大規模言語モデルにおける強化学習ダイナミクスの予測可能性について

要旨

大規模言語モデル（LLM）の推論能力における最近の進展は、主に強化学習（RL）によって推進されているが、RLトレーニング中のパラメータダイナミクスの根本的な理解はまだ不十分である。本研究では、LLMにおけるRL誘導パラメータ更新の2つの基本的な特性を明らかにした：（1）ランク1支配性（Rank-1 Dominance）、すなわちパラメータ更新行列のトップ特異部分空間が推論の改善をほぼ完全に決定し、性能向上の99％以上を回復すること；（2）ランク1線形ダイナミクス（Rank-1 Linear Dynamics）、すなわちこの支配的な部分空間がトレーニング全体を通じて線形に進化し、早期のチェックポイントから正確な予測を可能にすること。8つのLLMと7つのアルゴリズムにわたる広範な実験により、これらの特性の一般性が検証された。さらに重要なことに、これらの発見に基づいて、AlphaRLというプラグイン型の高速化フレームワークを提案した。これは、短い初期トレーニングウィンドウを使用して最終的なパラメータ更新を外挿し、追加のモジュールやハイパーパラメータチューニングなしで推論性能の96％以上を維持しながら最大2.5倍の高速化を実現する。この発見は、大規模RLにおける汎用的で実用的なツールとして位置づけられ、LLMのための原理的で解釈可能かつ効率的なトレーニングパラダイムへの道を開くものである。

English

Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 8 LLMs and 7 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5 speedup while retaining \textgreater 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.

大規模言語モデルにおける強化学習ダイナミクスの予測可能性について

On Predictability of Reinforcement Learning Dynamics for Large Language Models

要旨

Support