探索と活用のトレードオフを超えて：RLVRにおけるLLM推論のための隠れ状態アプローチ

要旨

検証可能な報酬のための強化学習（RLVR）における主流の見解は、最近の進歩を探索と活用のトレードオフというレンズを通じて解釈しており、この視点は主にトークンレベルの指標によって形作られています。私たちはこの視点を再検討し、この認識されたトレードオフが根本的な制約ではなく、むしろ測定レベルに起因する人工物である可能性を提案します。これを調査するため、分析を意味的に豊かな隠れ状態空間にシフトし、探索を定量化するために有効ランク（ER）を採用し、活用のダイナミクスを捉えるためにその新たな一次および二次微分である有効ランク速度（ERV）と有効ランク加速度（ERA）を提案します。私たちの分析は、隠れ状態レベルでは探索と活用が分離可能であることを明らかにします（第4章）。この発見は、両方の能力を同時に強化する機会を提示します。この洞察が、私たちの手法であるVelocity-Exploiting Rank-Learning（VERL）を動機づけます。VERLは、RLのアドバンテージ関数を直接形成することで、探索と活用の相乗的強化の原則を初めて実践するものです。重要な革新は、理論的に安定したERAを予測メタコントローラーとして活用し、相乗的なデュアルチャネルのインセンティブ構造を作り出すことです。トレードオフを強制する代わりに、VERLは探索に対する報酬を事前に増幅して過信を防ぎ、推論を強化するために活用による利益を強化します。多様なLLMと推論ベンチマークでの実験は、挑戦的なGaokao 2024データセットで最大21.4%の絶対精度向上を含む一貫した成果を示しています。

English

A prevailing view in Reinforcement Learning for Verifiable Rewards (RLVR) interprets recent progress through the lens of an exploration-exploitation trade-off, a perspective largely shaped by token-level metrics. We re-examine this perspective, proposing that this perceived trade-off may not be a fundamental constraint but rather an artifact of the measurement level. To investigate this, we shift the analysis to the semantically rich hidden-state space, adopting Effective Rank (ER) to quantify exploration and proposing its novel first- and second-order derivatives, named Effective Rank Velocity (ERV) and Effective Rank Acceleration (ERA), to capture exploitation dynamics. Our analysis reveals that at the hidden-state level, exploration and exploitation could be decoupled (Sec. 4). This finding reveals an opportunity to enhance both capacities simultaneously. This insight motivates our method, Velocity-Exploiting Rank-Learning (VERL), the first to operationalize the principle of synergistic exploration-exploitation enhancement by directly shaping the RL advantage function. The key innovation is leveraging the theoretically stable ERA as a predictive meta-controller to create a synergistic, dual-channel incentive structure. Instead of forcing a trade-off, VERL prospectively amplifies rewards for exploration to preempt overconfidence and reinforces exploitative gains to consolidate reasoning. Experiments across diverse LLMs and reasoning benchmarks show consistent gains, including up to 21.4% absolute accuracy improvement on the challenging Gaokao 2024 dataset.

探索と活用のトレードオフを超えて：RLVRにおけるLLM推論のための隠れ状態アプローチ

Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR

要旨

Support