STARE: 驚き度に基づくトークンレベルのアドバンテージ再重み付けによる方策エントロピー安定化

要旨

GRPOのような検証可能な報酬を用いた強化学習（Reinforcement Learning with Verifiable Rewards）アルゴリズムは、LLMにおける複雑な推論のための主要なポストトレーニングパラダイムとして登場したが、トレーニング中にポリシーのエントロピー崩壊（policy entropy collapse）に悩まされることが一般的である。我々はGRPOにおけるトークンレベルのエントロピー動態の一次勾配分析を行い、トークンレベルのクレジット割り当てのミスマッチを特定する。すなわち、トークンごとのエントロピー変動は、軌跡レベルのアドバンテージと次のトークン分布に対するエントロピー感度関数の積に分解され、アドバンテージとサプライザル（驚き度）による4象限構造と準臨界性（near-criticality）の性質をもたらす。これに動機づけられ、我々はSTARE（Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability、サプライザル誘導型トークンレベルアドバンテージ再重み付けによるポリシーエントロピー安定化）を提案する。これは、バッチ内部のサプライザル分位数を用いてエントロピー臨界トークンサブセットを特定し、選択的にそれらの実効アドバンテージを再重み付けし、さらに目標エントロピー閉ループゲートを組み込むことで安定したエントロピー調整を実現する。1.5Bから32Bまでのモデル規模と、3つのタスクファミリー（Short CoT、Long CoT、Multi-Turn Tool Use）において、STAREは数千ステップにわたって安定したRLトレーニングを維持し、ポリシーのエントロピーを目標範囲内に保つ。AIME24およびAIME25において、STAREは平均精度でDAPOや他の競合ベースラインを4%～8%上回り、リフレクショントークンと応答長が連動して増加する。これは持続的な探索と活用のバランスを示しており、RLトレーニングの可能性をさらに引き出す。コードは https://github.com/hp-luo/STARE で公開されている。

English

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.