エントロピー安全推論のための分位点アドバンテージ推定

要旨

検証可能な報酬を用いた強化学習（RLVR）は大規模言語モデル（LLM）の推論能力を強化するが、訓練過程ではしばしば{エントロピー崩壊}と{エントロピー爆発}の間で振動が生じる。これらの問題は、値なし強化学習（例えばGRPOやDAPO）で使用される平均ベースラインに起因しており、報酬の外れ値下で負のアドバンテージを持つサンプルを不適切にペナルティ化するためである。本論文では、{分位点アドバンテージ推定法}（QAE）を提案し、平均をグループごとのK分位点ベースラインに置き換える。QAEは、応答レベルでの二つのレジームを持つゲートを誘導する：難しいクエリ（p <= 1 - K）では稀な成功を強化し、簡単なクエリ（p > 1 - K）では残りの失敗をターゲットとする。一次ソフトマックス更新の下で、{両側エントロピー安全性}を証明し、一ステップのエントロピー変化に対する下限と上限を与えることで、爆発を抑制し崩壊を防ぐ。実験的には、この最小限の修正がエントロピーを安定化し、クレジット割り当てを疎化し（調整されたKの下で、約80%の応答がゼロアドバンテージを受ける）、AIME 2024/2025およびAMC 2023においてQwen3-8B/14B-Baseモデルで持続的なpass@1向上をもたらす。これらの結果は、RLVRのスケーリングにおける主要なメカニズムとして、トークンレベルのヒューリスティックではなく{ベースライン設計}を特定するものである。

English

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

エントロピー安全推論のための分位点アドバンテージ推定

Quantile Advantage Estimation for Entropy-Safe Reasoning

要旨

Support