分位數優勢估計於熵安全推理之應用

摘要

強化學習與可驗證獎勵（RLVR）增強了大型語言模型（LLM）的推理能力，但訓練過程常在「熵崩潰」與「熵爆炸」之間波動。我們將這兩種風險歸因於無價值強化學習（如GRPO和DAPO）中使用的均值基線，該基線在獎勵異常值下不當懲罰了負優勢樣本。我們提出了「分位數優勢估計」（QAE），以分組的K分位數基線取代均值。QAE引入了一種回應層面的雙機制閘門：在難題（p ≤ 1 - K）上，它強化罕見的成功；在易題（p > 1 - K）上，它針對剩餘的失敗。在一階softmax更新下，我們證明了「雙側熵安全性」，為單步熵變提供了上下界，從而抑制爆炸並防止崩潰。實證表明，這一微小修改穩定了熵，稀疏化了信用分配（通過調節K，約80%的回應獲得零優勢），並在Qwen3-8B/14B-Base模型上持續提升了AIME 2024/2025和AMC 2023的pass@1成績。這些結果表明，「基線設計」——而非詞元級啟發式——是擴展RLVR的主要機制。

English

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

分位數優勢估計於熵安全推理之應用

Quantile Advantage Estimation for Entropy-Safe Reasoning

摘要

Support