ChatPaper.aiChatPaper

分位數優勢估計於熵安全推理之應用

Quantile Advantage Estimation for Entropy-Safe Reasoning

September 26, 2025
作者: Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
cs.AI

摘要

強化學習與可驗證獎勵(RLVR)增強了大型語言模型(LLM)的推理能力,但訓練過程常在「熵崩潰」與「熵爆炸」之間波動。我們將這兩種風險歸因於無價值強化學習(如GRPO和DAPO)中使用的均值基線,該基線在獎勵異常值下不當懲罰了負優勢樣本。我們提出了「分位數優勢估計」(QAE),以分組的K分位數基線取代均值。QAE引入了一種回應層面的雙機制閘門:在難題(p ≤ 1 - K)上,它強化罕見的成功;在易題(p > 1 - K)上,它針對剩餘的失敗。在一階softmax更新下,我們證明了「雙側熵安全性」,為單步熵變提供了上下界,從而抑制爆炸並防止崩潰。實證表明,這一微小修改穩定了熵,稀疏化了信用分配(通過調節K,約80%的回應獲得零優勢),並在Qwen3-8B/14B-Base模型上持續提升了AIME 2024/2025和AMC 2023的pass@1成績。這些結果表明,「基線設計」——而非詞元級啟發式——是擴展RLVR的主要機制。
English
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.
PDF1082September 29, 2025