熵安全推理的分位数优势估计
Quantile Advantage Estimation for Entropy-Safe Reasoning
September 26, 2025
作者: Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)增强了大型语言模型(LLM)的推理能力,但训练过程常在“熵崩溃”与“熵爆炸”之间波动。我们将这两种风险归因于无价值强化学习(如GRPO和DAPO)中采用的均值基线,其在奖励异常值下对负优势样本进行了不当惩罚。为此,我们提出了“分位数优势估计”(QAE),以分组K分位数基线取代均值。QAE引入了一种响应层面的双机制门控:对于难题(p ≤ 1 - K),它强化罕见成功;对于易题(p > 1 - K),则针对剩余失败。在一阶softmax更新下,我们证明了“双向熵安全性”,为单步熵变提供了上下界,既遏制了爆炸又防止了崩溃。实证表明,这一微小改动稳定了熵值,稀疏化了信用分配(通过调整K,约80%的响应获得零优势),并在Qwen3-8B/14B-Base模型上实现了AIME 2024/2025和AMC 2023的持续pass@1提升。这些结果揭示了“基线设计”——而非词级启发式——作为扩展RLVR的主要机制。
English
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM
reasoning, but training often oscillates between {entropy collapse} and
{entropy explosion}. We trace both hazards to the mean baseline used in
value-free RL (e.g., GRPO and DAPO), which improperly penalizes
negative-advantage samples under reward outliers. We propose {Quantile
Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile
baseline. QAE induces a response-level, two-regime gate: on hard queries (p <=
1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it
targets remaining failures. Under first-order softmax updates, we prove
{two-sided entropy safety}, giving lower and upper bounds on one-step entropy
change that curb explosion and prevent collapse. Empirically, this minimal
modification stabilizes entropy, sparsifies credit assignment (with tuned K,
roughly 80% of responses receive zero advantage), and yields sustained pass@1
gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results
identify {baseline design} -- rather than token-level heuristics -- as the
primary mechanism for scaling RLVR.