熵安全推理的分位数优势估计

摘要

基于可验证奖励的强化学习（RLVR）增强了大型语言模型（LLM）的推理能力，但训练过程常在“熵崩溃”与“熵爆炸”之间波动。我们将这两种风险归因于无价值强化学习（如GRPO和DAPO）中采用的均值基线，其在奖励异常值下对负优势样本进行了不当惩罚。为此，我们提出了“分位数优势估计”（QAE），以分组K分位数基线取代均值。QAE引入了一种响应层面的双机制门控：对于难题（p ≤ 1 - K），它强化罕见成功；对于易题（p > 1 - K），则针对剩余失败。在一阶softmax更新下，我们证明了“双向熵安全性”，为单步熵变提供了上下界，既遏制了爆炸又防止了崩溃。实证表明，这一微小改动稳定了熵值，稀疏化了信用分配（通过调整K，约80%的响应获得零优势），并在Qwen3-8B/14B-Base模型上实现了AIME 2024/2025和AMC 2023的持续pass@1提升。这些结果揭示了“基线设计”——而非词级启发式——作为扩展RLVR的主要机制。

English

Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.

熵安全推理的分位数优势估计

Quantile Advantage Estimation for Entropy-Safe Reasoning

摘要

Support