勿负失误:通过置信度重加权有效利用负强化学习群体
Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting
October 9, 2025
作者: Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLMs)在推理任务上表现的标准方法,其中群体相对策略优化(GRPO)在实践中被广泛采用。然而,GRPO在负样本群体上浪费了大量计算资源:在这些群体中,所有采样响应均不正确,导致优势值为零,因而无法产生梯度。我们探讨是否能在无需额外监督的情况下利用这些负样本群体。从奖励建模中的最大似然估计(MLE)目标出发,我们证明了MLE梯度等价于针对一个修正后的价值函数的策略梯度。该价值函数对错误响应施加了基于置信度的惩罚,对置信度更高的错误施加更大的惩罚。我们将此方法称为带负样本的似然估计(LENS)。LENS对GRPO进行了改进,为错误的生成分配非零且依赖置信度的奖励,使得负样本群体变得信息丰富,并将之前浪费的样本转化为有用的梯度更新。在MATH基准测试中,使用Llama-3.1-8B和Qwen-2.5-3B模型,所提出的变体持续超越GRPO基线,尤其在难度较大的项目上取得了显著提升。这些结果展示了一种原则性且实用的方法来“挽救”负样本群体,从而提高了RLVR的效率和性能。
English
Reinforcement learning with verifiable rewards (RLVR) has become a standard
recipe for improving large language models (LLMs) on reasoning tasks, with
Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO
wastes substantial compute on negative groups: groups in which no sampled
response is correct yield zero advantage and thus no gradient. We ask whether
negative groups can be leveraged without extra supervision. Starting from a
maximum-likelihood (MLE) objective in reward modeling, we show that the MLE
gradient is equivalent to a policy gradient for a modified value function. This
value function adds a confidence-weighted penalty on incorrect responses,
imposing larger penalties on more confident mistakes. We refer to this as
Likelihood Estimation with Negative Samples
(LENS). LENS modifies GRPO to assign non-zero, confidence-dependent
rewards to incorrect generations, making negative groups informative and
converting previously wasted samples into useful gradient updates. On the MATH
benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently
outperforms GRPO baseline, with significant gains on harder items. These
results demonstrate a principled and practical way to "rescue" negative groups,
improving efficiency and performance in RLVR.