ChatPaper.aiChatPaper

勿使錯誤白費:透過信心重權衡來善用負面強化學習群組

Don't Waste Mistakes: Leveraging Negative RL-Groups via Confidence Reweighting

October 9, 2025
作者: Yunzhen Feng, Parag Jain, Anthony Hartshorn, Yaqi Duan, Julia Kempe
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型(LLMs)在推理任務上表現的標準方法,其中群組相對策略優化(GRPO)在實踐中廣泛應用。然而,GRPO在負面群組上浪費了大量計算資源:在這些群組中,所有採樣的回應都不正確,導致優勢值為零,從而無法產生梯度。我們探討是否能在無需額外監督的情況下利用這些負面群組。從獎勵建模中的最大似然(MLE)目標出發,我們證明MLE梯度等於針對修改後價值函數的策略梯度。該價值函數對錯誤回應增加了基於置信度的懲罰,對更自信的錯誤施加更大的懲罰。我們將此方法稱為基於負樣本的似然估計(LENS)。LENS改進了GRPO,為錯誤生成分配非零且依賴於置信度的獎勵,使負面群組變得有信息量,並將之前浪費的樣本轉化為有用的梯度更新。在MATH基準測試中,使用Llama-3.1-8B和Qwen-2.5-3B模型,所提出的變體始終優於GRPO基線,特別是在更難的項目上取得了顯著提升。這些結果展示了一種原則性且實用的方法來“挽救”負面群組,從而提高了RLVR的效率和性能。
English
Reinforcement learning with verifiable rewards (RLVR) has become a standard recipe for improving large language models (LLMs) on reasoning tasks, with Group Relative Policy Optimization (GRPO) widely used in practice. Yet GRPO wastes substantial compute on negative groups: groups in which no sampled response is correct yield zero advantage and thus no gradient. We ask whether negative groups can be leveraged without extra supervision. Starting from a maximum-likelihood (MLE) objective in reward modeling, we show that the MLE gradient is equivalent to a policy gradient for a modified value function. This value function adds a confidence-weighted penalty on incorrect responses, imposing larger penalties on more confident mistakes. We refer to this as Likelihood Estimation with Negative Samples (LENS). LENS modifies GRPO to assign non-zero, confidence-dependent rewards to incorrect generations, making negative groups informative and converting previously wasted samples into useful gradient updates. On the MATH benchmark with Llama-3.1-8B and Qwen-2.5-3B, the proposed variant consistently outperforms GRPO baseline, with significant gains on harder items. These results demonstrate a principled and practical way to "rescue" negative groups, improving efficiency and performance in RLVR.
PDF133October 13, 2025