攀登之路铸就的智慧,远胜于峰顶之见:论推理学习中的噪声奖励
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
May 28, 2025
作者: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
cs.AI
摘要
近期关于通过强化学习(RL)对大型语言模型(LLMs)进行推理后训练的研究,通常集中于那些能够被准确验证和奖励的任务,如解决数学问题。相比之下,我们的研究探讨了奖励噪声的影响,这是在实际场景中利用奖励模型对LLMs进行后训练时更为实用的考量。我们发现,LLMs对显著的奖励噪声表现出极强的鲁棒性。例如,在数学任务中手动翻转40%的奖励函数输出,仍能使Qwen-2.5-7B模型快速收敛,其数学任务表现从5%提升至72%,而使用无噪声奖励训练的模型准确率为75%。令人惊讶的是,仅通过奖励关键推理短语的出现(即推理模式奖励,RPR),如“首先,我需要”——而不验证答案的正确性,模型便达到了与严格正确验证和精确奖励训练模型相当的峰值下游性能(Qwen-2.5-7B超过70%的准确率)。认识到推理过程比最终结果更为重要,我们将RPR与噪声奖励模型结合。RPR帮助校准了噪声奖励模型,减少了潜在的假阴性,并提升了LLM在开放式任务上的表现。这些发现强调了在预训练阶段提升模型基础能力的重要性,同时为推进后训练技术提供了洞见。我们的代码和脚本可在https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason获取。
English
Recent studies on post-training large language models (LLMs) for reasoning
through reinforcement learning (RL) typically focus on tasks that can be
accurately verified and rewarded, such as solving math problems. In contrast,
our research investigates the impact of reward noise, a more practical
consideration for real-world scenarios involving the post-training of LLMs
using reward models. We found that LLMs demonstrate strong robustness to
substantial reward noise. For example, manually flipping 40% of the reward
function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve
rapid convergence, improving its performance on math tasks from 5% to 72%,
compared to the 75% accuracy achieved by a model trained with noiseless
rewards. Surprisingly, by only rewarding the appearance of key reasoning
phrases (namely reasoning pattern reward, RPR), such as ``first, I need
to''-without verifying the correctness of answers, the model achieved peak
downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models
trained with strict correctness verification and accurate rewards. Recognizing
the importance of the reasoning process over the final results, we combined RPR
with noisy reward models. RPR helped calibrate the noisy reward models,
mitigating potential false negatives and enhancing the LLM's performance on
open-ended tasks. These findings suggest the importance of improving models'
foundational abilities during the pre-training phase while providing insights
for advancing post-training techniques. Our code and scripts are available at
https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.Summary
AI-Generated Summary