攀登之路铸就的智慧，远胜于峰顶之见：论推理学习中的噪声奖励

摘要

近期关于通过强化学习（RL）对大型语言模型（LLMs）进行推理后训练的研究，通常集中于那些能够被准确验证和奖励的任务，如解决数学问题。相比之下，我们的研究探讨了奖励噪声的影响，这是在实际场景中利用奖励模型对LLMs进行后训练时更为实用的考量。我们发现，LLMs对显著的奖励噪声表现出极强的鲁棒性。例如，在数学任务中手动翻转40%的奖励函数输出，仍能使Qwen-2.5-7B模型快速收敛，其数学任务表现从5%提升至72%，而使用无噪声奖励训练的模型准确率为75%。令人惊讶的是，仅通过奖励关键推理短语的出现（即推理模式奖励，RPR），如“首先，我需要”——而不验证答案的正确性，模型便达到了与严格正确验证和精确奖励训练模型相当的峰值下游性能（Qwen-2.5-7B超过70%的准确率）。认识到推理过程比最终结果更为重要，我们将RPR与噪声奖励模型结合。RPR帮助校准了噪声奖励模型，减少了潜在的假阴性，并提升了LLM在开放式任务上的表现。这些发现强调了在预训练阶段提升模型基础能力的重要性，同时为推进后训练技术提供了洞见。我们的代码和脚本可在https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason获取。

English

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

攀登之路铸就的智慧，远胜于峰顶之见：论推理学习中的噪声奖励

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

摘要

Support