攀登之路比頂峰更深刻磨礪智慧：論推理學習中的噪聲獎勵

摘要

近期關於通過強化學習（RL）對大型語言模型（LLM）進行推理後訓練的研究，通常聚焦於那些能夠精確驗證和獎勵的任務，例如解決數學問題。與此相對，我們的研究探討了獎勵噪聲的影響，這在涉及使用獎勵模型對LLM進行後訓練的現實場景中是一個更實際的考量。我們發現，LLM對顯著的獎勵噪聲表現出強大的魯棒性。例如，在數學任務中手動翻轉40%的獎勵函數輸出，仍能使Qwen-2.5-7B模型實現快速收斂，將其數學任務的表現從5%提升至72%，而無噪聲獎勵訓練的模型則達到75%的準確率。令人驚訝的是，僅通過獎勵關鍵推理短語的出現（即推理模式獎勵，RPR），例如「首先，我需要」——而不驗證答案的正確性，模型達到了與嚴格正確性驗證和精確獎勵訓練模型相當的峰值下游性能（Qwen-2.5-7B超過70%的準確率）。認識到推理過程相較於最終結果的重要性，我們將RPR與噪聲獎勵模型結合。RPR幫助校準了噪聲獎勵模型，減少了潛在的假陰性，並提升了LLM在開放式任務上的表現。這些發現表明，在預訓練階段提升模型的基礎能力的重要性，同時為推進後訓練技術提供了見解。我們的代碼和腳本可在https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason獲取。

English

Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.

攀登之路比頂峰更深刻磨礪智慧：論推理學習中的噪聲獎勵

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

摘要

Support