攀登之路比頂峰更深刻磨礪智慧:論推理學習中的噪聲獎勵
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
May 28, 2025
作者: Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
cs.AI
摘要
近期關於通過強化學習(RL)對大型語言模型(LLM)進行推理後訓練的研究,通常聚焦於那些能夠精確驗證和獎勵的任務,例如解決數學問題。與此相對,我們的研究探討了獎勵噪聲的影響,這在涉及使用獎勵模型對LLM進行後訓練的現實場景中是一個更實際的考量。我們發現,LLM對顯著的獎勵噪聲表現出強大的魯棒性。例如,在數學任務中手動翻轉40%的獎勵函數輸出,仍能使Qwen-2.5-7B模型實現快速收斂,將其數學任務的表現從5%提升至72%,而無噪聲獎勵訓練的模型則達到75%的準確率。令人驚訝的是,僅通過獎勵關鍵推理短語的出現(即推理模式獎勵,RPR),例如「首先,我需要」——而不驗證答案的正確性,模型達到了與嚴格正確性驗證和精確獎勵訓練模型相當的峰值下游性能(Qwen-2.5-7B超過70%的準確率)。認識到推理過程相較於最終結果的重要性,我們將RPR與噪聲獎勵模型結合。RPR幫助校準了噪聲獎勵模型,減少了潛在的假陰性,並提升了LLM在開放式任務上的表現。這些發現表明,在預訓練階段提升模型的基礎能力的重要性,同時為推進後訓練技術提供了見解。我們的代碼和腳本可在https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason獲取。
English
Recent studies on post-training large language models (LLMs) for reasoning
through reinforcement learning (RL) typically focus on tasks that can be
accurately verified and rewarded, such as solving math problems. In contrast,
our research investigates the impact of reward noise, a more practical
consideration for real-world scenarios involving the post-training of LLMs
using reward models. We found that LLMs demonstrate strong robustness to
substantial reward noise. For example, manually flipping 40% of the reward
function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve
rapid convergence, improving its performance on math tasks from 5% to 72%,
compared to the 75% accuracy achieved by a model trained with noiseless
rewards. Surprisingly, by only rewarding the appearance of key reasoning
phrases (namely reasoning pattern reward, RPR), such as ``first, I need
to''-without verifying the correctness of answers, the model achieved peak
downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models
trained with strict correctness verification and accurate rewards. Recognizing
the importance of the reasoning process over the final results, we combined RPR
with noisy reward models. RPR helped calibrate the noisy reward models,
mitigating potential false negatives and enhancing the LLM's performance on
open-ended tasks. These findings suggest the importance of improving models'
foundational abilities during the pre-training phase while providing insights
for advancing post-training techniques. Our code and scripts are available at
https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.Summary
AI-Generated Summary