Cooper：大型語言模型強化學習中的策略與獎勵模型協同優化

摘要

大型語言模型（LLMs）在推理任務中展現了卓越的性能，其中強化學習（RL）作為提升其推理能力的關鍵算法。目前，存在兩種主流的獎勵範式：基於模型的獎勵和基於規則的獎勵。然而，這兩種方法都存在局限性：基於規則的獎勵缺乏魯棒性，而基於模型的獎勵容易受到獎勵欺騙的影響。為了解決這些問題，我們提出了Cooper（協同優化策略模型和獎勵模型），這是一個聯合優化策略模型和獎勵模型的RL框架。Cooper利用基於規則的獎勵在識別正確響應時的高精度，並動態構建和選擇正負樣本對以持續訓練獎勵模型。這一設計增強了魯棒性並降低了獎勵欺騙的風險。為了進一步支持Cooper，我們引入了一種混合註釋策略，高效且準確地生成獎勵模型的訓練數據。我們還提出了一種基於參考的獎勵建模範式，其中獎勵模型以參考答案作為輸入。基於這一設計，我們訓練了一個名為VerifyRM的獎勵模型，其在VerifyBench上的準確率優於同規模的其他模型。我們使用VerifyRM和Cooper進行了強化學習。實驗結果表明，Cooper不僅緩解了獎勵欺騙問題，還提升了端到端RL的性能，例如在Qwen2.5-1.5B-Instruct上實現了0.54%的平均準確率提升。我們的研究表明，動態更新獎勵模型是對抗獎勵欺騙的有效方法，為更好地將獎勵模型整合到RL中提供了參考。

English

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

Cooper：大型語言模型強化學習中的策略與獎勵模型協同優化

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

摘要

Support