Cooper：面向大语言模型的强化学习中策略与奖励模型的协同优化

摘要

大型语言模型（LLMs）在推理任务中展现了卓越的性能，其中强化学习（RL）作为提升其推理能力的关键算法。当前，存在两种主流的奖励范式：基于模型的奖励和基于规则的奖励。然而，这两种方法均存在局限性：基于规则的奖励缺乏鲁棒性，而基于模型的奖励则易受奖励欺骗的影响。为解决这些问题，我们提出了Cooper（协同优化策略模型与奖励模型），一种联合优化策略模型和奖励模型的RL框架。Cooper在识别正确答案时利用基于规则奖励的高精度，并动态构建和选择正负样本对以持续训练奖励模型。这一设计增强了鲁棒性，并降低了奖励欺骗的风险。为进一步支持Cooper，我们引入了一种混合标注策略，高效且准确地生成奖励模型的训练数据。我们还提出了一种基于参考的奖励建模范式，其中奖励模型以参考答案为输入。基于此设计，我们训练了一个名为VerifyRM的奖励模型，在VerifyBench上相比同规模模型实现了更高的准确率。我们使用VerifyRM和Cooper进行强化学习。实验表明，Cooper不仅缓解了奖励欺骗问题，还提升了端到端RL性能，例如在Qwen2.5-1.5B-Instruct上实现了0.54%的平均准确率提升。我们的研究结果表明，动态更新奖励模型是应对奖励欺骗的有效途径，为更好地将奖励模型融入RL提供了参考。

English

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

Cooper：面向大语言模型的强化学习中策略与奖励模型的协同优化

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

摘要

Support