Cooper: 大規模言語モデルのための強化学習におけるポリシーモデルと報酬モデルの共最適化

要旨

大規模言語モデル（LLMs）は、推論タスクにおいて顕著な性能を発揮しており、その推論能力を強化するための主要なアルゴリズムとして強化学習（RL）が活用されています。現在、主流の報酬パラダイムは2つあります：モデルベースの報酬とルールベースの報酬です。しかし、どちらのアプローチも限界を抱えています：ルールベースの報酬は堅牢性に欠け、モデルベースの報酬は報酬ハッキングに対して脆弱です。これらの問題を解決するため、我々はCooper（Co-optimizing Policy Model and Reward Model）を提案します。これは、ポリシーモデルと報酬モデルを共同で最適化するRLフレームワークです。Cooperは、正しい応答を識別する際のルールベース報酬の高精度を活用し、報酬モデルの継続的なトレーニングのために動的に正例-負例ペアを構築・選択します。この設計により、堅牢性が向上し、報酬ハッキングのリスクが軽減されます。さらにCooperをサポートするため、報酬モデルのトレーニングデータを効率的かつ正確に生成するハイブリッドアノテーション戦略を導入します。また、参照ベースの報酬モデリングパラダイムを提案し、報酬モデルが参照回答を入力として受け取るように設計します。この設計に基づき、VerifyRMという報酬モデルをトレーニングし、VerifyBenchにおいて同サイズの他のモデルよりも高い精度を達成しました。VerifyRMとCooperの両方を使用して強化学習を実施しました。実験結果は、Cooperが報酬ハッキングを軽減するだけでなく、エンドツーエンドのRL性能も向上させることを示しています。例えば、Qwen2.5-1.5B-Instructにおいて平均精度で0.54%の向上を達成しました。我々の研究結果は、報酬モデルを動的に更新することが報酬ハッキングに対抗する有効な方法であり、報酬モデルをRLに統合するための参考となることを示しています。

English

Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

Cooper: 大規模言語モデルのための強化学習におけるポリシーモデルと報酬モデルの共最適化

Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

要旨

Support