エージェンシック報酬モデリング：信頼性の高い報酬システムのための人間の選好と検証可能な正しさシグナルの統合

要旨

報酬モデル（RMs）は、大規模言語モデル（LLMs）の学習と推論時のスケールアップにおいて重要な役割を果たします。しかし、既存の報酬モデルは主に人間の選好に焦点を当てており、LLMsの学習において強い可能性を示している検証可能な正しさの信号を無視しています。本論文では、エージェント型報酬モデリングを提案します。これは、報酬モデルと様々な側面からの検証可能な正しさの信号を組み合わせることで、信頼性の高い報酬を提供するシステムです。私たちは、人間の選好報酬と2つの検証可能な信号（事実性と指示の遵守）を組み合わせた報酬エージェント「RewardAgent」を実装し、より信頼性の高い報酬を提供します。既存の報酬モデルベンチマークと現実世界の下流タスクにおける推論時のベストオブN探索に関する包括的な実験を行いました。RewardAgentは、従来の報酬モデルを大幅に上回り、その有効性を実証しました。さらに、RewardAgentを使用して学習選好ペアを構築し、DPO目的関数でLLMを学習させた結果、従来の報酬モデルと比較して様々なNLPベンチマークで優れた性能を達成しました。今後の研究を促進するため、コードを公開しています（https://github.com/THU-KEG/Agentic-Reward-Modeling）。

English

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

エージェンシック報酬モデリング：信頼性の高い報酬システムのための人間の選好と検証可能な正しさシグナルの統合

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

要旨

Support