代理獎勵建模：整合人類偏好與可驗證的正確性信號，以建立可靠的獎勵系統

摘要

獎勵模型（RMs）對於大型語言模型（LLMs）的訓練及推論階段的擴展至關重要。然而，現有的獎勵模型主要聚焦於人類偏好，忽略了在訓練LLMs中展現出強大潛力的可驗證正確性信號。本文提出了一種主動式獎勵建模方法，該系統將獎勵模型與來自不同方面的可驗證正確性信號相結合，以提供更可靠的獎勵。我們實證性地實現了一個名為RewardAgent的獎勵代理，它結合了人類偏好獎勵與兩種可驗證信號：事實性與指令遵循，從而提供更為可靠的獎勵。我們在現有的獎勵模型基準上進行了全面的實驗，並在實際下游任務的推論階段進行了最佳n選搜索。RewardAgent顯著超越了基礎獎勵模型，證明了其有效性。我們進一步利用RewardAgent構建訓練偏好對，並採用DPO目標訓練了一個LLM，在多種NLP基準測試中相較於傳統獎勵模型取得了更優異的表現。我們的代碼已公開，以促進進一步研究（https://github.com/THU-KEG/Agentic-Reward-Modeling）。

English

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

代理獎勵建模：整合人類偏好與可驗證的正確性信號，以建立可靠的獎勵系統

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

摘要

Support