에이전트적 보상 모델링: 검증 가능한 정확성 신호와 인간 선호도를 통합한 신뢰할 수 있는 보상 시스템

초록

보상 모델(Reward Models, RMs)은 대규모 언어 모델(Large Language Models, LLMs)의 학습 및 추론 단계에서의 확장에 있어 핵심적인 역할을 합니다. 그러나 기존의 보상 모델은 주로 인간의 선호도에 초점을 맞추고 있으며, LLMs 학습에 있어 강력한 잠재력을 보여준 검증 가능한 정확성 신호를 간과하고 있습니다. 본 논문에서는 에이전트 기반 보상 모델링(agentic reward modeling)을 제안합니다. 이는 보상 모델과 다양한 측면에서의 검증 가능한 정확성 신호를 결합하여 신뢰할 수 있는 보상을 제공하는 시스템입니다. 우리는 RewardAgent라는 보상 에이전트를 실험적으로 구현하였는데, 이는 인간 선호도 보상과 사실성(factuality), 지시 이행(instruction following)이라는 두 가지 검증 가능한 신호를 결합하여 더욱 신뢰할 수 있는 보상을 제공합니다. 기존 보상 모델 벤치마크와 실제 세계의 다운스트림 작업에 대한 추론 단계의 best-of-n 탐색을 포함한 포괄적인 실험을 수행한 결과, RewardAgent는 기존의 기본 보상 모델을 크게 능가하는 성능을 보여주었습니다. 또한, RewardAgent를 사용하여 학습 선호 쌍(training preference pairs)을 구성하고 DPO 목적 함수를 통해 LLM을 학습시킨 결과, 다양한 NLP 벤치마크에서 기존 보상 모델 대비 우수한 성능을 달성하였습니다. 본 연구의 코드는 추가 연구를 위해 공개되었습니다(https://github.com/THU-KEG/Agentic-Reward-Modeling).

English

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

에이전트적 보상 모델링: 검증 가능한 정확성 신호와 인간 선호도를 통합한 신뢰할 수 있는 보상 시스템

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

초록

Support