RM-R1: 추론으로서의 보상 모델링

초록

보상 모델링은 인간의 선호도와 대형 언어 모델(LLM)을 정렬하는 데 필수적이며, 특히 인간 피드백을 통한 강화 학습(RLHF)을 통해 이를 달성합니다. 정확한 보상 신호를 제공하기 위해, 보상 모델(RM)은 점수나 판단을 내리기 전에 깊은 사고를 자극하고 해석 가능한 추론을 수행해야 합니다. 그러나 기존의 RM들은 불투명한 스칼라 점수를 생성하거나 선호하는 답변의 예측을 직접 생성하여 자연어 비판을 통합하는 데 어려움을 겪고, 이로 인해 해석 가능성이 부족합니다. 최근 추론 집약적 작업에서의 긴 사고의 연쇄(CoT)의 발전에 영감을 받아, 우리는 추론 능력을 보상 모델링에 통합하면 RM의 해석 가능성과 성능이 크게 향상될 것이라는 가설을 세우고 이를 검증했습니다. 본 연구에서는 보상 모델링을 추론 작업으로 공식화하는 새로운 종류의 생성적 보상 모델인 '추론 보상 모델(ReasRMs)'을 소개합니다. 우리는 추론 중심의 훈련 파이프라인을 제안하고, RM-R1이라는 ReasRMs 패밀리를 훈련시켰습니다. 이 훈련은 두 가지 주요 단계로 구성됩니다: (1) 고품질 추론 체인의 증류와 (2) 검증 가능한 보상을 통한 강화 학습. RM-R1은 자체적으로 추론 흔적이나 채팅 특정 루브릭을 생성하고 이를 기준으로 후보 응답을 평가함으로써 LLM 롤아웃을 개선합니다. 실험적으로, 우리의 모델은 여러 종합적인 보상 모델 벤치마크에서 생성적 RM의 최첨단 또는 최첨단에 근접한 성능을 달성하며, 훨씬 더 큰 오픈 웨이트 모델(예: Llama3.1-405B)과 사유 모델(예: GPT-4o)을 최대 13.8%까지 능가합니다. 최종 성능을 넘어, 우리는 성공적인 ReasRM 훈련의 핵심 요소를 이해하기 위해 철저한 실증적 분석을 수행합니다. 향후 연구를 촉진하기 위해, 우리는 https://github.com/RM-R1-UIUC/RM-R1에서 여섯 개의 ReasRM 모델과 코드 및 데이터를 공개합니다.

English

Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. However, existing RMs either produce opaque scalar scores or directly generate the prediction of a preferred answer, making them struggle to integrate natural language critiques, thus lacking interpretability. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. In this work, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. The training consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. RM-R1 improves LLM rollouts by self-generating reasoning traces or chat-specific rubrics and evaluating candidate responses against them. Empirically, our models achieve state-of-the-art or near state-of-the-art performance of generative RMs across multiple comprehensive reward model benchmarks, outperforming much larger open-weight models (e.g., Llama3.1-405B) and proprietary ones (e.g., GPT-4o) by up to 13.8%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

RM-R1: 추론으로서의 보상 모델링

RM-R1: Reward Modeling as Reasoning

초록

Support