보상 추론 모델

초록

보상 모델은 대규모 언어 모델이 인간의 기대에 부합하는 출력을 생성하도록 유도하는 데 중요한 역할을 합니다. 그러나 테스트 시점의 컴퓨팅 자원을 효과적으로 활용하여 보상 모델의 성능을 향상시키는 것은 여전히 해결해야 할 과제로 남아 있습니다. 본 연구에서는 최종 보상을 생성하기 전에 신중한 추론 과정을 수행하도록 특별히 설계된 Reward Reasoning Models (RRMs)를 소개합니다. RRMs는 사고의 연쇄(chain-of-thought) 추론을 통해 적절한 보상이 즉시 명확하지 않은 복잡한 질문에 대해 추가적인 테스트 시점 컴퓨팅 자원을 활용합니다. RRMs를 개발하기 위해, 우리는 명시적인 추론 흔적을 훈련 데이터로 요구하지 않으면서도 스스로 진화하는 보상 추론 능력을 키우는 강화 학습 프레임워크를 구현했습니다. 실험 결과는 RRMs가 다양한 도메인에서 보상 모델링 벤치마크에서 우수한 성능을 달성함을 보여줍니다. 특히, RRMs가 테스트 시점의 컴퓨팅 자원을 적응적으로 활용하여 보상 정확도를 더욱 개선할 수 있음을 입증했습니다. 사전 훈련된 보상 추론 모델은 https://huggingface.co/Reward-Reasoning에서 확인할 수 있습니다.

English

Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.