일반화된 보상 모델링을 위한 추론 시점 스케일링

초록

강화 학습(Reinforcement Learning, RL)은 대규모 언어 모델(Large Language Models, LLMs)의 사후 학습(post-training)에서 널리 채택되고 있습니다. 최근, LLM의 추론 능력을 RL을 통해 강화하는 연구는 적절한 학습 방법이 효과적인 추론 시 확장성을 가능하게 할 수 있음을 시사합니다. RL의 주요 과제는 검증 가능한 질문이나 인공적인 규칙을 넘어 다양한 영역에서 LLM을 위한 정확한 보상 신호를 얻는 것입니다. 본 연구에서는 일반적인 질의에 대해 더 많은 추론 계산을 통해 보상 모델링(Reward Modeling, RM)을 개선하는 방법, 즉 일반적인 RM의 추론 시 확장성과 더 나아가 적절한 학습 방법을 통해 성능-계산 스케일링의 효과를 향상시키는 방법을 탐구합니다. RM 접근법으로는 다양한 입력 유형에 대한 유연성과 추론 시 스케일링의 잠재력을 가능하게 하는 점별 생성적 보상 모델링(Generative Reward Modeling, GRM)을 채택합니다. 학습 방법으로는 온라인 RL을 통해 GRM에서 확장 가능한 보상 생성 행동을 촉진하고, 적응적으로 원칙을 생성하며 정확한 비판을 생성하는 자기 원칙 비판 튜닝(Self-Principled Critique Tuning, SPCT)을 제안하여 DeepSeek-GRM 모델을 개발합니다. 또한, 효과적인 추론 시 스케일링을 위해 병렬 샘플링을 사용하여 계산 사용을 확장하고, 더 나은 스케일링 성능을 위한 투표 과정을 안내하는 메타 RM을 도입합니다. 실험적으로, SPCT가 GRM의 품질과 확장성을 크게 향상시키며, 다양한 RM 벤치마크에서 기존 방법과 모델을 능가하고 심각한 편향 없이 더 나은 성능을 달성할 수 있음을 보여줍니다. DeepSeek-GRM은 일부 작업에서 여전히 도전에 직면하지만, 일반적인 보상 시스템에 대한 미래의 노력으로 해결될 수 있을 것으로 믿습니다. 모델은 공개 및 오픈소스로 제공될 예정입니다.

English

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

일반화된 보상 모델링을 위한 추론 시점 스케일링

Inference-Time Scaling for Generalist Reward Modeling

초록

Support