汎用報酬モデリングのための推論時スケーリング

要旨

大規模言語モデル（LLM）のポストトレーニングにおいて、強化学習（RL）が広く採用されています。最近、LLMの推論能力をRLによって強化する取り組みが進んでおり、適切な学習方法が効果的な推論時のスケーラビリティを可能にすることが示唆されています。RLの主要な課題は、検証可能な質問や人工的なルールを超えた様々な領域において、LLMに対して正確な報酬信号を取得することです。本研究では、一般的なクエリに対する推論計算を増やすことで報酬モデリング（RM）を改善する方法、すなわち汎用的なRMの推論時スケーラビリティを探求し、さらに適切な学習方法を用いて性能-計算スケーリングの効果を高める方法を調査します。RMアプローチでは、異なる入力タイプに対する柔軟性と推論時スケーリングの可能性を実現するために、ポイントワイズ生成型報酬モデリング（GRM）を採用します。学習方法としては、オンラインRLを通じてGRM内でスケーラブルな報酬生成行動を促進するために、自己原則化批判チューニング（SPCT）を提案し、適応的に原則を生成し、正確に批判を行うことで、DeepSeek-GRMモデルを開発します。さらに、効果的な推論時スケーリングを実現するために、並列サンプリングを用いて計算使用量を拡大し、メタRMを導入して投票プロセスをガイドし、より良いスケーリング性能を実現します。実験的に、SPCTがGRMの品質とスケーラビリティを大幅に向上させ、既存の手法やモデルを上回り、深刻なバイアスなしに様々なRMベンチマークで優れた結果を示し、トレーニング時スケーリングと比較しても優れた性能を達成できることを示します。DeepSeek-GRMは一部のタスクにおいて課題に直面していますが、汎用的な報酬システムに関する今後の取り組みによって解決可能であると考えています。モデルは公開され、オープンソース化されます。

English

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

汎用報酬モデリングのための推論時スケーリング

Inference-Time Scaling for Generalist Reward Modeling

要旨

Support