獎勵推理模型

摘要

獎勵模型在引導大型語言模型產出符合人類期望的結果方面扮演著關鍵角色。然而，如何有效利用測試時的計算資源來提升獎勵模型性能，仍是一個待解的難題。在本研究中，我們引入了獎勵推理模型（Reward Reasoning Models, RRMs），該模型專門設計用於在生成最終獎勵前執行深思熟慮的推理過程。通過思維鏈推理，RRMs能夠針對複雜查詢，在適當獎勵不明顯的情況下，利用額外的測試時計算資源。為了開發RRMs，我們實施了一個強化學習框架，該框架促進了自我進化的獎勵推理能力，而無需依賴明確的推理軌跡作為訓練數據。實驗結果表明，RRMs在多個領域的獎勵建模基準測試中均取得了優異表現。值得注意的是，我們展示了RRMs能夠自適應地利用測試時計算資源，進一步提升獎勵準確性。預訓練的獎勵推理模型已公開於https://huggingface.co/Reward-Reasoning。

English

Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.