獎勵推理模型
Reward Reasoning Model
May 20, 2025
作者: Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
cs.AI
摘要
獎勵模型在引導大型語言模型產出符合人類期望的結果方面扮演著關鍵角色。然而,如何有效利用測試時的計算資源來提升獎勵模型性能,仍是一個待解的難題。在本研究中,我們引入了獎勵推理模型(Reward Reasoning Models, RRMs),該模型專門設計用於在生成最終獎勵前執行深思熟慮的推理過程。通過思維鏈推理,RRMs能夠針對複雜查詢,在適當獎勵不明顯的情況下,利用額外的測試時計算資源。為了開發RRMs,我們實施了一個強化學習框架,該框架促進了自我進化的獎勵推理能力,而無需依賴明確的推理軌跡作為訓練數據。實驗結果表明,RRMs在多個領域的獎勵建模基準測試中均取得了優異表現。值得注意的是,我們展示了RRMs能夠自適應地利用測試時計算資源,進一步提升獎勵準確性。預訓練的獎勵推理模型已公開於https://huggingface.co/Reward-Reasoning。
English
Reward models play a critical role in guiding large language models toward
outputs that align with human expectations. However, an open challenge remains
in effectively utilizing test-time compute to enhance reward model performance.
In this work, we introduce Reward Reasoning Models (RRMs), which are
specifically designed to execute a deliberate reasoning process before
generating final rewards. Through chain-of-thought reasoning, RRMs leverage
additional test-time compute for complex queries where appropriate rewards are
not immediately apparent. To develop RRMs, we implement a reinforcement
learning framework that fosters self-evolved reward reasoning capabilities
without requiring explicit reasoning traces as training data. Experimental
results demonstrate that RRMs achieve superior performance on reward modeling
benchmarks across diverse domains. Notably, we show that RRMs can adaptively
exploit test-time compute to further improve reward accuracy. The pretrained
reward reasoning models are available at
https://huggingface.co/Reward-Reasoning.Summary
AI-Generated Summary