ChatPaper.aiChatPaper

Think-RM:實現生成式獎勵模型中的長程推理能力

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

May 22, 2025
作者: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
cs.AI

摘要

基於人類反饋的強化學習(RLHF)已成為對齊大型語言模型與人類偏好的一種強大後訓練範式。RLHF中的核心挑戰在於構建精確的獎勵信號,傳統的布拉德利-特里獎勵模型(BT RMs)常因對數據規模和覆蓋範圍的敏感性,以及易受獎勵攻擊的脆弱性而受限。生成式獎勵模型(GenRMs)通過生成思維鏈(CoT)推理並最終給出獎勵,提供了一種更為穩健的替代方案。然而,現有的GenRMs依賴於淺層、垂直擴展的推理,限制了其處理細微或複雜(如推理密集型)任務的能力。此外,它們的成對偏好輸出與需要點狀獎勵信號的標準RLHF算法不相容。在本研究中,我們引入了Think-RM,這是一個通過建模內部思考過程來實現GenRMs中長遠推理的訓練框架。Think-RM不生成結構化、外部提供的推理,而是生成靈活、自我引導的推理軌跡,支持自我反思、假設推理和發散推理等高級能力。為了激發這些推理能力,我們首先通過對長思維鏈數據進行監督微調(SFT)來預熱模型。隨後,我們通過基於規則的強化學習(RL)進一步提升模型的長遠推理能力。此外,我們提出了一種新穎的成對RLHF流程,直接利用成對偏好獎勵優化策略,省去了點狀獎勵轉換的需求,從而更有效地利用Think-RM的輸出。實驗表明,Think-RM在RM-Bench上取得了最先進的成果,相較於BT RM和垂直擴展的GenRM,性能提升了8%。當與我們的成對RLHF流程結合時,它展現出相較於傳統方法的更優終端策略性能。
English
Reinforcement learning from human feedback (RLHF) has become a powerful post-training paradigm for aligning large language models with human preferences. A core challenge in RLHF is constructing accurate reward signals, where the conventional Bradley-Terry reward models (BT RMs) often suffer from sensitivity to data size and coverage, as well as vulnerability to reward hacking. Generative reward models (GenRMs) offer a more robust alternative by generating chain-of-thought (CoT) rationales followed by a final reward. However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks. Moreover, their pairwise preference outputs are incompatible with standard RLHF algorithms that require pointwise reward signals. In this work, we introduce Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. Rather than producing structured, externally provided rationales, Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities such as self-reflection, hypothetical reasoning, and divergent reasoning. To elicit these reasoning abilities, we first warm-up the models by supervised fine-tuning (SFT) over long CoT data. We then further improve the model's long-horizon abilities by rule-based reinforcement learning (RL). In addition, we propose a novel pairwise RLHF pipeline that directly optimizes policies using pairwise preference rewards, eliminating the need for pointwise reward conversion and enabling more effective use of Think-RM outputs. Experiments show that Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline, it demonstrates superior end-policy performance compared to traditional approaches.

Summary

AI-Generated Summary

PDF42May 23, 2025