Think-RM:赋能生成式奖励模型实现长程推理
Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models
May 22, 2025
作者: Ilgee Hong, Changlong Yu, Liang Qiu, Weixiang Yan, Zhenghao Xu, Haoming Jiang, Qingru Zhang, Qin Lu, Xin Liu, Chao Zhang, Tuo Zhao
cs.AI
摘要
基于人类反馈的强化学习(RLHF)已成为一种强大的后训练范式,用于使大型语言模型与人类偏好保持一致。RLHF中的一个核心挑战是构建准确的奖励信号,传统的Bradley-Terry奖励模型(BT RMs)常因对数据规模和覆盖范围的敏感性以及易受奖励攻击而受限。生成式奖励模型(GenRMs)通过生成思维链(CoT)推理并最终给出奖励,提供了一种更为稳健的替代方案。然而,现有的GenRMs依赖于浅层、垂直扩展的推理,限制了其处理微妙或复杂(如推理密集型)任务的能力。此外,它们的成对偏好输出与需要点状奖励信号的标准RLHF算法不兼容。在本研究中,我们提出了Think-RM,一个通过模拟内部思维过程使GenRMs具备长程推理能力的训练框架。Think-RM不生成结构化的外部提供推理,而是生成灵活、自我引导的推理轨迹,支持自我反思、假设推理和发散推理等高级能力。为了激发这些推理能力,我们首先通过监督微调(SFT)在长CoT数据上进行模型预热。随后,我们通过基于规则的强化学习(RL)进一步提升模型的长程能力。此外,我们提出了一种新颖的成对RLHF流程,直接利用成对偏好奖励优化策略,无需点状奖励转换,从而更有效地利用Think-RM的输出。实验表明,Think-RM在RM-Bench上取得了最先进的结果,比BT RM和垂直扩展的GenRM高出8%。当与我们的成对RLHF流程结合时,它展示了相较于传统方法更优的最终策略性能。
English
Reinforcement learning from human feedback (RLHF) has become a powerful
post-training paradigm for aligning large language models with human
preferences. A core challenge in RLHF is constructing accurate reward signals,
where the conventional Bradley-Terry reward models (BT RMs) often suffer from
sensitivity to data size and coverage, as well as vulnerability to reward
hacking. Generative reward models (GenRMs) offer a more robust alternative by
generating chain-of-thought (CoT) rationales followed by a final reward.
However, existing GenRMs rely on shallow, vertically scaled reasoning, limiting
their capacity to handle nuanced or complex (e.g., reasoning-intensive) tasks.
Moreover, their pairwise preference outputs are incompatible with standard RLHF
algorithms that require pointwise reward signals. In this work, we introduce
Think-RM, a training framework that enables long-horizon reasoning in GenRMs by
modeling an internal thinking process. Rather than producing structured,
externally provided rationales, Think-RM generates flexible, self-guided
reasoning traces that support advanced capabilities such as self-reflection,
hypothetical reasoning, and divergent reasoning. To elicit these reasoning
abilities, we first warm-up the models by supervised fine-tuning (SFT) over
long CoT data. We then further improve the model's long-horizon abilities by
rule-based reinforcement learning (RL). In addition, we propose a novel
pairwise RLHF pipeline that directly optimizes policies using pairwise
preference rewards, eliminating the need for pointwise reward conversion and
enabling more effective use of Think-RM outputs. Experiments show that Think-RM
achieves state-of-the-art results on RM-Bench, outperforming both BT RM and
vertically scaled GenRM by 8%. When combined with our pairwise RLHF pipeline,
it demonstrates superior end-policy performance compared to traditional
approaches.Summary
AI-Generated Summary