LongRM:揭示与突破奖励建模的上下文边界
LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
October 8, 2025
作者: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang
cs.AI
摘要
奖励模型(RM)在将大型语言模型(LLM)与人类偏好对齐方面发挥着关键作用。随着现实世界应用越来越多地涉及长历史轨迹,例如LLM代理,评估模型的响应是否不仅高质量,而且基于并符合所提供的上下文,变得不可或缺。然而,当前的RM仍局限于短上下文设置,主要关注响应级别的属性(如安全性或有用性),而很大程度上忽视了长上下文与响应一致性的关键维度。在本研究中,我们引入了Long-RewardBench,一个专门为长上下文RM评估设计的基准,包含成对比较和最佳N项任务。我们的初步研究表明,即使是最先进的生成式RM在长上下文场景中也表现出显著的脆弱性,无法维持上下文感知的偏好判断。基于对模型输出中观察到的失败模式的分析,我们提出了一种通用的多阶段训练策略,能够有效地将任意模型扩展为稳健的长上下文RM(LongRMs)。实验表明,我们的方法不仅显著提高了长上下文评估的性能,还保持了强大的短上下文能力。值得注意的是,我们的8B LongRM超越了规模大得多的70B基线,并与专有的Gemini 2.5 Pro模型的性能相匹配。
English
Reward model (RM) plays a pivotal role in aligning large language model (LLM)
with human preferences. As real-world applications increasingly involve long
history trajectories, e.g., LLM agent, it becomes indispensable to evaluate
whether a model's responses are not only high-quality but also grounded in and
consistent with the provided context. Yet, current RMs remain confined to
short-context settings and primarily focus on response-level attributes (e.g.,
safety or helpfulness), while largely neglecting the critical dimension of long
context-response consistency. In this work, we introduce Long-RewardBench, a
benchmark specifically designed for long-context RM evaluation, featuring both
Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that
even state-of-the-art generative RMs exhibit significant fragility in
long-context scenarios, failing to maintain context-aware preference judgments.
Motivated by the analysis of failure patterns observed in model outputs, we
propose a general multi-stage training strategy that effectively scales
arbitrary models into robust Long-context RMs (LongRMs). Experiments show that
our approach not only substantially improves performance on long-context
evaluation but also preserves strong short-context capability. Notably, our 8B
LongRM outperforms much larger 70B-scale baselines and matches the performance
of the proprietary Gemini 2.5 Pro model.