LongRM：揭示与突破奖励建模的上下文边界

摘要

奖励模型（RM）在将大型语言模型（LLM）与人类偏好对齐方面发挥着关键作用。随着现实世界应用越来越多地涉及长历史轨迹，例如LLM代理，评估模型的响应是否不仅高质量，而且基于并符合所提供的上下文，变得不可或缺。然而，当前的RM仍局限于短上下文设置，主要关注响应级别的属性（如安全性或有用性），而很大程度上忽视了长上下文与响应一致性的关键维度。在本研究中，我们引入了Long-RewardBench，一个专门为长上下文RM评估设计的基准，包含成对比较和最佳N项任务。我们的初步研究表明，即使是最先进的生成式RM在长上下文场景中也表现出显著的脆弱性，无法维持上下文感知的偏好判断。基于对模型输出中观察到的失败模式的分析，我们提出了一种通用的多阶段训练策略，能够有效地将任意模型扩展为稳健的长上下文RM（LongRMs）。实验表明，我们的方法不仅显著提高了长上下文评估的性能，还保持了强大的短上下文能力。值得注意的是，我们的8B LongRM超越了规模大得多的70B基线，并与专有的Gemini 2.5 Pro模型的性能相匹配。

English

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

LongRM：揭示与突破奖励建模的上下文边界

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

摘要

Support