LongRM:揭示與突破獎勵建模的上下文邊界限制
LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling
October 8, 2025
作者: Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang
cs.AI
摘要
獎勵模型(Reward Model, RM)在對齊大型語言模型(LLM)與人類偏好方面扮演著關鍵角色。隨著現實世界應用日益涉及長歷史軌跡,例如LLM代理,評估模型回應是否不僅高質量,而且基於並與提供的上下文保持一致,變得不可或缺。然而,當前的RM仍局限於短上下文設置,主要關注回應層面的屬性(如安全性或幫助性),而很大程度上忽略了長上下文與回應一致性的關鍵維度。在本研究中,我們引入了Long-RewardBench,這是一個專為長上下文RM評估設計的基準,包含成對比較和最佳N選取任務。我們的初步研究顯示,即使是頂尖的生成式RM在長上下文場景中也表現出顯著的脆弱性,無法維持上下文感知的偏好判斷。基於對模型輸出中觀察到的失敗模式的分析,我們提出了一種通用的多階段訓練策略,能夠有效地將任意模型擴展為強大的長上下文RM(LongRMs)。實驗表明,我們的方法不僅在長上下文評估中大幅提升了性能,還保持了強大的短上下文能力。值得注意的是,我們的8B LongRM超越了規模大得多的70B基線模型,並與專有的Gemini 2.5 Pro模型的性能相匹配。
English
Reward model (RM) plays a pivotal role in aligning large language model (LLM)
with human preferences. As real-world applications increasingly involve long
history trajectories, e.g., LLM agent, it becomes indispensable to evaluate
whether a model's responses are not only high-quality but also grounded in and
consistent with the provided context. Yet, current RMs remain confined to
short-context settings and primarily focus on response-level attributes (e.g.,
safety or helpfulness), while largely neglecting the critical dimension of long
context-response consistency. In this work, we introduce Long-RewardBench, a
benchmark specifically designed for long-context RM evaluation, featuring both
Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that
even state-of-the-art generative RMs exhibit significant fragility in
long-context scenarios, failing to maintain context-aware preference judgments.
Motivated by the analysis of failure patterns observed in model outputs, we
propose a general multi-stage training strategy that effectively scales
arbitrary models into robust Long-context RMs (LongRMs). Experiments show that
our approach not only substantially improves performance on long-context
evaluation but also preserves strong short-context capability. Notably, our 8B
LongRM outperforms much larger 70B-scale baselines and matches the performance
of the proprietary Gemini 2.5 Pro model.