LongRM：揭示與突破獎勵建模的上下文邊界限制

摘要

獎勵模型（Reward Model, RM）在對齊大型語言模型（LLM）與人類偏好方面扮演著關鍵角色。隨著現實世界應用日益涉及長歷史軌跡，例如LLM代理，評估模型回應是否不僅高質量，而且基於並與提供的上下文保持一致，變得不可或缺。然而，當前的RM仍局限於短上下文設置，主要關注回應層面的屬性（如安全性或幫助性），而很大程度上忽略了長上下文與回應一致性的關鍵維度。在本研究中，我們引入了Long-RewardBench，這是一個專為長上下文RM評估設計的基準，包含成對比較和最佳N選取任務。我們的初步研究顯示，即使是頂尖的生成式RM在長上下文場景中也表現出顯著的脆弱性，無法維持上下文感知的偏好判斷。基於對模型輸出中觀察到的失敗模式的分析，我們提出了一種通用的多階段訓練策略，能夠有效地將任意模型擴展為強大的長上下文RM（LongRMs）。實驗表明，我們的方法不僅在長上下文評估中大幅提升了性能，還保持了強大的短上下文能力。值得注意的是，我們的8B LongRM超越了規模大得多的70B基線模型，並與專有的Gemini 2.5 Pro模型的性能相匹配。

English

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

LongRM：揭示與突破獎勵建模的上下文邊界限制

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

摘要

Support