LongRM：報酬モデリングのコンテキスト境界の解明と解放

要旨

報酬モデル（RM）は、大規模言語モデル（LLM）を人間の好みに合わせる上で重要な役割を果たします。現実世界のアプリケーションでは、LLMエージェントなどの長い履歴軌跡がますます関わるようになり、モデルの応答が高品質であるだけでなく、提供された文脈に基づいて一貫しているかどうかを評価することが不可欠になっています。しかし、現在のRMは短い文脈設定に限定されており、主に応答レベルの属性（安全性や有用性など）に焦点を当てており、長い文脈と応答の一貫性という重要な次元をほとんど無視しています。本研究では、長文脈RM評価のために特別に設計されたベンチマークであるLong-RewardBenchを紹介します。このベンチマークは、ペアワイズ比較とBest-of-Nタスクを特徴としています。予備調査では、最先端の生成RMでさえ、長文脈シナリオにおいて重大な脆弱性を示し、文脈を意識した選好判断を維持できないことが明らかになりました。モデル出力で観察された失敗パターンの分析に基づき、任意のモデルを堅牢な長文脈RM（LongRM）に拡張する一般的な多段階トレーニング戦略を提案します。実験結果は、このアプローチが長文脈評価のパフォーマンスを大幅に向上させるだけでなく、強力な短文脈能力も保持することを示しています。特に、8BのLongRMは、はるかに大規模な70Bベースラインを上回り、プロプライエタリなGemini 2.5 Proモデルの性能に匹敵することが注目されます。

English

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

LongRM：報酬モデリングのコンテキスト境界の解明と解放

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

要旨

Support