LongRM: 보상 모델링의 컨텍스트 경계를 밝히고 확장하기

초록

보상 모델(Reward Model, RM)은 대규모 언어 모델(LLM)을 인간의 선호도에 맞추는 데 핵심적인 역할을 합니다. 실제 애플리케이션에서는 LLM 에이전트와 같이 긴 이력 궤적이 점점 더 많이 포함되면서, 모델의 응답이 고품질일 뿐만 아니라 제공된 맥락에 기반하고 일관성을 유지하는지 평가하는 것이 필수적입니다. 그러나 현재의 RM은 주로 짧은 맥락 설정에 국한되어 있으며, 응답 수준의 속성(예: 안전성 또는 유용성)에 초점을 맞추는 반면, 긴 맥락-응답 일관성이라는 중요한 차원을 크게 간과하고 있습니다. 본 연구에서는 긴 맥락 RM 평가를 위해 특별히 설계된 벤치마크인 Long-RewardBench를 소개합니다. 이 벤치마크는 Pairwise Comparison과 Best-of-N 작업을 모두 포함하고 있습니다. 우리의 예비 연구는 최첨단 생성형 RM조차도 긴 맥락 시나리오에서 상당히 취약하며, 맥락을 인지한 선호도 판단을 유지하지 못한다는 것을 보여줍니다. 모델 출력에서 관찰된 실패 패턴 분석에 동기를 부여받아, 우리는 임의의 모델을 강력한 긴 맥락 RM(LongRM)으로 효과적으로 확장할 수 있는 일반적인 다단계 훈련 전략을 제안합니다. 실험 결과, 우리의 접근 방식은 긴 맥락 평가에서의 성능을 크게 향상시킬 뿐만 아니라 강력한 짧은 맥락 능력도 유지하는 것으로 나타났습니다. 특히, 우리의 8B LongRM은 훨씬 더 큰 70B 규모의 베이스라인을 능가하며, 독점적인 Gemini 2.5 Pro 모델의 성능과도 맞먹습니다.

English

Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

LongRM: 보상 모델링의 컨텍스트 경계를 밝히고 확장하기

LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

초록

Support