LongTraceRL: 루브릭 보상을 활용한 검색 에이전트 궤적으로부터의 장문맥 추론 학습

초록

장문 맥락 추론은 대규모 언어 모델에게 여전히 핵심적인 과제로 남아 있으며, 이들은 광범위한 방해 콘텐츠 속에서 핵심 정보를 찾아 통합하는 데 종종 실패한다. 검증 가능한 보상을 통한 강화 학습(RLVR)은 이 작업에 유망한 것으로 나타났지만, 기존 방법은 혼동 가능성이 낮은 방해 요소와 중간 추론 단계를 감독할 수 없는 희소한 결과 전용 보상 신호로 인해 제한적이다. 이러한 문제를 해결하기 위해 우리는 LongTraceRL을 소개한다. 데이터 구축을 위해 우리는 지식 그래프 랜덤 워크를 통해 다중 홉 질문을 생성하고, 검색 에이전트 궤적을 활용하여 계층적 방해 요소(에이전트가 읽었지만 인용하지 않은 문서(혼동 가능성 높음)와 검색 결과에 나타났지만 열리지 않은 문서(혼동 가능성 낮음))를 구축함으로써, 무작위 샘플링이나 단일 검색으로 구축된 것보다 훨씬 더 도전적인 훈련 맥락을 생성한다. 보상 설계를 위해 우리는 각 추론 체인을 따라 있는 골드 엔티티를 세밀한 엔티티 수준 프로세스 감독으로 사용하는 루브릭 보상을 제안한다. 이 루브릭 보드는 정답 최종 답변이 있는 응답에만 적용되어(긍정 전용 전략), 정답 응답 간의 추론 품질을 구별하고 보상 해킹을 방지한다. 세 가지 추론 LLM(4B-30B)에 대한 실험을 5개의 장문 맥락 벤치마크에서 수행한 결과, LongTraceRL이 강력한 기준선을 일관되게 능가하며 포괄적이고 증거에 기반한 추론을 장려함을 보여준다. 코드, 데이터셋 및 모델은 https://github.com/THU-KEG/LongTraceRL에서 확인할 수 있다.

English

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.