LongTraceRL:利用评分奖励从搜索代理轨迹中学习长上下文推理
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
May 29, 2026
作者: Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
cs.AI
摘要
长上下文推理仍然是大型语言模型面临的核心挑战,这类模型往往难以在大量干扰信息中定位并整合关键信息。可验证奖励的强化学习(RLVR)在此任务中展现出潜力,但现有方法受限于低混淆度的干扰项,且仅能提供稀疏的结果导向奖励信号,无法对中间推理步骤进行监督。为应对这些问题,我们提出LongTraceRL框架。在数据构建方面,我们通过知识图谱随机游走生成多跳问题,并利用搜索代理轨迹构建分级干扰项:代理读取但未引用的文档(高混淆度)与搜索结果中出现但从未打开的文档(低混淆度),由此生成的训练上下文远优于通过随机采样或单次搜索构建的上下文。在奖励设计方面,我们提出一种基于评分标准的奖励机制,利用每条推理链中的实体作为细粒度、实体级的过程监督信号。该奖励仅作用于最终答案正确的回复(正向策略),从而区分正确回复间的推理质量,并防止奖励攻击。在五个长上下文基准测试上对三种推理型语言模型(参数规模4B–30B)的实验表明,LongTraceRL始终优于强基线方法,并鼓励全面且基于证据的推理。代码、数据集与模型已开源至https://github.com/THU-KEG/LongTraceRL。
English
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.