LongTraceRL：基於評分獎勵從搜索代理軌跡中學習長上下文推理

摘要

長上下文推理仍然是大型語言模型面臨的核心挑戰，這些模型經常無法在大量干擾內容中定位並整合關鍵資訊。基於可驗證獎勵的強化學習（RLVR）在此任務中展現出潛力，然而現有方法受限於低混淆度的干擾項以及僅有結果獎勵的稀疏信號，無法對中間推理步驟進行監督。為解決這些問題，我們提出LongTraceRL。在資料建構方面，我們透過知識圖譜隨機遊走生成多跳問題，並利用搜尋代理軌跡建立分層干擾項：代理讀取但未引用的文檔（高混淆度）以及出現在搜尋結果中但從未被開啟的文檔（低混淆度），產生的訓練上下文遠比隨機抽樣或單次搜尋所建構的更具挑戰性。在獎勵設計方面，我們提出基於細則的獎勵，利用每條推理鏈上的黃金實體作為細粒度的實體層級過程監督。此細則獎勵僅應用於最終答案正確的回應（僅正向策略），區分正確回應間的推理品質，並防止獎勵駭客行為。在三個推理LLM（4B-30B）上進行的五項長上下文基準測試實驗表明，LongTraceRL持續優於強基線，並鼓勵全面、基於證據的推理。程式碼、資料集和模型可在https://github.com/THU-KEG/LongTraceRL取得。

English

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce LongTraceRL. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a rubric reward that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at https://github.com/THU-KEG/LongTraceRL{https://github.com/THU-KEG/LongTraceRL}.