ChatPaper.aiChatPaper

LoongRL:面向长上下文高级推理的强化学习

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

October 22, 2025
作者: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
cs.AI

摘要

在大型語言模型中,長上下文推理至關重要。雖然強化學習(RL)通過在思維鏈中誘發「頓悟」時刻來增強短上下文推理,但長上下文推理所需的高級思維模式仍大多未被探索,且高難度的RL數據稀缺。本文介紹了LoongRL,這是一種數據驅動的RL方法,專為高級長上下文推理而設計。LoongRL的核心是KeyChain,這是一種合成方法,通過插入UUID鏈將短的多跳問答轉化為高難度的長上下文任務,這些鏈將真實問題隱藏在大量干擾文檔中。解決這些任務需要模型逐步追踪正確的鏈條,識別真實問題,檢索相關事實並進行推理以正確回答。在KeyChain數據上進行RL訓練會誘發一種新興的計劃-檢索-推理-復查的推理模式,這種模式遠遠超越了訓練長度的泛化能力。在16K長度上訓練的模型能有效解決128K的任務,而無需承擔過高的全長度RL展開成本。在Qwen2.5-7B和14B上,LoongRL顯著提升了長上下文多跳問答的準確率,分別實現了+23.5%和+21.1%的絕對增益。由此產生的LoongRL-14B達到了74.2的分數,與更大規模的前沿模型如o3-mini(74.5)和DeepSeek-R1(74.9)相媲美。它還提升了長上下文檢索能力,通過了所有128K的「大海撈針」壓力測試,並保留了短上下文推理能力。
English
Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
PDF351October 23, 2025