LoongRL:面向长上下文高级推理的强化学习
LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
October 22, 2025
作者: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang
cs.AI
摘要
长上下文推理对于大型语言模型至关重要。尽管强化学习(RL)通过激发思维链中的“顿悟”时刻增强了短上下文推理能力,但长上下文推理所需的高级思维模式仍未被充分探索,且高难度的RL数据稀缺。本文提出LoongRL,一种数据驱动的RL方法,用于高级长上下文推理。LoongRL的核心是KeyChain,这是一种合成方法,通过插入UUID链将短多跳问答转化为高难度的长上下文任务,将真实问题隐藏于大量干扰文档中。解决这些任务要求模型逐步追踪正确链条,识别真实问题,检索相关事实并基于其进行推理以给出正确答案。在KeyChain数据上的RL训练催生了一种计划-检索-推理-复查的推理模式,该模式在训练长度之外展现出良好的泛化能力。在16K长度上训练的模型能有效解决128K任务,而无需承担全长度RL展开的高昂成本。在Qwen2.5-7B和14B上,LoongRL显著提升了长上下文多跳问答的准确率,分别实现了+23.5%和+21.1%的绝对增益。由此得到的LoongRL-14B模型得分达到74.2,与更大规模的前沿模型如o3-mini(74.5)和DeepSeek-R1(74.9)相媲美。此外,它还提升了长上下文检索能力,通过了所有128K“大海捞针”压力测试,并保留了短上下文推理能力。
English
Reasoning over long contexts is essential for large language models. While
reinforcement learning (RL) enhances short-context reasoning by inducing "Aha"
moments in chain-of-thought, the advanced thinking patterns required for
long-context reasoning remain largely unexplored, and high-difficulty RL data
are scarce. In this paper, we introduce LoongRL, a data-driven RL method for
advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis
approach that transforms short multi-hop QA into high-difficulty long-context
tasks by inserting UUID chains that hide the true question among large
collections of distracting documents. Solving these tasks requires the model to
trace the correct chain step-by-step, identify the true question, retrieve
relevant facts and reason over them to answer correctly. RL training on
KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning
pattern that generalizes far beyond training length. Models trained at 16K
effectively solve 128K tasks without prohibitive full-length RL rollout costs.
On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA
accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches
a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5)
and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all
128K needle-in-a-haystack stress tests, and preserves short-context reasoning
capabilities.