QwenLong-L1:迈向基于强化学习的长上下文大推理模型
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
May 23, 2025
作者: Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
cs.AI
摘要
近期的大型推理模型(LRMs)通过强化学习(RL)展现了强大的推理能力。这些改进主要在短上下文推理任务中得以体现。相比之下,将LRMs扩展至能够有效处理并推理长上下文输入仍是一个亟待解决的关键挑战。为填补这一空白,我们首先形式化了长上下文推理RL的范式,并识别出训练效率低下和优化过程不稳定等核心挑战。针对这些问题,我们提出了QwenLong-L1框架,该框架通过渐进式上下文扩展,将短上下文LRMs适配至长上下文场景。具体而言,我们利用预热监督微调(SFT)阶段建立稳健的初始策略,随后采用课程引导的分阶段RL技术稳定策略演化,并辅以难度感知的回顾采样策略激励策略探索。在七个长上下文文档问答基准上的实验表明,QwenLong-L1-32B超越了OpenAI-o3-mini和Qwen3-235B-A22B等旗舰LRMs,性能与Claude-3.7-Sonnet-Thinking相当,在现有最先进的LRMs中展现了领先性能。本研究推动了能够在信息密集环境中进行稳健推理的实用长上下文LRMs的发展。
English
Recent large reasoning models (LRMs) have demonstrated strong reasoning
capabilities through reinforcement learning (RL). These improvements have
primarily been observed within the short-context reasoning tasks. In contrast,
extending LRMs to effectively process and reason on long-context inputs via RL
remains a critical unsolved challenge. To bridge this gap, we first formalize
the paradigm of long-context reasoning RL, and identify key challenges in
suboptimal training efficiency and unstable optimization process. To address
these issues, we propose QwenLong-L1, a framework that adapts short-context
LRMs to long-context scenarios via progressive context scaling. Specifically,
we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust
initial policy, followed by a curriculum-guided phased RL technique to
stabilize the policy evolution, and enhanced with a difficulty-aware
retrospective sampling strategy to incentivize the policy exploration.
Experiments on seven long-context document question-answering benchmarks
demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini
and Qwen3-235B-A22B, achieving performance on par with
Claude-3.7-Sonnet-Thinking, demonstrating leading performance among
state-of-the-art LRMs. This work advances the development of practical
long-context LRMs capable of robust reasoning across information-intensive
environments.Summary
AI-Generated Summary