QwenLong-L1:迈向基于强化学习的长上下文大型推理模型
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning
May 23, 2025
作者: Fanqi Wan, Weizhou Shen, Shengyi Liao, Yingcheng Shi, Chenliang Li, Ziyi Yang, Ji Zhang, Fei Huang, Jingren Zhou, Ming Yan
cs.AI
摘要
近期的大型推理模型(LRMs)通過強化學習(RL)展現了強大的推理能力。這些改進主要體現在短上下文推理任務中。相比之下,將LRMs擴展到能夠有效處理和推理長上下文輸入的RL方法仍然是一個關鍵的未解難題。為彌合這一差距,我們首先形式化了長上下文推理RL的範式,並識別了訓練效率低下和優化過程不穩定的關鍵挑戰。為解決這些問題,我們提出了QwenLong-L1框架,該框架通過漸進式上下文縮放將短上下文LRMs適應於長上下文場景。具體而言,我們利用一個熱身監督微調(SFT)階段來建立穩健的初始策略,隨後採用課程引導的分階段RL技術來穩定策略演進,並通過難度感知的回顧採樣策略來激勵策略探索。在七個長上下文文檔問答基準上的實驗表明,QwenLong-L1-32B超越了OpenAI-o3-mini和Qwen3-235B-A22B等旗艦LRMs,其性能與Claude-3.7-Sonnet-Thinking相當,展示了在最先進的LRMs中的領先性能。這項工作推動了實用長上下文LRMs的發展,使其能夠在信息密集的環境中進行穩健的推理。
English
Recent large reasoning models (LRMs) have demonstrated strong reasoning
capabilities through reinforcement learning (RL). These improvements have
primarily been observed within the short-context reasoning tasks. In contrast,
extending LRMs to effectively process and reason on long-context inputs via RL
remains a critical unsolved challenge. To bridge this gap, we first formalize
the paradigm of long-context reasoning RL, and identify key challenges in
suboptimal training efficiency and unstable optimization process. To address
these issues, we propose QwenLong-L1, a framework that adapts short-context
LRMs to long-context scenarios via progressive context scaling. Specifically,
we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust
initial policy, followed by a curriculum-guided phased RL technique to
stabilize the policy evolution, and enhanced with a difficulty-aware
retrospective sampling strategy to incentivize the policy exploration.
Experiments on seven long-context document question-answering benchmarks
demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini
and Qwen3-235B-A22B, achieving performance on par with
Claude-3.7-Sonnet-Thinking, demonstrating leading performance among
state-of-the-art LRMs. This work advances the development of practical
long-context LRMs capable of robust reasoning across information-intensive
environments.Summary
AI-Generated Summary