QwenLong-L1：迈向基于强化学习的长上下文大型推理模型

摘要

近期的大型推理模型（LRMs）通過強化學習（RL）展現了強大的推理能力。這些改進主要體現在短上下文推理任務中。相比之下，將LRMs擴展到能夠有效處理和推理長上下文輸入的RL方法仍然是一個關鍵的未解難題。為彌合這一差距，我們首先形式化了長上下文推理RL的範式，並識別了訓練效率低下和優化過程不穩定的關鍵挑戰。為解決這些問題，我們提出了QwenLong-L1框架，該框架通過漸進式上下文縮放將短上下文LRMs適應於長上下文場景。具體而言，我們利用一個熱身監督微調（SFT）階段來建立穩健的初始策略，隨後採用課程引導的分階段RL技術來穩定策略演進，並通過難度感知的回顧採樣策略來激勵策略探索。在七個長上下文文檔問答基準上的實驗表明，QwenLong-L1-32B超越了OpenAI-o3-mini和Qwen3-235B-A22B等旗艦LRMs，其性能與Claude-3.7-Sonnet-Thinking相當，展示了在最先進的LRMs中的領先性能。這項工作推動了實用長上下文LRMs的發展，使其能夠在信息密集的環境中進行穩健的推理。

English

Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a difficulty-aware retrospective sampling strategy to incentivize the policy exploration. Experiments on seven long-context document question-answering benchmarks demonstrate that QwenLong-L1-32B outperforms flagship LRMs like OpenAI-o3-mini and Qwen3-235B-A22B, achieving performance on par with Claude-3.7-Sonnet-Thinking, demonstrating leading performance among state-of-the-art LRMs. This work advances the development of practical long-context LRMs capable of robust reasoning across information-intensive environments.

QwenLong-L1：迈向基于强化学习的长上下文大型推理模型

QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

摘要

Support