ChatPaper.aiChatPaper

QwenLong-L1.5:面向長上下文推理與記憶管理的訓練後優化方案

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

December 15, 2025
作者: Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
cs.AI

摘要

我們推出QwenLong-L1.5模型,該模型通過系統化的訓練後創新實現了卓越的長上下文推理能力。其核心技術突破包括:(1)長上下文數據合成管線:開發系統化合成框架,生成需要對全局分佈證據進行多跳溯因的挑戰性推理任務。通過將文檔解構為原子事實及其底層關聯,再以程式化方式組合可驗證的推理問題,我們的方法能大規模創建高質量訓練數據,顯著超越簡單檢索任務,實現真正的長程推理能力。(2)長上下文訓練的穩定性強化學習:為克服長上下文RL的關鍵不穩定性,我們引入任務平衡抽樣與任務特定優勢估計以減輕獎勵偏差,並提出自適應熵控策略優化(AEPO)動態調控探索-利用權衡。(3)超長上下文的記憶增強架構:針對擴展上下文窗口仍無法容納無限長序列的難題,我們開發具多階段融合RL訓練的記憶管理框架,無縫整合單次推理與基於記憶的迭代處理,可處理超過400萬詞元的任務。基於Qwen3-300億-A3B思維架構的QwenLong-L1.5在長上下文推理基準測試中達到與GPT-5和Gemini-2.5-Pro相當的性能,較基線模型平均提升9.90分。在超長任務(100萬至400萬詞元)上,其記憶智能體框架相比智能體基線帶來9.48分增益。此外,所獲得的長上下文推理能力還轉化為科學推理、記憶工具使用及長對話等通用領域的性能提升。
English
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
PDF804December 17, 2025