ChatPaper.aiChatPaper

QwenLong-L1.5:面向长上下文推理与记忆管理的后训练优化方案

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

December 15, 2025
作者: Weizhou Shen, Ziyi Yang, Chenliang Li, Zhiyuan Lu, Miao Peng, Huashan Sun, Yingcheng Shi, Shengyi Liao, Shaopeng Lai, Bo Zhang, Dayiheng Liu, Fei Huang, Jingren Zhou, Ming Yan
cs.AI

摘要

我们推出QwenLong-L1.5模型,该模型通过系统化的后训练创新实现了卓越的长上下文推理能力。其关键技术突破包括:(1)长上下文数据合成管道:开发了系统化合成框架,可生成需要基于全局分布证据进行多跳推理的挑战性任务。通过将文档解构为原子事实及其内在关联,再以可编程方式组合成可验证的推理问题,该方法能大规模生成高质量训练数据,显著超越简单检索任务,实现真正的长程推理能力;(2)长上下文训练的稳定强化学习:针对长上下文强化学习的不稳定性,提出任务平衡采样与任务特定优势估计以缓解奖励偏差,并设计自适应熵控策略优化(AEPO)动态调节探索-利用平衡;(3)超长上下文的记忆增强架构:针对扩展上下文窗口仍无法容纳无限长序列的难题,开发了具有多阶段融合强化训练的记忆管理框架,可对超过400万token的任务实现单次推理与基于记忆的迭代处理的无缝集成。基于Qwen3-300亿参数A3B思维架构的QwenLong-L1.5,在长上下文推理基准测试中达到与GPT-5和Gemini-2.5-Pro相当的水准,较基线模型平均提升9.90分。在超长任务(100万至400万token)上,其记忆智能体框架较智能体基线提升9.48分。此外,所获得的长上下文推理能力还显著提升了科学推理、记忆工具使用、长对话等通用领域的性能表现。
English
We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.
PDF804December 17, 2025