ChatPaper.aiChatPaper

基于端到端摘要的上下文管理扩展LLM多轮强化学习

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

October 8, 2025
作者: Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen
cs.AI

摘要

我们研究了大语言模型(LLM)代理在长期多轮工具使用场景下的强化学习(RL)微调,其中上下文长度迅速成为根本性瓶颈。现有的RL流程可能面临指令跟随性能下降、过高的展开成本,以及最为严格的上下文限制等问题。为应对这些挑战,我们在训练中引入了基于摘要的上下文管理机制。具体而言,该机制通过LLM生成的摘要定期压缩工具使用历史,保留任务相关信息,从而在保持紧凑上下文的同时,使代理能够突破固定上下文窗口的限制。基于这一框架,我们推导出一种策略梯度表示,无缝地使标准LLM RL基础设施能够以端到端方式优化工具使用行为及摘要策略。我们通过摘要增强的策略优化(SUPO)实例化这一框架,SUPO是一种LLM RL算法,支持超越固定上下文限制的长期训练。在交互式函数调用和搜索任务上的实验表明,与基线相比,SUPO在保持相同甚至更低工作上下文长度的同时,显著提高了成功率。我们还证明,对于复杂的搜索任务,当测试时最大摘要轮次超过训练时设置时,SUPO能进一步提升评估性能。我们的研究结果确立了基于摘要的上下文管理作为一种原则性且可扩展的方法,用于训练超越固定上下文长度限制的RL代理。
English
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with SUmmarization augmented Policy Optimization (SUPO), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that SUPO significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, SUPO can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
PDF32October 16, 2025