ChatPaper.aiChatPaper

擴展大規模語言模型的多輪強化學習:基於端到端摘要的上下文管理

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

October 8, 2025
作者: Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen
cs.AI

摘要

我們研究針對大型語言模型(LLM)代理進行強化學習(RL)微調,以實現長時序多輪工具使用,其中上下文長度迅速成為基本瓶頸。現有的RL流程可能面臨指令跟隨能力下降、過高的執行成本,以及最重要的嚴格上下文限制等問題。為應對這些挑戰,我們在訓練中引入了基於摘要的上下文管理。具體而言,它定期通過LLM生成的摘要來壓縮工具使用歷史,這些摘要保留了任務相關信息,從而保持緊湊的上下文,同時使代理能夠超越固定上下文窗口的限制。基於這一框架,我們推導出一種策略梯度表示,無縫地使標準LLM RL基礎設施能夠以端到端的方式優化工具使用行為以及摘要策略。我們通過摘要增強策略優化(SUPO)來實例化這一框架,這是一種LLM RL算法,能夠實現超越固定上下文限制的長時序訓練。在交互式函數調用和搜索任務上的實驗表明,與基線相比,SUPO在保持相同甚至更低的工作上下文長度的同時,顯著提高了成功率。我們還證明,對於複雜的搜索任務,當測試時的最大摘要輪次超過訓練時的設置時,SUPO可以進一步提升評估性能。我們的結果確立了基於摘要的上下文管理作為一種原則性和可擴展的方法,用於訓練超越固定上下文長度限制的RL代理。
English
We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with SUmmarization augmented Policy Optimization (SUPO), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that SUPO significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, SUPO can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
PDF32October 16, 2025