記憶即行動:面向長程智能任務的自主情境管理
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
October 14, 2025
作者: Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang
cs.AI
摘要
大型語言模型在處理長期代理任務時面臨挑戰,因其有限的記憶容量容易被分散或無關的上下文所淹沒。現有的工作記憶方法通常依賴於外部的啟發式機制,這些機制與代理的核心策略是分離的。在本研究中,我們將工作記憶管理重新定義為一種可學習的內在能力。我們提出了一個新框架——記憶即行動(Memory-as-Action),其中代理通過執行明確的編輯操作來主動管理其工作記憶,這些操作作為統一策略的一部分。這種表述方式使得通過強化學習訓練的代理能夠在給定的資源約束下,平衡記憶整理與長期任務目標。然而,這種記憶編輯操作打破了大型語言模型交互中持續增長前綴的標準假設,導致了我們所稱的軌跡斷裂。這些非前綴的變化破壞了標準策略梯度方法所需的因果連續性,使得這些方法不再適用。為解決這一問題,我們提出了一種新算法——動態上下文策略優化(Dynamic Context Policy Optimization),該算法通過在記憶操作點分割軌跡並將軌跡級優勢應用於產生的動作片段,實現了穩定的端到端強化學習。我們的結果表明,以端到端的方式聯合優化任務推理和記憶管理,不僅減少了整體計算消耗,還通過適應模型內在能力的自適應上下文整理策略,提升了任務性能。
English
Large Language Models face challenges in long-horizon agentic tasks as their
constrained memory is easily overwhelmed by distracting or irrelevant context.
Existing working memory methods typically rely on external, heuristic
mechanisms that are decoupled from the agent's core policy. In this work, we
reframe working memory management as a learnable, intrinsic capability. We
propose a novel framework, Memory-as-Action, where an agent actively manages
its working memory by executing explicit editing operations as part of a
unified policy. This formulation allows an agent, trained via reinforcement
learning, to balance memory curation against long-term task objectives under
given resource constraints. However, such memory editing actions break the
standard assumption of a continuously growing prefix in LLM interactions,
leading to what we call trajectory fractures. These non-prefix changes disrupt
the causal continuity required by standard policy gradient methods, making
those methods inapplicable. To address this, we propose a new algorithm,
Dynamic Context Policy Optimization, which enables stable end-to-end
reinforcement learning by segmenting trajectories at memory action points and
applying trajectory-level advantages to the resulting action segments. Our
results demonstrate that jointly optimizing for task reasoning and memory
management in an end-to-end fashion not only reduces overall computational
consumption but also improves task performance, driven by adaptive context
curation strategies tailored to the model's intrinsic capabilities.