记忆即行动:面向长期自主任务的上下文自主管理
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
October 14, 2025
作者: Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang
cs.AI
摘要
大型语言模型在执行长期代理任务时面临挑战,因其有限的内存容易被干扰或无关上下文所淹没。现有的工作记忆方法通常依赖于外部的启发式机制,这些机制与代理的核心策略相分离。在本研究中,我们将工作记忆管理重新定义为一种可学习的内在能力。我们提出了一种新颖的框架——记忆即行动,其中代理通过执行明确的编辑操作作为统一策略的一部分,主动管理其工作记忆。这一表述使得通过强化学习训练的代理能够在给定资源约束下,平衡记忆整理与长期任务目标。然而,此类记忆编辑操作打破了大型语言模型交互中持续增长前缀的标准假设,导致了我们称之为轨迹断裂的现象。这些非前缀变化破坏了标准策略梯度方法所需的因果连续性,使得这些方法不再适用。为解决这一问题,我们提出了一种新算法——动态上下文策略优化,通过在记忆动作点分割轨迹并对生成的动作片段应用轨迹级优势,实现了稳定的端到端强化学习。我们的结果表明,以端到端方式联合优化任务推理与记忆管理,不仅降低了整体计算消耗,还通过适应模型内在能力的自适应上下文整理策略,提升了任务表现。
English
Large Language Models face challenges in long-horizon agentic tasks as their
constrained memory is easily overwhelmed by distracting or irrelevant context.
Existing working memory methods typically rely on external, heuristic
mechanisms that are decoupled from the agent's core policy. In this work, we
reframe working memory management as a learnable, intrinsic capability. We
propose a novel framework, Memory-as-Action, where an agent actively manages
its working memory by executing explicit editing operations as part of a
unified policy. This formulation allows an agent, trained via reinforcement
learning, to balance memory curation against long-term task objectives under
given resource constraints. However, such memory editing actions break the
standard assumption of a continuously growing prefix in LLM interactions,
leading to what we call trajectory fractures. These non-prefix changes disrupt
the causal continuity required by standard policy gradient methods, making
those methods inapplicable. To address this, we propose a new algorithm,
Dynamic Context Policy Optimization, which enables stable end-to-end
reinforcement learning by segmenting trajectories at memory action points and
applying trajectory-level advantages to the resulting action segments. Our
results demonstrate that jointly optimizing for task reasoning and memory
management in an end-to-end fashion not only reduces overall computational
consumption but also improves task performance, driven by adaptive context
curation strategies tailored to the model's intrinsic capabilities.