面向长时域LLM智能体的元认知记忆策略优化

摘要

记忆增强型LLM代理通过递归地将交互轨迹摘要为紧凑记忆，以应对复杂的长期任务。然而，现有方法通常采用基于结果的强化学习来训练这些记忆策略，未能定位中间记忆质量退化之处。随着交互展开，模糊的递归摘要会逐步丢弃任务相关信息并引入语义噪声，这加剧了信念偏差，模糊了代理对潜在任务状态的估计，最终扰乱长期推理。因此，我们认为记忆优化的重点不应仅局限于轨迹级别的成功，而应关注中间摘要所引发的信念清晰度。为此，我们引入信念熵（Belief Entropy），一种自监督代理指标，用于探测模型在当前记忆下对潜在任务状态的不确定程度。基于这一指标，我们提出了元认知记忆策略优化（MMPO）。不同于仅依赖稀疏的基于结果的信号，MMPO通过显式惩罚引发高认知不确定性的摘要，提供细粒度的、针对记忆的监督。实验表明，MMPO在多种长期任务上始终优于现有方法，即使在扩展到175万token的上下文时，仍能保持97.1%的性能。

English

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.