有用的记忆在被大语言模型持续更新时会变得有缺陷

摘要

从过往经验中学习得益于两种互补的记忆形式：情景痕迹——事件发生的原始轨迹，以及经过多事件提炼、可重复使用的类似图式的抽象经验。现有智能体记忆系统追求抽象形式：通过大语言模型将过往轨迹重写为文本记忆库，并持续用新交互更新，从而打造无需参数更新即可自我改进的智能体。然而我们发现，当今大语言模型生成的此类抽象记忆，即使源于有效经验也常存在缺陷。在整合过程中，记忆效用先升后降，最终可能低于无记忆基线。更令人意外的是，即便从真实解决方案进行整合，GPT-5.4在之前无记忆时曾成功解决的ARC-AGI问题集中，仍有54%失败。我们追溯这种退化源于整合步骤而非底层经验：相同的轨迹在不同更新策略下产生性质迥异的记忆，而仅保留轨迹的情景控制组在测试中与整合方法表现相当。在可控的ARC-AGI流式环境中，智能体默认保留原始情景，其准确率是强制整合组的双倍；完全禁用整合（仅用情景管理）则与这种自动模式持平。实践表明，稳健的智能体记忆应将原始情景视为一级证据，并明确管控整合过程而非每次交互后都执行整合。展望未来，可靠的智能体记忆需要大语言模型在整合时不会覆盖其依赖的原始证据。

English

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

有用的记忆在被大语言模型持续更新时会变得有缺陷

Useful Memories Become Faulty When Continuously Updated by LLMs

摘要

Support