有用的記憶在被大型語言模型持續更新時變得有缺陷。

摘要

從過往經驗中學習得益於兩種互補的記憶形式：情節痕跡——事件發生的原始軌跡，以及跨越多個情節提煉而成、可重複使用的圖式式教訓之整合抽象概念。近期的智能體記憶系統致力於整合形式：讓大型語言模型將過往軌跡改寫成文字記憶庫，並持續以新互動更新，以期在無需參數更新的情況下實現自我改進的智能體。然而我們發現，即便這些由當前大型語言模型產生的整合記憶源自有用的經驗，也經常出現錯誤。隨著整合過程進行，記憶的效用先是上升，隨後衰退，甚至可能低於無記憶的基線。更令人驚訝的是，即使從真實解答中進行整合，GPT-5.4在先前無記憶時已解決的ARC-AGI問題集中仍有54%失敗。我們將此退化歸因於整合步驟，而非底層經驗：相同的軌跡在不同的更新排程下會產生質性不同的記憶；而單純保留這些軌跡的純情節控制組，在測試中仍能與我們測試的整合系統相匹敵。在一個可控的ARC-AGI Stream環境中，該環境暴露了保留、刪除與整合行動，智能體預設保留原始情節，其準確率是強制整合對照組的兩倍；完全停用整合（僅進行情節管理）則可達到此自動模式的表現。實務上，穩健的智能體記憶應將原始情節視為一等證據，並明確控管整合，而非在每次互動後就立即執行整合。展望未來，可靠的智能體記憶需要大型語言模型能夠在不覆蓋其依賴的證據下進行整合。

English

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

有用的記憶在被大型語言模型持續更新時變得有缺陷。

Useful Memories Become Faulty When Continuously Updated by LLMs

摘要

Support