我们准备好迎接智能体原生记忆系统了吗?
Are We Ready For An Agent-Native Memory System?
June 23, 2026
作者: Wei Zhou, Xuanhe Zhou, Shaokun Han, Hongming Xu, Guoliang Li, Zhiyu Li, Feiyu Xiong, Fan Wu
cs.AI
摘要
大型语言模型(LLM)智能体的记忆机制已迅速从简单的检索增强机制演变为支持持久化信息存储、检索、更新、整合以及在智能体执行过程中进行动态生命周期管理的数据管理系统。尽管经历了这一演变,现有评估仍主要通过端到端任务成功指标(如F1、BLEU)来评测智能体记忆,而将底层系统视为一个单一黑盒。因此,关键的系统级问题,包括操作成本、各记忆模块间的架构权衡,以及动态知识更新下的鲁棒性,尚未得到充分探索。本文从数据管理视角出发,对智能体记忆进行了系统性实验研究。我们提出了一种分析框架,将智能体记忆分解为四个核心模块:记忆表示与存储、提取、检索与路由,以及维护。在此框架下,我们评估了12个代表性记忆系统及两个参考基线,涵盖跨11个数据集的五个基准工作负载。广泛的端到端评估表明,没有单一架构在所有场景中占据优势;相反,其有效性高度依赖于记忆结构与工作负载瓶颈的匹配程度。此外,通过细粒度的消融研究,我们量化了各模块对表示保真度、检索精度、更新正确性及长程稳定性的独立影响。最后,我们揭示了现实工作负载下的成本-性能权衡,表明局部维护比全局重组更具成本效益。基于这些发现,我们识别出构建真正原生的智能体记忆系统的有前景方向。代码已公开于 https://github.com/OpenDataBox/MemoryData。
English
Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.