智能体记忆机制剖析：评估体系分类与系统局限性的实证分析

摘要

智能体记忆系统使大型语言模型（LLM）智能体能够在长程交互中维持状态，支持超越固定上下文窗口的长周期推理与个性化交互。尽管架构发展迅速，这些系统的实证基础仍显薄弱：现有基准测试往往规模不足，评估指标与语义效用存在偏差，不同骨干模型的性能差异显著，且系统级成本常被忽视。本文从架构与系统双重视角对智能体记忆进行结构化分析。我们首先基于四种记忆结构提出简洁的MAG系统分类法，进而剖析制约当前系统的关键痛点，包括基准测试的饱和效应、指标有效性与评判敏感性、骨干模型依赖的准确性，以及记忆维护带来的延迟与吞吐量开销。通过将记忆结构与实证局限相联结，本文阐明了当前智能体记忆系统为何常未达理论预期，并为更可靠的评估方法与可扩展的系统设计指明了方向。

English

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

智能体记忆机制剖析：评估体系分类与系统局限性的实证分析

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

摘要

Support