エージェント記憶の解剖学：評価とシステム限界に関する分類法と実証分析

要旨

エージェント記憶システムは、大規模言語モデル（LLM）エージェントが長い対話を通じて状態を維持することを可能にし、固定されたコンテキストウィンドウを超えた長期的な推論とパーソナライゼーションを支援する。アーキテクチャの急速な発展にもかかわらず、これらのシステムの実証的基盤は脆弱である。既存のベンチマークは規模不足であることが多く、評価指標は意味的効用と整合せず、性能は基盤モデルによって大きく変動し、システムレベルのコストが軽視されがちである。本調査は、アーキテクチャとシステムの両観点からエージェント記憶を体系的分析する。まず、4つの記憶構造に基づくMAGシステムの簡潔な分類法を提示する。次に、ベンチマークの飽和効果、指標の有効性と評価器の感度、基盤モデル依存的な精度、記憶維持によってもたらされるレイテンシとスループットのオーバーヘッドなど、現行システムを制限する主要な課題点を分析する。記憶構造と実証的限界を結びつけることで、現在のエージェント記憶システムが理論的な可能性を十分に発揮できていない理由を明らかにし、信頼性の高い評価とスケーラブルなシステム設計に向けた方向性を示す。

English

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

エージェント記憶の解剖学：評価とシステム限界に関する分類法と実証分析

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

要旨

Support