私たちはエージェントネイティブなメモリシステムへの準備はできているか？

要旨

大規模言語モデル（LLM）エージェントにおけるメモリは、単純な検索拡張機構から、永続的な情報保存、検索、更新、統合、そしてエージェント実行中の動的なライフサイクル管理を支援するデータ管理システムへと急速に進化してきた。しかし、この進化にもかかわらず、既存の評価では依然としてエンドツーエンドのタスク成功指標（例：F1、BLEU）を通じて主にエージェントメモリをベンチマークしており、基盤となるシステムを一枚岩のブラックボックスとして扱っている。その結果、運用コスト、メモリモジュール間のアーキテクチャ上のトレードオフ、動的な知識更新下でのロバスト性といった重要なシステムレベルの課題は、十分に調査されていない。本稿では、データ管理の観点からエージェントメモリの体系的な実験研究を提示する。我々は、エージェントメモリを4つの中核モジュール（メモリ表現と保存、抽出、検索とルーティング、メンテナンス）に分解する分析フレームワークを提案する。このフレームワークのもとで、11のデータセットにわたる5つのベンチマークワークロードにおいて、12の代表的なメモリシステムと2つの参照ベースラインを評価する。広範なエンドツーエンド評価の結果、単一のアーキテクチャがすべてのシナリオで支配的になるわけではなく、その有効性はメモリ構造がワークロードのボトルネックとどれだけ整合するかに大きく依存することが示された。さらに、詳細なアブレーション研究を通じて、表現の忠実性、検索精度、更新の正確性、長期的安定性に対する各モジュールの個別の影響を定量化する。最後に、現実的なワークロード下でのコストパフォーマンスのトレードオフを明らかにし、局所的なメンテナンスが全体的な再編成よりもコスト効率が高いことを示す。これらの発見に基づき、真にエージェントネイティブなメモリシステムを構築するための有望な方向性を特定する。コードは https://github.com/OpenDataBox/MemoryData で公開されている。

English

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.