우리는 에이전트-네이티브 메모리 시스템에 준비가 되었는가?

초록

대규모 언어 모델(LLM) 에이전트를 위한 메모리는 단순한 검색 증강(retrieval-augmented) 메커니즘에서 에이전트 실행 전반에 걸쳐 지속적인 정보 저장, 검색, 갱신, 통합 및 동적 생명주기 관리를 지원하는 데이터 관리 시스템으로 빠르게 진화해 왔다. 이러한 진화에도 불구하고, 기존 평가는 여전히 주로 종단간(end-to-end) 작업 성공 지표(예: F1, BLEU)를 통해 에이전트 메모리를 평가하며, 기본 시스템을 단일 블랙박스로 취급한다. 그 결과, 운영 비용, 메모리 모듈 간 아키텍처 트레이드오프, 동적 지식 갱신 하에서의 견고성과 같은 중요한 시스템 수준의 관심사는 충분히 탐구되지 못하고 있다. 본 논문에서는 데이터 관리 관점에서 에이전트 메모리에 대한 체계적인 실험 연구를 제시한다. 우리는 에이전트 메모리를 메모리 표현 및 저장, 추출, 검색 및 라우팅, 유지 관리의 네 가지 핵심 모듈로 분해하는 분석 프레임워크를 제안한다. 이 프레임워크 하에서 우리는 11개의 데이터셋에 걸친 5개의 벤치마크 워크로드에 대해 12개의 대표적인 메모리 시스템과 두 개의 참조 기준선을 평가한다. 광범위한 종단간 평가 결과, 모든 시나리오에서 지배적인 단일 아키텍처는 존재하지 않으며, 대신 효과는 메모리 구조가 워크로드 병목 지점과 얼마나 잘 정렬되는지에 크게 의존함을 보여준다. 또한 세분화된 절제 연구(ablation study)를 통해 표현 충실도, 검색 정밀도, 갱신 정확성 및 장기적 안정성에 대한 개별 효과를 정량화한다. 마지막으로, 현실적인 워크로드 하에서 비용-성능 트레이드오프를 밝혀내며, 전역 재구성보다 지역적 유지 관리가 비용 효율적임을 보여준다. 이러한 발견을 바탕으로, 우리는 진정한 에이전트 네이티브 메모리 시스템 구축을 위한 유망한 방향을 식별한다. 코드는 https://github.com/OpenDataBox/MemoryData에서 공개적으로 이용 가능하다.

English

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.