에이전트 메모리의 해부: 평가 및 시스템 한계에 대한 분류 및 실증 분석

초록

에이전트 메모리 시스템은 대규모 언어 모델(LLM) 에이전트가 장기간 상호작용에서 상태를 유지할 수 있게 하여, 고정된 컨텍스트 창을 넘어 장기 추론과 개인화를 지원합니다. 아키텍처 발전 속도가 빠름에도 불구하고, 이러한 시스템의 실증적 기반은 여전히 취약합니다. 기존 벤치마크는 종종 규모가 부족하고, 평가 메트릭은 의미적 유용성과 일치하지 않으며, 성능은 백본 모델에 따라 크게 달라지고, 시스템 수준의 비용이 자주 간과됩니다. 본 설문 논문은 아키텍처 및 시스템 관점에서 에이전트 메모리에 대한 구조화된 분석을 제시합니다. 먼저 네 가지 메모리 구조를 기반으로 MAG 시스템의 간결한 분류 체계를 소개합니다. 그런 다음 벤치마크 포화 효과, 메트릭 타당성과 판단 민감도, 백본 의존적 정확도, 메모리 유지 관리로 인한 지연 시간 및 처리량 오버헤드를 포함하여 현재 시스템을 제한하는 주요 문제점을 분석합니다. 메모리 구조와 실증적 한계를 연결함으로써, 이 설문 논문은 현재 에이전트 메모리 시스템이 왜 종종 이론적 기대에 미치지 못하는지 명확히 하고, 더 신뢰할 수 있는 평가와 확장 가능한 시스템 설계를 위한 방향을 제시합니다.

English

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

에이전트 메모리의 해부: 평가 및 시스템 한계에 대한 분류 및 실증 분석

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

초록

Support