MEME: 다중 개체 및 진화하는 메모리 평가

초록

LLM 기반 에이전트들은 점점 더 지속적 환경에서 작동하며, 여러 세션에 걸쳐 정보를 저장, 갱신, 추론해야 한다. 기존 벤치마크는 단일 개체 갱신만 평가했지만, MEME은 다중 개체(Multi-Entity) 축과 변화(Evolving) 축이 정의하는 전체 공간을 포괄하는 6가지 과제를 정의한다. 여기에는 이전 연구에서 평가되지 않은 세 가지 과제, 즉 의존성 추론과 관련된 Cascade 및 Absence, 그리고 제거 이후 상태를 다루는 Deletion이 포함된다. 세 가지 메모리 패러다임에 걸친 6개의 메모리 시스템을 100개의 통제된 에피소드에서 평가한 결과, 모든 시스템이 기본 구성 하에서 의존성 추론에서 붕괴함을 발견했다(Cascade: 평균 정확도 3%, Absence: 1%). 반면 정적 검색 성능은 적절했다. 프롬프트 최적화, 더 깊은 검색, 필러 노이즈 감소, 그리고 대부분의 강력한 LLM들은 이 격차를 좁히지 못했다. 내부 LLM으로 Claude Opus 4.7과 결합된 파일 기반 에이전트만이 부분적으로 격차를 좁혔지만, 기본 비용의 약 70배가 소요되어, 현재로서는 실용적이지 않은 규모의 설정에 의존함을 시사한다. 코드와 데이터는 프로젝트 페이지(https://seokwonjung-jay.github.io/meme-eval/)에서 확인할 수 있다.

English

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

MEME: 다중 개체 및 진화하는 메모리 평가

MEME: Multi-entity & Evolving Memory Evaluation

초록

Support