MEME: 多实体与演化记忆评估

摘要

基于大语言模型的智能代理日益在持久化环境中运行，它们需要跨多个会话存储、更新和推理信息。虽然现有基准仅评估单实体更新场景，但MEME定义了覆盖多实体与演变两个维度完整空间的六项任务，其中包括三项未被先前工作评分的内容：级联与缺失（依赖推理）以及删除（移除后状态）。通过在100个受控事件中评估涵盖三种记忆范式的六个记忆系统，我们发现所有系统在默认配置下均无法完成依赖推理任务（级联平均准确率3%，缺失1%），尽管静态检索性能尚可。提示优化、更深层检索、减少干扰噪音以及多数更强的LLM均未能弥合这一差距。仅当将Claude Opus 4.7作为内部LLM搭配基于文件的智能代理时，才部分弥合了差距，但成本约为基线的70倍，这表明当前实现差距弥合依赖于在大规模部署中不可行的配置。代码与数据详见项目主页：https://seokwonjung-jay.github.io/meme-eval/。

English

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

MEME: 多实体与演化记忆评估

MEME: Multi-entity & Evolving Memory Evaluation

摘要

Support