ChatPaper.aiChatPaper

MEME:多實體與演化記憶評估

MEME: Multi-entity & Evolving Memory Evaluation

May 12, 2026
作者: Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh
cs.AI

摘要

基於大型語言模型的代理日益在持久性環境中運作,必須跨越多個會話儲存、更新並推理資訊。現有基準僅評估單一實體的更新,而MEME則定義了六項任務,涵蓋多實體與演化軸線所定義的完整範疇,其中包含三項先前研究未評估的任務:級聯與缺失(依賴推理)以及刪除(移除後狀態)。我們在100個受控情境中評估了分屬三種記憶典範的六種記憶系統,結果發現所有系統在預設配置下,即使靜態檢索表現足夠,在依賴推理任務上仍全面崩潰(級聯:平均準確率3%;缺失:平均準確率1%)。提示最佳化、更深層檢索、減少填充噪音以及多數更強大的大型語言模型均未能彌補此差距。唯有搭配Claude Opus 4.7作為內部大型語言模型的檔案型代理能部分縮小差距,但其成本約為基準的70倍,顯示目前縮小差距所依賴的配置在大規模應用上並不實用。程式碼與資料請見專案頁面:https://seokwonjung-jay.github.io/meme-eval/。
English
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.
PDF61May 14, 2026