MEME: マルチエンティティ＆進化的メモリ評価

要旨

LLMベースのエージェントは、複数のセッションにわたって情報を保存、更新、推論する必要がある持続的な環境で動作することが増えている。従来のベンチマークは単一エンティティの更新のみを評価していたが、MEMEはマルチエンティティと進化という軸で定義される全空間をカバーする6つのタスクを定義している。その中には、従来の研究では評価されていなかった3つのタスク（依存関係推論に関するCascadeとAbsence、および削除後の状態に関するDeletion）が含まれている。3つのメモリパラダイムにわたる6つのメモリシステムを100の制御されたエピソードで評価した結果、すべてのシステムがデフォルト設定下での依存関係推論において性能が低下することが判明した（平均精度：Cascade 3%、Absence 1%）。一方で、静的検索性能は十分であった。プロンプト最適化、より深い検索、フィラーノイズの低減、およびほとんどのより強力なLLMでは、このギャップを埋めることができなかった。内部LLMとしてClaude Opus 4.7を搭載したファイルベースのエージェントのみが部分的にギャップを埋めたが、そのコストはベースラインの約70倍であり、ギャップを埋めることが現在のところスケールに実用的でない構成に依存していることを示している。コードとデータはプロジェクトページで入手可能: https://seokwonjung-jay.github.io/meme-eval/

English

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

MEME: マルチエンティティ＆進化的メモリ評価

MEME: Multi-entity & Evolving Memory Evaluation

要旨

Support