MINTEval: 長期エージェントシステムにおけるマルチターゲット干渉下でのメモリ評価

要旨

実世界のエージェントは、長く変化し続ける時間軸で動作し、情報が繰り返し更新され、記憶間で干渉が生じる可能性がある。そのため、正確な想起と、複数の情報にわたる集約的推論が求められる。しかし、既存のベンチマークは静的な独立した想起に焦点を当てており、進化する記憶間のこうした動的な相互作用を捉えていない。本稿では、現在の記憶拡張型エージェントが、多様な領域と質問タイプにわたる、干渉が多く長期にわたる現実的な設定でどのように機能するかを研究する。我々は、MINTEval（Long-Horizon Memory under INTerference Evaluation）を導入する。このベンチマークは、(1) 頻繁に更新され、かなりの干渉を引き起こす、長く高度に相互接続されたコンテキスト、(2) 状態追跡、マルチターン対話、Wikipediaの改訂、GitHubコミットといった多様な領域（領域汎化の評価を可能にする）、(3) 干渉に対する頑健性を評価する多様な質問タイプ（(i) 長いコンテキストから特定のターゲットを検索する単一ターゲット想起タスク、(ii) 複数の関連情報にわたる推論を必要とするマルチターゲット集約タスク）を特徴とする。全体として、MINTEvalは、平均138.8kトークン、インスタンスあたり最大1.8Mトークンの長期コンテキストにわたる15,600の質問応答ペアで構成される。我々は、標準的な長コンテキストLLM、RAG、記憶拡張型エージェントフレームワークを含む7つの代表的なシステムを評価する。全システムにおいて、一貫して低い性能（平均精度27.9％）、特に複数の証拠にわたる集約的推論を必要とする質問で低い結果が観察された。分析の結果、性能の制限要因は主に検索と記憶構成にあることが示された。さらに、現在の記憶システムは、後続のコンテキストによって修正または干渉を受ける初期の事実を想起・推論することが困難であり、介入する更新の回数が増えるにつれて精度が低下する。

English

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.