MINTEval：多目標干擾下長程智能體系統的記憶評估

摘要

在现实环境中，智能体需要在长期且不断变化的时间跨度中运作，信息会被反复更新，且可能在记忆之间产生干扰，这就要求其能够准确回忆并整合多条信息进行推理。然而，现有的基准测试主要关注静态、独立的回忆任务，未能捕捉到不断演变的记忆之间的动态交互作用。本文研究了当前记忆增强型智能体在充满干扰的长期、多领域及多类型问题场景下的表现。我们提出了MINTEval（长时域记忆干扰评估基准），该基准具有以下特点：（1）长篇幅、高度关联且信息频繁更新的上下文，能引发显著的干扰效应；（2）涵盖多种领域（状态追踪、多轮对话、维基百科修订及GitHub提交），可评估模型的领域泛化能力；（3）包含多种问题类型，用于评估模型对干扰的鲁棒性，包括（i）单目标回忆任务，要求从长上下文中检索特定目标，以及（ii）多目标聚合任务，要求对多个相关信息进行整合推理。总体而言，MINTEval包含15.6万个问答对，上下文平均长度为138.8k个词元，单个实例最长可达180万词元。我们评估了7个代表性系统，包括原始长上下文大语言模型、检索增强生成（RAG）以及记忆增强型智能体框架。在所有系统中，性能均持续偏低（平均准确率27.9%），尤其是在需要整合多条证据进行推理的问题上。分析表明，性能主要受限于检索过程和记忆构建。此外，现有的记忆系统难以回忆并推理被后续上下文修改或干扰的早期事实，且准确性会随着中间更新次数的增加而下降。

English

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.