MINTEval：长时域智能体系统中多目标干扰下的记忆评估

摘要

现实中的智能体需要在长期且不断演变的时间跨度中运行，信息会反复更新并可能在记忆之间产生干扰，这就要求智能体具备准确回忆、以及对多条信息进行聚合推理的能力。然而，现有基准测试聚焦于静态、独立的回忆任务，未能捕捉到这些动态演变记忆之间的交互。本文研究了当前记忆增强智能体在现实场景（干扰密集、时间跨度长）下，跨多个领域和问题类型时的表现。我们提出了MINTEval（长视野记忆干扰评估基准），其核心特征包括：(1) 长期且高度关联的上下文，包含频繁更新的信息，会引发显著干扰；(2) 多个领域（状态跟踪、多轮对话、维基百科修订、GitHub提交），可用于评估领域泛化能力；(3) 多种问题类型，用于检验抗干扰鲁棒性，包括(i) 单目标回忆任务（要求从长上下文中检索特定目标），以及(ii) 多目标聚合任务（要求对多条相关信息进行推理）。总体而言，MINTEval包含15.6万个问答对，其长视野上下文平均长度为13.88万个token，单实例最长达180万个token。我们评估了7个代表性系统，包括普通长上下文LLM、RAG以及记忆增强智能体框架。在所有系统中，我们观察到一致的低性能表现（平均准确率27.9%），尤其是在需要对多条证据进行聚合推理的问题上。分析表明，性能主要受限于检索和记忆构建环节。此外，当前的记忆系统难以回忆和推理那些被后续上下文修改或干扰的早期事实，并且随着中间更新次数的增加，准确率持续下降。

English

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.