基於增量式多輪互動評估LLM代理的記憶能力

摘要

近期针对大型语言模型（LLM）代理的基准测试主要集中于评估其推理、规划与执行能力，而另一关键组成部分——记忆，即代理如何记忆、更新及检索长期信息——由于缺乏相应的基准测试而未能得到充分评估。我们将具备记忆机制的代理称为记忆代理。本文中，我们识别出记忆代理所必需的四大核心能力：准确检索、实时学习、长程理解及冲突解决。现有数据集要么依赖于有限的上下文长度，要么专为静态的长上下文场景（如基于书籍的问答）设计，这些均未能体现记忆代理在交互式、多轮对话中逐步积累信息的特性。此外，尚无现有基准测试全面覆盖上述四项能力。因此，我们引入了MemoryAgentBench，这是一个专为记忆代理设计的新基准测试。我们的基准测试结合了重新构建的现有数据集与新构建的数据集，覆盖了上述四项记忆能力，为评估记忆质量提供了一个系统且具挑战性的测试平台。我们评估了从简单的基于上下文及检索增强生成（RAG）系统到配备外部记忆模块与工具集成的高级代理在内的多种记忆代理。实证结果表明，当前方法在掌握所有四项能力方面均存在不足，这凸显了进一步研究LLM代理全面记忆机制的必要性。

English

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

基於增量式多輪互動評估LLM代理的記憶能力

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

摘要

Support