通过增量式多轮交互评估LLM代理的记忆能力

摘要

近期针对大型语言模型（LLM）智能体的基准测试主要聚焦于评估其推理、规划与执行能力，而另一个关键组成部分——记忆，即智能体如何记忆、更新及检索长期信息——由于缺乏相应的基准测试，评估不足。我们将具备记忆机制的智能体称为记忆型智能体。本文中，我们识别出记忆型智能体必备的四大核心能力：准确检索、测试时学习、长程理解与冲突解决。现有数据集要么依赖有限的上下文长度，要么专为静态的长上下文场景（如基于书籍的问答）设计，未能体现记忆型智能体在交互式多轮对话中逐步积累信息的特性。此外，尚无现有基准测试全面覆盖这四大能力。因此，我们推出了MemoryAgentBench，这是一个专为记忆型智能体设计的新基准测试。我们的基准结合了重新构建的现有数据集与全新构建的数据集，覆盖上述四大记忆能力，为评估记忆质量提供了一个系统且具挑战性的测试平台。我们评估了多种记忆型智能体，从简单的基于上下文和检索增强生成（RAG）的系统，到配备外部记忆模块及工具集成的高级智能体。实证结果表明，当前方法在全面掌握这四大能力方面仍有不足，凸显了进一步研究LLM智能体综合记忆机制的必要性。

English

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

通过增量式多轮交互评估LLM代理的记忆能力

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

摘要

Support