LLMエージェントのメモリ評価：段階的なマルチターンインタラクションを通じて

要旨

大規模言語モデル（LLM）エージェントの最近のベンチマークは、主に推論、計画、実行能力の評価に焦点を当てており、もう一つの重要な要素であるメモリ（エージェントが長期情報を記憶、更新、検索する方法）は、ベンチマークの不足により十分に評価されていません。我々は、メモリ機構を持つエージェントをメモリエージェントと呼びます。本論文では、メモリエージェントに不可欠な4つのコア能力を特定します：正確な検索、テスト時の学習、長期的な理解、および衝突解決です。既存のデータセットは、限られたコンテキスト長に依存しているか、書籍ベースのQAのような静的で長いコンテキスト設定に特化しており、情報を段階的に蓄積するメモリエージェントのインタラクティブで多段階の性質を反映していません。さらに、既存のベンチマークはこれら4つの能力をすべてカバーしていません。そこで、我々はメモリエージェントに特化した新しいベンチマークであるMemoryAgentBenchを導入します。このベンチマークは、既存のデータセットを再構築したものと新たに構築したものを組み合わせ、上記の4つのメモリ能力をカバーし、メモリ品質を評価するための体系的で挑戦的なテストベッドを提供します。我々は、単純なコンテキストベースや検索拡張生成（RAG）システムから、外部メモリモジュールやツール統合を備えた高度なエージェントまで、多様なメモリエージェントを評価します。実験結果は、現在の手法がこれら4つの能力をすべて習得するには至っていないことを明らかにし、LLMエージェントの包括的なメモリ機構に関するさらなる研究の必要性を強調しています。

English

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.

LLMエージェントのメモリ評価：段階的なマルチターンインタラクションを通じて

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

要旨

Support