LongMemEval: 長期インタラクティブメモリに関するチャットアシスタントのベンチマーク化

要旨

最近の大規模言語モデル（LLM）駆動のチャットアシスタントシステムは、ユーザーとアシスタントのチャット履歴を追跡するためのメモリコンポーネントを統合し、より正確で個人に適した応答を可能にしています。しかし、持続的な対話における長期メモリ機能は未だ未開拓です。本論文では、チャットアシスタントの5つの主要な長期メモリ能力を評価するために設計された包括的なベンチマークであるLongMemEvalを紹介します。それらは、情報抽出、マルチセッション推論、時間推論、知識更新、および棄権です。自由にスケーラブルなユーザーとアシスタントのチャット履歴に埋め込まれた500の細心に選定された質問を備えたLongMemEvalは、既存の長期メモリシステムにとって重要な課題を提供します。商用チャットアシスタントや長いコンテキストのLLMは、持続的な対話を通じて情報を記憶する際に30%の精度低下を示します。次に、長期メモリ設計を索引付け、検索、および読み取り段階にわたる4つの設計選択に分解する統一フレームワークを提案します。主要な実験的洞察に基づいて構築されたこれらの最適化には、値の粒度を最適化するためのセッション分解、インデックス構造を強化するための事実増強キー拡張、および検索範囲を洗練するための時間に敏感なクエリ拡張が含まれます。実験結果は、これらの最適化がLongMemEvalにおけるメモリリコールとダウンストリーム質問応答の両方を大幅に改善することを示しています。総じて、当研究は、LLMベースのチャットアシスタントの長期メモリ能力を向上させるための貴重なリソースとガイダンスを提供し、より個人に適した信頼性の高い会話型AIに向けた道筋を示しています。

English

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

LongMemEval: 長期インタラクティブメモリに関するチャットアシスタントのベンチマーク化

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

要旨

Support