LLMエージェントの超長期会話メモリの評価

要旨

既存の長期オープンドメイン対話に関する研究は、5回以下のチャットセッションにわたる文脈内でのモデル応答の評価に焦点を当てている。長文脈大規模言語モデル（LLM）や検索拡張生成（RAG）技術の進展にもかかわらず、非常に長期にわたる対話におけるそれらの有効性は未だに検証されていない。この研究ギャップを埋めるため、我々はLLMベースのエージェントアーキテクチャを活用し、ペルソナと時間的イベントグラフに基づいて対話を構築する機械-人間パイプラインを導入する。さらに、各エージェントに画像を共有し反応する能力を付与する。生成された会話は、長期的な一貫性とイベントグラフへの接地を確保するため、人間のアノテーターによって検証および編集される。このパイプラインを用いて、我々はLoCoMoという非常に長期にわたる会話のデータセットを収集し、各会話は平均300ターンと9Kトークンを超え、最大35セッションに及ぶ。LoCoMoに基づき、質問応答、イベント要約、マルチモーダル対話生成タスクを含む、モデルの長期記憶を測定する包括的な評価ベンチマークを提示する。実験結果は、LLMが長い会話を理解し、対話内の長期的な時間的および因果的ダイナミクスを把握する上で課題を抱えていることを示している。長文脈LLMやRAGのような戦略を採用することで改善が見られるが、これらのモデルは依然として人間の性能に大きく遅れをとっている。

English

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

LLMエージェントの超長期会話メモリの評価

Evaluating Very Long-Term Conversational Memory of LLM Agents

要旨

Support