评估LLM代理的非常长期对话记忆

摘要

现有关于长期开放领域对话的研究侧重于评估模型响应，其上下文跨度不超过五个聊天会话。尽管长上下文大语言模型（LLMs）和检索增强生成（RAG）技术取得了进展，但它们在非常长期对话中的有效性尚未被探索。为了填补这一研究空白，我们引入了一个机器-人类流程，通过利用基于LLM的代理架构生成高质量、非常长期的对话，并将这些对话基于人物角色和时间事件图进行基础。此外，我们赋予每个代理能力来分享和对图像做出反应。生成的对话经人类注释员验证和编辑，以确保长期一致性和与事件图的基础联系。利用这一流程，我们收集了LoCoMo，一个非常长期对话的数据集，每个对话包含300轮，平均9K标记，长达35个会话。基于LoCoMo，我们提出了一个全面的评估基准，用于衡量模型中的长期记忆，包括问答、事件总结和多模态对话生成任务。我们的实验结果表明，LLMs在理解冗长对话和理解对话中的长期时间和因果动态方面存在挑战。采用长上下文LLMs或RAG等策略可以带来改进，但这些模型仍然远远落后于人类表现。

English

Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context large language models (LLMs) and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

评估LLM代理的非常长期对话记忆

Evaluating Very Long-Term Conversational Memory of LLM Agents

摘要

Support