評估LLM智能體的非常長期對話記憶
Evaluating Very Long-Term Conversational Memory of LLM Agents
February 27, 2024
作者: Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, Yuwei Fang
cs.AI
摘要
現有關於長期開放領域對話的研究主要集中在評估模型回應,其上下文範圍不超過五個聊天會話。儘管長文本大語言模型(LLMs)和檢索增強生成(RAG)技術有所進展,但它們在非常長期對話中的效力尚未被探索。為填補這一研究空白,我們引入了一個機器-人類流程,通過利用基於LLM的代理架構並將對話基於人物角色和時間事件圖來生成高質量的非常長期對話。此外,我們賦予每個代理能力來分享和對圖像做出反應。生成的對話由人類標註者進行驗證和編輯,以確保長期一致性和與事件圖的關聯性。通過這個流程,我們收集了LoCoMo,一個非常長期對話的數據集,每個對話包括300輪,平均9K個標記,最多可達35個會話。基於LoCoMo,我們提出了一個全面的評估基準,用於測量模型的長期記憶,包括問答、事件摘要和多模態對話生成任務。我們的實驗結果表明,LLMs在理解冗長對話和理解對話中的長期時間和因果動態方面存在挑戰。採用長文本LLMs或RAG等策略可以帶來改進,但這些模型仍然明顯落後於人類表現。
English
Existing works on long-term open-domain dialogues focus on evaluating model
responses within contexts spanning no more than five chat sessions. Despite
advancements in long-context large language models (LLMs) and retrieval
augmented generation (RAG) techniques, their efficacy in very long-term
dialogues remains unexplored. To address this research gap, we introduce a
machine-human pipeline to generate high-quality, very long-term dialogues by
leveraging LLM-based agent architectures and grounding their dialogues on
personas and temporal event graphs. Moreover, we equip each agent with the
capability of sharing and reacting to images. The generated conversations are
verified and edited by human annotators for long-range consistency and
grounding to the event graphs. Using this pipeline, we collect LoCoMo, a
dataset of very long-term conversations, each encompassing 300 turns and 9K
tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a
comprehensive evaluation benchmark to measure long-term memory in models,
encompassing question answering, event summarization, and multi-modal dialogue
generation tasks. Our experimental results indicate that LLMs exhibit
challenges in understanding lengthy conversations and comprehending long-range
temporal and causal dynamics within dialogues. Employing strategies like
long-context LLMs or RAG can offer improvements but these models still
substantially lag behind human performance.