LongMemEval：對長期互動記憶進行聊天助手基準測試

摘要

最近大型語言模型（LLM）驅動的聊天助手系統已經整合了記憶組件來追踪用戶-助手聊天歷史，從而實現更準確和個性化的回應。然而，在持續互動中它們的長期記憶能力仍未被充分探索。本文介紹了LongMemEval，這是一個全面的基準測試，旨在評估聊天助手的五個核心長期記憶能力：信息提取、多會話推理、時間推理、知識更新和棄權。LongMemEval中包含了500個精心策劃的問題，嵌入在可自由擴展的用戶-助手聊天歷史中，對現有的長期記憶系統構成了重大挑戰，商業聊天助手和長文本LLM在記憶持續互動中信息的準確性下降了30%。然後，我們提出了一個統一框架，將長期記憶設計分解為索引、檢索和閱讀階段的四個設計選擇。基於關鍵的實驗洞察，我們提出了幾種記憶設計，包括會話分解以優化值的細粒度、增強索引結構的事實擴充關鍵和用於精煉搜索範圍的時間感知查詢擴充。實驗結果表明，這些優化大大提高了LongMemEval上的記憶召回和下游問答。總的來說，我們的研究為提升基於LLM的聊天助手的長期記憶能力提供了寶貴的資源和指導，為實現更個性化和可靠的對話AI鋪平了道路。

English

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

LongMemEval：對長期互動記憶進行聊天助手基準測試

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

摘要

Support