HaluMem:評估代理記憶系統中的幻覺現象
HaluMem: Evaluating Hallucinations in Memory Systems of Agents
November 5, 2025
作者: Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li
cs.AI
摘要
記憶系統是實現大型語言模型與人工智慧代理等AI系統長期學習與持續互動的關鍵組件。然而在記憶存儲與檢索過程中,這些系統常出現記憶幻覺現象,包括虛構、錯誤、衝突與遺漏等問題。現有記憶幻覺評估主要採用端到端問答形式,難以定位幻覺產生的具體操作階段。為此,我們提出首個針對記憶系統的操作層級幻覺評估基準HaluMem,定義三項評估任務(記憶提取、記憶更新與記憶問答),全面揭示不同互動階段的幻覺行為。為支持評估,我們構建了以用戶為中心的多輪人機互動數據集HaluMem-Medium與HaluMem-Long,兩者均包含約1.5萬個記憶點與3.5千道多類型問題,單用戶平均對話輪次達1.5千與2.6千輪,上下文長度超百萬詞元,可評估不同上下文規模與任務複雜度下的幻覺表現。基於HaluMem的實證研究表明,現有記憶系統在提取與更新階段易產生並積累幻覺,進而將錯誤傳導至問答階段。未來研究應著重開發可解釋且具約束力的記憶操作機制,系統性抑制幻覺並提升記憶可靠性。
English
Memory systems are key components that enable AI systems such as LLMs and AI
agents to achieve long-term learning and sustained interaction. However, during
memory storage and retrieval, these systems frequently exhibit memory
hallucinations, including fabrication, errors, conflicts, and omissions.
Existing evaluations of memory hallucinations are primarily end-to-end question
answering, which makes it difficult to localize the operational stage within
the memory system where hallucinations arise. To address this, we introduce the
Hallucination in Memory Benchmark (HaluMem), the first operation level
hallucination evaluation benchmark tailored to memory systems. HaluMem defines
three evaluation tasks (memory extraction, memory updating, and memory question
answering) to comprehensively reveal hallucination behaviors across different
operational stages of interaction. To support evaluation, we construct
user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and
HaluMem-Long. Both include about 15k memory points and 3.5k multi-type
questions. The average dialogue length per user reaches 1.5k and 2.6k turns,
with context lengths exceeding 1M tokens, enabling evaluation of hallucinations
across different context scales and task complexities. Empirical studies based
on HaluMem show that existing memory systems tend to generate and accumulate
hallucinations during the extraction and updating stages, which subsequently
propagate errors to the question answering stage. Future research should focus
on developing interpretable and constrained memory operation mechanisms that
systematically suppress hallucinations and improve memory reliability.