HaluMem:智能体记忆系统中的幻觉评估
HaluMem: Evaluating Hallucinations in Memory Systems of Agents
November 5, 2025
作者: Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, Zhiyu Li
cs.AI
摘要
记忆系统是实现LLM及AI智能体长期学习与持续交互的关键组件。然而在记忆存储与检索过程中,这些系统常出现记忆幻觉现象,包括虚构、错误、冲突和遗漏等。现有对记忆幻觉的评估主要采用端到端问答形式,难以定位幻觉产生的具体操作环节。为此,我们推出首个面向记忆系统的操作级幻觉评估基准HaluMem,通过定义记忆提取、记忆更新和记忆问答三项评估任务,全面揭示交互过程中不同操作阶段的幻觉行为。为支撑评估,我们构建了以用户为中心的多轮人机交互数据集HaluMem-Medium与HaluMem-Long,两者均包含约1.5万个记忆点及3.5千道多类型问题,单用户平均对话轮次达1.5k与2.6k轮,上下文长度超100万token,可评估不同上下文规模与任务复杂度下的幻觉表现。基于HaluMem的实证研究表明,现有记忆系统在提取和更新阶段易产生并积累幻觉,进而将误差传播至问答阶段。未来研究应致力于开发可解释、强约束的记忆操作机制,系统性地抑制幻觉并提升记忆可靠性。
English
Memory systems are key components that enable AI systems such as LLMs and AI
agents to achieve long-term learning and sustained interaction. However, during
memory storage and retrieval, these systems frequently exhibit memory
hallucinations, including fabrication, errors, conflicts, and omissions.
Existing evaluations of memory hallucinations are primarily end-to-end question
answering, which makes it difficult to localize the operational stage within
the memory system where hallucinations arise. To address this, we introduce the
Hallucination in Memory Benchmark (HaluMem), the first operation level
hallucination evaluation benchmark tailored to memory systems. HaluMem defines
three evaluation tasks (memory extraction, memory updating, and memory question
answering) to comprehensively reveal hallucination behaviors across different
operational stages of interaction. To support evaluation, we construct
user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and
HaluMem-Long. Both include about 15k memory points and 3.5k multi-type
questions. The average dialogue length per user reaches 1.5k and 2.6k turns,
with context lengths exceeding 1M tokens, enabling evaluation of hallucinations
across different context scales and task complexities. Empirical studies based
on HaluMem show that existing memory systems tend to generate and accumulate
hallucinations during the extraction and updating stages, which subsequently
propagate errors to the question answering stage. Future research should focus
on developing interpretable and constrained memory operation mechanisms that
systematically suppress hallucinations and improve memory reliability.