RealMem:基于真实世界记忆驱动交互的大语言模型基准评测
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
January 11, 2026
作者: Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen
cs.AI
摘要
随着大语言模型从静态对话接口演变为自主通用智能体,有效的记忆机制对于保障长期行为一致性至关重要。然而现有基准主要关注日常对话或任务导向型对话,未能捕捉智能体必须追踪动态目标的**"长期项目导向型"**交互场景。
为弥补这一空白,我们推出首个基于真实项目场景的基准**RealMem**。该基准涵盖11类场景下超过2000组跨会话对话,采用自然用户查询进行评估。我们提出融合项目基础构建、多智能体对话生成、记忆与进度管理的综合流程,以模拟记忆的动态演进过程。
实验表明,现有记忆系统在管理现实项目中的长期项目状态和动态上下文依赖关系方面面临重大挑战。我们的代码与数据集已开源:[https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench)
English
As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture **"long-term project-oriented"** interactions where agents must track evolving goals.
To bridge this gap, we introduce **RealMem**, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.
We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects.
Our code and datasets are available at [https://github.com/AvatarMemory/RealMemBench](https://github.com/AvatarMemory/RealMemBench).