用户自述:基于长期个性化指代记忆的问答系统
According to Me: Long-Term Personalized Referential Memory QA
March 2, 2026
作者: Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, Bill Byrne
cs.AI
摘要
个性化人工智能助手需具备对长期用户记忆的调用与推理能力,这些记忆天然跨越多种模态和来源(如图像、视频和电子邮件)。然而现有长期记忆基准主要聚焦于对话历史,未能捕捉基于真实生活体验的个性化参照。我们推出ATM-Bench——首个面向多模态、多源个性化参照记忆问答的基准测试。该基准包含近四年的隐私保护型个人记忆数据、人工标注的问答对及真实记忆证据,涵盖需要解析个人参照、多源证据推理以及处理矛盾证据的查询类型。我们提出模式引导记忆法(SGM),对源自不同渠道的记忆项进行结构化表征。实验中,我们实现了5种最先进的记忆系统与标准RAG基线,并评估了采用不同记忆录入、检索及答案生成技术的变体。研究发现:现有系统在ATM-Bench困难集上表现不佳(准确率低于20%),且SGM相较前人工作中常用的描述性记忆法具有性能提升。代码地址:https://github.com/JingbiaoMei/ATM-Bench
English
Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench