SuperMemory-VQA: 一个面向长时间跨度记忆的第一人称视觉问答基准
SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
May 30, 2026
作者: Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang
cs.AI
摘要
AI眼镜为AI智能体作为个性化记忆助手提供了极具潜力的平台。要真正实现实用价值,此类系统需超越短期视频理解的能力,解决人类在日常、个人或社交场景中,通过纵向第一人称视频流所经历的记忆缺口问题。然而,现有的第一人称数据集主要聚焦于动作识别或基于短视频片段的通用问答,衡量的是感知能力而非真实的人类记忆需求。我们提出SuperMemory-VQA,这是一个用于评估AI助手在实际长期记忆任务中表现的第一人称视觉问答(VQA)数据集。该数据集包含使用AI眼镜记录的52.9小时日常活动,涵盖同步RGB视频、音频转录、眼动轨迹、惯性测量单元(IMU)数据及SLAM轨迹数据。通过人工验证的标注流程,我们构建了4,853个有据可依的问答对,覆盖目标与位置记忆、意图回忆、视觉场景回忆、时间线重构、对话记忆以及跨上下文检索等任务。每个问题均以多项选择题形式呈现,并明确设置“不可回答”选项,以检测幻觉鲁棒性。对主流智能体框架及大语言模型(LLM)骨干网络的基准测试表明,现有系统在真实世界的记忆任务中仍远未达到可靠水平,这凸显了开发新型架构的必要性——此类架构应能实现基于证据的AI记忆,仅在证据充分时给出回答。参与者调查进一步证实,我们的问题具有现实性、实用性,且与日常记忆需求相吻合。
English
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.