ChatPaper.aiChatPaper

SuperMemory-VQA:一個面向長時程記憶的自我中心視覺問答基準

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

May 30, 2026
作者: Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang
cs.AI

摘要

AI眼鏡為AI代理提供了一個極具潛力的平台,使其能作為個人化記憶助手。若要真正發揮效用,此類系統必須超越短期影片理解,針對人類在實際、個人或社交目的下,於長時間的自我中心影片串流中所經歷的記憶缺口進行處理。然而,現有的自我中心資料集大多聚焦於動作辨識或短片中的通用問答,衡量的是感知能力而非真實的人類記憶需求。我們引入了SuperMemory-VQA,這是一個用於評估AI助手在實際、長時程記憶任務上表現的自我中心視覺問答(VQA)資料集。該資料集包含52.9小時以AI眼鏡錄製的日常活動,包括同步的RGB影片、音訊轉錄、眼動追蹤、慣性測量單元(IMU)以及同時定位與地圖建構(SLAM)軌跡。透過經人工驗證的標註流程,我們建構了4,853組有根據的問答對,涵蓋物體與位置記憶、意圖回想、視覺場景回想、時間線重建、對話記憶以及情境內檢索。每個問題都以選擇題形式呈現,並附帶明確的「無法回答」選項,以測試對幻覺的穩健性。對領先的代理框架與大型語言模型(LLM)主幹的基準測試顯示,現有系統在真實世界的記憶任務上仍遠未達到可靠程度,凸顯了對於新的、基於證據的AI記憶架構的需求——此類架構應僅在證據充足時才進行回答。參與者調查進一步支持,我們的問題具有現實性、實用性,且與日常記憶需求相符。
English
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.