SuperMemory-VQA: 長期記憶のための一人称視点視覚質問応答ベンチマーク

要旨

AIグラスは、AIエージェントが個人向けメモリアシスタントとして機能するための有力なプラットフォームを提供する。真に有用であるためには、そのようなシステムは短期的な動画理解を超え、長期的な一人称視点動画ストリームにおいて、実用的・個人的・社会的な目的で人間が経験する記憶のギャップに対処しなければならない。しかし、既存の一人称視点データセットは主に行動認識や短いクリップからの一般的なQAに焦点を当てており、現実的な人間の記憶ニーズではなく知覚能力を測定している。我々は、実用的かつ長期的な記憶タスクにおけるAIアシスタントの評価を目的とした、一人称視点視覚質問応答（VQA）データセット「SuperMemory-VQA」を紹介する。本データセットは、AIグラスで記録された52.9時間の日常活動を含み、同期されたRGB動画、音声文字起こし、眼球視線、IMU、SLAM軌跡を備える。人間による検証済みアノテーションパイプラインを通じて、物体・位置記憶、意図想起、視覚シーン想起、タイムライン再構成、会話記憶、文脈内検索にわたる、根拠付けられた4,853の質問応答ペアを構築した。各質問は多肢選択形式で提示され、「回答不可能」という明示的な選択肢を含むことで、ハルシネーションに対する堅牢性をテストする。主要なエージェントフレームワークおよびLLMバックボーンを用いたベンチマーク評価の結果、既存のシステムは現実世界の記憶タスクにおいて信頼できる水準には程遠く、証拠が十分な場合にのみ回答可能な、根拠に基づくAIメモリのための新しいアーキテクチャの必要性が明らかになった。さらに、参加者アンケートは、我々の質問が現実的で有用であり、日常の記憶ニーズと整合していることを裏付けている。

English

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.