カメラロールVQAのためのパーソナルAIエージェント

要旨

本研究では、個人のカメラロールにおける視覚的質問応答設定について取り組む。この設定では、対話型AIアシスタントがユーザーの個人用カメラロールにアクセスし、関連する写真を取得して質問に回答する。質問は、単純な事実確認（例：「昨日試した料理の名前は？」）から、より自由度の高いもの（例：「まだ食べたことのない料理をいくつか勧めて」）まで多岐にわたる。個人のカメラロールは膨大な性質を持つ（すなわち、複数年、数百から数千枚の写真）ため、成功するAIアシスタントは、長期的かつ高度にパーソナライズされた視覚コンテンツの流れを理解し、正確かつ関連性のある情報を検索・特定する必要がある。これを支援するため、実世界での使用を模した質問を収集し、手動でアノテーションを施した。最終的なデータセットcamrollは、50人のユーザー、31,476枚の画像、2,500のQAペアを含む。さらに、階層的メモリと大規模でパーソナライズされた視覚記憶を効率的にナビゲートするための最小限のツールセットを備えた対話型AIエージェントであるcamroll-agentを設計した。実験結果は、camroll-agentが、長文脈理解AIエージェントシステムにおける多くのベースラインおよび手法を上回る性能を示す。camrollデータセットとcamroll-agentは、AIエージェントの長文脈推論におけるギャップを浮き彫りにしている。すなわち、パーソナライズされた視覚記憶には、特に一貫性、視覚的詳細、ユーザー固有の文脈が存在する場合、標準的な長文脈テキスト記憶とは異なるアプローチが必要である。

English

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.