ChatPaper.aiChatPaper

個人AI代理用於相機膠卷視覺問答

Personal AI Agent for Camera Roll VQA

June 3, 2026
作者: Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li
cs.AI

摘要

我們研究個人相機膠卷的視覺問答設定。在此設定中,一個對話式 AI 助手能夠存取使用者的個人相機膠卷,並檢索相關照片來回答問題,範圍從簡單的事實性問題(例如「我昨天嘗試的食物名稱?」)到更開放式的問題(例如「推薦一些我從未吃過的菜餚」)。考量到個人相機膠卷的龐大規模(即跨越多年、數百到數千張照片),一個成功的 AI 助手需要理解長期、高度個人化的視覺內容流,以便導航並定位正確且/或相關的資訊。為支援此目標,我們收集並手動註釋了模擬實際使用場景的問題。最終資料集 camroll 包含 50 位使用者、31,476 張影像以及 2,500 個問答對。我們進一步設計了 camroll-agent,這是一個配備分層記憶體與最小工具集的對話式 AI 代理,用於在大型個人化視覺記憶中高效導航。實驗結果顯示,camroll-agent 在多個長期上下文理解的 AI 代理系統基準與方法中表現優於眾多對照組。camroll 資料集與 camroll-agent 共同凸顯了 AI 代理在長期上下文推理上的差距:個人化視覺記憶需要不同於標準長期上下文文字記憶的方法,特別是在需要一致性、視覺細節與使用者特定情境時。
English
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.