相机胶卷VQA的个人AI代理
Personal AI Agent for Camera Roll VQA
June 3, 2026
作者: Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li
cs.AI
摘要
我们研究了个人相机胶卷视觉问答任务。在该任务中,对话式AI助手能够访问用户的个人相机胶卷,检索相关照片以回答各类问题——从简单的事实性问题(例如“我昨天尝试的食物叫什么?”)到更开放的问题(例如“推荐一些我从未吃过的菜肴”)。由于个人相机胶卷涵盖内容广泛(跨越多年,包含数百至数千张照片),一个成功的AI助手需要理解长期跨度、高度个性化的视觉内容流,以便在大量图像中定位并找到正确或相关信息。为此,我们收集并手工标注了模拟真实使用场景的问题。最终数据集camroll包含50名用户、31,476张图像和2,500个问答对。我们进一步设计了camroll-agent,这是一个配备分层记忆和最少工具集的对话式AI智能体,能够高效地在大规模个性化视觉记忆中导航。实验结果表明,camroll-agent在多个基线方法和用于长上下文理解的AI智能体系统中表现更优。camroll数据集与camroll-agent共同揭示了AI智能体在长上下文推理方面的差距:个性化视觉记忆需要与标准长上下文文本记忆不同的方法,尤其是在一致性、视觉细节和用户特定上下文存在的情况下。
English
We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.