카메라 롤 VQA를 위한 개인 AI 에이전트

초록

우리는 개인 카메라 롤 시각 질의응답 설정을 연구한다. 이 설정에서 대화형 AI 어시스턴트는 사용자의 개인 카메라 롤에 접근하여 관련 사진을 검색함으로써 단순한 사실 질문(예: "어제 내가 시식한 음식 이름은?")부터 보다 개방적인 질문(예: "내가 한 번도 먹어본 적 없는 요리를 추천해줘")까지 다양한 질의에 응답할 수 있다. 개인 카메라 롤의 방대한 특성(즉, 수년간 수백에서 수천 장의 사진)을 고려할 때, 성공적인 AI 어시스턴트는 올바르거나 관련된 정보를 탐색하고 찾아내기 위해 장기적이고 고도로 개인화된 시각 콘텐츠 스트림을 이해해야 한다. 이를 지원하기 위해 우리는 실제 사용을 모방한 질문을 수집하고 수동으로 주석을 달았다. 최종 데이터셋인 camroll은 50명의 사용자, 31,476장의 이미지, 2,500개의 QA 쌍을 포함한다. 또한 우리는 계층적 메모리와 대규모 개인화된 시각 기억을 효율적으로 탐색하기 위한 최소한의 도구 세트를 갖춘 대화형 AI 에이전트인 camroll-agent를 설계하였다. 실험 결과는 camroll-agent가 장기 맥락 이해 AI 에이전트 시스템의 수많은 기준선과 방법보다 우수한 성능을 보임을 보여준다. camroll 데이터셋과 camroll-agent는 함께 AI 에이전트의 장기 맥락 추론에서의 격차를 부각시킨다: 개인화된 시각 기억은 표준 장기 맥락 텍스트 기억과 다른 접근 방식을 필요로 하며, 특히 일관성, 시각적 세부 사항 및 사용자 특정 맥락이 존재할 때 더욱 그러하다.

English

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.