AVIS: 대형 언어 모델을 활용한 자율 시각 정보 탐색

초록

본 논문에서는 자율적인 정보 탐색 시각적 질문응답 프레임워크인 AVIS를 제안합니다. 우리의 방법은 대형 언어 모델(LLM)을 활용하여 외부 도구의 사용을 동적으로 전략화하고, 그 출력을 조사함으로써 제기된 질문에 답변하기 위해 필수적인 지식을 획득합니다. "이 이미지에 묘사된 건물이 기념하는 행사는 무엇인가?"와 같이 외부 지식이 필요한 시각적 질문에 응답하는 것은 복잡한 작업입니다. 이 작업은 API 호출, 그 응답 분석, 정보에 기반한 결정을 포함하는 일련의 행동을 요구하는 조합적 탐색 공간을 제시합니다. 우리는 사용자 연구를 통해 이 작업에 직면했을 때 인간의 의사결정 과정의 다양한 사례를 수집합니다. 이 데이터는 세 가지 구성 요소로 이루어진 시스템을 설계하는 데 사용됩니다: 다음에 사용할 도구를 동적으로 결정하는 LLM 기반 플래너, 도구 출력에서 주요 정보를 분석하고 추출하는 LLM 기반 리저너, 그리고 과정 전반에 걸쳐 획득한 정보를 유지하는 작업 메모리 구성 요소입니다. 수집된 사용자 행동은 우리 시스템을 두 가지 주요 방식으로 안내합니다. 첫째, 사용자의 의사결정 순서를 분석하여 전이 그래프를 생성합니다. 이 그래프는 구별된 상태를 명시하고 각 상태에서 가능한 행동 집합을 제한합니다. 둘째, 사용자 의사결정 사례를 활용하여 LLM 기반 플래너와 리저너에 관련된 문맥적 사례를 제공함으로써 정보에 기반한 결정을 내리는 능력을 강화합니다. 우리는 AVIS가 Infoseek 및 OK-VQA와 같은 지식 집약적 시각적 질문응답 벤치마크에서 최첨단 결과를 달성함을 보여줍니다.

English

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

AVIS: 대형 언어 모델을 활용한 자율 시각 정보 탐색

AVIS: Autonomous Visual Information Seeking with Large Language Models

초록

Support