AVIS: 大規模言語モデルを用いた自律的視覚情報探索

要旨

本論文では、自律的な情報探索型視覚質問応答フレームワークであるAVISを提案する。本手法は、大規模言語モデル（LLM）を活用して外部ツールの利用戦略を動的に策定し、その出力を調査することで、提示された質問に答えるために必要な知識を獲得する。「この画像に写っている建物はどの出来事を記念しているのか？」といった外部知識を必要とする視覚質問に応答することは、複雑な課題である。この課題は、APIの呼び出し、その応答の分析、情報に基づいた意思決定といった一連のアクションを必要とする組み合わせ探索空間を提示する。我々は、この課題に直面した際の人間の意思決定の多様な事例を収集するためにユーザー調査を実施した。このデータを用いて、次にどのツールを使用するかを動的に決定するLLM駆動のプランナー、ツールの出力から重要な情報を分析・抽出するLLM駆動の推論器、プロセス全体を通じて獲得した情報を保持するワーキングメモリの3つのコンポーネントからなるシステムを設計した。収集したユーザーの行動は、我々のシステムを導く2つの重要な方法で活用される。まず、ユーザーが行った意思決定のシーケンスを分析して遷移グラフを作成する。このグラフは、異なる状態を定義し、各状態で利用可能なアクションのセットを制限する。次に、ユーザーの意思決定の事例を用いて、LLM駆動のプランナーと推論器に関連する文脈事例を提供し、情報に基づいた意思決定を行う能力を強化する。我々は、AVISがInfoseekやOK-VQAといった知識集約型視覚質問応答ベンチマークにおいて、最先端の結果を達成することを示す。

English

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

AVIS: 大規模言語モデルを用いた自律的視覚情報探索

AVIS: Autonomous Visual Information Seeking with Large Language Models

要旨

Support