視覺資訊自主尋求與大型語言模型

摘要

本文提出了一個自主尋求資訊的視覺問答框架，稱為AVIS。我們的方法利用大型語言模型（LLM）動態策略化外部工具的使用，並調查它們的輸出，從而獲取提供問題答案所需的不可或缺的知識。回答需要外部知識的視覺問題，例如“這張圖中的建築紀念著什麼事件？”，是一個複雜的任務。這個任務呈現出一個需要一系列動作的組合搜索空間，包括調用API、分析它們的回應和做出明智決策。我們進行了一項用戶研究，收集了人類在面對這個任務時的各種決策實例。然後，這些數據被用來設計一個系統，包括三個組件：一個由LLM驅動的規劃器，動態確定下一步應該使用哪個工具；一個由LLM驅動的推理器，從工具輸出中分析並提取關鍵信息；以及一個工作記憶組件，在整個過程中保留獲得的信息。收集的用戶行為在兩個關鍵方面指導我們的系統。首先，通過分析用戶所做決策的順序，我們創建了一個轉換圖，描述了不同的狀態並限制了每個狀態可用的行動。其次，我們使用用戶決策的示例，為我們的LLM驅動的規劃器和推理器提供相關的情境實例，增強它們做出明智決策的能力。我們展示了AVIS在知識密集型視覺問答基準測試中（如Infoseek和OK-VQA）取得了最先進的結果。

English

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

視覺資訊自主尋求與大型語言模型

AVIS: Autonomous Visual Information Seeking with Large Language Models

摘要

Support