自主视觉信息检索与大型语言模型

摘要

本文提出了一种自主信息检索视觉问答框架，称为AVIS。我们的方法利用大型语言模型（LLM）动态规划外部工具的利用，并调查它们的输出，从而获取提供问题答案所需的必要知识。回答需要外部知识的视觉问题，例如“这幅图中描绘的建筑纪念的是哪个事件？”，是一项复杂任务。这项任务呈现出一个需要一系列行动的组合搜索空间，包括调用API、分析其响应以及做出明智决策。我们进行了用户研究，收集了人类在面对这一任务时的各种决策实例。然后，利用这些数据设计了一个由三个组件组成的系统：一个由LLM驱动的规划器，动态确定下一步要使用的工具；一个由LLM驱动的推理器，分析并从工具输出中提取关键信息；以及一个工作记忆组件，在整个过程中保留获取的信息。收集的用户行为作为我们系统的指南，有两个关键作用。首先，我们通过分析用户所做决策的顺序创建了一个转换图。该图勾勒出不同的状态，并限制了每个状态下可用的行动集。其次，我们利用用户决策的示例为我们的LLM驱动的规划器和推理器提供相关的背景实例，增强它们做出明智决策的能力。我们展示了AVIS在知识密集型视觉问答基准测试中取得了最先进的结果，如Infoseek和OK-VQA。

English

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

自主视觉信息检索与大型语言模型

AVIS: Autonomous Visual Information Seeking with Large Language Models

摘要

Support