实现真正的零样本组合视觉推理：LLM作为程序员

摘要

视觉推理主要由端到端的神经网络主导，其规模达到数十亿个模型参数和训练示例。然而，即使是最大的模型也在组合推理、泛化、细粒度空间和时间推理以及计数方面遇到困难。利用大型语言模型（LLMs）作为控制器进行视觉推理，原则上可以通过分解任务并通过协调一组（视觉）工具来解决这些限制。最近，这些模型在诸如组合视觉问题回答、视觉定位和视频时间推理等任务上取得了出色的表现。然而，在它们当前的形式中，这些模型严重依赖于在提示中上下文示例的人工设计，这些示例通常是特定于数据集和任务的，并且需要高技能程序员进行大量工作。在这项工作中，我们提出了一个框架，通过引入空间和时间抽象例程，并利用少量标记示例来自动生成上下文示例，从而避免人工创建上下文示例，从而缓解了这些问题。在多个视觉推理任务中，我们展示了我们的框架在性能上取得了一致的增益，使LLMs作为控制器的设置更加稳健，并消除了对人工设计上下文示例的需求。

English

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

实现真正的零样本组合视觉推理：LLM作为程序员

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

摘要

Support