朝向真正的零-shot 組合視覺推理：以 LLMs 為程式設計師

摘要

視覺推理主要由擁有數十億模型參數和訓練示例的端到端神經網絡所主導。然而，即使是最大的模型也在組合推理、泛化、細粒度空間和時間推理以及計數方面遇到困難。利用大型語言模型（LLMs）作為控制器進行視覺推理，原則上可以通過將任務分解並通過協調一組（視覺）工具來解決子任務，從而解決這些限制。最近，這些模型在組合視覺問答、視覺對應和視頻時間推理等任務上取得了出色的表現。然而，在它們目前的形式中，這些模型嚴重依賴於在提示中的上下文示例的人工設計，這些示例通常是數據集和任務特定的，需要高技能程序員進行大量工作。在這項工作中，我們提出了一個框架，通過引入空間和時間抽象例程，並利用少量標記示例來自動生成上下文示例，從而避免了人工創建的上下文示例。在多個視覺推理任務中，我們展示了我們的框架在性能上持續取得穩定的增益，使LLMs作為控制器設置更加強健，並消除了對人工設計上下文示例的需求。

English

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

朝向真正的零-shot 組合視覺推理：以 LLMs 為程式設計師

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

摘要

Support