朝向真正的零-shot 組合視覺推理:以 LLMs 為程式設計師
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
January 3, 2024
作者: Aleksandar Stanić, Sergi Caelles, Michael Tschannen
cs.AI
摘要
視覺推理主要由擁有數十億模型參數和訓練示例的端到端神經網絡所主導。然而,即使是最大的模型也在組合推理、泛化、細粒度空間和時間推理以及計數方面遇到困難。利用大型語言模型(LLMs)作為控制器進行視覺推理,原則上可以通過將任務分解並通過協調一組(視覺)工具來解決子任務,從而解決這些限制。最近,這些模型在組合視覺問答、視覺對應和視頻時間推理等任務上取得了出色的表現。然而,在它們目前的形式中,這些模型嚴重依賴於在提示中的上下文示例的人工設計,這些示例通常是數據集和任務特定的,需要高技能程序員進行大量工作。在這項工作中,我們提出了一個框架,通過引入空間和時間抽象例程,並利用少量標記示例來自動生成上下文示例,從而避免了人工創建的上下文示例。在多個視覺推理任務中,我們展示了我們的框架在性能上持續取得穩定的增益,使LLMs作為控制器設置更加強健,並消除了對人工設計上下文示例的需求。
English
Visual reasoning is dominated by end-to-end neural networks scaled to
billions of model parameters and training examples. However, even the largest
models struggle with compositional reasoning, generalization, fine-grained
spatial and temporal reasoning, and counting. Visual reasoning with large
language models (LLMs) as controllers can, in principle, address these
limitations by decomposing the task and solving subtasks by orchestrating a set
of (visual) tools. Recently, these models achieved great performance on tasks
such as compositional visual question answering, visual grounding, and video
temporal reasoning. Nevertheless, in their current form, these models heavily
rely on human engineering of in-context examples in the prompt, which are often
dataset- and task-specific and require significant labor by highly skilled
programmers. In this work, we present a framework that mitigates these issues
by introducing spatially and temporally abstract routines and by leveraging a
small number of labeled examples to automatically generate in-context examples,
thereby avoiding human-created in-context examples. On a number of visual
reasoning tasks, we show that our framework leads to consistent gains in
performance, makes LLMs as controllers setup more robust, and removes the need
for human engineering of in-context examples.