LLM을 프로그래머로 활용한 진정한 제로샷 조합형 시각적 추론을 향하여

초록

시각적 추론은 수십억 개의 모델 파라미터와 훈련 예제로 확장된 종단 간(end-to-end) 신경망이 주도하고 있습니다. 그러나 가장 큰 모델들조차도 조합적 추론, 일반화, 세밀한 공간 및 시간적 추론, 그리고 계수(counting)에 어려움을 겪습니다. 대형 언어 모델(LLMs)을 컨트롤러로 활용한 시각적 추론은 원칙적으로 이러한 한계를 해결할 수 있는데, 이는 작업을 분해하고 (시각적) 도구 세트를 조율하여 하위 작업을 해결함으로써 가능합니다. 최근 이러한 모델들은 조합적 시각 질의응답, 시각적 그라운딩, 비디오 시간적 추론과 같은 작업에서 뛰어난 성능을 달성했습니다. 그럼에도 불구하고, 현재 형태에서는 이러한 모델들이 프롬프트 내의 문맥 내 예제(contextual examples)에 대한 인간의 엔지니어링에 크게 의존하며, 이는 종종 데이터셋 및 작업에 특화되어 있고 숙련된 프로그래머의 상당한 노력을 요구합니다. 본 연구에서는 이러한 문제를 완화하기 위해 공간적 및 시간적으로 추상화된 루틴을 도입하고 소량의 레이블된 예제를 활용하여 문맥 내 예제를 자동으로 생성함으로써 인간이 만든 문맥 내 예제를 피하는 프레임워크를 제시합니다. 여러 시각적 추론 작업에서 우리의 프레임워크가 성능의 일관된 향상을 이끌어내고, LLM을 컨트롤러로 설정하는 방식을 더 견고하게 만들며, 문맥 내 예제에 대한 인간의 엔지니어링 필요성을 제거함을 보여줍니다.

English

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

LLM을 프로그래머로 활용한 진정한 제로샷 조합형 시각적 추론을 향하여

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

초록

Support