モジュラー型視覚的質問応答によるコード生成

要旨

本論文では、視覚的質問応答をモジュール化されたコード生成として定式化するフレームワークを提案する。従来のモジュール化アプローチとは異なり、本手法では追加の学習を必要とせず、事前学習済みの言語モデル（LM）、画像キャプションペアで事前学習された視覚モデル、およびインコンテキスト学習に使用される50のVQA例に依存している。生成されたPythonプログラムは、算術演算と条件分岐ロジックを用いて視覚モデルの出力を呼び出し、組み合わせる。本手法は、コード生成を採用しないFew-shotベースラインと比較して、COVRデータセットでは少なくとも3%、GQAデータセットでは約2%の精度向上を達成している。

English

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the outputs of the visual models using arithmetic and conditional logic. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.

モジュラー型視覚的質問応答によるコード生成

Modular Visual Question Answering via Code Generation

要旨

Support