通过代码生成的模块化视觉问答

摘要

我们提出了一个框架，将视觉问答问题表述为模块化代码生成。与先前关于视觉问答模块化方法的工作相比，我们的方法无需额外训练，依赖于预训练的语言模型（LMs）、在图像标题对上预训练的视觉模型，以及用于上下文学习的五十个VQA示例。生成的Python程序使用算术和条件逻辑调用和组合视觉模型的输出。与不使用代码生成的少样本基线相比，我们的方法在COVR数据集上将准确率提高至少3％，在GQA数据集上提高约2％。

English

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the outputs of the visual models using arithmetic and conditional logic. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.

通过代码生成的模块化视觉问答

Modular Visual Question Answering via Code Generation

摘要

Support