通过代码生成的模块化视觉问答
Modular Visual Question Answering via Code Generation
June 8, 2023
作者: Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein
cs.AI
摘要
我们提出了一个框架,将视觉问答问题表述为模块化代码生成。与先前关于视觉问答模块化方法的工作相比,我们的方法无需额外训练,依赖于预训练的语言模型(LMs)、在图像标题对上预训练的视觉模型,以及用于上下文学习的五十个VQA示例。生成的Python程序使用算术和条件逻辑调用和组合视觉模型的输出。与不使用代码生成的少样本基线相比,我们的方法在COVR数据集上将准确率提高至少3%,在GQA数据集上提高约2%。
English
We present a framework that formulates visual question answering as modular
code generation. In contrast to prior work on modular approaches to VQA, our
approach requires no additional training and relies on pre-trained language
models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA
examples used for in-context learning. The generated Python programs invoke and
compose the outputs of the visual models using arithmetic and conditional
logic. Our approach improves accuracy on the COVR dataset by at least 3% and on
the GQA dataset by roughly 2% compared to the few-shot baseline that does not
employ code generation.