InstructBLIP:朝向具有指导调整功能的通用视觉-语言模型
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
May 11, 2023
作者: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
cs.AI
摘要
受到预训练和指导微调流程的推动,能够解决各种语言领域任务的通用语言模型已经出现。然而,构建通用的视觉-语言模型具有挑战性,因为额外的视觉输入引入了增加的任务差异。虽然视觉-语言预训练已被广泛研究,但视觉-语言指导微调仍相对较少被探讨。本文针对预训练的BLIP-2模型,对视觉-语言指导微调进行了系统和全面的研究。我们收集了26个公开可用数据集的各种数据,将其转换为指导微调格式,并将其分类为两个簇,用于保留指导微调和保留零样本评估。此外,我们引入了指导感知的视觉特征提取,这是一种关键方法,使模型能够提取针对给定指导的信息丰富特征。由此产生的InstructBLIP模型在所有13个保留数据集上实现了最先进的零样本性能,明显优于BLIP-2和更大的Flamingo。我们的模型在单独的下游任务微调时也实现了最先进的性能(例如,在ScienceQA IMG上达到90.7%的准确率)。此外,我们定性地展示了InstructBLIP相对于同时进行的多模态模型的优势。所有InstructBLIP模型均已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip 开源。
English
General-purpose language models that can solve various language-domain tasks
have emerged driven by the pre-training and instruction-tuning pipeline.
However, building general-purpose vision-language models is challenging due to
the increased task discrepancy introduced by the additional visual input.
Although vision-language pre-training has been widely studied, vision-language
instruction tuning remains relatively less explored. In this paper, we conduct
a systematic and comprehensive study on vision-language instruction tuning
based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly
available datasets, transform them into instruction tuning format and
categorize them into two clusters for held-in instruction tuning and held-out
zero-shot evaluation. Additionally, we introduce instruction-aware visual
feature extraction, a crucial method that enables the model to extract
informative features tailored to the given instruction. The resulting
InstructBLIP models achieve state-of-the-art zero-shot performance across all
13 held-out datasets, substantially outperforming BLIP-2 and the larger
Flamingo. Our models also lead to state-of-the-art performance when finetuned
on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG).
Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over
concurrent multimodal models. All InstructBLIP models have been open-sourced at
https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.