InstructBLIP：朝向具有指导调整功能的通用视觉-语言模型

摘要

受到预训练和指导微调流程的推动，能够解决各种语言领域任务的通用语言模型已经出现。然而，构建通用的视觉-语言模型具有挑战性，因为额外的视觉输入引入了增加的任务差异。虽然视觉-语言预训练已被广泛研究，但视觉-语言指导微调仍相对较少被探讨。本文针对预训练的BLIP-2模型，对视觉-语言指导微调进行了系统和全面的研究。我们收集了26个公开可用数据集的各种数据，将其转换为指导微调格式，并将其分类为两个簇，用于保留指导微调和保留零样本评估。此外，我们引入了指导感知的视觉特征提取，这是一种关键方法，使模型能够提取针对给定指导的信息丰富特征。由此产生的InstructBLIP模型在所有13个保留数据集上实现了最先进的零样本性能，明显优于BLIP-2和更大的Flamingo。我们的模型在单独的下游任务微调时也实现了最先进的性能（例如，在ScienceQA IMG上达到90.7%的准确率）。此外，我们定性地展示了InstructBLIP相对于同时进行的多模态模型的优势。所有InstructBLIP模型均已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip 开源。

English

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

InstructBLIP：朝向具有指导调整功能的通用视觉-语言模型

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

摘要

Support