ChatPaper.aiChatPaper

InstructBLIP:朝向具有指导调整功能的通用视觉-语言模型

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

May 11, 2023
作者: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
cs.AI

摘要

受到预训练和指导微调流程的推动,能够解决各种语言领域任务的通用语言模型已经出现。然而,构建通用的视觉-语言模型具有挑战性,因为额外的视觉输入引入了增加的任务差异。虽然视觉-语言预训练已被广泛研究,但视觉-语言指导微调仍相对较少被探讨。本文针对预训练的BLIP-2模型,对视觉-语言指导微调进行了系统和全面的研究。我们收集了26个公开可用数据集的各种数据,将其转换为指导微调格式,并将其分类为两个簇,用于保留指导微调和保留零样本评估。此外,我们引入了指导感知的视觉特征提取,这是一种关键方法,使模型能够提取针对给定指导的信息丰富特征。由此产生的InstructBLIP模型在所有13个保留数据集上实现了最先进的零样本性能,明显优于BLIP-2和更大的Flamingo。我们的模型在单独的下游任务微调时也实现了最先进的性能(例如,在ScienceQA IMG上达到90.7%的准确率)。此外,我们定性地展示了InstructBLIP相对于同时进行的多模态模型的优势。所有InstructBLIP模型均已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip 开源。
English
General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
PDF50December 15, 2024