InstructBLIP:朝向通用視覺語言模型與指示微調的方向前進
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
May 11, 2023
作者: Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi
cs.AI
摘要
受到預訓練和指導微調流程的推動,已經出現了可以解決各種語言領域任務的通用語言模型。然而,建立通用的視覺語言模型具有挑戰性,因為額外的視覺輸入引入了增加的任務差異。儘管視覺語言的預訓練已被廣泛研究,但視覺語言的指導微調仍相對較少被探討。在本文中,我們基於預訓練的BLIP-2模型對視覺語言的指導微調進行了系統和全面的研究。我們收集了廣泛的26個公開可用數據集,將它們轉換為指導微調格式,並將它們分為兩個集群,用於保留指導微調和保留零樣本評估。此外,我們引入了指導感知的視覺特徵提取,這是一種關鍵方法,使模型能夠提取針對給定指導的信息特徵。由此產生的InstructBLIP模型在所有13個保留的數據集上實現了最先進的零樣本性能,遠遠優於BLIP-2和更大的Flamingo。我們的模型在個別下游任務微調時也達到了最先進的性能(例如,在ScienceQA IMG上達到90.7%的準確率)。此外,我們在質量上展示了InstructBLIP相對於同時多模型的優勢。所有InstructBLIP模型均已在https://github.com/salesforce/LAVIS/tree/main/projects/instructblip上開源。
English
General-purpose language models that can solve various language-domain tasks
have emerged driven by the pre-training and instruction-tuning pipeline.
However, building general-purpose vision-language models is challenging due to
the increased task discrepancy introduced by the additional visual input.
Although vision-language pre-training has been widely studied, vision-language
instruction tuning remains relatively less explored. In this paper, we conduct
a systematic and comprehensive study on vision-language instruction tuning
based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly
available datasets, transform them into instruction tuning format and
categorize them into two clusters for held-in instruction tuning and held-out
zero-shot evaluation. Additionally, we introduce instruction-aware visual
feature extraction, a crucial method that enables the model to extract
informative features tailored to the given instruction. The resulting
InstructBLIP models achieve state-of-the-art zero-shot performance across all
13 held-out datasets, substantially outperforming BLIP-2 and the larger
Flamingo. Our models also lead to state-of-the-art performance when finetuned
on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG).
Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over
concurrent multimodal models. All InstructBLIP models have been open-sourced at
https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.