InstructBLIP: 명령어 튜닝을 통한 범용 비전-언어 모델 구축

초록

다양한 언어 도메인 작업을 해결할 수 있는 범용 언어 모델은 사전 학습과 명령어 튜닝 파이프라인의 발전에 힘입어 등장했습니다. 그러나 시각적 입력이 추가되면서 작업 간 차이가 더 커져 범용 시각-언어 모델을 구축하는 것은 더욱 어려운 과제가 되었습니다. 시각-언어 사전 학습은 널리 연구되었지만, 시각-언어 명령어 튜닝은 상대적으로 덜 탐구된 분야입니다. 본 논문에서는 사전 학습된 BLIP-2 모델을 기반으로 시각-언어 명령어 튜닝에 대한 체계적이고 포괄적인 연구를 수행합니다. 우리는 26개의 다양한 공개 데이터셋을 수집하여 명령어 튜닝 형식으로 변환하고, 이를 held-in 명령어 튜닝과 held-out 제로샷 평가를 위한 두 개의 클러스터로 분류했습니다. 또한, 주어진 명령어에 맞춰 정보를 추출할 수 있는 중요한 방법인 명령어 인식 시각적 특징 추출을 도입했습니다. 그 결과, InstructBLIP 모델은 모든 13개의 held-out 데이터셋에서 최첨단 제로샷 성능을 달성하며, BLIP-2와 더 큰 Flamingo 모델을 크게 능가했습니다. 또한, 개별 하위 작업에 대해 미세 조정할 때도 최첨단 성능을 보였습니다(예: ScienceQA IMG에서 90.7% 정확도). 더 나아가, 우리는 InstructBLIP이 동시대의 다중 모달 모델에 비해 갖는 장점을 질적으로 입증했습니다. 모든 InstructBLIP 모델은 https://github.com/salesforce/LAVIS/tree/main/projects/instructblip에서 오픈소스로 공개되었습니다.

English

General-purpose language models that can solve various language-domain tasks have emerged driven by the pre-training and instruction-tuning pipeline. However, building general-purpose vision-language models is challenging due to the increased task discrepancy introduced by the additional visual input. Although vision-language pre-training has been widely studied, vision-language instruction tuning remains relatively less explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pre-trained BLIP-2 models. We gather a wide variety of 26 publicly available datasets, transform them into instruction tuning format and categorize them into two clusters for held-in instruction tuning and held-out zero-shot evaluation. Additionally, we introduce instruction-aware visual feature extraction, a crucial method that enables the model to extract informative features tailored to the given instruction. The resulting InstructBLIP models achieve state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and the larger Flamingo. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA IMG). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models have been open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

InstructBLIP: 명령어 튜닝을 통한 범용 비전-언어 모델 구축

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

초록

Support