Vision-Flan: 시각적 명령어 튜닝에서 인간이 레이블링한 작업의 확장

초록

비전-언어 모델(VLM)이 다재다능한 시각적 보조 도구로서 놀라운 능력을 보여주고 있음에도 불구하고, 기존 VLM 프레임워크 내에는 두 가지 중대한 과제가 여전히 존재합니다: (1) 사전 학습과 시각적 지시 튜닝에서의 작업 다양성 부족, 그리고 (2) GPT-4 합성 지시 튜닝 데이터의 주석 오류와 편향. 이러한 과제들은 일반화 능력 저하, 환각 현상, 그리고 치명적 망각과 같은 문제를 초래합니다. 이러한 문제를 해결하기 위해, 우리는 현재까지 공개된 가장 다양한 시각적 지시 튜닝 데이터셋인 Vision-Flan을 구축했습니다. 이 데이터셋은 학술 데이터셋에서 수집된 187개의 다양한 작업과 1,664,261개의 인스턴스로 구성되어 있으며, 각 작업은 전문가가 작성한 지시문과 함께 제공됩니다. 또한, 우리는 두 단계의 지시 튜닝 프레임워크를 제안합니다. 이 프레임워크에서는 VLM이 먼저 Vision-Flan에서 미세 조정된 후, GPT-4 합성 데이터에서 추가로 튜닝됩니다. 우리는 이 두 단계 튜닝 프레임워크가 기존의 단일 단계 시각적 지시 튜닝 프레임워크를 크게 능가하며, 다양한 다중 모달 평가 벤치마크에서 최첨단 성능을 달성함을 발견했습니다. 마지막으로, 우리는 시각적 지시 튜닝을 이해하기 위해 심층 분석을 수행했으며, 그 결과 다음과 같은 사실을 발견했습니다: (1) GPT-4 합성 데이터는 VLM의 능력을 크게 향상시키기보다는 모델의 응답을 인간이 선호하는 형식으로 조정하는 역할을 합니다; (2) 최소량(예: 1,000개)의 GPT-4 합성 데이터만으로도 VLM의 응답을 인간의 선호에 효과적으로 맞출 수 있습니다; (3) 시각적 지시 튜닝은 주로 대형 언어 모델(LLM)이 시각적 특징을 이해하는 데 도움을 줍니다.

English

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Vision-Flan: 시각적 명령어 튜닝에서 인간이 레이블링한 작업의 확장

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

초록

Support