Vision-Flan: 視覚的指示チューニングにおける人間によるラベル付けタスクのスケーリング

要旨

視覚言語モデル（VLM）は多用途な視覚アシスタントとして驚異的な能力を発揮する一方で、既存のVLMフレームワークには2つの大きな課題が残されている：(1) 事前学習と視覚指示チューニングにおけるタスクの多様性の欠如、(2) GPT-4によって合成された指示チューニングデータにおけるアノテーションエラーとバイアスである。これらの課題は、汎化性能の低さ、幻覚（hallucination）、そして破滅的忘却（catastrophic forgetting）といった問題を引き起こす。これらの課題に対処するため、我々はVision-Flanを構築した。これは、学術データセットから収集された187の多様なタスクと1,664,261のインスタンスから成る、これまでで最も多様な公開視覚指示チューニングデータセットであり、各タスクには専門家によって書かれた指示が付随している。さらに、我々は2段階の指示チューニングフレームワークを提案し、VLMをまずVision-Flanでファインチューニングし、その後GPT-4によって合成されたデータでさらにチューニングする。この2段階チューニングフレームワークは、従来の単一段階の視覚指示チューニングフレームワークを大幅に上回り、幅広いマルチモーダル評価ベンチマークで最先端の性能を達成する。最後に、視覚指示チューニングを理解するための詳細な分析を行い、以下の知見を得た：(1) GPT-4によって合成されたデータはVLMの能力を大幅に向上させるのではなく、むしろモデルの応答を人間が好む形式に調整する役割を果たす；(2) 最小限の量（例えば1,000）のGPT-4合成データでも、VLMの応答を人間の好みに効果的に合わせることができる；(3) 視覚指示チューニングは主に大規模言語モデル（LLM）が視覚的特徴を理解するのに役立つ。

English

Despite vision-language models' (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs' capabilities but rather modulates the model's responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features.

Vision-Flan: 視覚的指示チューニングにおける人間によるラベル付けタスクのスケーリング

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

要旨

Support