SVIT: ビジュアル命令チューニングのスケールアップ

要旨

基盤モデルの登場により、大規模な言語モデルと視覚モデルが統合され、視覚キャプショニング、対話、質問応答などのマルチモーダル能力を獲得しています。既存のマルチモーダルモデルは視覚理解と推論において印象的な性能を示していますが、高品質な指示チューニングデータの不足により、その限界はまだ十分に探られていません。マルチモーダル能力の限界を押し広げるため、我々はSVIT（Scale up Visual Instruction Tuning）を構築し、320万の視覚指示チューニングデータセットを作成しました。このデータセットには、160万の会話型質問応答（QA）ペア、160万の複雑な推論QAペア、および106Kの詳細な画像記述が含まれています。提案されたデータセットは、その量だけでなく、高品質で多様性に富んでいることも特徴です。これは、GPT-4に豊富な手動注釈付き画像をプロンプトとして与えることで生成されました。我々は、SVITでマルチモーダルモデルをトレーニングすることで、視覚知覚、推論、計画においてマルチモーダル性能が大幅に向上することを実証的に確認しました。

English

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

SVIT: ビジュアル命令チューニングのスケールアップ

SVIT: Scaling up Visual Instruction Tuning

要旨

Support