見ることは信じること：GPT-4Vの視覚的指示チューニングを向上させるためのプロンプティング

要旨

既存の視覚的指示チューニング手法では、通常、大規模言語モデルにテキスト記述をプロンプトとして与え、指示追従データを生成します。有望な性能が達成されているものの、これらの記述は画像アノテーションから派生しており、しばしば粗粒度です。さらに、視覚的コンテキスト全体を観察せずに、指示が視覚的コンテンツと矛盾する可能性もあります。この課題に対処するため、我々は細粒度の視覚的指示データセットであるLVIS-Instruct4Vを導入しました。このデータセットは、LVISの画像を用いて強力なGPT-4Vをプロンプトすることで生成された、22万件の視覚的に整合性がありコンテキストを意識した指示を含んでいます。実験的検証とケーススタディを通じて、高品質な視覚的指示データが、最先端の大規模マルチモーダルモデルであるLLaVA-1.5の性能を、幅広いベンチマークで明確な差をもって向上させることができることを示しました。特に、LLaVA-Instructを我々のLVIS-Instruct4Vに置き換えるだけで、最も挑戦的なLMMベンチマークにおいてLLaVAを上回る結果を達成しました。例えば、LLaVA^w（76.7対70.7）やMM-Vet（40.2対35.4）などです。我々はデータとモデルをhttps://github.com/X2FD/LVIS-INSTRUCT4Vで公開しています。

English

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

見ることは信じること：GPT-4Vの視覚的指示チューニングを向上させるためのプロンプティング

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

要旨

Support