眼見為實：促使 GPT-4V 進行更佳的視覺指導微調

摘要

現有的視覺指示調整方法通常會使用文字描述來激發大型語言模型生成遵循指示的數據。儘管取得了令人期待的表現，但這些描述是從圖像標註中衍生出來的，這些標註往往是粗粒度的。此外，這些指示甚至可能與視覺內容相矛盾，而沒有觀察整個視覺上下文。為了應對這一挑戰，我們引入了一個精細化的視覺指示數據集，LVIS-Instruct4V，其中包含由強大的GPT-4V提示使用LVIS圖像生成的220K個視覺對齊和上下文感知指示。通過實驗驗證和案例研究，我們展示了高質量的視覺指導數據可以明顯提高LLaVA-1.5的性能，這是一種最先進的大型多模型模型，在各種基準測試中取得了明顯的進展。值得注意的是，僅通過將LLaVA-Instruct替換為我們的LVIS-Instruct4V，我們在大多數具有挑戰性的LMM基準測試中取得了比LLaVA更好的結果，例如LLaVA^w（76.7比70.7）和MM-Vet（40.2比35.4）。我們在https://github.com/X2FD/LVIS-INSTRUCT4V 上發布了我們的數據和模型。

English

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

眼見為實：促使 GPT-4V 進行更佳的視覺指導微調

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

摘要

Support