眼見為實:促使 GPT-4V 進行更佳的視覺指導 微調
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
November 13, 2023
作者: Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
現有的視覺指示調整方法通常會使用文字描述來激發大型語言模型生成遵循指示的數據。儘管取得了令人期待的表現,但這些描述是從圖像標註中衍生出來的,這些標註往往是粗粒度的。此外,這些指示甚至可能與視覺內容相矛盾,而沒有觀察整個視覺上下文。為了應對這一挑戰,我們引入了一個精細化的視覺指示數據集,LVIS-Instruct4V,其中包含由強大的GPT-4V提示使用LVIS圖像生成的220K個視覺對齊和上下文感知指示。通過實驗驗證和案例研究,我們展示了高質量的視覺指導數據可以明顯提高LLaVA-1.5的性能,這是一種最先進的大型多模型模型,在各種基準測試中取得了明顯的進展。值得注意的是,僅通過將LLaVA-Instruct替換為我們的LVIS-Instruct4V,我們在大多數具有挑戰性的LMM基準測試中取得了比LLaVA更好的結果,例如LLaVA^w(76.7比70.7)和MM-Vet(40.2比35.4)。我們在https://github.com/X2FD/LVIS-INSTRUCT4V 上發布了我們的數據和模型。
English
Existing visual instruction tuning methods typically prompt large language
models with textual descriptions to generate instruction-following data.
Despite the promising performance achieved, these descriptions are derived from
image annotations, which are oftentimes coarse-grained. Furthermore, the
instructions might even contradict the visual content without observing the
entire visual context. To address this challenge, we introduce a fine-grained
visual instruction dataset, LVIS-Instruct4V, which contains 220K visually
aligned and context-aware instructions produced by prompting the powerful
GPT-4V with images from LVIS. Through experimental validation and case studies,
we demonstrate that high-quality visual instructional data could improve the
performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a
wide spectrum of benchmarks by clear margins. Notably, by simply replacing the
LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA
on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet
(40.2 vs. 35.4). We release our data and model at
https://github.com/X2FD/LVIS-INSTRUCT4V.