眼见为实：促使GPT-4V进行更好的视觉指导调整

摘要

现有的视觉指导调整方法通常使用文本描述来提示大型语言模型生成遵循指令的数据。尽管取得了令人鼓舞的性能，但这些描述是从图像注释中衍生出来的，而这些注释往往是粗粒度的。此外，这些指令甚至可能在没有观察整个视觉上下文的情况下与视觉内容相矛盾。为了解决这一挑战，我们引入了一个细粒度的视觉指导数据集，LVIS-Instruct4V，其中包含由强大的GPT-4V提示LVIS图像生成的22万个视觉对齐和上下文感知指令。通过实验验证和案例研究，我们证明高质量的视觉指导数据可以显著提高LLaVA-1.5的性能，这是一种最先进的大型多模态模型，在各种基准测试中都有明显的提升。值得注意的是，仅仅通过用我们的LVIS-Instruct4V替换LLaVA-Instruct，我们在大多数具有挑战性的LMM基准测试中取得了比LLaVA更好的结果，例如LLaVA^w（76.7比70.7）和MM-Vet（40.2比35.4）。我们在https://github.com/X2FD/LVIS-INSTRUCT4V 上发布了我们的数据和模型。

English

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

眼见为实：促使GPT-4V进行更好的视觉指导调整

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

摘要

Support