Ver para Creer: Guiando a GPT-4V para una Mejor Sintonización de Instrucciones Visuales

Resumen

Los métodos existentes de ajuste fino de instrucciones visuales suelen utilizar descripciones textuales para generar datos que sigan instrucciones en modelos de lenguaje grandes. A pesar del rendimiento prometedor logrado, estas descripciones se derivan de anotaciones de imágenes, que a menudo son de grano grueso. Además, las instrucciones podrían incluso contradecir el contenido visual sin observar el contexto visual completo. Para abordar este desafío, presentamos un conjunto de datos de instrucciones visuales de grano fino, LVIS-Instruct4V, que contiene 220K instrucciones visualmente alineadas y conscientes del contexto, producidas al utilizar el potente GPT-4V con imágenes de LVIS. A través de validación experimental y estudios de casos, demostramos que los datos de instrucciones visuales de alta calidad pueden mejorar el rendimiento de LLaVA-1.5, un modelo multimodal grande de última generación, en una amplia gama de benchmarks con márgenes claros. Notablemente, al simplemente reemplazar LLaVA-Instruct con nuestro LVIS-Instruct4V, logramos mejores resultados que LLaVA en la mayoría de los benchmarks desafiantes para modelos multimodales grandes (LMM), por ejemplo, LLaVA^w (76.7 vs. 70.7) y MM-Vet (40.2 vs. 35.4). Publicamos nuestros datos y modelo en https://github.com/X2FD/LVIS-INSTRUCT4V.

English

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

Ver para Creer: Guiando a GPT-4V para una Mejor Sintonización de Instrucciones Visuales

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Resumen

Support