보는 것이 믿는 것이다: 더 나은 시각적 명령어 튜닝을 위한 GPT-4V 프롬프팅

초록

기존의 시각적 명령어 튜닝 방법들은 일반적으로 대형 언어 모델에 텍스트 설명을 제공하여 명령어 수행 데이터를 생성합니다. 이러한 방법들은 유망한 성능을 달성했지만, 이러한 설명들은 이미지 주석에서 파생된 경우가 많아 종종 거칠게 분류됩니다. 더욱이, 전체 시각적 맥락을 관찰하지 않고 생성된 명령어는 시각적 내용과 모순될 수도 있습니다. 이러한 문제를 해결하기 위해, 우리는 LVIS의 이미지를 활용하여 강력한 GPT-4V를 프롬프트하여 생성된 220K개의 시각적으로 정렬되고 맥락을 인지한 명령어를 포함한 세분화된 시각적 명령어 데이터셋인 LVIS-Instruct4V를 소개합니다. 실험적 검증과 사례 연구를 통해, 우리는 고품질의 시각적 명령어 데이터가 최첨단 대형 멀티모달 모델인 LLaVA-1.5의 성능을 다양한 벤치마크에서 명확한 차이로 개선할 수 있음을 입증했습니다. 특히, LLaVA-Instruct를 우리의 LVIS-Instruct4V로 단순히 교체함으로써, 가장 도전적인 LMM 벤치마크에서 LLaVA보다 더 나은 결과를 달성했습니다(예: LLaVA^w (76.7 vs. 70.7) 및 MM-Vet (40.2 vs. 35.4)). 우리는 데이터와 모델을 https://github.com/X2FD/LVIS-INSTRUCT4V에서 공개합니다.

English

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA^w (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

보는 것이 믿는 것이다: 더 나은 시각적 명령어 튜닝을 위한 GPT-4V 프롬프팅

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

초록

Support