SVIT: Skalierung von visuellem Instruktions-Tuning

Zusammenfassung

Dank der Entstehung von Foundation-Modellen werden große Sprach- und Vision-Modelle integriert, um multimodale Fähigkeiten wie visuelle Bildbeschreibung, Dialogführung und Fragebeantwortung zu erlangen. Obwohl bestehende multimodale Modelle beeindruckende Leistungen im Bereich des visuellen Verstehens und Schlussfolgerns zeigen, sind ihre Grenzen aufgrund der Knappheit hochwertiger Instruktions-Tuning-Daten noch weitgehend unerforscht. Um die Grenzen der multimodalen Fähigkeiten zu erweitern, skalieren wir Visual Instruction Tuning (SVIT), indem wir einen Datensatz mit 3,2 Millionen visuellen Instruktions-Tuning-Daten erstellen, darunter 1,6 Millionen Konversations-Frage-Antwort-Paare (QA), 1,6 Millionen komplexe Schlussfolgerungs-QA-Paare und 106.000 detaillierte Bildbeschreibungen. Neben dem Umfang zeichnet sich der vorgeschlagene Datensatz auch durch hohe Qualität und große Vielfalt aus, die durch die Anregung von GPT-4 mit umfangreichen manuellen Bildanmerkungen generiert werden. Wir bestätigen empirisch, dass das Training multimodaler Modelle auf SVIT die multimodale Leistung in Bezug auf visuelle Wahrnehmung, Schlussfolgerung und Planung signifikant verbessern kann.

English

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We empirically verify that training multimodal models on SVIT can significantly improve the multimodal performance in terms of visual perception, reasoning and planing.

SVIT: Skalierung von visuellem Instruktions-Tuning

SVIT: Scaling up Visual Instruction Tuning

Zusammenfassung

Support