COMPACT: COMPositionale Abstimmung visueller Fähigkeiten von atomar bis komplex

papers.abstract

Multimodale Large Language Models (MLLMs) glänzen bei einfachen Vision-Sprache-Aufgaben, haben jedoch Schwierigkeiten mit komplexen Aufgaben, die mehrere Fähigkeiten erfordern, wie beispielsweise das gleichzeitige Erkennen von Objekten, deren Zählung und das Verständnis ihrer räumlichen Beziehungen. Dies könnte teilweise darauf zurückzuführen sein, dass Visual Instruction Tuning (VIT), ein entscheidender Trainingsschritt für MLLMs, traditionell auf die Skalierung des Datenvolumens ausgerichtet war, nicht jedoch auf die kompositionelle Komplexität der Trainingsbeispiele. Wir schlagen COMPACT (COMPositional Atomic-to-complex visual Capability Tuning) vor, das einen Trainingsdatensatz erzeugt, der explizit die kompositionelle Komplexität der Trainingsbeispiele steuert. Die Daten von COMPACT ermöglichen es MLLMs, Kombinationen atomarer Fähigkeiten zu trainieren, um komplexe Fähigkeiten effizienter zu erlernen. In allen Benchmarks erreicht COMPACT eine vergleichbare Leistung wie das LLaVA-665k VIT, während weniger als 10 % des Datenbudgets verwendet werden, und übertrifft es sogar in mehreren Fällen, insbesondere bei Aufgaben, die komplexe Multi-Fähigkeiten erfordern. Beispielsweise erzielt COMPACT eine deutliche Verbesserung von 83,3 % bei MMStar und 94,0 % bei MM-Vet im Vergleich zum vollständigen VIT bei besonders komplexen Fragen, die vier oder mehr atomare Fähigkeiten erfordern. COMPACT bietet ein skalierbares, dateneffizientes Rezept für das visuelle kompositionelle Tuning, um die Leistung bei komplexen Vision-Sprache-Aufgaben zu verbessern.

English

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

COMPACT: COMPositionale Abstimmung visueller Fähigkeiten von atomar bis komplex

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

papers.abstract

Support