COMPACT: COMPositionele Atoom-naar-Complex Visuele Vaardigheidsafstemming

Samenvatting

Multimodale Large Language Models (MLLMs) blinken uit in eenvoudige visueel-taalkundige taken, maar hebben moeite met complexe taken die meerdere vaardigheden vereisen, zoals het gelijktijdig herkennen van objecten, deze tellen en hun ruimtelijke relaties begrijpen. Dit kan deels het gevolg zijn van het feit dat Visual Instruction Tuning (VIT), een cruciale trainingsstap voor MLLMs, traditioneel gericht is geweest op het schalen van de datavolume, maar niet op de compositionele complexiteit van de trainingsvoorbeelden. Wij stellen COMPACT voor (COMPositional Atomic-to-complex visual Capability Tuning), dat een trainingsdataset genereert waarbij expliciet controle wordt uitgeoefend op de compositionele complexiteit van de trainingsvoorbeelden. De data van COMPACT stelt MLLMs in staat om combinaties van atomische vaardigheden te trainen om complexe vaardigheden efficiënter te leren. Op alle benchmarks behaalt COMPACT vergelijkbare prestaties als de LLaVA-665k VIT terwijl minder dan 10% van het databudget wordt gebruikt, en overtreft het deze zelfs op verschillende benchmarks, vooral die waarbij complexe taken met meerdere vaardigheden betrokken zijn. Zo behaalt COMPACT een aanzienlijke verbetering van 83,3% op MMStar en 94,0% op MM-Vet in vergelijking met de volledige VIT, met name bij complexe vragen die vier of meer atomische vaardigheden vereisen. COMPACT biedt een schaalbare, data-efficiënte, visuele compositionele tuningmethode om prestaties te verbeteren op complexe visueel-taalkundige taken.

English

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

COMPACT: COMPositionele Atoom-naar-Complex Visuele Vaardigheidsafstemming

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Samenvatting

Support