COMPACT: アトミックから複合的な視覚能力チューニングのための構成要素

要旨

マルチモーダル大規模言語モデル（MLLMs）は、単純な視覚-言語タスクにおいて優れた性能を発揮しますが、物体の認識、数え上げ、空間関係の理解など、複数の能力を同時に必要とする複雑なタスクには苦戦しています。これは、MLLMsの重要なトレーニングステップであるVisual Instruction Tuning（VIT）が、従来、データ量のスケーリングに焦点を当ててきた一方で、トレーニング例の構成的な複雑さにはあまり注意を払ってこなかったことが一因である可能性があります。本論文では、COMPACT（COMPositional Atomic-to-complex visual Capability Tuning）を提案します。COMPACTは、トレーニング例の構成的な複雑さを明示的に制御したトレーニングデータセットを生成します。COMPACTのデータを用いることで、MLLMsは原子能力の組み合わせをトレーニングし、複雑な能力をより効率的に学習することができます。すべてのベンチマークにおいて、COMPACTはLLaVA-665k VITと同等の性能を達成しつつ、そのデータ予算の10%未満を使用し、特に複数の能力を必要とするタスクではそれを上回る性能を示しました。例えば、COMPACTは、4つ以上の原子能力を必要とする特に複雑な質問において、MMStarで83.3%、MM-Vetで94.0%の大幅な改善を達成しました。COMPACTは、複雑な視覚-言語タスクを改善するための、スケーラブルでデータ効率の良い視覚的構成チューニングのレシピを提供します。

English

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

COMPACT: アトミックから複合的な視覚能力チューニングのための構成要素

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

要旨

Support