COMPACT：從原子到複雜的視覺能力組合式調優

摘要

多模态大型語言模型（MLLMs）在處理簡單的視覺-語言任務時表現出色，但在面對需要多種能力的複雜任務時卻顯得力不從心，例如同時識別物體、計數以及理解它們的空間關係。這可能部分歸因於視覺指令調優（VIT）——MLLMs關鍵訓練步驟——傳統上主要關注數據規模的擴展，而非訓練樣本的組合複雜性。我們提出了COMPACT（組合式原子到複雜視覺能力調優），它生成了一個明確控制訓練樣本組合複雜性的訓練數據集。COMPACT提供的數據使MLLMs能夠通過組合原子能力來更高效地學習複雜能力。在所有基準測試中，COMPACT在使用不到LLaVA-665k VIT數據預算10%的情況下，達到了與之相當的性能，並在多個測試中，尤其是涉及複雜多能力任務的測試中，表現更為優異。例如，在需要四個或更多原子能力的特別複雜問題上，COMPACT相比於全規模VIT，在MMStar上實現了83.3%的顯著提升，在MM-Vet上提升了94.0%。COMPACT提供了一種可擴展、數據高效的視覺組合調優方案，以提升複雜視覺-語言任務的表現。

English

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

COMPACT：從原子到複雜的視覺能力組合式調優

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

摘要

Support