COMPACT：组合式原子到复杂视觉能力调优

摘要

多模态大语言模型（MLLMs）在简单的视觉-语言任务上表现出色，但在面对需要多种能力的复杂任务时却显得力不从心，例如同时识别物体、计数并理解它们的空间关系。这在一定程度上可能是由于视觉指令调优（VIT）——MLLMs关键训练步骤——传统上侧重于扩大数据规模，而非提升训练样本的组合复杂性所致。我们提出了COMPACT（组合式原子到复杂视觉能力调优），它生成一个明确控制训练样本组合复杂性的训练数据集。COMPACT提供的数据使MLLMs能够通过原子能力的组合来更高效地学习复杂能力。在所有基准测试中，COMPACT在使用不到LLaVA-665k VIT 10%数据预算的情况下，实现了与其相当的性能，并在多个任务上超越之，尤其是在涉及复杂多能力任务时。例如，在需要四个或更多原子能力的特别复杂问题上，COMPACT相较于全规模VIT，在MMStar上实现了83.3%的显著提升，在MM-Vet上提升了94.0%。COMPACT提供了一种可扩展、数据高效的视觉组合调优方案，以改进复杂视觉-语言任务的表现。

English

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

COMPACT：组合式原子到复杂视觉能力调优

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

摘要

Support