MMEvol:利用Evol-Instruct增强多模态大语言模型
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
September 9, 2024
作者: Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li
cs.AI
摘要
多模大型语言模型(MLLMs)的发展取得了显著进展。然而,多模指导数据的数量和质量已成为其发展中的重要瓶颈。手动创建多模指导数据既耗时又低效,制约了生成高复杂度指导的挑战。此外,从黑盒商业模型(如GPT-4o、GPT-4V)中提炼指导数据往往会导致简化的指导数据,限制了性能达到这些模型水平的可能性。筛选多样化和复杂的指导数据仍然是一个重大挑战。我们提出了MMEvol,这是一个新颖的多模指导数据演进框架,结合了细粒度感知演进、认知推理演进和交互演进。这种迭代方法突破了数据质量瓶颈,生成了一个复杂多样的图像-文本指导数据集,从而赋予MLLMs增强的能力。从初始指令集SEED-163K开始,我们利用MMEvol系统地扩展了指令类型的多样性,整合推理步骤以增强认知能力,并从图像中提取详细信息以改善视觉理解和鲁棒性。为了全面评估我们数据的有效性,我们使用演进数据训练LLaVA-NeXT,并在13个视觉-语言任务中进行实验。与使用种子数据训练的基准相比,我们的方法在这些任务中平均准确率提高了3.1个百分点,并在其中9个任务上达到了最先进的性能水平。
English
The development of Multimodal Large Language Models (MLLMs) has seen
significant advancements. However, the quantity and quality of multimodal
instruction data have emerged as significant bottlenecks in their progress.
Manually creating multimodal instruction data is both time-consuming and
inefficient, posing challenges in producing instructions of high complexity.
Moreover, distilling instruction data from black-box commercial models (e.g.,
GPT-4o, GPT-4V) often results in simplistic instruction data, which constrains
performance to that of these models. The challenge of curating diverse and
complex instruction data remains substantial. We propose MMEvol, a novel
multimodal instruction data evolution framework that combines fine-grained
perception evolution, cognitive reasoning evolution, and interaction evolution.
This iterative approach breaks through data quality bottlenecks to generate a
complex and diverse image-text instruction dataset, thereby empowering MLLMs
with enhanced capabilities. Beginning with an initial set of instructions,
SEED-163K, we utilize MMEvol to systematically broadens the diversity of
instruction types, integrates reasoning steps to enhance cognitive
capabilities, and extracts detailed information from images to improve visual
understanding and robustness. To comprehensively evaluate the effectiveness of
our data, we train LLaVA-NeXT using the evolved data and conduct experiments
across 13 vision-language tasks. Compared to the baseline trained with seed
data, our approach achieves an average accuracy improvement of 3.1 points and
reaches state-of-the-art (SOTA) performance on 9 of these tasks.Summary
AI-Generated Summary