ChatPaper.aiChatPaper

MMEvol:利用Evol-Instruct增強多模態大型語言模型

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

September 9, 2024
作者: Run Luo, Haonan Zhang, Longze Chen, Ting-En Lin, Xiong Liu, Yuchuan Wu, Min Yang, Minzheng Wang, Pengpeng Zeng, Lianli Gao, Heng Tao Shen, Yunshui Li, Xiaobo Xia, Fei Huang, Jingkuan Song, Yongbin Li
cs.AI

摘要

多模式大型語言模型(MLLMs)的發展取得了顯著進展。然而,多模式指導數據的數量和質量已成為其發展的重要瓶頸。手動創建多模式指導數據既耗時又低效,對於生成高度複雜的指導提出了挑戰。此外,從黑盒商業模型(例如GPT-4o、GPT-4V)中提煉指導數據通常會導致簡化的指導數據,這限制了性能與這些模型的相應性能。策劃多樣且複雜的指導數據的挑戰仍然很大。我們提出了MMEvol,這是一個新穎的多模式指導數據演進框架,結合了細粒度感知演進、認知推理演進和互動演進。這種迭代方法突破了數據質量瓶頸,生成了一個複雜且多樣的圖像-文本指導數據集,從而賦予MLLMs增強的能力。從初始指令集SEED-163K開始,我們利用MMEvol系統地擴大指導類型的多樣性,整合推理步驟以增強認知能力,並從圖像中提取詳細信息以改善視覺理解和韌性。為了全面評估我們數據的有效性,我們使用演進後的數據訓練LLaVA-NeXT,並在13個視覺-語言任務中進行實驗。與使用種子數據訓練的基準相比,我們的方法在這些任務中實現了平均準確率提高3.1個百分點,並在其中9個任務上達到了最新技術水平(SOTA)的性能。
English
The development of Multimodal Large Language Models (MLLMs) has seen significant advancements. However, the quantity and quality of multimodal instruction data have emerged as significant bottlenecks in their progress. Manually creating multimodal instruction data is both time-consuming and inefficient, posing challenges in producing instructions of high complexity. Moreover, distilling instruction data from black-box commercial models (e.g., GPT-4o, GPT-4V) often results in simplistic instruction data, which constrains performance to that of these models. The challenge of curating diverse and complex instruction data remains substantial. We propose MMEvol, a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution. This iterative approach breaks through data quality bottlenecks to generate a complex and diverse image-text instruction dataset, thereby empowering MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broadens the diversity of instruction types, integrates reasoning steps to enhance cognitive capabilities, and extracts detailed information from images to improve visual understanding and robustness. To comprehensively evaluate the effectiveness of our data, we train LLaVA-NeXT using the evolved data and conduct experiments across 13 vision-language tasks. Compared to the baseline trained with seed data, our approach achieves an average accuracy improvement of 3.1 points and reaches state-of-the-art (SOTA) performance on 9 of these tasks.

Summary

AI-Generated Summary

PDF493November 16, 2024