MMEvol: Evol-Instructを用いたマルチモーダルな大規模言語モデルの強化

要旨

Multimodal Large Language Models (MLLMs)の開発は、重要な進展を見ています。しかしながら、多様性と質の高いマルチモーダルな指示データは、その進展において重要なボトルネックとして浮かび上がっています。マルチモーダルな指示データを手動で作成することは時間がかかり効率が悪く、高度な複雑性を持つ指示を生成する際に課題を提起しています。さらに、ブラックボックスの商用モデル（例：GPT-4o、GPT-4V）から指示データを抽出することは、しばしば単純化された指示データをもたらし、これらのモデルの性能に制約を与えます。多様で複雑な指示データを収集するという課題は依然として大きなものです。私たちは、MMEvolという新しいマルチモーダルな指示データ進化フレームワークを提案します。このフレームワークは、微細な知覚進化、認知的推論進化、および相互作用進化を組み合わせています。この反復的なアプローチは、データ品質のボトルネックを突破し、複雑で多様な画像テキスト指示データセットを生成することで、MLLMに強化された機能を提供します。初期の指示セットであるSEED-163Kを用い、MMEvolを活用して指示タイプの多様性を体系的に拡大し、認知能力を高めるための推論ステップを統合し、画像から詳細な情報を抽出して視覚理解と頑健性を向上させます。私たちのデータの効果を包括的に評価するために、進化したデータを使用してLLaVA-NeXTを訓練し、13のビジョン言語タスクで実験を行います。シードデータで訓練されたベースラインと比較して、私たちのアプローチは平均精度が3.1ポイント向上し、これらのタスクのうち9つで最先端のパフォーマンスを達成しています。

English

The development of Multimodal Large Language Models (MLLMs) has seen significant advancements. However, the quantity and quality of multimodal instruction data have emerged as significant bottlenecks in their progress. Manually creating multimodal instruction data is both time-consuming and inefficient, posing challenges in producing instructions of high complexity. Moreover, distilling instruction data from black-box commercial models (e.g., GPT-4o, GPT-4V) often results in simplistic instruction data, which constrains performance to that of these models. The challenge of curating diverse and complex instruction data remains substantial. We propose MMEvol, a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution. This iterative approach breaks through data quality bottlenecks to generate a complex and diverse image-text instruction dataset, thereby empowering MLLMs with enhanced capabilities. Beginning with an initial set of instructions, SEED-163K, we utilize MMEvol to systematically broadens the diversity of instruction types, integrates reasoning steps to enhance cognitive capabilities, and extracts detailed information from images to improve visual understanding and robustness. To comprehensively evaluate the effectiveness of our data, we train LLaVA-NeXT using the evolved data and conduct experiments across 13 vision-language tasks. Compared to the baseline trained with seed data, our approach achieves an average accuracy improvement of 3.1 points and reaches state-of-the-art (SOTA) performance on 9 of these tasks.

MMEvol: Evol-Instructを用いたマルチモーダルな大規模言語モデルの強化

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

要旨

Support