LLaVA-MoD: MoE知識蒸留によるLLaVAの超小型化

要旨

私たちは、LLaVA-MoDという新しいフレームワークを紹介します。このフレームワークは、大規模なMultimodal Language Models（l-MLLM）からの知識を蒸留することで、小規模なMultimodal Language Models（s-MLLM）の効率的なトレーニングを可能にするよう設計されています。当アプローチは、MLLMの蒸留における2つの基本的な課題に取り組んでいます。まず、s-MLLMのネットワーク構造を最適化するために、疎な専門家の混合（MoE）アーキテクチャを言語モデルに統合することで、計算効率とモデルの表現力とのバランスを取っています。次に、包括的な知識移行を確実にするために、進行的な知識転送戦略を提案しています。この戦略は、まず模倣蒸留から始まり、出力分布間のKullback-Leibler（KL）ダイバージェンスを最小化することで、生徒モデルが教師ネットワークの理解を模倣できるようにします。その後、Direct Preference Optimization（DPO）を介した好み蒸留を導入し、l-MLLMを参照モデルとして扱うことが鍵となります。このフェーズでは、s-MLLMが優れた例と劣った例を区別する能力が、l-MLLMを大幅に上回り、特に幻覚のベンチマークにおいて、より優れた生徒を生み出します。幅広い実験により、LLaVA-MoDが各種のマルチモーダルベンチマークで既存のモデルを凌駕し、最小限のアクティブ化されたパラメータと低い計算コストを維持しながら、優れたパフォーマンスを発揮することが示されました。驚くべきことに、LLaVA-MoDはわずか2Bのアクティブ化されたパラメータで、訓練データのわずか0.3%と23%の訓練可能なパラメータのみを使用し、ベンチマーク全体でQwen-VL-Chat-7Bを平均8.8%上回っています。これらの結果は、LLaVA-MoDが効果的に教師モデルから包括的な知識を蒸留する能力を示し、より効率的なMLLMの開発の道を切り拓くことを裏付けています。コードは以下で入手可能です：https://github.com/shufangxun/LLaVA-MoD.

English

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

LLaVA-MoD: MoE知識蒸留によるLLaVAの超小型化

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

要旨

Support