LLaVA-MoD：透過 MoE 知識蒸餾使 LLaVA 變得微小

摘要

我們介紹了LLaVA-MoD，一個新穎的框架，旨在通過從大規模MLLM（l-MLLM）中提煉知識，實現小規模多模態語言模型（s-MLLM）的高效訓練。我們的方法應對了MLLM提煉中的兩個基本挑戰。首先，通過將稀疏專家混合（MoE）架構整合到語言模型中，優化了s-MLLM的網絡結構，實現了計算效率和模型表現力之間的平衡。其次，我們提出了一種漸進式知識轉移策略，以確保全面的知識遷移。該策略始於模仿提煉，通過最小化輸出分佈之間的Kullback-Leibler（KL）散度，使學生模型能夠模擬老師網絡的理解。隨後，我們通過直接偏好優化（DPO）引入了偏好提煉，其關鍵在於將l-MLLM視為參考模型。在此階段，s-MLLM區分優劣示例的能力顯著增強，超越了l-MLLM，尤其在幻覺基準測試中，使學生模型更優秀。大量實驗表明，LLaVA-MoD在各種多模態基準測試中優於現有模型，同時保持了最少的激活參數和低計算成本。值得注意的是，LLaVA-MoD僅激活了2B個參數，在各項基準測試中平均超越Qwen-VL-Chat-7B 8.8％，僅使用了0.3％的訓練數據和23％的可訓練參數。這些結果突顯了LLaVA-MoD有效地從其老師模型中提煉全面知識的能力，為更高效的MLLM的開發鋪平了道路。代碼將在以下鏈接提供：https://github.com/shufangxun/LLaVA-MoD。

English

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

LLaVA-MoD：透過 MoE 知識蒸餾使 LLaVA 變得微小

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

摘要

Support