LLaVA-MoD:透過 MoE 知識蒸餾使 LLaVA 變得微小
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
August 28, 2024
作者: Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang
cs.AI
摘要
我們介紹了LLaVA-MoD,一個新穎的框架,旨在通過從大規模MLLM(l-MLLM)中提煉知識,實現小規模多模態語言模型(s-MLLM)的高效訓練。我們的方法應對了MLLM提煉中的兩個基本挑戰。首先,通過將稀疏專家混合(MoE)架構整合到語言模型中,優化了s-MLLM的網絡結構,實現了計算效率和模型表現力之間的平衡。其次,我們提出了一種漸進式知識轉移策略,以確保全面的知識遷移。該策略始於模仿提煉,通過最小化輸出分佈之間的Kullback-Leibler(KL)散度,使學生模型能夠模擬老師網絡的理解。隨後,我們通過直接偏好優化(DPO)引入了偏好提煉,其關鍵在於將l-MLLM視為參考模型。在此階段,s-MLLM區分優劣示例的能力顯著增強,超越了l-MLLM,尤其在幻覺基準測試中,使學生模型更優秀。大量實驗表明,LLaVA-MoD在各種多模態基準測試中優於現有模型,同時保持了最少的激活參數和低計算成本。值得注意的是,LLaVA-MoD僅激活了2B個參數,在各項基準測試中平均超越Qwen-VL-Chat-7B 8.8%,僅使用了0.3%的訓練數據和23%的可訓練參數。這些結果突顯了LLaVA-MoD有效地從其老師模型中提煉全面知識的能力,為更高效的MLLM的開發鋪平了道路。代碼將在以下鏈接提供:https://github.com/shufangxun/LLaVA-MoD。
English
We introduce LLaVA-MoD, a novel framework designed to enable the efficient
training of small-scale Multimodal Language Models (s-MLLM) by distilling
knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental
challenges in MLLM distillation. First, we optimize the network structure of
s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the
language model, striking a balance between computational efficiency and model
expressiveness. Second, we propose a progressive knowledge transfer strategy to
ensure comprehensive knowledge migration. This strategy begins with mimic
distillation, where we minimize the Kullback-Leibler (KL) divergence between
output distributions to enable the student model to emulate the teacher
network's understanding. Following this, we introduce preference distillation
via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM
as the reference model. During this phase, the s-MLLM's ability to discriminate
between superior and inferior examples is significantly enhanced beyond l-MLLM,
leading to a better student that surpasses its teacher, particularly in
hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD
outperforms existing models across various multimodal benchmarks while
maintaining a minimal number of activated parameters and low computational
costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses
Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of
the training data and 23% trainable parameters. These results underscore
LLaVA-MoD's ability to effectively distill comprehensive knowledge from its
teacher model, paving the way for the development of more efficient MLLMs. The
code will be available on: https://github.com/shufangxun/LLaVA-MoD.Summary
AI-Generated Summary