LLaVA-MoD:通过MoE知识蒸馏使LLaVA变得更小
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
August 28, 2024
作者: Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang
cs.AI
摘要
我们介绍了LLaVA-MoD,这是一个新颖的框架,旨在通过从大规模MLLM(l-MLLM)中提炼知识,实现对小规模多模态语言模型(s-MLLM)的高效训练。我们的方法解决了MLLM蒸馏中的两个基本挑战。首先,我们通过将稀疏专家混合(MoE)架构整合到语言模型中,优化了s-MLLM的网络结构,实现了计算效率和模型表现力之间的平衡。其次,我们提出了一种渐进式知识迁移策略,以确保全面的知识迁移。这一策略始于模仿蒸馏,通过最小化输出分布之间的Kullback-Leibler(KL)散度,使得学生模型能够模拟教师网络的理解能力。随后,我们引入了通过直接偏好优化(DPO)进行的偏好蒸馏,其中关键在于将l-MLLM视为参考模型。在这个阶段,s-MLLM在区分优劣示例方面的能力显著提升,超越了l-MLLM,特别是在幻觉基准测试中,使得学生模型更胜一筹。大量实验表明,LLaVA-MoD在各种多模态基准测试中表现优于现有模型,同时保持了最少数量的激活参数和低计算成本。值得注意的是,LLaVA-MoD仅使用了2B个激活参数,在各项基准测试中平均超过Qwen-VL-Chat-7B 8.8%,仅使用了0.3%的训练数据和23%的可训练参数。这些结果突显了LLaVA-MoD有效地从其教师模型中提炼全面的知识,为更高效的MLLM的发展铺平了道路。代码将在以下链接提供:https://github.com/shufangxun/LLaVA-MoD。
English
We introduce LLaVA-MoD, a novel framework designed to enable the efficient
training of small-scale Multimodal Language Models (s-MLLM) by distilling
knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental
challenges in MLLM distillation. First, we optimize the network structure of
s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the
language model, striking a balance between computational efficiency and model
expressiveness. Second, we propose a progressive knowledge transfer strategy to
ensure comprehensive knowledge migration. This strategy begins with mimic
distillation, where we minimize the Kullback-Leibler (KL) divergence between
output distributions to enable the student model to emulate the teacher
network's understanding. Following this, we introduce preference distillation
via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM
as the reference model. During this phase, the s-MLLM's ability to discriminate
between superior and inferior examples is significantly enhanced beyond l-MLLM,
leading to a better student that surpasses its teacher, particularly in
hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD
outperforms existing models across various multimodal benchmarks while
maintaining a minimal number of activated parameters and low computational
costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses
Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of
the training data and 23% trainable parameters. These results underscore
LLaVA-MoD's ability to effectively distill comprehensive knowledge from its
teacher model, paving the way for the development of more efficient MLLMs. The
code will be available on: https://github.com/shufangxun/LLaVA-MoD.Summary
AI-Generated Summary