LLaVA-MoD: MoE 지식 증류를 통해 LLaVA를 소형화하기

초록

우리는 LLaVA-MoD를 소개합니다. 이는 대규모 다중 모달 언어 모델(l-MLLM)로부터 지식을 증류하여 소규모 다중 모달 언어 모델(s-MLLM)을 효율적으로 훈련할 수 있도록 고안된 혁신적인 프레임워크입니다. 우리의 접근 방식은 MLLM 증류에서 두 가지 근본적인 도전 과제에 대처합니다. 첫째, 우리는 s-MLLM의 네트워크 구조를 최적화하기 위해 희소한 전문가 집합(MoE) 구조를 언어 모델에 통합하여 계산 효율성과 모델 표현력 사이의 균형을 이룹니다. 둘째, 우리는 포괄적인 지식 이전을 보장하기 위해 점진적 지식 전이 전략을 제안합니다. 이 전략은 모방 증류로 시작하여, 출력 분포 간의 Kullback-Leibler (KL) 발산을 최소화하여 학생 모델이 선생님 네트워크의 이해를 모방할 수 있도록 합니다. 그 후 우리는 직접 선호 최적화(DPO)를 통한 선호 증류를 도입하는데, 이때 l-MLLM을 참조 모델로 취급하는 것이 핵심입니다. 이 단계에서 s-MLLM이 우수 및 열등한 예제를 구별하는 능력이 l-MLLM을 크게 뛰어넘어 향상되어, 특히 환각 벤치마크에서 선생님을 능가하는 더 나은 학생을 얻게 됩니다. 포괄적인 지식을 효과적으로 증류하여 더 효율적인 MLLM의 개발을 위한 길을 열어놓는 LLaVA-MoD의 능력을 강조하는 결과들이 있습니다. 코드는 다음에서 이용 가능합니다: https://github.com/shufangxun/LLaVA-MoD.

English

We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

LLaVA-MoD: MoE 지식 증류를 통해 LLaVA를 소형화하기

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

초록

Support