MoE-LLaVA:大规模视觉语言模型的专家混合模型
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
January 29, 2024
作者: Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan
cs.AI
摘要
对于大型视觉-语言模型(LVLMs),扩展模型规模可以有效提高性能。然而,扩大模型参数会显著增加训练和推理成本,因为在计算中每个标记都会激活所有模型参数。在这项工作中,我们提出了一种新颖的LVLMs训练策略MoE-tuning,可以构建一个稀疏模型,具有惊人数量的参数但恒定的计算成本,并有效解决了通常与多模态学习和模型稀疏性相关的性能下降问题。此外,我们提出了MoE-LLaVA框架,这是一个基于MoE的稀疏LVLM架构。该框架在部署过程中通过路由器唯一激活前k个专家,使其余专家保持非活动状态。我们的广泛实验突出了MoE-LLaVA在视觉理解方面的出色能力,以及减少模型输出中幻觉的潜力。值得注意的是,仅通过30亿个稀疏激活参数,MoE-LLaVA在各种视觉理解数据集上展现出与LLaVA-1.5-7B相媲美的性能,甚至在物体幻觉基准测试中超越了LLaVA-1.5-13B。通过MoE-LLaVA,我们旨在为稀疏LVLMs建立一个基准,并为未来开发更高效、更有效的多模态学习系统提供宝贵见解。代码已发布在https://github.com/PKU-YuanGroup/MoE-LLaVA。
English
For Large Vision-Language Models (LVLMs), scaling the model can effectively
improve performance. However, expanding model parameters significantly
increases the training and inferring costs, as all model parameters are
activated for each token in the calculation. In this work, we propose a novel
training strategy MoE-tuning for LVLMs, which can constructing a sparse model
with an outrageous number of parameter but a constant computational cost, and
effectively addresses the performance degradation typically associated with
multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA
framework, a MoE-based sparse LVLM architecture. This framework uniquely
activates only the top-k experts through routers during deployment, keeping the
remaining experts inactive. Our extensive experiments highlight the excellent
capabilities of MoE-LLaVA in visual understanding and its potential to reduce
hallucinations in model outputs. Remarkably, with just 3 billion sparsely
activated parameters, MoE-LLaVA demonstrates performance comparable to the
LLaVA-1.5-7B on various visual understanding datasets and even surpasses the
LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to
establish a baseline for sparse LVLMs and provide valuable insights for future
research in developing more efficient and effective multi-modal learning
systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.