MoE-LLaVA：大规模视觉语言模型的专家混合模型

摘要

对于大型视觉-语言模型（LVLMs），扩展模型规模可以有效提高性能。然而，扩大模型参数会显著增加训练和推理成本，因为在计算中每个标记都会激活所有模型参数。在这项工作中，我们提出了一种新颖的LVLMs训练策略MoE-tuning，可以构建一个稀疏模型，具有惊人数量的参数但恒定的计算成本，并有效解决了通常与多模态学习和模型稀疏性相关的性能下降问题。此外，我们提出了MoE-LLaVA框架，这是一个基于MoE的稀疏LVLM架构。该框架在部署过程中通过路由器唯一激活前k个专家，使其余专家保持非活动状态。我们的广泛实验突出了MoE-LLaVA在视觉理解方面的出色能力，以及减少模型输出中幻觉的潜力。值得注意的是，仅通过30亿个稀疏激活参数，MoE-LLaVA在各种视觉理解数据集上展现出与LLaVA-1.5-7B相媲美的性能，甚至在物体幻觉基准测试中超越了LLaVA-1.5-13B。通过MoE-LLaVA，我们旨在为稀疏LVLMs建立一个基准，并为未来开发更高效、更有效的多模态学习系统提供宝贵见解。代码已发布在https://github.com/PKU-YuanGroup/MoE-LLaVA。

English

For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

MoE-LLaVA：大规模视觉语言模型的专家混合模型

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

摘要

Support