MoE-LLaVA：大視覺語言模型的專家混合模型

摘要

對於大型視覺語言模型（LVLMs），擴展模型可以有效提高性能。然而，擴展模型參數會顯著增加訓練和推斷成本，因為計算中每個標記都會激活所有模型參數。在這項工作中，我們提出了一種新的訓練策略 MoE-tuning 用於 LVLMs，可以構建一個稀疏模型，具有驚人數量的參數，但恆定的計算成本，並有效解決了通常與多模態學習和模型稀疏性相關的性能下降問題。此外，我們提出了 MoE-LLaVA 框架，一種基於 MoE 的稀疏 LVLM 結構。該框架在部署期間僅通過路由器激活前 k 個專家，使其餘專家保持非活動狀態。我們的廣泛實驗突出了 MoE-LLaVA 在視覺理解方面的出色能力，以及減少模型輸出幻覺的潛力。值得注意的是，僅激活 30 億個參數的 MoE-LLaVA 在各種視覺理解數據集上展現出與 LLaVA-1.5-7B 相當的性能，甚至在物體幻覺基準測試中超越了 LLaVA-1.5-13B。通過 MoE-LLaVA，我們旨在為稀疏 LVLMs 建立基準，並為未來開發更高效和更有效的多模態學習系統提供寶貴見解。代碼已發布在 https://github.com/PKU-YuanGroup/MoE-LLaVA。

English

For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

MoE-LLaVA：大視覺語言模型的專家混合模型

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

摘要

Support