MoE-LLaVA: 大規模視覚言語モデルのためのエキスパートの混合

要旨

大規模視覚言語モデル（LVLM）において、モデルのスケーリングは性能向上に有効である。しかし、モデルパラメータを拡張すると、計算において各トークンに対してすべてのモデルパラメータが活性化されるため、トレーニングと推論のコストが大幅に増加する。本研究では、LVLM向けの新しいトレーニング戦略であるMoE-tuningを提案する。この戦略は、膨大な数のパラメータを持つ疎なモデルを構築しつつ、計算コストを一定に保ち、マルチモーダル学習とモデルの疎性に伴う性能低下を効果的に解決する。さらに、MoEベースの疎なLVLMアーキテクチャであるMoE-LLaVAフレームワークを提示する。このフレームワークは、デプロイ時にルーターを通じてトップkのエキスパートのみを活性化し、残りのエキスパートを非活性化する。我々の広範な実験は、MoE-LLaVAの視覚理解における優れた能力と、モデル出力における幻覚を減少させる可能性を強調している。驚くべきことに、30億の疎活性化パラメータのみで、MoE-LLaVAは様々な視覚理解データセットにおいてLLaVA-1.5-7Bに匹敵する性能を示し、オブジェクト幻覚ベンチマークではLLaVA-1.5-13Bを凌駕する。MoE-LLaVAを通じて、疎なLVLMのベースラインを確立し、より効率的で効果的なマルチモーダル学習システムの開発に向けた将来の研究に貴重な洞察を提供することを目指す。コードはhttps://github.com/PKU-YuanGroup/MoE-LLaVAで公開されている。

English

For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

MoE-LLaVA: 大規模視覚言語モデルのためのエキスパートの混合

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

要旨

Support