MoE-Mamba: エキスパートの混合を活用した効率的な選択的状態空間モデル

要旨

状態空間モデル（SSMs）は、逐次モデリングの分野においてTransformerの支配的な地位に挑戦する有力な候補となっています。一方で、Mixture of Experts（MoE）は、Transformerベースの大規模言語モデル（LLMs）を大幅に改善し、最近の最先端オープンソースモデルにも採用されています。我々は、SSMsのスケーリングの可能性を最大限に引き出すためには、MoEと組み合わせるべきであると提案します。これを、最近のSSMベースのモデルであるMambaで実証します。MambaはTransformerに匹敵する驚異的な性能を達成しています。我々のモデルであるMoE-Mambaは、MambaとTransformer-MoEの両方を上回ります。特に、MoE-MambaはMambaと同等の性能を、2.2倍少ない訓練ステップで達成しつつ、Transformerに対するMambaの推論性能の向上を維持します。

English

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

MoE-Mamba: エキスパートの混合を活用した効率的な選択的状態空間モデル

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

要旨

Support