MoE-Mamba：具有专家混合的高效选择性状态空间模型

摘要

状态空间模型（SSMs）已成为顺序建模领域的重要竞争者，挑战了Transformer的主导地位。与此同时，专家混合模型（MoE）显著改进了基于Transformer的LLMs，包括最近的开源模型。我们建议为了发挥SSMs在扩展方面的潜力，它们应与MoE相结合。我们在最近提出的基于SSM的模型Mamba上展示了这一点，该模型实现了出色的、类似Transformer的性能。我们的模型MoE-Mamba在性能上优于Mamba和Transformer-MoE。特别是，MoE-Mamba在较少的训练步骤中达到了与Mamba相同的性能，同时保持了Mamba相对于Transformer的推理性能提升。

English

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

MoE-Mamba：具有专家混合的高效选择性状态空间模型

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

摘要

Support