MoE-Mamba：具有專家混合的高效選擇性狀態空間模型

摘要

狀態空間模型（SSMs）已成為順序建模領域的嚴肅競爭對手，挑戰了Transformer的主導地位。與此同時，專家混合模型（MoE）顯著改進了基於Transformer的LLMs，包括最近的最先進開源模型。我們建議為了發揮SSMs在擴展方面的潛力，應該將它們與MoE結合。我們在最近的基於SSM的模型Mamba上展示了這一點，該模型實現了類似Transformer的卓越性能。我們的模型MoE-Mamba在性能上優於Mamba和Transformer-MoE。特別是，MoE-Mamba在訓練步驟減少2.2倍的情況下達到了與Mamba相同的性能，同時保留了Mamba相對於Transformer的推理性能增益。

English

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

MoE-Mamba：具有專家混合的高效選擇性狀態空間模型

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

摘要

Support