黑曼巴：状态空间模型的专家混合模型

摘要

最近，状态空间模型（SSMs）在大规模语言建模基准测试中展现出与变压器竞争力相当的性能，同时实现了与序列长度成线性关系的时间和内存复杂度。最近发布的SSM模型Mamba在语言建模和长序列处理任务中表现出色。同时，专家混合模型（MoE）在显著降低推断计算和延迟成本的同时，表现出卓越的性能，但以更大的内存占用为代价。本文介绍了BlackMamba，这是一种将Mamba SSM与MoE相结合以获得双方优势的新型架构。我们展示了BlackMamba在竞争性能方面与Mamba和变压器基准相媲美，并在推断和训练FLOPs方面表现出色。我们完全训练并开源了300B令牌的自定义数据集上的340M/1.5B和630M/2.8B BlackMamba模型。我们展示了BlackMamba继承并结合了SSM和MoE架构的双重优势，将SSM的线性复杂度生成与MoE的廉价快速推断相结合。我们开源了所有权重、检查点和推断代码。推断代码位于：https://github.com/Zyphra/BlackMamba

English

State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

黑曼巴：状态空间模型的专家混合模型

BlackMamba: Mixture of Experts for State-Space Models

摘要

Support