黑曼巴:狀態空間模型的專家混合模型
BlackMamba: Mixture of Experts for State-Space Models
February 1, 2024
作者: Quentin Anthony, Yury Tokpanov, Paolo Glorioso, Beren Millidge
cs.AI
摘要
最近,狀態空間模型(SSMs)在大規模語言建模基準測試中展現出與變壓器競爭力相當的表現,同時在序列長度方面實現了線性時間和記憶體複雜度。最近推出的SSM模型Mamba在語言建模和長序列處理任務中展現出令人印象深刻的表現。與此同時,專家混合(MoE)模型展現出卓越的性能,同時顯著降低了推理的計算和延遲成本,但以更大的記憶體占用為代價。在本文中,我們提出了BlackMamba,一種結合了Mamba SSM和MoE以獲取兩者優勢的新型架構。我們展示了BlackMamba在與Mamba和變壓器基準的競爭中表現出色,並在推理和訓練FLOPs方面表現優異。我們完全訓練並開源了300B tokens自定義數據集上的340M/1.5B和630M/2.8B BlackMamba模型。我們展示了BlackMamba繼承並結合了SSM和MoE架構的優勢,將來自SSM的線性複雜度生成與來自MoE的便宜和快速推理結合在一起。我們開源了所有權重、檢查點和推理代碼。推理代碼位於:https://github.com/Zyphra/BlackMamba
English
State-space models (SSMs) have recently demonstrated competitive performance
to transformers at large-scale language modeling benchmarks while achieving
linear time and memory complexity as a function of sequence length. Mamba, a
recently released SSM model, shows impressive performance in both language
modeling and long sequence processing tasks. Simultaneously, mixture-of-expert
(MoE) models have shown remarkable performance while significantly reducing the
compute and latency costs of inference at the expense of a larger memory
footprint. In this paper, we present BlackMamba, a novel architecture that
combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate
that BlackMamba performs competitively against both Mamba and transformer
baselines, and outperforms in inference and training FLOPs. We fully train and
open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a
custom dataset. We show that BlackMamba inherits and combines both of the
benefits of SSM and MoE architectures, combining linear-complexity generation
from SSM with cheap and fast inference from MoE. We release all weights,
checkpoints, and inference code open-source. Inference code at:
https://github.com/Zyphra/BlackMamba