Mamba:具有選擇性狀態空間的線性時間序列建模
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
December 1, 2023
作者: Albert Gu, Tri Dao
cs.AI
摘要
基礎模型,現在支持大多數深度學習中令人興奮的應用,幾乎普遍基於Transformer架構及其核心注意力模組。許多次線性時間架構,如線性注意力、閘控卷積和循環模型,以及結構化狀態空間模型(SSMs),已被開發來解決Transformer在長序列上的計算效率問題,但在重要模態(如語言)上並未表現出色。我們確定此類模型的一個關鍵弱點是它們無法執行基於內容的推理,並進行了幾項改進。首先,簡單地讓SSM參數成為輸入的函數,解決了它們在離散模態方面的弱點,使模型能夠根據當前標記有選擇性地沿著序列長度維度傳播或遺忘信息。其次,即使這種改變阻止了高效卷積的使用,我們設計了一個硬件感知的並行算法以循環模式運行。我們將這些有選擇性的SSMs集成到一個簡化的無注意力或甚至MLP塊(Mamba)的端到端神經網絡架構中。Mamba具有快速推理能力(比Transformer高5倍的吞吐量)和序列長度的線性擴展,其性能在真實數據上提高,可達到百萬長度序列。作為一個通用的序列模型骨幹,Mamba在語言、音頻和基因組等多個模態上實現了最先進的性能。在語言建模方面,我們的Mamba-3B模型優於相同大小的Transformer模型,在預訓練和下游評估中與兩倍大小的Transformer模型性能相當。
English
Foundation models, now powering most of the exciting applications in deep
learning, are almost universally based on the Transformer architecture and its
core attention module. Many subquadratic-time architectures such as linear
attention, gated convolution and recurrent models, and structured state space
models (SSMs) have been developed to address Transformers' computational
inefficiency on long sequences, but they have not performed as well as
attention on important modalities such as language. We identify that a key
weakness of such models is their inability to perform content-based reasoning,
and make several improvements. First, simply letting the SSM parameters be
functions of the input addresses their weakness with discrete modalities,
allowing the model to selectively propagate or forget information along the
sequence length dimension depending on the current token. Second, even though
this change prevents the use of efficient convolutions, we design a
hardware-aware parallel algorithm in recurrent mode. We integrate these
selective SSMs into a simplified end-to-end neural network architecture without
attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5times
higher throughput than Transformers) and linear scaling in sequence length, and
its performance improves on real data up to million-length sequences. As a
general sequence model backbone, Mamba achieves state-of-the-art performance
across several modalities such as language, audio, and genomics. On language
modeling, our Mamba-3B model outperforms Transformers of the same size and
matches Transformers twice its size, both in pretraining and downstream
evaluation.