Mamba：具有選擇性狀態空間的線性時間序列建模

摘要

基礎模型，現在支持大多數深度學習中令人興奮的應用，幾乎普遍基於Transformer架構及其核心注意力模組。許多次線性時間架構，如線性注意力、閘控卷積和循環模型，以及結構化狀態空間模型（SSMs），已被開發來解決Transformer在長序列上的計算效率問題，但在重要模態（如語言）上並未表現出色。我們確定此類模型的一個關鍵弱點是它們無法執行基於內容的推理，並進行了幾項改進。首先，簡單地讓SSM參數成為輸入的函數，解決了它們在離散模態方面的弱點，使模型能夠根據當前標記有選擇性地沿著序列長度維度傳播或遺忘信息。其次，即使這種改變阻止了高效卷積的使用，我們設計了一個硬件感知的並行算法以循環模式運行。我們將這些有選擇性的SSMs集成到一個簡化的無注意力或甚至MLP塊（Mamba）的端到端神經網絡架構中。Mamba具有快速推理能力（比Transformer高5倍的吞吐量）和序列長度的線性擴展，其性能在真實數據上提高，可達到百萬長度序列。作為一個通用的序列模型骨幹，Mamba在語言、音頻和基因組等多個模態上實現了最先進的性能。在語言建模方面，我們的Mamba-3B模型優於相同大小的Transformer模型，在預訓練和下游評估中與兩倍大小的Transformer模型性能相當。

English

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5times higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

Mamba：具有選擇性狀態空間的線性時間序列建模

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

摘要

Support