翻译：Mamba：具有选择性状态空间的线性时间序列建模

摘要

基础模型，现在支持深度学习中大多数引人注目的应用，几乎普遍基于Transformer架构及其核心注意力模块。许多次线性时间架构，如线性注意力、门控卷积和循环模型，以及结构化状态空间模型（SSMs），已被开发用于解决Transformer在长序列上的计算效率低下的问题，但它们在诸如语言等重要模态上的表现并不如注意力模型。我们发现这类模型的一个关键弱点是它们无法执行基于内容的推理，并进行了若干改进。首先，简单地让SSM参数成为输入的函数，解决了它们在离散模态上的弱点，使模型能够根据当前标记有选择地沿序列长度维度传播或遗忘信息。其次，尽管这种改变阻止了高效卷积的使用，我们设计了一个硬件感知并行算法以递归模式运行。我们将这些选择性的SSMs集成到一个简化的端到端神经网络架构中，无需注意力甚至MLP块（Mamba）。Mamba具有快速推理能力（比Transformer高5倍的吞吐量）和序列长度的线性扩展性，其性能在实际数据上提高到百万长度序列。作为一种通用序列模型骨干，Mamba在诸如语言、音频和基因组学等多种模态上实现了最先进的性能。在语言建模方面，我们的Mamba-3B模型在预训练和下游评估中均优于同等大小的Transformer模型，并与其两倍大小的Transformer模型性能相匹敌。

English

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5times higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

翻译：Mamba：具有选择性状态空间的线性时间序列建模

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

摘要

Support