翻译:Mamba:具有选择性状态空间的线性时间序列建模
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
December 1, 2023
作者: Albert Gu, Tri Dao
cs.AI
摘要
基础模型,现在支持深度学习中大多数引人注目的应用,几乎普遍基于Transformer架构及其核心注意力模块。许多次线性时间架构,如线性注意力、门控卷积和循环模型,以及结构化状态空间模型(SSMs),已被开发用于解决Transformer在长序列上的计算效率低下的问题,但它们在诸如语言等重要模态上的表现并不如注意力模型。我们发现这类模型的一个关键弱点是它们无法执行基于内容的推理,并进行了若干改进。首先,简单地让SSM参数成为输入的函数,解决了它们在离散模态上的弱点,使模型能够根据当前标记有选择地沿序列长度维度传播或遗忘信息。其次,尽管这种改变阻止了高效卷积的使用,我们设计了一个硬件感知并行算法以递归模式运行。我们将这些选择性的SSMs集成到一个简化的端到端神经网络架构中,无需注意力甚至MLP块(Mamba)。Mamba具有快速推理能力(比Transformer高5倍的吞吐量)和序列长度的线性扩展性,其性能在实际数据上提高到百万长度序列。作为一种通用序列模型骨干,Mamba在诸如语言、音频和基因组学等多种模态上实现了最先进的性能。在语言建模方面,我们的Mamba-3B模型在预训练和下游评估中均优于同等大小的Transformer模型,并与其两倍大小的Transformer模型性能相匹敌。
English
Foundation models, now powering most of the exciting applications in deep
learning, are almost universally based on the Transformer architecture and its
core attention module. Many subquadratic-time architectures such as linear
attention, gated convolution and recurrent models, and structured state space
models (SSMs) have been developed to address Transformers' computational
inefficiency on long sequences, but they have not performed as well as
attention on important modalities such as language. We identify that a key
weakness of such models is their inability to perform content-based reasoning,
and make several improvements. First, simply letting the SSM parameters be
functions of the input addresses their weakness with discrete modalities,
allowing the model to selectively propagate or forget information along the
sequence length dimension depending on the current token. Second, even though
this change prevents the use of efficient convolutions, we design a
hardware-aware parallel algorithm in recurrent mode. We integrate these
selective SSMs into a simplified end-to-end neural network architecture without
attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5times
higher throughput than Transformers) and linear scaling in sequence length, and
its performance improves on real data up to million-length sequences. As a
general sequence model backbone, Mamba achieves state-of-the-art performance
across several modalities such as language, audio, and genomics. On language
modeling, our Mamba-3B model outperforms Transformers of the same size and
matches Transformers twice its size, both in pretraining and downstream
evaluation.