区块状态转换器

摘要

状态空间模型（SSMs）已经在需要建模长程依赖关系并能够高效扩展到长序列的任务中展现出令人印象深刻的结果，这归功于其次二次运行时复杂度。最初设计用于连续信号的SSMs在视觉和音频等众多任务中表现出卓越性能；然而，在语言建模任务中，SSMs仍然落后于Transformer的表现。在这项工作中，我们提出了一个名为块状态Transformer（BST）的混合层，它内部结合了一个用于长程上下文化的SSM子层，以及一个用于序列的短期表示的块Transformer子层。我们研究了三种不同的、完全可并行化的变体，将SSMs和基于块的注意力集成在一起。我们展示了我们的模型在语言建模困惑度上优于类似的基于Transformer的架构，并且能够泛化到更长的序列。此外，与块循环Transformer相比，当采用模型并行化时，块状态Transformer在层级别上的速度提升超过十倍。

English

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.