區塊狀態轉換器

摘要

狀態空間模型（SSMs）在需要建模長程依賴性並能有效擴展到長序列的任務上展現出令人印象深刻的成果，這歸因於其次二次運行時間複雜度。最初設計用於連續信號的SSMs在視覺和音訊等眾多任務中展現出卓越的性能；然而，在語言建模任務中，SSMs仍然落後於Transformer的表現。在這項研究中，我們提出了一個名為區塊狀態Transformer（BST）的混合層，內部結合了一個用於長程情境化的SSM子層，以及一個用於序列的短期表示的區塊Transformer子層。我們研究了三種不同且完全可並行化的變體，將SSMs和區塊注意力整合在一起。我們展示了我們的模型在語言建模困惑度上優於類似基於Transformer的架構，並且對更長序列具有泛化能力。此外，與區塊循環Transformer相比，當採用模型並行化時，區塊狀態Transformer在層級上的速度增加超過十倍。

English

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.