區塊狀態轉換器
Block-State Transformer
June 15, 2023
作者: Mahan Fathi, Jonathan Pilault, Pierre-Luc Bacon, Christopher Pal, Orhan Firat, Ross Goroshin
cs.AI
摘要
狀態空間模型(SSMs)在需要建模長程依賴性並能有效擴展到長序列的任務上展現出令人印象深刻的成果,這歸因於其次二次運行時間複雜度。最初設計用於連續信號的SSMs在視覺和音訊等眾多任務中展現出卓越的性能;然而,在語言建模任務中,SSMs仍然落後於Transformer的表現。在這項研究中,我們提出了一個名為區塊狀態Transformer(BST)的混合層,內部結合了一個用於長程情境化的SSM子層,以及一個用於序列的短期表示的區塊Transformer子層。我們研究了三種不同且完全可並行化的變體,將SSMs和區塊注意力整合在一起。我們展示了我們的模型在語言建模困惑度上優於類似基於Transformer的架構,並且對更長序列具有泛化能力。此外,與區塊循環Transformer相比,當採用模型並行化時,區塊狀態Transformer在層級上的速度增加超過十倍。
English
State space models (SSMs) have shown impressive results on tasks that require
modeling long-range dependencies and efficiently scale to long sequences owing
to their subquadratic runtime complexity. Originally designed for continuous
signals, SSMs have shown superior performance on a plethora of tasks, in vision
and audio; however, SSMs still lag Transformer performance in Language Modeling
tasks. In this work, we propose a hybrid layer named Block-State Transformer
(BST), that internally combines an SSM sublayer for long-range
contextualization, and a Block Transformer sublayer for short-term
representation of sequences. We study three different, and completely
parallelizable, variants that integrate SSMs and block-wise attention. We show
that our model outperforms similar Transformer-based architectures on language
modeling perplexity and generalizes to longer sequences. In addition, the
Block-State Transformer demonstrates more than tenfold increase in speed at the
layer level compared to the Block-Recurrent Transformer when model
parallelization is employed.