ブロックステートトランスフォーマー

要旨

状態空間モデル（SSM）は、長距離依存性のモデリングを必要とするタスクにおいて、その準二次的な実行時間複雑性により、長いシーケンスに効率的にスケールするという印象的な結果を示してきた。元々は連続信号向けに設計されたSSMは、視覚や音声を含む多様なタスクで優れた性能を発揮しているが、言語モデリングタスクにおいては依然としてTransformerの性能に及ばない。本研究では、長距離の文脈化のためにSSMサブレイヤーを、シーケンスの短期的な表現のためにブロックTransformerサブレイヤーを内部に組み合わせた、Block-State Transformer（BST）というハイブリッド層を提案する。我々は、SSMとブロック単位の注意機構を統合した3つの異なる、かつ完全に並列化可能なバリエーションを検討する。我々のモデルが、言語モデリングのパープレキシティにおいて類似のTransformerベースのアーキテクチャを上回り、より長いシーケンスに一般化することを示す。さらに、Block-State Transformerは、モデルの並列化が適用された場合、Block-Recurrent Transformerと比較してレイヤーレベルで10倍以上の速度向上を実現する。

English

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.