TCNCA: スケーラブルなシーケンス処理のためのチャンク化アテンションを備えた時間畳み込みネットワーク

要旨

MEGAは、最近提案されたトランスフォーマーベースのアーキテクチャであり、FFTに基づく並列計算によりO(LlogL)のスケーリングを実現する線形再帰演算子を利用している。ここで、Lはシーケンス長を表す。本研究では、このアプローチを基盤とし、線形再帰を特殊な時間的畳み込みネットワークに置き換えることで、より浅いネットワークで大きな受容野サイズを可能とし、計算複雑性をO(L)に削減した。この結果得られたモデルを、チャンク化アテンションを備えた時間的畳み込みネットワーク（TCNCA）と呼ぶ。TCNCAを、EnWik8言語モデリング、長距離シーケンス分類（LRA）、および合成推論ベンチマークである連想想起において評価した。EnWik8では、TCNCAはMEGAを上回り、トレーニング中のフォワード/バックワードパスが1.37倍/1.24倍高速であり、より低い損失を達成した。TCNCAで使用される拡張畳み込みは、GPU上でのFFTベースの並列化再帰と比較して一貫して大幅に高速な操作であり、非常に長いシーケンス長を扱うためのスケーラブルな候補となっている：最大131kのシーケンス長において、フォワード/バックワードパスで最大7.07倍/2.86倍高速である。さらにLRAにおいて、TCNCAは推論中に平均1.28倍の高速化を達成し、MEGAと同等の精度を維持した。連想想起においては、過剰な乗算的および加算的相互作用を省いた簡略化版のTCNCAでさえ、様々なシーケンス長と語彙サイズにおいてMEGAに対して優位または競争力のある性能を示した。

English

MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as O(LlogL), with L being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to O(L). The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with 1.37times/1.24times faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to 7.07times/2.86times faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, 1.28times speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

TCNCA: スケーラブルなシーケンス処理のためのチャンク化アテンションを備えた時間畳み込みネットワーク

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

要旨

Support