ChatPaper.aiChatPaper

TCNCA:具有分块注意力的时间卷积网络,用于可扩展的序列处理

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

December 9, 2023
作者: Aleksandar Terzic, Michael Hersche, Geethan Karunaratne, Luca Benini, Abu Sebastian, Abbas Rahimi
cs.AI

摘要

MEGA是一种最近基于Transformer的架构,它利用线性循环算子,其基于FFT的并行计算复杂度随着序列长度L的增加按O(LlogL)的速度扩展。我们在其方法的基础上,通过将线性循环替换为一种特殊的时间卷积网络,实现了更大的感受野尺寸和更浅的网络结构,将计算复杂度降低至O(L)。由此产生的模型被称为TCNCA,即带有分块注意力的时间卷积网络。我们在EnWik8语言建模、长距离竞技(LRA)序列分类以及合成推理基准联想回忆上评估了TCNCA。在EnWik8上,TCNCA的性能优于MEGA,在训练过程中的前向/后向传播速度比为1.37倍/1.24倍,损失更低。TCNCA中使用的扩张卷积在GPU上始终比基于FFT的并行循环更快,使其成为处理非常大序列长度的可扩展候选方案:对于长达131k的序列,它们的前向/后向传播速度最高可提高7.07倍/2.86倍。在LRA方面,TCNCA在推断过程中平均实现了1.28倍的加速,同时保持与MEGA相似的准确性。在联想回忆方面,我们发现,即使是TCNCA的简化版本,没有过多的乘法和加法交互,仍然在一系列序列长度和词汇量上优于或与MEGA竞争。
English
MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as O(LlogL), with L being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to O(L). The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with 1.37times/1.24times faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to 7.07times/2.86times faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, 1.28times speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.
PDF30December 15, 2024