TCNCA: 확장 가능한 시퀀스 처리를 위한 청크 어텐션 기반 시간적 합성곱 네트워크

초록

MEGA는 최근에 제안된 트랜스포머 기반 아키텍처로, FFT(Fast Fourier Transform)를 기반으로 한 병렬 계산을 통해 선형 재귀 연산자를 활용하며, 이는 시퀀스 길이인 L에 대해 O(LlogL)의 계산 복잡도를 가진다. 본 연구에서는 이 접근법을 기반으로 선형 재귀를 특수한 시간적 합성곱 네트워크(Temporal Convolutional Network, TCN)로 대체하여, 더 얕은 네트워크로도 더 큰 수용 필드(receptive field) 크기를 허용하고 계산 복잡도를 O(L)로 줄였다. 이를 통해 TCNCA(Chunked Attention을 가진 Temporal Convolutional Network)라는 모델을 개발하였다. TCNCA는 EnWik8 언어 모델링, 장거리 시퀀스 분류(Long-Range-Arena, LRA), 그리고 합성 추론 벤치마크인 연상 회상(associative recall) 작업에서 평가되었다. EnWik8에서 TCNCA는 MEGA를 능가하며, 더 낮은 손실을 달성함과 동시에 학습 중 순전파/역전파 속도가 각각 1.37배/1.24배 빨랐다. TCNCA에서 사용된 확장 합성곱(dilated convolution)은 GPU에서 FFT 기반 병렬 재귀 연산보다 일관되게 그리고 상당히 빠른 연산을 제공하여, 매우 긴 시퀀스 길이를 처리하는 데 확장 가능한 후보로 적합하다: 최대 131k 길이의 시퀀스에 대해 순전파/역전파 속도가 각각 7.07배/2.86배 빨랐다. 또한 LRA에서 TCNCA는 MEGA와 유사한 정확도를 유지하면서 평균 1.28배의 추론 속도 향상을 달성했다. 연상 회상 작업에서는, 과도한 곱셈 및 덧셈 상호작용을 제거한 단순화된 버전의 TCNCA도 다양한 시퀀스 길이와 어휘 크기에서 MEGA에 비해 우수하거나 경쟁력 있는 성능을 보였다.

English

MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as O(LlogL), with L being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to O(L). The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with 1.37times/1.24times faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to 7.07times/2.86times faster in the forward/backward pass for sequences up to 131k. Further on LRA, TCNCA achieves, on average, 1.28times speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.

TCNCA: 확장 가능한 시퀀스 처리를 위한 청크 어텐션 기반 시간적 합성곱 네트워크

TCNCA: Temporal Convolution Network with Chunked Attention for Scalable Sequence Processing

초록

Support