超长序列分布式Transformer

摘要

基于长序列训练的Transformer模型通常比短序列实现更高的准确性。然而，由于巨大的计算和内存需求，传统的Transformer在长序列训练方面存在困难。现有的长序列训练方法提供的加速和内存减少有限，并且可能会影响准确性。本文提出了一种新颖高效的分布式训练方法，即长短序列Transformer（LSS Transformer），用于训练具有长序列的Transformer。它将长序列分割成各个GPU之间的片段，每个GPU计算其片段的部分自注意力。然后，利用融合通信和新颖的双梯度平均技术，避免了聚合部分自注意力的需求，并最小化通信开销。我们在Wikipedia enwik8数据集上评估了LSS Transformer与最先进的Nvidia序列并行方法之间的性能差异。结果表明，相较于144个Nvidia V100 GPU上最先进的序列并行方法，我们提出的方法实现了5.6倍更快和10.2倍更节省内存的效果。此外，我们的算法可扩展到极端序列长度为50,112，使用3,456个GPU时，实现了161%的超线性并行效率和32 petaflops的吞吐量。

English

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.

超长序列分布式Transformer

Ultra-Long Sequence Distributed Transformer

摘要

Support