超長序列分佈式Transformer

摘要

在長序列上訓練的Transformer模型通常比短序列實現更高的準確性。不幸的是，傳統的Transformer在長序列訓練方面遇到了巨大的計算和記憶體需求問題。現有的長序列訓練方法提供的速度提升和記憶體減少有限，可能會影響準確性。本文提出了一種新穎且高效的分佈式訓練方法，即長短序列Transformer（LSS Transformer），用於訓練具有長序列的Transformer。它將一個長序列分發到不同GPU之間的段中，每個GPU計算其段的部分自注意力。然後，它使用融合通信和新穎的雙梯度平均技術，避免聚合部分自注意力的需要並最小化通信開銷。我們在Wikipedia enwik8數據集上評估了LSS Transformer和最先進的Nvidia序列並行方法之間的性能。結果顯示，相較於144個Nvidia V100 GPU上最先進的序列並行方法，我們提出的方法實現速度提升5.6倍，記憶體效率提高10.2倍。此外，我們的算法可擴展到極端序列長度為50,112，使用3,456個GPU實現161%超線性並行效率和32 petaflops的吞吐量。

English

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.

超長序列分佈式Transformer

Ultra-Long Sequence Distributed Transformer

摘要

Support