超長シーケンス分散型トランスフォーマー

要旨

長いシーケンスで学習されたTransformerモデルは、短いシーケンスよりも高い精度を達成することが多い。しかし、従来のTransformerは、膨大な計算量とメモリ要件のため、長いシーケンスの学習に苦戦している。既存の長いシーケンス学習手法は、速度向上とメモリ削減が限定的であり、精度を犠牲にする可能性がある。本論文では、長いシーケンスでTransformerを学習するための新規で効率的な分散学習手法、Long Short-Sequence Transformer（LSS Transformer）を提案する。この手法は、長いシーケンスをGPU間でセグメントに分割し、各GPUがそのセグメントの部分的な自己注意を計算する。その後、融合通信と新規の二重勾配平均化技術を使用して、部分的な自己注意を集約する必要性を回避し、通信オーバーヘッドを最小化する。我々は、LSS Transformerと最先端のNvidiaシーケンス並列処理をWikipedia enwik8データセットで比較評価した。結果は、144台のNvidia V100 GPU上で、提案手法が最先端のシーケンス並列処理と比較して5.6倍の速度と10.2倍のメモリ効率を実現することを示している。さらに、我々のアルゴリズムは3,456台のGPUで50,112という極端なシーケンス長にスケールし、161%の超線形並列効率と32ペタフロップスのスループットを達成した。

English

Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.

超長シーケンス分散型トランスフォーマー

Ultra-Long Sequence Distributed Transformer

要旨

Support