DeepSpeed Ulysses:系統優化以實現訓練極長序列Transformer模型
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
September 25, 2023
作者: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, Yuxiong He
cs.AI
摘要
在典型基於Transformer的大型語言模型(LLM)中,計算可以通過批次大小、隱藏維度、層數和序列長度來描述。到目前為止,用於加速LLM訓練的系統工作主要集中在前三個維度上:批次大小的數據並行性、隱藏大小的張量並行性以及模型深度或層數的流水線並行性。這些廣泛研究的並行形式並不針對或優化長序列Transformer模型。考慮到長序列LLM的實際應用需求,人們開始重新關注序列並行性。然而,現有的序列並行性工作受到內存通信效率低下的限制,限制了它們對長序列大型模型的可擴展性。在本工作中,我們介紹了DeepSpeed-Ulysses,這是一種新穎、可攜帶且有效的方法,可以實現高效率和可擴展的LLM訓練,並支持極長的序列長度。DeepSpeed-Ulysses的核心是沿著序列維度對輸入數據進行分區,並使用高效的全對全集體通信進行注意力計算。理論通信分析顯示,與其他方法隨著序列長度增加而產生通信開銷不同,DeepSpeed-Ulysses在序列長度和計算設備成比例增加時,保持恆定的通信量。此外,實驗評估顯示,DeepSpeed-Ulysses比現有方法SOTA基準訓練速度快2.5倍,序列長度長4倍。
English
Computation in a typical Transformer-based large language model (LLM) can be
characterized by batch size, hidden dimension, number of layers, and sequence
length. Until now, system works for accelerating LLM training have focused on
the first three dimensions: data parallelism for batch size, tensor parallelism
for hidden size and pipeline parallelism for model depth or layers. These
widely studied forms of parallelism are not targeted or optimized for long
sequence Transformer models. Given practical application needs for long
sequence LLM, renewed attentions are being drawn to sequence parallelism.
However, existing works in sequence parallelism are constrained by
memory-communication inefficiency, limiting their scalability to long sequence
large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable
and effective methodology for enabling highly efficient and scalable LLM
training with extremely long sequence length. DeepSpeed-Ulysses at its core
partitions input data along the sequence dimension and employs an efficient
all-to-all collective communication for attention computation. Theoretical
communication analysis shows that whereas other methods incur communication
overhead as sequence length increases, DeepSpeed-Ulysses maintains constant
communication volume when sequence length and compute devices are increased
proportionally. Furthermore, experimental evaluations show that
DeepSpeed-Ulysses trains 2.5X faster with 4X longer sequence length than the
existing method SOTA baseline.