DeepSpeed Ulysses:用于训练极长序列Transformer模型的系统优化。
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
September 25, 2023
作者: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, Yuxiong He
cs.AI
摘要
基于Transformer的大型语言模型(LLM)的计算可以通过批量大小、隐藏维度、层数和序列长度来描述。到目前为止,用于加速LLM训练的系统工作主要集中在前三个维度上:批量大小的数据并行、隐藏大小的张量并行和模型深度或层数的管道并行。这些广泛研究的并行形式并非针对或针对长序列Transformer模型进行优化。鉴于长序列LLM的实际应用需求,人们开始重新关注序列并行。然而,现有的序列并行工作受到内存通信效率的限制,限制了它们对长序列大型模型的可扩展性。在这项工作中,我们介绍了DeepSpeed-Ulysses,这是一种新颖、便携且有效的方法,可实现高效且可扩展的LLM训练,适用于极长序列长度。DeepSpeed-Ulysses的核心是沿着序列维度对输入数据进行分区,并采用高效的全互联集体通信进行注意力计算。理论通信分析表明,与其他方法随着序列长度增加而产生通信开销不同,DeepSpeed-Ulysses在序列长度和计算设备成比例增加时保持恒定的通信量。此外,实验评估表明,DeepSpeed-Ulysses在4倍更长的序列长度下比现有方法SOTA基准训练速度快2.5倍。
English
Computation in a typical Transformer-based large language model (LLM) can be
characterized by batch size, hidden dimension, number of layers, and sequence
length. Until now, system works for accelerating LLM training have focused on
the first three dimensions: data parallelism for batch size, tensor parallelism
for hidden size and pipeline parallelism for model depth or layers. These
widely studied forms of parallelism are not targeted or optimized for long
sequence Transformer models. Given practical application needs for long
sequence LLM, renewed attentions are being drawn to sequence parallelism.
However, existing works in sequence parallelism are constrained by
memory-communication inefficiency, limiting their scalability to long sequence
large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable
and effective methodology for enabling highly efficient and scalable LLM
training with extremely long sequence length. DeepSpeed-Ulysses at its core
partitions input data along the sequence dimension and employs an efficient
all-to-all collective communication for attention computation. Theoretical
communication analysis shows that whereas other methods incur communication
overhead as sequence length increases, DeepSpeed-Ulysses maintains constant
communication volume when sequence length and compute devices are increased
proportionally. Furthermore, experimental evaluations show that
DeepSpeed-Ulysses trains 2.5X faster with 4X longer sequence length than the
existing method SOTA baseline.