ChatPaper.aiChatPaper

DeepSpeed Ulysses:用于训练极长序列Transformer模型的系统优化。

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

September 25, 2023
作者: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, Yuxiong He
cs.AI

摘要

基于Transformer的大型语言模型(LLM)的计算可以通过批量大小、隐藏维度、层数和序列长度来描述。到目前为止,用于加速LLM训练的系统工作主要集中在前三个维度上:批量大小的数据并行、隐藏大小的张量并行和模型深度或层数的管道并行。这些广泛研究的并行形式并非针对或针对长序列Transformer模型进行优化。鉴于长序列LLM的实际应用需求,人们开始重新关注序列并行。然而,现有的序列并行工作受到内存通信效率的限制,限制了它们对长序列大型模型的可扩展性。在这项工作中,我们介绍了DeepSpeed-Ulysses,这是一种新颖、便携且有效的方法,可实现高效且可扩展的LLM训练,适用于极长序列长度。DeepSpeed-Ulysses的核心是沿着序列维度对输入数据进行分区,并采用高效的全互联集体通信进行注意力计算。理论通信分析表明,与其他方法随着序列长度增加而产生通信开销不同,DeepSpeed-Ulysses在序列长度和计算设备成比例增加时保持恒定的通信量。此外,实验评估表明,DeepSpeed-Ulysses在4倍更长的序列长度下比现有方法SOTA基准训练速度快2.5倍。
English
Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the first three dimensions: data parallelism for batch size, tensor parallelism for hidden size and pipeline parallelism for model depth or layers. These widely studied forms of parallelism are not targeted or optimized for long sequence Transformer models. Given practical application needs for long sequence LLM, renewed attentions are being drawn to sequence parallelism. However, existing works in sequence parallelism are constrained by memory-communication inefficiency, limiting their scalability to long sequence large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention computation. Theoretical communication analysis shows that whereas other methods incur communication overhead as sequence length increases, DeepSpeed-Ulysses maintains constant communication volume when sequence length and compute devices are increased proportionally. Furthermore, experimental evaluations show that DeepSpeed-Ulysses trains 2.5X faster with 4X longer sequence length than the existing method SOTA baseline.
PDF201December 15, 2024