零气泡管道并行化
Zero Bubble Pipeline Parallelism
November 30, 2023
作者: Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin
cs.AI
摘要
管道并行是大规模分布式训练的关键组成部分之一,然而其效率受到管道气泡的影响,这被认为是不可避免的。在这项工作中,我们引入了一种调度策略,据我们所知,这是第一次成功在同步训练语义下实现零管道气泡。这一改进背后的关键思想是将反向计算分为两部分,一部分计算输入的梯度,另一部分计算参数的梯度。基于这一思想,我们手工设计了新颖的管道调度方案,明显优于基准方法。我们进一步开发了一种算法,根据特定模型配置和内存限制自动找到最佳调度。此外,为了真正实现零气泡,我们引入了一种新颖的技术,在优化器步骤中绕过同步。实验评估表明,我们的方法在类似的内存限制下比1F1B调度的吞吐量高出多达23%。当内存约束放宽时,这一数字可以进一步提高至31%。我们相信我们的结果标志着在利用管道并行潜力方面迈出了重要的一步。我们在https://github.com/sail-sg/zero-bubble-pipeline-parallelism 上基于流行的Megatron-LM存储库开源了我们的实现。
English
Pipeline parallelism is one of the key components for large-scale distributed
training, yet its efficiency suffers from pipeline bubbles which were deemed
inevitable. In this work, we introduce a scheduling strategy that, to our
knowledge, is the first to successfully achieve zero pipeline bubbles under
synchronous training semantics. The key idea behind this improvement is to
split the backward computation into two parts, one that computes gradient for
the input and another that computes for the parameters. Based on this idea, we
handcraft novel pipeline schedules that significantly outperform the baseline
methods. We further develop an algorithm that automatically finds an optimal
schedule based on specific model configuration and memory limit. Additionally,
to truly achieve zero bubble, we introduce a novel technique to bypass
synchronizations during the optimizer step. Experimental evaluations show that
our method outperforms the 1F1B schedule up to 23% in throughput under a
similar memory limit. This number can be further pushed to 31% when the memory
constraint is relaxed. We believe our results mark a major step forward in
harnessing the true potential of pipeline parallelism. We open sourced our
implementation based on the popular Megatron-LM repository on
https://github.com/sail-sg/zero-bubble-pipeline-parallelism.