零氣泡管道平行處理
Zero Bubble Pipeline Parallelism
November 30, 2023
作者: Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin
cs.AI
摘要
Pipeline parallelism 是大规模分布式训练的关键组成部分之一,然而其效率常受到管道气泡的影响,这被认为是不可避免的。在这项工作中,我们引入了一种调度策略,据我们所知,这是第一种成功在同步训练语义下实现零管道气泡的方法。这一改进背后的关键思想是将反向计算分为两部分,一部分计算输入的梯度,另一部分计算参数的梯度。基于这一思想,我们手工设计了新颖的管道调度,明显优于基准方法。我们进一步开发了一种算法,根据特定模型配置和内存限制自动找到最佳调度。此外,为了真正实现零气泡,我们引入了一种新颖的技术,在优化器步骤中绕过同步。实验评估表明,我们的方法在类似内存限制下的吞吐量上比 1F1B 调度高出多达 23%。当内存约束放宽时,这个数字可以进一步提高到 31%。我们相信我们的结果标志着在利用管道并行性的潜力方面迈出了重要的一步。我们已在 https://github.com/sail-sg/zero-bubble-pipeline-parallelism 上基于流行的 Megatron-LM 代码库开源了我们的实现。
English
Pipeline parallelism is one of the key components for large-scale distributed
training, yet its efficiency suffers from pipeline bubbles which were deemed
inevitable. In this work, we introduce a scheduling strategy that, to our
knowledge, is the first to successfully achieve zero pipeline bubbles under
synchronous training semantics. The key idea behind this improvement is to
split the backward computation into two parts, one that computes gradient for
the input and another that computes for the parameters. Based on this idea, we
handcraft novel pipeline schedules that significantly outperform the baseline
methods. We further develop an algorithm that automatically finds an optimal
schedule based on specific model configuration and memory limit. Additionally,
to truly achieve zero bubble, we introduce a novel technique to bypass
synchronizations during the optimizer step. Experimental evaluations show that
our method outperforms the 1F1B schedule up to 23% in throughput under a
similar memory limit. This number can be further pushed to 31% when the memory
constraint is relaxed. We believe our results mark a major step forward in
harnessing the true potential of pipeline parallelism. We open sourced our
implementation based on the popular Megatron-LM repository on
https://github.com/sail-sg/zero-bubble-pipeline-parallelism.