零气泡管道并行化

摘要

管道并行是大规模分布式训练的关键组成部分之一，然而其效率受到管道气泡的影响，这被认为是不可避免的。在这项工作中，我们引入了一种调度策略，据我们所知，这是第一次成功在同步训练语义下实现零管道气泡。这一改进背后的关键思想是将反向计算分为两部分，一部分计算输入的梯度，另一部分计算参数的梯度。基于这一思想，我们手工设计了新颖的管道调度方案，明显优于基准方法。我们进一步开发了一种算法，根据特定模型配置和内存限制自动找到最佳调度。此外，为了真正实现零气泡，我们引入了一种新颖的技术，在优化器步骤中绕过同步。实验评估表明，我们的方法在类似的内存限制下比1F1B调度的吞吐量高出多达23%。当内存约束放宽时，这一数字可以进一步提高至31%。我们相信我们的结果标志着在利用管道并行潜力方面迈出了重要的一步。我们在https://github.com/sail-sg/zero-bubble-pipeline-parallelism 上基于流行的Megatron-LM存储库开源了我们的实现。

English

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles under synchronous training semantics. The key idea behind this improvement is to split the backward computation into two parts, one that computes gradient for the input and another that computes for the parameters. Based on this idea, we handcraft novel pipeline schedules that significantly outperform the baseline methods. We further develop an algorithm that automatically finds an optimal schedule based on specific model configuration and memory limit. Additionally, to truly achieve zero bubble, we introduce a novel technique to bypass synchronizations during the optimizer step. Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. This number can be further pushed to 31% when the memory constraint is relaxed. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism. We open sourced our implementation based on the popular Megatron-LM repository on https://github.com/sail-sg/zero-bubble-pipeline-parallelism.

零气泡管道并行化

Zero Bubble Pipeline Parallelism

摘要

Support