打破气泡：带有有界权值不一致性的异步流水线并行训练

摘要

流水线并行对于训练大型神经网络至关重要，但现有调度策略需在吞吐量、内存和优化一致性之间权衡。同步流水线保持了前向/后向权重一致性，但存在气泡问题；异步流水线消除了气泡，却引入了权重版本不匹配，通常需要权重缓存、预测或修正机制。我们提出PACI（可控不一致异步流水线训练），这是一种无气泡的异步流水线方法，能在无需权重缓存、预测、额外参数副本或全局同步的情况下，限制前向/后向版本的漂移。其核心思想是利用局部梯度累积作为版本控制机制：通过相对于流水线延迟放缓参数版本演进，PACI在保持稳态利用率的同时，限制了任何微批次跨越的优化器更新次数。在GPT风格语言模型预训练中，PACI达到了与同步1F1B-flush相当的稳定性和最终困惑度，保持了相同的峰值内存占用，实现了完全利用的流水线吞吐量，并将训练收敛时间相比最快flush基线提升高达1.69倍。这些结果表明，前向/后向不一致性无需被消除：当被显式约束时，它可以安全地换取显著的效率提升。

English

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.