ChatPaper.aiChatPaper

打破气泡:带有有界权值不一致性的异步流水线并行训练

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

June 5, 2026
作者: Itay Elam, Eliron Rahimi, Avi Mendelson, Chaim Baskin
cs.AI

摘要

流水线并行对于训练大型神经网络至关重要,但现有调度策略需在吞吐量、内存和优化一致性之间权衡。同步流水线保持了前向/后向权重一致性,但存在气泡问题;异步流水线消除了气泡,却引入了权重版本不匹配,通常需要权重缓存、预测或修正机制。我们提出PACI(可控不一致异步流水线训练),这是一种无气泡的异步流水线方法,能在无需权重缓存、预测、额外参数副本或全局同步的情况下,限制前向/后向版本的漂移。其核心思想是利用局部梯度累积作为版本控制机制:通过相对于流水线延迟放缓参数版本演进,PACI在保持稳态利用率的同时,限制了任何微批次跨越的优化器更新次数。在GPT风格语言模型预训练中,PACI达到了与同步1F1B-flush相当的稳定性和最终困惑度,保持了相同的峰值内存占用,实现了完全利用的流水线吞吐量,并将训练收敛时间相比最快flush基线提升高达1.69倍。这些结果表明,前向/后向不一致性无需被消除:当被显式约束时,它可以安全地换取显著的效率提升。
English
Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.