打破氣泡:具有有界權重不一致性的非同步管線平行訓練
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency
June 5, 2026
作者: Itay Elam, Eliron Rahimi, Avi Mendelson, Chaim Baskin
cs.AI
摘要
管線平行度對於訓練大型神經網路至關重要,但現有排程方式在吞吐量、記憶體與最佳化一致性之間需要取捨。同步管線能保持前向/反向的權重一致性,但會產生氣泡;非同步管線消除了氣泡,卻引入了權重版本不一致,通常需要權重暫存、預測或修正機制。我們提出PACI(可控不一致性的管線非同步訓練),這是一種無氣泡的非同步管線方法,能在無需權重暫存、預測、額外參數複製或全域同步的情況下,限制前向/反向版本的偏移。關鍵思路是將局部梯度累積作為版本控制機制:透過相對於管線延遲減緩參數版本的演進,PACI限制了任何微批次所跨越的最佳化器更新次數,同時保持穩態利用率。在GPT風格語言模型預訓練中,PACI達到了與同步1F1B-flush相同的穩定性與最終困惑度,保持相同峰值記憶體佔用,實現了完全利用的管線吞吐量,並相較於最快的flush基準,將訓練達到目標精確度的時間提升了高達1.69倍。這些結果表明,前向/反向不一致性無需消除:當明確受到限制時,可以安全地將其換取顯著的效率提升。
English
Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.