打破氣泡：具有有界權重不一致性的非同步管線平行訓練

摘要

管線平行度對於訓練大型神經網路至關重要，但現有排程方式在吞吐量、記憶體與最佳化一致性之間需要取捨。同步管線能保持前向/反向的權重一致性，但會產生氣泡；非同步管線消除了氣泡，卻引入了權重版本不一致，通常需要權重暫存、預測或修正機制。我們提出PACI（可控不一致性的管線非同步訓練），這是一種無氣泡的非同步管線方法，能在無需權重暫存、預測、額外參數複製或全域同步的情況下，限制前向/反向版本的偏移。關鍵思路是將局部梯度累積作為版本控制機制：透過相對於管線延遲減緩參數版本的演進，PACI限制了任何微批次所跨越的最佳化器更新次數，同時保持穩態利用率。在GPT風格語言模型預訓練中，PACI達到了與同步1F1B-flush相同的穩定性與最終困惑度，保持相同峰值記憶體佔用，實現了完全利用的管線吞吐量，並相較於最快的flush基準，將訓練達到目標精確度的時間提升了高達1.69倍。這些結果表明，前向/反向不一致性無需消除：當明確受到限制時，可以安全地將其換取顯著的效率提升。

English

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.