ワンステップ勾配遅延は大規模非同期パイプラインパラレルLLM事前学習の障壁ではない

要旨

現代の大規模LLM事前学習はパイプライン並列性を活用することで恩恵を受けているが、同期実装ではパイプラインバブル中にGPUがアイドル状態となり、計算資源を浪費する。非同期パイプライン並列性はこれらのバブルを排除し、勾配の陳腐化を代償にスループットを最大化する。非同期スケジュールの中でもPipeDream-2BWは特に魅力的である。元のPipeDreamスケジュールとは異なり、パイプライン深度にかかわらず一定の1ステップ勾配遅延を保証する。しかしながら、陳腐化下での最適化は本質的に不安定であるという一般的な信念のために、その採用は限定的である。本研究ではこの仮定に挑戦し、1ステップ遅延下での性能劣化が本質的な限界ではなく、最適化手法の選択に強く依存することを示す。我々は、PipeDream-2BWが導入された当時の主要な最適化手法であるAdamWが確かに深刻な劣化を示す一方で、Muonのような最近の手法は1ステップ遅延下で強いロバスト性を示すことを明らかにする、初の包括的な実証分析を提供する。さらに、遅延効果を軽減するために、最適化手法に依存しないエラーフィードバックに着想を得た補正を導入する。この補正の有無にかかわらずMuonの収束を示す理論的分析も提供する。最大10Bパラメータのモデルに対する広範な評価により、我々の戦略が同期学習との性能差を埋めることを確認し、大規模非同期パイプライン並列性の実用的可能性を強調する。

English

Modern large-scale LLM pretraining benefits from utilizing Pipeline Parallelism; however, synchronous implementations leave GPUs idle during pipeline bubbles, wasting computational resources. Asynchronous Pipeline Parallelism eliminates these bubbles, maximizing throughput at the cost of gradient staleness. Among asynchronous schedules, PipeDream-2BW is particularly appealing: unlike the original PipeDream schedule, it ensures a constant one-step gradient delay regardless of pipeline depth. However, its adoption remains limited due to the common belief that optimizing under staleness is fundamentally unstable. In this work, we challenge this assumption, demonstrating that degradation under one-step delay depends strongly on optimizer choice rather than being an intrinsic limitation. We provide the first comprehensive empirical analysis showing that while AdamW, the predominant optimizer at the time when PipeDream-2BW was introduced, indeed suffers from severe degradation, recent methods like Muon exhibit strong robustness under a one-step delay. We introduce an optimizer-agnostic Error Feedback-inspired correction to further mitigate delay effects. We provide supporting theoretical analysis demonstrating convergence for Muon with and without this correction. Extensive evaluation on models up to 10B parameters confirms that our strategies bridge the performance gap with synchronous training, highlighting the practical potential of asynchronous pipeline parallelism at scale.