2BP: 2段階バックプロパゲーション

要旨

ディープニューラルネットワーク（DNN）がサイズと複雑さを増すにつれ、単一のアクセラレータのメモリ容量を超えることが多くなり、モデルパラメータを複数のアクセラレータに分散する必要が生じています。パイプライン並列処理は、大規模なDNNを訓練するための一般的な分散戦略です。しかし、現在のパイプライン並列処理の実装は、MLフレームワークが提供する自動微分ツールによって意図せずボトルネックが生じています。本論文では、2段階バックプロパゲーション（2BP）を紹介します。逆伝播ステップを2つの別々の段階に分割することで、アイドル計算時間を削減できます。様々なモデルアーキテクチャとパイプラインスケジュールで2BPをテストし、全てのケースでスループットの向上を達成しました。2BPを使用することで、4つのGPUで70億パラメータのLLaMa風トランスフォーマーを訓練する際に、従来の方法と比較して1.70倍のスループット向上を実現しました。

English

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

2BP: 2段階バックプロパゲーション

2BP: 2-Stage Backpropagation

要旨

Support