ニューラルネットワークの大規模学習のための分散Shampooオプティマイザの分散データ並列PyTorch実装

要旨

Shampooは、ニューラルネットワークの訓練に用いられるAdaGradファミリーに属するオンラインかつ確率的最適化アルゴリズムです。この手法は、ブロック対角型の前処理行列を構築します。ここで各ブロックは、ニューラルネットワークの各パラメータに対して、完全行列AdaGradの粗いクロネッカー積近似で構成されます。本研究では、アルゴリズムの完全な説明と、PyTorchにおいて大規模な深層ネットワークを訓練するために実装が活用するパフォーマンス最適化を提供します。我々の実装は、PyTorchのDTensorデータ構造を介して各パラメータのブロックに関連するメモリと計算を分散し、各イテレーションで計算された探索方向に対してAllGatherプリミティブを実行することで、高速なマルチGPU分散データ並列訓練を可能にします。この主要なパフォーマンス向上により、標準的な対角スケーリングベースの適応勾配法と比較して、ステップごとの実時間において最大10%のパフォーマンス低下に抑えることができます。ImageNet ResNet50の訓練に関するアブレーションスタディを実施し、Shampooが最小限のハイパーパラメータチューニングで標準的な訓練レシピを上回る優位性を実証することで、実装を検証しました。

English

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.

ニューラルネットワークの大規模学習のための分散Shampooオプティマイザの分散データ並列PyTorch実装

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

要旨

Support