一個基於分散式數據平行處理的 PyTorch 實現,用於在大規模訓練神經網絡時使用的分散式 Shampoo 優化器。
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
September 12, 2023
作者: Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat
cs.AI
摘要
Shampoo是一種屬於AdaGrad方法家族的在線和隨機優化算法,用於訓練神經網絡。它構建了一個塊對角預條件器,其中每個塊由神經網絡的每個參數的粗Kronecker乘積近似構成完整矩陣AdaGrad。在這項工作中,我們提供了對該算法的完整描述,以及我們實現的性能優化,以便在PyTorch中規模化訓練深度網絡。我們的實現通過使用PyTorch的DTensor數據結構來分配每個參數塊的內存和計算,並在每次迭代中對計算的搜索方向執行AllGather原始操作,實現了快速的多GPU分佈式數據並行訓練。這一主要性能增強使我們能夠實現每步時鐘時間最多比標準基於對角線縮放的自適應梯度方法減少10%的性能。我們通過對ImageNet ResNet50進行消融研究來驗證我們的實現,展示了Shampoo相對於標準訓練配方的優越性,並進行了最少的超參數調整。
English
Shampoo is an online and stochastic optimization algorithm belonging to the
AdaGrad family of methods for training neural networks. It constructs a
block-diagonal preconditioner where each block consists of a coarse Kronecker
product approximation to full-matrix AdaGrad for each parameter of the neural
network. In this work, we provide a complete description of the algorithm as
well as the performance optimizations that our implementation leverages to
train deep networks at-scale in PyTorch. Our implementation enables fast
multi-GPU distributed data-parallel training by distributing the memory and
computation associated with blocks of each parameter via PyTorch's DTensor data
structure and performing an AllGather primitive on the computed search
directions at each iteration. This major performance enhancement enables us to
achieve at most a 10% performance reduction in per-step wall-clock time
compared against standard diagonal-scaling-based adaptive gradient methods. We
validate our implementation by performing an ablation study on training
ImageNet ResNet50, demonstrating Shampoo's superiority over standard training
recipes with minimal hyperparameter tuning.