ChatPaper.aiChatPaper

一个基于PyTorch的分布式数据并行实现,用于在大规模训练神经网络时使用的分布式Shampoo优化器。

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

September 12, 2023
作者: Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, Michael Rabbat
cs.AI

摘要

Shampoo是属于AdaGrad方法族的一种在线和随机优化算法,用于训练神经网络。它构建了一个块对角的预处理器,其中每个块由神经网络的每个参数的粗略Kronecker乘积逼近全矩阵AdaGrad组成。在这项工作中,我们提供了该算法的完整描述,以及我们实现中利用的性能优化,以在PyTorch中规模化训练深度网络。我们的实现通过PyTorch的DTensor数据结构分配每个参数块的内存和计算,通过在每次迭代中对计算的搜索方向执行AllGather原语,实现了快速的多GPU分布式数据并行训练。这一重大性能提升使我们能够在每步墙钟时间上最多比标准对角缩放自适应梯度方法减少10%的性能。我们通过对ImageNet ResNet50进行消融研究来验证我们的实现,展示了Shampoo相对于标准训练配方在最小超参数调整下的优越性。
English
Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.
PDF70December 15, 2024