语言建模的异步本地SGD训练
Asynchronous Local-SGD Training for Language Modeling
January 17, 2024
作者: Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
cs.AI
摘要
本文介绍了一项关于本地随机梯度下降(Local-SGD)的实证研究,也被称为联邦平均。这是一种分布式优化方法,其中每个设备在通信过程中执行多个随机梯度下降更新。在训练语言模型方面,我们提出了一种{\it 异步} Local-SGD 的方法,即每个工作节点在完成其随机梯度下降步骤后立即更新全局参数。我们通过研究工作者硬件异构性、模型大小、工作者数量和优化器如何影响学习性能来进行全面调查。我们发现,使用简单实现时,尽管更新(全局)模型参数更频繁,异步 Local-SGD 收敛所需迭代次数比其同步对应物更多。我们确定了当工作者梯度过时时,全局参数上的动量加速是一个关键挑战。我们提出了一种新方法,利用延迟的 Nesterov 动量更新,并根据工作者的计算速度调整其本地训练步骤。通过在 C4 数据集上评估具有高达 150M 参数的模型,这种方法在每次更新步骤的困惑度方面与同步 Local-SGD 的性能相匹配,并在挂钟时间方面显著超越它。
English
Local stochastic gradient descent (Local-SGD), also referred to as federated
averaging, is an approach to distributed optimization where each device
performs more than one SGD update per communication. This work presents an
empirical study of {\it asynchronous} Local-SGD for training language models;
that is, each worker updates the global parameters as soon as it has finished
its SGD steps. We conduct a comprehensive investigation by examining how worker
hardware heterogeneity, model size, number of workers, and optimizer could
impact the learning performance. We find that with naive implementations,
asynchronous Local-SGD takes more iterations to converge than its synchronous
counterpart despite updating the (global) model parameters more frequently. We
identify momentum acceleration on the global parameters when worker gradients
are stale as a key challenge. We propose a novel method that utilizes a delayed
Nesterov momentum update and adjusts the workers' local training steps based on
their computation speed. This approach, evaluated with models up to 150M
parameters on the C4 dataset, matches the performance of synchronous Local-SGD
in terms of perplexity per update step, and significantly surpasses it in terms
of wall clock time.