語言建模的非同步本地SGD訓練
Asynchronous Local-SGD Training for Language Modeling
January 17, 2024
作者: Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato
cs.AI
摘要
本地隨機梯度下降(Local-SGD),又稱聯合平均,是一種分散優化方法,其中每個設備在通信期間執行多個隨機梯度下降更新。本研究提出了一項針對訓練語言模型的「非同步」本地隨機梯度下降的實證研究;也就是說,每個工作人員在完成其隨機梯度下降步驟後立即更新全局參數。我們通過檢驗工作人員硬件異構性、模型大小、工作人員數量和優化器可能如何影響學習性能來進行全面調查。我們發現,使用天真實現方式,非同步本地隨機梯度下降需要更多次迭代才能收斂,儘管更頻繁地更新(全局)模型參數。我們確定當工作人員梯度過時時,全局參數上的動量加速是一個關鍵挑戰。我們提出了一種利用延遲的 Nesterov 動量更新並根據工作人員的計算速度調整其本地訓練步驟的新方法。這種方法在 C4 數據集上評估了高達 150M 參數的模型,與同步本地隨機梯度下降在每次更新步驟的困惑度方面性能相匹敵,並在牆鐘時間方面明顯超越。
English
Local stochastic gradient descent (Local-SGD), also referred to as federated
averaging, is an approach to distributed optimization where each device
performs more than one SGD update per communication. This work presents an
empirical study of {\it asynchronous} Local-SGD for training language models;
that is, each worker updates the global parameters as soon as it has finished
its SGD steps. We conduct a comprehensive investigation by examining how worker
hardware heterogeneity, model size, number of workers, and optimizer could
impact the learning performance. We find that with naive implementations,
asynchronous Local-SGD takes more iterations to converge than its synchronous
counterpart despite updating the (global) model parameters more frequently. We
identify momentum acceleration on the global parameters when worker gradients
are stale as a key challenge. We propose a novel method that utilizes a delayed
Nesterov momentum update and adjusts the workers' local training steps based on
their computation speed. This approach, evaluated with models up to 150M
parameters on the C4 dataset, matches the performance of synchronous Local-SGD
in terms of perplexity per update step, and significantly surpasses it in terms
of wall clock time.