DiLoCo:分布式低通信语言模型训练
DiLoCo: Distributed Low-Communication Training of Language Models
November 14, 2023
作者: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc'Aurelio Ranzato, Arthur Szlam, Jiajun Shen
cs.AI
摘要
大型语言模型(LLM)已成为许多机器学习应用中的关键组成部分。然而,训练LLM的标准方法通常需要大量紧密相连的加速器,设备在每一次优化步骤中交换梯度和其他中间状态。虽然构建和维护一个承载许多加速器的单个计算集群很困难,但在承载较少设备的多个计算集群中可能更容易找到解决方案。在这项工作中,我们提出了一种分布式优化算法,名为分布式低通信(DiLoCo),可以实现在连接较差的设备群岛上训练语言模型。该方法是联邦平均的一种变体,内部步骤数量较多,内部优化器为AdamW,外部优化器为Nesterov动量。在广泛使用的C4数据集上,我们展示了8个工作者上的DiLoCo表现与完全同步优化相当,但通信次数减少了500倍。DiLoCo对每个工作者的数据分布表现出很强的鲁棒性。它还能够很好地应对随着时间资源变得不可用,反之亦然,它可以在训练过程中无缝利用变得可用的资源。
English
Large language models (LLM) have become a critical component in many
applications of machine learning. However, standard approaches to training LLM
require a large number of tightly interconnected accelerators, with devices
exchanging gradients and other intermediate states at each optimization step.
While it is difficult to build and maintain a single computing cluster hosting
many accelerators, it might be easier to find several computing clusters each
hosting a smaller number of devices. In this work, we propose a distributed
optimization algorithm, Distributed Low-Communication (DiLoCo), that enables
training of language models on islands of devices that are poorly connected.
The approach is a variant of federated averaging, where the number of inner
steps is large, the inner optimizer is AdamW, and the outer optimizer is
Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8
workers performs as well as fully synchronous optimization while communicating
500 times less. DiLoCo exhibits great robustness to the data distribution of
each worker. It is also robust to resources becoming unavailable over time, and
vice versa, it can seamlessly leverage resources that become available during
training.