DiLoCo：分散式低通訊訓練語言模型

摘要

大型語言模型（LLM）已成為許多機器學習應用中的重要組成部分。然而，訓練LLM的標準方法需要大量緊密相連的加速器，這些設備在每個優化步驟中交換梯度和其他中間狀態。雖然建立和維護一個托管許多加速器的單個計算叢集很困難，但在每個托管較少設備的多個計算叢集中找到可能更容易。在這項工作中，我們提出了一種分佈式優化算法，即分佈式低通信（DiLoCo），它使得可以在連接較差的設備島上訓練語言模型。這種方法是聯邦平均的一種變體，其中內部步驟數量很大，內部優化器是AdamW，外部優化器是Nesterov動量。在廣泛使用的C4數據集上，我們展示了DiLoCo在8個工作者上的表現與完全同步優化相當，同時通信次數減少了500倍。DiLoCo對每個工作者的數據分佈表現出很強的韌性。它還對資源隨時間變得不可用表現出韌性，反之亦然，它可以無縫地利用在訓練期間變得可用的資源。

English

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.

DiLoCo：分散式低通訊訓練語言模型

DiLoCo: Distributed Low-Communication Training of Language Models

摘要

Support