DiLoCo: 言語モデルの低通信分散型トレーニング

要旨

大規模言語モデル（LLM）は、機械学習の多くのアプリケーションにおいて重要な要素となっています。しかし、LLMを訓練するための標準的なアプローチでは、多数の密接に接続されたアクセラレータが必要であり、各最適化ステップでデバイスが勾配やその他の中間状態を交換します。多くのアクセラレータをホストする単一のコンピューティングクラスターを構築・維持することは困難ですが、より少ない数のデバイスをホストする複数のコンピューティングクラスターを見つけることは比較的容易かもしれません。本研究では、接続が不十分なデバイスの島々で言語モデルを訓練することを可能にする分散最適化アルゴリズム、Distributed Low-Communication（DiLoCo）を提案します。このアプローチは、内部ステップ数が多く、内部最適化器がAdamW、外部最適化器がNesterovモーメンタムであるフェデレーテッドアベレージングの変種です。広く使用されているC4データセットにおいて、8つのワーカーで動作するDiLoCoは、完全に同期した最適化と同等の性能を示しながら、通信量を500分の1に削減します。DiLoCoは、各ワーカーのデータ分布に対して非常に高いロバスト性を示します。また、時間の経過とともにリソースが利用できなくなることに対してもロバストであり、逆に、訓練中に利用可能になるリソースをシームレスに活用することもできます。

English

Large language models (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.

DiLoCo: 言語モデルの低通信分散型トレーニング

DiLoCo: Distributed Low-Communication Training of Language Models

要旨

Support