通信効率の高い言語モデルトレーニングは信頼性と堅牢性を備えてスケーリングする： DiLoCoのスケーリング則

要旨

大規模な機械学習モデルをスケールアップするにつれ、データ並列アプローチに内在する頻繁な同期要求が重大なボトルネックとなり、さらなるスケーリングに対する重要な課題となっています。最近の研究では、モデルの品質を損なうことなく同期要求を緩和するアプローチ（DiLoCo）が開発されました。しかし、これらの研究ではDiLoCoの挙動がモデルサイズとともにどのように変化するかを詳細に分析していません。本研究では、固定の計算予算のもとでLLMを訓練する際のDiLoCoのスケーリング則の挙動を調査します。特に、モデルレプリカ数、ハイパーパラメータ、トークン予算といったアルゴリズム的要因が、スケーリング則を通じて正確に予測可能な形で訓練にどのように影響するかに焦点を当てます。その結果、DiLoCoはモデルサイズに対して予測可能かつ堅牢にスケールすることがわかりました。適切に調整された場合、DiLoCoはデータ並列訓練よりもモデルサイズに対して優れたスケーリングを示し、小規模なモデルサイズにおいてもデータ並列訓練を上回る性能を発揮します。我々の結果は、これまでに報告されていたよりもより一般的なDiLoCoの利点を示しており、最適バッチサイズの増加、スケールに伴う下流タスクでの汎化性能の向上、固定トークン予算における評価損失の改善などが含まれます。

English

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach (DiLoCo) that relaxes synchronization demands without compromising model quality. However, these works do not carefully analyze how DiLoCo's behavior changes with model size. In this work, we study the scaling law behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on how algorithmic factors, including number of model replicas, hyperparameters, and token budget affect training in ways that can be accurately predicted via scaling laws. We find that DiLoCo scales both predictably and robustly with model size. When well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes. Our results showcase a more general set of benefits of DiLoCo than previously documented, including increased optimal batch sizes, improved downstream generalization with scale, and improved evaluation loss for a fixed token budget.

通信効率の高い言語モデルトレーニングは信頼性と堅牢性を備えてスケーリングする： DiLoCoのスケーリング則

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

要旨

Support