通訊高效的語言模型訓練展現出可靠且穩健的擴展性:DiLoCo的擴展法則
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
March 12, 2025
作者: Zachary Charles, Gabriel Teston, Lucio Dery, Keith Rush, Nova Fallen, Zachary Garrett, Arthur Szlam, Arthur Douillard
cs.AI
摘要
隨著我們擴展至更龐大的機器學習模型,數據並行方法中固有的頻繁同步需求造成了顯著的減速,這對進一步的擴展構成了關鍵挑戰。近期研究開發了一種方法(DiLoCo),在無需犧牲模型質量的前提下,放寬了同步需求。然而,這些研究並未深入分析DiLoCo的行為如何隨模型規模變化。在本研究中,我們探討了在固定計算預算下訓練大型語言模型時,DiLoCo的擴展規律行為。我們聚焦於算法因素,包括模型副本數量、超參數及令牌預算如何影響訓練,這些影響可通過擴展規律準確預測。我們發現,DiLoCo在模型規模上的擴展既具可預測性又穩健。當調校得當時,DiLoCo在模型規模上的擴展優於數據並行訓練,甚至在小模型規模下也能超越數據並行訓練。我們的結果展示了DiLoCo比先前文獻記載更廣泛的優勢,包括增加的最佳批次大小、隨著規模提升的下游泛化能力,以及在固定令牌預算下改善的評估損失。
English
As we scale to more massive machine learning models, the frequent
synchronization demands inherent in data-parallel approaches create significant
slowdowns, posing a critical challenge to further scaling. Recent work develops
an approach (DiLoCo) that relaxes synchronization demands without compromising
model quality. However, these works do not carefully analyze how DiLoCo's
behavior changes with model size. In this work, we study the scaling law
behavior of DiLoCo when training LLMs under a fixed compute budget. We focus on
how algorithmic factors, including number of model replicas, hyperparameters,
and token budget affect training in ways that can be accurately predicted via
scaling laws. We find that DiLoCo scales both predictably and robustly with
model size. When well-tuned, DiLoCo scales better than data-parallel training
with model size, and can outperform data-parallel training even at small model
sizes. Our results showcase a more general set of benefits of DiLoCo than
previously documented, including increased optimal batch sizes, improved
downstream generalization with scale, and improved evaluation loss for a fixed
token budget.Summary
AI-Generated Summary