NoLoCo:面向大规模模型的无全归约低通信训练方法
NoLoCo: No-all-reduce Low Communication Training Method for Large Models
June 12, 2025
作者: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, Oğuzhan Ersoy, Christopher Nies
cs.AI
摘要
訓練大型語言模型通常依賴於在包含數萬個加速器的集群上進行優化,這些加速器通過高帶寬互聯進行通信。擴展這些集群成本高昂,且可能變得不可行,從而限制了可訓練模型的規模。近期的幾項研究提出了通信需求較少的訓練方法,避免了對高度互聯計算集群的需求。這些先進的低通信訓練方法仍保留了一個模型參數同步步驟,當在所有模型副本上執行時,在低帶寬網絡上可能代價高昂。
在本研究中,我們提出了一種新穎的優化方法——NoLoCo,該方法在訓練過程中不顯式同步所有模型參數,因此無需任何集體通信。NoLoCo通過一種新穎的Nesterov動量優化器變體,通過與隨機選擇的另一模型權重進行部分平均,隱式地同步模型權重。我們不僅為所提出的優化器提供了理論上的收斂分析,還展示了語言模型訓練的實證結果。
我們在從1.25億到68億參數範圍內的多種加速器數量和模型規模上對NoLoCo進行了基準測試。我們的方法相比於完全分片數據並行訓練,甚至比廣泛使用的低通信訓練方法DiLoCo,顯著減少了通信開銷。對於在互聯網上訓練的數百個加速器,同步步驟本身估計比DiLoCo中使用的全歸約快一個數量級。此外,我們沒有任何全局阻塞通信,從而減少了加速器的閒置時間。與DiLoCo相比,我們還觀察到在多種模型規模和加速器數量下,收斂速度最高可提升4%。
English
Training large language models is generally done via optimization methods on
clusters containing tens of thousands of accelerators, communicating over a
high-bandwidth interconnect. Scaling up these clusters is expensive and can
become impractical, imposing limits on the size of models that can be trained.
Several recent studies have proposed training methods that are less
communication intensive, avoiding the need for a highly connected compute
cluster. These state-of-the-art low communication training methods still employ
a synchronization step for model parameters, which, when performed over all
model replicas, can become costly on a low-bandwidth network.
In this work, we propose a novel optimization method, NoLoCo, that does not
explicitly synchronize all model parameters during training and, as a result,
does not require any collective communication. NoLoCo implicitly synchronizes
model weights via a novel variant of the Nesterov momentum optimizer by
partially averaging model weights with a randomly selected other one. We
provide both a theoretical convergence analysis for our proposed optimizer as
well as empirical results from language model training.
We benchmark NoLoCo on a wide range of accelerator counts and model sizes,
between 125M to 6.8B parameters. Our method requires significantly less
communication overhead than fully sharded data parallel training or even widely
used low communication training method, DiLoCo. The synchronization step itself
is estimated to be one magnitude faster than the all-reduce used in DiLoCo for
few hundred accelerators training over the internet. We also do not have any
global blocking communication that reduces accelerator idling time. Compared to
DiLoCo, we also observe up to 4% faster convergence rate with wide range of
model sizes and accelerator counts.