NoLoCo:面向大规模模型的无全归约低通信训练方法
NoLoCo: No-all-reduce Low Communication Training Method for Large Models
June 12, 2025
作者: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, Oğuzhan Ersoy, Christopher Nies
cs.AI
摘要
大规模语言模型的训练通常通过在包含数万加速器的集群上采用优化方法进行,这些加速器通过高带宽互连进行通信。扩展此类集群成本高昂且可能变得不切实际,从而限制了可训练模型的规模。近期多项研究提出了通信需求较低的训练方法,避免了对高度互联计算集群的依赖。这些先进的低通信训练方法仍包含一个模型参数同步步骤,当在所有模型副本上执行时,在低带宽网络中可能代价高昂。
在本研究中,我们提出了一种新颖的优化方法——NoLoCo,该方法在训练过程中不显式同步所有模型参数,因此无需任何集体通信。NoLoCo通过一种Nesterov动量优化器的新变体,通过随机选择另一个模型权重进行部分平均,隐式同步模型权重。我们不仅为所提出的优化器提供了理论收敛分析,还展示了语言模型训练的实证结果。
我们在125M至6.8B参数范围内的多种加速器数量和模型规模上对NoLoCo进行了基准测试。与完全分片数据并行训练或广泛使用的低通信训练方法DiLoCo相比,我们的方法显著减少了通信开销。对于数百个加速器在互联网上的训练,同步步骤本身估计比DiLoCo中使用的全归约操作快一个数量级。此外,我们没有任何全局阻塞通信,从而减少了加速器的闲置时间。与DiLoCo相比,我们还观察到在广泛的模型规模和加速器数量下,收敛速度最多可提升4%。
English
Training large language models is generally done via optimization methods on
clusters containing tens of thousands of accelerators, communicating over a
high-bandwidth interconnect. Scaling up these clusters is expensive and can
become impractical, imposing limits on the size of models that can be trained.
Several recent studies have proposed training methods that are less
communication intensive, avoiding the need for a highly connected compute
cluster. These state-of-the-art low communication training methods still employ
a synchronization step for model parameters, which, when performed over all
model replicas, can become costly on a low-bandwidth network.
In this work, we propose a novel optimization method, NoLoCo, that does not
explicitly synchronize all model parameters during training and, as a result,
does not require any collective communication. NoLoCo implicitly synchronizes
model weights via a novel variant of the Nesterov momentum optimizer by
partially averaging model weights with a randomly selected other one. We
provide both a theoretical convergence analysis for our proposed optimizer as
well as empirical results from language model training.
We benchmark NoLoCo on a wide range of accelerator counts and model sizes,
between 125M to 6.8B parameters. Our method requires significantly less
communication overhead than fully sharded data parallel training or even widely
used low communication training method, DiLoCo. The synchronization step itself
is estimated to be one magnitude faster than the all-reduce used in DiLoCo for
few hundred accelerators training over the internet. We also do not have any
global blocking communication that reduces accelerator idling time. Compared to
DiLoCo, we also observe up to 4% faster convergence rate with wide range of
model sizes and accelerator counts.