DiPaCo：分布式路径组合

摘要

机器学习（ML）领域的进展得益于扩展神经网络模型。这种扩展得益于工程领域日益英勇的壮举，这是为了适应需要高带宽通信的ML方法，这些方法要求设备之间在并行工作时进行通信。在这项工作中，我们提出了一种共同设计的模块化架构和训练方法，用于ML模型，被称为DIstributed PAth COmposition（DiPaCo）。在训练过程中，DiPaCo通过一组共享模块的路径分配计算。结合受本地SGD启发的优化（DiLoCo），该方法使模块保持同步，通信大大减少。我们的方法促进了在连接质量差且异构的工作节点之间进行训练，设计确保了对工作节点故障和抢占的稳健性。在推断时，每个输入只需要执行一条路径，无需进行任何模型压缩。我们认为这种方法是迈向新的大规模学习范式的第一个原型，这种范式不太同步且更具模块化。我们在广泛使用的C4基准测试上进行的实验表明，对于相同数量的训练步骤但更少的挂钟时间，DiPaCo通过选择256种可能路径之一，每条路径包含1.5亿参数，超过了一个10亿参数的密集变压器语言模型的性能。

English

Progress in machine learning (ML) has been fueled by scaling neural network models. This scaling has been enabled by ever more heroic feats of engineering, necessary for accommodating ML approaches that require high bandwidth communication between devices working in parallel. In this work, we propose a co-designed modular architecture and training approach for ML models, dubbed DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes computation by paths through a set of shared modules. Together with a Local-SGD inspired optimization (DiLoCo) that keeps modules in sync with drastically reduced communication, Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions. At inference time, only a single path needs to be executed for each input, without the need for any model compression. We consider this approach as a first prototype towards a new paradigm of large-scale learning, one that is less synchronous and more modular. Our experiments on the widely used C4 benchmark show that, for the same amount of training steps but less wall-clock time, DiPaCo exceeds the performance of a 1 billion-parameter dense transformer language model by choosing one of 256 possible paths, each with a size of 150 million parameters.

DiPaCo：分布式路径组合

DiPaCo: Distributed Path Composition

摘要

Support