DiPaCo:分散式路徑組合
DiPaCo: Distributed Path Composition
March 15, 2024
作者: Arthur Douillard, Qixuan Feng, Andrei A. Rusu, Adhiguna Kuncoro, Yani Donchev, Rachita Chhaparia, Ionel Gog, Marc'Aurelio Ranzato, Jiajun Shen, Arthur Szlam
cs.AI
摘要
機器學習(ML)的進展得益於對神經網絡模型的擴展。這種擴展是通過越來越英勇的工程技術成就實現的,這些成就是為了滿足需要高頻寬通信的ML方法而進行的,這些方法需要在並行工作的設備之間進行通信。在這項工作中,我們提出了一種共同設計的模塊化架構和訓練方法,用於ML模型,名為DIstributed PAth COmposition(DiPaCo)。在訓練期間,DiPaCo通過一組共享模塊的路徑分配計算。結合了受Local-SGD啟發的優化(DiLoCo),該方法通過大幅減少通信來保持模塊同步,有助於在連接不佳和異構工作器之間進行訓練,並確保對工作器故障和抢占具有韌性的設計。在推斷時,每個輸入只需要執行一條路徑,無需進行任何模型壓縮。我們認為這種方法是朝著一種新的大規模學習範式的第一個原型,這種範式不太同步,更模塊化。我們在廣泛使用的C4基準測試上進行的實驗表明,對於相同數量的訓練步驟但更少的牆鐘時間,DiPaCo通過選擇256條可能的路徑之一,每條路徑包含1.5億參數,超越了一個10億參數的密集變壓器語言模型的性能。
English
Progress in machine learning (ML) has been fueled by scaling neural network
models. This scaling has been enabled by ever more heroic feats of engineering,
necessary for accommodating ML approaches that require high bandwidth
communication between devices working in parallel. In this work, we propose a
co-designed modular architecture and training approach for ML models, dubbed
DIstributed PAth COmposition (DiPaCo). During training, DiPaCo distributes
computation by paths through a set of shared modules. Together with a Local-SGD
inspired optimization (DiLoCo) that keeps modules in sync with drastically
reduced communication, Our approach facilitates training across poorly
connected and heterogeneous workers, with a design that ensures robustness to
worker failures and preemptions. At inference time, only a single path needs to
be executed for each input, without the need for any model compression. We
consider this approach as a first prototype towards a new paradigm of
large-scale learning, one that is less synchronous and more modular. Our
experiments on the widely used C4 benchmark show that, for the same amount of
training steps but less wall-clock time, DiPaCo exceeds the performance of a 1
billion-parameter dense transformer language model by choosing one of 256
possible paths, each with a size of 150 million parameters.Summary
AI-Generated Summary