ChatPaper.aiChatPaper

具有重疊通訊的流式分散式學習:邁向分散式免費午餐

Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

January 30, 2025
作者: Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey, Ross McIlroy, Jiajun Shen, Alexandre Ramé, Arthur Szlam, Marc'Aurelio Ranzato, Paul Barham
cs.AI

摘要

大型語言模型(LLMs)的訓練通常分佈在大量加速器上,以減少訓練時間。由於在每個梯度步驟中需要交換內部狀態和參數梯度,因此所有設備都需要位於同一位置,使用低延遲高帶寬的通信連結,以支持所需的高交換位元數量。最近,像DiLoCo這樣的分佈式算法已經放寬了這種同一位置的限制:加速器可以分組為“工作器”,其中工作器之間的同步只會偶爾發生。這意味著工作器可以使用較低帶寬的通信連結連接,而不會影響學習質量。然而,在這些方法中,跨工作器的通信仍然需要與以前相同的峰值帶寬,因為同步需要在所有工作器之間交換所有參數。在本文中,我們以三種方式改進了DiLoCo。首先,我們僅按順序同步參數的子集,而不是一次同步所有參數,這大大降低了峰值帶寬。其次,我們允許工作器在同步時繼續訓練,這減少了牆上時鐘時間。第三,我們對工作器交換的數據進行量化,進一步降低了工作器之間的帶寬。通過適當結合這些修改,我們實驗性地展示,我們可以分佈式訓練十億級參數並達到與以前相似的質量,但將所需帶寬降低了兩個數量級。
English
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

Summary

AI-Generated Summary

PDF307January 31, 2025