通過 C4 實現大規模並行訓練效率的提升:一種以通信為驅動的方法。
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
June 7, 2024
作者: Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Huang Zhong, Dennis Cai, Yuan Xie, Binzhang Fu
cs.AI
摘要
大型語言模型(LLMs)的出現使得採用並行訓練技術成為必要,這涉及部署數千個 GPU 來訓練單個模型。不幸的是,我們發現當前並行訓練的效率通常不盡理想,主要是由於以下兩個主要問題。首先,硬體故障是不可避免的,導致訓練任務中斷。無法快速識別故障組件導致 GPU 資源的大量浪費。其次,由於 GPU 必須等待參數同步完成才能進行下一輪計算,網絡擁塞可能會大大增加 GPU 的等待時間。為應對這些挑戰,本文介紹了一種基於通信的解決方案,即 C4。C4 的關鍵見解有兩個方面。首先,在並行訓練中,集體通信呈現周期性和均勻特徵,因此任何異常肯定是由某種硬體故障引起的。通過利用這一特徵,C4 可以迅速識別故障組件,迅速隔離異常並重新啟動任務,從而避免由於異常檢測延遲而導致的資源浪費。其次,集體通信的可預測通信模型涉及少量大流量,使 C4 能夠有效執行流量規劃,從而大幅減少網絡擁塞。C4 已在我們的生產系統中廣泛實施,將由錯誤引起的額外開銷減少約 30%,並且對於某些通信成本適中的應用程序,運行時性能提高約 15%。
English
The emergence of Large Language Models (LLMs) has necessitated the adoption
of parallel training techniques, involving the deployment of thousands of GPUs
to train a single model. Unfortunately, we have found that the efficiency of
current parallel training is often suboptimal, largely due to the following two
main issues. Firstly, hardware failures are inevitable, leading to
interruptions in the training tasks. The inability to quickly identify the
faulty components results in a substantial waste of GPU resources. Secondly,
since GPUs must wait for parameter synchronization to complete before
proceeding to the next round of computation, network congestions can greatly
increase the waiting time for GPUs. To address these challenges, this paper
introduces a communication-driven solution, namely the C4. The key insights of
C4 are two folds. First, in parallel training, collective communication
exhibits periodic and homogeneous characteristics, so any anomalies are
certainly due to some form of hardware malfunction. By leveraging this feature,
C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and
restart the task, thereby avoiding resource wastage caused by delays in anomaly
detection. Second, the predictable communication model of collective
communication, involving few large flows, allows C4 to efficiently execute
traffic planning, substantially reducing network congestion. C4 has been
extensively implemented across our production systems, cutting error-induced
overhead by roughly 30% and enhancing runtime performance by about 15% for
certain applications with moderate communication costs.