通過 C4 實現大規模並行訓練效率的提升：一種以通信為驅動的方法。

摘要

大型語言模型（LLMs）的出現使得採用並行訓練技術成為必要，這涉及部署數千個 GPU 來訓練單個模型。不幸的是，我們發現當前並行訓練的效率通常不盡理想，主要是由於以下兩個主要問題。首先，硬體故障是不可避免的，導致訓練任務中斷。無法快速識別故障組件導致 GPU 資源的大量浪費。其次，由於 GPU 必須等待參數同步完成才能進行下一輪計算，網絡擁塞可能會大大增加 GPU 的等待時間。為應對這些挑戰，本文介紹了一種基於通信的解決方案，即 C4。C4 的關鍵見解有兩個方面。首先，在並行訓練中，集體通信呈現周期性和均勻特徵，因此任何異常肯定是由某種硬體故障引起的。通過利用這一特徵，C4 可以迅速識別故障組件，迅速隔離異常並重新啟動任務，從而避免由於異常檢測延遲而導致的資源浪費。其次，集體通信的可預測通信模型涉及少量大流量，使 C4 能夠有效執行流量規劃，從而大幅減少網絡擁塞。C4 已在我們的生產系統中廣泛實施，將由錯誤引起的額外開銷減少約 30％，並且對於某些通信成本適中的應用程序，運行時性能提高約 15％。

English

The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

通過 C4 實現大規模並行訓練效率的提升：一種以通信為驅動的方法。

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

摘要

Support