利用C4提高大规模并行训练效率：一种基于通信的方法

摘要

大型语言模型（LLMs）的出现需要采用并行训练技术，涉及部署数千个GPU来训练单个模型。不幸的是，我们发现当前并行训练的效率通常不够理想，主要是由于以下两个主要问题。首先，硬件故障是不可避免的，会导致训练任务中断。由于无法快速识别故障组件，导致GPU资源大量浪费。其次，由于GPU必须等待参数同步完成后才能继续下一轮计算，网络拥塞会大大增加GPU的等待时间。为了解决这些挑战，本文介绍了一种基于通信的解决方案，即C4。C4的关键见解有两个方面。首先，在并行训练中，集体通信表现出周期性和均匀性特征，因此任何异常肯定是由某种形式的硬件故障引起的。通过利用这一特性，C4可以快速识别故障组件，迅速隔离异常，并重新启动任务，从而避免由于异常检测延迟而导致的资源浪费。其次，集体通信的可预测通信模型涉及少量大流量，使C4能够有效执行流量规划，大大减少网络拥塞。C4已广泛应用于我们的生产系统中，将由错误引起的开销大约减少30％，并且对于某些通信成本适中的应用程序，提高了大约15％的运行时性能。

English

The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

利用C4提高大规模并行训练效率：一种基于通信的方法

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

摘要

Support