利用C4提高大规模并行训练效率:一种基于通信的方法
Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
June 7, 2024
作者: Jianbo Dong, Bin Luo, Jun Zhang, Pengcheng Zhang, Fei Feng, Yikai Zhu, Ang Liu, Zian Chen, Yi Shi, Hairong Jiao, Gang Lu, Yu Guan, Ennan Zhai, Wencong Xiao, Hanyu Zhao, Man Yuan, Siran Yang, Xiang Li, Jiamang Wang, Rui Men, Jianwei Zhang, Huang Zhong, Dennis Cai, Yuan Xie, Binzhang Fu
cs.AI
摘要
大型语言模型(LLMs)的出现需要采用并行训练技术,涉及部署数千个GPU来训练单个模型。不幸的是,我们发现当前并行训练的效率通常不够理想,主要是由于以下两个主要问题。首先,硬件故障是不可避免的,会导致训练任务中断。由于无法快速识别故障组件,导致GPU资源大量浪费。其次,由于GPU必须等待参数同步完成后才能继续下一轮计算,网络拥塞会大大增加GPU的等待时间。为了解决这些挑战,本文介绍了一种基于通信的解决方案,即C4。C4的关键见解有两个方面。首先,在并行训练中,集体通信表现出周期性和均匀性特征,因此任何异常肯定是由某种形式的硬件故障引起的。通过利用这一特性,C4可以快速识别故障组件,迅速隔离异常,并重新启动任务,从而避免由于异常检测延迟而导致的资源浪费。其次,集体通信的可预测通信模型涉及少量大流量,使C4能够有效执行流量规划,大大减少网络拥塞。C4已广泛应用于我们的生产系统中,将由错误引起的开销大约减少30%,并且对于某些通信成本适中的应用程序,提高了大约15%的运行时性能。
English
The emergence of Large Language Models (LLMs) has necessitated the adoption
of parallel training techniques, involving the deployment of thousands of GPUs
to train a single model. Unfortunately, we have found that the efficiency of
current parallel training is often suboptimal, largely due to the following two
main issues. Firstly, hardware failures are inevitable, leading to
interruptions in the training tasks. The inability to quickly identify the
faulty components results in a substantial waste of GPU resources. Secondly,
since GPUs must wait for parameter synchronization to complete before
proceeding to the next round of computation, network congestions can greatly
increase the waiting time for GPUs. To address these challenges, this paper
introduces a communication-driven solution, namely the C4. The key insights of
C4 are two folds. First, in parallel training, collective communication
exhibits periodic and homogeneous characteristics, so any anomalies are
certainly due to some form of hardware malfunction. By leveraging this feature,
C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and
restart the task, thereby avoiding resource wastage caused by delays in anomaly
detection. Second, the predictable communication model of collective
communication, involving few large flows, allows C4 to efficiently execute
traffic planning, substantially reducing network congestion. C4 has been
extensively implemented across our production systems, cutting error-induced
overhead by roughly 30% and enhancing runtime performance by about 15% for
certain applications with moderate communication costs.Summary
AI-Generated Summary