Het verbeteren van de efficiëntie van grootschalige parallelle training met C4: een communicatiegestuurde aanpak

Samenvatting

De opkomst van Large Language Models (LLMs) heeft de adoptie van parallelle trainingsmethoden noodzakelijk gemaakt, waarbij duizenden GPU's worden ingezet om één enkel model te trainen. Helaas hebben we vastgesteld dat de efficiëntie van de huidige parallelle training vaak suboptimaal is, voornamelijk vanwege de volgende twee problemen. Ten eerste zijn hardwarestoringen onvermijdelijk, wat leidt tot onderbrekingen in de trainingstaken. Het onvermogen om defecte componenten snel te identificeren resulteert in een aanzienlijke verspilling van GPU-bronnen. Ten tweede, aangezien GPU's moeten wachten tot de parametersynchronisatie is voltooid voordat ze kunnen doorgaan naar de volgende rekenronde, kunnen netwerkcongesties de wachttijd voor GPU's aanzienlijk verlengen. Om deze uitdagingen aan te pakken, introduceert dit artikel een communicatiegedreven oplossing, genaamd C4. De kerninzichten van C4 zijn tweeledig. Ten eerste vertoont collectieve communicatie in parallelle training periodieke en homogene kenmerken, waardoor afwijkingen zeker het gevolg zijn van een vorm van hardwarestoring. Door gebruik te maken van deze eigenschap kan C4 defecte componenten snel identificeren, de afwijking snel isoleren en de taak opnieuw starten, waardoor bronverspilling door vertragingen in anomaliedetectie wordt voorkomen. Ten tweede maakt het voorspelbare communicatiemodel van collectieve communicatie, dat bestaat uit enkele grote datastromen, het mogelijk dat C4 efficiënt verkeersplanning uitvoert, waardoor netwerkcongestie aanzienlijk wordt verminderd. C4 is uitgebreid geïmplementeerd in onze productiesystemen, waardoor de overhead door fouten met ongeveer 30% is verminderd en de runtime-prestaties voor bepaalde toepassingen met matige communicatiekosten met ongeveer 15% zijn verbeterd.

English

The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

Het verbeteren van de efficiëntie van grootschalige parallelle training met C4: een communicatiegestuurde aanpak

Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

Samenvatting

Support