MegaScale：将大型语言模型训练扩展至超过10,000个GPU

摘要

我们介绍了设计、实施和工程经验，用于构建和部署 MegaScale，这是一个用于在超过10,000个GPU规模上训练大型语言模型（LLMs）的生产系统。在这一规模上训练LLMs带来了训练效率和稳定性方面前所未有的挑战。我们采取了全栈方法，通过跨模型块和优化器设计、计算和通信重叠、运算符优化、数据管道和网络性能调优，共同设计算法和系统组件。在整个训练过程中保持高效率（即稳定性）是生产中的一个重要考虑因素，考虑到LLM训练作业的长时间跨度。许多困难的稳定性问题只会在大规模下出现，深入的可观察性是解决这些问题的关键。我们开发了一套诊断工具，以监视系统组件和堆栈深处的事件，识别根本原因，并制定有效技术以实现容错性和减轻滞后者。MegaScale在使用12,288个GPU训练175B的LLM模型时实现了55.2%的模型FLOPs利用率（MFU），相比Megatron-LM提高了1.34倍的MFU。我们分享了在识别和修复故障和滞后者方面的运营经验。我们希望通过从系统角度阐明问题并分享我们的经验，能够激发未来LLM系统研究的灵感。

English

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

MegaScale：将大型语言模型训练扩展至超过10,000个GPU

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

摘要

Support