MegaScale:将大型语言模型训练扩展至超过10,000个GPU
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
February 23, 2024
作者: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu
cs.AI
摘要
我们介绍了设计、实施和工程经验,用于构建和部署 MegaScale,这是一个用于在超过10,000个GPU规模上训练大型语言模型(LLMs)的生产系统。在这一规模上训练LLMs带来了训练效率和稳定性方面前所未有的挑战。我们采取了全栈方法,通过跨模型块和优化器设计、计算和通信重叠、运算符优化、数据管道和网络性能调优,共同设计算法和系统组件。在整个训练过程中保持高效率(即稳定性)是生产中的一个重要考虑因素,考虑到LLM训练作业的长时间跨度。许多困难的稳定性问题只会在大规模下出现,深入的可观察性是解决这些问题的关键。我们开发了一套诊断工具,以监视系统组件和堆栈深处的事件,识别根本原因,并制定有效技术以实现容错性和减轻滞后者。MegaScale在使用12,288个GPU训练175B的LLM模型时实现了55.2%的模型FLOPs利用率(MFU),相比Megatron-LM提高了1.34倍的MFU。我们分享了在识别和修复故障和滞后者方面的运营经验。我们希望通过从系统角度阐明问题并分享我们的经验,能够激发未来LLM系统研究的灵感。
English
We present the design, implementation and engineering experience in building
and deploying MegaScale, a production system for training large language models
(LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale
brings unprecedented challenges to training efficiency and stability. We take a
full-stack approach that co-designs the algorithmic and system components
across model block and optimizer design, computation and communication
overlapping, operator optimization, data pipeline, and network performance
tuning. Maintaining high efficiency throughout the training process (i.e.,
stability) is an important consideration in production given the long extent of
LLM training jobs. Many hard stability issues only emerge at large scale, and
in-depth observability is the key to address them. We develop a set of
diagnosis tools to monitor system components and events deep in the stack,
identify root causes, and derive effective techniques to achieve fault
tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs
Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the
MFU by 1.34x compared to Megatron-LM. We share our operational experience in
identifying and fixing failures and stragglers. We hope by articulating the
problems and sharing our experience from a systems perspective, this work can
inspire future LLM systems research.