MegaScale：將大型語言模型訓練擴展至超過10,000個 GPU

摘要

我們介紹了在建立和部署 MegaScale 時的設計、實施和工程經驗，這是一個用於在超過 10,000 個 GPU 規模上訓練大型語言模型（LLMs）的生產系統。在這種規模上訓練LLMs帶來了訓練效率和穩定性方面前所未有的挑戰。我們採用了全棧方法，通過共同設計模型塊和優化器設計、計算和通信重疊、運算子優化、數據管道和網絡性能調優等算法和系統組件，來應對這些挑戰。在訓練過程中保持高效率（即穩定性）是生產中的一個重要考慮因素，考慮到LLM訓練作業的長時間。許多困難的穩定性問題只會在大規模下出現，深入的可觀察性是解決這些問題的關鍵。我們開發了一套診斷工具，以監控系統組件和深層堆棧中的事件，識別根本原因，並制定有效技術來實現容錯容忍和減輕滯後者。MegaScale 在使用 12,288 個 GPU 訓練 175B LLM 模型時實現了 55.2% 的模型 FLOPs 利用率（MFU），相較於 Megatron-LM，MFU 提高了 1.34 倍。我們分享了在識別和修復故障和滯後者方面的運營經驗。我們希望通過從系統角度闡明問題並分享我們的經驗，能夠激發未來LLM系統研究的靈感。

English

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

MegaScale：將大型語言模型訓練擴展至超過10,000個 GPU

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

摘要

Support