MegaScale：大規模言語モデルのトレーニングを10,000以上のGPUにスケーリング

要旨

10,000以上のGPU規模で大規模言語モデル（LLM）を訓練するためのプロダクションシステムであるMegaScaleの設計、実装、およびエンジニアリング経験を紹介します。この規模でのLLM訓練は、訓練効率と安定性において前例のない課題をもたらします。我々は、モデルブロックとオプティマイザ設計、計算と通信のオーバーラップ、演算子最適化、データパイプライン、ネットワークパフォーマンスチューニングにわたるアルゴリズムとシステムコンポーネントを共設計するフルスタックアプローチを採用しています。LLM訓練ジョブの長期にわたる特性を考慮し、訓練プロセス全体を通じて高い効率（すなわち安定性）を維持することが重要です。多くの深刻な安定性問題は大規模な場合にのみ顕在化し、深い可観測性がそれらに対処する鍵となります。我々は、システムコンポーネントとスタック深部のイベントを監視し、根本原因を特定し、フォールトトレランスを実現し、遅延を軽減する効果的な技術を導出するための診断ツールセットを開発しました。MegaScaleは、12,288 GPUで175B LLMモデルを訓練する際に55.2%のModel FLOPs Utilization（MFU）を達成し、Megatron-LMと比較してMFUを1.34倍向上させました。我々は、障害や遅延を特定し修正する運用経験を共有します。システムの観点から問題を明確にし、経験を共有することで、今後のLLMシステム研究にインスピレーションを与えることを期待しています。

English

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

MegaScale：大規模言語モデルのトレーニングを10,000以上のGPUにスケーリング

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

要旨

Support