메가스케일: 대규모 언어 모델 학습을 10,000개 이상의 GPU로 확장하기

초록

본 논문에서는 10,000개 이상의 GPU를 활용하여 대규모 언어 모델(LLM)을 학습시키기 위한 프로덕션 시스템인 MegaScale의 설계, 구현 및 엔지니어링 경험을 소개한다. 이러한 규모에서의 LLM 학습은 학습 효율성과 안정성 측면에서 전례 없는 도전 과제를 제시한다. 우리는 모델 블록 및 옵티마이저 설계, 계산 및 통신 중첩, 연산자 최적화, 데이터 파이프라인, 네트워크 성능 튜닝 등 알고리즘과 시스템 구성 요소를 전체적으로 고려한 풀스택 접근 방식을 채택하였다. 특히, LLM 학습 작업의 장기간 지속을 고려할 때 학습 과정 전반에 걸쳐 높은 효율성(즉, 안정성)을 유지하는 것은 프로덕션 환경에서 중요한 고려 사항이다. 대규모에서만 발생하는 다양한 복잡한 안정성 문제를 해결하기 위해서는 심층적인 관측 가능성이 핵심이다. 이를 위해 우리는 시스템 구성 요소와 스택 깊숙이 있는 이벤트를 모니터링하고 근본 원인을 식별하며, 내결함성을 달성하고 지연 작업을 완화하기 위한 효과적인 기술을 도출하기 위한 진단 도구 세트를 개발하였다. MegaScale은 12,288개의 GPU를 사용하여 175B LLM 모델을 학습할 때 55.2%의 Model FLOPs Utilization(MFU)을 달성하였으며, 이는 Megatron-LM 대비 MFU를 1.34배 향상시킨 결과이다. 또한, 우리는 실패 및 지연 작업을 식별하고 수정하는 과정에서 얻은 운영 경험을 공유한다. 이 연구가 시스템 관점에서 문제를 명확히 하고 경험을 공유함으로써, 향후 LLM 시스템 연구에 영감을 줄 수 있기를 기대한다.

English

We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlapping, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research.

메가스케일: 대규모 언어 모델 학습을 10,000개 이상의 GPU로 확장하기

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

초록

Support