Unicron: 대규모 자기 치유 LLM 훈련의 경제적 최적화

초록

대규모 언어 모델 훈련은 다양한 분야에서 점점 더 중요해지고 있지만, 잦은 장애로 인해 상당한 시간과 경제적 비용이 발생하는 것이 주요 걸림돌입니다. 클라우드 기반 환경에서의 현재 장애 복구 방법은 개별 작업의 다운타임을 줄이는 데 초점을 맞추면서도 클러스터 전체의 비용 영향을 고려하지 못해 다양한 복잡한 시나리오에 적절히 대응하지 못하고 있습니다. 우리는 대규모 언어 모델 훈련을 위한 효율적인 자가 치유 기능을 갖춘 워크로드 관리자인 Unicron을 소개합니다. Unicron은 클러스터 내 여러 동시 작업에서 장애 관련 비용을 최소화함으로써 훈련 프로세스를 최적화합니다. 주요 기능으로는 추가 오버헤드 없이 실시간 오류를 식별하는 인밴드 오류 감지, 최적의 재구성을 위한 동적 비용 인식 계획 생성 메커니즘, 상태 변경 시 다운타임을 줄이는 효율적인 전환 전략 등이 포함됩니다. 128-GPU 분산 클러스터에 배포된 Unicron은 최신 방법 대비 최대 1.9배의 훈련 효율성 향상을 보여주며, 장애 복구 비용을 크게 줄이고 대규모 언어 모델 훈련의 신뢰성을 크게 향상시킵니다.

English

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

Unicron: 대규모 자기 치유 LLM 훈련의 경제적 최적화

Unicron: Economizing Self-Healing LLM Training at Scale

초록

Support