Unicron：规模化节约自愈式LLM训练

摘要

在各个领域中，训练大规模语言模型变得日益关键，但由于频繁失败而受阻，导致了重大的时间和经济成本。当前基于云的环境中的失败恢复方法未能充分解决出现的多样化和复杂情况，狭隘地专注于消除个别任务的停机时间，而没有考虑到对整个集群的总体成本影响。我们引入了Unicron，一个专为大规模语言模型训练中高效自愈而设计的工作负载管理器。Unicron通过最小化集群内多个并发任务的与失败相关的成本来优化训练过程。其关键特性包括基于内部的错误检测，实时识别错误而无需额外开销，动态成本感知计划生成机制，用于最佳重配置，以及高效的转换策略，以减少状态变化期间的停机时间。在一个由128个GPU组成的分布式集群上部署，Unicron展示了比最先进方法高达1.9倍的训练效率提升，显著降低了失败恢复成本，并增强了大规模语言模型训练的可靠性。

English

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

Unicron：规模化节约自愈式LLM训练

Unicron: Economizing Self-Healing LLM Training at Scale

摘要

Support