ChatPaper.aiChatPaper

Unicron:在大規模上節省自我修復LLM訓練

Unicron: Economizing Self-Healing LLM Training at Scale

December 30, 2023
作者: Tao He, Xue Li, Zhibin Wang, Kun Qian, Jingbo Xu, Wenyuan Yu, Jingren Zhou
cs.AI

摘要

在各個領域中,訓練大規模語言模型變得日益重要,但常常受到頻繁失敗的阻礙,導致顯著的時間和經濟成本。目前基於雲端的失敗恢復方法未能充分應對出現的各種複雜情況,主要僅關注消除個別任務的停機時間,而沒有考慮整個叢集的總體成本影響。我們引入了Unicron,一個針對大規模語言模型訓練的高效自我修復工作負載管理器。Unicron通過最小化叢集內多個同時任務中與失敗相關的成本,優化訓練過程。其關鍵功能包括用於實時錯誤識別的帶內錯誤檢測,無需額外開銷,動態成本感知計劃生成機制以實現最佳重構,以及有效的轉換策略以減少狀態變化期間的停機時間。在一個128-GPU分佈式叢集上部署,Unicron展示了相對於最先進方法高達1.9倍的訓練效率改進,顯著降低失敗恢復成本,增強大規模語言模型訓練的可靠性。
English
Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.
PDF121December 15, 2024