Unicron: 大規模LLMトレーニングにおける自己修復の効率化

要旨

大規模言語モデルのトレーニングは、さまざまな分野でますます重要になっていますが、頻繁に発生する障害によって大きな時間的・経済的コストが生じています。クラウドベースの環境における現在の障害回復手法は、個々のタスクのダウンタイムを短縮することに焦点を当てるあまり、クラスター全体のコストへの影響を考慮せず、多様で複雑なシナリオに対応できていません。本論文では、大規模言語モデルのトレーニングにおいて効率的な自己修復を実現するワークロードマネージャー「Unicron」を紹介します。Unicronは、クラスター内の複数の並行タスクにおける障害関連コストを最小化することで、トレーニングプロセスを最適化します。その主な特徴として、追加のオーバーヘッドなしにリアルタイムでエラーを検出するインバンドエラー検出、最適な再構成を行うための動的コスト認識プラン生成メカニズム、状態変化時のダウンタイムを削減する効率的な移行戦略が挙げられます。128 GPUの分散クラスターに展開した結果、Unicronは最先端の手法と比較してトレーニング効率を最大1.9倍向上させ、障害回復コストを大幅に削減し、大規模言語モデルトレーニングの信頼性を高めることを実証しました。

English

Training large-scale language models is increasingly critical in various domains, but it is hindered by frequent failures, leading to significant time and economic costs. Current failure recovery methods in cloud-based settings inadequately address the diverse and complex scenarios that arise, focusing narrowly on erasing downtime for individual tasks without considering the overall cost impact on a cluster. We introduce Unicron, a workload manager designed for efficient self-healing in large-scale language model training. Unicron optimizes the training process by minimizing failure-related costs across multiple concurrent tasks within a cluster. Its key features include in-band error detection for real-time error identification without extra overhead, a dynamic cost-aware plan generation mechanism for optimal reconfiguration, and an efficient transition strategy to reduce downtime during state changes. Deployed on a 128-GPU distributed cluster, Unicron demonstrates up to a 1.9x improvement in training efficiency over state-of-the-art methods, significantly reducing failure recovery costs and enhancing the reliability of large-scale language model training.

Unicron: 大規模LLMトレーニングにおける自己修復の効率化

Unicron: Economizing Self-Healing LLM Training at Scale

要旨

Support