並非全盤皆失:無需檢查點的LLM恢復
All is Not Lost: LLM Recovery without Checkpoints
June 18, 2025
作者: Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
cs.AI
摘要
在分散且计算能力较弱的节点上训练大型语言模型(LLMs),例如多个现场实例,能够降低训练成本并促进模型的民主化。然而,这一过程中不可避免的挑战是节点因故障及操作者的调度策略而频繁更替,导致丢失某一阶段——即模型的一部分。传统的故障恢复方法要么采用检查点技术,即定期将整个模型的副本发送至额外存储,要么进行冗余计算。这些方法即便在无故障情况下也会产生显著的通信和/或计算开销,且在处理大规模模型时扩展性较差。本文提出了一种高效的恢复方法——CheckFree,该方法通过用最近邻阶段的加权平均值替代故障阶段来实现恢复。与现有技术相比,CheckFree无需额外的计算或存储资源。然而,由于采用邻近阶段平均的特性,它仅能恢复中间阶段的故障。我们进一步将方法扩展至CheckFree+,通过乱序流水线执行来容忍首尾阶段的崩溃。得益于乱序流水线技术,这些阶段的行为由邻近阶段模拟,使得CheckFree+能够通过简单复制邻近阶段的权重来恢复它们。为了能够恢复(解)嵌入层,CheckFree+将这些层复制到邻近阶段,这仅需相对较小的存储开销。我们在模型规模从124M到1.5B的LLaMa模型上,针对不同的故障频率,对方法进行了广泛评估。在低至中等故障率(5-10%)的情况下,CheckFree和CheckFree+在墙钟时间收敛性方面均优于检查点技术和冗余计算,提升幅度超过12%。我们的两项提议均可通过以下代码运行:https://github.com/gensyn-ai/CheckFree。
English
Training LLMs on decentralized and wimpy computation nodes, e.g., multiple
on-spot instances, lowers the training cost and enables model democratization.
The inevitable challenge here is the churn of nodes due to failures and the
operator's scheduling policies, leading to losing a stage - a part of the
model. The conventional approaches to recover from failures are to either use
checkpointing, where periodically a copy of the entire model is sent to an
additional storage, or redundant computation. These approaches yield
significant communication and/or computation overhead even in non-failure cases
and scale poorly in settings with large models. In this paper, we propose,
CheckFree, an efficient recovery method where a failing stage is substituted by
a weighted average of the closest neighboring stages. In contrast to the state
of the art, CheckFree requires no additional computation or storage. However,
because of the nature of averaging neighbouring stages, it can only recover
failures of intermediate stages. We further extend our method to CheckFree+
with out-of-order pipeline execution to tolerate crashes of the first and last
stages. Thanks to out-of-order pipelining, behaviour of those stages is
mimicked by their neighboring ones, which allows CheckFree+ to recover them by
simply copying the weights from the immediate neighbour. To be able to recover
the (de)embedding layers, CheckFree+ copies those layers to the neighboring
stages, which requires relatively small storage overhead. We extensively
evaluate our method on LLaMa models of model sizes from 124M to 1.5B with
varying failure frequencies. In the case of low and medium failure rates
(5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant
computation in terms of convergence in wall-clock time by over 12%. Both of our
proposals can be run via our code available at:
https://github.com/gensyn-ai/CheckFree.