复用预训练检查点:专家混合模型的垂直扩展 助力高效大规模语言模型预训练
Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
October 9, 2025
作者: Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong
cs.AI
摘要
大型语言模型预训练所需计算成本的快速增长,亟需更高效的解决方案。现有训练良好的模型检查点已投入大量计算资源,但由于工程限制或模型容量不足,许多资源未能充分利用。为有效回收这些“沉没”成本,我们提出通过扩展参数规模并继续训练来循环利用预训练检查点。我们针对已收敛的专家混合模型,提出了正交增长方法:通过层间复制实现深度扩展,以及通过注入噪声的专家复制实现宽度扩展。为确定检查点序列中此类增长的最佳时机,我们进行了全面的扩展实验,结果表明最终准确率与沉没成本量呈显著正相关,即前期投入越大,性能提升越明显。我们将该方法应用于参数规模达700亿、训练token数超1万亿的模型,在相同额外计算预算下,相比从头训练实现了10.66%的准确率提升。我们的检查点循环利用方法为经济高效的大型语言模型预训练奠定了基础。
English
The rapidly increasing computational cost of pretraining Large Language
Models necessitates more efficient approaches. Numerous computational costs
have been invested in existing well-trained checkpoints, but many of them
remain underutilized due to engineering constraints or limited model capacity.
To efficiently reuse this "sunk" cost, we propose to recycle pretrained
checkpoints by expanding their parameter counts and continuing training. We
propose orthogonal growth method well-suited for converged Mixture-of-Experts
model: interpositional layer copying for depth growth and expert duplication
with injected noise for width growth. To determine the optimal timing for such
growth across checkpoints sequences, we perform comprehensive scaling
experiments revealing that the final accuracy has a strong positive correlation
with the amount of sunk cost, indicating that greater prior investment leads to
better performance. We scale our approach to models with 70B parameters and
over 1T training tokens, achieving 10.66% accuracy gain over training from
scratch under the same additional compute budget. Our checkpoint recycling
approach establishes a foundation for economically efficient large language
model pretraining.