回收預訓練檢查點：專家混合模型的正交擴展以實現高效大型語言模型預訓練

摘要

大型語言模型預訓練的計算成本快速攀升，亟需更高效的解決方案。現有已訓練好的檢查點已投入大量計算資源，但由於工程限制或模型容量不足，許多檢查點並未得到充分利用。為有效重複利用這些「沉沒」成本，我們提出通過擴展參數數量並繼續訓練來回收預訓練檢查點。我們針對已收斂的專家混合模型，提出了正交增長方法：通過層間複製實現深度增長，以及通過注入噪聲的專家複製實現寬度增長。為確定檢查點序列中此類增長的最佳時機，我們進行了全面的擴展實驗，結果顯示最終準確率與沉沒成本量呈強正相關，表明前期投入越多，性能越佳。我們將該方法擴展至擁有700億參數和超過1萬億訓練標記的模型，在相同額外計算預算下，相比從頭訓練獲得了10.66%的準確率提升。我們的檢查點回收方法為經濟高效的大型語言模型預訓練奠定了基礎。

English

The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

回收預訓練檢查點：專家混合模型的正交擴展以實現高效大型語言模型預訓練

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

摘要

Support