事前学習済みチェックポイントの再利用：効率的な大規模言語モデル事前学習のためのMixture-of-Expertsの直交的成長

要旨

大規模言語モデルの事前学習における急速に増大する計算コストは、より効率的なアプローチを必要としています。既存の十分に学習されたチェックポイントには多くの計算コストが投入されていますが、エンジニアリング上の制約やモデル容量の限界により、その多くが十分に活用されていません。この「埋没」コストを効率的に再利用するため、私たちは事前学習済みチェックポイントをリサイクルし、パラメータ数を拡張して学習を継続することを提案します。特に、収束したMixture-of-Expertsモデルに適した直交成長法を提案します。深さ方向の成長には層間コピーを、幅方向の成長にはノイズを注入したエキスパートの複製を用います。チェックポイントシーケンス全体で最適な成長タイミングを決定するため、包括的なスケーリング実験を行い、最終的な精度が埋没コストの量と強い正の相関を持つことを明らかにしました。これは、より多くの事前投資がより良い性能につながることを示しています。私たちはこのアプローチを700億パラメータと1兆以上の学習トークンを有するモデルにスケールし、同じ追加計算予算でゼロから学習する場合と比較して10.66%の精度向上を達成しました。このチェックポイントリサイクルアプローチは、経済的に効率的な大規模言語モデルの事前学習の基盤を確立します。

English

The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

事前学習済みチェックポイントの再利用：効率的な大規模言語モデル事前学習のためのMixture-of-Expertsの直交的成長

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

要旨

Support