Transformerの積み重ね：効率的なLLM事前学習のためのモデル成長の詳細検討

要旨

大規模言語モデル（LLM）はその規模の大きさから、事前学習に多大な計算コストを要します。モデル成長（Model Growth）は、より小さなモデルを活用して大規模モデルの学習を加速する有望なアプローチとして登場しました。しかし、効率的なLLM事前学習におけるこれらのモデル成長手法の実用性はまだ十分に検証されていません。本研究では、3つの重要な課題（O1）包括的評価の欠如、（O2）スケーリングにおける実用性の未検証、（O3）経験的ガイドラインの不足を特定しました。O1に対処するため、既存のアプローチを4つの基本的な成長オペレーターに分類し、標準化されたLLM事前学習環境で系統的に評価しました。その結果、G_{stack}と呼ばれる深さ方向のスタッキングオペレーターが、学習の顕著な加速をもたらし、強力なベースラインと比較して損失の減少と8つの標準NLPベンチマークでの全体的な性能向上を示すことが明らかになりました。これらの有望な結果に基づき、O2とO3に深く取り組むため、G_{stack}に関する広範な実験を行いました。O2（未検証のスケーラビリティ）については、G_{stack}がスケーラブルであり、成長後の7B LLMや750BトークンでのLLM事前学習において一貫して良好な性能を発揮することを示しました。例えば、300Bトークンを使用して従来通り学習した7Bモデルと比較すると、G_{stack}モデルは194Bトークンで同じ損失に収束し、54.6%の高速化を実現しました。さらに、O3（経験的ガイドラインの不足）に対処するため、G_{stack}の成長タイミングと成長係数を決定するガイドラインを形式化し、一般的なLLM事前学習において実用的なものとしました。また、G_{stack}に関する詳細な議論と包括的なアブレーション研究も提供しています。私たちのコードと事前学習済みモデルはhttps://llm-stacking.github.io/で公開されています。

English

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical textit{O}bstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G_{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G_{stack} to address O2 and O3. For O2 (untested scalability), our study shows that G_{stack} is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G_{stack} model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G_{stack}, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G_{stack}. Our code and pre-trained model are available at https://llm-stacking.github.io/{https://llm-stacking.github.io/}.

Transformerの積み重ね：効率的なLLM事前学習のためのモデル成長の詳細検討

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

要旨

Support