堆疊您的Transformer:深入研究模型增長以實現高效的LLM預訓練
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
May 24, 2024
作者: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu
cs.AI
摘要
由於其大規模,LLM的預訓練在計算上是昂貴的。
模型增長作為一種有前途的方法,通過利用較小的模型來加速較大模型的訓練。然而,這些模型增長方法在高效的LLM預訓練中的可行性尚未得到充分探討。本研究確定了三個關鍵的障礙:(O1) 缺乏全面評估,(O2) 對於擴展性的可行性未經測試,以及 (O3) 缺乏實證指南。為了應對O1,我們將現有方法總結為四個基本增長運算符,並在標準化的LLM預訓練環境中對它們進行系統性評估。我們的研究結果顯示,一種稱為G_{stack}的深度堆疊運算符在訓練中表現出顯著的加速效果,導致在八個標準NLP基準測試中相較於強基線,損失減少並且整體性能得到改善。受這些有希望的結果激勵,我們進行了大量實驗,深入探討G_{stack}以應對O2和O3。對於O2(未經測試的擴展性),我們的研究表明G_{stack}是可擴展的,並且在成長後的實驗中一直表現良好,包括使用750B標記進行LLM預訓練。例如,與使用300B標記的傳統訓練的7B模型相比,我們的G_{stack}模型在使用194B標記時達到相同的損失,實現了54.6%的加速。我們進一步通過制定指南來確定G_{stack}的增長時機和增長因子,以應對O3(缺乏實證指南),使其在一般LLM預訓練中更具實用性。我們還提供了關於G_{stack}的深入討論和全面的消融研究。我們的代碼和預訓練模型可在以下網址獲得:https://llm-stacking.github.io/。
English
LLMs are computationally expensive to pre-train due to their large scale.
Model growth emerges as a promising approach by leveraging smaller models to
accelerate the training of larger ones. However, the viability of these model
growth methods in efficient LLM pre-training remains underexplored. This work
identifies three critical textit{O}bstacles: (O1)
lack of comprehensive evaluation, (O2) untested viability for
scaling, and (O3) lack of empirical guidelines. To tackle
O1, we summarize existing approaches into four atomic growth
operators and systematically evaluate them in a standardized LLM pre-training
setting. Our findings reveal that a depthwise stacking operator, called
G_{stack}, exhibits remarkable acceleration in training, leading to
decreased loss and improved overall performance on eight standard NLP
benchmarks compared to strong baselines. Motivated by these promising results,
we conduct extensive experiments to delve deeper into G_{stack} to
address O2 and O3. For O2 (untested
scalability), our study shows that G_{stack} is scalable and
consistently performs well, with experiments up to 7B LLMs after growth and
pre-training LLMs with 750B tokens. For example, compared to a conventionally
trained 7B model using 300B tokens, our G_{stack} model converges to
the same loss with 194B tokens, resulting in a 54.6\% speedup. We further
address O3 (lack of empirical guidelines) by formalizing guidelines
to determine growth timing and growth factor for G_{stack}, making it
practical in general LLM pre-training. We also provide in-depth discussions and
comprehensive ablation studies of G_{stack}. Our code and pre-trained
model are available at
https://llm-stacking.github.io/{https://llm-stacking.github.io/}.Summary
AI-Generated Summary