堆叠您的Transformer:深入研究用于高效LLM预训练的模型增长
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
May 24, 2024
作者: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu
cs.AI
摘要
由于其大规模,LLM的预训练在计算上是昂贵的。
模型增长作为一种有前途的方法,通过利用较小的模型来加速更大模型的训练。然而,这些模型增长方法在高效的LLM预训练中的可行性尚未得到充分探讨。本研究确定了三个关键障碍:(O1)缺乏全面评估,(O2)未经测试的可扩展性,以及(O3)缺乏经验指南。为了解决O1,我们将现有方法总结为四个基本增长操作符,并在标准化的LLM预训练环境中对它们进行系统评估。我们的研究结果显示,一种称为G_{stack}的深度堆叠操作符,在训练中表现出显著的加速效果,导致损失减少,并在八个标准NLP基准测试中相对于强基线表现出改善的整体性能。受这些有希望的结果激励,我们进行了大量实验,深入研究G_{stack}以解决O2和O3。对于O2(未经测试的可扩展性),我们的研究表明G_{stack}是可扩展的,并且在经过增长和使用750B令牌预训练LLM后,始终表现良好,例如,与使用300B令牌传统训练的7B模型相比,我们的G_{stack}模型在194B令牌的情况下收敛到相同的损失,实现了54.6\%的加速。我们进一步通过制定指导原则来解决O3(缺乏经验指南),以确定G_{stack}的增长时机和增长因子,使其在一般的LLM预训练中变得实用。我们还对G_{stack}进行了深入讨论和全面的消融研究。我们的代码和预训练模型可在https://llm-stacking.github.io/获取。
English
LLMs are computationally expensive to pre-train due to their large scale.
Model growth emerges as a promising approach by leveraging smaller models to
accelerate the training of larger ones. However, the viability of these model
growth methods in efficient LLM pre-training remains underexplored. This work
identifies three critical textit{O}bstacles: (O1)
lack of comprehensive evaluation, (O2) untested viability for
scaling, and (O3) lack of empirical guidelines. To tackle
O1, we summarize existing approaches into four atomic growth
operators and systematically evaluate them in a standardized LLM pre-training
setting. Our findings reveal that a depthwise stacking operator, called
G_{stack}, exhibits remarkable acceleration in training, leading to
decreased loss and improved overall performance on eight standard NLP
benchmarks compared to strong baselines. Motivated by these promising results,
we conduct extensive experiments to delve deeper into G_{stack} to
address O2 and O3. For O2 (untested
scalability), our study shows that G_{stack} is scalable and
consistently performs well, with experiments up to 7B LLMs after growth and
pre-training LLMs with 750B tokens. For example, compared to a conventionally
trained 7B model using 300B tokens, our G_{stack} model converges to
the same loss with 194B tokens, resulting in a 54.6\% speedup. We further
address O3 (lack of empirical guidelines) by formalizing guidelines
to determine growth timing and growth factor for G_{stack}, making it
practical in general LLM pre-training. We also provide in-depth discussions and
comprehensive ablation studies of G_{stack}. Our code and pre-trained
model are available at
https://llm-stacking.github.io/{https://llm-stacking.github.io/}.Summary
AI-Generated Summary