ChatPaper.aiChatPaper

堆叠您的Transformer:深入研究用于高效LLM预训练的模型增长

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

May 24, 2024
作者: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu
cs.AI

摘要

由于其大规模,LLM的预训练在计算上是昂贵的。 模型增长作为一种有前途的方法,通过利用较小的模型来加速更大模型的训练。然而,这些模型增长方法在高效的LLM预训练中的可行性尚未得到充分探讨。本研究确定了三个关键障碍:(O1)缺乏全面评估,(O2)未经测试的可扩展性,以及(O3)缺乏经验指南。为了解决O1,我们将现有方法总结为四个基本增长操作符,并在标准化的LLM预训练环境中对它们进行系统评估。我们的研究结果显示,一种称为G_{stack}的深度堆叠操作符,在训练中表现出显著的加速效果,导致损失减少,并在八个标准NLP基准测试中相对于强基线表现出改善的整体性能。受这些有希望的结果激励,我们进行了大量实验,深入研究G_{stack}以解决O2和O3。对于O2(未经测试的可扩展性),我们的研究表明G_{stack}是可扩展的,并且在经过增长和使用750B令牌预训练LLM后,始终表现良好,例如,与使用300B令牌传统训练的7B模型相比,我们的G_{stack}模型在194B令牌的情况下收敛到相同的损失,实现了54.6\%的加速。我们进一步通过制定指导原则来解决O3(缺乏经验指南),以确定G_{stack}的增长时机和增长因子,使其在一般的LLM预训练中变得实用。我们还对G_{stack}进行了深入讨论和全面的消融研究。我们的代码和预训练模型可在https://llm-stacking.github.io/获取。
English
LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical textit{O}bstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G_{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G_{stack} to address O2 and O3. For O2 (untested scalability), our study shows that G_{stack} is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G_{stack} model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G_{stack}, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G_{stack}. Our code and pre-trained model are available at https://llm-stacking.github.io/{https://llm-stacking.github.io/}.

Summary

AI-Generated Summary

PDF301December 15, 2024