堆疊您的Transformer：深入研究模型增長以實現高效的LLM預訓練

摘要

由於其大規模，LLM的預訓練在計算上是昂貴的。模型增長作為一種有前途的方法，通過利用較小的模型來加速較大模型的訓練。然而，這些模型增長方法在高效的LLM預訓練中的可行性尚未得到充分探討。本研究確定了三個關鍵的障礙：(O1) 缺乏全面評估，(O2) 對於擴展性的可行性未經測試，以及 (O3) 缺乏實證指南。為了應對O1，我們將現有方法總結為四個基本增長運算符，並在標準化的LLM預訓練環境中對它們進行系統性評估。我們的研究結果顯示，一種稱為G_{stack}的深度堆疊運算符在訓練中表現出顯著的加速效果，導致在八個標準NLP基準測試中相較於強基線，損失減少並且整體性能得到改善。受這些有希望的結果激勵，我們進行了大量實驗，深入探討G_{stack}以應對O2和O3。對於O2（未經測試的擴展性），我們的研究表明G_{stack}是可擴展的，並且在成長後的實驗中一直表現良好，包括使用750B標記進行LLM預訓練。例如，與使用300B標記的傳統訓練的7B模型相比，我們的G_{stack}模型在使用194B標記時達到相同的損失，實現了54.6％的加速。我們進一步通過制定指南來確定G_{stack}的增長時機和增長因子，以應對O3（缺乏實證指南），使其在一般LLM預訓練中更具實用性。我們還提供了關於G_{stack}的深入討論和全面的消融研究。我們的代碼和預訓練模型可在以下網址獲得：https://llm-stacking.github.io/。

English

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical textit{O}bstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G_{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G_{stack} to address O2 and O3. For O2 (untested scalability), our study shows that G_{stack} is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G_{stack} model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G_{stack}, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G_{stack}. Our code and pre-trained model are available at https://llm-stacking.github.io/{https://llm-stacking.github.io/}.

堆疊您的Transformer：深入研究模型增長以實現高效的LLM預訓練

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

摘要

Summary

Support

Support