스태킹 방식의 트랜스포머: 효율적인 대규모 언어 모델 사전 학습을 위한 모델 성장에 대한 심층 분석

초록

LLM(Large Language Model)은 그 규모가 크기 때문에 사전 학습에 많은 계산 비용이 듭니다. 모델 성장(Model Growth)은 더 작은 모델을 활용하여 더 큰 모델의 학습을 가속화하는 유망한 접근 방식으로 부상하고 있습니다. 그러나 이러한 모델 성장 방법이 LLM의 효율적인 사전 학습에 얼마나 적합한지는 아직 충분히 탐구되지 않았습니다. 본 연구는 세 가지 주요 장애물을 식별합니다: (O1) 포괄적인 평가의 부재, (O2) 확장 가능성에 대한 검증 부족, (O3) 경험적 지침의 부족. O1을 해결하기 위해, 우리는 기존 접근 방식을 네 가지 기본 성장 연산자로 요약하고 이를 표준화된 LLM 사전 학습 환경에서 체계적으로 평가합니다. 우리의 연구 결과, G_{stack}이라는 깊이별 스태킹 연산자가 학습 가속화에서 뛰어난 성과를 보이며, 강력한 베이스라인 대비 손실 감소와 8개의 표준 NLP 벤치마크에서 전반적인 성능 향상을 이끌어냄을 확인했습니다. 이러한 유망한 결과에 고무되어, 우리는 O2와 O3를 더 깊이 탐구하기 위해 G_{stack}에 대한 광범위한 실험을 수행합니다. O2(검증되지 않은 확장성)에 대해, 우리의 연구는 G_{stack}이 확장 가능하며 일관되게 우수한 성능을 보임을 입증합니다. 이는 성장 후 7B 규모의 LLM과 750B 토큰으로 사전 학습된 LLM까지의 실험을 통해 확인되었습니다. 예를 들어, 300B 토큰을 사용하여 전통적으로 학습된 7B 모델과 비교했을 때, 우리의 G_{stack} 모델은 194B 토큰으로 동일한 손실에 도달하여 54.6%의 속도 향상을 달성했습니다. 또한, O3(경험적 지침의 부족)를 해결하기 위해 G_{stack}의 성장 시기와 성장 요소를 결정하는 지침을 공식화하여 일반적인 LLM 사전 학습에서 실용적으로 사용할 수 있도록 했습니다. 우리는 또한 G_{stack}에 대한 심층 논의와 포괄적인 절제 연구를 제공합니다. 우리의 코드와 사전 학습된 모델은 https://llm-stacking.github.io/에서 확인할 수 있습니다.

English

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical textit{O}bstacles: (O1) lack of comprehensive evaluation, (O2) untested viability for scaling, and (O3) lack of empirical guidelines. To tackle O1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called G_{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into G_{stack} to address O2 and O3. For O2 (untested scalability), our study shows that G_{stack} is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our G_{stack} model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address O3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for G_{stack}, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of G_{stack}. Our code and pre-trained model are available at https://llm-stacking.github.io/{https://llm-stacking.github.io/}.

스태킹 방식의 트랜스포머: 효율적인 대규모 언어 모델 사전 학습을 위한 모델 성장에 대한 심층 분석

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

초록

Support