智能扩展：利用小模型初始化加速大型语言模型预训练

摘要

语言模型的预训练阶段通常从随机初始化的参数开始。随着模型规模不断扩大的趋势，训练大量参数可能会变得极其缓慢和昂贵。相比之下，小型语言模型的训练成本较低，但通常无法达到大型模型的准确性。在本文中，我们探讨了一个有趣的想法，即是否可以开发一种方法，利用较小的预训练模型来初始化大型语言模型？这种初始化是否会在训练时间和最终准确性方面带来任何好处？本文介绍了HyperCloning，这是一种方法，可以将预训练语言模型的参数扩展到具有增加隐藏维度的更大模型。我们的方法确保较大模型保留较小模型的功能。因此，在训练开始之前，较大模型已经继承了较小模型的预测能力和准确性。我们证明，训练这样一个初始化模型会显著节省用于预训练大型语言模型所需的GPU小时数。

English

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

智能扩展：利用小模型初始化加速大型语言模型预训练

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

摘要

Support