使用更少标记预训练小型基础语言模型

摘要

我们研究了一种简单方法的有效性，用于从现有的大型基础语言模型（LM）开发一个小型基础LM：首先从较大的LM中继承一些Transformer块，然后在较大模型的原始预训练数据的极小子集（0.1\%）上训练这个较小的模型。我们将这种简单的方法称为Inheritune，并首次演示了使用1B标记（以及3B参数的较大LM的起始几层）构建一个包含15亿参数的小型基础LM；我们仅使用一块A6000 GPU，在不到半天的时间内完成。在9个不同的评估数据集以及MMLU基准测试中，得到的模型与公开可用的包含10亿至20亿参数的基础模型相比表现出色，其中一些模型使用的标记数量是其数十到数千倍。我们在稍有不同的设置中研究了Inheritune，其中我们训练利用较大LM及其完整预训练数据集的小型LM。在这里，我们展示了当在OpenWebText数据集上使用了90亿标记进行相同数量的训练步骤后，利用GPT2-medium（355M）和GPT-2-large（770M）的一些层进行训练的较小LM可以有效地匹配其更大对应模型的val loss。我们通过广泛的实验分析了我们的方法，并展示了它在不同设置下的有效性。我们的代码可在https://github.com/sanyalsunny111/LLM-Inheritune找到。

English

We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM: first inherit a few transformer blocks from the larger LM, and then train this smaller model on a very small subset (0.1\%) of the raw pretraining data of the larger model. We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens (and a starting few layers of larger LM of 3B parameters); we do this using a single A6000 GPU for less than half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark, the resulting model compares favorably to publicly available base models of 1B-2B size, some of which have been trained using 50-1000 times more tokens. We investigate Inheritune in a slightly different setting where we train small LMs utilizing larger LMs and their full pre-training dataset. Here we show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch for the same number of training steps on OpenWebText dataset with 9B tokens. We analyze our recipe with extensive experiments and demonstrate it efficacy on diverse settings. Our code is available at https://github.com/sanyalsunny111/LLM-Inheritune.

使用更少标记预训练小型基础语言模型

Pre-training Small Base LMs with Fewer Tokens

摘要

Support