使用較少標記進行預訓練小型基礎語言模型
Pre-training Small Base LMs with Fewer Tokens
April 12, 2024
作者: Sunny Sanyal, Sujay Sanghavi, Alexandros G. Dimakis
cs.AI
摘要
我們研究了一種簡單方法的有效性,用於從現有的大型基礎語言模型(LM)開始發展一個小型基礎LM:首先從較大的LM中繼承幾個Transformer區塊,然後在較大模型的原始預訓練數據的非常小的子集(0.1%)上訓練這個較小的模型。我們將這種簡單的方法稱為Inheritune,並首次展示了用於構建具有15億參數的小型基礎LM,使用10億標記(以及較大LM的起始幾層,其參數為30億);我們使用一個A6000 GPU,在不到半天的時間內完成了這個過程。在9個不同的評估數據集以及MMLU基準測試中,結果模型與公開可用的10億至20億大小的基礎模型相比表現出色,其中一些模型使用的標記數量是其50到1000倍。
我們在稍有不同的設置中研究了Inheritune,其中我們訓練小型LM利用較大LM及其完整的預訓練數據集。在這裡,我們展示了當在OpenWebText數據集上以90億標記進行相同數量的訓練步驟時,使用GPT2-medium(3.55億)和GPT-2-large(7.7億)的一些層進行訓練的較小LM可以有效地匹配其更大對應模型的val loss。我們通過大量實驗分析了我們的方法,並展示了它在不同設置下的有效性。我們的程式碼可在https://github.com/sanyalsunny111/LLM-Inheritune 找到。
English
We study the effectiveness of a simple approach to develop a small base
language model (LM) starting from an existing large base LM: first inherit a
few transformer blocks from the larger LM, and then train this smaller model on
a very small subset (0.1\%) of the raw pretraining data of the larger model. We
call our simple recipe Inheritune and first demonstrate it for building a small
base LM with 1.5B parameters using 1B tokens (and a starting few layers of
larger LM of 3B parameters); we do this using a single A6000 GPU for less than
half a day. Across 9 diverse evaluation datasets as well as the MMLU benchmark,
the resulting model compares favorably to publicly available base models of
1B-2B size, some of which have been trained using 50-1000 times more tokens.
We investigate Inheritune in a slightly different setting where we train
small LMs utilizing larger LMs and their full pre-training dataset. Here we
show that smaller LMs trained utilizing some of the layers of GPT2-medium
(355M) and GPT-2-large (770M) can effectively match the val loss of their
bigger counterparts when trained from scratch for the same number of training
steps on OpenWebText dataset with 9B tokens. We analyze our recipe with
extensive experiments and demonstrate it efficacy on diverse settings. Our code
is available at https://github.com/sanyalsunny111/LLM-Inheritune.Summary
AI-Generated Summary