語言模型在過度訓練和在下游任務中能夠可靠地擴展。
Language models scale reliably with over-training and on downstream tasks
March 13, 2024
作者: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
cs.AI
摘要
規模定律對於開發語言模型是有用的指南,但目前的規模研究與語言模型最終的訓練和評估之間仍存在差距。例如,通常在計算最佳訓練範疇(即“Chinchilla最佳”範疇)中研究規模,但在實踐中,模型通常會被過度訓練以降低推理成本。此外,規模定律主要預測下一個標記預測的損失,但最終模型是根據下游任務表現進行比較的。在本文中,我們解決了這兩個缺點。為此,我們建立了一個包含104個模型的測試平臺,這些模型具有從0.011B到6.9B個參數,在三種數據分佈上使用不同數量的標記進行訓練。首先,我們研究了過度訓練範疇中的規模。我們擬合了兩個規模定律,可以在模型參數數量和訓練標記與參數比之間進行外推。這使我們能夠預測一個擁有1.4B個參數、900B個標記運行(即32倍過度訓練)和一個擁有6.9B個參數、138B個標記運行的驗證損失,從而實驗所需的計算量減少300倍。其次,我們通過一個冪律將語言模型的困惑度與其下游任務表現相關聯。我們使用這個定律來預測這兩個模型在下游任務中的平均top-1錯誤,而實驗所需的計算量減少20倍。我們的實驗可在https://github.com/mlfoundations/scaling找到。
English
Scaling laws are useful guides for developing language models, but there are
still gaps between current scaling studies and how language models are
ultimately trained and evaluated. For instance, scaling is usually studied in
the compute-optimal training regime (i.e., "Chinchilla optimal" regime);
however, in practice, models are often over-trained to reduce inference costs.
Moreover, scaling laws mostly predict loss on next-token prediction, but
ultimately models are compared based on downstream task performance. In this
paper, we address both shortcomings. To do so, we create a testbed of 104
models with 0.011B to 6.9B parameters trained with various numbers of tokens on
three data distributions. First, we investigate scaling in the over-trained
regime. We fit scaling laws that extrapolate in both the number of model
parameters and the ratio of training tokens to parameters. This enables us to
predict the validation loss of a 1.4B parameter, 900B token run (i.e.,
32times over-trained) and a 6.9B parameter, 138B token
runx2014each from experiments that take 300times less compute.
Second, we relate the perplexity of a language model to its downstream task
performance via a power law. We use this law to predict top-1 error averaged
over downstream tasks for the two aforementioned models using experiments that
take 20times less compute. Our experiments are available at
https://github.com/mlfoundations/scaling.Summary
AI-Generated Summary