語言模型在過度訓練和在下游任務中能夠可靠地擴展。

摘要

規模定律對於開發語言模型是有用的指南，但目前的規模研究與語言模型最終的訓練和評估之間仍存在差距。例如，通常在計算最佳訓練範疇（即“Chinchilla最佳”範疇）中研究規模，但在實踐中，模型通常會被過度訓練以降低推理成本。此外，規模定律主要預測下一個標記預測的損失，但最終模型是根據下游任務表現進行比較的。在本文中，我們解決了這兩個缺點。為此，我們建立了一個包含104個模型的測試平臺，這些模型具有從0.011B到6.9B個參數，在三種數據分佈上使用不同數量的標記進行訓練。首先，我們研究了過度訓練範疇中的規模。我們擬合了兩個規模定律，可以在模型參數數量和訓練標記與參數比之間進行外推。這使我們能夠預測一個擁有1.4B個參數、900B個標記運行（即32倍過度訓練）和一個擁有6.9B個參數、138B個標記運行的驗證損失，從而實驗所需的計算量減少300倍。其次，我們通過一個冪律將語言模型的困惑度與其下游任務表現相關聯。我們使用這個定律來預測這兩個模型在下游任務中的平均top-1錯誤，而實驗所需的計算量減少20倍。我們的實驗可在https://github.com/mlfoundations/scaling找到。

English

Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32times over-trained) and a 6.9B parameter, 138B token runx2014each from experiments that take 300times less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20times less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

語言模型在過度訓練和在下游任務中能夠可靠地擴展。

Language models scale reliably with over-training and on downstream tasks

摘要

Support