ChatPaper.aiChatPaper

语言模型在过度训练和在下游任务中能够可靠地扩展。

Language models scale reliably with over-training and on downstream tasks

March 13, 2024
作者: Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G. Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
cs.AI

摘要

尺度定律是开发语言模型的有用指南,但目前的尺度研究与语言模型最终的训练和评估之间仍然存在差距。例如,尺度通常在计算最优训练范围内进行研究(即“毛丝鼠最优”范围);然而,在实践中,模型通常会过度训练以减少推理成本。此外,尺度定律主要预测下一个标记预测的损失,但最终模型是基于下游任务性能进行比较的。在本文中,我们解决了这两个缺点。为此,我们创建了一个包含104个模型的测试平台,这些模型具有从0.011B到6.9B个参数,在三个数据分布上使用不同数量的标记进行训练。首先,我们研究了过度训练范围内的尺度。我们拟合了尺度定律,可以在模型参数数量和训练标记与参数比之间进行外推。这使我们能够预测一个具有1.4B参数、900B标记运行(即过度训练32倍)和一个具有6.9B参数、138B标记运行的验证损失,每个实验的计算量减少300倍。其次,我们通过幂律将语言模型的困惑度与其下游任务性能联系起来。我们使用这个定律来预测两个前述模型在下游任务中的top-1错误率,每个实验的计算量减少20倍。我们的实验可在https://github.com/mlfoundations/scaling找到。
English
Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32times over-trained) and a 6.9B parameter, 138B token runx2014each from experiments that take 300times less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20times less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

Summary

AI-Generated Summary

PDF151December 15, 2024