言語モデルは、過剰学習および下流タスクにおいて、信頼性を持ってスケーリングする。

要旨

スケーリング則は言語モデルの開発において有用な指針であるが、現在のスケーリング研究と、言語モデルが実際にどのように訓練され評価されるかとの間には依然としてギャップが存在する。例えば、スケーリングは通常、計算最適な訓練体制（すなわち「Chinchilla最適」体制）で研究されるが、実際には、推論コストを削減するためにモデルが過剰訓練されることが多い。さらに、スケーリング則は主に次のトークン予測における損失を予測するが、最終的にはモデルは下流タスクの性能に基づいて比較される。本論文では、これらの両方の欠点に対処する。そのために、0.011Bから6.9Bのパラメータを持つ104のモデルを、3つのデータ分布に対して様々なトークン数で訓練したテストベッドを作成する。まず、過剰訓練体制におけるスケーリングを調査する。モデルのパラメータ数と訓練トークン数の比率の両方で外挿するスケーリング則を適合させる。これにより、1.4Bパラメータ、900Bトークンの実行（すなわち32倍の過剰訓練）と6.9Bパラメータ、138Bトークンの実行の検証損失を、計算量が300分の1の実験から予測することが可能となる。次に、言語モデルのパープレキシティを下流タスクの性能に関連付けるためのべき乗則を導出する。この法則を用いて、前述の2つのモデルの下流タスクにおける平均トップ1エラーを、計算量が20分の1の実験を用いて予測する。我々の実験はhttps://github.com/mlfoundations/scalingで公開されている。

English

Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32times over-trained) and a 6.9B parameter, 138B token runx2014each from experiments that take 300times less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20times less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

言語モデルは、過剰学習および下流タスクにおいて、信頼性を持ってスケーリングする。

Language models scale reliably with over-training and on downstream tasks

要旨

Support