計算効率の良いモデルラダーを使用して、タスクのスケーリング則を確立する

要旨

私たちは、タスクのスケーリング則とモデルラダーを開発し、過学習設定で事前学習された言語モデル（LMs）の個々のタスクパフォーマンスを予測します。言語モデリング損失に対する標準的なべき乗則は、タスクパフォーマンスを正確にモデル化することができません。そのため、私たちは2段階の予測アプローチを活用しています：まず、モデルとデータサイズを使用してタスク固有の損失を予測し、次にこのタスク損失を使用してタスクパフォーマンスを予測します。私たちは、一連の小規模な「ラダー」モデルを訓練し、2つの予測ステップのパラメータ化された関数に適合するデータポイントを収集し、2つのターゲットモデルの予測を行います：4Tトークンに訓練された7Bモデルと5Tトークンに訓練された13Bモデル。ラダーモデルの訓練にかかるコンピュートは、ターゲットモデルに使用されるコンピュートの1%にすぎません。ランク付け分類形式で書かれた4つの多肢選択タスクにおいて、両方のターゲットモデルの精度を絶対誤差2ポイント以内で予測できます。他の4つのタスクでは予測誤差が大きく（平均絶対誤差6.9）、これらはしばしばタスクメトリクスの分散が大きいタスクであることがわかります。また、より少ないラダーモデルを訓練するためにより少ないコンピュートを使用すると、予測が悪化する傾向があることを見つけます。最後に、設計選択肢と2段階アプローチがスケーリング則の確立において優れたパフォーマンスをもたらすことを実証的に示します。

English

We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: first use model and data size to predict a task-specific loss, and then use this task loss to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks written in ranked classification format, we can predict the accuracy of both target models within 2 points of absolute error. We have higher prediction error on four other tasks (average absolute error 6.9) and find that these are often tasks with higher variance in task metrics. We also find that using less compute to train fewer ladder models tends to deteriorate predictions. Finally, we empirically show that our design choices and the two-step approach lead to superior performance in establishing scaling laws.

計算効率の良いモデルラダーを使用して、タスクのスケーリング則を確立する

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

要旨

Support