データ制約下でのトレーニングに関する規範的スケーリング則

要旨

学習用計算リソースが高品質データの利用可能性を急速に上回りつつある。この状況は、最適な計算リソース配分という課題から、限られたデータから最大の価値を引き出すという核心的課題へと焦点を移行させている。広く採用されているChinchillaスケーリング則は、すべての訓練トークンが一意であることを前提としている。この前提は、データが制約された環境での事前学習戦略を導く上で、その有用性を制限する。本研究では、データ繰り返し時の過剰損失を、単純な加法的過学習ペナルティとしてモデル化し、これがモデルの振る舞いを正確に記述することを明らかにする。我々の提案するスケーリング則は、質的に新しい計算最適配分の指針を提供する。ある時点を超えると、さらなるデータ繰り返しは非生産的となり、計算リソースはモデル容量の拡大に振り向ける方が効果的である。我々の則が推奨する設定に従うことで、データ制約環境下での性能が向上することを実証する。最後に、我々の単一パラメータからなる定式化は過学習を一つの係数に分離するため、異なる訓練設定間の直接比較を可能にする。事例研究として、強い重み減衰（λ=1.0）がこの係数を約70%減少させることを示し、データ制約環境下での最適な重み減衰が標準的な実践よりも一桁大きいという最近の知見を、スケーリング則の観点から説明する。

English

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

データ制約下でのトレーニングに関する規範的スケーリング則

Prescriptive Scaling Laws for Data Constrained Training

要旨

Support