最適なデータ混合のスケーリング則

要旨

大規模基盤モデルは通常、複数のドメインからのデータを用いて訓練され、そのデータの混合比率（各ドメインの使用割合）はモデルの性能に重要な役割を果たします。この混合比率を選択する標準的なアプローチは試行錯誤に依存しており、大規模な事前学習では非現実的となります。本研究では、スケーリング則を用いて任意のターゲットドメインに対する最適なデータ混合比率を決定する体系的な手法を提案します。このアプローチは、サイズNのモデルをDトークンで訓練し、特定のドメイン重みベクトルhを用いた場合の損失を正確に予測します。我々は、大規模言語モデル（LLM）、ネイティブマルチモーダルモデル（NMM）、および大規模視覚モデル（LVM）の事前学習という3つの異なる大規模設定において、これらのスケーリング則の予測力を実証することで、その普遍性を検証します。さらに、これらのスケーリング則が新しいデータ混合比率やスケールを外挿できることを示します。すなわち、そのパラメータは少数の小規模な訓練実行を用いて正確に推定でき、より大きなスケールや未見のドメイン重みにおける性能を推定するために使用できます。スケーリング則により、与えられた訓練予算（N, D）の下で任意のターゲットドメインに対する最適なドメイン重みを導出することが可能となり、コストのかかる試行錯誤法に代わる原理的な代替手段を提供します。

English

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.

最適なデータ混合のスケーリング則

Scaling Laws for Optimal Data Mixtures

要旨

Support