最佳數據混合比例之規模法則

摘要

大型基础模型通常基于多领域数据进行训练，其中数据混合比例——即各领域数据所占比重——对模型性能起着至关重要的作用。传统上，选择这一混合比例依赖于试错法，然而在大规模预训练场景下，这种方法显得不切实际。我们提出了一种系统方法，利用缩放定律来确定针对任何目标领域的最优数据混合比例。该方法能够精确预测在模型规模为N、训练数据量为D个标记以及特定领域权重向量h的条件下，模型的损失值。我们通过在三个截然不同且规模庞大的场景——大型语言模型（LLM）、原生多模态模型（NMM）及大型视觉模型（LVM）的预训练中，展示了这些缩放定律的预测能力，从而验证了其普适性。进一步地，我们证明这些缩放定律能够外推至新的数据混合比例及跨尺度应用：其参数可通过少量小规模训练运行准确估计，并用于预估更大规模及未见过的领域权重下的性能表现。基于这些缩放定律，可以在给定训练预算（N,D）下，为任何目标领域推导出最优的领域权重，为昂贵的试错法提供了一种有理论依据的替代方案。

English

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.

最佳數據混合比例之規模法則

Scaling Laws for Optimal Data Mixtures

摘要

Support