最优数据混合的规模法则

摘要

大型基础模型通常基于多领域数据进行训练，其中数据混合比例——即各领域数据所占比重——对模型性能起着关键作用。传统选择这一比例的方法依赖于试错，这在大规模预训练中变得不切实际。我们提出了一种系统方法，利用缩放定律来确定任何目标领域的最优数据混合比例。我们的方法能准确预测使用N规模模型、D个标记及特定领域权重向量h训练时的模型损失。通过在三个不同且大规模场景——大型语言模型（LLM）、原生多模态模型（NMM）及大规模视觉模型（LVM）预训练——中展示其预测能力，我们验证了这些缩放定律的普适性。进一步研究表明，这些缩放定律能够外推至新的数据混合比例及跨尺度应用：其参数可通过少量小规模训练运行准确估计，并用于预测更大规模及未见过的领域权重下的性能。缩放定律使我们能够在给定训练预算（N,D）下，为任何目标领域推导出最优的领域权重，为昂贵的试错方法提供了一个有原则的替代方案。

English

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.