数据混合的高效实现:语言模型预训练的双变量缩放定律
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining
May 23, 2024
作者: Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
cs.AI
摘要
大型语言模型展现出卓越的泛化能力,主要归因于利用多样化的数据。然而,传统做法在整合这些多样化数据方面往往依赖启发式方案,缺乏理论指导。本研究通过研究基于低成本代理的数据混合策略,旨在简化数据整理以提升训练效率来解决这些局限性。具体而言,我们提出了一种统一的缩放定律,称为BiMix,准确地建模了数据数量和混合比例的双变量缩放行为。我们进行系统实验,并提供了BiMix预测能力和基本原理的实证证据。值得注意的是,我们的研究结果表明,基于熵驱动的无需训练的数据混合方法可以实现与更消耗资源的方法相当甚至更好的性能。我们希望我们的定量洞察力可以为成本效益的语言建模中进一步审慎的研究和发展提供启示。
English
Large language models exhibit exceptional generalization capabilities,
primarily attributed to the utilization of diversely sourced data. However,
conventional practices in integrating this diverse data heavily rely on
heuristic schemes, lacking theoretical guidance. This research tackles these
limitations by investigating strategies based on low-cost proxies for data
mixtures, with the aim of streamlining data curation to enhance training
efficiency. Specifically, we propose a unified scaling law, termed BiMix, which
accurately models the bivariate scaling behaviors of both data quantity and
mixing proportions. We conduct systematic experiments and provide empirical
evidence for the predictive power and fundamental principles of BiMix. Notably,
our findings reveal that entropy-driven training-free data mixtures can achieve
comparable or even better performance than more resource-intensive methods. We
hope that our quantitative insights can shed light on further judicious
research and development in cost-effective language modeling.Summary
AI-Generated Summary