高效的資料混合:語言模型預訓練的雙變量縮放定律
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining
May 23, 2024
作者: Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
cs.AI
摘要
大型語言模型展現出卓越的泛化能力,主要歸因於對多元來源數據的利用。然而,傳統整合這些多元數據的做法主要依賴啟發式方案,缺乏理論指導。本研究通過探討基於低成本代理的數據混合策略,旨在精簡數據整理以提升訓練效率,以應對這些限制。具體而言,我們提出了一個統一的縮放定律,稱為BiMix,能準確地模擬數據數量和混合比例的雙變量縮放行為。我們進行系統性實驗,並提供BiMix 預測能力和基本原則的實證證據。值得注意的是,我們的研究結果顯示,基於熵驅動的無需訓練的數據混合方法可以達到與甚至更好於更耗資源的方法的性能。我們希望我們的定量見解能為成本效益的語言建模中的進一步明智的研究和開發提供一些啟示。
English
Large language models exhibit exceptional generalization capabilities,
primarily attributed to the utilization of diversely sourced data. However,
conventional practices in integrating this diverse data heavily rely on
heuristic schemes, lacking theoretical guidance. This research tackles these
limitations by investigating strategies based on low-cost proxies for data
mixtures, with the aim of streamlining data curation to enhance training
efficiency. Specifically, we propose a unified scaling law, termed BiMix, which
accurately models the bivariate scaling behaviors of both data quantity and
mixing proportions. We conduct systematic experiments and provide empirical
evidence for the predictive power and fundamental principles of BiMix. Notably,
our findings reveal that entropy-driven training-free data mixtures can achieve
comparable or even better performance than more resource-intensive methods. We
hope that our quantitative insights can shed light on further judicious
research and development in cost-effective language modeling.Summary
AI-Generated Summary