ChatPaper.aiChatPaper

RegMix:数据混合作为语言模型预训练的回归

RegMix: Data Mixture as Regression for Language Model Pre-training

July 1, 2024
作者: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin
cs.AI

摘要

大型语言模型预训练的数据混合显著影响性能,然而如何确定有效的混合仍不清楚。我们提出了RegMix,通过将其构建为回归任务,自动识别高性能数据混合。RegMix包括训练一组具有不同数据混合的小型模型,并拟合回归模型以预测它们在各自混合下的性能。利用拟合的回归模型,我们模拟排名靠前的混合,并用它来训练一个计算量更大几个数量级的大规模模型。为了在实证上验证RegMix,我们训练了512个具有100万参数的模型,使用10亿标记的不同混合来拟合回归模型并找到最佳混合。使用这个混合,我们训练了一个具有10亿参数的模型,使用了250亿标记(即比例放大了1000倍,时间延长了25倍),我们发现这个模型在64个候选的具有其他混合的10亿参数模型中表现最佳。此外,我们的方法表现出比人类选择更优越的性能,并取得与DoReMi相匹配或超越的结果,同时只利用了10%的计算预算。我们的实验还表明:(1)数据混合对性能有显著影响,单任务性能变化高达14.6%;(2)与维基百科等被认为是高质量数据不同,网络语料库与下游性能有最强烈的正相关性;(3)领域之间以复杂方式相互作用,常常违背常识,因此需要像RegMix这样的自动方法;(4)数据混合效应超越了规模定律,我们的方法通过考虑所有领域的方式捕捉了这种复杂性。我们的代码可在https://github.com/sail-sg/regmix找到。
English
The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at https://github.com/sail-sg/regmix.

Summary

AI-Generated Summary

PDF397November 28, 2024