DoReMi:优化数据混合加速语言模型预训练。
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
May 17, 2023
作者: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu
cs.AI
摘要
预训练数据领域(例如维基百科、图书、网络文本)的混合比例对语言模型(LM)的性能有很大影响。本文提出了一种名为最小最大优化域重新加权(DoReMi)的方法,首先使用组分布稳健优化(Group DRO)在领域上训练一个小型代理模型,以生成领域权重(混合比例),而无需了解下游任务。然后,我们使用这些领域权重对数据集进行重新采样,并训练一个更大的全尺寸模型。在我们的实验中,我们使用DoReMi在一个拥有2.8亿参数的代理模型上,更高效地找到用于训练一个拥有80亿参数的模型(规模扩大30倍)的领域权重。在The Pile数据集上,DoReMi在所有领域中都改善了困惑度,即使它降低了某个领域的权重。DoReMi将平均少样本下游准确率提高了6.5%,超过了使用The Pile默认领域权重训练的基线模型,并且在训练步骤减少2.6倍时达到基线准确率。在GLaM数据集上,DoReMi即使没有下游任务的知识,也能与在下游任务上调整的领域权重的性能相匹配。
English
The mixture proportions of pretraining data domains (e.g., Wikipedia, books,
web text) greatly affect language model (LM) performance. In this paper, we
propose Domain Reweighting with Minimax Optimization (DoReMi), which first
trains a small proxy model using group distributionally robust optimization
(Group DRO) over domains to produce domain weights (mixture proportions)
without knowledge of downstream tasks. We then resample a dataset with these
domain weights and train a larger, full-sized model. In our experiments, we use
DoReMi on a 280M-parameter proxy model to find domain weights for training an
8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves
perplexity across all domains, even when it downweights a domain. DoReMi
improves average few-shot downstream accuracy by 6.5% over a baseline model
trained using The Pile's default domain weights and reaches the baseline
accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has
no knowledge of downstream tasks, even matches the performance of using domain
weights tuned on downstream tasks.