DoReMi：優化數據混合加速語言模型預訓練。

摘要

預訓練資料領域的混合比例（例如維基百科、書籍、網頁文字）對語言模型（LM）的性能有很大影響。本文提出了一種稱為最小最大優化的領域重新加權方法（DoReMi），該方法首先使用群體分布魯棒優化（Group DRO）在各個領域上訓練一個小型代理模型，以生成領域權重（混合比例），而無需了解下游任務。然後，我們使用這些領域權重對數據集進行重新取樣，並訓練一個更大的全尺寸模型。在我們的實驗中，我們使用DoReMi在一個具有 2.8 億參數的代理模型上，更有效地找到用於訓練一個具有 80 億參數的模型（規模大 30 倍）的領域權重。在 The Pile 數據集上，DoReMi 在所有領域上都改善了困惑度，即使它降低了某個領域的權重。DoReMi 將平均少樣本下游準確性提高了 6.5%，優於使用 The Pile 默認領域權重訓練的基線模型，並且以 2.6 倍較少的訓練步驟達到基線準確性。在 GLaM 數據集上，DoReMi 即使沒有下游任務的知識，也能與調整為下游任務的領域權重的性能相匹配。

English

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

DoReMi：優化數據混合加速語言模型預訓練。

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

摘要

Support