DoReMi: データ混合の最適化による言語モデル事前学習の高速化

要旨

事前学習データのドメイン（例：Wikipedia、書籍、ウェブテキスト）の混合比率は、言語モデル（LM）の性能に大きく影響します。本論文では、ドメイン再重み付けを最小最適化で行うDomain Reweighting with Minimax Optimization（DoReMi）を提案します。DoReMiはまず、下流タスクの知識なしに、ドメインに対するグループ分布ロバスト最適化（Group DRO）を用いて小さなプロキシモデルを訓練し、ドメインの重み（混合比率）を生成します。その後、これらのドメイン重みでデータセットを再サンプリングし、より大規模なフルサイズのモデルを訓練します。実験では、280MパラメータのプロキシモデルにDoReMiを適用し、8Bパラメータのモデル（30倍大規模）をより効率的に訓練するためのドメイン重みを見つけます。The Pileデータセットでは、DoReMiはドメインの重みを下げた場合でも、すべてのドメインでパープレキシティを改善します。DoReMiは、The Pileのデフォルトのドメイン重みで訓練されたベースラインモデルと比較して、平均的なfew-shot下流タスクの精度を6.5%向上させ、ベースラインの精度を2.6倍少ない訓練ステップで達成します。GLaMデータセットでは、下流タスクの知識を持たないDoReMiが、下流タスクで調整されたドメイン重みを使用した場合の性能に匹敵する結果を示します。

English

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to find domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

DoReMi: データ混合の最適化による言語モデル事前学習の高速化

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

要旨

Support