RegMix：資料混合作為語言模型預訓練的迴歸

摘要

大型語言模型預訓練的數據混合顯著影響性能，然而如何確定有效的混合仍不清楚。我們提出 RegMix，通過將其定義為回歸任務，自動識別高性能數據混合。RegMix 包括訓練一組具有多樣數據混合的小型模型，並擬合一個回歸模型來預測它們在各自混合下的性能。通過擬合的回歸模型，我們模擬排名靠前的混合，並用它來訓練具有數量級更多計算的大型模型。為了在實驗上驗證 RegMix，我們訓練了512個具有1M參數、1B標記的不同混合模型，以擬合回歸模型並找到最佳混合。使用這個混合，我們訓練了一個具有1B參數的模型，標記了25B標記（即比其他混合的64個1B參數模型大1000倍並長25倍），我們發現其性能優於其他模型。此外，我們的方法表現優於人工選擇，並實現與或超越 DoReMi 的結果，同時僅利用10%的計算預算。我們的實驗還表明：（1）數據混合對性能有顯著影響，單任務性能變化高達14.6％；（2）與維基百科等被認為是高質量的數據相比，Web 語料庫對下游性能有最強烈的正相關；（3）領域之間以復雜方式互動，常常違背常識，因此需要像 RegMix 這樣的自動方法；（4）數據混合效應超越了擴展定律，我們的方法通過考慮所有領域一起捕捉了這種復雜性。我們的代碼可在 https://github.com/sail-sg/regmix 找到。

English

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at https://github.com/sail-sg/regmix.

RegMix：資料混合作為語言模型預訓練的迴歸

RegMix: Data Mixture as Regression for Language Model Pre-training

摘要

Summary

Support

Support