語言模型的最佳控制資料選擇

摘要

本研究探討從龐大語料庫中選擇高質量的預訓練數據，以增強語言模型在下游應用中的能力。我們將數據選擇定義為一個泛化的最優控制問題，可以通過 Pontryagin 的最大值原理（PMP）在理論上解決，得出一組表徵最優數據選擇與語言模型訓練動態之間關係的必要條件。基於這些理論結果，我們引入了基於 PMP 的數據選擇（PDS）框架，通過解決 PMP 條件來近似最優數據選擇。在我們的實驗中，我們採用 PDS 從 CommmonCrawl 選擇數據，並展示 PDS 選擇的語料庫加速了語言模型的學習，並在各種模型大小的下游任務中持續提升其性能。此外，根據縮放定律，PDS 的好處延伸到訓練了約 400B 模型的約 10T 標記的情況，通過對測試損失曲線的外推來證明。當預訓練數據有限時，PDS 也改善了數據利用率，通過將數據需求減少 1.8 倍，減輕了可用網絡爬蟲語料庫的快速耗盡。我們的代碼、數據和模型檢查點可在 https://github.com/microsoft/LMOps/tree/main/data_selection 找到。

English

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/data_selection.

語言模型的最佳控制資料選擇

Data Selection via Optimal Control for Language Models

摘要

Support