通过最优控制进行语言模型的数据选择

摘要

本研究调查了从大规模语料库中选择高质量预训练数据以增强语言模型在下游任务中的能力。我们将数据选择问题形式化为广义最优控制问题，可以通过庞特里亚金最大值原理（PMP）在理论上解决，得到一组表征最优数据选择与语言模型训练动态关系的必要条件。基于这些理论结果，我们引入了基于PMP的数据选择（PDS）框架，通过解决PMP条件来近似最优数据选择。在我们的实验中，我们采用PDS从CommmonCrawl中选择数据，并展示PDS选择的语料库加速了语言模型的学习，并在各种模型规模下持续提升了它们在各种下游任务上的性能。此外，根据缩放定律，PDS的好处延伸到了训练了约400B模型的约10T标记的情况，通过测试损失曲线的外推加以证明。当预训练数据有限时，PDS还通过将数据需求减少1.8倍来改善数据利用，从而减轻了可用网络爬取语料库迅速耗尽的问题。我们的代码、数据和模型检查点可在https://github.com/microsoft/LMOps/tree/main/data_selection 找到。

English

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage. We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions. In our experiments, we adopt PDS to select data from CommmonCrawl and show that the PDS-selected corpus accelerates the learning of LMs and constantly boosts their performance on a wide range of downstream tasks across various model sizes. Moreover, the benefits of PDS extend to ~400B models trained on ~10T tokens, as evidenced by the extrapolation of the test loss curves according to the Scaling Laws. PDS also improves data utilization when the pre-training data is limited, by reducing the data demand by 1.8 times, which mitigates the quick exhaustion of available web-crawled corpora. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/data_selection.

通过最优控制进行语言模型的数据选择

Data Selection via Optimal Control for Language Models

摘要

Support