基于多目标强化学习的大语言模型预训练整体数据调度器

摘要

训练数据的组成，受数据来源多样性及其混合策略的支配，是大语言模型（LLM）预训练的基石。在线数据混合（ODM）技术，即在训练过程中自适应调整数据混合比例，已成为提升效率的前沿方向。然而，现有方法受限于其依赖单一优化视角，这从根本上忽视了复杂的LLM预训练需要从多个维度考虑动态数据组合的需求。为克服这一局限，我们引入了全面数据调度器（HDS），一种新颖的在线数据混合框架。HDS将数据调度挑战建模为连续控制空间中的强化学习问题，并利用软演员-评论家（SAC）算法在探索高维策略空间中的稳定性和样本效率。HDS的核心是一种新颖的多目标、全面奖励函数，该函数整合了三个关键视角：基于数据质量的数据驱动奖励、捕获跨领域影响的损失驱动奖励，以及基于权重范数的模型驱动奖励。为验证我们的设计并确定其最优配置，我们在不同规模的LLM上进行了系统性实验。在The Pile基准测试中，HDS以比次优方法少44%的训练迭代次数达到了最终验证困惑度。此外，在MMLU 0-shot任务上实现了7.2%的提升，并在其他基准测试中持续获得收益，展示了其在提升训练效率和最终模型能力方面的能力。

English

The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a promising direction to improve efficiency. However, existing methods are constrained by their reliance on a singular optimization perspective, which fundamentally overlooks the need for complex LLM pre-training to consider the dynamic data composition from multiple dimensions. To overcome this limitation, we introduce the Holistic Data Scheduler (HDS), a novel online data mixing framework. HDS formulates the data scheduling challenge as a reinforcement learning problem in a continuous control space and leverages the Soft Actor-Critic (SAC) algorithm for its stability and sample efficiency in exploring the high-dimensional policy space. At the core of HDS lies a novel multi-objective, holistic reward function that integrates three critical perspectives: a data-driven reward for quality, a loss-driven reward capturing inter-domain influence, and a model-driven reward based on weight norms. To validate our design and determine its optimal configuration, we conducted systematic experiments on LLMs of various sizes. On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations. Furthermore, it achieves a 7.2% improvement on the MMLU 0-shot task along with consistent gains on other benchmarks, showcasing its ability to enhance both training efficiency and final model capability.