解耦搜索与训练：基于模型融合的大规模语言模型预训练数据混合扩展方法

摘要

确定高效的数据混合比例是大型语言模型预训练的关键因素，模型需要在通用能力与数学、编程等高难度任务的专业性之间取得平衡。然而现有方法要么依赖不可靠的小规模代理实验，要么需耗费巨资进行大规模探索，使得最优混合比例的确定仍面临挑战。为此，我们提出解耦式训练混合搜索框架DeMix，该创新框架通过模型融合技术预测最优数据配比。与为每个采样混合比例训练代理模型不同，DeMix首先对候选数据集进行规模化组件模型训练，再通过加权模型融合推导数据混合代理指标。这种范式将搜索成本与训练成本解耦，无需额外训练即可评估无限采样混合比例，从而通过更多搜索尝试发现更优配比。大量实验表明，DeMix突破了充分性、准确性与效率之间的权衡关系，能以更低搜索成本获得基准表现更优的混合方案。我们还开源了DeMix Corpora——一个包含22万亿token高质量预训练数据及已验证混合方案的完整数据集，以促进开放研究。相关代码与数据集详见https://github.com/Lucius-lsr/DeMix。

English

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

解耦搜索与训练：基于模型融合的大规模语言模型预训练数据混合扩展方法

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

摘要

Support