解耦搜索与训练:基于模型融合的大规模语言模型预训练数据混合扩展方法
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
January 31, 2026
作者: Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao
cs.AI
摘要
确定有效的数据混合比例是大型语言模型预训练的关键因素,模型需要在通用能力与数学、代码等高难度任务专长之间取得平衡。然而现有方法要么依赖不可靠的小规模代理实验,要么需进行成本高昂的大规模探索,导致最优混合比例的确定仍具挑战。为此,我们提出解耦训练混合搜索框架DeMix,该创新方案通过模型融合来预测最优数据配比。与为每个采样混合比例训练代理模型不同,DeMix首先对候选数据集进行规模化组件模型训练,再通过加权模型融合推导数据混合代理指标。这种范式将搜索过程与训练成本解耦,可在无需额外训练负担的情况下评估无限采样混合方案,从而通过更多搜索尝试发现更优配比。大量实验表明,DeMix打破了充分性、准确性与效率之间的权衡关系,能以更低搜索成本获得基准性能更优的混合方案。此外,我们开源了DeMix Corpora——一个包含22万亿token高质量预训练数据及已验证混合方案的综合性数据集,以推动开放研究。相关代码与DeMix Corpora数据集详见https://github.com/Lucius-lsr/DeMix。
English
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.