FastMix: 基于梯度下降的快速数据混合优化
FastMix: Fast Data Mixture Optimization via Gradient Descent
June 12, 2026
作者: Haoru Tan, Sitong Wu, Yanfeng Chen, Jun Xia, Ruobing Xie, Bin Xia, Xingwu Sun, Xiaojuan Qi
cs.AI
摘要
虽然大规模、多样化的数据集推动了大型模型的最新进展,但确定预训练和后训练的最优数据混合比例仍然是一个重要的开放问题。我们通过 FASTMIX 这一新型框架来应对这一挑战,该框架仅需训练单个代理模型即可自动发现数据混合方案。与依赖预定义启发式规则或资源密集型模拟不同,FASTMIX 联合优化混合系数与模型参数,从而在效率和可扩展性上显著优于先前方法。FASTMIX 的核心是将混合选择重新表述为一个双层优化问题。在这一表述下,我们证明优化混合比例在数学上等价于在均匀源采样下为每个数据源分配损失权重。这使得混合系数可以直接嵌入可微的迭代优化目标中,从而实现混合参数与模型参数的高效梯度优化。为解决该优化问题,FASTMIX 采用了一种近似迭代优化流程,交替执行以下步骤:(i)根据当前混合比例采样数据,更新模型参数(内循环),以及(ii)基于验证反馈更新混合比例(外循环)。在预训练和后训练场景中,FASTMIX 均优于基线方法,同时大幅降低了搜索成本。代码地址:https://github.com/hrtan/fastmix
English
While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)