mSFT：针对多任务SFT中异构数据集混合过拟合问题的解决方案

摘要

当前语言模型训练普遍采用多任务监督微调（SFT）方法，并在所有子数据集上均匀分配计算预算。这种方法存在本质缺陷：异构的学习动态会导致快速学习的任务过早过拟合，而慢速学习的任务仍处于欠拟合状态。为此，我们提出mSFT——一种针对多任务数据混合的迭代式过拟合感知搜索算法。mSFT通过在动态混合数据上训练模型，识别并排除最早发生过拟合的子数据集，并回退至该特定任务的最优检查点后继续训练。大量评估表明，mSFT在10个基准测试和6个基础模型中均稳定优于4种基线方法。进一步分析证实，mSFT在不同数据集规模、任务粒度下均保持稳健增益，且对其唯一新增超参数（计算预算）不敏感。值得注意的是，在低计算预算下，mSFT能在降低训练浮点运算量的同时提升性能。最终，mSFT为多任务SFT建立了一种实用的过拟合感知算法，可最大化模型在异构数据混合场景下的潜力。

English

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.