mSFT：针对多任务SFT中异构数据集混合过拟合问题的解决方案

摘要

当前语言模型训练普遍采用均质计算预算的多任务监督微调方法，这种方案存在根本性缺陷：异构学习动态会导致快速学习任务过早过拟合，而慢速学习任务仍欠拟合。为此，我们提出mSFT算法——一种面向多任务数据混合的迭代式过拟合感知搜索方法。mSFT通过在动态混合数据集上训练模型，识别并排除最早过拟合的子数据集，并回退至该子数据集的最优检查点后继续训练。大量实验表明，mSFT在10个基准测试和6个基础模型中均稳定优于4种基线方法。进一步分析证实，mSFT在不同数据规模、任务粒度下均能保持稳健增益，且对其唯一新增超参数（计算预算）不敏感。值得注意的是，在低计算预算下，mSFT能在降低训练FLOPs的同时提升性能。最终，mSFT为多任务监督微调建立了一种实用的过拟合感知算法，可最大化模型在不同数据混合场景下的潜力。

English

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

mSFT：针对多任务SFT中异构数据集混合过拟合问题的解决方案

mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

摘要

Support