mSFT: マルチタスクSFTにおけるデータセット混合の不均一な過学習への対処

要旨

現在の言語モデル学習では、均一な計算予算をすべてのサブデータセットに適用するマルチタスク教師ありファインチューニング（SFT）が一般的に行われている。このアプローチは根本的に最適とは言えない。なぜなら、異種の学習ダイナミクスにより、学習の速いタスクは早期に過学習を起こし、遅いタスクは未学習のまま残るためである。この問題に対処するため、我々はマルチタスクデータ混合のための反復的で過学習を考慮した探索アルゴリズムであるmSFTを提案する。mSFTは、アクティブな混合データでモデルを学習し、最も早期に過学習するサブデータセットを特定して除外し、その特定の最適チェックポイントに戻った上で学習を継続する。大規模な評価により、mSFTが10のベンチマークと6つのベースモデルにおいて、一貫して4つのベースライン手法を上回ることを実証した。さらに詳細な分析により、mSFTが様々なデータセットサイズやタスクの粒度において堅牢な性能向上を維持し、その単一の新規ハイパーパラメータ（計算予算）に対して感度が低いことを確認した。特筆すべきは、低計算予算条件下において、mSFTが学習FLOPsを削減しつつ性能を向上させ得る点である。最終的に、mSFTは多様なデータ混合においてモデルの潜在能力を最大化する、実用的な過学習考慮型マルチタスクSFTアルゴリズムを確立する。

English

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

mSFT: マルチタスクSFTにおけるデータセット混合の不均一な過学習への対処

mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

要旨

Support