mSFT: 다중 작업 SFT에서 데이터셋 혼합의 이질적 과적합 문제 해결

초록

현재 언어 모델 훈련에서는 일반적으로 모든 하위 데이터셋에 동일한 컴퓨팅 예산을 적용한 다중 작업 지도 미세 조정(SFT)이 널리 사용되고 있습니다. 그러나 이러한 접근 방식은 근본적으로 최적이 아닙니다. 이질적인 학습 역학으로 인해 학습 속도가 빠른 작업은 초기에 과적합되는 반면, 느린 작업은 여전히 과소적합 상태에 머물게 됩니다. 이를 해결하기 위해 우리는 다중 작업 데이터 혼합을 위한 반복적이고 과적합 인식 탐색 알고리즘인 mSFT를 제안합니다. mSFT는 활성 혼합 데이터로 모델을 훈련시키고, 가장 먼저 과적합되는 하위 데이터셋을 식별하여 제외한 후, 해당 특정 최적 체크포인트로 복귀하여 훈련을 계속합니다. 광범위한 평가 결과, mSFT가 10개의 벤치마크와 6개의 기본 모델에서 4가지 기준 방법을 일관되게 능가함을 확인했습니다. 추가 분석을 통해 mSFT가 다양한 데이터셋 크기와 작업 세분성에서도 견고한 성능 향상을 유지하며, 단일 신규 하이퍼파라미터(컴퓨팅 예산)에 둔감함을 확인했습니다. 특히 낮은 컴퓨팅 예산에서 mSFT는 훈련 FLOPs를 줄이면서도 성능을 향상시킬 수 있습니다. 궁극적으로 mSFT는 다양한 데이터 혼합에서 모델의 잠재력을 극대화하는 실용적인 과적합 인식 다중 작업 SFT 알고리즘을 정립합니다.

English

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

mSFT: 다중 작업 SFT에서 데이터셋 혼합의 이질적 과적합 문제 해결

mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

초록

Support