FastMix: 경사 하강법을 통한 고속 데이터 혼합 최적화

초록

대규모 데이터셋은 최근 대규모 모델의 발전을 이끌었지만, 사전 학습 및 후속 학습을 위한 최적의 데이터 혼합 비율을 식별하는 것은 여전히 중요한 미해결 문제로 남아 있다. 본 연구에서는 단일 프록시 모델만 학습하면서 데이터 혼합 발견을 자동화하는 새로운 프레임워크인 FASTMIX를 제안한다. FASTMIX는 사전 정의된 휴리스틱이나 자원 집약적 시뮬레이션에 의존하는 대신, 혼합 계수와 모델 파라미터를 공동으로 최적화하여 기존 방법 대비 효율성과 확장성을 크게 개선한다. FASTMIX의 핵심은 혼합 선택 문제를 이중 수준 최적화 문제로 재구성하는 데 있다. 이러한 재구성 하에서, 혼합 비율을 최적화하는 것은 균일한 소스 샘플링 하에서 소스별 손실 가중치를 할당하는 것과 수학적으로 동등함을 보인다. 이는 혼합 계수를 미분 가능한 반복 최적화 목표에 직접 내장하여, 혼합과 모델 모두에 대한 효율적인 그래디언트 기반 최적화를 가능하게 한다. 최적화 문제를 해결하기 위해 FASTMIX는 근사 반복 최적화 절차를 구현하며, (i) 현재 혼합 비율에 따라 샘플링된 데이터에 대해 모델 파라미터를 업데이트(내부 루프)하고, (ii) 검증 피드백에 기반하여 혼합 비율을 업데이트(외부 루프)하는 과정을 번갈아 수행한다. 사전 학습 및 후속 학습 전반에 걸쳐, FASTMIX는 검색 비용을 대폭 절감하면서도 기준 방법보다 우수한 성능을 보인다. 코드(https://github.com/hrtan/fastmix)는 공개되어 있다.

English

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)