최적 데이터 혼합을 위한 스케일링 법칙

초록

대형 기반 모델은 일반적으로 여러 도메인의 데이터를 기반으로 학습되며, 데이터 혼합 비율(각 도메인이 사용되는 비율)은 모델 성능에 중요한 역할을 합니다. 이러한 혼합 비율을 선택하는 표준적인 접근 방식은 시행착오에 의존하는데, 이는 대규모 사전 학습에서는 비현실적이 됩니다. 우리는 스케일링 법칙을 사용하여 특정 목표 도메인에 대한 최적의 데이터 혼합 비율을 결정하는 체계적인 방법을 제안합니다. 우리의 접근 방식은 크기 N의 모델이 D개의 토큰과 특정 도메인 가중치 벡터 h로 학습되었을 때의 손실을 정확하게 예측합니다. 우리는 이러한 스케일링 법칙의 보편성을 대규모 언어 모델(LLM), 네이티브 멀티모달 모델(NMM), 대형 비전 모델(LVM) 사전 학습이라는 세 가지 독립적이고 대규모의 설정에서 예측력을 입증함으로써 검증합니다. 또한, 이러한 스케일링 법칙이 새로운 데이터 혼합 및 다양한 규모로 외삽될 수 있음을 보여줍니다: 몇 차례의 소규모 학습 실행을 통해 매개변수를 정확하게 추정하고, 이를 통해 더 큰 규모와 보지 못한 도메인 가중치에서의 성능을 예측할 수 있습니다. 스케일링 법칙은 주어진 학습 예산(N, D) 하에서 특정 목표 도메인에 대한 최적의 도메인 가중치를 도출할 수 있게 하여, 비용이 많이 드는 시행착오 방법에 대한 원칙적인 대안을 제공합니다.

English

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size N trained with D tokens and a specific domain weight vector h. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (N,D), providing a principled alternative to costly trial-and-error methods.

최적 데이터 혼합을 위한 스케일링 법칙

Scaling Laws for Optimal Data Mixtures

초록

Support