데이터 제약 환경에서의 훈련을 위한 규범적 스케일링 법칙

초록

훈련 컴퓨팅 자원이 고품질 데이터의 가용성을 점점 더 앞지르고 있습니다. 이로 인해 핵심 과제는 최적의 컴퓨팅 자원 배분에서 제한된 데이터로부터 최대의 가치를 추출하는 방향으로 전환되고 있습니다. 널리 채택된 Chinchilla 스케일링 법칙은 모든 훈련 토큰이 고유하다고 가정하는데, 이로 인해 데이터가 제한된 환경에서 사전 훈련 결정을 안내하는 능력이 제한됩니다. 우리는 반복 훈련 시 초과 손실을 단순 가법적 과적합 패널티로 모델링하였으며, 이가 모델 동작을 정확히 설명함을 확인했습니다. 우리의 스케일링 법칙은 정성적으로 새로운 컴퓨팅 최적 배분 방안을 제시합니다. 일정 지점 이후에는 추가적인 반복은 역효과를 내며, 컴퓨팅 자원은 모델 용량에 투자하는 것이 더 효과적입니다. 우리의 법칙이 권장하는 구성을 따를 때 데이터가 제한된 환경에서 성능이 향상됨을 보여줍니다. 마지막으로, 우리의 단일 매개변수 형태는 과적합을 단일 계수로 분리하기 때문에 다양한 훈련 구성 간 직접 비교가 가능합니다. 사례 연구로, 강력한 가중치 감쇠(λ=1.0)가 이 계수를 약 70% 감소시킴을 보여주며, 데이터 제약 환경에서 최적의 가중치 감쇠가 일반적인 관행보다 한 차원 크게 적용되어야 한다는 최근 연구 결과에 대한 스케일링 법칙 기반 설명을 제공합니다.

English

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay (λ=1.0) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

데이터 제약 환경에서의 훈련을 위한 규범적 스케일링 법칙

Prescriptive Scaling Laws for Data Constrained Training

초록

Support