파워 스케줄러: 배치 크기와 토큰 수에 무관한 학습률 스케줄러

초록

언어 모델 사전 훈련을 위한 최적 학습률을 찾는 것은 어려운 작업입니다. 이는 학습률, 배치 크기, 훈련 토큰 수, 모델 크기 및 기타 하이퍼파라미터 간 복잡한 상관 관계 뿐만 아니라 수십억 또는 수조 개의 매개변수를 갖는 대규모 언어 모델에 대한 하이퍼파라미터 검색을 수행하는 데 막대한 비용이 소요된다는 이유 때문입니다. 최근 연구에서는 작은 프록시 모델과 소규모 말뭉치를 사용하여 하이퍼파라미터 검색을 수행하고 최적 매개변수를 대규모 모델과 대규모 말뭉치로 이식하는 것을 제안합니다. 깊이와 너비와 같은 모델 크기 관련 하이퍼파라미터에 대한 이론적 및 경험적으로 입증된 제로샷 전이성에 비해, 소규모 말뭉치에서 대규모 말뭉치로의 제로샷 전이는 미개척된 영역입니다. 본 논문에서는 최근 제안된 WSD 스케줄러를 위한 최적 학습률, 배치 크기 및 훈련 토큰 수 간의 상관 관계를 연구합니다. 수천 번의 소규모 실험을 통해 변수 간의 거듭제곱 법칙 관계를 발견하고 이를 모델 크기를 초월한 전이성을 입증했습니다. 이 관찰을 기반으로, 훈련 토큰 수와 배치 크기에 대해 동의하지 않는 Power 스케줄러라는 새로운 학습률 스케줄러를 제안합니다. 실험 결과, Power 스케줄러를 최대 업데이트 매개변수화(muP)와 결합하면 훈련 토큰 수, 배치 크기, 모델 크기 및 심지어 모델 아키텍처에 관계없이 하나의 하이퍼파라미터 세트로 인상적인 성능을 일관되게 달성할 수 있습니다. Power 스케줄러로 훈련된 3B 밀집 및 MoE 모델은 최첨단 소규모 언어 모델과 비교 가능한 성능을 달성합니다. 이러한 사전 훈련된 모델은 https://ibm.biz/BdKhLa에서 오픈 소스로 제공됩니다.

English

Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at https://ibm.biz/BdKhLa.

파워 스케줄러: 배치 크기와 토큰 수에 무관한 학습률 스케줄러

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

초록

Support