功率调度器：一种与批量大小和令牌数量无关的学习率调度器

摘要

寻找语言模型预训练的最佳学习率是一项具有挑战性的任务。这不仅因为学习率、批量大小、训练标记数量、模型大小和其他超参数之间存在复杂的相关性，而且因为对拥有数十亿或数万亿参数的大型语言模型进行超参数搜索成本过高。最近的研究提出使用小型代理模型和小型语料库进行超参数搜索，并将最佳参数转移到大型模型和大型语料库中。虽然零次迁移性在理论上和经验上已被证明适用于与模型大小相关的超参数，如深度和宽度，但从小型语料库到大型语料库的零次迁移尚未得到充分探讨。在本文中，我们研究了最近提出的WSD调度器的最佳学习率、批量大小和训练标记数量之间的相关性。经过数千次小型实验，我们发现了变量之间的幂律关系，并展示了其在模型大小之间的可迁移性。基于这一观察，我们提出了一种新的学习率调度器，Power调度器，它对训练标记数量和批量大小保持不可知性。实验证明，将Power调度器与最大更新参数化（muP）相结合，可以在不考虑训练标记数量、批量大小、模型大小甚至模型架构的情况下始终实现出色的性能。我们使用Power调度器训练的3B密集型和MoE模型达到了与最先进的小型语言模型相当的性能。我们在https://ibm.biz/BdKhLa 上开源了这些预训练模型。

English

Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at https://ibm.biz/BdKhLa.

功率调度器：一种与批量大小和令牌数量无关的学习率调度器

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

摘要

Support