ChatPaper.aiChatPaper

能量排程器:一種不受批次大小和標記數量影響的學習率排程器

Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler

August 23, 2024
作者: Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda
cs.AI

摘要

尋找語言模型預訓練的最佳學習率是一項具挑戰性的任務。這不僅是因為學習率、批次大小、訓練標記數量、模型大小和其他超參數之間存在著複雜的相互關係,而且因為對具有數十億或數萬億參數的大型語言模型進行超參數搜索成本過高。最近的研究提出使用小型代理模型和小型語料庫進行超參數搜索,並將最佳參數轉移到大型模型和大型語料庫中。儘管從理論和實證上證明了與模型大小相關的超參數(如深度和寬度)的零-shot可轉移性,但從小型語料庫到大型語料庫的零-shot轉移尚未得到充分探討。在本文中,我們研究了最近提出的WSD調度器的最佳學習率、批次大小和訓練標記數量之間的相關性。通過數千次小型實驗,我們發現了變數之間的冪律關係,並證明了其在模型大小之間的可轉移性。基於這一觀察,我們提出了一種新的學習率調度器,Power調度器,對訓練標記數量和批次大小保持中立。實驗表明,將Power調度器與最大更新參數化(muP)結合,可以始終以一組超參數實現出色的性能,而不受訓練標記數量、批次大小、模型大小甚至模型架構的影響。我們使用Power調度器訓練的3B密集型和MoE模型實現了與最先進小型語言模型相當的性能。我們在https://ibm.biz/BdKhLa 上開源了這些預訓練模型。
English
Finding the optimal learning rate for language model pretraining is a challenging task. This is not only because there is a complicated correlation between learning rate, batch size, number of training tokens, model size, and other hyperparameters but also because it is prohibitively expensive to perform a hyperparameter search for large language models with Billions or Trillions of parameters. Recent studies propose using small proxy models and small corpus to perform hyperparameter searches and transposing the optimal parameters to large models and large corpus. While the zero-shot transferability is theoretically and empirically proven for model size related hyperparameters, like depth and width, the zero-shot transfer from small corpus to large corpus is underexplored. In this paper, we study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler. After thousands of small experiments, we found a power-law relationship between variables and demonstrated its transferability across model sizes. Based on the observation, we propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size. The experiment shows that combining the Power scheduler with Maximum Update Parameterization (muP) can consistently achieve impressive performance with one set of hyperparameters regardless of the number of training tokens, batch size, model size, and even model architecture. Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models. We open-source these pretrained models at https://ibm.biz/BdKhLa.

Summary

AI-Generated Summary

PDF254November 16, 2024