卓越的预训练优化器及其发现之道

摘要

AdamW长期以来一直是语言模型预训练中的主导优化器，尽管有众多声称替代优化器能带来1.4至2倍加速的说法。我们认为，两个方法论上的缺陷掩盖了公平比较，阻碍了实际应用：(i) 不均衡的超参数调优和(ii) 有限或误导性的评估设置。为解决这两个问题，我们对十种深度学习优化器进行了系统研究，覆盖了四种模型规模（0.1B至1.2B参数）和数据与模型比例（1至8倍Chinchilla最优值）。我们发现，公平且信息丰富的比较需要在训练结束时，跨模型规模和数据与模型比例进行严格的超参数调优和评估。首先，一个优化器的最优超参数对另一个可能并不适用，盲目转移超参数有失公允。其次，许多提出的优化器相对于良好调优基线的实际加速比宣称的要低，并随模型规模增大而减小，对于1.2B参数模型仅剩1.1倍。第三，在达到目标训练预算前比较中间检查点可能产生误导，因为学习率衰减会导致两个优化器之间的排名在训练过程中反转。通过深入调查，我们发现所有最快的优化器，如Muon和Soap，均采用矩阵作为预条件子——即用矩阵而非逐元素标量乘以梯度。然而，基于矩阵的优化器的加速比与模型规模成反比，从0.1B参数模型相对于AdamW的1.4倍降至1.2B参数模型的仅1.1倍。

English

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.

卓越的预训练优化器及其发现之道

Fantastic Pretraining Optimizers and Where to Find Them

摘要

Support