ファンタスティックな事前学習最適化手法とその探求

要旨

AdamWは、言語モデルの事前学習において長らく支配的な最適化手法として用いられてきた。しかし、代替の最適化手法が1.4倍から2倍の速度向上を提供するとの主張が数多くあるにもかかわらず、その実用化は進んでいない。本研究では、公平な比較を妨げ、実用的な採用を阻んできた2つの方法論的な欠陥を指摘する：(i) 不均一なハイパーパラメータチューニング、(ii) 限定的または誤解を招く評価設定。これら2つの問題に対処するため、我々は10種類の深層学習最適化手法を、4つのモデル規模（0.1B～1.2Bパラメータ）とデータ対モデル比率（Chinchilla最適値の1～8倍）にわたって体系的に調査した。公平で有益な比較を行うためには、厳密なハイパーパラメータチューニングと、訓練終了時における多様なモデル規模およびデータ対モデル比率にわたる評価が必要であることが明らかとなった。第一に、ある最適化手法にとって最適なハイパーパラメータが、他の手法にとっては最適でない場合があり、盲目的なハイパーパラメータの転用は不公平である。第二に、提案された多くの最適化手法の、十分にチューニングされたベースラインに対する実際の速度向上は、主張された値よりも低く、モデルサイズが大きくなるにつれて減少し、1.2Bパラメータモデルではわずか1.1倍に留まる。第三に、目標の訓練予算に到達する前の中間チェックポイントを比較することは、学習率の減衰により訓練中に2つの最適化手法の順位が逆転する可能性があるため、誤解を招く可能性がある。我々の徹底的な調査を通じて、MuonやSoapなどの最速の最適化手法はすべて、行列を前処理として使用していることが明らかとなった。つまり、勾配を行列で乗算するのではなく、要素ごとのスカラーで乗算する。しかし、行列ベースの最適化手法の速度向上はモデル規模に反比例し、0.1BパラメータモデルではAdamWに対して1.4倍の速度向上があるが、1.2Bパラメータモデルではわずか1.1倍に減少する。

English

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.

ファンタスティックな事前学習最適化手法とその探求

Fantastic Pretraining Optimizers and Where to Find Them

要旨

Support