大規模モデルのトレーニングにおける凸最適化理論と学習率スケジューリングの驚くべき一致

要旨

大規模モデルのトレーニングのための学習率スケジュールは、非滑らか凸最適化理論からの性能境界と驚くほど似ていることを示します。私たちは、定数スケジュールと線形冷却を持つ境界を提供します。特に、冷却の実用的利点は、対数項の欠如によって境界に反映されています。さらに、最適化理論と実践のこの驚くほどの近い一致を学習率チューニングに活用できることを示します。124Mおよび210MのLlamaタイプのモデルのトレーニングにおいて、(i) 最適な学習率での継続トレーニングのスケジュールを拡張し、(ii) 最適な学習率をスケジュール間で転送することで、顕著な改善を達成します。

English

We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms. Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

大規模モデルのトレーニングにおける凸最適化理論と学習率スケジューリングの驚くべき一致

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

要旨

Support