Muon在預訓練中的實際效能

摘要

我們證明，作為二階優化器最簡單實現的Muon，在計算時間權衡上顯著擴展了相較於AdamW的帕累托前沿。我們發現，Muon在保持大批次規模下的數據效率方面比AdamW更為有效，遠遠超出所謂的關鍵批次大小，同時保持計算效率，從而實現更經濟的訓練。我們研究了Muon與最大更新參數化（muP）的結合，以實現高效的超參數遷移，並提出了一種簡單的伸縮算法，該算法考慮了muP中所有誤差來源，同時僅引入適度的資源開銷。我們通過多達四十億參數的模型規模實驗以及對數據分佈和架構的消融研究，驗證了我們的發現。

English

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Muon在預訓練中的實際效能

Practical Efficiency of Muon for Pretraining

摘要

Support