ChatPaper.aiChatPaper

μ子在预训练中的实际效率

Practical Efficiency of Muon for Pretraining

May 4, 2025
作者: Essential AI, Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis, Ritvik Kapila, Saurabh Srivastava, Somanshu Singla, Tim Romanski, Yash Vanjani, Ashish Vaswani
cs.AI

摘要

我们证明,作为二阶优化器的最简实例,Muon在计算时间权衡上显著扩展了Pareto前沿,超越了AdamW。研究发现,Muon在大批量训练时,远超过所谓的临界批量大小,仍能有效保持数据效率,同时维持计算效率,从而实现更经济的训练。我们探讨了Muon与最大更新参数化(muP)的结合,以实现高效超参数迁移,并提出了一种简单的伸缩算法,该算法在考虑muP中所有误差源的同时,仅引入适度的资源开销。我们通过在模型规模高达四十亿参数上的广泛实验,以及对数据分布和架构的消融研究,验证了这些发现。
English
We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Summary

AI-Generated Summary

PDF231May 6, 2025