NorMuon:提升Muon效率与可扩展性
NorMuon: Making Muon more efficient and scalable
October 7, 2025
作者: Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao
cs.AI
摘要
优化器的选择对大规模语言模型(LLMs)的训练效率和计算成本有着显著影响。近期,Muon优化器通过正交化参数更新、改善优化几何条件,展示了令人瞩目的成果。尽管Muon被视为Adam的潜在继任者,但联合利用两者优势的可能性尚未得到系统探索。本研究中,我们填补了这一空白,提出了NorMuon(神经元级归一化Muon),一种将正交化与神经元级自适应学习率协同结合的优化器。分析表明,Muon虽有效降低了条件数,但其更新导致神经元范数高度不均,致使某些神经元主导优化过程。NorMuon通过为每个神经元维护二阶动量统计量,并在正交化后实施行归一化,解决了这一失衡问题,确保参数利用均衡的同时保留了Muon的条件改善优势。为实现大规模实际部署,我们在FSDP2框架下开发了一种高效的分布式实现,策略性地将正交化计算分布至各设备。跨多个模型规模的实验证明,NorMuon在1.1B预训练设置下,训练效率较Adam提升21.74%,较Muon提升11.31%,同时保持与Muon相当的内存占用。我们的发现表明,正交化与自适应学习率是互补而非竞争的方法,为大规模深度学习中的优化器设计开辟了新路径。
English
The choice of optimizer significantly impacts the training efficiency and
computational costs of large language models (LLMs). Recently, the Muon
optimizer has demonstrated promising results by orthogonalizing parameter
updates, improving optimization geometry through better conditioning. Despite
Muon's emergence as a candidate successor to Adam, the potential for jointly
leveraging their strengths has not been systematically explored. In this work,
we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an
optimizer that synergistically combines orthogonalization with neuron-level
adaptive learning rates. Our analysis reveals that while Muon effectively
reduces condition numbers, the resulting updates exhibit highly non-uniform
neuron norms, causing certain neurons to dominate the optimization process.
NorMuon addresses this imbalance by maintaining second-order momentum
statistics for each neuron and applying row-wise normalization after
orthogonalization, ensuring balanced parameter utilization while preserving
Muon's conditioning benefits. To enable practical deployment at scale, we
develop an efficient distributed implementation under the FSDP2 framework that
strategically distributes orthogonalization computations across devices.
Experiments across multiple model scales demonstrate that NorMuon consistently
outperforms both Adam and Muon, achieving 21.74% better training efficiency
than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while
maintaining a comparable memory footprint to Muon. Our findings suggest that
orthogonalization and adaptive learning rates are complementary rather than
competing approaches, opening new avenues for optimizer design in large-scale
deep learning.