NorMuon:提升μ子探测效率与可扩展性
NorMuon: Making Muon more efficient and scalable
October 7, 2025
作者: Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao
cs.AI
摘要
优化器的选择对大规模语言模型(LLMs)的训练效率和计算成本有着显著影响。近期,Muon优化器通过正交化参数更新,改善了优化几何条件,展示了令人瞩目的成果。尽管Muon被视为Adam的潜在继任者,但联合利用两者优势的可能性尚未得到系统探索。本研究填补了这一空白,提出了NorMuon(神经元级归一化Muon),一种将正交化与神经元级自适应学习率协同结合的优化器。我们的分析表明,虽然Muon有效降低了条件数,但由此产生的更新表现出高度不均的神经元范数,导致某些神经元主导优化过程。NorMuon通过为每个神经元维护二阶动量统计量,并在正交化后应用行级归一化,解决了这一不平衡问题,确保了参数的均衡利用,同时保留了Muon的优化条件优势。为了实现在大规模场景下的实际部署,我们在FSDP2框架下开发了一种高效的分布式实现,策略性地将正交化计算分布在多个设备上。跨多个模型规模的实验表明,NorMuon在1.1B预训练设置下,训练效率比Adam提高了21.74%,比Muon提升了11.31%,同时保持了与Muon相当的内存占用。我们的发现表明,正交化与自适应学习率是互补而非竞争的方法,为大规模深度学习中的优化器设计开辟了新途径。
English
The choice of optimizer significantly impacts the training efficiency and
computational costs of large language models (LLMs). Recently, the Muon
optimizer has demonstrated promising results by orthogonalizing parameter
updates, improving optimization geometry through better conditioning. Despite
Muon's emergence as a candidate successor to Adam, the potential for jointly
leveraging their strengths has not been systematically explored. In this work,
we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an
optimizer that synergistically combines orthogonalization with neuron-level
adaptive learning rates. Our analysis reveals that while Muon effectively
reduces condition numbers, the resulting updates exhibit highly non-uniform
neuron norms, causing certain neurons to dominate the optimization process.
NorMuon addresses this imbalance by maintaining second-order momentum
statistics for each neuron and applying row-wise normalization after
orthogonalization, ensuring balanced parameter utilization while preserving
Muon's conditioning benefits. To enable practical deployment at scale, we
develop an efficient distributed implementation under the FSDP2 framework that
strategically distributes orthogonalization computations across devices.
Experiments across multiple model scales demonstrate that NorMuon consistently
outperforms both Adam and Muon, achieving 21.74% better training efficiency
than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while
maintaining a comparable memory footprint to Muon. Our findings suggest that
orthogonalization and adaptive learning rates are complementary rather than
competing approaches, opening new avenues for optimizer design in large-scale
deep learning.