NorMuon: ミューオンの効率化とスケーラビリティの向上

要旨

オプティマイザの選択は、大規模言語モデル（LLM）の学習効率と計算コストに大きな影響を与える。最近、Muonオプティマイザは、パラメータ更新を直交化し、条件数を改善することで最適化幾何を向上させることで有望な結果を示している。MuonがAdamの後継候補として登場したにもかかわらず、両者の強みを共同で活用する可能性は体系的に検討されていない。本研究では、このギャップを埋めるために、NorMuon（Neuron-wise Normalized Muon）を提案する。このオプティマイザは、直交化とニューロンレベルの適応学習率を相乗的に組み合わせたものである。分析の結果、Muonは条件数を効果的に低減する一方で、その結果として得られる更新はニューロンノルムが非常に不均一であり、特定のニューロンが最適化プロセスを支配する傾向があることが明らかになった。NorMuonは、各ニューロンの二次モーメンタム統計を維持し、直交化後に行ごとの正規化を適用することで、この不均衡を解消し、Muonの条件数改善の利点を保ちつつ、パラメータのバランスの取れた利用を確保する。大規模な実用展開を可能にするために、FSDP2フレームワークの下で、直交化計算をデバイス間で戦略的に分散する効率的な分散実装を開発した。複数のモデルスケールにわたる実験により、NorMuonはAdamとMuonの両方を一貫して上回り、1.1Bの事前学習設定においてAdamよりも21.74%、Muonよりも11.31%の学習効率向上を達成し、Muonと同等のメモリフットプリントを維持することが示された。我々の研究結果は、直交化と適応学習率が競合するのではなく補完的であることを示唆しており、大規模深層学習におけるオプティマイザ設計の新たな道を開くものである。

English

The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

NorMuon: ミューオンの効率化とスケーラビリティの向上

NorMuon: Making Muon more efficient and scalable

要旨

Support