なぜムオンがアダムを上回るのか：曲率の観点から

要旨

Muonは、Adamと比較して大規模言語モデルの学習効率を約2倍向上させることが知られているが、この優位性の局所的な幾何学的源泉は依然として不明である。本研究では、曲率の観点からMuonがAdamよりも優れている理由の解明に向けた最初の一歩を踏み出す。まず、学習ランドスケープに2次テイラー近似を適用し、検証損失が同等の条件において、MuonがAdamよりも大きな1ステップあたりの損失減少を達成することを示す。両最適化器は1次利得が同等である一方、Muonは一貫して小さな2次曲率ペナルティを生じる。次に、この曲率ペナルティを更新ノルムの2乗と正規化方向鋭敏性（NDS）に分解する。MuonとAdamは更新ノルムが同等であるため、Muonのより小さな曲率ペナルティは更新スケールではなく、より低いNDSによってもたらされることがわかる。第三に、学習データとモデル構造がMuonのNDS優位性をどのように形成するかを調べる。制御された不均衡を伴うZipf-確率的文脈自由文法（PCFG）データを用いて、データの不均衡がMuonのAdamに対するNDS優位性を増幅することを示す。さらに、層内/層間分解により、学習の中盤から後期にかけて、Muonの低いNDSは主により小さな層内曲率によって維持されていることが明らかになる。実験的証拠に加えて、不均一な曲率と高曲率モードへの勾配アライメントを持つ様式化された二次問題を解析する。Muonは曲率グループ間で更新エネルギーをバランスさせることにより、勾配降下法よりも小さな平均NDSを達成することを証明する。曲率の不均一性が十分に強い場合、これは同じステップ数後の局所二次損失の低下にもつながる。

English

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.