Muon為何優於Adam：曲率視角

摘要

Muon 在大语言模型訓練中的效率比 Adam 高出約兩倍，但這種優勢的局部幾何來源仍不清楚。我們的研究從曲率視角出發，邁出了揭示 Muon 優於 Adam 之原因的第一步。首先，我們對訓練損失曲面進行二階泰勒近似，並證明在驗證損失匹配的情況下，Muon 能實現比 Adam 更大的單步損失下降。兩者的一階增益相當，但 Muon 始終承受更小的二階曲率懲罰。其次，我們將此曲率懲罰分解為更新範數平方與標準化方向銳度（Normalized Directional Sharpness, NDS）。我們發現 Muon 和 Adam 的更新範數相近，因此 Muon 較小的曲率懲罰源於更低的 NDS，而非更新幅度。第三，我們研究了訓練數據和模型結構如何塑造 Muon 的 NDS 優勢。通過使用具有可控不平衡性的 Zipf-概率上下文無關文法（PCFG）數據，我們證明數據不平衡會放大 Muon 相對於 Adam 的 NDS 優勢。進一步的層內/跨層分解表明，在訓練的中後期，Muon 較低的 NDS 主要由更小的層內曲率維持。除經驗證據外，我們還分析了具有異質曲率且梯度傾向於高曲率模態的典型二次問題。我們證明，Muon 通過在曲率組之間平衡更新能量，能夠達到比梯度下降（GD）更小的平均 NDS；當曲率異質性足夠強時，這也使得在相同步數後局部二次損失更低。

English

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.