为什么Muon优于Adam：曲率视角

摘要

Muon在大型语言模型训练中的效率约为Adam的两倍，但其局部几何优势来源尚不明确。我们的研究首次从曲率视角尝试揭示Muon优于Adam的原因。首先，我们对训练损失景观进行二阶泰勒近似，发现在验证损失相同时，Muon单步损失降幅大于Adam。两类优化器的一阶增益相当，但Muon始终承受更小的二阶曲率惩罚。其次，我们将曲率惩罚分解为更新范数平方与归一化方向锐度（NDS）。研究发现Muon与Adam的更新范数相近，因此Muon更小的曲率惩罚源于更低的NDS而非更新幅度。第三，我们探究训练数据与模型结构如何塑造Muon的NDS优势。通过使用具有可控不平衡性的齐夫-概率上下文无关文法（PCFG）数据，我们发现数据不平衡会放大Muon相对于Adam的NDS优势。层内/层间分解进一步表明，在训练中后期，Muon更低的NDS主要得益于更小的层内曲率。除实证证据外，我们还分析了具有异质曲率且梯度向高曲率模式对齐的典型二次问题，并证明Muon通过跨曲率组平衡更新能量，实现了比梯度下降更小的平均NDS；当曲率异质性足够强时，相同迭代步数下也能获得更低的局部二次损失。

English

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.