在尾部关联记忆学习任务中,Muon算法表现优于Adam算法。
Muon Outperforms Adam in Tail-End Associative Memory Learning
September 30, 2025
作者: Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan
cs.AI
摘要
Muon优化器在训练大规模语言模型(LLMs)时始终比Adam更快,但其成功背后的机制尚不明确。本文通过联想记忆的视角揭示了这一机制。通过消融Muon优化的Transformer组件,我们发现LLMs的联想记忆参数,即Value和Output(VO)注意力权重及前馈网络(FFNs),是Muon优越性的主要贡献者。基于这一联想记忆视角,我们进一步解释了Muon在现实世界语料库上的优势,这些语料库本质上具有重尾特性:少数类别(尾部类别)的出现频率远低于其他类别。Muon的优越性通过两个关键属性得以解释:(i)其更新规则始终产生比Adam更各向同性的奇异谱;(ii)在重尾数据上,它比Adam更有效地优化尾部类别。除实证证据外,我们通过分析类别不平衡数据下的单层联想记忆模型,从理论上验证了这些发现。我们证明,无论特征嵌入如何,Muon始终能在各类别间实现均衡学习,而Adam则可能因嵌入特性导致学习误差的巨大差异。总之,我们的实证观察与理论分析揭示了Muon的核心优势:其更新规则与线性联想记忆的外积结构相一致,使得在重尾分布中对尾部类别的学习比Adam更为均衡和有效。
English
The Muon optimizer is consistently faster than Adam in training Large
Language Models (LLMs), yet the mechanism underlying its success remains
unclear. This paper demystifies this mechanism through the lens of associative
memory. By ablating the transformer components optimized by Muon, we reveal
that the associative memory parameters of LLMs, namely the Value and Output
(VO) attention weights and Feed-Forward Networks (FFNs), are the primary
contributors to Muon's superiority. Motivated by this associative memory view,
we then explain Muon's superiority on real-world corpora, which are
intrinsically heavy-tailed: a few classes (tail classes) appear far less
frequently than others. The superiority is explained through two key
properties: (i) its update rule consistently yields a more isotropic singular
spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes
tail classes more effectively than Adam. Beyond empirical evidence, we
theoretically confirm these findings by analyzing a one-layer associative
memory model under class-imbalanced data. We prove that Muon consistently
achieves balanced learning across classes regardless of feature embeddings,
whereas Adam can induce large disparities in learning errors depending on
embedding properties. In summary, our empirical observations and theoretical
analyses reveal Muon's core advantage: its update rule aligns with the
outer-product structure of linear associative memories, enabling more balanced
and effective learning of tail classes in heavy-tailed distributions than Adam.