ChatPaper.aiChatPaper

μ子在尾端关联记忆学习中表现优于Adam

Muon Outperforms Adam in Tail-End Associative Memory Learning

September 30, 2025
作者: Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan
cs.AI

摘要

在訓練大型語言模型(LLMs)時,Muon優化器始終比Adam更快,但其成功背後的機制尚不明確。本文通過關聯記憶的視角揭示了這一機制。通過消融Muon優化的變壓器組件,我們發現LLMs的關聯記憶參數,即值與輸出(VO)注意力權重和前饋網絡(FFNs),是Muon優越性的主要貢獻者。基於這一關聯記憶視角,我們進一步解釋了Muon在現實世界語料庫上的優越性,這些語料庫本質上是重尾分佈的:少數類別(尾類)出現的頻率遠低於其他類別。這種優越性通過兩個關鍵特性得以解釋:(i)其更新規則始終產生比Adam更各向同性的奇異譜;因此,(ii)在重尾數據上,它比Adam更有效地優化尾類。除了實證證據外,我們通過分析在類別不平衡數據下的一層關聯記憶模型,從理論上證實了這些發現。我們證明,無論特徵嵌入如何,Muon始終實現跨類別的平衡學習,而Adam則可能因嵌入特性而導致學習誤差的顯著差異。總之,我們的實證觀察和理論分析揭示了Muon的核心優勢:其更新規則與線性關聯記憶的外積結構一致,使得在重尾分佈中比Adam更平衡且有效地學習尾類。
English
The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.
PDF81October 1, 2025