뮤온(Muon)이 꼬리 끝 연관 메모리 학습에서 Adam을 능가한다

초록

Muon 최적화기는 대규모 언어 모델(LLM) 훈련에서 Adam보다 지속적으로 빠른 성능을 보이지만, 그 성공의 메커니즘은 여전히 명확하지 않다. 본 논문은 연관 메모리(associative memory)의 관점을 통해 이 메커니즘을 해명한다. Muon이 최적화하는 트랜스포머 구성 요소를 제거(ablation)함으로써, LLM의 연관 메모리 파라미터, 즉 Value 및 Output(VO) 어텐션 가중치와 피드포워드 네트워크(FFN)가 Muon의 우수성에 주요 기여자임을 밝힌다. 이 연관 메모리 관점에 동기를 받아, 본 논문은 본질적으로 heavy-tailed(꼬리가 긴) 특성을 가진 실제 데이터셋에서 Muon의 우수성을 설명한다. 이러한 데이터셋에서는 소수의 클래스(꼬리 클래스)가 다른 클래스보다 훨씬 적게 나타난다. Muon의 우수성은 두 가지 주요 특성으로 설명된다: (i) Muon의 업데이트 규칙은 Adam보다 더 등방성(isotropic) 특성을 가진 특이값 스펙트럼을 일관적으로 생성하며, (ii) heavy-tailed 데이터에서 꼬리 클래스를 Adam보다 더 효과적으로 최적화한다. 실증적 증거를 넘어, 본 논문은 클래스 불균형 데이터 하에서의 1층 연관 메모리 모델을 분석하여 이러한 발견을 이론적으로 확인한다. Muon은 특징 임베딩에 관계없이 클래스 간 균형 잡힌 학습을 일관적으로 달성하는 반면, Adam은 임베딩 특성에 따라 학습 오차에서 큰 차이를 유발할 수 있음을 증명한다. 요약하면, 본 논문의 실증적 관찰과 이론적 분석은 Muon의 핵심 이점을 밝힌다: Muon의 업데이트 규칙은 선형 연관 메모리의 외적 곱(outer-product) 구조와 일치하여, heavy-tailed 분포에서 꼬리 클래스의 더 균형 잡히고 효과적인 학습을 가능하게 한다.

English

The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.

뮤온(Muon)이 꼬리 끝 연관 메모리 학습에서 Adam을 능가한다

Muon Outperforms Adam in Tail-End Associative Memory Learning

초록

Support