Muon이 Adam보다 뛰어난 이유: 곡률 관점

초록

뮤온(Muon)은 대규모 언어 모델 학습에서 아담(Adam)보다 학습 효율성을 약 2배 향상시키지만, 이러한 이점의 국소적 기하학적 원인은 아직 명확하지 않다. 본 연구는 곡률(curvature) 관점에서 뮤온이 아담보다 우수한 이유를 규명하기 위한 첫 걸음을 내딛는다. 먼저, 학습 풍경에 2차 테일러 근사를 적용하여, 검증 손실이 일치하는 조건에서 뮤온이 아담보다 더 큰 한 단계 손실 감소를 달성함을 보인다. 두 최적화 기법은 1차 이득(first-order gain)은 비슷하지만, 뮤온은 일관되게 더 작은 2차 곡률 페널티를 발생시킨다. 둘째, 이 곡률 페널티를 업데이트 노름의 제곱과 정규화 방향 예민도(NDS, Normalized Directional Sharpness)로 분해한다. 뮤온과 아담의 업데이트 노름은 비슷하므로, 뮤온의 더 작은 곡률 페널티는 업데이트 규모가 아닌 더 낮은 NDS에 의해 결정된다. 셋째, 학습 데이터와 모델 구조가 뮤온의 NDS 이점을 어떻게 형성하는지 연구한다. 제어된 불균형을 가진 Zipf-확률적 문맥자유문법(PCFG) 데이터를 사용하여, 데이터 불균형이 뮤온의 NDS 이점을 아담에 비해 증폭시킴을 보인다. 계층 내/계층 간 분해를 통해, 학습 중기 및 후기 단계에서 뮤온의 낮은 NDS가 주로 더 작은 계층 내 곡률에 의해 유지됨을 추가로 보인다. 실험적 증거 외에도, 이질적인 곡률과 고곡률 모드로의 기울기 정렬을 갖는 모범적인 이차 문제를 분석한다. 뮤온이 곡률 그룹 간 업데이트 에너지를 균형 있게 분배함으로써 경사 하강법(GD)보다 더 작은 평균 NDS를 달성함을 증명하며, 곡률 이질성이 충분히 강할 때 동일한 스텝 수 후에 더 낮은 국소 이차 손실을 산출함을 보인다.

English

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.