DynMuon: 뮤온의 동적 스펙트럼 형성 관점

초록

최근 몇 년 동안, Muon은 대규모 언어 모델과 더 넓게는 트랜스포머를 훈련하는 주요 방법으로 부상했다. 표준 경사 하강법과 비교할 때 핵심적인 차이점은 기존 업데이트 행렬 M=UΣV^⊤를 그 극 인자(polar factor) UV^⊤로 대체하는 데 있다. 본 연구에서는 일부 매개변수 p에 대해 업데이트 M을 UΣ^p V^⊤로 대체하는 Muon 유사 업데이트의 부류를 고려한다. 이를 '스펙트럼 조정(spectral-shaping)' 연산이라 부르며, (a) 손실 함수의 국소 곡률, (b) 확률적 그래디언트 및 레이블 노이즈로 인한 잡음, (c) 훈련 단계에 의존하는 p 선택 방법에 대한 이론을 개발한다. 이론과 실험을 통해 이전에 간과되었던 행동을 밝혀낸다. 양의 p는 높은 곡률 방향을 강조하고 신호 수축을 가속화함으로써 초기 단계에 도움이 되는 반면, 약간 음의 p는 여전히 유용한 훈련 신호를 포함하는 낮은 곡률 방향으로 업데이트 강도를 재분배함으로써 후기 단계에 도움이 된다. 이러한 통찰을 바탕으로, 훈련 과정에서 p를 양수에서 약간 음수로 스케줄링하는 효율적인 동적 스펙트럼 조정 방법인 DynMuon을 제안한다. 모델 크기, 아키텍처 및 훈련 설정 전반에 걸친 광범위한 실험 결과, DynMuon이 Muon보다 일관되게 더 낮은 검증 손실을 달성하면서 동일한 목표 손실에 도달하는 데 필요한 스텝 수가 10.6~26.5% 적다는 것을 보여준다.

English

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix M=UΣV^top with its polar factor UV^top. In this work, we consider a class of Muon-like updates, where we replace the update M with UΣ^p V^top for some parameter p. We call this a "spectral-shaping" operation, and develop a theory of how to pick p which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive p helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative p helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules p from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.