NorMuon: 뮤온의 효율성과 확장성 향상

초록

옵티마이저 선택은 대규모 언어 모델(LLM)의 학습 효율성과 계산 비용에 상당한 영향을 미친다. 최근 Muon 옵티마이저는 매개변수 업데이트를 직교화하여 최적화 기하학을 개선함으로써 유망한 결과를 보여주었다. Muon이 Adam의 후속 후보로 부상했음에도 불구하고, 이들의 강점을 공동으로 활용할 가능성은 체계적으로 탐구되지 않았다. 본 연구에서는 NorMuon(Neuron-wise Normalized Muon)을 제안하여 직교화와 뉴런 수준의 적응형 학습률을 시너지적으로 결합함으로써 이러한 격차를 해소한다. 우리의 분석에 따르면, Muon은 조건수를 효과적으로 감소시키지만, 그 결과 업데이트는 매우 불균일한 뉴런 노름을 보여 특정 뉴런이 최적화 과정을 지배하게 된다. NorMuon은 각 뉴런에 대한 2차 모멘텀 통계를 유지하고 직교화 후 행 단위 정규화를 적용함으로써 이러한 불균형을 해결하며, Muon의 조건수 이점을 유지하면서 균형 잡힌 매개변수 활용을 보장한다. 대규모 실용적 배포를 가능하게 하기 위해, FSDP2 프레임워크 하에서 직교화 계산을 전략적으로 장치 간에 분배하는 효율적인 분산 구현을 개발하였다. 다양한 모델 규모에서의 실험 결과, NorMuon은 Adam과 Muon 모두를 일관되게 능가하며, 1.1B 사전 학습 설정에서 Adam 대비 21.74%, Muon 대비 11.31%의 학습 효율성 향상을 달성하면서 Muon과 유사한 메모리 사용량을 유지한다. 우리의 연구 결과는 직교화와 적응형 학습률이 상호 보완적이며 경쟁적이지 않음을 시사하며, 대규모 딥러닝에서 옵티마이저 설계를 위한 새로운 방향을 제시한다.

English

The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

NorMuon: 뮤온의 효율성과 확장성 향상

NorMuon: Making Muon more efficient and scalable

초록

Support