ChatPaper.aiChatPaper

Adam优化Muon算法:采用正交化动量的自适应矩估计

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

February 19, 2026
作者: Minxin Zhang, Yuxuan Liu, Hayden Scheaffer
cs.AI

摘要

高效随机优化通常将确定性场景下表现良好的更新方向与适应随机扰动的机制相结合。虽然Adam采用自适应矩估计来提升稳定性,但Muon通过正交化动量利用权重层的矩阵结构,在大型语言模型训练中展现出卓越性能。我们提出新型优化器NAMO及其对角扩展NAMO-D,首次实现正交化动量与基于范数的Adam型噪声自适应原理性融合。NAMO采用单一自适应步长缩放正交化动量,在保持正交性的同时以可忽略的额外成本超越Muon性能。NAMO-D则通过右乘带有限幅项的对角矩阵实现正交化动量变换,该设计支持神经元级噪声自适应,并与常见的近块对角Hessian矩阵结构相契合。在标准假设下,我们为两种算法建立了确定性场景下的最优收敛速率,并证明在随机场景中其收敛保证能自适应随机梯度噪声水平。GPT-2模型预训练实验表明,NAMO与NAMO-D均优于AdamW和Muon基线,其中NAMO-D通过引入平衡"保持良态更新方向"与"利用细粒度噪声自适应"这对竞争目标的限幅超参数,实现了对NAMO的进一步性能提升。
English
Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.
PDF11February 24, 2026