Adam优化Muon算法：采用正交化动量的自适应矩估计

摘要

高效随机优化方法通常将确定性场景下表现良好的更新方向与适应随机扰动的机制相结合。虽然Adam采用自适应矩估计来增强稳定性，但Muon通过正交化动量利用权重层的矩阵结构，在大型语言模型训练中展现出卓越性能。我们提出新型优化器NAMO及其对角扩展版本NAMO-D，首次实现了正交化动量与基于范数的Adam型噪声自适应原理性融合。NAMO采用单一自适应步长缩放正交化动量，在保持正交性的同时以可忽略的额外成本改进了Muon。而NAMO-D则通过带截断项的对角矩阵右乘正交化动量，该设计既能实现神经元级噪声自适应，又符合常见的近块对角Hessian矩阵结构。在标准假设下，我们建立了两种算法在确定性环境中的最优收敛速率，并证明在随机环境中其收敛保证能自适应随机梯度的噪声水平。GPT-2模型预训练实验表明，NAMO和NAMO-D均优于AdamW和Muon基线，其中NAMO-D通过引入平衡"保持良好条件更新方向"与"利用细粒度噪声自适应"这两个竞争目标的截断超参数，进一步实现了性能提升。

English

Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability, Muon utilizes the weight layers' matrix structure via orthogonalized momentum, showing superior performance in large language model training. We propose a new optimizer and a diagonal extension, NAMO and NAMO-D, providing the first principled integration of orthogonalized momentum with norm-based Adam-type noise adaptation. NAMO scales orthogonalized momentum using a single adaptive stepsize, preserving orthogonality while improving upon Muon at negligible additional cost. NAMO-D instead right-multiplies orthogonalized momentum by a diagonal matrix with clamped entries. This design enables neuron-wise noise adaptation and aligns with the common near block-diagonal Hessian structure. Under standard assumptions, we establish optimal convergence rates for both algorithms in the deterministic setting and show that, in the stochastic setting, their convergence guarantees adapt to the noise level of stochastic gradients. Experiments on pretraining GPT-2 models demonstrate improved performance of both NAMO and NAMO-D compared to the AdamW and Muon baselines, with NAMO-D achieving further gains over NAMO via an additional clamping hyperparameter that balances the competing goals of maintaining a well-conditioned update direction and leveraging fine-grained noise adaptation.

Adam优化Muon算法：采用正交化动量的自适应矩估计

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

摘要

Support