DynMuon：缪子的动态能谱塑造视角

摘要

近年来，Muon已成为训练大语言模型及更广泛Transformer架构的主流方法。与标准梯度下降法相比，其本质区别在于将常规更新矩阵M=UΣV^T替换为其极化因子UV^T。本研究探讨一类类Muon优化方法，通过参数p将更新矩阵M替换为UΣ^p V^T形式。我们将此操作称为"频谱整形"，并建立了一套理论框架来指导p值选择，该选择取决于：(a)损失函数的局部曲率，(b)随机梯度与标签噪声带来的噪声影响，以及(c)训练阶段。理论与实验揭示了此前被忽视的行为特征：正值p通过强化高曲率方向加速信号收缩，在训练初期发挥优势；而轻微负值p则能将更新强度重新分配到仍含有效训练信号的低曲率方向，在训练后期发挥作用。基于这一发现，我们提出DynMuon——一种高效的动态频谱整形方法，可在训练过程中将p值从正值逐渐过渡至轻微负值。跨模型规模、架构及训练设置的大量实验表明，DynMuon在达到相同目标损失时所需步数比Muon减少10.6%-26.5%，且稳定实现更低的验证损失。

English

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix M=UΣV^top with its polar factor UV^top. In this work, we consider a class of Muon-like updates, where we replace the update M with UΣ^p V^top for some parameter p. We call this a "spectral-shaping" operation, and develop a theory of how to pick p which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive p helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative p helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules p from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.