DynMuon: ミューオンの動的スペクトル形成の視点

要旨

近年、Muonは大規模言語モデル、そしてより広くはトランスフォーマーの訓練における支配的な手法として台頭してきた。標準的な勾配降下法と比較した場合の本質的な違いは、通常の更新行列 \(M = U \Sigma V^\top\) をその極因子 \(UV^\top\) で置き換える点にある。本研究では、あるパラメータ \(p\) に対して更新 \(M\) を \(U \Sigma^p V^\top\) で置き換える、Muonのような更新のクラスを考察する。これを「スペクトル整形」操作と呼び、(a)損失関数の局所曲率、(b)確率的勾配とラベルノイズに起因するノイズ、(c)訓練段階に依存する \(p\) の選択方法に関する理論を構築する。我々の理論と実験は、これまで見落とされていた挙動を明らかにする。正の \(p\) は初期において高曲率方向を強調し信号の収縮を加速することで役立ち、一方で緩やかに負の \(p\) は後期において未だ有用な訓練信号を含む低曲率方向へ更新の強度を再配分することで役立つ。この知見に基づき、我々は訓練過程で \(p\) を正から緩やかに負へとスケジュールする効率的な動的スペクトル整形手法DynMuonを提案する。モデルサイズ、アーキテクチャ、訓練設定を網羅した広範な実験により、DynMuonはMuonよりも一貫して低い検証損失を達成し、同じ目標損失に到達するために必要なステップ数が10.6%から26.5%削減されることを示す。

English

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix M=UΣV^top with its polar factor UV^top. In this work, we consider a class of Muon-like updates, where we replace the update M with UΣ^p V^top for some parameter p. We call this a "spectral-shaping" operation, and develop a theory of how to pick p which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive p helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative p helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules p from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.