DynMuon：渺子的動態光譜塑形視角

摘要

近年來，Muon已成為訓練大型語言模型及更廣泛Transformer架構的主導方法。與標準梯度下降法相比，其核心差異在於將傳統更新矩陣 \( M = U\Sigma V^\top \) 替換為其極因子 \( UV^\top \)。在本研究中，我們考慮一類類似Muon的更新方式：將更新矩陣 \( M \) 替換為 \( U\Sigma^p V^\top \)，其中 \( p \) 為可調參數。我們將此操作稱為「譜整形」，並發展了一套理論來選擇 \( p \) 值，其選取依據包含：(a) 損失函數的局部曲率、(b) 隨機梯度與標籤雜訊所導致的雜訊，以及 (c) 訓練階段。我們的理論分析與實驗揭示了一個先前被忽略的行為：正的 \( p \) 值在訓練早期有助於強調高曲率方向並加速訊號收斂，而輕微負的 \( p \) 值則在訓練後期有助於將更新強度重新分配至仍含有用訓練訊號的低曲率方向。基於此洞見，我們提出了DynMuon，一種高效的動態譜整形方法，可在訓練過程中將 \( p \) 由正值調整至輕微負值。橫跨不同模型規模、架構與訓練設定的廣泛實驗顯示，DynMuon在達到相同目標損失時，不僅持續取得比Muon更低的驗證損失，所需訓練步數更減少了10.6%至26.5%。

English

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix M=UΣV^top with its polar factor UV^top. In this work, we consider a class of Muon-like updates, where we replace the update M with UΣ^p V^top for some parameter p. We call this a "spectral-shaping" operation, and develop a theory of how to pick p which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive p helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative p helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules p from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss.