重新審視Muon在預訓練之外的應用：VLA與RLVR的頻譜失敗與高通補救措施

摘要

Muon 是一種具矩陣感知能力的優化器，它利用牛頓-舒爾茨 (NS) 迭代，透過將動量矩陣的所有奇異值推向 1，來實現譜梯度正交化。雖然這種均勻的譜白化方法能增強探索能力，並在大型語言模型預訓練中優於 AdamW，但我們指出，它在預訓練之外的兩種場景中可能導致根本性限制：(i) 跨模態視覺-語言-動作 (VLA) 訓練，其中動作模組梯度本質上為低秩，會放大雜訊尾方向；(ii) 可驗證獎勵的強化學習 (RLVR)，其中低信噪比梯度以及需要保留先前訓練中每個注意力頭的專業化特性，使得白化方法不穩定。為了解決這些問題，我們提出 Pion，這是一個可直接替代 Muon 的優化器，它在保留計算效率的同時，將均勻譜白化替換為一個兩階段的「促進+抑制」機制，我們稱之為高通 NS 迭代。此設計會產生尖銳的譜高通效應，將主導奇異值錨定在 1，並將雜訊尾成分抑制向 0，且濾波強度可控。為了保留預訓練後的每個注意力頭異質性，Pion 還支援逐頭模式，透過簡單的形狀重塑，在注意力頭之間獨立應用更新，且不增加額外成本。在 LIBERO 和 LIBERO-Plus 上的 VLA 訓練中，Pion 無論在 l₁ 回歸 (VLA-Adapter) 還是流匹配 (VLANeXt) 架構上，均持續優於兩種基線；例如，在 VLA-Adapter 中，僅需 1,500 訓練步即可在 LIBERO Object 上達到 100% 的成功率，而 Muon 為 97.0%，AdamW 僅為 32.2%。Pion 的優勢進一步延伸至真實的 Franka Research 3 機器人，其在 DROID 設定下搭配 pi_0.5 骨幹，在三個抓取與放置任務中表現優異。在 Qwen3-1.7B/4B 上使用 GRPO 和 GMPO 進行 RLVR 後訓練時，Pion 在 MATH 和 GSM8K 上亦優於 AdamW，而 Muon 則崩潰至零。

English

Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.