事前学習を超えたMuonの再考：VLAとRLVRにおけるスペクトル障害とハイパス対策

要旨

Muonは、Newton-Schulz反復を活用して運動量行列の全特異値を1に近づけることでスペクトル勾配直交化を実現する、行列認識型オプティマイザである。この一様なスペクトル白色化は探索を促進しLLM事前学習においてAdamWを上回るものの、事前学習以外の二つの領域で本質的な限界が生じる可能性を示す。すなわち、(i) 本質的に低ランクな行動モジュール勾配がノイズ性のテール方向の増幅を引き起こすクロスモダリティ視覚-言語-行動学習、および(ii) 低SNR勾配と事前学習からのヘッド単位の専門性維持の必要性により白色化が不安定になる検証可能な報酬を用いた強化学習である。これらの課題に対処するため、Pionを提案する。これはMuonのドロップイン代替であり、計算効率を維持しつつ、一様なスペクトル白色化を二段階の促進＋抑制機構（ハイパスNS反復と呼ぶ）に置き換える。この設計は鋭いスペクトルハイパス効果を誘発し、支配的な特異値を1に固定しながらノイズ性のテール成分を0に近づけ、フィルタ強度も制御可能である。事前学習済みのヘッド単位の異質性を維持するため、Pionはさらにヘッド単位モードをサポートし、簡単な再整形により注意ヘッド間で独立に更新を適用する（追加コストなし）。LIBEROおよびLIBERO-PlusにおけるVLA学習では、Pionはl1回帰型VLA-Adapterとフローマッチング型VLANeXtの両方のアーキテクチャで一貫してベースラインを上回った。例えばVLA-AdapterではLIBERO Objectにおいて1500ステップ後100%の成功率を達成し、Muonの97.0%、AdamWの32.2%を大きく凌駕する。Pionの優位性はさらに、DROID設定下のpi0.5バックボーンを搭載した実機Franka Research 3ロボットによる3つの把握・配置タスクにも拡張される。また、GRPOおよびGMPOを用いたQwen3-1.7B/4BのRLVR事後学習においても、PionはMATHおよびGSM8KでAdamWを上回り、Muonはゼロに崩壊した。

English

Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.