ChatPaper.aiChatPaper

超越预训练重新审视Muon:面向VLA与RLVR的谱失效问题及高通滤波补救方案

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

May 19, 2026
作者: Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, Sijia Liu
cs.AI

摘要

Muon 是一种矩阵感知优化器,利用牛顿-舒尔茨(NS)迭代实现谱梯度正交化,通过将动量矩阵的所有奇异值推向1。这种均匀谱白化虽然增强了探索能力,并在大语言模型预训练中优于AdamW,但我们证明它在预训练之外的两个场景中可能带来根本性限制:(i)跨模态视觉-语言-动作(VLA)训练,其中本征低秩的动作模块梯度会导致噪声尾方向的放大;(ii)具有可验证奖励的强化学习(RLVR),低信噪比梯度以及需要保留前期训练中每头专业化特性,使得白化过程不稳定。为了解决这些问题,我们提出Pion,一种Muon的即插即用替代方案,在保持计算效率的同时,将均匀谱白化替换为两阶段提升+抑制机制,我们称之为高通NS迭代。该设计产生尖锐的谱高通效应,将主导奇异值锚定在1,同时将噪声尾分量抑制趋近于0,并具有可控的滤波器强度。为保留预训练得到的每头异质性,Pion还支持一种每头模式,通过简单的重塑操作独立地对注意力头更新,且无额外开销。在LIBERO和LIBERO-Plus上的VLA训练中,Pion在l1回归(VLA-Adapter)和流匹配(VLANeXt)架构上均持续优于两个基线,例如在VLA-Adapter上经过1500步训练后在LIBERO Object上达到100%成功率,而Muon为97.0%,AdamW仅为32.2%。Pion的优势进一步扩展到真实的Franka Research 3机器人上,采用pi_0.5骨干网络在DROID设置下完成三个抓取放置任务。在Qwen3-1.7B/4B上使用GRPO和GMPO进行RLVR后训练中,Pion在MATH和GSM8K上同样优于AdamW,而Muon退化为零。
English
Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.