Muon에 대한 재고찰: 사전 학습 이후의 스펙트럼 실패와 VLA 및 RLVR을 위한 고역 통과 해결 방안

초록

뮤온(Muon)은 행렬 인식 최적화 기법으로, 뉴턴-슐츠(NS) 반복을 활용하여 운동량 행렬의 모든 특이값을 1로 수렴시킴으로써 스펙트럼 기울기 직교화를 강제한다. 이러한 균일 스펙트럼 백색화(whitening)는 탐색을 향상시키고 LLM 사전학습에서 AdamW보다 우수한 성능을 보이지만, 본 연구는 사전학습을 넘어 두 가지 영역에서 근본적인 한계를 초래할 수 있음을 보여준다: (i) 본질적으로 저랭크인 행동 모듈 기울기가 잡음이 많은 꼬리 방향을 증폭시키는 교차 모달리티 시각-언어-행동(VLA) 훈련, (ii) 낮은 SNR의 기울기와 사전 훈련으로부터의 헤드별 전문화 유지 필요성으로 인해 백색화가 불안정해지는 검증 가능한 보상 기반 강화학습(RLVR). 이러한 문제를 해결하기 위해, 우리는 뮤온의 계산 효율성을 유지하면서 균일 스펙트럼 백색화를 2단계 촉진+억제(Promotion+Suppression) 메커니즘(이를 고역 통과 NS 반복이라 명명)으로 대체한 피온(Pion)을 제안한다. 이 설계는 날카로운 스펙트럼 고역 통과 효과를 유도하여, 지배적인 특이값은 1에 고정시키고 잡음이 많은 꼬리 성분은 0으로 억제하며, 필터 강도를 제어할 수 있게 한다. 사전 훈련된 헤드별 이질성을 보존하기 위해, 피온은 추가 비용 없이 단순한 재구성(reshape)을 통해 주의 헤드 간에 업데이트를 독립적으로 적용하는 헤드별 모드도 지원한다. LIBERO 및 LIBERO-Plus에서의 VLA 훈련에서, 피온은 l1-회귀(VLA-Adapter)와 흐름 매칭(VLANeXt) 아키텍처 모두에서 두 기준선을 일관되게 능가한다. 예를 들어 VLA-Adapter로 LIBERO Object에서 1,500 훈련 스텝 후 100% 성공률에 도달한 반면, 뮤온은 97.0%, AdamW는 32.2%에 불과했다. 피온의 장점은 DROID 설정 하에서 pi_0.5 백본을 사용한 실제 Franka Research 3 로봇의 세 가지 잡기-놓기(grasp-and-place) 작업에서도 확장된다. GRPO와 GMPO를 사용한 Qwen3-1.7B/4B의 RLVR 사후 훈련에서도 피온은 MATH와 GSM8K에서 AdamW를 능가하는 반면, 뮤온은 0으로 붕괴한다.

English

Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.