Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Summary
This paper introduces Pion, a new optimizer that replaces Muon's spectral whitening with a high-pass NS iteration to stabilize training in low-rank and low-SNR regimes, achieving improved performance in VLA and RLVR tasks.
View Cached Full Text
Cached at: 05/25/26, 06:36 AM
Paper page - Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
Source: https://huggingface.co/papers/2605.19282
Abstract
Muon’s spectral whitening approach in LLM pretraining is replaced by Pion, which uses a high-pass NS iteration to stabilize training in low-rank and low-SNR regimes while maintaining computational efficiency and supporting per-head updates.
Muonis a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforcespectral gradient orthogonalizationby driving allsingular valuesof the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i)cross-modality vision-language-action(VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii)reinforcement learning with verifiable rewards(RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement forMuonthat preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharpspectral high-pass effect, anchoring dominantsingular valuesat 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports aper-head modethat applies updates independently acrossattention headsvia a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps withVLA-Adapter, vs. 97.0% forMuonand only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under theDROID setupon three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B withGRPOandGMPO, Pion also outperforms AdamW on MATH and GSM8K whileMuoncollapses to zero.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.19282
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.19282 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.19282 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.19282 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
This paper introduces Pion, a novel spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers.
Can Muon Fine-tune Adam-Pretrained Models?
Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.
MuCon: Clipped Muon Updates for LLM Training
This paper introduces MuCon, a clipped-Muon optimizer for LLM training that applies singular-value clipping instead of full polarization, preserving smaller singular values while clipping only the largest ones. It explores approximations to avoid full SVD, including polar/absolute-value formulas and rational Newton filters, noting numerical challenges near the threshold.
Anytime Training with Schedule-Free Spectral Optimization
This paper introduces SF-NorMuon, a schedule-free spectral optimizer that matches or exceeds tuned AdamW on language models up to 772M parameters, with theoretical guarantees for stationarity and long-horizon stability.
Spectral Scaling Laws of Muon
This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.