Why Muon Outperforms Adam: A Curvature Perspective
Summary
This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.
View Cached Full Text
Cached at: 06/09/26, 08:42 AM
Paper page - Why Muon Outperforms Adam: A Curvature Perspective
Source: https://huggingface.co/papers/2606.04662
Abstract
Muon outperforms Adam in large language model training by reducing curvature penalties through lower normalized directional sharpness, particularly in middle and late training stages, with advantages amplified by data imbalance and heterogeneous curvature.
Muonimproves training efficiency overAdamin large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifyingMuon’s superiority overAdamfrom a curvature perspective. First, we apply asecond-order Taylor approximationto thetraining landscapeand show thatMuonachieves a larger one-step loss decrease thanAdamat matched validation loss. The two optimizers have comparable first-order gains, butMuonconsistently incurs a smaller second-ordercurvature penalty. Second, we decompose thiscurvature penaltyinto the squaredupdate normandNormalized Directional Sharpness(NDS). We find thatMuonandAdamhave comparableupdate norms, soMuon’s smallercurvature penaltyis driven by lower NDS, not update scale. Third, we study how training data and model structure shapeMuon’s NDS advantage. UsingZipf-Probabilistic Context-Free Grammar(PCFG) data with controlled imbalance, we show that data imbalance amplifiesMuon’s NDS advantage overAdam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training,Muon’s lower NDS is mainly sustained by smallerwithin-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems withheterogeneous curvatureandgradient alignmenttoward high-curvature modes. We prove thatMuonattains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lowerlocal quadratic lossafter the same number of steps.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.04662
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.04662 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.04662 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.04662 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Can Muon Fine-tune Adam-Pretrained Models?
Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
This paper challenges the geometric justification for the Muon optimizer, arguing that precise structure is less important than step-size optimality. It introduces Freon and Kaon optimizers to demonstrate that random or inverted spectra can perform as well as Muon.
How Much Orthogonalization Does Muon Need?
This paper studies how much orthogonalization the Muon optimizer requires, proposing a five-step cubic Newton-Schulz schedule that reduces computational cost while achieving training quality similar to more expensive methods across GPT-2 Small and hybrid MoE/Mamba models.
Spectral Scaling Laws of Muon
This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.
SignMuon: Communication-Efficient Distributed Muon Optimization
SignMuon is a 1-bit, matrix-aware optimizer for distributed training that combines signSGD's majority-vote sign aggregation with Muon's polar-step framework, achieving 32x bandwidth reduction over float32 while maintaining strong convergence and performance on benchmarks like CIFAR-10/ResNet-50 and nanoGPT.