Why Muon Outperforms Adam: A Curvature Perspective

Hugging Face Daily Papers 06/03/26, 12:00 AM Papers

optimizer muon adam curvature large-language-model training-efficiency

Summary

This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:42 AM

Paper page - Why Muon Outperforms Adam: A Curvature Perspective

Source: https://huggingface.co/papers/2606.04662

Abstract

Muon outperforms Adam in large language model training by reducing curvature penalties through lower normalized directional sharpness, particularly in middle and late training stages, with advantages amplified by data imbalance and heterogeneous curvature.

Muonimproves training efficiency overAdamin large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifyingMuon’s superiority overAdamfrom a curvature perspective. First, we apply asecond-order Taylor approximationto thetraining landscapeand show thatMuonachieves a larger one-step loss decrease thanAdamat matched validation loss. The two optimizers have comparable first-order gains, butMuonconsistently incurs a smaller second-ordercurvature penalty. Second, we decompose thiscurvature penaltyinto the squaredupdate normandNormalized Directional Sharpness(NDS). We find thatMuonandAdamhave comparableupdate norms, soMuon’s smallercurvature penaltyis driven by lower NDS, not update scale. Third, we study how training data and model structure shapeMuon’s NDS advantage. UsingZipf-Probabilistic Context-Free Grammar(PCFG) data with controlled imbalance, we show that data imbalance amplifiesMuon’s NDS advantage overAdam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training,Muon’s lower NDS is mainly sustained by smallerwithin-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems withheterogeneous curvatureandgradient alignmenttoward high-curvature modes. We prove thatMuonattains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lowerlocal quadratic lossafter the same number of steps.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.04662

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.04662 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.04662 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.04662 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Why Muon Outperforms Adam: A Curvature Perspective

Paper page - Why Muon Outperforms Adam: A Curvature Perspective

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Can Muon Fine-tune Adam-Pretrained Models?

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

How Much Orthogonalization Does Muon Need?

Spectral Scaling Laws of Muon

SignMuon: Communication-Efficient Distributed Muon Optimization

Submit Feedback

Similar Articles

Can Muon Fine-tune Adam-Pretrained Models?

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

How Much Orthogonalization Does Muon Need?

SignMuon: Communication-Efficient Distributed Muon Optimization