muon

#muon

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

X AI KOLs Following ↗ · 4d ago Cached

Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.

0 favorites 0 likes

#muon

Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon

Hacker News Top ↗ · 5d ago Cached

This blog post presents Gram Newton-Schulz, a hardware-aware optimization of the Newton-Schulz orthogonalization procedure used in the Muon optimizer, achieving significant speedups for training large language models while preserving model quality.

0 favorites 0 likes

#muon

Spectral Scaling Laws of Muon

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.

0 favorites 0 likes

#muon

Why Muon Outperforms Adam: A Curvature Perspective

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.

0 favorites 0 likes

#muon

MuCon: Clipped Muon Updates for LLM Training

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper introduces MuCon, a clipped-Muon optimizer for LLM training that applies singular-value clipping instead of full polarization, preserving smaller singular values while clipping only the largest ones. It explores approximations to avoid full SVD, including polar/absolute-value formulas and rational Newton filters, noting numerical challenges near the threshold.

0 favorites 0 likes

#muon

DynMuon: A Dynamic Spectral Shaping View of Muon

Hugging Face Daily Papers ↗ · 2026-05-16 Cached

This paper introduces DynMuon, a dynamic spectral shaping optimizer that schedules the update parameter p from positive to mildly negative during training, consistently achieving lower validation loss and requiring 10.6-26.5% fewer steps than the standard Muon optimizer.

0 favorites 0 likes

#muon

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper challenges the geometric justification for the Muon optimizer, arguing that precise structure is less important than step-size optimality. It introduces Freon and Kaon optimizers to demonstrate that random or inverted spectra can perform as well as Muon.

0 favorites 0 likes

#muon

Can Muon Fine-tune Adam-Pretrained Models?

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.

0 favorites 0 likes

muon

Submit Feedback