Tag
This paper studies how much orthogonalization the Muon optimizer requires, proposing a five-step cubic Newton-Schulz schedule that reduces computational cost while achieving training quality similar to more expensive methods across GPT-2 Small and hybrid MoE/Mamba models.
This paper introduces DynMuon, a dynamic spectral shaping optimizer that schedules the update parameter p from positive to mildly negative during training, consistently achieving lower validation loss and requiring 10.6-26.5% fewer steps than the standard Muon optimizer.