Can Muon Fine-tune Adam-Pretrained Models?
Summary
Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.
View Cached Full Text
Cached at: 05/12/26, 10:52 AM
Paper page - Can Muon Fine-tune Adam-Pretrained Models?
Source: https://huggingface.co/papers/2605.10468
Abstract
Optimizer mismatch between Adam and Muon during fine-tuning degrades performance due to differing implicit biases, but this can be mitigated using parameter-efficient fine-tuning methods like LoRA.
Muonhas emerged as an efficient alternative toAdamfor pretraining, yet remains underused forfine-tuning. A key obstacle is that most open models are pretrained withAdam, and naively switching toMuonforfine-tuningleads to degraded performance due to anoptimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinctimplicit biasesofAdamandMuon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this withLoRA: across language and vision tasks,LoRAreduces the performance gap betweenAdamandMuonobserved under fullfine-tuning. Studies onLoRArank,catastrophic forgetting, andLoRAvariants further confirm that mismatch severity correlates with update strength. These results shed light on howoptimizer mismatchaffectsfine-tuningand how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.10468
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10468 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10468 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10468 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Why Muon Outperforms Adam: A Curvature Perspective
This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
This paper introduces Pion, a new optimizer that replaces Muon's spectral whitening with a high-pass NS iteration to stabilize training in low-rank and low-SNR regimes, achieving improved performance in VLA and RLVR tasks.
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning
This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.
Anytime Training with Schedule-Free Spectral Optimization
This paper introduces SF-NorMuon, a schedule-free spectral optimizer that matches or exceeds tuned AdamW on language models up to 772M parameters, with theoretical guarantees for stationarity and long-horizon stability.
@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…
Tilde Research discovered a flaw in the Muon optimizer that leads to early death of MLP neurons and open-sourced an alternative, Aurora. While maintaining orthogonality, Aurora resolves the neuron death issue, significantly improving training efficiency.