Can Muon Fine-tune Adam-Pretrained Models?

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

muon adam fine-tuning lora optimizer-mismatch deep-learning

Summary

Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

Original Article

View Cached Full Text

Cached at: 05/12/26, 10:52 AM

Paper page - Can Muon Fine-tune Adam-Pretrained Models?

Source: https://huggingface.co/papers/2605.10468

Abstract

Optimizer mismatch between Adam and Muon during fine-tuning degrades performance due to differing implicit biases, but this can be mitigated using parameter-efficient fine-tuning methods like LoRA.

Muonhas emerged as an efficient alternative toAdamfor pretraining, yet remains underused forfine-tuning. A key obstacle is that most open models are pretrained withAdam, and naively switching toMuonforfine-tuningleads to degraded performance due to anoptimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinctimplicit biasesofAdamandMuon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this withLoRA: across language and vision tasks,LoRAreduces the performance gap betweenAdamandMuonobserved under fullfine-tuning. Studies onLoRArank,catastrophic forgetting, andLoRAvariants further confirm that mismatch severity correlates with update strength. These results shed light on howoptimizer mismatchaffectsfine-tuningand how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.10468

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10468 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10468 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10468 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Can Muon Fine-tune Adam-Pretrained Models?

Paper page - Can Muon Fine-tune Adam-Pretrained Models?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Why Muon Outperforms Adam: A Curvature Perspective

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Anytime Training with Schedule-Free Spectral Optimization

@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…

Submit Feedback

Similar Articles

Why Muon Outperforms Adam: A Curvature Perspective

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Anytime Training with Schedule-Free Spectral Optimization

@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…