Can Muon Fine-tune Adam-Pretrained Models?

Hugging Face Daily Papers Papers

Summary

Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.
Original Article
View Cached Full Text

Cached at: 05/12/26, 10:52 AM

Paper page - Can Muon Fine-tune Adam-Pretrained Models?

Source: https://huggingface.co/papers/2605.10468

Abstract

Optimizer mismatch between Adam and Muon during fine-tuning degrades performance due to differing implicit biases, but this can be mitigated using parameter-efficient fine-tuning methods like LoRA.

Muonhas emerged as an efficient alternative toAdamfor pretraining, yet remains underused forfine-tuning. A key obstacle is that most open models are pretrained withAdam, and naively switching toMuonforfine-tuningleads to degraded performance due to anoptimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinctimplicit biasesofAdamandMuon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this withLoRA: across language and vision tasks,LoRAreduces the performance gap betweenAdamandMuonobserved under fullfine-tuning. Studies onLoRArank,catastrophic forgetting, andLoRAvariants further confirm that mismatch severity correlates with update strength. These results shed light on howoptimizer mismatchaffectsfine-tuningand how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

View arXiv pageView PDFGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.10468

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10468 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10468 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10468 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Why Muon Outperforms Adam: A Curvature Perspective

Hugging Face Daily Papers

This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.

Anytime Training with Schedule-Free Spectral Optimization

arXiv cs.LG

This paper introduces SF-NorMuon, a schedule-free spectral optimizer that matches or exceeds tuned AdamW on language models up to 772M parameters, with theoretical guarantees for stationarity and long-horizon stability.

@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…

X AI KOLs Timeline

Tilde Research discovered a flaw in the Muon optimizer that leads to early death of MLP neurons and open-sourced an alternative, Aurora. While maintaining orthogonality, Aurora resolves the neuron death issue, significantly improving training efficiency.