@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…

X AI KOLs Timeline 05/10/26, 03:40 AM Tools

Summary

Tilde Research discovered a flaw in the Muon optimizer that leads to early death of MLP neurons and open-sourced an alternative, Aurora. While maintaining orthogonality, Aurora resolves the neuron death issue, significantly improving training efficiency.

Tilde Research discovered a hidden flaw in the Muon optimizer, utilized by leading models such as DeepSeek V4, Kimi K2.5, and GLM-5: it causes more than a quarter of the neurons in MLP layers to die permanently during the early stages of training. Based on this finding, the team designed and open-sourced an alternative optimizer called Aurora. A 1.1B model trained with only about 100B tokens matched the performance of Qwen3-1.7B (trained on 36T tokens) on language understanding benchmarks like HellaSwag and Winogrande. The issue stems from a mathematical characteristic of how Muon handles MLP weight matrices. In the early stages of training, some neurons happen to receive weaker gradient signals. Traditional optimizers like AdamW normalize per parameter, naturally flattening these disparities; however, Muon's orthogonalization step passes these weak signals through unchanged. Weak neurons continue to receive weak updates, becoming increasingly silent, creating a "rich get richer" vicious cycle. By step 500 of training, over a quarter of the neurons had substantially died, wasting parameter capacity. The previous improved version, NorMuon, alleviated this by forcing the flattening of update magnitudes per row, but at the cost of breaking the orthogonality of the update matrix (orthogonalization ensures each update step is as efficient as possible, which is a core advantage of Muon), resulting in a loss of optimization precision. Aurora treats "uniform updates" and "orthogonality" as joint constraints, using alternating iterations to satisfy both simultaneously: ensuring every neuron gets a fair chance to learn without sacrificing update precision. Untuned Aurora incurs only a 6% higher computational overhead than Muon and can be directly substituted. In the modded-nanoGPT optimization benchmark, Aurora set a new best record at 3175 steps. Aurora's advantages amplify with increasing MLP width; the higher the expansion coefficient, the more pronounced the improvement. The code and the 1.1B pre-trained model are both open-source.

Original Article

Similar Articles

Aurora: A Leverage-Aware Optimizer for Rectangular Matrices

Lobsters Hottest

Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.

Can Muon Fine-tune Adam-Pretrained Models?

Hugging Face Daily Papers

Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.

@AI_jacksaku: This week’s GitHub dark horse—Unsloth speeds up AI model training 2-5× while cutting VRAM use by 80%. What does that mean? Fine-tuning a large model used to require an A100 cluster and tens of thousands of dollars. Now one RTX 4090 can finish the job in a few hours. How? By optimizing attention compute, eliminating redundant memory copies, and adding QLoRA & Flash Attention support.

X AI KOLs Timeline

Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

X AI KOLs Timeline

A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.

@0xLogicrw: MiniMax published a technical blog post detailing the root cause analysis for its M2 series large models' inability to output the person's name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The root cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), massive amounts of internet text were used to cre…

X AI KOLs Timeline

MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.

Similar Articles

Aurora: A Leverage-Aware Optimizer for Rectangular Matrices

Can Muon Fine-tune Adam-Pretrained Models?

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

Submit Feedback