@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…
Summary
Tilde Research discovered a flaw in the Muon optimizer that leads to early death of MLP neurons and open-sourced an alternative, Aurora. While maintaining orthogonality, Aurora resolves the neuron death issue, significantly improving training efficiency.
Similar Articles
Aurora: A Leverage-Aware Optimizer for Rectangular Matrices
Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.
Can Muon Fine-tune Adam-Pretrained Models?
Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.
@AI_jacksaku: This week’s GitHub dark horse—Unsloth speeds up AI model training 2-5× while cutting VRAM use by 80%. What does that mean? Fine-tuning a large model used to require an A100 cluster and tens of thousands of dollars. Now one RTX 4090 can finish the job in a few hours. How? By optimizing attention compute, eliminating redundant memory copies, and adding QLoRA & Flash Attention support.
Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.
@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...
A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.
@0xLogicrw: MiniMax published a technical blog post detailing the root cause analysis for its M2 series large models' inability to output the person's name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The root cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), massive amounts of internet text were used to cre…
MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.