curvature

#curvature

Why Muon Outperforms Adam: A Curvature Perspective

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.

0 favorites 0 likes

#curvature

Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.

Reddit r/ArtificialInteligence ↗ · 2026-06-02

The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.

0 favorites 0 likes

#curvature

After 8 years, I rewrote my open-source PyTorch curvature library

Hacker News Top ↗ · 2026-05-14 Cached

After 8 years, the author rewrote the open-source pytorch-hessian-eigenthings library, providing efficient eigendecomposition of Hessian and other curvature matrices for PyTorch models using iterative methods like Lanczos.

0 favorites 0 likes

curvature

Why Muon Outperforms Adam: A Curvature Perspective

Your transformer's attention entropy collapse isn't a bug. It's the model doing exactly what you trained it to do. Here's how to fix it with a three-line temperature schedule. arXiv-able. Self-contained proof. No citations needed.

After 8 years, I rewrote my open-source PyTorch curvature library

Submit Feedback