Tag
This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.
The article explains that attention entropy collapse in deep transformer layers is a geometric consequence of training, not a bug, and proposes a three-line temperature schedule to prevent it.
After 8 years, the author rewrote the open-source pytorch-hessian-eigenthings library, providing efficient eigendecomposition of Hessian and other curvature matrices for PyTorch models using iterative methods like Lanczos.