Tag
This blog post presents Gram Newton-Schulz, a hardware-aware optimization of the Newton-Schulz orthogonalization procedure used in the Muon optimizer, achieving significant speedups for training large language models while preserving model quality.