Tag
This blog post presents Gram Newton-Schulz, a hardware-aware optimization of the Newton-Schulz orthogonalization procedure used in the Muon optimizer, achieving significant speedups for training large language models while preserving model quality.
This paper proposes Dynamic Contextual Orthogonalization (DCO), an inference-time method that reduces hallucinations in large language models by aligning attention head outputs with the context manifold, achieving superior faithfulness on benchmarks with Llama-3 models.
This paper studies how much orthogonalization the Muon optimizer requires, proposing a five-step cubic Newton-Schulz schedule that reduces computational cost while achieving training quality similar to more expensive methods across GPT-2 Small and hybrid MoE/Mamba models.