Aurora: A Leverage-Aware Optimizer for Rectangular Matrices

Lobsters Hottest 05/10/26, 01:24 AM Papers

machine-learning optimizers training-efficiency open-source research deep-learning

Summary

Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.

<p><a href="https://lobste.rs/s/2kznvg/aurora_leverage_aware_optimizer_for">Comments</a></p>

Original Article

View Cached Full Text

Cached at: 05/10/26, 02:46 AM

# Aurora: A Leverage-Aware Optimizer for Rectangular Matrices | Tilde Source: [https://blog.tilderesearch.com/blog/aurora](https://blog.tilderesearch.com/blog/aurora) [Back](https://tilderesearch.com/research)5\.05\.2026 Alec Dewulf\*, Dhruv Pai\*, Li Yang\*, Ashley Zhang\*, Ben Keigwin ## TL;DR - We show that Muon's update inherits row\-norm anisotropy on tall matrices which can cause a significant portion of neurons in MLP layers to permanently die\. Row normalizing updates fixes this, but at the cost of orthogonality\. - We formulate the problem of steepest descent under the joint constraint of row\-norm uniformity and orthogonality, and present**Aurora**optimizer as a solution\. - We use Aurora to train a 1\.1B model, which achieves**100x data efficiency**on open\-source internet data and outperforms larger models on general evals like HellaSwag\. - We submitted a PR to the modded\-nanoGPT speedrun where Aurora**outperforms the current SoTA run**\. - Untuned Aurora is only a 6% overhead over traditional Muon, and a drop\-in replacement\. - We open\-source code for both Riemannian and vanilla Aurora:[github\.com/tilde\-research/aurora\-release](https://github.com/tilde-research/aurora-release)\. Loading\.\.\. **Outline** 1. [Introduction](https://blog.tilderesearch.com/blog/aurora#1-introduction)— Muon's success and the NorMuon puzzle 2. [Tension Due to Row Normalization](https://blog.tilderesearch.com/blog/aurora#2-tension-due-to-row-normalization)— Row normalization competes with orthogonality on tall matrices 3. [Normalization Prevents Neuron Death](https://blog.tilderesearch.com/blog/aurora#3-normalization-prevents-neuron-death)— Muon kills MLP neurons; leverage scores explain why 4. [Augmenting the Muon Objective](https://blog.tilderesearch.com/blog/aurora#4-augmenting-the-muon-objective)— Deriving Aurora from a joint constraint 5. [Results](https://blog.tilderesearch.com/blog/aurora#5-results)— nanoGPT speedrun SoTA, 1B pretraining, and downstream evals 6. [Inflection](https://blog.tilderesearch.com/blog/aurora#6-inflection)— Why this matters ## 1: Introduction The Muon optimizer\[1\]first attracted attention due to its success in the nanoGPT speedrun competition\[2\], where it outperformed AdamW\[3\]in wall\-clock time to convergence despite requiring more computation per step\. Since then, substantial effort has gone into scaling Muon through improved distributed implementations\[4\], more accurate orthogonalization routines\[5\]\[6\], and methods to reduce compute and communication overhead\[4\]\. The result is that Muon \(including clipped and per\-head versions\) has become an increasingly popular choice for training frontier\-scale models\[7\]\[8\]\[18\]\. Recently, there has been a surge in Muon variants pushing the state\-of\-the\-art on open research benchmarks\[9\]\[10\]\. One example is NorMuon\[9\], which is currently SoTA on the modded\-nanoGPT speedrun\. NorMuon augments Muon with an additional step that scales each row by its inverse RMS norm, analogously to Adam's per\-parameter normalization\. Notably, this row normalization step can move the Muon update quite far from the true orthogonal gradient\. The fact that NorMuon still succeeds suggests that there may be a gap in the Muon formulation that is being addressed by row normalization\. We study the effects of row normalization and find that Muon can result in*neuron death*in MLP layers, whereby some neurons receive persistently small updates early in training and fail to recover\. We show that this failure mode can be avoided by redistributing mass equally across rows for updates to the up and gate projections in the MLP layers\. Motivated by this observation, we propose**Aurora**, which uses this mechanism to prevent neuron death without sacrificing precision of the gradient orthogonalization\. We show that Aurora achieves large gains over both Muon and NorMuon at 1B scale and outperforms the current SoTA on the modded\-nanoGPT optimization track\. Furthermore, we find that Aurora's performance gains scale with MLP width leading us to believe that Aurora is particularly effective for networks with large MLP expansion factors\. ## 2: Tension Due to Row Normalization The core algorithmic component in Muon is an iterative algorithm to compute the polar factor of a matrix\. ForG∈Rm×nG \\in \\mathbb\{R\}^\{m \\times n\}with thin Singular Value Decomposition \(SVD\)G=UΣV⊤G = U\\Sigma V^\\top, the*polar factor*polar⁡\(G\)=UV⊤\\operatorname\{polar\}\(G\) = UV^\\top, is the closest semi\-orthogonal matrix toGGin Frobenius norm\. IfGGis the gradient of a parameterWW, then Muon formspolar⁡\(G\)=UV⊤\\operatorname\{polar\}\(G\) = UV^\\topand updatesW←W−ηUV⊤W \\gets W \- \\eta\\, UV^\\topfor a learning rateη\\eta\. The existence of matmul\-only iterative algorithms for computingpolar⁡\(G\)\\operatorname\{polar\}\(G\)\[5\]\[6\]\[1\]is largely what makes Muon feasible at scale\. We focus on two such algorithms here: quintic iteration from modded\-nanoGPT\[21\]and Polar Express with 8 steps \(PE\-8\)\.\[5\]We give a more thorough overview of different algorithms for computing the polar factor in[Appendix A](https://blog.tilderesearch.com/blog/aurora#a-comparing-newton-schulz-iterations)\. We find that in our training settings \(as described in[Appendix B](https://blog.tilderesearch.com/blog/aurora#b-training-setting)\) Muon achieves monotonically better downstream loss under more precise orthogonalization; i\.e\. under iterative algorithms that more accurately approximatepolar⁡\(G\)\\operatorname\{polar\}\(G\)\. We plot Muon runs using the least precise iteration we test \(quintic\) and the most precise \(PE\-8\) in[Figure 1](https://blog.tilderesearch.com/blog/aurora#figure-1)to illustrate this gap\. This result confirms that polar factor precision is an important component in strong Muon implementations\. Loading\.\.\. **Figure1\.**Muon achieves monotonically lower loss with more precise orthogonalization routines\. PE\-8 substantially outperforms the quintic iteration at 1B scale\.NorMuon augments Muon with an additional step that scales each row of the polar factor to have unit RMS norm\. Algorithm:NorMuonRequire:Initial weightsW0∈Rm×n,lossL,learning rateη,momentum\(β1,β2\),perturbationε,weight decayλInitializeM0∈Rm×n←0,v0∈Rm←0fort=1,2,…doGt←∇WL\(Wt\)Mt←β1Mt−1\+\(1−β1\)GtOt←polar\(Mt\)vt←β2vt−1\+\(1−β2\)mean\_cols⁡\(Ot⊙Ot\)Vt←ExpandRows\(vt\)\(Vt∈Rm×n\)O^t←Ot⊘\(Vt\+ε\)η^=0\.2ηmn/∥O^t∥FWt\+1←Wt−ηλWt−η^O^tendfor\\begin\{aligned\} &\\textbf\{Algorithm: NorMuon\} \\\\ &\\textbf\{Require: \} \\text\{Initial weights \} \\mathbf\{W\}\_0 \\in \\mathbb\{R\}^\{m \\times n\},\\; \\text\{loss \} L,\\; \\text\{learning rate \} \\eta,\\; \\text\{momentum \} \(\\beta\_1, \\beta\_2\),\\; \\text\{perturbation \} \\varepsilon,\\; \\text\{weight decay \} \\lambda \\\\ &\\text\{Initialize \} \\mathbf\{M\}\_0 \\in \\mathbb\{R\}^\{m \\times n\} \\leftarrow \\mathbf\{0\},\\; \\mathbf\{v\}\_0 \\in \\mathbb\{R\}^m \\leftarrow \\mathbf\{0\} \\\\ &\\textbf\{for \} t = 1, 2, \\ldots \\textbf\{ do\} \\\\ &\\quad \\mathbf\{G\}\_t \\leftarrow \\nabla\_\{\\mathbf\{W\}\} L\(\\mathbf\{W\}\_t\) \\\\ &\\quad \\mathbf\{M\}\_t \\leftarrow \\beta\_1 \\mathbf\{M\}\_\{t\-1\} \+ \(1 \- \\beta\_1\)\\, \\mathbf\{G\}\_t \\\\ &\\quad \\mathbf\{O\}\_t \\leftarrow \\mathrm\{polar\}\(\\mathbf\{M\}\_t\) \\\\ &\\quad \{\\color\{\#2563EB\} \\mathbf\{v\}\_t \\leftarrow \\beta\_2 \\mathbf\{v\}\_\{t\-1\} \+ \(1 \- \\beta\_2\)\\, \\operatorname\{mean\\\_cols\}\(\\mathbf\{O\}\_t \\odot \\mathbf\{O\}\_t\)\} \\\\ &\\quad \{\\color\{\#2563EB\} \\mathbf\{V\}\_t \\leftarrow \\mathrm\{ExpandRows\}\(\\mathbf\{v\}\_t\) \\qquad \(\\mathbf\{V\}\_t \\in \\mathbb\{R\}^\{m \\times n\}\)\} \\\\ &\\quad \{\\color\{\#2563EB\} \\widehat\{\\mathbf\{O\}\}\_t \\leftarrow \\mathbf\{O\}\_t \\oslash \\bigl\(\\sqrt\{\\mathbf\{V\}\_t\} \+ \\varepsilon\\bigr\)\} \\\\ &\\quad \\hat\{\\eta\} = 0\.2\\, \\eta \\sqrt\{mn\}\\, /\\, \\\|\\widehat\{\\mathbf\{O\}\}\_t\\\|\_F \\\\ &\\quad \\mathbf\{W\}\_\{t\+1\} \\leftarrow \\mathbf\{W\}\_t \- \\eta \\lambda \\mathbf\{W\}\_t \- \\hat\{\\eta\}\\, \\widehat\{\\mathbf\{O\}\}\_t \\\\ &\\textbf\{end for\} \\end\{aligned\}We show that for tall matrices, this row normalization can significantly reduce polar factor precision, which is undesirable both theoretically and in light of our previous results \([Figure 1](https://blog.tilderesearch.com/blog/aurora#figure-1)\)\. **Claim 1\.**A tall matrix cannot be simultaneously column\-orthogonal and have uniform unit row norms\. That is, ifM∈Rm×n\\mathcal\{M\} \\in \\mathbb\{R\}^\{m \\times n\}withm\>nm \> nsatisfiesM⊤M=In\\mathcal\{M\}^\\top \\mathcal\{M\} = I\_n, thenM\\mathcal\{M\}cannot have all its row norms equal to one\. *Proof\.*Suppose by contradiction that all the rows ofM\\mathcal\{M\}have norm one\. Then tr\(M⊤M\)=tr\(In\)=nandtr\(M⊤M\)=tr\(MM⊤\)=∑i=1m∥Mi,⋅∥22=m\\text\{tr\}\(\\mathcal\{M\}^\\top\\mathcal\{M\}\) = \\text\{tr\}\(I\_n\) = n \\quad \\text\{and\} \\quad \\text\{tr\}\(\\mathcal\{M\}^\\top\\mathcal\{M\}\) = \\text\{tr\}\(\\mathcal\{M\}\\mathcal\{M\}^\\top\) = \\sum\_\{i=1\}^m \\\|\\mathcal\{M\}\_\{i, \\cdot\}\\\|\_2^2 = m Together, these facts imply thatm=nm = n, contradicting our assumption thatM\\mathcal\{M\}is tall\. The average norm of a row inM\\mathcal\{M\}isn/mn/m\.□\\square Therefore, row normalization in NorMuon necessarily introduces a precision defect into the orthogonalization routine\. We find that this polar precision defect can be quite large for matrices with non\-uniform row norms\. In particular, the magnitude of the introduced error can be much larger than the precision improvement of PE\-8 over quintic iteration in the same setting \(see[Appendix A](https://blog.tilderesearch.com/blog/aurora#a-comparing-newton-schulz-iterations)for a full comparison of different Newton\-Schulz iterations\)\. ![Row normalization precision](https://blog.tilderesearch.com/images/blog/aurora/figure1_row_norm_orth_defect.png) **Figure2\.**\(a\) We sample n=1000 random 512×128 matrices for row norm standard deviations in \{1, 2, 3, 4, 5\} and orthogonalize them with different polar factor algorithms \(PE\-8, CANS\-10, CANS\-12 and quintic\); the precision of the resulting polar factors is shown on the plot \(dotted lines\)\. We then apply unit row normalization and plot the updated precision \(dark lines\)\. At a row\-norm standard deviation of 3, the orthogonality defect peaks at 0\.06\. \(b\) Plots the average defect magnitude for each row standard norm deviation\.Indeed, the performance gap between PE\-8 and quintic iteration disappears when these algorithms are used with NorMuon in the same setting\. However, both runs outperform our Muon\+PE\-8 baseline, suggesting that row normalization can be independently useful\. Loading\.\.\. **Figure3\.**NorMuon does not benefit from more precise orthogonalization at 1B scale\. PE\-8 and quintic achieve the same downstream loss, consistent with our result that row normalization degrades the polar factor precision\.To mitigate the defect introduced by row normalization, we can simply normalize tall matrices to have row normsn/m\\sqrt\{n/m\}instead of one\. We call this variant**U\-NorMuon**and find that it outperforms NorMuon in our 340M setting \([Figure 4](https://blog.tilderesearch.com/blog/aurora#figure-4); setting described in[Appendix B](https://blog.tilderesearch.com/blog/aurora#b-training-setting)\)\. Algorithm:U\-NorMuonRequire:Initial weightsW0∈Rm×n,lossL,learning rateη,momentum\(β1,β2\),perturbationε,weight decayλInitializeM0∈Rm×n←0,v0∈Rm←0fort=1,2,…doGt←∇WL\(Wt\)Mt←β1Mt−1\+\(1−β1\)GtOt←polar\(Mt\)vt←β2vt−1\+\(1−β2\)mean\_cols⁡\(Ot⊙Ot\)Vt←ExpandRows\(vt\)\(Vt∈Rm×n\)O^t←Ot⊘\(Vt\+ε\)η^=0\.2ηn/∥O^t∥FWt\+1←Wt−ηλWt−η^O^tendfor\\begin\{aligned\} &\\textbf\{Algorithm: U\-NorMuon\} \\\\ &\\textbf\{Require: \} \\text\{Initial weights \} \\mathbf\{W\}\_0 \\in \\mathbb\{R\}^\{m \\times n\},\\; \\text\{loss \} L,\\; \\text\{learning rate \} \\eta,\\; \\text\{momentum \} \(\\beta\_1, \\beta\_2\),\\; \\text\{perturbation \} \\varepsilon,\\; \\text\{weight decay \} \\lambda \\\\ &\\text\{Initialize \} \\mathbf\{M\}\_0 \\in \\mathbb\{R\}^\{m \\times n\} \\leftarrow \\mathbf\{0\},\\; \\mathbf\{v\}\_0 \\in \\mathbb\{R\}^m \\leftarrow \\mathbf\{0\} \\\\ &\\textbf\{for \} t = 1, 2, \\ldots \\textbf\{ do\} \\\\ &\\quad \\mathbf\{G\}\_t \\leftarrow \\nabla\_\{\\mathbf\{W\}\} L\(\\mathbf\{W\}\_t\) \\\\ &\\quad \\mathbf\{M\}\_t \\leftarrow \\beta\_1 \\mathbf\{M\}\_\{t\-1\} \+ \(1 \- \\beta\_1\)\\, \\mathbf\{G\}\_t \\\\ &\\quad \\mathbf\{O\}\_t \\leftarrow \\mathrm\{polar\}\(\\mathbf\{M\}\_t\) \\\\ &\\quad \\mathbf\{v\}\_t \\leftarrow \\beta\_2 \\mathbf\{v\}\_\{t\-1\} \+ \(1 \- \\beta\_2\)\\, \\operatorname\{mean\\\_cols\}\(\\mathbf\{O\}\_t \\odot \\mathbf\{O\}\_t\) \\\\ &\\quad \\mathbf\{V\}\_t \\leftarrow \\mathrm\{ExpandRows\}\(\\mathbf\{v\}\_t\) \\qquad \(\\mathbf\{V\}\_t \\in \\mathbb\{R\}^\{m \\times n\}\) \\\\ &\\quad \\widehat\{\\mathbf\{O\}\}\_t \\leftarrow \\mathbf\{O\}\_t \\oslash \\bigl\(\\sqrt\{\\mathbf\{V\}\_t\} \+ \\varepsilon\\bigr\) \\\\ &\\quad \\hat\{\\eta\} = 0\.2\\, \\eta \{\\color\{\#2563EB\} n\}\\, /\\, \\\|\\widehat\{\\mathbf\{O\}\}\_t\\\|\_F \\\\ &\\quad \\mathbf\{W\}\_\{t\+1\} \\leftarrow \\mathbf\{W\}\_t \- \\eta \\lambda \\mathbf\{W\}\_t \- \\hat\{\\eta\}\\, \\widehat\{\\mathbf\{O\}\}\_t \\\\ &\\textbf\{end for\} \\end\{aligned\}Now we turn our analysis to wide and square matrices, which also receive unit row norm updates under NorMuon\. A left\-orthogonal wide or square matrix necessarily has all its row norms equal to one, so row\-norm uniformity is*implied*by orthogonality in this case\. Thus, we expect that row\-normalization is unnecessary or perhaps even harmful under precise orthogonalization routines like PE\-8\. Iterative polar factor algorithms can fail to converge precisely early in training when raw gradients tend to be ill\-conditioned, allowing for the accumulation of row norm statistics that quickly become stale\. In this case, row normalization will make rows*less*uniform until the statistics are corrected\. We find evidence of this occurring at 340M scale where row\-normalizing only tall matrices outperforms NorMuon and U\-NorMuon by a small but non\-trivial margin \([Figure 4](https://blog.tilderesearch.com/blog/aurora#figure-4)\)\. Loading\.\.\. **Figure4\.**Column normalization of down projection performs very similarly to up and gate row normalization under NorMuon\. NorMuon applied only to tall matrices outperforms all NorMuon variants and Muon\.We will show that Muon can allow a large subset of neurons in the MLP layers to effectively die, but that this pathology is mitigated by \(U\-\)NorMuon\. Building on ideas in the literature, we propose this as an explanation for the performance gap between Muon and NorMuon that we find in all our settings\. We then derive Aurora which effectively row normalizes updates to tall parameters without sacrificing precision of the polar factor\. ## 3: Normalization Prevents Neuron Death We define a*dead model component*as a subset of model parameters receiving persistently small learning signal after the earliest phases of training\. We identify dead model components with the following three criteria: 1. **Low effective gradient norm\.**Model components which, for a particular batch, receive low effective gradient norm must correspond to relatively flat directions in the loss landscape\. Components with persistently small gradient norms will thus receive little learning signal\. There is a strong precedent in the attribution and pruning literatures for using gradient norms as a proxy for feature importance\[11\]\[12\]\[13\]\. 2. **Low effective*update*norm\.**For networks trained with optimizers like Muon, small grad norms are not themselves sufficient for identifying dead network components\. In particular, the orthogonalization step can drastically alter the gradient before it is applied, including amplifying small magnitude directions\. We therefore require that effective update norm, which measures the actual parameter change, also be small\. 3. **Persistence over training\.**It's possible for parameters to receive small update norms at specific points in training but still contribute non\-trivially to the network output\. For example, some components may learn useful features early and receive smaller updates as those features stabilize\. To distinguish these static but useful components from genuinely dead ones, we require low effective gradient and update norms to be persistent and predictable over training\. Gradient norm\-9\-8\-7\-6\-5\-4\-3\-2\-101log grad normneuron index \(sorted\) deadaliveunpredictable **Figure5\.**Three criteria for identifying a dead neuron: low effective gradient norm, low effective update norm, and persistence of both over training\.We will show that neuron death, as we've defined it, can and does occur in networks trained with Muon because tall matrices in MLP layers are allowed to receive updates with very non\-uniform row\-norms\. In particular, we will show that the Muon update privileges network components that happened to get large updates early in training, allowing others to die\. We find that U\-NorMuon avoids this problem completely\. ## The Mechanism of Neuron Death For a matrixM∈Rm×nM \\in \\mathbb\{R\}^\{m \\times n\}with thin SVDM=UrΣV⊤M = U\_r \\Sigma V^\\top, define the*leverage score*of rowiiasℓi\(M\)=∥\(Ur\)i,⋅∥22\\ell\_i\(M\) = \\\|\(U\_r\)\_\{i,\\cdot\}\\\|\_2^2\. The leverage score of a row is directly proportional to the scale of its update so that rows with consistently low leverage scores in the update matrix receive a consistently small portion of the update mass\. We can show that row leverage is mostly preserved under orthogonalization of reasonably well\-conditioned matrices so that under Muon, rows with low leverage in the raw gradient tend to also have low leverage in the update\. M\-0\.960\.32\-0\.27\-0\.49\-1\.84\-0\.31\-1\.14\-1\.14\-0\.530\.091\.23\-1\.12\-0\.270\.72\-1\.80=Uᵣ0\.07\-0\.28\-0\.80\-0\.58\-0\.400\.42\-0\.37\-0\.56\-0\.280\.55\-0\.250\.110\.46\-0\.620\.30Σ2\.812\.330\.98V⊤0\.200\.90\-0\.390\.540\.230\.810\.82\-0\.37\-0\.44Row leverage scores0\.722r10\.681r20\.532r30\.382r40\.684r5 Click arowin M— Shrink a row/col to see its leverage score drop for rectangular matrices\. **Figure6\.**Interactive illustration of leverage scores\. Each row's leverage score determines its share of update energy under the polar factor\. Rows with small leverage receive proportionally small updates\.**Claim 2 \(Informal\)\.**For tall matrices, rows with small gradient norm still have small norm after orthogonalization, so that the Muon update does not prevent neuron death\. *Intuition\.*The leverage score of rowiimeasures how much rowiiparticipates in the column space ofMM\. A row that has near\-zero norm contributes very little to any singular vector, soUrU\_rplaces little weight on it\. Unlike singular values, which are unit\-normalized by orthogonalization, this row\-wise structure lives inUrU\_rand passes directly through to the polar factor\. For wide or square matrices, the orthogonality constraintUr⊤Ur=IU\_r^\\top U\_r = Iforces leverage scores to be uniform, but for tall matrices there are degrees of freedom under this constraint\. Themmleverage scores must sum tonnbut are otherwise allowed to concentrate on rows arbitrarily\. **Claim 2 \(Formal\)\.**LetM∈Rm×nM \\in \\mathbb\{R\}^\{m \\times n\}withm\>nm \> nand thin SVDM=UrΣV⊤M = U\_r \\Sigma V^\\top\. Suppose rowiisatisfies∥Mi,⋅∥≤ϵ\\\|M\_\{i,\\cdot\}\\\| \\leq \\epsilonfor someϵ≥0\\epsilon \\geq 0\. Thenℓi\(M\)≤ϵ2/σn\(M\)2\\ell\_i\(M\) \\leq \\epsilon^2 / \\sigma\_n\(M\)^2, whereσn\(M\)\\sigma\_n\(M\)is the smallest singular value ofMM\. *Proof\.*We haveMi,⋅=\(Ur\)i,⋅ΣV⊤M\_\{i,\\cdot\} = \(U\_r\)\_\{i,\\cdot\} \\Sigma V^\\top, so\(Ur\)i,⋅=Mi,⋅VΣ−1\(U\_r\)\_\{i,\\cdot\} = M\_\{i,\\cdot\} V \\Sigma^\{\-1\}\. Then: ℓi=∥\(Ur\)i,⋅∥2=∥Mi,⋅VΣ−1∥2≤∥Mi,⋅∥2⋅∥Σ−1∥2=∥Mi,⋅∥2σn\(M\)2≤ϵ2σn\(M\)2□\\begin\{aligned\} \\ell\_i &= \\\|\(U\_r\)\_\{i,\\cdot\}\\\|^2 = \\\|M\_\{i,\\cdot\} V \\Sigma^\{\-1\}\\\|^2 \\leq \\\|M\_\{i,\\cdot\}\\\|^2 \\cdot \\\|\\Sigma^\{\-1\}\\\|^2 \\\\ &= \\frac\{\\\|M\_\{i,\\cdot\}\\\|^2\}\{\\sigma\_n\(M\)^2\} \\leq \\frac\{\\epsilon^2\}\{\\sigma\_n\(M\)^2\} \\quad \\square \\end\{aligned\} Thus, the leverage score \(and hence update\) for a neuron is bounded by the squared norm of the corresponding row in the momentum buffer, normalized by the square of the smallest singular value\. In particular, whenMMis reasonably well\-conditioned \(so thatσn\(M\)\\sigma\_n\(M\)is not too small\) Muon preserves row leverage anisotropy\. This leads to the following self\-reinforcing feedback loop in networks trained with Muon: 1. Dead neurons receive near\-zero gradient rows 2. Momentum accumulates into a bufferMMwith highly anisotropic row norms\. 3. The updates inherit this anisotropy, resulting in dead neurons not being updated\. 4. Hence, dead neurons remain dead\. We find significant empirical evidence that this loop occurs in our standard 1B and 340M transformer settings\. Training0% xW\_gateW\_upSiLU⊙hW\_downyper\-neuron row view \(12 rows of W\_gate / W\_up\)Activation \|hᵢ\|Gradient ‖∇W\[i,:\]‖Update ‖ΔW\[i,:\]‖SiLU\(gᵢ\)σ′\(gᵢ\)score00\.350\.760\.2310\.790\.950\.8120\.540\.860\.3930\.830\.960\.9040\.370\.770\.2850\.880\.971\.0060\.620\.890\.5170\.390\.780\.2080\.390\.790\.3490\.420\.800\.24100\.700\.920\.65110\.460\.820\.28 **Figure7\.**The neuron death feedback loop under Muon, visualized per\-neuron across the SwiGLU MLP\. Each row corresponds to one neuron in gate/up projection\. Dead neurons receive vanishing activations, near\-zero gradient scalars, and near\-zero leverage\. Press play or drag the slider to watch neurons collapse over training\.## Empirical Evidence of Neuron Death We train vanilla transformer models with 340 million non\-embedding parameters following the recipe given in[Appendix B](https://blog.tilderesearch.com/blog/aurora#b-training-setting)using both Muon and U\-NorMuon\. For each, we extract the momentum buffer every 50 steps and compute row\-wise gradient norms and leverage scores\. The U\-NorMuon update has approximately uniform row norms \(and hence leverage scores\) by construction, but leverage scores tend to be very anisotropic for tall matrices under Muon\. Our models use SwiGLU MLPs and we verify later that our results transfer to models that use ReLU2^2\(see[Appendix G](https://blog.tilderesearch.com/blog/aurora#g-neuron-death-in-relu-mlps)\)\. **Neuron death is visible in activation space\.**We first examine the MLP latent statistics directly\. For each neuron, we track the mean and standard deviation of activations across tokens at each training step\. Under Muon, the distribution of gate means diverges sharply: as training progresses we find that a band of neurons drifts toward zero mean with collapsing variance; we interpret this as activation space collapse\. By contrast, gate means and variances remain consistently high under U\-NorMuon\. Loading gate latent statistics\.\.\. **Figure8\.**SwiGLU hidden activations over training for all 2816 neurons in layer 12\. Each dot is one neuron; the x\-axis is the mean of hᵢ across tokens and the y\-axis is log₁₀ of its standard deviation\. Under Muon \(left\), a cluster of neurons collapses to near\-zero variance early in training\. Under U\-NorMuon \(right\), all neurons maintain healthy variance\. Scrub the slider to watch neurons die\.**Muon kills neurons early and permanently\.**We visualize leverage scores for the tall MLP projections of a middle layer over training \([Figure 9](https://blog.tilderesearch.com/blog/aurora#figure-9)\)\. Under Muon, neurons are initially alive with uniformly high leverage, but a large fraction of neurons die during learning rate warmup and never recover\. By step 500, more than one in four neurons are effectively dead, producing a sharply bimodal distribution of leverage scores; one mass of neurons receives near\-zero updates, and the other receives disproportionately large ones\. step0 **Figure9\.**Each square on the grid corresponds to a row in the parameter matrix\. Gray neurons have low leverage and are classified as "dead"\. The distribution of leverage scores for Muon becomes highly anisotropic over training, whereas U\-NorMuon leverage scores are isotropic and there are no dead neurons\.This neuron death is self\-reinforcing and leads to a fundamental collapse in learning dynamics \([Figure 10](https://blog.tilderesearch.com/blog/aurora#figure-10)\)\. On a log\-log plot, row gradient norms and leverage scores are tightly correlated, indicating that the momentum buffer tends to be in the well\-conditioned domain where our bound from[Claim 2](https://blog.tilderesearch.com/blog/aurora#claim-2-formal)is tight\. Loading\\u2026 **Figure10\.**Muon has a strong tail of dead neurons\. These rows have both low gradient norm and row leverage scores\. U\-NorMuon, by contrast, is highly isotropic\. For Muon, the row gradient norm and leverage scores are strongly correlated on log\-log scale\.To verify that neuron death persists throughout training, we split neurons into quartiles by their leverage score at step 100 \(out of 10 thousand\) and track each cohort as training progresses \([Figure 11](https://blog.tilderesearch.com/blog/aurora#figure-11)\)\. The bottom quartile continues to decay, while the remaining cohorts quickly stabilize at high leverage under a "rich get richer" dynamic\. If anisotropy were i\.i\.d\. across rows, early leverage would not predict late leverage\. This strong stratification confirms that, in our experiments, neuron death is a fixed point of learning rather than just training noise\. Training stepLog leverage scoreBottom 10%10–25%25–40%40–55%55–70%70–85%85–95%Top 5%Optimizer Projection Layer **Figure11\.**Rows are grouped into 8 cohorts by their initial leverage score at step 500\. Under Muon, the bottom cohorts collapse to vanishing leverage while the top cohorts stabilize high, and dead neurons stay dead\. Under U\-NorMuon, all cohorts converge toward uniform leverage\. Toggle between optimizers, projections, and layers to explore\.**U\-NorMuon prevents death, and the prevention propagates\.** Under U\-NorMuon, leverage scores are isotropic and approximately distributed according to a Gaussian throughout training with no dead\-neuron tail \([Figures 9](https://blog.tilderesearch.com/blog/aurora#figure-9)–[10](https://blog.tilderesearch.com/blog/aurora#figure-10)\)\. One might expect this to be a trivial consequence of the normalization step forcing uniformity post\-hoc\. Instead, we find that the momentum buffer*itself*converges to a regime with isotropic row norms\. The normalization intervenes most aggressively early in training when anisotropy is developing, and as the buffer stabilizes, the correction shrinks\. More surprisingly, this benefit extends to parameters that U\-NorMuon does not directly normalize\. For example, in[Figure 12](https://blog.tilderesearch.com/blog/aurora#figure-12)we plot column leverage scores for the down projection matrix, which is a wide matrix and thus receives no benefit from row normalization\. By contrast, column leverage becomes anisotropic under Muon\. Dead rows in the up/gate projections starve the corresponding columns of the down projection of gradient signal\. Under U\-NorMuon, keeping the up/gate rows alive ensures isotropic gradient flow into the down projection, stabilizing its column leverage without any direct intervention\. ![Figure 12](https://blog.tilderesearch.com/images/blog/aurora/figure6_down_proj_rownorm.png) **Figure12\.**Even though down projection does not benefit directly from row normalization, the induced dynamics prevent \(column\-wise\) leverage anisotropy\. Alive rows in the up and gate projections in turn induce gradient flow in the columns of the down projection\.**Takeaway\.**U\-NorMuon's benefit is not simply normalization\. U\-NorMuon breaks a pathological self\-reinforcing collapse that wastes a quarter or more of the network's MLP parameters\. However, it does so by forcefully overriding the polar factor with uniform row norms, sacrificing precision of the polar factor which is both empirically and theoretically desirable in the Muon framework\. We will now present Aurora, which achieves both goals simultaneously\. We show that Aurora has better downstream performance than both Muon and \(U\-\)NorMuon at 1B scale and on the modded\-nanoGPT optimizers track\. ## 4: Augmenting the Muon Objective Recall that the Muon update direction can be motivated as the solution to the following constrained optimization problem: U∗=arg⁡max⁡UTr\(G⊤U\)subject to∥U∥2≤1U^\* = \\arg\\max\_\{U\} \\text\{Tr\}\(G^\\top U\) \\quad \\text\{subject to\} \\quad \\\|U\\\|\_2 \\le 1WhereG=∇WLG = \\nabla\_W \\mathcal\{L\}is the gradient with respect to a matrix parameterWW, and∥U∥2\\\|U\\\|\_2is the spectral norm \(ignoring the learning rate scalar for simplicity\)\. There is no constraint on the uniformity of the row norms of the updateU∈Rm×nU\\in \\mathbb\{R\}^\{m \\times n\}, which we found can be quite far from uniform for tall matrices \([Sections 2](https://blog.tilderesearch.com/blog/aurora#2-muon-kills-neurons)&[3](https://blog.tilderesearch.com/blog/aurora#3-normalization-prevents-neuron-death)\)\. Motivated by our results in[Section 3](https://blog.tilderesearch.com/blog/aurora#3-normalization-prevents-neuron-death), we augment the Muon objective for tall matrices \(m≥nm \\ge n\) with the additional requirement that all the update row norms be equal ton/m\\sqrt\{n/m\}: U∗=arg⁡max⁡UTr\(G⊤U\)s\.t\.∥U∥2≤1,∥Ui:∥2=nm∀iU^\* = \\arg\\max\_\{U\} \\text\{Tr\}\(G^\\top U\) \\quad \\text\{s\.t\.\} \\quad \\\|U\\\|\_2 \\le 1, \\quad \\\|U\_\{i:\}\\\|^2 = \\frac\{n\}\{m\} \\;\\;\\forall\\, iThe constant row norm constraint fixes the squared Frobenius norm to∥U∥F2=m⋅\(n/m\)=n\\\|U\\\|\_F^2 = m \\cdot \(n/m\) = n\. Because the squared Frobenius norm also equals the sum of the squared singular values and the spectral norm bound implies eachσj≤1\\sigma\_j \\le 1, allnnsingular values must exactly equal 1\. This forcesUUto be left semi\-orthogonal\. Therefore, the problem reduces to: U∗=arg⁡max⁡UTr\(G⊤U\)s\.t\.U⊤U=In,∥Ui:∥2=nm∀iU^\* = \\arg\\max\_\{U\} \\text\{Tr\}\(G^\\top U\) \\quad \\text\{s\.t\.\} \\quad U^\\top U = I\_n, \\quad \\\|U\_\{i:\}\\\|^2 = \\frac\{n\}\{m\} \\;\\;\\forall\\, iThe solution is steepest descent under two constraints: a left semi\-orthogonality constraint to make the operator RMS\-to\-RMS norm preserving, and a constant row norm constraint to enforce uniform updates\. Solving this problem yields Aurora\. ## A Riemannian Solution We first consider a Riemannian formulation of the constrained update\-selection problem\. We restrict the parameter updateUUto lie on the joint Stiefel/equal\-row\-leverage manifold M=\{U∈Rm×n:U⊤U=I,diag\(UU⊤\)=nm1\}\\mathcal\{M\} = \\\{U \\in \\mathbb\{R\}^\{m \\times n\} : U^\\top U = I, \\; \\text\{diag\}\(UU^\\top\) = \\frac\{n\}\{m\}\\mathbf\{1\}\\\}Given a matrix gradientGG, we seek the feasible update most aligned withGG: max⁡U⟨G,U⟩s\.t\.U∈M\\max\_U \\langle G, U \\rangle \\quad \\text\{s\.t\.\} \\quad U \\in \\mathcal\{M\}A Riemannian approach approximately solves this problem by projecting the Euclidean directionGGonto the tangent space ofM\\mathcal\{M\}, taking a step along the projected direction, and retracting the result back towardM\\mathcal\{M\}\([Figure 13](https://blog.tilderesearch.com/blog/aurora#figure-13)\)\. We refer to this reference solver as**Riemannian\-Aurora**; the full derivation is given in[Appendix D](https://blog.tilderesearch.com/blog/aurora#d-solving-for-the-riemannian-gradient)\. Algorithm:Riemannian\-AuroraRequire:MomentumG∈Rm×n,step sizeη,iterationsTSetr←n/mInitializeU0∈M\(e\.g\. polar projection ofGfollowed by row/Stiefel projection\)fort=0,1,…,T−1doBt←sym⁡\(Ut⊤G\)Pt←UtUt⊤qt←diag⁡\(GUt⊤\)−diag⁡\(UtBtUt⊤\)Solve:\(rIm−Pt∘Pt\)λt=qtDt←diag⁡\(λt\)St←Bt−Ut⊤DtUtZt←G−UtSt−DtUtYt←Ut\+ηZtUt\+1←Retract⁡\(Yt\)endforReturnUT\\begin\{aligned\} &\\textbf\{Algorithm: Riemannian\-Aurora\} \\\\ &\\textbf\{Require: \} \\text\{Momentum \} \\mathbf\{G\} \\in \\mathbb\{R\}^\{m \\times n\},\\; \\text\{step size \} \\eta,\\; \\text\{iterations \} T \\\\ &\\text\{Set \} r \\leftarrow n/m \\\\ &\\text\{Initialize \} \\mathbf\{U\}\_0 \\in \\mathcal\{M\} \\text\{ \(e\.g\. polar projection of \} \\mathbf\{G\} \\text\{ followed by row/Stiefel projection\)\} \\\\ &\\textbf\{for \} t = 0, 1, \\ldots, T\-1 \\textbf\{ do\} \\\\ &\\quad \\mathbf\{B\}\_t \\leftarrow \\operatorname\{sym\}\(\\mathbf\{U\}\_t^\\top \\mathbf\{G\}\) \\\\ &\\quad \\mathbf\{P\}\_t \\leftarrow \\mathbf\{U\}\_t \\mathbf\{U\}\_t^\\top \\\\ &\\quad \\mathbf\{q\}\_t \\leftarrow \\operatorname\{diag\}\(\\mathbf\{G\}\\mathbf\{U\}\_t^\\top\) \- \\operatorname\{diag\}\(\\mathbf\{U\}\_t \\mathbf\{B\}\_t \\mathbf\{U\}\_t^\\top\) \\\\ &\\quad \\text\{Solve: \} \\bigl\(r \\mathbf\{I\}\_m \- \\mathbf\{P\}\_t \\circ \\mathbf\{P\}\_t\\bigr\)\\,\\boldsymbol\{\\lambda\}\_t = \\mathbf\{q\}\_t \\\\ &\\quad \\mathbf\{D\}\_t \\leftarrow \\operatorname\{diag\}\(\\boldsymbol\{\\lambda\}\_t\) \\\\ &\\quad \\mathbf\{S\}\_t \\leftarrow \\mathbf\{B\}\_t \- \\mathbf\{U\}\_t^\\top \\mathbf\{D\}\_t \\mathbf\{U\}\_t \\\\ &\\quad \\mathbf\{Z\}\_t \\leftarrow \\mathbf\{G\} \- \\mathbf\{U\}\_t \\mathbf\{S\}\_t \- \\mathbf\{D\}\_t \\mathbf\{U\}\_t \\\\ &\\quad \\mathbf\{Y\}\_t \\leftarrow \\mathbf\{U\}\_t \+ \\eta\\, \\mathbf\{Z\}\_t \\\\ &\\quad \\mathbf\{U\}\_\{t\+1\} \\leftarrow \\operatorname\{Retract\}\(\\mathbf\{Y\}\_t\) \\\\ &\\textbf\{end for\} \\\\ &\\textbf\{Return \} \\mathbf\{U\}\_T \\end\{aligned\}StiefelObliqueUMMuonAuroraGMuon: tangent to Stiefel only · Aurora: tangent to both constraintsDrag G to see how the projections diverge **Figure13\.**2D cross\-section of the Riemannian geometry\. The Stiefel and oblique manifolds intersect at U on the joint manifold M\. Muon projects the gradient onto the Stiefel tangent only, drifting from the oblique constraint\. Aurora projects onto the joint tangent, respecting both constraints simultaneously\.Unlike post\-hoc row normalization, which first computes a Muon update and then rescales its rows, Riemannian\-Aurora treats orthogonality and equal row leverage as a single joint constraint\. Both the initialization ofU0U\_0and retraction are done approximately via alternately computing the polar factor and row normalizing the update in our implementation\. Some of the steps in Riemannian\-Aurora are particularly expensive, making it infeasible for use on moderately\-sized networks as written\. Namely: 1. The tangent space projection involves solving anmm\-dimensional linear system involvingrI−P⊙PrI \- P \\odot P, whereP=UU⊤P = UU^\\top\. For anm×nm \\times nmatrix, explicitly formingPPcostsO\(m2n\)O\(m^2 n\)memory/computation and storing it costsO\(m2\)O\(m^2\), which is prohibitive at moderate scales\. 2. Solving for the tangent constraint with conjugate gradient requires repeated projections of the formv↦\(P⊙P\)vv \\mapsto \(P \\odot P\)v, i\.e\. dense row interactions\. We present it mostly as a faithful solution to our dual\-constraint problem and as a comparison point for our practical instantiation of Aurora\. We include a small\-scale verification of Riemannian\-Aurora in[Appendix E](https://blog.tilderesearch.com/blog/aurora#e-cifar-10-results)\. ## An Iterative Approximation: Aurora Though the joint Stiefel/equal\-row\-norm projection has no simple closed form, it admits a natural iterative approximation which alternates between the two constraint sets, starting with row normalization, then using Newton\-Schultz iteration to approximate the polar factor\. In practice, we use a damped iteration with an EMA of row normalization factors to improve convergence and prevent oscillation\. This requires just one additional damping parameterβ\\beta\. Algorithm:AuroraInput:Momentum bufferM∈Rm×n,NS iterationsK,damping parameterβX0=M/∥M∥FD0=Imfork=0,1,…,K−1dork=\(∥\(Xk\)1,:∥,…,∥\(Xk\)m,:∥\)Dk=Dk−1β⋅diag\(rk\)\(1−β\)X~k=n/m⋅Dk−1XkXk\+1=polar\(X~k\)endforReturnXK\\begin\{aligned\} &\\textbf\{Algorithm: Aurora\} \\\\ &\\textbf\{Input: \} \\text\{Momentum buffer \} M \\in \\mathbb\{R\}^\{m \\times n\}, \\text\{ NS iterations \} K, \\text\{ damping parameter \} \\beta \\\\ &X\_0 = M / \\\|M\\\|\_F \\\\ &D\_0 = I\_m \\\\ &\\textbf\{for \} k = 0, 1, \\ldots, K\-1 \\textbf\{ do\} \\\\ &\\quad r\_k = \(\\\|\(X\_k\)\_\{1,:\}\\\|, \\ldots, \\\|\(X\_k\)\_\{m,:\}\\\|\) \\\\ &\\quad D\_k = D\_\{k\-1\}^\\beta \\cdot \\text\{diag\}\(r\_k\)^\{\(1\-\\beta\)\} \\\\ &\\quad \\tilde\{X\}\_k = \\sqrt\{n/m\} \\cdot D\_k^\{\-1\} X\_k \\\\ &\\quad X\_\{k\+1\} = \\mathrm\{polar\}\(\\tilde\{X\}\_k\) \\\\ &\\textbf\{end for\} \\\\ &\\textbf\{Return \} X\_K \\end\{aligned\}These steps alternate because the polar factor computation can disturb row\-norm uniformity and vice versa\. We show that, under mild conditions, this alternating projection converges to a point on or near the intersection of the Stiefel and row\-oblique manifolds \(proof in[Appendix F](https://blog.tilderesearch.com/blog/aurora#f-alternating-iteration-converges)\)\. Notably, the final iterate is always on the Stiefel manifold as the last step is always orthogonalization\. Aurora is roughlyKKtimes more compute expensive than Muon but in practice, we can effectively overlap much of this overhead with gradient communication in distributed settings\. We present an interactive simulation of vanilla \(iterative\) Aurora below\. StiefelObliqueOrthogonality defectRow norm spread0\.000\.430\.870\.000\.060\.11 X 0\.044 0\.224 0\.229 \-0\.192 0\.322 0\.374 \-0\.235 \-0\.360 0\.430 0\.131 0\.501 0\.518 \-0\.487 \-0\.033 0\.488 \-0\.275 \-0\.200 0\.340 ‖row‖ XᵀX orth defect0\.75448 row norm std0\.09671 Anisotropyσ0\.10 Normalizationβ1\.00 0/20 **Figure14\.**Aurora alternates between row normalization and polar factor computation\. Each iteration improves both orthogonality and row\-norm uniformity, converging to a point on the joint Stiefel–oblique manifold\.## 5: Results ## nanoGPT Speedrun We evaluate Aurora on the modded\-nanoGPT speedrun benchmark, comparing against NorMuon and the previous state\-of\-the\-art run\. Pure Aurora \(2 damping iterations,β=0\.5\\beta = 0\.5\) improves on NorMuon by 25 steps, reaching the 3\.28 target loss in 3225 steps\. When combined with Contra\-Muon and update/weight flooring, Aurora achieves a new SoTA of 3175 steps\. DescriptionKey hparamsS→3\.28mean@SNorMuon \(\#10 leaderboard\)lr=0\.035, wd=0\.02532503\.2789Aurora \(pure, ours\)lr=0\.05, wd=0\.025, pp\_iterations=2, pp\_beta=0\.532253\.2772ContraNorMuon w/ update clamping \(\#11, prior SOTA\)Setup from \#9 \+ Contra\-Muon32253\.2785Aurora \+ Contra \+ u/w\-floor \(ours, NEW SOTA\)lr=0\.0375, wd=0, pp\_iterations=2, pp\_beta=0\.531753\.2796 Loading\.\.\. **Figure15\.**nanoGPT speedrun convergence curves\. Aurora \+ Contra\-Muon reaches the 3\.28 eval loss target at step 3175, setting a new state\-of\-the\-art on the optimization track\.## Aurora 1\.1B Pretraining We train 1\.1B\-parameter transformers on ~100B tokens \(see[Appendix B](https://blog.tilderesearch.com/blog/aurora#b-training-setting)for full details\) and compare Aurora against Muon and NorMuon, each using PE\-8\. Aurora achieves the lowest final loss of all methods, reaching a smoothed loss of 2\.26 at step 24k, which is a clear improvement over Muon \(2\.31\) and NorMuon \(2\.33\)\. Loading\.\.\. **Figure16\.**1\.1B pretraining loss\. Aurora \(green\) converges faster and reaches a lower final loss than all Muon and NorMuon baselines\. All runs train for 24k steps\.**Downstream evaluations\.**Aurora's loss improvement translates to consistent gains on standard benchmarks\. On Hellaswag, Aurora achieves 67\.6% \- a 2\.5 point improvement over both Muon and NorMuon\. Gains are similarly consistent on ARC\-Challenge and Winogrande\. Strikingly, Aurora improves MMLU scores by 10 points over Muon\. We hypothesize that since MLPs are predominantly responsible for memorization, Aurora's gains are most visible on memorization\-intensive benchmarks like MMLU\. 01020304050607065\.164\.667\.6Hellaswag43\.442\.743\.5ARC\-C59\.762\.263\.1Winogrande27\.125\.837\.9MMLU\+10\.8MuonNorMuonAurora Benchmark 55%60%65%70%75%100B1T10T100Tpretraining tokens \(log scale\)Hellaswag\(%\)~20–360x fewer tokens67\.6%Aurora\-1\.1B62\.94%Gemma3\-1B67%Qwen2\-1\.5B65\.73%Llama3\.2\-1B67\.9%Qwen2\.5\-1\.5B67\.09%Qwen3\-1\.7B **Figure17\.**Downstream benchmark comparison across optimizers \(top\) and token efficiency relative to Qwen models \(bottom\)\. Aurora matches Qwen2\-1\.5B on Hellaswag and ARC\-C despite using 400M fewer parameters and ~100x fewer pretraining tokens\. Toggle benchmarks on the scatter plot to compare\.Remarkably, Aurora\-1\.1B matches or approaches Qwen3\-1\.7B \(36T tokens\) on Hellaswag and Winogrande despite training on two orders of magnitude fewer tokens, having 400M fewer parameters, and an uncurated data mixture\. The scatter plot reveals the striking token efficiency: Aurora sits far to the left of frontier small language models on the log\-token axis while achieving competitive scores on benchmarks that primarily test language understanding rather than domain\-specific knowledge\. Note that our 1\.1B model is trained on ~100B tokens of the Nemotron CC dataset, which contains no explicit math or science data, explaining the lower MMLU scores compared to models trained on curated multi\-domain corpora\. Notably, despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open\-source internet\-only data, we match Qwen3\-1\.7B on several benchmarks\. ## Update Quality We verify that Aurora's damped iteration actually solves the joint Stiefel–oblique problem for real momentum buffers, and compare it against the Riemannian reference solver from[Section 4](https://blog.tilderesearch.com/blog/aurora#a-riemannian-solution)\. **Gradient alignment vs\. leverage uniformity** ![Gradient alignment vs row norm CV](https://blog.tilderesearch.com/images/blog/aurora/figurex_pareto_alignment_vs_uniformity.png) **Figure18\.**Comparison of alignment vs row norm coefficient of variation tradeoff for Aurora, Riemannian\-Aurora, and standard Muon\. All updates lie on the Stiefel manifold by construction\. Data is computed on a real momentum buffer from a Muon model very early in training \(before pathology\)\.[Figure 18](https://blog.tilderesearch.com/blog/aurora#figure-18)shows the tradeoff between gradient alignment \(⟨G,U⟩/∥G∥\\langle G, U \\rangle / \\\|G\\\|\) and row\-norm coefficient of variation \(CV\) as a function of iteration count, computed on a real momentum buffer from early in training\. Standard Muon achieves high alignment but poor row\-norm uniformity\. Aurora and Riemannian\-Aurora converge to low CV with high alignment, but from opposite directions\. Whereas Aurora starts orthogonal and corrects row norms, Riemannian\-Aurora starts from the tangent projection and retracts onto the manifold\. Riemannian\-Aurora achieves 0\.6% higher gradient alignment at convergence, suggesting that the practical Aurora iteration sacrifices very little relative to the reference solver\. **Leverage score distributions** [Figure 19](https://blog.tilderesearch.com/blog/aurora#figure-19)compares leverage score distributions across optimizers\. Muon produces a heavy tail of dead rows spanning several orders of magnitude\. Aurora with 1 damping iteration \(equivalent to Muon\-Eq\) improves on Muon but retains visible spread\. Aurora with 2 iterations produces by far the most isotropic distribution, tightly clustered around the uniform valuenm\\frac\{n\}\{m\}\. Loading\.\.\. **Figure19\.**Distribution of leverage scores for Muon, U\-NorMuon, Aurora with 1 iter \(Muon\-Eq\) and Aurora with 2 iters\. Aurora with 2 iters has by far the most isotropic row leverage scores of all the approaches, tightly clustered around uniform\.**Weight spectra** We examine the singular value spectra of the trained weight matrices\. Under NorMuon, the down projection spectra are inflated by ~100x \(σmax≈500\\sigma\_\{max\} \\approx 500to15001500vs Aurora's1111to2121\), while gate projection and up projection spectra remain comparable\. This is consistent with the weight matrix absorbing scale information that the row\-normalized update discards\. Loading\.\.\. **Figure20\.**Singular value spectra of trained weight matrices \(log scale\)\. Gate and up projections have similar spectra across optimizers, but NorMuon's down projection is ~100x larger, with worse conditioning\. κ denotes the condition number\. Slide to compare across layers\.## 6: Inflection Four things excite us about this work: **1\. Strong empirical gains\.**We stress that Aurora 1\.1B was trained to be a research artifact: we did not do any tricks with data, architecture, or optimizer beyond simply using Aurora, to isolate its effect\. Despite this, it achieved 100x data efficiency on several key benchmarks, and we see this as very promising\. **2\. Architecture\-Optimizer Codesign\.**Aurora is an optimizer*specifically for up and gate projections*, which we show corrects a fundamental pathology in model training dynamics under Muon\. We believe this is a nice example of optimizer\-architecture co\-design at a level of granularity that is very uncommon in the literature\. Crucially, Aurora's advantage over Muon grows monotonically with MLP expansion factor \([Figure 21](https://blog.tilderesearch.com/blog/aurora#figure-21)\), suggesting that wider MLPs, which increase the row\-to\-column ratio of the up and gate projections, amplify exactly the pathology Aurora corrects\. 0\-0\.02\-0\.04\-0\.060\.5×2×4×8×16×\-0\.0019\-0\.0045\-0\.0161\-0\.0399\-0\.0623MLP expansion factorΔ CE \(Aurora − Muon\)Aurora better ↓ **Figure21\.**Aurora's advantage over Muon scales with MLP expansion factor\. Delta CE \(Aurora − Muon\) at step 3000 of 340M\-parameter runs across MLP hidden ratios from 0\.5× to 16×\. More negative = larger Aurora gain\.**3\. Compression / Dense Solutions\.**Aurora's row uniformity constraint is complementary to Muon's motivating constraint\. We propose thinking about the row\-uniformity constraint as encouraging*compression\.*Under Aurora, networks tend to more effectively utilize the parameters in their MLPs; that is, models tend to denser solutions\. The empirical success of Aurora indicates that dense solutions to language modelling can be feasible\. **4\. Bottom\-Up Optimizer Design\.**Aurora was derived by mechanistically studying a specific pathology in Muon and designing the minimal objective that corrects it\. This contrasts with the top\-down approach that dominates optimizer research, where one posits a mathematical formulation and hopes it yields empirical gains; an approach that offers little guidance on which of many sound framings to choose, and no explanation for why one works better than another \([Figure 22](https://blog.tilderesearch.com/blog/aurora#figure-22)\)\. We strongly believe the field would benefit from more bottom\-up research as a whole\. Top\-Downposit formulation → validateBottom\-Upstudy mechanisms → address pathologyDesirable propertyspectral norm, orthogonalityσ₁ = σ₂ = ⋯ = σₖDerive optimizersteepest descent under constraintEmpirical validationtrain models, measure lossWorks ✓Fails ✗newideaHard part: no signal on why it failsStudy existing optimizerscompare dynamics, ablate componentsIdentify pathologye\.g\. neuron death under MuonDesign targeted fixminimal objective for pathologyValidate & discoverempirical gains \+ new questionsiterateHard part: principled fix given signal **Figure22\.**Two philosophies of optimizer design\. Top\-down begins with a desirable mathematical property and validates empirically, but offers little signal when validation fails\. Bottom\-up studies existing optimizers mechanistically, identifies specific pathologies, and designs targeted fixes, yielding both better optimizers and deeper understanding\.Understanding how Muon fails on tall matrices led us to a better objective, a better optimizer, and "denser" models\. We are interested in faster and more precise methods of solving the dual\-constraint steepest descent objective, and extending our results to large networks with wide MLPs\. **Cite this work** ``` @article{dewulf2026aurora, title = {Aurora: A Leverage-Aware Optimizer for Rectangular Matrices}, author = {Dewulf, Alec and Pai, Dhruv and Yang, Li and Zhang, Ashley and Keigwin, Ben}, year = {2026}, url = {https://tilderesearch.com/blog/aurora} } ``` ## Appendix ## A\. Comparing Newton\-Schulz Iterations We present a cost analysis of different polynomials for iterative orthogonalization\. As discussed in Sections[2](https://blog.tilderesearch.com/blog/aurora#2-tension-due-to-row-normalization)&[5](https://blog.tilderesearch.com/blog/aurora#5-results), we find that more precise polynomials are always helpful for Muon and Aurora, which justifies the additional computation in many settings\. Method & StepsMatmulsDerivative at 0Max Deviation from 1 \(SVs ≥ 0\.01\)Simple\-5 \[1\]154840\.318169Quintic\-5 \[21\]154920\.031775Polar Express \(PE\)\-8 \[5\]246,8222\.22e\-16CANS\-12 \[6\]246,8681\.11e\-16 ![NS transfer functions](https://blog.tilderesearch.com/images/blog/aurora/appendx_ns.png) Transfer functions for Newton\-Schulz iterations\. PE\-8 and CANS\-12 converge to machine precision, while quintic retains a persistent spectral error of ~0\.03\.CANS\-12 and Polar Express\-8 are both much more precise than quintic iteration, which has a persistent spectral error of 0\.3, even in the limit\. Both CANS\-12 and PE\-8 achieve nearly perfect accuracy for reasonably well\-conditioned gradients\. ## B\. Training Setting We report exact configurations and training settings for our 340M and 1\.1B runs to maximize the reproducibility of our results\. We arrived at these settings via hyperparameter sweeps; we spent roughly the same compute sweeping each optimizer and setting\. We used an internal tokenizer with 128k vocab size for all our experiments, which was released in the model repository\. For reproducibility, we trained on fully open\-source internet data from NVIDIA Nemotron CC v2 \[19\]\. **Transformer 340M:** CategoryDetailsTraining Configuration800k tokens per batch, 8192 token sequence length, WSD scheduleData10\.5B tokens of NemotronCCv2 \[19\] from the high quality split\(Nor\)Muon Hyperparameterslr=8e\-3Aurora HyperparametersTwo projection iterations, damping beta=0\.5\. All other settings from NorMuonModel Architecturehidden\_size=1024, layers=24, GQA 4:1, RoPE, QKNorm, ShortConv, Gated Attention **Transformer 1\.1B:** CategoryDetailsTraining Configuration4M tokens per batch, 2048 token sequence length, WSD with cosine decay scheduleData100B tokens of NemotronCCv2 \[19\] from the high quality splitOptimizer HyperparametersSame as the 340M settings in all cases, with appropriate hyperparameter transferModel Architecturehidden\_size=2048, layers=24, GQA 4:1, RoPE, QKNorm, ShortConv, Gated Attention ## C\. Literature on Parameter Death In Transformer FFNs, prior work has shown that intermediate hidden dimensions are not uniformly utilized: FFN activations are often highly sparse, and some intermediate neurons can remain inactive over large corpora, often referred to as*dead*or*dormant*neurons\[14\]\[15\]\[16\]\. These neurons correspond to the hidden FFN dimensions between the up/gate projections and the down projection\. When a hidden neuron remains inactive or near\-zero for many tokens/steps across training, the gradients to its associated weights are greatly reduced\. As a result, only a subset of the FFN intermediate dimensions and their corresponding weight matrices effectively participate in computation, reducing the model's effective capacity and parameter utilization\[17\]\. ## D\. Solving for the Riemannian Gradient We want to solve max⁡U∈Rm×n⟨G,U⟩s\.t\.U⊤U=In,diag⁡\(UU⊤\)=nm1\\max\_\{U \\in \\mathbb\{R\}^\{m \\times n\}\} \\langle G, U \\rangle \\quad \\text\{s\.t\.\} \\quad U^\\top U = I\_n, \\qquad \\operatorname\{diag\}\(UU^\\top\) = \\frac\{n\}\{m\}\\mathbf\{1\}Letr:=nmr := \\frac\{n\}\{m\}\. The feasible set is the equal\-row\-norm Stiefel manifold M=\{U∈Rm×n:U⊤U=In,∥Ui:∥22=r∀i\}\\mathcal\{M\} = \\left\\\{ U \\in \\mathbb\{R\}^\{m \\times n\} : U^\\top U = I\_n, \\; \\\|U\_\{i:\}\\\|\_2^2 = r \\;\\; \\forall\\, i \\right\\\}**Tangent space\.**LetZ∈Rm×nZ \\in \\mathbb\{R\}^\{m \\times n\}be a tangent vector atU∈MU \\in \\mathcal\{M\}\. The Stiefel constraint requires\(U\+ϵZ\)⊤\(U\+ϵZ\)=In\+O\(ϵ2\)\(U \+ \\epsilon Z\)^\\top \(U \+ \\epsilon Z\) = I\_n \+ O\(\\epsilon^2\)for smallϵ\>0\\epsilon \> 0\. Keeping only first\-order terms gives U⊤Z\+Z⊤U=0U^\\top Z \+ Z^\\top U = 0The row\-norm constraint requires∥Ui:\+ϵZi:∥22=r\+O\(ϵ2\)\\\|U\_\{i:\} \+ \\epsilon Z\_\{i:\}\\\|\_2^2 = r \+ O\(\\epsilon^2\)for allii\. Expanding to first order gives⟨Ui:,Zi:⟩=0\\langle U\_\{i:\}, Z\_\{i:\} \\rangle = 0for allii, or equivalentlydiag⁡\(ZU⊤\)=0\\operatorname\{diag\}\(ZU^\\top\) = 0\. Therefore, TUM=\{Z∈Rm×n:U⊤Z\+Z⊤U=0,diag⁡\(ZU⊤\)=0\}T\_U \\mathcal\{M\} = \\left\\\{ Z \\in \\mathbb\{R\}^\{m \\times n\} : U^\\top Z \+ Z^\\top U = 0, \\; \\operatorname\{diag\}\(ZU^\\top\) = 0 \\right\\\}**Riemannian gradient\.**We compute the Riemannian gradient by projectingGGonto the tangent space: Z=arg⁡min⁡Z~∥Z~−G∥F2s\.t\.U⊤Z~\+Z~⊤U=0,diag⁡\(Z~U⊤\)=0Z = \\arg\\min\_\{\\widetilde\{Z\}\} \\\|\\widetilde\{Z\} \- G\\\|\_F^2 \\quad \\text\{s\.t\.\} \\quad U^\\top \\widetilde\{Z\} \+ \\widetilde\{Z\}^\\top U = 0, \\qquad \\operatorname\{diag\}\(\\widetilde\{Z\}U^\\top\) = 0The projected vector has the formZ=G−US−DUZ = G \- US \- DU, whereS=S⊤∈Rn×nS = S^\\top \\in \\mathbb\{R\}^\{n \\times n\}andD=diag⁡\(λ\)∈Rm×mD = \\operatorname\{diag\}\(\\lambda\) \\in \\mathbb\{R\}^\{m \\times m\}\. TheUSUSterm enforces the Stiefel tangent constraint, while theDUDUterm enforces the row\-norm tangent constraint\. Define B:=sym⁡\(U⊤G\)=12\(U⊤G\+G⊤U\),P:=UU⊤B := \\operatorname\{sym\}\(U^\\top G\) = \\frac\{1\}\{2\}\(U^\\top G \+ G^\\top U\), \\qquad P := UU^\\topWe impose the Stiefel tangent constraintU⊤Z\+Z⊤U=0U^\\top Z \+ Z^\\top U = 0\. SubstitutingZ=G−US−DUZ = G \- US \- DU, we get U⊤\(G−US−DU\)\+\(G−US−DU\)⊤U=0U^\\top\(G \- US \- DU\) \+ \(G \- US \- DU\)^\\top U = 0UsingU⊤U=InU^\\top U = I\_n,S=S⊤S = S^\\top, andD=D⊤D = D^\\top, this becomes U⊤G\+G⊤U−2S−2U⊤DU=0U^\\top G \+ G^\\top U \- 2S \- 2U^\\top DU = 0Therefore, S=sym⁡\(U⊤G\)−U⊤DU=B−U⊤DUS = \\operatorname\{sym\}\(U^\\top G\) \- U^\\top DU = B \- U^\\top DUNow impose the row\-norm tangent constraintdiag⁡\(ZU⊤\)=0\\operatorname\{diag\}\(ZU^\\top\) = 0\. UsingZ=G−US−DUZ = G \- US \- DU, this gives diag⁡\(GU⊤\)−diag⁡\(USU⊤\)−diag⁡\(DUU⊤\)=0\\operatorname\{diag\}\(GU^\\top\) \- \\operatorname\{diag\}\(USU^\\top\) \- \\operatorname\{diag\}\(DUU^\\top\) = 0Sincediag⁡\(UU⊤\)=r1\\operatorname\{diag\}\(UU^\\top\) = r\\mathbf\{1\}andD=diag⁡\(λ\)D = \\operatorname\{diag\}\(\\lambda\), we havediag⁡\(DUU⊤\)=rλ\\operatorname\{diag\}\(DUU^\\top\) = r\\lambda\. SubstitutingS=B−U⊤DUS = B \- U^\\top DUintodiag⁡\(USU⊤\)\\operatorname\{diag\}\(USU^\\top\)gives diag⁡\(USU⊤\)=diag⁡\(UBU⊤\)−diag⁡\(UU⊤DUU⊤\)\\operatorname\{diag\}\(USU^\\top\) = \\operatorname\{diag\}\(UBU^\\top\) \- \\operatorname\{diag\}\(UU^\\top DUU^\\top\)WithP=UU⊤P = UU^\\top, this becomesdiag⁡\(USU⊤\)=diag⁡\(UBU⊤\)−diag⁡\(PDP\)\\operatorname\{diag\}\(USU^\\top\) = \\operatorname\{diag\}\(UBU^\\top\) \- \\operatorname\{diag\}\(PDP\)\. SinceD=diag⁡\(λ\)D = \\operatorname\{diag\}\(\\lambda\), diag⁡\(PDP\)=\(P∘P\)λ\\operatorname\{diag\}\(PDP\) = \(P \\circ P\)\\lambdaThereforediag⁡\(USU⊤\)=diag⁡\(UBU⊤\)−\(P∘P\)λ\\operatorname\{diag\}\(USU^\\top\) = \\operatorname\{diag\}\(UBU^\\top\) \- \(P \\circ P\)\\lambda\. Plugging this into the row constraint gives diag⁡\(GU⊤\)−\[diag⁡\(UBU⊤\)−\(P∘P\)λ\]−rλ=0\\operatorname\{diag\}\(GU^\\top\) \- \\left\[\\operatorname\{diag\}\(UBU^\\top\) \- \(P \\circ P\)\\lambda\\right\] \- r\\lambda = 0Rearranging, \(rI−P∘P\)λ=diag⁡\(GU⊤\)−diag⁡\(UBU⊤\)\(rI \- P \\circ P\)\\lambda = \\operatorname\{diag\}\(GU^\\top\) \- \\operatorname\{diag\}\(UBU^\\top\)After solving this linear system forλ\\lambda, defineD=diag⁡\(λ\)D = \\operatorname\{diag\}\(\\lambda\),S=B−U⊤DUS = B \- U^\\top DU, and the tangent projection isZ=G−US−DUZ = G \- US \- DU\. ## E\. CIFAR\-10 Results We train a 2\-layer MLP with ReLU for 3 epochs on CIFAR\-10 as a toy verification of Riemannian\-Aurora\. We useK=20K=20conjugate gradient iterations, Nesterov Momentum, lr=4e\-3, 2 retraction steps, PE\-8, a Riemannian learning rate ofη=0\.1\\eta = 0\.1and an AdamW learning rate of 4e\-3 with betas\(0\.9,0\.95\)\(0\.9, 0\.95\)\. We found these parameters to be approximately optimal, after a brief sweep\. We find that Riemannian\-Aurora outperforms the AdamW baseline by a significant amount, and performs very similarly to Muon in terms of train accuracy, but has slightly higher test accuracy\. Unlike Muon, it leads to much more uniform row leverage scores\. ![CIFAR-10 results](https://blog.tilderesearch.com/images/blog/aurora/appendix_cifar10_aurora.png) CIFAR\-10 training and test accuracy for AdamW, Muon, and Riemannian\-Aurora, alongside leverage score distributions\. Riemannian\-Aurora achieves Muon\-level accuracy with uniform leverage\.## F\. Alternating Iteration Converges Letm\>nm \> n,r=n/mr = n/m, and define S=\{U∈Rm×n:U⊤U=In\}\\mathcal\{S\} = \\\{U \\in \\mathbb\{R\}^\{m \\times n\} : U^\\top U = I\_n\\\}Or=\{U∈Rm×n:∥Ui,:∥22=r,i=1,…,m\}\\mathcal\{O\}\_r = \\\{U \\in \\mathbb\{R\}^\{m \\times n\} : \\\|U\_\{i,:\}\\\|\_2^2 = r,\\; i = 1, \\ldots, m\\\}LetM=S∩Or\\mathcal\{M\} = \\mathcal\{S\} \\cap \\mathcal\{O\}\_r\. The alternating projection iteration is Uk\+1/2=ΠOr\(Uk\),Uk\+1=ΠS\(Uk\+1/2\)U\_\{k\+1/2\} = \\Pi\_\{\\mathcal\{O\}\_r\}\(U\_k\), \\qquad U\_\{k\+1\} = \\Pi\_\{\\mathcal\{S\}\}\(U\_\{k\+1/2\}\)whereΠOr\\Pi\_\{\\mathcal\{O\}\_r\}is row\-wise rescaling andΠS\\Pi\_\{\\mathcal\{S\}\}is the polar factor\. ### Lemma F\.1: Nonemptiness M≠∅\\mathcal\{M\} \\neq \\emptyset\. *Proof\.*By the Schur–Horn theorem, there exists a rank\-nnorthogonal projectorP∈Rm×mP \\in \\mathbb\{R\}^\{m \\times m\}with constant diagonalPii=n/mP\_\{ii\} = n/m\. For example, the diagonal vector\(n/m,…,n/m\)\(n/m, \\ldots, n/m\)is majorized by the eigenvalue vector\(1,…,1,0,…,0\)\(1, \\ldots, 1, 0, \\ldots, 0\)withnnones\. WriteP=UU⊤P = UU^\\topwithU⊤U=InU^\\top U = I\_n\. Then∥Ui,:∥22=\(UU⊤\)ii=Pii=n/m\\\|U\_\{i,:\}\\\|\_2^2 = \(UU^\\top\)\_\{ii\} = P\_\{ii\} = n/m, soU∈S∩OrU \\in \\mathcal\{S\} \\cap \\mathcal\{O\}\_r\. HenceM≠∅\\mathcal\{M\} \\neq \\emptyset\.□\\square ### Regularity Condition ForU∈MU \\in \\mathcal\{M\}, defineP=UU⊤P = UU^\\topandLU=rIm−P∘PL\_U = rI\_m \- P \\circ P, where∘\\circdenotes the Hadamard product\. SinceP2=PP^2 = PandPii=rP\_\{ii\} = r, ∑jPij2=\(P2\)ii=Pii=r\\sum\_j P\_\{ij\}^2 = \(P^2\)\_\{ii\} = P\_\{ii\} = rThereforeLUL\_Uis a graph Laplacian: λ⊤LUλ=12∑i,jPij2\(λi−λj\)2\\lambda^\\top L\_U \\lambda = \\frac\{1\}\{2\} \\sum\_\{i,j\} P\_\{ij\}^2 \(\\lambda\_i \- \\lambda\_j\)^2CallUU**regular**ifker⁡LU=span⁡\{1\}\\ker L\_U = \\operatorname\{span\}\\\{\\mathbf\{1\}\\\}\. ### Lemma F\.2: Clean Intersection IfU∈MU \\in \\mathcal\{M\}is regular, thenS\\mathcal\{S\}andOr\\mathcal\{O\}\_rintersect cleanly atUU\. *Proof\.*The normal spaces areNUS=\{UA:A=A⊤\}N\_U \\mathcal\{S\} = \\\{UA : A = A^\\top\\\}andNUOr=\{DU:D=diag⁡\(λ\)\}N\_U \\mathcal\{O\}\_r = \\\{DU : D = \\operatorname\{diag\}\(\\lambda\)\\\}\. Suppose a vector lies in both normal spaces, soUA=DUUA = DU\. SinceP=UU⊤P = UU^\\topis the projector ontocol⁡\(U\)\\operatorname\{col\}\(U\), this implies\(I−P\)DU=0\(I \- P\)DU = 0\. But ∥\(I−P\)DU∥F2=∥DU∥F2−∥PDU∥F2\\\|\(I \- P\)DU\\\|\_F^2 = \\\|DU\\\|\_F^2 \- \\\|PDU\\\|\_F^2Using∥Ui,:∥22=r\\\|U\_\{i,:\}\\\|\_2^2 = r, we get∥DU∥F2=r∥λ∥22\\\|DU\\\|\_F^2 = r\\\|\\lambda\\\|\_2^2and∥PDU∥F2=λ⊤\(P∘P\)λ\\\|PDU\\\|\_F^2 = \\lambda^\\top \(P \\circ P\)\\lambda, so 0=∥\(I−P\)DU∥F2=λ⊤\(rIm−P∘P\)λ=λ⊤LUλ0 = \\\|\(I \- P\)DU\\\|\_F^2 = \\lambda^\\top\(rI\_m \- P \\circ P\)\\lambda = \\lambda^\\top L\_U \\lambdaBy regularity, this forcesλ=α1\\lambda = \\alpha \\mathbf\{1\}, henceD=αImD = \\alpha I\_m,A=αInA = \\alpha I\_n, andDU=UA=αUDU = UA = \\alpha U\. ThereforeNUS∩NUOr=span⁡\{U\}N\_U \\mathcal\{S\} \\cap N\_U \\mathcal\{O\}\_r = \\operatorname\{span\}\\\{U\\\}\. This one\-dimensional common normal direction is unavoidable since both constraint sets imply the same Frobenius norm constraint∥U∥F2=n\\\|U\\\|\_F^2 = n\. Thus the combined constraint map has maximal rank modulo this single redundancy\. By the constant\-rank theorem,M\\mathcal\{M\}is locally a smooth manifold andTUM=TUS∩TUOrT\_U \\mathcal\{M\} = T\_U \\mathcal\{S\} \\cap T\_U \\mathcal\{O\}\_r, so the intersection is clean\.□\\square ### Theorem F\.3: Local Linear Convergence LetU⋆∈MU\_\\star \\in \\mathcal\{M\}be regular\. Then for every initializationU0U\_0sufficiently close toU⋆U\_\\star, the alternating projection sequenceUk\+1/2=ΠOr\(Uk\)U\_\{k\+1/2\} = \\Pi\_\{\\mathcal\{O\}\_r\}\(U\_k\),Uk\+1=ΠS\(Uk\+1/2\)U\_\{k\+1\} = \\Pi\_\{\\mathcal\{S\}\}\(U\_\{k\+1/2\}\)is well\-defined and converges linearly to someU^∈M\\widehat\{U\} \\in \\mathcal\{M\}\. *Proof\.*BothS\\mathcal\{S\}andOr\\mathcal\{O\}\_rareC∞C^\\inftyembedded submanifolds\. NearU⋆U\_\\star, projection ontoS\\mathcal\{S\}is uniquely given by the polar factor, since nearby matrices have full column rank\. Projection ontoOr\\mathcal\{O\}\_ris uniquely given by row\-wise normalization, since all rows ofU⋆U\_\\starhave normr\>0\\sqrt\{r\} \> 0\. By Lemma F\.2,S\\mathcal\{S\}andOr\\mathcal\{O\}\_rintersect cleanly atU⋆U\_\\star\. Clean intersection of smooth manifolds implies local linear convergence of alternating projections\. Therefore the iterates converge linearly to a pointU^∈S∩Or=M\\widehat\{U\} \\in \\mathcal\{S\} \\cap \\mathcal\{O\}\_r = \\mathcal\{M\}\.□\\square ## G\. Neuron Death in ReLU² MLPs The neuron death analysis in[Section 3](https://blog.tilderesearch.com/blog/aurora#3-normalization-prevents-neuron-death)focuses on SwiGLU MLPs, but, as we show below, this pathology is not specific to the SiLU gating function\. For any activationϕ\\phiwithϕ\(0\)=0\\phi\(0\) = 0andϕ′\(0\)≈0\\phi'\(0\) \\approx 0, the same vanishing\-gradient feedback loop occurs: neurons with small pre\-activations receive small gradients, accumulate small momentum, and produce small leverage scores under Muon\. Common activation functions including ReLU² \(defined byϕ\(x\)=max⁡\(0,x\)2\\phi\(x\) = \\max\(0,x\)^2\), GELU, and SiLU all share this property\. We verify this by repeating our leverage score analysis on a 400M transformer trained with ReLU² MLPs \(same setting as[Appendix B](https://blog.tilderesearch.com/blog/aurora#b-training-setting), with SwiGLU replaced by ReLU²\)\. The results closely mirror those of the SwiGLU experiments\. **Neuron death grid \(ReLU²\)\.**Under Muon, a large fraction of neurons die early in training\. Under U\-NorMuon, row leverage scores remain isotropic throughout training\. step0 Leverage scores for all 5632 neurons \(up projection, layer 12\) in a ReLU² MLP\. The same pattern of early, permanent death under Muon and healthy leverage under U\-NorMuon is observed\.**Leverage persistence \(ReLU²\)\.**We sort neurons into cohorts by their leverage score at step 500 and track each cohort over training\. The pattern matches our SwiGLU results: under Muon, the lowest\-leverage cohorts collapse and never recover, while under U\-NorMuon, all cohorts converge toward uniform leverage\. Training stepLog leverage scoreBottom 10%10–25%25–40%40–55%55–70%70–85%85–95%Top 5%Optimizer Projection Layer Leverage persistence for ReLU² MLPs\. Toggle between optimizers, projections, and layers\. The qualitative behavior matches the SwiGLU results in Section 3\.## H\. A Note on MoE Architectures Aurora's benefits are most pronounced for tall matrices, where the ratiom/nm/nis large\. This is the typical regime for up and gate projections in dense transformers, which commonly use MLP expansion factors of 4× or more\. In MoE architectures, capacity is distributed across many smaller experts, each with a proportionally smaller hidden dimension and a lowerm/nm/nratio\. We therefore expect the neuron death pathology to be less severe in MoE models of moderate size, though not entirely absent for experts that remain meaningfully tall\. ## I\. Code Release Our implementation of Aurora is open\-sourced at[github\.com/tilde\-research/aurora\-release](https://github.com/tilde-research/aurora-release)\. The repository includes both the practical damped\-iteration Aurora optimizer and the Riemannian\-Aurora reference solver\. Our modded\-nanoGPT speedrun submission \(Aurora \+ Contra\-Muon \+ u/w\-floor, 3175 steps\) is available as a pull request at[KellerJordan/modded\-nanogpt\#284](https://github.com/KellerJordan/modded-nanogpt/pull/284)\. We also release the Aurora 1\.1B pretrained model at[tilde\-research/aurora\-1\.1B](https://huggingface.co/tilde-research/aurora-1.1B)\. ## References 1. [Muon: An optimizer for hidden layers in neural networks](https://kellerjordan.github.io/posts/muon/) Jordan, K\., Jin, Y\., Boza, V\., You, J\., Cesista, F\., Newhouse, L\., and Bernstein, J\. \(2024\) \. 4. [Dion: Distributed Orthonormalized Updates](https://arxiv.org/abs/2504.05295) Ahn, K\., Xu, B\., Abreu, N\., Fan, Y\., Magakyan, G\., Sharma, P\., Zhan, Z\., and Langford, J\. \(2025\) \. 10. [MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration](https://arxiv.org/abs/2603.28254) Chang, D\., Shi, Q\., Zhang, L\., Li, Y\., Zhang, R\., Lu, Y\., Liu, Y\., and Yuan, G\. \(2026\) \. 15. [The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers](https://openreview.net/forum?id=TJ2nxciYCk-) Li, Z\., You, C\., Bhojanapalli, S\., Li, D\., Rawat, A\. S\., Reddi, S\. J\., Ye, K\., Chern, F\., Yu, F\., Guo, R\., and Kumar, S\. \(2023\) \. 17. [MoEfication: Transformer Feed\-forward Layers are Mixtures of Experts](https://aclanthology.org/2022.findings-acl.71/) Zhang, Z\., Lin, Y\., Liu, Z\., Li, P\., Sun, M\., and Zhou, J\. \(2022\) \. 18. [GLM\-5: From Vibe Coding to Agentic Engineering](https://arxiv.org/abs/2602.15763) Zeng, A\., Lv, X\., Hou, Z\., Du, Z\., Zheng, Q\., Chen, B\., et al\. \(2026\) \.

Aurora: A Leverage-Aware Optimizer for Rectangular Matrices

Similar Articles

Aurora: A Leverage-Aware Spectral Optimizer

@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures

Submit Feedback

Similar Articles

Aurora: A Leverage-Aware Spectral Optimizer

@0xLogicrw: Tilde Research found a hidden flaw in the Muon optimizer, used by leading models like DeepSeek V4, Kimi K2.5, and GLM-5: it causes over a quarter of MLP layer neurons to die permanently in early training. The team designed an alternative optimizer, Auro…

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures