MuCon: Clipped Muon Updates for LLM Training
Summary
This paper introduces MuCon, a clipped-Muon optimizer for LLM training that applies singular-value clipping instead of full polarization, preserving smaller singular values while clipping only the largest ones. It explores approximations to avoid full SVD, including polar/absolute-value formulas and rational Newton filters, noting numerical challenges near the threshold.
View Cached Full Text
Cached at: 05/27/26, 09:11 AM
# MuCon: Clipped Muon Updates for LLM Training
Source: [https://arxiv.org/html/2605.26459](https://arxiv.org/html/2605.26459)
\(May 8, 2026\)
###### Abstract
Muon\-style optimizers take a matrix\-valued momentum or preconditioned update
B=Udiag\(σ1,…,σr\)V⊤B=U\{\\rm diag\}\(\\sigma\_\{1\},\\dots,\\sigma\_\{r\}\)V^\{\\top\}and replace it with its canonical partial polar factor
Polar\(B\)=UV⊤\.\\operatorname\{Polar\}\(B\)=UV^\{\\top\}\.This maps every nonzero singular value to one\. MuCon is the clipped\-Muon variant studied here: it applies singular\-value clipping to the same Muon matrix,
DτMuCon\(B\)=MClipτ\(B\)=Udiag\(min\{σi,τ\}\)V⊤,τ\>0\.D^\{\\mathrm\{MuCon\}\}\_\{\\tau\}\(B\)=\\operatorname\{MClip\}\_\{\\tau\}\(B\)=U\{\\rm diag\}\(\\min\\\{\\sigma\_\{i\},\\tau\\\}\)V^\{\\top\},\\qquad\\tau\>0\.Thus,MClipτ\\operatorname\{MClip\}\_\{\\tau\}denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon’s polar direction\. The Muon/MuCon scaling parameterization used in this work is calledSpectralP: it is the hidden\-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied\. The mapMClipτ\\operatorname\{MClip\}\_\{\\tau\}is the Frobenius projection onto the spectral\-norm ball of radiusτ\\tau: it leaves singular values at or belowτ\\tauunchanged and modifies only the violating singular directions\. This paper asks when the MuCon clipping step can be approximated without a full dense SVD\. We record two exact identities, a polar/absolute\-value formula and a scalar\-root formulation leading to a rational Newton filter for the clipped positive\-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill\-conditioned\. Matrix\-function methods are therefore useful only when paired with stable polar/square\-root primitives or explicit regularization near the clipping boundary\.
## 1Introduction
Many optimizer designs modify matrix\-valued updates through spectral transformations\. A well\-known example is Muon\-style orthogonalization\. If
Bt=UΣV⊤B\_\{t\}=U\\Sigma V^\{\\top\}is a compact SVD of the matrix\-valued momentum or preconditioned update passed to Muon’s matrix step, the mathematical target in this report is the canonical partial polar factor
DtMuon=Polar\(Bt\)=UV⊤\.D\_\{t\}^\{\\mathrm\{Muon\}\}=\\operatorname\{Polar\}\(B\_\{t\}\)=UV^\{\\top\}\.For rank\-deficient matrices,Polar\\operatorname\{Polar\}denotes this SVD\-defined partial isometry, not an arbitrary orthogonal completion\. In implementations it is often approximated by a Newton\-Schulz iteration\. The operation is aggressive: every nonzero singular value is replaced by one\.
This report studies the more selective map
MClipτ\(M\)=Udiag\(min\{σi,τ\}\)V⊤,M=Udiag\(σi\)V⊤,\\operatorname\{MClip\}\_\{\\tau\}\(M\)=U\{\\rm diag\}\(\\min\\\{\\sigma\_\{i\},\\tau\\\}\)V^\{\\top\},\\qquad M=U\{\\rm diag\}\(\\sigma\_\{i\}\)V^\{\\top\},so that
σi\(MClipτ\(M\)\)=min\{σi\(M\),τ\}\.\\sigma\_\{i\}\(\\operatorname\{MClip\}\_\{\\tau\}\(M\)\)=\\min\\\{\\sigma\_\{i\}\(M\),\\tau\\\}\.The default threshold isτ=1\\tau=1, and we writeMClip\(M\)=MClip1\(M\)\\operatorname\{MClip\}\(M\)=\\operatorname\{MClip\}\_\{1\}\(M\)when no threshold is shown\. Clipping preserves all singular values at or belowτ\\tauand modifies only the violating directions\. When this map replaces Muon’s polar step on the Muon matrixBtB\_\{t\}, the resulting clipped\-Muon update is
DtMuCon=MClipτt\(Bt\),DtMuon=Polar\(Bt\)\.D\_\{t\}^\{\\mathrm\{MuCon\}\}=\\operatorname\{MClip\}\_\{\\tau\_\{t\}\}\(B\_\{t\}\),\\qquad D\_\{t\}^\{\\mathrm\{Muon\}\}=\\operatorname\{Polar\}\(B\_\{t\}\)\.Throughout the paper,MClipτ\\operatorname\{MClip\}\_\{\\tau\}denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that applies this operator inside the Muon update pipeline\. Separately,SpectralPdenotes the scaling parameterization used for the Muon/MuCon hidden\-matrix groups in this work; it is not a new clipping map\.
The operatorMClipτ\\operatorname\{MClip\}\_\{\\tau\}is exactly the Frobenius projection onto the spectral\-norm ball
ℬ2\(τ\)=\{X∈ℝm×n:∥X∥2≤τ\}:MClipτ\(M\)=argminX∈ℬ2\(τ\)12∥X−M∥F2\.\\mathcal\{B\}\_\{2\}\(\\tau\)=\\\{X\\in\\mathbb\{R\}^\{m\\times n\}:\\\|X\\\|\_\{2\}\\leq\\tau\\\}:\\qquad\\operatorname\{MClip\}\_\{\\tau\}\(M\)=\\mathop\{\\mathrm\{argmin\}\}\_\{X\\in\\mathcal\{B\}\_\{2\}\(\\tau\)\}\\frac\{1\}\{2\}\\\|X\-M\\\|\_\{F\}^\{2\}\.The exact algorithm is simple: compute an SVD, clip the singular values, and reconstruct the matrix\. Its dense cost,
O\(mnmin\(m,n\)\),O\(mn\\min\(m,n\)\),is usually too high for repeated use inside an optimizer; seeGolub and Van Loan \([2013](https://arxiv.org/html/2605.26459#bib.bib11)\)for standard background on dense matrix decompositions\.
The central numerical question is therefore:
> Can the MuCon clipping step be approximated accurately enough for optimizer use while avoiding a full dense SVD?
A key structural identity already suggests the correct regime split\. Let
ℐ\>=\{i:σi\(M\)\>τ\},k\>=\|ℐ\>\|\.\\mathcal\{I\}\_\{\>\}=\\\{i:\\sigma\_\{i\}\(M\)\>\\tau\\\},\\qquad k\_\{\>\}=\|\\mathcal\{I\}\_\{\>\}\|\.Then
MClipτ\(M\)=M−U\>diag\(\(σi−τ\)i∈ℐ\>\)V\>⊤,\\operatorname\{MClip\}\_\{\\tau\}\(M\)=M\-U\_\{\>\}\{\\rm diag\}\\bigl\(\(\\sigma\_\{i\}\-\\tau\)\_\{i\\in\\mathcal\{I\}\_\{\>\}\}\\bigr\)V\_\{\>\}^\{\\top\},whereU\>,V\>U\_\{\>\},V\_\{\>\}contain only the singular vectors whose singular values exceedτ\\tau\. Hence, clipping is a rank\-k\>k\_\{\>\}correction toMM\. Whenk\>k\_\{\>\}is small, a partial SVD, Lanczos method, or randomized subspace method is the most selective baseline; whenk\>k\_\{\>\}is large, global matrix\-function iterations may be competitive\.
##### Contributions\.
The report makes three technical points forSpectralPMuCon\. First, it separates the mathematical clipping map from the clipped\-Muon optimizer primitive and records the projection and low\-rank correction identities that any approximation should respect\. Second, it derives a polar/absolute\-value formulation of clipping and explains why threshold eigenvalues are numerically delicate\. Third, it analyzes a rational Newton iteration for the clipped positive\-semidefinite factor and clarifies that it is a spectral filter, not a standalone SVD\-free algorithm\.
## 2Background:SpectralPand Width\-Depth Scaled Training
This project is motivated by hyperparameter transfer under simultaneous width and depth scaling\. CompleteP studies joint width\-depth transfer and non\-lazy feature learning in deep Transformers\(Dey et al\.,[2025](https://arxiv.org/html/2605.26459#bib.bib1)\)\. SpectralμP\\mu\\mathrm\{P\}and related operator\-norm viewpoints motivate controlling weights and updates in normalized spectral norm\(Yang et al\.,[2023](https://arxiv.org/html/2605.26459#bib.bib5); Zheng et al\.,[2026](https://arxiv.org/html/2605.26459#bib.bib2)\)\. In this report, the resulting hidden\-matrix scaling parameterization for Muon and MuCon is calledSpectralP\.SpectralPassigns hidden two\-dimensional matrix groups to a spectral update class, while scalar, vector, embedding, and unembedding groups remain AdamW companion groups\. Both viewpoints point to the same numerical need: matrix updates should have controlled spectra without requiring expensive decompositions at every optimizer step\.
### 2\.1Hyperparameter Transfer and Maximal\-Update Parameterization
A central practical problem in large\-model training is hyperparameter transfer\. Ideally, one tunes hyperparameters such as learning rate, initialization scale, and weight decay on a small model and transfers them to a much larger one\. If the same base hyperparameters remain close to optimal after scaling, one obtains a “tune small, train large” strategy\.
Maximal update parameterization, orμP\\mu\\mathrm\{P\}, was introduced to make this possible under width scaling\(Yang and Hu,[2021](https://arxiv.org/html/2605.26459#bib.bib3); Yang et al\.,[2022](https://arxiv.org/html/2605.26459#bib.bib4)\)\. In its simplest form,μP\\mu\\mathrm\{P\}aims to preserve nontrivial feature learning as the widthNNgrows\. Ifhℓ\(x\)∈ℝdℓh\_\{\\ell\}\(x\)\\in\\mathbb\{R\}^\{d\_\{\\ell\}\}is the hidden representation at layerℓ\\ell, the invariant scale is coordinate\-wise, or equivalently RMS\-normalized:
‖hℓ\(x\)‖R,dℓ=Θ\(1\),‖Δhℓ\(x\)‖R,dℓ=Θ\(1\),‖a‖R,d:=‖a‖2d\.\\\|h\_\{\\ell\}\(x\)\\\|\_\{\\mathrm\{R\},d\_\{\\ell\}\}=\\Theta\(1\),\\qquad\\\|\\Delta h\_\{\\ell\}\(x\)\\\|\_\{\\mathrm\{R\},d\_\{\\ell\}\}=\\Theta\(1\),\\qquad\\\|a\\\|\_\{\\mathrm\{R\},d\}:=\\frac\{\\\|a\\\|\_\{2\}\}\{\\sqrt\{d\}\}\.Thus, the parameterization should avoid both lazy dynamics,
‖Δhℓ\(x\)‖R,dℓ→0,\\\|\\Delta h\_\{\\ell\}\(x\)\\\|\_\{\\mathrm\{R\},d\_\{\\ell\}\}\\to 0,and unstable dynamics,
‖Δhℓ\(x\)‖R,dℓ→∞\.\\\|\\Delta h\_\{\\ell\}\(x\)\\\|\_\{\\mathrm\{R\},d\_\{\\ell\}\}\\to\\infty\.
For width\-only scaling, a typical hidden\-matrix AdamW learning\-rate rule takes the form
ηhidden=ηbasemN−1,mN=NNbase\.\\eta\_\{\\mathrm\{hidden\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\},\\qquad m\_\{N\}=\\frac\{N\}\{N\_\{\\mathrm\{base\}\}\}\.The exact exponent depends on the optimizer and parameterization, but the principle is that matrix updates must be rescaled so that their induced feature movement remains order one\.
Modern foundation models do not scale only in width; they also become deeper\. A useful parameterization should therefore preserve transfer as
N→∞,L→∞,N\\to\\infty,\\qquad L\\to\\infty,whereLLis the number of residual blocks\.
### 2\.2CompleteP: Residual Scaling for Deep Transformers
CompleteP studies joint width\-depth scaling for pre\-LN decoder\-only Transformer language models\(Dey et al\.,[2025](https://arxiv.org/html/2605.26459#bib.bib1)\)\. Its starting point is the residual recursion
hℓ\+1=hℓ\+L−αFℓ\(hℓ\),ℓ=1,…,L,h\_\{\\ell\+1\}=h\_\{\\ell\}\+L^\{\-\\alpha\}F\_\{\\ell\}\(h\_\{\\ell\}\),\\qquad\\ell=1,\\dots,L,whereFℓF\_\{\\ell\}is a residual block, such as an attention or MLP block\. The exponentα∈\[1/2,1\]\\alpha\\in\[1/2,1\]controls the depth scaling of each residual branch\.
The two most important cases are
α=12andα=1\.\\alpha=\\frac\{1\}\{2\}\\qquad\\text\{and\}\\qquad\\alpha=1\.The choiceα=1/2\\alpha=1/2is natural from initialization stability\. If residual increments are roughly independent and have comparable size, then the accumulated residual variance scales as
∑ℓ=1LL−2α=L1−2α\.\\sum\_\{\\ell=1\}^\{L\}L^\{\-2\\alpha\}=L^\{1\-2\\alpha\}\.Avoiding variance explosion, therefore, requires
α≥12\.\\alpha\\geq\\frac\{1\}\{2\}\.
CompleteP argues that initialization stability is not enough\. It advocates the stronger scaling
so that
hℓ\+1=hℓ\+L−1Fℓ\(hℓ\)\.h\_\{\\ell\+1\}=h\_\{\\ell\}\+L^\{\-1\}F\_\{\\ell\}\(h\_\{\\ell\}\)\.In practical scaling experiments one writes
mN=NNbase,mL=LLbase,m\_\{N\}=\\frac\{N\}\{N\_\{\\mathrm\{base\}\}\},\\qquad m\_\{L\}=\\frac\{L\}\{L\_\{\\mathrm\{base\}\}\},and uses
hℓ\+1=hℓ\+mL−1Fℓ\(hℓ\)\.h\_\{\\ell\+1\}=h\_\{\\ell\}\+m\_\{L\}^\{\-1\}F\_\{\\ell\}\(h\_\{\\ell\}\)\.
#### 2\.2\.1AdamW scaling in CompleteP
CompleteP is not only a residual multiplier\. It also specifies how model and optimizer hyperparameters should scale withmNm\_\{N\}andmLm\_\{L\}\. For hidden matrix weights, the initialization variance follows the width\-μP\\mu\\mathrm\{P\}rule
Var\(Whidden\)=σbase2mN−1\.\\operatorname\{Var\}\(W\_\{\\mathrm\{hidden\}\}\)=\\sigma\_\{\\mathrm\{base\}\}^\{2\}m\_\{N\}^\{\-1\}\.For AdamW hidden matrix updates, the learning\-rate rule summarized in CompleteP is
ηhiddenAdamW=ηbasemN−1mLα−1\.\\eta\_\{\\mathrm\{hidden\}\}^\{\\mathrm\{AdamW\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\}m\_\{L\}^\{\\alpha\-1\}\.Thus, under CompleteP withα=1\\alpha=1,
ηhiddenAdamW=ηbasemN−1\.\\eta\_\{\\mathrm\{hidden\}\}^\{\\mathrm\{AdamW\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\}\.The hidden matrix learning rate scales with width, but not with depth\.
LayerNorm and bias learning rates scale as
ηLN=ηbasemLα−1,ηbias=ηbasemLα−1\.\\eta\_\{\\mathrm\{LN\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{L\}^\{\\alpha\-1\},\\qquad\\eta\_\{\\mathrm\{bias\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{L\}^\{\\alpha\-1\}\.Therefore, under CompleteP,
ηLN=ηbias=ηbase\.\\eta\_\{\\mathrm\{LN\}\}=\\eta\_\{\\mathrm\{bias\}\}=\\eta\_\{\\mathrm\{base\}\}\.CompleteP also scales hidden\-weight decay and AdamW’s numericalϵ\\epsilonparameter by parameter group:
λhidden=λbasemN\.\\lambda\_\{\\mathrm\{hidden\}\}=\\lambda\_\{\\mathrm\{base\}\}m\_\{N\}\.For the pure CompleteP AdamW parameterization, the AdamWϵ\\epsilonscaling is parameter\-group dependent\. With a general residual exponentα\\alpha, hidden\-block AdamW groups use
ϵhidden/residual=ϵbasemN−1mL−α\.\\epsilon\_\{\\mathrm\{hidden/residual\}\}=\\epsilon\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\}m\_\{L\}^\{\-\\alpha\}\.This CompleteP hidden/residual group includes hidden matrix AdamW groups, hidden\-block LayerNorm parameters, hidden\-block biases, and other hidden vector parameters\. Under CompleteP withα=1\\alpha=1, this becomes
ϵhidden/residual=ϵbasemN−1mL−1\.\\epsilon\_\{\\mathrm\{hidden/residual\}\}=\\epsilon\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\}m\_\{L\}^\{\-1\}\.CompleteP embedding/unembedding parameters and the final LayerNorm instead use
ϵemb/unemb=ϵfinal LN=ϵbasemN−1\.\\epsilon\_\{\\mathrm\{emb/unemb\}\}=\\epsilon\_\{\\text\{final LN\}\}=\\epsilon\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-1\}\.InSpectralP, the same AdamWϵ\\epsilonrules apply to AdamW companion groups: hidden\-block LayerNorm, bias, and vector companion groups use the CompleteP hidden/residualϵ\\epsilon, while embedding/unembedding and final\-LayerNorm companion groups use the CompleteP embedding/unembeddingϵ\\epsilon\. TheSpectralPMuon/MuCon hidden matrix groups themselves do not use AdamWϵ\\epsilon\.
Biases and LayerNorm gains are usually assigned zero decoupled weight decay in LLM implementations\. If they are decayed, their coefficient should be treated as a separate vector\-parameter hyperparameter\.
#### 2\.2\.2Complete feature learning
CompleteP also emphasizes*complete feature learning*: a good width\-depth parameterization should not merely keep activations stable, but should also prevent the network from becoming effectively linearized around initialization asN,L→∞N,L\\to\\infty\.
Leth\(θ\)h\(\\theta\)be a representation depending on parametersθ\\theta, and letθ0\\theta\_\{0\}be the initialization\. The linearization ofhhatθ0\\theta\_\{0\}is
hlin,θ\(θ,θ0\)=h\(θ0\)\+⟨∇θh\(θ0\),θ−θ0⟩\.h^\{\\mathrm\{lin\},\\theta\}\(\\theta,\\theta\_\{0\}\)=h\(\\theta\_\{0\}\)\+\\left\\langle\\nabla\_\{\\theta\}h\(\\theta\_\{0\}\),\\theta\-\\theta\_\{0\}\\right\\rangle\.A representation is lazy with respect toθ\\thetaif its update becomes asymptotically indistinguishable from the update of this linearization:
\|Δθh−Δθhlin,θ\|\|Δθhlin,θ\|=o\(1\)\.\\frac\{\\left\|\\Delta\_\{\\theta\}h\-\\Delta\_\{\\theta\}h^\{\\mathrm\{lin\},\\theta\}\\right\|\}\{\\left\|\\Delta\_\{\\theta\}h^\{\\mathrm\{lin\},\\theta\}\\right\|\}=o\(1\)\.
The role ofα\\alphacan be seen in a two\-layer residual block
hℓ\+1=hℓ\+L−αWℓ\(2\)Wℓ\(1\)hℓ\.h\_\{\\ell\+1\}=h\_\{\\ell\}\+L^\{\-\\alpha\}W\_\{\\ell\}^\{\(2\)\}W\_\{\\ell\}^\{\(1\)\}h\_\{\\ell\}\.Under maximal\-update scaling, suppose
ΔWℓ\(i\)=Θ\(Lα−1\),i=1,2\.\\Delta W\_\{\\ell\}^\{\(i\)\}=\\Theta\(L^\{\\alpha\-1\}\),\\qquad i=1,2\.Then the first\-order contribution toΔhℓ\+1\\Delta h\_\{\\ell\+1\}has size
while the second\-order contribution has size
Θ\(Lα−2\)\.\\Theta\(L^\{\\alpha\-2\}\)\.Their ratio is therefore
Θ\(Lα−1\)\.\\Theta\(L^\{\\alpha\-1\}\)\.Ifα<1\\alpha<1, this ratio vanishes; ifα=1\\alpha=1, it remains order one\. This is the basic mechanism behind the CompleteP choiceα=1\\alpha=1\.
### 2\.3Relevance toSpectralPMuCon
The preceding scaling arguments motivate spectral control of matrix\-valued updates\. Singular value clipping provides an explicit projection\-based control primitive:
MClipτ\(M\)=Udiag\(min\{σi,τ\}\)V⊤,M=Udiag\(σi\)V⊤\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=U\{\\rm diag\}\(\\min\\\{\\sigma\_\{i\},\\tau\\\}\)V^\{\\top\},\\qquad M=U\{\\rm diag\}\(\\sigma\_\{i\}\)V^\{\\top\}\.It enforces
‖MClipτ\(M\)‖2≤τ\\\|\\operatorname\{MClip\}\_\{\\tau\}\(M\)\\\|\_\{2\}\\leq\\tauand solves
MClipτ\(M\)=argmin‖X‖2≤τ12‖X−M‖F2\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=\\mathop\{\\mathrm\{argmin\}\}\_\{\\\|X\\\|\_\{2\}\\leq\\tau\}\\frac\{1\}\{2\}\\\|X\-M\\\|\_\{F\}^\{2\}\.
SpectralPMuCon applies this primitive to the same matrix\-valued update thatSpectralPMuon would polarize\. This differs from Muon\-style orthogonalization,
Polar\(M\)=UV⊤,\\operatorname\{Polar\}\(M\)=UV^\{\\top\},which sends every nonzero singular value to11\. Clipping instead applies
σi↦min\(σi,τ\),\\sigma\_\{i\}\\mapsto\\min\(\\sigma\_\{i\},\\tau\),so it preserves all singular directions withσi≤τ\\sigma\_\{i\}\\leq\\tauand modifies only the violating directions\.
This distinction matters for optimizer design\. If an update matrix has a few unstable high\-gain directions but many useful moderate directions, full orthogonalization can distort the update unnecessarily\. Singular\-value clipping gives the closest feasible matrix under the spectral\-norm constraint\. The computational obstacle is that exact clipping requires an SVD\. The numerical question is therefore:
Can the MuCon directionMClipτ\(Bt\)be approximated accurately and cheaply without a full SVD?\\text\{Can the MuCon direction \}\\operatorname\{MClip\}\_\{\\tau\}\(B\_\{t\}\)\\text\{ be approximated accurately and cheaply without a full SVD?\}
### 2\.4Why iterativeSpectralPMuCon is a natural next step
CompleteP motivates depth scaling through
hℓ\+1=hℓ\+L−1Fℓ\(hℓ\),h\_\{\\ell\+1\}=h\_\{\\ell\}\+L^\{\-1\}F\_\{\\ell\}\(h\_\{\\ell\}\),while spectralμP\\mu\\mathrm\{P\}motivates controlling matrix weights and updates in normalized operator norm\. Both views suggest that optimizer design increasingly depends on matrix spectral geometry\.
SpectralPMuCon fits naturally into this picture\. It controls the largest singular values without forcing the update to become an isometry:
Bt↦MClipτ\(Bt\)B\_\{t\}\\quad\\mapsto\\quad\\operatorname\{MClip\}\_\{\\tau\}\(B\_\{t\}\)rather than
Bt↦Polar\(Bt\)=UV⊤\.B\_\{t\}\\quad\\mapsto\\quad\\operatorname\{Polar\}\(B\_\{t\}\)=UV^\{\\top\}\.This makes it a candidate primitive for optimizers that want the stability benefits of spectral control while preserving more of the original update geometry\.
The rest of the report isolates the numerical linear algebra problem from the full training problem\. The next section states the scaling conventions used in this report\.
## 3SpectralPScaling Recipes and Operator\-Norm Bookkeeping
Write
mN=NNbase,mL=LLbase\.m\_\{N\}=\\frac\{N\}\{N\_\{\\mathrm\{base\}\}\},\\qquad m\_\{L\}=\\frac\{L\}\{L\_\{\\mathrm\{base\}\}\}\.For a vectora∈ℝda\\in\\mathbb\{R\}^\{d\}, define the dimension\-normalized RMS norm
‖a‖R,d=‖a‖2d\.\\\|a\\\|\_\{\\mathrm\{R\},d\}=\\frac\{\\\|a\\\|\_\{2\}\}\{\\sqrt\{d\}\}\.ForA∈ℝm×nA\\in\\mathbb\{R\}^\{m\\times n\}, viewed as a mapℝn→ℝm\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{m\}, define the induced RMS operator norm
‖A‖R\(m,n\)=maxv≠0‖Av‖R,m‖v‖R,n=nm‖A‖2\.\\\|A\\\|\_\{\\mathrm\{R\}\(m,n\)\}=\\max\_\{v\\neq 0\}\\frac\{\\\|Av\\\|\_\{\\mathrm\{R\},m\}\}\{\\\|v\\\|\_\{\\mathrm\{R\},n\}\}=\\sqrt\{\\frac\{n\}\{m\}\}\\,\\\|A\\\|\_\{2\}\.At fixed aspect ratio, this is equivalent to the spectral norm up to constants\. For input or output matrices with one fixed dimension, the factorn/m\\sqrt\{n/m\}is essential\.
Table[1](https://arxiv.org/html/2605.26459#S3.T1)summarizes the recipe used in this report\. Each entry is the multiplier applied to the corresponding base\-model hyperparameter\. The base model hasmN=mL=1m\_\{N\}=m\_\{L\}=1\. The CompleteP column specializes to residual exponentα=1\\alpha=1and reports the pure CompleteP AdamW scaling\. The last column is theSpectralPrecipe for Muon/MuCon hidden matrix groups together with its AdamW companion groups\. For a general residual multipliermL−αm\_\{L\}^\{\-\\alpha\}, replace the hidden/LN/bias AdamW learning\-rate depth factor bymLα−1m\_\{L\}^\{\\alpha\-1\}and the hidden\-block AdamWϵ\\epsilondepth factor bymL−αm\_\{L\}^\{\-\\alpha\}\. In both CompleteP andSpectralP, hidden\-block AdamWϵ\\epsilonapplies to hidden residual matrix, hidden LayerNorm, hidden bias, and hidden vector groups; embedding/unembedding and final LayerNorm groups keep themN−1m\_\{N\}^\{\-1\}width factor with no depth factor\. InSpectralP, hidden 2D Muon/MuCon matrix groups do not use AdamWϵ\\epsilon\.
The implementation also follows the SteptronOSS Muon convention of applying a shape\-dependent RMS\-matching multiplier after the spectral direction is computed\. For a hidden matrix with last two dimensionsm×nm\\times n, define
κMuon\(m,n\)=ρmatchmax\{m,n\},\\kappa\_\{\\mathrm\{Muon\}\}\(m,n\)=\\rho\_\{\\mathrm\{match\}\}\\sqrt\{\\max\\\{m,n\\\}\},whereρmatch\\rho\_\{\\mathrm\{match\}\}is the configuredmatched\_adamw\_rms\. Thus theSpectralPoptimizer group still has learning\-rate multiplier one, but the actual hidden\-matrix Muon step uses the effective coefficientηbaseκMuon\(m,n\)\\eta\_\{\\rm base\}\\kappa\_\{\\mathrm\{Muon\}\}\(m,n\)multiplying the polar or clipped direction\. This is an implementation\-level RMS matching calibration, not an AdamWϵ\\epsilonrule and not a width/depth scheduler multiplier\.
For embedding and unembedding AdamW companion groups, writeγemb\\gamma\_\{\\rm emb\}for the configured embedding learning\-rate multiplier\. The implementation therefore uses embedding/unembedding learning rateγembηbase\\gamma\_\{\\rm emb\}\\eta\_\{\\rm base\}, withγemb=1\\gamma\_\{\\rm emb\}=1as the default\.
Table 1:Scaling recipe comparisons\.##### Untied versus tied embeddings\.
Letdvocabd\_\{\\rm vocab\}be fixed and store token matrices row\-wise in hidden coordinates\. In the untied setting, use separate matrices
E,WU∈ℝdvocab×N,h0=E⊤ex,ℓ=mN−1WUhL\.E,W\_\{U\}\\in\\mathbb\{R\}^\{d\_\{\\rm vocab\}\\times N\},\\qquad h\_\{0\}=E^\{\\top\}e\_\{x\},\\qquad\\ell=m\_\{N\}^\{\-1\}W\_\{U\}h\_\{L\}\.The recipe is
Var\(Eij\)=Var\(\(WU\)ij\)=σbase2,ηE=ηU=γembηbase,\\displaystyle\\operatorname\{Var\}\(E\_\{ij\}\)=\\operatorname\{Var\}\(\(W\_\{U\}\)\_\{ij\}\)=\\sigma\_\{\\rm base\}^\{2\},\\qquad\\eta\_\{E\}=\\eta\_\{U\}=\\gamma\_\{\\rm emb\}\\eta\_\{\\rm base\},λE=λU=λbase,ϵE=ϵU=ϵbasemN−1\.\\displaystyle\\lambda\_\{E\}=\\lambda\_\{U\}=\\lambda\_\{\\rm base\},\\qquad\\epsilon\_\{E\}=\\epsilon\_\{U\}=\\epsilon\_\{\\rm base\}m\_\{N\}^\{\-1\}\.In the tied setting, setWU=EW\_\{U\}=Eand use one shared parameter and one shared optimizer state:
h0=E⊤ex,ℓ=mN−1EhL,Var\(Eij\)=σbase2,ηE=γembηbase,\\displaystyle h\_\{0\}=E^\{\\top\}e\_\{x\},\\qquad\\ell=m\_\{N\}^\{\-1\}Eh\_\{L\},\\qquad\\operatorname\{Var\}\(E\_\{ij\}\)=\\sigma\_\{\\rm base\}^\{2\},\\qquad\\eta\_\{E\}=\\gamma\_\{\\rm emb\}\\eta\_\{\\rm base\},λE=λbase,ϵE=ϵbasemN−1\.\\displaystyle\\lambda\_\{E\}=\\lambda\_\{\\rm base\},\\qquad\\epsilon\_\{E\}=\\epsilon\_\{\\rm base\}m\_\{N\}^\{\-1\}\.The output multiplier remainsmN−1m\_\{N\}^\{\-1\}in both cases\. In the tied case, this is especially important: the self\-overlap termEx⊤ExE\_\{x\}^\{\\top\}E\_\{x\}is orderNNat initialization, so omitting this forward multiplier, or using a weaker width multiplier thanmN−1m\_\{N\}^\{\-1\}, would make tied\-output logits grow with width\. Gradients from the input and output uses are summed before the single AdamW update\. These embedding/unembedding groups are companion AdamW groups in theSpectralPrecipe, not hidden\-matrix Muon/MuCon groups\.
The operator\-norm entries should be read as bookkeeping, not as a derivation of the AdamW column\. In this report,SpectralPnames the Muon/MuCon hidden\-matrix scaling rule summarized by the last column; MuCon itself remains the clipped update primitive\. LetG𝒪\(N\)G\_\{\\mathcal\{O\}\}\(N\)be the pre\-learning\-rate matrix direction produced by optimizer𝒪\\mathcal\{O\}before any implementation\-level RMS matching multiplier\. If, for a fixed aspect ratio,
‖G𝒪\(N\)‖R\(m,n\)=Θ\(Np𝒪\),\\\|G\_\{\\mathcal\{O\}\}\(N\)\\\|\_\{\\mathrm\{R\}\(m,n\)\}=\\Theta\(N^\{p\_\{\\mathcal\{O\}\}\}\),then choosing
ηhidden𝒪=ηbasemN−p𝒪\\eta\_\{\\mathrm\{hidden\}\}^\{\\mathcal\{O\}\}=\\eta\_\{\\mathrm\{base\}\}m\_\{N\}^\{\-p\_\{\\mathcal\{O\}\}\}keeps the uncalibrated hidden matrix direction order one in the RMS operator norm\. The exponent is update\-normalization dependent and must be computed for the actual optimizer direction\. For example, a dense i\.i\.d\.O\(1\)O\(1\)update has‖G‖2=Θ\(N\)\\\|G\\\|\_\{2\}=\\Theta\(\\sqrt\{N\}\)andp=1/2p=1/2at fixed aspect ratio, whereas an uncalibrated polar/Muon hidden direction has‖G‖2=1\\\|G\\\|\_\{2\}=1andp=0p=0\. The code keeps thisp=0p=0group\-learning\-rate convention, then multiplies the resulting Muon/MuCon direction byκMuon\(m,n\)\\kappa\_\{\\mathrm\{Muon\}\}\(m,n\)to match the RMS scale used by the SteptronOSS Muon recipe\. AdamW underμP\\mu\\mathrm\{P\}/CompleteP uses the coordinate\-wise tensor\-program scaling shown in Table[1](https://arxiv.org/html/2605.26459#S3.T1); it should not be inferred by treating the AdamW update as a generic dense i\.i\.d\. spectral update\.
If threshold\-τ\\tauclipping is applied to the pre\-learning\-rate hidden matrix direction, as inSpectralPMuCon, then
‖MClipτ\(G\)‖R\(m,n\)≤nmτ\.\\\|\\operatorname\{MClip\}\_\{\\tau\}\(G\)\\\|\_\{\\mathrm\{R\}\(m,n\)\}\\leq\\sqrt\{\\frac\{n\}\{m\}\}\\,\\tau\.Thus, for hidden matrices with fixed aspect ratio andτ=Θ\(1\)\\tau=\\Theta\(1\), the uncalibratedSpectralPMuCon direction is automatically order one in RMS operator norm\. Under this bookkeeping, clipped MuCon has the same group\-learning\-rate width exponent as polar Muon, but its direction is not forced to be an isometry\. The implemented step then applies the sameκMuon\(m,n\)\\kappa\_\{\\mathrm\{Muon\}\}\(m,n\)RMS\-matching multiplier used for polar Muon\.
## 4Problem Formulation
Let
M∈ℝm×n,M=UΣV⊤M\\in\\mathbb\{R\}^\{m\\times n\},\\qquad M=U\\Sigma V^\{\\top\}be a compact SVD with rankrr\. Thus
U∈ℝm×r,V∈ℝn×r,Σ=diag\(σ1,…,σr\),U\\in\\mathbb\{R\}^\{m\\times r\},\\qquad V\\in\\mathbb\{R\}^\{n\\times r\},\\qquad\\Sigma=\{\\rm diag\}\(\\sigma\_\{1\},\\dots,\\sigma\_\{r\}\),where
σ1≥σ2≥⋯≥σr\>0\.\\sigma\_\{1\}\\geq\\sigma\_\{2\}\\geq\\cdots\\geq\\sigma\_\{r\}\>0\.The compact SVD omits zero singular values; all spectral formulas below act as zero on the omitted null spaces unless stated otherwise\. ForM=0M=0, defineMClipτ\(0\)=0\\operatorname\{MClip\}\_\{\\tau\}\(0\)=0\. ForM≠0M\\neq 0, the singular\-value clipping operator is
MClipτ\(M\)=Udiag\(min\{σ1,τ\},…,min\{σr,τ\}\)V⊤,τ\>0\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=U\{\\rm diag\}\(\\min\\\{\\sigma\_\{1\},\\tau\\\},\\dots,\\min\\\{\\sigma\_\{r\},\\tau\\\}\)V^\{\\top\},\\qquad\\tau\>0\.Equivalently,
MClipτ\(M\)=Ufτ\(Σ\)V⊤,fτ\(σ\)=min\{σ,τ\}\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=Uf\_\{\\tau\}\(\\Sigma\)V^\{\\top\},\\qquad f\_\{\\tau\}\(\\sigma\)=\\min\\\{\\sigma,\\tau\\\}\.
In the optimizer setting,MMis the same matrixBtB\_\{t\}thatSpectralPMuon would pass to its polar or Newton\-Schulz step\. TheSpectralPMuCon update direction is therefore
DtMuCon=MClipτt\(Bt\)\.D\_\{t\}^\{\\mathrm\{MuCon\}\}=\\operatorname\{MClip\}\_\{\\tau\_\{t\}\}\(B\_\{t\}\)\.This notation keeps the operatorMClipτ\\operatorname\{MClip\}\_\{\\tau\}distinct from the algorithmic primitive MuCon\.
The projection interpretation is
MClipτ\(M\)=argminX∈ℝm×n12‖X−M‖F2subject to‖X‖2≤τ\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=\\mathop\{\\mathrm\{argmin\}\}\_\{X\\in\\mathbb\{R\}^\{m\\times n\}\}\\frac\{1\}\{2\}\\\|X\-M\\\|\_\{F\}^\{2\}\\quad\\text\{subject to\}\\quad\\\|X\\\|\_\{2\}\\leq\\tau\.Indeed, by unitary invariance of the Frobenius norm and von Neumann’s trace inequality, the unique projection aligns its singular vectors with those ofMMafter zero\-padding singular values as needed\. The problem reduces to
min0≤si≤τ∑i\(si−σi\)2,\\min\_\{0\\leq s\_\{i\}\\leq\\tau\}\\sum\_\{i\}\(s\_\{i\}\-\\sigma\_\{i\}\)^\{2\},whose unique solution is
si=min\{σi,τ\}\.s\_\{i\}=\\min\\\{\\sigma\_\{i\},\\tau\\\}\.
The same formula gives the useful low\-rank correction identity
MClipτ\(M\)=M−U\>diag\(\(σi−τ\)i∈ℐ\>\)V\>⊤,ℐ\>=\{i:σi\>τ\}\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=M\-U\_\{\>\}\{\\rm diag\}\\bigl\(\(\\sigma\_\{i\}\-\\tau\)\_\{i\\in\\mathcal\{I\}\_\{\>\}\}\\bigr\)V\_\{\>\}^\{\\top\},\\qquad\\mathcal\{I\}\_\{\>\}=\\\{i:\\sigma\_\{i\}\>\\tau\\\}\.Consequently, ifk\>=\|ℐ\>\|≪rk\_\{\>\}=\|\\mathcal\{I\}\_\{\>\}\|\\ll r, the target differs fromMMby a rank\-k\>k\_\{\>\}matrix\. This observation is central for algorithm design: global matrix\-function approximations should be compared against partial SVD, Lanczos, or randomized range\-finding baselines that target only the violating singular subspace\(Halko et al\.,[2011](https://arxiv.org/html/2605.26459#bib.bib12)\)\.
The computational challenge is that a full dense SVD is too expensive for frequent optimizer use\. We therefore seek approximations using cheaper primitives:
- •matrix\-matrix multiplications;
- •matrix\-vector products;
- •matrix\-function iterations;
- •small dense auxiliary linear algebra; and
- •structured linear solves when numerically stable\.
## 5Algorithmic Approaches
The low\-rank correction identity suggests a partial spectral baseline wheneverk\>k\_\{\>\}is small\. ForSpectralPMuCon, this means targeting only the singular directions of the Muon matrixBtB\_\{t\}whose gains exceedτt\\tau\_\{t\}\. The two approaches below are instead global matrix\-function viewpoints\. They are most relevant when many singular values violate the constraint, or when fast polar, square\-root, or rational\-filter primitives are already available\.
### 5\.1Approach I: polar/absolute\-value formulation
Let
H=\(M⊤M\)1/2,Q=MH†\.H=\(M^\{\\top\}M\)^\{1/2\},\\qquad Q=MH^\{\\dagger\}\.With the compact SVD above,
Q=UV⊤,H=VΣV⊤Q=UV^\{\\top\},\\qquad H=V\\Sigma V^\{\\top\}on the row space ofMM, andH=0H=0on its orthogonal complement\. Thus
Functional calculus gives
MClipτ\(M\)=QP⋆,P⋆=fτ\(H\),fτ\(t\)=min\{t,τ\}\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=QP\_\{\\star\},\\qquad P\_\{\\star\}=f\_\{\\tau\}\(H\),\\qquad f\_\{\\tau\}\(t\)=\\min\\\{t,\\tau\\\}\.Sincefτ\(0\)=0f\_\{\\tau\}\(0\)=0, this formula is valid for rectangular and rank\-deficient matrices\.
Ifm<nm<n, it can be cheaper to work on the left side\. With
K=\(MM⊤\)1/2,Q=K†M=UV⊤,K=\(MM^\{\\top\}\)^\{1/2\},\\qquad Q=K^\{\\dagger\}M=UV^\{\\top\},one has
MClipτ\(M\)=fτ\(K\)Q\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=f\_\{\\tau\}\(K\)Q\.Thus, the matrix\-function factor should be formed on the smaller side whenever possible\.
For a symmetric positive\-semidefinite matrixHH,
P⋆=fτ\(H\)=12\(H\+τIn−\|H−τIn\|\),P\_\{\\star\}=f\_\{\\tau\}\(H\)=\\frac\{1\}\{2\}\\left\(H\+\\tau I\_\{n\}\-\|H\-\\tau I\_\{n\}\|\\right\),where\|A\|=\(A2\)1/2\|A\|=\(A^\{2\}\)^\{1/2\}for symmetricAA\. Herefτ\(H\)f\_\{\\tau\}\(H\)denotes scalar functional calculus; it is not a Loewner infimum\. Equivalently,
\|H−τIn\|=\(H−τIn\)sign\(H−τIn\),\|H\-\\tau I\_\{n\}\|=\(H\-\\tau I\_\{n\}\)\\mathop\{\\mathrm\{sign\}\}\(H\-\\tau I\_\{n\}\),with the spectral conventionsign\(0\)=0\\mathop\{\\mathrm\{sign\}\}\(0\)=0\. Hence
MClipτ\(M\)=Q12\(H\+τIn−\(H−τIn\)sign\(H−τIn\)\)\.\\operatorname\{MClip\}\_\{\\tau\}\(M\)=Q\\,\\frac\{1\}\{2\}\\left\(H\+\\tau I\_\{n\}\-\(H\-\\tau I\_\{n\}\)\\mathop\{\\mathrm\{sign\}\}\(H\-\\tau I\_\{n\}\)\\right\)\.This identity is exact, including at singular values equal toτ\\tau\.
This gives a two\-stage strategy\. First, approximate the canonical partial polar factorQQ\. For a tall full\-column\-rank matrix, the Newton\-Schulz iteration
Xk\+1=12Xk\(3I−Xk⊤Xk\)X\_\{k\+1\}=\\frac\{1\}\{2\}X\_\{k\}\(3I\-X\_\{k\}^\{\\top\}X\_\{k\}\)converges under standard scaling assumptions; a simple sufficient condition is‖I−X0⊤X0‖2<1\\\|I\-X\_\{0\}^\{\\top\}X\_\{0\}\\\|\_\{2\}<1\. Rank\-deficient cases are more delicate in finite precision: exact zero singular values remain zero under the iteration, but tiny singular values can slow convergence and contaminate the computed partial polar factor\. Rank\-aware polar iterations, scaled Newton iterations, QDWH\-type iterations, or explicit regularization are often preferable\(Higham,[1986](https://arxiv.org/html/2605.26459#bib.bib8); Nakatsukasa and Bai,[2010](https://arxiv.org/html/2605.26459#bib.bib10)\)\. For wide matrices, apply the analogous iteration toM⊤M^\{\\top\}and transpose the result\.
Second, form the symmetric factor
H^=12\(Q^⊤M\+M⊤Q^\),\\widehat\{H\}=\\frac\{1\}\{2\}\(\\widehat\{Q\}^\{\\top\}M\+M^\{\\top\}\\widehat\{Q\}\),approximate\|H^−τIn\|\|\\widehat\{H\}\-\\tau I\_\{n\}\|, and set
P^=12\(H^\+τIn−\|H^−τIn\|\),X^=Q^P^\.\\widehat\{P\}=\\frac\{1\}\{2\}\\left\(\\widehat\{H\}\+\\tau I\_\{n\}\-\|\\widehat\{H\}\-\\tau I\_\{n\}\|\\right\),\\qquad\\widehat\{X\}=\\widehat\{Q\}\\widehat\{P\}\.For the left\-sided variant, use
K^=12\(MQ^⊤\+Q^M⊤\),P^L=fτ\(K^\),X^=P^LQ^\.\\widehat\{K\}=\\frac\{1\}\{2\}\(M\\widehat\{Q\}^\{\\top\}\+\\widehat\{Q\}M^\{\\top\}\),\\qquad\\widehat\{P\}\_\{L\}=f\_\{\\tau\}\(\\widehat\{K\}\),\\qquad\\widehat\{X\}=\\widehat\{P\}\_\{L\}\\widehat\{Q\}\.In finite precision, the symmetric factors should be explicitly symmetrized, and small negative eigenvalue artifacts should be treated as numerical error rather than meaningful spectrum\.
##### Advantages\.
This formulation reduces clipping to classical matrix functions: polar decomposition and the matrix absolute value\(Higham,[2008](https://arxiv.org/html/2605.26459#bib.bib9)\)\.
##### Challenges\.
The clipped factorfτ\(H\)f\_\{\\tau\}\(H\)is continuous but not differentiable at eigenvalueτ\\tau\. A standalone matrix\-sign iteration forH−τInH\-\\tau I\_\{n\}is worse conditioned because the sign function is discontinuous at zero\. The absolute\-value formulation avoids this discontinuity, but it remains nonsmooth at the clipping boundary and can still be sensitive when many singular values satisfyσi\(M\)≈τ\\sigma\_\{i\}\(M\)\\approx\\tau\. Moreover, ifHHis formed explicitly or if the absolute\-value iteration requires dense eigensolves or large dense linear solves, the method can lose its advantage over SVD\.
### 5\.2Approach II: rational Newton iteration for the clipped PSD factor
A second approach targets the clipped positive\-semidefinite factor
P⋆=fτ\(H\),H=\(M⊤M\)1/2\.P\_\{\\star\}=f\_\{\\tau\}\(H\),\\qquad H=\(M^\{\\top\}M\)^\{1/2\}\.For an eigenvalueσ≥0\\sigma\\geq 0ofHH, the desired clipped value
p⋆=min\{σ,τ\}p\_\{\\star\}=\\min\\\{\\sigma,\\tau\\\}is the smaller root, or the double root whenσ=τ\\sigma=\\tau, of
\(p−τ\)\(p−σ\)=0\.\(p\-\\tau\)\(p\-\\sigma\)=0\.Newton’s method for this scalar equation gives
pk\+1=pk2−στ2pk−σ−τ,p0=0\.p\_\{k\+1\}=\\frac\{p\_\{k\}^\{2\}\-\\sigma\\tau\}\{2p\_\{k\}\-\\sigma\-\\tau\},\\qquad p\_\{0\}=0\.Let
a=min\{σ,τ\},b=max\{σ,τ\},ek=a−pk\.a=\\min\\\{\\sigma,\\tau\\\},\\qquad b=\\max\\\{\\sigma,\\tau\\\},\\qquad e\_\{k\}=a\-p\_\{k\}\.Forpk<ap\_\{k\}<a,
ek\+1=ek2b−a\+2ek\.e\_\{k\+1\}=\\frac\{e\_\{k\}^\{2\}\}\{b\-a\+2e\_\{k\}\}\.Thus the iterates increase to the smaller root\. Ifσ≠τ\\sigma\\neq\\tau, convergence is locally quadratic; ifσ=τ\\sigma=\\tau, thenek\+1=ek/2e\_\{k\+1\}=e\_\{k\}/2and convergence is only linear\.
Applied by spectral functional calculus, this yields the rational matrix iteration
\(H\+τIn−2Pk\)Pk\+1=τH−Pk2,P0=0\.\(H\+\\tau I\_\{n\}\-2P\_\{k\}\)P\_\{k\+1\}=\\tau H\-P\_\{k\}^\{2\},\\qquad P\_\{0\}=0\.Equivalently, in exact arithmetic,
Pk\+1=\(H\+τIn−2Pk\)−1\(τH−Pk2\),P\_\{k\+1\}=\(H\+\\tau I\_\{n\}\-2P\_\{k\}\)^\{\-1\}\(\\tau H\-P\_\{k\}^\{2\}\),becausePk=rk\(H\)P\_\{k\}=r\_\{k\}\(H\)is a rational function ofHHand therefore commutes withHH\. This formula should be interpreted as a rational spectral filter, not as a generic noncommutative Newton method\. In finite precision, the linear solve should be implemented symmetrically andPk\+1P\_\{k\+1\}should be explicitly symmetrized\.
ComputingP⋆P\_\{\\star\}alone does not recover the clipped matrix\. One must also apply the polar factor:
MClipτ\(M\)=QP⋆=MH†P⋆,\\operatorname\{MClip\}\_\{\\tau\}\(M\)=QP\_\{\\star\}=MH^\{\\dagger\}P\_\{\\star\},with the convention that the zero eigenspace ofHHcontributes zero in the pseudoinverse product\. Equivalently,
MClipτ\(M\)=Mgτ\(H\),\\operatorname\{MClip\}\_\{\\tau\}\(M\)=Mg\_\{\\tau\}\(H\),where
gτ\(σ\)=\{1,0≤σ≤τ,τ/σ,σ\>τ\.g\_\{\\tau\}\(\\sigma\)=\\begin\{cases\}1,&0\\leq\\sigma\\leq\\tau,\\\\\[2\.84526pt\] \\tau/\\sigma,&\\sigma\>\\tau\.\\end\{cases\}The value ofgτg\_\{\\tau\}atσ=0\\sigma=0is immaterial in exact arithmetic becauseMMannihilates the nullspace ofHH; the continuous conventiongτ\(0\)=1g\_\{\\tau\}\(0\)=1is preferable when approximatinggτg\_\{\\tau\}directly\.
##### Advantages\.
The iteration targets the clipped positive\-semidefinite factor directly, avoiding a separate discontinuous sign computation\.
##### Challenges\.
The method still requires applying or approximatingH=\(M⊤M\)1/2H=\(M^\{\\top\}M\)^\{1/2\}and then applyingH†P⋆H^\{\\dagger\}P\_\{\\star\}orQQ\. It also involves solving
H\+τIn−2Pk\.H\+\\tau I\_\{n\}\-2P\_\{k\}\.For theiith spectral component, the coefficient eigenvalue is
σi\+τ−2pk,i⟶\|σi−τ\|\.\\sigma\_\{i\}\+\\tau\-2p\_\{k,i\}\\longrightarrow\|\\sigma\_\{i\}\-\\tau\|\.Thus, directions exactly at the clipping threshold make the limiting coefficient singular, and spectra clustered nearσ=τ\\sigma=\\taulead to ill\-conditioned solves\. The iteration is therefore not automatically cheaper than SVD unless combined with an efficient square\-root/polar routine and stabilization near the threshold, such as smoothing the clip or regularizing the solve\.
## 6Conclusion
MuCon replaces Muon’s polar direction with a clipped spectral direction\. For the Muon matrixBt=UΣV⊤B\_\{t\}=U\\Sigma V^\{\\top\},
DtMuon=UV⊤,DtMuCon=MClipτt\(Bt\)=Udiag\(min\{σi,τt\}\)V⊤\.D\_\{t\}^\{\\mathrm\{Muon\}\}=UV^\{\\top\},\\qquad D\_\{t\}^\{\\mathrm\{MuCon\}\}=\\operatorname\{MClip\}\_\{\\tau\_\{t\}\}\(B\_\{t\}\)=U\{\\rm diag\}\(\\min\\\{\\sigma\_\{i\},\\tau\_\{t\}\\\}\)V^\{\\top\}\.Thus MuCon is not a new parameterization: it is a projection primitive layered on top of theSpectralPwidth\-depth scaling and the implementation\-level RMS\-matching calibration\. CompleteP supplies the residual and AdamW companion\-group scaling rules;SpectralPspecifies the hidden\-matrix Muon/MuCon group convention; clipping controls only the spectrum of the resulting matrix direction\.
The main algorithmic lesson is a regime split\. When only a few singular values exceedτ\\tau, clipping is the low\-rank correction
MClipτ\(M\)=M−U\>diag\(\(σi−τ\)i∈ℐ\>\)V\>⊤,\\operatorname\{MClip\}\_\{\\tau\}\(M\)=M\-U\_\{\>\}\{\\rm diag\}\\bigl\(\(\\sigma\_\{i\}\-\\tau\)\_\{i\\in\\mathcal\{I\}\_\{\>\}\}\\bigr\)V\_\{\>\}^\{\\top\},so partial SVD, Lanczos, or randomized subspace methods are the natural baselines\. When many singular values are clipped, global matrix\-function methods such as polar/absolute\-value formulations become plausible, but only if the polar or square\-root primitives are stable\. Rational Newton filters are exact as spectral filters, but their solves become ill\-conditioned nearσ=τ\\sigma=\\tau, which explains their poor behavior without additional regularization\.
## References
- Dey et al\. \(2025\)Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness\.Don’t be lazy: CompleteP enables compute\-efficient deep transformers\.*arXiv preprint arXiv:2505\.01618*, 2025\.
- Zheng et al\. \(2026\)Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, and Chongxuan Li\.Spectral condition forμ\\muP under width\-depth scaling\.*arXiv preprint arXiv:2603\.00541v1*, 2026\.
- Yang and Hu \(2021\)Greg Yang and Edward J\. Hu\.Tensor Programs IV: Feature learning in infinite\-width neural networks\.In*Proceedings of the International Conference on Machine Learning*, 2021\.
- Yang et al\. \(2022\)Greg Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao\.Tensor Programs V: Tuning large neural networks via zero\-shot hyperparameter transfer\.*arXiv preprint arXiv:2203\.03466*, 2022\.
- Yang et al\. \(2023\)Greg Yang, James B\. Simon, and Jeremy Bernstein\.A spectral condition for feature learning\.*arXiv preprint arXiv:2310\.17813*, 2023\.
- Jordan et al\. \(2024\)Keller Jordan, Yuchen Jin, Vladimir Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein\.Muon: An optimizer for hidden layers in neural networks\.Technical blog post, 2024\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.In*International Conference on Learning Representations*, 2019\.
- Higham \(1986\)Nicholas J\. Higham\.Computing the polar decomposition\-with applications\.*SIAM Journal on Scientific and Statistical Computing*, 7\(4\):1160\-1174, 1986\.
- Higham \(2008\)Nicholas J\. Higham\.*Functions of Matrices: Theory and Computation*\.SIAM, 2008\.
- Nakatsukasa and Bai \(2010\)Yuji Nakatsukasa and Zhaojun Bai\.Optimizing Halley’s iteration for computing the matrix polar decomposition\.*SIAM Journal on Matrix Analysis and Applications*, 31\(5\):2700\-2720, 2010\.
- Golub and Van Loan \(2013\)Gene H\. Golub and Charles F\. Van Loan\.*Matrix Computations*\.Johns Hopkins University Press, 4th edition, 2013\.
- Halko et al\. \(2011\)Nathan Halko, Per\-Gunnar Martinsson, and Joel A\. Tropp\.Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions\.*SIAM Review*, 53\(2\):217\-288, 2011\.Similar Articles
Spectral Scaling Laws of Muon
This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.
SignMuon: Communication-Efficient Distributed Muon Optimization
SignMuon is a 1-bit, matrix-aware optimizer for distributed training that combines signSGD's majority-vote sign aggregation with Muon's polar-step framework, achieving 32x bandwidth reduction over float32 while maintaining strong convergence and performance on benchmarks like CIFAR-10/ResNet-50 and nanoGPT.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
This paper introduces Pion, a new optimizer that replaces Muon's spectral whitening with a high-pass NS iteration to stabilize training in low-rank and low-SNR regimes, achieving improved performance in VLA and RLVR tasks.
Gram Newton-Schulz: A Fast, Hardware-Aware Newton-Schulz Algorithm for Muon
This blog post presents Gram Newton-Schulz, a hardware-aware optimization of the Newton-Schulz orthogonalization procedure used in the Muon optimizer, achieving significant speedups for training large language models while preserving model quality.
Why Muon Outperforms Adam: A Curvature Perspective
This paper investigates why the Muon optimizer outperforms Adam in large language model training, showing from a curvature perspective that Muon incurs a smaller curvature penalty due to lower normalized directional sharpness, with advantages amplified by data imbalance.