@timlautk: 1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduc…
Summary
Introduces a symmetry-compatible principle for LLM optimizer design, yielding a layerwise optimizer stack with principled updates for embeddings, LM heads, SwiGLU MLPs, and MoE routers, showing improved validation loss over AdamW across multiple architectures.
View Cached Full Text
Cached at: 05/20/26, 04:25 AM
1/4 New paper with @weijie444!
We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduct, get an end-to-end layerwise optimizer stack where every major matrix-valued parameter (embeddings, LM heads, SwiGLU MLPs, MoE routers) has its own principled update!
http://arxiv.org/abs/2605.18106 http://github.com/timlautk/equivariant_optimizers…
Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Source: https://arxiv.org/html/2605.18106 MnLargeSymbols’164 MnLargeSymbols’171
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, andMoERouters
Abstract
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimization methods, such asAdamand its variants, operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. In this paper, we address this disparity by introducing asymmetry-compatible principlefor optimizer design. Specifically, we argue that the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block of the neural network. Following this principle, we first provide a unified perspective on the natural class of bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent,Muon,Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive new classes of symmetry-compatible optimizers tailored to parameter blocks whose symmetries differ from those of general matrix layers: for embedding and LM head matrices, left-permutation and right-orthogonal equivariance leads to one-sided spectral, row-norm, and hybrid row-norm/spectral updates; for SwiGLU MLP projections, intermediate-neuron permutation symmetry motivates row-aware and column-aware variants; and forMoErouters, expert-permutation symmetry together with shared-logit-shift invariance gives rise to centered row-norm and left-spectral updates. These constructions yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this optimizer design principle through extensive pre-training experiments on dense and sparseMoElanguage models, including Qwen3-0.6B-style, Gemma 3 1B-style,OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible update rules consistently improve final validation loss, and in several cases training stability, over the correspondingAdamWupdates.
https://github.com/timlautk/equivariant_optimizers
1Introduction
The most widely used optimizers in deep learning, such asAdam[81],Adafactor[136],RMSprop[147],AdaGrad[43,114], and their variants, all belong to the broad family ofcoordinate-wise adaptive gradient methods. These methods treat model parameters as a single long concatenated vector and update each coordinate independently. Despite their empirical success, this design implicitly assumes that every entry of a weight matrix is an independent coordinate in a high-dimensional vector space. This assumption is rarely questioned, yet it strongly shapes the training dynamics of modern neural networks. In particular, such a geometry-blind treatment ignores the rich matrix structure of neural network parameters and fails to distinguish between the geometries of different layer types, such as embeddings, LM heads, dense linear layers, attention projections, SwiGLU MLP projections, andMoErouters.
At the same time, our theoretical understanding of optimizer behavior remains limited across the two major families most relevant to modern large-scale training:coordinate-wise adaptive gradient optimizersandspectral optimizers. In language model pre-training in particular, comparisons between these optimizer families are still largely empirical, relying on large-scale benchmarking exercises[153,134]and speedrunning[75], with relatively little analysis of their different geometric behavior and training dynamics. Hyperparameter transfer rules[160]and scaling-law prescriptions[77,67], for example, are often applied across optimizers, even though their original development was tied primarily to coordinate-wise adaptive methods, particularlyAdamW[109]. Another notable benchmarking effort isAlgoPerf:Training Algorithms[29,78], which evaluates training speedups obtained solely from changes to the training algorithm and aims to provide a more comprehensive comparison of optimizers. However,AlgoPerfdoes not include a language modeling workload, and its workloads are far smaller than the language models considered in modern pre-training. Such benchmarking practices implicitly assume that different optimizer families are directly comparable and share similar training phenomena, which need not be the case.
The central thesis of this paper is that optimizer design for modern neural networks should be layerwise and symmetry-compatible. Rather than applying a single coordinate-wise optimizer to all parameters, we propose a layerwise symmetry-compatible principle: each major matrix-valued parameter class should be updated by an optimizer whose equivariance matches the symmetry of that parameter class. This leads to a broad family ofequivariant optimizers, whose update laws are matched to the symmetry groups of the parameter blocks on which they act.
Figure˜1summarizes this shift. The coordinate-wise view treats matrix-valued parameters as vectorized collections of independent coordinates, leading to updates that can discard spectral structure and break natural equivariances. In contrast, the symmetry-aware matrix view starts from the layerwise geometry of each parameter class and derives optimizer updates whose equivariance matches that geometry.
Coordinate-wise viewParameters treatedas a long vectorEntrywise adaptive updates(Adam,AdaGrad,Adafactor,RMSprop)Break orthogonal equivariancefor matrix layersDiscard spectral structureand inducemismatched geometrySymmetry-aware matrix viewMatrix parameters have layerwise symmetry and geometryUpdate maps should matchthe symmetry of each parameter classSpectral, one-sided spectral,row-norm, and hybrid optimizersArchitecture–optimizer co-designfor linear layers, SwiGLU MLPs,embeds., heads andMoEroutersrethinkoptimizergeometry
Figure 1:Two perspectives on deep learning optimization. Left: coordinate-wise adaptive methods treat matrix parameters as vectors and ignore matrix geometry. Right: the symmetry- and equivariance-based viewpoint developed in this paper leads to a family of equivariant, layer-specific optimizer classes and architecture–optimizer co-design.##### Contributions.
Our work makes the following contributions.
- 1.A symmetry-compatible principle for matrix-gradient optimizer design.We argue that popular coordinate-wise adaptive optimizers such asAdam,AdamW, andRMSproparegeometrically mismatchedfor matrix-valued parameters in the sense that their updates generally fail to respect the natural equivariance and invariance structures of matrix layers. Fully-connected layers, attention projections, embedding and LM head matrices, dense and expert SwiGLU MLP projections, andMoErouter weight matrices all possess nontrivial row, column, permutation, and spectral geometries. Their gradients often exhibit correlations, low-rank structure, and dominant singular directions that are not explicitly represented by elementwise updates. Our central message is that neural network weight matrices live in geometries that coordinate-wise adaptive methods do not capture.
- 2.A unifying equivariance view of spectral optimizers.We show that optimizer updates governed by orthogonal equivariance naturally lead to the class ofspectral optimizers. This class includes or provides a unifying interpretation of stochastic spectral descent (SSD)[21],Muon[76],Scion[122], and polar gradient methods (PolarGrad)[89]. These methods compute, exactly or approximately, the orthogonal polar factor of an update directionDD, such as a gradientGGor momentumMM: D=UΣV⊤⇒U𝗉≔polar(D)=UV⊤.D=U\Sigma V^{\top}\quad\Rightarrow\quad U_{\mathsf{p}}\coloneqq\mathrm{polar}(D)=UV^{\top}.Such updates are bi-orthogonally equivariant, preserve the singular-vector structure of the update direction, and arise naturally from matrix geometry. This viewpoint gives a symmetry-based interpretation of the spectral-norm steepest descent principle underlyingMuon[11,12,76]: because the spectral norm is unitarily invariant, the corresponding polar update is naturally bi-orthogonally equivariant.
- 3.A family of equivariant optimizers for layerwise architecture–optimizer co-design.Beyond full spectral optimizers for ordinary matrix layers, we derive equivariant optimizer classes for layers whose symmetries differ from those of standard linear maps. These include one-sided spectral optimizers, such as right-spectral updates for embedding and LM head matrices and left-spectral updates forMoErouters, as well as non-spectral row-norm-based optimizers and hybrid row-norm/one-sided-spectral optimizers. We further show that SwiGLU MLP projection matrices possess intermediate-neuron permutation geometry, motivating row-aware updates for gate and up projections and column-aware updates for down projections. The corresponding practical momentum variants are denotedRightPolarGradM,LeftPolarGradM,RowNormM, andHybridPolarGradM. These constructions instantiate an architecture–optimizer co-design principle based on layerwise equivariance.
- 4.End-to-end pre-training evidence.We evaluate the proposed equivariant optimizer assignments in dense and sparseMoElanguage model pre-training experiments (Section˜4). These experiments instantiate, to the best of our knowledge, the first end-to-end pre-training optimizer stack in which all major matrix-valued parameter classes in language models are assigned updates according to their layerwise symmetry. ReplacingAdamWon large vocabulary-indexed matrices with row-norm or hybrid equivariant updates consistently improves final validation loss. The gains are modest but visible for the smaller Qwen3-0.6B-style dense model, become more pronounced for the larger Gemma 3 1B-style model, and persist in sparseMoEexperiments based onOLMoE-1B-7B and downsized gpt-oss (Figure˜2). In dense models, hybrid row-norm/spectral updates for SwiGLU MLP projections further improve validation loss. In theMoEsetting, symmetry-compatible router updates improve over coordinate-wise router updates and can reduce training loss spikes.
As a representative example,Figure˜2shows the effect of symmetry-compatible assignments in a sparseMoEpre-training experiment.
Figure 2:Validation losses for downsized gpt-oss pre-training. The configurations differ in the optimizers for the embedding, LM head, and router matrices; seeSection˜4.4for details. Configurations (i) and (ii) use symmetry-compatible optimizers derived from the layerwise equivariance principle, while configuration (iii) replaces the router update byAdamWand configuration (iv) usesAdamWfor the embedding, LM head, and router matrices.
Scope and limitations.
Our goal is not to claim that equivariant optimizers dominate coordinate-wise adaptive methods in all regimes. Rather, we develop a layerwise equivariance principle for matrix-valued parameters and show that it leads to practical optimizer assignments that are competitive and often beneficial in representative pre-training settings. The empirical results should be viewed as evidence for the usefulness of the principle, not as an exhaustive large-scale optimizer benchmark.
Organization.
We first introduce notation and closely related work inSection˜2. InSection˜3, we develop the layerwise symmetry-compatible principle, beginning from a linear-operator view of matrix parameters and the resulting coordinate-free equivariance requirements. We then derive equivariant optimizer classes for embeddings, LM heads, SwiGLU MLP projections, andMoErouters, including one-sided spectral, row-norm, and hybrid variants. InSection˜3.8, we establish that spectral optimizers are precisely the direction-wise update maps compatible with bi-orthogonal equivariance. We present dense andMoElanguage model pre-training experiments inSection˜4. We conclude with a discussion of broader implications and future directions inSection˜5.
2Preliminaries and Related Work
In this section, we introduce necessary notation and related work for self-containedness. For an extended overview of related work, we refer the readers toAppendix˜A.
Notation.
For any real-valued square matrixS∈ℝd×dS\in\mathbb{R}^{d\times d},diag(S)∈ℝd\mathrm{diag}(S)\in\mathbb{R}^{d}denotes the vector of its diagonal entries,Diag(S)∈ℝd×d\operatorname*{Diag}(S)\in\mathbb{R}^{d\times d}the diagonal matrix with diagonal entries equal to those ofSS, andtr(S)\mathrm{tr}(S)is its trace. For anyx∈ℝdx\in\mathbb{R}^{d},Diag(x)∈ℝd×d\operatorname*{Diag}(x)\in\mathbb{R}^{d\times d}is the diagonal matrix with diagonal entries equal to the entries ofxx. For anym×nm\times nreal-valued matricesA≔(ai,j)1⩽i⩽m,1⩽j⩽nA\coloneqq(a_{i,j})_{1\leqslant i\leqslant m,1\leqslant j\leqslant n}andB≔(bi,j)1⩽i⩽m,1⩽j⩽nB\coloneqq(b_{i,j})_{1\leqslant i\leqslant m,1\leqslant j\leqslant n}, we denote the Frobenius inner product ofAAandBBby\llangleA,B\rrangleF≔tr(A⊤B)=∑i,jai,jbi,j\left\llangle A,B\right\rrangle_{\rm F}\coloneqq\mathrm{tr}(A^{\top}B)=\sum_{i,j}a_{i,j}b_{i,j}. For a matrixA∈ℝm×nA\in\mathbb{R}^{m\times n}, we denote its its Frobenius norm by|||A|||F≔\llangleA,A\rrangleF\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}\coloneqq\sqrt{\left\llangle A,A\right\rrangle_{\rm F}}, its spectral norm by|||A|||S≔supx∈ℝn,x≠0{‖Ax‖2/‖x‖2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}\coloneqq\sup_{x\in\mathbb{R}^{n},x\neq 0}\{\|Ax\|_{2}/\|x\|_{2}\}, its nuclear norm by|||A|||nuc≔∑i=1m∧nσi(A)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\coloneqq\sum_{i=1}^{m\wedge n}\sigma_{i}(A), whereσ(A)=(σ1(A),…,σm∧n(A))⊤\sigma(A)=(\sigma_{1}(A),\ldots,\sigma_{m\wedge n}(A))^{\top}is the vector of nonincreasing ordered singular values ofAA, and its max norm by|||A|||max≔max1⩽i⩽m,1⩽j⩽n|ai,j|\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\max}\coloneqq\max_{1\leqslant i\leqslant m,1\leqslant j\leqslant n}|a_{i,j}|. The Schattenpp-norm ofAAis denoted by|||A|||p≔‖σ(A)‖p\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{p}\coloneqq\left\lVert\sigma(A)\right\rVert_{p}. The Hadamard product ofA∈ℝm×nA\in\mathbb{R}^{m\times n}andB∈ℝm×nB\in\mathbb{R}^{m\times n}is denoted byA⊙B≔(ai,jbi,j)1⩽i⩽m,1⩽j⩽nA\odot B\coloneqq(a_{i,j}b_{i,j})_{1\leqslant i\leqslant m,1\leqslant j\leqslant n}. For the the matrixA∈ℝm×nA\in\mathbb{R}^{m\times n}, we denote byvec(A)∈ℝmn\mathrm{vec}(A)\in\mathbb{R}^{mn}its vectorization by rows. Conversely, forx∈ℝmnx\in\mathbb{R}^{mn}, we writereshape(x,m,n)∈ℝm×n\mathrm{reshape}(x,m,n)\in\mathbb{R}^{m\times n}for the inverse operation, so thatreshape(vec(A),m,n)=A\mathrm{reshape}(\mathrm{vec}(A),m,n)=Afor allA∈ℝm×nA\in\mathbb{R}^{m\times n}. Let𝕊d≔{A∈ℝd×d:A=A⊤}\mathbb{S}^{d}\coloneqq\{A\in\mathbb{R}^{d\times d}:A=A^{\top}\}denote the space of real symmetric matrices inℝd×d\mathbb{R}^{d\times d},𝕊+d≔{A∈𝕊d:A≽0}\mathbb{S}^{d}_{+}\coloneqq\{A\in\mathbb{S}^{d}:A\succcurlyeq 0\}the set of symmetric positive semidefinite matrices, and𝕊++d≔{A∈𝕊d:A≻0}\mathbb{S}^{d}_{++}\coloneqq\{A\in\mathbb{S}^{d}:A\succ 0\}the set of symmetric positive definite matrices, where≽\succcurlyeqand≻\succdenote Löwner orders. Let𝕆d≔{A∈ℝd×d:A⊤A=AA⊤=Id}\mathbb{O}^{d}\coloneqq\{A\in\mathbb{R}^{d\times d}:A^{\top}A=AA^{\top}=I_{d}\}denote the set ofd×dd\times dorthogonal matrices, whereId∈ℝd×dI_{d}\in\mathbb{R}^{d\times d}is thed×dd\times didentity matrix. Letℙd≔{P∈{0,1}d×d:P𝟏d=𝟏d,P⊤𝟏d=𝟏d}\mathbb{P}^{d}\coloneqq\{P\in\{0,1\}^{d\times d}:P\bm{1}_{d}=\bm{1}_{d},\;P^{\top}\bm{1}_{d}=\bm{1}_{d}\}denote the set ofd×dd\times dpermutation matrices, where𝟏d\bm{1}_{d}is the all-ones vector inℝd\mathbb{R}^{d}. Letℰ\mathcal{E}be a Euclidean space endowed with an inner product⟨⋅,⋅⟩\langle\cdot,\cdot\rangleand the induced norm∥⋅∥\|\cdot\|. The domain of a functionf:ℰ→ℝ¯≔ℝ∪{±∞}f\colon\mathcal{E}\to\overline{\mathbb{R}}\coloneqq\mathbb{R}\cup\{\pm\infty\}isdomf≔{x∈ℰ:f(x)<∞}\operatorname*{dom}f\coloneqq\{x\in\mathcal{E}:f(x)<\infty\}. A functionf:ℰ→ℝ¯f\colon\mathcal{E}\to\overline{\mathbb{R}}is said to beproperif it has a nonempty domain. The (convex) indicator functionι𝒞(x)\iota_{\mathcal{C}}(x)of a nonempty closed convex set𝒞\mathcal{C}atxxequals0ifx∈𝒞x\in\mathcal{C}and+∞+\inftyotherwise. The Euclidean projection ofxxonto a nonempty closed convex set𝒞\mathcal{C}is denoted byproj𝒞(x)\operatorname{proj}_{\mathcal{C}}(x).ℕ\mathbb{N}denotes the set of nonnegative integers andℕ∗≔ℕ∖{0}\mathbb{N}^{*}\coloneqq\mathbb{N}\setminus\{0\}denotes the set of positive integers. For a functionf:ℰ→ℝ¯f\colon\mathcal{E}\to\overline{\mathbb{R}}, we useargminf\operatorname*{argmin}fto denote the unique minimizer offf.
2.1Matrix-Gradient Optimizers
The recent release ofMuon[76], together with its strong empirical performance in themodded-nanogptspeedrun[75], has renewed interest in matrix-gradient optimizers for deep learning. This has led to a rapidly growing line of work on geometry-aware and matrix-structured optimization methods[122,89,28,150,26,85,49,71,140,169,158,73,171,56,126,159,42]. Conceptually,Muonis closely related to stochastic spectral descent (SSD)[21,22,23], since both methods can be derived from steepest descent with respect to the spectral norm. We emphasize that this spectral-norm steepest descent perspective is already closely aligned with the equivariance view developed here: because the spectral norm is unitarily invariant, its steepest descent direction is the orthogonal polar factor, and the resultingMuonupdate is implicitly bi-orthogonally equivariant. Our contribution is to make this equivariance principle explicit, to placeMuonand related methods inside a broader class of spectral optimizers, and to extend the same symmetry-based design logic to layers whose symmetries are not fully bi-orthogonal, such as embeddings, LM heads, SwiGLU MLP projections, andMoErouters.
On the theoretical side, local one-step analyses of simplifiedMuon-type updates have been developed in[145,32,57], while several recent works study convergence rates and optimization guarantees under different assumptions[97,89,24,83,138,79,112]. Our work is aligned with this broad effort, but differs in emphasis: rather than viewing matrix-gradient optimizers primarily as normalization or preconditioning heuristics, we derive them from symmetry and equivariance principles for matrix-valued neural network parameters.
A separate but related line of work develops matrix-gradient optimizers from second-order or preconditioning perspectives. These include Kronecker-factored or layerwise preconditioners such as K-FAC[113,45],Shampoo[62,7,139,44], BFGS and L-BFGS-type methods[54], SOAP[151], KL-Shampoo and KL-SOAP[104], and learned or adaptive preconditioners such as preconditioned SGD (PSGD)[98,123]. These methods typically approximate curvature or preconditioning structure, whereas spectral and polar updates can also be understood as enforcing equivariance properties of the update map itself. This distinction is important for our framework, since the appropriate optimizer geometry depends on the symmetry group of the layer, not only on curvature approximation.
Other related directions include imposing constraints directly on the weights, such as Stiefel-manifold interpretations and manifold-constrained optimizers[15,20,156,161,61], as well as variance reduction and low-rank gradient projection methods such as MARS-M and GaLore[106,165,175,143]. These methods address complementary aspects of matrix-gradient optimization, including weight constraints, variance control, and computational efficiency. We refer readers to the recent review[121]for a broader overview of geometry-aware optimization methods in deep learning.
2.2Matrix Optimization Problems, Löwner Operators, and Spectral Operators
Matrix optimization problems have long been studied as a distinct class of optimization problems because matrices carry algebraic and geometric structures, such as eigenvalues, singular values, ranks, invariant subspaces, and unitary symmetries, that are obscured by vectorization[38,39]. The foundations for convex and unitarily invariant matrix functions, eigenvalue optimization, and spectral optimization were developed in convex matrix analysis and variational analysis[93,94,92,95].
Our framework is also closely related to spectral functions and spectral operators[68,16,66,146,27]. For rectangular matrices, such operators act on singular values while preserving singular vectors,G=UDiag(σ(G))V⊤↦𝒯(G)=UDiag(ψ(σ(G)))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}\mapsto\mathcal{T}(G)=U\operatorname*{Diag}(\psi(\sigma(G)))V^{\top}. This is the same operator-theoretic structure underlying spectral matrix-gradient optimizers such as stochastic spectral descent,Muon,Scion, and polar gradient methods.
2.3Symmetry and Equivariance in Deep Learning
There is a long line of work recognizing symmetry and equivariance as organizing principles in neural networks, both for understanding optimization, generalization, and representation learning[118,63,133,99,1,172,173,125,174], and for designing equivariant architectures[102,10,82]. Our work is complementary: rather than imposing equivariance on the architecture or studying equivariance of existing training dynamics, we impose equivariance on the optimizer update map acting on parameter tensors. Thus, our viewpoint extends the equivariance principle from architecture design to optimizer design, where the relevant symmetry is the internal geometry of the parameter space rather than only the symmetry of the input or output domain.
3Equivariant Optimizers from Layerwise Symmetry
Modern deep learning architectures contain matrix-valued parameters with different symmetry structures. The common principle is that a parameter matrix does not always represent an arbitrary array of coordinates, but often represents a linear map between two structured spaces. If the coordinates of these spaces are changed, the parameter and its gradient transform accordingly, and a geometry-compatible optimizer should transform in the same way.
We first state this principle in a general form. LetW∈ℝm×nW\in\mathbb{R}^{m\times n}represent a linear map from an input space to an output space. Suppose the output and input coordinates are transformed by invertible matricesP∈GL(m)P\in\mathrm{GL}(m)andQ∈GL(n)Q\in\mathrm{GL}(n). Then the same linear map is represented byW~=PWQ−1\widetilde{W}=PWQ^{-1}. Iff~(W~)≔f(P−1W~Q)\widetilde{f}(\widetilde{W})\coloneqq f(P^{-1}\widetilde{W}Q), then standard matrix calculus gives
∇W~f~(W~)=P−⊤∇Wf(W)Q⊤.\nabla_{\widetilde{W}}\widetilde{f}(\widetilde{W})=P^{-\top}\,\nabla_{W}f(W)Q^{\top}.Thus, under a general change of coordinates, the gradient transforms contravariantly with respect to the output coordinates and covariantly with respect to the input coordinates.
In this work, we study the equivariance of the update map𝒰:ℝm×n→ℝm×n\mathscr{U}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}in matrix-optimizer iterations
(∀k∈ℕ)Wk+1=Wk−γk𝒰(Dk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathscr{U}(D_{k}),whereDkD_{k}is an update direction, such as a gradient or momentum. The relevant requirement is not necessarily that the layerwise loss function be invariant under arbitrary transformations, but that the optimizer update transform consistently with the representation of its input direction. Thus, once a layer symmetry specifies a transformation lawDk↦g⋅DkD_{k}\mapsto g\cdot D_{k}, we require
𝒰(g⋅Dk)=g⋅𝒰(Dk).\mathscr{U}(g\cdot D_{k})=g\cdot\mathscr{U}(D_{k}).WhenDkD_{k}transforms equivariantly, the update𝒰(Dk)\mathscr{U}(D_{k})therefore transforms equivariantly as well.
In this paper, however, we do not require equivariance under all invertible changes of coordinates. The relevant symmetry group depends on the layer. For ordinary linear and attention matrices, the natural coordinate changes are orthonormal changes of basis, soP∈𝕆mP\in\mathbb{O}^{m}andQ∈𝕆nQ\in\mathbb{O}^{n}. In this caseP−⊤=PP^{-\top}=PandQ−1=Q⊤Q^{-1}=Q^{\top}, and both the parameter and gradient transform asW↦PWQ⊤W\mapsto PWQ^{\top}andG↦PGQ⊤G\mapsto PGQ^{\top}. This leads to the bi-orthogonal equivariance condition
𝒰(PGQ⊤)=P𝒰(G)Q⊤.\mathscr{U}(PGQ^{\top})=P\,\mathscr{U}(G)\,Q^{\top}.For embedding and LM head matricesW∈ℝv×dW\in\mathbb{R}^{v\times d}, the row axis indexes vocabulary items, so the admissible left action is not a general orthogonal rotation but a permutationP∈ℙvP\in\mathbb{P}^{v}, while the hidden feature axis still admits right orthogonal transformations. ForMoErouters, the row axis indexes experts and additionally has a shared-logit-shift invariance. For SwiGLU MLP projections, the relevant symmetry is permutation of intermediate neurons, which acts on the rows of the gate and up projections and on the columns of the down projection.
This gives a layerwise equivariance principle: the optimizer update map should commute with the symmetry group of the parameter block on which it acts. Full bi-orthogonal equivariance leads to spectral optimizers for ordinary matrix layers; left-permutation/right-orthogonal equivariance leads to row-aware and right-spectral optimizers for embeddings and LM heads; intermediate-neuron permutation symmetry leads to row- and column-aware updates for SwiGLU MLP projections; and expert-permutation plus shared-shift symmetry leads to centered row-aware or left-spectral updates forMoErouters.
3.1A General Symmetry-Induced Optimizer Geometry
LetW∈ℝm×nW\in\mathbb{R}^{m\times n}be a layer parameter and letf:ℝm×n→ℝf\colon\mathbb{R}^{m\times n}\to\mathbb{R}be the corresponding layerwise loss. Suppose a group𝒢\mathcal{G}acts on the parameter space by transformationsW↦g⋅WW\mapsto g\cdot W. In the matrix settings considered below, this action is typically of the formg⋅W=PWQ−1g\cdot W=PWQ^{-1}, or, after restricting to orthogonal or permutation symmetries,g⋅W=PWQ⊤g\cdot W=PWQ^{\top}. We say that the parameterization admits the symmetry group𝒢\mathcal{G}iff(g⋅W)=f(W)f(g\cdot W)=f(W)for allg∈𝒢g\in\mathcal{G}. The corresponding optimizer update map should satisfy
(∀g∈𝒢)𝒰(g⋅G)=g⋅𝒰(G),(\forall g\in\mathcal{G})\qquad\mathscr{U}(g\cdot G)=g\cdot\mathscr{U}(G),whereGGis an update direction, such as a gradient or momentum, expressed in the corresponding transformed coordinates. This condition ensures that the optimizer does not depend on arbitrary choices of representation that are invisible to the model.
3.2Bi-Orthogonal Equivariance for Ordinary Matrix Layers
The general reparameterizationW↦PWQ−1W\mapsto PWQ^{-1}specializes toW↦PWQ⊤W\mapsto PWQ^{\top}when the admissible coordinate changes are orthogonal. This is the natural case for ordinary linear layers and attention projection matrices, where both input and output coordinates represent continuous feature bases. Therefore an update map for such layers should satisfy
(∀P∈𝕆m,∀Q∈𝕆n)𝒰(PGQ⊤)=P𝒰(G)Q⊤.(\forall P\in\mathbb{O}^{m},\forall Q\in\mathbb{O}^{n})\qquad\mathscr{U}(PGQ^{\top})=P\,\mathscr{U}(G)Q^{\top}.This is exactlybi-orthogonal equivariance.
Definition 3.1(Bi-orthogonal equivariance).
Let𝒰:ℝm×n→ℝm×n\mathscr{U}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}be a matrix-valued map. We say that𝒰\mathscr{U}isbi-orthogonally equivariantif, for allG∈ℝm×nG\in\mathbb{R}^{m\times n}and allP∈𝕆mP\in\mathbb{O}^{m},Q∈𝕆nQ\in\mathbb{O}^{n},
𝒰(PGQ⊤)=P𝒰(G)Q⊤.\mathscr{U}(PGQ^{\top})=P\,\mathscr{U}(G)Q^{\top}.
Thus, bi-orthogonal equivariance is exactly the requirement that ifWWand its gradient are transformed asW↦PWQ⊤W\mapsto PWQ^{\top}andG↦PGQ⊤G\mapsto PGQ^{\top}, then the optimizer update transforms in the same way. More generally, if an update rule has the formΔW=𝒰(W,G)\Delta W=\mathscr{U}(W,G), one may require
𝒰(PWQ⊤,PGQ⊤)=P𝒰(W,G)Q⊤.\mathscr{U}(PWQ^{\top},PGQ^{\top})=P\,\mathscr{U}(W,G)Q^{\top}.In this paper, we focus on update maps of the form𝒰(G)\mathscr{U}(G)or𝒰(D)\mathscr{U}(D), whereDDis a gradient-derived direction such as momentum. Terms that depend explicitly onWW, such as decoupled weight decay or scalar step-size scaling, can typically be handled separately and preserve the same equivariance, so we suppress the dependence onWWfor simplicity.
For ordinary matrix layers, bi-orthogonal equivariance motivates polar and spectral update directions. In particular, the orthogonal polar factor[9,64]satisfies
polar(PGQ⊤)=Ppolar(G)Q⊤,\mathrm{polar}(PGQ^{\top})=P\,\mathrm{polar}(G)Q^{\top},(1)and henceMuon-style and polar-gradient updates are symmetry-compatible for ordinary matrix layers. We defer the full characterization of bi-orthogonally equivariant maps as spectral operators toSection˜3.8. Standard momentum constructions such as EMA, Polyak, and Nesterov momentum preserve the same equivariance because their buffers are linear combinations of past gradients. We state this fact for Nesterov momentum in the following proposition.
Proposition 3.1(Nesterov momentum is bi-orthogonally equivariant).
Let(Wk)k∈ℕ⊂ℝm×n(W_{k})_{k\in\mathbb{N}}\subset\mathbb{R}^{m\times n}be a parameter sequence, and defineGk≔∇Wf(Wk)G_{k}\coloneqq\nabla_{W}f(W_{k})fork∈ℕk\in\mathbb{N}. Fix orthogonal matricesP∈𝕆mP\in\mathbb{O}^{m}andQ∈𝕆nQ\in\mathbb{O}^{n}, and define the transformed parameter sequenceW~k≔PWkQ⊤\widetilde{W}_{k}\coloneqq PW_{k}Q^{\top}fork∈ℕk\in\mathbb{N}, with transformed gradientsG~k≔∇Wf(W~k)\widetilde{G}_{k}\coloneqq\nabla_{W}f(\widetilde{W}_{k})fork∈ℕk\in\mathbb{N}. Let the momentum buffer beMk=βMk−1+GkM_{k}=\beta M_{k-1}+G_{k}withM−1=0M_{-1}=0, and define the update direction byNk≔Gk+βMkN_{k}\coloneqq G_{k}+\beta M_{k}. For the transformed sequence, letM~k=βM~k−1+G~k\widetilde{M}_{k}=\beta\widetilde{M}_{k-1}+\widetilde{G}_{k}withM~−1=0\widetilde{M}_{-1}=0, andN~k≔G~k+βM~k\widetilde{N}_{k}\coloneqq\widetilde{G}_{k}+\beta\widetilde{M}_{k}. Then we haveM~k=PMkQ⊤\widetilde{M}_{k}=PM_{k}Q^{\top},N~k=PNkQ⊤\widetilde{N}_{k}=PN_{k}Q^{\top}, andpolar(N~k)=Ppolar(Nk)Q⊤\mathrm{polar}(\widetilde{N}_{k})=P\,\mathrm{polar}(N_{k})Q^{\top}.
All proofs are given inAppendix˜D.
3.3Optimizers for Embeddings and LM Heads via Left-Permutation Right-Orthogonal Equivariance
For vocabulary-indexed matrices such as input embeddings and untied LM heads, we consider three symmetry-compatible update families: row-norm updates, right-spectral updates, and hybrid row-norm/right-spectral updates. For an update directionD∈ℝv×dD\in\mathbb{R}^{v\times d}, representative examples are
𝒰𝗋𝗈𝗐(D)=Diag(η(‖D1:‖2),…,η(‖Dv:‖2))D,\mathscr{U}_{\mathsf{row}}(D)=\operatorname*{Diag}(\eta(\|D_{1:}\|_{2}),\dots,\eta(\|D_{v:}\|_{2}))D,𝒰𝖱(D)=D(D⊤D+εI)−1/2,\mathscr{U}_{\mathsf{R}}(D)=D(D^{\top}D+\varepsilon I)^{-\nicefrac{{1}}{{2}}},and
𝒰𝗁𝗒𝖻(D)=𝒰𝖱(𝒰𝗋𝗈𝗐(D))or𝒰𝗋𝗈𝗐(𝒰𝖱(D)).\mathscr{U}_{\mathsf{hyb}}(D)=\mathscr{U}_{\mathsf{R}}(\mathscr{U}_{\mathsf{row}}(D))\quad\text{or}\quad\mathscr{U}_{\mathsf{row}}(\mathscr{U}_{\mathsf{R}}(D)).The row-norm update acts locally on vocabulary rows, the right-spectral update acts globally through the hidden-feature Gram matrix, and the hybrid update combines the two. We now explain why these updates are natural from the symmetry of embeddings and LM heads.
In empirical uses ofMuon[76], it is often recommended thatAdamWbe used for embedding and LM head matrices. For embeddings, this recommendation is motivated in part by modular norm theory[88]; for LM heads, it appears to be driven more by empirical considerations. Relatedly,Scion[122]derives embedding updates from induced operator norms in its linear minimization oracle framework. These approaches depend on a particular choice of norm. Here we instead derive optimizer classes directly from the symmetry of the parameterization.
Letv∈ℕ∗v\in\mathbb{N}^{*}denote the vocabulary size andd∈ℕ∗d\in\mathbb{N}^{*}the embedding dimension, typically withv≫dv\gg d. Consider an input embedding matrixE∈ℝv×dE\in\mathbb{R}^{v\times d}and an untied LM head matrixWout∈ℝv×dW_{\rm out}\in\mathbb{R}^{v\times d}. In both cases, rows index vocabulary items, while columns correspond to hidden features. Thus, the row axis admits permutation symmetry, whereas the hidden feature axis admits right orthogonal symmetry. The natural equivariance condition for an update map is therefore
𝒰𝖫𝖯𝖱𝖮(PDR⊤)=P𝒰𝖫𝖯𝖱𝖮(D)R⊤,\mathscr{U}_{\mathsf{LPRO}}(PDR^{\top})=P\,\mathscr{U}_{\mathsf{LPRO}}(D)R^{\top},(2)for allD∈ℝv×dD\in\mathbb{R}^{v\times d}, permutation matricesP∈ℙvP\in\mathbb{P}^{v}, and orthogonal matricesR∈𝕆dR\in\mathbb{O}^{d}. We call such mapsleft-permutation right-orthogonal(LPRO) equivariant.
Definition 3.2(Left-permutation and right-orthogonal equivariant maps).
A map𝒰𝖫𝖯𝖱𝖮:ℝv×d→ℝv×d\mathscr{U}_{\mathsf{LPRO}}\colon\mathbb{R}^{v\times d}\to\mathbb{R}^{v\times d}is said to beleft-permutation and right-orthogonal equivariantif (2) holds for allD∈ℝv×dD\in\mathbb{R}^{v\times d},P∈ℙvP\in\mathbb{P}^{v}, andR∈𝕆dR\in\mathbb{O}^{d}. We denote the set of such maps by𝒰𝖫𝖯𝖱𝖮v×d\mathcal{U}_{\mathsf{LPRO}}^{v\times d}.
3.3.1Right-Spectral Optimizers
A first natural subclass of LPRO-equivariant maps is given by right-spectral updates,
𝒰𝖱(D)=DΦ(D⊤D),\mathscr{U}_{\mathsf{R}}(D)=D\,\Phi(D^{\top}D),(3)whereΦ:𝕊+d→ℝd×d\Phi\colon\mathbb{S}_{+}^{d}\to\mathbb{R}^{d\times d}is an orthogonally equivariant spectral operator. Equivalently, ifD⊤D=VDiag(λ(D⊤D))V⊤D^{\top}D=V\operatorname*{Diag}(\lambda(D^{\top}D))V^{\top}, then
Φ(D⊤D)=VDiag(ψ(λ(D⊤D)))V⊤\Phi(D^{\top}D)=V\operatorname*{Diag}(\psi(\lambda(D^{\top}D)))V^{\top}for some absolutely symmetric mapψ:ℝ+d→ℝd\psi\colon\mathbb{R}_{+}^{d}\to\mathbb{R}^{d}.
Theorem 3.2(Right-spectral updates are LPRO-equivariant).
Let𝒰𝖱\mathscr{U}_{\mathsf{R}}be of the form (3), whereΦ(RXR⊤)=RΦ(X)R⊤\Phi(RXR^{\top})=R\Phi(X)R^{\top}for allX∈𝕊+dX\in\mathbb{S}_{+}^{d}andR∈𝕆dR\in\mathbb{O}^{d}. Then
𝒰𝖱(PDR⊤)=P𝒰𝖱(D)R⊤\mathscr{U}_{\mathsf{R}}(PDR^{\top})=P\,\mathscr{U}_{\mathsf{R}}(D)R^{\top}for allD∈ℝv×dD\in\mathbb{R}^{v\times d},P∈ℙvP\in\mathbb{P}^{v}, andR∈𝕆dR\in\mathbb{O}^{d}. We denote the set of right-spectral maps by𝒰𝖱v×d\mathcal{U}_{\mathsf{R}}^{v\times d}.
The choiceΦ(X)=(X+εI)−1/2\Phi(X)=(X+\varepsilon I)^{-\nicefrac{{1}}{{2}}}yields the damped right-polar update
𝒰𝖱(D)=D(D⊤D+εI)−1/2.\mathscr{U}_{\mathsf{R}}(D)=D(D^{\top}D+\varepsilon I)^{-\nicefrac{{1}}{{2}}}.Whenε=0\varepsilon=0andDDhas full column rank, this is the orthogonal polar factor ofDD. Thus, right-spectral updates are the one-sided analogue of spectral or polar-gradient updates, but they require only the smaller right Gram matrixD⊤D∈ℝd×dD^{\top}D\in\mathbb{R}^{d\times d}.
This computational distinction is important for embeddings and LM heads, wherev≫dv\gg d. Although coordinate-wise adaptive methods such asAdamare often used for these matrices in practice, they are not LPRO-equivariant. The reason for their empirical use may be computational rather than geometric: accurately approximating polar factors of tall-skinny, ill-conditioned vocabulary matrices can be challenging with simple Newton–Schulz iterations. More robust polar decomposition algorithms, such as QDWH[116]and ZOLO-PD[117], can compute such updates more accurately[89].
Right-spectral normalization is also natural statistically. In a mini-batch of sizebb, gradients of embedding and LM head matrices have rank at mostO(b)O(b), since they factor through token occurrences and hidden features. Thus, even when the vocabulary dimension is very large, the gradient often lies in a low-dimensional feature subspace. Right-spectral updates act on this intrinsic singular geometry throughD⊤DD^{\top}D, rather than on individual coordinates.
3.3.2Row-Norm and Hybrid LPRO-Equivariant Optimizers
Right-spectral maps do not exhaust all LPRO-equivariant updates. Restricting the left symmetry from the full orthogonal group to the permutation group allows update maps that depend on individual row norms. In particular, row-norm maps of the form
𝒰𝗋𝗈𝗐(D)=Diag(η(‖D1:‖2),…,η(‖Dv:‖2))D\mathscr{U}_{\mathsf{row}}(D)=\operatorname*{Diag}(\eta(\|D_{1:}\|_{2}),\dots,\eta(\|D_{v:}\|_{2}))D(4)are LPRO-equivariant for any scalar functionη:ℝ+→ℝ\eta\colon\mathbb{R}_{+}\to\mathbb{R}, because left multiplication by a permutation matrix permutes the row norms and right orthogonal transformations preserve them. We denote the set of such maps by𝒰𝗋𝗈𝗐v×d\mathcal{U}_{\mathsf{row}}^{v\times d}.
Thus, right-spectral maps form a global Gram-based subclass, while row-norm maps form a local row-adaptive subclass:
𝒰𝖱v×d⊂𝒰𝖫𝖯𝖱𝖮v×d,𝒰𝗋𝗈𝗐v×d⊂𝒰𝖫𝖯𝖱𝖮v×d.\mathcal{U}_{\mathsf{R}}^{v\times d}\subset\mathcal{U}_{\mathsf{LPRO}}^{v\times d},\qquad\mathcal{U}_{\mathsf{row}}^{v\times d}\subset\mathcal{U}_{\mathsf{LPRO}}^{v\times d}.Hybrid LPRO maps are obtained by composing these two types of maps.
Proposition 3.3(Closure under composition).
If𝒰1,𝒰2:ℝv×d→ℝv×d\mathscr{U}_{1},\mathscr{U}_{2}\colon\mathbb{R}^{v\times d}\to\mathbb{R}^{v\times d}are both left-permutation and right-orthogonal equivariant, then𝒰2∘𝒰1\mathscr{U}_{2}\circ\mathscr{U}_{1}is also left-permutation and right-orthogonal equivariant.
Definition 3.3(Hybrid LPRO-equivariant maps).
A map𝒰\mathscr{U}is called ahybrid LPRO-equivariant mapif it is a finite composition of right-spectral maps and row-norm maps. We denote the set of such maps by𝒰𝗁𝗒𝖻v×d\mathcal{U}_{\mathsf{hyb}}^{v\times d}.
Accordingly, LPRO-compatible optimizers for embeddings and LM heads are obtained by applying maps in𝒰𝖱v×d\mathcal{U}_{\mathsf{R}}^{v\times d},𝒰𝗋𝗈𝗐v×d\mathcal{U}_{\mathsf{row}}^{v\times d}, or𝒰𝗁𝗒𝖻v×d\mathcal{U}_{\mathsf{hyb}}^{v\times d}to update directions that transform in the same way as the gradient, such as momentum buffers formed from past gradients.
Examples.
Several recent optimizers can be interpreted through this framework. SCALE[51]applies column normalization to the EMA momentum for LM heads; under our row-vocabulary convention, this corresponds to a row-norm-based update. Other row- or column-norm-based optimizers include RMNP[36]and REG[107]. Finally,NorMuon[100],Muon+[170], andMuonEq[25], which apply row-wise and/or column-wise normalization to the orthogonal polar factor of the EMA momentum, can be viewed as hybrid spectral/row-norm optimizers.
3.4Optimizers for SwiGLU MLP Projections
We next consider SwiGLU MLP projection matrices[31,137]. Unlike ordinary linear and attention projection matrices, SwiGLU projections do not possess full bi-orthogonal symmetry. Instead, their natural symmetry is the permutation symmetry of intermediate neurons. This suggests that the gate and up projections should use left-permutation/right-orthogonal equivariant updates, while the down projection should use the corresponding transposed geometry. Concretely, forWgate,Wup∈ℝdff×dmodelW_{\mathrm{gate}},W_{\mathrm{up}}\in\mathbb{R}^{d_{\mathrm{ff}}\times d_{\mathrm{model}}}andWdown∈ℝdmodel×dffW_{\mathrm{down}}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{\mathrm{ff}}}, we apply the LPRO-compatible optimizer classes ofSection˜3.3toWgateW_{\mathrm{gate}},WupW_{\mathrm{up}}, andWdown⊤W_{\mathrm{down}}^{\top}.
This viewpoint is closely related to Aurora[37], an optimizer designed for the up and gate projections in SwiGLU MLPs. Aurora alternates between row-normalization and polar steps on the momentum, which is similar in spirit to the hybrid row-norm/right-spectral optimizers developed above. Our contribution here is to derive this type of update from the intermediate-neuron permutation symmetry of the SwiGLU block.
Proposition 3.4(Intermediate-neuron permutation symmetry of SwiGLU MLPs).
Let us consider the SwiGLU MLP block defined by
SwiGLU(x;Wgate,Wup,Wdown)≔Wdown(σ(Wgatex)⊙(Wupx)),\mathrm{SwiGLU}(x;W_{\mathrm{gate}},W_{\mathrm{up}},W_{\mathrm{down}})\coloneqq W_{\mathrm{down}}\left(\sigma(W_{\mathrm{gate}}x)\odot(W_{\mathrm{up}}x)\right),whereWgate,Wup∈ℝdff×dmodelW_{\mathrm{gate}},W_{\mathrm{up}}\in\mathbb{R}^{d_{\mathrm{ff}}\times d_{\mathrm{model}}},Wdown∈ℝdmodel×dffW_{\mathrm{down}}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{\mathrm{ff}}},σ\sigmais applied coordinatewise, and⊙\odotdenotes the Hadamard product. LetP∈ℙdffP\in\mathbb{P}^{d_{\mathrm{ff}}}be a permutation matrix, and defineW~gate≔PWgate\widetilde{W}_{\mathrm{gate}}\coloneqq PW_{\mathrm{gate}},W~up≔PWup\widetilde{W}_{\mathrm{up}}\coloneqq PW_{\mathrm{up}}, andW~down≔WdownP⊤\widetilde{W}_{\mathrm{down}}\coloneqq W_{\mathrm{down}}P^{\top}. Then, for every inputx∈ℝdmodelx\in\mathbb{R}^{d_{\mathrm{model}}},
SwiGLU(x;W~gate,W~up,W~down)=SwiGLU(x;Wgate,Wup,Wdown).\mathrm{SwiGLU}(x;\widetilde{W}_{\mathrm{gate}},\widetilde{W}_{\mathrm{up}},\widetilde{W}_{\mathrm{down}})=\mathrm{SwiGLU}(x;W_{\mathrm{gate}},W_{\mathrm{up}},W_{\mathrm{down}}).
Proposition˜3.4implies that optimizers forWgateW_{\mathrm{gate}}andWupW_{\mathrm{up}}should commute with permutations of their rows and orthogonal changes of basis in the model dimension. Thus, the same LPRO-compatible row-norm, right-spectral, and hybrid row-norm/right-spectral updates developed for embeddings and LM heads apply directly to these matrices. For the down projection, the intermediate-neuron axis appears as the column dimension, so the same principle applies toWdown⊤W_{\mathrm{down}}^{\top}. Equivalently, one may view the down-projection update as a right-permutation/left-orthogonal analogue of the LPRO updates.
More generally, this intermediate-neuron permutation symmetry is not specific to SwiGLU. Any feed-forward block whose hidden nonlinearity is applied coordinatewise admits a permutation symmetry of the hidden units: simultaneously permuting the rows of the input projection and the corresponding columns of the output projection leaves the represented function unchanged. For gated MLPs such as GLU, GeGLU, ReGLU, and SwiGLU, the same symmetry holds provided the gate and value projections are permuted together, along with the corresponding columns of the down projection.
3.5Optimizers forMoERouters
We now consider the router matrix in a mixture-of-experts (MoE) model. Unlike ordinary linear layers, embeddings, and LM heads, the router has an additional symmetry: experts are interchangeable, and the softmax is invariant under adding a shared scalar offset to all logits.
LetW∈ℝe×dW\in\mathbb{R}^{e\times d}denote the router matrix, wheree∈ℕ∗e\in\mathbb{N}^{*}is the number of experts andd∈ℕ∗d\in\mathbb{N}^{*}is the hidden dimension. The routing distribution isp(x;W)=softmax(Wx)p(x;W)=\mathrm{softmax}(Wx)forx∈ℝdx\in\mathbb{R}^{d}. Since permuting the rows ofWWonly relabels the experts, and sincesoftmax(z+c𝟏e)=softmax(z)\mathrm{softmax}(z+c\bm{1}_{e})=\mathrm{softmax}(z)for allz∈ℝez\in\mathbb{R}^{e}andc∈ℝc\in\mathbb{R}, the router parameters admit the symmetry
(∀P∈ℙe,∀a∈ℝd)W↦PW+𝟏ea⊤.(\forall P\in\mathbb{P}^{e},\forall a\in\mathbb{R}^{d})\qquad W\mapsto PW+\bm{1}_{e}a^{\top}.Indeed,
(∀x∈ℝd)(PW+𝟏ea⊤)x=P(Wx)+(a⊤x)𝟏e,(\forall x\in\mathbb{R}^{d})\qquad(PW+\bm{1}_{e}a^{\top})x=P(Wx)+(a^{\top}x)\bm{1}_{e},so the router logits are unchanged up to expert relabeling and a shared logit shift.
This symmetry suggests that router updates should be defined on the centered expert geometry. Let
Π⟂≔Ie−1e𝟏e𝟏e⊤\Pi_{\perp}\coloneqq I_{e}-\frac{1}{e}\bm{1}_{e}\bm{1}_{e}^{\top}be the orthogonal projector onto𝟏e⟂\bm{1}_{e}^{\perp}, and defineDc≔Π⟂DD_{c}\coloneqq\Pi_{\perp}D.
The centered directionDcD_{c}removes the shared-row component ofDDand captures the intrinsic variation across experts. This motivates router update maps built fromDcD_{c}, for example through the centered left Gram matrix
DcDc⊤=Π⟂DD⊤Π⟂.D_{c}D_{c}^{\top}=\Pi_{\perp}DD^{\top}\Pi_{\perp}.
Definition 3.4(Router-compatible update maps).
A map𝒰:ℝe×d→ℝe×d\mathscr{U}\colon\mathbb{R}^{e\times d}\to\mathbb{R}^{e\times d}is calledrouter-compatibleif it is expert-permutation equivariant and shared-row-shift invariant, namely,
𝒰(PD+𝟏ea⊤)=P𝒰(D)\mathscr{U}(PD+\bm{1}_{e}a^{\top})=P\,\mathscr{U}(D)for allD∈ℝe×dD\in\mathbb{R}^{e\times d}, all permutation matricesP∈ℙeP\in\mathbb{P}^{e}, and alla∈ℝda\in\mathbb{R}^{d}.
We consider two basic router-compatible update families. The first is a left-spectral update in the centered expert geometry:
𝒰𝖫(D)=Ψ(DcDc⊤)Dc,Dc=Π⟂D,\mathscr{U}_{\mathsf{L}}(D)=\Psi(D_{c}D_{c}^{\top})D_{c},\qquad D_{c}=\Pi_{\perp}D,whereΨ:𝕊+e→ℝe×e\Psi\colon\mathbb{S}_{+}^{e}\to\mathbb{R}^{e\times e}is permutation equivariant, i.e.,
Ψ(PXP⊤)=PΨ(X)P⊤.\Psi(PXP^{\top})=P\,\Psi(X)P^{\top}.The second is a centered row-norm update:
𝒰𝗋𝗈𝗐𝗋𝗈𝗎𝗍𝖾𝗋(D)=Diag(η(‖Dc,1:‖2),…,η(‖Dc,e:‖2))Dc,Dc=Π⟂D,\mathscr{U}_{\mathsf{row}}^{\mathsf{router}}(D)=\operatorname*{Diag}(\eta(\|D_{c,1:}\|_{2}),\dots,\eta(\|D_{c,e:}\|_{2}))D_{c},\qquad D_{c}=\Pi_{\perp}D,whereη:ℝ+→ℝ\eta\colon\mathbb{R}_{+}\to\mathbb{R}is applied pointwise to the centered expert-row norms. The left-spectral update mixes information globally across experts throughDcDc⊤D_{c}D_{c}^{\top}, while the row-norm update acts locally on each centered expert row.
Proposition 3.5(Router-compatible update families).
Both centered left-spectral updates and centered row-norm updates are router-compatible. That is, for allD∈ℝe×dD\in\mathbb{R}^{e\times d},P∈ℙeP\in\mathbb{P}^{e}, anda∈ℝda\in\mathbb{R}^{d},
𝒰𝖫(PD+𝟏ea⊤)=P𝒰𝖫(D),𝒰𝗋𝗈𝗐𝗋𝗈𝗎𝗍𝖾𝗋(PD+𝟏ea⊤)=P𝒰𝗋𝗈𝗐𝗋𝗈𝗎𝗍𝖾𝗋(D).\mathscr{U}_{\mathsf{L}}(PD+\bm{1}_{e}a^{\top})=P\,\mathscr{U}_{\mathsf{L}}(D),\qquad\mathscr{U}_{\mathsf{row}}^{\mathsf{router}}(PD+\bm{1}_{e}a^{\top})=P\,\mathscr{U}_{\mathsf{row}}^{\mathsf{router}}(D).
The converse is false in general: router compatibility does not force an update map to be left-spectral. Centered row-norm updates are already router-compatible, but they depend on individual centered expert-row norms rather than only on the centered Gram matrixDcDc⊤D_{c}D_{c}^{\top}. Thus, left-spectral and row-norm updates should be viewed as two natural subclasses of router-compatible maps.
Hybrid router maps are obtained by composing router-compatible maps. If𝒰1\mathscr{U}_{1}and𝒰2\mathscr{U}_{2}are router-compatible, then so is𝒰2∘𝒰1\mathscr{U}_{2}\circ\mathscr{U}_{1}. Therefore, finite compositions of centered left-spectral and centered row-norm updates remain router-compatible. A representative hybrid router update is
𝒰𝗁𝗒𝖻𝗋𝗈𝗎𝗍𝖾𝗋(D)=Diag(η(‖Z1:‖2),…,η(‖Ze:‖2))Z,Z=Ψ(DcDc⊤)Dc.\mathscr{U}_{\mathsf{hyb}}^{\mathsf{router}}(D)=\operatorname*{Diag}(\eta(\|Z_{1:}\|_{2}),\dots,\eta(\|Z_{e:}\|_{2}))Z,\qquad Z=\Psi(D_{c}D_{c}^{\top})D_{c}.Such hybrid router optimizers combine the global expert-mixing geometry of left-spectral updates with the local expert-wise normalization of row-norm updates, while preserving expert-permutation equivariance and shared-row-shift invariance.
Definition 3.5(Router-compatible optimizers).
A matrix optimizer for anMoErouter is calledrouter-compatibleif its update rule has the form
(∀k∈ℕ)Wk+1=Wk−γk𝒰(Dk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathscr{U}(D_{k}),where𝒰\mathscr{U}is router-compatible andDk∈ℝe×dD_{k}\in\mathbb{R}^{e\times d}is an update direction that transforms under expert permutations and shared row shifts in the same way as the gradient.
Table 1:Optimizer classes across LLM layers induced by symmetry.Each matrix parameter has a natural symmetry group, which determines the corresponding symmetry-compatible optimizer class.
3.6Symmetry-to-Optimizer Principle and Architecture–Optimizer Co-Design
The preceding developments suggest a unifying principle for optimizer design in modern deep learning architectures: the optimizer geometry should be determined by the symmetry structure of the underlying parameterization. This does not require the full layerwise loss to be globally invariant under the corresponding group action. Rather, it is enough that the gradient, or more generally the chosen update direction, transforms equivariantly under the relevant action on the parameter space. The optimizer update should then transform in the same way.
Accordingly, whenever the update direction associated with a layer parameterWWtransforms under a symmetry group𝒢\mathcal{G}, a natural optimizer should use an update map𝒰\mathscr{U}that is equivariant under the same induced action. For matrix-valued parameters, this principle leads to several canonical optimizer geometries: full spectral updates under bi-orthogonal symmetry; right-spectral updates under right-orthogonal symmetry; left-spectral updates under left-orthogonal symmetry; row-norm and hybrid updates under left-permutation/right-orthogonal symmetry; and centered left-spectral or row-norm updates under the expert-permutation and shared-row-shift symmetry ofMoErouters. Thus, the symmetry structure of the layer determines the appropriate optimizer class.
This principle also gives a practical recipe for architecture–optimizer co-design. Given a new architecture block, one should:
- 1.identify the symmetry group of the parameterization;
- 2.determine whether the symmetry acts on the left, on the right, on both sides, or only after quotienting out symmetry-redundant directions;
- 3.choose the matching symmetry-compatible optimizer class; and
- 4.use the smallest Gram matrix or invariant statistic compatible with that symmetry.
InSection˜3.7, we instantiate this recipe by extending momentumPolarGrad[89]to one-sided and hybrid variants, includingLeftPolarGradM,RightPolarGradM,RowNormM, andHybridPolarGradM.
3.7Practical Optimizers for Embeddings, LM Heads, SwiGLU MLP Projections, andMoERouters
The preceding subsections introduced three main classes of symmetry-compatible optimizers: one-sided spectral optimizers, row-norm-based optimizers, and hybrid variants obtained by composing row-wise normalization with one-sided spectral updates. We now describe their practical momentum implementations. The main computational issue is the efficient and stable approximation of matrix inverse square roots, or equivalently orthogonal polar factors, which we compute using GPU-friendly numerical linear algebra routines.
3.7.1One-Sided Spectral Optimizers
For embedding, LM head, and SwiGLU MLP projection matrices, the relevant matrices are often tall-skinny. In this regime, right-spectral updates are attractive because they only require the inverse square root of the smaller right Gram matrix
Ck≔Gk⊤Gkor, with momentum,Ck≔Mk⊤Mk.C_{k}\coloneqq G_{k}^{\top}G_{k}\qquad\text{or, with momentum,}\qquad C_{k}\coloneqq M_{k}^{\top}M_{k}.ForMoErouters, the corresponding left-spectral update is applied in the centered expert geometry. Writing
Π⟂≔Ie−1e𝟏e𝟏e⊤,Mk,c≔Π⟂Mk,\Pi_{\perp}\coloneqq I_{e}-\frac{1}{e}\bm{1}_{e}\bm{1}_{e}^{\top},\qquad M_{k,c}\coloneqq\Pi_{\perp}M_{k},the relevant Gram matrix is
Ck≔Mk,cMk,c⊤.C_{k}\coloneqq M_{k,c}M_{k,c}^{\top}. To compute the inverse square roots in these updates, we use Newton–Schulz iterations with the polynomial coefficients ofPolar Express[5]. This is motivated by the connection between polynomial iterations for polar decomposition and inverse-square-root computation[65]. For numerical stability, the Gram inverse-square-root implementation is performed infloat32. Other fast inverse-square-root routines, such as PRISM[162], could be used in the same role. ForRightPolarGradM, we also provide a Gram Newton–Schulz implementation[168], which directly approximates
Mk(Mk⊤Mk)−1/2M_{k}(M_{k}^{\top}M_{k})^{-\nicefrac{{1}}{{2}}}and supports more inner iterations in lower-precision formats such asbfloat16orfloat16. The resulting one-sided momentum polar-gradient algorithms are summarized inAlgorithm˜1.
Algorithm 1LeftPolarGradMandRightPolarGradM0:
W0∈ℝm×nW_{0}\in\mathbb{R}^{m\times n},
M−1=0M_{-1}=0, learning rates
{γk}k⩾0\{\gamma_{k}\}_{k\geqslant 0}, momentum
β∈[0,1)\beta\in[0,1), scaling exponent
α∈[0,1]\alpha\in[0,1], damping
ε>0\varepsilon>0, weight decay
λ⩾0\lambda\geqslant 0 1:for
k=0,…,K−1k=0,\ldots,K-1do
2:
Gk=∇Wf(Wk)G_{k}=\nabla_{W}f(W_{k}) 3:
Mk=βMk−1+(1−β)GkM_{k}=\beta M_{k-1}+(1-\beta)G_{k} 4:ifLeftPolarGradMthen
5:
Ck=MkMk⊤C_{k}=M_{k}M_{k}^{\top} 6:
Lk=(Ck+εI)−1/2L_{k}=(C_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}viaPolar Expressor Newton–Schulz iteration
7:
νk=max{tr(CkLk),ε}\nu_{k}=\max\{\mathrm{tr}(C_{k}L_{k}),\varepsilon\} 8:
Wk+1=(1−γkλ)Wk−γkνkαLkMkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}\nu_{k}^{\alpha}L_{k}M_{k} 9:elseifRightPolarGradMthen
10:
Ck=Mk⊤MkC_{k}=M_{k}^{\top}M_{k} 11:
Rk=(Ck+εI)−1/2R_{k}=(C_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}viaPolar Expressor Newton–Schulz iteration
12:
νk=max{tr(CkRk),ε}\nu_{k}=\max\{\mathrm{tr}(C_{k}R_{k}),\varepsilon\} 13:
Wk+1=(1−γkλ)Wk−γkνkαMkRkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}\nu_{k}^{\alpha}M_{k}R_{k} 14:endif
15:endfor
15:
WKW_{K}
At the level of exact polar decomposition,LeftPolarGradMandRightPolarGradMcompute the same polar direction whenever both sides are well-defined. Their distinction is therefore computational and architectural: they differ in which Gram matrix is formed, which inverse square root is computed, and which layer symmetry they are intended to respect.
3.7.2Row-Norm-Based and Hybrid Variants
For embedding, LM head, and SwiGLU MLP projection matrices, row-norm-based updates provide a cheaper LPRO-compatible alternative. Given a momentum directionMkM_{k}, define
Dη(Mk)≔Diag(η(‖Mk,1:‖2),…,η(‖Mk,v:‖2)).D_{\eta}(M_{k})\coloneqq\operatorname*{Diag}(\eta(\|M_{k,1:}\|_{2}),\dots,\eta(\|M_{k,v:}\|_{2})).Then a row-norm update takes the form𝒯𝗋𝗈𝗐(Mk)=Dη(Mk)Mk\mathcal{T}_{\mathsf{row}}(M_{k})=D_{\eta}(M_{k})M_{k}. Hereη\etamay be chosen as a bounded row-scaling rule or as a smoothed normalization rule such asη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon).
Hybrid variants combine row-wise scaling with a one-sided spectral step. For embeddings, LM heads, and SwiGLU gate/up projections, there are two natural orders:
right-spectral/row-norm:Zk=Mk(Mk⊤Mk+εI)−1/2,𝒯𝗁𝗒𝖻(Mk)=Dη(Zk)Zk,\text{right-spectral/row-norm:}\qquad Z_{k}=M_{k}(M_{k}^{\top}M_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}},\qquad\mathcal{T}_{\mathsf{hyb}}(M_{k})=D_{\eta}(Z_{k})Z_{k},and
row-norm/right-spectral:Zk=Dη(Mk)Mk,𝒯𝗁𝗒𝖻(Mk)=Zk(Zk⊤Zk+εI)−1/2.\text{row-norm/right-spectral:}\qquad Z_{k}=D_{\eta}(M_{k})M_{k},\qquad\mathcal{T}_{\mathsf{hyb}}(M_{k})=Z_{k}(Z_{k}^{\top}Z_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}.Both remain left-permutation and right-orthogonal equivariant. ForMoErouters, the same constructions are applied after centering:
Mk,c=Π⟂Mk.M_{k,c}=\Pi_{\perp}M_{k}.Thus, router row-norm updates act onMk,cM_{k,c}, while hybrid router updates combine a centered left-spectral step with row-wise normalization across experts. We refer to these practical hybrid variants collectively asHybridPolarGradM. The algorithms forRowNormMandHybridPolarGradMare summarized inAlgorithm˜2.
Algorithm 2RowNormMandHybridPolarGradMfor Embeddings, LM Heads, andMoERouters0:
W0∈ℝm×nW_{0}\in\mathbb{R}^{m\times n},
M−1=0M_{-1}=0, learning rates
{γk}k⩾0\{\gamma_{k}\}_{k\geqslant 0}, momentum
β∈[0,1)\beta\in[0,1), scaling exponent
α∈[0,1]\alpha\in[0,1], row-scaling rule
η\eta, damping
ε>0\varepsilon>0, weight decay
λ⩾0\lambda\geqslant 0 1:for
k=0,…,K−1k=0,\ldots,K-1do
2:
Gk=∇Wf(Wk)G_{k}=\nabla_{W}f(W_{k}) 3:
Mk=βMk−1+(1−β)GkM_{k}=\beta M_{k-1}+(1-\beta)G_{k} 4:ifembedding / LM head / SwiGLU gate-upandRowNormMthen
5:
Dk=Diag(η(‖Mk,1:‖2),…,η(‖Mk,m:‖2))D_{k}=\operatorname*{Diag}(\eta(\|M_{k,1:}\|_{2}),\dots,\eta(\|M_{k,m:}\|_{2})) 6:
Wk+1=(1−γkλ)Wk−γkDkMkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}D_{k}M_{k} 7:elseifembedding / LM head / SwiGLU gate-upandHybridPolarGradM(right-spectral/row-norm)then
8:
Ck=Mk⊤MkC_{k}=M_{k}^{\top}M_{k} 9:
Rk=(Ck+εI)−1/2R_{k}=(C_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}viaPolar Expressor Newton–Schulz iteration
10:
νk=max{tr(CkRk),ε}\nu_{k}=\max\{\mathrm{tr}(C_{k}R_{k}),\varepsilon\} 11:
Zk=νkαMkRkZ_{k}=\nu_{k}^{\alpha}M_{k}R_{k} 12:
Dk=Diag(η(‖Zk,1:‖2),…,η(‖Zk,m:‖2))D_{k}=\operatorname*{Diag}(\eta(\|Z_{k,1:}\|_{2}),\dots,\eta(\|Z_{k,m:}\|_{2})) 13:
Wk+1=(1−γkλ)Wk−γkDkZkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}D_{k}Z_{k} 14:elseifembedding / LM head / SwiGLU gate-upandHybridPolarGradM(row-norm/right-spectral)then
15:
Zk=Diag(η(‖Mk,1:‖2),…,η(‖Mk,m:‖2))MkZ_{k}=\operatorname*{Diag}(\eta(\|M_{k,1:}\|_{2}),\dots,\eta(\|M_{k,m:}\|_{2}))M_{k} 16:
Ck=Zk⊤ZkC_{k}=Z_{k}^{\top}Z_{k} 17:
Rk=(Ck+εI)−1/2R_{k}=(C_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}viaPolar Expressor Newton–Schulz iteration
18:
νk=max{tr(CkRk),ε}\nu_{k}=\max\{\mathrm{tr}(C_{k}R_{k}),\varepsilon\} 19:
Wk+1=(1−γkλ)Wk−γkνkαZkRkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}\nu_{k}^{\alpha}Z_{k}R_{k} 20:elseifMoErouterandRowNormMthen
21:
Mk,c=Π⟂MkM_{k,c}=\Pi_{\perp}M_{k}, where
Π⟂=Ie−1e𝟏e𝟏e⊤\Pi_{\perp}=I_{e}-\frac{1}{e}\bm{1}_{e}\bm{1}_{e}^{\top} 22:
Dk=Diag(η(‖Mk,c,1:‖2),…,η(‖Mk,c,e:‖2))D_{k}=\operatorname*{Diag}(\eta(\|M_{k,c,1:}\|_{2}),\dots,\eta(\|M_{k,c,e:}\|_{2})) 23:
Wk+1=(1−γkλ)Wk−γkDkMk,cW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}D_{k}M_{k,c} 24:elseifMoErouterandHybridPolarGradM(left-spectral/row-norm)then
25:
Mk,c=Π⟂MkM_{k,c}=\Pi_{\perp}M_{k} 26:
Ck=Mk,cMk,c⊤C_{k}=M_{k,c}M_{k,c}^{\top} 27:
Lk=(Ck+εI)−1/2L_{k}=(C_{k}+\varepsilon I)^{-\nicefrac{{1}}{{2}}}viaPolar Expressor Newton–Schulz iteration
28:
νk=max{tr(CkLk),ε}\nu_{k}=\max\{\mathrm{tr}(C_{k}L_{k}),\varepsilon\} 29:
Zk=νkαLkMk,cZ_{k}=\nu_{k}^{\alpha}L_{k}M_{k,c} 30:
Dk=Diag(η(‖Zk,1:‖2),…,η(‖Zk,e:‖2))D_{k}=\operatorname*{Diag}(\eta(\|Z_{k,1:}\|_{2}),\dots,\eta(\|Z_{k,e:}\|_{2})) 31:
Wk+1=(1−γkλ)Wk−γkDkZkW_{k+1}=(1-\gamma_{k}\lambda)W_{k}-\gamma_{k}D_{k}Z_{k} 32:endif
33:endfor
33:
WKW_{K}
For down projections in SwiGLU MLPs, the intermediate-neuron axis is the column dimension. In practice, we apply the same row-norm, right-spectral, or hybrid update to the transposed matrixWdown⊤W_{\mathrm{down}}^{\top}, and then transpose the resulting update back to the original parameter shape.
The practical distinction among these variants is both geometric and computational.RightPolarGradMpreserves the right-orthogonal geometry through a Gram inverse square root, whileRowNormMis a purely row-adaptive LPRO-compatible alternative.HybridPolarGradMinterpolates between the two by composing row normalization and one-sided spectral normalization. ForMoErouters, the same alternatives appear in the centered expert geometry: one may use a purely centered row-norm update, a left-spectral update, or a hybrid left-spectral/row-norm update.
3.8Spectral Optimizers as the Bi-Orthogonally Equivariant Class
We have used bi-orthogonal equivariance as the symmetry principle for ordinary matrix layers. We now record the corresponding structural characterization: direction-wise update maps satisfying this equivariance are precisely spectral operators. This explains why SSD[23],Muon[76],Scion[122],PolarGrad[89], and related spectral updates form the canonical optimizer class for ordinary matrix layers. The preceding sections show how different symmetry groups lead instead to different optimizer classes for embeddings, LM heads, SwiGLU MLP projections, andMoErouters.
We first recall the relevant matrix-analytic notions. A proper functionφ:ℝm×n→ℝ¯\varphi\colon\mathbb{R}^{m\times n}\to\overline{\mathbb{R}}is calledspectralif there exists a proper functionψ:ℝr→ℝ¯\psi\colon\mathbb{R}^{r}\to\overline{\mathbb{R}}, wherer=min{m,n}r=\min\{m,n\}, such thatφ=ψ∘σ\varphi=\psi\circ\sigma. Its matrix-valued analogue is aspectral operator: a map𝒰:ℝm×n→ℝm×n\mathscr{U}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}is spectral if there exists an absolutely symmetric mapψ:ℝr→ℝr\psi\colon\mathbb{R}^{r}\to\mathbb{R}^{r}such that, for every singular value decompositionD=UDiag(σ(D))V⊤D=U\operatorname*{Diag}(\sigma(D))V^{\top},
𝒰(D)=UDiag(ψ(σ(D)))V⊤.\mathscr{U}(D)=U\operatorname*{Diag}(\psi(\sigma(D)))V^{\top}.Thus, spectral operators preserve the singular-vector geometry of their input and act only on singular values.
We next recall the standard gradient formula for spectral functions; see, e.g.,Lewis [93].
Theorem 3.6(Gradient formula for spectral functions).
Letf:ℝr→ℝ¯f\colon\mathbb{R}^{r}\to\overline{\mathbb{R}}be convex and absolutely symmetric. Then the corresponding spectral functionf∘σf\circ\sigmais differentiable atW∈ℝm×nW\in\mathbb{R}^{m\times n}if and only ifffis differentiable atσ(W)\sigma(W). In this case, ifW=UDiag(σ(W))V⊤W=U\operatorname*{Diag}(\sigma(W))V^{\top}is a singular value decomposition ofWW, then
∇(f∘σ)(W)=UDiag(∇f(σ(W)))V⊤.\nabla(f\circ\sigma)(W)=U\operatorname*{Diag}(\nabla f(\sigma(W)))V^{\top}.
This formula shows that spectral scalar functions act through singular values while preserving singular directions. The same structure appears when one requires an update map to commute with arbitrary left and right orthogonal changes of coordinates. We now state the corresponding characterization.
Theorem 3.7(Characterization of bi-orthogonally equivariant update maps).
A continuous matrix-valued map𝒰:ℝm×n→ℝm×n\mathscr{U}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}satisfies
𝒰(PDQ⊤)=P𝒰(D)Q⊤\mathscr{U}(PDQ^{\top})=P\,\mathscr{U}(D)Q^{\top}for allD∈ℝm×nD\in\mathbb{R}^{m\times n},P∈𝕆mP\in\mathbb{O}^{m}, andQ∈𝕆nQ\in\mathbb{O}^{n}if and only if it is a spectral operator. Equivalently,𝒰\mathscr{U}is bi-orthogonally equivariant if and only if, for every singular value decompositionD=UDiag(σ(D))V⊤D=U\operatorname*{Diag}(\sigma(D))V^{\top}, there exists an absolutely symmetric mapψ:ℝr→ℝr\psi\colon\mathbb{R}^{r}\to\mathbb{R}^{r}such that
𝒰(D)=UDiag(ψ(σ(D)))V⊤,r=min{m,n}.\mathscr{U}(D)=U\operatorname*{Diag}(\psi(\sigma(D)))V^{\top},\qquad r=\min\{m,n\}.
We denote the set of bi-orthogonally equivariant matrix maps fromℝm×n\mathbb{R}^{m\times n}toℝm×n\mathbb{R}^{m\times n}by𝒰𝕆m×n\mathcal{U}_{\mathbb{O}}^{m\times n}.Theorem˜3.7shows that bi-orthogonal equivariance is not merely a desirable property: it is a complete structural characterization of direction-wise matrix update maps. Any such update must act through the singular values of the update direction and preserve its singular vectors.
Accordingly, we call a matrix optimizerspectralif its update rule is bi-orthogonally equivariant. Concretely, its iterates satisfy
Wk+1=Wk−γk𝒰(Dk),W_{k+1}=W_{k}-\gamma_{k}\mathscr{U}(D_{k}),whereγk>0\gamma_{k}>0,𝒰∈𝒰𝕆m×n\mathscr{U}\in\mathcal{U}_{\mathbb{O}}^{m\times n}, andDkD_{k}is an update direction that itself transforms bi-orthogonally, such as the gradient or a symmetry-compatible momentum direction. ByTheorem˜3.7, every such direction-wise spectral optimizer is determined by a singular-value transformationψ\psi.
This characterization applies to maps acting on a single update direction. It does not exclude broader stateful matrix optimizers such asShampoo, whose auxiliary states may themselves evolve equivariantly. Spectral optimizers should therefore be understood as the bi-orthogonally equivariant class of memoryless, or direction-wise, matrix update maps; stateful equivariant optimizers form a larger class.
3.8.1Examples of Spectral and Equivariant Matrix Optimizers
The simplest spectral optimizer is vanilla gradient descent, for which the spectral operator is the identity map. Its Polyak, Nesterov, and EMA-momentum variants remain spectral because bi-orthogonally equivariant momentum constructions composed with spectral operators preserve spectrality. Other examples include stochastic spectral descent[21,23],Muon[76],Scion[122], andPolarGrad[89]. Power-type spectral mapsψ(t)=tp\psi(t)=t^{p}, includingp∈{1/2,1/4}p\in\{1/2,1/4\}, are also studied in[126].
It is useful to distinguish spectral operators acting on the current update direction from history-dependent optimizers whose states evolve equivariantly. For example,Shampoo[62,7]maintains left and right preconditioners based on moving averages ofGkGk⊤G_{k}G_{k}^{\top}andGk⊤GkG_{k}^{\top}G_{k}, and applies
Lk−1/4GkRk−1/4.L_{k}^{-\nicefrac{{1}}{{4}}}G_{k}R_{k}^{-\nicefrac{{1}}{{4}}}.This update is bi-orthogonally equivariant when the state variables(Lk,Rk)(L_{k},R_{k})transform accordingly, but it is not generally a spectral operator of the current gradient alone, since the preconditioners need not share the singular-vector basis ofGkG_{k}. In special aligned cases, it reduces to a singular-value transformation and hence becomes spectral. One-sidedShampoo[155,96]and ASGO[6]admit a similar interpretation.
SOAP[151]is also geometry-aware but generally not spectral in the strict direction-wise sense. It rotates gradients into learned eigenspaces and applies coordinate-wise adaptive scaling in those evolving coordinates. Thus, the update depends on history-dependent bases and coordinate-wise statistics, not only on the singular values of the current gradient. It becomes spectral only in special cases where the learned eigenspaces align with the singular-vector basis and the adaptive scaling acts symmetrically across singular directions.
Normalization preserves spectrality when the normalizing scalar is unitarily invariant. If𝒰\mathscr{U}is spectral andα:ℝm×n→ℝ+\alpha\colon\mathbb{R}^{m\times n}\to\mathbb{R}_{+}is a positive spectral scalar function, thenD↦𝒰(D)/α(D)D\mapsto\mathscr{U}(D)/\alpha(D)is again spectral. In particular, normalization by a unitarily invariant matrix norm preserves spectrality. By contrast, row-wise or column-wise normalization generally breaks full bi-orthogonal equivariance because it depends on a preferred coordinate system. Such normalizations may still be appropriate for layers with one-sided or permutation symmetries, but they are not spectral operators for ordinary matrix layers.
More broadly, steepest descent for matrix-valued parameters is compatible with bi-orthogonal symmetry when the underlying norm is unitarily invariant. The unit ball of a unitarily invariant norm is invariant underD↦PDQ⊤D\mapsto PDQ^{\top}, so the associated steepest descent direction transforms equivariantly under left and right orthogonal changes of coordinates. Non-unitarily invariant norms, such as general inducedℓp→ℓq\ell_{p}\to\ell_{q}operator norms with(p,q)≠(2,2)(p,q)\neq(2,2), generally fail this property and impose a coordinate-dependent geometry on the update.
4Numerical Experiments
Our experiments test the layerwise equivariance principle in full language model pre-training. Rather than changing a single optimizer in isolation, we instantiate a symmetry-compatible optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its layerwise symmetry: attention matrices use spectral or head-wise spectral updates; embeddings and LM heads use row-norm or hybrid updates; dense and expert SwiGLU MLP projections use right-spectral, row-aware, column-aware, or hybrid updates along the appropriate intermediate-neuron axis;MoErouters use centered row-aware or left-spectral updates; and scalar or vector parameters use standard coordinate-wise optimizers.
We validate the proposed equivariant optimizer classes on four open-weight dense andMoElanguage model architectures spanning different vocabulary sizes, hidden sizes, embedding and LM head matrix dimensions, and numbers of trainable parameters. Our goal is to test the practical implications of the symmetry- and geometry-based design principles developed above across architectures with distinct matrix-parameter geometries. Because our focus is optimizer behavior rather than scaling-law-optimal pre-training, we do not train the models for the large token budgets typically prescribed by scaling laws. We pre-train all models on a 10B-token subset of FineWeb-Edu[111]with context length 1024. These settings allow us to examine optimizer behavior in controlled yet nontrivial pre-training regimes.
Unless otherwise specified, we useMuon[76]withPolar Expresscoefficients[5]for hidden and attention matrices, andAdamW[109]for scalar and vector parameters. For consistency with the intended layerwise geometry, fused attention weights are handled by applyingPolar Expressto the momentum of each attention head or fused component separately, in a similar spirit toMuon Split[53]. Likewise, forOLMoE-1B-7B and downsized gpt-oss, we treat fused expert SwiGLU projection tensors according to their intermediate-neuron geometry. The fused expert gate/up tensors are reshaped or interpreted so that gate/up channels associated with intermediate neurons receive row-aware or hybrid updates along the correct axis, while expert down projections are updated along the corresponding intermediate-neuron axis. This avoids applying a geometry-aware optimizer to an artifact of tensor storage rather than to the functional symmetry of the layer. We use untied input embeddings and output LM head weights, which allows us to assign different optimizers to these two large vocabulary-indexed matrices. Full experimental details are given inAppendix˜G. Code is available athttps://github.com/timlautk/equivariant_optimizers.
4.1Qwen3-0.6B-Style Pre-Training
We first pre-train a Qwen3-0.6B-style dense language model[127]incorporating several recent architectural innovations, including Grouped Query Attention (GQA)[4], the SwiGLU activation function[31,137], Rotary Positional Embeddings (RoPE)[144], pre-norm normalization[157]withRMSNorm[167,74], and QK-Norm[35], without QKV bias terms. The model uses a vocabulary size of 151,936 and hidden dimension 1024. Since untying the embeddings increases the number of trainable parameters, we reduce the number of hidden layers from 28 to 20, resulting in a total of 625,784,832 trainable parameters.
We compare three optimizer assignments for the vocabulary-indexed matrices, namely the input embedding and LM head matrices: (i)RowNormM, (ii)HybridPolarGradMwith row-norm/right-spectral, (iii)AdamW. For SwiGLU MLP projections, we compare (a) right-spectral updates, equivalentlyMuon-style updates, with (b) hybrid row-norm/right-spectral updates applied along the appropriate intermediate-neuron axis: row-aware for gate and up projections, and column-aware for down projections.
Although different LPRO-compatible optimizers could in principle be assigned to the embedding and LM head matrices, we use the same optimizer for both layers in each configuration for simplicity.
(a)SwiGLU MLP projection matrices useMuon, equivalentlyRightPolarGradMwithα=0\alpha=0.
(b)SwiGLU MLP projection matrices useHybridPolarGradMwith a row-norm/right-spectral composition.
Figure 3:Training and validation losses for Qwen3-0.6B-style pre-training. In each subfigure, the three configurations differ only in the optimizer assigned to the input embedding and LM head matrices:RowNormM,HybridPolarGradM, orAdamW.The final validation losses for configurations (i)–(iii) are 4.1991, 4.2055, and 4.2084 inFigure˜3(a), and 4.1962, 4.1978, and 4.2046 inFigure˜3(b), respectively. As shown inFigure˜3, in both settings, configuration (iii) makes comparable initial progress to configuration (i) and has lower validation loss at the earlier stage of training, but is subsequently overtaken by bothRowNormMandHybridPolarGradM. The final gap betweenHybridPolarGradMandAdamWis smaller than that betweenRowNormMandAdamW, but the validation loss still favors the symmetry-compatible update.
AcrossFigure˜3(a)andFigure˜3(b), usingHybridPolarGradMfor the SwiGLU MLP projection matrices improves all three embedding/LM-head configurations relative to usingMuonfor the SwiGLU MLP projections. The final validation losses decrease from 4.1991 to 4.1962 forRowNormM, from 4.2055 to 4.1978 forHybridPolarGradM, and from 4.2084 to 4.2046 forAdamW. The improvement is largest whenHybridPolarGradMis also used for the input embedding and LM head matrices, suggesting a possible complementary effect between symmetry-compatible updates on vocabulary-indexed matrices and row-aware/right-spectral updates on SwiGLU MLP projections. Nevertheless,RowNormMremains the best-performing assignment for the input embedding and LM head matrices in both settings. Thus, the comparison between (a) and (b) suggests that applying row-norm/right-spectral hybrid updates to SwiGLU MLP projections can further improve validation loss, while the relative ranking among the three vocabulary-indexed optimizer assignments remains stable. This improvement is plausibly due to the tall-skinny geometry of the SwiGLU MLP projections. Sincedmodel=1024d_{\mathrm{model}}=1024anddff=3072d_{\mathrm{ff}}=3072, the gate and up projections have many more rows than columns, with rows corresponding to intermediate neurons. Row-scale imbalance can therefore be important. WhileMuoncaptures the right-spectral geometry,HybridPolarGradMadditionally normalizes intermediate-neuron rows before the spectral step, which may yield a better update geometry in this regime.
In terms of wall-clock training time,HybridPolarGradMincurs additional overhead due to the inner Gram Newton–Schulz iterations used to approximate the right-spectral component. Consequently, configurations (ii) and (iii) require a similar amount of training time to reach comparable validation loss values, despiteHybridPolarGradMachieving a slightly lower final loss. Configurations (i) and (ii) both follow symmetry-compatible geometries for the embedding and LM head matrices, whereas configuration (iii) applies coordinate-wiseAdamWupdates to these matrix-valued parameters, thereby introducing a geometry mismatch. Overall, these results are consistent with our symmetry-aware matrix view of optimizer design: even when applied only to the vocabulary-indexed matrices, geometry-compatible updates can improve the optimization trajectory and final validation loss.
4.2Gemma 3 1B-Style Pre-Training
We next pre-train a Gemma 3 1B-style dense language model[50]. Compared with the Qwen3-0.6B-style experiment, this model has both a larger vocabulary size of 262,144, and a larger hidden dimension of 1152. Consequently, the input embedding and LM head matrices are substantially larger, and their matrix gradients may be more anisotropic or ill-conditioned. This setting therefore provides a more stringent test of geometry-aware optimizers for vocabulary-indexed matrix parameters. For right-spectral updates, such as those used insideRightPolarGradMandHybridPolarGradM, the larger hidden dimension also makes the Gram matrix inverse-square-root computation more demanding, often requiring more accurate or additional Newton–Schulz iterations. To keep the total model size close to one billion trainable parameters after using untied embeddings, we reduce the number of hidden layers from 26 to 18, resulting in 1,087,138,944 trainable parameters. We use the same three optimizer assignments for the input embedding and LM head matrices as in the Qwen3-0.6B-style experiment, and compare two assignments for the SwiGLU MLP projection matrices: (a)Muonand (b)HybridPolarGradM.
(a)SwiGLU MLP projection matrices useMuon, equivalentlyRightPolarGradMwithα=0\alpha=0.
(b)SwiGLU MLP projection matrices useHybridPolarGradMwith a row-norm/right-spectral composition.
Figure 4:Training and validation losses for Gemma 3 1B-style pre-training. In each subfigure, the three configurations differ only in the optimizer assigned to the input embedding and LM head matrices:RowNormM,HybridPolarGradM, orAdamW.The final validation losses for configurations (i)–(iii) are 4.0699, 4.0663, and 4.1046 inFigure˜4(a), and 4.0552, 4.0461, and 4.0862 inFigure˜4(b), respectively. In both settings,HybridPolarGradMachieves the lowest final validation loss, whileRowNormMalso substantially outperformsAdamW. This behavior is consistent with the hypothesis that geometry-compatible updates become increasingly important as vocabulary-indexed matrices grow larger and their gradients become more anisotropic or ill-conditioned.
ComparingFigure˜4(a)andFigure˜4(b), usingHybridPolarGradMfor the SwiGLU MLP projection matrices improves all three embedding/LM-head optimizer assignments. The final validation loss decreases from 4.0699 to 4.0552 forRowNormM, from 4.0663 to 4.0461 forHybridPolarGradM, and from 4.1046 to 4.0862 forAdamW. This improvement is plausibly related to the tall-skinny geometry of the SwiGLU MLP projections: in Gemma 3 1B-style models,dmodel=1152≪dff=6912d_{\mathrm{model}}=1152\ll d_{\mathrm{ff}}=6912, so the gate and up projections have many more rows than columns, with rows corresponding to intermediate neurons. WhileMuoncaptures the right-spectral geometry,HybridPolarGradMadditionally normalizes intermediate-neuron rows before the spectral step, which can better account for row-scale imbalance in this regime.
At the same time,HybridPolarGradMincurs higher computational overhead thanRowNormMbecause it requires approximating matrix inverse square roots through inner Gram Newton–Schulz iterations. In contrast,RowNormMperforms only row-wise normalization and therefore avoids spectral computations altogether. Thus, the Gemma 3 1B-style experiment highlights a practical tradeoff: the hybrid row-norm/right-spectral update achieves the best validation loss, while the row-norm-only update provides a computationally cheaper symmetry-compatible alternative that still improves markedly over the coordinate-wiseAdamWbaseline. We also provide additional experimental results for a base learning rate sweep and two extra random seeds inSection˜G.2.
4.3OLMoE-1B-7B-Style Pre-Training
In addition to dense language models, we also pre-train a sparse Mixture-of-Experts (MoE) model, a widely used architecture in recent open-weight language models[72,33,120,128,53,142,58,34]. We use AllenAI’sOLMoE-1B-7B[115], which provides a comprehensive training recipe together with open-source data, code, and training logs. The model has vocabulary size 50,304 and hidden dimension 2048, making the embedding and LM head matrices considerably large. Relative to the original pre-training setup, we remove the auxiliary load-balancing loss[135]and the router z-loss[176]in order to reduce confounding effects from auxiliary objectives and isolate the effect of optimizer geometry. We also reduce the number of hidden layers from 16 to 12 and the number of experts from 64 to 32, yielding a total of 2,824,177,664 trainable parameters.
For matrix-valued parameters other than the hidden and attention matrices, we compare four optimizer assignments:
- (i)RowNormMfor embeddings, LM head, and routers,
- (ii)RowNormMfor embeddings and LM head, andLeftPolarGradMfor routers,
- (iii)RowNormMfor embeddings and LM head, andAdamWfor routers,
- (iv)AdamWfor embeddings, LM head, and routers.
We chooseRowNormMfor the embedding and LM head matrices because it adds minimal computational overhead relative toAdamWwhile preserving a symmetry-compatible geometry for vocabulary-indexed matrices. By contrast,RightPolarGradMandHybridPolarGradMrequire numerical polar decomposition, which is computationally more demanding and may require higher numerical precision for large vocabulary matrices. We leave broader ablations over these alternatives to future work.
We also expect the relative behavior ofMuonandHybridPolarGradMon SwiGLU MLP projections to depend on matrix aspect ratio. Whendff≫dmodeld_{\mathrm{ff}}\gg d_{\mathrm{model}}, as in the dense Qwen3-0.6B-style and Gemma 3 1B-style models, the gate and up projections are tall-skinny and row-scale imbalance across intermediate neurons can be substantial. In this regime, the row-normalization step inHybridPolarGradMcan be beneficial. In theMoEexperiments below, however,dffd_{\mathrm{ff}}anddmodeld_{\mathrm{model}}are much closer, so pureMuon-style right-spectral updates may already capture much of the relevant matrix geometry. For this reason, in theMoEexperiments we useMuon-style right-spectral updates for the SwiGLU MLP projection tensors and focus our ablations on the vocabulary-indexed matrices andMoErouters.
Figure 5:Training and validation losses forOLMoE-1B-7B-style pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices.The final validation losses for configurations (i)–(iv) are 4.0814, 4.0717, 4.1083, and 4.1155 respectively. As shown inFigure˜5, configuration (iv), which usesAdamWfor the embedding, LM head, and router matrices, makes faster initial progress over roughly the first 500 steps, but is eventually overtaken by configurations (i)–(iii). This reversal occurs before the onset of learning rate decay, which linearly decreases to zero for the last 40% of the training tokens. The validation loss gaps continue to widen during the decay phase.
Configurations (i) and (ii) use symmetry-compatible updates for all three special matrix classes considered here: embeddings, LM head, and routers. Configuration (iii) retains symmetry-compatible updates for the embedding and LM head matrices, but introduces a geometry mismatch in the router updates by using coordinate-wiseAdamW. Configuration (iv) appliesAdamWto all three classes and therefore departs most strongly from the proposed symmetry-compatible optimizer design. Empirically, both (i) and (ii) outperform (iii), while (iv) performs worst overall. These results are consistent with our theoretical perspective that respecting parameter symmetry and matrix geometry can matter for optimizer design, especially in large sparse architectures where router dynamics play a central role.
The performance gap between configurations (i) and (ii) is relatively small, suggesting that the row-norm and left-spectral router updates behave similarly in this setting. This small gap may reflect suboptimal learning-rate tuning forRowNormM, the effect of inexact polar oracles, or that left-spectral normalization is able to further capture the relevant router geometry in this experiment. Finally, we observe that configuration (iv) might exhibit slightly more pronounced training-loss spikes at around 2.1B seen training tokens than the other configurations, despite usingMuonfor the hidden and attention matrices. This suggests that geometry-matched optimizer choices for embeddings, LM heads, and routers may also improve training stability in practice.
4.4Downsized gpt-oss Pre-Training
We finally pre-train a downsized variant of gpt-oss-20b[120]. This architecture differs fromOLMoEin several important ways. For example, it uses QKV bias terms in its GQA modules and includes bias vectors in itsMoErouter networks. The model has vocabulary size 201,088, making the embedding and LM head matrices substantially larger than those inOLMoE. To obtain a tractable experimental variant, we downsize gpt-oss-20b by reducing the number of hidden layers from 24 to 12, the hidden and intermediate dimensions from 2880 to 2048, and the number of experts from 32 to 16. This yields a total of 3,467,779,008 trainable parameters. We use the same loss function and the same four optimizer assignments as in theOLMoE-1B-7B experiment.
Figure 6:Training and validation losses for downsized gpt-oss pre-training. The configurations differ in the optimizers assigned to the embedding, LM head, and router matrices.The final validation losses for configurations (i)–(iv) are 4.3090, 4.3122, 4.3363, and 4.3704, respectively. As in theOLMoEexperiment, the fully coordinate-wise baseline in configuration (iv), which usesAdamWfor the embedding, LM head, and router matrices, obtains the worst final validation loss. The three configurations usingRowNormMfor the embedding and LM head matrices all substantially improve over this baseline, suggesting that the benefit of symmetry-compatible updates for vocabulary-indexed matrices persists in a distinct sparseMoEarchitecture.
Among configurations (i)–(iii), the differences are smaller. As shown inFigure˜6, configuration (i), which usesRowNormMfor embeddings, LM head, and routers, achieves the lowest final validation loss, followed closely by configuration (ii), which usesLeftPolarGradMfor routers. Configuration (iii), which keepsRowNormMfor embeddings and LM head but usesAdamWfor routers, is slightly worse than both geometry-compatible router variants. This ordering is consistent with the view that router geometry can matter, although the smaller gap between configurations (i)–(iii) suggests that the dominant improvement in this setting comes from replacingAdamWon the large vocabulary-indexed matrices.
Overall, the downsized gpt-oss experiment provides an additional architecture check for our optimizer design principle. Despite architectural differences fromOLMoE, including QKV biases and router bias terms, the same qualitative pattern holds: geometry-compatible optimizer assignments for special matrix-valued parameters improve the final validation loss relative to using coordinate-wiseAdamWfor those parameters.
4.5Cross-Model Comparison
We compare the results acrossSections˜4.1,4.2,4.3and4.4. These experiments span dense and sparse language models with increasing numbers of trainable parameters, from the Qwen3-0.6B-style dense model to the downsized gpt-oss sparseMoEmodel. Across these models, the vocabulary sizes differ substantially, while the hidden dimensions are relatively close. Consequently, the embedding and LM head matrices have comparable column dimensions, determined bydmodeld_{\mathrm{model}}, but very different numbers of rows, determined by the vocabulary sizevv. Thus, asvvgrows, these vocabulary-indexed matrices become increasingly tall. This changes both their computational cost and the conditioning properties of their matrix gradients, and makes them a natural testbed for row-aware and one-sided spectral optimizer design.
The same row-versus-column distinction is also important for SwiGLU MLP projection matrices. For a dense SwiGLU block, the gate and up projections have shapedff×dmodeld_{\mathrm{ff}}\times d_{\mathrm{model}}, so their rows correspond to intermediate neurons, whereas the down projection has shapedmodel×dffd_{\mathrm{model}}\times d_{\mathrm{ff}}, so the same intermediate-neuron geometry appears along the column axis. In the dense Qwen3-0.6B-style and Gemma 3 1B-style models,dffd_{\mathrm{ff}}is substantially larger thandmodeld_{\mathrm{model}}, placing the gate and up projections in a tall-skinny regime. This is precisely the setting where row-normalization before a right-spectral step can be useful:HybridPolarGradMcan correct row-scale imbalance across intermediate neurons while retaining spectral geometry. In theMoEexperiments, by contrast, the hidden and intermediate dimensions are closer in our downsized settings, so a pureMuon-style right-spectral update may already capture much of the relevant matrix geometry for expert SwiGLU projection tensors.
Across all experiments, replacingAdamWon the large vocabulary-indexed matrices with symmetry-compatible optimizers consistently improves final validation loss. The gains are modest but visible for the smaller dense model, become more pronounced in the larger Gemma 3 1B-style experiment, and persist in both sparseMoEexperiments. This trend is consistent with our matrix-geometry perspective: as embedding and LM head matrices grow in their row dimension, coordinate-wise updates increasingly operate in a parameterization-dependent augmented space, whereas row-norm and spectral updates preserve the relevant vocabulary-indexed matrix geometry.
The comparison should not be interpreted as a scaling law, since the models differ in architecture, vocabulary size, training length, and sparsity structure. Nevertheless, the consistent ordering across dense and sparseMoEmodels provides evidence that the benefit of symmetry-compatible optimizer assignments is not restricted to a single model family or architecture. In particular, the results suggest that large vocabulary-indexed matrices are a robust setting in which geometry-compatible updates can improve optimization. The sparseMoEexperiments further show that router matrices and expert SwiGLU projection tensors provide additional layer types where symmetry-aware optimizer design can matter.
Taken together, these experiments support the view that symmetry-compatible optimizer design is most naturally applied as a layerwise optimizer stack, rather than as a single global replacement forAdamW.
5Discussion and Outlook
This work suggests a different view of deep learning optimization: optimizer design should be layerwise, geometry-aware, and symmetry-compatible. Popular coordinate-wise adaptive methods such asAdamandAdamWremain strong default optimizers because of their robustness, efficiency, and practical momentum. However, when applied indiscriminately to matrix-valued parameters, they treat matrices and tensors as collections of independent coordinates and therefore ignore the intrinsic geometry of the parameter blocks they update. This mismatch is especially apparent in modern architectures, where different modules play different algebraic and semantic roles.
Our main contribution is a symmetry-compatible equivariance principle for designing optimizers for matrix-valued neural network parameters. For ordinary matrix layers, this principle recovers spectral optimizers as natural bi-orthogonally equivariant update maps. For embedding and LM head matrices, it leads to left-permutation/right-orthogonal equivariant updates. For SwiGLU MLP projections, it motivates row- and column-aware updates aligned with intermediate-neuron permutation geometry. ForMoErouter weights, it yields expert-permutation equivariant and shared-logit-shift invariant updates. Together, these examples support an architecture–optimizer co-design principle: different parameter classes should be updated by optimizers whose equivariance matches their layerwise symmetry.
This perspective connects recent progress in matrix-gradient optimization to a broader transition from generic coordinate-wise methods towardmodule-awareandgeometry-awareoptimization. Layerwise training itself is not new: classical examples include LARS and LAMB[163,164], layerwise hyperparameter prescriptions such as those arising inμ\muP[160], and block-coordinate views of neural network training[166,90]. What is emerging, however, is a more refined class of layerwise optimizers that account for the geometry of each parameter block. Recent methods such asShampoo[62],Muon[76], SOAP[151],Scion[122],Gluon[132], andPolarGrad[89]can be viewed as part of this trend.
The case for geometry-aware optimizer design becomes stronger as foundation models become more heterogeneous. Large language models[149,130,131,19], vision transformers[40], multimodal models[129,18], diffusion language models[110,55,119],MoEs[135,91], and state space models[60,30,87]all contain parameter blocks with distinct natural symmetries. From this viewpoint, it is increasingly unnatural to optimize all such layers with a single coordinate-wise rule. Instead, optimizer design for modern deep learning systems should be treated as an architecture-aware problem, rather than as the selection of one universal update rule.
This viewpoint also offers a possible interpretation of some training stabilization practices. As model scale increases, coordinate-wise adaptive optimizers are rarely used in isolation; they are typically accompanied by stabilization tricks, modified variants such asStableAdamW[154], and many recipe-level heuristics[70]. While such techniques are practically important, some of them may be understood as partial corrections for a mismatch between optimizer geometry and model geometry. Geometry-aware optimizers provide a complementary path by encoding more appropriate invariance, normalization, and scaling structure directly into the update rule.
Several challenges remain. First, geometry-aware optimizers depend on efficient numerical linear algebra. The practical success ofMuonand related methods was enabled in part by making polar decomposition or matrix orthogonalization efficient at scale, especially through Newton–Schulz iterations. Further progress will likely depend on fast, stable, and GPU-friendly matrix decomposition routines, including inexact polar oracles based on Newton–Schulz iteration[79], QDWH[89],Polar Express[5], CANS[59], PRISM[162], Turbo-Muon[17], Flash-Muon[103], and Gram Newton–Schulz[168]. Second, non-elementwise optimizers raise new distributed-systems challenges, including communication costs, synchronization, memory sharding, and compatibility with tensor, pipeline, sequence, and data parallelism. Recent work on DistributedMuon[105,47,124],Dion[3,2],Disco[48], and ParallelMuon[101]suggests that these challenges can be addressed in practice.
Encouragingly, matrix-aware optimizers have already begun to appear in industry-scale model training, including work by Moonshot AI[105,80], Essential AI[46], Prime Intellect[124], Zhipu AI[52,53], Zyphra[8,152], Motif Technologies[101], Arcee AI[141], StepFun[142], and DeepSeek-AI[34]. These developments suggest that the question is no longer whether matrix-aware optimizers can be scaled, but how far they can be pushed once algorithmic geometry, numerical linear algebra, and distributed systems are designed in concert.
Recent work byDuet al.[41]provides a complementary perspective, showing through a layer-peeled optimization model that symmetries in next-token distributions can transfer to learned LLM weights, logits, and context embeddings. The Newton–Muonoptimizer[42]provides a complementary example of symmetry-aware matrix-gradient optimization. By deriving aMuon-like update from a quadratic surrogate involving the layer input matrix, Newton–Muonshows that the polar update can be combined with right preconditioning by the input second moment. In this sense, it extends the geometry ofMuonfrom a purely weight-gradient update to one that also reflects data geometry.
Overall, our results suggest that much of modern deep learning still relies on optimizer updates whose geometry is mismatched to the layers they train. This mismatch may affect robustness, stability, scalability, and interpretability, since coordinate-wise adaptivity is sensitive to arbitrary parameterizations and can be viewed as operating in a pathological diagonal lifting of the original matrix space. By contrast, symmetry-compatible updates offer the prospect of more stable, parameterization-consistent, and theoretically grounded training procedures. Introducing symmetry and equivariance into optimizer design therefore opens a path toward a more principled science of large-scale model pre-training.
Acknowledgments
This work was supported by computational resources from Prime Intellect.
References
- [1]E. Abbe and E. Boix-Adsera(2022)On the non-universality of deep learning: quantifying the cost of symmetry.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.3.
- [2]K. Ahn, N. Amsel, and J. Langford(2025)Dion2: a simple method to shrink matrix in Muon.arXiv preprint 2512.16928.Cited by:§5.
- [3]K. Ahn, B. Xu, N. Abreu, Y. Fan, G. Magakyan, P. Sharma, Z. Zhan, and J. Langford(2025)Dion: distributed orthonormalized updates.arXiv preprint arXiv:2504.05295.Cited by:§5.
- [4]J. Ainslie, J. Lee-Thorp, M. De Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai(2023)GQA: training generalized multi-query transformer models from multi-head checkpoints.InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),Cited by:§4.1.
- [5]N. Amsel, D. Persson, C. Musco, and R. M. Gower(2026)The Polar Express: optimal matrix sign methods and their application to the Muon algorithm.InInternational Conference on Learning Representations (ICLR),Cited by:Algorithm E.1,Appendix G,§3.7.1,§4,§5.
- [6]K. An, Y. Liu, R. Pan, S. Ma, D. Goldfarb, and T. Zhang(2025)ASGO: adaptive structured gradient optimization.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§3.8.1,Remark 3.4.
- [7]R. Anil, V. Gupta, T. Koren, K. Regan, and Y. Singer(2020)Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018.Cited by:§2.1,§3.8.1.
- [8]Q. Anthony, Y. Tokpanov, S. Szot, S. Rajagopal, P. Medepalli, R. Iyer, V. Shyam, A. Golubeva, A. Chaurasia, X. Yang, T. Figliolia, R. Washbourne, D. Thorstensen, A. Pearson, Z. Grossbart, J. van Patten, E. Barsoum, Z. Gu, Y. Fu, and B. Millidge(2025)Training foundation models on a full-stack AMD platform: compute, networking, and system design.arXiv preprint arXiv:2511.17127.Cited by:§5.
- [9]L. Autonne(1902)Sur les groupes linéaires, réels et orthogonaux.Bulletin de la Société Mathématique de France30,pp. 121–134.Cited by:§3.2.
- [10]E. Bao, J. Lu, L. Song, N. Hart-Hodgson, W. Parson, and Y. Zhou(2019)Equivariant neural networks and equivarification.arXiv preprint arXiv:1906.07172.Cited by:§2.3.
- [11]J. Bernstein and L. Newhouse(2024)Old optimizer, new norm: an anthology.InOPT 2024: Optimization for Machine Learning,Cited by:§A.1,item 2,Remark 3.5.
- [12]J. Bernstein and L. Newhouse(2025)Modular duality in deep learning.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§A.1,§A.2,item 2.
- [13]J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar(2018)SignSGD: compressed optimisation for non-convex problems.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§A.1,Appendix C.
- [14]J. Bernstein(2025-03)Deriving Muon.External Links:LinkCited by:§A.1.
- [15]J. Bernstein(2025)Modular manifolds.Thinking Machines Lab: Connectionism.Note:https://thinkingmachines.ai/blog/modular-manifolds/Cited by:§2.1.
- [16]R. Bhatia(2013)Matrix analysis.Vol.169,Springer Science & Business Media.Cited by:Appendix B,§2.2.
- [17]T. Boissin, T. Massena, F. Mamalet, and M. Serrurier(2025)Turbo-Muon: accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632.Cited by:§5.
- [18]F. Bordes, R. Y. Pang, A. Ajay, A. C. Li, A. Bardes, S. Petryk, O. Mañas, Z. Lin, A. Mahmoud, B. Jayaraman, M. Ibrahim, M. Hall, Y. Xiong, J. Lebensold, C. Ross, S. Jayakumar, C. Guo, D. Bouchacourt, H. Al-Tahan, K. Padthe, V. Sharma, H. Xu, X. E. Tan, M. Richards, S. Lavoie, P. Astolfi, R. A. Hemmat, J. Chen, K. Tirumala, R. Assouel, M. Moayeri, A. Talattof, K. Chaudhuri, Z. Liu, X. Chen, Q. Garrido, K. Ullrich, A. Agrawal, K. Saenko, A. Celikyilmaz, and V. Chandra(2024)An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247.Cited by:§5.
- [19]T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei(2020)Language models are few-shot learners.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§5.
- [20]S. Buchanan(2025)A faster manifold Muon with ADMM.Note:https://sdbuchanan.com/blog/manifold-muon/Cited by:§2.1.
- [21]D. Carlson, V. Cevher, and L. Carin(2015)Stochastic spectral descent for restricted Boltzmann machines.InProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS),Cited by:item 2,§2.1,§3.8.1.
- [22]D. Carlson, E. Collins, Y. Hsieh, L. Carin, and V. Cevher(2015)Preconditioned spectral descent for deep learning.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.1.
- [23]D. Carlson, Y. Hsieh, E. Collins, L. Carin, and V. Cevher(2016)Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing10(2),pp. 296–311.Cited by:§2.1,§3.8.1,§3.8.
- [24]D. Chang, Y. Liu, and G. Yuan(2025)On the convergence of Muon and beyond.arXiv preprint arXiv:2509.15816.Cited by:§2.1.
- [25]D. Chang, Q. Shi, L. Zhang, Y. Li, R. Zhang, Y. Lu, Y. Liu, and G. Yuan(2026)MuonEq: balancing before orthogonalization with lightweight equilibration.arXiv preprint arXiv:2603.28254.Cited by:§3.3.2.
- [26]L. Chen, J. Li, and Q. Liu(2025)Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054.Cited by:§2.1.
- [27]Y. Chen, Y. Chi, J. Fan, and C. Ma(2021)Spectral methods for data science: a statistical perspective.Foundations and Trends® in Machine Learning14(5),pp. 566–806.Cited by:§2.2.
- [28]M. Crawshaw, C. Modi, M. Liu, and R. M. Gower(2025)An exploration of non-Euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827.Cited by:§A.1,§2.1.
- [29]G. E. Dahl, F. Schneider, Z. Nado, N. Agarwal, C. S. Sastry, P. Hennig, S. Medapati, R. Eschenhagen, P. Kasimbeg, D. Suo, J. Bae, J. Gilmer, A. L. Peirson, B. Khan, R. Anil, M. Rabbat, S. Krishnan, D. Snider, E. Amid, K. Chen, C. J. Maddison, R. Vasudev, M. Badura, A. Garg, and P. Mattson(2023)Benchmarking neural network training algorithms.arXiv preprint arXiv:2306.07179.Cited by:§1.
- [30]T. Dao and A. Gu(2024)Transformers are SSMs: generalized models and efficient algorithms through structured state space duality.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§5.
- [31]Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier(2017)Language modeling with gated convolutional networks.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§3.4,§4.1.
- [32]D. Davis and D. Drusvyatskiy(2025)When do spectral gradient updates help in deep learning?.arXiv preprint arXiv:2512.04299.Cited by:§2.1.
- [33]DeepSeek-AI, A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Wang, J. Chen, J. Chen, J. Yuan, J. Qiu, J. Li, J. Song, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Xu, L. Xia, L. Zhao, L. Wang, L. Zhang, M. Li, M. Wang, M. Zhang, M. Zhang, M. Tang, M. Li, N. Tian, P. Huang, P. Wang, P. Zhang, Q. Wang, Q. Zhu, Q. Chen, Q. Du, R. J. Chen, R. L. Jin, R. Ge, R. Zhang, R. Pan, R. Wang, R. Xu, R. Zhang, R. Chen, S. S. Li, S. Lu, S. Zhou, S. Chen, S. Wu, S. Ye, S. Ye, S. Ma, S. Wang, S. Zhou, S. Yu, S. Zhou, S. Pan, T. Wang, T. Yun, T. Pei, T. Sun, W. L. Xiao, W. Zeng, W. Zhao, W. An, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, X. Q. Li, X. Jin, X. Wang, X. Bi, X. Liu, X. Wang, X. Shen, X. Chen, X. Zhang, X. Chen, X. Nie, X. Sun, X. Wang, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yu, X. Song, X. Shan, X. Zhou, X. Yang, X. Li, X. Su, X. Lin, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. X. Zhu, Y. Zhang, Y. Xu, Y. Xu, Y. Huang, Y. Li, Y. Zhao, Y. Sun, Y. Li, Y. Wang, Y. Yu, Y. Zheng, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Tang, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Wu, Y. Ou, Y. Zhu, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Zha, Y. Xiong, Y. Ma, Y. Yan, Y. Luo, Y. You, Y. Liu, Y. Zhou, Z. F. Wu, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Huang, Z. Zhang, Z. Xie, Z. Zhang, Z. Hao, Z. Gou, Z. Ma, Z. Yan, Z. Shao, Z. Xu, Z. Wu, Z. Zhang, Z. Li, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Gao, and Z. Pan(2024)DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437.Cited by:§4.3.
- [34]DeepSeek-AI(2026)DeepSeek-V4: towards highly efficient million-token context intelligence.Cited by:§4.3,§5.
- [35]M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. Riquelme, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. van Steenkiste, G. F. Elsayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. P. Collier, A. Gritsenko, V. Birodkar, C. Vasconcelos, Y. Tay, T. Mensink, A. Kolesnikov, F. Pavetić, D. Tran, T. Kipf, M. Lučić, X. Zhai, D. Keysers, J. Harmsen, and N. Houlsby(2023)Scaling vision transformers to 22 billion parameters.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§4.1.
- [36]S. Deng, Z. Ouyang, T. Pang, Z. Liu, R. Jin, S. Yu, and Y. Yang(2026)RMNP: row-momentum normalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527.Cited by:§3.3.2.
- [37]A. Dewulf, D. Pai, L. Yang, A. Zhang, and B. Keigwin(2026-05-05)Aurora: a leverage-aware optimizer for rectangular matrices.External Links:LinkCited by:§3.4.
- [38]C. Ding, D. Sun, J. Sun, and K. Toh(2018)Spectral operators of matrices.Mathematical Programming168(1),pp. 509–531.Cited by:§2.2.
- [39]C. Ding, D. Sun, J. Sun, and K. Toh(2020)Spectral operators of matrices: semismoothness and characterizations of the generalized Jacobian.SIAM Journal on Optimization30(1),pp. 630–659.Cited by:§2.2.
- [40]A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby(2021)An image is worth 16x16 words: transformers for image recognition at scale.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
- [41]Z. Du, H. He, and W. Su(2026)Uncovering symmetry transfer in large language models via layer-peeled optimization.arXiv preprint arXiv:2605.12756.Cited by:§5.
- [42]Z. Du and W. Su(2026)The Newton–Muon optimizer.arXiv preprint arXiv:2604.01472.Cited by:§2.1,§5.
- [43]J. Duchi, E. Hazan, and Y. Singer(2011)Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research12,pp. 2121–2159.Cited by:§1.
- [44]R. Eschenhagen, A. Cai, T. Lee, and H. M. Shi(2026)Clarifying Shampoo: adapting spectral descent to stochasticity and the parameter trajectory.arXiv preprint arXiv:2602.09314.Cited by:§2.1.
- [45]R. Eschenhagen, A. Immer, R. Turner, F. Schneider, and P. Hennig(2023)Kronecker-factored approximate curvature for modern neural network architectures.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.1.
- [46]Essential AI, I. Shah, A. M. Polloreno, K. Stratos, P. Monk, A. Chaluvaraju, A. Hojel, A. Ma, A. Thomas, A. Tanwer, D. J. Shah, K. Nguyen, K. Smith, M. Callahan, M. Pust, M. Parmar, P. Rushton, P. Mazarakis, R. Kapila, S. Srivastava, S. Singla, T. Romanski, Y. Vanjani, and A. Vaswani(2025)Practical efficiency of Muon for pretraining.arXiv preprint arXiv:2505.02222.Cited by:§5.
- [47]Essential AI(2025)Layer sharding for large‑scale training with Muon.Note:https://www.essential.ai/research/infraCited by:§5.
- [48]O. Filatov, J. Wang, J. Ebert, and S. Kesselheim(2025)Optimal scaling needs optimal norm.arXiv preprint arXiv:2510.03871.Cited by:§5.
- [49]K. Frans, S. Levine, and P. Abbeel(2025)A stable whitening optimizer for efficient neural network training.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.1.
- [50]Gemma Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, L. Rouillard, T. Mesnard, G. Cideron, J. Grill, S. Ramos, E. Yvinec, M. Casbon, E. Pot, I. Penchev, G. Liu, F. Visin, K. Kenealy, L. Beyer, X. Zhai, A. Tsitsulin, R. Busa-Fekete, A. Feng, N. Sachdeva, B. Coleman, Y. Gao, B. Mustafa, I. Barr, E. Parisotto, D. Tian, M. Eyal, C. Cherry, J. Peter, D. Sinopalnikov, S. Bhupatiraju, R. Agarwal, M. Kazemi, D. Malkin, R. Kumar, D. Vilar, I. Brusilovsky, J. Luo, A. Steiner, A. Friesen, A. Sharma, A. Sharma, A. M. Gilady, A. Goedeckemeyer, A. Saade, A. Feng, A. Kolesnikov, A. Bendebury, A. Abdagic, A. Vadi, A. György, A. S. Pinto, A. Das, A. Bapna, A. Miech, A. Yang, A. Paterson, A. Shenoy, A. Chakrabarti, B. Piot, B. Wu, B. Shahriari, B. Petrini, C. Chen, C. L. Lan, C. A. Choquette-Choo, C. Carey, C. Brick, D. Deutsch, D. Eisenbud, D. Cattle, D. Cheng, D. Paparas, D. S. Sreepathihalli, D. Reid, D. Tran, D. Zelle, E. Noland, E. Huizenga, E. Kharitonov, F. Liu, G. Amirkhanyan, G. Cameron, H. Hashemi, H. Klimczak-Plucińska, H. Singh, H. Mehta, H. T. Lehri, H. Hazimeh, I. Ballantyne, I. Szpektor, I. Nardini, J. Pouget-Abadie, J. Chan, J. Stanton, J. Wieting, J. Lai, J. Orbay, J. Fernandez, J. Newlan, J. Ji, J. Singh, K. Black, K. Yu, K. Hui, K. Vodrahalli, K. Greff, L. Qiu, M. Valentine, M. Coelho, M. Ritter, M. Hoffman, M. Watson, M. Chaturvedi, M. Moynihan, M. Ma, N. Babar, N. Noy, N. Byrd, N. Roy, N. Momchev, N. Chauhan, N. Sachdeva, O. Bunyan, P. Botarda, P. Caron, P. K. Rubenstein, P. Culliton, P. Schmid, P. G. Sessa, P. Xu, P. Stanczyk, P. Tafti, R. Shivanna, R. Wu, R. Pan, R. Rokni, R. Willoughby, R. Vallu, R. Mullins, S. Jerome, S. Smoot, S. Girgin, S. Iqbal, S. Reddy, S. Sheth, S. Põder, S. Bhatnagar, S. R. Panyam, S. Eiger, S. Zhang, T. Liu, T. Yacovone, T. Liechty, U. Kalra, U. Evci, V. Misra, V. Roseberry, V. Feinberg, V. Kolesnikov, W. Han, W. Kwon, X. Chen, Y. Chow, Y. Zhu, Z. Wei, Z. Egyed, V. Cotruta, M. Giang, P. Kirk, A. Rao, K. Black, N. Babar, J. Lo, E. Moreira, L. G. Martins, O. Sanseviero, L. Gonzalez, Z. Gleicher, T. Warkentin, V. Mirrokni, E. Senter, E. Collins, J. Barral, Z. Ghahramani, R. Hadsell, Y. Matias, D. Sculley, S. Petrov, N. Fiedel, N. Shazeer, O. Vinyals, J. Dean, D. Hassabis, K. Kavukcuoglu, C. Farabet, E. Buchatskaya, J. Alayrac, R. Anil, Dmitry, Lepikhin, S. Borgeaud, O. Bachem, A. Joulin, A. Andreev, C. Hardin, R. Dadashi, and L. Hussenot(2025)Gemma 3 technical report.arXiv preprint arXiv:2503.19786.Cited by:Appendix G,§4.2.
- [51]A. Glentis, J. Li, A. Han, and M. Hong(2025)A minimalist optimizer design for LLM pretraining.arXiv preprint arXiv:2506.16659.Cited by:§3.3.2.
- [52]GLM-4.5 Team, A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, K. Wang, L. Zhong, M. Liu, R. Lu, S. Cao, X. Zhang, X. Huang, Y. Wei, Y. Cheng, Y. An, Y. Niu, Y. Wen, Y. Bai, Z. Du, Z. Wang, Z. Zhu, B. Zhang, B. Wen, B. Wu, B. Xu, C. Huang, C. Zhao, C. Cai, C. Yu, C. Li, C. Ge, C. Huang, C. Zhang, C. Xu, C. Zhu, C. Li, C. Yin, D. Lin, D. Yang, D. Jiang, D. Ai, E. Zhu, F. Wang, G. Pan, G. Wang, H. Sun, H. Li, H. Li, H. Hu, H. Zhang, H. Peng, H. Tai, H. Zhang, H. Wang, H. Yang, H. Liu, H. Zhao, H. Liu, H. Yan, H. Liu, H. Chen, J. Li, J. Zhao, J. Ren, J. Jiao, J. Zhao, J. Yan, J. Wang, J. Gui, J. Zhao, J. Liu, J. Li, J. Li, J. Lu, J. Wang, J. Yuan, J. Li, J. Du, J. Du, J. Liu, J. Zhi, J. Gao, K. Wang, L. Yang, L. Xu, L. Fan, L. Wu, L. Ding, L. Wang, M. Zhang, M. Li, M. Xu, M. Zhao, M. Zhai, P. Du, Q. Dong, S. Lei, S. Tu, S. Yang, S. Lu, S. Li, S. Li, Shuang-Li, S. Yang, S. Yi, T. Yu, W. Tian, W. Wang, W. Yu, W. L. Tam, W. Liang, W. Liu, X. Wang, X. Jia, X. Gu, X. Ling, X. Wang, X. Fan, X. Pan, X. Zhang, X. Zhang, X. Fu, X. Zhang, Y. Xu, Y. Wu, Y. Lu, Y. Wang, Y. Zhou, Y. Pan, Y. Zhang, Y. Wang, Y. Li, Y. Su, Y. Geng, Y. Zhu, Y. Yang, Y. Li, Y. Wu, Y. Li, Y. Liu, Y. Wang, Y. Li, Y. Zhang, Z. Liu, Z. Yang, Z. Zhou, Z. Qiao, Z. Feng, Z. Liu, Z. Zhang, Z. Wang, Z. Yao, Z. Wang, Z. Liu, Z. Chai, Z. Li, Z. Zhao, W. Chen, J. Zhai, B. Xu, M. Huang, H. Wang, J. Li, Y. Dong, and J. Tang(2025)GLM-4.5: agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471.Cited by:§5.
- [53]GLM-5 Team, A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, C. Zhu, C. Yin, C. Wang, G. Pan, H. Zeng, H. Zhang, H. Wang, H. Chen, J. Zhang, J. Jiao, J. Guo, J. Wang, J. Du, J. Wu, K. Wang, L. Li, L. Fan, L. Zhong, M. Liu, M. Zhao, P. Du, Q. Dong, R. Lu, Shuang-Li, S. Cao, S. Liu, T. Jiang, X. Chen, X. Zhang, X. Huang, X. Dong, Y. Xu, Y. Wei, Y. An, Y. Niu, Y. Zhu, Y. Wen, Y. Cen, Y. Bai, Z. Qiao, Z. Wang, Z. Wang, Z. Zhu, Z. Liu, Z. Li, B. Wang, B. Wen, C. Huang, C. Cai, C. Yu, C. Li, C. Hu, C. Zhang, D. Zhang, D. Lin, D. Yang, D. Wang, D. Ai, E. Zhu, F. Yi, F. Chen, G. Wen, H. Sun, H. Zhao, H. Hu, H. Zhang, H. Liu, H. Zhang, H. Peng, H. Tai, H. Zhang, H. Liu, H. Wang, H. Yan, H. Ge, H. Liu, H. Chu, J. Zhao, J. Wang, J. Zhao, J. Ren, J. Wang, J. Zhang, J. Gui, J. Zhao, J. Li, J. An, J. Li, J. Yuan, J. Du, J. Liu, J. Zhi, J. Duan, K. Zhou, K. Wei, K. Wang, K. Luo, L. Zhang, L. Sha, L. Xu, L. Wu, L. Ding, L. Chen, M. Li, N. Lin, P. Ta, Q. Zou, R. Song, R. Yang, S. Tu, S. Yang, S. Wu, S. Zhang, S. Li, S. Li, S. Fan, W. Qin, W. Tian, W. Zhang, W. Yu, W. Liang, X. Kuang, X. Cheng, X. Li, X. Yan, X. Hu, X. Ling, X. Fan, X. Xia, X. Zhang, X. Zhang, X. Pan, X. Zou, X. Zhang, Y. Liu, Y. Wu, Y. Li, Y. Wang, Y. Zhu, Y. Tan, Y. Zhou, Y. Pan, Y. Zhang, Y. Su, Y. Geng, Y. Yan, Y. Tan, Y. Bi, Y. Shen, Y. Yang, Y. Li, Y. Liu, Y. Wang, Y. Li, Y. Wu, Y. Zhang, Y. Duan, Y. Zhang, Z. Liu, Z. Jiang, Z. Yan, Z. Zhang, Z. Wei, Z. Chen, Z. Feng, Z. Yao, Z. Chai, Z. Wang, Z. Zhang, B. Xu, M. Huang, H. Wang, J. Li, Y. Dong, and J. Tang(2026)GLM-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763.Cited by:§4.3,§4,§5.
- [54]D. Goldfarb, Y. Ren, and A. Bahamou(2020)Practical quasi-Newton methods for training deep neural networks.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.1.
- [55]S. Gong, S. Agarwal, Y. Zhang, J. Ye, L. Zheng, M. Li, C. An, P. Zhao, W. Bi, J. Han, H. Peng, and L. Kong(2025)Scaling diffusion language models via adaptation from autoregressive models.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
- [56]W. Gong, J. Zazo, Q. Luo, P. Wang, J. Hensman, and C. Ma(2026)ARO: a new lens on matrix optimization for large models.arXiv preprint arXiv:2602.09006.Cited by:§A.3,§2.1.
- [57]A. Gonon, A. Muşat, and N. Boumal(2026)Insights on Muon from simple quadratics.arXiv preprint arXiv:2602.11948.Cited by:§2.1.
- [58]Google DeepMind(2026-04)Gemma 4 model card.External Links:LinkCited by:§4.3.
- [59]E. Grishina, M. Smirnov, and M. Rakhuba(2025)Accelerating Newton-Schulz iteration for orthogonalization via Chebyshev-type polynomials.arXiv preprint arXiv:2506.10935.Cited by:§5.
- [60]A. Gu and T. Dao(2024)Mamba: linear-time sequence modeling with selective state spaces.InProceedings of the Conference on Language Modeling (COLM),Cited by:§5.
- [61]Y. Gu and Z. Xie(2026)Mano: restriking manifold optimization for LLM training.arXiv preprint arXiv:2601.23000.Cited by:§2.1.
- [62]V. Gupta, T. Koren, and Y. Singer(2018)Shampoo: preconditioned stochastic tensor optimization.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§A.1,§2.1,§3.8.1,§5.
- [63]K. He, X. Zhang, S. Ren, and J. Sun(2016)Deep residual learning for image recognition.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),Cited by:§2.3.
- [64]N. J. Higham(1986)Computing the polar decomposition—with applications.SIAM Journal on Scientific and Statistical Computing7(4),pp. 1160–1174.Cited by:§3.2.
- [65]N. J. Higham(1997)Stable iterations for the matrix square root.Numerical Algorithms15(2),pp. 227–242.Cited by:Theorem E.1,§3.7.1.
- [66]N. J. Higham(2008)Functions of matrices: theory and computation.Society for Industrial and Applied Mathematics.Cited by:§2.2.
- [67]J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre(2022)Training compute-optimal large language models.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§1.
- [68]R. A. Horn and C. R. Johnson(1994)Topics in matrix analysis.Cambridge University Press.Cited by:Appendix B,§2.2.
- [69]R. A. Horn and C. R. Johnson(2012)Matrix analysis.2nd edition,Cambridge University Press.Cited by:Appendix B.
- [70]Y. Hu, H. Song, J. Deng, J. Wang, J. Chen, K. Zhou, Y. Zhu, J. Jiang, Z. Dong, W. X. Zhao, and J. Wen(2025)YuLan-Mini: pushing the limits of open data-efficient language model.InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers),Cited by:§5.
- [71]F. Huang, Y. Luo, and S. Chen(2025)LiMuon: light and fast Muon optimizer for large models.arXiv preprint arXiv:2509.14562.Cited by:§2.1.
- [72]A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed(2024)Mixtral of experts.arXiv preprint arXiv:2401.04088.Cited by:§4.3.
- [73]R. Jiang, Z. Mhammedi, M. Mohri, and A. Mokhtari(2026)Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232.Cited by:§2.1.
- [74]Z. Jiang, J. Gu, H. Zhu, and D. Pan(2023)Pre-RMSNorm and Pre-CRMSNorm transformers: equivalent and efficient Pre-LN transformers.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§4.1.
- [75]K. Jordan, J. Bernstein, B. Rappazzo, @fernbear.bsky.social, B. Vlado, Y. Jiacheng, F. Cesista, B. Koszarsky, and @Grad62304977(2024)modded-nanogpt: speedrunning the NanoGPT baseline.External Links:LinkCited by:§A.1,§1,§2.1.
- [76]K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein(2024)Muon: an optimizer for hidden layers in neural networks.External Links:LinkCited by:§A.1,item 2,item 2,§2.1,§3.3,§3.8.1,§3.8,Remark 3.1,§4,§5.
- [77]J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei(2020)Scaling laws for neural language models.arXiv preprint arXiv:2001.08361.Cited by:§1.
- [78]P. Kasimbeg, F. Schneider, R. Eschenhagen, J. Bae, C. S. Sastry, M. Saroufim, B. Feng, L. Wright, E. Z. Yang, Z. Nado, S. Medapati, P. Hennig, M. Rabbat, and G. E. Dahl(2025)Accelerating neural network training: an analysis of the AlgoPerf competition.InInternational Conference on Learning Representations (ICLR),Cited by:§1.
- [79]G. Y. Kim and M. Oh(2026)Convergence of Muon with Newton-Schulz.InInternational Conference on Learning Representations (ICLR),Cited by:§2.1,§5.
- [80]Kimi Team(2025)Kimi K2: open agentic intelligence.arXiv preprint arXiv:2507.20534.Cited by:§5.
- [81]D. P. Kingma and J. L. Ba(2015)Adam: a method for stochastic optimization.InInternational Conference on Learning Representations (ICLR),Cited by:Appendix C,§1.
- [82]R. Kondor(2025)The principles behind equivariant neural networks for physics and chemistry.Proceedings of the National Academy of Sciences122(41),pp. e2415656122.Cited by:§2.3.
- [83]D. Kovalev and E. Borodich(2025)Non-Euclidean SGD for structured optimization: unified analysis and improved rates.arXiv preprint arXiv:2511.11466.Cited by:§2.1.
- [84]D. Kovalev(2025)Understanding gradient orthogonalization for deep learning via non-Euclidean trust-region optimization.arXiv preprint arXiv:2503.12645.Cited by:§A.1.
- [85]A. Kravatskiy, I. Kozyrev, N. Kozlov, A. Vinogradov, D. Merkulov, and I. Oseledets(2025)The Ky Fan norms and beyond: dual norms and combinations for matrix optimization.arXiv preprint arXiv:2512.09678.Cited by:§A.1,§2.1.
- [86]K. Kurdyka(1998)On gradients of functions definable in o-minimal structures.Annales de l’institut Fourier48(3),pp. 769–783.Cited by:Remark F.2.
- [87]A. Lahoti, K. Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu(2026)Mamba-3: improved sequence modeling using state space principles.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
- [88]T. Large, Y. Liu, M. Huh, H. Bahng, P. Isola, and J. Bernstein(2024)Scalable optimization in the modular norm.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§A.1,§A.2,§3.3.
- [89]T. T. Lau, Q. Long, and W. Su(2025)PolarGrad: a class of matrix-gradient optimizers from a unifying preconditioning perspective.arXiv preprint arXiv:2505.21799.Cited by:Appendix C,§F.2.1,§F.2.2,item 2,§2.1,§2.1,§3.3.1,§3.6,§3.8.1,§3.8,Remark 3.1,§5,§5.
- [90]T. T. Lau, J. Zeng, B. Wu, and Y. Yao(2018)A proximal block coordinate descent algorithm for deep neural network training.InInternational Conference on Learning Representations (ICLR), Workshop Track,Cited by:§5.
- [91]D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen(2021)GShard: scaling giant models with conditional computation and automatic sharding.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
- [92]A. S. Lewis and M. L. Overton(1996)Eigenvalue optimization.Acta Numerica5,pp. 149–190.Cited by:Appendix B,§2.2.
- [93]A. S. Lewis(1995)The convex analysis of unitarily invariant matrix functions.Journal of Convex Analysis2(1),pp. 173–183.Cited by:§2.2,§3.8.
- [94]A. S. Lewis(1996)Group invariance and convex matrix analysis.SIAM Journal on Matrix Analysis and Applications17(4),pp. 927–949.Cited by:§2.2.
- [95]A. S. Lewis(2003)The mathematics of eigenvalue optimization.Mathematical Programming97,pp. 155–176.Cited by:§2.2.
- [96]H. Li, Y. Dong, and Z. Lin(2026)Convergence rate analysis of the AdamW-style Shampoo: unifying one-sided and two-sided preconditioning.arXiv preprint arXiv:2601.07326.Cited by:§3.8.1,Remark 3.4.
- [97]J. Li and M. Hong(2025)A note on the convergence of Muon and further.arXiv preprint arXiv:2502.02900.Cited by:§2.1.
- [98]X. Li(2017)Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems29(5),pp. 1454–1466.Cited by:§2.1.
- [99]Z. Li, Y. Zhang, and S. Arora(2021)Why are convolutional nets more sample-efficient than fully-connected nets?.InInternational Conference on Learning Representations (ICLR),Cited by:§2.3.
- [100]Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao(2025)NorMuon: making Muon more efficient and scalable.arXiv preprint arXiv:2510.05491.Cited by:§3.3.2.
- [101]J. Lim, S. Lee, D. Kim, T. Kim, E. Park, J. Lee, J. Lee, J. Lee, W. T. Cheung, D. Choi, J. Her, J. Huh, H. Jung, C. Kang, B. Kim, M. Kim, T. Kim, Y. Kim, H. Kweon, H. Lee, K. Lee, D. Oh, Y. Park, B. Ryu, and D. Weon(2025)Motif 2 12.7B technical report.arXiv preprint arXiv:2511.07464.Cited by:§5,§5.
- [102]L. Lim and B. J. Nelson(2023)What is an equivariant neural network?.Notices of the American Mathematical Society70(4),pp. 619–625.Cited by:§2.3.
- [103]T. Lin(2025)Flash-Muon: an efficient implementation of Muon optimizer.External Links:LinkCited by:§5.
- [104]W. Lin, S. C. Lowe, F. Dangel, R. Eschenhagen, Z. Xu, and R. B. Grosse(2025)Understanding and improving Shampoo and SOAP via Kullback–Leibler minimization.arXiv preprint arXiv:2509.03378.Cited by:§2.1.
- [105]J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang(2025)Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982.Cited by:§5,§5.
- [106]Y. Liu, A. Yuan, and Q. Gu(2025)MARS-M: when variance reduction meets matrices.arXiv preprint arXiv:2510.21800.Cited by:§2.1.
- [107]Z. Liu, H. Wu, X. Fu, S. Liu, X. Han, T. Zhong, and M. Yuan(2025)REG: a regularization optimizer for robust training dynamics.arXiv preprint arXiv:2510.03691.Cited by:§3.3.2.
- [108]S. Łojasiewicz(1993)Sur la géométrie semi-et sous-analytique.Annales de l’institut Fourier43(5),pp. 1575–1595.Cited by:Remark F.2.
- [109]I. Loshchilov and F. Hutter(2019)Decoupled weight decay regularization.InInternational Conference on Learning Representations (ICLR),Cited by:§1,§4.
- [110]A. Lou, C. Meng, and S. Ermon(2024)Discrete diffusion modeling by estimating the ratios of the data distribution.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§5.
- [111]A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf(2024)FineWeb-Edu: the finest collection of educational content.External Links:LinkCited by:§4.
- [112]J. Ma, Y. Huang, Y. Chi, and Y. Chen(2026)Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474.Cited by:§2.1.
- [113]J. Martens and R. Grosse(2015)Optimizing neural networks with Kronecker-factored approximate curvature.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§2.1.
- [114]H. B. McMahan and M. Streeter(2010)Adaptive bound optimization for online convex optimization.InProceedings of the Conference on Learning Theory (COLT),Cited by:§1.
- [115]N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambert, Y. Gu, S. Arora, A. Bhagia, D. Schwenk, D. Wadden, A. Wettig, B. Hui, T. Dettmers, D. Kiela, A. Farhadi, N. A. Smith, P. W. Koh, A. Singh, and H. Hajishirzi(2025)OLMoE: open mixture-of-experts language models.InInternational Conference on Learning Representations (ICLR),Cited by:Appendix G,§4.3.
- [116]Y. Nakatsukasa, Z. Bai, and F. Gygi(2010)Optimizing Halley’s iteration for computing the matrix polar decomposition.SIAM Journal on Matrix Analysis and Applications31(5),pp. 2700–2720.Cited by:§3.3.1.
- [117]Y. Nakatsukasa and R. W. Freund(2016)Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: the power of Zolotarev’s functions.SIAM Review58(3),pp. 461–493.Cited by:§3.3.1.
- [118]A. Y. Ng(2004)Feature selection,L1L_{1}vs.L2L_{2}regularization, and rotational invariance.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§2.3.
- [119]S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y. Lin, J. Wen, and C. Li(2025)Large language diffusion models.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§5.
- [120]OpenAI, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V. Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, L. Gross, K. G. Guzman, J. Hallman, J. Hehir, J. Heidecke, A. Helyar, H. Hu, R. Huet, J. Huh, S. Jain, Z. Johnson, C. Koch, I. Kofman, D. Kundel, J. Kwon, V. Kyrylov, E. Y. Le, G. Leclerc, J. P. Lennon, S. Lessans, M. Lezcano-Casado, Y. Li, Z. Li, J. Lin, J. Liss, Lily, Liu, J. Liu, K. Lu, C. Lu, Z. Martinovic, L. McCallum, J. McGrath, S. McKinney, A. McLaughlin, S. Mei, S. Mostovoy, T. Mu, G. Myles, A. Neitz, A. Nichol, J. Pachocki, A. Paino, D. Palmie, A. Pantuliano, G. Parascandolo, J. Park, L. Pathak, C. Paz, L. Peran, D. Pimenov, M. Pokrass, E. Proehl, H. Qiu, G. Raila, F. Raso, H. Ren, K. Richardson, D. Robinson, B. Rotsted, H. Salman, S. Sanjeev, M. Schwarzer, D. Sculley, H. Sikchi, K. Simon, K. Singhal, Y. Song, D. Stuckey, Z. Sun, P. Tillet, S. Toizer, F. Tsimpourlas, N. Vyas, E. Wallace, X. Wang, M. Wang, O. Watkins, K. Weil, A. Wendling, K. Whinnery, C. Whitney, H. Wong, L. Yang, Y. Yang, M. Yasunaga, K. Ying, W. Zaremba, W. Zhan, C. Zhang, B. Zhang, E. Zhang, and S. Zhao(2025)Gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925.Cited by:Appendix G,§4.3,§4.4.
- [121]T. Pethick, K. Antonakopoulos, A. Silveti-Falls, L. C. Vankadara, and V. Cevher(2025)Training neural networks at any scale.arXiv preprint arXiv:2511.11163.Cited by:§2.1.
- [122]T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher(2025)Training deep learning models with norm-constrained LMOs.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§A.1,§A.1,item 2,§2.1,§3.3,§3.8.1,§3.8,§5.
- [123]O. Pooladzandi and X. Li(2024)Curvature-informed SGD via general purpose Lie-group preconditioners.arXiv preprint arXiv:2402.04553.Cited by:§2.1.
- [124]Prime Intellect Team, M. Senghaas, F. Obeid, S. Jaghouar, W. Brown, J. M. Ong, D. Auras, M. Sirovatka, J. Straube, A. Baker, S. Müller, J. Mattern, M. Basra, A. Ismail, D. Scherm, C. Miller, A. Patel, S. Kirsten, M. Sieg, C. Reetz, K. Erdem, V. Weisser, and J. Hagemann(2025)INTELLECT-3: technical report.arXiv preprint arXiv:2512.16144.Cited by:§5,§5.
- [125]T. Putterman, D. Lim, Y. Gelberg, M. M. Bronstein, S. Jegelka, and H. Maron(2025)GL equivariant metanetworks for learning on low rank weight spaces.InLearning on Graphs Conference (LoG),Cited by:§2.3.
- [126]X. Qi, M. Chen, J. Ye, Y. He, and R. Xiao(2026)Delving into Muon and beyond: deep analysis and extensions.arXiv preprint arXiv:2602.04669.Cited by:§2.1,§3.8.1.
- [127]Qwen Team(2025)Qwen3 technical report.arXiv preprint arXiv:2505.09388.Cited by:Appendix G,§4.1.
- [128]Qwen Team(2026-02)Qwen3.5: towards native multimodal agents.External Links:LinkCited by:§4.3.
- [129]A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever(2021)Learning transferable visual models from natural language supervision.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§5.
- [130]A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever(2018)Improving language understanding by generative pre-training.Cited by:§5.
- [131]A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever(2019)Language models are unsupervised multitask learners.OpenAI blog.Cited by:§5.
- [132]A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik(2025)Gluon: making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs).arXiv preprint arXiv:2505.13416.Cited by:§A.1,§5.
- [133]S. Schubert, P. Neubert, J. Pöschmann, and P. Protzel(2019)Circular convolutional neural networks for panoramic images and laser data.InIEEE Intelligent Vehicles Symposium (IV),Cited by:§2.3.
- [134]A. Semenov, M. Pagliardini, and M. Jaggi(2025)Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440.Cited by:§1.
- [135]N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean(2017)Outrageously large neural networks: the sparsely-gated mixture-of-experts layer.InInternational Conference on Learning Representations (ICLR),Cited by:§4.3,§5.
- [136]N. Shazeer and M. Stern(2018)Adafactor: adaptive learning rates with sublinear memory cost.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§1.
- [137]N. Shazeer(2020)GLU variants improve transformer.arXiv preprint arXiv:2002.05202.Cited by:§3.4,§4.1.
- [138]W. Shen, R. Huang, M. Huang, C. Shen, and J. Zhang(2025)On the convergence analysis of Muon.arXiv preprint arXiv:2505.23737.Cited by:§2.1.
- [139]H. M. Shi, T. Lee, S. Iwasaki, J. Gallego-Posada, Z. Li, K. Rangadurai, D. Mudigere, and M. Rabbat(2023)A distributed data-parallel PyTorch implementation of the distributed Shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497.Cited by:§2.1.
- [140]C. Si, D. Zhang, and W. Shen(2025)AdaMuon: adaptive Muon optimizer.arXiv preprint arXiv:2507.11005.Cited by:§2.1.
- [141]V. Singh, L. Krauss, S. Jaghouar, M. Sirovatka, C. Goddard, F. Obied, J. M. Ong, J. Straube, Fern, A. Harley, C. Stewart, C. Kealty, M. Panahi, S. Kirsten, A. Deshpande, A. Vij, A. Bresnu, P. Veldurthi, R. Ravishankar, H. Bishnoi, DatologyAI Team, Arcee AI Team, Prime Intellect Team, M. McQuade, J. Hagemann, and L. Atkins(2026)Arcee Trinity Large technical report.arXiv preprint arXiv:2602.17004.Cited by:§5.
- [142]StepFun Team, A. Huang, A. Li, A. Kong, B. Wang, B. Jiao, B. Dong, B. Wang, B. Chen, B. Li, B. Ma, C. Su, C. Miao, C. Wan, C. Lou, C. Hu, C. Xu, C. Yu, C. Feng, C. Yao, C. Han, D. Ma, D. Shi, D. Jiang, D. Ma, D. Sun, D. Qi, E. Liu, F. Zhang, F. Wan, G. Huang, G. Yan, G. Cao, G. Li, H. Cheng, H. Guo, H. Zhang, H. Nie, H. Jia, H. Lv, H. Zhou, H. Lv, H. Wang, H. Shum, H. Huang, H. Peng, H. Zhou, H. Wang, H. Chen, H. Zhu, H. Wu, H. Guo, J. Wang, J. Zhou, J. Sun, J. Wu, J. Zhang, J. Lv, J. Liu, J. Fu, J. Liu, J. Cheng, J. Luo, J. Yang, J. Zhou, J. Hou, J. Bai, J. Hu, J. Xie, J. Wu, J. Zhang, J. Zhou, J. Liu, J. Lin, K. M. Lo, K. Liang, K. Liu, K. Tan, K. Yan, K. Li, K. An, K. Lin, L. Yang, L. Lv, L. Zhao, L. Chen, L. Shi, L. Tan, L. Lin, L. Chen, L. Ma, M. Ren, M. Li, M. Li, M. Li, M. Zhang, M. Chen, M. Huang, N. Wang, P. Liu, Q. Han, Q. Zhao, Q. He, Q. Du, Q. Wu, Q. Sun, R. Yang, R. Miao, R. Han, R. Wan, R. Guo, S. Wang, S. Pang, S. Yang, S. Fan, S. Shang, S. Yang, S. Li, S. Tian, S. Liu, S. Wu, S. Chen, S. Yuan, T. Cao, T. Yue, T. Cheng, T. Li, T. Luo, W. You, W. Ji, W. Yuan, W. Zhang, W. Wu, W. Xie, W. Sun, W. Deng, W. Zheng, W. Xie, X. Wang, X. Kong, X. Liu, X. Zhang, X. Yang, X. Liu, X. Yuan, X. Jiao, X. Ren, X. Zhang, X. Li, X. Liu, X. Wu, X. Chen, X. Yang, X. Wang, X. Zhao, X. He, X. Feng, X. Cai, X. Zhou, Y. Yu, Y. Li, Y. Xu, Y. Lai, Y. Xu, Y. Wang, Y. Shen, Y. Zhu, Y. Lv, Y. Cao, Y. Gong, Y. Yang, Y. Yang, Y. Zhao, Y. Zhao, Y. Zhang, Y. Zhang, Y. Zhang, Y. Chen, Y. Zhao, Y. Long, Y. Wang, Y. Guan, Y. Zhou, Y. Peng, Y. Ding, Y. Fan, Y. Lu, Y. Yang, Y. Luo, Y. Zhao, Y. Peng, Y. Lin, Y. Lu, Y. Zhao, Y. Ju, Y. Zhang, Y. Li, Y. Yang, Y. Chen, Y. Cai, Z. Weng, Z. Hong, Z. Li, Z. Xie, Z. Ge, Z. Gong, Z. Zeng, Z. Lu, Z. Huang, Z. Chang, Z. Huang, Z. Hu, Z. Yang, Z. Wang, Z. Ren, Z. Zhang, and Z. Wang(2026)Step 3.5 Flash: open frontier-level intelligence with 11B active parameters.arXiv preprint arXiv:2602.10604.Cited by:§4.3,§5.
- [143]D. Su, A. Gu, J. Xu, Y. Tian, and J. Zhao(2025)GaLore 2: large-scale LLM pre-training by gradient low-rank projection.arXiv preprint arXiv:2504.20437.Cited by:§2.1.
- [144]J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu(2024)RoFormer: enhanced transformer with rotary position embedding.Neurocomputing568,pp. 127063.External Links:DocumentCited by:§4.1.
- [145]W. Su(2025)Isotropic curvature model for understanding deep learning optimization: is gradient orthogonalization optimal?.arXiv preprint arXiv:2511.00674.Cited by:§2.1.
- [146]D. Sun and J. Sun(2008)Löwner’s operator and spectral functions in Euclidean Jordan algebras.Mathematics of Operations Research33(2),pp. 421–445.Cited by:§2.2.
- [147]T. Tieleman and G. Hinton(2012)Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude.Note:Coursera: Neural Networks for Machine LearningCited by:§1.
- [148]M. Tuddenham, A. Prügel-Bennett, and J. Hare(2022)Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052.Cited by:Remark 3.1.
- [149]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin(2017)Attention is all you need.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§5.
- [150]A. Veprikov, A. Bolatov, S. Horváth, A. Beznosikov, M. Takáč, and S. Hanzely(2025)Preconditioned norms: a unified framework for steepest descent, quasi-Newton and adaptive methods.arXiv preprint arXiv:2510.10777.Cited by:§A.1,§2.1,Remark 3.5.
- [151]N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. Kakade(2025)SOAP: improving and stabilizing Shampoo using Adam.InInternational Conference on Learning Representations (ICLR),Cited by:§2.1,§3.8.1,§5.
- [152]R. Washbourne, R. Iyer, T. Figliolia, H. Zheng, R. Lorig-Roach, S. Yang, P. Yuvraj, Q. Anthony, Y. Tokpanov, X. Yang, G. Nanduru, S. Ebert, P. Medepalli, S. Szot, S. Rajagopal, A. Ong, B. Mehta, and B. Millidge(2026)ZAYA1-8B technical report.arXiv preprint arXiv:2605.05365.Cited by:§5.
- [153]K. Wen, D. Hall, T. Ma, and P. Liang(2026)Fantastic pretraining optimizers and where to find them.InInternational Conference on Learning Representations (ICLR),Cited by:§1.
- [154]M. Wortsman, T. Dettmers, L. Zettlemoyer, A. S. Morcos, A. Farhadi, and L. Schmidt(2023)Stable and low-precision training for large-scale vision-language models.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§5.
- [155]S. Xie, T. Wang, S. Reddi, S. Kumar, and Z. Li(2025)Structured preconditioners in adaptive optimization: a unified analysis.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§3.8.1,Remark 3.4.
- [156]T. Xie, H. Luo, H. Tang, Y. Hu, J. K. Liu, Q. Ren, Y. Wang, W. X. Zhao, R. Yan, B. Su, C. Luo, and B. Guo(2026)Controlled LLM training on spectral sphere.arXiv preprint arXiv:2601.08393.Cited by:§2.1.
- [157]R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu(2020)On layer normalization in the transformer architecture.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§4.1.
- [158]C. Xu, W. Yan, and Y. A. Zhang(2026)FISMO: fisher-structured momentum-orthogonalized optimizer.arXiv preprint arXiv:2601.21750.Cited by:§2.1.
- [159]R. Xu, J. Li, and Y. Lu(2026)On the width scaling of neural optimizers under matrix operator norms I: row/column normalization and hyperparameter transfer.arXiv preprint arXiv:2603.09952.Cited by:§A.1,§2.1,Remark 3.5.
- [160]G. Yang, E. J. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, and J. Gao(2021)Tuning large neural networks via zero-shot hyperparameter transfer.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§1,§5.
- [161]K. Yang and L. Lai(2026)Manifold constrained steepest descent.arXiv preprint arXiv:2601.21487.Cited by:§2.1.
- [162]S. Yang, Z. Wang, O. Balabanov, N. B. Erichson, and M. W. Mahoney(2026)PRISM: distribution-free adaptive computation of matrix functions for accelerating neural network training.arXiv preprint arXiv:2601.22137.Cited by:§3.7.1,§5.
- [163]Y. You, I. Gitman, and B. Ginsburg(2017)Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888.Cited by:§5.
- [164]Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh(2020)Large batch optimization for deep learning: training BERT in 76 minutes.InInternational Conference on Learning Representations (ICLR),Cited by:§5.
- [165]H. Yuan, Y. Liu, S. Wu, X. Zhou, and Q. Gu(2024)MARS: unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438.Cited by:§2.1.
- [166]J. Zeng, T. T. Lau, S. Lin, and Y. Yao(2019)Global convergence of block coordinate descent in deep learning.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§5.
- [167]B. Zhang and R. Sennrich(2019)Root mean square layer normalization.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§4.1.
- [168]J. Zhang, N. Amsel, B. Chen, and T. Dao(2026)Gram Newton-Schulz.External Links:LinkCited by:Appendix G,Appendix G,§3.7.1,§5.
- [169]M. Zhang, Y. Liu, and H. Schaeffer(2025)AdaGrad meets Muon: adaptive stepsizes for orthogonal updates.arXiv preprint arXiv:2509.02981.Cited by:§2.1.
- [170]R. Zhang, Y. Zhao, Z. Liu, Z. Wang, and Z. Zhang(2026)Muon+: towards better Muon via one additional normalization step.arXiv preprint arXiv:2602.21545.Cited by:§3.3.2.
- [171]Y. Zhang, S. Xing, J. Huang, K. Lv, Y. Zhou, X. Qiu, Q. Guo, and K. Chen(2026)Mousse: rectifying the geometry of Muon with curvature-aware preconditioning.arXiv preprint arXiv:2603.09697.Cited by:§2.1.
- [172]B. Zhao, N. Dehmamy, R. Walters, and R. Yu(2022)Symmetry teleportation for accelerated optimization.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.3.
- [173]B. Zhao, R. M. Gower, R. Walters, and R. Yu(2024)Improving convergence and generalization using parameter symmetries.InInternational Conference on Learning Representations (ICLR),Cited by:§2.3.
- [174]B. Zhao, R. Walters, and R. Yu(2026)Symmetry in neural network parameter spaces.Transactions on Machine Learning Research.External Links:ISSN 2835-8856,LinkCited by:§2.3.
- [175]J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian(2024)GaLore: memory-efficient LLM training by gradient low-rank projection.InProceedings of the International Conference on Machine Learning (ICML),Cited by:§2.1.
- [176]B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus(2022)ST-MoE: designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906.Cited by:§4.3.
Appendix
Contents
- 1Introduction
- 2Preliminaries and Related Work1. 2.1Matrix-Gradient Optimizers 2. 2.2Matrix Optimization Problems, Löwner Operators, and Spectral Operators 3. 2.3Symmetry and Equivariance in Deep Learning
- 3Equivariant Optimizers from Layerwise Symmetry1. 3.1A General Symmetry-Induced Optimizer Geometry 2. 3.2Bi-Orthogonal Equivariance for Ordinary Matrix Layers 3. 3.3Optimizers for Embeddings and LM Heads via Left-Permutation Right-Orthogonal Equivariance1. 3.3.1Right-Spectral Optimizers 2. 3.3.2Row-Norm and Hybrid LPRO-Equivariant Optimizers 4. 3.4Optimizers for SwiGLU MLP Projections 5. 3.5Optimizers forMoERouters 6. 3.6Symmetry-to-Optimizer Principle and Architecture–Optimizer Co-Design 7. 3.7Practical Optimizers for Embeddings, LM Heads, SwiGLU MLP Projections, andMoERouters1. 3.7.1One-Sided Spectral Optimizers 2. 3.7.2Row-Norm-Based and Hybrid Variants 8. 3.8Spectral Optimizers as the Bi-Orthogonally Equivariant Class1. 3.8.1Examples of Spectral and Equivariant Matrix Optimizers
- 4Numerical Experiments1. 4.1Qwen3-0.6B-Style Pre-Training 2. 4.2Gemma 3 1B-Style Pre-Training 3. 4.3OLMoE-1B-7B-Style Pre-Training 4. 4.4Downsized gpt-oss Pre-Training 5. 4.5Cross-Model Comparison
- 5Discussion and Outlook1. Acknowledgments
- References
- AFurther Related Work1. A.1Non-Euclidean Norm-Based Steepest Descent and LMO-Based Frameworks 2. A.2Modular Norm Theory 3. A.3Rotation-Based Optimizers
- BSupplemental Technical Background on Matrix Analysis
- CGeometric Misalignment of Coordinate-wise Adaptive Gradient Methods
- DProofs of Main Text
- EImplementation Details of Practical Optimizers1. E.1Numerical Algorithm for Matrix Inverse Square Root viaPolar Express
- FConvergence Analysis of Symmetry-Compatible Optimizers1. F.1General Update Scheme and Standing Assumptions 2. F.2Full Spectral Optimizers1. F.2.1Specialization to Normalized Polar-Type Spectral Methods 2. F.2.2Specialization toPolarGradwith Nuclear-Norm Scaling 3. F.3One-Sided Spectral Optimizers1. F.3.1Right-Spectral Optimizers 2. F.3.2Left-Spectral Optimizers 4. F.4Row-Norm-Based Optimizers1. F.4.1Specialization to Smoothed Row Normalization 5. F.5Nuclear-Norm-Scaled Right-Spectral/Row-Norm Hybrid Optimizers 6. F.6Nuclear-Norm-Scaled Row-Norm/Right-Spectral Hybrid Optimizers
- GExperimental Details1. G.1Qwen3-0.6B-Style Pre-Training 2. G.2Gemma 3 1B-Style Pre-Training1. G.2.1Gemma 3 1B-Style Pre-Training Learning Rate Sweep 2. G.2.2Gemma 3 1B-Style Pre-Training Across Random Seeds 3. G.3OLMoE-1B-7B-Style Pre-Training 4. G.4Downsized gpt-oss Pre-Training
Organization.
In the appendix, we provide discussion on further related work (Appendix˜A), and supplementary technical background materials on amtrix analysis (Appendix˜B). Furthermore, we illustrate the geometric misalignment of coordinate-wise adaptive gradient methods inAppendix˜C. We then provide omitted proofs from the main text inAppendix˜D, as well as details of the implementation of practical optimizersAppendix˜E. We then give the convergence analysis inAppendix˜F. We also provide details of numerical experiments inAppendix˜G.
Appendix AFurther Related Work
We further discuss connections between our framework and other existing paradigms for designing matrix-gradient optimizers in modern deep learning.
A.1Non-Euclidean Norm-Based Steepest Descent and LMO-Based Frameworks
A prominent line of recent work interpretsMuonand related methods through the lens of non-Euclidean steepest descent, trust-region methods, and linear minimization oracle (LMO) frameworks[76,14,11,12,122,84,132]. In these views,Muoncan be understood as a normalized steepest descent method with respect to a non-Euclidean matrix norm—most notably the spectral norm—and this perspective has also been used to connectMuonto optimizers such assignSGD[13]andShampoo[62]. Related work has further explored alternative norm choices and generalized preconditioned steepest descent constructions[28,150,85,159].
While these frameworks provide useful algorithmic interpretations ofMuonand its variants, they leave open a fundamental question: what is theprincipledcriterion for choosing the “correct” norm for a given layer? In practice, existing prescriptions are not always fully aligned with actual usage. For example,Largeet al.[88]suggest using the maximum columnℓ2\ell_{2}-norm for embedding layers, whereas in practiceAdamWis often used for embeddings whileMuonis applied to other matrix parameters inmodded-nanogptspeedrunning[75]. More broadly, several works suggest that one may use variousℓp→ℓq\ell_{p}\to\ell_{q}operator norms for steepest descent or LMO for embedding and LM head layers (see e.g.,[122]). Our theoretical development suggests that such norm choices might generally not be geometrically principled for matrix-valued parameters.
In contrast, our symmetry-based analysis provides a concrete design principle. For matrix parameters whose geometry should respect orthogonal changes of basis, only unitarily invariant matrix norms are appropriate, since they are precisely the norms compatible with orthogonal invariance. By comparison, theℓp→ℓq\ell_{p}\to\ell_{q}operator norm is generallynotunitarily invariant, except in the special casep=q=2p=q=2, which reduces to the spectral norm. Accordingly, non-unitarily invariant norms typically induce an incorrect matrix geometry. This offers a possible explanation for the gap between certain norm-based theoretical prescriptions and empirical practice.
A.2Modular Norm Theory
Our right-spectral viewpoint is closely related in spirit to modular norm theory[12,88]: both seek architecture-aware optimizer geometries derived from structural properties of the module rather than from an ambient coordinate system. The two approaches are nevertheless conceptually distinct.
Modular norm theory derives updates from steepest descent under module-adapted operator norms, and is primarily motivated by scale transfer across width and depth through recursively constructed global norms. By contrast, right-spectral optimizers are derived directly from left-permutation and right-orthogonal equivariance, leading to update rules of the form (3), namely spectral transformations determined by the right Gram matrixD⊤DD^{\top}D. In this sense, right-spectral optimizers form a symmetry-derived subclass of LPRO-equivariant updates, whereas modular-norm-based methods may also include coordinate-dependent row-wise or column-wise transformations that are compatible with only part of the underlying symmetry.
From this perspective, the two frameworks may be viewed as complementary. Modular norm theory aims to provide the correctscale invariance, namely how learning rates and normalized updates should behave as architecture width and depth vary. Our spectral framework instead aims to provide the correctdirectional geometry, namely how update directions should respect the invariance structure of different layer types. Their combination suggests the possibility of a fully geometry-aware optimizer that uses modular-norm-based global scaling together with layerwise spectral, right-spectral, or left-spectral update directions dictated by symmetry.
A.3Rotation-Based Optimizers
The recent work[56]is perhaps the closest in spirit to our approach, but there are several important differences. First, we do not assume that the layerwise loss function itself is rotationally, or more generally bi-orthogonally, invariant. Rather, we derive optimizer classes from the symmetry structure of the parameter geometry and the corresponding equivariance properties of update rules. This distinction matters, since exact rotational invariance of the layerwise loss may be difficult to verify and can fail in practice.
Second, we do not advocate a single update rule for all matrix-valued parameters. Instead, our framework emphasizes that different layer types admit different symmetry groups and therefore should, in principle, use different optimizer geometries. In particular, embedding and LM head matrices possess different equivariance properties from weight matrices in linear and attention layers, which leads naturally to an architecture–optimizer co-design perspective.
Third, the update rules in ARO for matrix parameters are based only on left multiplication by a rotation matrix. By contrast, our framework includes full spectral, right-spectral, left-spectral, row-norm-based, and hybrid classes, depending on the relevant symmetry structure of the layer. In this sense, our theory provides a broader symmetry-based taxonomy of matrix-gradient optimizers.
Appendix BSupplemental Technical Background on Matrix Analysis
We include several technical definitions arising in matrix analysis, which is closely related to the derivation of spectral optimizers via steepest descent. Details of these materials can be found in[92,69,68,16].
Let us recall the notion of unitarily invariant norms, which provide the norm-level analogue of left-right orthogonal symmetry. Since we work over the field of real numbersℝ\mathbb{R}, “unitarily invariant” here is equivalent to invariance under left and right orthogonal transformations.
Definition B.1(Unitarily invariant norms).
A norm∥⋅∥:ℝm×n→ℝ+\left\lVert\cdot\right\rVert\colon\mathbb{R}^{m\times n}\to\mathbb{R}_{+}is said to beunitarily invariantif‖A‖=‖UAV‖\left\lVert A\right\rVert=\left\lVert UAV\right\rVertfor allA∈ℝm×nA\in\mathbb{R}^{m\times n}and all orthogonal matricesU∈𝕆mU\in\mathbb{O}^{m}andV∈𝕆nV\in\mathbb{O}^{n}. Aunitarily invariant matrix normis a unitarily invariant norm onℝm×n\mathbb{R}^{m\times n}that is also submultiplicative.
Unitarily invariant norms can be completely characterized through symmetric gauge functions acting on singular values.
Definition B.2(Symmetric gauge functions).
A functionψ:ℝr→ℝ+\psi\colon\mathbb{R}^{r}\to\mathbb{R}_{+}is asymmetric gauge functionif: (i)ψ\psiis a norm; (ii)ψ(|x|)=ψ(x)\psi(|x|)=\psi(x)for allx∈ℝrx\in\mathbb{R}^{r}, where|⋅||\cdot|is understood coordinatewise; and (iii)ψ(Px)=ψ(x)\psi(Px)=\psi(x)for all permutation matricesP∈ℙrP\in\mathbb{P}^{r}and allx∈ℝrx\in\mathbb{R}^{r}.
Proposition B.1.
A norm∥⋅∥:ℝm×n→ℝ+\left\lVert\cdot\right\rVert\colon\mathbb{R}^{m\times n}\to\mathbb{R}_{+}is unitarily invariant if and only if there exists a symmetric gauge functionψ\psisuch that‖W‖=ψ(σ(W))=(ψ∘σ)(W)\left\lVert W\right\rVert=\psi(\sigma(W))=(\psi\circ\sigma)(W), whereσ(W)\sigma(W)is the vector of singular values ofWWarranged in descending order.
Example B.1.
Important examples of unitarily invariant matrix norms include the Schattenpp-norms|||⋅|||p\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\cdot\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{p}forp∈[1,∞]p\in\left[1,\infty\right], including the nuclear norm|||⋅|||nuc\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\cdot\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}, Frobenius norm|||⋅|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\cdot\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}, and the spectral norm|||⋅|||S\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\cdot\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}. Another notable example is the Ky Fankk-norm.
Appendix CGeometric Misalignment of Coordinate-wise Adaptive Gradient Methods
If an optimizer for matrix parameters in linear layers does not respect the intrinsic geometry of the parameter space through bi-orthogonal equivariance, several issues arise. In particular, for coordinate-wise adaptive gradient methods, the optimizer iterates depend on arbitrary coordinate choices: rotating the input space can change the optimizer itself, leading to different training dynamics under equivalent reparameterizations.
For a vector parameterw∈ℝnw\in\mathbb{R}^{n}, the general form of a coordinate-wise adaptive gradient method is
(∀k∈ℕ)wk+1=wk−γk𝒰(dk),(\forall k\in\mathbb{N})\qquad w_{k+1}=w_{k}-\gamma_{k}\mathscr{U}(d_{k}),wheredkd_{k}is an update direction and𝒰:ℝn→ℝn\mathscr{U}\colon\mathbb{R}^{n}\to\mathbb{R}^{n}is a vector-valued map. A simple but important observation is that optimizers built only from linear operations, such as additions, subtractions, scalar multiplications, and their linear combinations, behave analogously for vector and matrix parameters. By contrast, coordinate-wise adaptive methods such asAdamrely on elementwise nonlinear operations, including division, squaring, and square roots. These operations distort the geometry of the original update direction, whether it is formed from the gradient itself or from a momentum term.
While spectral methods (e.g., spectral descent andMuon; formally introduced inSection˜3.8) respect bi-orthogonal equivariance, coordinate-wise methods are typically equivariant only under the much smaller signed permutation group. As a result, they fail to respect the intrinsic matrix geometry of the optimization problem. This mismatch is especially problematic in deep learning, where the optimization landscape often exhibits an intrinsically low-dimensional structure. Spectral optimizers are designed to adapt to this geometry, whereas coordinate-wise methods largely ignore it. Moreover, as model size increases, the advantage of such geometry-aware methods typically becomes more pronounced.
Sign descent.
For a matrix gradientGG, sign descent orsignSGD[13]uses the coordinate-wise signssgn(G)\mathrm{sgn}(G)of the gradient as the update direction, which satisfies\llangleG,sgn(G)\rrangleF=‖G‖1\left\llangle G,\mathrm{sgn}(G)\right\rrangle_{\rm F}=\left\lVert G\right\rVert_{{1}}. Thus, sign descent is governed by the duality associated with the coordinate-wiseℓ∞\ell_{\infty}-norm, rather than by any unitarily invariant matrix geometry (seeDefinition˜B.1).
Adam.
More generally,Adam[81]can be viewed as a smoothed version of sign descent, since its update is dominated by coordinate-wise normalization. In particular,Adamapplies the coordinate-wise scaling (omitting bias corrections for simplicity)
Wij←Wij−γ⋅MijVij+ε,W_{ij}\leftarrow W_{ij}-\gamma\cdot\frac{M_{ij}}{\sqrt{V_{ij}}+\varepsilon},whereMijM_{ij}andVijV_{ij}denote the first- and second-moment statistics (with bias corrections) at coordinate(i,j)(i,j),γ>0\gamma>0is the learning rate , andε>0\varepsilon>0is a small constant. This update does not transform equivariantly under bi-orthogonal reparameterizationsW↦W~≔PWQ⊤W\mapsto\widetilde{W}\coloneqq PWQ^{\top}forP∈𝕆mP\in\mathbb{O}^{m}andQ∈𝕆nQ\in\mathbb{O}^{n}. Consequently, the update direction ofAdamdepends on how the entries ofWWare indexed, rather than only on the intrinsic geometry ofWWas a matrix.
Furthermore, coordinate-wise methods tend to inject high-rank coordinate noise even when the underlying gradientGGis low rank. In contrast, spectral methods act directly on the matrix structure ofGG, and therefore remain aligned with the low-dimensional geometry that frequently governs deep network optimization.
Sign descent as a special case of spectral descent.
Following similar arguments in[89], we further interpret sign descent as spectral descent applied to the diagonal matrization of the vectorized matrix parameter. Define thediagonal matrization of the vectorizationofGGbyG~≔Diag(vec(G))∈ℝmn×mn\widetilde{G}\coloneqq\operatorname*{Diag}(\mathrm{vec}(G))\in\mathbb{R}^{mn\times mn}. The following lemma makes this connection precise.
Lemma C.1.
LetG≔∇Wf(W)∈ℝm×nG\coloneqq\nabla_{W}f(W)\in\mathbb{R}^{m\times n}contain no exact zero entries, and defineG~≔Diag(vec(G))∈ℝmn×mn\widetilde{G}\coloneqq\operatorname*{Diag}(\mathrm{vec}(G))\in\mathbb{R}^{mn\times mn}. Then the orthogonal polar factor ofG~\widetilde{G}ispolar(G~)=Diag(sgn(vec(G)))\mathrm{polar}(\widetilde{G})=\operatorname*{Diag}(\mathrm{sgn}(\mathrm{vec}(G))). Consequently,sgn(G)=reshape(diag(polar(G~)),m,n)\mathrm{sgn}(G)=\mathrm{reshape}(\mathrm{diag}(\mathrm{polar}(\widetilde{G})),m,n).
Proof.
SinceG~=Diag(vec(G))\widetilde{G}=\operatorname*{Diag}(\mathrm{vec}(G))is diagonal, its singular values are given by the absolute values of its diagonal entries. Hence we may write its singular value decomposition as
G~=ImnDiag(|vec(G)|)Diag(sgn(vec(G))),\widetilde{G}=I_{mn}\,\operatorname*{Diag}(|\mathrm{vec}(G)|)\,\operatorname*{Diag}(\mathrm{sgn}(\mathrm{vec}(G))),or equivalently,
G~=Diag(sgn(vec(G)))Diag(|vec(G)|).\widetilde{G}=\operatorname*{Diag}(\mathrm{sgn}(\mathrm{vec}(G)))\,\operatorname*{Diag}(|\mathrm{vec}(G)|).Therefore, by the definition of the orthogonal polar factor, we havepolar(G~)=Diag(sgn(vec(G)))\mathrm{polar}(\widetilde{G})=\operatorname*{Diag}(\mathrm{sgn}(\mathrm{vec}(G))). Taking the diagonal ofpolar(G~)\mathrm{polar}(\widetilde{G})and reshaping it back to anm×nm\times nmatrix yields
reshape(diag(polar(G~)),m,n)=reshape(sgn(vec(G)),m,n)=sgn(G).\mathrm{reshape}(\mathrm{diag}(\mathrm{polar}(\widetilde{G})),m,n)=\mathrm{reshape}(\mathrm{sgn}(\mathrm{vec}(G)),m,n)=\mathrm{sgn}(G).∎
In other words, coordinate-wise sign descent can be viewed as spectral gradient descent applied to the highly degenerate diagonal liftingG~=Diag(vec(G))∈ℝmn×mn\widetilde{G}=\operatorname*{Diag}(\mathrm{vec}(G))\in\mathbb{R}^{mn\times mn}. Thus, sign-based methods inherit only the trivial geometry of this pathological diagonal representation, rather than the intrinsic matrix geometry ofGGitself. This highlights afundamental geometric mismatchin coordinate-wise updates, which is further worsened when the parameter dimensionsmmandnnand hence model sizes grow.
Appendix DProofs of Main Text
Proof ofProposition˜3.1.
We also state the desired results for Polyak and Nesterov momentum.
For EMA momentum, we defineMk=βMk−1+(1−β)GkM_{k}=\beta M_{k-1}+(1-\beta)G_{k}withM−1=0M_{-1}=0, and, for the transformed sequence,M~k=βM~k−1+(1−β)G~k\widetilde{M}_{k}=\beta\widetilde{M}_{k-1}+(1-\beta)\widetilde{G}_{k}withM~−1=0\widetilde{M}_{-1}=0. Then we haveM~k=PMkQ⊤\widetilde{M}_{k}=PM_{k}Q^{\top}andpolar(M~k)=Ppolar(Mk)Q⊤\mathrm{polar}(\widetilde{M}_{k})=P\,\mathrm{polar}(M_{k})Q^{\top}.
For Polyak heavy-ball momentum, we defineMk=βMk−1+GkM_{k}=\beta M_{k-1}+G_{k}withM−1=0M_{-1}=0, andM~k=βM~k−1+G~k\widetilde{M}_{k}=\beta\widetilde{M}_{k-1}+\widetilde{G}_{k}withM~−1=0\widetilde{M}_{-1}=0. Then we haveM~k=PMkQ⊤\widetilde{M}_{k}=PM_{k}Q^{\top}andpolar(M~k)=Ppolar(Mk)Q⊤\mathrm{polar}(\widetilde{M}_{k})=P\,\mathrm{polar}(M_{k})Q^{\top}.
We first prove the stated transformation for the momentum sequences. For the EMA recursion, assume inductively thatM~k−1=PMk−1Q⊤\widetilde{M}_{k-1}=PM_{k-1}Q^{\top}. Using the bi-orthogonal equivariance of the gradient, we have
G~k=∇Wf(W~k)=∇Wf(PWkQ⊤)=P∇Wf(Wk)Q⊤=PGkQ⊤.\widetilde{G}_{k}=\nabla_{W}f(\widetilde{W}_{k})=\nabla_{W}f(PW_{k}Q^{\top})=P\nabla_{W}f(W_{k})Q^{\top}=PG_{k}Q^{\top}.Hence
M~k=βM~k−1+(1−β)G~k=βPMk−1Q⊤+(1−β)PGkQ⊤=P(βMk−1+(1−β)Gk)Q⊤=PMkQ⊤.\widetilde{M}_{k}=\beta\widetilde{M}_{k-1}+(1-\beta)\widetilde{G}_{k}=\beta PM_{k-1}Q^{\top}+(1-\beta)PG_{k}Q^{\top}\\ =P(\beta M_{k-1}+(1-\beta)G_{k})Q^{\top}=PM_{k}Q^{\top}.Thus the EMA momentum is bi-orthogonally equivariant. The Polyak case is identical, replacing(1−β)(1-\beta)by11, which givesM~k=PMkQ⊤\widetilde{M}_{k}=PM_{k}Q^{\top}.
For Nesterov momentum, the momentum recursion is the same as in the Polyak case, soM~k=PMkQ⊤\widetilde{M}_{k}=PM_{k}Q^{\top}. Therefore,
N~k=G~k+βM~k=PGkQ⊤+βPMkQ⊤=P(Gk+βMk)Q⊤=PNkQ⊤.\widetilde{N}_{k}=\widetilde{G}_{k}+\beta\widetilde{M}_{k}=PG_{k}Q^{\top}+\beta PM_{k}Q^{\top}=P(G_{k}+\beta M_{k})Q^{\top}=PN_{k}Q^{\top}.All claims about orthogonal polar factors follow from (1). For example,
polar(N~k)=polar(PNkQ⊤)=Ppolar(Nk)Q⊤.\mathrm{polar}(\widetilde{N}_{k})=\mathrm{polar}(PN_{k}Q^{\top})=P\,\mathrm{polar}(N_{k})Q^{\top}.The proofs for the momentum polar factors are identical. ∎
Proof ofTheorem˜3.2.
LetP∈ℙvP\in\mathbb{P}^{v},R∈𝕆dR\in\mathbb{O}^{d}, andD∈ℝv×dD\in\mathbb{R}^{v\times d}. SincePPis a permutation matrix, we haveP⊤P=IvP^{\top}P=I_{v}. Therefore,
(PDR⊤)⊤(PDR⊤)=RD⊤P⊤PDR⊤=RD⊤DR⊤.(PDR^{\top})^{\top}(PDR^{\top})=RD^{\top}P^{\top}PDR^{\top}=RD^{\top}DR^{\top}.Using the orthogonal equivariance ofΦ\Phi, it follows that
Φ((PDR⊤)⊤(PDR⊤))=Φ(RD⊤DR⊤)=RΦ(D⊤D)R⊤.\Phi((PDR^{\top})^{\top}(PDR^{\top}))=\Phi(RD^{\top}DR^{\top})=R\Phi(D^{\top}D)R^{\top}.Hence we obtain
𝒰𝖱(PDR⊤)=PDR⊤Φ((PDR⊤)⊤(PDR⊤))=PDR⊤(RΦ(D⊤D)R⊤)=PDΦ(D⊤D)R⊤=P𝒰𝖱(D)R⊤.\mathscr{U}_{\mathsf{R}}(PDR^{\top})=PDR^{\top}\Phi((PDR^{\top})^{\top}(PDR^{\top}))=PDR^{\top}(R\,\Phi(D^{\top}D)R^{\top})\\ =PD\Phi(D^{\top}D)R^{\top}=P\,\mathscr{U}_{\mathsf{R}}(D)R^{\top}.∎
Proof ofProposition˜3.3.
For anyD∈ℝv×dD\in\mathbb{R}^{v\times d}, permutation matrixP∈ℙvP\in\mathbb{P}^{v}, and orthogonal matrixR∈𝕆dR\in\mathbb{O}^{d},
(𝒰2∘𝒰1)(PDR⊤)=𝒰2(𝒰1(PDR⊤))=𝒰2(P𝒰1(D)R⊤)=P𝒰2(𝒰1(D))R⊤.(\mathscr{U}_{2}\circ\mathscr{U}_{1})(PDR^{\top})=\mathscr{U}_{2}(\mathscr{U}_{1}(PDR^{\top}))=\mathscr{U}_{2}(P\,\mathscr{U}_{1}(D)R^{\top})=P\,\mathscr{U}_{2}(\mathscr{U}_{1}(D))R^{\top}.Hence𝒰2∘𝒰1\mathscr{U}_{2}\circ\mathscr{U}_{1}is left-permutation and right-orthogonal equivariant. ∎
Proof ofProposition˜3.4.
Sinceσ\sigmais applied coordinatewise andPPis a permutation matrix,σ(PWgatex)=Pσ(Wgatex)\sigma(PW_{\mathrm{gate}}x)=P\sigma(W_{\mathrm{gate}}x). Moreover,(Pu)⊙(Pv)=P(u⊙v)(Pu)\odot(Pv)=P(u\odot v)for allu,v∈ℝdffu,v\in\mathbb{R}^{d_{\mathrm{ff}}}. Hence
SwiGLU(x;W~gate,W~up,W~down)\displaystyle\mathrm{SwiGLU}(x;\widetilde{W}_{\mathrm{gate}},\widetilde{W}_{\mathrm{up}},\widetilde{W}_{\mathrm{down}})=WdownP⊤(σ(PWgatex)⊙(PWupx))\displaystyle=W_{\mathrm{down}}P^{\top}\left(\sigma(PW_{\mathrm{gate}}x)\odot(PW_{\mathrm{up}}x)\right)=WdownP⊤P(σ(Wgatex)⊙(Wupx))\displaystyle=W_{\mathrm{down}}P^{\top}P\left(\sigma(W_{\mathrm{gate}}x)\odot(W_{\mathrm{up}}x)\right)=Wdown(σ(Wgatex)⊙(Wupx)).\displaystyle=W_{\mathrm{down}}\left(\sigma(W_{\mathrm{gate}}x)\odot(W_{\mathrm{up}}x)\right).∎
Proof ofProposition˜3.5.
LetD~=PD+𝟏ea⊤\widetilde{D}=PD+\bm{1}_{e}a^{\top}. SinceP𝟏e=𝟏eP\bm{1}_{e}=\bm{1}_{e}, we haveΠ⟂P=PΠ⟂\Pi_{\perp}P=P\Pi_{\perp}andΠ⟂𝟏ea⊤=0\Pi_{\perp}\bm{1}_{e}a^{\top}=0. Hence
D~c=Π⟂D~=PΠ⟂D=PDc.\widetilde{D}_{c}=\Pi_{\perp}\widetilde{D}=P\Pi_{\perp}D=PD_{c}.ThereforeD~cD~c⊤=PDcDc⊤P⊤\widetilde{D}_{c}\widetilde{D}_{c}^{\top}=PD_{c}D_{c}^{\top}P^{\top}. For the left-spectral update, permutation equivariance ofΨ\Psigives
𝒰𝖫(D~)=Ψ(PDcDc⊤P⊤)PDc=PΨ(DcDc⊤)Dc=P𝒰𝖫(D).\mathscr{U}_{\mathsf{L}}(\widetilde{D})=\Psi(PD_{c}D_{c}^{\top}P^{\top})PD_{c}=P\Psi(D_{c}D_{c}^{\top})D_{c}=P\mathscr{U}_{\mathsf{L}}(D).For the row-norm update, left multiplication byPPmerely permutes the rows ofDcD_{c}and hence permutes their norms. Therefore the diagonal row-scaling matrix transforms by conjugation withPP, giving
𝒰𝗋𝗈𝗐𝗋𝗈𝗎𝗍𝖾𝗋(D~)=P𝒰𝗋𝗈𝗐𝗋𝗈𝗎𝗍𝖾𝗋(D).\mathscr{U}_{\mathsf{row}}^{\mathsf{router}}(\widetilde{D})=P\,\mathscr{U}_{\mathsf{row}}^{\mathsf{router}}(D).∎
Proof ofTheorem˜3.7.
Suppose first that𝒰\mathscr{U}is of the stated spectral form. LetP∈𝕆mP\in\mathbb{O}^{m}andQ∈𝕆nQ\in\mathbb{O}^{n}be orthogonal. IfD=UDiag(σ(D))V⊤D=U\operatorname*{Diag}(\sigma(D))V^{\top}, thenPDQ⊤=(PU)Diag(σ(D))(QV)⊤PDQ^{\top}=(PU)\operatorname*{Diag}(\sigma(D))(QV)^{\top}is a singular value decomposition ofPDQ⊤PDQ^{\top}. Therefore, we have
𝒰(PDQ⊤)=(PU)Diag(ψ(σ(D)))(QV)⊤=P𝒰(D)Q⊤,\mathscr{U}(PDQ^{\top})=(PU)\operatorname*{Diag}(\psi(\sigma(D)))(QV)^{\top}=P\,\mathscr{U}(D)Q^{\top},so𝒰\mathscr{U}is bi-orthogonally equivariant.
Conversely, suppose that𝒰\mathscr{U}is bi-orthogonally equivariant. LetD=UDiag(σ(D))V⊤D=U\operatorname*{Diag}(\sigma(D))V^{\top}be a singular value decomposition ofDD. Then we have
𝒰(D)=𝒰(UDiag(σ(D))V⊤)=U𝒰(Diag(σ(D)))V⊤.\mathscr{U}(D)=\mathscr{U}(U\operatorname*{Diag}(\sigma(D))V^{\top})=U\mathscr{U}(\operatorname*{Diag}(\sigma(D)))V^{\top}.Thus the action of𝒰\mathscr{U}is completely determined by its action on diagonal matrices of singular values. Since bi-orthogonal equivariance ensures𝒰\mathscr{U}commutes with arbitrary diagonal sign matrices, forcing all off-diagonal elements to be zero, let us defineDiag(ψ(s))≔𝒰(Diag(s))\operatorname*{Diag}(\psi(s))\coloneqq\mathscr{U}(\operatorname*{Diag}(s)), wheres∈ℝ+rs\in\mathbb{R}_{+}^{r}. The bi-orthogonal equivariance of𝒰\mathscr{U}further implies that this definition is independent of the particular singular value decomposition and thatψ\psiis absolutely symmetric with respect to permutations and sign changes compatible with singular-value representations. Hence we have𝒰(D)=UDiag(ψ(σ(D)))V⊤\mathscr{U}(D)=U\operatorname*{Diag}(\psi(\sigma(D)))V^{\top}, which is exactly the claimed spectral form. ∎
Appendix EImplementation Details of Practical Optimizers
We provide the implementation details of our proposed practical optimizers in this section.
E.1Numerical Algorithm for Matrix Inverse Square Root viaPolar Express
We illustrate below the algorithm for computing matrix inverse square root viaPolar Express, which is based on the intrinsic connection between polynomial iterations for computing the orthogonal polar factor and those for computing the inverse square root of a square matrix, stated in the following theorem.
Theorem E.1(Higham [65]).
LetA∈ℝn×nA\in\mathbb{R}^{n\times n}be a square matrix with nonnegative eigenvalues. Consider any iteration of the formXk+1=Xkh(Xk2)X_{k+1}=X_{k}h(X_{k}^{2})that converges tomsgn(X0)\mathrm{msgn}(X_{0})forX0=(0AIn0)X_{0}=\left(\begin{smallmatrix}0&A\\ I_{n}&0\end{smallmatrix}\right)with order of convergenceqq. Then in the coupled iterationXk+1=Xkh(YkXk)X_{k+1}=X_{k}h(Y_{k}X_{k}),Yk+1=h(YkXk)YkY_{k+1}=h(Y_{k}X_{k})Y_{k}, withX0=AX_{0}=AandY0=InY_{0}=I_{n}, we haveXk→A1/2X_{k}\to A^{\nicefrac{{1}}{{2}}}andYk→A−1/2Y_{k}\to A^{-\nicefrac{{1}}{{2}}}, both with order of convergenceqq. Heremsgn\mathrm{msgn}denotes the matrix sign function.
Algorithm E.1Matrix Inverse Square Root viaPolar Express[5]0:
A∈ℝn×nA\in\mathbb{R}^{n\times n},
ε>0\varepsilon>0,
K∈ℕ∗K\in\mathbb{N}^{*}, sequence ofPolar Expresscoefficients
{(ak,bk,ck)}k=1K\{(a_{k},b_{k},c_{k})\}_{k=1}^{K} 1:
A=(A+A⊤)/2+εInA=(A+A^{\top})/2+\varepsilon I_{n} 2:
α=1.02|||A|||F+ε\alpha=1.02\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert A\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}+\varepsilon 3:
Y1=A/αY_{1}=A/\alpha 4:
Z1=InZ_{1}=I_{n} 5:for
k=1,…,Kk=1,\ldots,Kdo
6:
(a¯k,b¯k,c¯k)=(ak+bk+ck,−bk−2ck,ck)(\overline{a}_{k},\overline{b}_{k},\overline{c}_{k})=(a_{k}+b_{k}+c_{k},-b_{k}-2c_{k},c_{k}) 7:
Rk=In−ZkYkR_{k}=I_{n}-Z_{k}Y_{k} 8:
Rk=(Rk+Rk⊤)/2R_{k}=(R_{k}+R_{k}^{\top})/2(optional matrix symmetrization)
9:
Mk=a¯kIn+b¯kRk+c¯kRk2M_{k}=\overline{a}_{k}I_{n}+\overline{b}_{k}R_{k}+\overline{c}_{k}R_{k}^{2} 10:
Yk+1=YkMkY_{k+1}=Y_{k}M_{k} 11:
Zk+1=MkZkZ_{k+1}=M_{k}Z_{k} 12:endfor
12:
ZK+1/αZ_{K+1}/\sqrt{\alpha}
Appendix FConvergence Analysis of Symmetry-Compatible Optimizers
In this section, we study the convergence of several symmetry-compatible optimizer classes introduced above, including full spectral, one-sided spectral, row-norm-based, and hybrid optimizers. To present the analysis in a unified way, we begin from a general first-order iteration and state the basic assumptions once. The subsequent subsections then specialize these assumptions to each optimizer class.
F.1General Update Scheme and Standing Assumptions
We consider the generic iteration
(∀k∈ℕ)Wk+1=Wk−γk𝒯(Gk),Gk≔∇f(Wk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}(G_{k}),\qquad G_{k}\coloneqq\nabla f(W_{k}),(F.1)whereWk∈ℝm×nW_{k}\in\mathbb{R}^{m\times n},γk>0\gamma_{k}>0is the learning rate, and𝒯:ℝm×n→ℝm×n\mathcal{T}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}is an update map whose precise form depends on the optimizer class under consideration. We do not consider momentum in our analysis for simplicity and leave it to future work. The layerwise loss functionffis assumed to satisfy the following standard regularity conditions.
Assumption F.1(LL-smoothness).
Letf:ℝm×n→ℝ¯f\colon\mathbb{R}^{m\times n}\to\overline{\mathbb{R}}be differentiable andLL-smooth with respect to the Frobenius norm, i.e., there existsL∈(0,∞)L\in(0,\infty)such that
(∀X,Y∈ℝm×n)|||∇f(X)−∇f(Y)|||F⩽L|||X−Y|||F.(\forall X,Y\in\mathbb{R}^{m\times n})\qquad\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\nabla f(X)-\nabla f(Y)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}\leqslant L\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert X-Y\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}.
Assumption F.2(μ\mu-Polyak–Łojasiewicz).
Letf:ℝm×n→ℝ¯f\colon\mathbb{R}^{m\times n}\to\overline{\mathbb{R}}satisfy theμ\mu-Polyak–Łojasiewicz (PŁ) inequality:
(∀X∈ℝm×n)|||∇f(X)|||F2⩾2μ(f(X)−f⋆),(\forall X\in\mathbb{R}^{m\times n})\qquad\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\nabla f(X)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\geqslant 2\mu\bigl(f(X)-f^{\star}\bigr),wheref⋆≔infff^{\star}\coloneqq\inf f.
The PŁ condition will only be needed to obtain linear convergence. Under smoothness alone, we will still obtain monotonic descent and sublinear convergence to stationarity.
Throughout the section, the basic descent estimate follows from the standard smoothness inequality:
f(Wk+1)⩽f(Wk)−γk\llangleGk,𝒯(Gk)\rrangleF+Lγk22|||𝒯(Gk)|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Thus, the convergence analysis for each optimizer class reduces to controlling two quantities\llangleG,𝒯(G)\rrangleF\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}and|||𝒯(G)|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. The relevant bounds depend on the geometry of the update map𝒯\mathcal{T}, and are stated separately in the subsections below.
We now specialize (F.1) to each symmetry-compatible optimizer class introduced earlier. In each case, the key step is to identify the geometry-dependent alignment term\llangleG,𝒯(G)\rrangleF\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}and the corresponding update norm|||𝒯(G)|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, which together determine the admissible learning rates and convergence rates.
The optimizer classes considered in this section fall into two broad regimes. The first consists of scale-compatible updates, for which|||𝒯(G)|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}scales proportionally to|||G|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}. This includes standard spectral, one-sided spectral, and bounded row-norm-based updates. The second consists of fully normalized updates, such as polar or row-normalized directions, whose magnitude is controlled by rank or support rather than gradient norm. The former admit convergence bounds through uniform alignment and norm-control constants, whereas the latter are more naturally analyzed through geometry-dependent ratio quantities.
F.2Full Spectral Optimizers
We begin with full spectral optimizers. The following assumption captures the two structural properties needed for convergence: positive alignment with the gradient and control of the update norm. We specialize (F.1) to the full spectral case, where𝒯\mathcal{T}is a spectral operator.
Assumption F.3(Singular-value alignment and boundedness).
Let𝒯:ℝm×n→ℝm×n\mathcal{T}\colon\mathbb{R}^{m\times n}\to\mathbb{R}^{m\times n}be the spectral update map defined by𝒯(G)=UDiag(ψ(σ(G)))V⊤\mathcal{T}(G)=U\operatorname*{Diag}(\psi(\sigma(G)))V^{\top}wheneverG=UDiag(σ(G))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}is a singular value decomposition ofGG, for some absolutely symmetric mapψ:ℝ+r→ℝr\psi\colon\mathbb{R}_{+}^{r}\to\mathbb{R}^{r}. Assume there exist constants0<c1⩽c2<∞0<c_{1}\leqslant c_{2}<\inftysuch that for alls∈ℝ+rs\in\mathbb{R}_{+}^{r},
∑i=1rsiψi(s)⩾c1∑i=1rsi2and∑i=1rψi(s)2⩽c2∑i=1rsi2,\sum_{i=1}^{r}s_{i}\psi_{i}(s)\geqslant c_{1}\sum_{i=1}^{r}s_{i}^{2}\quad\text{and}\quad\sum_{i=1}^{r}\psi_{i}(s)^{2}\leqslant c_{2}\sum_{i=1}^{r}s_{i}^{2},wherer=min{m,n}r=\min\{m,n\}.
Lemma F.1(Alignment and norm bounds).
UnderAssumption˜F.3, for allG∈ℝm×nG\in\mathbb{R}^{m\times n}, we have\llangleG,𝒯(G)\rrangleF⩾c1|||G|||F2\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}\geqslant c_{1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}and|||𝒯(G)|||F2⩽c2|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant c_{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.
Proof.
If we writes=σ(G)s=\sigma(G), then we have𝒯(G)=UDiag(ψ(s))V⊤\mathcal{T}(G)=U\operatorname*{Diag}(\psi(s))V^{\top}. Using orthogonal invariance of the Frobenius inner product, we have\llangleG,𝒯(G)\rrangleF=\llangleDiag(s),Diag(ψ(s))\rrangleF=∑i=1rsiψi(s)\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}=\left\llangle\operatorname*{Diag}(s),\operatorname*{Diag}(\psi(s))\right\rrangle_{\rm F}=\sum_{i=1}^{r}s_{i}\psi_{i}(s). ByAssumption˜F.3, we have\llangleG,𝒯(G)\rrangleF⩾c1∑i=1rsi2=c1|||G|||F2\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}\geqslant c_{1}\sum_{i=1}^{r}s_{i}^{2}=c_{1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. Similarly, we also have|||𝒯(G)|||F2=∑i=1rψi(s)2⩽c2∑i=1rsi2=c2|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{r}\psi_{i}(s)^{2}\leqslant c_{2}\sum_{i=1}^{r}s_{i}^{2}=c_{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. ∎
Theorem F.2(Descent lemma for spectral optimizers).
SupposeAssumptions˜F.1andF.3hold. Then the iteration (F.1) satisfies
f(Wk+1)⩽f(Wk)−γk\llangleGk,𝒯(Gk)\rrangleF+Lγk22|||𝒯(Gk)|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.(F.2)Consequently, we havef(Wk+1)⩽f(Wk)−(c1γk−Lc2γk2/2)|||Gk|||F2f(W_{k+1})\leqslant f(W_{k})-\left(c_{1}\gamma_{k}-Lc_{2}\gamma_{k}^{2}/2\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. In particular, ifγk∈(0,2c1/(Lc2))\gamma_{k}\in\left(0,2c_{1}/(Lc_{2})\right), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
ByLL-smoothness offf, we have
f(Wk+1)⩽f(Wk)+\llangle∇f(Wk),Wk+1−Wk\rrangleF+L2|||Wk+1−Wk|||F2.f(W_{k+1})\leqslant f(W_{k})+\left\llangle\nabla f(W_{k}),W_{k+1}-W_{k}\right\rrangle_{\rm F}+\frac{L}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert W_{k+1}-W_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.UsingWk+1−Wk=−γk𝒯(Gk)W_{k+1}-W_{k}=-\gamma_{k}\mathcal{T}(G_{k})andGk=∇f(Wk)G_{k}=\nabla f(W_{k}), we obtain (F.2). Now applyLemma˜F.1to obtain\llangleGk,𝒯(Gk)\rrangleF⩾c1|||Gk|||F2\left\llangle G_{k},\mathcal{T}(G_{k})\right\rrangle_{\rm F}\geqslant c_{1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}and|||𝒯(Gk)|||F2⩽c2|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant c_{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, which yieldsf(Wk+1)⩽f(Wk)−(c1γk−Lc2γk2/2)|||Gk|||F2f(W_{k+1})\leqslant f(W_{k})-\left(c_{1}\gamma_{k}-Lc_{2}\gamma_{k}^{2}/2\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. ∎
Theorem F.3(Sublinear convergence to stationarity).
SupposeAssumptions˜F.1andF.3hold. If the learning rate is constant and satisfiesγ∈(0,2c1/(Lc2))\gamma\in(0,2c_{1}/(Lc_{2})), then
∑k=0T−1|||Gk|||F2⩽f(W0)−f⋆γ(c1−Lc2γ/2),and thereforemin0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(c1−Lc2γ/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(c_{1}-Lc_{2}\gamma/2\right)},\quad\text{ and therefore }\quad\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(c_{1}-Lc_{2}\gamma/2\right)}.
Proof.
ByTheorem˜F.2,f(Wk+1)⩽f(Wk)−γ(c1−Lc2γ/2)|||Gk|||F2f(W_{k+1})\leqslant f(W_{k})-\gamma\left(c_{1}-Lc_{2}\gamma/2\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. Summing fromk=0k=0toT−1T-1yields
f(WT)⩽f(W0)−γ(c1−Lc22γ)∑k=0T−1|||Gk|||F2.f(W_{T})\leqslant f(W_{0})-\gamma\left(c_{1}-\frac{Lc_{2}}{2}\gamma\right)\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Sincef(WT)⩾f⋆f(W_{T})\geqslant f^{\star}, rearranging gives∑k=0T−1|||Gk|||F2⩽(f(W0)−f⋆)/(γ(c1−Lc2γ/2))\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant(f(W_{0})-f^{\star})/(\gamma(c_{1}-Lc_{2}\gamma/2)). The second inequality follows frommin0⩽k<T|||Gk|||F2⩽1T∑k=0T−1|||Gk|||F2\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{1}{T}\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. ∎
Theorem F.4(Linear convergence under the PŁ condition).
SupposeAssumptions˜F.1,F.2andF.3hold. If the learning rate is constant and satisfiesγ∈(0,2c1/(Lc2))\gamma\in(0,2c_{1}/(Lc_{2})), then the spectral iteration (F.1) obeys
f(Wk+1)−f⋆⩽(1−2μγ(c1−Lc22γ))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(c_{1}-\frac{Lc_{2}}{2}\gamma\right)\right)\left(f(W_{k})-f^{\star}\right).Hencef(Wk)−f⋆⩽ρk(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ≔1−2μγ(c1−Lc2γ/2)∈(0,1)\rho\coloneqq 1-2\mu\gamma\left(c_{1}-Lc_{2}\gamma/2\right)\in(0,1).
Proof.
ByTheorem˜F.2, subtractingf⋆f^{\star}from both sides gives
f(Wk+1)−f⋆⩽f(Wk)−f⋆−(c1γ−Lc22γ2)|||Gk|||F2.f(W_{k+1})-f^{\star}\leqslant f(W_{k})-f^{\star}-\left(c_{1}\gamma-\frac{Lc_{2}}{2}\gamma^{2}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Using the PŁ inequality, we obtain the desired inequality. Since0<γ<2c1/(Lc2)0<\gamma<2c_{1}/(Lc_{2}), the factorc1−Lc2γ/2c_{1}-Lc_{2}\gamma/2is positive, soρ∈(0,1)\rho\in(0,1). Iterating the recursion yieldsf(Wk)−f⋆⩽ρk(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho^{k}(f(W_{0})-f^{\star}). ∎
Corollary F.5(Optimal constant learning rate within this bound).
Under the assumptions ofTheorem˜F.4, the contraction factor inTheorem˜F.4is minimized over constant learning ratesγ>0\gamma>0byγ⋆=c1/(Lc2)\gamma^{\star}=c_{1}/(Lc_{2}), for whichρ⋆=1−μc12/(Lc2)\rho^{\star}=1-\mu c_{1}^{2}/(Lc_{2}).
Proof.
The contraction factorρ(γ)=1−2μγ(c1−Lc2γ/2)\rho(\gamma)=1-2\mu\gamma\left(c_{1}-Lc_{2}\gamma/2\right)is minimized when the quadratic−2μc1γ+μLc2γ2-2\mu c_{1}\gamma+\mu Lc_{2}\gamma^{2}is minimized, equivalently whenc1γ−Lc2γ2/2c_{1}\gamma-Lc_{2}\gamma^{2}/2is maximized. This occurs atγ⋆=c1/(Lc2)\gamma^{\star}=c_{1}/(Lc_{2}). Substituting intoρ(γ)\rho(\gamma)givesρ⋆=1−2μc1Lc2(c1−Lc22c1Lc2)=1−μc12/(Lc2)\rho^{\star}=1-2\mu\frac{c_{1}}{Lc_{2}}\left(c_{1}-\frac{Lc_{2}}{2}\frac{c_{1}}{Lc_{2}}\right)=1-\mu c_{1}^{2}/(Lc_{2}). ∎
F.2.1Specialization to Normalized Polar-Type Spectral Methods
The abstract convergence result above assumes that the spectral update𝒯(G)\mathcal{T}(G)grows proportionally to the gradient norm. This covers many spectral maps, but excludes fully normalized polar updates such as𝒯(G)=polar(G)=UV⊤\mathcal{T}(G)=\mathrm{polar}(G)=UV^{\top}, for which|||𝒯(G)|||F2=rank(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\mathrm{rank}(G)does not vanish asG→0G\to 0(cf.null-gradient consistencydefined in[89]). To analyze such methods, it is more natural to work with a geometry-dependent ratio between the update norm and its alignment with the gradient, defined below.
Assumption F.4(Positive alignment).
For everyG∈ℝm×n∖{0}G\in\mathbb{R}^{m\times n}\setminus\{0\}, we assume that\llangleG,𝒯(G)\rrangleF>0\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}>0.
Definition F.1(Spectral advantage ratio).
For a spectral update map𝒯\mathcal{T}, define itsspectral advantage ratioatG≠0G\neq 0by
ℜ𝒯(G)≔\llangleG,𝒯(G)\rrangleF|||𝒯(G)|||F2.\mathfrak{R}_{\mathcal{T}}(G)\coloneqq\frac{\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}}.Equivalently, its reciprocal isℜ𝒯−1(G)=|||𝒯(G)|||F2/\llangleG,𝒯(G)\rrangleF\mathfrak{R}_{\mathcal{T}}^{-1}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}/\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}.
The quantityℜ𝒯(G)\mathfrak{R}_{\mathcal{T}}(G)measures how much descent is obtained per unit squared update norm. Larger values correspond to more favorable geometry.
Lemma F.6(Descent lemma in ratio form).
SupposeAssumptions˜F.1andF.4hold. Then the iteration (F.1) satisfies
f(Wk+1)⩽f(Wk)−(γk−Lγk22ℜ𝒯(Gk))\llangleGk,𝒯(Gk)\rrangleF.f(W_{k+1})\leqslant f(W_{k})-\left(\gamma_{k}-\frac{L\gamma_{k}^{2}}{2\,\mathfrak{R}_{\mathcal{T}}(G_{k})}\right)\left\llangle G_{k},\mathcal{T}(G_{k})\right\rrangle_{\rm F}.In particular, if0<γk<2ℜ𝒯(Gk)/L0<\gamma_{k}<2\mathfrak{R}_{\mathcal{T}}(G_{k})/L, thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
The inequality follows fromLL-smoothness offfand the definition ofℜ𝒯(Gk)\mathfrak{R}_{\mathcal{T}}(G_{k}). The final claim follows immediately. ∎
The preceding lemma shows that normalized spectral methods are governed by two distinct quantities: the ratio termℜ𝒯(G)\mathfrak{R}_{\mathcal{T}}(G), which controls the admissible learning rate through the smoothness bound, and the alignment term\llangleG,𝒯(G)\rrangleF\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}, which must still be related to|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}in order to combine the descent estimate with the PŁ inequality.
Theorem F.7(Linear convergence under PŁ and ratio/alignment bounds).
SupposeAssumptions˜F.1,F.2andF.4hold. Assume there exists a constantℜ¯>0\underline{\mathfrak{R}}>0such thatℜ𝒯(G)⩾ℜ¯\mathfrak{R}_{\mathcal{T}}(G)\geqslant\underline{\mathfrak{R}}for allG≠0G\neq 0, and that there existsα>0\alpha>0such that\llangleG,𝒯(G)\rrangleF⩾α|||G|||F2\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}\geqslant\alpha\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}for allG∈ℝm×nG\in\mathbb{R}^{m\times n}. If the learning rate is constant and satisfies0<γ<2ℜ¯/L0<\gamma<2\underline{\mathfrak{R}}/L, then
f(Wk+1)−f⋆⩽(1−2μαγ(1−Lγ2ℜ¯))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\alpha\gamma\left(1-\frac{L\gamma}{2\underline{\mathfrak{R}}}\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Hencef(Wk)−f⋆⩽ρk(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ≔1−2μαγ(1−Lγ/(2ℜ¯))∈(0,1)\rho\coloneqq 1-2\mu\alpha\gamma\left(1-L\gamma/(2\underline{\mathfrak{R}})\right)\in(0,1).
Proof.
ByLemma˜F.6,f(Wk+1)⩽f(Wk)−(γ−Lγ22ℜ𝒯(Gk))\llangleGk,𝒯(Gk)\rrangleFf(W_{k+1})\leqslant f(W_{k})-\left(\gamma-\frac{L\gamma^{2}}{2\mathfrak{R}_{\mathcal{T}}(G_{k})}\right)\left\llangle G_{k},\mathcal{T}(G_{k})\right\rrangle_{\rm F}. Usingℜ𝒯(G)⩾ℜ¯\mathfrak{R}_{\mathcal{T}}(G)\geqslant\underline{\mathfrak{R}}and\llangleG,𝒯(G)\rrangleF⩾α|||G|||F2\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}\geqslant\alpha\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, we obtain
f(Wk+1)⩽f(Wk)−αγ(1−Lγ2ℜ¯)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\alpha\gamma\left(1-\frac{L\gamma}{2\underline{\mathfrak{R}}}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Applying the PŁ inequality yields the desired inequality. Since0<γ<2ℜ¯/L0<\gamma<2\underline{\mathfrak{R}}/L, the factor1−Lγ/(2ℜ¯)1-L\gamma/(2\underline{\mathfrak{R}})is positive, soρ∈(0,1)\rho\in(0,1). Iterating the recursion proves the claim. ∎
Specialization to the polar update.
For the normalized polar update𝒯(G)=polar(G)=UV⊤\mathcal{T}(G)=\mathrm{polar}(G)=UV^{\top}whenG=UΣV⊤G=U\Sigma V^{\top}, one has\llangleG,polar(G)\rrangleF=|||G|||nuc\left\llangle G,\mathrm{polar}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}and|||polar(G)|||F2=rank(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathrm{polar}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\mathrm{rank}(G), and thereforeℜpolar(G)=|||G|||nuc/rank(G)\mathfrak{R}_{\mathrm{polar}}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}/\mathrm{rank}(G). This is the average nonzero singular value ofGG. In particular, the descent condition becomes
0<γk<2|||Gk|||nucLrank(Gk).0<\gamma_{k}<\frac{2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}}{L\,\mathrm{rank}(G_{k})}.Moreover,\llangleG,polar(G)\rrangleF=|||G|||nuc⩾|||G|||F\left\llangle G,\mathrm{polar}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}, but in general one does not have a uniform lower bound of the form
\llangleG,polar(G)\rrangleF⩾α|||G|||F2\left\llangle G,\mathrm{polar}(G)\right\rrangle_{\rm F}\geqslant\alpha\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}without additional scale control. Thus, for normalized polar updates, the ratioℜpolar(G)\mathfrak{R}_{\mathrm{polar}}(G)naturally governs the admissible learning rate, whereas stronger convergence rates require an additional lower bound relating|||G|||nuc\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}to|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. Using the stable rank defined bysrank(G)≔|||G|||F2/|||G|||S2\mathrm{srank}(G)\coloneqq\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}/\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}^{2}, one has
|||G|||nuc⩾|||G|||F2|||G|||S=srank(G)|||G|||S,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\frac{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}}=\mathrm{srank}(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}},and hence
ℜpolar(G)=|||G|||nucrank(G)⩾srank(G)rank(G)|||G|||S.\mathfrak{R}_{\mathrm{polar}}(G)=\frac{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}}{\mathrm{rank}(G)}\geqslant\frac{\mathrm{srank}(G)}{\mathrm{rank}(G)}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}.This lower bound shows explicitly how the descent margin depends on the spectral spread of the gradient: flatter spectra, corresponding to larger stable rank, lead to more favorable ratio bounds for the polar update.
F.2.2Specialization toPolarGradwith Nuclear-Norm Scaling
ForPolarGradwith nuclear-norm scaling[89], the normalized polar direction is rescaled in such a way that both its alignment with the gradient and its squared Frobenius norm admit exact closed-form expressions. This removes the scale ambiguity present in fully normalized polar updates and leads to a particularly transparent descent analysis in terms of gradient rank and stable rank.
Given a gradient matrixG=UDiag(σ(G))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}, define the update map
𝒯PG(G)≔|||G|||nucpolar(G)=|||G|||nucUV⊤.\mathcal{T}_{\mathrm{PG}}(G)\coloneqq\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\mathrm{polar}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,UV^{\top}.The corresponding iteration is
(∀k∈ℕ)Wk+1=Wk−γk|||Gk|||nucpolar(Gk),Gk=∇f(Wk).(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\mathrm{polar}(G_{k}),\qquad G_{k}=\nabla f(W_{k}).(F.3)In what follows, we writerk≔rank(Gk)r_{k}\coloneqq\mathrm{rank}(G_{k}).
Lemma F.8(Alignment and norm identities forPolarGrad).
For everyG∈ℝm×nG\in\mathbb{R}^{m\times n}, we have\llangleG,𝒯PG(G)\rrangleF=|||G|||nuc2\left\llangle G,\mathcal{T}_{\mathrm{PG}}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}and|||𝒯PG(G)|||F2=rank(G)|||G|||nuc2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathrm{PG}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\mathrm{rank}(G)\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. Consequently,
ℜ𝒯PG(G)=\llangleG,𝒯PG(G)\rrangleF|||𝒯PG(G)|||F2=1rank(G).\mathfrak{R}_{\mathcal{T}_{\mathrm{PG}}}(G)=\frac{\left\llangle G,\mathcal{T}_{\mathrm{PG}}(G)\right\rrangle_{\rm F}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathrm{PG}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}}=\frac{1}{\mathrm{rank}(G)}.
Proof.
Using orthogonal invariance of the Frobenius inner product,
\llangleG,𝒯PG(G)\rrangleF=|||G|||nuc\llangleUDiag(σ(G))V⊤,UV⊤\rrangleF=|||G|||nuc∑i=1rσi(G)=|||G|||nuc2.\left\llangle G,\mathcal{T}_{\mathrm{PG}}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\left\llangle U\operatorname*{Diag}(\sigma(G))V^{\top},UV^{\top}\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\sum_{i=1}^{r}\sigma_{i}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Also,|||𝒯PG(G)|||F2=|||G|||nuc2|||UV⊤|||F2=|||G|||nuc2rank(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathrm{PG}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert UV^{\top}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,\mathrm{rank}(G). The ratio identity follows immediately. ∎
Theorem F.9(Descent lemma for nuclear-norm-scaledPolarGrad).
SupposeffsatisfiesAssumption˜F.1. Then the iteration (F.3) satisfies
f(Wk+1)⩽f(Wk)−γk|||Gk|||nuc2+Lγk22rk|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}+\frac{L\gamma_{k}^{2}}{2}r_{k}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Equivalently,
f(Wk+1)⩽f(Wk)−γk(1−Lγk2rk)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left(1-\frac{L\gamma_{k}}{2}r_{k}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.In particular, ifγk<2/(Lrk)\gamma_{k}<2/(Lr_{k}), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
ByLL-smoothness offfandWk+1−Wk=−γk𝒯PG(Gk)W_{k+1}-W_{k}=-\gamma_{k}\mathcal{T}_{\mathrm{PG}}(G_{k}), we obtain
f(Wk+1)⩽f(Wk)−γk\llangleGk,𝒯PG(Gk)\rrangleF+Lγk22|||𝒯PG(Gk)|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}_{\mathrm{PG}}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathrm{PG}}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Now applyLemma˜F.8. ∎
Corollary F.10(Stable-rank improvement).
Under the assumptions ofTheorem˜F.9,
f(Wk+1)⩽f(Wk)−γk(1−Lγk2rk)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left(1-\frac{L\gamma_{k}}{2}r_{k}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Since|||Gk|||nuc2⩾srank(Gk)|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\geqslant\mathrm{srank}(G_{k})\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}withsrank(Gk)≔|||Gk|||F2/|||Gk|||S2\mathrm{srank}(G_{k})\coloneqq\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}/\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{S}}^{2}, it follows that
f(Wk+1)⩽f(Wk)−γk(1−Lγk2rk)srank(Gk)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left(1-\frac{L\gamma_{k}}{2}r_{k}\right)\mathrm{srank}(G_{k})\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.
Thus, compared with vanilla gradient descent, nuclear-norm-scaledPolarGradadmits a potentially stronger one-step decrease when the gradient has nontrivial spectral spread, measured by its stable rank.
Theorem F.11(Sublinear convergence to stationarity for nuclear-norm-scaledPolarGrad).
SupposeffsatisfiesAssumption˜F.1. Consider the iteration (F.3) with constant learning rateγ>0\gamma>0. Assume there existsr¯∈ℕ∗\overline{r}\in\mathbb{N}^{*}such thatrk⩽r¯r_{k}\leqslant\overline{r}for allk∈ℕk\in\mathbb{N}, and thatγ∈(0,2/(Lr¯))\gamma\in(0,2/(L\overline{r})). Then
f(Wk+1)⩽f(Wk)−γ(1−Lγ2r¯)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Consequently, we have
∑k=0T−1|||Gk|||F2⩽∑k=0T−1|||Gk|||nuc2⩽f(W0)−f⋆γ(1−Lγr¯/2)andmin0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(1−Lγr¯/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(1-L\gamma\overline{r}/2\right)}\quad\text{ and }\quad\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(1-L\gamma\overline{r}/2\right)}.Hence the method converges to stationarity at the standard𝒪(1/T)\mathscr{O}(1/T)rate in the minimum gradient norm.
Proof.
ByTheorem˜F.9,f(Wk+1)⩽f(Wk)−γ(1−Lγ2rk)|||Gk|||nuc2f(W_{k+1})\leqslant f(W_{k})-\gamma\left(1-\frac{L\gamma}{2}r_{k}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. Usingrk⩽r¯r_{k}\leqslant\overline{r}, we obtainf(Wk+1)⩽f(Wk)−γ(1−Lγ2r¯)|||Gk|||nuc2f(W_{k+1})\leqslant f(W_{k})-\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. Summing fromk=0k=0toT−1T-1yields
f(WT)⩽f(W0)−γ(1−Lγ2r¯)∑k=0T−1|||Gk|||nuc2.f(W_{T})\leqslant f(W_{0})-\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Sincef(WT)⩾f⋆f(W_{T})\geqslant f^{\star}and|||Gk|||nuc⩾|||Gk|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}},
∑k=0T−1|||Gk|||F2⩽∑k=0T−1|||Gk|||nuc2⩽f(W0)−f⋆γ(1−Lγr¯/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(1-L\gamma\overline{r}/2\right)}.Finally,min0⩽k<T|||Gk|||F2⩽1T∑k=0T−1|||Gk|||F2\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{1}{T}\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}proves the stated𝒪(1/T)\mathscr{O}(1/T)bound. ∎
Theorem F.12(Linear convergence under PŁ for nuclear-norm-scaledPolarGrad).
SupposeffsatisfiesAssumptions˜F.1andF.2. Assume there existsr¯∈ℕ∗\overline{r}\in\mathbb{N}^{*}such thatrk⩽r¯r_{k}\leqslant\overline{r}for allk∈ℕk\in\mathbb{N}. Then any constant learning rate satisfyingγ∈(0,2/(Lr¯))\gamma\in(0,2/(L\overline{r}))yields
f(Wk+1)−f⋆⩽(1−2μγ(1−Lγ2r¯))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Hencef(Wk)−f⋆⩽ρk(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ=1−2μγ(1−Lγr¯/2)∈(0,1)\rho=1-2\mu\gamma\left(1-L\gamma\overline{r}/2\right)\in(0,1).
Proof.
ByTheorem˜F.9, and usingrk⩽r¯r_{k}\leqslant\overline{r}and|||Gk|||nuc2⩾|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, we obtain
f(Wk+1)⩽f(Wk)−γ(1−Lγ2r¯)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Applying the PŁ inequality gives
f(Wk+1)−f⋆⩽(1−2μγ(1−Lγ2r¯))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(1-\frac{L\gamma}{2}\overline{r}\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Since0<γ<2/(Lr¯)0<\gamma<2/(L\overline{r}), the factor1−Lγr¯/21-L\gamma\overline{r}/2is positive, henceρ∈(0,1)\rho\in(0,1). ∎
F.3One-Sided Spectral Optimizers
We now specialize the preceding convergence analysis to one-sided spectral optimizers. For one-sided spectral optimizers, the convergence analysis follows the same smoothness-based template as in the full spectral case. The only additional ingredients are alignment and norm-control conditions adapted to the relevant one-sided Gram operator. These yield standard descent, sublinear stationarity, and linear convergence under the PŁ condition.
Recall that right-spectral updates take the form𝒯𝖱(G)=GΨ(G⊤G)\mathcal{T}_{\mathsf{R}}(G)=G\,\Psi(G^{\top}G), while left-spectral updates take the form𝒯𝖫(G)=Φ(GG⊤)G\mathcal{T}_{\mathsf{L}}(G)=\Phi(GG^{\top})G, whereΨ\PsiandΦ\Phiare orthogonally equivariant spectral operators on the corresponding Gram matrices.
F.3.1Right-Spectral Optimizers
Consider the iteration
(∀k∈ℕ)Wk+1=Wk−γk𝒯𝖱(Gk),Gk≔∇f(Wk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}_{\mathsf{R}}(G_{k}),\qquad G_{k}\coloneqq\nabla f(W_{k}),(F.4)with𝒯𝖱(G)=GΨ(G⊤G)\mathcal{T}_{\mathsf{R}}(G)=G\,\Psi(G^{\top}G).
Assumption F.5(Right-spectral eigenvalue alignment and boundedness).
For everyG∈ℝm×nG\in\mathbb{R}^{m\times n}, letG⊤G=VDiag(λ(G⊤G))V⊤G^{\top}G=V\operatorname*{Diag}(\lambda(G^{\top}G))V^{\top}andΨ(G⊤G)=VDiag(ψ(λ(G⊤G)))V⊤\Psi(G^{\top}G)=V\operatorname*{Diag}(\psi(\lambda(G^{\top}G)))V^{\top}, whereλ(G⊤G)=(λ1,…,λn)∈ℝ+n\lambda(G^{\top}G)=(\lambda_{1},\dots,\lambda_{n})\in\mathbb{R}_{+}^{n}. Assume there exist constants0<c𝖱,1⩽c𝖱,2<∞0<c_{\mathsf{R},1}\leqslant c_{\mathsf{R},2}<\inftysuch that for allλ∈ℝ+n\lambda\in\mathbb{R}_{+}^{n},
∑i=1nλiψi(λ)⩾c𝖱,1∑i=1nλiand∑i=1nλiψi(λ)2⩽c𝖱,2∑i=1nλi.\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda)\geqslant c_{\mathsf{R},1}\sum_{i=1}^{n}\lambda_{i}\quad\text{and}\quad\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda)^{2}\leqslant c_{\mathsf{R},2}\sum_{i=1}^{n}\lambda_{i}.
Lemma F.13(Right-spectral alignment identities).
UnderAssumption˜F.5, for allG∈ℝm×nG\in\mathbb{R}^{m\times n},\llangleG,𝒯𝖱(G)\rrangleF=∑i=1nλiψi(λ)\left\llangle G,\mathcal{T}_{\mathsf{R}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda)and|||𝒯𝖱(G)|||F2=∑i=1nλiψi(λ)2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{R}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda)^{2}, whereλ=λ(G⊤G)\lambda=\lambda(G^{\top}G). Consequently,\llangleG,𝒯𝖱(G)\rrangleF⩾c𝖱,1|||G|||F2\left\llangle G,\mathcal{T}_{\mathsf{R}}(G)\right\rrangle_{\rm F}\geqslant c_{\mathsf{R},1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}and|||𝒯𝖱(G)|||F2⩽c𝖱,2|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{R}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant c_{\mathsf{R},2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.
Proof.
LetG=UDiag(σ(G))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}be a singular value decomposition ofGG. Sinceλi(G⊤G)=σi(G)2\lambda_{i}(G^{\top}G)=\sigma_{i}(G)^{2}, we may writeG⊤G=VDiag(λ)V⊤G^{\top}G=V\operatorname*{Diag}(\lambda)V^{\top}andΨ(G⊤G)=VDiag(ψ(λ))V⊤\Psi(G^{\top}G)=V\operatorname*{Diag}(\psi(\lambda))V^{\top}. Therefore,
𝒯𝖱(G)=GΨ(G⊤G)=UDiag(σ(G))V⊤VDiag(ψ(λ))V⊤=UDiag(σi(G)ψi(λ))V⊤.\mathcal{T}_{\mathsf{R}}(G)=G\Psi(G^{\top}G)=U\operatorname*{Diag}(\sigma(G))V^{\top}V\operatorname*{Diag}(\psi(\lambda))V^{\top}=U\operatorname*{Diag}(\sigma_{i}(G)\psi_{i}(\lambda))V^{\top}.Hence\llangleG,𝒯𝖱(G)\rrangleF=∑i=1nσi(G)2ψi(λ)=∑i=1nλiψi(λ)\left\llangle G,\mathcal{T}_{\mathsf{R}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{n}\sigma_{i}(G)^{2}\psi_{i}(\lambda)=\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda), and|||𝒯𝖱(G)|||F2=∑i=1nσi(G)2ψi(λ)2=∑i=1nλiψi(λ)2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{R}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{n}\sigma_{i}(G)^{2}\psi_{i}(\lambda)^{2}=\sum_{i=1}^{n}\lambda_{i}\psi_{i}(\lambda)^{2}. The final inequalities follow fromAssumption˜F.5and∑i=1nλi=|||G|||F2\sum_{i=1}^{n}\lambda_{i}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. ∎
Theorem F.14(Right-spectral descent and convergence).
SupposeAssumptions˜F.1andF.5hold. Then the iteration (F.4) satisfies
f(Wk+1)⩽f(Wk)−(c𝖱,1γk−Lc𝖱,22γk2)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\left(c_{\mathsf{R},1}\gamma_{k}-\frac{Lc_{\mathsf{R},2}}{2}\gamma_{k}^{2}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.In particular, ifγk∈(0,2c𝖱,1/(Lc𝖱,2))\gamma_{k}\in(0,2c_{\mathsf{R},1}/(Lc_{\mathsf{R},2})), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}). If, in addition,ffsatisfiesAssumption˜F.2andγk≡γ\gamma_{k}\equiv\gammais constant withγ∈(0,2c𝖱,1/(Lc𝖱,2))\gamma\in(0,2c_{\mathsf{R},1}/(Lc_{\mathsf{R},2})), then
f(Wk+1)−f⋆⩽(1−2μγ(c𝖱,1−Lc𝖱,22γ))(f(Wk)−f⋆),f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(c_{\mathsf{R},1}-\frac{Lc_{\mathsf{R},2}}{2}\gamma\right)\right)\bigl(f(W_{k})-f^{\star}\bigr),and thereforef(Wk)−f⋆⩽ρ𝖱k(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho_{\mathsf{R}}^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ𝖱=1−2μγ(c𝖱,1−Lc𝖱,2γ/2)∈(0,1)\rho_{\mathsf{R}}=1-2\mu\gamma\left(c_{\mathsf{R},1}-Lc_{\mathsf{R},2}\gamma/2\right)\in(0,1). Moreover, without the PŁ condition, ifγk≡γ\gamma_{k}\equiv\gammais constant, then
min0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(c𝖱,1−Lc𝖱,2γ/2).\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(c_{\mathsf{R},1}-Lc_{\mathsf{R},2}\gamma/2\right)}.
Proof.
Combine the smoothness inequality withLemma˜F.13, as in the proof ofTheorem˜F.2, and then use either the PŁ inequality or summation of the descent bound. ∎
F.3.2Left-Spectral Optimizers
Consider the iteration
(∀k∈ℕ)Wk+1=Wk−γk𝒯𝖫(Gk),Gk≔∇f(Wk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}_{\mathsf{L}}(G_{k}),\qquad G_{k}\coloneqq\nabla f(W_{k}),(F.5)with𝒯𝖫(G)=Φ(GG⊤)G\mathcal{T}_{\mathsf{L}}(G)=\Phi(GG^{\top})\,G.
Assumption F.6(Left-spectral eigenvalue alignment and boundedness).
For everyG∈ℝm×nG\in\mathbb{R}^{m\times n}, letGG⊤=UDiag(λ(GG⊤))U⊤GG^{\top}=U\operatorname*{Diag}(\lambda(GG^{\top}))U^{\top}andΦ(GG⊤)=UDiag(ϕ(λ(GG⊤)))U⊤\Phi(GG^{\top})=U\operatorname*{Diag}(\phi(\lambda(GG^{\top})))U^{\top}, whereλ(GG⊤)=(λ1,…,λm)∈ℝ+m\lambda(GG^{\top})=(\lambda_{1},\dots,\lambda_{m})\in\mathbb{R}_{+}^{m}. Assume there exist constants0<c𝖫,1⩽c𝖫,2<∞0<c_{\mathsf{L},1}\leqslant c_{\mathsf{L},2}<\inftysuch that for allλ∈ℝ+m\lambda\in\mathbb{R}_{+}^{m},
∑i=1mλiϕi(λ)⩾c𝖫,1∑i=1mλiand∑i=1mλiϕi(λ)2⩽c𝖫,2∑i=1mλi.\sum_{i=1}^{m}\lambda_{i}\phi_{i}(\lambda)\geqslant c_{\mathsf{L},1}\sum_{i=1}^{m}\lambda_{i}\quad\text{and}\quad\sum_{i=1}^{m}\lambda_{i}\phi_{i}(\lambda)^{2}\leqslant c_{\mathsf{L},2}\sum_{i=1}^{m}\lambda_{i}.
Lemma F.15(Left-spectral alignment identities).
UnderAssumption˜F.6, for allG∈ℝm×nG\in\mathbb{R}^{m\times n},\llangleG,𝒯𝖫(G)\rrangleF=∑i=1mλiϕi(λ)\left\llangle G,\mathcal{T}_{\mathsf{L}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{m}\lambda_{i}\,\phi_{i}(\lambda)and|||𝒯𝖫(G)|||F2=∑i=1mλiϕi(λ)2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{L}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{m}\lambda_{i}\,\phi_{i}(\lambda)^{2}, whereλ=λ(GG⊤)\lambda=\lambda(GG^{\top}). Consequently,\llangleG,𝒯𝖫(G)\rrangleF⩾c𝖫,1|||G|||F2\left\llangle G,\mathcal{T}_{\mathsf{L}}(G)\right\rrangle_{\rm F}\geqslant c_{\mathsf{L},1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}and|||𝒯𝖫(G)|||F2⩽c𝖫,2|||G|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{L}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant c_{\mathsf{L},2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.
Proof.
LetG=UDiag(σ(G))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}be a singular value decomposition ofGG. Sinceλi(GG⊤)=σi(G)2\lambda_{i}(GG^{\top})=\sigma_{i}(G)^{2}, we may writeGG⊤=UDiag(λ)U⊤GG^{\top}=U\operatorname*{Diag}(\lambda)U^{\top}andΦ(GG⊤)=UDiag(ϕ(λ))U⊤\Phi(GG^{\top})=U\operatorname*{Diag}(\phi(\lambda))U^{\top}. Therefore,
𝒯𝖫(G)=Φ(GG⊤)G=UDiag(ϕ(λ))U⊤UDiag(σ(G))V⊤=UDiag(ϕi(λ)σi(G))V⊤.\mathcal{T}_{\mathsf{L}}(G)=\Phi(GG^{\top})G=U\operatorname*{Diag}(\phi(\lambda))U^{\top}U\operatorname*{Diag}(\sigma(G))V^{\top}=U\operatorname*{Diag}(\phi_{i}(\lambda)\sigma_{i}(G))V^{\top}.Hence\llangleG,𝒯𝖫(G)\rrangleF=∑i=1mσi(G)2ϕi(λ)=∑i=1mλiϕi(λ)\left\llangle G,\mathcal{T}_{\mathsf{L}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{m}\sigma_{i}(G)^{2}\phi_{i}(\lambda)=\sum_{i=1}^{m}\lambda_{i}\phi_{i}(\lambda), and|||𝒯𝖫(G)|||F2=∑i=1mσi(G)2ϕi(λ)2=∑i=1mλiϕi(λ)2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{L}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{m}\sigma_{i}(G)^{2}\phi_{i}(\lambda)^{2}=\sum_{i=1}^{m}\lambda_{i}\phi_{i}(\lambda)^{2}. The final inequalities follow fromAssumption˜F.6and∑i=1mλi=|||G|||F2\sum_{i=1}^{m}\lambda_{i}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. ∎
Theorem F.16(Left-spectral descent and convergence).
SupposeAssumptions˜F.1andF.6hold. Then the iteration (F.5) satisfies
f(Wk+1)⩽f(Wk)−(c𝖫,1γk−Lc𝖫,22γk2)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\left(c_{\mathsf{L},1}\gamma_{k}-\frac{Lc_{\mathsf{L},2}}{2}\gamma_{k}^{2}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.In particular, ifγk∈(0,2c𝖫,1/(Lc𝖫,2))\gamma_{k}\in(0,2c_{\mathsf{L},1}/(Lc_{\mathsf{L},2})), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}). If, in addition,ffsatisfiesAssumption˜F.2andγk≡γ\gamma_{k}\equiv\gammais constant withγ∈(0,2c𝖫,1/(Lc𝖫,2))\gamma\in(0,2c_{\mathsf{L},1}/(Lc_{\mathsf{L},2})), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k})then
f(Wk+1)−f⋆⩽(1−2μγ(c𝖫,1−Lc𝖫,22γ))(f(Wk)−f⋆),f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(c_{\mathsf{L},1}-\frac{Lc_{\mathsf{L},2}}{2}\gamma\right)\right)\bigl(f(W_{k})-f^{\star}\bigr),and thereforef(Wk)−f⋆⩽ρ𝖫k(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho_{\mathsf{L}}^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ𝖫=1−2μγ(c𝖫,1−Lc𝖫,2γ/2)∈(0,1)\rho_{\mathsf{L}}=1-2\mu\gamma\left(c_{\mathsf{L},1}-Lc_{\mathsf{L},2}\gamma/2\right)\in(0,1). Moreover, without the PŁ condition, ifγk≡γ\gamma_{k}\equiv\gammais constant, then
min0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(c𝖫,1−Lc𝖫,2γ/2).\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(c_{\mathsf{L},1}-Lc_{\mathsf{L},2}\gamma/2\right)}.
Proof.
Identical to the proof ofTheorem˜F.14, replacingΨ(G⊤G)\Psi(G^{\top}G)byΦ(GG⊤)\Phi(GG^{\top}). ∎
For the canonical one-sided polar updates, the abstract(c1,c2)(c_{1},c_{2})-based analysis is less natural, and is better replaced by the same ratio-style viewpoint used for normalizedPolarGrad. After nuclear-norm scaling, however, both one-sided variants recover the same closed-form alignment and norm identities as full nuclear-norm-scaledPolarGrad.
Theorem F.17(Convergence of nuclear-norm-scaled one-sidedPolarGrad).
Consider either the right-sided update
𝒯𝖯𝖦,𝖱(G)=|||G|||nucG(G⊤G)†/2\mathcal{T}_{\mathsf{PG},\mathsf{R}}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,G(G^{\top}G)^{\nicefrac{{\dagger}}{{2}}}or the left-sided update
𝒯𝖯𝖦,𝖫(G)=|||G|||nuc(GG⊤)†/2G.\mathcal{T}_{\mathsf{PG},\mathsf{L}}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,(GG^{\top})^{\nicefrac{{\dagger}}{{2}}}G.Then for both updates,\llangleG,𝒯(G)\rrangleF=|||G|||nuc2\left\llangle G,\mathcal{T}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}and|||𝒯(G)|||F2=rank(G)|||G|||nuc2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\mathrm{rank}(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. Hence the identities inLemma˜F.8remain valid for both one-sided nuclear-norm-scaled variants. Therefore, the descent, stationarity, and PŁ linear convergence results ofTheorems˜F.9,F.11andF.12apply verbatim.
Proof.
We first consider the right-sided update𝒯𝖯𝖦,𝖱(G)\mathcal{T}_{\mathsf{PG},\mathsf{R}}(G). LetG=UDiag(σ(G))V⊤G=U\operatorname*{Diag}(\sigma(G))V^{\top}be a singular value decomposition ofGG. Then
G(G⊤G)†/2=UDiag(σ(G))V⊤(VDiag(σ(G)2)V⊤)†/2=UV⊤=polar(G),G(G^{\top}G)^{\nicefrac{{\dagger}}{{2}}}=U\operatorname*{Diag}(\sigma(G))V^{\top}\left(V\operatorname*{Diag}(\sigma(G)^{2})V^{\top}\right)^{\negthickspace\nicefrac{{\dagger}}{{2}}}=UV^{\top}=\mathrm{polar}(G),so𝒯𝖯𝖦,𝖱(G)=|||G|||nucpolar(G)\mathcal{T}_{\mathsf{PG},\mathsf{R}}(G)=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\mathrm{polar}(G). Therefore,\llangleG,𝒯𝖯𝖦,𝖱(G)\rrangleF=|||G|||nuc\llangleG,polar(G)\rrangleF=|||G|||nuc2\left\llangle G,\mathcal{T}_{\mathsf{PG},\mathsf{R}}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\left\llangle G,\mathrm{polar}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}, and|||𝒯𝖯𝖦,𝖱(G)|||F2=|||G|||nuc2|||polar(G)|||F2=rank(G)|||G|||nuc2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{PG},\mathsf{R}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathrm{polar}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\mathrm{rank}(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. The left-sided update follows similarly. The final claim follows immediately by invokingTheorems˜F.9,F.11andF.12. ∎
F.4Row-Norm-Based Optimizers
We next study row-norm-based optimizers, whose update maps act locally on each row of the gradient matrix. Such methods are especially natural for parameter matrices whose row axis carries the relevant structural symmetry, as in embeddings, LM heads, andMoErouters.
Consider the iteration
(∀k∈ℕ)Wk+1=Wk−γk𝒯𝗋𝗈𝗐(Gk),Gk≔∇f(Wk),(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}_{\mathsf{row}}(G_{k}),\qquad G_{k}\coloneqq\nabla f(W_{k}),(F.6)where𝒯𝗋𝗈𝗐(G)=Dη(G)G\mathcal{T}_{\mathsf{row}}(G)=D_{\eta}(G)\,GandDη(G)≔Diag(η(‖G1:‖2),…,η(‖Gv:‖2))D_{\eta}(G)\coloneqq\operatorname*{Diag}(\eta(\|G_{1:}\|_{2}),\dots,\eta(\|G_{v:}\|_{2})), for some scalar functionη:ℝ+→ℝ\eta\colon\mathbb{R}_{+}\to\mathbb{R}.
Assumption F.7(Uniform row-scaling bounds).
There exist constants0<η¯⩽η¯<∞0<\underline{\eta}\leqslant\overline{\eta}<\inftysuch thatη¯⩽η(t)⩽η¯\underline{\eta}\leqslant\eta(t)\leqslant\overline{\eta}for allt⩾0t\geqslant 0.
Lemma F.18(Alignment and norm bounds for row-norm updates).
UnderAssumption˜F.7, for allG∈ℝv×dG\in\mathbb{R}^{v\times d},
\llangleG,𝒯𝗋𝗈𝗐(G)\rrangleF=∑i=1vη(‖Gi:‖2)‖Gi:‖22⩾η¯|||G|||F2,\left\llangle G,\mathcal{T}_{\mathsf{row}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{v}\eta(\|G_{i:}\|_{2})\,\|G_{i:}\|_{2}^{2}\geqslant\underline{\eta}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2},and
|||𝒯𝗋𝗈𝗐(G)|||F2=∑i=1vη(‖Gi:‖2)2‖Gi:‖22⩽η¯2|||G|||F2.\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{v}\eta(\|G_{i:}\|_{2})^{2}\,\|G_{i:}\|_{2}^{2}\leqslant\overline{\eta}^{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.
Proof.
By definition, we have
\llangleG,𝒯𝗋𝗈𝗐(G)\rrangleF=∑i=1v\llangleGi:,η(‖Gi:‖2)Gi:\rrangleF=∑i=1vη(‖Gi:‖2)‖Gi:‖22.\left\llangle G,\mathcal{T}_{\mathsf{row}}(G)\right\rrangle_{\rm F}=\sum_{i=1}^{v}\left\llangle G_{i:},\eta(\|G_{i:}\|_{2})G_{i:}\right\rrangle_{\rm F}=\sum_{i=1}^{v}\eta(\|G_{i:}\|_{2})\,\|G_{i:}\|_{2}^{2}.Usingη(‖Gi:‖2)⩾η¯\eta(\|G_{i:}\|_{2})\geqslant\underline{\eta}, we obtain\llangleG,𝒯𝗋𝗈𝗐(G)\rrangleF⩾η¯∑i=1v‖Gi:‖22=η¯|||G|||F2\left\llangle G,\mathcal{T}_{\mathsf{row}}(G)\right\rrangle_{\rm F}\geqslant\underline{\eta}\sum_{i=1}^{v}\|G_{i:}\|_{2}^{2}=\underline{\eta}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}. Similarly, we also have
|||𝒯𝗋𝗈𝗐(G)|||F2=∑i=1vη(‖Gi:‖2)2‖Gi:‖22⩽η¯2∑i=1v‖Gi:‖22=η¯2|||G|||F2.\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{v}\eta(\|G_{i:}\|_{2})^{2}\,\|G_{i:}\|_{2}^{2}\leqslant\overline{\eta}^{2}\sum_{i=1}^{v}\|G_{i:}\|_{2}^{2}=\overline{\eta}^{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.∎
Theorem F.19(One-step descent for row-norm-based optimizers).
SupposeAssumptions˜F.1andF.7hold. Then the iteration (F.6) satisfies
f(Wk+1)⩽f(Wk)−(η¯γk−Lη¯22γk2)|||Gk|||F2.f(W_{k+1})\leqslant f(W_{k})-\left(\underline{\eta}\,\gamma_{k}-\frac{L\overline{\eta}^{2}}{2}\gamma_{k}^{2}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.In particular, ifγk∈(0,2η¯/(Lη¯2))\gamma_{k}\in(0,2\underline{\eta}/(L\overline{\eta}^{2})), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
ByLL-smoothness offfandWk+1−Wk=−γk𝒯𝗋𝗈𝗐(Gk)W_{k+1}-W_{k}=-\gamma_{k}\mathcal{T}_{\mathsf{row}}(G_{k}), we obtain
f(Wk+1)⩽f(Wk)−γk\llangleGk,𝒯𝗋𝗈𝗐(Gk)\rrangleF+Lγk22|||𝒯𝗋𝗈𝗐(Gk)|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}_{\mathsf{row}}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.ApplyLemma˜F.18. ∎
Theorem F.20(Sublinear convergence to stationarity).
SupposeAssumptions˜F.1andF.7hold, and letγ>0\gamma>0be constant withγ∈(0,2η¯/(Lη¯2))\gamma\in(0,2\underline{\eta}/(L\overline{\eta}^{2})). Then
∑k=0T−1|||Gk|||F2⩽f(W0)−f⋆γ(η¯−Lη¯2γ/2),and thereforemin0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(η¯−Lη¯2γ/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(\underline{\eta}-L\overline{\eta}^{2}\gamma/2\right)},\quad\text{ and therefore }\quad\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(\underline{\eta}-L\overline{\eta}^{2}\gamma/2\right)}.
Proof.
Sum the descent inequality inTheorem˜F.19. ∎
Theorem F.21(Linear convergence under the PŁ condition).
SupposeAssumptions˜F.1,F.2andF.7hold, and letγ>0\gamma>0be constant withγ∈(0,2η¯/(Lη¯2))\gamma\in(0,2\underline{\eta}/(L\overline{\eta}^{2})). Then
f(Wk+1)−f⋆⩽(1−2μγ(η¯−Lη¯22γ))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(\underline{\eta}-\frac{L\overline{\eta}^{2}}{2}\gamma\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Hencef(Wk)−f⋆⩽ρ𝗋𝗈𝗐k(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho_{\mathsf{row}}^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ𝗋𝗈𝗐=1−2μγ(η¯−Lη¯2γ/2)∈(0,1)\rho_{\mathsf{row}}=1-2\mu\gamma\left(\underline{\eta}-L\overline{\eta}^{2}\gamma/2\right)\in(0,1).
Proof.
CombineTheorem˜F.19with the PŁ inequality. ∎
F.4.1Specialization to Smoothed Row Normalization
A useful smoothed variant of row normalization is obtained by takingη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)for someε>0\varepsilon>0. The corresponding update map is
𝒯𝗋𝗈𝗐,ε(G)=Diag(1‖G1:‖2+ε,…,1‖Gv:‖2+ε)G.\mathcal{T}_{\mathsf{row},\varepsilon}(G)=\operatorname*{Diag}\left(\frac{1}{\|G_{1:}\|_{2}+\varepsilon},\dots,\frac{1}{\|G_{v:}\|_{2}+\varepsilon}\right)G.Unlike the fully normalized choiceη(t)=1/t\eta(t)=1/t, this smoothed row-norm-based map remains bounded at zero and therefore fits naturally into the preceding bounded-scaling framework.
Assumption F.8(Uniform row-norm upper bound).
There existsM>0M>0such that‖Gk,i:‖2⩽M\|G_{k,i:}\|_{2}\leqslant Mfor allk∈ℕk\in\mathbb{N},i∈⟦v⟧≔{1,…,v}i\in\llbracket v\rrbracket\coloneqq\{1,\ldots,v\}.
Corollary F.22(Descent and convergence for smoothed row normalization).
SupposeAssumptions˜F.1andF.8hold, and letη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)for someε>0\varepsilon>0. Then1/(M+ε)⩽η(‖Gk,i:‖2)⩽1/ε1/(M+\varepsilon)\leqslant\eta(\|G_{k,i:}\|_{2})\leqslant 1/\varepsilonfor allk∈ℕk\in\mathbb{N}andi∈⟦v⟧i\in\llbracket v\rrbracket. HenceAssumption˜F.7holds withη¯=1/(M+ε)\underline{\eta}=1/(M+\varepsilon)andη¯=1/ε\overline{\eta}=1/\varepsilon. Consequently, the descent, stationarity, and PŁ linear-convergence results ofTheorems˜F.19,F.20andF.21apply directly. In particular, any constant learning rate satisfyingγ∈(0,2ε2/(L(M+ε)))\gamma\in(0,2\varepsilon^{2}/(L(M+\varepsilon)))guarantees monotonic descent, and underAssumption˜F.2one obtains linear convergence with contraction factor
ρ𝗋𝗈𝗐,ε=1−2μγ(1M+ε−Lγ2ε2).\rho_{\mathsf{row},\varepsilon}=1-2\mu\gamma\left(\frac{1}{M+\varepsilon}-\frac{L\gamma}{2\varepsilon^{2}}\right).
Proof.
Fort⩾0t\geqslant 0, the mapt↦1/(t+ε)t\mapsto 1/(t+\varepsilon)is decreasing, so1/(M+ε)⩽1/(‖Gk,i:‖2+ε)⩽1/ε1/(M+\varepsilon)\leqslant 1/(\|G_{k,i:}\|_{2}+\varepsilon)\leqslant 1/\varepsilon. ThusAssumption˜F.7holds with the stated constants. The conclusions then follow immediately fromTheorems˜F.19,F.20andF.21. ∎
Thus, the choiceη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)interpolates between the bounded row-norm regime and the fully normalized regime: it preserves the local row-adaptive flavor of normalization while avoiding the singular behavior ofη(t)=1/t\eta(t)=1/tat small row norms.
Having established convergence guarantees for right-spectral and row-norm-based optimizers separately, we now turn to their finite compositions. The resulting hybrid methods inherit the geometric structure of the right polar factor together with a local row-wise normalization, and their analyses are naturally expressed through the preserved alignment ratio and the active row support.
F.5Nuclear-Norm-Scaled Right-Spectral/Row-Norm Hybrid Optimizers
We now study the hybrid optimizer obtained by composing the right polar factor with row-wise normalization. In this construction, the update is first normalized with respect to the feature geometry through the right polar factor, and then normalized locally across rows. After an additional nuclear-norm scaling, both the alignment and the squared Frobenius norm of the resulting update admit explicit expressions, leading to a clean descent analysis in terms of two interpretable quantities: the preserved alignment ratio𝔄𝗁𝗒𝖻(G)\mathfrak{A}_{\mathsf{hyb}}(G)and the active row supports𝗋𝗈𝗐(G)s_{\mathsf{row}}(G).
Define the right polar factor
Z(G)≔G(G⊤G)†/2,Z(G)\coloneqq G(G^{\top}G)^{\nicefrac{{\dagger}}{{2}}},and let the row-normalized hybrid map𝒯𝗁𝗒𝖻:ℝv×d→ℝv×d\mathcal{T}_{\mathsf{hyb}}\colon\mathbb{R}^{v\times d}\to\mathbb{R}^{v\times d}be given row-wise by
𝒯𝗁𝗒𝖻(G)i:={Z(G)i:‖Z(G)i:‖2,Z(G)i:≠0,0,Z(G)i:=0.\mathcal{T}_{\mathsf{hyb}}(G)_{i:}=\begin{cases}\dfrac{Z(G)_{i:}}{\|Z(G)_{i:}\|_{2}},&Z(G)_{i:}\neq 0,\\[6.0pt] 0,&Z(G)_{i:}=0.\end{cases}Its nuclear-norm-scaled version is defined by𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(G)≔|||G|||nuc𝒯𝗁𝗒𝖻(G)\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G)\coloneqq\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\mathcal{T}_{\mathsf{hyb}}(G). The corresponding iteration is
(∀k∈ℕ)Wk+1=Wk−γk𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(Gk),Gk≔∇f(Wk).(\forall k\in\mathbb{N})\qquad W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G_{k}),\qquad G_{k}\coloneqq\nabla f(W_{k}).(F.7)
Definition F.2(Active row support of the right polar factor).
ForG∈ℝv×dG\in\mathbb{R}^{v\times d}, define
s𝗋𝗈𝗐(G)≔♯{i∈⟦v⟧:Z(G)i:≠0},Z(G)=G(G⊤G)†/2.s_{\mathsf{row}}(G)\coloneqq\sharp\bigl\{i\in\llbracket v\rrbracket:Z(G)_{i:}\neq 0\bigr\},\qquad Z(G)=G(G^{\top}G)^{\nicefrac{{\dagger}}{{2}}}.
Definition F.3(Hybrid row-polar alignment ratio).
ForG≠0G\neq 0, define
𝔄𝗁𝗒𝖻(G)≔\llangleG,𝒯𝗁𝗒𝖻(G)\rrangleF|||G|||nuc.\mathfrak{A}_{\mathsf{hyb}}(G)\coloneqq\frac{\left\llangle G,\mathcal{T}_{\mathsf{hyb}}(G)\right\rrangle_{\rm F}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}}.
The quantity𝔄𝗁𝗒𝖻(G)\mathfrak{A}_{\mathsf{hyb}}(G)measures how much of the nuclear-norm alignment of the right polar factor is preserved after row normalization.
Lemma F.23(Norm and alignment identities).
For everyG∈ℝv×dG\in\mathbb{R}^{v\times d}, we have|||𝒯𝗁𝗒𝖻(G)|||F2=s𝗋𝗈𝗐(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=s_{\mathsf{row}}(G),|||𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(G)|||F2=|||G|||nuc2s𝗋𝗈𝗐(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,s_{\mathsf{row}}(G), and\llangleG,𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(G)\rrangleF=|||G|||nuc\llangleG,𝒯𝗁𝗒𝖻(G)\rrangleF=𝔄𝗁𝗒𝖻(G)|||G|||nuc2\left\llangle G,\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\left\llangle G,\mathcal{T}_{\mathsf{hyb}}(G)\right\rrangle_{\rm F}=\mathfrak{A}_{\mathsf{hyb}}(G)\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.
Proof.
By construction, every nonzero row of𝒯𝗁𝗒𝖻(G)\mathcal{T}_{\mathsf{hyb}}(G)has Euclidean norm11, while every zero row remains zero. Hence|||𝒯𝗁𝗒𝖻(G)|||F2=s𝗋𝗈𝗐(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=s_{\mathsf{row}}(G). Multiplying by|||G|||nuc\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}yields|||𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(G)|||F2=|||G|||nuc2|||𝒯𝗁𝗒𝖻(G)|||F2=|||G|||nuc2s𝗋𝗈𝗐(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,s_{\mathsf{row}}(G). Similarly,\llangleG,𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(G)\rrangleF=|||G|||nuc\llangleG,𝒯𝗁𝗒𝖻(G)\rrangleF=𝔄𝗁𝗒𝖻(G)|||G|||nuc2\left\llangle G,\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G)\right\rrangle_{\rm F}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\left\llangle G,\mathcal{T}_{\mathsf{hyb}}(G)\right\rrangle_{\rm F}=\mathfrak{A}_{\mathsf{hyb}}(G)\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}. ∎
Theorem F.24(Descent lemma for the nuclear-norm-scaled hybrid update).
SupposeffsatisfiesAssumption˜F.1. Then the iteration (F.7) satisfies
f(Wk+1)⩽f(Wk)−γk𝔄𝗁𝗒𝖻(Gk)|||Gk|||nuc2+Lγk22s𝗋𝗈𝗐(Gk)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\mathfrak{A}_{\mathsf{hyb}}(G_{k})\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}+\frac{L\gamma_{k}^{2}}{2}\,s_{\mathsf{row}}(G_{k})\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Equivalently,
f(Wk+1)⩽f(Wk)−γk(𝔄𝗁𝗒𝖻(Gk)−Lγk2s𝗋𝗈𝗐(Gk))|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left(\mathfrak{A}_{\mathsf{hyb}}(G_{k})-\frac{L\gamma_{k}}{2}s_{\mathsf{row}}(G_{k})\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.In particular, ifγk∈(0,2𝔄𝗁𝗒𝖻(Gk)/(Ls𝗋𝗈𝗐(Gk)))\gamma_{k}\in(0,2\,\mathfrak{A}_{\mathsf{hyb}}(G_{k})/(Ls_{\mathsf{row}}(G_{k}))), thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
ByLL-smoothness offfand (F.7), we obtain
f(Wk+1)⩽f(Wk)−γk\llangleGk,𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(Gk)\rrangleF+Lγk22|||𝒯𝗁𝗒𝖻,𝗇𝗎𝖼(Gk)|||F2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{hyb},\mathsf{nuc}}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Now applyLemma˜F.23. ∎
Assumption F.9(Uniform hybrid alignment and row-support bounds).
There exist constantsa¯>0\underline{a}>0ands¯∈ℕ∗\overline{s}\in\mathbb{N}^{*}such that for allk∈ℕk\in\mathbb{N},𝔄𝗁𝗒𝖻(Gk)⩾a¯\mathfrak{A}_{\mathsf{hyb}}(G_{k})\geqslant\underline{a}ands𝗋𝗈𝗐(Gk)⩽s¯s_{\mathsf{row}}(G_{k})\leqslant\overline{s}.
Theorem F.25(Sublinear convergence to stationarity).
SupposeAssumptions˜F.1andF.9hold. If the learning rateγ\gammais constant and satisfiesγ∈(0,2a¯/(Ls¯))\gamma\in(0,2\underline{a}/(L\overline{s})), then
f(Wk+1)⩽f(Wk)−γ(a¯−Lγ2s¯)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}-\frac{L\gamma}{2}\overline{s}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Consequently,
∑k=0T−1|||Gk|||F2⩽∑k=0T−1|||Gk|||nuc2⩽f(W0)−f⋆γ(a¯−Lγs¯/2)andmin0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tγ(a¯−Lγs¯/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(\underline{a}-L\gamma\overline{s}/2\right)}\quad\text{ and }\quad\min_{0\leqslant k<T}\,\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\gamma\left(\underline{a}-L\gamma\overline{s}/2\right)}.
Proof.
ByTheorem˜F.24andAssumption˜F.9,
f(Wk+1)⩽f(Wk)−γ(a¯−Lγ2s¯)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}-\frac{L\gamma}{2}\overline{s}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Summing fromk=0k=0toT−1T-1gives
f(WT)⩽f(W0)−γ(a¯−Lγ2s¯)∑k=0T−1|||Gk|||nuc2.f(W_{T})\leqslant f(W_{0})-\gamma\left(\underline{a}-\frac{L\gamma}{2}\overline{s}\right)\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Usingf(WT)⩾f⋆f(W_{T})\geqslant f^{\star}yields
∑k=0T−1|||Gk|||nuc2⩽f(W0)−f⋆γ(a¯−Lγs¯/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(\underline{a}-L\gamma\overline{s}/2\right)}.The Frobenius-norm and minimum-gradient-norm bounds follow from|||Gk|||nuc⩾|||Gk|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}. ∎
Theorem F.26(Linear convergence under the PŁ condition).
SupposeAssumptions˜F.1,F.2andF.9hold. If the learning rate is constant and satisfiesγ∈(0,2a¯/(Ls¯))\gamma\in(0,2\underline{a}/(L\overline{s})), then
f(Wk+1)−f⋆⩽(1−2μγ(a¯−Lγ2s¯))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\gamma\left(\underline{a}-\frac{L\gamma}{2}\overline{s}\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Hencef(Wk)−f⋆⩽ρ𝗁𝗒𝖻k(f(W0)−f⋆)f(W_{k})-f^{\star}\leqslant\rho_{\mathsf{hyb}}^{k}\bigl(f(W_{0})-f^{\star}\bigr), whereρ𝗁𝗒𝖻=1−2μγ(a¯−Lγs¯/2)∈(0,1)\rho_{\mathsf{hyb}}=1-2\mu\gamma\left(\underline{a}-L\gamma\overline{s}/2\right)\in(0,1).
Proof.
ByTheorem˜F.24andAssumption˜F.9,
f(Wk+1)⩽f(Wk)−γ(a¯−Lγ2s¯)|||Gk|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}-\frac{L\gamma}{2}\overline{s}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Applying|||Gk|||nuc2⩾|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}and the PŁ inequality gives the desired inequality. ∎
Switching the order of the two normalizations leads to a genuinely different optimizer. If row normalization is applied first, then the subsequent right-spectral step is computed from the modified Gram matrixG~⊤G~=G⊤Dη(G)2G\widetilde{G}^{\top}\widetilde{G}=G^{\top}D_{\eta}(G)^{2}G, rather than from the original gradient Gram matrixG⊤GG^{\top}G. Thus, right-polar-first preserves the original feature geometry before applying local row-wise normalization, whereas row-normalize-first alters that geometry prior to the spectral step.
F.6Nuclear-Norm-Scaled Row-Norm/Right-Spectral Hybrid Optimizers
We now turn to the hybrid optimizer obtained by reversing the order of the two normalizations. Instead of first applying the right polar factor and then normalizing rows, we first apply row-wise normalization to the gradient and then compute a right-spectral update from the resulting row-normalized matrix. This construction is genuinely different from the right-spectral/row-norm hybrid inSection˜F.5. Indeed, the spectral step is now computed from the modified Gram matrix
G~⊤G~=G⊤Dη(G)2G,\widetilde{G}^{\top}\widetilde{G}=G^{\top}D_{\eta}(G)^{2}G,rather than from the original feature Gram matrixG⊤GG^{\top}G.
Letη:ℝ+→ℝ+\eta\colon\mathbb{R}_{+}\to\mathbb{R}_{+}be a row-scaling function, and define
Dη(G)≔Diag(η(‖G1:‖2),…,η(‖Gv:‖2)).D_{\eta}(G)\coloneqq\operatorname*{Diag}\bigl(\eta(\|G_{1:}\|_{2}),\ldots,\eta(\|G_{v:}\|_{2})\bigr).The row-normalized gradient is defined byG~≔Dη(G)G\widetilde{G}\coloneqq D_{\eta}(G)G. For example, the normalized row-norm choice corresponds to
η(t)={1/t,t>0,0,t=0,\eta(t)=\begin{cases}1/t,&t>0,\\ 0,&t=0,\end{cases}or, in practical implementations, the smoothed variantη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon).
Define the row-norm/right-spectral hybrid map by𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G)≔Z(G~)=G~(G~⊤G~)†/2\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G)\coloneqq Z(\widetilde{G})=\widetilde{G}(\widetilde{G}^{\top}\widetilde{G})^{\nicefrac{{\dagger}}{{2}}}. Its nuclear-norm-scaled version is
𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(G)≔|||G~|||nuc𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G).\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G)\coloneqq\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\,\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G).The corresponding iteration is
Wk+1=Wk−γk𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(Gk),Gk=∇f(Wk).W_{k+1}=W_{k}-\gamma_{k}\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G_{k}),\qquad G_{k}=\nabla f(W_{k}).(F.8)
Definition F.4(Row-normalized effective rank).
ForG∈ℝv×dG\in\mathbb{R}^{v\times d}, definer𝗋𝗈𝗐𝗁𝗒𝖻(G)≔rank(G~)r_{\mathsf{row}\mathsf{hyb}}(G)\coloneqq\mathrm{rank}(\widetilde{G}), whereG~=Dη(G)G\widetilde{G}=D_{\eta}(G)G.
Definition F.5(Row-norm/right-spectral alignment ratio).
ForG≠0G\neq 0, define
𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)≔\llangleG,𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G)\rrangleF|||G~|||nucwithG~=Dη(G)G.\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G)\coloneqq\frac{\left\llangle G,\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G)\right\rrangle_{\rm F}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}}\quad\text{ with }\quad\widetilde{G}=D_{\eta}(G)G.
The quantity𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G)measures the alignment between the original gradientGGand the right polar factor of its row-normalized versionG~\widetilde{G}. Unlike the polar-first hybrid, this alignment is not automatically tied to the nuclear norm ofGG, because the spectral step is computed after row-wise reweighting.
Lemma F.27(Norm and alignment identities).
For everyG∈ℝv×dG\in\mathbb{R}^{v\times d}, letG~=Dη(G)G\widetilde{G}=D_{\eta}(G)G. Then|||𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G)|||F2=r𝗋𝗈𝗐𝗁𝗒𝖻(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=r_{\mathsf{row}\mathsf{hyb}}(G),|||𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(G)|||F2=|||G~|||nuc2r𝗋𝗈𝗐𝗁𝗒𝖻(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\,r_{\mathsf{row}\mathsf{hyb}}(G), and\llangleG,𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(G)\rrangleF=𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)|||G~|||nuc2\left\llangle G,\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G)\right\rrangle_{\rm F}=\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.
Proof.
Since𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G)\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G)is the right polar factor ofG~\widetilde{G}, its Frobenius norm squared equalsrank(G~)=r𝗋𝗈𝗐𝗁𝗒𝖻(G)\mathrm{rank}(\widetilde{G})=r_{\mathsf{row}\mathsf{hyb}}(G). Multiplying by|||G~|||nuc\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}gives
|||𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(G)|||F2=|||G~|||nuc2|||𝒯𝗋𝗈𝗐𝗁𝗒𝖻(G)|||F2=|||G~|||nuc2r𝗋𝗈𝗐𝗁𝗒𝖻(G).\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}\mathsf{hyb}}(G)\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}r_{\mathsf{row}\mathsf{hyb}}(G).The alignment identity follows directly from the definition of𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G). ∎
Theorem F.28(Descent lemma for the nuclear-norm-scaled row-norm/right-spectral update).
SupposeffsatisfiesAssumption˜F.1. Then the iteration (F.8) satisfies
f(Wk+1)⩽f(Wk)−γk𝔄𝗋𝗈𝗐𝗁𝗒𝖻(Gk)|||G~k|||nuc2+Lγk22r𝗋𝗈𝗐𝗁𝗒𝖻(Gk)|||G~k|||nuc2,f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G_{k})\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}+\frac{L\gamma_{k}^{2}}{2}r_{\mathsf{row}\mathsf{hyb}}(G_{k})\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2},whereG~k=Dη(Gk)Gk\widetilde{G}_{k}=D_{\eta}(G_{k})G_{k}. Equivalently,
f(Wk+1)⩽f(Wk)−γk(𝔄𝗋𝗈𝗐𝗁𝗒𝖻(Gk)−Lγk2r𝗋𝗈𝗐𝗁𝗒𝖻(Gk))|||G~k|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma_{k}\left(\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G_{k})-\frac{L\gamma_{k}}{2}r_{\mathsf{row}\mathsf{hyb}}(G_{k})\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.In particular, if
γk∈(0,2𝔄𝗋𝗈𝗐𝗁𝗒𝖻(Gk)Lr𝗋𝗈𝗐𝗁𝗒𝖻(Gk)),\gamma_{k}\in\left(0,\,\frac{2\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G_{k})}{Lr_{\mathsf{row}\mathsf{hyb}}(G_{k})}\right),thenf(Wk+1)⩽f(Wk)f(W_{k+1})\leqslant f(W_{k}).
Proof.
ByLL-smoothness and (F.8),
f(Wk+1)⩽(Wk)−γk\llangleGk,𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(Gk)\rrangleF+Lγk22|||𝒯𝗋𝗈𝗐𝗁𝗒𝖻,𝗇𝗎𝖼(Gk)|||F2.f(W_{k+1})\leqslant(W_{k})-\gamma_{k}\left\llangle G_{k},\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G_{k})\right\rrangle_{\rm F}+\frac{L\gamma_{k}^{2}}{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\mathcal{T}_{\mathsf{row}\mathsf{hyb},\mathsf{nuc}}(G_{k})\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.ApplyingLemma˜F.27gives the claim. ∎
Assumption F.10(Uniform row-norm/right-spectral alignment and rank bounds).
There exist constantsa¯𝗋𝗈𝗐>0\underline{a}_{\mathsf{row}}>0andr¯𝗋𝗈𝗐∈ℕ∗\overline{r}_{\mathsf{row}}\in\mathbb{N}^{*}such that for allk∈ℕk\in\mathbb{N},𝔄𝗋𝗈𝗐𝗁𝗒𝖻(Gk)⩾a¯𝗋𝗈𝗐\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G_{k})\geqslant\underline{a}_{\mathsf{row}}andr𝗋𝗈𝗐𝗁𝗒𝖻(Gk)⩽r¯𝗋𝗈𝗐r_{\mathsf{row}\mathsf{hyb}}(G_{k})\leqslant\overline{r}_{\mathsf{row}}.
We need to make the following extra comparability assumption, which is necessary if we want convergence to stationarity in terms of|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, instead of|||G~k|||nuc2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.
Assumption F.11(Row-normalization comparability).
There exists a constantκ𝗋𝗈𝗐>0\kappa_{\mathsf{row}}>0such that for allk∈ℕk\in\mathbb{N},|||G~k|||nuc⩾κ𝗋𝗈𝗐|||Gk|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\kappa_{\mathsf{row}}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}, whereG~k=Dη(Gk)Gk\widetilde{G}_{k}=D_{\eta}(G_{k})G_{k}.
Theorem F.29(Sublinear convergence to stationarity).
SupposeAssumptions˜F.1,F.10andF.11hold. If the learning rateγ\gammais constant and satisfiesγ∈(0,2a¯𝗋𝗈𝗐/(Lr¯𝗋𝗈𝗐))\gamma\in\left(0,2\underline{a}_{\mathsf{row}}/(L\overline{r}_{\mathsf{row}})\right), then
f(Wk+1)⩽f(Wk)−γ(a¯𝗋𝗈𝗐−Lγ2r¯𝗋𝗈𝗐)|||G~k|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}_{\mathsf{row}}-\frac{L\gamma}{2}\overline{r}_{\mathsf{row}}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Consequently,
∑k=0T−1|||Gk|||F2⩽f(W0)−f⋆κ𝗋𝗈𝗐2γ(a¯𝗋𝗈𝗐−Lγr¯𝗋𝗈𝗐/2),\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{\kappa_{\mathsf{row}}^{2}\gamma\left(\underline{a}_{\mathsf{row}}-L\gamma\overline{r}_{\mathsf{row}}/2\right)},and hence
min0⩽k<T|||Gk|||F2⩽f(W0)−f⋆Tκ𝗋𝗈𝗐2γ(a¯𝗋𝗈𝗐−Lγr¯𝗋𝗈𝗐/2).\min_{0\leqslant k<T}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{f(W_{0})-f^{\star}}{T\kappa_{\mathsf{row}}^{2}\gamma\left(\underline{a}_{\mathsf{row}}-L\gamma\overline{r}_{\mathsf{row}}/2\right)}.
Proof.
ByTheorem˜F.28andAssumption˜F.10,
f(Wk+1)⩽f(Wk)−γ(a¯𝗋𝗈𝗐−Lγ2r¯𝗋𝗈𝗐)|||G~k|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}_{\mathsf{row}}-\frac{L\gamma}{2}\overline{r}_{\mathsf{row}}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Summing fromk=0k=0toT−1T-1and usingf(WT)⩾f⋆f(W_{T})\geqslant f^{\star}yields
∑k=0T−1|||G~k|||nuc2⩽f(W0)−f⋆γ(a¯𝗋𝗈𝗐−Lγr¯𝗋𝗈𝗐/2).\sum_{k=0}^{T-1}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{\,2}\leqslant\frac{f(W_{0})-f^{\star}}{\gamma\left(\underline{a}_{\mathsf{row}}-L\gamma\overline{r}_{\mathsf{row}}/2\right)}.The comparability assumption gives|||G~k|||nuc2⩾κ𝗋𝗈𝗐2|||Gk|||F2\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{\,2}\geqslant\kappa_{\mathsf{row}}^{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}, which proves the claim. ∎
Theorem F.30(Linear convergence under the PŁ condition).
SupposeAssumptions˜F.1,F.2,F.10andF.11hold. If the learning rateγ\gammais constant and satisfiesγ∈(0,2a¯𝗋𝗈𝗐/(Lr¯𝗋𝗈𝗐))\gamma\in\left(0,2\underline{a}_{\mathsf{row}}/(L\overline{r}_{\mathsf{row}})\right), then
f(Wk+1)−f⋆⩽(1−2μκ𝗋𝗈𝗐2γ(a¯𝗋𝗈𝗐−Lγ2r¯𝗋𝗈𝗐))(f(Wk)−f⋆).f(W_{k+1})-f^{\star}\leqslant\left(1-2\mu\kappa_{\mathsf{row}}^{2}\gamma\left(\underline{a}_{\mathsf{row}}-\frac{L\gamma}{2}\overline{r}_{\mathsf{row}}\right)\right)\bigl(f(W_{k})-f^{\star}\bigr).Hence
f(Wk)−f⋆⩽ρ𝗋𝗈𝗐𝗁𝗒𝖻k(f(W0)−f⋆),f(W_{k})-f^{\star}\leqslant\rho_{\mathsf{row}\mathsf{hyb}}^{k}\bigl(f(W_{0})-f^{\star}\bigr),where
ρ𝗋𝗈𝗐𝗁𝗒𝖻=1−2μκ𝗋𝗈𝗐2γ(a¯𝗋𝗈𝗐−Lγr¯𝗋𝗈𝗐/2)∈(0,1).\rho_{\mathsf{row}\mathsf{hyb}}=1-2\mu\kappa_{\mathsf{row}}^{2}\gamma\left(\underline{a}_{\mathsf{row}}-L\gamma\overline{r}_{\mathsf{row}}/2\right)\in(0,1).
Proof.
ByTheorem˜F.28andAssumption˜F.10,
f(Wk+1)⩽f(Wk)−γ(a¯𝗋𝗈𝗐−Lγ2r¯𝗋𝗈𝗐)|||G~k|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\underline{a}_{\mathsf{row}}-\frac{L\gamma}{2}\overline{r}_{\mathsf{row}}\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.UsingAssumption˜F.11,
|||G~k|||nuc2⩾κ𝗋𝗈𝗐2|||Gk|||F2.\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}\geqslant\kappa_{\mathsf{row}}^{2}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.The PŁ inequality then proves the claimed recursion. Iterating the recursion yields the linear rate. ∎
Now, we specialize the convergence results whenη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)forε>0\varepsilon>0, further assuming a uniform row-norm bound.
Assumption F.12(Uniform row-norm bound).
There existsR>0R>0such that, for allk∈ℕk\in\mathbb{N},maxi∈⟦v⟧‖Gk,i:‖2⩽R\max_{i\in\llbracket v\rrbracket}\|G_{k,i:}\|_{2}\leqslant R.
Lemma F.31(Verification forη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)).
Letε>0\varepsilon>0and define
Dε(G)=Diag(1‖G1:‖2+ε,…,1‖Gv:‖2+ε),D_{\varepsilon}(G)=\operatorname*{Diag}\left(\frac{1}{\|G_{1:}\|_{2}+\varepsilon},\ldots,\frac{1}{\|G_{v:}\|_{2}+\varepsilon}\right),andG~=Dε(G)G\widetilde{G}=D_{\varepsilon}(G)G. Then, for everyG≠0G\neq 0,
𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)=\llangleG,polar(G~)\rrangleF|||G~|||nuc⩾ε.\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G)=\frac{\left\llangle G,\mathrm{polar}(\widetilde{G})\right\rrangle_{\rm F}}{\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}}\geqslant\varepsilon.Moreover, ifmaxi∈⟦v⟧‖Gi:‖2⩽R\max_{i\in\llbracket v\rrbracket}\|G_{i:}\|_{2}\leqslant R, then
|||G~|||nuc⩾|||G~|||F⩾1R+ε|||G|||F.\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}\geqslant\frac{1}{R+\varepsilon}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}.Consequently, along any sequence satisfyingAssumption˜F.12,Assumption˜F.10andAssumption˜F.11hold witha¯𝗋𝗈𝗐=ε\underline{a}_{\mathsf{row}}=\varepsilon,r¯𝗋𝗈𝗐=d\overline{r}_{\mathsf{row}}=d, andκ𝗋𝗈𝗐=1R+ε\kappa_{\mathsf{row}}=\frac{1}{R+\varepsilon}.
Proof.
LetG~=Dε(G)G\widetilde{G}=D_{\varepsilon}(G)Gand writeU~𝗉=polar(G~)\widetilde{U}_{\mathsf{p}}=\mathrm{polar}(\widetilde{G}). SinceG=Dε(G)−1G~G=D_{\varepsilon}(G)^{-1}\widetilde{G}, we have\llangleG,U~𝗉\rrangleF=\llangleDε(G)−1G~,U~𝗉\rrangleF\left\llangle G,\widetilde{U}_{\mathsf{p}}\right\rrangle_{\rm F}=\left\llangle D_{\varepsilon}(G)^{-1}\widetilde{G},\widetilde{U}_{\mathsf{p}}\right\rrangle_{\rm F}. LetG~=U~Σ~V~⊤\widetilde{G}=\widetilde{U}\widetilde{\Sigma}\widetilde{V}^{\top}be a compact singular value decomposition. ThenU~𝗉=U~V~⊤\widetilde{U}_{\mathsf{p}}=\widetilde{U}\widetilde{V}^{\top}, and therefore
\llangleDε(G)−1G~,U~𝗉\rrangleF=tr(Σ~U~⊤Dε(G)−1U~).\left\llangle D_{\varepsilon}(G)^{-1}\widetilde{G},\widetilde{U}_{\mathsf{p}}\right\rrangle_{\rm F}=\mathrm{tr}\left(\widetilde{\Sigma}\widetilde{U}^{\top}D_{\varepsilon}(G)^{-1}\widetilde{U}\right).BecauseDε(G)−1=Diag(‖G1:‖2+ε,…,‖Gv:‖2+ε)≽εIvD_{\varepsilon}(G)^{-1}=\operatorname*{Diag}(\|G_{1:}\|_{2}+\varepsilon,\ldots,\|G_{v:}\|_{2}+\varepsilon)\succcurlyeq\varepsilon I_{v}, we obtain
tr(Σ~U~⊤Dε(G)−1U~)⩾εtr(Σ~)=ε|||G~|||nuc.\mathrm{tr}\left(\widetilde{\Sigma}\widetilde{U}^{\top}D_{\varepsilon}(G)^{-1}\widetilde{U}\right)\geqslant\varepsilon\,\mathrm{tr}(\widetilde{\Sigma})=\varepsilon\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}.This proves𝔄𝗋𝗈𝗐𝗁𝗒𝖻(G)⩾ε\mathfrak{A}_{\mathsf{row}\mathsf{hyb}}(G)\geqslant\varepsilon.
For the comparability bound, ifmaxi∈⟦v⟧‖Gi:‖2⩽R\max_{i\in\llbracket v\rrbracket}\|G_{i:}\|_{2}\leqslant R, then
1‖Gi:‖2+ε⩾1R+ε.\frac{1}{\|G_{i:}\|_{2}+\varepsilon}\geqslant\frac{1}{R+\varepsilon}.Hence
|||G~|||F2=∑i=1v‖Gi:‖22(‖Gi:‖2+ε)2⩾1(R+ε)2|||G|||F2.\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}=\sum_{i=1}^{v}\frac{\|G_{i:}\|_{2}^{2}}{(\|G_{i:}\|_{2}+\varepsilon)^{2}}\geqslant\frac{1}{(R+\varepsilon)^{2}}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}.Since|||G~|||nuc⩾|||G~|||F\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}\geqslant\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}, the claimed comparability bound follows. Finally,r𝗋𝗈𝗐𝗁𝗒𝖻(G)=rank(G~)⩽dr_{\mathsf{row}\mathsf{hyb}}(G)=\mathrm{rank}(\widetilde{G})\leqslant d, so we may taker¯𝗋𝗈𝗐=d\overline{r}_{\mathsf{row}}=d. ∎
Corollary F.32(Convergence forη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)).
SupposeAssumptions˜F.1andF.12hold and consider the row-norm/right-spectral hybrid optimizer withη(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon). If the learning rateγ\gammais constant and satisfiesγ∈(0,2ε/(Ld))\gamma\in(0,2\varepsilon/(Ld)), then
f(Wk+1)⩽f(Wk)−γ(ε−Lγ2d)|||G~k|||nuc2.f(W_{k+1})\leqslant f(W_{k})-\gamma\left(\varepsilon-\frac{L\gamma}{2}d\right)\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert\widetilde{G}_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{nuc}}^{2}.Moreover,
min0⩽k<T|||Gk|||F2⩽(R+ε)2(f(W0)−f⋆)γ(ε−Lγd/2).\min_{0\leqslant k<T}\lvert\kern-1.07639pt\lvert\kern-1.07639pt\lvert G_{k}\rvert\kern-1.07639pt\rvert\kern-1.07639pt\rvert_{\mathrm{F}}^{2}\leqslant\frac{(R+\varepsilon)^{2}(f(W_{0})-f^{\star})}{\gamma(\varepsilon-L\gamma d/2)}.If, in addition,ffsatisfies theμ\mu-PŁ condition, then
f(Wk)−f⋆⩽(1−2μγ(ε−Lγd/2)(R+ε)2)k(f(W0)−f⋆).f(W_{k})-f^{\star}\leqslant\left(1-\frac{2\mu\gamma(\varepsilon-L\gamma d/2)}{(R+\varepsilon)^{2}}\right)^{\negthickspace k}\bigl(f(W_{0})-f^{\star}\bigr).
Appendix GExperimental Details
Note that we reduce the number of hidden layers and the number of experts from the original architecture. We give the main modified model architecture specifications of the model inTable˜G.1. Their detailed designs can be found in their corresponding technical reports[127,50,115,120]. We initialize all 2D weights by Gaussian random numbers with zero mean and standard deviation 0.02.
Table G.1:Modified model architectures of language models.ModelQwen3-0.6BGemma 3 1BOLMoE-1B-7Bgpt-oss# trainable parameters625,784,8321,087,138,9442,824,177,6643,467,779,008dmodeld_{\mathrm{model}}1024115220482048dffd_{\mathrm{ff}}3072691210242048nlayersn_{\mathrm{layers}}20181212nheadsn_{\mathrm{heads}}(Q / KV)16 / 84 / 116 / 1664 / 8dheadsd_{\mathrm{heads}}12825612864nexpertsn_{\mathrm{experts}}1184nexpertsn_{\mathrm{experts}}activatedN/AN/A3216vocabulary size151,936262,14450,304201,088layer normRMSNormRMSNormRMSNormRMSNormactivation functionSwiGLUGeGLU withtanh\tanhSwiGLUSwiGLU111The SwiGLU implementation in gpt-oss is unconventional as it includes clamping and residual connection.In the following experiments, we usePolar Express[5]for computing the matrix inverse square root inLeftPolarGradMand Gram Newton–Schulz[168]for computing the orthogonal polar factor directly inHybridPolarGradM, respectively. Unless otherwise specified, forPolar Expressand Gram Newton–Schulz, we use 5 inner steps withεNS=10−7\varepsilon_{\mathrm{NS}}=10^{-7}. We useε=10−8\varepsilon=10^{-8}for all ofRowNormM,LeftPolarGradMandHybridPolarGradM. The row-scaling rule inRowNormMandHybridPolarGradMis chosen to be thesmoothed row normalization, i.e.,η(t)=1/(t+ε)\eta(t)=1/(t+\varepsilon)withε=10−8\varepsilon=10^{-8}.
ForAdamWon scalar and vector parameters, we use a linear warmup with cosine decay learning rate schedule with 100 warmup steps and a half-cosine decay to0. For all other parameters, regardless of the choice of the optimizers, we use a stable-decay schedule with an initial learning rateγ0\gamma_{0}for the first 60% of training steps and linear decay to0for the last 40% training steps. We use the fused implementation ofAdamWin PyTorch, while the implementations ofRowNormM,LeftPolarGradM,RightPolarGradMandHybridPolarGradMare not optimized with customized kernels, except for the usage of Gram Newton–Schulz[168].
We give the training configurations and optimizer hyperparameters of all four model pre-training experiments in the following subsections.
G.1Qwen3-0.6B-Style Pre-Training
In terms of wall-clock training time, for (a), configurations (i)–(iii) take 7.347 hours, 7.509 hours and 7.369 hours respectively, while for (b), configurations (i)–(iii) take 7.707 hours, 7.832 hours and 7.771 hours respectively. The time difference between (a) and (b) is expected asHybridPolarGradMfor SwiGLU MLP projection matrices have additional computational overheads thanMuon.
Table G.2:Training configurations for Qwen3-0.6B-style pre-training.Table G.3:Optimizer hyperparameters for Qwen3-0.6B-style pre-training.
G.2Gemma 3 1B-Style Pre-Training
In terms of wall-clock training time, for (a), configurations (i)–(iii) take 8.303 hours, 8.615 hours and 8.142 hours respectively, whereas for (b), configurations (i)–(iii) take 8.510 hours, 9.059 hours and 8.480 hours respectively.
Table G.4:Training configurations for Gemma 3 1B-style pre-training.Table G.5:Optimizer hyperparameters for Gemma 3 1B-style pre-training.#### G.2.1Gemma 3 1B-Style Pre-Training Learning Rate Sweep
For this pre-training experiment, we also perform a base learning rate sweep for the embedding and LM head matrices, keeping the learning rates for scalars/vector and matrices fixed. For simplicity, we keep both base learning rates for the embedding and LM head matrices to be equal, although more delicate tuning is possible.
Table G.6:Base learning-rate sweep for the input embedding and LM head matrices in Gemma 3 1B-style pre-training. We sweep onlyγ0,emb=γ0,head\gamma_{0,\mathrm{emb}}=\gamma_{0,\mathrm{head}}. In setting (a), SwiGLU MLP projection matrices useMuon; in setting (b), they useHybridPolarGradMwith a row-norm/right-spectral composition.We observe fromTable˜G.6that the validation loss gaps between configuration (iii) and configurations (i)–(ii) remain substantial across the swept base learning rates. As shown inFigure˜G.1, the separation between theAdamWembedding/LM-head baselines and the symmetry-compatible alternatives is not explained by a single learning-rate choice. Across the swept values ofγ0,emb=γ0,head\gamma_{0,\mathrm{emb}}=\gamma_{0,\mathrm{head}}, theAdamWcurves make comparable or slightly faster initial progress, but remain consistently above theRowNormMandHybridPolarGradMcurves later in training. This suggests that the improvement from symmetry-compatible vocabulary-indexed updates is robust to reasonable base learning-rate variation.
Figure G.1:Validation loss curves for the Gemma 3 1B-style embedding/LM-head learning-rate sweep. The swept learning rate isγ0,emb=γ0,head\gamma_{0,\mathrm{emb}}=\gamma_{0,\mathrm{head}}; SwiGLU MLP projection matrices useHybridPolarGradMwith a row-norm/right-spectral composition. Across the sweep,RowNormMandHybridPolarGradMremain consistently better thanAdamWfor the input embedding and LM head matrices.
G.2.2Gemma 3 1B-Style Pre-Training Across Random Seeds
To assess the robustness of the observed optimizer ordering, we repeat the Gemma 3 1B-style pre-training experiment with two additional random seeds. As inSection˜4.2, we consider the same three optimizer assignments for the input embedding and LM head matrices: (i)RowNormM, (ii)HybridPolarGradM, and (iii)AdamW. In all runs in this subsection, the SwiGLU MLP projection matrices useHybridPolarGradMwith a row-norm/right-spectral composition.
(a)Second random seed.
(b)Third random seed.
Figure G.2:Training and validation losses for Gemma 3 1B-style pre-training under two additional random seeds. In each subfigure, the three configurations differ only in the optimizer assigned to the input embedding and LM head matrices:RowNormM,HybridPolarGradM, orAdamW. The SwiGLU MLP projection matrices useHybridPolarGradMin all runs.Table G.7:Final validation losses across three random seeds for Gemma 3 1B-style pre-training. The optimizer assignment varies only for the input embedding and LM head matrices; SwiGLU MLP projection matrices useHybridPolarGradMthroughout.As shown inTable˜G.7, the qualitative ordering is stable across random seeds. Both symmetry-compatible optimizer assignments,RowNormMandHybridPolarGradM, consistently outperform theAdamWbaseline on the vocabulary-indexed matrices.HybridPolarGradMachieves the lowest mean final validation loss, whileRowNormMalso provides a clear improvement overAdamW. These results indicate that the gains observed in the main Gemma 3 1B-style experiment are not an artifact of a single random initialization.
G.3OLMoE-1B-7B-Style Pre-Training
In terms of wall-clock training time, configurations (i)–(iv) take 8.607 hours, 8.661 hours, 8.685 hours and 8.686 hours respectively.
Table G.8:Training configurations forOLMoE-1B-7B-style pre-training.Table G.9:Optimizer hyperparameters forOLMoE-1B-7B-style pre-training.
G.4Downsized gpt-oss Pre-Training
In terms of wall-clock training time, configurations (i)–(iv) take 14.45 hours, 14.25 hours, 14.82 hours and 14.80 hours respectively.
Table G.10:Training configurations for downsized gpt-oss pre-training.Table G.11:Optimizer hyperparameters for downsized gpt-oss pre-training.
Similar Articles
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.
@AlphaSignalAI: You can now boost any LLM's accuracy 2-10x without training it. Most teams improve model accuracy by fine-tuning or swa…
OptiLLM is an open-source proxy that boosts any LLM's accuracy 2-10x by adding extra compute at inference time, using techniques like multi-agent cross-verification and Monte Carlo tree search.
@rasbt: A little talk on what we can learn from implementing LLM architectures from scratch in Python and PyTorch. And how I ap…
Sebastian Raschka discusses the value of implementing LLM architectures from scratch in Python/PyTorch, sharing his workflow for understanding new open-weight models by dissecting configs, coding, and layer-by-layer debugging.
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
Introduces DualOptim+, an optimization framework for LLM unlearning that uses shared base states and decoupled delta states to balance forgetting and retaining objectives, with a quantized variant for reduced memory.
Scaling LLMs horizontally: hidden-state coupling without weight modification [R]
Residual Coupling (RC) connects frozen language models in parallel using lightweight learned linear bridges, enabling horizontal scaling without weight modification. It reduces perplexity by up to 80.7% compared to MoE and improves accuracy on TruthfulQA by 9.1 percentage points.