Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Summary
Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.
View Cached Full Text
Cached at: 05/19/26, 10:34 PM
Paper page - Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Source: https://huggingface.co/papers/2605.18106
Abstract
Researchers developed symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional coordinate-wise methods like Adam.
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry andequivarianceproperties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect theequivariancestructures of theparameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the correspondingweight block. Following this principle, we first provide a unified perspective onbi-orthogonally equivariant updatesfor general matrix layers, as employed bystochastic spectral descent,Muon,Scion, andpolar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derivesymmetry-compatible optimizersfor parameter blocks whose symmetries differ from those of general matrix layers: embedding andLM head matrices,SwiGLU MLP projections, andMoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, andleft-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whoseequivariancematches its symmetry group. We corroborate this principle throughpre-trainingexperiments on dense andsparse MoElanguage models, includingQwen3-0.6B-style,Gemma 3 1B-style,OLMoE-1B-7B-style, and downsizedgpt-ossarchitectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.18106
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18106 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18106 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18106 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@timlautk: 1/4 New paper with @weijie444! We introduce a symmetry-compatible principle for LLM optimizer design and, as a byproduc…
Introduces a symmetry-compatible principle for LLM optimizer design, yielding a layerwise optimizer stack with principled updates for embeddings, LM heads, SwiGLU MLPs, and MoE routers, showing improved validation loss over AdamW across multiple architectures.
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
Introduces DualOptim+, an optimization framework for LLM unlearning that uses shared base states and decoupled delta states to balance forgetting and retaining objectives, with a quantized variant for reduced memory.
Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates
This paper investigates the role of group-equivariant architectures in neural fluid dynamics surrogates, introducing the AB-GATr model. It finds that equivariance is beneficial when data lacks strong alignment, but can degrade performance on highly aligned datasets.
Aurora: A Leverage-Aware Optimizer for Rectangular Matrices
Tilde Research introduces Aurora, a new optimizer designed to prevent neuron death in MLP layers while maintaining orthogonality, achieving state-of-the-art results on nanoGPT benchmarks and 100x data efficiency on 1B models.
Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap
This paper introduces a Jacobian-PCA-Grassmann framework to analyze the geometric structure of expert specialization in Mixture-of-Experts (MoE) Transformers. It finds that experts exhibit strong functional decorrelation while their representations overlap, and that routing sparsity significantly influences this geometry.