Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

arXiv cs.LG Papers

Summary

This paper presents a descent-free and alignment-free method to measure singular structure in trained neural networks. It recovers the order of dead directions from the directional Fisher rate, classifying genuine singularities from flat gauge symmetries, and demonstrates the technique on transformer and convolutional layers.

arXiv:2607.00603v1 Announce Type: new Abstract: We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient $1/(2k)$ follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity, whose order the architecture fixes, from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. A pluggable detector supplies the directions for transformer, convolutional, and normalisation layers. The read recovers the architecture-predicted order across constructed cells and trained networks, including a fine-tuned vision transformer whose dead structure is the LayerNorm-kernel gauge and a from-scratch one whose compressed MLP forms a node-death at its activation order. Where the singular structure enumerates, the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient $(\lambda, m)$ matching the closed form. The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. We then map its reach into the Watanabe triple: the order determines the universal singular fluctuation $\nu(k)$, though a trained network's realized $\nu$ falls below it as the live structure absorbs the dead direction's data fluctuation, and the multiplicity recovers from the dominant structure under a single-locus assumption.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:39 AM

# Decomposing and Classifying Singular Structure off Canonical Alignment
Source: [https://arxiv.org/html/2607.00603](https://arxiv.org/html/2607.00603)
## Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

Tejas Pradeep Shirodkar IIIT, Hyderabad

###### Abstract

We give a descent\-free, alignment\-free measurement of singular structure on trained networks\. At a single frozen checkpoint the read recovers the orderkkof each dead direction from the directional\-Fisher rate, the master invariant from which the per\-direction learning coefficient1/\(2​k\)1/\(2k\)follows exactly, in whatever basis the optimizer left\. The same read classifies each direction, separating a genuine singularity, whose order the architecture fixes, from a flat gauge symmetry; the directional\-Fisher magnitude settles the cases the order cannot\. A pluggable detector supplies the directions for transformer, convolutional, and normalisation layers\. The read recovers the architecture\-predicted order across constructed cells and trained networks, including a fine\-tuned vision transformer whose dead structure is the LayerNorm\-kernel gauge and a from\-scratch one whose compressed MLP forms a node\-death at its activation order\. Where the singular structure enumerates, the per\-direction orders assemble, through the typed intersection of the loci, into the global coefficient\(λ,m\)\(\\lambda,m\)matching the closed form\. The method removes the canonical\-alignment and descent preconditions of the underlying rate result, turning order\-recovery into a deterministic, architecture\-general reading\. We then map its reach into the Watanabe triple: the order determines the universal singular fluctuationν​\(k\)\\nu\(k\), though a trained network’s realizedν\\nufalls below it as the live structure absorbs the dead direction’s data fluctuation, and the multiplicity recovers from the dominant structure under a single\-locus assumption\.

## 1 Introduction

A dead direction is the object two traditions see at the same point\. From Amari’s information geometry\(Amari,[2016](https://arxiv.org/html/2607.00603#bib.bib1)\)it is a direction in which the Fisher metric loses non\-degeneracy\. From Watanabe’s singular learning theory\(Watanabe,[2009](https://arxiv.org/html/2607.00603#bib.bib29)\)it is tangent to the analytic singular set, where the Kullback–Leibler divergence vanishes to an integer order that resolution of singularities recovers\. The two readings name the same vector, and that orderkkis the invariant that bridges them\. The trajectory\-rate result ofShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)readskkin original parameter coordinates, without resolution: move the parameters along a dead directionuu,θ​\(t\)=θ0\+t​u\\theta\(t\)=\\theta\_\{0\}\+tu, and the directional Fisher decays asu⊤​F​\(θ​\(t\)\)​u=Θ​\(t2​\(k−1\)\)u^\{\\top\}F\(\\theta\(t\)\)\\,u=\\Theta\(t^\{2\(k\-1\)\}\), so the log\-log slope returnskkand the per\-direction learning coefficientλ=1/\(2​k\)\\lambda=1/\(2k\)\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x1.png)Figure 1:Reading the order off canonical alignment\. \(a\) A trained network leaves a dead directionuurotated off the coordinate axes; we constructuuas the joint mode of the K\-FAC factorsA⊗GA\\otimes Gand scan the directional Fisher out from the frozen checkpointθ0\\theta\_\{0\}, with no descent and no alignment\. \(b\) On a real gelu transformer block whose dead direction is rotated off the axes, the off\-canonical joint\-mode read recovers the activation order \(k^=1\.95\\hat\{k\}=1\.95,k=2k=2\), while a per\-coordinate scan along an axis follows the wrong direction and reads a deviant order \(k^=1\.26\\hat\{k\}=1\.26\)\. The slope in the purity\-matched window \(shaded\) returnskk\.Singular learning theory characterises a trained network by this learning coefficient together with the multiplicitymmand the singular fluctuationν\\nu, the Watanabe triple\(λ,m,ν\)\(\\lambda,m,\\nu\)that controls the Bayesian free energy and the widely applicable information criterion \(WAIC\)\(Watanabe,[2018](https://arxiv.org/html/2607.00603#bib.bib30)\)\. Reading the triple on a real network is costly and preconditioned\. The standard estimator of the learning coefficient samples the posterior with stochastic\-gradient Langevin dynamics \(SGLD\)\(Lau et al\.,[2025](https://arxiv.org/html/2607.00603#bib.bib15)\); it returns a single calibrated scalar, requires per\-model tuning of the sampler, does not localise to a network coordinate, and does not isolatemmorν\\nu\. The rate read above is cheaper, but the clean per\-layer version ofShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)assumes canonical alignment, the dead direction being the same coordinate at every layer, and a descending, theorem\-compatible optimizer\. Trained networks meet these conditions only in part\.

We give a measurement methodology that applies wherever a dead direction has formed, the regime where a network carries singular structure to read\. Given one, the pipeline is*detect then read*: a detector locates the direction, and one descent\-free scan at a single frozen checkpoint reads its order from the directional Fisher rate, with a purity\-matched window isolating thet2​\(k−1\)t^\{2\(k\-1\)\}regime\. The read returns the orderkkper direction, henceλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\), with the dead\-subspace dimension alongside, so the single coefficient that posterior sampling reports resolves into the per\-direction structure it sums over\. The scan also reads which kind of direction it found\. A finite order marks a genuine degeneracy: a*node\-death*, a hidden unit whose incoming and outgoing weights have both collapsed, whose order is the unit activation’s local analytic order, or a depth\-induced singularity of a deep linear map, whose order is the depth\. A directional Fisher that stays at the floor marks a gauge direction, a symmetry of the architecture that adds to the multiplicity without carrying a finite order\. Separating the two takes the magnitude of the Fisher: a curved gauge orbit read along its tangent imitates a finite order on the slope alone\. We exercise the read across a taxonomy of dead\-structure types, from constructed node\-deaths and deep\-linear depth singularities to a real fine\-tuned vision transformer’s LayerNorm\-kernel gauge and a from\-scratch one’s rotated node\-death \(Section[4](https://arxiv.org/html/2607.00603#S4)\)\.

The detector is the part of the pipeline that changes with the architecture, since the structure that exposes a dead direction changes with it\. A generic layer surfaces the direction as a near\-kernel of a K\-FAC \(Kronecker\-factored approximate curvature\) factor\(Martens and Grosse,[2015](https://arxiv.org/html/2607.00603#bib.bib17)\); the detector forms the activation factor and the gradient factor and reads the direction off whichever separates it more cleanly, the activation–gradient duality ofShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)\. A convolutional layer replaces the activation factor with a spatial\-patch covariance, and the read runs on that\. A LayerNorm transformer needs no scan, since the kernel of its normalisation scale gives the direction in closed form\. The read downstream is identical in each case\.

The read and the posterior sampler are complementary\. The read is deterministic and needs no canonical alignment or descent, and the sampler supplies the single calibrated coefficient the read decomposes into per\-direction structure\. Beyond the order and its exact coefficient1/\(2​k\)1/\(2k\), the read reaches the universal fluctuationν​\(k\)\\nu\(k\)through the order and the multiplicity through the dominant structure, though a trained network realizes both only in part\.

##### Contributions\.

1. 1\.A descent\-free, alignment\-free read of the per\-direction orderkk, henceλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\), and of the dead\-subspace dimension, at a single frozen checkpoint \(Figure[1](https://arxiv.org/html/2607.00603#S1.F1)\), through duality\-based identification, by constructing the dead mode from the factors rather than searching the Fisher spectrum for it, and a purity\-matched rate window \(Section[3](https://arxiv.org/html/2607.00603#S3)\)\.
2. 2\.A detect\-then\-read pipeline whose detector adapts to the architecture, a K\-FAC dual\-factor scan, a convolutional channel\-death factor, or the algebraic LayerNorm\-kernel direction, while the read stays fixed \(Section[3](https://arxiv.org/html/2607.00603#S3)\)\.
3. 3\.A taxonomy that classifies each dead direction as a genuine singularity, whose finite order the architecture fixes \(a node\-death at the activation order, a depth singularity at the network depth, a unit\-overlap merging\), or a flat gauge that adds to the multiplicity without an order \(the LayerNorm\-kernel, attention\-rotation, and cross\-entropy\-shift gauges\), with the magnitude criterion that tells a curved gauge orbit from a finite order\. We populate it across constructed cells, a from\-scratch vision transformer whose compressed MLP forms a genuine node\-death, and a real fine\-tuned one whose dead structure reads as the LayerNorm\-kernel and attention\-rotation gauges \(Section[4](https://arxiv.org/html/2607.00603#S4), Table[2](https://arxiv.org/html/2607.00603#S4.T2)\)\.
4. 4\.The optimizer\-dependent geometry the read reports: a standard optimizer can leave a deep network’s dead structure too diffuse to carry an order, or a shallower network’s rotated off the coordinate axes, where a per\-coordinate scan misses it, while an orthogonalising optimizer forms a clean dead structure the off\-canonical read recovers \(Section[5](https://arxiv.org/html/2607.00603#S5)\)\.
5. 5\.The global coefficient where the singular structure enumerates: the per\-direction orders assemble, through the typed intersection of the loci \(transversal, separable, tangency, or determinantal\), into the global\(λ,m\)\(\\lambda,m\)matching Aoyagi’s closed form on the analytic cells, set beside the posterior sampler where enumeration is open \(Sections[7](https://arxiv.org/html/2607.00603#S7), Appendix[D\.3](https://arxiv.org/html/2607.00603#A4.SS3)\)\.
6. 6\.A map of the read’s reach into the rest of the Watanabe triple: the order determines the universal singular fluctuationν​\(k\)\\nu\(k\), which we confirm by sampling on an isolated order\-kkdirection; a trained network’s realizedν\\nufalls belowν​\(k\)\\nu\(k\)as the live structure absorbs the dead direction’s data fluctuation, a suppression we isolate and measure; and the multiplicity recovers from the dominant structure under a single\-dominant\-locus assumption \(Section[6](https://arxiv.org/html/2607.00603#S6)\)\.

## 2 Background

##### The rate primitive\.

Along a dead direction the directional Fisher decays ast2​\(k−1\)t^\{2\(k\-1\)\}for the direction’s KL orderkk, a measure of how flat the loss is along the direction: the largerkk, the more derivatives vanish before the loss responds\. The order is the invariant both traditions read in original coordinates\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23); Watanabe,[2009](https://arxiv.org/html/2607.00603#bib.bib29)\)\. That order fixes the direction’s local thresholdλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\)\. The global threshold collects the per\-direction contributions, summing∑i1/\(2​ki\)\\sum\_\{i\}1/\(2k\_\{i\}\)over independent dead directions but takingmini⁡1/\(2​ki\)\\min\_\{i\}1/\(2k\_\{i\}\)where directions meet in a normal crossing \(their singular loci intersect transversally\), so a global value needs both the per\-direction orders and the crossing structure that combines them\.

##### The Watanabe triple\.

A regular model, one with an identifiable parameter and a non\-degenerate Fisher metric, has its Bayesian free energy and generalization error set by half the parameter count,d/2d/2\. Neural networks are singular: distinct parameters realize the same function, and the Fisher degenerates on the setΣT\\Sigma\_\{T\}of optimal parameters, sod/2d/2no longer applies\(Watanabe,[2009](https://arxiv.org/html/2607.00603#bib.bib29)\)\. Singular learning theory replaces it with three invariants of that singular structure\. The learning coefficient, or real log canonical threshold,λ\\lambdais the effective complexity: it is the coefficient of the leading correction to the free energyFn=n​L0\+λ​log⁡n−\(m−1\)​log⁡log⁡n\+O​\(1\)F\_\{n\}=nL\_\{0\}\+\\lambda\\log n\-\(m\-1\)\\log\\log n\+O\(1\), equalsd/2d/2for a regular model, falls below it as the structure grows more degenerate, and governs the Bayes generalization error\. The multiplicitymmcounts the components ofΣT\\Sigma\_\{T\}that achieve this minimalλ\\lambda, and sets thelog⁡log⁡n\\log\\log nterm\. The singular fluctuationν\\nugoverns the gap between generalization and training loss, through the widely applicable information criterionWAIC=Tn\+2​ν/n\\mathrm\{WAIC\}=T\_\{n\}\+2\\nu/n\(Watanabe,[2018](https://arxiv.org/html/2607.00603#bib.bib30)\), and also reduces tod/2d/2in the regular case\. For the analytic singular models, reduced\-rank regression and deep linear networks, the triple is known in closed form\(Aoyagi and Watanabe,[2005](https://arxiv.org/html/2607.00603#bib.bib4); Aoyagi,[2024](https://arxiv.org/html/2607.00603#bib.bib3)\); these are the ground truth the paper calibrates against\.

##### The two preconditions\.

The clean per\-layer rate read ofShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)holds under two conditions that a trained network meets only in part:*canonical alignment*, so that the dead direction occupies one coordinate at every layer and a per\-coordinate scan follows it, and*descent*under a theorem\-compatible optimizer, so that the rate is read along the approach to the singularity\. We keep neither, asking only that a dead direction has formed and reading it at one frozen checkpoint in whatever basis the optimizer leaves\. In place of the descent, the read scans a synthetic displacementθ0\+t​u\\theta\_\{0\}\+tuout from the checkpoint along the nominated directionuu, with the scalettset by the read, and takes the order from the growth of the directional Fisher along that scan\. No training trajectory enters; the checkpoint need only sit at the singularity inuu, the one precondition that remains\.

##### K\-FAC factors and the A–G duality\.

The parameter Fisher of a layery=W​xy=Wxfactorises\(Martens and Grosse,[2015](https://arxiv.org/html/2607.00603#bib.bib17)\)asFW≈A⊗GF\_\{W\}\\approx A\\otimes G, with input covarianceA=𝔼​\[x​x⊤\]A=\\mathbb\{E\}\[xx^\{\\top\}\]and output\-gradient covarianceG=𝔼​\[g​g⊤\]G=\\mathbb\{E\}\[gg^\{\\top\}\]forg=∂L/∂yg=\\partial L/\\partial y, and its smallest\-Fisher direction is the rank\-one liftgmin​amin⊤g\_\{\\min\}a\_\{\\min\}^\{\\top\}\. The two factors are dual, withλmin​\(Aℓ\)​λmin​\(Gℓ\)=Θ​\(t2​\(L−1\)\)\\lambda\_\{\\min\}\(A\_\{\\ell\}\)\\,\\lambda\_\{\\min\}\(G\_\{\\ell\}\)=\\Theta\(t^\{2\(L\-1\)\}\)the same at every layerℓ\\ellof the depth\-LLnetwork\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23), Thm\. 3\), so a detector can read the dead direction off whichever factor separates it at a given layer\. We accumulate both factors as the*true*Fisher \(label\-resampled and Monte\-Carlo estimated,true\-MCin the tables\), resampling the labels from the model’s own predictive distribution\. The*empirical*Fisher, built from the data labels, carries the model’s fit error: thereggis the residual at the observed label, which vanishes only at a perfect fit and contributes no curvature\. Resampling removes that term, soλmin\\lambda\_\{\\min\}and the effective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2607.00603#bib.bib21)\)stay geometric quantities\. The loss\-landscape degeneracy line ofBushnaq et al\. \([2024](https://arxiv.org/html/2607.00603#bib.bib5)\)reads the same activation and gradient structure at leading order and bounds the higher\-order content by the Hessian rank, while the read here scans the nominated direction for the orderkkthat the rank leaves unmeasured\.

##### Node\-death\.

A hidden unit whose incoming weightscfc​\[j,:\]c\_\{\\mathrm\{fc\}\}\[j,:\]and outgoing weightscproj​\[:,j\]c\_\{\\mathrm\{proj\}\}\[:,j\]have both collapsed contributes nothing to the output, the fully\-dead\-unit configuration that singular learning theory studies as the canonical two\-layer singularity\(Carroll,[2021](https://arxiv.org/html/2607.00603#bib.bib6); Farrugia\-Roberts,[2022](https://arxiv.org/html/2607.00603#bib.bib8),[2023](https://arxiv.org/html/2607.00603#bib.bib9)\); we call it a*node\-death*\. Scaling the two halves together makes the unit’s contribution bilinear, so the output grows astkt^\{k\}and the directional Fisher ast2​\(k−1\)t^\{2\(k\-1\)\}withkkthe activation’s local analytic order,33for squared\-ReLU and22for gelu or ReLU\. Scaling one half alone leaves the function unchanged, a*gauge*direction of the hidden\-reparametrisation symmetryWℓ↦M​Wℓ,Wℓ\+1↦Wℓ\+1​M−1W\_\{\\ell\}\\mapsto MW\_\{\\ell\},\\ W\_\{\\ell\+1\}\\mapsto W\_\{\\ell\+1\}M^\{\-1\}, joined by the LayerNorm centering symmetry\. These gauge directions occupy the Fisher’s near\-zero floor\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23), Thm\. 3\)and carry no order\. At a frozen checkpoint we count the dead subspace, the genuine node\-deaths at the dominant order, excluding the gauge directions and the below\-floor numerical near\-kernels\. Under a normal crossing of equal\-order directions this count is the analytic multiplicitymm\(the constructed overlap cell returnsm=n−1m=n\-1\); on a trained network it is the dead\-subspace dimension, which equalsmmonly under that crossing assumption and otherwise tracks the compressing subspace as training proceeds\. We report the dead\-subspace dimension on the trained networks and reservemmfor the analytic value we verify on the constructed cells\.

##### Related work\.

The standard estimator of the learning coefficient is the SGLD local learning coefficient\(Lau et al\.,[2025](https://arxiv.org/html/2607.00603#bib.bib15)\), extended to a weight subset by the refined LLC\(Wang et al\.,[2025](https://arxiv.org/html/2607.00603#bib.bib28)\)and across training checkpoints by the stagewise\-development reading\(Hoogland et al\.,[2024](https://arxiv.org/html/2607.00603#bib.bib12)\)\. The geometry read is the deterministic, sampling\-free counterpart of that family: it returns the per\-direction order at a frozen checkpoint and is set beside the sampler in Section[7](https://arxiv.org/html/2607.00603#S7), and its developmental tracking \(Appendix[B\.5](https://arxiv.org/html/2607.00603#A2.SS5)\) reads without sampling the same structure the stagewise LLC follows\. The loss\-landscape degeneracy programme ofBushnaq et al\. \([2024](https://arxiv.org/html/2607.00603#bib.bib5)\)diagonalises the same K\-FAC factorsAAandGGto expose a sparsely interacting feature basis; the read here scans those factors instead for the order the Hessian rank leaves unmeasured\. The wider singular\-learning landscape in deep learning is surveyed by the theory paper\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)\.

This paper is the off\-canonical measurement layer of one programme\. It builds on the rate primitive, the activation–gradient duality, the algebraic LayerNorm\-kernel direction, and the gauge quotient of the theory paper\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23)\), the cheap spectral observables ofShirodkar and Narayanan \([2026b](https://arxiv.org/html/2607.00603#bib.bib25)\), the algebraic LayerNorm detector ofShirodkar and Narayanan \([2026a](https://arxiv.org/html/2607.00603#bib.bib24)\), and the gauge\-equivariant optimizer ofShirodkar \([2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)\. On top of these it removes the canonical\-alignment and descent preconditions of the rate read, classifies each dead direction as a genuine degeneracy or a flat gauge, and decomposes the learning coefficient into the per\-direction order and the dead\-subspace dimension at a frozen checkpoint\. Its experiments re\-analyse the modular\-addition cohort ofShirodkar \([2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)and reuse the dead\-unit census ofShirodkar and Narayanan \([2026b](https://arxiv.org/html/2607.00603#bib.bib25)\), reading at a frozen checkpoint what those papers read in training and in the spectrum\.

##### Notation\.

Table[1](https://arxiv.org/html/2607.00603#S2.T1)collects the symbols the read uses\.

Table 1:Notation\. One symbol,γ\\gamma, carries two meanings the context separates: the LayerNorm per\-channel gain in the gauge read, and the localization strength of the posterior sampler in the global view\.

## 3 Method

We measure the order of a dead direction at a frozen checkpoint, in whatever basis the optimizer left, by a procedure in two stages\. The detector reads the dead coordinate at each layer and assembles the per\-layer coordinates into one network\-wide directionuu; the read then scans the directional Fisher alonguuout from the checkpoint and takes the order from its growth \(Figure[2](https://arxiv.org/html/2607.00603#S3.F2)\)\. The two stages separate a coordinate\-free quantity from an architecture\-specific one\. The order is an analytic invariant of the singularity, the same number in any basis, so the scan that recovers it is one fixed operation\. Finding the direction it lives along depends on where each layer type seats its dead coordinate, so only the detector varies with the architecture\. Three mechanisms make the measurement hold off canonical alignment, taken in turn below: the duality that carries the dead coordinate across layers, the construction that seats the scan on the order\-carrying mode above the gauge floor, and the purity\-matched window that reads the exponent from a direction recovered only approximately\. The full pipeline and setup are Appendices[A\.2](https://arxiv.org/html/2607.00603#A1.SS2)and[A\.1](https://arxiv.org/html/2607.00603#A1.SS1)\.

##### The detector across architectures\.

The detector generalises by one rule: read the dead coordinate as the near\-kernel of the layer’s natural second\-moment object\. A near\-kernel is the subspace of that object’s smallest eigenvalues, the directions along which the metric it builds has gone near\-singular because the network stopped using them\. For a generic layer that object is the K\-FAC factor pairA⊗GA\\otimes Gof Section[2](https://arxiv.org/html/2607.00603#S2); a convolution replaces the input factor with its spatial\-patch covariance\(Grosse and Martens,[2016](https://arxiv.org/html/2607.00603#bib.bib10)\); a LayerNorm transformer needs no scan at all, its dead direction being the closed\-form kernelγ−1/‖γ−1‖\\gamma^\{\-1\}/\\\|\\gamma^\{\-1\}\\\|of the per\-channel gain\(Shirodkar and Narayanan,[2026a](https://arxiv.org/html/2607.00603#bib.bib24); Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)\. An architecture whose degeneracy lives in an object none of these expose, an attention head’s rotation or a dead expert in a mixture, needs a detector built for that object, and supplying one extends the pipeline without touching the read\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x2.png)Figure 2:The read in action, on a log\-log directional Fisher againstttwith the purity\-matched window shaded\. \(a\) a real squared\-ReLU dead unit, axis\-aligned, returningk=3k\{=\}3; \(b\) a real gelu network’s rotated dead direction, recovered off canonical alignment atk=2k\{=\}2; \(c\) a trained deep linear network with no activation, returning the depth orderk=L=5k\{=\}L\{=\}5\. In each the Fisher flattens onto a contamination floor ast→0t\\to 0and follows thet2​\(k−1\)t^\{2\(k\-1\)\}power law in the window\.
##### Duality\-based identification\.

A clean per\-layer read needs the dead coordinate at every layer, yet at each end of the network one K\-FAC factor fails to pin it: at the inputA1=cov​\(x\)A\_\{1\}=\\mathrm\{cov\}\(x\)stays full rank, and at the outputGLG\_\{L\}goes flat\. The dualityλmin​\(Aℓ\)​λmin​\(Gℓ\)=Θ​\(t2​\(L−1\)\)\\lambda\_\{\\min\}\(A\_\{\\ell\}\)\\,\\lambda\_\{\\min\}\(G\_\{\\ell\}\)=\\Theta\(t^\{2\(L\-1\)\}\)keeps the other factor informative wherever one fails, so we read the dead coordinate from whichever factor separates it\. We fill the few layers neither factor resolves by mapping a neighbour’s coordinate through the layer Jacobian, then assemble the per\-layer coordinates into the cross\-layer joint mode, the network\-wide dead directionuuthe read scans\.

##### Construct, do not search\.

The order\-carrying joint mode sits above the bottom of the Fisher spectrum\. That bottom holds the gauge orbit, the∼\(L−1\)​d2\\sim\(L\-1\)d^\{2\}within\-layer reparametrisation directions \(for layer widthdd\) that the layered structure leaves flat\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23), Thm\. 3\), and below a sample ration/d<1n/d<1a sampling null space as well; the order\-carrying mode lies above both, since moving along it raises the Fisher ast2​\(k−1\)t^\{2\(k\-1\)\}while the gauge orbit stays exactly flat\. An eigendecomposition of the Fisher, or any refinement toward its bottom, therefore returns a gauge or null direction instead of the order, so we name the dead substructure from the factors and construct the joint mode at it\. The argument holds wherever the gauge orbit and the sampling null lie below the order\-carrying mode, which the gauge\-quotient structure makes the generic case for node\-death\.

##### Purity\-matched rate window\.

The order is the leading exponent of the rate ast→0t\\to 0, and the window selector recovers it without being given its value\. A nominated direction at cosine1−ε1\-\\varepsilonto the true one carries anε\\varepsilon\-sized component on a lower\-order direction, whose rate exponent liesΔ​α=2​\(k−klow\)\\Delta\\alpha=2\(k\-k\_\{\\mathrm\{low\}\}\)below the true one\. That component dominates the rate fort≲ε2/Δ​αt\\lesssim\\varepsilon^\{2/\\Delta\\alpha\}, so the read is pure only above that scale\. Sweeping the lower end of the fit window once and keeping the highest\-r2r^\{2\}admissible window \(r2\>0\.95r^\{2\}\>0\.95\) isolates the single power law that has cleared the contamination\. The selector optimizes the fit alone, so the architecturally fixed prediction then checks the recovered order, having played no part in choosing the window\. A robustness battery bears this out: on planted orders the selector recoversk∈\{2,3,4\}k\\in\\\{2,3,4\\\}on every clean cell, returns no order on a flat scan, and never matches a wrong exponent \(Section[4](https://arxiv.org/html/2607.00603#S4)\)\.

##### What the read returns\.

The scan sorts a direction by its slope and the magnitude it reaches, over the whole axis between a flat gauge and a clean finite order\. A genuine degeneracy raises the directional Fisher ast2​\(k−1\)t^\{2\(k\-1\)\}with finitekk, the slope returning the order\. A log\-log slopeα\\alphagivesk=1\+α/2k=1\+\\alpha/2andλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\): the squared\-ReLU unit of Figure[2](https://arxiv.org/html/2607.00603#S3.F2)a rises with slope44, sok=3k=3andλdir=1/6\\lambda\_\{\\mathrm\{dir\}\}=1/6\. A flat scan is a gauge when it sits at a deep floor and a regular live direction when it sits at the constant value a non\-dead weight carries, a split the magnitude makes where the slope cannot\. The magnitude breaks a second tie: a curved gauge orbit read along its tangent rises with slope22, the slope ak=2k\{=\}2node\-death also shows, separated only by its deep floor\. Between the endpoints the slope itself diagnoses the direction\. A slope shallower than the prediction marks a contaminated direction, the dead mode mixed with a lower\-order component whose true exponent surfaces only above the contamination scale, the deviant per\-coordinate read of a rotated death, a returned value that misses the predicted order \(Section[4](https://arxiv.org/html/2607.00603#S4)\)\. A slope steeper than the prediction marks a generic line through a crossing of loci, the signal that routes the global assembly to its structured resolution \(Section[9](https://arxiv.org/html/2607.00603#S9)\)\. A slope that has not settled across the window marks a pre\-asymptotic scan\. The read returns an order only on a clean, settled match and otherwise rejects with the diagnosis, so it never mints a wrong exponent; Appendix[A\.3](https://arxiv.org/html/2607.00603#A1.SS3)is the key from slope and magnitude to verdict\.

##### Reading the triple\.

The read places three quantities in the frame of the Watanabe triple\(λ,m,ν\)\(\\lambda,m,\\nu\), each with its own reach and status\. From the order, each dead direction carries the local coefficientλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\)exactly; the globalλ\\lambdaassembles these by the sum\-versus\-crossing rule of Section[2](https://arxiv.org/html/2607.00603#S2)\(independent directions add, a crossing takes the minimum\) where the singular structure can be enumerated, on the analytic models \(Appendix[D\.3](https://arxiv.org/html/2607.00603#A4.SS3)\), and the posterior sampler supplies it on a large network where the enumeration is open \(Section[7](https://arxiv.org/html/2607.00603#S7)\)\. A floor\-aware count of the dead directions at the dominant order gives the dead\-subspace dimension; the constructed cells confirm it equals the analytic multiplicitymmunder a normal crossing of equal\-order directions, while on a trained network it tracks the compressing dead subspace \(Section[5](https://arxiv.org/html/2607.00603#S5)\)\. The singular fluctuationν\\nuis fixed by the order, the universal valueν​\(k\)\\nu\(k\), which a trained network’s live structure suppresses belowν​\(k\)\\nu\(k\)\(Section[6](https://arxiv.org/html/2607.00603#S6)\)\.

## 4 A taxonomy of dead structure

The read is one instrument for a family of dead\-structure types\. Whatever the type, it nominates a candidate direction from the factors, scans it, and classifies what it finds: a genuine singularity carries a finite order the architecture fixes, a gauge stays at the floor\. Table[2](https://arxiv.org/html/2607.00603#S4.T2)maps the family and marks, for each type, whether a model in reach formed a clean instance\. The detected types carry the evidence below, the gauge family completes at the transformer’s architectural symmetries, and the attempted rows are types whose precondition, a cleanly formed instance, the models tried did not meet\.

A finite recovered orderkkalong a directionuumakes four claims at once\. It certifies that a genuine singularity has formed there: the directional Fisher rises from the floor with a finite order, the signature of a dead direction the network drove its weights into\. Its value names the type: the activation’s analytic order for a node\-death, the depth for a linear collapse, and no finite order for a gauge\. It sets the local learning coefficientλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\)this direction contributes to the model’s complexity\. And on the constructed and architecture\-known cells, where the order is set in advance, a match certifies the read itself, so the same read can be trusted on a model whose singular structure is unknown\. The orders, dead\-subspace dimensions, and gauge labels across a network then form a localized, typed map of its singular complexity, the per\-direction structure a single posterior\-sampled coefficient sums over but cannot localize\.

For the decomposition to mean anything, the recovered order must belong to the structure that formed, fixed before the read\. We check it where the order is predictable in advance, reading structures whosekkwe know and asking whether the scan returns it; across the genuine singularities below the recovered order falls on the diagonal of the predicted one \(Figure[3](https://arxiv.org/html/2607.00603#S4.F3)\)\.

Table 2:The family of dead\-structure types the read covers\. Order is the finite KL order a genuine singularity carries, or a dash for a flat direction\. Status marks the read state:*detected*here, or*attempted*, scanned for but with no model in reach forming a clean instance\.### 4\.1 Genuine singularities

![Refer to caption](https://arxiv.org/html/2607.00603v1/x3.png)Figure 3:The recovered order tracks the structure present\. The predicted order is the activation’s analytic order for a node\-death \(k=3k\{=\}3squared\-ReLU,k=2k\{=\}2gelu, acrossd=8d\{=\}8,d=24d\{=\}24, and modular multiplication\), the depth for a trained linear network \(k=Lk\{=\}L,L∈\{3,4,5\}L\\in\\\{3,4,5\\\}\), and again the activation order for a convolutional channel read through a different K\-FAC factor\. All fall on the diagonal\.##### The activation sets a node\-death’s order\.

The order of a node\-death follows the activation’s local analytic order, through the order at which the activation derivativeϕ′\\phi^\{\\prime\}vanishes, so before reading anything we can predictk=2k=2for gelu and ReLU andk=3k=3for squared\-ReLU\. To test the prediction in isolation we construct a node\-death in a small network, vary only the activation, and scan the joint mode: the read returnsk=2,3,4k=2,3,4for gelu, squared\-ReLU, and cubed\-ReLU atr2=1\.000r^\{2\}=1\.000, recovering the predicted order in each case\. To test it on a real network we train a matched pair of grokking transformers\(Power et al\.,[2022](https://arxiv.org/html/2607.00603#bib.bib20); Nanda et al\.,[2023](https://arxiv.org/html/2607.00603#bib.bib18)\)that differ only in activation and read the dead structure each forms during training\. The squared\-ReLU model leaves its dead subspace on the coordinate axes, where a scan along the axis returnsk=3k=3\. The gelu model leaves its dead subspace rotated off the axes, so a per\-coordinate scan no longer follows it and returns a deviant1\.291\.29; the off\-canonical read identifies the rotated direction and returnsk=2k=2atr2=1\.000r^\{2\}=1\.000, the activation’s predicted order recovered where the rotation had hidden it\. Appendix[B\.2](https://arxiv.org/html/2607.00603#A2.SS2)gives the constructed node\-death cells\.

##### The same order on a real vision transformer\.

Trained from scratch on ImageNet under weight decay, a transformer forms the activation node\-death the fine\-tuned model below lacks\. A six\-block ViT\(Dosovitskiy et al\.,[2021](https://arxiv.org/html/2607.00603#bib.bib7)\)\(width256256, MLP hidden10241024, patch1616,112×112112\{\\times\}112inputs\) on a100100\-class subset prunes its over\-parametrised MLP as it compresses, and the optimizer leaves the resulting dead subspace rotated off the coordinate axes: at the deepest singular block the squared\-ReLU coordinate\-concentration is0\.070\.07\. A per\-coordinate scan misses such a death; the off\-canonical read recovers the activation\-predicted order on every condition \(Table[3](https://arxiv.org/html/2607.00603#S4.T3)\), squared\-ReLUk^=3\.00\\hat\{k\}=3\.00and geluk^≈2\.0\\hat\{k\}\\approx 2\.0, across two decay strengths and three seeds\. On a current vision architecture the order the read returns is the order the activation fixes, recovered off a death the standard optimizer left rotated\. Appendix[B\.4](https://arxiv.org/html/2607.00603#A2.SS4)gives the setup for both vision\-transformer regimes and the per\-block gauge detection\.

Table 3:From\-scratch ViT on ImageNet\-100, per condition \(mean over three seeds\)\. The order is read off the deepest singular block\. Coordinate\-concentration near0is a dead subspace rotated off the coordinate axes, near11axis\-aligned; the last column is the dead\-subspace dimension \(the Fisher\-bottom count\)\.
##### Depth sets a linear singularity’s order\.

A test that the read returns a non\-activation order needs a singularity with no activation behind it\. We train a deep linear network of depthLLto a rank\-deficient target until one mode of the weight product collapses to machine precision; the resulting dead direction takes its order from depth alone, withk=Lk=L\. Reading it atL∈\{3,4,5\}L\\in\\\{3,4,5\\\}returnsk^=3\.00,4\.02,5\.07\\hat\{k\}=3\.00,4\.02,5\.07atr2=1\.000r^\{2\}=1\.000\. The valuek=5k=5is one no activation node\-death produces, so the read is tracking the order the structure carries\. Appendix[B\.2](https://arxiv.org/html/2607.00603#A2.SS2)gives the trained deep\-linear cells\.

##### A convolutional channel carries the activation order through a different factor\.

The reads above run on the Linear MLP factor; a dead convolutional channel exposes its dead direction through a spatial\-patch covariance instead, a different K\-FAC factor, which tests whether the order law and the read survive the change of factor structure\. We construct a dead channel and readk=2k=2for ReLU andk=3k=3for squared\-ReLU atr2=1\.000r^\{2\}=1\.000, and we then train a wide CNN until weight decay kills its spare channels, nominate a dead channel from the convolutionalGG\-factor, and recoverk=2k=2for ReLU and gelu andk=3k=3for squared\-ReLU across three seeds\. The nominated channel\-direction is usually a rotated combination of channels, so the trained convolutional read is itself off canonical, and the activation order survives both the new factor and the rotation\. Appendix[B\.3](https://arxiv.org/html/2607.00603#A2.SS3)gives the convolutional setup\.

##### Task, width, and seed\.

Three controls check that the node\-death order is the structure’s, holding as task, width, and seed change\. Changing the task to modular multiplication, the gelu network still readsk=2k=2once it groks \(off\-canonicalk^=1\.99\\hat\{k\}=1\.99atr2=1\.000r^\{2\}=1\.000\)\. Changing the width tod=24d\{=\}24, three times thed=8d\{=\}8width and read at a proper sample ration/d≈100n/d\\approx 100, the gauge\-fixed squared\-ReLU network readsk=3k=3along the axis and the gauge\-fixed gelu networkk=2k=2off the axis, holding across three seeds and three interior blocks \(k^=2\.81±0\.18\\hat\{k\}=2\.81\\pm 0\.18for squared\-ReLU,k^=1\.97±0\.06\\hat\{k\}=1\.97\\pm 0\.06for gelu\)\. A RoPE\-aware attention gauge and a standard one give the same MLP order, so the attention\-side gauge choice leaves the read unchanged\. Appendix[B\.1](https://arxiv.org/html/2607.00603#A2.SS1)lists every per\-cell read\.

##### Two coincident units read the curvature order\.

The other canonical two\-layer singularity is a unit overlap, two hidden units sharing the same incoming and outgoing weights, the overlap singularity ofAmari et al\. \([2006](https://arxiv.org/html/2607.00603#bib.bib2)\)paired with the elimination singularity a node\-death realises\. It carries two degenerate directions the read separates\. Moving the outgoing weights apart,ci→c\+sc\_\{i\}\\to c\+sandcj→c−sc\_\{j\}\\to c\-s, leaves the output unchanged, a flat transfer gauge\. Moving the incoming weights apart,ai→a\+t​δa\_\{i\}\\to a\+t\\deltaandaj→a−t​δa\_\{j\}\\to a\-t\\delta, cancels the odd terms of the expansion and grows the output ast2t^\{2\}from the activation’s curvatureϕ′′\\phi^\{\\prime\\prime\}at the operating point, so the split direction carries orderk=2k=2\. We construct an overlap in a small network and readk=2k=2on the split atr2=1\.000r^\{2\}=1\.000for squared\-ReLU and gelu alike, the transfer direction flat at the floor\. The split order is the curvature’s,k=2k=2for either activation, while a node\-death’s order tracks the activation’s order at zero; so one architecture carries two singularity types whose orders the read tells apart\. Fornncoincident units the split order staysk=2k=2and its multiplicity grows asn−1n\-1, the transfer gauge of the same dimension, which we read atn=2,3,4n=2,3,4\. Appendix[B\.2](https://arxiv.org/html/2607.00603#A2.SS2)gives the overlap construction\.

### 4\.2 Architectural gauges

##### A real model often carries a gauge in place of a node\-death\.

A DINOv2 ViT\-S\(Oquab et al\.,[2023](https://arxiv.org/html/2607.00603#bib.bib19)\)fine\-tuned on CIFAR\-100 carries no weight\-space node\-death: scanning every transformer block’s feed\-forward layer for a hidden unit whose incoming and outgoing weights have both collapsed, the smallest combined norm stays at0\.440\.44to0\.730\.73of the block median across all twelve blocks, with no unit near the floor\. The dead structure this model carries is a gauge, which the read classifies directly\. Appendix[B\.4](https://arxiv.org/html/2607.00603#A2.SS4)gives the fine\-tuned DINOv2 setup and the per\-block gauge reads\.

##### The LayerNorm\-kernel gauge\.

What this model carries instead is the LayerNorm\-kernel directionu⋆=γ−1/‖γ−1‖u^\{\\star\}=\\gamma^\{\-1\}/\\\|\\gamma^\{\-1\}\\\|\(Shirodkar and Narayanan,[2026a](https://arxiv.org/html/2607.00603#bib.bib24); Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23)\), the exact kernel of the post\-LayerNorm input covariance: at every block the detectedu⋆u^\{\\star\}aligns with the covariance’s smallest\-eigenvalue direction to\|cos\|=1\.0000\|\\cos\|=1\.0000\. The read flags it as a gauge, its directional Fisher sitting10410^\{4\}to5×1045\\times 10^\{4\}below a live direction at the two normalisation sites of each block\. The flatness comes from the LayerNorm centering symmetry\. Projecting the activations ontou⋆u^\{\\star\}gives a variance at machine precision \(std=8×10−16\\mathrm\{std\}=8\\times 10^\{\-16\}\) about a nonzero mean set by the LayerNorm bias, so alongu⋆u^\{\\star\}the data is effectively constant\. The direction is therefore an approximate gauge, and the residual Fisher is that constant offset carried through the nonlinearity, a small nonzero floor\.

##### The attention rotation gauge, and the slope it imitates\.

The attention query–key rotation is a second architectural gauge: a shared per\-head rotation of the query and key projections leaves every attention score invariant, and once the rotation includes the projection bias the read places it10610^\{6\}below a live direction\. This gauge is a curved orbit, so read along its tangent it rises with slope22, the slope ak=2k\{=\}2node\-death also shows, and only the depth of its floor separates the two\. The read therefore classifies the architectural gauges of this network as flat and reserves a finite order for genuine node\-death, the distinction the magnitude makes \(Section[3](https://arxiv.org/html/2607.00603#S3)\)\.

##### Completing the gauge family\.

The LayerNorm scale and the query–key rotation are two of the transformer’s architectural symmetries, the gauge family the equivariant optimizer ofShirodkar \([2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)is built to quotient; the same flat\-magnitude test reaches the rest\. The attention value–output rotation is the sibling of the query–key gauge under the per\-headO​\(dhead\)O\(d\_\{\\mathrm\{head\}\}\)symmetry\. The cross\-entropy shift is a constant added to every output logit that the softmax absorbs\. The ReLU rescaling is the single\-sided move at a hidden unit, the gauge companion the order reads use throughout\. Each is a flat direction with no order, a contributor to the architectural multiplicity\. We read the ReLU rescaling directly as that companion, and a constructed cell reads the cross\-entropy shift flat, its directional Fisher more than twenty orders of magnitude below a live bias direction\. The value–output rotation reads the same way on the fine\-tuned vision transformer, a curved gauge rising with slope22but sitting four orders of magnitude below a live direction\.

### 4\.3 Reach and open testbeds

The read mechanism, nominate then scan then classify, does not change with the type, so the family extends past the cells above once a clean instance forms\. Attention\-head death forms in reach as a low weight\-space rank of the attention block \(Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)\), with no single head carrying an order; an isolated head\-death needs a known\-prunable\-head transformer\. A mixture\-of\-experts expert goes dead by distribution coverage, reviving under a different corpus, and a forced\-dead expert is input\-dead, reading flat with no order; its durable weight\-level structure is the within\-expert node\-death already in the table\. A clean instance of either forms in models we did not read here, so we leave the full validation to future work; that formed instance is the precondition the read shares with every type it does detect\.

## 5 Optimizer and training phase shape the dead structure

The activation and the depth fix the order, and the optimizer and the training phase fix the basis the dead structure occupies and whether it forms at all\.

##### The orthogonaliser decides whether a clean structure forms\.

The optimizer decides whether a cleanly readable dead structure forms at all, and on this deep transformer the orthogonaliser is what decides it, with the gauge projection playing the separate role we isolate below\. We read the same run under three optimizers from one matched cohort over three seeds: vanilla Muon\(Jordan et al\.,[2024](https://arxiv.org/html/2607.00603#bib.bib13)\)\(the textbook degree\-five Newton–Schulz orthogonalisation, NS5, thens\_offbaseline\), the scaled\-polar orthogonaliser with the gauge removed \(theDDCMuonorthogonaliser ofShirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22),bf\_fast\), and the gauge\-equivariant optimizer on that orthogonaliser \(Appendix[C\.1](https://arxiv.org/html/2607.00603#A3.SS1)\)\. Vanilla Muon leaves the dead structure diffuse, a small flat near\-kernel spread across directions with its axis\-alignment falling from0\.710\.71to0\.400\.40through depth and no single direction carrying a clean order; the scan reports a deviant pre\-asymptotic value under both activations \(k^=1\.13±0\.15\\hat\{k\}=1\.13\\pm 0\.15for gelu and1\.48±0\.431\.48\\pm 0\.43for squared\-ReLU,r2r^\{2\}below the admissibility threshold\), so the read finds no order to recover\. The scaled\-polar orthogonaliser, with the gauge removed, instead compresses the network into a large clean dead subspace: under squared\-ReLU its directions sit on the coordinate axes at alignment\>0\.99\>0\.99and readk=3k=3, and under gelu they sit rotated off the axes, where the off\-canonical read recoversk=2k=2\. The gauge\-equivariant optimizer on the same orthogonaliser reads the same order at the same alignment, so the readable axis\-aligned basis comes from the scaled\-polar orthogonaliser, and the gauge’s separate effect is to spectrally separate the bottom block\. The contrast reproduces at the task\-natural one\-block width, where weight decay matched across the arms confirms the orthogonaliser supplies the alignment \(Appendix[C\.1](https://arxiv.org/html/2607.00603#A3.SS1)\)\.

The off\-canonical read removes the canonical\-alignment precondition, recovering a dead direction whether it sits on the coordinate axes or rotated off them, as the gelu case shows\. It keeps the one remaining precondition: the frozen checkpoint must sit at the singularity in the nominated direction, so the synthetic scan out from it shows the cleant2​\(k−1\)t^\{2\(k\-1\)\}growth the admissibility gate checks\. The optimizers reach different checkpoints\. The scaled\-polar orthogonaliser lands the units at clean node\-deaths, read atk=3k=3on the axes ork=2k=2rotated; vanilla Muon, on this deep transformer, lands at a diffuse, low\-alignment checkpoint whose best nominated direction stays pre\-asymptotic \(r2r^\{2\}below the gate\)\. That is a property of the solution the optimizer reaches: the common optimizers on the fixed architecture below reach readable structures with no gauge fixing \(Section[9](https://arxiv.org/html/2607.00603#S9)\)\. Under the per\-coordinate squared\-ReLU the gauge\-fixed rotation is no symmetry, so the clean rotated subspace is geometry the optimizer produced \(Figure[4](https://arxiv.org/html/2607.00603#S5.F4)\)\.

##### The common optimizers on a fixed architecture\.

The reads above move the optimizer and the depth together\. To place the optimizer on its own we hold a small architecture fixed, a deterministic squared\-ReLU teacher\-student MLP whose spare units a wide student prunes to a node\-death under weight decay, and read it under SGD, AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2607.00603#bib.bib16)\), RMSProp\(Tieleman and Hinton,[2012](https://arxiv.org/html/2607.00603#bib.bib26)\), and Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2607.00603#bib.bib14)\)at three seeds \(Table[4](https://arxiv.org/html/2607.00603#S5.T4); the cell setup is Appendix[C\.2](https://arxiv.org/html/2607.00603#A3.SS2)\)\. The target is deterministic, so the spare units have nothing to fit and prune cleanly under every optimizer; this removes the adaptive\-preconditioner resistance a noisy target provokes, where the Adam family amplifies the spare units’ noise\-fitting gradient\. The off\-canonical rate read returnsk≈3k\\approx 3\(asymptotic,r2=1\.0r^\{2\}=1\.0\) under every optimizer; the optimizer sets the structure the order occupies\. SGD, which carries no preconditioner to bend the metric, and Adam leave a clean node\-death on the coordinate axes, where the gauge companion \(a single\-sided move at the dead unit\) reads flat\. AdamW leaves a deep node\-death rotated off the axes at one of the three seeds, where the off\-canonical read supplies the order a per\-coordinate scan would miss\. RMSProp, a pure diagonal preconditioner without momentum, keeps a distributed representation with no unit driven to a weight\-space joint\-zero, so the rate readsk≈3k\\approx 3off the lowest\-Fisher direction while the gauge companion stays finite, an order read off a near\-kernel with no confirmed node\-death\. On the ReLU cells SGD follows the activation order \(k≈2k\\approx 2\), and the single\-seed AdamW ReLU cell reads deviant\. Across the common optimizers on this deterministic architecture the read recovers the order wherever a low\-Fisher direction forms\. The noisy target is now measured \(Table[5](https://arxiv.org/html/2607.00603#S5.T5)\)\. The deep grokking transformer under vanilla Muon keeps its structure diffuse, and no clean dead direction forms, so the admissibility gate correctly returns no order there, the null reading that marks the absence of a formed singularity\.

Table 4:Optimiser\-axis coverage on a fixed squared\-ReLU teacher\-student node\-death cell \(predictedk=3k=3, three seeds\)\. The off\-canonical order readsk≈3k\\approx 3under every optimizer; the optimizer sets the structure the order occupies\. The gauge companion is a single\-sided move at the dead unit, flat only when the consumer has been pruned \(a confirmed node\-death\)\.
##### A noisy target grades the precondition by optimizer\.

The deterministic cell prunes the spare units cleanly under every optimizer\. Gaussian noise on the teacher target gives the adaptive preconditioner a spare\-unit noise\-fitting gradient to amplify, and the prune then separates the optimizers \(Table[5](https://arxiv.org/html/2607.00603#S5.T5), sweeping the noise scaleσtrain\\sigma\_\{\\mathrm\{train\}\}across\{0,0\.1,0\.25,0\.5,1\.0\}\\\{0,0\.1,0\.25,0\.5,1\.0\\\}at three seeds, the read\-time geometry on the true\-MC Fisher\)\. SGD forms a cleank=3k\{=\}3node\-death at every noise level, the death deepening as the noise grows\. Adam under its coupled\-L2L\_\{2\}weight decay holds the clean axis node\-death across the sweep\. RMSProp keeps its distributed near\-kernel, the rate readingk≈3k\\approx 3off the lowest\-Fisher direction with no confirmed node\-death\. AdamW breaks at the first noise level: its spare units fit the noise, no unit reaches a joint\-zero, and the order read returns the deviant value\. The data\-fit shows in the empirical\-over\-true\-MC bottom\-eigenvalue ratio, which grows to9494for AdamW and100100for RMSProp as the noise rises, the inflation the true\-MC geometry sees through\. Where a clean death forms \(SGD, Adam\), the true\-MC bottom eigenvalue sits below the residual’s resolution, leaving the ratio undefined\. The off\-canonical read follows the geometry under target noise wherever a low\-Fisher direction survives, and returns the deviant value where the optimizer fits the noise and no unit prunes\.

Table 5:Noisy\-target sweep on the same squared\-ReLU node\-death cell \(predictedk=3k=3, three seeds\), the target gaining Gaussian noise of scaleσtrain\\sigma\_\{\\mathrm\{train\}\}and the read\-time geometry on the true\-MC Fisher\. The prune separates the optimizers the deterministic cell prunes alike: SGD and Adam hold the clean death, RMSProp keeps a near\-kernel, AdamW loses the death once the target is noisy\. The data\-fit ratio \(empirical over true\-MC bottom eigenvalue\) is undefined where a clean death forms, its true\-MC eigenvalue sitting below the residual’s resolution\.
##### The dead\-subspace dimension\.

Counting the dead directions at a block gives the dead\-subspace dimension, the per\-direction read supplying the order and a floor\-aware count supplying how many directions share it; under a normal crossing of equal\-order directions this count is the analytic multiplicity\. The count keeps the genuine node\-deaths and drops the gauge directions and the below\-floor numerical near\-kernels, so the architectural symmetries \(the∼\(L−1\)​d2\\sim\(L\{\-\}1\)d^\{2\}reparametrisation directions and the per\-site LayerNorm\-centering directions\) enter a separate architectural multiplicity, distinct from the learned one\.

##### The structure emerges in the compression phase\.

Read across training steps, the dead subspace grows and sharpens: its dimension rises from188188to562562while its axis\-alignment climbs from0\.780\.78to0\.9990\.999, with the orderk=3k=3steady throughout \(Appendix[B\.5](https://arxiv.org/html/2607.00603#A2.SS5)\)\. The gauge directions are present from initialisation, fixed by the architecture, whereas the node\-deaths appear only as the network compresses, which gives the classification a second, developmental signature beyond the frozen\-checkpoint read\. In the accumulation regime, before the network compresses, no node\-death forms and the read returns no order, a regime boundary the discussion makes precise \(Section[9](https://arxiv.org/html/2607.00603#S9)\)\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x4.png)Figure 4:The orthogonaliser sets the basis\. Dead\-subspace axis\-alignment by interior block for three optimizers from one matched cohort: the scaled\-polar orthogonaliser, with or without the gauge projection, keeps the subspace on the coordinate axes \(alignment≈1\\approx 1, cleank=3k\{=\}3\), while vanilla Muon rotates it off the axes with falling alignment through depth, where a per\-coordinate read cannot follow it\. The readable axis\-aligned basis comes from the orthogonaliser; the gauge arm reads the same\.

## 6 The singular fluctuation: fixed by the order, absorbed by the network

The third invariant of the Watanabe triple, the singular fluctuationν\\nu, setsWAIC=Tn\+2​ν/n\\mathrm\{WAIC\}=T\_\{n\}\+2\\nu/nand the asymptotic train–validation gap\. Along a dead directionν\\nuis fixed by the order alone: it is a universal functionν​\(k\)\\nu\(k\)of the KL order, independent of the model\-specific leading coefficient\(Shirodkar,[2026b](https://arxiv.org/html/2607.00603#bib.bib23)\), withν​\(2\)≈0\.173\\nu\(2\)\\approx 0\.173andν​\(3\)≈0\.278\\nu\(3\)\\approx 0\.278\. The order read reachesν\\nuthe way it reachesλdir\\lambda\_\{\\mathrm\{dir\}\}: fromkk\. The reach has a ceiling we measure here, where the live structure of a trained network absorbs part of the fluctuation\.

##### The order fixesν​\(k\)\\nu\(k\)on an isolated direction\.

We confirm the predictedν​\(k\)\\nu\(k\)by measuring it directly on data, an independent check of a value the theory computes by quadrature\. On the canonical order\-kkcelly=a​sk\+εy=a\\,s^\{k\}\+\\varepsilon,ε∼𝒩​\(0,σ2\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\), we form the exact posterior on generated noisy data and compute Watanabe’s functional varianceVn=∑iVarpost​\[log⁡p​\(yi∣s\)\]V\_\{n\}=\\sum\_\{i\}\\mathrm\{Var\}\_\{\\mathrm\{post\}\}\[\\log p\(y\_\{i\}\\mid s\)\], withν^=Vn/2\\hat\{\\nu\}=V\_\{n\}/2\. Data\-averaged over300300draws,ν^\\hat\{\\nu\}recoversν​\(k\)\\nu\(k\)within a few percent acrossn∈\{500,…,4000\}n\\in\\\{500,\\dots,4000\\\}:0\.1670\.167to0\.1780\.178atk=2k\{=\}2\(against0\.1730\.173\) and0\.2750\.275to0\.2890\.289atk=3k\{=\}3\(against0\.2780\.278\)\. The functional\-variance estimator is itself calibrated against the regular anchor, where it returnsν=d/2\\nu=d/2on add\-parameter linear\-Gaussian model by both the closed form and a posterior sampler \(the cells are Appendix[D\.4](https://arxiv.org/html/2607.00603#A4.SS4)\)\.

##### A trained network’s realizedν\\nufalls belowν​\(k\)\\nu\(k\)\.

The universality value is the fluctuation an isolated order\-kkdirection sees\. In a trained network the dead direction’s basis overlaps the live units, and the live parameters absorb part of the data fluctuation the singular fluctuation integrates over\. We isolate the effect on a controlled celly=b​g​\(x\)\+sk​c​\(x\)\+εy=b\\,g\(x\)\+s^\{k\}c\(x\)\+\\varepsilonwith one regular parameterbb\(basisgg\) and one order\-kkdead coordinatess\(basiscc\): in the joint, data\-averaged posterior the dead direction contributes exactlyν​\(k\)\\nu\(k\)while the bases stay distinct, and the contribution collapses as they align \(Figure[5](https://arxiv.org/html/2607.00603#S6.F5)\)\. In the trained over\-parametrised networks of Section[4](https://arxiv.org/html/2607.00603#S4)the effective overlap of a generic dead\-scan basis with the live\-unit span isρeff≈0\.62\\rho\_\{\\mathrm\{eff\}\}\\approx 0\.62to0\.810\.81, inside the suppression band, so the live structure absorbs the fluctuation and the realizedν\\nusits belowν​\(k\)\\nu\(k\)by a structure\-dependent amount\. The order fixes the idealisedν​\(k\)\\nu\(k\); recovering a trained network’s realizedν\\nuneeds the live\-structure absorption, which the order alone does not carry\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x5.png)Figure 5:The order fixesν​\(k\)\\nu\(k\), the live structure absorbs it\. Left: on the isolated order\-kkcelly=a​sk\+εy=a\\,s^\{k\}\+\\varepsilonthe functional\-varianceν^\\hat\{\\nu\}recovers the universality valueν​\(k\)\\nu\(k\)across training sizenn, data\-averaged over300300draws \(ν​\(2\)=0\.173\\nu\(2\)\{=\}0\.173,ν​\(3\)=0\.278\\nu\(3\)\{=\}0\.278\)\. Right: in the controlled celly=b​g\+sk​c\+εy=b\\,g\+s^\{k\}c\+\\varepsilon, the dead direction’s contribution toν\\nuholds atν​\(3\)\\nu\(3\)while the regular and dead bases stay distinct and collapses as their overlapρ=corr​\(g,c\)\\rho=\\mathrm\{corr\}\(g,c\)grows \(120120draws per point\); the shaded band is the effective dead–live overlapρeff∈\[0\.62,0\.81\]\\rho\_\{\\mathrm\{eff\}\}\\in\[0\.62,0\.81\]measured in the trained vision transformers, where the live structure absorbs the fluctuation\.
##### The frozen scan is the calibrated order\.

The order the read reports comes from the frozen scan, synthesized at the checkpoint\. A dead direction also collapses along the training trajectory, but itsσmin\\sigma\_\{\\min\}decay against steps is the optimizer’s approach speed, which on the grokking runs separatesk=3k=3fromk=2k=2at close to chance\. The trajectory carries the order in the Fisher\-against\-σmin\\sigma\_\{\\min\}slope under a theorem\-compatible descent, validated under SGD\(Shirodkar and Narayanan,[2026b](https://arxiv.org/html/2607.00603#bib.bib25)\)and scoped out for the Adam\-class and Muon runs here\. The frozen scan removes that descent precondition; the trajectory collapse then cross\-checks the frozen order without supplying it\.

## 7 The global view: alongside posterior sampling

The global view estimates the single threshold by sampling the local posterior with SGLD\(Lau et al\.,[2025](https://arxiv.org/html/2607.00603#bib.bib15)\), and setting that estimate beside the geometry read fixes what each measures and how its accuracy is established\.

##### Accuracy is fixed on closed\-form ground truth\.

On the analytic models that anchor accuracy \(Section[2](https://arxiv.org/html/2607.00603#S2)\) the geometry read recovers the per\-direction order exactly,k∈\{2,3,4\}k\\in\\\{2,3,4\\\}on the planted node\-deaths andk=Lk=Lon the deep linear networks, while the sampler’s local coefficient carries its documented width\-dependent drift from the global threshold and is read as a rank statistic\.

##### A scalar against a decomposition\.

The sampler returns one global number and localises at most to a parameter subset, a layer or an attention head, its refined form\(Wang et al\.,[2025](https://arxiv.org/html/2607.00603#bib.bib28)\); a single direction lies below its resolution\. The per\-direction order, the intersection type, and the gauge\-versus\-singularity verdict have no counterpart in a single scalar, so the paper validates them against closed forms; the sampler has no per\-direction read to set beside them\. The geometry read returns the order of each dead direction and the dead\-subspace dimension of the directions sharing it, the per\-direction decomposition the global scalar aggregates by the sum\-versus\-crossing rule of Section[2](https://arxiv.org/html/2607.00603#S2)\. The two answer different questions: the geometry read assembles the global coefficient where the singular structure is enumerable, on the analytic models, and the sampler supplies it where enumeration on a large network is still open \(Section[9](https://arxiv.org/html/2607.00603#S9)\)\.

##### Cost\.

The geometry read is deterministic and needs no calibration, returning the order and the dead\-subspace dimension in one pass of forward and backward evaluations\. A credible sampler estimate needs a per\-model calibration sweep to a stability plateau before the estimate, so the cost gap is the calibration overhead and the per\-model retuning that precede it \(Table[15](https://arxiv.org/html/2607.00603#A4.T15); the full head\-to\-head is Appendix[D\.1](https://arxiv.org/html/2607.00603#A4.SS1)\)\.

##### The triple, assembled\.

From the geometry view we read the per\-direction order and the dead\-subspace dimension, and through the order the universalν​\(k\)\\nu\(k\)\(live\-basis absorption suppresses the realized value in a trained network\)\. The global view adds the single threshold the sampler estimates\. The first two reads need no canonical alignment and no posterior sample, and they decompose the sampler’s global value\.

## 8 Uses

Beyond a single complexity scalar the read returns two things: a per\-direction decomposition into the order and the dead\-subspace dimension, and a gauge\-versus\-singularity verdict\. Several uses follow from what the paper already measures, and more are immediate given how cheap the read is\.

##### A gauge\-fixing diagnostic\.

The classification names which directions a gauge\-equivariant optimizer\(Shirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)should quotient away and which carry learned structure to keep, a flat verdict marking a symmetry direction and a finite order a node\-death\. On the fine\-tuned vision transformer the read returns this verdict directly \(Section[4](https://arxiv.org/html/2607.00603#S4)\), turning the optimizer’s gauge choices from a design assumption into a measured target\.

##### The effective dimension tracks the sampler\.

The cheap read that matches the sampler most closely is the curvature\-weighted effective dimension of the K\-FAC Fisher spectrum,λ^eff​\(γ\)=12​∑iλi/\(λi\+γ\)\\hat\{\\lambda\}\_\{\\mathrm\{eff\}\}\(\\gamma\)=\\tfrac\{1\}\{2\}\\sum\_\{i\}\\lambda\_\{i\}/\(\\lambda\_\{i\}\+\\gamma\)over the block’s factor\-eigenvalue products, at one forward and backward pass per block\. On a width\-128 seven\-optimizer cohort it recovers the calibrated SGLD coefficient restricted to each block at Spearmanρ=0\.82\\rho=0\.82on the MLP andρ=0\.79\\rho=0\.79on the attention \(Appendix[D\.2\.1](https://arxiv.org/html/2607.00603#A4.SS2.SSS1)\)\. The curvature weighting is what carries it: a bare rank or dead\-unit count does not track \(ρ=0\.29\\rho=0\.29on this cohort\), since the sampler weights each direction by its curvature\. The liveness gate reads a fully input\-dead MLP to zero, the coefficient the restricted sampler also reads\. The decomposition is the informative object, since the global coefficient is near\-constant across these arms and the per\-block restriction carries the cross\-optimizer structure\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x6.png)Figure 6:The one\-pass census separates the optimizer families\. The census learning\-coefficient estimateλ^\\hat\{\\lambda\}\(one forward pass\) against the calibrated SGLD\-LLC \(a per\-model calibration sweep\), on the d8 modular\-addition models where the sampler converges\. Across the seven distinct models every gauge\-fixed DDC model \(green\) reads as more complex than every vanilla\-Muon model \(red\) in both estimators; the pooled rank agreement is the weaker Spearmanρ=0\.71\\rho=0\.71across those models andρ=0\.66\\rho=0\.66over the eleven checkpoints\.
##### A one\-pass cross\-optimizer separator\.

The dead\-unit census ofShirodkar and Narayanan \([2026b](https://arxiv.org/html/2607.00603#bib.bib25)\)is cheaper still, a single forward pass with no spectrum\. Its estimateλ^=Ntotal/2−Ndead​\(1/2−1/\(2​k\)\)\\hat\{\\lambda\}=N\_\{\\mathrm\{total\}\}/2\-N\_\{\\mathrm\{dead\}\}\(1/2\-1/\(2k\)\)is the regular dimension minus the reduction the dead units buy\. On eighteen d8 modular\-addition checkpoints \(gauge\-fixed DDC, vanilla Muon, vanilla AdamW, with and without weight decay\), across the seven distinct models where the sampler converges \(R^≤1\.1\\hat\{R\}\\leq 1\.1, positive estimate\), every gauge\-fixed DDC model reads as more complex than every vanilla\-Muon model in both the census and the SGLD\-LLC \(Figure[6](https://arxiv.org/html/2607.00603#S8.F6)\)\. The separator is the dead\-unit count, which the census turns intoλ^\\hat\{\\lambda\}directly for this order\-k=3k\{=\}3set:1717to106106dead directions in the DDC models against508508to721721in the vanilla\-Muon ones\. That count is only0\.1%0\.1\\%to4\.4%4\.4\\%of theNtotalN\_\{\\mathrm\{total\}\}regular dimension, soλ^\\hat\{\\lambda\}barely moves offNtotal/2N\_\{\\mathrm\{total\}\}/2: it separates the families cleanly but not the checkpoints within a run, where the count follows the weight\-decay schedule while the sampler moves the other way, and the pooled agreement is the weaker Spearmanρ=0\.71\\rho=0\.71across the seven models andρ=0\.66\\rho=0\.66over the eleven checkpoints\. Where the sampler degenerates to a negative estimate, on the vanilla\-AdamW and accumulation\-regime cells, the census still returns a reading\. Appendix[D\.2](https://arxiv.org/html/2607.00603#A4.SS2)gives the full ranking\.

##### Developmental tracking\.

The read is cheap enough to run at every checkpoint, tracing the singular structure through training: the gauge directions are present from initialisation and the node\-deaths emerge as the network compresses \(Appendix[B\.5](https://arxiv.org/html/2607.00603#A2.SS5)\)\. The per\-model profile \(its order spectrum, gauge count, and effective Fisher rank by layer\) is a richer signature than the bottom singular value alone\. The per\-direction diagnosis sharpens it into model\-level measurements: a compression\-progress fraction from the settled\-versus\-in\-transit split, a continuous deadness\-depth spectrum in place of a binary count, and a bucket\-census fingerprint\. Read alongside a run, the same profile is a training diagnostic: it shows whether the singular structure is forming, separating a compressing network from one the optimizer leaves diffuse \(Section[5](https://arxiv.org/html/2607.00603#S5)\) or one still in the accumulation regime with no node\-death, a geometric state the loss curve does not report\. The read is cheap enough to monitor across the run as it trains\.

## 9 Discussion and limitations

This paper turns the order\-recovery ofShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)into a measurement: a descent\-free read at a frozen checkpoint that decomposes the single learning\-coefficient scalar into per\-direction orders and a dead\-subspace dimension, classifies each direction as a genuine singularity or a flat gauge, and does so off canonical alignment across transformer, convolutional, and normalisation layers\. What follows maps where the read applies, what it needs to run, and the cases at the programme’s frontier\.

##### The read applies once a singularity has formed\.

The read needs a dead direction in place, which a network forms in the compression phase as it folds into a singular solution\. A checkpoint still accumulating capacity, where the bottom singular value rises, carries no dead structure, and the read correctly returns no order: singular\-learning analysis begins once a singularity has formed\. Standard language\-model pretraining sits in the accumulation phase, so the read characterises its compressed checkpoints and reports the absence of singular structure on the rest\.

##### A detector has to surface the structure\.

The read runs on whatever a detector exposes, and the paper supplies detectors across the dead\-structure family: the K\-FAC factors, the convolutional channel factor, the LayerNorm kernel, the attention rotations, and the constructed overlap and shift cells \(Table[2](https://arxiv.org/html/2607.00603#S4.T2)\)\. Each returns a classified verdict, a finite order where a node\-death or depth singularity has formed and a gauge contribution where the structure is a symmetry, both of which the read reports as results\. Two rows stay unverified, attention\-head death and expert death in a mixture of experts: a detector is in hand for each, but no model in reach formed a clean instance to read, so they sit at attempted in the table\. The detector family is finite, so the completeness of an enumeration is itself measurable: the K\-FAC factors flag every separated dead block per layer regardless of type, and comparing that total against the dimension the named detectors claimed returns the unattributed remainder, dead structure the factors see but no detector typed\. A nonzero remainder bounds what the named reads miss, so a global coefficient assembled from the named structure alone is a lower bound on the complexity until the remainder is closed\.

##### What the read tells you on an arbitrary model\.

The read reads dead structure it is pointed at; it does not find structure on its own\. A uniformly random direction in a random network almost surely reads regular, since the order\-carrying directions are a constructed, measure\-zero set and the order\-carrying mode sits above the high\-dimensional gauge floor, so a random line has vanishing overlap with it\. Every order or gauge verdict is therefore conditional on a detector having nominated the direction, and searching the parameter space at random or refining the Fisher toward its bottom returns a gauge or null direction in place of the order \(Section[3](https://arxiv.org/html/2607.00603#S3)\)\. Given any model and any weights the read does three things accurately\. It certifies a detector\-nominated direction, returning a finite order withλdir=1/\(2​k\)\\lambda\_\{\\mathrm\{dir\}\}=1/\(2k\)or a gauge with none, and rejecting a contaminated, crossing, or unsettled scan with a diagnosis and never a wrong order\. It reports the absence of singular structure on an uncompressed checkpoint, a real answer about the network’s phase\. And it decomposes a known, enumerable structure into per\-direction orders the analytic models confirm to machine precision\. The hard limit is the first sentence: the read sees the fraction of a real model’s singular structure its detector family surfaces \(the unattributed remainder is the coverage residual above\)\. The bottom singular value, the K\-FAC bottom\-block separation, and the deadness ratio answer, before any scan, whether a checkpoint has compressed enough to carry anything to read\.

##### The optimizer shapes how clean the read is\.

The read characterises whatever structure an optimizer leaves, and different optimizers leave different solutions \(Section[5](https://arxiv.org/html/2607.00603#S5)\)\. The architecture fixes the order; the basis and the sharpness are the optimizer’s\. An orthogonalising optimizer\(Shirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)leaves an axis\-clean structure that reads directly, a standard one can leave it rotated, which the off\-canonical read still follows, and on a deep transformer it can leave the structure too diffuse to carry any order\. The read sees only what the optimizer leaves; where that is nothing, as for vanilla Muon on the deep grokking transformer, there is nothing to recover\.

##### σmin\\sigma\_\{\\min\}holds only in regime\.

The bottom singular value underwrites the magnitude floor and the pre\-scan compression check\. At deeply singular checkpoints its magnitude becomes unreliable, nearing the precision floor as its sample variance grows, and the asymptoticity and floor gates reject the read once it reaches the floor\. The rank\-level reading, the effective rank and the cross\-checkpoint ordering, still holds at those checkpoints; the absolute magnitude is what is lost\.

##### The global coefficient\.

The per\-direction reads assemble into a global learning coefficient by the sum\-versus\-crossing rule of Section[2](https://arxiv.org/html/2607.00603#S2)wherever the singular structure is enumerable\. Across twelve such cells, the normal crossings, separable sums, their composites, and the scalar deep linear network at depths two through five, the assembled\(λ,m\)\(\\lambda,m\)reproduces the closed\-form threshold to machine precision \(Table[19](https://arxiv.org/html/2607.00603#A4.T19)\); the scalar network’s depth\-LLcollapse reads orderk=Lk=L, and the globalλ=1/2\\lambda=1/2follows from theLL\-hyperplane crossing\. The blind assembly breaks on one structure, the wide matrix deep linear networks, whose product\-zero locus is determinantal with its crossings rotated off every coordinate\. There the coordinate read mis\-counts and the assembled coefficient leaves the closed form, a gap the improved rotated order read does not close, since it sits in the assembly\. Three reads carry across the boundary, developed in Appendix[D\.3](https://arxiv.org/html/2607.00603#A4.SS3): a rigorous bracket from the Newton\-polyhedron simplex bound and the generic\-line order contains the closed form on every cell and holds the hundreds\-scaleλ\\lambdaof the trainedH=\(20,h,h,20\)H=\(20,h,h,20\)network up toh=128h=128; a blind detector flags the determinantal locus and recovers its depth; and the structured resolution rule returns the exact coefficient on all five determinantal cells for depthL≤3L\\leq 3\. A general exact resolver for an arbitrary determinantal variety stays open, the simple origin blow\-up stalling at the simplex bound, an item in the programme’s open\-problems register\. For one global number there the posterior sampler stays the practical tool, with the geometry read supplying the per\-direction decomposition where individual directions can be targeted\. The multiplicity recovers the way the coefficient does: typing the dominant\-order directions returns it where they form a single crossing, robust to padding by lower\-order, higher\-order, and gauge directions, and under\-counts only when the multiplicity splits across several equal\-threshold loci, which then need enumerating\. The singular fluctuation reaches a sharper limit: the order fixes the universalν​\(k\)\\nu\(k\), but a trained network’s live structure absorbs part of the data fluctuation, so the realizedν\\nusits belowν​\(k\)\\nu\(k\)by a structure\-dependent amount the order alone does not carry \(Section[6](https://arxiv.org/html/2607.00603#S6)\)\.

##### Cost and sample count\.

The proxies the read uses cost little: the activationσmin\\sigma\_\{\\min\}is one SVD, the K\-FAC factor reads are per\-layer, and the directional Fisher is a few forward evaluations; only a full\-spectrum Fisher would grow cubically in width, and the read does not form one\. The geometry reads carry ann/dn/dsample\-count requirement, met at thed=24d=24width \(n/d≈100n/d\\approx 100\) and approached atd=8d=8\(n/d≈64n/d\\approx 64\), where the recovered orders still hold atr2=1\.000r^\{2\}=1\.000\.

## References

- Amari \(2016\)Shun\-ichi Amari\.*Information Geometry and Its Applications*, volume 194 of*Applied Mathematical Sciences*\.Springer, 2016\.URL[https://link\.springer\.com/book/10\.1007/978\-4\-431\-55978\-8](https://link.springer.com/book/10.1007/978-4-431-55978-8)\.
- Amari et al\. \(2006\)Shun\-ichi Amari, Hyeyoung Park, and Tomoko Ozeki\.Singularities affect dynamics of learning in neuromanifolds\.*Neural Computation*, 18\(5\):1007–1065, 2006\.URL[https://doi\.org/10\.1162/neco\.2006\.18\.5\.1007](https://doi.org/10.1162/neco.2006.18.5.1007)\.
- Aoyagi \(2024\)Miki Aoyagi\.Consideration on the learning efficiency of multiple\-layered neural networks with linear units\.*Neural Networks*, 172:106132, 2024\.URL[https://doi\.org/10\.1016/j\.neunet\.2024\.106132](https://doi.org/10.1016/j.neunet.2024.106132)\.
- Aoyagi and Watanabe \(2005\)Miki Aoyagi and Sumio Watanabe\.Stochastic complexities of reduced rank regression in Bayesian estimation\.*Neural Networks*, 18\(7\):924–933, 2005\.URL[https://doi\.org/10\.1016/j\.neunet\.2005\.03\.014](https://doi.org/10.1016/j.neunet.2005.03.014)\.
- Bushnaq et al\. \(2024\)Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky\-Dill, Kaarel Hänni, Cindy Wu, and Marius Hobbhahn\.Using degeneracy in the loss landscape for mechanistic interpretability, 2024\.URL[https://arxiv\.org/abs/2405\.10927](https://arxiv.org/abs/2405.10927)\.
- Carroll \(2021\)Liam Carroll\.Phase transitions in neural networks\.Master’s thesis, School of Mathematics and Statistics, The University of Melbourne, 2021\.URL[http://therisingsea\.org/notes/MSc\-Carroll\.pdf](http://therisingsea.org/notes/MSc-Carroll.pdf)\.
- Dosovitskiy et al\. \(2021\)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby\.An image is worth 16x16 words: Transformers for image recognition at scale\.In*ICLR*, 2021\.
- Farrugia\-Roberts \(2022\)Matthew Farrugia\-Roberts\.Structural degeneracy in neural networks\.Master’s thesis, School of Computing and Information Systems, The University of Melbourne, 2022\.URL[https://far\.in\.net/mthesis](https://far.in.net/mthesis)\.
- Farrugia\-Roberts \(2023\)Matthew Farrugia\-Roberts\.Functional equivalence and path connectivity of reducible hyperbolic tangent networks\.In*Advances in Neural Information Processing Systems 36 \(NeurIPS\)*, pages 79502–79517, 2023\.URL[https://arxiv\.org/abs/2305\.05089](https://arxiv.org/abs/2305.05089)\.
- Grosse and Martens \(2016\)Roger Grosse and James Martens\.A Kronecker\-factored approximate Fisher matrix for convolution layers\.In*ICML*, 2016\.URL[https://arxiv\.org/abs/1602\.01407](https://arxiv.org/abs/1602.01407)\.
- Hironaka \(1964\)Heisuke Hironaka\.Resolution of singularities of an algebraic variety over a field of characteristic zero\.*Annals of Mathematics*, 79\(1\):109–326, 1964\.URL[https://www\.jstor\.org/stable/1970486](https://www.jstor.org/stable/1970486)\.
- Hoogland et al\. \(2024\)Jesse Hoogland, George Wang, Matthew Farrugia\-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet\.Loss landscape degeneracy and stagewise development in transformers\.*Transactions on Machine Learning Research*, 2024\.URL[https://arxiv\.org/abs/2402\.02364](https://arxiv.org/abs/2402.02364)\.
- Jordan et al\. \(2024\)Keller Jordan et al\.Muon: An optimizer for hidden layers in neural networks\.[https://kellerjordan\.github\.io/posts/muon/](https://kellerjordan.github.io/posts/muon/), 2024\.Blog post\.
- Kingma and Ba \(2015\)Diederik P\. Kingma and Jimmy Ba\.Adam: A method for stochastic optimization\.In*International Conference on Learning Representations \(ICLR\)*, 2015\.URL[https://arxiv\.org/abs/1412\.6980](https://arxiv.org/abs/1412.6980)\.
- Lau et al\. \(2025\)Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei\.The local learning coefficient: A singularity\-aware complexity measure\.In*AISTATS*, 2025\.URL[https://proceedings\.mlr\.press/v258/lau25a\.html](https://proceedings.mlr.press/v258/lau25a.html)\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.In*International Conference on Learning Representations \(ICLR\)*, 2019\.URL[https://arxiv\.org/abs/1711\.05101](https://arxiv.org/abs/1711.05101)\.
- Martens and Grosse \(2015\)James Martens and Roger Grosse\.Optimizing neural networks with Kronecker\-factored approximate curvature\.In*ICML*, 2015\.URL[https://arxiv\.org/abs/1503\.05671](https://arxiv.org/abs/1503.05671)\.
- Nanda et al\. \(2023\)Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt\.Progress measures for grokking via mechanistic interpretability\.In*ICLR*, 2023\.URL[https://arxiv\.org/abs/2301\.05217](https://arxiv.org/abs/2301.05217)\.
- Oquab et al\. \(2023\)Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El\-Nouby, et al\.DINOv2: Learning robust visual features without supervision\.*arXiv preprint arXiv:2304\.07193*, 2023\.
- Power et al\. \(2022\)Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra\.Grokking: Generalization beyond overfitting on small algorithmic datasets\.*arXiv:2201\.02177*, 2022\.
- Roy and Vetterli \(2007\)Olivier Roy and Martin Vetterli\.The effective rank: A measure of effective dimensionality\.*15th European Signal Processing Conference \(EUSIPCO\)*, pages 606–610, 2007\.
- Shirodkar \(2026a\)Tejas Pradeep Shirodkar\.Dead\-Direction Conditioners: Gauge\-Equivariant Preconditioning for Deep Networks, 2026a\.URL[https://arxiv\.org/abs/2606\.29176](https://arxiv.org/abs/2606.29176)\.
- Shirodkar \(2026b\)Tejas Pradeep Shirodkar\.Dead directions: Geometric singular learning, 2026b\.URL[https://arxiv\.org/abs/2606\.05957](https://arxiv.org/abs/2606.05957)\.
- Shirodkar and Narayanan \(2026a\)Tejas Pradeep Shirodkar and P\. J\. Narayanan\.Algebraic dead directions in LayerNorm transformers: A forward\-pass\-only diagnostic at LLM scale, 2026a\.URL[https://arxiv\.org/abs/2606\.19491](https://arxiv.org/abs/2606.19491)\.
- Shirodkar and Narayanan \(2026b\)Tejas Pradeep Shirodkar and P\. J\. Narayanan\.Dead\-direction signatures: A cheap spectral reading of singular complexity, 2026b\.URL[https://arxiv\.org/abs/2606\.21158](https://arxiv.org/abs/2606.21158)\.
- Tieleman and Hinton \(2012\)Tijmen Tieleman and Geoffrey Hinton\.Lecture 6\.5—RMSProp: Divide the gradient by a running average of its recent magnitude\.COURSERA: Neural Networks for Machine Learning, 2012\.
- Timaeus and collaborators \(2024\)Timaeus and collaborators\.devinterp: A library for developmental interpretability\.[https://github\.com/timaeus\-research/devinterp](https://github.com/timaeus-research/devinterp), 2024\.Python package\.
- Wang et al\. \(2025\)George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet\.Differentiation and specialization of attention heads via the refined local learning coefficient\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://arxiv\.org/abs/2410\.02984](https://arxiv.org/abs/2410.02984)\.Spotlight\.
- Watanabe \(2009\)Sumio Watanabe\.*Algebraic Geometry and Statistical Learning Theory*\.Cambridge University Press, 2009\.URL[https://doi\.org/10\.1017/CBO9780511800474](https://doi.org/10.1017/CBO9780511800474)\.
- Watanabe \(2018\)Sumio Watanabe\.*Mathematical Theory of Bayesian Statistics*\.CRC Press, 2018\.URL[https://www\.routledge\.com/9781482238068](https://www.routledge.com/9781482238068)\.

## Appendices

We give the experiments in full here, from the network setups through to the global\-coefficient assembly and the singular\-fluctuation cells\.

## Appendix contents

## Appendix AThe read: setup and pipeline

### A\.1 Experimental setup

The paper reads four families of network\. The modular\-addition transformers are one\-block RoPE \(rotary position embedding\) models trained to predict\(a\+b\)modp\(a\+b\)\\bmod pfrom the two operand tokens, read at the grokked checkpoint where the network has folded into its singular solution\. All d=8 runs use50005000steps and the d=24 run60006000, at seed4242unless noted\. The gauge\-fixed optimizer is Muon with a body\-frame rotation gauge on the attention QK and VO blocks \(theDDCMuonrecipe ofShirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22)\); vanilla Muon is the same recipe with the gauge removed \(textbook Newton–Schulz,ns\_off\)\. The deep linear networks are trained by gradient descent to a rank\-deficient regression target until one mode of the weight product reaches machine precision, with a hand\-constructed linear bridge swept toward its singularity for method validation\. The convolutional cell is a teacher–student channel\-death network \(Appendix[B\.3](https://arxiv.org/html/2607.00603#A2.SS3)\)\. The two vision transformers, a DINOv2 ViT\-S/14 fine\-tuned on CIFAR\-100 and a from\-scratch six\-block ViT trained on an ImageNet\-100 subset, are set up with their reads in Appendix[B\.4](https://arxiv.org/html/2607.00603#A2.SS4)\. Table[6](https://arxiv.org/html/2607.00603#A1.T6)lists the transformer, linear, and convolutional cohorts\.

The transformer reads accumulate the K\-FAC factors over held\-out tokens and scan the true \(label\-resampled\) softmax\-categorical Fisher \(Appendix[A\.2](https://arxiv.org/html/2607.00603#A1.SS2)\), at sample ration/d≈64n/d\\approx 64for the d=8 reads andn/d≈100n/d\\approx 100for the d=24 reads\. The frozen\-checkpoint scans run in single precision \(float32\), since lower precision floors the directional Fisher before the power\-law window opens\. Each read and figure regenerates from its committed driver and result JSON\.

Table 6:Cohorts\. Widthddis the model dimension; the MLP hidden width is4​d4d\.
### A\.2 The read pipeline

The read takes a checkpoint and a target MLP block\. It accumulates the K\-FAC factorsAAandGGover held\-out tokens, nominates the dead direction from the better\-separated factor \(the A–G duality\), and assembles the cross\-layer joint mode\. It then scans the directional Fisher along that mode,θ​\(t\)=θ0\+t​u\\theta\(t\)=\\theta\_\{0\}\+tu, evaluatingu⊤​F​\(θ​\(t\)\)​uu^\{\\top\}F\(\\theta\(t\)\)uby a finite\-difference Jacobian\-vector product on the true \(label\-resampled\) Fisher, and fits the log\-log slope over the auto\-selected purity window\. The sample ration/dn/dis set by the number of accumulation batches:n/d≈64n/d\\approx 64for the d=8 reads andn/d≈100n/d\\approx 100for the d=24 reads\. The scan grid is1616points overt∈\[10−2,0\.5\]t\\in\[10^\{\-2\},0\.5\]\. The window selector keeps the highest\-r2r^\{2\}admissible single\-power\-law fit\.

The read needs no library beyond a forward pass and one singular value decomposition\. Given a nominated unit\-norm directionuuat a base pointθ0\\theta\_\{0\}, it scans the true softmax\-categorical Fisher quadratic formu⊤​F​\(θ\)​uu^\{\\top\}F\(\\theta\)ualongθ​\(t\)=θ0\+t​u\\theta\(t\)=\\theta\_\{0\}\+tuby a finite\-difference Jacobian\-vector product, fits a single power law over the purity window, and classifies the result\.

```
# inputs: model, batch x; unit direction u at base theta0; finite-diff step eps;
#         log-spaced scan grid t_values (16 pts in [1e-2, 0.5]).

def dir_fisher(theta, u):     # u^T F u, softmax-CE Fisher (residual-free)
    f  = lambda th: logits(model_at(th), x)   # shifted-param logits
    Ju = (f(theta + eps*u) - f(theta - eps*u)) / (2*eps)
    p  = softmax(f(theta))                    # (N tokens, C classes)
    return mean_x[ sum_c p*Ju^2 - (sum_c p*Ju)^2 ]   # cat-cov contraction

ufu       = [dir_fisher(theta0 + t*u, u) for t in t_values]   # the scan
alpha, r2 = best_powerlaw_window(t_values, ufu)   # sweep lower cut, keep max-r2
k_hat     = 1 + alpha/2

# classify (verdict = asymptotic iff r2 > 0.95 and the window clears the floor):
if asymptotic and r2>0.95 and k_hat>=1.5:  genuine singularity (order k_hat)
elif max(ufu) << a live direction:         gauge (no finite order)
else:                                      live / pre-asymptotic
```

At an exact constructed singularityu⊤​F​\(θ0\)​u=0u^\{\\top\}F\(\\theta\_\{0\}\)u=0for a genuine direction and a gauge alike, so the clean rise of the scan, not the base\-point magnitude, separates them; a flat gauge never rises, a curved gauge rises with slope22but from a deep floor, and a genuine degeneracy rises from a live\-scale coefficient\.

### A\.3 A diagnostic key

The read returns four numbers per direction: the slopeα\\alpha\(equivalently the orderk^=1\+α/2\\hat\{k\}=1\+\\alpha/2\), the single\-power\-law fit qualityr2r^\{2\}, the asymptoticity verdict \(whether the slope has settled across the window\), and the magnitudeu⊤​F​uu^\{\\top\}Futhe scan reaches relative to a live direction\. Table[7](https://arxiv.org/html/2607.00603#A1.T7)is the key from these four to the verdict and to what it says about the nominated direction\. The classification is conservative: it accepts an order only on a clean, settled match \(the first row\) and otherwise rejects with a named diagnosis, so it never returns a wrong exponent\. The four reject rows are the failure modes the off\-canonical read is built to tell apart, and each names its remedy: a shallow slope is a contaminated direction to re\-nominate, a steep one a crossing to resolve structurally, an unsettled one a scan to widen, and a flat live one a direction that is simply not dead\.

Table 7:Reading the scan: slope and magnitude to verdict\. “deep floor” is a magnitude orders of magnitude below a live direction’s Fisher; “live” is the live\-scale value a non\-dead weight carries;kkis the predicted order where one is known \(the activation order for a node\-death, the depth for a determinantal collapse\)\.Each verdict carries a refinement the read also returns\. The magnitude is reported as the continuous depthlog10⁡\(Fmax/Flive\)\\log\_\{10\}\(F\_\{\\max\}/F\_\{\\mathrm\{live\}\}\), so the gauge / near\-dead / regular split reads as a position on the depth continuum\. The slope is reported as a profile, a plateau for a clean power law, a monotone drift for a transitional structure, or a curved log\-log, from the local slopes the single window discards\. A pre\-asymptotic reject is split in turn: a slope drifting up from a depressed magnitude is a unit still compressing \(*in transit*\), while a flat\-slope one only needs a wider scan\. These read directly off the scan the verdict already computed\.

Table[8](https://arxiv.org/html/2607.00603#A1.T8)applies the key to existing checkpoint reads\. The grokking node\-deaths return their order, on the axis and recovered off it; the two DINOv2 gauges read flat at a deep floor, the query–key rotation rising with slope22but separated from ak=2k\{=\}2node\-death by its10−610^\{\-6\}floor; and the diffuse vanilla\-Muon read returns no settled order\. The classifications come from the diagnostic key applied to the committed read data, with no re\-run\.

Table 8:The diagnostic key on real\-model reads\. Each row is an existing checkpoint read classified by the diagnostic key \(Table[7](https://arxiv.org/html/2607.00603#A1.T7)\);*depth*islog10⁡\(F/Flive\)\\log\_\{10\}\(F/F\_\{\\mathrm\{live\}\}\)where a live reference is recorded\.

## Appendix BPer\-type reads

### B\.1 Per\-cell results

Table[9](https://arxiv.org/html/2607.00603#A2.T9)reports the read per cell\. The cells fall into the three verdicts the method returns\. The gauge\-fixed squared\-ReLU blocks read a clean node\-death,k^≈3\\hat\{k\}\\approx 3at axis\-alignment above0\.990\.99andr2=1\.000r^\{2\}=1\.000, the order the activation fixes\. The gelu cells readk^≈2\\hat\{k\}\\approx 2at axis\-alignment near0\.10\.1, the same order recovered off the coordinate axes through the rotated\-direction read, where a per\-coordinate scan finds nothing\. The vanilla\-Muon base reads pre\-asymptotic, its dead structure diffuse with no single direction carrying a clean order \(r2=0\.71r^\{2\}=0\.71, no order returned\)\. The trained deep linear cells read the depth orderk^=L\\hat\{k\}=LforL=3,4,5L=3,4,5with no activation behind them,k=5k=5an order no node\-death produces\. Architecture fixes the order, the optimizer the alignment, and the read separates the two\. Figure[7](https://arxiv.org/html/2607.00603#A2.F7)shows the dead\-subspace dimension and axis\-alignment by depth for the gauge\-fixed d=8 run\.

Table 9:Per\-cell reads\. Axis\-alignment near11means the dead subspace lies on the coordinate axes;k^\\hat\{k\}is the recovered order \(single\-axis where axis\-aligned, rotated\-direction read where not\)\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x7.png)Figure 7:Dead subspace by depth \(gauge\-fixed d=8\)\. The axis\-aligned activation\-side subspace sits in the interior blocks \(11–44\) and grows with depth; the input block carries its dead structure on the gradient factor \(A\-sidem=0m\{=\}0\), and the late blocks carry none\.
### B\.2 Constructed and trained linear cells

The constructed and trained\-linear cells here read a structure whose order is fixed before the measurement, which checks that the read returns the order planted before the scan\.

##### Constructed node\-death \(planted order\)\.

In a small two\-layer network one hidden unit’s incoming and outgoing weights are zeroed, the activation is varied, and the joint mode is scanned\. The read returns the activation’s analytic order,k=2,3,4k=2,3,4for gelu, squared\-ReLU, and cubed\-ReLU atr2=1\.000r^\{2\}=1\.000\(Figure[3](https://arxiv.org/html/2607.00603#S4.F3)\)\. The same construction on a convolutional channel \(Appendix[B\.3](https://arxiv.org/html/2607.00603#A2.SS3)\) reads the order through the spatial\-patch covariance, a different K\-FAC factor\. The selector recovers the planted order on every clean cell, returns no order on a flat scan, and never matches a wrong exponent\.

##### Trained deep linear network \(depth order\)\.

A deep linear network of depthLLis trained by gradient descent to a rank\-deficient regression target until one mode of the weight product collapses to machine precision\. The resulting dead direction takes its order from depth alone,k=Lk=L, with no activation behind it\. Reading it atL∈\{3,4,5\}L\\in\\\{3,4,5\\\}returnsk^=3\.00,4\.02,5\.07\\hat\{k\}=3\.00,4\.02,5\.07atr2=1\.000r^\{2\}=1\.000\(Table[9](https://arxiv.org/html/2607.00603#A2.T9)\), andk=5k=5is an order no activation node\-death produces\. The wide matrix deep linear networks, whose singular set is a determinantal variety, mark the boundary of the global assembly and are treated in Appendix[D\.3](https://arxiv.org/html/2607.00603#A4.SS3)\.

##### Per\-type direction constructors\.

Each type plants its degeneracy and supplies the directionsuuthe read scans, so the constructed\-cell claims reproduce from the description alone\. Writeaja\_\{j\}for hidden unitjj’s incoming weights,cjc\_\{j\}for its outgoing weights, andd,d′d,d^\{\\prime\}for random unit vectors\.

```
node-death (unit j):  set a_j = c_j = 0;  theta0 = 0 in the (a_j, c_j) block
          joint mode u = [d ; d’] / sqrt(2)
          -> order k = activation order (sq-ReLU 3, gelu/ReLU 2)
unit overlap (units 0..n-1 all copied onto unit 0 so they coincide):
          split_i    = +d  on a_0, -d  on a_i  -> order k = 2 (curvature phi’’)
          transfer_i = +d’ on c_0, -d’ on c_i  -> flat gauge  (i = 1..n-1)
cross-entropy shift (output bias b in R^C):  u = 1_C / sqrt(C)  -> flat gauge
ReLU rescale (units W_l, W_{l+1}): one-sided log-norm move      -> flat gauge
```

The LayerNorm\-kernel gauge is read in closed form, the kernelu⋆=γ−1/‖γ−1‖u^\{\\star\}=\\gamma^\{\-1\}/\\\|\\gamma^\{\-1\}\\\|of the per\-channel gainγ\\gamma, with no scan\. The attention rotation gauges are curved: for an antisymmetric generatorAAon a head’sdheadd\_\{\\mathrm\{head\}\}subspace, the query–key gauge movesδ​WQ=WQ​A\\delta W\_\{Q\}=W\_\{Q\}A,δ​WK=−A​WK\\delta W\_\{K\}=\-AW\_\{K\}, and the value–output gauge movesδ​WV=WV​A\\delta W\_\{V\}=W\_\{V\}A,δ​WO=−A​WO\\delta W\_\{O\}=\-AW\_\{O\}; both leave the head invariant, so the read places them at a deep floor \(the magnitude criterion, since they rise with slope22along the tangent\)\.

### B\.3 Convolutional channel\-death

The convolutional read uses a small CNN: two3×33\{\\times\}3convolutions \(cin=3→12→12c\_\{\\mathrm\{in\}\}\{=\}3\\to 12\\to 12, padding11, no bias\), the activation, a global average pool, and a linear head to a44\-dimensional regression target\. The second convolution carries the dead channel; the head is the channel’s consumer\. Inputs are512512random3×8×83\{\\times\}8\{\\times\}8images\. In the constructed case one channel’s incoming filter and outgoing head slice are zeroed, and the scan runs along the joint turn\-on mode\. In the trained case the target is a narrow teacher CNN \(44channels\) the wide student over\-covers, trained with Adam \(lr3⋅10−33\{\\cdot\}10^\{\-3\}, weight decay5⋅10−25\{\\cdot\}10^\{\-2\},40004000steps\); the spare channels die, and the dead channel\-direction is nominated from the smallest\-eigenvalue eigenvector of the convolutionalGG\-factor \(theCout×CoutC\_\{\\mathrm\{out\}\}\{\\times\}C\_\{\\mathrm\{out\}\}per\-spatial\-position output\-gradient covariance\)\. The frozen\-weight true\-MC Fisher scan and the purity window are the same as the Linear reads\. The nominated direction is usually a rotated combination of channels \(axis\-alignment near0for the smooth and polynomial activations\), so the trained read is off canonical\.

Table 10:Convolutional channel\-death reads\. Trained rows are the mean recovered order over three seeds\{42,142,242\}\\\{42,142,242\\\}\.
### B\.4 Vision transformer reads

The real vision\-transformer reads of Section[4](https://arxiv.org/html/2607.00603#S4)cover two regimes on the same architecture: a fine\-tuned network that carries a gauge, and a from\-scratch network that forms a node\-death\.

##### Setup\.

The fine\-tuned model is a DINOv2 ViT\-S/14 \(twelve blocks, embedding dimension384384, patch1414\) fine\-tuned on CIFAR\-100, read at the converged checkpoint\. The from\-scratch model is a six\-block ViT \(width256256, MLP hidden10241024, patch1616,112×112112\\times 112inputs\) trained on a100100\-class ImageNet subset under weight decay, read at the deepest singular block\. The detector at each block is the closed\-form LayerNorm\-kernel directionu⋆=γ−1/‖γ−1‖u^\{\\star\}=\\gamma^\{\-1\}/\\\|\\gamma^\{\-1\}\\\|for the gauge read and the smallest\-eigenvalue eigenvector of the squared\-ReLU input covariance for the node\-death read; the directional Fisher scan and the purity window are the same as the Linear reads\.

##### The fine\-tuned model carries gauges\.

Across all twelve blocks the smallest combined feed\-forward norm sits at0\.440\.44to0\.730\.73of the block median, so no weight\-space node\-death has formed\. What the model carries is three architectural gauges, all read as flat \(Table[11](https://arxiv.org/html/2607.00603#A2.T11)\): the LayerNorm kernel, whose detectedu⋆u^\{\\star\}aligns with the input covariance’s smallest\-eigenvalue direction at\|cos\|=1\.0000\|\\cos\|=1\.0000and whose directional Fisher sits orders of magnitude below a live direction, and the attention query\-key and value\-output rotations, read10610^\{6\}and2×1042\\times 10^\{4\}below live with the slope\-22a curved gauge orbit imitates\. The read reserves a finite order for a genuine node\-death and flags these as gauges\.

Table 11:Architectural gauges on the fine\-tuned DINOv2 ViT\-S/14\. The LayerNorm\-kernel direction is detected at three blocks; the QK and VO rotations at block66\. The gauge/live ratio is the directional Fisher against a live direction at the same site; a deep floor marks a gauge\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x8.png)Figure 8:The gauge floors on the fine\-tuned DINOv2 ViT\-S/14\. The LayerNorm kernel and the attention query–key rotation read a directional Fisher orders of magnitude below a live direction \(ratio11\); the deep floor is the flat signature the read classifies them by\.
##### The from\-scratch model forms a node\-death off the axis\.

Trained from scratch under weight decay, the same architecture prunes its over\-parametrised MLP as it compresses and leaves the dead subspace rotated off the coordinate axes \(squared\-ReLU coordinate\-concentration0\.070\.07at the deepest singular block\)\. A per\-coordinate scan misses such a death; the off\-canonical read recovers the activation\-predicted order on all four conditions \(Table[3](https://arxiv.org/html/2607.00603#S4.T3)\), squared\-ReLUk^=3\.00\\hat\{k\}=3\.00and geluk^≈2\.0\\hat\{k\}\\approx 2\.0, across two decay strengths and three seeds\.

### B\.5 Developmental emergence of the dead subspace

The frozen\-checkpoint read is cheap enough to run at every checkpoint, which traces the dead subspace through training\. Table[12](https://arxiv.org/html/2607.00603#A2.T12)and Figure[9](https://arxiv.org/html/2607.00603#A2.F9)read the gauge\-fixed d=8 squared\-ReLU run at four training steps across the grok transition\. The dead subspace grows and sharpens onto the coordinate axes as the network compresses: the dead\-subspace dimension rises from188188to562562and the axis\-alignment from0\.780\.78to0\.9990\.999, while the recovered order holds atk≈3k\\approx 3throughout\. The order is fixed by the activation before the subspace forms; only the count and the alignment develop\. This gives the classification a developmental signature beyond the single\-checkpoint read: the architectural gauge directions are present from initialisation, whereas the node\-deaths appear only as the network compresses\.

Table 12:Developmental read of the gauge\-fixed d=8 squared\-ReLU run \(cellgauge\_mdrift0p4\_s42, block11,n/d≈64n/d\\approx 64\), at four steps across the grok transition\. The orderkkis steady while the dead\-subspace dimension and axis\-alignment grow\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x9.png)Figure 9:The dead subspace emerges through grokking \(gauge\-fixed d=8\)\. The dead\-subspace dimension \(blue\) and the axis\-alignment \(green\) both grow across the grok transition while the recovered order stays atk≈3k\\approx 3\.

## Appendix COptimizer and basis shape the read

### C\.1 The orthogonaliser base produces the readable basis

The optimiser claim of Section[5](https://arxiv.org/html/2607.00603#S5)reads the deep d=8 squared\-ReLU transformer under three optimisers from one matched cohort, which separates the orthogonaliser from the gauge projection\. The arms are vanilla Muon \(textbook NS5, thens\_offbaseline\), the scaled\-polar orthogonaliser with the gauge removed \(theDDCMuonorthogonaliser,bf\_fast\), and the gauge\-equivariant optimiser on that orthogonaliser \(bf\_fastplus the rotation gauge\)\. This is the dead\-structure readability counterpart of the four\-arm accuracy decomposition ofShirodkar \([2026a](https://arxiv.org/html/2607.00603#bib.bib22)\), run on the same cohort\. Each block’s dead subspace is read by the multi\-component scan: the top eight directions of the dual\-factor bottom subspace are each scanned for the asymptotict2​\(k−1\)t^\{2\(k\-1\)\}growth, and the count that reaches it \(the*readable*count\) separates a formed dead structure from a diffuse one \(Table[13](https://arxiv.org/html/2607.00603#A3.T13)\)\.

Vanilla Muon leaves no readable dead subspace: the bottom block is a small flat near\-kernel \(dead\-subspace dimension99to3232\), none of whose directions reach the asymptotic regime, at every interior block and every seed \(block\-1 readable0/240/24across three seeds; blocks22to44at seed4242read0/80/8with axis\-alignment falling to0\.100\.10to0\.400\.40\)\. The scaled\-polar orthogonaliser, with the gauge removed, instead forms a large readable subspace \(dead\-subspace dimension606606to809809\) whose directions read the squared\-ReLU orderk≈3k\\approx 3on the coordinate axes \(axis\-alignment\>0\.99\>0\.99\), across three seeds and the interior blocks \(23/2423/24readable at block11,8/88/8at each of blocks22to44\)\. The gauge on the same orthogonaliser reads the same order at the same alignment \(22/2422/24at block11,8/88/8through depth\), and in addition spectrally separates the bottom block \(block verdict “subspace” against the orthogonaliser’s “flat”\)\. So on this deep transformer the readable, axis\-aligned dead structure comes from the scaled\-polar orthogonaliser; the gauge’s separate contribution here is the spectral separation of the block\. The task\-natural reproduction below confirms the orthogonaliser supplies the alignment at width128128as well, once the gauge’s weight decay is matched\.

The0/240/24for vanilla Muon is the diffuse\-structure verdict: the bottom block forms no direction that reaches the asymptotic regime, the null the gate returns in Section[5](https://arxiv.org/html/2607.00603#S5), with nothing formed for any scan to read\. This cohort is squared\-ReLU, whose readable structure the orthogonaliser leaves on the coordinate axes, so the off\-canonical recovery of a*rotated*order is exercised on the gelu cohort, where the same orthogonaliser leaves the order rotated off the axes and the joint\-mode scan still returns it \(Figure[1](https://arxiv.org/html/2607.00603#S1.F1), Section[5](https://arxiv.org/html/2607.00603#S5)\)\.

At depth2424the diffuse\-versus\-readable split holds with a thinner signal\. That cohort carries only the vanilla and gauge arms, with no gauge\-free scaled\-polar cell, so the orthogonaliser and the gauge cannot be separated there\. Its dead subspace is one\- to four\-dimensional, so the read is single\-component, but vanilla Muon reads no order at any block under either activation \(readable0/10/1to0/40/4\) while the gauge arm reads it on most blocks \(squared\-ReLU3/33/3, gelu2/32/3\)\. The contrast is the same as at depth88, and isolating the orthogonaliser from the gauge at width needs a gauge\-free scaled\-polar cohort the cohort does not yet contain\.

Table 13:Three optimisers from one matched cohort \(grok\_rope\_ab, d=8 squared\-ReLU,n/d≈64n/d\\approx 64\), read by the multi\-component scan at block11across three seeds\. The readable count is how many of the top eight dual\-factor subspace directions reach the asymptotict2​\(k−1\)t^\{2\(k\-1\)\}regime\. The readable axis\-aligned structure is the scaled\-polar orthogonaliser: both scaled\-polar arms form it, vanilla Muon does not\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x10.png)Figure 10:The multi\-component read at block11\(d8 squared\-ReLU, seed4242\)\. Each panel scans the top eight directions of the dual\-factor bottom subspace; a filled marker is an asymptotic component \(the orderk^\\hat\{k\}it reads\), a cross a pre\-asymptotic one \(no order\)\. Vanilla Muon forms no readable order; the scaled\-polar orthogonaliser and the gauge both read the squared\-ReLU orderk=3k\{=\}3on nearly every component\. The readable dead structure comes from the orthogonaliser; the gauge reads the same\.##### Task\-size corroboration\.

The deep\-transformer reads move the orthogonaliser and the depth together\. A task\-natural one\-block transformer \(width128128, the canonical modular\-addition size\) repeats the orthogonaliser\-versus\-gauge contrast and crosses it with weight decay, five seeds per arm \(Table[14](https://arxiv.org/html/2607.00603#A3.T14)\)\. The dead subspace is small here \(dead\-subspace dimension11to44\), so this is the axis\-alignment read rather than the multi\-component order read the d=8 node\-death affords\. At matched weight decay the orthogonaliser lifts the dead\-subspace axis\-alignment from0\.500\.50to0\.690\.69, and the rotation gauge adds0\.010\.01, within seed noise; weight decay leaves the alignment unchanged \(0\.500\.50at1×1\\timesand2×2\\times\)\. The readable axis\-aligned basis is therefore the orthogonaliser at the task size too, matching the deep\-transformer read\. The grok\-speed differences across these arms are weight\-decay effects, accounted for by the DDC optimizer paper\(Shirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22)\); the dead\-structure read here is the alignment\.

Table 14:Task\-natural one\-block transformer \(width128128, squared\-ReLU, five seeds\): dead\-subspace axis\-alignment crossing the orthogonaliser and the rotation gauge with weight decay\. At matched \(1×1\\times\) weight decay the orthogonaliser supplies the alignment \(\+0\.19\+0\.19\) and the gauge adds little \(\+0\.01\+0\.01, within noise\); weight decay does not move alignment \(vanilla1×=2×1\\times=2\\times\), the control that isolates the alignment read\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x11.png)Figure 11:The dead\-subspace axis\-alignment at task size \(1\-block d128, 5 seeds\)\. At matched1×1\\timesweight decay the alignment jumps from vanilla to the scaled\-polar orthogonaliser \(\+0\.19\+0\.19\) and the rotation gauge adds nothing \(\+0\.01\+0\.01, within seed noise\), so the clean axis\-aligned basis is the orthogonaliser\. The hatched2×2\\times\-WD vanilla bar is the control: weight decay does not move the alignment, which isolates the read\.

### C\.2 The optimiser\-axis cell

The optimiser\-axis reads of Section[5](https://arxiv.org/html/2607.00603#S5)\(Tables[4](https://arxiv.org/html/2607.00603#S5.T4)and[5](https://arxiv.org/html/2607.00603#S5.T5)\) hold a small architecture fixed and move only the optimiser, isolating the optimiser’s effect from the depth the deep\-transformer reads confound it with\. The cell is a two\-layer squared\-ReLU teacher–student MLP \(input dimension1616, hidden width6464, output dimension44\); the teacher uses44hidden units and the student6464, so the student over\-covers and its spare units prune to a node\-death under weight decay\. Each arm trains to convergence under SGD, AdamW, RMSProp, or Adam at three seeds\{42,142,242\}\\\{42,142,242\\\}\. The read is the same off\-canonical pipeline as the transformer cells: the dead unit is nominated from the smaller\-norm K\-FAC factor, and the joint mode is scanned on the frozen\-weight true\-MC Fisher over the auto\-selected purity window\. The gauge companion is a single\-sided move at the dead unit; it reads flat only when the unit’s consumer has also been pruned, the signature of a confirmed node\-death\.

Section[5](https://arxiv.org/html/2607.00603#S5)reads the deterministic\-target and noisy\-target results \(Tables[4](https://arxiv.org/html/2607.00603#S5.T4)and[5](https://arxiv.org/html/2607.00603#S5.T5)\): the order holds atk≈3k\\approx 3where a clean death forms, the optimiser sets the basis, and a noisy target separates the optimisers through the adaptive preconditioner’s noise\-fitting amplification\. Figure[12](https://arxiv.org/html/2607.00603#A3.F12)shows the two cells side by side\.

![Refer to caption](https://arxiv.org/html/2607.00603v1/x12.png)Figure 12:The optimiser\-axis cell\. Left: on the deterministic squared\-ReLU teacher–student node\-death every optimiser reads the predictedk=3k\{=\}3\. Right: under Gaussian target noise AdamW loses the death \(the order reads deviant\) while SGD, Adam, and RMSProp hold it\.
### C\.3 Dead\-input classification on the seven\-optimizer cohort

A dead direction the read cannot order is the limiting case of the read\. The component is degenerate enough that no perturbation of its weights moves the output, so the transversal\-Fisher scan finds nothing to fit and the order is undefined\. This appendix documents that case, the read’s third classification beside the gauge and the node\-death\. The cohort is a one\-block transformer \(width128128, four heads\) on addition mod113113, the modular\-addition cohortShirodkar \([2026a](https://arxiv.org/html/2607.00603#bib.bib22)\)analyses for its mechanism, read at the grokked checkpoint \(0\.980\.98to1\.001\.00validation\), across seven optimizers and three activations with three seeds each\. The read targets the smallest\-joint\-norm hidden unit of the MLP\.

On the arm trained by the gauge\-equivariant Adam the read returns no order: every scanned direction classifies tangential and the transversal count is zero\. The cause is upstream\. The second LayerNorm gain has collapsed to3×10−93\\times 10^\{\-9\}, eight orders below the0\.200\.20to0\.360\.36the other arms hold, so the normalised MLP input is a constant and the unit’s pre\-activation no longer depends on the token\. The liveness read of the MLP input \(one forward pass over the gain norm and the input\-activation spread\) reports the collapse, so the bare no\-order result classifies a dead input channel, distinct from a gauge at a deep floor and a node\-death at a finite order\. This collapse is the DDC optimizer shedding the spare feed\-forward block on its log\-radial scale, where weight decay drives the gauged scale to zero\(Shirodkar,[2026a](https://arxiv.org/html/2607.00603#bib.bib22)\); the read’s part is to localise it from the frozen checkpoint, where the same rotation gauge on a Muon base \(the DDC arm of Appendix[D\.2](https://arxiv.org/html/2607.00603#A4.SS2)\) instead keeps the MLP alive\.

The restricted learning coefficient confirms the classification\. The MLP block’s SGLD\-LLC on the gauge\-equivariant Adam arm is near zero, a seed median of0\.40\.4against1717to756756for the live arms, the cohort floor by more than an order of magnitude\. The same restriction taken to a single dead unit does not converge \(Appendix[D\.1](https://arxiv.org/html/2607.00603#A4.SS1)\); the block is the finest scale the sampler resolves on this arm, while the rate read returns the per\-direction order\.

Across the seven optimizers and three seeds the read separates this dead\-input arm from the live\-MLP arms at the frozen checkpoint, the classification the cross\-optimizer analyses of Appendices[D\.2](https://arxiv.org/html/2607.00603#A4.SS2)and[D\.2\.1](https://arxiv.org/html/2607.00603#A4.SS2.SSS1)build on\.

## Appendix DThe global view

### D\.1 Learning\-coefficient comparison

Table[15](https://arxiv.org/html/2607.00603#A4.T15)reports the SGLD learning\-coefficient estimate at three granularities against the rate read, on the grokked d=8 squared\-ReLU checkpoint\. The estimator is the SGLD local learning coefficient ofLau et al\. \([2025](https://arxiv.org/html/2607.00603#bib.bib15)\), its restricted \(per\-subset\) form followingWang et al\. \([2025](https://arxiv.org/html/2607.00603#bib.bib28)\), run via the devinterp library\(Timaeus and collaborators,[2024](https://arxiv.org/html/2607.00603#bib.bib27)\)with our own calibration of the inverse temperature and convergence gates\. The global and per\-block \(restricted\) estimates converge at component scale; the per\-direction restriction does not converge\. The rate read returns the per\-direction order deterministically\.

Table 15:Learning\-coefficient estimate vs the rate read \(grokked d=8 checkpoint\)\.![Refer to caption](https://arxiv.org/html/2607.00603v1/x13.png)Figure 13:The learning coefficient across granularities \(grokked d8 squared\-ReLU\)\. The SGLD sampler converges at the global model and one block but degenerates restricted to a single direction; the descent\-free rate read returns the per\-direction orderk=3k\{=\}3, henceλdir=1/6\\lambda\_\{\\mathrm\{dir\}\}\{=\}1/6, deterministically\.On the seven\-optimizer width\-128 cohort of Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)the read returns the three views of the triple at one frozen checkpoint \(Table[16](https://arxiv.org/html/2607.00603#A4.T16)\): the per\-direction orderkkand its coefficient1/\(2​k\)1/\(2k\)from the geometry read, the block coefficient from the restricted sampler \(Appendix[D\.2\.1](https://arxiv.org/html/2607.00603#A4.SS2.SSS1)\), and the trajectory fluctuationν\\nufrom the loss curve\. The views are complementary\. The order separates the geometries, the canonical arms reading the architecture order33, the weight\-decay arms departing to the input\-death order near22, and the gauge\-equivariant Adam arm returning no order with a dead input \(Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)\)\. The block coefficient carries the bulk degeneracy a single direction does not: SGD and AdamW share the order33yet differ thirtyfold in the restricted coefficient, the dense weight\-decay\-free posterior against the compressed one\. The fluctuation is read on the trajectory the geometry read does not need\.

Table 16:The Watanabe triple read three ways on the seven\-optimizer cohort, seed median, squared\-ReLU\. Orderkkis the intrinsic geometry read \(the weight\-decay arms read the off\-canonical input\-death order, gauge Adam’s input is dead with no order, the DDC log\-radial shed of Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)\);1/\(2​k\)1/\(2k\)the per\-direction coefficient; MLP rLLC the local learning coefficient from the sampler restricted to the block \(Table[18](https://arxiv.org/html/2607.00603#A4.T18)\);ν\\nuthe trajectory fluctuation at the plateau\.
### D\.2 Complexity ranking against the sampler

This appendix backs the ranking use of Section[8](https://arxiv.org/html/2607.00603#S8)\(Figure[6](https://arxiv.org/html/2607.00603#S8.F6)\)\. The set is eighteen squared\-ReLU d8 checkpoints on one synthetic arithmetic task \(the task and architecture fixed\), spanning gauge\-fixed DDC, vanilla Muon, and vanilla AdamW, with and without weight decay, across training steps\. For each the cheap estimate is the dead\-unit census \(one forward pass\), summed intoλ^=Ntotal/2−Ndead​\(1/2−1/\(2​k\)\)\\hat\{\\lambda\}=N\_\{\\mathrm\{total\}\}/2\-N\_\{\\mathrm\{dead\}\}\(1/2\-1/\(2k\)\)over theNtotal=16384N\_\{\\mathrm\{total\}\}=16384MLP\-hidden units withk=3k=3; the expensive estimate is the calibrated SGLD\-LLC with a focused\(η,γ\)\(\\eta,\\gamma\)grid per cell, since one shared configuration does not converge positively across the set\. Eleven cells converge \(R^≤1\.1\\hat\{R\}\\leq 1\.1, positive estimate\); the other seven drop from the SGLD comparison on a negative estimate orR^\>1\.1\\hat\{R\}\>1\.1\(the four no\-weight\-decay accumulation\-regime vanilla\-Muon cells, a late vanilla\-Muon cell, one gauge\-fixed DDC cell, and the vanilla\-AdamW cell\), while the census reads all eighteen\. Spearmanρ​\(λ^,SGLD\-LLC\)=0\.66\\rho\(\\hat\{\\lambda\},\\text\{SGLD\-LLC\}\)=0\.66on the eleven converged checkpoints and0\.710\.71on the seven distinct converged models \(Table[17](https://arxiv.org/html/2607.00603#A4.T17)\)\. The gauge\-fixed DDC models occupy the high\-λ^\\hat\{\\lambda\}, high\-LLC corner and the vanilla\-Muon models the low corner, in both estimators\. The census separates the families by the dead\-unit count, which is0\.1%0\.1\\%to4\.4%4\.4\\%of the1638416384units, soλ^\\hat\{\\lambda\}barely moves offNtotal/2N\_\{\\mathrm\{total\}\}/2: the pooledρ\\rhoreads the family split, while within a single run the count follows the weight\-decay schedule and the sampler moves the other way\. The per\-block effective dimension below tracks the restricted sampler more closely\.

Table 17:The seven distinct converged models, sorted by the cheapλ^\\hat\{\\lambda\}\. Both estimators place the gauge\-fixed DDC models above the vanilla\-Muon models\. Seed4242;n=960n\{=\}960validation sequences; per\-cell SGLD calibration overη∈\{1,3,10\}×10−5\\eta\\in\\\{1,3,10\\\}\{\\times\}10^\{\-5\},γ∈\{10,100\}\\gamma\\in\\\{10,100\\\}\.#### D\.2\.1Per\-block ranking on a seven\-optimizer cohort

The ranking above reads the cheap census against the global coefficient\. The same question at the block scale, on the width\-128 seven\-optimizer cohort of Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)\(SGD, Muon, Muon with weight decay, AdamW, AdamW with matched hyperparameters, gauge\-fixed Muon, gauge\-fixed Adam; squared\-ReLU, three seeds\), reads the calibrated SGLD coefficient restricted to the MLP block and to the attention block \(over the block parameter subset, seed median under a cross\-seed stability gate\) against a cheap frozen\-Fisher proxy\. The proxy is the localized effective dimensionλ^eff​\(γ\)=12​∑iλi/\(λi\+γ\)\\hat\{\\lambda\}\_\{\\mathrm\{eff\}\}\(\\gamma\)=\\tfrac\{1\}\{2\}\\sum\_\{i\}\\lambda\_\{i\}/\(\\lambda\_\{i\}\+\\gamma\)over the block’s K\-FAC factor spectrum \(the productsλG,i​λA,j\\lambda\_\{G,i\}\\lambda\_\{A,j\}of the input and output\-gradient factor eigenvalues\), at one forward and backward pass per block\. Table[18](https://arxiv.org/html/2607.00603#A4.T18)lists the seed medians\.

Table 18:Per\-block complexity on the seven\-optimizer cohort: the calibrated SGLD coefficient restricted to each block \(rLLC\) against the cheap K\-FAC effective dimension atγ=0\.1\\gamma=0\.1\. Seed medians, squared\-ReLU\. The gauge\-Adam MLP is the DDC log\-radial shed \(Appendix[C\.3](https://arxiv.org/html/2607.00603#A3.SS3)\); the read and the restricted sampler both localise it\.The effective dimension tracks the restricted sampler per block: Spearmanρ​\(λ^eff,rLLC\)=0\.82\\rho\(\\hat\{\\lambda\}\_\{\\mathrm\{eff\}\},\\mathrm\{rLLC\}\)=0\.82on the MLP and0\.790\.79on the attention, atγ=0\.1\\gamma=0\.1, near the localization scale the sampler itself uses\. The bare factor rank does not track \(ρ=0\.29\\rho=0\.29\): SGD carries a low\-rank MLP input from the low\-dimensional grokked representation yet the cohort’s highest MLP coefficient from its dense weight\-decay\-free posterior, so the directions have to be weighted by curvature, which the effective dimension does and a count does not\. The liveness\-gated census reads the input\-dead arm to zero \(gauge Adam,λ^gated=0\\hat\{\\lambda\}\_\{\\mathrm\{gated\}\}=0against9494to256256for the live arms\), the same death the restricted sampler reads \(0\.40\.4\)\. The attention block carries no isolated dead channels; its degeneracy is a low weight\-space rank, and the per\-head VO and QK effective rank separates the optimizer families the restricted sampler does, the gauge and vanilla Muon arms at high weight\-rank and low coefficient, the Adam arms at low rank and high coefficient\.

### D\.3 Global\-coefficient assembly on analytic models

Table[19](https://arxiv.org/html/2607.00603#A4.T19)assembles a global learning coefficient from the per\-direction reads on analytic singular models and checks it against the closed form\. For each cell the per\-direction orders come from the multi\-component read, the crossing\-versus\-independent grouping from the directional\-Fisher coupling \(two directions share a locus when fixing one scales the other’s leading Fisher coefficient, an exponent of2​kj2k\_\{j\}within a crossing and0across independent loci\), and the global pair from the sum\-versus\-crossing rule\. The closed form is Watanabe’s normal\-crossing real log canonical threshold \(RLCT\) for the crossings, the separable sum for the independent loci, and Aoyagi’s deep\-linear coefficient for the scalar networks\. On the scalar networks the radial collapse direction reads orderk=Lk=L, whose single\-direction threshold1/\(2​L\)1/\(2L\)sits below the global1/21/2theLL\-hyperplane crossing carries; the blind grouping recovers the1/21/2\.

The wide deep linear networks mark the boundary of the read\. Their singular set is the matrix product\-zero locus, a determinantal variety whose crossings lie in coupled directions away from any coordinate\. The per\-direction coordinate read departs from the closed form: on the small grids it over\-reads to the regularD/2D/2at depth two, each coordinate staying regular at order one, and under\-reads to1/21/2at depth three, the grouping collapsing every coordinate into a single crossing \(lower block of Table[19](https://arxiv.org/html/2607.00603#A4.T19)\)\. The improved rotated order read leaves this unchanged, so the gap is in the assembly\. Three reads survive the boundary\. The Newton polyhedron of the KL divergence’s monomial support gives the learning coefficient by a toric computation, the diagonal Newton distance solved as a linear program: exact for a singularity non\-degenerate with respect to its Newton polyhedron, reproducing the closed form on every enumerable cell, and the simplex upper bound otherwise\. On the determinantal cells the simplex bound sits far below the regularD/2D/2\(the depth\-two2×2×22\{\\times\}2\{\\times\}2reads2\.02\.0againstD/2=4\.0D/2=4\.0and the exact1\.51\.5\) and is itself exact on some cells \(3×2×2×33\{\\times\}2\{\\times\}2\{\\times\}3reads the exact2\.02\.0\)\. The generic\-line orderkgenk\_\{\\mathrm\{gen\}\}then brackets the coefficient as\[1/\(2​kgen\),min⁡\(D/2,simplex\)\]\[1/\(2k\_\{\\mathrm\{gen\}\}\),\\,\\min\(D/2,\\,\\text\{simplex\}\)\], which contains the closed\-formλ\\lambdaon every cell of the table and holds the hundreds\-scaleλ\\lambdaat the trainedH=\(20,h,h,20\)H=\(20,h,h,20\)forhhup to128128\. The determinantal signature is itself detectable: the generic\-line order disagrees with the order the recovered coordinate structure implies, which flags the locus blind and recovers its depthLL\. Routing the flagged locus to the structured resolution rule, Aoyagi’s coefficient for the architecture, recovers the closed form on all five cells the blind assembly missed, the determinantal resolutionShirodkar \([2026b](https://arxiv.org/html/2607.00603#bib.bib23)\)proves for depthL≤3L\\leq 3\. A general exact resolver for an arbitrary determinantal variety stays open in that programme: the simple origin blow\-up\(Hironaka,[1964](https://arxiv.org/html/2607.00603#bib.bib11)\)stalls at the simplex bound, so the exact value needs the structured rank\-locus resolution \(Section[9](https://arxiv.org/html/2607.00603#S9)\)\.

##### Typing the intersection\.

The coupling grouping generalises to a per\-cell read of the intersection type, which routes the assembly to its matched rule: a transversal crossing of regular loci to the minimum, a separable sum to the sum, a determinantal locus to the structured resolution, and a tangency to the Newton\-polyhedron engine\. A tangency and a transversal crossing can read the same generic order; the offset resolution separates them, since holding one nominated normal at an offset makes a transversal crossing’s per\-direction orders sum to the generic order while a tangency’s overshoot it, no offset peeling a clean factor off a shared tangent\. On the constructed tangencyK=\(y2−x4\)2K=\(y^\{2\}\-x^\{4\}\)^\{2\}this routes to Newton for the exact3/83/8the crossing rule misses, and the matched rule reproduces the closed form on every typed cell\. The tangency whose loci read order one along the shared tangent, where the offset orders do not overshoot and a line scan reads it as a transversal crossing, separates under a second\-order read\. Near the singular point the directional Fisher grows ast2​\(u⊤​A​v\)2t^\{2\}\(u^\{\\top\}Av\)^\{2\}withAAthe Hessian of the locus, and the rank ofAAtypes the cell: a transversal crossing reads full rank, every sheet entering at second order, and a tangency reads rank\-deficient, the contact pushing one direction to a higher order\. The confidence is the singular\-value gap, which falls toward the decision threshold as a crossing approaches a shallow angle and the tangency boundary\. The case still open is a contact tangent to second order, whereAAitself degenerates and the read needs a third order\.

Table 19:Assembled global coefficient against the closed form on analytic models, with the rigorous bracket and the resolution\-rule value where the assembly fails\. Orders are recovered blind; the locus grouping is recovered from the Fisher coupling; the assembly is the sum\-versus\-crossing rule \(reproduced exactly on the enumerable block by the Newton\-polyhedron engine\); the bracket is\[1/\(2​kgen\),min⁡\(D/2,simplex\)\]\[1/\(2k\_\{\\mathrm\{gen\}\}\),\\,\\min\(D/2,\\text\{simplex\}\)\], the determinantal upper bound being the Newton simplex bound\. The resolved column is the structured resolution rule \(Aoyagi\) the determinantal detector routes to\. Seed4242,n=2×104n\{=\}2\{\\times\}10^\{4\}Monte\-Carlo for the directional Fisher \(the wide block is exact Gauss\-Newton\)\. Assembled equals the closed form on every enumerable cell \(upper block\); on the wide matrix determinantal cells \(lower block\) the assembly departs, the bracket contains the closed form, and the resolution rule recovers it\.cellrec\.kkassembled\(λ,m\)\(\\lambda,m\)bracketresolved\(λ,m\)\(\\lambda,m\)closed form\(λ,m\)\(\\lambda,m\)crossing\(2,2\)\(2,2\)2,22,2\(1/4,2\)\(1/4,\\,2\)\[0\.12,1\.00\]\[0\.12,\\,1\.00\]–\(1/4,2\)\(1/4,\\,2\)crossing\(2,3\)\(2,3\)2,32,3\(1/6,1\)\(1/6,\\,1\)\[0\.10,1\.00\]\[0\.10,\\,1\.00\]–\(1/6,1\)\(1/6,\\,1\)crossing\(2,3,2\)\(2,3,2\)2,3,22,3,2\(1/6,1\)\(1/6,\\,1\)\[0\.07,1\.50\]\[0\.07,\\,1\.50\]–\(1/6,1\)\(1/6,\\,1\)crossing\(3,3,3\)\(3,3,3\)3,3,33,3,3\(1/6,3\)\(1/6,\\,3\)\[0\.06,1\.50\]\[0\.06,\\,1\.50\]–\(1/6,3\)\(1/6,\\,3\)sum\(2,3\)\(2,3\)2,32,3\(5/12,1\)\(5/12,\\,1\)\[0\.25,1\.00\]\[0\.25,\\,1\.00\]–\(5/12,1\)\(5/12,\\,1\)sum\(2,3,4\)\(2,3,4\)2,3,42,3,4\(13/24,1\)\(13/24,\\,1\)\[0\.25,1\.50\]\[0\.25,\\,1\.50\]–\(13/24,1\)\(13/24,\\,1\)composite2,2,32,2,3\(5/12,2\)\(5/12,\\,2\)\[0\.17,1\.50\]\[0\.17,\\,1\.50\]–\(5/12,2\)\(5/12,\\,2\)composite2,3,2,22,3,2,2\(5/12,2\)\(5/12,\\,2\)\[0\.12,2\.00\]\[0\.12,\\,2\.00\]–\(5/12,2\)\(5/12,\\,2\)DLNL=2L\{=\}21,11,1\(1/2,2\)\(1/2,\\,2\)\[0\.25,1\.00\]\[0\.25,\\,1\.00\]–\(1/2,2\)\(1/2,\\,2\)DLNL=3L\{=\}31,1,11,1,1\(1/2,3\)\(1/2,\\,3\)\[0\.17,1\.50\]\[0\.17,\\,1\.50\]–\(1/2,3\)\(1/2,\\,3\)DLNL=4L\{=\}41×41^\{\\times 4\}\(1/2,4\)\(1/2,\\,4\)\[0\.12,2\.00\]\[0\.12,\\,2\.00\]–\(1/2,4\)\(1/2,\\,4\)DLNL=5L\{=\}51×51^\{\\times 5\}\(1/2,5\)\(1/2,\\,5\)\[0\.10,2\.50\]\[0\.10,\\,2\.50\]–\(1/2,5\)\(1/2,\\,5\)Wide matrix deep linear \(matrix product\-zero, determinantal\): the boundaryDLN2×2×22\{\\times\}2\{\\times\}21×81^\{\\times 8\}\(4,1\)\(4,\\,1\)\[0\.25,2\.00\]\[0\.25,\\,2\.00\]\(3/2,1\)\(3/2,\\,1\)\(3/2,1\)\(3/2,\\,1\)DLN3×2×33\{\\times\}2\{\\times\}31×121^\{\\times 12\}\(6,1\)\(6,\\,1\)\[0\.25,3\.00\]\[0\.25,\\,3\.00\]\(5/2,1\)\(5/2,\\,1\)\(5/2,1\)\(5/2,\\,1\)DLN4×2×44\{\\times\}2\{\\times\}41×161^\{\\times 16\}\(8,1\)\(8,\\,1\)\[0\.25,4\.00\]\[0\.25,\\,4\.00\]\(7/2,1\)\(7/2,\\,1\)\(7/2,1\)\(7/2,\\,1\)DLN3×3×33\{\\times\}3\{\\times\}31×181^\{\\times 18\}\(9,1\)\(9,\\,1\)\[0\.25,4\.50\]\[0\.25,\\,4\.50\]\(7/2,2\)\(7/2,\\,2\)\(7/2,2\)\(7/2,\\,2\)DLN3×2×2×33\{\\times\}2\{\\times\}2\{\\times\}31×161^\{\\times 16\}\(1/2,16\)\(1/2,\\,16\)\[0\.17,2\.00\]\[0\.17,\\,2\.00\]\(2,3\)\(2,\\,3\)\(2,3\)\(2,\\,3\)![Refer to caption](https://arxiv.org/html/2607.00603v1/x14.png)Figure 14:Assembling the global learning coefficient against the closed form\. The per\-direction reads assemble to the closed form \(black tick\) on every enumerable cell \(green\) and depart on the wide matrix determinantal cells \(amber, right of the divider\); the rigorous bracket \(grey\) contains the closed form and the determinantal\-routed resolution \(blue\) recovers it\.

### D\.4 The singular\-fluctuation cells

Section[6](https://arxiv.org/html/2607.00603#S6)reads the singular fluctuationν\\nufrom the order through the universality valueν​\(k\)\\nu\(k\)and measures where a trained network’s realizedν\\nufalls below it\. Three cells support that section, all CPU and fp64, seed4242\.

The*order\-kkvalidation cell*confirmsν^=ν​\(k\)\\hat\{\\nu\}=\\nu\(k\)on an isolated direction\. The model isy=a​sk\+εy=a\\,s^\{k\}\+\\varepsilon,ε∼𝒩​\(0,σ2\)\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\(a=1a\{=\}1,σ=1\\sigma\{=\}1\), whose KL alongsshas order2​k2k\. We form the exact one\-dimensional posterior on a grid and compute the functional varianceVn=∑iVarpost​\[log⁡p​\(yi∣s\)\]V\_\{n\}=\\sum\_\{i\}\\mathrm\{Var\}\_\{\\mathrm\{post\}\}\[\\log p\(y\_\{i\}\\mid s\)\], withν^=Vn/2\\hat\{\\nu\}=V\_\{n\}/2, data\-averaged over300300draws at eachn∈\{500,1000,2000,4000\}n\\in\\\{500,1000,2000,4000\\\}\.

The*estimator anchor*validates the functional\-variance estimator on a regulardd\-parameter linear\-Gaussian model, whereν=d/2\\nu=d/2\. The closed form and a hand\-rolled SGLD sampler of the same posterior both returnd/2d/2\(d=4,6,10→2\.0,3\.0,5\.0d=4,6,10\\to 2\.0,3\.0,5\.0\)\.

The*absorption cell*isolates the live\-structure suppression\. The model isy=b​g​\(x\)\+sk​c​\(x\)\+εy=b\\,g\(x\)\+s^\{k\}c\(x\)\+\\varepsilonwith a regular parameterbb\(basisgg\) and an order\-kkdead coordinatess\(basiscc\), the teacher settings∗=0s^\{\*\}\{=\}0\. The dead direction’s contribution toν\\nuisν​\(joint over​b,s\)−ν​\(b​alone\)\\nu\(\\text\{joint over \}b,s\)\-\\nu\(b\\text\{ alone\}\), sampled from the\(b,s\)\(b,s\)grid posterior and data\-averaged over120120draws, swept over the basis overlapρ=corr​\(g,c\)\\rho=\\mathrm\{corr\}\(g,c\)\(Figure[5](https://arxiv.org/html/2607.00603#S6.F5)\)\. The effective overlapρeff\\rho\_\{\\mathrm\{eff\}\}of a generic dead\-scan basis with the live\-unit span in the trained vision\-transformer cells is the mean over200200random scan directions\.

Similar Articles

A Geometric Account of Activation Steering through Angle-Norm Decomposition

arXiv cs.AI

This paper analyzes linear activation steering in language models by decomposing interventions into angular and radial components. It finds that concepts are primarily encoded in angular structure, but norm adjustments are crucial for stability, supporting spherical steering methods while showing that additive coefficients conflate geometry.

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv cs.CL

This paper investigates the geometric relationship between directions in language model activations that detect a behavior versus those that control it, finding that for hallucination detection they are nearly orthogonal (cosine ~0.12), while for output format they align perfectly, challenging a common assumption in mechanistic interpretability.

Graph Alignment Topology as an Inductive Bias for Grounding Detection

arXiv cs.CL

This paper introduces Graph Alignment Topology as an inductive bias for grounding detection, using a graph neural network to model alignment structure between reference information and LLM outputs. The method achieves state-of-the-art results on multiple hallucination and question-answering datasets, outperforming GPT-4o.

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

arXiv cs.LG

This paper investigates how post-training of LLMs introduces AI-like stylistic regularities and proposes PASTA, a training-free method to localize and ablate these alignment signatures, reducing AI detection rates while maintaining coherence across 11 models and 6 detectors.