Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Summary
This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.
View Cached Full Text
Cached at: 05/15/26, 06:28 AM
# Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Source: [https://arxiv.org/html/2605.14258](https://arxiv.org/html/2605.14258)
Jesseba Fernando Network Science Institute, Northeastern University fernando\.je@northeastern\.edu&Grigori Guitchounts Flagship Pioneering g\.guitchounts@alumni\.harvard\.edu
###### Abstract
Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood\. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer’s nonlinear update has a local linear description\. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown\. We perform full Jacobian eigendecomposition across three production–scale LLMs and show that training installs a monotonic spectral gradient through depth—from non\-normal, rotation\-dominated early layers to near–symmetric late layers—together with a cumulative low\-rank bottleneck that funnels perturbations into a small fraction of the residual stream’s effective dimensions\. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non\-normality is removed\. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization\. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network’s functional topology\.
## 1Introduction
How computation propagates through a neural network remains largely unknown despite being one of the driving principles behind mechanistic interpretability\. A growing toolkit addresses this question for transformers: circuit analysis traces computations through specific attention heads and MLP blocks\(Olahet al\.,[2020](https://arxiv.org/html/2605.14258#bib.bib26); Olssonet al\.,[2022](https://arxiv.org/html/2605.14258#bib.bib28); Lindseyet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib27)\); sparse autoencoders decompose activations into interpretable features\(Cunninghamet al\.,[2023](https://arxiv.org/html/2605.14258#bib.bib29); Brickenet al\.,[2023](https://arxiv.org/html/2605.14258#bib.bib30)\); probing classifiers and residual\-stream readouts test what information is linearly accessible at each layer\(Alain and Bengio,[2016](https://arxiv.org/html/2605.14258#bib.bib31); Belroseet al\.,[2023](https://arxiv.org/html/2605.14258#bib.bib32)\)\. These methods have been productive at identifying*what*individual components represent\. But*how*these components come to be, or how a perturbation at one layer becomes amplified, compressed or redirected as it passes through subsequent layers, is an open question\.
Dynamical systems theory is a natural formalism for this question, and viewing layers as discrete time steps has precedent in work that treats transformers as ODE solvers or interacting\-particle systems\(Luet al\.,[2019](https://arxiv.org/html/2605.14258#bib.bib15); Geshkovskiet al\.,[2024](https://arxiv.org/html/2605.14258#bib.bib19)\), and in empirical analyses of residual\-stream trajectories through depth\(Hosseini and Fedorenko,[2023](https://arxiv.org/html/2605.14258#bib.bib16); Lawsonet al\.,[2024](https://arxiv.org/html/2605.14258#bib.bib20); Fernando and Guitchounts,[2025](https://arxiv.org/html/2605.14258#bib.bib33)\)\. Each transformer block updates a shared residual stream,𝐡ℓ\+1=𝐡ℓ\+fℓ\(𝐡ℓ\)\\mathbf\{h\}\_\{\\ell\+1\}=\\mathbf\{h\}\_\{\\ell\}\+f\_\{\\ell\}\(\\mathbf\{h\}\_\{\\ell\}\), making depth a form of discrete time and the per\-layer JacobianJℓ=∂𝐡ℓ\+1/∂𝐡ℓJ\_\{\\ell\}=\\partial\\mathbf\{h\}\_\{\\ell\+1\}/\\partial\\mathbf\{h\}\_\{\\ell\}the local linear description of the dynamics\. Theoretical literature on signal propagation has characterized these Jacobians at random initialization, establishing order\-chaos phase boundaries\(Pooleet al\.,[2016](https://arxiv.org/html/2605.14258#bib.bib5); Schoenholzet al\.,[2017](https://arxiv.org/html/2605.14258#bib.bib6)\)and conditions for dynamical isometry\(Penningtonet al\.,[2017](https://arxiv.org/html/2605.14258#bib.bib7)\), but in most cases have been applied only at initialization or reduce the Jacobian to scalar summaries\. Recent empirical work has begun to probe Jacobian\-like objects in*trained*transformers—population\-level linear maps\(Fuet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib9)\), detached Jacobians with frozen nonlinearities\(Golden,[2025](https://arxiv.org/html/2605.14258#bib.bib10)\), residual Jacobians correlated with benchmark performance\(Aubryet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib11)\), Dynamic Mode Decomposition on averaged states\(Jacobset al\.,[2026](https://arxiv.org/html/2605.14258#bib.bib12)\)—and each independently finds signatures of progressive dimensional funneling through depth\. These approaches use approximate or aggregate linearizations rather than the true input\-space Jacobian\. As a result, the full spectral geometry of trained production\-scale Jacobians—the eigenvalue distribution, the non\-normality structure, the balance of expanding and contracting modes, the rotational dynamics encoded in complex eigenvalues—remains uncharacterized\.
A separate question concerns the relationship between a network’s computational*topology*and its*dynamics*\. Prior work on modularity in neural networks has focused on the weight graph—showing that trained networks are more “clusterable" than at initialization\(Filanet al\.,[2021](https://arxiv.org/html/2605.14258#bib.bib34)\)or extracting communities from weight connectivity patterns\(Watanabeet al\.,[2018](https://arxiv.org/html/2605.14258#bib.bib35)\)\. In transformers,Elhageet al\.\([2021](https://arxiv.org/html/2605.14258#bib.bib24)\)framed computation as independent components “reading from” and “writing to” a shared residual stream\. Whether this additive structure gives rise to mesoscale community organization in the residual stream’s*activation*correlations—and whether such structure is architectural or learned—has not been tested at production scale\.
We address these gaps by computing exact Jacobians of three production\-scale LLMs—Llama 3\.1 8B, OLMo 3 7B, and Gemma 4 E4B—performing full eigendecomposition and singular value analysis at every layer\. For OLMo, we repeat this at three training checkpoints: step 0 \(random initialization\), step 471k, and step 1\.41M \(final pretraining step\)\. Because step 0 shares the trained model’s architecture, depth, and wiring while varying only the weights, it serves as a null model: findings present at step 0 are architectural; findings absent at step 0 and emergent at trained checkpoints are attributed to training\. We bridge these dynamics to activation\-graph topology by constructing activation\-correlation graphs and running community detection to test agreement with the Jacobian dynamics across the three checkpoints noted above\.
Our contributions are:
1. 1\.Full spectral characterization of trained production\-scale Jacobians\.We find that per\-layer Jacobians organize into depth regimes with a monotonic non\-normality gradient—from near\-rotational early layers to near\-symmetric late layers\. Approximately, 98% of eigenvalues appear as complex conjugate pairs\. Composing Jacobians across depth reveals that perturbations are funneled into a handful of effective channels \(∼4−40\\sim 4\-40\) across all three architectures\.
2. 2\.Architecture\-versus\-training decomposition via OLMo’s step\-0 null model\.Depth regimes and adjacent\-layer subspace decoupling are largely architectural, while the non\-normality gradient and dimensional funneling are installed by training\. Using Schur decomposition to zero the non\-normality of the Jacobian reveals disruptions that track whether rank collapse is preserved, indicating that non\-normal structure is central to the computations carried out by the residual stream\.
3. 3\.Per\-unit111We use “unit” throughout in the neuroscientific sense of a recording channel: theii\-th coordinate of𝐡ℓ∈ℝd\\mathbf\{h\}\_\{\\ell\}\\in\\mathbb\{R\}^\{d\}\. We avoid “neuron” because residual stream coordinates do not correspond to discrete computational elements but are bookkeeping channels of a shared workspace\(Elhageet al\.,[2021](https://arxiv.org/html/2605.14258#bib.bib24)\)\.coupling between community topology and Jacobian dynamics\.Units that bridge multiple activation\-graph communities \(boundary nodes\) are preferentially amplified or suppressed by the Jacobian, with the sign governed by operator type: boundary units are amplified at near\-symmetric layers and de\-amplified at non\-normal ones\. This coupling is absent at initialization and emerges monotonically over training \(0→14→240\\to 14\\to 24FDR\-significant layers across OLMo checkpoints\), establishing it as a learned property even though the mesoscale community structure itself is partly architectural\.
## 2Training reshapes per\-layer Jacobians into a non\-normality gradient and a dimensional bottleneck
We treat each transformer block as one step of a discrete dynamical system on the residual stream and ask what its local linear description—the per\-layer JacobianJℓ=∂𝐡ℓ\+1/∂𝐡ℓJ\_\{\\ell\}=\\partial\\mathbf\{h\}\_\{\\ell\+1\}/\\partial\\mathbf\{h\}\_\{\\ell\}—looks like in trained transformers\. We compute exact per\-sample Jacobians at every sub\-layer boundary on 1,000 WikiText\-2\(Merityet al\.,[2017](https://arxiv.org/html/2605.14258#bib.bib3)\)samples for three architectures \(Llama 3\.1 8B, OLMo 3 7B, Gemma 4 E4B\) and, for OLMo, at three pre\-training checkpoints \(step 0, 471k, 1\.41M\)\. Step 0 shares the trained model’s architecture, depth, and wiring while varying only the weights, so it serves as an architectural null\. Computational and statistical details, and the formal definitions of the spectral quantities used below, are in Appendix[A](https://arxiv.org/html/2605.14258#A1)\.
A first basic observation cuts across every model and every layer: about 98% of the eigenvalues of the mean Jacobian come as complex conjugate pairs \(visualized for Llama in Figure[1](https://arxiv.org/html/2605.14258#S2.F1); full per\-layer densities for all three models in Figure[S1](https://arxiv.org/html/2605.14258#A1.F1)\)\. Each such pair is a spiral—a two\-dimensional subspace that is simultaneously rotated and either stretched or compressed—rather than a pure radial expansion or contraction\. This rotational structure is invisible to the singular\-value decompositions used in all prior spectral analyses of transformers\(Fuet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib9); Golden,[2025](https://arxiv.org/html/2605.14258#bib.bib10); Aubryet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib11)\), which by construction collapse each mode to a non\-negative scale factor and discard phase\. Trained transformers thus route information through coupled rotational modes at every depth\.
### 2\.1Jacobians of trained transformers sweep from rotation\-dominated to near\-symmetric operators across depth
We set out to ask whether different layers play distinct dynamical roles, and if so, how those roles are distributed along the depth axis\. Two complementary measures organize the layers \(Figure[2](https://arxiv.org/html/2605.14258#S2.F2)\)\. The condition numberκℓ=σmax/σmin\\kappa\_\{\\ell\}=\\sigma\_\{\\max\}/\\sigma\_\{\\min\}summarizes geometric distortion; self\-alignment‖Vℓ,:k⊤Uℓ,:k‖F2/k\\\|V\_\{\\ell,:k\}^\{\\top\}U\_\{\\ell,:k\}\\\|\_\{F\}^\{2\}/k\(overlap between leading input and output singular subspaces,k=64k=64\) summarizes operator type, equaling 1 for symmetric operators andk/d≈0\.016k/d\\approx 0\.016for pure rotators\. In Llama 3\.1 8B these measures partition the 32 layers into three regimes\. Early layers \(0–4\) are highly anisotropic \(κ∼106\\kappa\\sim 10^\{6\}\) and strongly non\-normal \(self\-alignment∼0\.04\\sim 0\.04\): near\-pure rotators\. Mid layers \(5–19\) collapse the condition number to∼102\\sim 10^\{2\}–10310^\{3\}and the expanding\-mode fraction to 17–30%, but remain non\-normal\. Late layers \(20–31\) re\-expandκ\\kappato∼105\\sim 10^\{5\}–10610^\{6\}while rising toward symmetry \(self\-alignment∼0\.55\\sim 0\.55\)\. Similar patterns were observed for OLMo and Gemma \(Figure[S2](https://arxiv.org/html/2605.14258#A1.F2)\)\.
Figure 1:Eigenvalue structure of Llama 3\.1 8B mean Jacobians\.\(a\)Complex\-plane scatter at five depths; the predominance of off\-axis points reflects the∼\\sim98% complex conjugate pairs\.\(b\)Log\-density heatmaps ofRe\(λ\)\\mathrm\{Re\}\(\\lambda\),Im\(λ\)\\mathrm\{Im\}\(\\lambda\), and\|λ\|\|\\lambda\|across all 32 layers\. Both panels show the three\-regime transition: broad early, contracted mid, re\-expanded late\.Figure 2:Three\-regime Jacobian structure across depth\.Top row:Llama 3\.1 8B per\-layer profiles\.\(a\)Condition numberκ\\kappa\(median over 1,000 samples; shaded band shows IQR\),\(b\)fraction of expanding eigenvalue modes,\(c\)mean eigenvalue magnitude\.Bottom row:regime\-mean summaries \(early 0–4, mid 5–19, late 20–31\) across all five configurations for the same three quantities:\(d\)κ\\kappa,\(e\)fraction expanding,\(f\)mean\|λ\|\|\\lambda\|\(brokenyy\-axis\)\.Self\-alignment rises monotonically from≈0\.04\\approx 0\.04\(layer 2\) to≈0\.70\\approx 0\.70\(layer 29\); the Henrici departureδ\(J\)=‖J‖F2−∑\|λi\|2/‖J‖F\\delta\(J\)=\\sqrt\{\\\|J\\\|\_\{F\}^\{2\}\-\\sum\|\\lambda\_\{i\}\|^\{2\}\}\\,/\\,\\\|J\\\|\_\{F\}falls in lockstep, from0\.910\.91to0\.470\.47\(Figure[3](https://arxiv.org/html/2605.14258#S2.F3)a,c\)\. Operator type thus changes continuously from rotation toward symmetry even though geometric distortion bottoms out in the middle\. Could this rising symmetry merely reflect the identity in the skip connection dominating in late layers? Stripping the skip \(Rℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I\) rules that out:RℓR\_\{\\ell\}’s self\-alignment stays below∼0\.20\\sim 0\.20everywhere \(Figure[3](https://arxiv.org/html/2605.14258#S2.F3)a\)\. The block computation is non\-normal at every layer; the symmetry ofJℓJ\_\{\\ell\}instead tracks the residual norm ratio‖Rℓ‖F/‖Jℓ‖F\\\|R\_\{\\ell\}\\\|\_\{F\}/\\\|J\_\{\\ell\}\\\|\_\{F\}, which declines from0\.9630\.963to0\.5180\.518as the identity progressively dominates \(Figure[3](https://arxiv.org/html/2605.14258#S2.F3)b\)\. The same pattern holds across all models \(Figure[S3](https://arxiv.org/html/2605.14258#A1.F3)\)\.
OLMo’s step\-0 checkpoint separates architectural from learned contributions\. At initialization, all layers beyond the first are nearly symmetric \(mid 0\.95, late 0\.98\) with eigenvalue clouds collapsed nearRe\(λ\)=1\\mathrm\{Re\}\(\\lambda\)=1\(Figure[S4](https://arxiv.org/html/2605.14258#A1.F4)a; full trajectory in Figure[S5](https://arxiv.org/html/2605.14258#A1.F5)\)\. Training leaves the late ceiling near\-symmetric \(0\.98→0\.860\.98\\to 0\.86\) while pulling early and mid layers toward rotation \(mid0\.95→0\.650\.95\\to 0\.65; early0\.64→0\.430\.64\\to 0\.43; Table[2](https://arxiv.org/html/2605.14258#A1.T2)\)\. The non\-normality gradient is thus a joint product of architecture \(late\-regime ceiling\) and training \(early/mid rotator regime\)\.
Figure 3:Non\-normality gradient across depth\.Top row:Llama 3\.1 8B per\-layer profiles\.\(a\)Self\-alignment ofJℓJ\_\{\\ell\}\(orange\) andRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I\(blue\), with random baselinek/d≈0\.016k/d\\approx 0\.016\(dotted\)\.\(b\)Residual norm ratio‖Rℓ‖F/‖Jℓ‖F\\\|R\_\{\\ell\}\\\|\_\{F\}/\\\|J\_\{\\ell\}\\\|\_\{F\}\.\(c\)Henrici departure from normalityδ\(Jℓ\)\\delta\(J\_\{\\ell\}\)\.Bottom row:regime means \(early/mid/late\) across all five configurations for:\(d\)JJself\-alignment,\(e\)RRself\-alignment,\(f\)residual norm ratio,\(g\)Henrici departure\. Full per\-layer profiles for all models in Figure[S3](https://arxiv.org/html/2605.14258#A1.F3)\.Does the gradient imply inter\-layer coordination—do the leading singular directions one layer writes into match those the next layer reads? Largely not: forward alignment‖Uℓ⊤Vℓ\+1‖F2/k\\\|U\_\{\\ell\}^\{\\top\}V\_\{\\ell\+1\}\\\|\_\{F\}^\{2\}/ksits at or near the random baselinek/d≈0\.016k/d\\approx 0\.016across all models \(Figure[S6](https://arxiv.org/html/2605.14258#A1.F6); Appendix[B](https://arxiv.org/html/2605.14258#A2)\)\. The residual stream thus functions as the shared workspace theorized byElhageet al\.\([2021](https://arxiv.org/html/2605.14258#bib.bib24)\), with at most weak inter\-layer coordination\.
### 2\.2Training installs a cumulative low\-rank bottleneck through depth
To capture end\-to\-end propagation we form the cumulative Jacobian productPℓ=J31⋯JℓP\_\{\\ell\}=J\_\{31\}\\cdots J\_\{\\ell\}, mapping a perturbation at layerℓ\\ellto the final residual\-stream state, and measure its dimensionality via effective rankerank\(Pℓ\)=exp\(−∑ipilogpi\)\\mathrm\{erank\}\(P\_\{\\ell\}\)=\\exp\(\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}\)withpi=σi2/∑jσj2p\_\{i\}=\\sigma\_\{i\}^\{2\}/\\sum\_\{j\}\\sigma\_\{j\}^\{2\}\.
In Llama 3\.1 8B, effective rank drops from∼436\\sim 436\(single\-layer,ℓ=31\\ell=31\) to6\.76\.7at full composition \(ℓ=0\\ell=0\): of4,0964\{,\}096input directions, only about seven survive end\-to\-end \(Figure[4](https://arxiv.org/html/2605.14258#S2.F4)a\)\. Gemma funnels to5\.95\.9across 42 layers \(Figure[S7](https://arxiv.org/html/2605.14258#A1.F7)\)\. The collapse is not architectural: at OLMo step 0 effective rank stays high \(≈326\\approx 326forP0P\_\{0\},≈4006\\approx 4006forP31P\_\{31\}\); by step 1\.41M it falls to∼42\\sim 42—two orders of magnitude below initialization \(Figure[4](https://arxiv.org/html/2605.14258#S2.F4)d\)\. The steepest collapse occurs at early layers, where large spectral radii amplify a few eigendirections and suppress the rest \(Spearmanρs=−0\.33\\rho\_\{s\}=\-0\.33,p=0\.06p=0\.06; Figure[4](https://arxiv.org/html/2605.14258#S2.F4)b\)\.
The trained network’s sensitivity to its own input is thus confined to a subspace three orders of magnitude smaller than the residual stream—a learned bottleneck that bounds from above how many independent features any single forward pass can propagate from embedding to logit\.
Figure 4:Cumulative Jacobian analysis \(Llama 3\.1 8B unless noted\)\.\(a\)Effective rank ofPℓ=J31⋯JℓP\_\{\\ell\}=J\_\{31\}\\cdots J\_\{\\ell\}\(blue, left axis\) and fraction of expanding eigenvalues \(red dashed, right axis; dotted = 50%\) vs\. injection layer\.\(b\)Effective rank vs\. spectral radiusρ\(Jℓ\)\\rho\(J\_\{\\ell\}\), colored by layer; Spearmanρs\\rho\_\{s\}annotated\.\(c\)Singular value spectra of cumulative products at three depths\.\(d\)Effective rank ofP0P\_\{0\}\(full product\),PmidP\_\{\\text\{mid\}\}, andPLP\_\{L\}\(single\-layer\) across all five configurations \(brokenyy\-axis\)\.
### 2\.3The cumulative bottleneck is a property of the trained non\-normal feedforward, not the spectrum
The bottleneck in the residual stream could in principle reflect either of two distinct properties ofJℓJ\_\{\\ell\}: the*eigenvalue spectrum*, which we have shown contains expanding modes at every depth, or the trained*non\-normal feedforward*, the off\-spectrum mass that distinguishes a generic operator from a normal one with the same eigenvalues\. Applying Schur decomposition to the Jacobians allowed us to cleanly separate the two possibilities:Jℓ=Qℓ\(Λℓ\+Nℓ\)Qℓ∗J\_\{\\ell\}=Q\_\{\\ell\}\(\\Lambda\_\{\\ell\}\+N\_\{\\ell\}\)Q\_\{\\ell\}^\{\*\}, withΛℓ\\Lambda\_\{\\ell\}diagonal \(the eigenvalues\) andNℓN\_\{\\ell\}strictly upper\-triangular \(the non\-normal piece\)\. HoldingΛℓ\\Lambda\_\{\\ell\}andQℓQ\_\{\\ell\}fixed and scaling onlyNℓN\_\{\\ell\}by a dosec∈\{0,0\.25,0\.5,0\.75,1,1\.5,2\}c\\in\\\{0,0\.25,0\.5,0\.75,1,1\.5,2\\\}\(c=0c=0removes all non\-normality at fixed spectrum;c=1c=1is the trained model\), we recomposed the linearized stack and recomputederank\(P0:31\)\\mathrm\{erank\}\(P\_\{0:31\}\)\.
Figure 5:Schur surgery on the trained non\-normal feedforward of each layer’s Jacobian\. EachJℓJ\_\{\\ell\}is written in complex Schur form and onlyNℓN\_\{\\ell\}is scaled by a dosecc, holding the spectrumΛℓ\\Lambda\_\{\\ell\}and basisQℓQ\_\{\\ell\}fixed;c=0c=0is fully normal at the trained spectrum,c=1c=1is the trained model\.\(a\)Cumulative effective rank ofP0:31P\_\{0:31\}vs\. dose \(logyy\); all three trained models trace the same monotone curve, OLMo step 0 \(untrained\) is nearly flat\.\(b\)log10‖P0:L‖F\\log\_\{10\}\\\|P\_\{0:L\}\\\|\_\{F\}vs\. dose: cumulative Jacobians gain several orders of magnitude in cumulative Frobenius norm asNℓN\_\{\\ell\}is scaled in, consistent with the transient\-amplification reading of non\-normality\.\(c\)Henrici departure ofP0:LP\_\{0:L\}vs\. dose: cumulative non\-normality saturates near unity at modestccin trained networks\. Random\-replacement controls and mode\-RRconfirmation in Appendix[C](https://arxiv.org/html/2605.14258#A3)\.Removing the trained non\-normal feedforward component ofJJ, while keeping the eigenspectrum intact largely dissolves the bottleneck\. In Llama 3\.1 8B,erank\(P0:31\)\\mathrm\{erank\}\(P\_\{0:31\}\)rises from7\.17\.1atc=1c=1to45\.445\.4atc=0c=0\(a6\.4×6\.4\\timesrecovery\), and falls further to2\.52\.5atc=2c=2\(Figure[5](https://arxiv.org/html/2605.14258#S2.F5)a\); OLMo \(step 1\.41M\) and Gemma 4 E4B reproduce the same monotone trajectory, withc=0/c=1c=0/c=1ratios of5\.75\.7and6\.66\.6, while the untrained OLMo step 0 baseline is nearly flat over the same dose range\. Similarly, we ruled out that the dimensionality collapse relies on the Jacobian’s Frobenius\-mass or the skip connection \(Appendix[C](https://arxiv.org/html/2605.14258#A3), Figure[S8](https://arxiv.org/html/2605.14258#A1.F8)\) The dimensional collapse in the transformer residual stream is therefore not an effect of simple properties of J or its eigenspectrum, but rather is a property of the trained upper\-triangular structure\.
## 3Activation\-graph community structure predicts which units the Jacobian amplifies, with a sign set by operator type
Having characterized the low\-dimensional, rotational dynamics of the transformer residual stream, we next asked if those dynamics are related to the stream’s topological structure\. Do the network’s residual\-streamunitscluster into functional groups? And if so, where does community structure come from: is it a learned organization built by training, or is it already wired in by the architecture? Finally, does a unit’s position in the community structure predict how the Jacobian treats it?
### 3\.1Activation correlations form mesoscale communities at every depth
For each of the sub\-layer \(i\.e\. pre\-Attention and pre\-MLP, making 64 steps for a 32\-layer model\) snapshots we extract last\-token activations on the same1,0001\{,\}000WikiText samples used in §[2](https://arxiv.org/html/2605.14258#S2), build a sparse signed correlation graph by retaining the top\-k=20k=20edges per unit \(positive and negative weights both kept\), and partition the graph using signed Leiden CPM\(Traaget al\.,[2011](https://arxiv.org/html/2605.14258#bib.bib36),[2019](https://arxiv.org/html/2605.14258#bib.bib4)\)at resolutionγ=0\.001\\gamma=0\.001\(see Appendix[A](https://arxiv.org/html/2605.14258#A1)and Table[1](https://arxiv.org/html/2605.14258#A1.T1)for jointγpos/γneg\\gamma\_\{pos\}/\\gamma\_\{neg\}formulation, justification and comparison methods\) This yielded66–8787non\-degenerate communities per layer at every checkpoint of every model\.
Figure 6:Activation\-correlation graphs and community structure \(OLMo 3 7B, step 1\.41M\)\.\(a\)Pairwise activation correlation matrix at layer 17 \(4,096 units, original order\)\.\(b\)Same matrix sorted by signed Leiden CPM communities \(γ=0\.001\\gamma=0\.001\)\.\(c\)Pairwise NMI between community partitions across all sub\-layer snapshots \(pre\-attn even, pre\-MLP odd\)\.The communities are visually clear \(Figure[6](https://arxiv.org/html/2605.14258#S3.F6)a,b\) and reorganize gradually with depth: the pairwise normalized mutual information \(NMI\) between partitions decays smoothly with layer distance \(Figure[6](https://arxiv.org/html/2605.14258#S3.F6)c\), with no discrete phase boundaries\.
### 3\.2Communities capture Jacobian variance even at initialization—a partly architectural prior
We reasoned that if topology and dynamics are coupled, the layers where the community structure changes the most may also show the largest dynamical shifts\. They largely do not—the correlation between adjacent\-layer topology disruption \(1−NMI1\-\\text\{NMI\}\) and changes in the community\-projected operator’s spectrum is at chance in every model and checkpoint \(Appendix[D](https://arxiv.org/html/2605.14258#A4)\)\.
By contrast, community participation \(defined as the coarse\-grained operatorK=Cout⊤JCinK=C\_\{\\text\{out\}\}^\{\\top\}J\\,C\_\{\\text\{in\}\}, whereCCis the community basis\) explained significantly more of the Jacobian variance than size\-matched random community partitions, with the explained variance rising over the layers \(Appendix[A](https://arxiv.org/html/2605.14258#A1), Figure[S9](https://arxiv.org/html/2605.14258#A1.F9)\)\.
This was true even at OLMo’s step 0: the untrained network showed32/3232/32layers above null with medianz=19\.2z=19\.2—the largest effect of any configuration we tested\. The residual\-stream wiring and attention/MLP block topology are by themselves sufficient to make community structure dynamically informative; mesoscale topology↔\\leftrightarrowdynamics agreement is therefore a partly architectural prior, not a learned property\.
### 3\.3Training installs a per\-unit coupling between community boundaries and Jacobian amplification
Our mesoscale experiments showed that on a layer by layer level, the transformer architecture connects residual stream network topology to its dynamics\. This does not, however, address the question of whether topology has any effect on particular residual stream dimensions or units\.
Network topology allowed us to measure how evenly a residual stream unit’s connections spread across communities using each unit’s participation coefficient: high participation means the unit bridges multiple communities \(a boundary node\), low participation means its connections concentrate within one community\. We asked whether boundary units are preferentially amplified by the Jacobian, where “amplification” is read off the column norm‖J:,i‖\\\|J\_\{:,i\}\\\|\(Figure[7](https://arxiv.org/html/2605.14258#S3.F7)a\)\. Significance was assessed per layer with Cohen’sddbetween the column norms of units in the top\-10% versus bottom\-10% participation tails, FDR\-corrected \(Benjamini–Hochberg,α=0\.05\\alpha=0\.05\) across layers within each model; full procedure in Appendix[A](https://arxiv.org/html/2605.14258#A1)\. Whereas at OLMo step 0, the coupling is null \(0/320/32layers reach FDR significance\), training installs the coupling monotonically: by step 471k it is detectable at14/3214/32layers \(mediand=\+0\.124d=\+0\.124\), and by step 1\.41M at24/3224/32layers \(mediand=\+0\.248d=\+0\.248\) \(Figure[7](https://arxiv.org/html/2605.14258#S3.F7)b,d\)\.
Llama 3\.1 8B reproduces the trained pattern at16/3216/32FDR\-significant layers \(mediand=\+0\.131d=\+0\.131; Figure[7](https://arxiv.org/html/2605.14258#S3.F7)e\)\. Gemma 4 E4B replicated the broad picture but with a twist: Only4/424/42layers show FDR\-significant positive coupling, while13/4213/42show FDR\-significantnegativecoupling, the latter clustered in the strongly non\-normal early and mid layers where boundary units are preferentiallyde\-amplified \(Figure[7](https://arxiv.org/html/2605.14258#S3.F7)f\)\. This sign inversion motivated us to investigate the relationship between boundary\-node coupling to dynamics and properties of the Jacobians\.
Figure 7:Boundary\-node coupling: training trajectory and cross\-architecture generalization\.Top row\(OLMo\):\(a\)participation coefficient vs\. Jacobian column\-norm \(layer 17, step 1\.41M\);\(b\)per\-layer Cohen’sdd, step 0 \(gray\) vs\. step 1\.41M \(purple\);\(c\)Spearmanρ\\rhoby layer;\(d\)OLMo training trajectory of FDR\-significant fraction and mediandd\.Bottom row\(cross\-model\):\(e\)Llama 3\.1 8B and\(f\)Gemma 4 E4B per\-layer Cohen’sdd\(darker filled markers are FDR\-significant\); Gemma shows the sign inversion in non\-normal early/mid layers\.\(g\)Cross\-model FDR\-significant positive fraction\.In Llama and OLMo, the per\-unit coupling is uniformly positive and concentrates in the mid\-to\-late layers\. In Gemma, it is positive in the four near\-symmetric final layers and negative in the strongly non\-normal early and mid layers\. However, both patterns are instances of the same monotone relationship: the per\-layer self\-alignment ofJℓJ\_\{\\ell\}—the operator\-type measure of §[2\.1](https://arxiv.org/html/2605.14258#S2.SS1)—predicts both theexistenceand thesignof the boundary\-node coupling\.
Across all five configurations tested, the per\-layer Spearman correlation between self\-alignment and Cohen’sddtracks a single emerging pattern \(Figure[8](https://arxiv.org/html/2605.14258#S3.F8)\), a monotonic relationship, where layers with self\-alignment below∼0\.1\\sim 0\.1haved<0d<0\(boundary de\-amplification\), layers above∼0\.5\\sim 0\.5haved\>0d\>0\(boundary amplification\), and the crossover lies near the operator\-type transition between non\-normal and near\-symmetric\. These results demonstrate taht training does not distribute amplification uniformly across residual stream units but concentrates it \(or, in non\-normal layers, suppresses it\) at the units that bridge distinct functional communities\.
Figure 8:Self\-alignment \(operator type\) predicts the sign and magnitude of boundary\-node coupling across models and training stages\.Top row:OLMo training trajectory —\(a\)step 0,\(b\)step 471k,\(c\)step 1\.41M\.Bottom row:\(d\)Llama 3\.1 8B;\(e\)Gemma 4 E4B, whose widest self\-alignment range reveals the full monotone relationship including the sign inversion at low self\-alignment;\(f\)cross\-model Spearmanρ\\rhosummary \(∗∗=p<0\.01\{\*\*\}=p<0\.01\)\. Per\-panel Spearmanρ\\rhoandppannotated; points colored by depth regime \(blue early, red mid, green late\)\.
## 4Discussion
Dynamical systems theory has proven a productive lens for understanding neural networks, yet it has so far been applied almost entirely at initialization or in toy regimes\. Network topology offers a complementary lens, with community detection on activation graphs revealing mesoscale functional organization, but whether such organization relates to model dynamics has not been established\. Here we investigate the full spectral geometry of three production\-scale LLMs across training, and connect their dynamics to the residual stream’s community topology\. Three main findings emerge: First, a monotonic non\-normality gradient runs from rotation\-dominated early layers to near\-symmetric late layers, a depth\-varying operator not detectable from singular value analyses alone\. Second, composing these operators end\-to\-end reveals a cumulative low\-rank bottleneck that funnels perturbations into a small number of effective channels\. Third, boundary nodes in the activation\-graph community structure are preferentially coupled to the Jacobian dynamics, with a sign that is a learned function of operator type: boundary units are amplified at near\-symmetric layers de\-amplified at non\-normal ones\. Together, these results provide the first full eigenvalue\-level description of production\-scale transformer Jacobians after training, filling the gap between initialization\-time theory, which predicts none of this structure, and approximate linearizations, which cannot resolve the rotational dynamics that dominate the spectrum\.
Several methodological choices shape these claims, and each carries a different scope of generalization\. The community\-detection backbone is signed Leiden CPM atγ=0\.001\\gamma=0\.001, selected on partition\-stability grounds; three alternative methods yield the same qualitative pattern across all three topology findings, with CPM giving the largest magnitudes—we therefore read the trained\-Jacobian topology–dynamics relationship as algorithm\-independent in shape, even where its absolute scale is detector\-specific\. The per\-unit effect sizes themselves are moderate, so the OLMo trajectory of0→14→240\\to 14\\to 24FDR\-significant layers should be read as a statement about how broadly across depth the coupling becomes detectable rather than about its strength at any single layer; whether prolonged training pushes effect sizes higher or merely propagates the existing signal across more layers is a question the three used OLMo checkpoints cannot settle\. Finally, every Jacobian here is evaluated on last\-token activations from WikiText\-2, so distributional shift to other inputs could in principle affect the results, although we believe the quality of our findings would not be affected\.
Despite these limitations, this work provides useful insights into the dynamics and topology of the transformer residual stream\. The dimensional collapse we measure converges with what other groups have found using very different mathematical objects:Golden \([2025](https://arxiv.org/html/2605.14258#bib.bib10)\)report cumulative stable rank of∼\\sim1–3 in Llama 3\.2 3B via detached Jacobians;Fuet al\.\([2025](https://arxiv.org/html/2605.14258#bib.bib9)\)document a compression–expansion cycle in regression\-fit linear maps;Jacobset al\.\([2026](https://arxiv.org/html/2605.14258#bib.bib12)\)observe collapsing stable rank of layer updates in DINOv2\-Giant via Dynamic Mode Decomposition\. All of these methods and architecture families point at the same progressive funneling through depth, and the exact\-Jacobian description we provide here closes the gap that each of those approximate analyses had to leave open\. For activation\-graph and community\-detection methodology, the partition that captures more Jacobian variance than chance is already in place at random initialization: studies that read community structure as a learned organization without an untrained baseline risk attributing to data what is actually due to wiring, and we recommend that activation\-graph papers report a random\-init baseline whenever the comparison is feasible\. For signal\-propagation theory, the same step\-0 comparison sharpens where the theory still applies: late\-regime near\-symmetric operators, adjacent\-layer SV decoupling at the∼\\simk/dk/dbaseline, and the existence of a mesoscale variance signal are all present at initialization\. The non\-normality gradient, the cumulative dimensional collapse, and the boundary\-unit coupling, by contrast, are absent at step 0 and emerge over training—these are exactly the places where initialization\-time analysis breaks down and a trained\-Jacobian description like the one we provide becomes necessary\.
## References
- Understanding intermediate layers using linear classifier probes\.arXiv preprint arXiv:1610\.01644\.Note:Submitted to ICLR 2017Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- M\. Aubry, H\. Meng, A\. Sugolov, and V\. Papyan \(2025\)Transformer block coupling and its correlation with generalization in LLMs\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2407\.07810\. First posted July 2024Cited by:[§E\.1](https://arxiv.org/html/2605.14258#A5.SS1.p3.4),[§1](https://arxiv.org/html/2605.14258#S1.p2.2),[§2](https://arxiv.org/html/2605.14258#S2.p2.1)\.
- N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt \(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. L\. Turner, S\. Carter,et al\.\(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.Note:[https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.html](https://transformer-circuits.pub/2023/monosemantic-features/index.html)Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Note:Published at ICLR 2024Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Note:Meta AI\. Available at[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)Cited by:[1st item](https://arxiv.org/html/2605.14258#A1.I1.i1.p1.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.Note:[https://transformer\-circuits\.pub/2021/framework/index\.html](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[Appendix B](https://arxiv.org/html/2605.14258#A2.p3.16),[§E\.3](https://arxiv.org/html/2605.14258#A5.SS3.p1.4),[§1](https://arxiv.org/html/2605.14258#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.14258#S2.SS1.p4.2),[footnote 1](https://arxiv.org/html/2605.14258#footnote1)\.
- R\. Engelken \(2023\)Gradient flossing: improving gradient descent through dynamic control of Jacobians\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36\.Note:arXiv:2312\.17306\. See also Engelken, Wolf & Abbott, “Lyapunov spectra of chaotic recurrent neural networks,” Phys\. Rev\. Research 5, 043044 \(2023\)Cited by:[§E\.2](https://arxiv.org/html/2605.14258#A5.SS2.p1.4)\.
- J\. Fernando and G\. Guitchounts \(2025\)Transformer dynamics: a neuroscientific approach to interpretability of large language models\.arXiv preprint arXiv:2502\.12131\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- D\. Filan, S\. Casper, S\. Hod, C\. Wild, A\. Critch, and S\. Russell \(2021\)Clusterability in neural networks\.arXiv preprint arXiv:2103\.03386\.Note:Studies weight\-graph clusterability via spectral clustering \(normalized cut\)\. Key finding: trained NNs are more clusterable than at random initialization or random weight shuffles\. Definition explicitly excludes activations\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p3.1)\.
- Z\. Fu, M\. Liao, C\. Russell, and Z\. G\. Cai \(2025\)CAST: compositional analysis via spectral tracking for understanding transformer layer functions\.arXiv preprint arXiv:2510\.14262\.Cited by:[§E\.1](https://arxiv.org/html/2605.14258#A5.SS1.p1.3),[§1](https://arxiv.org/html/2605.14258#S1.p2.2),[§2](https://arxiv.org/html/2605.14258#S2.p2.1),[§4](https://arxiv.org/html/2605.14258#S4.p3.3)\.
- B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet \(2024\)A mathematical perspective on transformers\.arXiv preprint arXiv:2312\.10794\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- G\. Godin \(2026\)SCORE: replacing layer stacking with contractive recurrent depth\.arXiv preprint arXiv:2603\.10544\.Note:Controls contraction through explicit ODE step sizesCited by:[§E\.3](https://arxiv.org/html/2605.14258#A5.SS3.p2.1)\.
- J\. R\. Golden \(2025\)Large language models are locally linear mappings\.arXiv preprint arXiv:2505\.24293\.Note:Workshop version titled “Equivalent Linear Mappings of Large Language Models”Cited by:[§E\.1](https://arxiv.org/html/2605.14258#A5.SS1.p2.2),[§1](https://arxiv.org/html/2605.14258#S1.p2.2),[§2](https://arxiv.org/html/2605.14258#S2.p2.1),[§4](https://arxiv.org/html/2605.14258#S4.p3.3)\.
- E\. A\. Hosseini and E\. Fedorenko \(2023\)Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language\.arXiv preprint arXiv:2311\.04930\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- M\. Jacobs, T\. Fel, R\. Hakim, A\. Brondetta, D\. E\. Ba, and T\. A\. Keller \(2026\)Block recurrent dynamics in vision transformers\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2512\.19941\. First posted December 2025; accepted at ICLR 2026\. Introduces “Dynamical Interpretability”Cited by:[§E\.2](https://arxiv.org/html/2605.14258#A5.SS2.p2.2),[§1](https://arxiv.org/html/2605.14258#S1.p2.2),[§4](https://arxiv.org/html/2605.14258#S4.p3.3)\.
- H\. Koubbi, B\. Geshkovski, and P\. Rigollet \(2026\)Homogenized transformers\.arXiv preprint arXiv:2604\.01978\.Note:Studies deterministic and stochastic homogenized limits for transformers with independently resampled weights across layers/headsCited by:[§E\.3](https://arxiv.org/html/2605.14258#A5.SS3.p2.1)\.
- T\. Lawson, L\. Farnik, C\. Houghton, and L\. Aitchison \(2024\)Residual stream analysis with multi\-layer SAEs\.arXiv preprint arXiv:2409\.04185\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- J\. Lindsey, W\. Gurnee, E\. Ameisen, B\. Chen, A\. Pearce, N\. L\. Turner, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. B\. Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson \(2025\)On the biology of a large language model\.Transformer Circuits Thread\.Note:[https://transformer\-circuits\.pub/2025/attribution\-graphs/biology\.html](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)\. Uses attribution graphs from a cross\-layer transcoder to trace circuits in Claude 3\.5 HaikuCited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- Y\. Lu, Z\. Li, D\. He, Z\. Sun, B\. Dong, T\. Qin, L\. Wang, and T\. Liu \(2019\)Understanding and improving transformer from a multi\-particle dynamic system point of view\.arXiv preprint arXiv:1906\.02762\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:1609\.07843\. Introduces the WikiText\-2 datasetCited by:[§A\.1](https://arxiv.org/html/2605.14258#A1.SS1.p3.2),[§2](https://arxiv.org/html/2605.14258#S2.p1.1)\.
- C\. Olah, N\. Cammarata, L\. Schubert, G\. Goh, M\. Petrov, and S\. Carter \(2020\)Zoom in: an introduction to circuits\.Distill5\(3\),pp\. e00024\.001\.External Links:[Document](https://dx.doi.org/10.23915/distill.00024.001)Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- OLMo Team, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)OLMo 3\.arXiv preprint arXiv:2512\.13961\.Note:Allen Institute for AI\. Team OLMo, 68 authorsCited by:[2nd item](https://arxiv.org/html/2605.14258#A1.I1.i2.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Note:Transformer Circuits ThreadCited by:[§1](https://arxiv.org/html/2605.14258#S1.p1.1)\.
- J\. Pennington, S\. Schoenholz, and S\. Ganguli \(2017\)Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- B\. Poole, S\. Lahiri, M\. Raghu, J\. Sohl\-Dickstein, and S\. Ganguli \(2016\)Exponential expressivity in deep neural networks through transient chaos\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.29\.Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- H\. Prairie, Z\. Novack, T\. Berg\-Kirkpatrick, and D\. Y\. Fu \(2026\)Parcae: scaling laws for stable looped language models\.arXiv preprint arXiv:2604\.12946\.Note:Enforces stability via negative diagonal parameterizationCited by:[§E\.3](https://arxiv.org/html/2605.14258#A5.SS3.p2.1)\.
- L\. Ruthotto and E\. Haber \(2020\)Deep neural networks motivated by partial differential equations\.Journal of Mathematical Imaging and Vision62\(3\),pp\. 352–364\.Note:arXiv:1804\.04272 \(2018\)\. The 2018 date in the main text refers to the preprintExternal Links:[Document](https://dx.doi.org/10.1007/s10851-019-00903-1)Cited by:[§E\.2](https://arxiv.org/html/2605.14258#A5.SS2.p1.4)\.
- H\. Saratchandran and S\. Lucey \(2026\)Spectral conditioning of attention improves transformer performance\.arXiv preprint arXiv:2603\.07162\.Note:NeurIPS 2025 poster\. Analyzes the attention Jacobian with respect toWQW\_\{Q\},WKW\_\{K\}, andWVW\_\{V\}Cited by:[§E\.1](https://arxiv.org/html/2605.14258#A5.SS1.p4.5)\.
- S\. S\. Schoenholz, J\. Gilmer, S\. Ganguli, and J\. Sohl\-Dickstein \(2017\)Deep information propagation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p2.2)\.
- L\. Storm, H\. Linander, J\. Bec, K\. Gustavsson, and B\. Mehlig \(2024\)Finite\-time Lyapunov exponents of deep neural networks\.Physical Review Letters132\(5\),pp\. 057301\.Note:arXiv:2306\.12548 \(2023\)\. Published February 2024External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.132.057301)Cited by:[§E\.2](https://arxiv.org/html/2605.14258#A5.SS2.p1.4)\.
- V\. A\. Traag, P\. Van Dooren, and Y\. Nesterov \(2011\)Narrow scope for resolution\-limit\-free community detection\.Physical Review E84\(1\),pp\. \+016114\.External Links:1104\.3083,[Document](https://dx.doi.org/10.1103/PhysRevE.84.016114)Cited by:[§3\.1](https://arxiv.org/html/2605.14258#S3.SS1.p1.6)\.
- V\. A\. Traag, L\. Waltman, and N\. J\. van Eck \(2019\)From Louvain to Leiden: guaranteeing well\-connected communities\.Scientific Reports9\(1\),pp\. 5233\.External Links:[Document](https://dx.doi.org/10.1038/s41598-019-41695-z)Cited by:[§A\.4](https://arxiv.org/html/2605.14258#A1.SS4.p1.5),[§3\.1](https://arxiv.org/html/2605.14258#S3.SS1.p1.6)\.
- C\. Watanabe, K\. Hiramatsu, and K\. Kashino \(2018\)Understanding community structure in layered neural networks\.arXiv preprint arXiv:1804\.04778\.Note:Probabilistic community detection on weight connectivity patterns\. Groups neurons by shared connection structure to adjacent layers\. Also: Watanabe et al\., “Modular representation of layered neural networks,” Neural Networks 97:62–73, 2018Cited by:[§1](https://arxiv.org/html/2605.14258#S1.p3.1)\.
## Appendix AMethods
### A\.1Models and data
We study three decoder\-only transformer families:
- •Llama 3\.1 8B\[Dubeyet al\.,[2024](https://arxiv.org/html/2605.14258#bib.bib1)\]:meta\-llama/Llama\-3\.1\-8B\(final checkpoint\)\. 32 layers,d=4,096d=4\{,\}096, 32 attention heads with grouped\-query attention \(8 KV heads\)\.
- •OLMo 3 7B\[OLMo Teamet al\.,[2025](https://arxiv.org/html/2605.14258#bib.bib2)\]:allenai/Olmo\-3\-1025\-7Bat three training stages \(within the pretraining regime\): step 0 \(random initialization\), step 471,000, and step 1,413,814 \(final pretraining step\)\. 32 layers,d=4,096d=4\{,\}096, 32 attention heads with full multi\-head attention\.
- •Gemma 4 E4B:google/gemma\-4\-e4b\-it\. 42 layers,d=2,560d=2\{,\}560, 10 attention heads with grouped\-query attention \(2 KV heads\)\. Analyzed at the final \(instruction\-tuned\) checkpoint to test cross\-architecture generality\.
For each configuration, we extract activations on 1,000 samples from WikiText\-2\[Merityet al\.,[2017](https://arxiv.org/html/2605.14258#bib.bib3)\]\(train split, character\-length filtered, taken in dataset order\)\. At each transformer block we record the residual\-stream state at two sub\-layer boundaries—before attention and before the MLP—yielding 64 \(Llama, OLMo\) or 84 \(Gemma\) layer snapshots per sample\. We retain only thelast tokenposition, producing data tensors of shape\(1,000×64×4,096\)\(1\{,\}000\\times 64\\times 4\{,\}096\)for Llama/OLMo and\(1,000×84×2,560\)\(1\{,\}000\\times 84\\times 2\{,\}560\)for Gemma\.
### A\.2Jacobian computation
The per\-layer JacobianJℓ=∂𝐡ℓ\+1/∂𝐡ℓJ\_\{\\ell\}=\\partial\\mathbf\{h\}\_\{\\ell\+1\}/\\partial\\mathbf\{h\}\_\{\\ell\}is thed×dd\\times dlinear map \(d=4,096d=4\{,\}096for Llama/OLMo,d=2,560d=2\{,\}560for Gemma\) describing how an infinitesimal perturbation at the input of transformer blockℓ\\ellpropagates to its output\. Because the block includes a residual connection,Jℓ=I\+∂fℓ/∂𝐡ℓJ\_\{\\ell\}=I\+\\partial f\_\{\\ell\}/\\partial\\mathbf\{h\}\_\{\\ell\}\.
We compute exact Jacobians viatorch\.autograd\.functional\.jacobian, passing each sample’s last\-token activation through an isolated copy of the transformer block \(cast to float32\)\. Position embeddings are supplied at position 0 for a single\-token context; because the computation involves single\-token self\-attention without a KV cache, the RoPE rotation applied identically toQQandKKcancels in the attention dot product, making the result position\-invariant\.
For each of the 32 layers, we compute 1,000 individual Jacobians \(one per sample\) and define themean JacobianJ¯ℓ=1N∑i=1NJℓ\(i\)\\bar\{J\}\_\{\\ell\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}J\_\{\\ell\}^\{\(i\)\}\(accumulated in float64, stored as float32\)\. Per\-sample singular value decompositions provide distributional statistics \(condition number, Frobenius norm\); eigenvalues are computed on the mean Jacobian vianumpy\.linalg\.eigvals\.
### A\.3Spectral characterization
We derive the following quantities from each layer’s Jacobians\. Together, these capture how each layer stretches, rotates, and compresses the residual stream, and how these transformations compose across depth\.
Singular values\(per\-sample\): full SVD of eachJℓ\(i\)J\_\{\\ell\}^\{\(i\)\}\. Condition numberκ=σmax/σmin\\kappa=\\sigma\_\{\\max\}/\\sigma\_\{\\min\}; participation ratioPR=\(∑σi2\)2/∑σi4\\text\{PR\}=\(\\sum\\sigma\_\{i\}^\{2\}\)^\{2\}/\\sum\\sigma\_\{i\}^\{4\}\. The condition number quantifies how anisotropically a layer stretches its input: largeκ\\kappameans the layer nearly annihilates some directions while amplifying others, a signature of low\-rank or feature\-selective computation\. The participation ratio complements this by measuring how many singular values carry appreciable weight—a layer that distributes energy across many directions \(high PR\) operates differently from one that concentrates it in a few \(low PR\), even if both share the same condition number\.
Eigenvalues\(mean Jacobian\): fraction of expanding modes=\|\{i:\|λi\|\>1\}\|/d=\|\\\{i:\|\\lambda\_\{i\}\|\>1\\\}\|/d\. Complex conjugate pairs identified at threshold\|Im\(λ\)\|\>10−6\|\\text\{Im\}\(\\lambda\)\|\>10^\{\-6\}\. The fraction of expanding modes indicates whether a layer, on average, amplifies or contracts the residual stream\. Complex eigenvalue pairs signal rotational dynamics: the layer mixes pairs of directions rather than simply scaling them\. A layer with many complex pairs implements a qualitatively different transformation—one that rotates information between subspaces—compared to a layer whose eigenvalues are predominantly real\.
Self\-alignment: truncated SVD \(k=64k=64\) ofJ¯ℓ=UΣV⊤\\bar\{J\}\_\{\\ell\}=U\\Sigma V^\{\\top\}\. Self\-alignment=‖Vk⊤Uk‖F2/k=\\\|V\_\{k\}^\{\\top\}U\_\{k\}\\\|\_\{F\}^\{2\}/k; equals 1 for normal matrices, neark/d≈0\.016k/d\\approx 0\.016for random subspaces\. Normal matrices have orthogonal eigenvectors and their spectral decomposition fully determines their behavior\. Non\-normal matrices—those with low self\-alignment—can transiently amplify inputs even when all eigenvalues indicate contraction, because their input and output directions are misaligned\. This transient amplification is invisible to eigenvalue analysis alone and has direct consequences for how perturbations propagate through the network\.
Forward alignment:‖Uℓ⊤Vℓ\+1‖F2/k\\\|U\_\{\\ell\}^\{\\top\}V\_\{\\ell\+1\}\\\|\_\{F\}^\{2\}/kmeasures whether adjacent layers compose through their leading directions\. When forward alignment is high, the output subspace of layerℓ\\ellfeeds directly into the input subspace of layerℓ\+1\\ell\+1, meaning information flows efficiently between layers without being scattered into directions the next layer ignores\. Low forward alignment suggests that inter\-layer composition is indirect, relying on secondary singular directions or the residual stream’s skip connection to preserve information\.
Henrici departure:δ\(J\)=‖J‖F2−∑\|λi\|2/‖J‖F\\delta\(J\)=\\sqrt\{\\\|J\\\|\_\{F\}^\{2\}\-\\sum\|\\lambda\_\{i\}\|^\{2\}\}\\,/\\,\\\|J\\\|\_\{F\}\. This is a scalar summary of non\-normality derived from the gap between the Frobenius norm and the eigenvalue spectrum\. It equals zero for normal matrices and approaches one when the matrix’s behavior is dominated by its off\-diagonal structure\. We use it alongside self\-alignment because the two measures are sensitive to different aspects of non\-normality: self\-alignment captures directional misalignment of the leading singular subspaces, while the Henrici departure reflects the aggregate energy in the strictly upper\-triangular part of the Schur decomposition\.
Residual operatorRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I: the same alignment and spectral analyses applied after stripping the skip connection\. In a residual architecture,Jℓ=I\+RℓJ\_\{\\ell\}=I\+R\_\{\\ell\}, so the identity contributes a baseline of unit singular values and real eigenvalues at one\. AnalyzingRℓR\_\{\\ell\}isolates what the attention and MLP sublayers actually compute—the perturbation to the residual stream—from the pass\-through signal\. This decomposition is necessary because the skip connection can mask the spectral structure of the learned transformation: a layer whoseJℓJ\_\{\\ell\}appears well\-conditioned may have anRℓR\_\{\\ell\}that is highly anisotropic\.
Cumulative JacobianPℓ=J31⋯JℓP\_\{\\ell\}=J\_\{31\}\\cdots J\_\{\\ell\}: computed via iterative backward SVD composition \(truncated atk=512k=512\)\. Effective rank=exp\(−∑pilogpi\)=\\exp\(\-\\sum p\_\{i\}\\log p\_\{i\}\)wherepi=σi2/∑σj2p\_\{i\}=\\sigma\_\{i\}^\{2\}/\\sum\\sigma\_\{j\}^\{2\}\. The cumulative Jacobian captures the end\-to\-end linear sensitivity of the output to perturbations at layerℓ\\ell, integrating the effects of all downstream layers\. Its effective rank measures the dimensionality of the subspace through which information from layerℓ\\ellcan influence the final representation\. A sharp drop in effective rank with depth indicates a computational bottleneck where the network progressively discards degrees of freedom\.
### A\.4Activation\-correlation graphs and community detection
For each of the 64 sub\-layer snapshots, we construct a sparse signed correlation graph:
1. 1\.Z\-score each unit across the 1,000 samples\.
2. 2\.Compute the full4,096×4,0964\{,\}096\\times 4\{,\}096Pearson correlation matrix\.
3. 3\.Retain the top\-k=20k=20edges per node \(by\|corr\|\|\\text\{corr\}\|\), symmetrize by keeping the larger absolute weight, exclude self\-loops\.
Edge signs \(positive and negative correlations\) are retained\. We detect communities usingsigned Leiden CPM\[Traaget al\.,[2019](https://arxiv.org/html/2605.14258#bib.bib4)\]with resolutionγ=0\.001\\gamma=0\.001on the positive subgraph andγneg=0\\gamma\_\{\\text\{neg\}\}=0on the negative subgraph, optimized jointly with layer weights\[1,−1\]\[1,\-1\]\. This yields66–8787non\-degenerate communities per layer across all configurations\.
Table 1:Comparison of community\-detection methods on sparse signed activation\-correlation graphs \(top\-k=20k=20edges per node by\|corr\|\|\\mathrm\{corr\}\|\), aggregated over 64 sub\-layer snapshots per model\.*Frac\. FDR sig\. \(\+\+\)*is the fraction of layers with a false\-discovery\-rate significant positive bridge effect; Cohen’sddis summarized by the median and mean across layers\.Theparticipation coefficientof unitiiispi=1−∑c\(kic/ki\)2p\_\{i\}=1\-\\sum\_\{c\}\(k\_\{ic\}/k\_\{i\}\)^\{2\}, wherekick\_\{ic\}is the sum of absolute edge weights fromiito communityccandki=∑ckick\_\{i\}=\\sum\_\{c\}k\_\{ic\}\.
### A\.5Statistical tests
#### Test 1 \(rate\-level coupling\)\.
Spearman correlation between topology disruption \(1−NMI1\-\\text\{NMI\}of adjacent\-layer communities\) andΔ\\Deltadynamics \(Δσmax\(K\)\\Delta\\sigma\_\{\\max\}\(K\)or\|Δvariance captured\|\|\\Delta\\text\{variance captured\}\|\);n=31n=31layer pairs\. Significance via 10,000\-permutation two\-sided test\.
#### Test 2 \(mesoscale variance captured\)\.
The mesoscale operatorK=Cout⊤JCinK=C\_\{\\text\{out\}\}^\{\\top\}JC\_\{\\text\{in\}\}projects the Jacobian onto community bases \(CC: columns are1/nc1/\\sqrt\{n\_\{c\}\}for community members, zero elsewhere\)\. Variance captured=‖CoutKCin⊤‖F2/‖J‖F2=\\\|C\_\{\\text\{out\}\}KC\_\{\\text\{in\}\}^\{\\top\}\\\|\_\{F\}^\{2\}/\\\|J\\\|\_\{F\}^\{2\}, compared against 100 random same\-size partitions \(one\-sidedzz\-test\)\.
#### Test 3 \(boundary\-node amplification\)\.
Per layer: Cohen’sddbetween Jacobian column\-norm‖J:,i‖2\\\|J\_\{:,i\}\\\|\_\{2\}for units in the top\-10% vs\. bottom\-10% participation tails \(denominator:std\(all column norms,ddof=1\)\\text\{std\}\(\\text\{all column norms\},\\text\{ddof\}=1\)\)\. Significance via 5,000\-permutation two\-sided test; FDR correction \(Benjamini–Hochberg\) across 32 layers within each model atα=0\.05\\alpha=0\.05\.
### A\.6Software
All computations use Python 3\.12, PyTorch≥\\geq2\.5 \(CUDA 12\.4\), HuggingFace Transformers≥\\geq5\.2, nnsight≥\\geq0\.4 for activation extraction, and leidenalg for community detection\. Code is available at\[redacted for review\]\.
Figure S1:Eigenvalue distribution heatmaps for all three trained models, organized as a 3×\\times3 grid\. Rows are models \(\(a–c\)Llama 3\.1 8B;\(d–f\)OLMo 3 7B step 1\.41M;\(g–i\)Gemma 4 E4B\); columns are Re\(λ\\lambda\), Im\(λ\\lambda\), and\|λ\|\|\\lambda\|\. Color encodesP\(value∣layer\)P\(\\text\{value\}\\mid\\text\{layer\}\)\(log scale\)\. Dashed lines mark Re=1=1, Im=0=0, and\|λ\|=1\|\\lambda\|=1\. All three models show the same three\-regime structure: broad early distributions, contracted mid\-layers, and re\-expanding late layers\.Figure S2:Per\-layer spectral profiles for all five configurations \(companion to Figure[2](https://arxiv.org/html/2605.14258#S2.F2)\)\. Each row is one model/checkpoint \(top to bottom: OLMo step 0, OLMo step 471k, OLMo step 1\.41M, Llama 3\.1 8B, Gemma 4 E4B\); columns show, from left to right: condition numberκ=σmax/σmin\\kappa=\\sigma\_\{\\max\}/\\sigma\_\{\\min\}, the leading and trailing singular valuesσmax\\sigma\_\{\\max\}andσmin\\sigma\_\{\\min\}, the Frobenius norm‖Jℓ‖F\\\|J\_\{\\ell\}\\\|\_\{F\}, the fraction of expanding eigenvalue modes \(\|λ\|\>1\|\\lambda\|\>1\), and the mean eigenvalue magnitude\.σmax\\sigma\_\{\\max\}is comparatively stable across mid/late layers in every trained model, whileσmin\\sigma\_\{\\min\}collapses at both extremes—the singular\-value mechanism behind the U\-shaped condition number profile\.Figure S3:Per\-layer non\-normality metrics for all five configurations \(companion to Figure[3](https://arxiv.org/html/2605.14258#S2.F3)\)\. Each row is one model/checkpoint\.Left:Self\-alignment ofJJ\(orange\) andRR\(blue\) withk/dk/dbaseline\.Center:Residual norm ratio\.Right:Henrici departure\. The non\-normality gradient \(low early, high late\) is present in all trained models; Gemma spans the widest range \(0\.0003–0\.92\)\. OLMo step 0 saturates to near\-symmetric by layer 5\.Figure S4:Architecture vs\. training in eigenvalue geometry \(OLMo 3 7B\)\. Each row shows eigenvalues at layers 0, 7, 15, 23, 31 \(dashed circle = unit circle\)\.\(a\)Step 0: eigenvalues collapse to a tight cluster nearRe=1\\mathrm\{Re\}=1at all layers beyond layer 0\.\(b\)Step 471k: structured elliptical clouds emerge, with expanding mid\-layer modes\.\(c\)Step 1\.41M: rich spectral diversity at every layer, with persistent complex\-plane geometry matching Llama’s trained structure\. Full eigenvalue distribution heatmaps for all models are in Figure[S1](https://arxiv.org/html/2605.14258#A1.F1)\.Figure S5:Eigenvalue distribution heatmaps across OLMo training \(companion to Figure[S4](https://arxiv.org/html/2605.14258#A1.F4)\)\.\(a–c\)Step 0: eigenvalues are tightly concentrated around Re≈1\\approx 1, Im≈0\\approx 0,\|λ\|≈1\|\\lambda\|\\approx 1with only layer 0 showing spread\.\(d–f\)Step 471k: structured distributions emerge\.\(g–i\)Step 1\.41M: full three\-regime structure matching Llama\. Training progressively installs spectral diversity\.Figure S6:Adjacent\-layer forward alignment‖Uℓ⊤Vℓ\+1‖F2/k\\\|U\_\{\\ell\}^\{\\top\}V\_\{\\ell\+1\}\\\|\_\{F\}^\{2\}/kforJℓJ\_\{\\ell\}\(orange\) andRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I\(blue\), with random baselinek/d≈0\.016k/d\\approx 0\.016\(dotted\)\. Panels show one model/checkpoint each:\(a\)OLMo step 0,\(b\)OLMo step 471k,\(c\)OLMo step 1\.41M,\(d\)Llama 3\.1 8B,\(e\)Gemma 4 E4B\. At step 0,JJandRRare exactly at baseline—no inter\-layer coordination at initialization\. By step 471k,RRbegins to show elevated mid\-layer values; at step 1\.41M,RRshows a pronounced mid\-layer peak \(3\.8×3\.8\\timesbaseline, layers 6–14\), indicating training installs block\-level inter\-layer coordination in the depth range where the operator\-type transition is steepest\. Llama shows no such peak; Gemma shows a modest late\-layer elevation\.Figure S7:Cumulative Jacobian analysis for all five configurations \(companion to Figure[4](https://arxiv.org/html/2605.14258#S2.F4)\)\. Each row is one model/checkpoint \(top to bottom: OLMo step 0, OLMo step 471k, OLMo step 1\.41M, Llama 3\.1 8B, Gemma 4 E4B\)\.Left:Effective rank ofPℓP\_\{\\ell\}\(blue, left axis\) and fraction of expanding eigenvalues \(red dashed, right axis\) vs\. injection layer\.Center:Effective rank vs\. spectral radiusρ\(Jℓ\)\\rho\(J\_\{\\ell\}\), colored by layer; Spearmanρs\\rho\_\{s\}andppannotated\. At step 0, spectral radii cluster near 1 and effective rank is uniformly high; training progressively installs the negative correlation\.Right:Singular value spectra ofP0P\_\{0\}\(full\-depth product\),PmidP\_\{\\mathrm\{mid\}\}, andPL−1P\_\{L\-1\}\(single layer\)\. The trained models show orders\-of\-magnitude dynamic range concentrated in the leading modes; the untrained model has a nearly flat spectrum\. Spectra are shown for the singular values computed in the saved decomposition: Gemma spectra and the single\-layer OLMoP31P\_\{31\}spectra use full SVD, whereas the 4096\-dimensional OLMo cumulative products and Llama spectra use the iterative truncated\-SVD computation withk=512k=512for tractability; zero padding of truncated spectra is omitted from the plot\.Figure S8:Schur surgery: random controls and mode\-RRconfirmation\. Cumulative effective rank ofP0:L−1P\_\{0:L\-1\}vs\. doseccin modeJJfor\(a\)Llama 3\.1 8B and\(b\)OLMo 3 7B \(step 1\.41M\): trained dose \(red, solid\) vs\. Control A \(Frobenius\-matched random replacement ofNℓN\_\{\\ell\}, blue dashed\) and Control B \(Haar\-random Schur basis, green dashed\); shaded bands show±\\pmstd across 4 independent random draws per layer\. Control A reverses the dose response in both networks; Control B partly recovers the funnel in Llama but not in OLMo\.\(c\)Trained\-dose curve for Gemma 4 E4B \(modeJJ\)\.\(d\)Mode\-JJ\(solid\) vs\. mode\-RR\(dashed\) dose curves overlaid for all five configurations: the two modes are nearly indistinguishable, ruling out the\+I\+Iskip as a source of the funnel\.Figure S9:Mesoscale bridge diagnostic across all models and training checkpoints \(signed Leiden CPMγ=0\.001\\gamma=0\.001\)\. Columns:\(a\)OLMo step 0,\(b\)OLMo step 471k,\(c\)OLMo step 1\.41M,\(d\)Llama 3\.1 8B,\(e\)Gemma 4 E4B\.Top row:Variance captured by the community\-projected operatorKK\(red\) vs\. null distribution from 100 random same\-size partitions \(blue band = 5–95%\); annotation shows the count of layers withz\>1\.96z\>1\.96\.Middle row:Rate\-level coupling — topology disruption \(1−NMI1\-\\text\{NMI\}\) vs\.*signed*Δσmax\(K\)\\Delta\\sigma\_\{\\max\}\(K\); dot color encodes layer depth\. The figure annotations show Spearman correlations on the signed measure \(e\.g\., Gemmaρ=\+0\.18\\rho=\+0\.18,p=0\.255p=0\.255\); the absolute\-value version\|Δσmax\(K\)\|\|\\Delta\\sigma\_\{\\max\}\(K\)\|used in the main text is reported in Table[4](https://arxiv.org/html/2605.14258#A4.T4)\(Gemma:ρ=\+0\.39\\rho=\+0\.39,p=0\.012p=0\.012\)\.Bottom row:Topology disruption vs\.\|Δvariance captured\|\|\\Delta\\text\{variance captured\}\|; Spearmanρ\\rhoandpp\-values annotated per panel\.Table 2:MeanJℓJ\_\{\\ell\}self\-alignment by regime \(early/mid/late split at layers 5, 20 for 32\-layer models; 8, 28 for Gemma’s 42 layers\)\. Training broadens the rotator regime in early and mid layers while leaving the late\-layer near\-symmetry largely intact\.
## Appendix BAdjacent\-layer forward alignment
The forward alignment‖Uℓ⊤Vℓ\+1‖F2/k\\\|U\_\{\\ell\}^\{\\top\}V\_\{\\ell\+1\}\\\|\_\{F\}^\{2\}/kmeasures whether the leading output subspace of layerℓ\\elloverlaps with the leading input subspace of layerℓ\+1\\ell\+1, i\.e\. whether adjacent blocks compose through their dominant singular directions \(k=64k=64\)\. A value near the random baselinek/d≈0\.016k/d\\approx 0\.016means the two subspaces are no more aligned than chance\.
For the full JacobianJJ, forward alignment sits near baseline across all models and training stages \(Figure[S6](https://arxiv.org/html/2605.14258#A1.F6)a–c\)\. Llama 3\.1 8B has mean0\.0160\.016\(1\.05×1\.05\\timesbaseline\) with a single spike at the final layer \(0\.0680\.068\) reflecting the readout boundary; OLMo step 0 sits at0\.0160\.016\(1\.00×1\.00\\timesbaseline\)—exactly random at initialization—and OLMo step 1\.41M reaches0\.0320\.032\(2\.0×2\.0\\times\), with a final\-layer spike \(0\.1060\.106\)\. Gemma’s mean is0\.0160\.016\(0\.64×0\.64\\timesitsk/d=0\.025k/d=0\.025baseline\), again with a final\-layer spike \(0\.0500\.050,2\.0×2\.0\\times\)\. Adjacent\-layer subspace coupling at the level of the full Jacobian is therefore at or below random across the board\.
Stripping the identity via the residual operatorR=J−IR=J\-Ireveals a more subtle picture\. Llama’sRRforward alignment is0\.0190\.019\(1\.2×1\.2\\timesbaseline\)—nearly unchanged fromJJ, consistent with the block computations being uncoupled across adjacent layers\. OLMo step 0RRsits exactly at baseline \(1\.0×1\.0\\times\)\. But OLMo step 1\.41MRRrises to0\.0600\.060\(3\.8×3\.8\\timesbaseline\), with a pronounced mid\-layer peak \(layers 6–14 reach0\.090\.09–0\.150\.15; Figure[S6](https://arxiv.org/html/2605.14258#A1.F6)c\)\. Training installs inter\-layer coordination in the block computations concentrated in mid\-depth—the same regime where the non\-normality gradient is steepest\. Even so, the absolute values are small \(max0\.150\.15on a\[0,1\]\[0,1\]scale\): the residual stream still functions primarily as the shared workspace ofElhageet al\.\[[2021](https://arxiv.org/html/2605.14258#bib.bib24)\], not a sequential pipeline\. Whether this mid\-layerRRpeak is a general feature of trained transformers or specific to OLMo’s architecture and training recipe remains open—Llama’sRRforward alignment shows no such peak\.
## Appendix CSchur surgery: setup, random controls, and mode\-RRconfirmation
This appendix expands on the dose\-response intervention reported in §[2\.3](https://arxiv.org/html/2605.14258#S2.SS3): the formal setup, two random\-baseline controls that separately ablate the structural and basis components of the trainedNℓN\_\{\\ell\}, and a mode\-RRvariant that targets the residual operatorRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I\.
### C\.1Schur form, dose response, and random controls
For each per\-layer mean Jacobian we compute the complex Schur factorization
Jℓ=Qℓ\(Λℓ\+Nℓ\)Qℓ∗,J\_\{\\ell\}\\;=\\;Q\_\{\\ell\}\\,\(\\Lambda\_\{\\ell\}\+N\_\{\\ell\}\)\\,Q\_\{\\ell\}^\{\*\},withQℓQ\_\{\\ell\}unitary,Λℓ\\Lambda\_\{\\ell\}diagonal \(the eigenvalues\), andNℓN\_\{\\ell\}strictly upper\-triangular\. BecauseQℓQ\_\{\\ell\}is unitary,
‖Nℓ‖F2=‖Jℓ‖F2−∑i\|λi\|2,\\\|N\_\{\\ell\}\\\|\_\{F\}^\{2\}\\;=\\;\\\|J\_\{\\ell\}\\\|\_\{F\}^\{2\}\-\\sum\_\{i\}\|\\lambda\_\{i\}\|^\{2\},so‖Nℓ‖F\\\|N\_\{\\ell\}\\\|\_\{F\}is identical to the Henrici\-departure quantity used in §[2\.1](https://arxiv.org/html/2605.14258#S2.SS1)\(it is the same non\-normality, written in a form we can manipulate\)\. We sweep a dose gridc∈\{0,0\.25,0\.5,0\.75,1,1\.5,2\}c\\in\\\{0,0\.25,0\.5,0\.75,1,1\.5,2\\\}over three constructions, each replacingJℓJ\_\{\\ell\}at every layer of the linearized stack:
- •Dose\(the main intervention\):Jℓ\(c\)=Qℓ\(Λℓ\+cNℓ\)Qℓ∗J\_\{\\ell\}\(c\)=Q\_\{\\ell\}\(\\Lambda\_\{\\ell\}\+c\\,N\_\{\\ell\}\)Q\_\{\\ell\}^\{\*\}\. Spectrum and basis preserved; non\-normal feedforward scaled\.c=0c=0is fully normal at the trained spectrum,c=1c=1is the trained model\.
- •Control A \(Frobenius\-matched random replacement\):JℓA\(c\)=QℓΛℓQℓ∗\+MℓrandJ\_\{\\ell\}^\{A\}\(c\)=Q\_\{\\ell\}\\Lambda\_\{\\ell\}Q\_\{\\ell\}^\{\*\}\+M\_\{\\ell\}^\{\\mathrm\{rand\}\}, withMℓrandM\_\{\\ell\}^\{\\mathrm\{rand\}\}drawn i\.i\.d\. Gaussian and rescaled so‖Mℓrand‖F=c‖Nℓ‖F\\\|M\_\{\\ell\}^\{\\mathrm\{rand\}\}\\\|\_\{F\}=c\\,\\\|N\_\{\\ell\}\\\|\_\{F\}\. Preserves the spectrum, the trained basis, and the Frobenius mass of the perturbation; destroys the upper\-triangular structure ofNℓN\_\{\\ell\}\.
- •Control B \(Haar\-random Schur basis\):JℓB\(c\)=Qrand\(Λℓ\+cNℓ\)\(Qrand\)∗J\_\{\\ell\}^\{B\}\(c\)=Q^\{\\mathrm\{rand\}\}\(\\Lambda\_\{\\ell\}\+c\\,N\_\{\\ell\}\)\(Q^\{\\mathrm\{rand\}\}\)^\{\*\}, withQrandQ^\{\\mathrm\{rand\}\}a Haar\-random unitary\. Preserves the spectrum and the upper\-triangular shape ofNℓN\_\{\\ell\}; replaces the trained Schur basisQℓQ\_\{\\ell\}\.
Both controls draw four independent random matrices per layer \(n=4n=4\)\. The cumulative productP0:ℓ\(c\)=Jℓ\(c\)⋯J0\(c\)P\_\{0:\\ell\}\(c\)=J\_\{\\ell\}\(c\)\\cdots J\_\{0\}\(c\)is computed by iterative truncated SVD at rankKcum=512K\_\{\\mathrm\{cum\}\}=512with per\-layer max\-normalization, with alog10\\log\_\{10\}Frobenius scale accumulated separately for the transient\-amplification panel of Figure[5](https://arxiv.org/html/2605.14258#S2.F5)\.
#### ModeRR\.
To confirm the result is not an artifact of the residual identity, we repeat all three constructions onRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-I, writeRℓ=Qℓ\(ΛℓR\+NℓR\)Qℓ∗R\_\{\\ell\}=Q\_\{\\ell\}\(\\Lambda\_\{\\ell\}^\{R\}\+N\_\{\\ell\}^\{R\}\)Q\_\{\\ell\}^\{\*\}in Schur form, scale onlyNℓRN\_\{\\ell\}^\{R\}bycc, and substituteJℓ\(c\)=I\+Rℓ\(c\)J\_\{\\ell\}\(c\)=I\+R\_\{\\ell\}\(c\)back into the stack\.
### C\.2Per\-model results
Table 3:Schur dose response, cumulative effective rank ofP0:L−1P\_\{0:L\-1\}\. ModeJJc=0/c=1c=0/c=1ratios are6\.46\.4,5\.75\.7,6\.66\.6across the three trained models; modeRRratios on the residual operator are nearly identical, ruling out the\+I\+Iskip as a source of the funnel\.#### Random controls atc=1c=1\(modeJJ\)\.
Cumulative effective rank ofP0:31P\_\{0:31\}:
- •Llama 3\.1 8B:dose7\.127\.12; Control A215\.9±1\.2215\.9\\pm 1\.2\(n=4n=4\); Control B16\.5±1\.616\.5\\pm 1\.6\(n=4n=4\)\.
- •OLMo 3 7B \(step 1\.41M\):dose55\.555\.5; Control A349\.4±0\.3349\.4\\pm 0\.3\(n=4n=4\); Control B305\.3±0\.8305\.3\\pm 0\.8\(n=4n=4\)\.
Control A reverses the dose response in both networks: random feedforward of identical Frobenius mass to the trainedNℓN\_\{\\ell\}*raises*cumulative effective rank rather than lowering it \(Figure[S8](https://arxiv.org/html/2605.14258#A1.F8)a\)\. Control B partly recovers the funnel in Llama \(16\.5, close to dose 7\.1\) but not in OLMo \(305, close to Control A\): the trained Schur basisQℓQ\_\{\\ell\}is therefore necessary to drive the funnel in OLMo, while the upper\-triangular structure ofNℓN\_\{\\ell\}alone is largely sufficient in Llama\. The bottleneck thus depends on the trained non\-normal structure—the upper\-triangular shape ofNℓN\_\{\\ell\}and, in some architectures, also the trained basisQℓQ\_\{\\ell\}in which it is realized—and not on either the spectrum alone or the Frobenius mass of the off\-diagonal perturbation\.
#### ModeRRconfirmation\.
Repeating the surgery onRℓ=Jℓ−IR\_\{\\ell\}=J\_\{\\ell\}\-Igives nearly indistinguishable dose curves to modeJJat everyccacross all four configurations \(Figure[S8](https://arxiv.org/html/2605.14258#A1.F8)b; Table[3](https://arxiv.org/html/2605.14258#A3.T3), last column\)\. The result is therefore a property of the learned block computation, not of the residual identity\.
## Appendix DRate\-level topology–dynamics coupling
If topology and dynamics are coupled at the level of layer\-to\-layer change, then layers where the community structure reorganizes most between consecutive depths should also show the largest swings in dynamical observables\. We tested this via Spearman correlation between adjacent\-layer topology disruption \(1−NMI1\-\\text\{NMI\}of signed Leiden CPM partitions\) and two dynamical measures derived from the community\-projected operatorK=C⊤JCK=C^\{\\top\}JC: the change in its dominant singular value\|Δσmax\(K\)\|\|\\Delta\\sigma\_\{\\max\}\(K\)\|and the change in the variance it captures relative toJJ,\|Δ\(‖K‖F2/‖J‖F2\)\|\|\\Delta\(\\\|K\\\|\_\{F\}^\{2\}/\\\|J\\\|\_\{F\}^\{2\}\)\|\(n=31n=31layer pairs for the 32\-layer models,n=41n=41for Gemma\)\. Table[4](https://arxiv.org/html/2605.14258#A4.T4)gives the full per\-model statistics; per\-model coupling scatters with point colour by layer depth are in the middle and bottom rows of Figure[S9](https://arxiv.org/html/2605.14258#A1.F9)\.
Llama’s\|Δvariance captured\|\|\\Delta\\text\{variance captured\}\|correlation is significantly negative \(ρ=−0\.36\\rho=\-0\.36,p=0\.046p=0\.046\): more topology disruption between adjacent layers predictssmallerswings in variance captured, the opposite direction to a coupling hypothesis\. Gemma’s\|Δσmax\(K\)\|\|\\Delta\\sigma\_\{\\max\}\(K\)\|is the only positive coupling, modest in magnitude \(ρ=\+0\.39\\rho=\+0\.39,p=0\.012p=0\.012\), suggesting that in this deeper network layers where the community partition reshuffles also show larger shifts in mesoscale operator gain\. The complementary variance\-captured measure is null in Gemma, however, and both measures are null at OLMo step 0 and across the OLMo training trajectory\. The aggregate verdict is that rate\-level topology–dynamics coupling is not a robust cross\-architecture phenomenon, and we do not pursue it further in the main text\.
Table 4:Rate\-level coupling tests \(signed Leiden CPMγ=0\.001\\gamma=0\.001\)\.\|Δσmax\(K\)\|\|\\Delta\\sigma\_\{\\max\}\(K\)\|and\|Δ\|\\Deltavariance captured\|\|between adjacent layers correlated with topology disruption \(1−NMI1\-\\text\{NMI\}\)\.
## Appendix ERelated Work
### E\.1Spectral and Jacobian analysis of transformers
Fuet al\.\[[2025](https://arxiv.org/html/2605.14258#bib.bib9)\]estimate a population\-level linear mapTi=H~i†H~i\+1T\_\{i\}=\\tilde\{H\}\_\{i\}^\{\\dagger\}\\tilde\{H\}\_\{i\+1\}between consecutive layers via least\-squares regression, then extract effective rank and spectral decay via SVD\. Their compression–expansion cycle in GPT\-2 and Llama 3\.2 1B resonates with our U\-shaped condition number profile, but theirTiT\_\{i\}is a global regression fit—not a local linearization—and explains only∼\\sim62% of deeper\-layer outputs, precisely the regime where the nonlinear dynamics we characterize via Jacobian eigendecomposition are most pronounced\.
Golden \[[2025](https://arxiv.org/html/2605.14258#bib.bib10)\]construct “detached Jacobians” by freezing each nonlinearity at its inference\-time value, yielding an exactly linear factorization of the model\. SVD of these objects reveals cumulative stable rank collapsing to∼\\sim1–3 in Llama 3\.2 3B—parallel to our cumulative participation ratio collapse to∼\\sim7 in Llama 3\.1 8B\. Convergence across true and detached Jacobians is strong evidence that progressive dimensional funneling is a genuine structural property of transformer computation\. Detached Jacobians cannot characterize perturbation sensitivity or stability, and without eigendecomposition the rotational dynamics encoded in complex eigenvalues remain invisible\.
The most closely related empirical work isAubryet al\.\[[2025](https://arxiv.org/html/2605.14258#bib.bib11)\], who computeResidualJacobians∂fl/∂xl−1\\partial f^\{l\}/\\partial x^\{l\-1\}\(excluding the skip connection\) across 30\+ models and document cross\-layer singular\-vector alignment that correlates with benchmark performance \(R2=0\.80R^\{2\}=0\.80\)\. Our work differs in three respects: \(i\) we compute thefullJacobianJl=I\+∂fl/∂hlJ\_\{l\}=I\+\\partial f\_\{l\}/\\partial h\_\{l\}, where the identity term produces the near\-unity spectral radius, expanding/contracting mode counts, and non\-normality gradient that structure our analysis; \(ii\) we perform eigendecomposition, revealing that∼\\sim98% of eigenvalues are complex conjugate pairs encoding rotational dynamics invisible to SVD; \(iii\) we separate architecture from training at per\-unit granularity via OLMo’s step\-0 checkpoint, going beyond their existence proof of a training effect to identifywhich specific propertiesare architectural versus training\-induced\.
Saratchandran and Lucey \[[2026](https://arxiv.org/html/2605.14258#bib.bib21)\]analyze theparameterJacobian∂𝐀/∂W\\partial\\mathbf\{A\}/\\partial Wrelevant to optimization, reporting condition numbers of10910^\{9\}–101110^\{11\}—orders of magnitude larger than our input Jacobian condition numbers \(κ∼84\\kappa\\sim 84–10510^\{5\}\), reflecting the dramatic conditioning improvement from the residual connection\.
### E\.2Depth as a dynamical system
The interpretation of network depth as time in a dynamical system originates withRuthotto and Haber \[[2020](https://arxiv.org/html/2605.14258#bib.bib14)\], who established residual networks as forward Euler discretizations of ODEs and showed that constrainingJYFJ\_\{Y\}Fto be negative semi\-definite or purely imaginary yields provably stable architectures\.Stormet al\.\[[2024](https://arxiv.org/html/2605.14258#bib.bib17)\]extracted finite\-time Lyapunov exponents from MLP Jacobian products, showing that ridges of largeλ1\\lambda\_\{1\}align with decision boundaries and that training self\-organizes the FTLE distribution toward zero\.Engelken \[[2023](https://arxiv.org/html/2605.14258#bib.bib18)\]showed that even small Lyapunov spectrum spreads cause exponential ill\-conditioning of long\-term Jacobians, motivating gradient flossing to compress the spread\. These works establish Jacobian products as fundamental diagnostics but are restricted to MLPs or RNNs and operate at initialization or on small\-scale networks \(d≤256d\\leq 256,L≤16L\\leq 16\)\.
The work most closely related to ours isJacobset al\.\[[2026](https://arxiv.org/html/2605.14258#bib.bib12)\], who coined “Dynamical Interpretability” and applied it to DINOv2\-Giant ViTs\. They demonstrated directional convergence to angular attractors, self\-correcting perturbation dynamics, and collapsing stable rank in late depth—phenomenology consistent with what we observe in LLMs\. They linearized the depth flow via Dynamic Mode Decomposition on group\-averaged states, obtaining eigenvalues near the positive real axis inside the unit circle\. We differ in three respects: \(i\) we analyze a decoder\-only LLM rather than a ViT, establishing cross\-modality generality; \(ii\) we compute full per\-sample, per\-layer Jacobians rather than fitting a single global linear operator to averaged states, revealing the∼\\sim98% complex eigenvalue structure invisible to DMD on averages; \(iii\) we provide quantitative per\-layer diagnostics \(U\-shaped condition numbers, participation ratios 9–1860, expanding fraction 33%–61%\) and a topology↔\\leftrightarrowdynamics bridge that separates architectural priors from training\-induced properties at per\-unit granularity\.
### E\.3Residual stream as shared workspace
Elhageet al\.\[[2021](https://arxiv.org/html/2605.14258#bib.bib24)\]introduced the framework of transformer components reading from and writing to a shared residual stream, where each attention head and MLP block projects down, computes, and adds its output back\. This additive picture predicts that individual blocks operate as approximately independent perturbations to a shared state\. Our Jacobian decompositionJℓ=I\+RℓJ\_\{\\ell\}=I\+R\_\{\\ell\}tests this directly: the residual norm ratio‖Rℓ‖F/‖Jℓ‖F\\\|R\_\{\\ell\}\\\|\_\{F\}/\\\|J\_\{\\ell\}\\\|\_\{F\}quantifies how much each block perturbs the residual stream beyond the identity pass\-through, and the near\-random forward alignment \(§[B](https://arxiv.org/html/2605.14258#A2)\) confirms that adjacent blocks’ leading singular subspaces are largely decoupled—consistent with the shared\-workspace model\. The one departure is the mid\-layerRRforward alignment peak in trained OLMo \(3\.8×3\.8\\timesbaseline\), suggesting that training can install partial sequential coupling between blocks in the depth range where the operator\-type transition is steepest\.
In the architecture\-design direction,Prairieet al\.\[[2026](https://arxiv.org/html/2605.14258#bib.bib22)\]enforce stability in looped transformers by parameterizing the state matrix with guaranteed negative eigenvalues, andGodin \[[2026](https://arxiv.org/html/2605.14258#bib.bib23)\]control contraction through explicit ODE step sizes\.Koubbiet al\.\[[2026](https://arxiv.org/html/2605.14258#bib.bib25)\]prove that transformers with independent weights converge to an Itô SDE on the sphere, predicting representation collapse via logistic dynamics\. These theoretical baselines motivate—but do not replace—the direct empirical measurement of trained Jacobian spectra that we provide\.Similar Articles
I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]
The article presents a discovered spectral ratio between MLP and attention norms that predicts geometric stability in transformer models, with an optimal range of 0.5–2 to prevent rank collapse.
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.
Distributional Spectral Diagnostics for Localizing Grokking Transitions
This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.
RT-Transformer: The Transformer Block as a Spherical State Estimator
This paper presents a theoretical framework interpreting Transformer components (attention, residual connections, normalization) as arising from a spherical state estimation problem using Radial-Tangential SDEs.
@Propriocetive: New preprint: Mathematics is All You Need 2 — Sign-Stabilized Behavioral Fibers in Transformer Residual Streams. Headli…
A new preprint titled 'Mathematics is All You Need 2' presents the 'Two-Channel theorem,' demonstrating that behavioral fibers in transformer residual streams are sign-stabilized and causally steerable across different architectures (Qwen to Llama). The study claims high reproducibility and shows that the behavioral substrate is near-one-dimensional, separating generation from latent structure.