Task-Restricted Symmetries in Recurrent Weight Space

arXiv cs.LG Papers

Summary

This paper studies functional redundancy in recurrent neural networks by using ordered real Schur coordinates to identify structured ablations that preserve task performance, finding that task-restricted symmetries vary across tasks and trained solutions.

arXiv:2606.18457v1 Announce Type: new Abstract: Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input-output rollout nearly unchanged on a task distribution, while similar-scale changes can destroy the same behavior. We study this redundancy in one-layer tanh RNNs using ordered real Schur coordinates. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed. In a fixed-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay. Across flip-flop, sine generation, and context-dependent integration, the loss-preserving ablation profile varies across tasks and trained solutions. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space. Schur-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:43 AM

# Task-Restricted Symmetries in Recurrent Weight Space
Source: [https://arxiv.org/html/2606.18457](https://arxiv.org/html/2606.18457)
###### Abstract

Recurrent networks can contain substantial functional redundancy in weight space: changing a recurrent matrix may leave the input\-output rollout nearly unchanged on a task distribution, while similar\-scale changes can destroy the same behavior\. We study this redundancy in one\-layer tanh RNNs using ordered real Schur coordinates\. The Schur form separates spectral blocks from directed nonnormal couplings, giving a diagnostic basis for structured ablations that keep the input and readout maps fixed\. In a fixed\-length copy task, selected nonnormal Schur couplings can be removed with little loss in some trained solutions, whereas other couplings are necessary for accurate autonomous replay\. Across flip\-flop, sine generation, and context\-dependent integration, the loss\-preserving ablation profile varies across tasks and trained solutions\. These results identify candidate approximate functional invariances, not universal symmetries of recurrent weight space\. Schur\-coordinate ablations provide a practical diagnostic for which structured perturbations preserve a trained recurrent solution and which ones disrupt its computation\.

weight\-space symmetries, recurrent neural networks, Schur decomposition, nonnormality

## 1Introduction

Exact weight\-space symmetries have become a practical tool for comparing neural networks and for learning directly in parameter space\(Entezariet al\.,[2022](https://arxiv.org/html/2606.18457#bib.bib3); Ainsworthet al\.,[2023](https://arxiv.org/html/2606.18457#bib.bib1); Navonet al\.,[2023](https://arxiv.org/html/2606.18457#bib.bib2),[2024](https://arxiv.org/html/2606.18457#bib.bib4)\)\. Those symmetries identify transformations that preserve the realized function exactly, and recent work builds such structure directly into models that operate on trained networks as inputs\(Zhouet al\.,[2023](https://arxiv.org/html/2606.18457#bib.bib15); Kofinaset al\.,[2024](https://arxiv.org/html/2606.18457#bib.bib16)\)\. Recurrent networks can also admit large structured changes to the recurrent matrix that preserve task behavior only approximately and only on the task distribution\. These directions fall outside exact group\-theoretic symmetries, while still shaping the functional geometry of weight space\.

Ordered Schur coordinates reveal candidate approximate functional invariances under structured perturbation\. Because the resulting ablation profiles vary by task and by trained solution, they should not be read as evidence that nonnormal components can usually be ignored\. They identify which Schur\-coordinate couplings a particular recurrent solution can lose while preserving its original input\-output rollout, and which couplings carry task\-specific function\.

Because tanh RNNs do not admit arbitrary orthogonal changes of basis as exact symmetries, raw recurrent coordinates make nonnormal structure hard to compare across runs\. The real Schur decomposition represents every real recurrent matrix by an orthogonal basis, diagonal or quasi\-diagonal spectral blocks, and strictly upper\-triangular nonnormal interactions\. Such interactions are known to shape transient recurrent computations\(Murphy and Miller,[2009](https://arxiv.org/html/2606.18457#bib.bib10); Hennequinet al\.,[2012](https://arxiv.org/html/2606.18457#bib.bib11); Bondanelli and Ostojic,[2020](https://arxiv.org/html/2606.18457#bib.bib12); Pattadkalet al\.,[2024](https://arxiv.org/html/2606.18457#bib.bib14)\), and ordered Schur coordinates make them comparable and ablatable\.

Schur\-coordinate ablations preserve the rollout function for some blocks and not for others\. In the copy task, selected ablations produce nearly identical autonomous replay accuracy, while directed cross\-sector ablations move the model to lower\-accuracy behavior\. The neuroscience\-style tasks provide a scope test for the same interventions\. The copy task supplies an explicit temporal symmetry; the flip\-flop, sine\-generation, and context\-dependent integration tasks ask whether the same diagnostic basis also localizes fragile directions in other recurrent computations\(Sussillo and Barak,[2013](https://arxiv.org/html/2606.18457#bib.bib5); Manteet al\.,[2013](https://arxiv.org/html/2606.18457#bib.bib6); Maheswaranathanet al\.,[2019](https://arxiv.org/html/2606.18457#bib.bib7); Schuessleret al\.,[2024](https://arxiv.org/html/2606.18457#bib.bib13)\)\. Task\-dependent ablation profiles tie approximate invariance to the rollout distribution rather than to a task\-independent property of Schur blocks\.

## 2Ordered Schur Coordinates

A one\-layer tanh RNN maps inputxt∈ℝNxx\_\{t\}\\in\\mathbb\{R\}^\{N\_\{x\}\}, hidden stateht∈ℝNhh\_\{t\}\\in\\mathbb\{R\}^\{N\_\{h\}\}, outputy^t∈ℝNy\\hat\{y\}\_\{t\}\\in\\mathbb\{R\}^\{N\_\{y\}\},

ht\\displaystyle h\_\{t\}=tanh⁡\(Wx​h​xt\+Wh​h​ht−1\),h0=0,\\displaystyle=\\tanh\(W\_\{xh\}x\_\{t\}\+W\_\{hh\}h\_\{t\-1\}\),\\qquad h\_\{0\}=0,\(1\)y^t\\displaystyle\\hat\{y\}\_\{t\}=Wh​y​ht,\\displaystyle=W\_\{hy\}h\_\{t\},\(2\)withWx​h∈ℝNh×NxW\_\{xh\}\\in\\mathbb\{R\}^\{N\_\{h\}\\times N\_\{x\}\},Wh​h∈ℝNh×NhW\_\{hh\}\\in\\mathbb\{R\}^\{N\_\{h\}\\times N\_\{h\}\}, andWh​y∈ℝNy×NhW\_\{hy\}\\in\\mathbb\{R\}^\{N\_\{y\}\\times N\_\{h\}\}\. All reported experiments set the recurrent and readout biases to zero,bh=by=0b\_\{h\}=b\_\{y\}=0\.

For a trained recurrent matrix, writeW=Wh​hW=W\_\{hh\}\. Its real Schur decomposition is

W=Q​T​Q⊤,W=QTQ^\{\\top\},\(3\)whereQQis orthogonal andTTis real quasi\-upper\-triangular\(Trefethen and Embree,[2005](https://arxiv.org/html/2606.18457#bib.bib8)\)\. We decompose

whereBBcontains the block\-diagonal1×11\\times 1and2×22\\times 2real Schur eigenvalue blocks, andNNcontains the strictly block\-upper\-triangular nonnormal couplings between those blocks\.

The Schur blocks are ordered by nonincreasing eigenvalue modulus\. A relative thresholdα\\alphaseparates leading spectral blocks from their complement:

R=\{i:\|λi\|≥α​ρ​\(W\)\},C=\{1,…,Nh\}∖R\.R=\\\{i:\|\\lambda\_\{i\}\|\\geq\\alpha\\rho\(W\)\\\},\\qquad C=\\\{1,\\ldots,N\_\{h\}\\\}\\setminus R\.Hereλi\\lambda\_\{i\}is the eigenvalue associated with theiith Schur block andρ​\(W\)=maxj⁡\|λj\|\\rho\(W\)=\\max\_\{j\}\|\\lambda\_\{j\}\|is the spectral radius ofWW\.RRindexes the leading rotation\-like subspace used as the reference sector, whileCCindexes the remaining Schur blocks whose couplings toRRand to each other are tested by ablation\. In this ordered partition,

B=\(BR00BC\),N=\(TR​RTC→R0TC​C\)\.B=\\begin\{pmatrix\}B\_\{R\}&0\\\\ 0&B\_\{C\}\\end\{pmatrix\},\\qquad N=\\begin\{pmatrix\}T\_\{RR\}&T\_\{C\\rightarrow R\}\\\\ 0&T\_\{CC\}\\end\{pmatrix\}\.\(5\)TR​RT\_\{RR\},TC→RT\_\{C\\rightarrow R\}, andTC​CT\_\{CC\}are blocks of the nonnormal coupling matrixNN, not separate eigenvalue blocks\. The cross blockTC→RT\_\{C\\rightarrow R\}is the upper\-right coupling from the complement sector into the leading sector in the ordered Schur coordinates\.

For a setSSof Schur\-coupling blocks, the intervention zeros the corresponding entries ofNN, reconstructs

W~h​h​\(S\)=Q​T~​\(S\)​Q⊤,\\widetilde\{W\}\_\{hh\}\(S\)=Q\\widetilde\{T\}\(S\)Q^\{\\top\},\(6\)and reevaluates the original network without changing input or readout weights\. LetfWf\_\{W\}denote the rollout function of the trained network on a task distribution𝒟\\mathcal\{D\}\. This fixed\-encoder/fixed\-decoder intervention tests whether the original input\-output map is preserved in the original readout coordinates\. Refitting a linear or ridge decoder after the ablation would answer a different question: whether the perturbed latent dynamics still contain task information up to a new readout\.

For a rollout discrepancyd𝒟d\_\{\\mathcal\{D\}\}and toleranceϵ\\epsilon, an interventionSSis anϵ\\epsilon\-stabilizer on𝒟\\mathcal\{D\}whend𝒟​\(fW,fW~h​h​\(S\)\)≤ϵd\_\{\\mathcal\{D\}\}\(f\_\{W\},f\_\{\\widetilde\{W\}\_\{hh\}\(S\)\}\)\\leq\\epsilon\. A Schur\-coupling block is a candidate approximate functional invariance when zeroing it gives small discrepancy while removing non\-negligible Schur mass\. If performance changes sharply, the block lies in a fragile functional direction for that trained solution\.

For neuroscience\-style tasks, held\-out error is measured by

FVU=𝔼​‖y^−y‖2𝔼​‖y−y¯‖2\.\\mathrm\{FVU\}=\\frac\{\\mathbb\{E\}\\\|\\hat\{y\}\-y\\\|^\{2\}\}\{\\mathbb\{E\}\\\|y\-\\bar\{y\}\\\|^\{2\}\}\.\(7\)The expectation is over held\-out rollouts,yyis the target trajectory,y^\\hat\{y\}is the model output, andy¯\\bar\{y\}is the empirical mean target over the evaluation set\. For those tasks two summaries are reported:

Δ​FVU\\displaystyle\\Delta\\mathrm\{FVU\}=FVU​\(W~h​h\)−FVU​\(Wh​h\),\\displaystyle=\\mathrm\{FVU\}\(\\widetilde\{W\}\_\{hh\}\)\-\\mathrm\{FVU\}\(W\_\{hh\}\),\(8\)SΔ​T\\displaystyle S\_\{\\Delta T\}=Δ​FVU‖Δ​T‖F/‖T‖F\.\\displaystyle=\\frac\{\\Delta\\mathrm\{FVU\}\}\{\\\|\\Delta T\\\|\_\{F\}/\\\|T\\\|\_\{F\}\}\.\(9\)Δ​T=T−T~​\(S\)\\Delta T=T\-\\widetilde\{T\}\(S\), and∥⋅∥F\\\|\\cdot\\\|\_\{F\}denotes the Frobenius norm\.Δ​FVU\\Delta\\mathrm\{FVU\}captures the effect at the trained scale, whereasSΔ​TS\_\{\\Delta T\}measures effect per unit removed Schur mass\. The perturbations are evaluated after training; no input or readout weights are refit\.

We useα=0\.9\\alpha=0\.9throughout the main experiments\. This value was chosen a priori as a simple relative spectral\-radius cutoff for grouping high\-modulus Schur blocks intoRR, rather than tuned for an ablation outcome\. The threshold controls only theR/CR/Cpartition used to assign nonnormal couplings toTR​RT\_\{RR\},TC→RT\_\{C\\rightarrow R\}, andTC​CT\_\{CC\}\. A nearby\-threshold check on the copy\-task controllers preserves the same qualitative profile \([Table1](https://arxiv.org/html/2606.18457#S2.T1)\)\.

Table 1:Sensitivity to the Schur split threshold\. Values are mean autonomous replay accuracy over 128 lags\. The main experiments useα=0\.9\\alpha=0\.9\.### Coordinate choice\.

The Schur basis remains orthogonal even for strongly nonnormal matrices\(Trefethen and Embree,[2005](https://arxiv.org/html/2606.18457#bib.bib8)\)\. Direct eigencoordinates are often ill\-conditioned when transient amplification is large, making cross\-run comparison unstable and turning component ablations into basis\-sensitive operations\. By separating spectral blocks from nonnormal couplings and ordering them by eigenvalue modulus, the real Schur form turns those couplings into structured perturbation directions\. Compared with eigencoordinates, Schur coordinates provide a reproducible diagnostic basis for perturbing and interpreting recurrent dynamics\.

## 3Approximate Stabilizers in the Copy Task

The copy task is a fixed\-delay variant of the copying\-memory benchmark for long\-range recurrent memory\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.18457#bib.bib17); Arjovskyet al\.,[2016](https://arxiv.org/html/2606.18457#bib.bib18)\), and related fixed\-length copy tasks have been used to study traveling\-wave recurrent models\(Kelleret al\.,[2024](https://arxiv.org/html/2606.18457#bib.bib19)\)\. It presents a sequence ofs=8s=8symbols in\{−1,\+1\}d\\\{\-1,\+1\\\}^\{d\}, withd=8d=8, then sets inputs to zero while the network autonomously reproduces the stored sequence\. Replay accuracy is measured over the first 128 generated symbols after the input sequence\. The copy task experiments train one\-layer tanh RNNs atNh∈\{56,64,72\}N\_\{h\}\\in\\\{56,64,72\\\}under four recurrent constructions\. Letm=Nh−1/2m=N\_\{h\}^\{\-1/2\}\. The three dense constructions optimize an unconstrained matrixWh​h∈ℝNh×NhW\_\{hh\}\\in\\mathbb\{R\}^\{N\_\{h\}\\times N\_\{h\}\}and differ only inWh​h\(0\)W\_\{hh\}^\{\(0\)\}:

dense default:Wh​h,i​j\(0\)∼Unif​\[−m,m\],\\displaystyle W\_\{hh,ij\}^\{\(0\)\}\\sim\\mathrm\{Unif\}\[\-m,m\],dense orthogonal:Wh​h\(0\)=Q,Q⊤​Q=I,\\displaystyle W\_\{hh\}^\{\(0\)\}=Q,\\qquad Q^\{\\top\}Q=I,dense normal:Wh​h\(0\)=Q​Dnorm​Q⊤,\\displaystyle W\_\{hh\}^\{\(0\)\}=QD\_\{\\mathrm\{norm\}\}Q^\{\\top\},where

Dnorm\\displaystyle D\_\{\\mathrm\{norm\}\}=blockdiag​\(B1,…,BNh/2\),\\displaystyle=\\mathrm\{blockdiag\}\(B\_\{1\},\\ldots,B\_\{N\_\{h\}/2\}\),Bi\\displaystyle B\_\{i\}=\(ai−bibiai\),\\displaystyle=\\begin\{pmatrix\}a\_\{i\}&\-b\_\{i\}\\\\ b\_\{i\}&a\_\{i\}\\end\{pmatrix\},ai,bi\\displaystyle a\_\{i\},b\_\{i\}∼𝒩​\(0,1/6\)\.\\displaystyle\\sim\\mathcal\{N\}\(0,1/6\)\.For the Cayley construction, every optimization iterate satisfiesWh​h\(k\)=O​\(A\(k\)\)​D\(k\)​O​\(A\(k\)\)⊤W\_\{hh\}^\{\(k\)\}=O\(A^\{\(k\)\}\)D^\{\(k\)\}O\(A^\{\(k\)\}\)^\{\\top\}, where\(A\(k\)\)⊤=−A\(k\)\(A^\{\(k\)\}\)^\{\\top\}=\-A^\{\(k\)\}and

O​\(A\)=\(I−A\)​\(I\+A\)−1\.O\(A\)=\(I\-A\)\(I\+A\)^\{\-1\}\.At initialization,

Ui​j\\displaystyle U\_\{ij\}∼Unif​\[−m,m\],\\displaystyle\\sim\\mathrm\{Unif\}\[\-m,m\],A\(0\)\\displaystyle A^\{\(0\)\}=\(U−U⊤\)/2,\\displaystyle=\(U\-U^\{\\top\}\)/2,W~i​j\\displaystyle\\widetilde\{W\}\_\{ij\}∼Unif​\[−m,m\],\\displaystyle\\sim\\mathrm\{Unif\}\[\-m,m\],D\(0\)\\displaystyle D^\{\(0\)\}=realblock​\(eig​\(W~\)\),\\displaystyle=\\mathrm\{realblock\}\\\!\\left\(\\mathrm\{eig\}\(\\widetilde\{W\}\)\\right\),withrealblock​\(⋅\)\\mathrm\{realblock\}\(\\cdot\)converting conjugate eigenvalue pairs into2×22\\times 2real blocks of the form above\. For𝒵=\{TR​R,TC→R,TC​C\}\\mathcal\{Z\}=\\\{T\_\{RR\},T\_\{C\\rightarrow R\},T\_\{CC\}\\\}andS⊆𝒵S\\subseteq\\mathcal\{Z\}, the intervention is

W~h​h​\(S\)\\displaystyle\\widetilde\{W\}\_\{hh\}\(S\)=Q​ZS​\(T\)​Q⊤,\\displaystyle=QZ\_\{S\}\(T\)Q^\{\\top\},\(ZS​\(T\)\)B\\displaystyle\\bigl\(Z\_\{S\}\(T\)\\bigr\)\_\{B\}=\{0,B∈S,TB,B∉S,B∈𝒵\.\\displaystyle=\\qquad B\\in\\mathcal\{Z\}\.Entries outside\{TR​R,TC→R,TC​C\}\\\{T\_\{RR\},T\_\{C\\rightarrow R\},T\_\{CC\}\\\}are unchanged\. For𝒟rc\\mathcal\{D\}\_\{\\mathrm\{rc\}\}andℒ=\{1,…,128\}\\mathcal\{L\}=\\\{1,\\ldots,128\\\},

y^ℓ​jS​\(x\):=y^ℓ​j​\(x;W~h​h​\(S\)\),\\hat\{y\}\_\{\\ell j\}^\{S\}\(x\):=\\hat\{y\}\_\{\\ell j\}\(x;\\widetilde\{W\}\_\{hh\}\(S\)\),Accrc\\displaystyle\\mathrm\{Acc\}\_\{\\mathrm\{rc\}\}=1\|𝒟rc\|​\|ℒ\|​d​∑\(x,y\)∈𝒟rcℓ∈ℒ,j∈\[d\]𝟏​\{sgn⁡\(y^ℓ​jS​\(x\)\)=yℓ​j\}\.\\displaystyle=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{rc\}\}\|\\,\|\\mathcal\{L\}\|\\,d\}\\sum\_\{\\begin\{subarray\}\{c\}\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{rc\}\}\\\\ \\ell\\in\\mathcal\{L\},\\;j\\in\[d\]\\end\{subarray\}\}\\mathbf\{1\}\\\{\\operatorname\{sgn\}\(\\hat\{y\}\_\{\\ell j\}^\{S\}\(x\)\)=y\_\{\\ell j\}\\\}\.
![Refer to caption](https://arxiv.org/html/2606.18457v1/x1.png)Figure 1:Candidate approximate functional invariances in the copy task\. Points connected by gray line segments differ only by additionally zeroingTC​CT\_\{CC\}\. In the dense orthogonal model,TC​CT\_\{CC\}removal leaves the autonomous replay function nearly unchanged conditional on the other removed blocks, whileTR​RT\_\{RR\}andTC→RT\_\{C\\rightarrow R\}move the network between lower\-accuracy functional classes\. The Cayley\-transform representative has negligible complement blocks and changes little under the shown ablations\.In the dense orthogonalNh=72N\_\{h\}=72model, removingTC​CT\_\{CC\}alone leaves mean replay accuracy at1\.001\.00, matching the full model \([Figure1](https://arxiv.org/html/2606.18457#S3.F1)\)\. The same near\-equivalence holds after other Schur blocks have already been removed:−TR​R\-T\_\{RR\}and−TR​R,−TC​C\-T\_\{RR\},\-T\_\{CC\}give0\.8760\.876and0\.8750\.875;−TC→R\-T\_\{C\\rightarrow R\}and−TC→R,−TC​C\-T\_\{C\\rightarrow R\},\-T\_\{CC\}both give0\.6390\.639;−TR​R,−TC→R\-T\_\{RR\},\-T\_\{C\\rightarrow R\}and zeroing all three blocks both give0\.6240\.624\. Selected structured changes to nonnormal Schur couplings can therefore preserve the task behavior once the other ablated blocks are fixed\.

TC​CT\_\{CC\}is close to a stabilizer for this solved copy task controller conditional on the other removed blocks\. RemovingTC→RT\_\{C\\rightarrow R\}moves the dense model to a different functional class, and removingTR​RT\_\{RR\}produces a distinct intermediate class\. The Cayley representative has negligible complement blocks at this width, so the same ablations leave replay accuracy unchanged\.

These pairs define task\-restricted approximate equivalence classes in which multiple recurrent matrices with different nonnormal coordinates realize nearly identical rollout functions on the copy task distribution\. The copy task panels evaluate two representative solvedNh=72N\_\{h\}=72controllers, using Schur\-coordinate ablations as mechanistic interventions on trained controllers\.

*Takeaway:*in the dense orthogonal copy solution,TC​CT\_\{CC\}is nearly loss\-preserving whileTC→RT\_\{C\\rightarrow R\}is not; in the Cayley\-transform solution, the tested nonnormal couplings are nearly absent and the same ablations have little effect\.

## 4Task Dependence Beyond the Copy Task

The cross\-task suite tests whether the Schur\-coordinate interventions remain informative beyond the explicit temporal symmetry of the copy task\. The three tasks require discrete memory, oscillatory generation, and context\-dependent accumulation, so they probe distinct recurrent computations in the same one\-layer architecture\. The experiments use one\-layer tanh RNNs withNh=64N\_\{h\}=64,Wh​h\(0\)=QW\_\{hh\}^\{\(0\)\}=Q,Q⊤​Q=IQ^\{\\top\}Q=I, Adam with learning rate10−310^\{\-3\}, batch size 64, 30 epochs, and 128 batches per epoch\. Three seeds are trained for each of 3\-bit flip\-flop with length 25, frequency\-cued sine generation with length 50, and context\-dependent integration with four inputs and length 48\. Full models have held\-out meanFVU=0\.0048\\mathrm\{FVU\}=0\.0048for flip\-flop,0\.00360\.0036for sine generation, and0\.01040\.0104for context\-dependent integration\.

![Refer to caption](https://arxiv.org/html/2606.18457v1/x2.png)Figure 2:Single\-block Schur ablations across neuroscience\-style tasks\. Top: raw degradationΔ​FVU\\Delta\\mathrm\{FVU\}\. Bottom: normalized sensitivitySΔ​TS\_\{\\Delta T\}\. The loss\-preserving ablation profile depends on the computation: raw degradation is largest forTC→RT\_\{C\\rightarrow R\}in flip\-flop and for complement\-linked blocks in sine generation and context\-dependent integration\.Values are mean±\\pmSEM over seeds \([Figure2](https://arxiv.org/html/2606.18457#S4.F2)\)\. For flip\-flop, zeroingTC→RT\_\{C\\rightarrow R\}increases held\-out error by9\.45×10−2±9\.35×10−39\.45\\times 10^\{\-2\}\\pm 9\.35\\times 10^\{\-3\}, while zeroingTC​CT\_\{CC\}increases error by4\.96×10−2±5\.39×10−34\.96\\times 10^\{\-2\}\\pm 5\.39\\times 10^\{\-3\}\. The ring\-internal blockTR​RT\_\{RR\}has almost no raw effect\.

For sine generation, zeroingTC​CT\_\{CC\}raises held\-out error by2\.08±0\.232\.08\\pm 0\.23, and zeroingTC→RT\_\{C\\rightarrow R\}raises it by1\.73±0\.341\.73\\pm 0\.34\. The normalized sensitivity is largest forTC→RT\_\{C\\rightarrow R\},21\.1±5\.121\.1\\pm 5\.1, withTC​CT\_\{CC\}still substantial at12\.3±1\.512\.3\\pm 1\.5\. RemovingTR​RT\_\{RR\}has little effect at this width\. In context\-dependent integration, zeroingTC​CT\_\{CC\}raises held\-out error by0\.94±0\.030\.94\\pm 0\.03, while zeroingTC→RT\_\{C\\rightarrow R\}raises it by0\.37±0\.160\.37\\pm 0\.16\. The raw effect is dominated byTC​CT\_\{CC\}, consistent with a slow accumulated variable supported by within\-complement recurrence\.

Across tasks, selected Schur couplings can be removed with little loss when they avoid task\-relevant directions, as in the copy taskTC​CT\_\{CC\}pairs\. The same coordinates localize fragile directions when a block is required, as in sine generation and context\-dependent integration\.

*Takeaway:*the loss\-preserving ablation profile varies across tasks and trained solutions; no single Schur coupling is uniformly safe to remove\.

### Metric interpretation\.

Raw degradation measures loss at the trained operating point, whereasSΔ​TS\_\{\\Delta T\}measures loss per unit removed Schur mass\. We treatΔ​FVU\\Delta\\mathrm\{FVU\}as the primary behavioral effect and useSΔ​TS\_\{\\Delta T\}to identify small sectors with disproportionate impact\.

## 5Discussion and Limitations

### Interpretation\.

Exact symmetries characterize functional equivalence classes in weight space\(Entezariet al\.,[2022](https://arxiv.org/html/2606.18457#bib.bib3); Ainsworthet al\.,[2023](https://arxiv.org/html/2606.18457#bib.bib1); Navonet al\.,[2023](https://arxiv.org/html/2606.18457#bib.bib2)\)\. Ordered Schur coordinates play a complementary role by fixing an orthogonal coordinate system in which recurrent matrices can be compared and nonnormal sectors can be causally ablated\. The resulting equivalences are task\-restricted and approximate, because they are defined by rollout behavior on𝒟\\mathcal\{D\}rather than by a global parameter\-space group action\. For recurrent networks, raw parameter distance can miss both large structured changes that preserve the task function and small directed changes that alter it\.

Because the tasks studied here are low\-dimensional, the trained networks may use only a low\-dimensional hidden\-state subspace\. A Schur ablation can then preserve performance because it avoids the activity directions aligned with the readout or the dominant hidden\-state principal components, rather than because the removed coupling has no computational role\. The experiments do not separate this subspace explanation from the Schur\-coordinate account\. Separating the two would require measuring how the ablated Schur directions project onto hidden\-state PCs, readout\-aligned subspaces, and task\-conditioned activity manifolds\.

### Scope\.

The experiments use vanilla one\-layer tanh RNNs, simple low\-dimensional tasks, a narrow width range, and a small number of trained solutions\. They do not test LSTMs, GRUs, gated architectures, large sequence models, or high\-dimensional real\-world sequence tasks, so the evidence supports Schur\-coordinate ablation as a diagnostic for trained recurrent controllers rather than a universal statement about nonnormal structure\.

## References

- S\. K\. Ainsworth, J\. Hayase, and S\. Srinivasa \(2023\)Git re\-basin: merging models modulo permutation symmetries\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CQsmMYmlP5T)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1),[§5](https://arxiv.org/html/2606.18457#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Arjovsky, A\. Shah, and Y\. Bengio \(2016\)Unitary evolution recurrent neural networks\.InProceedings of the 33rd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.48,pp\. 1120–1128\.External Links:[Link](https://proceedings.mlr.press/v48/arjovsky16.html)Cited by:[§3](https://arxiv.org/html/2606.18457#S3.p1.7)\.
- G\. Bondanelli and S\. Ostojic \(2020\)Coding with transient trajectories in recurrent neural networks\.PLOS Computational Biology16\(2\),pp\. e1007655\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.1007655)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p3.1)\.
- R\. Entezari, H\. Sedghi, O\. Saukh, and B\. Neyshabur \(2022\)The role of permutation invariance in linear mode connectivity of neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dNigytemkL)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1),[§5](https://arxiv.org/html/2606.18457#S5.SS0.SSS0.Px1.p1.1)\.
- G\. Hennequin, T\. P\. Vogels, and W\. Gerstner \(2012\)Non\-normal amplification in random balanced neuronal networks\.Physical Review E86\(1\),pp\. 011909\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevE.86.011909)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p3.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§3](https://arxiv.org/html/2606.18457#S3.p1.7)\.
- T\. A\. Keller, L\. Muller, T\. Sejnowski, and M\. Welling \(2024\)Traveling waves encode the recent past and enhance sequence learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=p4S5Z6Sah4)Cited by:[§3](https://arxiv.org/html/2606.18457#S3.p1.7)\.
- M\. Kofinas, B\. Knyazev, Y\. Zhang, Y\. Chen, G\. J\. Burghouts, E\. Gavves, C\. G\. M\. Snoek, and D\. W\. Zhang \(2024\)Graph neural networks for learning equivariant representations of neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=oO6FsMyDBt)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1)\.
- N\. Maheswaranathan, A\. H\. Williams, M\. D\. Golub, S\. Ganguli, and D\. Sussillo \(2019\)Universality and individuality in neural dynamics across large populations of recurrent networks\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p4.1)\.
- V\. Mante, D\. Sussillo, K\. V\. Shenoy, and W\. T\. Newsome \(2013\)Context\-dependent computation by recurrent dynamics in prefrontal cortex\.Nature503,pp\. 78–84\.External Links:[Document](https://dx.doi.org/10.1038/nature12742)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p4.1)\.
- B\. K\. Murphy and K\. D\. Miller \(2009\)Balanced amplification: a new mechanism of selective amplification of neural activity patterns\.Neuron61\(4\),pp\. 635–648\.External Links:[Document](https://dx.doi.org/10.1016/j.neuron.2009.02.005)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p3.1)\.
- A\. Navon, A\. Shamsian, I\. Achituve, E\. Fetaya, G\. Chechik, and H\. Maron \(2023\)Equivariant architectures for learning in deep weight spaces\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 25790–25816\.External Links:[Link](https://proceedings.mlr.press/v202/navon23a.html)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1),[§5](https://arxiv.org/html/2606.18457#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Navon, A\. Shamsian, E\. Fetaya, G\. Chechik, N\. Dym, and H\. Maron \(2024\)Equivariant deep weight space alignment\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 37376–37395\.External Links:[Link](https://proceedings.mlr.press/v235/navon24a.html)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1)\.
- J\. J\. Pattadkal, B\. V\. Zemelman, I\. Fiete, and N\. J\. Priebe \(2024\)Primate neocortex performs balanced sensory amplification\.Neuron112\(4\),pp\. 661–675\.e7\.External Links:[Document](https://dx.doi.org/10.1016/j.neuron.2023.11.005)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p3.1)\.
- F\. Schuessler, F\. Mastrogiuseppe, S\. Ostojic, and O\. Barak \(2024\)Aligned and oblique dynamics in recurrent neural networks\.eLife13,pp\. RP93060\.External Links:[Document](https://dx.doi.org/10.7554/eLife.93060.3)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p4.1)\.
- D\. Sussillo and O\. Barak \(2013\)Opening the black box: low\-dimensional dynamics in high\-dimensional recurrent neural networks\.Neural Computation25\(3\),pp\. 626–649\.External Links:[Document](https://dx.doi.org/10.1162/NECO%5Fa%5F00409)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p4.1)\.
- L\. N\. Trefethen and M\. Embree \(2005\)Spectra and pseudospectra: the behavior of nonnormal matrices and operators\.Princeton University Press\.External Links:ISBN 9780691119465,[Document](https://dx.doi.org/10.1515/9780691213101)Cited by:[§2](https://arxiv.org/html/2606.18457#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18457#S2.p2.3)\.
- A\. Zhou, K\. Yang, K\. Burns, A\. Cardace, Y\. Jiang, S\. Sokota, J\. Z\. Kolter, and C\. Finn \(2023\)Permutation equivariant neural functionals\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 24966–24992\.External Links:[Link](https://openreview.net/forum?id=fmYmXNPmhv)Cited by:[§1](https://arxiv.org/html/2606.18457#S1.p1.1)\.

Similar Articles

Emergent retokenization symmetry in large language models: phenomenology and applications

arXiv cs.CL

This paper discovers that large language models partially exhibit emergent symmetry under retokenization—replacing a prompt's canonical tokenization with an alternative valid segmentation while preserving bytes exactly. The authors use this phenomenon to probe compositional understanding and propose retokenization as a novel inference-time sampling strategy that can recover solutions not found by conventional temperature sampling.