Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

arXiv cs.LG 05/12/26, 04:00 AM Papers
grokking deep-learning theoretical-ai empirical-study two-layer-networks feature-learning
Summary
This empirical study validates theoretical findings on feature repulsion and spectral lock-in during the grokking phenomenon in two-layer neural networks, demonstrating how activation functions influence the transition from memorization to generalization.
arXiv:2605.08119v1 Announce Type: new Abstract: Tian (2025) proves a repulsion theorem (Theorem 6) for the matrix $ B = (\widetilde{F}^\top \widetilde{F} + \eta I)^{-1} $ during the interactive feature-learning stage of grokking: similar features have negative off-diagonal entries $ B_{j\ell} $, producing an effective repulsive force that drives them apart. However, the theorem does not specify when this mechanism becomes empirically observable, nor whether it leaves a measurable spectral signature in the parameter updates. We test this directly on Tian's modular addition setup ($ M = 71 $, $ K = 2048 $, MSE loss) and observe a clear structure-mechanism dissociation. The predicted sign rule holds robustly on the top-200 most-similar feature pairs across activations (empirical sign-match rising from 0.865 to 0.985 on $ \sigma = x^2 $ across 5 seeds, and saturating at 1.000 on $ \sigma = \operatorname{ReLU} $). However, the spectral signature in the parameter updates is strongly activation-dependent. With $ \sigma = x^2 $, a simple slope detector on the rolling eigengap $ \sigma_2 / \sigma_3 $ of $ \Delta W $ fires in 15/15 grokking seeds at epoch 174 (IQR [173,174]) and in 0/15 non-grokking controls, with 229$ \times $ late-stage magnitude separation; the spectrum is rank-2. In contrast, with $ \sigma = \operatorname{ReLU} $, the detector never fires and the spectrum remains effectively rank-1. This dissociation aligns with Tian's Theorem 5 distinction between focused (power-law) and spreading (ReLU) memorization: while the sign structure of $ B $ depends only on $ \widetilde{F}^\top \widetilde{F} $, how feature repulsion translates into weight updates critically depends on the activation derivative $ \sigma' $.
Original Article
View Cached Full Text
Cached at: 05/12/26, 06:44 AM
# An Empirical Study of Two-Layer Network GrokkingCode at https://github.com/skydancerosel/grokking-integrability/tree/main/tian_eigengap.
Source: [https://arxiv.org/html/2605.08119](https://arxiv.org/html/2605.08119)
## Feature Repulsion and Spectral Lock\-in: An Empirical Study of Two\-Layer Network Grokking††thanks:Code at[https://github\.com/skydancerosel/grokking\-integrability/tree/main/tian\_eigengap](https://github.com/skydancerosel/grokking-integrability/tree/main/tian_eigengap)\.

###### Abstract

Tian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)proves a repulsion theorem \(Theorem 6\) for the off\-diagonal structure ofB=\(F~⊤F~\+ηI\)−1B=\(\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}\+\\eta I\)^\{\-1\}in two\-layer networks during the interactive feature\-learning stage of grokking, but does not specify when in training this mechanism becomes empirically observable\. We test the theorem and a candidate spectral observable directly on Tian’s exact modular\-addition setup \(M=71M=71,K=2048K=2048,n=2016n=2016, MSE\)\.

Theorem 6 holds across activations\.The sign rulesgn⁡\(Bjℓ\)=−sgn⁡\(f~j⊤Pη,−jℓf~ℓ\)\\operatorname\{sgn\}\(B\_\{j\\ell\}\)=\-\\operatorname\{sgn\}\(\\widetilde\{f\}\_\{j\}^\{\\top\}P\_\{\\eta,\-j\\ell\}\\widetilde\{f\}\_\{\\ell\}\)is verified on top\-200 most\-similar feature pairs at five deterministic\-replay checkpoints acrossn=5n\{=\}5seeds\. Empirical agreement rises from 0\.865 \[IQR 0\.865, 0\.875\] at epoch 50 to a tight saturation 0\.985 \[IQR 0\.980, 0\.990\] by epoch 300 withσ\(x\)=x2\\sigma\(x\)\{=\}x^\{2\}\. Onσ=ReLU\\sigma\{=\}\\mathrm\{ReLU\}the same sign rule*also*holds, saturating even faster \(1\.000 by epoch 500\)\. The mechanism is activation\-general\.

The parameter\-update spectral signature is activation\-specific\.The rolling\-window eigengapσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}on the parameter\-update GramΔW\\Delta Wfires only when Theorem 6 saturates*and*features collapse onto sharp peaks \(the*focused memorization*regime of Tian’s Theorem 5, characteristic of power activations\)\. Withσ=x2\\sigma\{=\}x^\{2\}, a slope\-based detector fires in 15/15 grok seeds at epoch 174 \(IQR=\[173,174\]=\[173,174\]\) and 0/15 control seeds, with229×229\\timeslate\-stage magnitude separation between conditions\. Withσ=ReLU\\sigma\{=\}\\mathrm\{ReLU\}— which Tian’s Theorem 5 places in the*spreading memorization*regime — the same detector fires in 0/15 grok seeds, late\-stage magnitude separation collapses to1\.4×1\.4\\times, and the spectrum is rank\-1 dominated rather than rank\-2\.

The two findings together draw a structure–mechanism distinction\.Tian’s Theorem 6 governs the off\-diagonal sign structure ofBBvia properties ofF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}alone, which depends on the activation only through its effect onF~\\widetilde\{F\}\. The signature in the parameter\-update spectrum, however, depends on*how*the repulsion translates into weight updates — which depends onσ′\\sigma^\{\\prime\}\. Power activations \(σ\(x\)=x2\\sigma\(x\)\{=\}x^\{2\}\) produce focused features that consolidate onto two persistent rank\-2 directions; ReLU produces spreading features that remain rank\-1 dominated\. We connect this to Theorem 5’s focused\-vs\-spreading distinction\.

We also report supporting findings: the lock\-in detector is sensitive to window size \(W≤10\\leq 10produces false positives in theη=0\\eta\{=\}0control; W∈\{20,30\}\\in\\\{20,30\\\}give perfect specificity\);σ3,σ4,σ5\\sigma\_\{3\},\\sigma\_\{4\},\\sigma\_\{5\}collapse together at small windows confirming rank\-2 at the finest temporal resolution; the lead time of the level\-metric detectorρtian\\rho\_\{\\mathrm\{tian\}\}atη=10−5\\eta\{=\}10^\{\-5\}scales as Tian’s1/η1/\\etaprediction \(lead 567 epochs, grokking at epoch 1527\)\.

## 1Introduction

Grokking—abrupt onset of generalization long after memorization\(Power et al\.,[2022](https://arxiv.org/html/2605.08119#bib.bib4)\)—has accumulated explanations through mechanistic interpretability\(Nanda et al\.,[2023](https://arxiv.org/html/2605.08119#bib.bib3)\), weight decay as implicit regularization\(Liu et al\.,[2022](https://arxiv.org/html/2605.08119#bib.bib2)\), and lazy\-to\-rich transitions\(Kumar et al\.,[2024](https://arxiv.org/html/2605.08119#bib.bib1)\)\.Tian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)provides the most principled framework:Li2decomposes grokking dynamics in two\-layer networks into three stages—*Lazy*learning,*Independent*feature learning, and*Interactive*feature learning—characterized by progressively richer structures of the backpropagated gradientGFG\_\{F\}and the activation GramF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}\.

Within Stage III, Tian’s Theorem 6 \(*repulsion of similar features*\) asserts that the off\-diagonal entries ofB:=\(F~⊤F~\+ηI\)−1B:=\(\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}\+\\eta I\)^\{\-1\}satisfy

sgn⁡\(Bjℓ\)=−sgn⁡\(f~j⊤Pη,−jℓf~ℓ\),Pη,−jℓ:=I−F~−jℓ\(F~−jℓ⊤F~−jℓ\+ηI\)−1F~−jℓ⊤,\\operatorname\{sgn\}\(B\_\{j\\ell\}\)\\;=\\;\-\\operatorname\{sgn\}\\\!\\left\(\\widetilde\{f\}\_\{j\}^\{\\top\}P\_\{\\eta,\-j\\ell\}\\widetilde\{f\}\_\{\\ell\}\\right\),\\qquad P\_\{\\eta,\-j\\ell\}:=I\-\\widetilde\{F\}\_\{\-j\\ell\}\(\\widetilde\{F\}\_\{\-j\\ell\}^\{\\top\}\\widetilde\{F\}\_\{\-j\\ell\}\+\\eta I\)^\{\-1\}\\widetilde\{F\}\_\{\-j\\ell\}^\{\\top\},\(1\)whereF~−jℓ\\widetilde\{F\}\_\{\-j\\ell\}excludes thejj\-th andℓ\\ell\-th columns\. The mechanism: when two hidden nodes acquire similar activations \(f~j⊤f~ℓ\\widetilde\{f\}\_\{j\}^\{\\top\}\\widetilde\{f\}\_\{\\ell\}large positive\),BjℓB\_\{j\\ell\}becomes negative, producing an effective force that drives them apart\.

The framework is theoretically clean\. Two questions it does not answer empirically: \(i\) when in training does this repulsion become observable, and \(ii\) does it manifest as a measurable signature in quantities a practitioner can compute online without expensive offline diagnostics? This paper addresses both\.

#### Two complementary tests\.

On Tian’s exact setup \(M=71M=71,K=2048K=2048,σ\(x\)=x2\\sigma\(x\)=x^\{2\}, MSE, training fractionp≈0\.40p\\\!\\approx\\\!0\.40,η=2×10−4\\eta=2\\times 10^\{\-4\}vsη=0\\eta=0as a no\-grokking control\), we run two tests\.

The first directly verifies the sign rule of equation \([1](https://arxiv.org/html/2605.08119#S1.E1)\) by deterministic\-replay reconstruction at multiple training checkpoints, computingBBvia the Woodbury identity, and checking sign agreement on the top\-200 most\-similar feature pairs\.

The second tests a candidate online observable: the rolling\-window eigengapσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}of the parameter\-update Gram\. If Stage III repulsion consolidates redundant feature dimensions, the rollingΔW\\Delta Wspectrum should become low\-rank—two persistent update directions for the surviving feature consolidations, with subdominant directions collapsing to noise\. Theσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}ratio is a natural detector for that collapse\.

#### Contributions\.

1. 1\.Multi\-seed verification of Theorem 6 onσ=x2\\sigma=x^\{2\}\.The empirical sign\-match \(top\-200 similar pairs\) rises from 0\.865 \[IQR 0\.865, 0\.875\] at epoch 50 to 0\.985 \[IQR 0\.980, 0\.990\] at epoch 300 acrossn=5n=5seeds\. Saturation at≥0\.95\\geq 0\.95occurs at epoch 175 in every seed\.
2. 2\.Theorem 6 generalizes beyond power activations\.Onσ=ReLU\\sigma=\\mathrm\{ReLU\}, the same sign\-rule check yields 0\.91 at epoch 100, 0\.995 at epoch 300, and 1\.000 at epoch 500\. The repulsion mechanism is activation\-general\.
3. 3\.The parameter\-update spectral signature isσ=x2\\sigma=x^\{2\}specific\.A slope\-based detector onσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}fires in 15/15 grok seeds at epoch 174 \(IQR=\[173,174\]=\[173,174\]\) onσ=x2\\sigma=x^\{2\}, with229×229\\timeslate\-stage magnitude separation from theη=0\\eta=0control\. Onσ=ReLU\\sigma=\\mathrm\{ReLU\}it fires in 0/15 grok seeds and the magnitude separation collapses to1\.4×1\.4\\times\. The ReLU spectrum is rank\-1 dominated rather than rank\-2\. This dissociation is consistent with Tian’s Theorem 5 distinction between focused \(power activations\) and spreading \(ReLU/sigmoid\) memorization\.
4. 4\.Methodological controls\.A window\-size sensitivity sweep shows that W≤10\\leq 10produces false positives in theη=0\\eta=0control; specificity holds for W∈\{20,30\}\\in\\\{20,30\\\}\. Withσ4,σ5\\sigma\_\{4\},\\sigma\_\{5\}logged, the rank\-2 claim is exact at the finest window size \(W=5\) whereσ3,σ4,σ5\\sigma\_\{3\},\\sigma\_\{4\},\\sigma\_\{5\}collapse together to noise floor; at larger windows a geometric cascade emerges\.
5. 5\.Extension acrossη\\eta\.An extended single\-seed run atη=10−5\\eta=10^\{\-5\}confirms Tian’s1/η1/\\etascaling: grokking at epoch 1527, with the level metricρtian\\rho\_\{\\mathrm\{tian\}\}\(equation \([7](https://arxiv.org/html/2605.08119#S6.E7)\) below\) leading test accuracy by 567 epochs\. The lock\-in magnitudeσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}at peak drops from∼300\\sim 300to∼25\\sim 25at this slowη\\eta, consistent with rank\-2 structure being incompletely developed when measurement ends\.

#### Paper outline\.

Section[2](https://arxiv.org/html/2605.08119#S2)describes the setup\. Section[3](https://arxiv.org/html/2605.08119#S3)presents the Theorem 6 verification across activations and seeds—the headline result\. Section[4](https://arxiv.org/html/2605.08119#S4)presents the parameter\-update spectral signature onσ=x2\\sigma=x^\{2\}and documents its failure onσ=ReLU\\sigma=\\mathrm\{ReLU\}\. Section[5](https://arxiv.org/html/2605.08119#S5)reports theη\\etasweep including the extended run\. Section[6](https://arxiv.org/html/2605.08119#S6)briefly reports a level\-metric detector that works onσ=x2\\sigma=x^\{2\}but does not generalize across\(M,p,σ\)\(M,p,\\sigma\), scoping its applicability\. Section[7](https://arxiv.org/html/2605.08119#S7)discusses what the activation\-general mechanism//activation\-specific signature distinction implies for the spectral approach to grokking diagnostics\.

## 2Setup and Instrumentation

### 2\.1Architecture and training

We replicateTian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)Figure 3 exactly\. The model is

Y^=σ\(XW\)V,\\widehat\{Y\}=\\sigma\(XW\)V,\(2\)with frozen identity embeddingX∈ℝn×2MX\\in\\mathbb\{R\}^\{n\\times 2M\}\(concatenated one\-hots of two input tokens\), unbiased linearW∈ℝ2M×KW\\in\\mathbb\{R\}^\{2M\\times K\},V∈ℝK×MV\\in\\mathbb\{R\}^\{K\\times M\}, and configurable activationσ\\sigma\. The loss is the zero\-meaned MSE used in Tian’s code:

J\(W,V\)=12‖P1⟂\(Y−σ\(XW\)V\)‖F2,P1⟂:=In−1n𝟏𝟏⊤\.J\(W,V\)=\\tfrac\{1\}\{2\}\\left\\\|P\_\{1\}^\{\\perp\}\(Y\-\\sigma\(XW\)V\)\\right\\\|\_\{F\}^\{2\},\\quad P\_\{1\}^\{\\perp\}:=I\_\{n\}\-\\tfrac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\.\(3\)Training uses Adam at learning rate10−310^\{\-3\}with weight decayη\\eta\. The hyperparameterη\\etain Tian’s notation is*weight decay*\(not learning rate\); we follow this convention throughout\.

The default setup isM=71M=71,K=2048K=2048,p=ntrain/M2≈0\.40p=n\_\{\\text\{train\}\}/M^\{2\}\\approx 0\.40,η∈\{2×10−4,0\}\\eta\\in\\\{2\\\!\\times\\\!10^\{\-4\},0\\\}, 400 epochs \(800 for ReLU which groks more slowly\), 15 seeds\. The matched\-seedη=0\\eta=0control isolates the effect of weight decay\.

### 2\.2Logged quantities

At each epoch we log: train/test accuracy; the off\-diagonal ratio ofF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}; the level metricρtian\\rho\_\{\\mathrm\{tian\}\}\(equation \([7](https://arxiv.org/html/2605.08119#S6.E7)\), Section[6](https://arxiv.org/html/2605.08119#S6)\);‖GF‖\\left\\\|G\_\{F\}\\right\\\|; the top\-5 eigenvalues of the rolling\-window Gram ofΔW\\Delta W\(andΔV\\Delta V\) withW=20W=20; and a 500\-pair independence proxy forGFG\_\{F\}column decoupling\.

The rolling Gram is maintained as a deque of flattened parameter deltas; at each step we formΔ=\[ΔWt−W\+1,…,ΔWt\]∈ℝP×W\\Delta=\[\\Delta W\_\{t\-W\+1\},\\ldots,\\Delta W\_\{t\}\]\\in\\mathbb\{R\}^\{P\\times W\}and calltorch\.linalg\.eigvalshonΔ⊤Δ∈ℝW×W\\Delta^\{\\top\}\\Delta\\in\\mathbb\{R\}^\{W\\times W\}, anO\(W3\)O\(W^\{3\}\)operation negligible per epoch\. The rolling\-Gram top\-kkeigenvalues areσk\(t\)\\sigma\_\{k\}\(t\); we reportσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}as the primary detector\.

### 2\.3Reproduction of Tian’s Figure 3

[Figure1](https://arxiv.org/html/2605.08119#S2.F1)reproducesTian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)Figure 3 across 15 seeds: train accuracy reaches11by epoch 25; test accuracy crosses0\.50\.5at median epoch102102atη=2×10−4\\eta=2\\\!\\times\\\!10^\{\-4\}and never atη=0\\eta=0;‖GF‖\\left\\\|G\_\{F\}\\right\\\|peaks around epoch 50; theF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}off\-diagonal ratio remains below0\.040\.04throughout \(within Tian’s8%8\\%bound\)\.

![Refer to caption](https://arxiv.org/html/2605.08119v1/figures/headline_overlay.png)Figure 1:Cross\-seed median \(±\\pmstd for accuracy and the level metric; IQR for the eigengap\) on the headline 15\-seed sweep\. Top: test accuracy reproduction\. Middle: the level metricρtian\\rho\_\{\\mathrm\{tian\}\}rises in Stage II only in the grok condition\. Bottom:σ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}on rollingΔW\\Delta WGram \(log scale\) saturates post\-grokking only in the grok condition\.N=15N=15seeds per condition\.

## 3Theorem 6 verification across activations and seeds

### 3\.1Verification protocol

We computeB=\(F~⊤F~\+ηI\)−1B=\(\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}\+\\eta I\)^\{\-1\}exactly via the Woodbury identity,

B=1ηI−1η2F~⊤\(F~F~⊤\+ηI\)−1F~,B=\\tfrac\{1\}\{\\eta\}I\-\\tfrac\{1\}\{\\eta^\{2\}\}\\widetilde\{F\}^\{\\top\}\(\\widetilde\{F\}\\widetilde\{F\}^\{\\top\}\+\\eta I\)^\{\-1\}\\widetilde\{F\},\(4\)which reduces theK×K=2048×2048K\\times K=2048\\times 2048inverse to ann×n=2016×2016n\\times n=2016\\times 2016inverse, computed in float64 on CPU \(PyTorch 2\.5 MPS does not support double precision linear algebra\)\.

For each checkpoint, we identify the top\-200 most\-similar unordered feature pairs\(j,ℓ\)\(j,\\ell\)via the cosine matrixSjℓ=f~j⊤f~ℓ/\(‖f~j‖‖f~ℓ‖\)S\_\{j\\ell\}=\\widetilde\{f\}\_\{j\}^\{\\top\}\\widetilde\{f\}\_\{\\ell\}/\(\\left\\\|\\widetilde\{f\}\_\{j\}\\right\\\|\\,\\left\\\|\\widetilde\{f\}\_\{\\ell\}\\right\\\|\)and evaluate the Theorem 6 sign rule \(equation \([1](https://arxiv.org/html/2605.08119#S1.E1)\)\) on those pairs\.

ComputingPη,−jℓP\_\{\\eta,\-j\\ell\}requires excluding columnsjjandℓ\\ellfromF~\\widetilde\{F\}and recomputing the projector for each pair\. We use the approximationPη,−jℓ≈Pη:=I−F~\(F~⊤F~\+ηI\)−1F~⊤P\_\{\\eta,\-j\\ell\}\\approx P\_\{\\eta\}:=I\-\\widetilde\{F\}\(\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}\+\\eta I\)^\{\-1\}\\widetilde\{F\}^\{\\top\}, which uses the full projector\. A direct verification of the approximation on 10 pairs at epoch 175 \(seed 0\) shows that*the sign*of the residual similarityf~j⊤Pη,−jℓf~ℓ\\widetilde\{f\}\_\{j\}^\{\\top\}P\_\{\\eta,\-j\\ell\}\\widetilde\{f\}\_\{\\ell\}is preserved in 10/10 pairs by the approximation, even though the magnitudes differ substantially \(the full projectorPηP\_\{\\eta\}nearly annihilatesf~ℓ\\widetilde\{f\}\_\{\\ell\}sincef~ℓ\\widetilde\{f\}\_\{\\ell\}is in the column space ofF~\\widetilde\{F\}, whilePη,−jℓP\_\{\\eta,\-j\\ell\}does not\)\. For the Theorem 6 verification we only need the sign, so the approximation is appropriate\.

### 3\.2Multi\-seed result onσ=x2\\sigma=x^\{2\}

[Table1](https://arxiv.org/html/2605.08119#S3.T1)reports the empirical sign\-match acrossn=5n=5seeds at five checkpoints\. Reproducibility is tight: IQR≤0\.015\\leq 0\.015at every checkpoint, and the saturation epoch \(sign\-match≥0\.95\\geq 0\.95\) coincides with theσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}slope\-fire epoch in every seed\.

Table 1:Theorem 6 sign\-matchPr⁡\[sgn⁡\(Bjℓ\)=−sgn⁡\(f~j⊤Pηf~ℓ\)\]\\Pr\[\\operatorname\{sgn\}\(B\_\{j\\ell\}\)=\-\\operatorname\{sgn\}\(\\widetilde\{f\}\_\{j\}^\{\\top\}P\_\{\\eta\}\\widetilde\{f\}\_\{\\ell\}\)\]on top\-200 most\-similar feature pairs acrossn=5n=5seeds, atσ=x2\\sigma=x^\{2\},η=2×10−4\\eta=2\\\!\\times\\\!10^\{\-4\},M=71M=71,K=2048K=2048\.[Figure2](https://arxiv.org/html/2605.08119#S3.F2)visualizes the progression with the slope\-fire epoch overlaid\.

![Refer to caption](https://arxiv.org/html/2605.08119v1/figures/multi_seed_thm6.png)Figure 2:Theorem 6 sign\-match acrossn=5n=5seeds with median, IQR, and individual seed traces\. Saturation \(≥0\.95\\geq 0\.95\) at epoch 175 coincides with theσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}slope\-fire epoch \(Section[4](https://arxiv.org/html/2605.08119#S4)\)\.
### 3\.3Generalization toσ=ReLU\\sigma=\\mathrm\{ReLU\}

We re\-ran the headline 15\-seed sweep withσ\(x\)=ReLU\(x\)\\sigma\(x\)=\\mathrm\{ReLU\}\(x\)\(800 epochs, otherwise identical setup; ReLU groks at thisη\\etabut on a longer timescale: test accuracy reaches 0\.99 at median epoch 530\)\. We then re\-ran the Theorem 6 verification on seed 0 at five checkpoints\.

The sign\-rule continues to hold:

ReLU saturates the sign rule even*faster*thanσ=x2\\sigma=x^\{2\}\(reaching 1\.000 by epoch 500 versus the asymptotic 0\.985 forx2x^\{2\}at epoch 300\)\. Note that median feature similarity is also higher for ReLU — features become more co\-linear under ReLU’s piecewise\-linear activation than underx2x^\{2\}\.

#### Conclusion\.

Theorem 6 is empirically verified and*activation\-general*: the sign rule of equation \([1](https://arxiv.org/html/2605.08119#S1.E1)\) holds whenever the network has progressed past Stage II, regardless of whether the activation isx2x^\{2\}or ReLU\.

## 4Parameter\-update spectral signature: rank\-2 lock\-in onσ=x2\\sigma=x^\{2\}

### 4\.1The detector

Define the rolling\-window Gram ofΔW\\Delta W,

Δ:=\[ΔWt−W\+1,…,ΔWt\]∈ℝP×W,σk\(t\):=λk\(Δ⊤Δ\),\\Delta:=\[\\Delta W\_\{t\-W\+1\},\\ldots,\\Delta W\_\{t\}\]\\in\\mathbb\{R\}^\{P\\times W\},\\qquad\\sigma\_\{k\}\(t\):=\\lambda\_\{k\}\\\!\\left\(\\Delta^\{\\top\}\\Delta\\right\),\(5\)whereP=2MKP=2MKis the parameter dimension ofWWandW=20W=20is the window size\. The slope\-based detector,

s\(t\):=125\[log⁡\(σ2/σ3\)\(t\)−log⁡\(σ2/σ3\)\(t−25\)\],“fire”:=min⁡\{t≥100:s\(t\)\>0\.04\},s\(t\):=\\tfrac\{1\}\{25\}\\bigl\[\\log\(\\sigma\_\{2\}/\\sigma\_\{3\}\)\(t\)\-\\log\(\\sigma\_\{2\}/\\sigma\_\{3\}\)\(t\-25\)\\bigr\],\\quad\\text\{\`\`fire''\}:=\\min\\\{t\\geq 100:s\(t\)\>0\.04\\\},\(6\)identifies the momentσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}enters its post\-Stage\-II rise\. The restrictiont≥100t\\geq 100excludes the initial\-condition transient of the rolling window filling\.

### 4\.2Result onσ=x2\\sigma=x^\{2\}\(15 seeds, headline\)

[Table2](https://arxiv.org/html/2605.08119#S4.T2)summarizes\. The slope detector fires in 15/15 grok seeds at epoch 174 \(IQR\[173,174\]\[173,174\]\) and 0/15 control seeds\. The late\-stage magnitude separation between conditions is229×229\\times\(grok medianσ2/σ3≈300\\sigma\_\{2\}/\\sigma\_\{3\}\\approx 300vs control≈1\.31\\approx 1\.31over epochs 200–400\)\.

Table 2:Lock\-in detector onσ=x2\\sigma=x^\{2\}: perfect specificity, tight timing, large magnitude separation\.
### 4\.3Mechanism: rank\-2 collapse

The mechanism behind the lock\-in is direct rank reduction in the rollingΔW\\Delta Wspectrum\. Inspecting raw eigenvalues at seed 0 \([Table3](https://arxiv.org/html/2605.08119#S4.T3)\): in the grok condition,σ2\\sigma\_\{2\}stabilizes near5×10−35\\\!\\times\\\!10^\{\-3\}between epochs 150 and 200 whileσ3\\sigma\_\{3\}collapses two orders of magnitude\. The ratioσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}jumps from2222to212212\. By epoch 300 it is310310\. The control \(η=0\\eta=0\) shows the opposite:σ1,σ2,σ3\\sigma\_\{1\},\\sigma\_\{2\},\\sigma\_\{3\}all collapse to∼10−4\\sim 10^\{\-4\}by epoch 250; isotropic numerical noise\.

Table 3:Top\-3 eigenvalues of the rolling\-window \(W=20W=20\) Gram ofΔW\\Delta Wat seed 0\. Rank\-2 lock\-in develops between epochs 150 and 200\.The connection to Theorem 6 is empirical and tight: acrossn=5n=5seeds, the slope\-fire epoch \(174±1\\pm 1\) coincides with the moment the sign\-matchPr⁡\[sgn⁡\(Bjℓ\)=−sgn⁡\(f~j⊤Pηf~ℓ\)\]\\Pr\[\\operatorname\{sgn\}\(B\_\{j\\ell\}\)=\-\\operatorname\{sgn\}\(\\widetilde\{f\}\_\{j\}^\{\\top\}P\_\{\\eta\}\\widetilde\{f\}\_\{\\ell\}\)\]jumps from 0\.91 \(epoch 100\) to 0\.955 \(epoch 175\)\. The interpretation: between epochs 150 and 200, redundant feature pairs have similar enough activations thatBjℓB\_\{j\\ell\}becomes negative on the top\-similarity tail, generating repulsion strong enough to consolidate the redundant features into two persistent directions, which appears inΔW\\Delta W’s rolling\-window spectrum as rank\-2 lock\-in\.

### 4\.4Window\-size sensitivity

[Table4](https://arxiv.org/html/2605.08119#S4.T4)reports a window\-size sweep \(n=3n=3seeds, both conditions\) withW∈\{5,10,20,30\}W\\in\\\{5,10,20,30\\\}\. The W==20 choice is*load\-bearing*: at smaller windows the slope detector misfires in the control condition \(3/3 false positives at W==5; 3/3 false positives at W==10 with reversed specificity\)\.

Table 4:Window\-size sensitivity\. The transition is∼25\\sim 25epochs wide, so windows must exceed that to average out single\-step noise but small enough not to smear into Stage I/II\.
### 4\.5Rank confirmation viaσ4,σ5\\sigma\_\{4\},\\sigma\_\{5\}

For W≤10\\leq 10,σ3,σ4,σ5\\sigma\_\{3\},\\sigma\_\{4\},\\sigma\_\{5\}all collapse to∼10−5\\sim 10^\{\-5\}together, supporting the rank\-2 framing exactly\. At W==30 a geometric cascade emerges:σ3≈10−4\\sigma\_\{3\}\\approx 10^\{\-4\},σ4≈10−5\\sigma\_\{4\}\\approx 10^\{\-5\},σ5≈10−6\\sigma\_\{5\}\\approx 10^\{\-6\}\. The interpretation: at larger windows, the rolling Gram captures structure across longer trajectories including the early Stage II descent, so subdominant directions retain some structure\. The ”rank\-2” claim is exact at the finest temporal resolution and approximate \(”rank≲3\\lesssim 3with cascade”\) at larger W\.[Figure3](https://arxiv.org/html/2605.08119#S4.F3)plots the trajectories\.

![Refer to caption](https://arxiv.org/html/2605.08119v1/figures/rank2_top5.png)Figure 3:Top\-5 eigenvalues of the rollingΔW\\Delta WGram at three window sizes\. At small W,σ3,σ4,σ5\\sigma\_\{3\},\\sigma\_\{4\},\\sigma\_\{5\}collapse together to the noise floor whileσ1,σ2\\sigma\_\{1\},\\sigma\_\{2\}persist \(rank\-2\)\. At W==30, the spectrum forms a geometric cascade\.
### 4\.6Failure onσ=ReLU\\sigma=\\mathrm\{ReLU\}

We re\-ran the headline sweep withσ=ReLU\\sigma=\\mathrm\{ReLU\}\(15 seeds, 800 epochs each, otherwise identical\)\.[Figure4](https://arxiv.org/html/2605.08119#S4.F4)compares the two activations on four panels: test accuracy,σ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\},σ1/σ2\\sigma\_\{1\}/\\sigma\_\{2\}, andρtian\\rho\_\{\\mathrm\{tian\}\}\.

![Refer to caption](https://arxiv.org/html/2605.08119v1/figures/relu_comparison.png)Figure 4:σ=x2\\sigma=x^\{2\}\(blue\) vsσ=ReLU\\sigma=\\mathrm\{ReLU\}\(green\), medians across 15 seeds each\. The rank\-2 lock\-in detectorσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}that gives perfect specificity onσ=x2\\sigma=x^\{2\}fails on ReLU: separation drops from229×229\\timesto1\.4×1\.4\\times, slope\-fire 0/15\. The level metricρtian\\rho\_\{\\mathrm\{tian\}\}fires at epoch 0 on ReLU because the ReLU initialization is already far from the lazy\-regime form \(Section[6](https://arxiv.org/html/2605.08119#S6)\)\.The contrast is stark: underσ=ReLU\\sigma=\\mathrm\{ReLU\}the slope detector fires in 0/15 grok seeds; the late\-stage magnitude separation is1\.4×1\.4\\timesrather than229×229\\times\. The ReLU spectrum is rank\-1 dominated:σ1≫σ2≈σ3≈σ4≈σ5\\sigma\_\{1\}\\gg\\sigma\_\{2\}\\approx\\sigma\_\{3\}\\approx\\sigma\_\{4\}\\approx\\sigma\_\{5\}throughout\. There is no rank\-2 lock\-in to detect\.

#### Why?

Tian’s Theorem 5 distinguishes*focused memorization*\(power activationsσ\(x\)=x2\\sigma\(x\)=x^\{2\}, withσ′\(x\)/x\\sigma^\{\\prime\}\(x\)/xconstant\) from*spreading memorization*\(ReLU, sigmoid, withσ′\(x\)/x\\sigma^\{\\prime\}\(x\)/xstrictly decreasing\)\. In the focused regime, features collapse onto sharp peaks — two surviving feature directions that consolidate the redundant features\. In the spreading regime, features remain distributed across the hidden layer, withσ1\\sigma\_\{1\}dominating but no rank\-2 substructure\. The Theorem 6 sign rule still holds \(Section[3](https://arxiv.org/html/2605.08119#S3)\) becauseBB’s structure depends onF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}alone, not onσ′\\sigma^\{\\prime\}\. But the way Theorem 6 repulsion translates into parameter updates depends crucially onσ′\\sigma^\{\\prime\}, and the rank\-2 spectral signature does not survive the change of activation regime\.

This is the central structure\-vs\-mechanism distinction the paper identifies:*Theorem 6’s repulsion is general; its parameter\-update spectral observable is specific to focused activations\.*

## 5η\\etasweep: scaling and the slow\-grokking regime

### 5\.1η∈\{10−5,5×10−5,10−4,2×10−4,5×10−4\}\\eta\\in\\\{10^\{\-5\},5\\\!\\times\\\!10^\{\-5\},10^\{\-4\},2\\\!\\times\\\!10^\{\-4\},5\\\!\\times\\\!10^\{\-4\}\\\}, 5 seeds

We sweepη\\etaatM=71M=71,K=2048K=2048,σ=x2\\sigma=x^\{2\}, 600 epochs, five seeds each \(n=25n=25\)\.[Table5](https://arxiv.org/html/2605.08119#S5.T5)summarizes\.

Table 5:η\\etasweep summary, median values per cell\.
### 5\.2Extendedη=10−5\\eta=10^\{\-5\}: the predicted slow regime

Tian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)predicts grokking timescale scales as1/η1/\\eta\. To test this, we extend a single seed atη=10−5\\eta=10^\{\-5\}to 2000 epochs\. The model groks: test accuracy crosses 0\.5 at epoch 1094 and 0\.99 at epoch 1527\. At this slowη\\eta, the level metricρtian\\rho\_\{\\mathrm\{tian\}\}crosses 0\.075 at epoch 527, leading test accuracy by 567 epochs\. The1/η1/\\etascaling is preserved: extrapolating fromη=10−4\\eta=10^\{\-4\}\(ttest=0\.99≈200t\_\{\\text\{test\}=0\.99\}\\approx 200\),η=10−5\\eta=10^\{\-5\}should grok at≈2000\\approx 2000epochs; observed 1527\.

The lock\-in magnitudeσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}at peak is dramatically reduced in this slow regime:≈25\\approx 25vs≈300\\approx 300atη=2×10−4\\eta=2\\\!\\times\\\!10^\{\-4\}\. This is consistent with the rank\-2 structure being incompletely developed when measurement ends: atη=10−5\\eta=10^\{\-5\}the model has just barely finished grokking andσ3\\sigma\_\{3\}has not collapsed as deeply\. The late\-stage magnitude is thereforeη\\eta\-dependent, but the*existence*of the slope\-fire is preserved\.

## 6The level\-metric initiation detectorρtian\\rho\_\{\\mathrm\{tian\}\}: a tightly\-scoped tool

A simpler signal — the off\-diagonal level metric on the activation Gram,

ρtian\(t\):=‖P1⟂FF⊤−\(a\(t\)I\+b\(t\)11⊤\)‖F‖FF⊤‖F,\\rho\_\{\\mathrm\{tian\}\}\(t\):=\\frac\{\\left\\\|P\_\{1\}^\{\\perp\}FF^\{\\top\}\-\(a\(t\)I\+b\(t\)\\,\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\)\\right\\\|\_\{F\}\}\{\\left\\\|FF^\{\\top\}\\right\\\|\_\{F\}\},\(7\)wherea\(t\),b\(t\)a\(t\),b\(t\)are the empirical diagonal/off\-diagonal averages — appears predictive at the headline operating point: atη=2×10−4\\eta=2\\\!\\times\\\!10^\{\-4\}onσ=x2\\sigma=x^\{2\}, threshold0\.0750\.075separates 15/15 grok seeds \(median fire epoch 17\) from 15/15 control seeds \(maxρtian=0\.0645<0\.075\\rho\_\{\\mathrm\{tian\}\}=0\.0645<0\.075\), with lead time−84\-84epochs vsttest=0\.5t\_\{\\text\{test\}=0\.5\}\.

Theη\\etasweep \([Table5](https://arxiv.org/html/2605.08119#S5.T5)\) preserves this picture: positive lead in 20/20 grokking runs atσ=x2\\sigma=x^\{2\}, range 79–154 epochs\.

However, the metric does not generalize beyondσ=x2\\sigma=x^\{2\}on modular addition:

- •M×pM\\times pscaling sweep\.On a 60\-run sweep \(M∈\{41,71,127\}M\\in\\\{41,71,127\\\},p∈\{0\.1,0\.2,0\.3,0\.5\}p\\in\\\{0\.1,0\.2,0\.3,0\.5\\\}, 5 seeds\),ρtian≥0\.075\\rho\_\{\\mathrm\{tian\}\}\\geq 0\.075fires in 60/60 runs at near\-constant epoch \(∼11\\sim 11–2525\), including all cells that fail to grok within the training budget\. Fire epoch is decoupled from grokking outcome\. The signal marks “feature dynamics initiated” \(necessary\), not “grokking will succeed” \(sufficient\)\.
- •σ=ReLU\\sigma=\\mathrm\{ReLU\}\.ρtian\\rho\_\{\\mathrm\{tian\}\}fires at epoch 0 in 15/15 grok seeds, because the ReLU initialization is already far from the formaI\+b𝟏𝟏⊤aI\+b\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\. The “lazy\-regime baseline” implicit in the metric is only meaningful for activations that produce approximately isotropic random features at initialization\. ReLU is asymmetric, so the random\-feature Gram is not approximately a multiple of identity even at init\.

We retainρtian\\rho\_\{\\mathrm\{tian\}\}in the paper as a tightly\-scoped diagnostic: it works for activations producing approximately isotropic random features at init, fires at a near\-constant Stage I escape timescale, and provides positive \(but not specific to grokking\) predictive content\. It does not survive activation changes, and a careful analysis of*what*ρtian\\rho\_\{\\mathrm\{tian\}\}is measuring across activation regimes is beyond the scope of this paper\.

## 7Discussion

#### Structure vs mechanism\.

The paper’s central finding is the dissociation between Theorem 6’s underlying mechanism \(the sign rule onBB’s off\-diagonals\) and its parameter\-update spectral signature \(the rank\-2 lock\-in ofΔW\\Delta W\)\. Both are present atσ=x2\\sigma=x^\{2\}; only the mechanism is present atσ=ReLU\\sigma=\\mathrm\{ReLU\}\. The mechanism depends on the structure ofF~⊤F~\\widetilde\{F\}^\{\\top\}\\widetilde\{F\}, which feeds intoBBwithout explicit dependence onσ\\sigma\. The signature, however, depends on*how the feature Gram’s structure translates into parameter updates*, which involvesσ′\\sigma^\{\\prime\}and is therefore activation\-dependent\.

This connects toTian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)Theorem 5, which formally distinguishes focused memorization \(power activations: features collapse onto sharp peaks\) from spreading memorization \(ReLU/sigmoid: features remain distributed\)\. Our empirical observation is the spectral side of this distinction: focused memorization produces rank\-2 lock\-in inΔW\\Delta Wbecause Theorem 6 repulsion has a small set of feature directions to consolidate onto; spreading memorization does not, because there are no sharp peaks for repulsion to consolidate around\.

#### Methodological lessons\.

- •The lock\-in detector requiresW∈\{20,30\}W\\in\\\{20,30\\\}\. Smaller windows produce false positives in the no\-grokking control\. The transition is∼25\\sim 25epochs wide, so the window must average over this without smearing into Stage I/II\. Practitioners should sweepWWempirically rather than rely on default choices\.
- •The rank\-2 framing is exact at smallWW\(whereσ3,σ4,σ5\\sigma\_\{3\},\\sigma\_\{4\},\\sigma\_\{5\}collapse together\) and approximate atW=30W=30\(geometric cascade\)\. Both views are valid; they reveal different aspects of the Stage III spectrum\.
- •ThePη,−jℓ≈PηP\_\{\\eta,\-j\\ell\}\\approx P\_\{\\eta\}approximation in the Theorem 6 verification preserves*sign*\(10/10 pairs we checked exactly\), but not magnitude\. For the sign rule of equation \([1](https://arxiv.org/html/2605.08119#S1.E1)\) the approximation is sufficient; for any quantitative claim about\|Bjℓ\|\|B\_\{j\\ell\}\|the exact projector should be used\.
- •Spectral metrics onΔW\\Delta Ware activation\-specific\.σ1/σ2\\sigma\_\{1\}/\\sigma\_\{2\}would be the natural rank\-1 detector for ReLU analogous toσ2/σ3\\sigma\_\{2\}/\\sigma\_\{3\}forx2x^\{2\}; we have not optimized the ReLU detector here\.

#### Open empirical question\.

The level\-metric initiation detectorρtian\\rho\_\{\\mathrm\{tian\}\}fires at near\-constant epoch∼11\\sim 11–2525across theM×pM\\times pscaling sweep, regardless ofMM,pp, or whether grokking succeeds\. This invariance is unexplained\. A natural mechanistic guess: the timescale is set by the convergence of the top layerVVto its ridge solution per Tian’s Lemma 1, which depends on the spectrum ofF~F~⊤\\widetilde\{F\}\\widetilde\{F\}^\{\\top\}but may be approximatelyMM\-invariant whenK≫MK\\gg M\. We have not verified this directly\. The question of*why*Stage I escape happens on a roughly fixed∼17\\sim 17\-epoch timescale across \(M, p\) is left as a follow\-up\.

#### Limits\.

Single hidden widthK=2048K=2048\. Single optimizer \(Adam\)\. We did not test the boundary between memorization and generalization solutions \(Tian \([2025](https://arxiv.org/html/2605.08119#bib.bib5)\)Theorem 5 on focused memorization\)\. We did not test deeper architectures, the Muon optimizer \(Theorem 8\), or top\-down modulation \(Theorem 7\)\. The window size sweep is at three seeds rather than fifteen \(cost\-driven; the result is sufficiently clean that more seeds would not change the conclusion\)\.

#### Connection to prior spectral\-edge work\.

Xu \([2026](https://arxiv.org/html/2605.08119#bib.bib6)\)identified a low\-dimensional execution manifold in attention\-based grokking models, with commutator defects orthogonal to the manifold and growing1010–1000×1000\\timesduring the grokking transition\. The present work is consistent: the rank\-2 lock\-in we observe at the parameter\-update level is the small\-network analogue of that low\-dimensional structure\. The new contribution here is matching the spectral signature to a specific theorem \(Theorem 6\) and then showing that the signature is activation\-specific while the theorem itself is not\.

## References

- Kumar et al\. \[2024\]Tanishq Kumar, Blake Bordelon, Samuel J Gershman, and Cengiz Pehlevan\.Grokking as the transition from lazy to rich training dynamics\.*arXiv preprint arXiv:2310\.06110*, 2024\.
- Liu et al\. \[2022\]Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J Michaud, Max Tegmark, and Mike Williams\.Omnigrok: Grokking beyond algorithmic data\.*arXiv preprint arXiv:2210\.01117*, 2022\.
- Nanda et al\. \[2023\]Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt\.Progress measures for grokking via mechanistic interpretability\.*arXiv preprint arXiv:2301\.05217*, 2023\.
- Power et al\. \[2022\]Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra\.Grokking: Generalization beyond overfitting on small algorithmic datasets\.In*ICLR 2022 Workshop on MATH\-AI*, 2022\.URL[https://arxiv\.org/abs/2201\.02177](https://arxiv.org/abs/2201.02177)\.
- Tian \[2025\]Yuandong Tian\.Provable scaling laws of feature emergence from learning dynamics of grokking\.*arXiv preprint arXiv:2509\.21519*, 2025\.URL[https://arxiv\.org/abs/2509\.21519](https://arxiv.org/abs/2509.21519)\.
- Xu \[2026\]Yongzhong Xu\.Low\-dimensional and transversely curved optimization dynamics in grokking\.*arXiv preprint arXiv:2602\.16746*, 2026\.
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

Similar Articles

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

Graph spectral analysis (Fiedler value + Scheffer CSD indicators) predicts grokking 21k steps before loss function - five reproducible experiments [R]

Neural Networks Provably Learn Spectral Representations for Group Composition

Neural Networks Provably Learn Spectral Representations for Group Composition

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

Submit Feedback

Similar Articles

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks
Graph spectral analysis (Fiedler value + Scheffer CSD indicators) predicts grokking 21k steps before loss function - five reproducible experiments [R]
Neural Networks Provably Learn Spectral Representations for Group Composition
Neural Networks Provably Learn Spectral Representations for Group Composition
A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization