Distributional Spectral Diagnostics for Localizing Grokking Transitions

arXiv cs.LG 05/12/26, 04:00 AM Papers
Summary
This paper proposes distributional spectral diagnostics to localize grokking transitions in Transformer models before test accuracy rises. It uses empirical distributions and Hankel dynamic mode decomposition to create a monitoring signal that discriminates between grokking and non-grokking runs.
arXiv:2605.08237v1 Announce Type: new Abstract: In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead-time trade-off. Task-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition (DMD); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output. On held-out modular-addition Transformer runs, the residual achieves AUROC \(\approx \) 0.93 for grokking-vs-non-grokking discrimination at the run level; under a fixed sustained-threshold operating rule, true-positive alarms can precede onset, with lead time reported jointly with false-alarm rate and uncertainty intervals. Perturbation experiments show that, in the tested \(wd=1\) pool, high-residual windows exhibit about \(3\times\) larger short-horizon perturbation deviation than low-residual windows. In a same-data norm-window control, perturbation sensitivity aligns with the residual ordering rather than total-parameter-norm ordering, suggesting that the residual is not merely a total-norm proxy at the window level in the studied \(wd=1\) dynamics. Norm signals remain strong run-level regime indicators, and log-probability performs best among the observables tested under the current protocol. We position the residual as a window-level monitoring and localization signal in the studied modular-arithmetic Transformer settings, not a universal early-warning predictor or an intervention rule.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/12/26, 07:10 AM
# Distributional Spectral Diagnostics for Localizing Grokking Transitions
Source: [https://arxiv.org/html/2605.08237](https://arxiv.org/html/2605.08237)
Ziyue Wang1Yufeng Ying2Takafumi Kanamori1 1Institute of Science Tokyo2University of Science and Technology of China

###### Abstract

In grokking, a model first fits the training data while test accuracy remains low, and only later begins to generalize\. We ask whether this transition can be localized from observed training trajectories before the test accuracy rises, and formulate grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead\-time trade\-off\. Task\-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel dynamic mode decomposition \(DMD\); the resulting reconstruction residual, together with spectrum and effective rank, forms the diagnostic output\. On held\-out modular\-addition Transformer runs, the residual achieves AUROC≈\\approx0\.93 for grokking\-vs\-non\-grokking discrimination at the run level; under a fixed sustained\-threshold operating rule, true\-positive alarms can precede onset, with lead time reported jointly with false\-alarm rate and uncertainty intervals\. Perturbation experiments show that, in the testedwd=1wd=1pool, high\-residual windows exhibit about3×3\\timeslarger short\-horizon perturbation deviation than low\-residual windows\. In a same\-data norm\-window control, perturbation sensitivity aligns with the residual ordering rather than total\-parameter\-norm ordering, suggesting that the residual is not merely a total\-norm proxy at the window level in the studiedwd=1wd=1dynamics\. Norm signals remain strong run\-level regime indicators, and log\-probability performs best among the observables tested under the current protocol\. We position the residual as a window\-level monitoring and localization signal in the studied modular\-arithmetic Transformer settings, not a universal early\-warning predictor or an intervention rule\.

## 1Introduction

Grokking is a striking failure mode of standard training summaries: on certain algorithmic tasks, a model fits the training set early yet generalizes only after a long delay, with test\-accuracy curves remaining nearly flat throughout the intervening period\(Poweret al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib21)\)\. The flatness is the difficulty—loss and accuracy alone do not directly indicate when the transition will occur, even in runs that eventually grok\. We address a narrow question motivated by this gap: can transition windows be localized from observed training trajectories, using generic distributional observables and an explicit threshold/FPR/lead\-time diagnostic protocol? This is a different question from explaining*why*grokking occurs\.

A growing body of work explains grokking through mechanism\. Mechanistic interpretability identifies emergent computational circuits\(Nandaet al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib34); Varmaet al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib2)\); implicit\-bias analyses attribute the transition to late\-phase norm minimization on the zero\-loss manifold\(Liuet al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib4); Lyuet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib5); Musat,[2026](https://arxiv.org/html/2605.08237#bib.bib22)\); stability\-based accounts link grokking to logit scaling and softmax collapse\(Thilaket al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib6); Prietoet al\.,[2025](https://arxiv.org/html/2605.08237#bib.bib51)\)\. Norm\-growth, logit\-scaling, AGOP\-based feature emergence\(Radhakrishnanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib3)\), and circuit\-formation views are the natural reference points for our work\. Our goal is complementary: a thresholded localization signal computed from task\-dependent observables and reported with explicit false\-alarm/lead\-time trade\-offs, intended to flag transition windows rather than explain them\.

We summarize a chosen task\-dependent observableoto\_\{t\}at each training step as an empirical distributionμt\\mu\_\{t\}\. Because the diagnostic is computed from the chosen output distribution rather than from hidden\-unit coordinates, it does not depend on hidden\-unit indexing\. Wasserstein/quantile coordinates convertμt\\mu\_\{t\}into a vector observationztz\_\{t\}, and windowed Hankel dynamic mode decomposition \(DMD\) provides a local dynamical approximation\(Schmid,[2010](https://arxiv.org/html/2605.08237#bib.bib23); H\. Tuet al\.,[2014](https://arxiv.org/html/2605.08237#bib.bib15); Arbabi and Mezić,[2017](https://arxiv.org/html/2605.08237#bib.bib16)\)\. The reconstruction residualRes\(r\)\\mathrm\{Res\}^\{\(r\)\}is the primary diagnostic; the spectrum and effective rankr0\.99r\_\{0\.99\}are auxiliary descriptors interpretable mainly in low\-residual windows\. Our primary setting takes the observable from the empirical distribution of correct\-answer log\-probabilities on a fixed probe set in Transformers trained on modular addition; secondary observables and FCN comparisons appear later as scope checks\.

Empirically, the residual rises around grokking transitions in the modular\-addition Transformer setting\. On a held\-out test split with fresh seeds, the residual gives nontrivial run\-level detection behavior under an explicit threshold/FPR/lead\-time trade\-off \(precise numbers in §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)and Table[3](https://arxiv.org/html/2605.08237#S3.T3)\)\. A perturbation experiment shows that high\-residual windows exhibit larger short\-horizon deviation than low\-residual windows under matched noise\. A same\-data norm\-window control re\-labels the same runs by total\-parameter\-norm percentile and produces an opposite fragility ordering, suggesting that the residual is not merely a total\-norm proxy at the window level in the studied wd=1=1dynamics; norm signals nevertheless remain strong run\-level regime indicators\. An observable ablation finds log\-probability to be the best\-performing observable among those tested under the current protocol\.

#### Contributions\.

- \(i\)We propose a windowed distributional diagnostic for training dynamics\. The method maps task\-dependent observables to empirical distributions, represents them by Wasserstein/quantile coordinates, and applies Hankel\-DMD to compute spectrum, effective rank, and reconstruction residual\.
- \(ii\)We evaluate the reconstruction residual as a transition\-localization signal for grokking\. Paired with a sustained\-threshold operating rule, it gives held\-out detection behavior under an explicit threshold/FPR/lead\-time trade\-off; log\-probability is the best\-performing observable among those we tested under the current protocol\.
- \(iii\)We provide perturbation\-based evidence that high\-residual windows correspond to fragile training periods, supporting the residual as a monitoring/localization signal rather than an intervention rule\.
- \(iv\)We evaluate scope and boundaries through model\-scale, task\-family, norm\-baseline, AGOP, intervention, CIFAR\-10, and FCN checks; these are presented as scope checks, not universal robustness claims\.

#### Scope of claims\.

We do not claim a universal predictor of grokking, an architecture\-independent diagnostic, or an automatic intervention rule\. We do not claim the residual replaces norm\-based regime classifiers, nor that perturbation alignment establishes causal mechanism\. Our claim is narrower: in the studied modular\-arithmetic Transformer settings, the reconstruction residual is a window\-level cue for transition localization and fragility monitoring, evaluated under an explicit threshold/FPR/lead\-time trade\-off\. Extended comparisons to mechanistic, spectral, Koopman/DMD, and Wasserstein training\-dynamics diagnostics are deferred to Appendix[U](https://arxiv.org/html/2605.08237#A21)\.

## 2Method

The pipeline has three stages: \(i\) summarize a chosen task\-dependent observable as an empirical distribution; \(ii\) embed each distribution in a Hilbert coordinate via the Wasserstein–quantile representation; \(iii\) analyze the resulting vector\-valued trajectory over fixed step windows by Hankel\-DMD and read off a small set of windowed quantities\. The pipeline is observable\-dependent: the choice ofoto\_\{t\}determines what the diagnostic can detect\. Figure[1](https://arxiv.org/html/2605.08237#S2.F1)summarizes the overall diagnostic pipeline, and Table[1](https://arxiv.org/html/2605.08237#S2.T1)lists the main notation\.

Table 1:Notation used in the distributional Hankel\-DMD diagnostic\.### 2\.1Observable and Wasserstein–quantile coordinates

#### Observable and distributional state\.

For Transformers on modular addition \(primary setting\), we fix a probe set𝒫=\{\(xi,yi⋆\)\}i=1M\\mathcal\{P\}=\\\{\(x\_\{i\},y\_\{i\}^\{\\star\}\)\\\}\_\{i=1\}^\{M\}\(M=100M=100examples\), whereyi⋆y\_\{i\}^\{\\star\}is the correct\-answer token for inputxix\_\{i\}\. At training stepttthe per\-sample observable is the scalar correct\-answer log\-probabilityot,i=log⁡pθt\(yi⋆∣xi\)o\_\{t,i\}=\\log p\_\{\\theta\_\{t\}\}\(y\_\{i\}^\{\\star\}\\mid x\_\{i\}\), and the distributional state is the empirical distribution of theseMMscalars:

μt:=1M∑i=1Mδot,i\.\\mu\_\{t\}\\;:=\\;\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\delta\_\{o\_\{t,i\}\}\.\(1\)The diagnostic therefore tracks the empirical distribution of correct\-answer log\-probabilities over a fixed probe set; an averaged loss or accuracy collapses this distribution to a single scalar\. Because the construction uses the chosen output distribution rather than hidden\-unit coordinates, it does not depend on hidden\-unit indexing\. Wasserstein/quantile coordinates provide the representation: they convert distribution\-valued states into vector\-valued observations\. Hankel\-DMD provides the local dynamical approximation: it analyzes how these vectors evolve over short training windows\. FCN observables, used as a secondary low\-residual descriptor, are defined in Appendix[T](https://arxiv.org/html/2605.08237#A20)\.

#### Wasserstein–quantile coordinate\.

For one\-dimensional measures with finite second moment, the quantile map is a global isometry between\(𝒲2\(ℝ\),dW\)\(\\mathcal\{W\}\_\{2\}\(\\mathbb\{R\}\),d\_\{W\}\)and a closed convex subset ofL2\(0,1\)L^\{2\}\(0,1\)\(Villani,[2009](https://arxiv.org/html/2605.08237#bib.bib1)\): at a fixed referenceμ⋆\\mu^\{\\star\},logμ⋆⁡\(μ\)=Fμ−1∘Fμ⋆−id\\log\_\{\\mu^\{\\star\}\}\(\\mu\)=F\_\{\\mu\}^\{\-1\}\\circ F\_\{\\mu^\{\\star\}\}\-\\mathrm\{id\}identifies the Wasserstein tangent space with a Hilbert subspace\. We evaluateFμt−1F^\{\-1\}\_\{\\mu\_\{t\}\}on a fixed quantile gridp1,…,pdp\_\{1\},\\dots,p\_\{d\}\(d=19d=19levels,0\.050\.05–0\.950\.95\):

zt=\(Fμt−1\(p1\),…,Fμt−1\(pd\)\)∈ℝd\.z\_\{t\}\\;=\\;\\bigl\(F\_\{\\mu\_\{t\}\}^\{\-1\}\(p\_\{1\}\),\\,\\dots,\\,F\_\{\\mu\_\{t\}\}^\{\-1\}\(p\_\{d\}\)\\bigr\)\\in\\mathbb\{R\}^\{d\}\.\(2\)Multi\-dimensional analogues require embeddings \(kernel mean embeddings, MDS\) that do not preserve Wasserstein geometry; we therefore restrict to one\-dimensional task\-dependent observables\. Full Wasserstein background, including the Hadamard structure of𝒲2\(D\)\\mathcal\{W\}\_\{2\}\(D\), is in Appendix[B](https://arxiv.org/html/2605.08237#A2)\.

### 2\.2Windowed Hankel\-DMD diagnostics

#### Delay state and snapshot matrices\.

Over a step window of lengthm\+qm\+qwe form delay\-embedded vectors

ξt=\[zt⊤,zt\+1⊤,…,zt\+q−1⊤\]⊤∈ℝqd,\\xi\_\{t\}\\;=\\;\\bigl\[z\_\{t\}^\{\\top\},\\,z\_\{t\+1\}^\{\\top\},\\,\\ldots,\\,z\_\{t\+q\-1\}^\{\\top\}\\bigr\]^\{\\top\}\\in\\mathbb\{R\}^\{qd\},\(3\)and Hankel snapshot matricesH−=\[ξ0⋯ξm−1\],H\+=\[ξ1⋯ξm\]∈ℝqd×m\.H\_\{\-\}=\[\\xi\_\{0\}\\,\\cdots\\,\\xi\_\{m\-1\}\],\\ H\_\{\+\}=\[\\xi\_\{1\}\\,\\cdots\\,\\xi\_\{m\}\]\\in\\mathbb\{R\}^\{qd\\times m\}\.

#### Hankel\-DMD estimator\.

A Koopman/DMD approximation\(Schmid,[2010](https://arxiv.org/html/2605.08237#bib.bib23); H\. Tuet al\.,[2014](https://arxiv.org/html/2605.08237#bib.bib15); Arbabi and Mezić,[2017](https://arxiv.org/html/2605.08237#bib.bib16);drmač2017datadrivenmodaldecompositions\)solves the ordinary least\-squares problem

AH⋆=arg⁡minA⁡‖H\+−AH−‖F2,A\_\{H\}^\{\\star\}\\;=\\;\\arg\\min\_\{A\}\\,\\bigl\\\|H\_\{\+\}\-AH\_\{\-\}\\bigr\\\|\_\{F\}^\{2\},\(4\)and we then truncate the fitted operator to rankrrvia the leadingrreigenpairs\(λj,wj\)j=1r\(\\lambda\_\{j\},w\_\{j\}\)\_\{j=1\}^\{r\}ofAH⋆A\_\{H\}^\{\\star\}\. The rank\-rrDMD reconstruction isξ^t\(r\)=WΛtb\\widehat\{\\xi\}\_\{t\}^\{\(r\)\}=W\\Lambda^\{t\}b, whereb=W†ξ0b=W^\{\\dagger\}\\xi\_\{0\}\. Snapshot construction details, the reduced\-rank projection, and a discussion of non\-normal sensitivity are in Appendix[C](https://arxiv.org/html/2605.08237#A3)\.

#### Reconstruction residual\.

Res\(r\):=\(∑t‖ξt−ξ^t\(r\)‖22\)1/2\(∑t‖ξt‖22\)1/2\.\\mathrm\{Res\}^\{\(r\)\}\\;:=\\;\\frac\{\\bigl\(\\sum\_\{t\}\\\|\\xi\_\{t\}\-\\widehat\{\\xi\}\_\{t\}^\{\(r\)\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\}\{\\bigl\(\\sum\_\{t\}\\\|\\xi\_\{t\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\}\.\(5\)A smallRes\(r\)\\mathrm\{Res\}^\{\(r\)\}indicates the windowed trajectory admits an accurate low\-rank linear description in the chosen coordinates; a large value indicates a departure from that regime\.

#### Effective rank\.

WithH−=UΣV⊤H\_\{\-\}=U\\Sigma V^\{\\top\}and singular valuesσ1≥σ2≥⋯\\sigma\_\{1\}\\geq\\sigma\_\{2\}\\geq\\cdots,

r0\.99:=min⁡\{r≥1:∑i=1rσi2∑iσi2≥0\.99\}\.r\_\{0\.99\}\\;:=\\;\\min\\Bigl\\\{r\\geq 1:\\ \\tfrac\{\\sum\_\{i=1\}^\{r\}\\sigma\_\{i\}^\{2\}\}\{\\sum\_\{i\}\\sigma\_\{i\}^\{2\}\}\\geq 0\.99\\Bigr\\\}\.\(6\)

#### Validity gate\.

As summarized in Figure[1](https://arxiv.org/html/2605.08237#S2.F1), the residual serves as a validity gate for the auxiliary descriptors\. In low\-residual windows, the spectrum ofAH⋆A\_\{H\}^\{\\star\}andr0\.99r\_\{0\.99\}can be read as empirical summaries of the local linear\-evolution regime\. In high\-residual windows, the low\-rank linear approximation is poor; the residual itself should then be interpreted as a transition or fragility signal, and spectral points and effective rank should not be over\-interpreted as stable regime descriptors\.

ot⏟selectedobservable⟶μt⏟distributionalstate→fixed quantile mapzt∈ℝd⏟Wassersteincoordinate→windowed Hankel\-DMD\(\{λj\},r0\.99,Res\(r\)\)⏟spectrum / effective rank/ residual\\underbrace\{o\_\{t\}\}\_\{\\begin\{subarray\}\{c\}\\text\{selected\}\\\\ \\text\{observable\}\\end\{subarray\}\}\\;\\longrightarrow\\;\\underbrace\{\\mu\_\{t\}\}\_\{\\begin\{subarray\}\{c\}\\text\{distributional\}\\\\ \\text\{state\}\\end\{subarray\}\}\\;\\xrightarrow\{\\;\\text\{fixed quantile map\}\\;\}\\;\\underbrace\{z\_\{t\}\\in\\mathbb\{R\}^\{d\}\}\_\{\\begin\{subarray\}\{c\}\\text\{Wasserstein\}\\\\ \\text\{coordinate\}\\end\{subarray\}\}\\;\\xrightarrow\{\\;\\text\{windowed Hankel\-DMD\}\\;\}\\;\\underbrace\{\\bigl\(\\\{\\lambda\_\{j\}\\\},\\,r\_\{0\.99\},\\,\\mathrm\{Res\}^\{\(r\)\}\\bigr\)\}\_\{\\begin\{subarray\}\{c\}\\text\{spectrum / effective rank\}\\\\ \\text\{/ residual\}\\end\{subarray\}\}
*How to read the diagnostics\.*LowRes\(r\)\\mathrm\{Res\}^\{\(r\)\}: interpret the spectrum andr0\.99r\_\{0\.99\}as local regime descriptors; highRes\(r\)\\mathrm\{Res\}^\{\(r\)\}: interpret the residual itself as a transition / fragility signal\.

Figure 1:Pipeline of the proposed diagnostic\.The selected task\-dependent observableoto\_\{t\}at each training step is summarized as an empirical distributionμt\\mu\_\{t\}\. Wasserstein/quantile coordinates convert eachμt\\mu\_\{t\}into a vector observationzt∈ℝdz\_\{t\}\\in\\mathbb\{R\}^\{d\}\. Windowed Hankel\-DMD then analyzes the local temporal evolution of\{zt\}\\\{z\_\{t\}\\\}over fixed step windows and returns spectrum, effective rank, and reconstruction residual\. Low residual supports interpreting the spectrum and effective rank as local descriptors; high residual is treated as a transition/fragility signal\.
#### AGOP as a parallel route\.

The Average Gradient Outer Product \(AGOP\)\(Radhakrishnanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib3)\)summarizes input sensitivity through averaged input\-gradient outer products and has been used to study feature emergence and grokking\-related transitions\. AGOP\-based diagnostics provide a parallel route under sufficient checkpoint coverage; in our setup, sparse checkpoint coverage prevents a fair quantitative comparison, so we treat AGOP as corroborative rather than competitive evidence \(Appendix[H](https://arxiv.org/html/2605.08237#A8)\)\.

## 3Experiments

We evaluate the residual as a transition\-localization signal under an explicit threshold/FPR/lead\-time protocol in modular\-addition Transformers and report scope checks for related settings\. Section[3\.1](https://arxiv.org/html/2605.08237#S3.SS1)fixes the diagnostic protocol and observable choice; Section[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)reports detection performance on a held\-out test fold; Section[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)reports perturbation fragility, norm baselines, and architecture scale; Section[3\.4](https://arxiv.org/html/2605.08237#S3.SS4)summarizes scope checks\. All experiments run on a single workstation \(NVIDIA RTX 4070 Laptop, 8 GB VRAM; PyTorch 2\.7\.1, CUDA 12\.8\)\.

### 3\.1Diagnostic protocol and observable choice

This section addresses two design questions: which observable should be used, and how the detection protocol is defined\.

#### Setup\.

We train a decoder\-only Transformer \(dmodel=128d\_\{\\mathrm\{model\}\}\{=\}128,nlayers=2n\_\{\\mathrm\{layers\}\}\{=\}2,nheads=4n\_\{\\mathrm\{heads\}\}\{=\}4, AdamW, training fraction0\.40\.4,6,0006\{,\}000max steps\) on the modular\-addition task ofPoweret al\.\([2022](https://arxiv.org/html/2605.08237#bib.bib21)\)\. The base pool consists of4141unique runs \(88seeds×\\times55weight\-decay settings\+1\+\\,1smoke run\); full hyperparameters are in Appendix[D](https://arxiv.org/html/2605.08237#A4)\. The default observable is the empirical distribution of correct\-answer log\-probabilities on a fixed probe set, defined precisely below and summarized by1919fixed quantile coordinates \(0\.050\.05–0\.950\.95\)\.

#### Transformer observable\.

We instantiate the distributional state of §[2\.1](https://arxiv.org/html/2605.08237#S2.SS1)as follows \(see Eq\.[1](https://arxiv.org/html/2605.08237#S2.E1)\)\. At each training stepttwe evaluate the current model on a fixed probe set ofM=100M=100training examples and record the scalar correct\-answer log\-probabilityot,i=log⁡pθt\(yi⋆∣xi\)o\_\{t,i\}=\\log p\_\{\\theta\_\{t\}\}\(y\_\{i\}^\{\\star\}\\mid x\_\{i\}\)for each exampleii\. The distributional stateμt\\mu\_\{t\}is the empirical distribution of theseMMscalars, andztz\_\{t\}is its quantile coordinate on the same1919\-level grid as in §[2\.1](https://arxiv.org/html/2605.08237#S2.SS1)\. This definition also explains the observable ablation below: the empirical distribution of correct\-answer log\-probabilities retains distributional information about the output dynamics, whereas logits, top\-kksummaries, and hidden\-state observables provide weaker or less stable signals under the current DMD configuration\.

#### Onset and labeling\.

Grokking onset is defined as the first step at which test accuracy crosses99%99\\%\. In the base pool, mean onset for wd=1=1runs is3398±3613398\\pm 361steps, while wd=2=2runs reach the same threshold by step15281528on average; the bimodal gap between these distributions motivates a step threshold separating grokking from early generalization \(justification and onset histogram in Appendix[M](https://arxiv.org/html/2605.08237#A13)\)\.

#### Operating rule\.

A run\-level alarm fires whenRes\(r\)\\mathrm\{Res\}^\{\(r\)\}exceedsτ×brun\\tau\\\!\\times\\\!b\_\{\\mathrm\{run\}\}forKKconsecutive windows, wherebrunb\_\{\\mathrm\{run\}\}is a per\-run residual baseline\. We treatK=2,τ=10K\{=\}2,\\tau\{=\}10as a fixed heuristic operating point:τ\\tauis set on the order of10×10\\timesbaseline \(residual\-multiplier heuristic\) andK=2K\{=\}2enforces sustainment\. Held\-out fresh seeds are used only for evaluation, and we report the full threshold sweep to expose the sensitivity–specificity trade\-off \(Appendix[N](https://arxiv.org/html/2605.08237#A14)\)\. We do not selectτ\\tauby AUROC or AUPRC optimization on either fold\.

#### Observable ablation\.

Replacing log\-probability with alternative observables under the same DMD configuration \(Appendix[P](https://arxiv.org/html/2605.08237#A16)\) gives, on the44grok runs of the ablation sub\-pool: log\-probability firessustained\_K2\_tau10alarms on4/44/4runs, of which2/42/4land before onset \(TP\) with median lead≈601\\approx 601steps; the matched FPR is the rule’s test\-fold value0\.500\.50from §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)\. Logits fire alarms on3/43/4; top\-kk\(∼95\\sim 95\-dim\) and hidden \(∼1900\\sim 1900\-dim\) trigger on0/40/4\. Among the observables we tested under the current protocol, log\-probability is the best\-performing default\. The present implementation is best suited to scalar empirical observables represented by one\-dimensional quantile coordinates, so the failure of top\-kkand hidden\-state variants under the same DMD configuration should be interpreted as a limitation of this scalar\-distribution implementation and fixed DMD setup, not as evidence that those observables lack useful information\.

#### DMD\-vs\-persistence calibration\.

To check whether the Hankel\-DMD fit captures temporal structure beyond a trivial identity predictor, we compare one\-step holdout prediction within each window against a persistence baseline \(predicting that the next state equals the current state\)\. Across the tested weight\-decay regimes, Hankel\-DMD improves over persistence \(Table[2](https://arxiv.org/html/2605.08237#S3.T2)\)\. We use this as a calibration check that the residual is not merely a noise\-fitting artifact, not as evidence that DMD is a uniformly reliable forecaster \(full setup in Appendix[I](https://arxiv.org/html/2605.08237#A9)\)\.

Table 2:DMD\-vs\-persistence calibration\. Holdout error and persistence error are relative residuals on within\-window holdout splits, aggregated across seeds and segment sizes per weight\-decay setting\. Positive gain \(persistence minus holdout\) means Hankel\-DMD improves over the persistence baseline\.

### 3\.2Residual localizes grokking transitions

We next evaluate the residual as a transition\-localization signal, reporting false alarms and lead time jointly\.

Finding 1\.Paired with the sustained\-threshold operating rule of §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1), the reconstruction residual yields a held\-out detector for grokking\-vs\-non\-grokking\. Lead time is threshold\-dependent and is reported jointly with FPR and CI\.

#### Protocol\.

Test fold: fresh seeds4646–4949,ngrok=5n\_\{\\mathrm\{grok\}\}\{=\}5,nnon=12n\_\{\\mathrm\{non\}\}\{=\}12\(base rate≈0\.29\\approx 0\.29\); wd=2=2early\-generalization runs are excluded from numerator and denominator\. Onset is the first step crossing99%99\\%test accuracy\. Operating rule:sustained\_K2\_tau10fixed by the heuristic of §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1)\. Lead time is computed only over true\-positive alarms\. Unless otherwise stated, AUROC and AUPRC reported in this section are run\-level ranking metrics computed from residual\-based alarm scores on held\-out runs; the temporal alarm metrics \(TPR, FPR, lead, CI\) describe the first alarm time and lead relative to grokking onset under a thresholded rule\. Window\-level evidence appears in §[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)\(perturbation and norm\-window controls\)\.

#### Evidence\.

On this fold, the residual achieves AUROC≈0\.93\\approx 0\.93\(AUPRC≈0\.91\\approx 0\.91\)\. At the selected operating point, TPR=0\.80=0\.80and FPR=0\.50=0\.50; the median lead computed only over true\-positive alarms is10681068steps \(95%95\\%bootstrap CI\[142,2426\]\[142,\\,2426\]; Appendix[O](https://arxiv.org/html/2605.08237#A15)\)\. Threshold trade\-off \(Table[3](https://arxiv.org/html/2605.08237#S3.T3)\): an instantaneousτ=5×\\tau\{=\}5\\timesrule recovers all grok runs \(TPR1\.001\.00\) at the cost of FPR0\.9170\.917and longer lead \(17741774steps\); a stricter instantaneousτ=20×\\tau\{=\}20\\timesrule reduces FPR to0\.2500\.250but recalls only40%40\\%of grok runs\. Figure[2](https://arxiv.org/html/2605.08237#S3.F2)shows a representative wd=1=1run with the residual rising before onset; full ROC and lead\-time distributions on the test fold are in Appendix[N](https://arxiv.org/html/2605.08237#A14)\.

#### Limitation\.

On the reused\-seed split \(seeds4242–4545\), the same fixedsustained\_K2\_tau10operating point fires almost no alarms, yielding TPR=0=0and FPR=0=0on that split \(Appendix[N](https://arxiv.org/html/2605.08237#A14)\)\. We therefore report this asymmetry as seed\-split sensitivity rather than as evidence of calibrated threshold selection: the reused\-seed split does not validate threshold calibration\. High absolute residual can occur in non\-grokking runs, motivating the relative sustained\-threshold rule rather than an absolute cut\. Lead time must always be read together with FPR and CI\. The selected operating rule is one specific point on a sensitivity–specificity trade\-off\.

Table 3:Detection threshold trade\-off on the held\-out test fold \(fresh seeds4646–4949;ngrok=5n\_\{\\mathrm\{grok\}\}\{=\}5,nnon=12n\_\{\\mathrm\{non\}\}\{=\}12, base rate≈0\.29\\approx 0\.29; wd=2=2early\-generalization runs excluded\)\. Lead is computed only over true\-positive alarms; the bracketed range for the selected sustained rule is the95%95\\%bootstrap CI of the median \(Appendix[O](https://arxiv.org/html/2605.08237#A15)\)\. AUROC≈0\.93\\approx 0\.93\(AUPRC≈0\.91\\approx 0\.91\) is computed on this same fold and applies to all rows \(single detector at different cuts\)\. The instantaneousτ=5×\\tau\{=\}5\\timesrule gives high recall at high FPR; the instantaneousτ=20×\\tau\{=\}20\\timesrule gives high specificity at low recall; the selected sustained rulesustained\_K2\_tau10sits at a moderate recall/FPR point on the same curve\.RuleTPR\(↑\)\(\\uparrow\)FPR\(↓\)\(\\downarrow\)Median lead95%95\\%CI*Instantaneous threshold rules*τ=5×\\tau\{=\}5\\times1\.001\.000\.9170\.91717741774—τ=20×\\tau\{=\}20\\times0\.400\.400\.2500\.250——*Sustained operating rule*sustained\_K2\_tau10\(selected\)0\.800\.800\.5000\.50010681068\[142,2426\]\[142,\\,2426\]
#### AUROC/AUPRC vs\. operating point\.

The AUROC and AUPRC summarize ranking performance across all thresholds; Table[3](https://arxiv.org/html/2605.08237#S3.T3)reports concrete operating points on the corresponding sensitivity–specificity curve\. Lower thresholds give earlier alarms at higher false\-positive rates, while higher thresholds reduce false positives at the cost of missed or delayed alarms\. The selected operating rule is one configuration on this curve, intended to localize candidate transition windows for further inspection rather than to serve as a single decision rule in isolation\. Because the held\-out split is small, we report run\-level bootstrap confidence intervals and operating\-point uncertainty estimates in Appendix[O](https://arxiv.org/html/2605.08237#A15); these intervals support the run\-level ranking value of RR while making the uncertainty of the fixed operating point explicit\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e1_case_success.png)Figure 2:Representative wd=1=1grokking run\.The dark\-blue curve shows test accuracy \(left axis\), the light\-cyan curve in the background shows training accuracy \(left axis, included as a reference for the memorization phase\), and the red curve shows the reconstruction residual on a log scale \(right axis\)\. The vertical dotted line marks grokking onset, defined as the first step at which test accuracy exceeds99%99\\%\. In this run, the rising edge of the residual precedes the onset\.

### 3\.3Fragility, norm baselines, and architecture scale

We now test whether high\-residual windows have functional meaning through perturbations, compare the residual against norm\-based baselines, and check sensitivity to model scale\.

#### Perturbation fragility\.

Finding 2\.In the wd=1=1pool we tested, identical perturbations applied at high\-residual windows produce roughly3×3\\timeslarger short\-horizon deviation than at low\-residual windows\. The result quantifies functional fragility under matched noise; it does not establish a causal mechanism\.

Protocol: identical multiplicative perturbations at scales\{0\.005,0\.01\}\\\{0\.005,0\.01\\\}are applied at high\-RR vs\. low\-RR windows on wd=1=1grokking baselines \(44high\-RR runs and55low\-RR runs at each scale; details in Appendix[K](https://arxiv.org/html/2605.08237#A11)\)\. Evidence: at scale0\.010\.01, mean short\-horizon deviation is0\.0900\.090\(high\-RR\) vs\.0\.0290\.029\(low\-RR\), giving high/low ratio≈3\.1×\\approx 3\.1\\times; at scale0\.0050\.005,0\.1070\.107vs\.0\.0340\.034, ratio≈3\.2×\\approx 3\.2\\times\. Figure[3](https://arxiv.org/html/2605.08237#S3.F3)shows the deviation distribution\. Limitation: one unrecoverable failure occurs in a high\-RR window near the transition; one low\-RR early\-training failure indicates a separate instability unrelated to transition fragility and is treated as a boundary case in Appendix[K](https://arxiv.org/html/2605.08237#A11)\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e5_devbox.png)Figure 3:Perturbation fragility\.Short\-horizon deviation by RR window and noise scale on the wd=1=1pool\. High\-RR windows show larger short\-horizon deviation under matched noise than low\-RR windows; the low\-RR early\-training failure is a boundary case discussed in Appendix[K](https://arxiv.org/html/2605.08237#A11)\.
#### Norm baselines and a same\-data window control\.

Finding 3\.Norm\-derived signals are strong run\-level regime indicators on the shared2222\-run pool, confirming that scale dynamics are informative for grokking\. In the same\-data norm\-window control, perturbation sensitivity aligns with the residual ordering rather than total\-norm ordering, suggesting that the residual is not merely a total\-norm proxy at the window level in the studied wd=1=1dynamics\.

Protocol: \(i\) the same2121perturbation runs \(1717reused above plus44new low\-norms=5250s\{=\}5250perturbations on seeds4646/4949\) are re\-labeled by total\-parameter\-norm percentile rather than RR percentile \(window\-control protocol\); \(ii\) run\-level \(max\-of\-trajectory\) AUROC for several signals is computed on a shared2222\-run pool; \(iii\) a fair\-FPR temporal alarm matches FPR to0\.500\.50on the reused\-seed split for each signal independently\. Note that the fair\-FPR protocol in \(iii\) is a baseline\-comparison protocol used to set comparison thresholds and is*not*thesustained\_K2\_tau10operating rule used in §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2); details in Appendix[Q](https://arxiv.org/html/2605.08237#A17)\.

Evidence:

- •Window\-control deviation \(scale0\.010\.01\): under RR\-window framing the high/low ratio is≈3\.82×\\approx 3\.82\\times; under norm\-window framing the same runs give≈0\.41×\\approx 0\.41\\times, reversing the ordering\.
- •Run\-level AUROCs under run\-level max\-of\-trajectory scoring on the shared norm\-baseline pool \(2222runs\): total parameter norm0\.98290\.9829,\|Δnorm\|\\lvert\\Delta\\mathrm\{norm\}\\rvert0\.99150\.9915, RR \(run\-level max\)0\.41030\.4103\. This run\-level RR AUROC should not be compared directly with the selected sustained\-threshold result in §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2); the protocols differ\.
- •Under the fair\-FPR temporal alarm \(FPR target0\.500\.50on the reused\-seed split\), RR triggers TPR=0\.75=0\.75on the test fold with median lead273\.5273\.5steps;norm\_N\_totaltriggers TPR=0=0\(no alarms fire\) under the same fair\-FPR protocol\. This is a different protocol from the selected sustained\-threshold operating rule of §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)\.

Limitation: norm and RR are temporally anti\-correlated in wd=1=1dynamics, so the two framings select temporally adjacent windows; the control distinguishes which signal aligns with perturbation outcome direction but does not establish causal mechanism\. Norm signals’ high run\-level AUROC and the residual’s window\-level perturbation alignment are complementary use cases\. We do not claim RR universally outperforms norm baselines, and a fully fair head\-to\-head detection comparison requires per\-signal threshold recalibration\. Appendix pointer: Appendix[Q](https://arxiv.org/html/2605.08237#A17)\.

#### Architecture scale\.

Finding 4\.The residual signal is observed across the three tested Transformer scales, with onset and amplitude shifting with capacity\. This evaluates model\-scale sensitivity, not standard\-vs\-mean\-field parameterization\.

Protocol and evidence: small \(dmodel=64d\_\{\\mathrm\{model\}\}\{=\}64,11layer\), baseline \(dmodel=128d\_\{\\mathrm\{model\}\}\{=\}128,22layers\), and large \(dmodel=256d\_\{\\mathrm\{model\}\}\{=\}256,33layers\) variants are trained at wd∈\{0,1,2\}\\in\\\{0,1,2\\\}with22seeds per scale; the per\-scale aggregated table is in Appendix[F](https://arxiv.org/html/2605.08237#A6)\. Limitation: residual amplitude shrinks with capacity, so the relative\-threshold rule’s behavior at small scale requires further investigation; this ablation does not test parameterization scaling\.

### 3\.4Scope and remaining checks

The remaining checks are summarized in Table[4](https://arxiv.org/html/2605.08237#S3.T4); full plots and per\-row protocol details are in the cited appendices\. Lead times reported in this table use the protocol of the corresponding appendix and are paired with the FPR of that appendix’s protocol\.

Table 4:Compact summary of scope and robustness checks\.“ρ\\rho” denotes the lead\-lag Spearman correlation betweenRes\(r\)\\mathrm\{Res\}^\{\(r\)\}and subsequent test\-accuracy change\. Each appendix gives full plots and the per\-row protocol; lead\-time figures \(where applicable\) are paired with the FPR of that protocol\.

## 4Discussion

We formulated grokking transition localization as a diagnostic problem with an explicit threshold/FPR/lead\-time trade\-off on observed training trajectories and instantiated it via a windowed distributional pipeline: task\-dependent observables are summarized as empirical distributions, mapped to Wasserstein/quantile coordinates, and analyzed by Hankel\-DMD, with the reconstruction residual as the primary diagnostic and the spectrum and effective rank as auxiliary low\-residual descriptors\. The framing is deliberately narrow: the residual is a window\-level monitoring and localization signal in the studied modular\-arithmetic Transformer settings, not a universal predictor or an intervention rule\.

#### What the evidence supports\.

Paired with a fixed sustained\-threshold operating rule of §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1), the reconstruction residual yields nontrivial held\-out detection behavior \(numbers in §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)and Table[3](https://arxiv.org/html/2605.08237#S3.T3)\), with lead time, false\-alarm rate, and bootstrap CI reported jointly\. In the wd=1=1pool we tested, identical perturbations applied at high\-RR windows produce larger short\-horizon deviations than at low\-RR windows, and a same\-data norm\-window control reverses the fragility ordering \(§[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)\)\. Log\-probability performs best among the observables tested under the current protocol \(Appendix[P](https://arxiv.org/html/2605.08237#A16)\)\.

#### What the diagnostic adds\.

The main value of the reconstruction residual is not to replace mechanistic, norm\-based, or gradient\-based analyses, but to provide a lightweight window\-level readout of when the observed output distribution is being reconfigured\. Unlike run\-level norm scores, which are strong regime indicators, the residual is evaluated locally on fixed step windows and highlights candidate transition periods for further inspection\. The perturbation and norm\-window controls suggest that these windows have functional meaning: matched noise produces larger deviations in high\-RR windows, and the same runs do not show the same ordering when re\-labeled by total parameter norm\. Thus, the residual provides a complementary output\-distribution view of training dynamics: it can be computed post hoc from a fixed probe set, does not rely on hidden\-unit indexing, and helps identify where mechanistic or gradient\-based analyses should focus\.

#### What it does not support\.

Norm\-derived signals are strong run\-level regime indicators on the shared pool \(run\-level AUROC≈0\.98\\approx 0\.98–0\.990\.99for total norm and\|Δnorm\|\\lvert\\Delta\\mathrm\{norm\}\\rvertvs\.0\.410\.41for run\-level RR\); we do not claim the residual replaces them as a regime classifier\. Lead times are threshold\-dependent and must be paired with FPR\. Perturbation alignment between residual ordering and short\-horizon deviation quantifies sensitivity, not a causal mechanism\. AGOP coverage in our setup is one run on the baseline\-battle pool, so a head\-to\-head AGOP comparison is not feasible; AGOP is treated as a parallel route under sufficient checkpoint coverage \(Appendix[H](https://arxiv.org/html/2605.08237#A8)\)\. An RR\-triggered learning\-rate intervention introduces a non\-zero failure rate without a reliable speedup, so monitoring does not directly imply beneficial control \(Appendix[R](https://arxiv.org/html/2605.08237#A18)\)\.

#### Boundaries\.

The current implementation assumes a scalar empirical observable represented by one\-dimensional quantile coordinates; the failure of top\-kkand hidden\-state observables to trigger any alarm under the current DMD configuration is best interpreted as a limitation of this scalar\-distribution implementation and fixed DMD setup rather than as a verdict on those observables’ information content\. Extending the diagnostic to genuinely multidimensional observables would require separate geometric embeddings, such as sliced Wasserstein representations, kernel mean embeddings, or layer\-wise scalar summaries, and is outside the scope of the present study\. The signal is observed across the three tested Transformer scales but with shrinking residual amplitude at small scale \(Appendix[F](https://arxiv.org/html/2605.08237#A6)\); coarse conclusions are stable across segment sizes\{250,500,1000\}\\\{250,500,1000\\\}while fine ordering may vary \(Appendix[G](https://arxiv.org/html/2605.08237#A7)\)\. Cross\-task transfer is partial, with shorter lead on modular multiplication and explicit failure cases on the fraction/prime sweep \(Appendix[S](https://arxiv.org/html/2605.08237#A19)\)\.

#### Future work\.

Layer\-wise observables and multi\-dimensional Wasserstein embeddings such as kernel mean embeddings\(Smolaet al\.,[2007](https://arxiv.org/html/2605.08237#bib.bib54)\)that respect distributional geometry beyond the one\-dimensional case; per\-signal threshold recalibration for fully fair head\-to\-head detection comparisons against norm\-based baselines; non\-autonomous extensions accounting for schedule\-induced drift\(Mandtet al\.,[2018](https://arxiv.org/html/2605.08237#bib.bib53)\); and large\-scale evaluation across architectures and task families\. Mechanistic accounts remain complementary: they explain*why*grokking occurs, while the residual addresses*where in training*the transition can be located from chosen task\-dependent observables\.

## Acknowledgments and Disclosure of Funding

## References

- Intrinsic dimensionality explains the effectiveness of language model fine\-tuning\.External Links:2012\.13255,[Link](https://arxiv.org/abs/2012.13255)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- H\. Arbabi and I\. Mezić \(2017\)Ergodic theory, dynamic mode decomposition, and computation of spectral properties of the koopman operator\.SIAM Journal on Applied Dynamical Systems16\(4\),pp\. 2096–2126\.External Links:ISSN 1536\-0040,[Link](http://dx.doi.org/10.1137/17M1125236),[Document](https://dx.doi.org/10.1137/17m1125236)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.SS0.SSS0.Px4.p1.6),[Appendix C](https://arxiv.org/html/2605.08237#A3.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p3.6),[§2\.2](https://arxiv.org/html/2605.08237#S2.SS2.SSS0.Px2.p1.8)\.
- J\. Brokman, R\. Betser, R\. Turjeman, T\. Berkov, I\. Cohen, and G\. Gilboa \(2024\)Enhancing neural training via a correlated dynamics model\.External Links:2312\.13247,[Link](https://arxiv.org/abs/2312.13247)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- S\. L\. Brunton, B\. W\. Brunton, J\. L\. Proctor, E\. Kaiser, and J\. N\. Kutz \(2017\)Chaos as an intermittently forced linear system\.Nature Communications8\(1\)\.External Links:ISSN 2041\-1723,[Link](http://dx.doi.org/10.1038/s41467-017-00030-8),[Document](https://dx.doi.org/10.1038/s41467-017-00030-8)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.p1.1)\.
- L\. Chizat and F\. Bach \(2018\)On the global convergence of gradient descent for over\-parameterized models using optimal transport\.External Links:1805\.09545,[Link](https://arxiv.org/abs/1805.09545)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1)\.
- L\. Chizat, E\. Oyallon, and F\. Bach \(2020\)On lazy training in differentiable programming\.External Links:1812\.07956,[Link](https://arxiv.org/abs/1812.07956)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- J\. M\. Cohen, S\. Kaur, Y\. Li, J\. Z\. Kolter, and A\. Talwalkar \(2022\)Gradient descent on neural networks typically occurs at the edge of stability\.External Links:2103\.00065,[Link](https://arxiv.org/abs/2103.00065)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- N\. Črnjarić\-Žic, S\. Maćešić, and I\. Mezić \(2019\)Koopman operator spectrum for random dynamical systems\.External Links:1711\.03146,[Link](https://arxiv.org/abs/1711.03146)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.SS0.SSS0.Px1.p1.9)\.
- A\. Damian, E\. Nichani, and J\. D\. Lee \(2023\)Self\-stabilization: the implicit bias of gradient descent at the edge of stability\.External Links:2209\.15594,[Link](https://arxiv.org/abs/2209.15594)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- A\. S\. Dogra and W\. T\. Redman \(2020\)Optimizing neural networks via koopman operator theory\.External Links:2006\.02361,[Link](https://arxiv.org/abs/2006.02361)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1)\.
- B\. Gess, S\. Kassing, and V\. Konarovskyi \(2023\)Stochastic modified flows, mean\-field limits and dynamics of stochastic gradient descent\.External Links:2302\.07125,[Link](https://arxiv.org/abs/2302.07125)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1)\.
- B\. Ghorbani, S\. Krishnan, and Y\. Xiao \(2019\)An investigation into neural net optimization via hessian eigenvalue density\.External Links:1901\.10159,[Link](https://arxiv.org/abs/1901.10159)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- G\. Gur\-Ari, D\. A\. Roberts, and E\. Dyer \(2018\)Gradient descent happens in a tiny subspace\.External Links:1812\.04754,[Link](https://arxiv.org/abs/1812.04754)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- J\. H\. Tu, C\. W\. Rowley, D\. M\. Luchtenburg, S\. L\. Brunton, and J\. Nathan Kutz \(2014\)On dynamic mode decomposition: theory and applications\.Journal of Computational Dynamics1\(2\),pp\. 391–421\.External Links:ISSN 2158\-2505,[Link](http://dx.doi.org/10.3934/jcd.2014.1.391),[Document](https://dx.doi.org/10.3934/jcd.2014.1.391)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.SS0.SSS0.Px1.p1.9),[Appendix C](https://arxiv.org/html/2605.08237#A3.SS0.SSS0.Px4.p1.6),[Appendix C](https://arxiv.org/html/2605.08237#A3.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p3.6),[§2\.2](https://arxiv.org/html/2605.08237#S2.SS2.SSS0.Px2.p1.8)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2020\)Neural tangent kernel: convergence and generalization in neural networks\.External Links:1806\.07572,[Link](https://arxiv.org/abs/1806.07572)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- J\. Lee, L\. Xiao, S\. S\. Schoenholz, Y\. Bahri, R\. Novak, J\. Sohl\-Dickstein, and J\. Pennington \(2020\)Wide neural networks of any depth evolve as linear models under gradient descent \*\.Journal of Statistical Mechanics: Theory and Experiment2020\(12\),pp\. 124002\.External Links:ISSN 1742\-5468,[Link](http://dx.doi.org/10.1088/1742-5468/abc62b),[Document](https://dx.doi.org/10.1088/1742-5468/abc62b)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- C\. Li, H\. Farkhoor, R\. Liu, and J\. Yosinski \(2018\)Measuring the intrinsic dimension of objective landscapes\.External Links:1804\.08838,[Link](https://arxiv.org/abs/1804.08838)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- T\. Li, L\. Tan, Q\. Tao, Y\. Liu, and X\. Huang \(2021\)Low dimensional landscape hypothesis is true: dnns can be trained in tiny subspaces\.External Links:2103\.11154,[Link](https://arxiv.org/abs/2103.11154)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- Z\. Liu, O\. Kitouni, N\. Nolte, E\. J\. Michaud, M\. Tegmark, and M\. Williams \(2022\)Towards understanding grokking: an effective theory of representation learning\.External Links:2205\.10343,[Link](https://arxiv.org/abs/2205.10343)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- D\. Luo, J\. Shen, R\. Dangovski, and M\. Soljačić \(2024\)QuACK: accelerating gradient\-based quantum optimization with koopman operator learning\.External Links:2211\.01365,[Link](https://arxiv.org/abs/2211.01365)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1)\.
- K\. Lyu, J\. Jin, Z\. Li, S\. S\. Du, J\. D\. Lee, and W\. Hu \(2024\)Dichotomy of early and late phase implicit biases can provably induce grokking\.External Links:2311\.18817,[Link](https://arxiv.org/abs/2311.18817)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- S\. Mandt, M\. D\. Hoffman, and D\. M\. Blei \(2018\)Stochastic gradient descent as approximate bayesian inference\.External Links:1704\.04289,[Link](https://arxiv.org/abs/1704.04289)Cited by:[§4](https://arxiv.org/html/2605.08237#S4.SS0.SSS0.Px5.p1.1)\.
- J\. Mao, I\. Griniasty, H\. K\. Teoh, R\. Ramesh, R\. Yang, M\. K\. Transtrum, J\. P\. Sethna, and P\. Chaudhari \(2024\)The training process of many deep networks explores the same low\-dimensional manifold\.Proceedings of the National Academy of Sciences121\(12\)\.External Links:ISSN 1091\-6490,[Link](http://dx.doi.org/10.1073/pnas.2310002121),[Document](https://dx.doi.org/10.1073/pnas.2310002121)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- C\. H\. Martin and M\. W\. Mahoney \(2018\)Implicit self\-regularization in deep neural networks: evidence from random matrix theory and implications for learning\.External Links:1810\.01075,[Link](https://arxiv.org/abs/1810.01075)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- S\. Mei, A\. Montanari, and P\. Nguyen \(2018\)A mean field view of the landscape of two\-layer neural networks\.Proceedings of the National Academy of Sciences115\(33\)\.External Links:ISSN 1091\-6490,[Link](http://dx.doi.org/10.1073/pnas.1806579115),[Document](https://dx.doi.org/10.1073/pnas.1806579115)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1)\.
- T\. Musat \(2026\)The geometry of grokking: norm minimization on the zero\-loss manifold\.External Links:2511\.01938,[Link](https://arxiv.org/abs/2511.01938)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, and J\. Steinhardt \(2023\)Progress measures for grokking via mechanistic interpretability\.External Links:2301\.05217,[Link](https://arxiv.org/abs/2301.05217)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- V\. Papyan \(2019\)Measurements of three\-level hierarchical structure in the outliers in the spectrum of deepnet hessians\.External Links:1901\.08244,[Link](https://arxiv.org/abs/1901.08244)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- J\. Pennington and Y\. Bahri \(2017\)Geometry of neural network loss surfaces via random matrix theory\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 2798–2806\.External Links:[Link](https://proceedings.mlr.press/v70/pennington17a.html)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- A\. Power, Y\. Burda, H\. Edwards, I\. Babuschkin, and V\. Misra \(2022\)Grokking: generalization beyond overfitting on small algorithmic datasets\.External Links:2201\.02177,[Link](https://arxiv.org/abs/2201.02177)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§D\.2](https://arxiv.org/html/2605.08237#A4.SS2.p1.4),[§1](https://arxiv.org/html/2605.08237#S1.p1.1),[§3\.1](https://arxiv.org/html/2605.08237#S3.SS1.SSS0.Px1.p1.13)\.
- L\. Prieto, M\. Barsbey, P\. A\. M\. Mediano, and T\. Birdal \(2025\)Grokking at the edge of numerical stability\.External Links:2501\.04697,[Link](https://arxiv.org/abs/2501.04697)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- A\. Radhakrishnan, D\. Beaglehole, P\. Pandit, and M\. Belkin \(2024\)Mechanism for feature learning in neural networks and backpropagation\-free machine learning models\.Science383\(6690\),pp\. 1461–1467\.External Links:[Document](https://dx.doi.org/10.1126/science.adi5639),[Link](https://www.science.org/doi/abs/10.1126/science.adi5639),https://www\.science\.org/doi/pdf/10\.1126/science\.adi5639Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px2.p1.1),[Appendix H](https://arxiv.org/html/2605.08237#A8.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08237#S2.SS2.SSS0.Px6.p1.1)\.
- W\. T\. Redman, J\. M\. Bello\-Rivas, M\. Fonoberova, R\. Mohr, I\. G\. Kevrekidis, and I\. Mezić \(2024\)Identifying equivalent training dynamics\.External Links:2302\.09160,[Link](https://arxiv.org/abs/2302.09160)Cited by:[Appendix T](https://arxiv.org/html/2605.08237#A20.p2.1),[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[§D\.1](https://arxiv.org/html/2605.08237#A4.SS1.SSS0.Px1.p1.10),[§D\.1](https://arxiv.org/html/2605.08237#A4.SS1.p1.1),[§D\.1](https://arxiv.org/html/2605.08237#A4.SS1.p3.2)\.
- G\. Rotskoff and E\. Vanden‐Eijnden \(2022\)Trainability and accuracy of artificial neural networks: an interacting particle system approach\.Communications on Pure and Applied Mathematics75\(9\),pp\. 1889–1935\.External Links:ISSN 1097\-0312,[Link](http://dx.doi.org/10.1002/cpa.22074),[Document](https://dx.doi.org/10.1002/cpa.22074)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1)\.
- C\. Rowley, I\. Mezic, S\. BAGHERI, P\. Schlatter, and D\. HENNINGSON \(2009\)Spectral analysis of nonlinear flows\.Journal of Fluid Mechanics641,pp\. 115 – 127\.External Links:[Document](https://dx.doi.org/10.1017/S0022112009992059)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.SS0.SSS0.Px5.p1.7),[Appendix C](https://arxiv.org/html/2605.08237#A3.p1.1)\.
- L\. Sagun, U\. Evci, V\. U\. Guney, Y\. Dauphin, and L\. Bottou \(2018\)Empirical analysis of the hessian of over\-parametrized neural networks\.External Links:1706\.04454,[Link](https://arxiv.org/abs/1706.04454)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- P\. J\. Schmid \(2010\)Dynamic mode decomposition of numerical and experimental data\.Journal of Fluid Mechanics656,pp\. 5–28\.External Links:[Document](https://dx.doi.org/10.1017/S0022112010001217)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08237#A3.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p3.6),[§2\.2](https://arxiv.org/html/2605.08237#S2.SS2.SSS0.Px2.p1.8)\.
- J\. Sirignano and K\. Spiliopoulos \(2019\)Mean field analysis of neural networks: a law of large numbers\.External Links:1805\.01053,[Link](https://arxiv.org/abs/1805.01053)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1)\.
- A\. Smola, A\. Gretton, L\. Song, and B\. Schölkopf \(2007\)A hilbert space embedding for distributions\.InAlgorithmic Learning Theory,M\. Hutter, R\. A\. Servedio, and E\. Takimoto \(Eds\.\),Berlin, Heidelberg,pp\. 13–31\.External Links:ISBN 978\-3\-540\-75225\-7Cited by:[Appendix A](https://arxiv.org/html/2605.08237#A1.p1.1),[§4](https://arxiv.org/html/2605.08237#S4.SS0.SSS0.Px5.p1.1)\.
- M\. E\. Tano, G\. D\. Portwood, and J\. C\. Ragusa \(2020\)Accelerating training in artificial neural networks with dynamic mode decomposition\.External Links:2006\.14371,[Link](https://arxiv.org/abs/2006.14371)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px4.p1.1)\.
- V\. Thilak, E\. Littwin, S\. Zhai, O\. Saremi, R\. Paiss, and J\. Susskind \(2022\)The slingshot mechanism: an empirical study of adaptive optimizers and the grokking phenomenon\.External Links:2206\.04817,[Link](https://arxiv.org/abs/2206.04817)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- V\. Varma, R\. Shah, Z\. Kenton, J\. Kramár, and R\. Kumar \(2023\)Explaining grokking through circuit efficiency\.External Links:2309\.02390,[Link](https://arxiv.org/abs/2309.02390)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.08237#S1.p2.1)\.
- C\. Villani \(2009\)Optimal transport: old and new\.Grundlehren der Mathematischen Wissenschaften, Vol\.338,Springer,Berlin, Heidelberg\.Cited by:[Appendix A](https://arxiv.org/html/2605.08237#A1.p1.1),[Appendix B](https://arxiv.org/html/2605.08237#A2.p1.10),[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px5.p1.1),[§2\.1](https://arxiv.org/html/2605.08237#S2.SS1.SSS0.Px2.p1.9)\.
- B\. Woodworth, S\. Gunasekar, J\. D\. Lee, E\. Moroshko, P\. Savarese, I\. Golan, D\. Soudry, and N\. Srebro \(2020\)Kernel and rich regimes in overparametrized models\.External Links:2002\.09277,[Link](https://arxiv.org/abs/2002.09277)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.
- G\. Yang and E\. J\. Hu \(2022\)Feature learning in infinite\-width neural networks\.External Links:2011\.14522,[Link](https://arxiv.org/abs/2011.14522)Cited by:[Appendix U](https://arxiv.org/html/2605.08237#A21.SS0.SSS0.Px3.p1.1)\.

## Appendix AWhy one\-dimensional distributions

Section[2](https://arxiv.org/html/2605.08237#S2)introduces how to project a one\-dimensional distribution onto a tangent space and obtain a finite\-dimensional coordinate representation via fixed quantile levels\. However, for multi\-dimensional distributions, there is generally no global isometric isomorphism that maps the Wasserstein space into a linear space\[Villani,[2009](https://arxiv.org/html/2605.08237#bib.bib1)\], which creates a practical obstacle to constructing the finite\-dimensional snapshot coordinates required in Section[2\.2](https://arxiv.org/html/2605.08237#S2.SS2)\. One may instead use kernel mean embeddings \(KME\)\[Smolaet al\.,[2007](https://arxiv.org/html/2605.08237#bib.bib54)\]or multidimensional scaling \(MDS\) to obtain a finite\-dimensional representation, but this typically introduces systematic distortion and does not preserve Wasserstein geometry, limiting the fidelity of the resulting dynamical characterization in the chosen observation space\.

## Appendix BWasserstein geometry: full background

This appendix expands on the compact Wasserstein\-quantile material in §[2\.1](https://arxiv.org/html/2605.08237#S2.SS1)\. Let\(𝒳,∥⋅∥\)\(\\mathcal\{X\},\\\|\\cdot\\\|\)be a Polish space and let𝒫2\(𝒳\)\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\)denote the set of Borel probability measures on𝒳\\mathcal\{X\}with finite second moments,

𝒫2\(𝒳\):=\{μ∈𝒫\(𝒳\):∫𝒳‖x‖2𝑑μ\(x\)<∞\}\.\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\):=\\Bigl\\\{\\mu\\in\\mathcal\{P\}\(\\mathcal\{X\}\):\\ \\int\_\{\\mathcal\{X\}\}\\\|x\\\|^\{2\}\\,d\\mu\(x\)<\\infty\\Bigr\\\}\.Forμ,ν∈𝒫2\(𝒳\)\\mu,\\nu\\in\\mathcal\{P\}\_\{2\}\(\\mathcal\{X\}\), the22\-Wasserstein distance is

W2\(μ,ν\):=\(infπ∈Π\(μ,ν\)∫𝒳×𝒳‖x−y‖2𝑑π\(x,y\)\)1/2,W\_\{2\}\(\\mu,\\nu\):=\\Bigl\(\\inf\_\{\\pi\\in\\Pi\(\\mu,\\nu\)\}\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\}\\\|x\-y\\\|^\{2\}\\,d\\pi\(x,y\)\\Bigr\)^\{1/2\},whereΠ\(μ,ν\)\\Pi\(\\mu,\\nu\)is the set of couplings ofμ\\muandν\\nu\. This is the minimum quadratic transport cost fromμ\\mutoν\\nu\[Villani,[2009](https://arxiv.org/html/2605.08237#bib.bib1)\]\.

ForD⊆ℝD\\subseteq\\mathbb\{R\}closed and𝒲2\(D\)\\mathcal\{W\}\_\{2\}\(D\)the corresponding11\-D Wasserstein space,

dW2\(μ,ν\)=∫01\(Fμ−1\(p\)−Fν−1\(p\)\)2𝑑p,d\_\{W\}^\{2\}\(\\mu,\\nu\)=\\int\_\{0\}^\{1\}\\bigl\(F\_\{\\mu\}^\{\-1\}\(p\)\-F\_\{\\nu\}^\{\-1\}\(p\)\\bigr\)^\{2\}\\,dp,withFμ−1F\_\{\\mu\}^\{\-1\}the quantile function\. Ifμ∈𝒲2\(D\)\\mu\\in\\mathcal\{W\}\_\{2\}\(D\)is atomless, the mapμ↦Fμ−1∈L2\(0,1\)\\mu\\mapsto F\_\{\\mu\}^\{\-1\}\\in L^\{2\}\(0,1\)is a global isometric embedding, identifying𝒲2\(D\)\\mathcal\{W\}\_\{2\}\(D\)with a closed convex subset ofL2\(0,1\)L^\{2\}\(0,1\)\. Consequently\(𝒲2\(D\),dW\)\(\\mathcal\{W\}\_\{2\}\(D\),d\_\{W\}\)is a Hadamard space \(complete, geodesic, non\-positively curved\), Fréchet means are uniquely defined, and at any atomless referenceμ⋆\\mu^\{\\star\}the tangent spaceTμ⋆T\_\{\\mu^\{\\star\}\}is identified with a closed subspace ofL2\(μ⋆\)L^\{2\}\(\\mu^\{\\star\}\)via the logarithmic map

logμ⋆⁡\(μ\)=Fμ−1∘Fμ⋆−id\.\\log\_\{\\mu^\{\\star\}\}\(\\mu\)=F\_\{\\mu\}^\{\-1\}\\circ F\_\{\\mu^\{\\star\}\}\-\\mathrm\{id\}\.This isometry justifies treating the windowed dynamics on a linear coordinate \(the quantile function evaluated on a fixed grid\) as in §[2\.1](https://arxiv.org/html/2605.08237#S2.SS1)\.

## Appendix CHankel\-DMD: full construction

This appendix expands on §[2\.2](https://arxiv.org/html/2605.08237#S2.SS2)\. Koopman\-operator theory analyzes a nonlinear system through an infinite\-dimensional linear operator acting on a chosen observation space\[Schmid,[2010](https://arxiv.org/html/2605.08237#bib.bib23), Rowleyet al\.,[2009](https://arxiv.org/html/2605.08237#bib.bib24)\]\. In practice, dynamic mode decomposition \(DMD\)\[Schmid,[2010](https://arxiv.org/html/2605.08237#bib.bib23), H\. Tuet al\.,[2014](https://arxiv.org/html/2605.08237#bib.bib15), Arbabi and Mezić,[2017](https://arxiv.org/html/2605.08237#bib.bib16), Bruntonet al\.,[2017](https://arxiv.org/html/2605.08237#bib.bib19),drmač2017datadrivenmodaldecompositions\]approximates the Koopman operator and its spectrum from snapshots\.

#### Random\-system interpretation\.

FollowingČrnjarić\-Žicet al\.\[[2019](https://arxiv.org/html/2605.08237#bib.bib17)\], H\. Tuet al\.\[[2014](https://arxiv.org/html/2605.08237#bib.bib15)\], let\(Xt\)t≥0\(X\_\{t\}\)\_\{t\\geq 0\}be a stochastic process on𝒳\\mathcal\{X\}andψ:𝒳→ℝd\\psi:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}a fixed observable vector \(here, the quantiles of the distributional state\);zt=ψ\(Xt\)∈ℝdz\_\{t\}=\\psi\(X\_\{t\}\)\\in\\mathbb\{R\}^\{d\}\. The Koopman operatorKKacts on observablesg:𝒳→ℝg:\\mathcal\{X\}\\to\\mathbb\{R\}by\(Kg\)\(x\)=𝔼\[g\(Xt\+1\)∣Xt=x\]\(Kg\)\(x\)=\\mathbb\{E\}\[g\(X\_\{t\+1\}\)\\mid X\_\{t\}=x\]\. The vectorsztz\_\{t\}are evaluations of a finite collection of observables evolving underKK\.

#### Raw DMD on a window\.

Collectingm\+1m\+1successive observations into snapshot matrices

Z−=\[z0⋯zm−1\]∈ℝd×m,Z\+=\[z1⋯zm\]∈ℝd×m,Z\_\{\-\}=\[z\_\{0\}\\ \\cdots\\ z\_\{m\-1\}\]\\in\\mathbb\{R\}^\{d\\times m\},\\qquad Z\_\{\+\}=\[z\_\{1\}\\ \\cdots\\ z\_\{m\}\]\\in\\mathbb\{R\}^\{d\\times m\},raw DMD seeksA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}minimising‖Z\+−AZ−‖F2\\\|Z\_\{\+\}\-AZ\_\{\-\}\\\|\_\{F\}^\{2\}, with closed formA⋆=Z\+Z−†A^\{\\star\}=Z\_\{\+\}Z\_\{\-\}^\{\\dagger\}whenZ−Z\_\{\-\}has full row rank\.A⋆A^\{\\star\}is the orthogonal projection \(under the empirical inner product\) of the action ofKKonto the Krylov subspace𝒦m=span\{z0,…,zm−1\}\\mathcal\{K\}\_\{m\}=\\mathrm\{span\}\\\{z\_\{0\},\\dots,z\_\{m\-1\}\\\}; its eigenvalues \(Ritz values\) give a finite\-dimensional approximation of the spectral properties ofKKrestricted to that subspace\.

#### Non\-normal sensitivity\.

With noise and finite samples,A⋆A^\{\\star\}’s eigenvalues are sensitive to small perturbations of\(Z−,Z\+\)\(Z\_\{\-\},Z\_\{\+\}\)when the underlying operator is non\-normal \(large pseudospectrum\)\. Raw DMD spectra may then be unstable or contain spurious modes\.

#### Hankel time\-delay extension\.

Fixq≥1q\\geq 1and form delay\-stacked vectorsξt=\[zt⊤,zt\+1⊤,…,zt\+q−1⊤\]⊤∈ℝqd\\xi\_\{t\}=\[z\_\{t\}^\{\\top\},z\_\{t\+1\}^\{\\top\},\\ldots,z\_\{t\+q\-1\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{qd\}\. The block Hankel snapshots are

H−=\[ξ0⋯ξm−q−1\],H\+=\[ξ1⋯ξm−q\]\.H\_\{\-\}=\[\\xi\_\{0\}\\ \\cdots\\ \\xi\_\{m\-q\-1\}\],\\qquad H\_\{\+\}=\[\\xi\_\{1\}\\ \\cdots\\ \\xi\_\{m\-q\}\]\.Hankel\-DMD solves the ordinary least\-squares problemAH⋆=arg⁡minA⁡‖H\+−AH−‖F2A\_\{H\}^\{\\star\}=\\arg\\min\_\{A\}\\\|H\_\{\+\}\-AH\_\{\-\}\\\|\_\{F\}^\{2\}and studies its eigenvalues; the time\-delay embedding enlarges the observable space and encodes temporal correlations overqqlags, typically improving spectral robustness and allowing continuous spectral components to be approximated by clusters of discrete modes\[Arbabi and Mezić,[2017](https://arxiv.org/html/2605.08237#bib.bib16), H\. Tuet al\.,[2014](https://arxiv.org/html/2605.08237#bib.bib15)\]\. We then truncate the fitted operator to a low\-rank approximation by retaining the leadingrreigenpairs ofAH⋆A\_\{H\}^\{\\star\}\(after a stabilizing SVD projection onto the dominant snapshot subspace\)\[drmač2017datadrivenmodaldecompositions\]\.

#### Spectral linear approximation\.

LetAH⋆W=WΛA\_\{H\}^\{\\star\}W=W\\LambdawithΛ=diag\(λ1,…,λr\)\\Lambda=\\mathrm\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{r\}\)andW=\[w1,…,wr\]W=\[w\_\{1\},\\dots,w\_\{r\}\]\. For a delay stateξ0\\xi\_\{0\}, modal amplitudes areb=W†ξ0b=W^\{\\dagger\}\\xi\_\{0\}, and the rank\-rrreconstruction is

ξ^t\(r\):=WΛtb=∑j=1rbjλjtwj,t=0,…,m−q\.\\widehat\{\\xi\}\_\{t\}^\{\(r\)\}:=W\\Lambda^\{t\}b=\\sum\_\{j=1\}^\{r\}b\_\{j\}\\,\\lambda\_\{j\}^\{t\}\\,w\_\{j\},\\qquad t=0,\\dots,m\-q\.This is a spectral, low\-rank linear approximation of the windowed evolution within the data\-driven subspace\. Each eigenpair represents a direction; real eigenvalues correspond to non\-oscillatory modes and complex eigenvalues to oscillatory ones\[Rowleyet al\.,[2009](https://arxiv.org/html/2605.08237#bib.bib24)\]\. The directions are not in general orthogonal; readingξ^t\(r\)\\widehat\{\\xi\}\_\{t\}^\{\(r\)\}as a complete orthogonal decomposition requires substantially stronger assumptions\. The reconstruction residual \([5](https://arxiv.org/html/2605.08237#S2.E5)\) and effective rank \([6](https://arxiv.org/html/2605.08237#S2.E6)\) are the two windowed scalars we use to summarize a window’s dynamics\.

## Appendix DTraining experiment details

### D\.1Fully connected networks \(FCNs\)

The FCN experiment considers a single\-hidden\-layer fully connected network trained on MNIST\. This part is mainly based onRedmanet al\.\[[2024](https://arxiv.org/html/2605.08237#bib.bib20)\]’s open\-source repository \(https://github\.com/william\-redman/Identifying\_Equivalent\_Training\_Dynamics\), and we keep closely matched hyperparameters\. We use the standard PyTorch initialization scheme, but instead of the repository’s multiplicative perturbation protocol, we obtain different initializations directly by changing the random seed\. This choice is consistent with our Wasserstein\-geometry perspective, under which seed\-wise initializations are treated as independent samples from the same underlying initialization distribution and hence correspond to a common starting point on the Wasserstein manifold\. We additionally report the same spectral diagnostics for FCNs trained with AdamW\.

Concretely, the FCN hyperparameters used in this work are summarized in Table[5](https://arxiv.org/html/2605.08237#A4.T5)\.

Table 5:FCN hyperparametersWe use 25 different random seeds and, following Sections[2](https://arxiv.org/html/2605.08237#S2), construct distributional\-state snapshots, using 19 quantile levels \(0\.05–0\.95\) as a finite\-dimensional coordinate representation of the empirical distribution, collect snapshots every step, define stages by non\-overlapping 500\-step windows and compute DMD\. For all experiments, we fix the delay length to44, and in DMD\-RRR we retain1010Koopman modes\. Note that in a small number of windows the effective rank can be below 10; however, to enable consistent visual comparisons of spectral plots and take advantage of the randomized shuffle controlRedmanet al\.\[[2024](https://arxiv.org/html/2605.08237#bib.bib20)\], we report 10 spectral points in all cases\.

#### Randomized shuffle control

LetΩ1=\{λj\(1\)\}j=1k\\Omega\_\{1\}=\\\{\\lambda^\{\(1\)\}\_\{j\}\\\}\_\{j=1\}^\{k\}andΩ2=\{λj\(2\)\}j=1k\\Omega\_\{2\}=\\\{\\lambda^\{\(2\)\}\_\{j\}\\\}\_\{j=1\}^\{k\}be two estimated spectral point sets, and letω:=d\(Ω1,Ω2\)\\omega:=d\(\\Omega\_\{1\},\\Omega\_\{2\}\)denote their spectral distance \(e\.g\., the 2\-Wasserstein / optimal\-matching distance\)\. FollowingRedmanet al\.\[[2024](https://arxiv.org/html/2605.08237#bib.bib20)\], we construct a “shuffle” baseline by first computing an optimal matchingσ∈Sk\\sigma\\in S\_\{k\}that minimizes∑j=1k‖λj\(1\)−λσ\(j\)\(2\)‖\\sum\_\{j=1\}^\{k\}\\\|\\lambda^\{\(1\)\}\_\{j\}\-\\lambda^\{\(2\)\}\_\{\\sigma\(j\)\}\\\|\. Then, for each matched pair\(λj\(1\),λσ\(j\)\(2\)\)\\big\(\\lambda^\{\(1\)\}\_\{j\},\\lambda^\{\(2\)\}\_\{\\sigma\(j\)\}\\big\), we independently swap the two elements with probability1/21/2, forming a randomized partition\(Ω1′,Ω2′\)\(\\Omega\_\{1\}^\{\\prime\},\\Omega\_\{2\}^\{\\prime\}\)that preserves pairwise proximity while removing systematic group structure\. Repeating this procedurenshuffn\_\{\\mathrm\{shuff\}\}times yields distances\{ωi′=d\(Ω1,i′,Ω2,i′\)\}i=1nshuff\\\{\\omega^\{\\prime\}\_\{i\}=d\(\\Omega^\{\\prime\}\_\{1,i\},\\Omega^\{\\prime\}\_\{2,i\}\)\\\}\_\{i=1\}^\{n\_\{\\mathrm\{shuff\}\}\}\. We report the empirical exceedance rate

p^:=1nshuff∑i=1nshuff𝟏\{ωi′≥ω\},\\hat\{p\}\\;:=\\;\\frac\{1\}\{n\_\{\\mathrm\{shuff\}\}\}\\sum\_\{i=1\}^\{n\_\{\\mathrm\{shuff\}\}\}\\mathbf\{1\}\\\{\\omega^\{\\prime\}\_\{i\}\\geq\\omega\\\},which quantifies how often a “naturally shuffled” pair is at least as separated as the observed spectra\.

#### Qualitative spectral robustness checks\.

We ran three qualitative robustness checks on the FCN spectra under the shuffle control above; the corresponding raw experiment outputs are not part of the released claim index, so we report direction only and omit numerical tables\. \(i\)*Multiplicative initialization perturbations\.*Forh=40h=40, applying0\.1%0\.1\\%relative multiplicative perturbations elementwise to the weights and computing the Wasserstein distance between spectra over the resulting initialization pairs gives small distances under both SGD and AdamW; the shuffle control does not reject the null that the observed distances are within the shuffle baseline\. \(ii\)*Seed variability\.*For each widthh∈\{5,40,256,1024\}h\\in\\\{5,40,256,1024\\\}we draw2525independent seeds and compute pairwise spectral Wasserstein distances; within\-width seed\-to\-seed distances are small at every tested width and again do not exceed the shuffle baseline\. Across\-width distances are visibly larger than within\-width seed distances, indicating that in this setting width changes the spectrum more than seed choice does\. \(iii\)*Activation choice \(ReLU vs\. GeLU\)\.*Under SGD ath∈\{40,256\}h\\in\\\{40,256\\\}, the ReLU\-vs\-GeLU spectral Wasserstein distance is small ath=256h=256and larger ath=40h=40, but in neither case exceeds the shuffle baseline; Figure[4](https://arxiv.org/html/2605.08237#A4.F4)shows the corresponding spectral points\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/ReLU_vs_GELU_spectrum_comparison_h=40.png)\(a\)h=40
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/ReLU_vs_GELU_spectrum_comparison.png)\(b\)h=256

Figure 4:Training\-dynamics spectral points under SGD for ReLU and GeLU activations at widths h=40 and h=256\. To better visualize differences in the locations of the spectral points, we slightly warp the axis scale\.

### D\.2Transformers

We follow the modular addition \(mod\-9797\) task and training codebase introduced byPoweret al\.\[[2022](https://arxiv.org/html/2605.08237#bib.bib21)\]and use their open\-source implementationhttps://github\.com/openai/grok\. The task maps pairs\(a,b\)∈\{0,…,96\}2\(a,b\)\\in\\\{0,\\dots,96\\\}^\{2\}to the labely=\(a\+b\)mod97y=\(a\+b\)\\bmod 97\. We modify only the training\-set fraction, setting it to0\.40\.4, and keep the remaining data generation and evaluation protocol unchanged\.

We consider three weight decay settings,wd∈\{0,1,2\}\\mathrm\{wd\}\\in\\\{0,1,2\\\}, and train one model per setting under otherwise matched hyperparameters\. Unless stated otherwise, all architectural choices and optimization hyperparameters follow the repository defaults\. The detailed values \(optimizer, learning rate, batch size, number of training steps, model dimensions, random seeds, and any learning\-rate schedule\) are reported in Table[6](https://arxiv.org/html/2605.08237#A4.T6)\.

As described in Section[2](https://arxiv.org/html/2605.08237#S2)\(Eq\.[1](https://arxiv.org/html/2605.08237#S2.E1)\), we construct the distributional state variable from a fixed probe set ofM=100M=100training examples: at training stepttwe record the scalar correct\-answer log\-probabilityot,i=log⁡pθt\(yi⋆∣xi\)o\_\{t,i\}=\\log p\_\{\\theta\_\{t\}\}\(y\_\{i\}^\{\\star\}\\mid x\_\{i\}\)for each probe sampleii, andμt\\mu\_\{t\}is their empirical distribution\. The quantile coordinateztz\_\{t\}uses the same1919\-level grid as the FCN setting\. We then compute windowed Hankel\-DMD spectra and diagnostics \(reconstruction residual and effective rank\) on step\-based segments, using the same DMD settings with FCN experiments across all weight decay values\.

Table 6:Transformer modular addition \(mod9797\) experimental setup\.![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/train_val_accuracy_wd0_averaged.png)\(a\)wd=0
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/train_val_accuracy_wd1_averaged.png)\(b\)wd=1
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/train_val_accuracy_wd2_averaged.png)\(c\)wd=2

Figure 5:\(a\) An example of training and test accuracy curves without weight decay\. The blue solid line denotes training accuracy and the red dashed line denotes test accuracy; the run remains overfitted for an extended period\. \(b\) Mean training and test accuracy curves with weight decaywd=1\\mathrm\{wd\}=1, with a±25%\\pm 25\\%standard\-deviation band\. The blue solid line denotes training accuracy, the red dashed line denotes test accuracy, and the shaded region indicates the±25%\\pm 25\\%standard\-deviation range; after a period of overfitting, the model undergoes rapid generalization\. \(c\) Mean training and test accuracy curves with weight decaywd=2\\mathrm\{wd\}=2, with a±25%\\pm 25\\%standard\-deviation band\. The blue solid line denotes training accuracy, the red dashed line denotes test accuracy, and the shaded region indicates the±25%\\pm 25\\%standard\-deviation range; generalization occurs without a pronounced overfitting period\.Across our three weight decay settings, the no\-weight\-decay runs exhibit prolonged overfitting followed by slow generalization\. Even with the training fraction set to 40%, some seeds do not generalize until beyond5×1055\\times 10^\{5\}steps\. In contrast, under our experimental setup, all runs with weight decay generalize within 6000 steps, as shown in Figure[5](https://arxiv.org/html/2605.08237#A4.F5)\. We therefore analyze spectral features only over the training trajectory from steps 1 to 6000: we record the log\-probability every two steps and define stages by non\-overlapping 500\-step windows\. The complete spectral plots are provided in Figure[6](https://arxiv.org/html/2605.08237#A4.F6)\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step00001-00500.png)\(a\)stage 1
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step00501-01000.png)\(b\)stage 2
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step01001-01500.png)\(c\)stage 3
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step01501-02000.png)\(d\)stage 4
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step02001-02500.png)\(e\)stage 5
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step02501-03000.png)\(f\)stage 6
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step03001-03500.png)\(g\)stage 7
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step03501-04000.png)\(h\)stage 8
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step04001-04500.png)\(i\)stage 9
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step04501-05000.png)\(j\)stage 10
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step05001-05500.png)\(k\)stage 11
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/partspec/spectrum_step05501-06000.png)\(l\)stage 12

Figure 6:Local Koopman spectra across successive training stages\. Each panel corresponds to a distinct step interval\.

## Appendix EExtended Weight Decay Robustness

To address concerns regarding the sensitivity of our Transformer results to the choice of weight decay, we expand the original sweep from three values \(wd∈\{0,1,2\}\\mathrm\{wd\}\\in\\\{0,1,2\\\}\) to five values \(wd∈\{0,0\.25,0\.5,1,2\}\\mathrm\{wd\}\\in\\\{0,0\.25,0\.5,1,2\\\}\) with four random seeds per setting, yielding 20 independent runs\. All other hyperparameters remain identical to Table[6](https://arxiv.org/html/2605.08237#A4.T6)\.

Figure[7](https://arxiv.org/html/2605.08237#A5.F7)shows the mean test accuracy trajectories with±1\\pm 1standard\-deviation bands\. The qualitative ordering between weight decay strength and grokking onset timing observed in the original experiments remains visible across the expanded sweep: among runs that grok, higher weight decay tends to be associated with earlier generalization onset in this sweep\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_train_test_consistency_overlay.png)Figure 7:Test accuracy trajectories across five weight decay values \(4 seeds each\), with±1\\pm 1std bands\.#### Alignment between diagnostics and accuracy\.

Figures[8](https://arxiv.org/html/2605.08237#A5.F8)and[8](https://arxiv.org/html/2605.08237#A5.F8)display the reconstruction residual and effective rank curves, respectively\. Both diagnostics track the accuracy trajectories, with the residual peak roughly coinciding with the onset of generalization for grokking runs \(wd≥1\\mathrm\{wd\}\\geq 1\)\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_residual_alignment_bands.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_rank_alignment_bands.png)

Figure 8:Reconstruction residual and effective rank over training for five weight decay values\. Left: the reconstruction residual tracks transition windows and spikes near generalization for grokking runs\. Right: the effective rank remains comparatively stable away from these transition windows\.
#### Peak residual and onset\.

Figure[9](https://arxiv.org/html/2605.08237#A5.F9)summarizes the peak residual and grokking onset step as a function of weight decay across the expanded range; both are descriptive summaries of the runs in this sweep, not a calibrated trend extrapolated to other regularization regimes\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_peak_residual_vs_weight_decay.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_onset_vs_weight_decay.png)

Figure 9:Peak reconstruction residual and grokking onset step across weight decay values\. Left: peak residual varies smoothly with regularization strength\. Right: grokking onset shifts earlier as weight decay increases among runs that generalize\.
#### Train–test consistency\.

Figure[10](https://arxiv.org/html/2605.08237#A5.F10)overlays train and test accuracy for each weight decay value, showing that the distributional diagnostics track dynamics visible in both splits in the runs we evaluated\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_train_test_consistency_overlay.png)Figure 10:Train–test accuracy overlay across weight decay settings\.

## Appendix FArchitecture Ablation

A natural concern is whether the observed patterns depend on the specific Transformer configuration \(dmodel=128d\_\{\\mathrm\{model\}\}\{=\}128,nlayers=2n\_\{\\mathrm\{layers\}\}\{=\}2,nheads=4n\_\{\\mathrm\{heads\}\}\{=\}4\)\. We test two additional configurations while keeping the task, data split, and analysis pipeline unchanged:

- •Small/shallow:dmodel=64d\_\{\\mathrm\{model\}\}\{=\}64,nlayers=1n\_\{\\mathrm\{layers\}\}\{=\}1,nheads=4n\_\{\\mathrm\{heads\}\}\{=\}4
- •Large/deep:dmodel=256d\_\{\\mathrm\{model\}\}\{=\}256,nlayers=3n\_\{\\mathrm\{layers\}\}\{=\}3,nheads=8n\_\{\\mathrm\{heads\}\}\{=\}8

Each variant is trained withwd∈\{0,1,2\}\\mathrm\{wd\}\\in\\\{0,1,2\\\}and 2 seeds\.

Table[7](https://arxiv.org/html/2605.08237#A6.T7)reports the aggregated results\. The baseline observation that higher weight decay promotes generalization in this task is preserved across the three tested scales; the residual peak co\-occurs with the generalization regime in all three scales, but its amplitude shifts substantially with capacity \(small∼0\.24\\sim 0\.24, baseline∼0\.68\\sim 0\.68, large∼0\.84\\sim 0\.84atwd=1\\mathrm\{wd\}=1\)\. Atwd=2\\mathrm\{wd\}=2the small/shallow variant fails to generalize on this run pool, and the large/deep variant under\-generalizes at bothwd=1\\mathrm\{wd\}=1andwd=2\\mathrm\{wd\}=2, indicating that the strength of the wd–generalization relationship is architecture\-dependent\.

Table 7:Architecture ablation summary \(modular addition, 6000 steps\)\. Aggregates over the seeds in each cell\. Onset is reported for cells where the mean test accuracy exceeds the99%99\\%criterion; “–” means the cell does not reach generalization on this pool\.Figure[11](https://arxiv.org/html/2605.08237#A6.F11)compares the residual alignment and accuracy bands for both variants\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/arch_large_train_test_consistency_overlay.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/arch_small_train_test_consistency_overlay.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/arch_large_residual_alignment_bands.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/arch_small_residual_alignment_bands.png)

Figure 11:Test accuracy and reconstruction residual for the two architecture variants\. Top left/right: accuracy bands for the large/deep and small/shallow models\. Bottom left/right: the corresponding residual curves, showing that the residual–onset alignment is only partially preserved under architecture changes\.We emphasize that this ablation*partially extends*the main findings to different structural hyperparameters, but it does not establish full robustness to arbitrary architecture choices\.

## Appendix GStage Partition Sensitivity

To examine whether the conclusions depend on the specific choice of segment size in the windowed DMD, we re\-analyze a representative subset of expanded Transformer runs \(2 seeds,wd∈\{0,1,2\}\\mathrm\{wd\}\\in\\\{0,1,2\\\}\) with three segment sizes: 250, 500, and 1000 steps\. All other DMD hyperparameters \(ndelays=32n\_\{\\mathrm\{delays\}\}=32,kmode=10k\_\{\\mathrm\{mode\}\}=10\) are held fixed\.

Table[8](https://arxiv.org/html/2605.08237#A7.T8)reports the key metrics\. Across the three tested segment sizes\{250,500,1000\}\\\{250,500,1000\\\}, the qualitative observations that remain visible are:

1. 1\.wd=0\\mathrm\{wd\}=0has the highest peak residual at every tested segment size;
2. 2\.the peak\-residual orderingwd=0\>wd=1\>wd=2\\mathrm\{wd\}=0\>\\mathrm\{wd\}=1\>\\mathrm\{wd\}=2holds at every tested segment size;
3. 3\.generalization onset underwd=1\\mathrm\{wd\}=1is later than underwd=2\\mathrm\{wd\}=2at every tested segment size;
4. 4\.lead\-lag Spearmanρ\\rhobetween RR and subsequent accuracy change stays above0\.760\.76for the grokking regimes \(wd=1,2\\mathrm\{wd\}=1,2\) at every tested segment size; forwd=0\\mathrm\{wd\}=0\(no grokking\)ρ\\rhois lower \(≈0\.54\\approx 0\.54–0\.600\.60\), consistent with the weaker temporal coupling there\.

Quantitative values shift with segment size, but the orderings \(Table[9](https://arxiv.org/html/2605.08237#A7.T9)\) are stable across the three sizes we tested, and theρ\\rhocomparison between grokking and non\-grokking regimes is preserved\.

Table 8:Partition sensitivity: peak residual, peak step, mean grokking onset, and lead\-lag Spearman correlation betweenRes\(r\)\\mathrm\{Res\}^\{\(r\)\}and subsequent accuracy change, under three segment sizes\. Cells aggregate across44–55seeds\. “–” means the cell does not reach99%99\\%test accuracy\.Table 9:Ordering stability across segment sizes\. “\>\>” denotes the weight\-decay value with the higher metric\.
## Appendix HAGOP Comparison and Joint Evidence

The Average Gradient Outer Product \(AGOP\)\[Radhakrishnanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib3)\]provides a parallel route to identifying transition windows from gradient outer products\. To relate it to our distributional diagnostics, we compute AGOP trace and top eigenvalue trajectories for the expanded Transformer runs and examine their joint behavior with the reconstruction residual\.

#### Coverage limitation \(N1\)\.

On the N1 baseline\-battle pool, AGOP can only be evaluated at saved checkpoints; the current exponential checkpoint schedule leaves only11non\-grok run with full coverage, which is insufficient for a fair head\-to\-head AUROC comparison\. We therefore do not interpret AGOP’s resulting AUROC on this pool as evidence against the method, and treat AGOP as corroborative under sufficient checkpoint coverage rather than competitive evidence\.

Figure[12](https://arxiv.org/html/2605.08237#A8.F12)shows three complementary views: \(a\) AGOP trace consistency across weight decay settings, \(b\) AGOP top eigenvalue trajectories, and \(c\) the joint evidence relating AGOP and reconstruction residual\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_agop_trace_consistency.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_agop_top_eigenvalue_consistency.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_agop_residual_joint_evidence.png)

Figure 12:AGOP diagnostics and their relationship with reconstruction residual\. Left: AGOP trace over training on a log scale\. Middle: the top AGOP eigenvalue over training on a log scale\. Right: scatter of relative residual against AGOP trace \(log scale\) per checkpoint\. The relationship is non\-monotonic on this pool: residual is elevated in the mid\-AGOP regime that contains transition windows for grokking runs \(wd≥1\\mathrm\{wd\}\\geq 1\), while saturated post\-grokking checkpoints sit at large AGOP yet small residual\. We therefore read panel \(c\) as qualitative co\-elevation of AGOP and residual within the transition windows of the runs we could evaluate, not as a global monotone correlation between AGOP magnitude and residual\.Subject to the coverage limitation noted above, the AGOP dynamics are qualitatively consistent with the distributional diagnostics in the runs we could evaluate: in those runs, grokking\-regime trajectories \(wd≥1\\mathrm\{wd\}\\geq 1\) show concurrent elevation in AGOP trace and reconstruction residual during the generalization transition window\. We treat this as qualitative corroboration rather than independent confirmation; a fair quantitative AGOP–RR comparison would require a denser checkpoint schedule and is left for future work\.

#### N1 detector pool\.

On the broader N1 baseline\-battle pool \(1818runs;55grok /1313non\-grok, including smoke and early\-generalization border cases\), the residual scores AUROC=0\.9231=0\.9231and AUPRC=0\.8944=0\.8944\. Figure[13](https://arxiv.org/html/2605.08237#A8.F13)shows the run\-level ROC for RR; AGOP cannot be placed in the \(FPR, TPR\) plane on this pool because its coverage is one non\-grok run, so its TPR is undefined\. We therefore do not interpret AGOP’s nominal score on this pool as evidence against AGOP, and treat it only as corroborative under sufficient checkpoint coverage\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_n1_roc.png)Figure 13:N1 baseline\-battle pool \(1818runs;55grok /1313non\-grok\)\. Run\-level ROC for the residual detector \(AUROC=0\.9231=0\.9231\)\. AGOP cannot be plotted on this pool because its coverage is one run with no grokking, so TPR is undefined; this is annotated in\-figure rather than rendered as an empty curve\. We therefore do not interpret AGOP’s score on this pool as evidence against the method\.

## Appendix IDMD reconstruction quality calibration

A natural question is whether the DMD approximation captures meaningful dynamics or merely fits noise\. We address this with a holdout\-based quality evaluation: for each DMD segment, we fit on a holdout split of the time steps within the segment and evaluate one\-step prediction error on the held\-out portion\. We compare against a persistence baseline \(identity transition\) that predicts each next state as equal to the current state\. “Gain” is the persistence\-minus\-holdout reduction in relative residual; positive values mean DMD beats persistence\.

Table[10](https://arxiv.org/html/2605.08237#A9.T10)reports the results, aggregated across seeds and segment sizes per weight\-decay setting\. The DMD holdout error is lower than the persistence baseline in all three regimes \(positive gain\), indicating that the windowed DMD fit captures nontrivial temporal structure beyond persistence\. The gap is largest forwd=0\\mathrm\{wd\}=0on this pool: the wd=0 distributional trajectory is far from stationary and persistence is a particularly weak baseline there\. The gap remains positive but smaller forwd=1,2\\mathrm\{wd\}=1,2, where the trajectory is closer to a slow drift and persistence is already a strong baseline\.

Table 10:DMD quality: holdout vs\. persistence baseline \(relative residual\)\. “Gain” is persistence rr minus holdout rr; positive means DMD beats persistence\.This calibration supports a conservative interpretation: across the regimes we tested, the DMD approximation yields a windowed prediction signal above the persistence baseline\. The quality should be read alongside the regime context \(the wd=0 trajectory has different structure than wd=1,2=1,2\), and the windowed DMD fit should not be uniformly treated as a reliable predictor across all training settings\.

## Appendix JPortability to CIFAR\-10 with a Tiny CNN

To address the concern that results are limited to MNIST MLPs and modular\-addition Transformers, we apply the analysis pipeline to a Tiny CNN \(two convolutional layers \+ global average pooling \+ linear head\) trained on CIFAR\-10\. We use logits as the observable and sweep over 2 seeds×\\times2 channel widths\{32,64\}\\\{32,64\\\}×\\times2 optimizers\{AdamW,SGD\}\\\{\\text\{AdamW\},\\text\{SGD\}\\\}for a total of 8 runs, each trained for 5 epochs\.

Table[11](https://arxiv.org/html/2605.08237#A10.T11)reports the DMD summary\. The method produces readable reconstruction residual and effective rank curves in this new setting\. Wider networks \(channels=64\\text\{channels\}=64\) exhibit lower peak residual, paralleling the width dependence observed for MNIST MLPs\.

Table 11:CIFAR\-10 Tiny CNN: DMD diagnostic summary\. Each cell averages over22seeds\.![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_cifar_peak_residual.png)
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/rebuttal/reviewer_cifar_optimizer_comparison.png)

Figure 14:CIFAR\-10 Tiny CNN: DMD diagnostics across channel widths and optimizers\. Left: peak residual by optimizer/channel configuration\. Right: peak residual and peak residual step by optimizer/channel, showing that wider channels yield lower residuals and earlier peak timing\.CIFAR\-10 is included as a portability check only, not as a grokking diagnostic benchmark: the run is small \(88runs,55epochs\), there is no grokking transition to localize in this setup, and no detection metric \(TPR/FPR/AUROC\) is reported\. The pipeline runs end\-to\-end and yields a readable RR/effective\-rank curve; we draw no detection or fragility conclusion from this section\.

## Appendix KPerturbation fragility: full pool

This appendix details the perturbation\-fragility experiment summarized in §[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)\.

#### Pool and protocol\.

On wd=1=1grokking baselines, identical multiplicative perturbations of magnitude\{0\.005,0\.01\}\\\{0\.005,0\.01\\\}are applied to model parameters at high\-RR vs\. low\-RR windows \(44high\-RR runs and55low\-RR runs at each scale,2121runs in total counting the additional low\-norm seeds\)\. For each perturbation we record short\-horizon accuracy deviation \(mean accuracy gap between perturbed and unperturbed trajectories over the first10001000steps after perturbation\) and final accuracy\.

#### Results\.

Mean short\-horizon deviation: at scale0\.010\.01,0\.0900\.090\(high\-RR\) vs\.0\.0290\.029\(low\-RR\), giving a high/low ratio≈3\.1×\\approx 3\.1\\times; at scale0\.0050\.005,0\.1070\.107vs\.0\.0340\.034, ratio≈3\.2×\\approx 3\.2\\times\.

#### Boundary cases\.

One unrecoverable failure occurs in a high\-RR window near the transition \(consistent with the high\-RR\-as\-fragile reading\)\. A second unrecoverable failure occurs in a low\-RR window very early in training; we interpret this as a separate early\-training instability unrelated to the transition window and treat it as a boundary case rather than a low\-RR failure of the fragility hypothesis\.

The result quantifies sensitivity under matched noise; it does not establish that the residual is a causal driver of the transition\.

## Appendix LDiscussion: method positioning

The supplementary experiments collectively support a narrow positioning of the distributional spectral diagnostics in the studied modular\-arithmetic Transformer settings:

1. 1\.Window\-level transition localization\.On the held\-out test fold \(Appendix[N](https://arxiv.org/html/2605.08237#A14)\), thesustained\_K2\_tau10rule attains AUROC≈0\.93\\approx 0\.93, TPR=0\.80=0\.80at FPR=0\.50=0\.50, and median lead10681068steps on true\-positive alarms \(95%95\\%bootstrap CI\[142,2426\]\[142,\\,2426\]; Appendix[O](https://arxiv.org/html/2605.08237#A15)\)\. Lead time is threshold\-dependent and is always reported jointly with FPR\.
2. 2\.Window\-level fragility under matched perturbations\.The sensitivity\-window experiment \(Appendix[K](https://arxiv.org/html/2605.08237#A11)\) reports that high\-RR windows show elevated short\-horizon perturbation sensitivity relative to low\-RR windows in wd=1=1baselines under matched noise\. The result quantifies sensitivity, not causal mechanism; one boundary\-case low\-RR early\-training failure is reported alongside the main pool\.
3. 3\.Not a total\-norm proxy at the window level\.The norm\-window control \(Appendix[Q](https://arxiv.org/html/2605.08237#A17)\) re\-labels the same perturbation runs by total\-parameter\-norm percentile and reverses the fragility ordering, indicating that the residual carries window\-level information not captured by the total\-norm signal\. Norm\-derived signals nevertheless remain strong run\-level regime indicators on the same pool, and we do not claim RR universally outperforms norm baselines\.
4. 4\.AGOP as a parallel route under coverage constraints\.The AGOP comparison \(Appendix[H](https://arxiv.org/html/2605.08237#A8)\) shows qualitative co\-occurrence between AGOP elevation and RR elevation in transition windows for the runs with sufficient checkpoint coverage\. AGOP coverage on the N1 baseline\-battle pool is one run, so we treat AGOP as corroborative under sufficient coverage rather than a competitor; a fair head\-to\-head comparison would require a denser checkpoint schedule\.
5. 5\.Scope checks: weight decay, segment size, architecture\.The weight\-decay sweep \(Appendix[E](https://arxiv.org/html/2605.08237#A5)\) shows that the qualitative ordering between regularization strength and onset timing remains visible across five settings; segment\-size sensitivity \(Appendix[G](https://arxiv.org/html/2605.08237#A7)\) shows coarse conclusions stable across\{250,500,1000\}\\\{250,500,1000\\\}steps with fine ordering varying; the architecture ablation \(Appendix[F](https://arxiv.org/html/2605.08237#A6)\) shows the signal is observed across the three tested Transformer scales but with shrinking residual amplitude at small scale, indicating model\-scale sensitivity rather than parameterization\-class behavior\.
6. 6\.Scope checks: portability and FCN\.CIFAR\-10 \(Appendix[J](https://arxiv.org/html/2605.08237#A10)\) is a portability check that the pipeline runs end\-to\-end on a different task/architecture; it is not a grokking diagnostic benchmark, and no detection metric is reported there\. FCN results \(Appendix[T](https://arxiv.org/html/2605.08237#A20)\) are a secondary low\-residual regime descriptor use, not a grokking diagnostic claim\.

These results position the method as a window\-level monitoring and localization signal in the studied modular\-arithmetic Transformer settings, not as a universal early\-warning predictor, an architecture\-independent diagnostic, an automatic intervention rule, or a replacement for norm\-based regime classifiers\.

## Appendix MOnset criterion and labeling protocol

The grokking\-vs\-non\-grokking label used by the §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)detector requires \(i\) an onset definition and \(ii\) a step threshold separating grokking from early generalization\.

#### Onset definition\.

For each run, grokking onset is the first step at which test accuracy crosses99%99\\%\. In the base pool \(4141unique runs:88seeds×\\times55weight\-decay settings\+1\+\\,1smoke run\), this gives a mean wd=1=1onset of33983398steps with std361361over99wd=1=1runs; wd=2=2runs reach the same threshold by step15281528on average\. The released code contains scripts to reproduce these aggregates\.

#### Bimodal gap\.

The empirical onset distribution is clearly bimodal between wd=2=2early\-generalization runs and wd=1=1grokking runs: the wd=2=2pool’s maximum onset is well below the wd=1=1pool’s minimum onset, and we choose the step threshold to fall in this gap\. The threshold is data\-driven \(post\-hoc\) but supported by the visible bimodal structure\.

#### Pool overview\.

Figure[15](https://arxiv.org/html/2605.08237#A13.F15)shows representative test\-accuracy and RR trajectories for non\-grok, grok, and early\-generalization runs in the base pool\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e1_repr_trajectories.png)Figure 15:Representative test\-accuracy and RR trajectories for non\-grok, grok, and early\-generalization regimes in the base pool \(4141unique runs\)\.

## Appendix NThreshold sweep and reused\-seed split behavior

This section reports the full threshold sweep on the held\-out test fold and documents the asymmetry between the held\-out test split and a reused\-seed split\. The reused\-seed split is reported as a sensitivity check, not as a successful calibration fold; the operating rule of §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1)is fixed by heuristic and sustainment, not by tuning on either split\.

#### Test\-fold sweep\.

The test fold is fresh seeds4646–4949\(ngrok=5n\_\{\\mathrm\{grok\}\}\{=\}5,nnon=12n\_\{\\mathrm\{non\}\}\{=\}12, base rate≈0\.29\\approx 0\.29; wd=2=2early\-generalization runs excluded\)\. Lead and summaries are computed only over true\-positive alarms\.

Table 12:Detection threshold trade\-off on the held\-out test fold \(ngrok=5n\_\{\\mathrm\{grok\}\}\{=\}5,nnon=12n\_\{\\mathrm\{non\}\}\{=\}12\)\. Median lead is computed only over true\-positive alarms; the bracketed range for the selected sustained rule is the95%95\\%bootstrap CI of the median \(Appendix[O](https://arxiv.org/html/2605.08237#A15)\)\. AUROC≈0\.93\\approx 0\.93\(AUPRC≈0\.91\\approx 0\.91\) is computed on this same fold and applies to all rows \(single detector at different cuts\)\.RuleTPRFPRMedian lead95%95\\%CIComment*Instantaneous threshold rules*τ=5×\\tau\{=\}5\\times1\.001\.000\.9170\.91717741774—high recallτ=20×\\tau\{=\}20\\times0\.400\.400\.2500\.250——high specificity*Sustained operating rule*sustained\_K2\_tau10\(selected\)0\.800\.800\.5000\.50010681068\[142,2426\]\[142,\\,2426\]moderate recall/FPR
#### Test\-fold ROC and lead\-time distribution\.

Figure[16](https://arxiv.org/html/2605.08237#A14.F16)shows the ROC for the residual detector on the test fold and the per\-threshold distribution of lead times across grok runs \(computed only over true\-positive alarms\)\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e2_roc.png)\(a\)ROC on test fold
![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e2_leadbox.png)\(b\)Lead\-time distribution

Figure 16:Detection on the held\-out test fold \(55grok /1212non\-grok\)\. \(a\) ROC for the residual detector\. \(b\) Per\-threshold distribution of lead times across grok runs, computed only over true\-positive alarms\.
#### Reused\-seed split behavior\.

On the reused\-seed split \(seeds4242–4545;55grok /1111non\-grok\), the same fixedsustained\_K2\_tau10operating point fires no alarms, yielding TPR=0=0and FPR=0=0\. We report this as seed\-split sensitivity rather than as evidence of calibrated threshold selection: the reused\-seed split does not validate threshold calibration, and we therefore avoid using it to tuneKKorτ\\tau\.

#### Precision–recall and a non\-grok rising\-residual case\.

Figure[17\(a\)](https://arxiv.org/html/2605.08237#A14.F17.sf1)shows the precision–recall curve on the same test fold\. Figure[17\(b\)](https://arxiv.org/html/2605.08237#A14.F17.sf2)shows a non\-grok run whose RR also rises but no generalization follows; this motivates the relative sustained\-threshold rule rather than an absolute residual cut\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e2_pr.png)\(a\)Precision–recall on the test fold\.
![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_e1_case_failure.png)\(b\)Non\-grok seed with rising RR \(no generalization\)\.

Figure 17:Test\-fold detection diagnostics\. \(a\) AUPRC≈0\.91\\approx 0\.91on the test fold\. \(b\) A non\-grok run whose RR rises without producing generalization; sustained\-threshold rules with relative baselines are intended to suppress such cases at the cost of some lead\.

## Appendix OUncertainty estimates for held\-out detection

This appendix reports uncertainty estimates for the held\-out detection results of §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)\. AUROC and AUPRC intervals are computed by stratified bootstrap over runs, resampling grokking and non\-grokking runs separately\. TPR and FPR intervals use exact binomial intervals\. Lead\-time intervals are computed over true\-positive alarms unless otherwise stated\. The bootstrap unit is the run, not the window\. These estimates are intended to quantify small\-sample uncertainty, not to introduce additional claims or to reselect the operating point\. The fixed operating point is not reselected by these uncertainty estimates; the intervals only quantify uncertainty for the held\-out evaluation\.

Table 13:Uncertainty estimates for the held\-out detection results\. AUROC and AUPRC intervals are computed by stratified bootstrap over runs\. TPR and FPR intervals use exact binomial intervals\. Lead\-time intervals are computed over true\-positive alarms unless otherwise stated\.The wide intervals reflect the small held\-out split\. The AUROC/AUPRC intervals support the run\-level ranking value of RR, while the TPR/FPR intervals show that the fixed operating point has substantial uncertainty\. The TP\-only lead summarizes alarms that occur before grokking onset; the all\-alarming lead additionally includes grokking runs whose alarms occur after onset, which explains the negative lower endpoint\. We report these intervals as small\-sample uncertainty quantification and not as evidence that the fixed operating point is statistically calibrated or validated\.

## Appendix PObservable ablation

This appendix reports the per\-observable detection results referenced in §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1)\.

For the44grok runs in the observable\-ablation sub\-pool, log\-probability \(1919\-dim\) firessustained\_K2\_tau10alarms on4/44/4runs but only2/42/4before onset \(true positive\); the median true\-positive lead is≈601\\approx 601steps\. Logits \(1919\-dim\) fire alarms on3/43/4runs but0/30/3before onset; top\-kk\(∼95\\sim 95\-dim\) and hidden \(∼1900\\sim 1900\-dim\) fire0/40/4alarms\. The matched FPR for these lead numbers is the rule’s test\-fold value0\.500\.50from §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2); per\-observable FPR on this sub\-pool is not separately reported\. As in §[3\.1](https://arxiv.org/html/2605.08237#S3.SS1)and the Discussion, the present implementation is best suited to scalar empirical observables represented by one\-dimensional quantile coordinates, so the failure of top\-kkand hidden\-state variants under the same DMD configuration should be interpreted as a limitation of this scalar\-distribution implementation and fixed DMD setup, not as evidence that those observables lack useful information\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_n3_lead_box.png)Figure 18:N3 observable ablation under the selectedsustained\_K2\_tau10rule, on the four grok runs of the sub\-pool\.Left:alarm outcome per observable, stacked over the four runs: green = TP \(alarm before onset\), orange = FN \(alarm after onset\), gray = no alarm fired\. Log\-probability fires alarms on all four runs and lands two of them before onset \(TP\); logits fires once and lands after onset \(FN\); top\-kkand hidden produce no alarms under the current DMD configuration\.Right:lead time of the TP alarms — only log\-probability has TPs, with leads of142142and10621062steps \(median≈601\\approx 601\)\. The figure shows that, under this rule and DMD setup, log\-probability is the only scalar\-distribution observable that yields useful pre\-onset signal on this sub\-pool; we read this as a limitation of the current scalar\-quantile implementation, not as evidence that the other observables lack information\.
## Appendix QNorm and logit\-scale baselines

We compare the residual against norm\-derived signals at two scales: run\-level \(max\-of\-trajectory\) discrimination and window\-level fragility under matched perturbations\.

#### Pool\.

2323runs enter the pipeline;2222have finite ROC scores under all signals on the shared pool\.

#### Run\-level AUROC\.

Under run\-level max\-of\-trajectory scoring on this pool, total parameter norm gives AUROC=0\.9829=0\.9829,\|Δnorm\|\\lvert\\Delta\\mathrm\{norm\}\\rvertgives0\.99150\.9915, and RR \(run\-level max\) gives0\.41030\.4103\. Norm\-derived signals are therefore strong run\-level regime indicators on this pool, while RR’s strength is at the window level \(next paragraph and Table[14](https://arxiv.org/html/2605.08237#A17.T14)\)\.

#### Fair\-FPR temporal\-alarm protocol\.

For the fair\-FPR norm comparison, each signal’s threshold is selected separately to match the target FPR on the reused\-seed split; these thresholds are not thesustained\_K2\_tau10operating rule used in the main RR experiment of §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)\. This is a baseline\-comparison protocol, not the selected operating rule, and the reused\-seed split is used here only to set comparison thresholds, not as a successful calibration fold\. With FPR target0\.500\.50on the reused\-seed split and reported on the test fold \(ntest\_grok=4n\_\{\\mathrm\{test\\\_grok\}\}\{=\}4,ntest\_non\_grok=5n\_\{\\mathrm\{test\\\_non\\\_grok\}\}\{=\}5\): under this fair\-FPR protocol, RR triggers TPR=0\.75=0\.75with median lead273\.5273\.5steps;norm\_N\_totaltriggers TPR=0=0\(no alarms fire\) under the same fair\-FPR protocol\. A fully fair head\-to\-head detection comparison would require per\-signal threshold recalibration tailored to each signal’s dynamic range and is left for future work\.

#### Window\-control contrast\.

The same2121perturbation runs \(1717reused from §[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)plus44new low\-norms=5250s\{=\}5250perturbations on seeds4646/4949\) are re\-labeled by total\-parameter\-norm percentile rather than RR percentile\. The aggregator below reports short\-horizon deviation in accuracy on a0–100100scale, so values are100×100\\timesthose in the §[3\.3](https://arxiv.org/html/2605.08237#S3.SS3)reporting\.

Table 14:Window\-control contrast \(2121runs, perturbation scale0\.010\.01\)\. Re\-labeling the same pool by RR percentile vs\. total\-parameter\-norm percentile reverses the fragility ordering\. Aggregator: short\-horizon deviation in accuracy on a0–100100scale\.
#### Interpretation\.

Norm\-derived signals are strong run\-level regime indicators, confirming that scale dynamics are informative for grokking\. In the same\-data window\-control experiment, perturbation sensitivity aligns with the residual ordering rather than total\-norm ordering, suggesting that RR is not merely a total\-norm proxy at the window level in the studied wd=1=1dynamics\. RR therefore provides a complementary window\-level fragility diagnostic on this pool\. We do not claim RR universally outperforms norm baselines, and the fair\-FPR comparison above is a baseline\-comparison protocol distinct from the selected operating rule of §[3\.2](https://arxiv.org/html/2605.08237#S3.SS2)\.

## Appendix RTriggered intervention

We test whether RR\-triggered control improves training beyond passive monitoring\.

Three strategies apply a learning\-rate halving event: \(a\) at a fixed step \(40004000\); \(b\) at a uniformly random step within a fixed window; \(c\) when the RR alarm fires \(rr\_trigger, using thesustained\_K2\_tau10rule\)\. On wd=1=1runs \(1010runs per strategy\), therr\_triggerstrategy prevents grokking in0\.100\.10of runs; thefixedstrategy lands an average of773773steps after onset \(negative==too late\)\. The fixed and random strategies produce a0failure rate but do not reliably accelerate grokking either; therr\_triggervariant gives at most a marginal speedup at the cost of a non\-zero failure rate\. Monitoring does not directly imply beneficial control\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/figures/fig_n2_outcomes.png)Figure 19:Outcome distribution by trigger strategy \(fixed/random/rr\_trigger\)\. RR\-triggered intervention introduces a10%10\\%failure mode that the fixed and random strategies do not exhibit\.
## Appendix STask\-family transfer

The signal is evaluated on related modular\-arithmetic settings as a transfer check\.

#### Modular\-addition fraction/prime sweep\.

Sweeping the training fraction and modulus prime size yields partial transfer with explicit failure cases under the same operating rule\.

#### Modular multiplication\.

The samesustained\_K2\_tau10rule triggers on wd=1=1modular\-multiplication runs but with shorter lead than mod\-addition\.

We emphasize that this is preliminary task\-family transfer evidence, not clean cross\-task robustness, and pair every transfer claim with FPR before re\-stating lead time once the relevant rows are added to the index\.

## Appendix TFCN secondary validation

For FCNs of varying width on MNIST, we treat residual and effective rank as low\-residual regime descriptors \(a secondary use\), not as the primary detection signal\. The FCN training details \(Table[5](https://arxiv.org/html/2605.08237#A4.T5)\), randomized shuffle control, and qualitative seed/initialization/activation robustness checks are reported as part of Section[D](https://arxiv.org/html/2605.08237#A4)above\. Figure[20](https://arxiv.org/html/2605.08237#A20.F20)reports the stage\-wise reconstruction residual and effective rank across widths under SGD and AdamW\.

![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/Reconstruction_error_comparison.png)\(a\)Reconstruction residuals for FCNs
![Refer to caption](https://arxiv.org/html/2605.08237v1/Chapters/Figures/Effective_rank_comparison.png)\(b\)99%99\\%effective rank for FCNs

Figure 20:FCN secondary check: stage\-wise reconstruction residuals and99%99\\%effective rank for FCNs of widthsh∈\{5,40,256,1024\}h\\in\\\{5,40,256,1024\\\}under SGD \(solid\) and AdamW \(dashed\)\. Wider networks tend to give smaller residuals and lower effective rank in the chosen observation space; AdamW tends to give lower effective rank than SGD at matched settings\.The qualitative observations are: spectral structure varies with width under both SGD and AdamW; wider networks tend to exhibit smaller reconstruction residuals and lower effective rank across stages; AdamW tends to yield lower effective rank than SGD under matched settings; and the spectra are stable under small multiplicative initialization perturbations and across seeds\. Among the tested factors in this FCN setting, width produces the clearest observed differences in training dynamics\. We do not interpret these FCN results as a grokking diagnostic claim; they are included as a low\-residual regime check and as a comparison point with prior Koopman\-spectral analyses on FCN training\[Redmanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib20)\]\.

## Appendix UExtended related work and positioning

This appendix expands the brief positioning given in the introduction\.

#### Mechanistic and norm\-based accounts of grokking\.

Since the original observation of delayed generalization on algorithmic tasks\[Poweret al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib21)\], several lines of work have sought to explain grokking\. Mechanistic interpretability identifies the emergence of specific computational circuits—Fourier features for modular addition\[Nandaet al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib34)\]and circuit\-efficiency tradeoffs\[Varmaet al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib2)\]—and yields progress measures that track circuit formation\. Implicit\-bias accounts attribute grokking to late\-phase norm minimization on the zero\-loss manifold\[Liuet al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib4), Lyuet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib5), Musat,[2026](https://arxiv.org/html/2605.08237#bib.bib22)\]; stability\-based accounts link it to logit scaling and softmax collapse\[Thilaket al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib6), Prietoet al\.,[2025](https://arxiv.org/html/2605.08237#bib.bib51)\]\. These works primarily address why grokking occurs and yield signals tied to specific circuits or mechanisms\. Our reconstruction residual addresses a different question—window\-level transition localization from chosen task\-dependent distributional observables—and is intended to complement rather than substitute for these mechanistic accounts\.

#### Gradient\-based diagnostics\.

A separate line of work tracks training through gradient\-derived quantities\. The Average Gradient Outer Product \(AGOP\)\[Radhakrishnanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib3)\]has been used to study feature emergence and grokking\-related transitions, providing another route to identifying transition windows\. We view our residual as complementary: it is derived from windowed distributional dynamics rather than from gradient outer products, and is intended as a window\-level monitoring signal rather than a mechanism\-specific progress measure\. In our setup, sparse checkpoint coverage prevents a fair head\-to\-head comparison; AGOP appears in Appendix[H](https://arxiv.org/html/2605.08237#A8)as corroborative rather than competitive evidence\.

#### Spectral and geometric diagnostics of training\.

A large body of work characterizes training through curvature spectra\[Sagunet al\.,[2018](https://arxiv.org/html/2605.08237#bib.bib7), Papyan,[2019](https://arxiv.org/html/2605.08237#bib.bib8), Ghorbaniet al\.,[2019](https://arxiv.org/html/2605.08237#bib.bib9)\], gradient confinement to top Hessian subspaces\[Gur\-Ariet al\.,[2018](https://arxiv.org/html/2605.08237#bib.bib27)\], and the edge\-of\-stability phenomenon in which the top Hessian eigenvalue saturates near2/η2/\\eta\[Cohenet al\.,[2022](https://arxiv.org/html/2605.08237#bib.bib10), Damianet al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib11)\]\. Weight\-matrix spectra have been studied via random\-matrix theory and heavy\-tailed self\-regularization\[Pennington and Bahri,[2017](https://arxiv.org/html/2605.08237#bib.bib12), Martin and Mahoney,[2018](https://arxiv.org/html/2605.08237#bib.bib13)\]\. Trajectory\-based analyses report low\-dimensional structure of parameter evolution\[Liet al\.,[2018](https://arxiv.org/html/2605.08237#bib.bib45), Aghajanyanet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib46), Maoet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib48), Liet al\.,[2021](https://arxiv.org/html/2605.08237#bib.bib47)\], correlated dynamics\[Brokmanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib28)\], and distinct lazy/rich regimes\[Jacotet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib25), Chizatet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib29), Leeet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib26), Woodworthet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib30)\], with the precise scaling determining whether feature learning occurs\[Yang and Hu,[2022](https://arxiv.org/html/2605.08237#bib.bib14)\]\. Our diagnostic operates on the empirical distribution of a chosen observable rather than on raw parameters or curvature, and is computed over windows rather than as instantaneous summaries; for output\-distribution observables the construction does not depend on hidden\-unit indexing\.

#### Koopman and DMD approaches to neural training\.

Koopman\-operator methods\[Schmid,[2010](https://arxiv.org/html/2605.08237#bib.bib23), Rowleyet al\.,[2009](https://arxiv.org/html/2605.08237#bib.bib24), H\. Tuet al\.,[2014](https://arxiv.org/html/2605.08237#bib.bib15), Arbabi and Mezić,[2017](https://arxiv.org/html/2605.08237#bib.bib16), Bruntonet al\.,[2017](https://arxiv.org/html/2605.08237#bib.bib19),drmač2017datadrivenmodaldecompositions\]have been applied to analyze and accelerate neural training\[Dogra and Redman,[2020](https://arxiv.org/html/2605.08237#bib.bib41), Tanoet al\.,[2020](https://arxiv.org/html/2605.08237#bib.bib42), Luoet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib43)\]and to detect equivalent training dynamics through Koopman conjugacy\[Redmanet al\.,[2024](https://arxiv.org/html/2605.08237#bib.bib20)\]\. The most closely related of these,Redmanet al\.\[[2024](https://arxiv.org/html/2605.08237#bib.bib20)\], uses Koopman spectra to test global, full\-trajectory equivalence between training runs\. Our use of windowed Hankel\-DMD\[Črnjarić\-Žicet al\.,[2019](https://arxiv.org/html/2605.08237#bib.bib17)\]on distributional observables targets a different scale of analysis: window\-level transition localization within a single trajectory, rather than equivalence testing across full trajectories\.

#### Distributional view and Wasserstein geometry\.

Mean\-field analyses model training as a Wasserstein gradient flow on the empirical parameter distribution\[Chizat and Bach,[2018](https://arxiv.org/html/2605.08237#bib.bib37), Meiet al\.,[2018](https://arxiv.org/html/2605.08237#bib.bib38), Rotskoff and Vanden‐Eijnden,[2022](https://arxiv.org/html/2605.08237#bib.bib39), Sirignano and Spiliopoulos,[2019](https://arxiv.org/html/2605.08237#bib.bib40), Gesset al\.,[2023](https://arxiv.org/html/2605.08237#bib.bib52)\]\. These results require specific parameterizations and asymptotic regimes that do not directly apply to our finite\-width, finite\-step setting\. We therefore use Wasserstein geometry and the tangent\-space identification at a reference measure\[Villani,[2009](https://arxiv.org/html/2605.08237#bib.bib1)\]as a descriptive coordinate system for distribution\-valued observables, and treat the choice of observable as task\-dependent\. We make no asymptotic claim, and our framework does not depend on a mean\-field limit being attained\.
Distributional Spectral Diagnostics for Localizing Grokking Transitions

Similar Articles

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)

Submit Feedback

Similar Articles

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking
I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)