The Discrete-Log Clock: How a Transformer Learns Modular Multiplication
Summary
This paper demonstrates that when transformers grok modular multiplication, the dense Fourier spectrum observed in previous work is an artifact of using the additive Fourier transform; using the multiplicative character transform reveals a sparse representation, leading to a reverse-engineered 'Discrete-Log Clock' algorithm analogous to the clock algorithm for modular addition.
View Cached Full Text
Cached at: 06/17/26, 05:37 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.17399](https://arxiv.org/html/2606.17399)
marginparsep has been altered\. topmargin has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.
The Discrete\-Log Clock: How a Transformer Learns Modular Multiplication
Huu Danh Nguyen1
††footnotetext:1Stanford University\. Correspondence to: Huu Danh Nguyen <dannycodeinc@gmail\.com\>\.
Mechanistic Interpretability Workshop at the43rd\\mathit\{43\}^\{rd\}International Conference on Machine Learning, Seoul, South Korea, 2026\. Copyright 2026 by the author\(s\)\.###### Abstract
When small transformers grok modular multiplication, prior work reports that the learned embedding has a “dense” Fourier spectrum requiring all frequencies\. This contrasts with modular addition, where only a sparse set of key frequencies suffices\. We show this density is an artifact of analyzing in the wrong basis\. The natural Fourier transform for multiplication is not the standard additive DFT but the*multiplicative character transform*, which decomposes functions on the multiplicative group\(ℤ/pℤ\)∗\(\\mathbb\{Z\}/p\\mathbb\{Z\}\)^\{\*\}into its irreducible representations\. Applying this transform to a grokked transformer trained ona⋅bmod113a\\cdot b\\bmod 113, we find the embedding spectrum becomes highly sparse \(Gini coefficient 0\.58 vs\. 0\.07 in the additive basis\) with only 4 key frequencies carrying significant energy\. Furthermore, 96\.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D\-periodic structure when reordered by the discrete logarithm\. These results demonstrate the transformer reduces multiplication to addition in discrete\-log space, implementing a “Discrete\-Log Clock” algorithm analogous to Nanda et al\.’s Clock algorithm for addition\. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise\.
Grokking, the phenomenon where neural networks suddenly generalize long after memorizing training data, has become a key testbed for mechanistic interpretability\(Poweret al\.,[2022](https://arxiv.org/html/2606.17399#bib.bib1)\)\. For modular addition \(a\+bmodpa\+b\\bmod p\), the internal algorithm has been fully reverse\-engineered: the model learns a sparse Fourier representation and applies trigonometric identities to compose rotations on a circle\(Nandaet al\.,[2023](https://arxiv.org/html/2606.17399#bib.bib2); Zhonget al\.,[2023](https://arxiv.org/html/2606.17399#bib.bib3)\)\.
For modular multiplication \(a⋅bmodpa\\cdot b\\bmod p\), models also grok successfully, but the internal algorithm has remained poorly characterized\.Doshiet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib4)\)derived analytical MLP solutions requiring dense Fourier components \(all frequencies\), andFurutaet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib5)\)confirmed experimentally that grokked transformers use cosine\-biased components across all frequencies\. These findings suggest multiplication uses a fundamentally different, less interpretable algorithm than addition\.
Our contribution\.We show this conclusion is premature\. The “dense spectrum” is an artifact of analyzing with the*additive*Fourier transform, which is the natural basis for addition but not for multiplication\. The correct analysis tool is the*multiplicative character transform*, which is the Fourier transform on the multiplicative group\(ℤ/pℤ\)∗\(\\mathbb\{Z\}/p\\mathbb\{Z\}\)^\{\*\}\. In this basis, the embedding becomes sparse, the neurons cluster cleanly by frequency, and the algorithm is interpretable: It reduces multiplication to addition via the discrete logarithm\.
Task formulation\.The input to our model is a pair of integers\(a,b\)∈\{1,…,112\}2\(a,b\)\\in\\\{1,\\ldots,112\\\}^\{2\}\. We train a 1\-layer transformer to output the predicted productc=a⋅bmod113c=a\\cdot b\\bmod 113\. The model observes 30% of all input pairs during training and must generalize to the remaining 70%\.
## 2Related Work
Mechanistic interpretability of modular addition\.Nandaet al\.\([2023](https://arxiv.org/html/2606.17399#bib.bib2)\)fully reverse\-engineered the algorithm learned by transformers on modular addition, showing the model uses discrete Fourier transforms and trigonometric identities to compose rotations\.Zhonget al\.\([2023](https://arxiv.org/html/2606.17399#bib.bib3)\)named this the “Clock” algorithm and discovered an alternative “Pizza” algorithm, demonstrating that multiple algorithmic solutions exist for the same task\. These works establish the methodology we extend to multiplication\.
Prior work on modular multiplication\.Doshiet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib4)\)derived analytical closed\-form MLP weights for modular multiplication and found that the solutions contain the discrete logarithmlogg\(i\)\\log\_\{g\}\(i\)in the weight structure\. However, they used quadratic activations and did not connect this to the empirical Fourier picture in trained ReLU transformers\.Furutaet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib5)\)trained transformers on modular polynomials \(including multiplication\) and reported that the learned spectrum uses “all frequencies” with no clear sparse structure, in contrast to addition\. Our work resolves this discrepancy: the density is a basis artifact, not an algorithmic difference\.
Group\-theoretic frameworks\.Chughtaiet al\.\([2023](https://arxiv.org/html/2606.17399#bib.bib6)\)showed that networks trained on group operations learn irreducible representations of the relevant group\.Standeret al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib8)\)reverse\-engineered networks on permutation group multiplication \(S5, S6\), finding coset\-based circuits\.McCrackenet al\.\([2025](https://arxiv.org/html/2606.17399#bib.bib7)\)unified all known addition algorithms under an “approximate CRT” framework\. These works study different groups \(non\-abelian permutation groups, additive cyclic groups\) from ours; our contribution is applying the multiplicative character basis specifically to\(ℤ/pℤ\)∗\(\\mathbb\{Z\}/p\\mathbb\{Z\}\)^\{\*\}in a trained transformer\.
## 3Methods
### 3\.1Task and Architecture
We train a 1\-layer decoder\-only transformer ona⋅bmod113a\\cdot b\\bmod 113for alla,b∈\{1,…,112\}a,b\\in\\\{1,\\ldots,112\\\}\(excluding zero, which lies outside the multiplicative group\)\. The architecture followsNandaet al\.\([2023](https://arxiv.org/html/2606.17399#bib.bib2)\):dmodel=128d\_\{\\text\{model\}\}=128, 4 attention heads \(dhead=32d\_\{\\text\{head\}\}=32\), MLP hidden dimension 512 with ReLU activation, no LayerNorm\. Input format is the token sequence\[a,b,=\]\[a,b,\{=\}\]; the model’s prediction is read from the output logits at position 2 \(“=\{=\}”\)\.
Hyperparameters\.We use 30% training data \(3,763 of 12,544 pairs\), AdamW with learning rate10−310^\{\-3\}, weight decay 1\.0, and train for 40,000 epochs with full\-batch gradient descent\. These hyperparameters exactly followNandaet al\.\([2023](https://arxiv.org/html/2606.17399#bib.bib2)\)to ensure comparability; the only change is the task \(multiplication vs\. addition\)\.
Training dynamics\.The model memorizes the training set by epoch∼\\sim500 \(train loss→0\\to 0, test loss remains high\)\. Between epochs 9,000 and 14,000, test loss drops suddenly to near zero: the model groks, achieving near\-perfect generalization \(Figure[1](https://arxiv.org/html/2606.17399#S3.F1)\)\. This delayed generalization is characteristic of grokking: the model first overfits, then discovers the generalizing algorithm under pressure from weight decay\.
Figure 1:Training curve fora⋅bmod113a\\cdot b\\bmod 113\. The model memorizes quickly \(train loss drops by epoch 500\), then generalizes suddenly during grokking \(epochs 9K–14K\)\. Weight decay drives the transition from memorization to the structured algorithm\.
### 3\.2The Multiplicative Group and Discrete Logarithm
Since 113 is prime, the nonzero integers\{1,…,112\}\\\{1,\\ldots,112\\\}form a cyclic group under multiplication mod 113\. A*primitive root*is a generator: an elementggsuch that\{g0,g1,…,g111\}mod113\\\{g^\{0\},g^\{1\},\\ldots,g^\{111\}\\\}\\bmod 113exhausts all 112 elements\. Forp=113p=113, we useg=3g=3\.
The*discrete logarithm*logg:\{1,…,112\}→\{0,…,111\}\\log\_\{g\}\\colon\\\{1,\\ldots,112\\\}\\to\\\{0,\\ldots,111\\\}assigns each element its exponent:3α≡a\(mod113\)3^\{\\alpha\}\\equiv a\\pmod\{113\}meanslog3\(a\)=α\\log\_\{3\}\(a\)=\\alpha\. We writeα=logg\(a\)\\alpha=\\log\_\{g\}\(a\)andβ=logg\(b\)\\beta=\\log\_\{g\}\(b\)for the two inputs\. This is an exact bijection \(a permutation of the 112 elements\) that defines a group isomorphism:
logg:\(ℤ/pℤ\)∗→≅ℤ/\(p−1\)ℤ\\log\_\{g\}:\(\\mathbb\{Z\}/p\\mathbb\{Z\}\)^\{\*\}\\xrightarrow\{\\;\\cong\\;\}\\mathbb\{Z\}/\(p\{\-\}1\)\\mathbb\{Z\}\(1\)Under this map, multiplication becomes addition:
a⋅b≡3\(α\+β\)mod112\(mod113\)a\\cdot b\\equiv 3^\{\(\\alpha\+\\beta\)\\bmod 112\}\\pmod\{113\}\(2\)In the relabeled coordinates, the multiplication table*is*the addition table mod 112\.
### 3\.3Two Fourier Bases
TheadditiveFourier basis consists ofsin\(2πka/q\)\\sin\(2\\pi ka/q\)andcos\(2πka/q\)\\cos\(2\\pi ka/q\)fork=1,…,q/2k=1,\\ldots,q/2, whereq=112q=112\. This is the standard DFT, natural for functions periodic in the integer labelaa\.
Themultiplicative character basisusessin\(2πklogg\(a\)/q\)\\sin\(2\\pi k\\log\_\{g\}\(a\)/q\)andcos\(2πklogg\(a\)/q\)\\cos\(2\\pi k\\log\_\{g\}\(a\)/q\)\. Operationally: reorder the embedding rows by the discrete\-log map, then take the standard DFT\. This is the Fourier transform on the group\(ℤ/pℤ\)∗\(\\mathbb\{Z\}/p\\mathbb\{Z\}\)^\{\*\}, decomposing functions into multiplicative characters\.
The motivation is direct: since multiplication becomes addition after relabeling \(Eq\.[2](https://arxiv.org/html/2606.17399#S3.E2)\), a model that has learned the group structure should have embeddings periodic in the relabeled coordinates, just as addition models have embeddings periodic in the original coordinates\.
### 3\.4Analysis Methodology
We apply the following pipeline to the trained model:
Step 1: Embedding spectrum\.Extract the trained embedding matrixWE∈ℝ112×128W\_\{E\}\\in\\mathbb\{R\}^\{112\\times 128\}\. Project onto both bases and compute the combined frequency norm‖fk‖=‖sk‖2\+‖ck‖2\\\|f\_\{k\}\\\|=\\sqrt\{\\\|s\_\{k\}\\\|^\{2\}\+\\\|c\_\{k\}\\\|^\{2\}\}for each frequencykk, wheresks\_\{k\}andckc\_\{k\}are the sin and cos projections \(each a 128\-dim vector\)\.
Step 2: Sparsity metrics\.We measure concentration using the Gini coefficient:
G=∑i=1n\(2i−n−1\)\|xi\|n∑i=1n\|xi\|G=\\frac\{\\sum\_\{i=1\}^\{n\}\(2i\-n\-1\)\|x\_\{i\}\|\}\{\\,n\\sum\_\{i=1\}^\{n\}\|x\_\{i\}\|\}\(3\)wherex1≤…≤xnx\_\{1\}\\leq\\ldots\\leq x\_\{n\}are the sorted frequency norms\.G=0G=0means uniform \(dense\),G=1G=1means maximally sparse\. Key frequencies are detected as those exceeding5×5\\timesthe median norm\.
Step 3: Neuron frequency assignment\.For each of 512 MLP neurons, compute the 2D activation patternhn\(a,b\)h\_\{n\}\(a,b\)over all input pairs \(reshaped to a112×112112\\times 112grid\)\. Take the 2D DFT in each basis and measure the maximum fraction of energy at any single frequency\. Neurons with\>\>85% energy at one frequency are classified as “single\-frequency tuned\.”
Step 4: SVD analysis\.Perform SVD onWEW\_\{E\}and reorder the principal components by discrete logarithm\. Sinusoidal structure in log\-order confirms the embedding encodes multiplicative characters\.
## 4Experiments and Results
### 4\.1Basis Comparison
Figure[2](https://arxiv.org/html/2606.17399#S4.F2)shows the embedding spectrum in both bases\. In the additive basis, energy spreads uniformly \(Gini = 0\.071\)\. In the multiplicative basis, energy concentrates in 4 peaks at frequencies\{2,8,47,56\}\\\{2,8,47,56\\\}\(Gini = 0\.579, an 8\.1×\\timesincrease\)\. Table[1](https://arxiv.org/html/2606.17399#S4.T1)summarizes all metrics\.
Figure 2:Embedding Fourier spectrum\.Left:additive basis shows flat, dense spectrum\.Right:multiplicative basis reveals 4 sparse peaks\. Same y\-axis scale\. The “density” reported by prior work is a basis artifact\.Table 1:Sparsity metrics comparing the two bases\.
### 4\.2Neuron\-Level Confirmation
In the multiplicative basis, 496 out of 512 neurons \(96\.9%\) have\>\>85% of their Fourier energy at a single frequency \(Figure[3](https://arxiv.org/html/2606.17399#S4.F3)\)\. In the additive basis, zero neurons pass this threshold\.
When neuron activation heatmaps are reordered by discrete logarithm, clear diagonal\-stripe patterns emerge \(Figure[4](https://arxiv.org/html/2606.17399#S4.F4)\)\. These stripes indicate the neuron computes a periodic function ofα\+β\\alpha\+\\beta, which is the signature of the trigonometric identitycos\(k\(α\+β\)\)\\cos\(k\(\\alpha\+\\beta\)\)operating in log\-space: the same mechanism Nanda et al\. found for addition, but applied after the discrete\-log transformation\.
Figure 3:Sorted maximum frequency fraction per neuron\. Orange: multiplicative basis \(96\.9% above 85% threshold\)\. Blue: additive basis \(0% above threshold\)\.Figure 4:Neuron heatmaps in integer order \(left\) vs\. discrete\-log order \(right\)\. In log\-order, diagonal stripes emerge, indicating periodicity inlogg\(a\)\+logg\(b\)\\log\_\{g\}\(a\)\+\\log\_\{g\}\(b\)\.
### 4\.3SVD in Log\-Order
We perform SVD onWEW\_\{E\}and reorder principal components by discrete logarithm\. In integer order, all components appear noisy\. In log\-order, several components reveal sinusoidal structure \(Figure[5](https://arxiv.org/html/2606.17399#S4.F5)\), consistent with the embedding encoding multiplicative characters\. The effective rank is 10\.3 \(out of 112\), confirming the embedding lives in a low\-dimensional subspace\.
Figure 5:Principal components ofWEW\_\{E\}reordered by discrete logarithm show sinusoidal structure, confirming the embedding encodes multiplicative characters\.
### 4\.4Discussion
The Discrete\-Log Clock algorithm\.Our results show the transformer solves modular multiplication by: \(1\) embedding each integeraaas sinusoidal functions oflogg\(a\)\\log\_\{g\}\(a\); \(2\) routing embeddings via attention to the output position; \(3\) applying trig identities in the MLP to computecos\(k\(α\+β\)/q\)\\cos\(k\(\\alpha\+\\beta\)/q\); \(4\) scoring each candidateccby alignment withcos\(klogg\(c\)/q\)\\cos\(k\\,\\log\_\{g\}\(c\)/q\), where only the correct answerc=abmodpc=ab\\bmod pachieves constructive interference across all frequencies\.
Why prior work missed this\.Furutaet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib5)\)andDoshiet al\.\([2024](https://arxiv.org/html/2606.17399#bib.bib4)\)analyzed with the additive DFT\. A function periodic inlogg\(a\)\\log\_\{g\}\(a\)appears non\-periodic inaabecause the discrete logarithm is a nonlinear permutation of the integers\. The “all\-frequencies” finding is correct in the additive basis but misleading because it conflates basis mismatch with algorithmic complexity\.
Overfitting and generalization\.Grokking is itself an overfitting phenomenon: the model achieves zero training loss by epoch 500 \(memorization\), then continues training for thousands of epochs before discovering the generalizing algorithm\. Weight decay penalizes the large, unstructured weights needed for memorization, gradually favoring the compact Fourier representation that generalizes\. We do not use cross\-validation because the task is deterministic \(no label noise\); the 70% held\-out test set directly measures generalization\. The model achieves 100% test accuracy after grokking, confirming complete generalization\.
Limitations\.We do not prove this is the only possible algorithm; prior work on addition\(Zhonget al\.,[2023](https://arxiv.org/html/2606.17399#bib.bib3)\)shows different architectures can yield different solutions\.
## 5Conclusion and Future Work
We have shown that a transformer trained on modular multiplication learns the “Discrete\-Log Clock”: it reduces multiplication to addition in discrete\-log space, then applies the same Fourier/trigonometric mechanism known from addition\. The key methodological insight is that the analysis basis must match the algebraic structure of the task\.
In follow\-up experiments, we confirmed the Discrete\-Log Clock across 10 primes \(p=59p=59to113113\), all exhibiting sparse multiplicative spectra with 4 to 7 key frequencies\. We also trained across multiple random seeds onp=113p=113and found the mechanism is universal \(∼\\sim80% of seeds\), with consistent sparsity \(Gini 0\.45–0\.61\) despite varying key frequencies\. We have also achieved preliminary results on modular exponentiation \(abmodpa^\{b\}\\bmod p\) with a 2\-layer transformer that achieves 100% accuracy onabmod41a^\{b\}\\bmod 41, to our knowledge the first demonstration of a transformer perfectly learning this operation\. The mechanism for exponentiation remains under active investigation\.
The “right basis” principle extends naturally: for any algebraic task, the Fourier transform on the relevant group should reveal interpretable structure\. Future work includes extending this methodology to polynomial operations, and multi\-layer architectures\.
## Appendix
### A\.0 Reproducibility
### A\.1 The Discrete\-Log Clock Derivation
This appendix derives the algorithm the transformer approximately implements for modular multiplication\. Letωk=2πk/q\\omega\_\{k\}=2\\pi k/q\.
### A\.2 The Embedding
Each dimensionjjofWEW\_\{E\}, viewed as a function ofα\\alphaacross all 112 elements, is approximately a sinusoid at a key frequencykj∈𝒦k\_\{j\}\\in\\mathcal\{K\}:
WE\[gα,j\]≈Ajsin\(ωkjα\+ϕj\)W\_\{E\}\[g^\{\\alpha\},\\;j\]\\;\\approx\\;A\_\{j\}\\sin\(\\omega\_\{k\_\{j\}\}\\alpha\+\\phi\_\{j\}\)\(4\)A single rowWE\[a\]W\_\{E\}\[a\]has no visible structure: it is a point\-sample of all 128 waves evaluated atα=logg\(a\)\\alpha=\\log\_\{g\}\(a\), plus noise\.
### A\.3 Attention
Attention routes information from positionsaaandbbto position==\. After attention, the residual stream at==approximately encodes the Fourier components of both inputs at the key frequencies\.
### A\.4 The MLP \(Trigonometric Addition\)
Writingck\(x\)=cos\(ωkx\)c\_\{k\}\(x\)=\\cos\(\\omega\_\{k\}x\)andsk\(x\)=sin\(ωkx\)s\_\{k\}\(x\)=\\sin\(\\omega\_\{k\}x\)for brevity, the attention and MLP layers together approximately implement:
ck\(α\+β\)\\displaystyle c\_\{k\}\(\\alpha\{\+\}\\beta\)=ck\(α\)ck\(β\)−sk\(α\)sk\(β\)\\displaystyle=c\_\{k\}\(\\alpha\)\\,c\_\{k\}\(\\beta\)\-s\_\{k\}\(\\alpha\)\\,s\_\{k\}\(\\beta\)\(5\)sk\(α\+β\)\\displaystyle s\_\{k\}\(\\alpha\{\+\}\\beta\)=sk\(α\)ck\(β\)\+ck\(α\)sk\(β\)\\displaystyle=s\_\{k\}\(\\alpha\)\\,c\_\{k\}\(\\beta\)\+c\_\{k\}\(\\alpha\)\\,s\_\{k\}\(\\beta\)\(6\)After the MLP, the residual stream approximately encodesck\(α\+β\)c\_\{k\}\(\\alpha\{\+\}\\beta\)andsk\(α\+β\)s\_\{k\}\(\\alpha\{\+\}\\beta\)for eachk∈𝒦k\\in\\mathcal\{K\}\.
### A\.5 Scoring \(Dot Product with Unembedding\)
The unembedding column for candidateccapproximately encodes\(cos\(ωkγ\),sin\(ωkγ\)\)\(\\cos\(\\omega\_\{k\}\\gamma\),\\,\\sin\(\\omega\_\{k\}\\gamma\)\)for eachkk\. The logit for candidateccis the dot product of the final residual with this column\. SincecosAcosB\+sinAsinB=cos\(A−B\)\\cos A\\cos B\+\\sin A\\sin B=\\cos\(A\-B\), each frequency’s contribution reduces to:
logit\(c\)≈∑k∈𝒦Akcos\(ωk\(α\+β−γ\)\)\\boxed\{\\;\\text\{logit\}\(c\)\\;\\approx\\;\\sum\_\{k\\in\\mathcal\{K\}\}A\_\{k\}\\cos\\\!\\left\(\\omega\_\{k\}\(\\alpha\+\\beta\-\\gamma\)\\right\)\\;\}\(7\)
### A\.6 Constructive Interference
Correct answer\(γ∗=\(α\+β\)modq\\gamma^\{\*\}=\(\\alpha\{\+\}\\beta\)\\bmod q\):
logit\(c∗\)≈∑kAkcos\(0\)=∑kAk\(maximum\)\\text\{logit\}\(c^\{\*\}\)\\approx\\sum\_\{k\}A\_\{k\}\\cos\(0\)=\\sum\_\{k\}A\_\{k\}\\quad\\text\{\(maximum\)\}All cosines equal 1; all frequencies constructively interfere\.
Wrong answer\(Δ=α\+β−γ≠0\\Delta=\\alpha\{\+\}\\beta\{\-\}\\gamma\\neq 0\):
logit\(c\)≈∑kAkcos\(ωkΔ\)<∑kAk\\text\{logit\}\(c\)\\approx\\sum\_\{k\}A\_\{k\}\\cos\(\\omega\_\{k\}\\Delta\)<\\sum\_\{k\}A\_\{k\}Cosines at different frequencies take values in\[−1,1\]\[\-1,1\], partially canceling\. With 4 incommensurate frequencies, destructive interference ensures no wrong answer approaches the correct answer’s score\.
## References
- A toy model of universality: reverse engineering how networks learn group operations\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.17399#S2.p3.1)\.
- D\. Doshi, T\. He, A\. Das, and A\. Gromov \(2024\)Grokking modular polynomials\.arXiv preprint arXiv:2406\.03495\.Cited by:[§1](https://arxiv.org/html/2606.17399#S1.p2.1),[§2](https://arxiv.org/html/2606.17399#S2.p2.1),[§4\.4](https://arxiv.org/html/2606.17399#S4.SS4.p2.2)\.
- H\. Furuta, G\. Minegishi, Y\. Iwasawa, and Y\. Matsuo \(2024\)Towards empirical interpretation of internal circuits and properties in grokked transformers on modular polynomials\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.17399#S1.p2.1),[§2](https://arxiv.org/html/2606.17399#S2.p2.1),[§4\.4](https://arxiv.org/html/2606.17399#S4.SS4.p2.2)\.
- G\. McCracken, G\. Moisescu\-Pareja, V\. Letourneau, D\. Precup, and J\. Love \(2025\)Uncovering a universal abstract algorithm for modular addition in neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.17399#S2.p3.1)\.
- N\. Nanda, L\. Chan, T\. Lieberum, J\. Smith, and J\. Steinhardt \(2023\)Progress measures for grokking via mechanistic interpretability\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.17399#S1.p1.1),[§2](https://arxiv.org/html/2606.17399#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.17399#S3.SS1.p1.6),[§3\.1](https://arxiv.org/html/2606.17399#S3.SS1.p2.1)\.
- A\. Power, Y\. Burda, H\. Edwards, I\. Babuschkin, and V\. Misra \(2022\)Grokking: generalization beyond overfitting on small algorithmic datasets\.InICLR 2022 Workshop on MATH\-AI,Cited by:[§1](https://arxiv.org/html/2606.17399#S1.p1.1)\.
- D\. Stander, Q\. Yu, H\. Fan, and S\. Biderman \(2024\)Grokking group multiplication with cosets\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.17399#S2.p3.1)\.
- Z\. Zhong, Z\. Liu, M\. Tegmark, and J\. Andreas \(2023\)The clock and the pizza: two stories in mechanistic explanation of neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.17399#S1.p1.1),[§2](https://arxiv.org/html/2606.17399#S2.p1.1),[§4\.4](https://arxiv.org/html/2606.17399#S4.SS4.p4.1)\.Similar Articles
Transformers Learn the Mestre-Nagao Heuristic
This paper trains a two-layer transformer encoder to classify rational elliptic curves by rank from Frobenius traces, achieving >99% accuracy. Mechanistic interpretability reveals the model learns the Mestre-Nagao heuristic and concentrates attention on prime positions, demonstrating that transformers can learn number-theoretic algorithms.
Transformer Math Explorer [P]
This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason
A comprehensive spectral analysis across 11 LLMs revealing that transformers exhibit phase transitions in hidden activation spaces during reasoning versus factual recall, with seven fundamental phenomena including spectral compression, instruction-tuning reversal, and perfect correctness prediction (AUC=1.0) based solely on spectral properties.
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
This paper performs full Jacobian eigendecomposition across production-scale LLMs, revealing a learned spectral gradient from rotation-dominated early layers to symmetric late layers, along with a low-rank bottleneck that compresses perturbations. The results link perturbation propagation and compression to network functional topology.
Convergent Evolution: How Different Language Models Learn Similar Number Representations
Study reveals that diverse language-model architectures independently evolve similar periodic Fourier features for representing numbers, with only some achieving geometric separability for modular arithmetic.