Transformers Learn the Mestre-Nagao Heuristic

arXiv cs.LG 06/16/26, 04:00 AM Papers
Summary
This paper trains a two-layer transformer encoder to classify rational elliptic curves by rank from Frobenius traces, achieving >99% accuracy. Mechanistic interpretability reveals the model learns the Mestre-Nagao heuristic and concentrates attention on prime positions, demonstrating that transformers can learn number-theoretic algorithms.
arXiv:2606.15036v1 Announce Type: new Abstract: We train a two-layer transformer encoder to classify rational elliptic curves $E/\mathbb{Q}$ of conductor $\leq 10000$ as either rank 0 or rank 1 from the first 128 normalized Frobenius traces. We achieve >99% accuracy on both classes, and accuracy is essentially unchanged on test curves with no isogeny or quadratic-twist relative in the training set. We then apply techniques from mechanistic interpretability such as attention analysis, linear probing, activation patching, logit attribution, and neuron-level circuit analysis to reverse-engineer the algorithm the (centroid in function space) model learned. We find that a sparse circuit of 20 out of 512 layer-1 MLP neurons is sufficient for rank prediction under a linear probe with an AUROC of 0.992 at plateau, implementing a push-pull detector architecture of rank-0 and rank-1 detectors with a one-sided readout. However, we notice that the model has sub-optimal readout problems indicating a mismatch in rank-order between the readout pathway and the discriminative circuit. Critically, the learned input weights of the top discriminating neuron match the Mestre-Nagao sum heuristic weights $\log(p)/(p\cdot \log{B})$ with a Spearman coefficient $r = 0.997$ and Pearson coefficient $r = 0.952$: the model has learnt a result from analytic number theory from the Frobenius trace data alone. We additionally find that all 50 independently trained models concentrate CLS attention on prime positions at 2-50$\times$ the rate of composite positions. The CLS embedding encodes $\log{L(E,1)}$ with $R^2 = 0.962\pm 0.011$ across the 50 models (after controlling for the conductor). Activation patching analysis reveals that attention weights are dissociated from causal information flow. Additionally, the 50 solutions from training are near-identical in function space (with pairwise agreement $>$98.8%) despite large weight space barriers.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:36 AM
# Transformers Learn the Mestre-Nagao Heuristic
Source: [https://arxiv.org/html/2606.15036](https://arxiv.org/html/2606.15036)
###### Abstract

We train a two\-layer transformer encoder to classify rational elliptic curvesE/ℚE/\\mathbb\{Q\}of conductor≤10000\\leq 10000as either rank 0 or rank 1 from the first 128 normalized Frobenius traces\. We achieve \>99% accuracy on both classes, and accuracy is essentially unchanged on test curves with no isogeny or quadratic\-twist relative in the training set\. We then apply techniques from mechanistic interpretability such as attention analysis, linear probing, activation patching, logit attribution, and neuron\-level circuit analysis to reverse\-engineer the algorithm the \(centroid in function space\) model learned\. We find that a sparse circuit of 20 out of 512 layer\-1 MLP neurons is sufficient for rank prediction under a linear probe with an AUROC of 0\.992 at plateau, implementing a push\-pull detector architecture of rank\-0 and rank\-1 detectors with a one\-sided readout: rank\-1 is signaled by a withheld push rather than by an opposing pull\. However, we notice that the model has sub\-optimal readout problems: the model’s readout weights extract only an AUROC of 0\.956 from the same neurons, indicating a mismatch in rank\-order between the readout pathway and the discriminative circuit\. Critically, the learned input weights of the top discriminating neuron match the Mestre\-Nagao sum heuristic weightslog⁡\(p\)/\(p⋅log⁡B\)\\log\(p\)/\(p\\cdot\\log\{B\}\)with a Spearman coefficientr=0\.997r=0\.997and Pearson coefficientr=0\.952r=0\.952: the model has learnt a result from analytic number theory from the Frobenius trace data alone\. We additionally find that all 50 independently trained models concentrate CLS attention on prime positions at 2\-50×\\timesthe rate of composite positions, which is consistent with the Euler product structure ofL\(E,s\)L\(E,s\)\. The CLS embedding encodeslog⁡L\(E,1\)\\log\{L\(E,1\)\}withR2=0\.962±0\.011R^\{2\}=0\.962\\pm 0\.011across the 50 models \(after controlling for conductor\)\. Activation patching analysis reveals that attention weights are dissociated from causal information flow\. Additionally, the 50 solutions from training are near\-identical in function space \(with pairwise agreement\>98\.8%\>98\.8\\%\) despite large weight space barriers\.

## 1Introduction

The Birch and Swinnerton\-Dyer \(BSD\) conjecture\[BirchSwinnertonDyer1965\]predicts that the rank of the Mordell\-Weil groupE\(ℚ\)E\(\\mathbb\{Q\}\)equals the order of vanishing of theLL\-functionL\(E,s\)L\(E,s\)ats=1s=1\. We note thatL\(E,s\)L\(E,s\)is completely determined by the Frobenius traces\{an\}\\\{a\_\{n\}\\\}, from the Euler product

L\(E,s\)=∏p\(1−app−s\+p1−2s\)−1\.L\(E,s\)=\\prod\_\{p\}\(1\-a\_\{p\}p^\{\-s\}\+p^\{1\-2s\}\)^\{\-1\}\.In theory, the rank is readable thus from the sequence of Frobenius traces\(a1,a2,…\)\(a\_\{1\},a\_\{2\},\\ldots\)\. Detecting the vanishing of theLL\-valueL\(E,1\)L\(E,1\)from finitely many terms is numerically very difficult, as the approximate functional equation

L\(E,1\)≈2∑n=1Nann⋅W\(nNE\)L\(E,1\)\\approx 2\\sum\_\{n=1\}^\{N\}\\frac\{a\_\{n\}\}\{n\}\\cdot W\\left\(\\frac\{n\}\{\\sqrt\{N\_\{E\}\}\}\\right\)converges slowly \(particularly for elliptic curves of high conductor\)\[rubinstein2005computational\]\.

Previous work in this area has shown that machine learning models are capable of predicting rank from Frobenius traces with high accuracy, such as\[babei2025learning\],\[bieri2026murmurations\],\[10\.1016/j\.jsc\.2022\.08\.017\], and\[kazalicki2023ranks\]\. These works show that prediction using machine learning is feasible, but do not address questions from mechanistic interpretability: what algorithm does the model discover?

We address this by using tools from mechanistic interpretability \(\[elhage2021mathematical\],\[nanda2023progressmeasuresgrokkingmechanistic\],\[elhage2022toymodelssuperposition\]\), such as attention analysis, linear probing\[alain2018understandingintermediatelayersusing\], activation patching\[10\.5555/3600270\.3601532\], direct logit attribution, and neuron\-level circuit analysis\. We find that a transformer trained on rank prediction independently rediscovers the Mestre\-Nagao heuristic\[bieri2026murmurations\], a result from classical analytic number theory that estimates the rank, implemented by a sparse push\-pull MLP circuit\. This appears to be the first mechanistic identification of a transformer neural network learning a named mathematical result from number\-theoretic data without supervision\.

## 2Background

### 2\.1L\-functions, BSD, and Frobenius Traces

LetE/ℚE/\\mathbb\{Q\}be a rational elliptic curve with conductorNEN\_\{E\}\. TheLL\-functionis

L\(E,s\)=∑n≥1anns=∏p∤NE11−app−s\+p1−2s⋅∏p∣NE11−app−s\.L\(E,s\)=\\sum\_\{n\\geq 1\}\\frac\{a\_\{n\}\}\{n^\{s\}\}=\\prod\_\{p\\nmid N\_\{E\}\}\\frac\{1\}\{1\-a\_\{p\}p^\{\-s\}\+p^\{1\-2s\}\}\\cdot\\prod\_\{p\\mid N\_\{E\}\}\\frac\{1\}\{1\-a\_\{p\}p^\{\-s\}\}\.Here,ap=p\+1−\#E\(𝔽p\)a\_\{p\}=p\+1\-\\\#E\(\\mathbb\{F\}\_\{p\}\)for primes of good reduction \(i\.e\. the curve is non\-singular when the coefficients are reduced modulopp\), and we have by the Ramanujan\-Eichler\-Shimura bound that\|ap\|≤2p\|a\_\{p\}\|\\leq 2\\sqrt\{p\}\[diamond2005first\]\.

The Frobenius tracesana\_\{n\}satisfyHecke multiplicativity: form,nm,nwheregcd⁡\(m,n\)=1\\gcd\{\(m,n\)\}=1, we haveamn=amana\_\{mn\}=a\_\{m\}a\_\{n\}\. Thus\{an\}\\\{a\_\{n\}\\\}is determined completely by\{ap\}\\\{a\_\{p\}\\\}for primespp\.

For rank 0 elliptic curves, the BSD formula gives

L\(E,1\)=ΩE⋅\#\(E\)⋅∏pcp\|Tor⁡\(E\(ℚ\)\)\|2\.L\(E,1\)=\\frac\{\\Omega\_\{E\}\\cdot\\\#\\Sha\(E\)\\cdot\\prod\_\{p\}c\_\{p\}\}\{\|\\operatorname\{Tor\}\(E\(\\mathbb\{Q\}\)\)\|^\{2\}\}\.where:

1. i\.ΩE\\Omega\_\{E\}is the real period, defined as follows: note that every elliptic curveE/ℚE/\\mathbb\{Q\}has a Weierstrass equation with integer coefficients: we have thatEEis the projective curve y2=x3\+ax\+b\.y^\{2\}=x^\{3\}\+ax\+b\.We can define the unique invariant differential \(in the sense that it is translation invariant\) ωE=d⁡x2y\.\\omega\_\{E\}=\\frac\{\\operatorname\{d\}x\}\{2y\}\.The lattice of periods is then defined as the discrete subgroup ofℂ\\mathbb\{C\}generated by integrals of the form whereγ∈H1\(E,ℤ\)\\gamma\\in H\_\{1\}\(E,\\mathbb\{Z\}\)\(note that there is an isomorphismE\(ℂ\)≅ℂ/ΛE\(\\mathbb\{C\}\)\\cong\\mathbb\{C\}/\\Lambda\)\. The real periodΩE\\Omega\_\{E\}is then defined as the least positive element ofΛ∩ℝ\\Lambda\\cap\\mathbb\{R\}multiplied by the number of components ofE\(ℝ\)E\(\\mathbb\{R\}\)\[lmfdb\]\.
2. ii\.\(E\)\\Sha\(E\)is the Tate\-Shafarevich group, defined as follows: letKKbe a number field, and letGKG\_\{K\}be its absolute Galois group\. For a placeν\\nuletKνK\_\{\\nu\}denote the completion atν\\nuofKK, and letGKνG\_\{K\_\{\\nu\}\}be the absolute Galois group of the completion\. We define the Tate\-Shafarevich group for an elliptic curveE/KE/Kas \(E\)=ker⁡\(H1\(GK,E\)→∏vH1\(GKv,EKv\)\),\\Sha\(E\)=\\ker\{\\left\(H^\{1\}\(G\_\{K\},E\)\\to\\prod\_\{v\}H^\{1\}\(G\_\{K\_\{v\}\},E\_\{K\_\{v\}\}\)\\right\)\},whereν\\nuruns over all places ofKK, andEKνE\_\{K\_\{\\nu\}\}denotes the base change ofEEtoKνK\_\{\\nu\}\. The order of\(E\)\\Sha\(E\)is conjectured to be finite\.
3. iii\.∏𝔭c𝔭\\prod\_\{\\mathfrak\{p\}\}c\_\{\\mathfrak\{p\}\}is the Tamagawa product, defined as follows: let𝔭\\mathfrak\{p\}be a prime ofKK\. We define the Tamagawa number c𝔭=\[E\(K𝔭:E0\(K𝔭\)\)\],c\_\{\\mathfrak\{p\}\}=\[E\(K\_\{\\mathfrak\{p\}\}:E^\{0\}\(K\_\{\\mathfrak\{p\}\}\)\)\],whereE0\(K𝔭\)E^\{0\}\(K\_\{\\mathfrak\{p\}\}\)is the subgroup ofE\(K𝔭\)E\(K\_\{\\mathfrak\{p\}\}\)consisting of all points whose reduction modulo𝔭\\mathfrak\{p\}is smooth\. IfEEhas good reduction at𝔭\\mathfrak\{p\}, thenc𝔭\(E\)=1c\_\{\\mathfrak\{p\}\}\(E\)=1\. The Tamagawa product is the product of the Tamagawa numbers over all primes, and is a positive integer\.

We note importantly that for curves of rank 1, theLL\-value ats=1s=1,L\(E,1\)=0L\(E,1\)=0\.

### 2\.2Mestre\-Nagao Heuristic

For a positive real boundBB, theMestre\-Nagao sumis defined as

S\(E,B\)=1log⁡B∑p<B,p∤NEaplog⁡ppS\(E,B\)=\\frac\{1\}\{\\log\{B\}\}\\sum\_\{p<B,p\\nmid N\_\{E\}\}\\frac\{a\_\{p\}\\log\{p\}\}\{p\}\[bieri2026murmurations\]\. For an elliptic curve of analytic rankranr\_\{\\mathrm\{an\}\}, the explicit formula forL′\(E,1\)L\(E,1\)\\frac\{L^\{\\prime\}\(E,1\)\}\{L\(E,1\)\}predicts that iflimB→∞S\(E,B\)\\lim\\limits\_\{B\\to\\infty\}S\(E,B\)exists, it converges to12−ran\\frac\{1\}\{2\}\-r\_\{\\mathrm\{an\}\}\[kim2021birchswinnertondyerconjecturenagaos\]\. Rank\-0 curves thus satisfyS\(E,B\)≈12S\(E,B\)\\approx\\frac\{1\}\{2\}, and rank\-1 curves satisfyS\(E,B\)≈−12S\(E,B\)\\approx\-\\frac\{1\}\{2\}for largeBB\. The sum separates the rank classes by a gap of 1 and serves as a heuristic predictor of rank with weightswp=log⁡ppw\_\{p\}=\\frac\{\\log\{p\}\}\{p\}indexed by primepp\. The per\-prime normalized Mestre\-Nagao weights are

wp=log⁡pp⋅log⁡B\.w\_\{p\}=\\frac\{\\log\{p\}\}\{p\\cdot\\log\{B\}\}\.\(1\)Bieri et al\. in\[bieri2026murmurations\]show that Mestre\-Nagao sums achieve an AUROC of approximately 0\.95 for rank prediction and that CNN saliency curves qualitatively resemblewpw\_\{p\}as a function ofpp\. We establish this connection at the level of individual neurons with Pearsonr=0\.952r=0\.952with circuit analysis methods\.

### 2\.3Transformers and Mechanistic Interpretability

A transformer encoder processes input\(x1,…,xT\)\(x\_\{1\},\\ldots,x\_\{T\}\)throughLLlayers, each with multi\-head self\-attention and a position\-wise MLP\. For a headhhwe have

αij\(h\)=softmax⁡\(qi\(h\)⋅kj\(h\)dhead\),andzi\(h\)=∑jαij\(h\)vj\(h\)\.\\alpha\_\{ij\}^\{\(h\)\}=\\operatorname\{softmax\}\\left\(\\frac\{\{q\_\{i\}\}^\{\(h\)\}\\cdot k\_\{j\}^\{\(h\)\}\}\{\\sqrt\{d\_\{\\mathrm\{head\}\}\}\}\\right\),\\text\{ and \}z\_\{i\}^\{\(h\)\}=\\sum\_\{j\}\\alpha\_\{ij\}^\{\(h\)\}v\_\{j\}^\{\(h\)\}\.We prepend a learned CLS token \(introduced in\[devlin2019bert\]\) whose final hidden state is used for classification\.

Mechanistic interpretability\[elhage2021mathematical\]aims to reverse\-engineer the learned algorithm through analyzing model internals, including analysis of attention weights\[clark\-etal\-2019\-bert\], linear probing\[alain2018understandingintermediatelayersusing\], activation patching\[10\.5555/3600270\.3601532\], and neuron\-level circuit analysis\[nanda2023progressmeasuresgrokkingmechanistic\]\.

## 3Experimental Setup

### 3\.1Data

We use the Cremona database accessed through the L\-functions and Modular Forms Database \(LMFDB\)\[lmfdb\]\. We restrict to curves of conductorNE≤10000N\_\{E\}\\leq 10000and analytic rank in\{0,1\}\\\{0,1\\\}, which yields 62,298 curves \(30,427 rank 0 curves and 31,871 rank 1 curves\), with a stratified 80/20 split \(random seed 42\)\. The input is

a~n=an2n,\\tilde\{a\}\_\{n\}=\\frac\{a\_\{n\}\}\{2\\sqrt\{n\}\},wheren=1,…,Nn=1,\\ldots,NwithN=128N=128as the primary experimental setting\. BSD\-related invariants \(such asLL\-values, periods, Tamagawa numbers,\(E\)\\Sha\(E\), and torsion\) are fetched from the LMFDB and computed by Dokchitser’s algorithm\[dokchitser2002computingspecialvaluesmotivic\]\.

### 3\.2Model

We train a 2\-layer transformer encoderdmodel=128d\_\{\\mathrm\{model\}\}=128, with 4 heads, MLP width4dmodel=5124d\_\{\\mathrm\{model\}\}=512neurons per layer, pre\-norm LayerNorm\[ba2016layer\]at approximately 500,000 parameters\. Training uses AdamW\[loshchilov2019adamw\], lr=3×10−4=3\\times 10^\{\-4\}, cosine schedule\[loshchilov2017sgdr\], weight decay 0\.01, weighted cross\-entropy for class imbalance, and 100 epochs\. We train 50 independent models with random seeds 8\-57\.

The model achieves98\.7±0\.02%98\.7\\pm 0\.02\\%accuracy on rank 0 curves and99\.5±0\.2%99\.5\\pm 0\.2\\%on rank 1 curves\. This exceeds the Mestre\-Nagao partial sum baseline AUROC of 0\.95, and the naive partial sum

SN=2∑n=1Nann,S\_\{N\}=2\\sum\_\{n=1\}^\{N\}\\frac\{a\_\{n\}\}\{n\},which has an AUROC of 0\.93\.

Recently, Babei, Shah, and Kebe\[babei2026twistclassredundancydrives\]showed that for the related task of predicting a Frobenius trace from nearby traces, much of the reported model performance is attributable to quadratic\-twist redundancy in the dataset as twist\-classes share trace magnitudes: an explicit twist\-matching baseline substantially outperforms the trained transformers\. We show that this situation does not apply in the case of rank prediction\.

We note that the dataset contains all curves of conductor≤104\\leq 10^\{4\}rather than one representative per isogeny class, so isogenous curves \(which have the same trace sequences and identical ranks\) may appear in both training and test set data\. Following the discussion of twist\-redundancy analysis in\[babei2026twistclassredundancydrives\], we partition the test set into three slices: curves with exact isogeny duplicates in the training set \(60\.4% of curves\), curves with no exact duplicate with quadratic\-twist proxy key \(this refers to the absolute traces\|ap\|\|a\_\{p\}\|at the eight largest primesp≤127p\\leq 127, following\[babei2026twistclassredundancydrives\]\) matching a training curve \(23\.5%\), and curves with neither \(16\.1%\)\.

Our representative model \(the centroid model in function space, see[section˜4\.2](https://arxiv.org/html/2606.15036#S4.SS2)\) achieves 99\.7%, 98\.9%, and 98\.9% accuracy on these slices respectively, with AUROC≥0\.999\\geq 0\.999on each\. Performance is thus essentially unchanged on curves about which the training set carries no twist\-class information\. A twist\-lookup baseline that predicts the majority rank among a test curve’s twist proxy mates in training performs at 50\.4% accuracy on the twist slice\. Rank, unlike trace magnitude, is not recoverable from twist\-class membership\. This implies the model’s accuracy is not derived from isogeny or twist retrieval, consistent with the parametric Mestre\-Nagao mechanism in[section˜8\.3](https://arxiv.org/html/2606.15036#S8.SS3), and in contrast to the trace\-prediction setting of\[babei2026twistclassredundancydrives\]\.

## 4Solution Space Geometry

### 4\.1Weight Space

We compute the pairwise loss barriers by linear mode connectivity following\[goodfellow2015qualitativelycharacterizingneuralnetwork\]and\[garipov2018losssurfacesmodeconnectivity\]\.

For each pair\(i,j\)\(i,j\)of models, we interpolateθ\(α\)=αθi\+\(1−α\)θj\\theta\(\\alpha\)=\\alpha\\theta\_\{i\}\+\(1\-\\alpha\)\\theta\_\{j\}over a uniform grid of 21 valuesα∈\[0,1\]\\alpha\\in\[0,1\]and definebarrier⁡\(i,j\)=1−minα⁡acc⁡\(θ\(α\)\)\\operatorname\{barrier\}\(i,j\)=1\-\\min\_\{\\alpha\}\{\\operatorname\{acc\}\(\\theta\(\\alpha\)\)\}\. This agrees with the endpoint\-relative barrier up to approximately 0\.01 as every endpoint model achieves an accuracy of approximately 0\.99\. Across the 1225 pairs, barriers range from 0\.18 to 0\.89 with a mean of 0\.556: no pair of solutions is linearly mode\-connected in raw weight space\.

These interpolations are performed without permutation or rescaling alignment of neurons\. Large raw\-weight barriers between functionally equivalent models are expected under the parameter\-space symmetries of the architecture \(see\[ainsworth2023gitrebasinmergingmodels\]and\[entezari2022rolepermutationinvariancelinear\]\), which is confirmed strongly by the function space analysis in[section˜4\.2](https://arxiv.org/html/2606.15036#S4.SS2)\.

We apply metric MDS to the pairwise barrier matrix treated as a dissimilarity matrix to classify the geometry of the weight space using Kruskal stress\-1\. The stress declines with embedding dimension until≈d=15−20\\approx d=15\-20, and plateaus at≈0\.15\\approx 0\.15and improves no further even atd=49d=49\. This implies the barrier geometry admits no low\-dimensional Euclidean structure, and is not approximately Euclidean at any dimension\. This is consistent with loss barriers not satisfying metric axioms\.

### 4\.2Function Space

We compute functional similarity of the models by Hamming agreement on their test set predictions\. The mean pairwise agreement across all 50 models was99\.2%99\.2\\%with a standard deviation of0\.11%0\.11\\%, and all pairs of models had agreement above98\.8%98\.8\\%\. Refer to[fig\.˜1](https://arxiv.org/html/2606.15036#S4.F1)for a visualization of the function dissimilarities\(1−agreement\)\(1\-\\mathrm\{agreement\}\)by 3D MDS\. The embedding’s Kruskal stress\-1 of 0\.27 reflects the near\-equidistant character of the residual disagreements, as dissimilarities range from 0\.005\-0\.012 and near\-equidistant point sets do not embed in low dimension\. This is expected from a single shared function being perturbed by small independent per\-model errors rather than of multiple functional clusters\. All models are effectively computing the same function, across many different parameterizations \(hence the large barriers observed in weight space\)\. We attempted to characterize function space by various metrics, but note that prime concentration strength does*not*organize the functional space\.

With the above results in mind, the centroid model in functional space was chosen as a representative model for subsequent analysis \(although checks of the solution types of all models revealed that their broad structure, as expected, was the same\)\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x1.png)Figure 1:The left plot depicts the functional solution space projected onto 3 dimensions, with solutions colored by their attention to primes for the purpose of function space visualization\. Refer to[section˜4\.2](https://arxiv.org/html/2606.15036#S4.SS2)for the stress interpretation\. The right plot projects onto dimensions 1 and 2\. Notice the clustering of the solutions\.

## 5Attention Analysis

For each trained model, we extract the mean CLS attention weight to each input position, averaged over the test set\. Across all 50 trained models, layer\-0 attention concentrates on prime positions\. Of 200 layer\-0 heads, 198 of them attend more strongly to prime positions than composite positions, with per\-head prime/composite ratios spanning 0\.65×\\times\- 128×\\times\(the extreme values reflect heads with near zero composite attention\)\. Per\-model means over the four layer\-0 heads range from 2\.1×\\timesto 50×\\times\(median6\.5×6\.5\\times, and mean9\.7×9\.7\\times\), and every model’s mean exceeds2×2\\times, implying prime concentration is universal at the model level\. The two heads weakly preferring composite positions \(0\.65×0\.65\\timesand0\.92×0\.92\\times, in two different models\) are offset by strong prime concentration in their sibling heads\.

This prime preference is consistent with the mathematical structure of the Euler product: sinceamn=amana\_\{mn\}=a\_\{m\}a\_\{n\}forgcd⁡\(m,n\)=1\\gcd\(m,n\)=1, the composite index Frobenius traces are determined entirely by those with prime indices, and the model learns to exploit this arithmetic structure\. Refer to[fig\.˜2](https://arxiv.org/html/2606.15036#S5.F2)for the attention distribution for the centroid model\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x2.png)Figure 2:CLS attention weight by position for the centroid model\. The blue bars are prime positions, grey bars are composite positions\. Note that prime positions receive 1\.6\-7\.6×\\timesthe amount of attention than composite positions, and no head in any layer pays more attention to composite positions\.![Refer to caption](https://arxiv.org/html/2606.15036v1/x3.png)Figure 3:Note that the centroid model tracks small primes \(in particular,p=11,13p=11,13\) the most\. This is not explained by the rate of elliptic curves with good reduction atpp\. For models trained on smaller trace sequence lengths, confounding due to smaller number of primes was observed\.### 5\.1Attention Causality Dissociation

Note that the centroid model concentrates attention on early primes in several heads \(visible in[fig\.˜3](https://arxiv.org/html/2606.15036#S5.F3)\) : in particular, the primesp=11p=11andp=13p=13receive the most attention across all layers\. However, from activation patching analysis, see[section˜7](https://arxiv.org/html/2606.15036#S7)\. We see that the most important prime in terms of causal patch effects isp=31p=31, followed byp=13p=13\.p=11p=11is not within the top ten most causally impactful primes\. Several composite positions also have non\-trivial causal effects\. Refer to[table˜1](https://arxiv.org/html/2606.15036#S5.T1)for the top 10 positions in terms of causal patch effects\. This dissociation between attention and causality is an instance of the finding of Jain and Wallace\[jain2019attentionexplanation\]in a number\-theoretic setting that attention analysis is not necessarily reliable as a complete explanation of model internals\. In particular, attention analysis correctly gives the coarse explanation that prime positions are important, but does not explain which prime positions in particular are the most significant\.

Table 1:Attention vs\. causal importance for the centroid model\. The top\-attended primes \(p=11,13p=11,13\) do not correspond to the most causally important positions under activation patching\. Notably,p=11p=11does not appear in the top 10 by causal effect, whilep=31p=31\(ranked first by patch effect\) receives near\-zero attention\. Three composite positions \(a9,a25,a26a\_\{9\},a\_\{25\},a\_\{26\}\) appear among the top 10 causal positions despite receiving negligible attention\.

## 6LL\-value Encoding

We fit a Ridge regression from the 128\-dimensional CLS embedding tolog⁡L\(E,1\)\\log\{L\(E,1\)\}for rank 0 test curves, referenced against the LMFDB exact values of theLL\-values\. Across 50 runs, we observed anR2=0\.944±0\.011R^\{2\}=0\.944\\pm 0\.011\. After controlling for the conductor \(regressing outlog⁡NE\\log\{N\_\{E\}\}\), we observe residualR2=0\.962±0\.011R^\{2\}=0\.962\\pm 0\.011\. We additionally trained a model to explicitly regresslog⁡L\(E,1\)\\log\{L\(E,1\)\}from the same inputs, and this achieved anR2=0\.953R^\{2\}=0\.953\. This implies that the classification model implicitly optimizes a near\-completeLL\-value approximation, as visible in[fig\.˜4](https://arxiv.org/html/2606.15036#S6.F4)\.

Probes for the remaining BSD\-invariants recovered little to no signal, as expected since global arithmetic invariants are not determined by finitely many local traces\. We defer the fuller treatment \(including a discussion of encoding the real periodΩE\\Omega\_\{E\}\) to the sequel\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x4.png)Figure 4:Scatter plot of CLS\-predicted vs\. true value oflog⁡L\(E,1\)\\log\{L\(E,1\)\}for the 50 trained models on the left and the explicit regression model on the right\. Both achieveR2R^\{2\}values\>0\.94\>0\.94\.
## 7Activation Patching

### 7\.1Method

We apply activation patching\[10\.5555/3600270\.3601532\]to identify which positions causally determine rank prediction\. In particular, for a clean rank 0 and a corrupted rank 1 curve, we patch residual stream activation\(l,p\)\(l,p\)from clean into corrupted and measure the normalized logit difference:

patch⁡\(l,p\)=Δlogit\(patched\)−Δlogit\(corrupt\)Δlogit\(clean\)−Δlogit\(corrupt\)\.\\operatorname\{patch\}\(l,p\)=\\frac\{\\Delta\_\{\\mathrm\{logit\}\}\(\\mathrm\{patched\}\)\-\\Delta\_\{\\mathrm\{logit\}\}\(\\mathrm\{corrupt\}\)\}\{\\Delta\_\{\\mathrm\{logit\}\}\(\\mathrm\{clean\}\)\-\\Delta\_\{\\mathrm\{logit\}\}\(\\mathrm\{corrupt\}\)\}\.Results are averaged over 200 pairs\. We also note that the information flow structure was qualitatively consistent across a sample of the 50 solutions\.

### 7\.2Direct Logit Attribution

To quantify the relative contribution of the various model components, we decompose the output logit difference into contributions from each attention head and MLP layer by direct logit attribution, following\[elhage2021mathematical\]\. In particular, we have

Δlogit=∑ℓ,h\(WU⋅zCLS\(ℓ,h\)\)\+∑ℓ\(WU⋅mCLS\(ℓ\)\),\\Delta\_\{\\mathrm\{logit\}\}=\\sum\_\{\\ell,h\}\\left\(W\_\{U\}\\cdot z\_\{\\mathrm\{CLS\}\}^\{\(\\ell,h\)\}\\right\)\+\\sum\_\{\\ell\}\\left\(W\_\{U\}\\cdot m\_\{\\mathrm\{CLS\}\}^\{\(\\ell\)\}\\right\),whereWUW\_\{U\}is the unembedding matrix that stores the difference direction of logit weights\. The left sum in the expression is the head contribution and the right sum is the MLP contribution\.

Across 50 models, the layer\-1 MLP dominates: its mean absolute contribution is 3\.2×\\timeslarger than the layer 0 MLP and 7\.5×\\timeslarger than individual attention heads\. Attention heads collectively account for less than 15%\\%of total logit variance\. This motivates the neuron\-level analysis of the layer\-1 MLP in the next section\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x5.png)Figure 5:Two\-panel activation patching figure for the centroid model\. The top shows activation patching for layer 0\. Notice that the most significant primes arep=31,13,19p=31,13,19, which differ from the primes that receive the most attention by the model\. The bottom shows the activation patching for layer 1, which has a perfect spike at the CLS position\.

## 8MLP Circuit Analysis and Mestre\-Nagao Sums

### 8\.1Circuit Sparsity

For each neuronnnin the layer 1 MLP \(totalling 512\), we compute the Fisher discriminant score

Fn=\|an,0¯−an,1¯\|σn,02\+σn,122\.F\_\{n\}=\\frac\{\|\\overline\{a\_\{n,0\}\}\-\\overline\{a\_\{n,1\}\}\|\}\{\\sqrt\{\\frac\{\\sigma\_\{n,0\}^\{2\}\+\\sigma\_\{n,1\}^\{2\}\}\{2\}\}\}\.Here,an,r¯\\overline\{a\_\{n,r\}\}is the mean post\-ReLU activation for rankrrcurves\. We then fit a logistic probe from the top\-kkneurons’ activations and measure AUROC askkincreases\.

We also for eachkkselect the top\-kkneurons by\|wn\|\|w\_\{n\}\|, wherewnw\_\{n\}denotes the neuron’s effective contribution to the logit difference under the model’s*own*weights\. In particular, we have that the layer 1 MLP output is

MLP⁡\(x\)=W2ReLU⁡\(W1x\+b1\)\+b2,\\operatorname\{MLP\}\(x\)=W\_\{2\}\\operatorname\{ReLU\}\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\},whereW1∈ℝ512×dW\_\{1\}\\in\\mathbb\{R\}^\{512\\times d\}maps the CLS residual stream into the 512\-dimensional hidden layer, andW2∈ℝd×512W\_\{2\}\\in\\mathbb\{R\}^\{d\\times 512\}maps back from the hidden layer to the CLS residual stream\. The classification head computesΔlogit=v⊤c\\Delta\_\{\\mathrm\{logit\}\}=v^\{\\top\}cwhereccis the CLS embedding andv=w\(0\)−w\(1\)v=w^\{\(0\)\}\-w^\{\(1\)\}is the logit direction\. The effective weight of a neuronnnis thuswn=\(W2⊤v\)nw\_\{n\}=\(W\_\{2\}^\{\\top\}v\)\_\{n\}, and the direct attribution score is

∑n∈Skwnhn,\\sum\_\{n\\in S\_\{k\}\}w\_\{n\}h\_\{n\},whereSkS\_\{k\}is the set of analyzed neurons andhnh\_\{n\}is the post\-ReLU activation of hidden neuronnn\. We then measure AUROC askkincreases\.

The two scores diverge sharply initially before converging atk=200k=200, see[fig\.˜6](https://arxiv.org/html/2606.15036#S8.F6)\. At the cutoffk=20k=20, the linear probe achieves AUROC 0\.992, while the direct logit attribution achieves AUROC 0\.956, and the direct logit attribution curve is non\-monotonic at small values ofkk\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x6.png)Figure 6:The left figure depicts the difference in AUROC vs\. the number of top\-kkneurons for both scores\. Notice the divergence for lowkkinitially\. The right figure depicts logit contribution weights for all 512 neurons sorted by\|wn\|\|w\_\{n\}\|, colored blue for rank 0 detector neurons, and colored red for rank 1 detector neurons\.In particular, note that the direct logit attribution curve sits essentially at chance \(0\.501 \- 0\.507\) fork=1,…,5k=1,\\ldots,5before jumping to AUROC 0\.883 when the neuron N199 is included \(refer to[fig\.˜7](https://arxiv.org/html/2606.15036#S8.F7)for a plot\)\. This reflects a rank\-order mismatch between the readout circuit and discriminative circuit: the ten neurons with the largest readout magnitudes\|wn\|\|w\_\{n\}\|all lie outside the Fisher top\-100, and none belong to the 20\-neuron rank discriminative circuit: in particular, the model’s 5 biggest readout weights point at neurons that are collectively inconsequential to rank discrimination\.

The gap reflects ordering rather than orientation: a sign\-correcting of the six circuit neurons whose readout sign disagreed with their firing pattern left the direct\-attribution AUROC essentially unchanged at everykk\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x7.png)Figure 7:Note the jump in AUROC when the neuron N199 is added, despite it falling outside the top 20 neurons ordered by Fisher discriminant, indicative of sub\-optimal readout\.We classify each neuron by its*firing pattern*: we computeΔn=an,0¯−an,1¯\\Delta\_\{n\}=\\overline\{a\_\{n,0\}\}\-\\overline\{a\_\{n,1\}\}the mean post\-ReLU activation differential\. Neurons withΔn\>0\\Delta\_\{n\}\>0are rank 0 detectors, and those withΔn<0\\Delta\_\{n\}<0are rank 1 detectors\. We classify explicitly by firing rather than by the sign ofwnw\_\{n\}as they disagree for several circuit neurons\.

### 8\.2Push\-Pull Architecture

The 20 circuit neurons can be split by firing pattern into 17 rank\-0 detectors and 3 rank\-1 detectors\. Each neuron’s pre\-activation is well\-approximated \(withR2=0\.81−0\.89R^\{2\}=0\.81\-0\.89\) by a prime\-weighted linear form

zn≈∑pcp\(n\)ap\+bn,z\_\{n\}\\approx\\sum\_\{p\}c\_\{p\}^\{\(n\)\}a\_\{p\}\+b\_\{n\},wherehn=ReLU⁡\(zn\)h\_\{n\}=\\operatorname\{ReLU\}\(z\_\{n\}\)\. The regressions in[section˜8\.3](https://arxiv.org/html/2606.15036#S8.SS3)show that rank\-0 detectors display coefficient profiles correlated with Mestre\-Nagao weights, while rank\-1 detectors display profiles anticorrelated with Mestre\-Nagao weights \(with Spearmanrrranging from−0\.48\-0\.48to−0\.56\-0\.56\)\. Each class thus fires on its own rank class and is nullified by the ReLU on the other class\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x8.png)Figure 8:The left figure shows the 20 circuit neurons sorted by the Fisher discriminant plotted against their logit contribution weights and most\-correlated primes\. The right figure shows the push\-pull architecture of the circuit: rank 0 and rank 1 detectors each compute a scaled Mestre\-Nagao partial sum\.The logit weights then wire the detectors into a vote: this wiring is in particular strongly one\-sided\. Refer to[fig\.˜8](https://arxiv.org/html/2606.15036#S8.F8)for a plot of the circuit\. Firing class and the vote sign agree for 14 of the 20 circuit neurons: aligned rank\-0 detectors push the rank\-0 logit with weights up town=0\.83w\_\{n\}=0\.83\. The six neurons that disagree are systematic: three rank\-0 firing neurons N335, N456, and N58 \(withwn=−0\.26,−0\.27,−0\.24w\_\{n\}=\-0\.26,\-0\.27,\-0\.24respectively\) push toward rank 1 when they fire, and the three rank\-1 detectors N405, N231, and N42 \(withwn=0\.19,0\.21,0\.20w\_\{n\}=0\.19,0\.21,0\.20respectively\) push toward rank 0 when they fire\. The rank\-1 class does not receive a positive vote anywhere in the circuit\. In particular we see that summing the mean signed contributionswnhnw\_\{n\}h\_\{n\}over the circuit gives9\.719\.71on rank\-0 curves versus1\.191\.19on rank\-1 curves\. The model signals rank\-1 by the essential*absence*of rank\-0 push rather than by a pull in the negative direction\. An ablation of N405 shifted the mean logit difference by \-0\.026 on the curves where it fires, which shows that the positive weight convention is indeed correct\.

Notice further that the misalignment is confined to small magnitudes as all misaligned neurons have\|wn\|≤0\.27\|w\_\{n\}\|\\leq 0\.27, whereas aligned detector neurons reach up to\|wn\|=0\.83\|w\_\{n\}\|=0\.83\. The misalignment is also prime\-structured: the three misaligned rank\-0 detectors are all best correlated witha19a\_\{19\}, and the three rank\-1 detectors are best correlated witha43a\_\{43\}\.

### 8\.3Mestre\-Nagao Sums

In order to identify the learned input weighting, we fit a Ridge regression from raw prime\-indexed Frobenius traces\{ap:p≤128\}\\\{a\_\{p\}:p\\leq 128\\\}to the pre\-activation of each circuit neuron\. Of the top five discriminating rank\-0 detectors, three of them \(N99, N412, and N396\) have learned regression coefficientscp^\\widehat\{c\_\{p\}\}that closely match the Mestre\-Nagao weightswp=log⁡pp⋅log⁡Bw\_\{p\}=\\frac\{\\log\{p\}\}\{p\\cdot\\log\{B\}\}from[eq\.˜1](https://arxiv.org/html/2606.15036#S2.E1)\(with Spearmanr≥0\.997\)r\\geq 0\.997\)\. The remaining two neurons \(N506 and N335\) show the decaying shape more loosely \(with Spearmanr=0\.59,0\.65r=0\.59,0\.65respectively\)\. The circuit computes parallel, approximately scaled copies of one Mestre\-Nagao partial sum rather than partitioning prime ranges across the neurons\. Refer to[fig\.˜9](https://arxiv.org/html/2606.15036#S8.F9)for the exact comparison of Mestre\-Nagao weights and learned coefficients for N99\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x9.png)Figure 9:The left figure shows the learned neuron 99 coefficients vs\. the Mestre\-Nagao weights at the top 15 primes by magnitude: note the near\-identical profiles\. The center figure shows a scatter plot of learned coefficients vs\. Mestre\-Nagao weight at each primepp\. The right figure shows coefficient profiles for 5 rank\-0 detector neurons, showing the parallel Mestre\-Nagao\-like structure across 3 of the top 5 neurons in the circuit\.In[fig\.˜10](https://arxiv.org/html/2606.15036#S8.F10), we assemble the mechanism of N99\. The pre\-activation distribution of the two rank classes is well\-separated about the firing threshold: rank\-0 curves sit comfortably abovez\>0z\>0, and rank\-1 curves below\. The ReLU activation itself acts as the decision boundary and the neuron functions as a one\-dimensional classifier on its learned statistic\. We note that said statistic is linear in the traces, as seen in the center diagram, with slope 1\.61 ina~7\\tilde\{a\}\_\{7\}, the neuron’s most correlated position\. This is consistent with the global linear fit ofR2=0\.89R^\{2\}=0\.89\. By the Hasse bound,a7a\_\{7\}takes only the 11 values−5,…,5\-5,\\ldots,5, which explains the vertical banding in the diagram\. The post\-ReLU activation is a graded decision of rank: on rank\-0 test curves, the post\-activation increases monotonically with the value oflog⁡L\(E,1\)\\log\{L\(E,1\)\}\(with Spearmanr=0\.827r=0\.827\), so above threshold, the neuron’s firing strength is a proxy for the distance ofL\(E,1\)L\(E,1\)from 0\. Rank\-1 curves haveL\(E,1\)=0L\(E,1\)=0and are clipped to silence\.

![Refer to caption](https://arxiv.org/html/2606.15036v1/x10.png)Figure 10:The left figure shows the pre\-activation distributions for rank 0 vs\. rank 1 curves\. Note that the two class distributions are cleanly separated by the firing threshold\. The center figure shows a scatter plot of pre\-activation vs\. the Frobenius tracea7a\_\{7\}, which shows a linear relationship\. The right figure shows the post\-activation vs\.log⁡L\(E,1\)\\log\{L\(E,1\)\}, which confirms that neuron 99 is anLL\-value detector, as the Spearmanr=0\.827r=0\.827\.N99 thus implements a thresholded Mestre\-Nagao partial sum: a linear, Mestre\-Nagao\-weighted functional of the prime\-indexed traces whose ReLU output is simultaneously a crude estimator of theLL\-value and a rank vote\. The sparse circuit aggregates twenty such similar detectors with varying prime emphases and thresholds, and this ensemble outperforms any single Mestre\-Nagao sum \(AUROC 0\.95\) and explains at the neuron level the near\-complete encoding oflog⁡L\(E,1\)\\log\{L\(E,1\)\}in the CLS embedding observed in[section˜6](https://arxiv.org/html/2606.15036#S6)\.

## References
Transformers Learn the Mestre-Nagao Heuristic

Similar Articles

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]

Transformers Linearly Represent Highly Structured World Models

Trained transformer-based chess models to play like humans (including thinking time) [P]

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

Transformer Math Explorer [P]

Submit Feedback

Similar Articles

I Found a Hidden Ratio in Transformers That Predicts Geometric Stability [R]
Transformers Linearly Represent Highly Structured World Models
Trained transformer-based chess models to play like humans (including thinking time) [P]
The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason