How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

arXiv cs.CL Papers

Summary

This paper evaluates the consistency and specificity of language model circuits, finding that while circuits are consistent within tasks, they lack task-specificity due to substantial overlap across different tasks.

arXiv:2605.08348v1 Announce Type: new Abstract: The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 06:41 AM

# How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Source: [https://arxiv.org/html/2605.08348](https://arxiv.org/html/2605.08348)
###### Abstract

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring*necessity*and*sufficiency*\. We measure circuit reuse, the proportion of components shared across per\-example circuits within a task, and investigate two less\-studied properties of this:*consistency*, the recurrence of components within a task, and*specificity*, their uniqueness to a task\. Using edge attribution patching across six tasks and seven models, we find that within\-task reuse is high and that shared components are necessary for task performance, with ablations causing up to∼\\sim100% relative accuracy drops\. However, circuits turn out not to be task\-specific: ablating one task’s circuit damages another task’s performance about as much as that task’s own circuit does\. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance\. Some circuits do contain a smaller set of task\-specific components, but these account for only a modest portion of circuit performance\. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task\-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior\.

\\icml@noticeprintedtrue††footnotetext:\\forloop@affilnum1\\c@@affilnum¡\\c@@affiliationcounter0AUTHORERR: Missing \\icmlaffiliation\.\.AUTHORERR: Missing \\icmlcorrespondingauthor\. \\Notice@String

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.08348v1/x1.png)Figure 1:Circuit evaluation criteria\.We propose evaluating circuits for consistency across inputs and specificity across tasks, in addition to necessity and sufficiency\.Neural networks are infamously black box; even when we can elicit strong performance on a task, it is unclear which internal computations are responsible\. The field of mechanistic interpretability seeks to reverse\-engineer the internal computations of neural networks by identifying*circuits*: sparse subgraphs of model components that are causally responsible for a particular behavior\(Elhageet al\.,[2021](https://arxiv.org/html/2605.08348#bib.bib3); Wanget al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib4)\)\. A growing body of work has developed methods for extracting such circuits\(Syedet al\.,[2024](https://arxiv.org/html/2605.08348#bib.bib1); Markset al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib20); Jafariet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib19)\)and evaluating their*necessity*\(removing the circuit should degrade performance\) and*sufficiency*\(the circuit alone should reproduce the behavior\)\(Shiet al\.,[2024](https://arxiv.org/html/2605.08348#bib.bib18)\)\.

We argue that there are two additional properties that are crucial to consider \([Figure1](https://arxiv.org/html/2605.08348#S1.F1)\)\. First, circuits should be*consistent*: if a circuit truly captures how a model solves a task, the same components should recur for different instances of that task\. Second, circuits should be*specific*: a task’s circuit should be meaningfully different from the circuits of unrelated tasks\. Without consistency, a circuit is an artifact of a particular input rather than a description of the model’s algorithm\. Without specificity, a circuit is not task\-specific, limiting its utility for understanding or intervention\.

We test both properties at scale\. Using Edge Attribution Patching \(EAP;Syedet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib1)\)\), we extract per\-example circuits fornn=1000 examples across six tasks spanning algorithmic reasoning \(Addition,Boolean Logic\), information retrieval \(IOI,CopyColors MCQA\), and knowledge\-intensive benchmarks \(ARC Easy,ARC Challenge\), and seven models from four architecture families \(Gemma 2,Llama 3\.2,Qwen3,OLMo\-2\)\. We find the following:

1. 1\.Circuits are consistent within a task\.Across tasks and models, a substantial fraction of each per\-example circuit is drawn from a shared set of components\. Ablating this shared set causes large accuracy loss compared to a capacity\-matched random ablation, confirming that the shared components are causally important, not merely high\-scoring attribution method artifacts\.
2. 2\.Circuits are not specific across tasks\.When we ablate one task’s circuit and evaluate on a different task, the performance drop is comparable to ablating that task’s own circuit\. This is explained by the substantial overlap between task circuits: at the component level, different tasks’ circuits are composed of largely the same MLP layers\. Selective ablation experiments reveal that a small number of important task\-specific components do exist, but the bulk of each circuit is shared across tasks\.

These findings suggest that circuit discovery at the level of attention heads and MLP layers primarily identifies general\-purpose model infrastructure rather than task\-specific mechanisms\. We discuss several explanations for this \- including the role of shared MLP layers and polysemanticity \- and speculate that finer\-grained methods, such as concept\-level/sparse feature circuits\(Markset al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib20)\), may be needed to recover task\-specific structure\. We also discuss implications for applications that assume circuit\-level modularity, including model editing\(Menget al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib23); Daiet al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib25)\)and safety interventions\(Liet al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib24)\), while noting important limitations of our analysis for these settings\.

## 2Background

### 2\.1Transformer Circuits

We use the Transformer Circuits framework\(Elhageet al\.,[2021](https://arxiv.org/html/2605.08348#bib.bib3)\)to represent decoder\-only transformers as a directed acyclic computational graph\. The*residual stream*acts as a central communication channel: the token embedding is written into it, and each subsequent layer reads from it, performs a computation, and additively writes its output back\. Because contributions are additive, the output of any component can influence any downstream component, resulting in a fully connected graph between layers\. Within this graph, nodes are the model’s computational units, typically attention heads and MLP layers, though other decompositions \(*e\.g\.*, individual neurons or sparse autoencoder features\) are also possible\(Markset al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib20); Aroraet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib26); Ameisenet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib27)\)\. Edges represent the flow of information between components through the residual stream\. A*circuit*is then defined as a sparse subgraph of this computational graph which is sufficient to explain a given model behavior\(Wanget al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib4); Conmyet al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib22)\)\.

### 2\.2Edge Attribution Patching

To identify components which are important for a given behavior, researchers typically use activation patching\(Viget al\.,[2020](https://arxiv.org/html/2605.08348#bib.bib28); Menget al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib23); Wanget al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib4)\), which replaces each component’s activation with its value under a corrupted input and measures how much the output changes\. However, this requires a separate forward pass per component, which becomes prohibitively expensive\. Edge Attribution Patching \(EAP;Syedet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib1)\)\) approximates these causal effects using gradient information, requiring only two forward passes and one backward pass per example\. Components are ranked by the absolute value of their attribution score and the top\-KKcomponents define the circuit\.Syedet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib1)\)show that EAP recovers circuits competitive with those found by more expensive methods, making it suitable for the large\-scale analysis we conduct here \(see[AppendixB](https://arxiv.org/html/2605.08348#A2)for the full details\)\.

## 3Methodology

### 3\.1Tasks and Models

We evaluate on six tasks spanning algorithmic reasoning \(Addition,Boolean Logic\), information retrieval from context \(IOI\(Wanget al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib4)\),CopyColors MCQA\(Muelleret al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib21)\)\), and knowledge\-intensive benchmarks \(ARC Easy,ARC Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.08348#bib.bib5)\)\)\. Full task descriptions are in[AppendixC](https://arxiv.org/html/2605.08348#A3)\. We study seven models from four architecture families:Gemma 2\(2B, 2B IT;Gemma Team \([2024](https://arxiv.org/html/2605.08348#bib.bib14)\)\),Llama 3\.2\(3B, 3B Instruct;Llama team \([2024](https://arxiv.org/html/2605.08348#bib.bib11)\)\),Qwen3\(4B, 8B;Yanget al\.\([2025](https://arxiv.org/html/2605.08348#bib.bib10)\)\), andOLMo\-2\-1B\(Team OLMo,[2024](https://arxiv.org/html/2605.08348#bib.bib12)\), which is used for pretraining dynamics analysis\.

### 3\.2Extracting and Evaluating Shared Circuits

For each taskTTwe use a dataset𝒟Ttrain=\{\(xi,yi\)\}i=1n\\mathcal\{D\}\_\{T\}^\{\\mathrm\{train\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}ofn=1000n=1000examples, wherexix\_\{i\}is an input prompt andyiy\_\{i\}the target answer token, and a disjoint held\-out evaluation set𝒟Teval\\mathcal\{D\}\_\{T\}^\{\\mathrm\{eval\}\}\. Let𝒞=\(𝒱,ℰ\)\\mathcal\{C\}=\(\\mathcal\{V\},\\mathcal\{E\}\)denote the model’s computation graph, where vertices𝒱\\mathcal\{V\}are model components \(attention heads and MLPs\) and edgesℰ\\mathcal\{E\}are the connections between them\. For each\(xi,yi\)∈𝒟Ttrain\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{T\}^\{\\mathrm\{train\}\}we extract a per\-input circuit𝒞i⊆𝒞\\mathcal\{C\}\_\{i\}\\subseteq\\mathcal\{C\}via EAP, defined as the subgraph spanned by the top\-KK% of components by absolute attribution score, and sweepK∈\{1,5,10,20,30\}K\\in\\\{1,5,10,20,30\\\}\. Given the per\-input circuits\{𝒞i\}i=1n\\\{\\mathcal\{C\}\_\{i\}\\\}\_\{i=1\}^\{n\}, the shared component set \(SPS\_\{P\}\) contains all components that appear in at leastPPof per\-input circuits:

SP=\{c∈𝒞:1n​∑i=1n𝟏​\{c∈𝒞i\}≥P\}\.S\_\{P\}=\\Bigl\\\{\\,c\\in\\mathcal\{C\}\\,:\\,\\tfrac\{1\}\{n\}\\textstyle\\sum\_\{i=1\}^\{n\}\\mathbf\{1\}\\\{c\\in\\mathcal\{C\}\_\{i\}\\\}\\,\\geq\\,P\\,\\Bigr\\\}\.\(1\)We definereuse@​P\\textbf\{reuse@\}Pas the mean fraction of a per\-input circuit overlapping with the shared set,

reuse@​P=1n​∑i=1n\|SP∩𝒞i\|\|𝒞i\|,\\textbf\{reuse@\}P=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\frac\{\|S\_\{P\}\\cap\\mathcal\{C\}\_\{i\}\|\}\{\|\\mathcal\{C\}\_\{i\}\|\},\(2\)and reportreuse@​P\\textbf\{reuse@\}PforP∈\{95%,96%,…,100%\}P\\in\\\{95\\%,96\\%,\\ldots,100\\%\\\}\.

To test whether shared components are causally important, we ablate \(zero out\) the shared setSPS\_\{P\}and measure accuracy on𝒟Teval\\mathcal\{D\}\_\{T\}^\{\\mathrm\{eval\}\}\. A raw accuracy drop, however, is difficult to interpret\. Removing any set of components reduces network capacity, so some degradation is expected regardless of whether the ablated components are task\-relevant\. We therefore compare the shared\-set ablation against acapacity\-conserved control\(C3\)\. LetSCS\_\{C\}be a uniformly random subset of𝒞∖SP\\mathcal\{C\}\\setminus S\_\{P\}matchingSPS\_\{P\}in the number of attention heads and MLPs\.111We do not randomly sample edges; ablating a vertex zeroes out all edges adjacent to it, so matching vertex counts automatically accounts for edge ablation\.Since both ablations are capacity\-matched, any additional degradation from the shared\-set ablation can largely be attributed to the functional role of those components rather than to capacity\.

We formalize ablation via the do\-calculus\(Pearl,[1995](https://arxiv.org/html/2605.08348#bib.bib38)\)\. For an ablation setS⊆𝒞S\\subseteq\\mathcal\{C\}, letdo​\(S←0\)\\mathrm\{do\}\(S\\leftarrow 0\)denote the intervention that clamps everyc∈Sc\\in Sto zero\. Vertex activations are set to zero, and edges are removed from the computation graph \(equivalently, the signal passed along them is zeroed\)\. The model’s output distribution onxxunder this intervention ispℳ\(⋅∣x;do\(S←0\)\)p\_\{\\mathcal\{M\}\}\(\\,\\cdot\\mid x;\\,\\mathrm\{do\}\(S\\leftarrow 0\)\), withS=∅S=\\varnothingreferring to the output distribution of the original \(unablated\) model\. We define thezero\-ablated prediction\(ZAP\) ofℳ\\mathcal\{M\}under ablationSSas the model’s top\-logit token under the intervention,222We use argmax decoding here, but any decoding algorithm could be used on the output distributionpMp\_\{\\mathrm\{M\}\}\.

ZAP​\(ℳ,S,x\)=arg⁡maxy′⁡pℳ​\(y′∣x;do​\(S←0\)\),\\mathrm\{ZAP\}\(\\mathcal\{M\},S,x\)=\\arg\\max\_\{y^\{\\prime\}\}p\_\{\\mathcal\{M\}\}\(y^\{\\prime\}\\mid x;\\,\\mathrm\{do\}\(S\\leftarrow 0\)\),\(3\)We define accuracy on taskTTunder ablationSSas the fraction of held\-out examples on which the model’s prediction equals the true label,333SPS\_\{P\},acc\\mathrm\{acc\}, andn​e​c​e​s​s​i​t​ynecessityall implicitly depend onTT\(through𝒟Ttrain\\mathcal\{D\}\_\{T\}^\{\\mathrm\{train\}\}and𝒟Teval\\mathcal\{D\}\_\{T\}^\{\\mathrm\{eval\}\}\), but we dropTTfrom the notation for readability\.

acc​\(ℳ,S\)=1\|𝒟Teval\|​∑\(x,y\)∈𝒟Teval𝟏​\{y=ZAP​\(ℳ,S,x\)\}\.\\mathrm\{acc\}\(\\mathcal\{M\},S\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{T\}^\{\\mathrm\{eval\}\}\|\}\\sum\_\{\(x,y\)\\,\\in\\,\\mathcal\{D\}\_\{T\}^\{\\mathrm\{eval\}\}\}\\mathbf\{1\}\\\{y=\\mathrm\{ZAP\}\(\\mathcal\{M\},S,x\)\\\}\.\(4\)
Finally, we define the causal effect ofSPS\_\{P\}on taskTTas

n​e​c​e​s​s​i​t​y​\(ℳ,SP\)=acc​\(ℳ,SC\)−acc​\(ℳ,SP\)acc​\(ℳ,∅\)\.necessity\(\\mathcal\{M\},S\_\{P\}\)=\\frac\{\\mathrm\{acc\}\(\\mathcal\{M\},S\_\{C\}\)\-\\mathrm\{acc\}\(\\mathcal\{M\},S\_\{P\}\)\}\{\\mathrm\{acc\}\(\\mathcal\{M\},\\varnothing\)\}\.\(5\)A positiven​e​c​e​s​s​i​t​ynecessitymeans ablatingSPS\_\{P\}hurts more thanC3C^\{3\}, which we interpret as evidence that the shared components are causally important for taskTT\.

### 3\.3Cross\-Task Experiments

The experiments above tell us whether the shared components are causally important for a given task\. However, they do not disambiguate whether those components are specific to that task\. A component could be critical for taskAAsimply because it is critical for all tasks, in which case finding it in taskAA’s circuit might not tell us much about taskAAin particular\. To probe specificity, we run two cross\-task experiments\. The first ablates a task’s shared circuit and measures the effect on other tasks\. The second ablates subsets of different task’s circuits\.

First, defineΔAB=accA​\(ℳ,∅\)−accA​\(ℳ,SPB\)\\Delta\_\{A\}^\{B\}=\\mathrm\{acc\}\_\{A\}\(\\mathcal\{M\},\\varnothing\)\-\\mathrm\{acc\}\_\{A\}\(\\mathcal\{M\},S\_\{P\}^\{B\}\), the accuracy drop on taskAAwhen ablating taskBB’s shared circuit, whereaccA\\mathrm\{acc\}\_\{A\}denotes accuracy evaluated on𝒟Aeval\\mathcal\{D\}\_\{A\}^\{\\mathrm\{eval\}\}andSPBS\_\{P\}^\{B\}is taskBB’s shared component set\. If circuits are task\-specific, ablating taskAA’s own circuit should hurt taskAAmore than ablating any other task’s circuit\. For each task and model, we compareΔAA\\Delta\_\{A\}^\{A\}, the drop onAAwhen ablating its own circuit, against1\|𝒯\|−1​∑B≠AΔAB\\frac\{1\}\{\|\\mathcal\{T\}\|\-1\}\\sum\_\{B\\neq A\}\\Delta\_\{A\}^\{B\}, the mean drop onAAwhen ablating each other task’s circuit instead\.

Second, to localize where task\-specific signal resides, we partition the union𝒞A∪𝒞B\\mathcal\{C\}\_\{A\}\\cup\\mathcal\{C\}\_\{B\}for a pair of tasks\(A,B\)\(A,B\)into three disjoint sets: the*shared core*𝒞A∩𝒞B\\mathcal\{C\}\_\{A\}\\cap\\mathcal\{C\}\_\{B\}\(components in both circuits\), the*AA\-only*set𝒞A∖𝒞B\\mathcal\{C\}\_\{A\}\\setminus\\mathcal\{C\}\_\{B\}\(components in𝒞A\\mathcal\{C\}\_\{A\}but not𝒞B\\mathcal\{C\}\_\{B\}\), and the*BB\-only*set𝒞B∖𝒞A\\mathcal\{C\}\_\{B\}\\setminus\\mathcal\{C\}\_\{A\}\(components in𝒞B\\mathcal\{C\}\_\{B\}but not𝒞A\\mathcal\{C\}\_\{A\}\)\. We ablate each set independently and report the accuracy drop on taskAAalongside the mean drop across other tasks\. Results are averaged over all task pairs\.

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/combined/combined_reuse_lift.png)Figure 2:Within\-task circuit reuse and importance scores across circuit sizes\.*Top:*reuse@97% measures the fraction of each example’s top\-KK% circuit covered by components appearing in at least 97% of examples; each line is a model, and thexx\-axis sweeps the circuit sizeKK\.*Bottom:*The importance score measures how much more performance drops when shared components are removed versus when an equally sized random set is removed; a positive score means the shared components matter more than a random set of the same size\. See[AppendixE](https://arxiv.org/html/2605.08348#A5)for all settings\.![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/pretraining/pretraining_k10.png)Figure 3:Pretraining dynamics of circuit reuse and causal importance inOLMo\-2\-1B\.*Top:*reuse@P atKK=10% across pretraining checkpoints, sweeping the consistency thresholdP∈\{95,…,100\}P\\in\\\{95,\\dots,100\\\}\(darker = stricter\)\.*Bottom:*Baseline accuracy \(teal solid\), accuracy after a capacity\-matched random ablation \(gray dashed\), and accuracy after ablating the sharedKK=10% circuit \(pink\)\. The gap between gray and pink reflects necessity\. Checkpoints span the full training run from 0B to 4001B tokens of stage\-1 plus two stage\-2*anneal*checkpoints \(ingredient1,ingredient3\); each anneal continues training for∼\\sim51B more tokens on a curated mixture with a learning\-rate decay schedule, producing the releasedOLMo\-2\-1B\.![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/cross_task/cross_task_diagonal_vs_offdiag_K10.png)Figure 4:Own\-circuit vs\. other\-circuit accuracy drop atKK=10%\.For each task, “Own” is the accuracy drop from removing that task’s circuit; “Other” is the mean drop from removing every other task’s circuit\. The two bars are close across tasks and models, indicating that circuits are not task\-specific\. See[AppendixF](https://arxiv.org/html/2605.08348#A6)for allKKvalues\.

## 4Within\-Task Consistency

#### Circuits reuse the same components across inputs\.

If a circuit captures how a model solves a task, the same components should appear regardless of which input is used\. We find consistent evidence for this\.[Figure2](https://arxiv.org/html/2605.08348#S3.F2)\(top row\) shows thereuse@97% rate \- the fraction of each example’s circuit covered by components appearing in at least 97% of examples \- as a function of circuit sizeKK, with each model as a separate line \(see[AppendixE](https://arxiv.org/html/2605.08348#A5)for additional settings\)\. Most task\-model combinations show 40\-70% reuse atKK=10%, meaning that roughly half or more of any individual example’s circuit is drawn from a set of components shared by nearly every example\. CopyColors MCQA and Boolean tend toward the higher end, while ARC and IOI are more moderate\. The Qwen models generally achieve higher reuse than Llama or Gemma\.

The relationship between reuse and circuit size is not strictly monotonic: for some task\-model pairs reuse increases withKK\(*e\.g\.*, Addition in the Gemma 2 family\), while for others \(*e\.g\.*, IOI in the Qwen 3 family\) it decreases as the larger circuit pulls in more example\-specific components\. Despite this variation, reuse remains well above zero atK≥5%K\\geq 5\\%\.

#### Shared components are actually doing work, not just appearing frequently\.

High reuse alone does not establish that the shared components matter for the task \- they could simply be components with large activations that get ranked highly without playing a real functional role\. We test this directly by comparing how much performance drops when the shared components are removed versus when an equally sized set of randomly chosen components is removed\.[Figure2](https://arxiv.org/html/2605.08348#S3.F2)\(bottom row\) shows this excess performance drop across tasks and models\. For most tasks, removing the shared components hurts more than removing an equally sized random set, and this gap grows withKK\. The effect is strongest at moderate\-to\-large circuit sizes \(K≥10%K\\geq 10\\%\), where Addition shows an excess drop of 0\.8\-1\.0 across all models and CopyColors MCQA reaches 0\.5\-0\.8\.

The one exception is Addition atKK=1% in Gemma models, where the circuit contains only 3 MLPs, making random comparisons somewhat likely to overlap with it by chance\.444There are 26 MLPs inGemma 2 2B\.This disappears at largerKK, where Addition shows the largest excess drop of any task\.

#### Consistency emerges early but degrades over training; necessity is largely uninformative until anneal\.

[Figure3](https://arxiv.org/html/2605.08348#S3.F3)tracks reuse and necessity acrossOLMo\-2\-1B’s full∼\\sim4T\-token training run\.reuse@P=95 atKK=10% peaks in the first∼\\sim76B tokens \(around 50–60% for all tasks\) and then declines for the rest of stage\-1: to 7–22% on Addition, 25–35% on ARC, 18–37% on IOI, and 30–50% on CopyColors MCQA\. Most of this drop happens in the gap between our 76B and 399B checkpoints rather than gradually over training\. Boolean is the most extreme: itsKK=10% shared circuit becomes empty by 399B and remains so for the rest of training\. The two anneal checkpoints sit close to the late stage\-1 reuse values, suggesting the anneal phase does not substantially reshape circuit consistency\.

Necessity is hard to read for most of stage\-1 because baseline accuracy on Addition, ARC, and MCQA hovers near chance, leaving little room for ablation to do measurable damage\. During the anneal phase however, baselines on Addition \(0→\\rightarrow40%\), ARC \(Easy\) \(25→\\rightarrow58%\), and CopyColors MCQA \(10→\\rightarrow95%\) jump sharply\. This is likely because anneal mixtures emphasize downstream\-task data, and so the model has started learning these tasks\. On MCQA at the anneals, ablating the shared circuit drops accuracy from∼\\sim95% to 0% while a capacity\-matched random ablation only drops it to 30%\. On ARC, by contrast, both ablations land near 25%, so the circuit is not necessary\. Full checkpoint breakdowns are in[AppendixL](https://arxiv.org/html/2605.08348#A12)\.

Table 1:Mean of MLP vs\. attention head circuit share across tasks\.![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/combined/combined_overlap_selective.png)Figure 5:Cross\-task overlap and targeted removal atKK=10%\.\(a\)Overlap between task pairs’ circuits; high overlap explains why removing one task’s circuit damages other tasks comparably\.\(b\)Accuracy drop from removing each circuit group: shared core, task\-specific, complementary, and a random control of equal size\. Solid bars show the target task drop; hatched bars show the mean drop on other tasks\. Results shown forLlama\-3\.2\-3B\(top\) andQwen3\-4B\(bottom\); see[AppendicesF](https://arxiv.org/html/2605.08348#A6)and[I](https://arxiv.org/html/2605.08348#A9)for allKKvalues and models\.
#### MLP layers dominate at small circuit sizes\.

Breaking down circuits by component type reveals that MLPs make up the vast majority of small circuits \([Table1](https://arxiv.org/html/2605.08348#S4.T1); see[AppendixG](https://arxiv.org/html/2605.08348#A7)for full results across all models\)\. ForGemma 2 2B\-IT, MLPs account for 95\-100% of the circuit atK≤10%K\\leq 10\\%across all tasks, and the Llama and Qwen families show the same pattern\. AsKKincreases, attention heads take up a progressively larger share, reaching roughly half the circuit byKK=20\-30%\. The location of these components across layers varies by model family: in Llama and Qwen, the small\-KKcircuits are concentrated in early layers, with middle\- and late\-layer components joining asKKgrows\. Gemma is more task\-dependent, with some tasks placing their small circuits in middle\-to\-late layers instead \(see[AppendixH](https://arxiv.org/html/2605.08348#A8)for cumulative layer distributions\)\. This is consistent with early work analyzing BERT, which found that lower layers tend to capture general syntactic structure while higher layers specialize in more task\-specific semantic processing\(Tenneyet al\.,[2019](https://arxiv.org/html/2605.08348#bib.bib33)\)\.

## 5Cross\-Task Specificity

Having established that circuits are consistent, we now ask a different but related question: are the components identified for one task*specific*to that task? We investigate this below\.

#### Removing one task’s circuit damages other tasks just as much, in most models\.

If circuits are task\-specific, removing taskAA’s circuit should damage taskAAfar more than taskBB\. For most model families, this is not what we observe\.[Figure4](https://arxiv.org/html/2605.08348#S3.F4)compares, for each task and model atKK=10%, the accuracy drop from removing that task’s own circuit \(“Own”\) against the mean drop from removing all other tasks’ circuits \(“Other”; see[AppendixF](https://arxiv.org/html/2605.08348#A6)for allKKvalues\)\. These two numbers are close in the Llama and Qwen families: inLlama\-3\.2\-3B, removing the Addition circuit causes a 99% drop on Addition, while removing other tasks’ circuits causes a mean 99% drop on Addition as well\. For ARC \(Challenge\), the own\-circuit drop is 41% and the other\-circuit mean is 40%\. The Gemma family is a notable exception, showing more differentiation between tasks in several cases\. For example, CopyColors MCQA inGemma 2 2B IThas a 68% own\-circuit drop vs\. 42% from other circuits \- a gap large enough to suggest some degree of task\-specific structure at this granularity\.

The fact that removing*other*tasks’ circuits causes equally large drops on Addition tells us that Addition’s small circuit is not unique to it \- it is a subset of the shared components that every task depends on\.

#### The non\-specificity is explained by how much circuits overlap\.

The near\-identical accuracy drops are explained by the fact that different tasks’ circuits are composed of largely the same components\.[Figure5](https://arxiv.org/html/2605.08348#S4.F5)a shows the overlap between task pairs’ circuits atKK=10% forLlama\-3\.2\-3BandQwen3\-4B\(see[AppendixF](https://arxiv.org/html/2605.08348#A6)for allKKvalues and models\)\. Overlap atKK=10% typically ranges from 0\.46 to 0\.89, with the highest overlap between related tasks \(ARC Easy and ARC Challenge: 1\.00 in Llama, 0\.89 in Qwen\) and the lowest involving CopyColors MCQA\. For comparison, two random circuits of sizeKKwould have expected overlap of roughlyK/\(2−K\)K/\(2\-K\), or about 5% atKK=10% \- an order of magnitude below what we observe\. Overlap tends to be highest at smallKKand decreases asKKgrows and more task\-specific components enter the circuit, but remains well above chance even atKK=30%\.

#### Task\-specific structure exists and has measurable effects\.

High overlap does not mean circuits are entirely undifferentiated\. To identify where task\-specific structure resides, we split each pair of task circuits into three non\-overlapping groups: the shared core \(components in both circuits\), the task\-specific components \(in circuitAAbut notBB\), and the complementary components \(in circuitBBbut notAA\)\. We then remove each group independently \([Figure5](https://arxiv.org/html/2605.08348#S4.F5)b; see[AppendixI](https://arxiv.org/html/2605.08348#A9)for results across allKKvalues and models\)\.

*Shared core\.*Removing the shared core causes large drops on both the target task and other tasks\. For Addition inLlama\-3\.2\-3B, removing the shared core drops the target by∼\\sim97% and other tasks by∼\\sim38%; inQwen3\-4B, the pattern is similar \(∼\\sim83% vs\.∼\\sim20%\)\. For ARC \(Challenge\), the target and non\-target drops are closer \(∼\\sim37% vs\.∼\\sim35% in Llama\)\. In all cases, removing the shared core causes substantially more damage than removing an equally sized random set, confirming that these components are genuinely important rather than simply numerous\.

*Task\-specific components\.*Removing the task\-specific components produces larger accuracy drops on the target task than on other tasks, showing that these components carry some signal that is genuinely specific to that task\. For Addition, the target drop is larger than the drop on other tasks \(∼\\sim83pp vs\.∼\\sim18pp inLlama\-3\.2\-3B\), suggesting these components do carry some Addition\-specific signal\. For ARC and CopyColors MCQA, the gap between target and non\-target drops is much smaller\. In absolute terms, these components are also small, accounting for only 15–30% of the total circuit atKK=10% \(see[AppendixJ](https://arxiv.org/html/2605.08348#A10)\)\.

*Complementary components\.*Removing the complementary components generally causes small drops that do not disproportionately hurt the target task\.

In summary, the shared core accounts for most of each circuit and most of the performance impact when removed\. The task\-specific components are a small portion of each circuit, but removing them does hurt the target task more than other tasks\. Because the shared core dominates, removing any task’s circuit strips away roughly the same components and causes roughly the same performance degradation regardless of which task is being evaluated\.

## 6Discussion

### 6\.1Why do circuits overlap?

We hypothesize several \(non\-mutually\-exclusive\) explanations for high cross\-task overlap\.

#### MLP layers as shared infrastructure\.

At small circuit sizes, circuits are composed almost entirely of MLP layers \([AppendixK](https://arxiv.org/html/2605.08348#A11)\)\. This likely contributes to the especially high overlap at smallKK: because models have far fewer MLP layers than attention heads, MLP\-dominated circuits are constrained to draw from a small shared set of components\. Beyond this, these layers plausibly perform general\-purpose operations – storing parametric knowledge\(Sunet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib35); Liuet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib36)\), mapping tokens into a useful representational space, adjusting positional information – that all downstream computation depends on, regardless of task\. Indeed, activation steering typically operates on post\-MLP residual stream states\(Subramaniet al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib34)\)rather than individual attention heads\.

#### Polysemanticity and superposition\.

The observation byElhageet al\.\([2022](https://arxiv.org/html/2605.08348#bib.bib2)\)that neural networks represent more features than they have dimensions implies that individual heads and MLP layers inevitably serve multiple roles\. At the granularity of entire MLP layers, these roles cannot be disentangled, so circuits for different tasks will overlap even if the underlying feature\-level computations are distinct\. A natural response is to move to finer\-grained units of analysis:Markset al\.\([2025](https://arxiv.org/html/2605.08348#bib.bib20)\)propose sparse feature circuits built on sparse autoencoders, which decompose polysemantic components into monosemantic features and may recover the task\-specific structure that component\-level analysis misses\. Whether feature\-level circuits exhibit greater specificity than what we observe here is a relevant open question\.

#### Reuse as a feature, not a bug\.

Our framing so far has treated non\-specificity as a limitation of circuit discovery, but this implicitly assumes that task circuits*ought*to be disjoint\. It is worth noting that high reuse may itself be a desirable property\. An alternative hypothesis holds that models develop \(via training pressures\) small, reusable computational motifs \(*e\.g\.*, induction heads, copy\-suppression heads\) that function as general\-purpose neural machinery \- and that finding these shared primitives is a valuable interpretability goal\. Reuse is plausibly one driver of generalization: in\-context learning, for example, likely succeeds precisely because models can apply the same retrieval and binding operations across novel tasks without dedicated machinery for each\. Here, the shared core is not a confounder, but a meaningful object of study in its own right\.

### 6\.2What does this mean for circuit\-level analysis?

The value proposition of per\-task circuit discovery implicitly relies on specificity: if taskAA’s circuit largely matches taskBB’s, identifyingAA’s circuit reveals more about what the model needs to function at all than about how it specifically performsAA\. Our results suggest this is the more accurate description at the granularity of attention heads and MLP layers \- though the selective ablation experiments do reveal a smaller set of task\-specific components embedded within this shared core, particularly for Addition\.

Recovering task\-specific structure reliably may require either \(a\) finer\-grained units of analysis, such as sparse feature circuits\(Markset al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib20)\), or \(b\) attribution methods that explicitly control for shared infrastructure \- for example, by identifying components with high attribution for one task*relative to*others, rather than in absolute terms\.

### 6\.3Broader implications

The broad reuse of causally important MLP layers has implications beyond circuit analysis\. Model editing methods\(Menget al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib23); Daiet al\.,[2022](https://arxiv.org/html/2605.08348#bib.bib25)\)modify specific weight matrices to target specific factual associations, but if those matrices are load\-bearing across many tasks, targeted edits will produce wider effects than intended\. This is consistent with existing evidence that localization does not straightforwardly inform editing\(Haseet al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib30)\)and that editing techniques suffer from low specificity\(Hoelscher\-Obermaieret al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib32)\)\. Safety interventions present a somewhat different picture, as they typically steer directions within activation space rather than ablating entire components\(Liet al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib24)\), and our results therefore apply less directly\. The broader lesson is that the degree of modularity depends heavily on the granularity of analysis, and conclusions drawn at one level of description should not be assumed to hold at others\.

## 7Related Work

#### Circuit discovery and evaluation\.

One of the first works to almost fully reverse\-engineer a model behavior wasWanget al\.\([2023](https://arxiv.org/html/2605.08348#bib.bib4)\), who identified a circuit for indirect object identification \(IOI\) in GPT\-2 Small\. Subsequent work has scaled circuit discovery through automation: ACDC\(Conmyet al\.,[2023](https://arxiv.org/html/2605.08348#bib.bib22)\), EAP\(Syedet al\.,[2024](https://arxiv.org/html/2605.08348#bib.bib1)\), and relevance patching\(Jafariet al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib19)\)\. We use EAP throughout for its scalability\. On the evaluation side,Shiet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib18)\)propose statistical tests for necessity, sufficiency, and minimality\.Milleret al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib16)\)show that standard evaluation metrics can be fragile, andHannaet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib15)\)introduce EAP\-IG and argue that circuits should be evaluated by faithfulness rather than overlap with known circuits\. Our work adds*consistency*and*specificity*to this evaluation toolkit\.

#### Circuit reuse across tasks\.

The most closely related work isMerulloet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib8)\), who compare the IOI circuit to a Colored Objects circuit\(authors,[2023](https://arxiv.org/html/2605.08348#bib.bib29)\)in GPT\-2 Medium, finding 78% overlap in attention heads\. They interpret this as evidence that models reuse algorithmic building blocks across tasks with a common underlying structure \(both tasks require copying a token from context\)\. Our work differs in scope, scale, and interpretation\. We study six diverse tasks across seven models from four families, and find comparable overlap between tasks with no obvious shared algorithmic structure \(*e\.g\.*, Addition and ARC\), driven by shared MLP layers rather than shared algorithmic roles\. This suggests that component\-level overlap may largely reflect dependence on general\-purpose infrastructure \- a distinction hard to make when comparing algorithmically similar tasks\. Notably, our CopyColors MCQA task\(Muelleret al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib21)\)is similar in spirit to their Colored Objects task, yet shows comparable overlap with unrelated tasks like Addition\.

## 8Conclusion

We evaluated two underexplored properties of language model circuits \- consistency \(whether the same components recur across inputs to a task\) and specificity \(whether circuits are unique to their task\)\. Across six tasks and seven models, we find that circuits are consistent \- shared components appear reliably and prove causally necessary \- but are largely not specific: circuits for different tasks overlap extensively and ablating one task’s circuit damages others comparably\. Both of these are explained by a heavy reliance on shared MLP\-layers\.

What should the field take from this? We think the primary lesson is that, at the level of attention heads and MLP layers, circuit discovery is effective at identifying which components are important \(consistency\), but most of the identified components are important for*everything*, not just the target task \(non\-specificity\)\. A smaller set of task\-specific components does exist within some circuits and shows selective causal effects \- but these are a minority, embedded in a much larger shared core\. Whether existing or new circuit discovery methods can reliably isolate this task\-specific signal \- through finer\-grained methods like sparse feature circuits\(Markset al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib20)\), contrastive attribution, or other approaches \- remains an important open question\.

## 9Acknowledgments

We thank Arnab Sen Sharma and other members of the Bau Lab for their helpful discussions\. We are also grateful to Tanush Chopra for valuable feedback on early drafts of this work, and to the anonymous reviewers for their thoughtful comments and suggestions, particularly their recommendation to conduct cross\-task experiments, which greatly improved the paper\.

## References

- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. Ben Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson \(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)Cited by:[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1)\.
- A\. Arora, Z\. Wu, J\. Steinhardt, and S\. Schwettmann \(2025\)Language model circuits are sparse in the neuron basis\.Note:[https://transluce\.org/neuron\-circuits](https://transluce.org/neuron-circuits)Cited by:[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1)\.
- B\. authors \(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=uyTL5Bvosj)Cited by:[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px2.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.ArXivabs/1803\.05457\.External Links:[Link](https://api.semanticscholar.org/CorpusID:3922816)Cited by:[Appendix C](https://arxiv.org/html/2605.08348#A3.SS0.SSS0.Px5.p1.1),[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1)\.
- A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 16318–16352\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei \(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics,Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p5.1),[§6\.3](https://arxiv.org/html/2605.08348#S6.SS3.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/to%5C_model/index.html)Cited by:[§6\.1](https://arxiv.org/html/2605.08348#S6.SS1.SSS0.Px2.p1.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1)\.
- Gemma Team \(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1)\.
- M\. Hanna, S\. Pezzelle, and Y\. Belinkov \(2024\)Have faith in faithfulness: going beyond circuit overlap when finding model mechanisms\.InICML 2024 Workshop on Mechanistic Interpretability,External Links:[Link](https://openreview.net/forum?id=grXgesr5dT)Cited by:[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- P\. Hase, M\. Bansal, B\. Kim, and A\. Ghandeharioun \(2023\)Does localization inform editing? surprising differences in causality\-based localization vs\. knowledge editing in language models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=EldbUlZtbd)Cited by:[§6\.3](https://arxiv.org/html/2605.08348#S6.SS3.p1.1)\.
- J\. Hoelscher\-Obermaier, J\. Persson, E\. Kran, I\. Konstas, and F\. Barez \(2023\)Detecting edit failures in large language models: an improved specificity benchmark\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 11548–11559\.External Links:[Link](https://aclanthology.org/2023.findings-acl.733/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.733)Cited by:[§6\.3](https://arxiv.org/html/2605.08348#S6.SS3.p1.1)\.
- F\. R\. Jafari, O\. Eberle, A\. Khakzar, and N\. Nanda \(2025\)RelP: faithful and efficient circuit discovery via relevance patching\.InMechanistic Interpretability Workshop at NeurIPS 2025,External Links:[Link](https://openreview.net/forum?id=5PKPy82sWN)Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p5.1),[§6\.3](https://arxiv.org/html/2605.08348#S6.SS3.p1.1)\.
- J\. Liu, J\. Jain, M\. Diab, and N\. Subramani \(2025\)LLM microscope: what model internals reveal about answer correctness and context utilization\.arXiv preprint arXiv:2510\.04013\.Cited by:[§6\.1](https://arxiv.org/html/2605.08348#S6.SS1.SSS0.Px1.p1.1)\.
- Llama team \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1)\.
- S\. Marks, C\. Rager, E\. J\. Michaud, Y\. Belinkov, D\. Bau, and A\. Mueller \(2025\)Sparse feature circuits: discovering and editing interpretable causal graphs in language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=I4e82CIDxv)Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§1](https://arxiv.org/html/2605.08348#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.08348#S6.SS1.SSS0.Px2.p1.1),[§6\.2](https://arxiv.org/html/2605.08348#S6.SS2.p2.1),[§8](https://arxiv.org/html/2605.08348#S8.p2.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=-h6WAS6eE4)Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.08348#S2.SS2.p1.1),[§6\.3](https://arxiv.org/html/2605.08348#S6.SS3.p1.1)\.
- J\. Merullo, C\. Eickhoff, and E\. Pavlick \(2024\)Circuit component reuse across tasks in transformer language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fpoAYV6Wsk)Cited by:[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px2.p1.1)\.
- J\. Miller, B\. Chughtai, and W\. Saunders \(2024\)Transformer circuit evaluation metrics are not robust\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=zSf8PJyQb2)Cited by:[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Mueller, A\. Geiger, S\. Wiegreffe, D\. Arad, I\. Arcuschin, A\. Belfki, Y\. S\. Chan, J\. F\. Fiotto\-Kaufman, T\. Haklay, M\. Hanna, J\. Huang, R\. Gupta, Y\. Nikankin, H\. Orgad, N\. Prakash, A\. Reusch, A\. Sankaranarayanan, S\. Shao, A\. Stolfo, M\. Tutek, A\. Zur, D\. Bau, and Y\. Belinkov \(2025\)MIB: a mechanistic interpretability benchmark\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=sSrOwve6vb)Cited by:[Appendix C](https://arxiv.org/html/2605.08348#A3.SS0.SSS0.Px3.p1.1),[Appendix C](https://arxiv.org/html/2605.08348#A3.SS0.SSS0.Px4.p1.1),[Appendix C](https://arxiv.org/html/2605.08348#A3.SS0.SSS0.Px5.p1.1),[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px2.p1.1)\.
- J\. Pearl \(1995\)Causal diagrams for empirical research\.Biometrika82\(4\),pp\. 669–688\.External Links:ISSN 00063444, 14643510,[Link](http://www.jstor.org/stable/2337329)Cited by:[§3\.2](https://arxiv.org/html/2605.08348#S3.SS2.p3.8)\.
- C\. Shi, N\. Beltran\-Velez, A\. Nazaret, C\. Zheng, A\. Garriga\-Alonso, A\. Jesson, M\. Makar, and D\. Blei \(2024\)Hypothesis testing the circuit hypothesis in LLMs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=5ai2YFAXV7)Cited by:[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- N\. Subramani, N\. Suresh, and M\. Peters \(2022\)Extracting latent steering vectors from pretrained language models\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 566–581\.External Links:[Link](https://aclanthology.org/2022.findings-acl.48/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.48)Cited by:[§6\.1](https://arxiv.org/html/2605.08348#S6.SS1.SSS0.Px1.p1.1)\.
- Z\. Sun, X\. Zang, K\. Zheng, J\. Xu, X\. Zhang, W\. Yu, Y\. Song, and H\. Li \(2025\)ReDeEP: detecting hallucination in retrieval\-augmented generation via mechanistic interpretability\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ztzZDzgfrh)Cited by:[§6\.1](https://arxiv.org/html/2605.08348#S6.SS1.SSS0.Px1.p1.1)\.
- A\. Syed, C\. Rager, and A\. Conmy \(2024\)Attribution patching outperforms automated circuit discovery\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 407–416\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.25/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.25)Cited by:[Appendix A](https://arxiv.org/html/2605.08348#A1.p1.1),[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§1](https://arxiv.org/html/2605.08348#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.08348#S2.SS2.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- Team OLMo \(2024\)2 OLMo 2 Furious\.External Links:2501\.00656,[Link](https://arxiv.org/abs/2501.00656)Cited by:[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InAssociation for Computational Linguistics,External Links:[Link](https://arxiv.org/abs/1905.05950)Cited by:[§4](https://arxiv.org/html/2605.08348#S4.SS0.SSS0.Px4.p1.5)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 12388–12401\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf)Cited by:[§2\.2](https://arxiv.org/html/2605.08348#S2.SS2.p1.1)\.
- K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NpsVSN6o4ul)Cited by:[Appendix C](https://arxiv.org/html/2605.08348#A3.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.08348#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08348#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.08348#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1),[§7](https://arxiv.org/html/2605.08348#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2605.08348#S3.SS1.p1.1)\.

## Appendix ALimitations

Our analysis uses a single circuit extraction method \(EAP\) and operates at the granularity of attention heads and MLP layers\. WhileSyedet al\.\([2024](https://arxiv.org/html/2605.08348#bib.bib1)\)show that EAP recovers circuits competitive with those found by more expensive methods, we cannot rule out that different circuit extraction methods would yield different specificity patterns\. Relatedly, we cannot rule out that gradient\-based attribution is simply biased toward components with large activation magnitudes, which could inflate apparent overlap\. However, our causal ablation experiments confirm that the shared components are functionally important \(positive necessity\), rather than merely high\-scoring artifacts of the attribution method\. Testing whether alternative extraction methods \(*e\.g\.*, relevance propagation\) recover greater task specificity is an important direction for future work\.

Different granularities \(*e\.g\.*, individual neurons, sparse autoencoder features\) might also yield qualitatively different conclusions about specificity\. Additionally, the small number of MLP layers relative to attention heads means that MLP\-heavy circuits are structurally constrained to overlap; our results should be interpreted with this prior in mind\. Our tasks, while diverse, do not include generative or multi\-step tasks; it is possible that more complex tasks would show different patterns of reuse\. Finally, our models range from 1B to 8B parameters\. Scaling behavior at larger model sizes remains an open question\.

## Appendix BEdge Attribution Patching Details

Given a clean inputxxand a corrupted inputx′x^\{\\prime\}, EAP computes the attribution score for each componentuuas a first\-order approximation of the effect of patching that component’s activation:

e^u=\(au​\(x′\)−au​\(x\)\)⊤⋅∂L​\(x\)∂au\\hat\{e\}\_\{u\}=\\left\(a\_\{u\}\(x^\{\\prime\}\)\-a\_\{u\}\(x\)\\right\)^\{\\top\}\\cdot\\frac\{\\partial L\(x\)\}\{\\partial a\_\{u\}\}whereau​\(⋅\)a\_\{u\}\(\\cdot\)denotes the activation of componentuuandL​\(x\)L\(x\)is a scalar metric \(*e\.g\.*, logit difference\) evaluated on the clean input\. The score is the dot product of the activation difference between corrupt and clean inputs with the gradient of the metric with respect to that activation, summed over sequence positions\. In practice, this is just a single line of PyTorch code:

score = \(act\_corrupt \- act\_clean\) \* grad

whereact\_corruptandact\_cleanare the component’s activations under the corrupted and clean inputs respectively, andgradis the gradient of the metric with respect to the clean activation\. Per\-component scores are obtained by summing over positions and \(for attention heads\) the head dimension\.

ModelHuggingFace IDParams\|L\|\|L\|\|a\|\|a\|dmodeld\_\{\\text\{model\}\}Gemma 2 2Bgoogle/gemma\-2\-2b2\.6B2682304Gemma 2 2B Instructgoogle/gemma\-2\-2b\-it2\.6B2682304Llama\-3\.2\-3Bmeta\-llama/Llama\-3\.2\-3B3\.2B28243072Llama\-3\.2\-3B Instructmeta\-llama/Llama\-3\.2\-3B\-Instruct3\.2B28243072Qwen3\-4BQwen/Qwen3\-4B4\.0B36322560Qwen3\-8BQwen/Qwen3\-8B8\.2B36324096OLMo\-2\-1Ballenai/OLMo\-2\-0425\-1B1\.2B16162048

Table 2:Models studied in this work\.We report the number of parameters, number of layers\|L\|\|L\|, total number of attention heads\|a\|\|a\|, and hidden dimensiondmodeld\_\{\\text\{model\}\}for each model\.
## Appendix CTask Details

Activation patching and its approximations require a clean inputxxand a corrupted inputx′x^\{\\prime\}that alters the information the model must use while preserving surface structure as much as possible\. The attribution metric for all tasks is the logit difference between the correct and incorrect answer tokens\. Below we describe each task and its corruption strategy\.

#### Addition\.

The model is given a 2\-digit arithmetic problem \(*e\.g\.*, “Compute: 47 \+ 63 =”\) and must produce the correct sum\. Corrupted inputs are generated by pairing each problem with a randomly selected different problem from the same batch, so the corrupted input has the same format but different operands and a different answer\.

#### Boolean Logic\.

The model evaluates logical expressions composed ofand,or, andnotover boolean literals \(*e\.g\.*, “Evaluate: true and \(false or true\) =”→\\rightarrow“true”\)\. Corrupted inputs are produced by randomly flipping one boolean literal in the expression \(*e\.g\.*,true→\\rightarrowfalse\), which changes the expression’s truth value while preserving its syntactic structure\.

#### Indirect Object Identification \(IOI\)\.

Following the task setup inWanget al\.\([2023](https://arxiv.org/html/2605.08348#bib.bib4)\), the model must identify the indirect object in sentences with a specific template involving two names \(*e\.g\.*, “Bilbo and Frodo spoke in Rivendell before Bilbo gave Sting to”→\\rightarrow“Frodo”\)\. We use the dataset and counterfactuals from the Mechanistic Interpretability Benchmark\(Muelleret al\.,[2025](https://arxiv.org/html/2605.08348#bib.bib21)\)\. The corrupted input applies the S2\-IO flip counterfactual, which swaps the subject and indirect object names so that the correct completion changes while the sentence template remains identical\.

#### CopyColors MCQA\.

A multiple\-choice task fromMuelleret al\.\([2025](https://arxiv.org/html/2605.08348#bib.bib21)\)in which the model is given a passage describing objects and their colors, followed by a question asking which color corresponds to a particular object\. The answer choices are presented as labeled options \(A, B, C, D\)\. The corrupted input applies the answer\-position counterfactual from the benchmark, which permutes the order of the answer choices so that the correct answer appears at a different position, changing the correct label token while keeping the passage and question unchanged\.

#### ARC Easy / ARC Challenge\.

The AI2 Reasoning Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.08348#bib.bib5)\)consists of multiple\-choice science exam questions\. The Easy split contains questions answerable with basic retrieval and reasoning, while the Challenge split filters for questions requiring more complex inference\. We use the datasets and counterfactuals fromMuelleret al\.\([2025](https://arxiv.org/html/2605.08348#bib.bib21)\)\. As with CopyColors MCQA, the corrupted input applies the answer\-position counterfactual, permuting the order of answer choices so that the correct label changes\.

## Appendix DModel Details

See[Table2](https://arxiv.org/html/2605.08348#A2.T2)for a list of model information\.

## Appendix EFull Within\-Task Results

[Table3](https://arxiv.org/html/2605.08348#A5.T3)reports reuse@PPfor all models, tasks, and circuit sizesKK, and[Table4](https://arxiv.org/html/2605.08348#A5.T4)reports the corresponding necessity values\.

Table 3:reuse@97 \(%\) across all circuit sizesKK\.Table 4:Necessity across all circuit sizesKK\.
## Appendix FFull Cross\-Task Results

[Table5](https://arxiv.org/html/2605.08348#A6.T5)reports own\-circuit vs\. other\-circuit accuracy drops for all models and circuit sizes\.[Figure6](https://arxiv.org/html/2605.08348#A6.F6)shows the cross\-task Jaccard overlap heatmaps across all values ofKK\.

Table 5:Own\-circuit vs\. other\-circuit accuracy drop \(pp\) across allKKvalues\. Each cell reports*Own/Oth\.*, where Own is the drop from ablating that task’s circuit and Oth\. is the mean drop from ablating other tasks’ circuits\.![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/overlap/jaccard_K1_t100.png)

K=1%K=1\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/overlap/jaccard_K5_t100.png)

K=5%K=5\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/overlap/jaccard_K10_t100.png)

K=10%K=10\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/overlap/jaccard_K20_t100.png)

K=20%K=20\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/overlap/jaccard_K30_t100.png)

K=30%K=30\\%

Figure 6:Cross\-task Jaccard overlap across different values ofKK\.
## Appendix GCircuit Composition Across Models

[Table6](https://arxiv.org/html/2605.08348#A7.T6)reports the MLP and attention head fractions of each circuit across all models, tasks, and circuit sizesKK\.

Table 6:Shared circuit composition across models and tasks\. Entries are MLP/attention\-head percentages\.
## Appendix HLayer Distribution of Circuit Components

[Figure7](https://arxiv.org/html/2605.08348#A8.F7)shows the cumulative fraction of circuit components across model depth for each task and circuit sizeKK\. In the Llama and Qwen families, the CDF at smallKKis shifted toward earlier layers, indicating that the highest\-attribution components tend to sit in early\-to\-middle layers\. The Gemma family is less uniform: for tasks like IOI and CopyColors MCQA, the small\-KKcircuit is concentrated in middle\-to\-late layers rather than early ones\. At largerKK, the distribution becomes more uniform across layers in all models\.

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_google_gemma-2-2b.png)

Gemma\-2\-2B

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_google_gemma-2-2b-it.png)

Gemma\-2\-2B\-IT

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_meta-llama_Llama-3.2-3B.png)

Llama\-3\.2\-3B

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_meta-llama_Llama-3.2-3B-Instruct.png)

Llama\-3\.2\-3B\-Instruct

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_qwen3-4b.png)

Qwen3\-4B

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/layer_distribution/layer_line_qwen3-8b.png)

Qwen3\-8B

Figure 7:Cumulative layer distribution of circuit components\.Each line shows the cumulative fraction of components in the top\-KK% circuit at or below a given layer\. At smallKK, the CDF is shifted left \(toward earlier layers\) in the Llama and Qwen families but shows more task\-dependent variation in the Gemma family\.
## Appendix ISelective Ablation AcrossKK

[Figure8](https://arxiv.org/html/2605.08348#A9.F8)and[Figure9](https://arxiv.org/html/2605.08348#A9.F9)shows the selective ablation results forK∈\{1,5,10,20,30\}%K\\in\\\{1,5,10,20,30\\\}\\%, complementing theKK=10% results in the main text\.

K=1%K=1\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/selective_ablation/drop_relative_k1.png)

K=5%K=5\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/selective_ablation/drop_relative_k5.png)

K=10%K=10\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/selective_ablation/drop_relative_k10.png)

Figure 8:Selective ablation: relative accuracy drop by component set forK∈\{1,5,10\}%K\\in\\\{1,5,10\\\}\\%\.K=20%K=20\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/selective_ablation/drop_relative_k20.png)

K=30%K=30\\%

![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/selective_ablation/drop_relative_k30.png)

Figure 9:Selective ablation: relative accuracy drop by component set forK∈\{20,30\}%K\\in\\\{20,30\\\}\\%\.
## Appendix JCircuit Decomposition Sizes

[Table7](https://arxiv.org/html/2605.08348#A10.T7)shows the mean number of components in each partition \(shared core / task\-specific / task\-complement\) for all models and tasks\.

At smallKK\(≤\\leq5%\), the shared core is near\-empty for the Gemma models, reflecting their low cross\-task overlap at strict thresholds\. For the Llama and Qwen families, the shared core already dominates atKK=5%, accounting for 73–78% of the decomposition on average\. AsKKgrows, the shared core’s absolute size increases across all models, but its relative share decreases: atKK=10% it constitutes 45–67% of the decomposition, dropping to 40–60% atKK=30%\. The task\-specific and task\-complement sets grow faster in both absolute and relative terms as more lower\-attribution components enter the circuit\. The larger Qwen models have substantially larger circuits in absolute terms \(*e\.g\.*, 45 components atKK=10% vs\. 15 for Gemma\), but the proportional breakdown is similar, suggesting that the dominance of shared infrastructure is not an artifact of model size\.

Table 7:Mean circuit decomposition sizes acrossKKvalues, averaged over task\-pair partners\. Each cell shows Shared / Specific / Complement counts \(\|CA∩CB\|\|C\_\{A\}\\cap C\_\{B\}\|/\|CA∖CB\|\|C\_\{A\}\\setminus C\_\{B\}\|/\|CB∖CA\|\|C\_\{B\}\\setminus C\_\{A\}\|\)\.![Refer to caption](https://arxiv.org/html/2605.08348v1/figures/circuit_decomposition/decomposition_mlp_head.png)Figure 10:MLP vs\. attention head composition of circuit decompositions atKK=10%\.Each group of three bars shows the mean number of MLP layers \(solid\) and attention heads \(hatched\) in the shared core, task\-specific, and task\-complement sets\. MLP layers account for the vast majority of the shared core across all models and tasks, while attention heads appear primarily in the task\-specific and task\-complement sets at larger circuit sizes\.
## Appendix KComponent Breakdown

Tables[8](https://arxiv.org/html/2605.08348#A11.T8)–[13](https://arxiv.org/html/2605.08348#A11.T13)report the MLP fraction of each circuit for all models and tasks, extending the summary in the main text to the full set of models\.

Table 8:Gemma 2 2B: mean attention heads / MLPs in the top\-K%K\\%circuit\.Table 9:Gemma 2 2B Instruct: mean attention heads / MLPs in the top\-K%K\\%circuit\.Table 10:Llama\-3\.2\-3B: mean attention heads / MLPs in the top\-K%K\\%circuit\.Table 11:Llama\-3\.2\-3B Instruct: mean attention heads / MLPs in the top\-K%K\\%circuit\.Table 12:Qwen3\-4B: mean attention heads / MLPs in the top\-K%K\\%circuit\.Table 13:Qwen3\-8B: mean attention heads / MLPs in the top\-K%K\\%circuit\.
## Appendix LPretraining Dynamics Tables

We report per\-checkpoint values for the pretraining dynamics analysis of[§4](https://arxiv.org/html/2605.08348#S4), split by circuit sizeK∈\{1,5,10,20,30\}%K\\in\\\{1,5,10,20,30\\\}\\%\. Rows cover the 18 stage\-1 checkpoints \(0B–4001B tokens\) plus two stage\-2 anneal checkpoints \(anneal1,anneal3; each∼\\sim51B tokens of curated data with LR decay on top of stage\-1\)\.

#### Reuse tables\.

[Tables14](https://arxiv.org/html/2605.08348#A12.T14),[15](https://arxiv.org/html/2605.08348#A12.T15),[16](https://arxiv.org/html/2605.08348#A12.T16),[17](https://arxiv.org/html/2605.08348#A12.T17)and[18](https://arxiv.org/html/2605.08348#A12.T18)givereuse@95 \(%\) at eachKK\. AtK=1%K=1\\%\([Table14](https://arxiv.org/html/2605.08348#A12.T14)\) circuits are tiny – often a single component – so reuse is0or50%50\\%depending on whether that component is shared\. AtK=5%K=5\\%\([Table15](https://arxiv.org/html/2605.08348#A12.T15)\) reuse fluctuates around 0–55% with no consistent trend\.[Table16](https://arxiv.org/html/2605.08348#A12.T16)corresponds to the top row of[Figure3](https://arxiv.org/html/2605.08348#S3.F3): reuse starts near 50–60% in the first∼\\sim76B tokens and declines for the rest of stage\-1, with Boolean’sK=10%K=10\\%shared circuit becoming empty from 399B onward\. AtK=20K=20–30%30\\%\([Tables17](https://arxiv.org/html/2605.08348#A12.T17)and[18](https://arxiv.org/html/2605.08348#A12.T18)\) the shared circuit covers a larger fraction of the model, so per\-checkpoint reuse is more stable \(typically 25–40%\)\.

#### Necessity tables\.

[Tables19](https://arxiv.org/html/2605.08348#A12.T19),[20](https://arxiv.org/html/2605.08348#A12.T20),[21](https://arxiv.org/html/2605.08348#A12.T21),[22](https://arxiv.org/html/2605.08348#A12.T22)and[23](https://arxiv.org/html/2605.08348#A12.T23)give necessity=\(control−ablation\)/baseline=\(\\text\{control\}\-\\text\{ablation\}\)/\\text\{baseline\}at eachKK, where ablation removes thereuse@95 shared circuit and control removes a capacity\-matched random component set\. A dash \(–\) marks checkpoints where baseline accuracy is0\(ratio undefined\); a0\.000\.00entry typically means the shared circuit is empty at thatKK\(no components in≥\\geq95% of examples\), making the ablation degenerate\.[Table21](https://arxiv.org/html/2605.08348#A12.T21)corresponds to the bottom row of[Figure3](https://arxiv.org/html/2605.08348#S3.F3)\. The IOI columns are consistently negative acrossKK, reflecting the anomaly noted in the main text: random ablation hurts more than shared\-circuit ablation\. At the anneal checkpoints, CopyColors MCQA shows the largest positive necessity atK=10K=10–30%30\\%\(e\.g\. baseline∼\\sim95% drops to 0% under shared\-circuit ablation but only to 30% under random ablation\), the only setting where the shared circuit is clearly causally distinguished from a capacity\-matched random control\.

Table 14:Pretrainingreuse@95 \(%\) atK=1%K=1\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 15:Pretrainingreuse@95 \(%\) atK=5%K=5\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 16:Pretrainingreuse@95 \(%\) atK=10%K=10\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 17:Pretrainingreuse@95 \(%\) atK=20%K=20\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 18:Pretrainingreuse@95 \(%\) atK=30%K=30\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 19:Pretraining necessity atK=1%K=1\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 20:Pretraining necessity atK=5%K=5\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 21:Pretraining necessity atK=10%K=10\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 22:Pretraining necessity atK=20%K=20\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.Table 23:Pretraining necessity atK=30%K=30\\%acrossOLMo\-2\-1Bcheckpoints\.annealNentries are stage\-2 anneal checkpoints \(ingredientNN\)\.

Similar Articles

Architecture, Not Scale: Circuit Localization in Large Language Models

arXiv cs.CL

This paper challenges the assumption that mechanistic interpretability becomes harder as models scale, showing that architecture (specifically Grouped Query Attention vs. Multi-Head Attention) matters more than parameter count for circuit localization and stability.

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

arXiv cs.AI

This paper challenges the 'Attention-Confidence Assumption' by demonstrating that attention map sharpness is a poor predictor of correctness in Vision-Language Models. Instead, it shows that reliability is better indicated by hidden-state geometry and self-consistency, with significant findings on architectural differences between late-fusion and early-fusion models.

What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Hugging Face Daily Papers

This paper proposes the Implicit Curriculum Hypothesis, demonstrating that language model pretraining follows a structured, compositional curriculum where capabilities emerge consistently across architectures and can be predicted from internal representations. The authors validate this through designed tasks spanning retrieval, morphology, coreference, reasoning, and mathematics, finding highly consistent emergence orderings (ρ=0.81) across four model families.