CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer
Summary
This paper investigates neural circuits in a sparse 8-layer Python transformer, finding dedicated circuitry for 106 programming concepts and decomposing them into concept-specific and token-driven components, with implications for understanding structural encoding in code models.
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# CSP-Atlas: Concept-Specific Neural Circuits in a Sparse Python Transformer
Source: [https://arxiv.org/html/2605.24603](https://arxiv.org/html/2605.24603)
###### Abstract
A sparse 8\-layer code transformer develops dedicated neural circuitry for every Python construct tested, and that circuitry is organised by a clean computational principle rather than by semantic category\. We extract neural circuits for 106 concepts \(43 AST node types, 63 builtin objects\) by marginalising across 63,800 controlled prompts, and decompose each circuit into concept\-specific and token\-driven components using contrastive checker prompts that present a keyword token without its associated syntactic structure\. Three findings emerge\. First, all 106 concepts produce non\-empty universal circuits at every one of nine parameter settings, and the ranking of concept\-specificity across constructs is stable across the sweep — survival is not an artifact of a permissive threshold\. Second, AST circuits contain a genuine concept component distinct from token activation: concept\-only neurons constitute up to62\.5%62\.5\\%of the loudest\-firing neurons at mid\-to\-late layers, while builtin circuits are almost entirely token\-driven\. Third, six computationally atomic constructs — Import, ImportFrom, Break, Continue, Pass, Assert — cluster together despite being semantically unrelated, sharing only the property of being single\-statement constructs requiring no nested body; this atomicity super\-cluster, together with a four\-tier hierarchy organised by token ambiguity and structural distinctiveness, shows that the model’s internal organisation tracks computational structure rather than meaning\. The methodology, full decomposition data, and analysis code are released\.
## 1Introduction
When a code language model processes a Pythonforloop, what is it responding to? It might be recognising the syntactic concept of iteration — a structured control\-flow construct with a target variable, an iterable, and an indented body\. Or it might simply be reacting to the three\-letter tokenfor\. These two hypotheses — concept representation versus token detection — predict identical behaviour on well\-formed code but diverge when the token appears outside its syntactic role: in a string literal, a comment, or a variable name\. This divergence is the experimental lever for the present study\.
The question matters for interpretability and trust\. If internal representations reduce to token pattern\-matching, the model’s apparent understanding is shallow; if the model builds dedicated circuitry for syntactic concepts distinct from token\-level processing, this suggests a deeper structural encoding\.
Working with a sparse 8\-layer transformer trained on Python code \(the CSP model\), we extract neural circuits for 106 programming concepts — 43 Abstract Syntax Tree \(AST\) node types and 63 builtin objects — using an aggressive marginalisation procedure across 63,800 controlled prompts\. We then decompose each circuit into three disjoint neuron populations: concept\-only neurons that respond to the syntactic concept but not the bare keyword token; shared neurons that respond to both; and token\-only neurons that respond to the token alone\.
#### Contributions\.
\(1\)*Parameter\-stable concept survival\.*All 106 concepts produce non\-empty universal circuits at every setting of a3×33\\times 3parameter sweep, and the concept\-specificity ranking is stable across settings — establishing that the model develops dedicated circuitry for every construct tested, robustly rather than at one permissive corner\. \(2\)*Concept–token decomposition\.*AST circuits contain a substantial concept\-specific component \(up to62\.5%62\.5\\%of the loudest\-firing units at mid\-to\-late layers\), while builtin circuits are almost entirely token\-driven\. \(3\)*Atomicity super\-cluster and four\-tier hierarchy\.*Six computationally atomic constructs cluster together by shared structural property rather than meaning, within a consistent hierarchy organised by token ambiguity and structural complexity: tokenless ASTs\>\>modular keyword ASTs\>\>non\-modular keyword ASTs\>\>builtins\.
#### Scope\.
This study is observational: we characterise circuit structure but do not perform causal interventions\. The analysis covers one model and one language\. These limitations define directions for follow\-up work, which extends the method to dense production\-scale models and a second language\.
## 2Background
#### Circuit discovery\.
The mechanistic\-interpretability programme reverse\-engineers neural networks into human\-understandable components\.Conmy et al\. \([2023](https://arxiv.org/html/2605.24603#bib.bib2)\)introduced automated circuit discovery via activation patching;Heimersheim and Nanda \([2024](https://arxiv.org/html/2605.24603#bib.bib4)\)refined causal scrubbing\. Our work differs in scope: rather than a circuit for a single behaviour, we extract circuits for an entire concept space and characterise their compositional structure\.
#### Superposition and polysemanticity\.
Individual neurons often respond to multiple unrelated features\(Elhage et al\.,[2022](https://arxiv.org/html/2605.24603#bib.bib3)\)\. Our marginalisation procedure is designed to cut through superposition: by intersecting across all complementary objects, we isolate neurons that respond to a concept regardless of context\.
#### Probing and causal methods\.
Linear probes detect syntactic structure in language model representations\(Tenney et al\.,[2019](https://arxiv.org/html/2605.24603#bib.bib7); Belinkov,[2022](https://arxiv.org/html/2605.24603#bib.bib1)\)\. Activation patching and causal tracing\(Meng et al\.,[2022](https://arxiv.org/html/2605.24603#bib.bib6)\)identify causally relevant components\. Our cross\-section experiment measures which neurons carry concept information, providing a target map for subsequent causal interventions\.
#### Code\-model interpretability\.
Prior work has probed code models for syntax trees\(Wan et al\.,[2022](https://arxiv.org/html/2605.24603#bib.bib8)\)and variable binding\(Hernandez et al\.,[2024](https://arxiv.org/html/2605.24603#bib.bib5)\)\. We extend this line by decomposing internal representations into concept\-specific and token\-driven partitions at the neuron level\.
## 3Model and Concept Space
### 3\.1The CSP Transformer
The model under study isopenai/circuit\-sparsity, a sparse transformer released in 2025 and trained on Python source code\. It has 8 layers; the MLP output at each layer is a2,0482\{,\}048\-dimensional vector entering the residual stream\. The sparsity is structural: the architecture uses sparse\-activation MLP blocks designed so that the residual\-stream\-bound MLP output contains genuine zeros for the majority of neurons on any given input, rather than the dense post\-nonlinearity values produced by standard SwiGLU/GELU blocks\. This is what makes binarisation meaningful in the present study: a thresholded mask reflects real structure in the activations rather than an arbitrary cut through a dense distribution\. We use the model purely as a measuring instrument — weights are loaded from the public Hub release and never updated — and pin the revision SHA in the released extraction config for reproducibility\. The bundled tokenizer is loaded via the same Hub identifier; it is not a generic GPT\-2 tokenizer, and token boundaries differ from CodeLlama or StarCoder\.
The term “neuron” throughout refers to one dimension of the MLP output vector — the composite signal contributed to the residual stream — not an internal neuron in the expanded MLP hidden layer\.
### 3\.2Concept Space
The investigation targets two families of Python concepts\. AST nodes are syntactic constructs defining program structure:For,If,FunctionDef,Import,Break,Pass,ListComp, and 36 others — 43 node types in total\. Builtin objects are types, functions, and exceptions provided without imports:int,list,dict,print,len,range,ValueError, and 56 others — 63 in total\. The full Cartesian product —43×63=1,27643\\times 63=1\{,\}276pairs — defines the concept space\. Complete lists are in Appendix A of the released artifacts\.
## 4Prompt Generation
### 4\.1Object Prompts
For each of the1,2761\{,\}276pairs,5050prompt variations are generated programmatically\. A structurally valid Python snippet is built usingast\.parse\(\)andast\.unparse\(\), guaranteed to contain the target AST node applied to the target builtin\. Variance is injected along three orthogonal dimensions: lexical/semantic \(variable names from five domains — finance, biology, gaming, physics, e\-commerce\), structural \(global scope∼\\sim40%40\\%, inside function∼\\sim30%30\\%, inside class method∼\\sim30%30\\%\), and padding \(unrelated code optionally added before/after\)\. Prompts with excessively high sequence loss are discarded; the top5050per pair are retained, for1,276×50=63,8001\{,\}276\\times 50=63\{,\}800prompts\.
### 4\.2Checker Prompts
Every keyword\-bearing concept has an inherent confound: the import statement always contains the tokenimport, so a universal circuit firing onimportmight reflect the token, not the concept\. The checker prompt set isolates this\.
Of the4343AST nodes,2424have testable keywords — these divide into 6 modular keywords \(Import,ImportFrom,Break,Continue,Pass,Assert\) and 18 non\-modular keywords \(For,While,If,Return, and the other body\-requiring constructs\)\. Of the6363builtins,3434have testable keywords\. The remaining concepts lack distinctive keyword tokens and are not subject to the confound\. This gives5858testable objects \(2424AST \+3434builtins\)\. For each testable object,5050checker prompts are generated across five categories where the keyword token appears but the concept does not \(Table[1](https://arxiv.org/html/2605.24603#S4.T1)\)\. Each prompt is validated to \(a\) parse as valid Python, \(b\) exclude the target concept from the AST, and \(c\) contain the keyword token in the tokeniser output\.
Table 1:Five checker prompt categories\. Each presents the keyword token in a non\-structural context\.Table 2:Concept fraction \(%\) by layer atε=0\.001,C=80%\\varepsilon=0\.001,C=80\\%\. Layer 1 is essentially pure token; concept signal peaks at layer 5 for modular ASTs\.
## 5Extraction Pipeline
#### Activation extraction\.
Each prompt is processed in a single forward pass; no text is generated — the model is used purely as a measuring instrument\. Forward hooks on the MLP module at each of88layers intercept the output at the last token position only, because the last token’s residual stream integrates information from the entire sequence through causal attention\. Each prompt produces88vectors of2,0482\{,\}048values\.
#### Binarisation\.
Raw activations are converted to binary masks\.*Epsilon thresholding*: a neuron is active if\|activation\|\>ε\|\\text\{activation\}\|\>\\varepsilon\.*Consistency filtering*: across the5050prompt variations, only neurons active in≥C%\\geq C\\%of prompts are retained\.
#### Marginalisation\.
A universal circuit is obtained by intersecting across the complementary dimension: across all6363builtins for an AST node, across all4343AST nodes for a builtin\. A neuron survives only if it fires regardless of which complementary object is involved\. If the model did not encode these concepts as structured units, the intersection would collapse to empty — as it would for random binary vectors of this density\. The result is106106universal circuits\.
#### Parameter sweep\.
Two parameters control extraction:ε∈\{0\.001,0\.1,0\.5\}\\varepsilon\\in\\\{0\.001,0\.1,0\.5\\\}\(activation threshold\) andC∈\{20%,50%,80%\}C\\in\\\{20\\%,50\\%,80\\%\\\}\(consistency threshold\)\. Both universal and checker masks are rebuilt at every combination in the3×33\\times 3grid \(9 settings\); results hold across all 9 unless stated otherwise\.
#### Decomposition\.
For each of the5858testable objects, at each layer, the universal maskAAand the token\-checker maskBBare compared, partitioning the2,0482\{,\}048dimensions into three disjoint groups\.*Concept\-only*\(A∖B\)\(A\\setminus B\): in the universal circuit but not the checker — neurons firing when the concept is present but not for the bare keyword\.*Shared*\(A∩B\)\(A\\cap B\): in both\.*Token\-only*\(B∖A\)\(B\\setminus A\): in the checker but not the universal circuit\. The concept fraction\|A∖B\|/\|A\|\|A\\setminus B\|/\|A\|quantifies how much of a circuit reflects structural understanding versus surface pattern\-matching, computed per concept, per layer, at all 9 settings\.
## 6Results
### 6\.1Finding 1: Universal Circuits Are Parameter\-Stable Across the Sweep
All4343AST nodes and all6363builtins produce non\-empty universal circuits at every one of the nine\(ε,C\)\(\\varepsilon,C\)settings — not at a single permissive corner but across the full grid, includingε=0\.5\\varepsilon=0\.5, where most neurons are filtered out\. Across the most aggressive setting \(ε=0\.001,C=80%\\varepsilon=0\.001,C=80\\%\), every one of the10,20810\{,\}208layer\-level pair masks \(1,2761\{,\}276pairs×\\times88layers\) contains active neurons; but the substantive claim is the stability: the ranking of concept\-specificity across constructs is preserved across all nine settings, so survival reflects structured representation rather than a threshold artifact\. Were the masks random binary vectors of comparable density, the marginalised intersection would collapse toward empty; it does not, for any concept, at any setting\.
### 6\.2Finding 2: AST Circuits Contain a Concept Component Distinct from Token Activation
Universal circuits for AST nodes are not merely token detectors: they contain a substantial set of neurons firing when the syntactic concept is present but not when the keyword token appears without the concept\. Builtin circuits are almost entirely token\-driven\. \(Note that1919of the4343AST nodes —Assign,BinOp,ListComp,Compare,Call, and others — have no distinctive keyword; their universal circuits are inherently concept\-driven, the token confound not applying\.\)
Across the parameter sweep, mean concept fraction for AST nodes exceeds that for builtins at every setting — by roughly33–6×6\\timesat low to midε\\varepsilon, and by an order of magnitude atε=0\.5\\varepsilon=0\.5, where builtin concept\-only neurons approach zero \(Table[3](https://arxiv.org/html/2605.24603#S6.T3)\)\. At a high threshold the loudest neurons are predominantly concept\-specific: atε=0\.5,C=80%\\varepsilon=0\.5,C=80\\%, layer 5, modular AST circuits are62\.5%62\.5\\%concept\-only and non\-modular AST circuits58\.2%58\.2\\%, while builtin concept\-only neurons vanish entirely \(Table[4](https://arxiv.org/html/2605.24603#S6.T4)\)\. The layer profile shows concept signal near\-zero at layer 1, peaking at layer 5 for modular ASTs, then declining \(Table[2](https://arxiv.org/html/2605.24603#S4.T2)\)\.
Table 3:Mean concept fraction by group across the3×33\\times 3parameter sweep\. AST exceeds builtin at every setting; the gap is widest atε=0\.5\\varepsilon=0\.5where builtin CF approaches zero\.Table 4:Atε=0\.5,C=80%\\varepsilon=0\.5,C=80\\%, layer 5: per\-group totals of concept\-only, shared, and universal\-circuit size\. The majority of AST circuit neurons are concept\-specific\. Builtin concept\-only neurons vanish entirely\.
### 6\.3Finding 3: The Atomicity Super\-Cluster and a Four\-Tier Hierarchy
Hierarchical clustering of concept\-only neuron sets groups six constructs tightly together:Import,ImportFrom,Break,Continue,Pass,Assert\. These six are not semantically related — abreakand animporthave nothing in common functionally — but they share a single computational property: each is an atomic, single\-statement construct requiring no nested body\. We call this the*atomicity super\-cluster*, and it is the clearest evidence that the model’s internal organisation tracks computational structure rather than semantic category\. Under Ward linkage on1−Jaccard1\-\\text\{Jaccard\}distance over the106106universal concept\-only neuron sets, the six\-set merges as a single cluster at layer33under ak=4k=4partition; at later layers \(including L5, where mean concept fraction peaks\) the six members remain mutually closer in Jaccard distance than to other concepts but distribute across≥2\\geq 2sub\-trees of the same partition\. The released code regenerates the dendrogram at any chosen layer \(Appendix R\)\.
By a relaxed per\-layer modularity criterion \(counting significant layers at trim levelp=0p=0\), the top of the modularity ranking is dominated by atomicity members:Break,ImportFrom,Assertoccupy the three highest positions, withBreakalone at the top with33significant layers\. The remaining three atomicity concepts \(Continue,Import,Pass\) tie with a broader set of concepts at the next level down; the strictp=0p=0criterion does not separate them from the field, but they fall inside the same hierarchical cluster as the top three under Ward linkage\.
The super\-cluster sits within a four\-tier hierarchy of concept representation:
*Tier 1 — Tokenless ASTs*\(1919objects:Assign,AugAssign,BinOp,Dict,ListComp,Compare,Call, and others\)\. No distinctive keyword token; circuits are inherently concept\-driven\.
*Tier 2 — Modular keyword ASTs*\(66objects:Import,ImportFrom,Break,Continue,Pass,Assert\)\. Computationally atomic — single\-statement, no nested body\. Strongest concept signal among testable objects:30%30\\%by count,62\.5%62\.5\\%by magnitude at L5\. The atomicity super\-cluster\.
*Tier 3 — Non\-modular keyword ASTs*\(1818objects:For,While,If,Return,ClassDef,FunctionDef,Try, and others, including async variants\)\. Keywords present, but constructs require bodies, targets, or structural dependencies\. Concept signal66–10%10\\%by count,58%58\\%by magnitude\.
*Tier 4 — Builtins*\(6363objects:int,list,print,len,range, and others\)\. Near\-zero concept signal relative to AST tiers; circuits are effectively subsets of token activation, and the few concept\-only neurons at lowε\\varepsilonvanish atε=0\.5\\varepsilon=0\.5\.
The hierarchy aligns with a gradient of token ambiguity and structural distinctiveness\. Tokenless ASTs must be encoded structurally\. Modular keyword ASTs have distinctive tokens and crisp self\-contained syntactic identity — the model encodes both, and they are partially separable\. Non\-modular keyword ASTs distribute their identity across multiple tokens \(keyword, colon, indented body\)\. Builtins are the most token\-predictable: the token is nearly sufficient to identify the concept\.
## 7Discussion
#### The atomicity observation\.
The six modular keyword ASTs cluster not because they are semantically related but because they share a computational property — they are atomic single\-statement constructs requiring no nested body\. This suggests the model’s internal organisation reflects computational complexity rather than semantic category, and it is the finding most likely to generalise: it predicts that*which*constructs receive early, separable circuitry should be a property of the language’s computational structure, testable in other models and languages\.
#### Why builtins are token\-driven\.
The marginalisation intersects across all4343AST nodes, extracting only what is shared whenprintappears as argument toFor,If,FunctionDef, and every other construct\. What survives is primarily the token response; the builtin’s role in any particular syntactic context is stripped away by design\.
#### Methodology as infrastructure\.
The pipeline is language\-agnostic: adding a new language requires only a concept\-space definition and contrastive prompts, and the decomposition produces directly comparable measurements across any \(language, model\) pair\. This generality motivates follow\-up cross\-language, cross\-model studies that test whether the four\-tier hierarchy and the atomicity super\-cluster survive in dense production models and a second language\.
#### Future work\.
Three extensions follow: cross\-model comparison on dense production\-scale models; cross\-language comparison testing whether language design predicts representation strength; and causal validation via ablation, confirming that concept\-only neurons are functionally active rather than epiphenomenal\.
## Limitations
No causal validation\.The decomposition is observational; we identify concept\-only neurons by activation pattern but have not confirmed that ablating them degrades the model’s processing of the associated construct\. Some may be epiphenomenal\.
Single model\.The CSP transformer is a sparse 8\-layer model; findings may not transfer to dense architectures, larger scales, or multilingual training\. Follow\-up work extending the method to dense production models and a second language addresses this directly\.
Single language\.All concepts are Python\. Different languages may produce qualitatively different circuit structure\.
MLP outputs only\.The analysis examines the MLP output vector entering the residual stream, not attention patterns, attention\-head outputs, or internal MLP hidden layers\.
Last\-token position only\.If concept representations are distributed across positions, last\-token extraction may capture only a partial signal\.
Checker\-prompt coverage\.Five checker categories cover the most common non\-structural keyword contexts but are not exhaustive; other contexts \(docstrings, type annotations, f\-strings\) might reveal additional nuance\.
## Ethical Considerations
This work analyses internal representations of a publicly available open\-source model\. No new models are trained, no human subjects are involved, and no private data is used; all prompts are synthetic code snippets generated programmatically\.
## Conclusion
We have presented a methodology for extracting and decomposing neural circuits for programming\-language concepts, and applied it to a sparse 8\-layer Python transformer across106106concepts and63,80063\{,\}800prompts\. Three findings result: the model develops dedicated circuitry for every concept tested, stably across a parameter sweep; AST circuits contain a genuine concept component distinct from token activation, strongest at mid\-to\-late layers; and the internal organisation tracks computational structure rather than meaning, visible in an atomicity super\-cluster and a four\-tier hierarchy\. The methodology is general, and all data and code are released to support the cross\-language, cross\-model, and causal extensions\.
## Data and Code Availability
The full dataset —63,80063\{,\}800object prompts, checker prompts, extracted activation masks at all99settings,106106universal circuit masks, and per\-object per\-layer decomposition tables with neuron indices — is released together with the extraction and analysis code at[https://github\.com/piotrwilam/AtlasCSP](https://github.com/piotrwilam/AtlasCSP)\. The frozen artifacts are mirrored on the Hugging Face Hub at[https://huggingface\.co/datasets/piotrwilam/AtlasCSP](https://huggingface.co/datasets/piotrwilam/AtlasCSP)\.
## References
- Belinkov \[2022\]Y\. Belinkov\. Probing classifiers: Promises, shortcomings, and advances\.*Computational Linguistics*48\(1\), 2022\.
- Conmy et al\. \[2023\]A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso\. Towards automated circuit discovery for mechanistic interpretability\.*NeurIPS*, 2023\.
- Elhage et al\. \[2022\]N\. Elhage et al\. Toy models of superposition\.*Transformer Circuits Thread*, 2022\.
- Heimersheim and Nanda \[2024\]S\. Heimersheim and N\. Nanda\. How to use and interpret activation patching\.*arXiv:2404\.15255*, 2024\.
- Hernandez et al\. \[2024\]E\. Hernandez, S\. Schwettmann, D\. Bau, T\. Bagashvili, A\. Torralba, and J\. Andreas\. Linearity of relation decoding in transformer language models\.*ICLR*, 2024\.
- Meng et al\. \[2022\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\. Locating and editing factual associations in GPT\.*NeurIPS*, 2022\.
- Tenney et al\. \[2019\]I\. Tenney, D\. Das, and E\. Pavlick\. BERT rediscovers the classical NLP pipeline\.*ACL*, 2019\.
- Wan et al\. \[2022\]Z\. Wan, W\. Zhao, H\. Zhang, et al\. What do they capture? A structural analysis of pre\-trained language models for source code\.*ICSE*, 2022\.
## Appendix AReproducibility and Claim Verification
Table 5:Every paper\-cited number is locked intests/test\_paper\_numbers\.pywith±0\.001\\pm 0\.001tolerance\. Analysis primitives incsp\_atlas/analysis/\(jaccard\.py,decomposition\.py,hierarchy\.py,modularity\.py\) map one\-to\-one to these claims; data loaders sit incsp\_atlas/io/; model\-touching code \(forward\-pass extraction\) lives incircuits/extraction/and is separated because it requires a GPU\. Frozen artifacts are mirrored on the Hugging Face Hub at[https://huggingface\.co/datasets/piotrwilam/AtlasCSP](https://huggingface.co/datasets/piotrwilam/AtlasCSP)\.Every numerical claim in the paper is regenerable from the released code\. Running
pytest tests/test\_paper\_numbers\.py
verifies the locked numbers in Table[5](https://arxiv.org/html/2605.24603#A1.T5); the dendrogram regenerates from a config\-named script \(experiments/fig1\_atomicity\_dendrogram\.py, driven byconfigs/paper/figure1\_atomicity\_dendrogram\.yaml\)\.Similar Articles
Transformers Linearly Represent Highly Structured World Models
This paper demonstrates that transformers trained on Sudoku solving traces build structured world models organized by domain constraints, and identifies a sparse, monosemantic circuit responsible for the naked-single decision rule. The work provides a fully interpretable algorithmic account of transformer reasoning on a combinatorial task.
Understanding neural networks through sparse circuits
OpenAI researchers present methods for training sparse neural networks that are easier to interpret by forcing most weights to zero, enabling the discovery of small, disentangled circuits that can explain model behavior while maintaining performance. This work aims to advance mechanistic interpretability as a complement to post-hoc analysis of dense networks and support AI safety goals.
Generative modeling with sparse transformers
OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.
Extracting Concepts from GPT-4
OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
Introduces a three-step recipe for identifying attention-head circuits in pretrained transformers using a spectral signal and task-pattern screen without requiring labels, validated across 51M to 1B parameter models and multiple architectures.