Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
Summary
This paper investigates whether language model agents can automate the explanation phase of mechanistic interpretability by introducing AgenticInterpBench, a benchmark with 84 semi-synthetic circuits, and HyVE, an agentic explainer that iteratively hypothesizes, validates, and explains circuit components. Experiments show promise but identify reliable validation as a key obstacle.
View Cached Full Text
Cached at: 06/24/26, 07:44 AM
# Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
Source: [https://arxiv.org/html/2606.24026](https://arxiv.org/html/2606.24026)
Ayan Antik Khan1,Harsh Kohli2,Yuekun Yao2 Huan Sun2,Ziyu Yao1
1George Mason University2The Ohio State University \{akhan265,ziyuyao\}@gmu\.edu\{kohli\.120,yao\.1267,sun\.397\}@osu\.edu
###### Abstract
Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor\-intensive and difficult to standardize\. In this work, we study whether language model \(LM\) agents can assist with this explanation problem once a circuit has already been identified\. We introduceAgenticInterpBench, a benchmark for circuit explanation built from 84 semi\-synthetic transformer circuits with 163 component\-level annotations\. We proposeHyVE\(Hypothesize,Validate,Explain\), an agentic explainer that analyzes each component through an iterative loop of observation, hypothesis generation, and causal validation, eventually producing a component\-level explanation and a circuit\-level task description\. Across four LM backbones,HyVErecovers useful component\- and task\-level explanations, but no backbone is uniformly best\. Our analysis shows that strong backbones usually form observation\-grounded hypotheses, while failures more often arise later in the validation loop, through incomplete validation plans, code execution errors, or unresolved hypotheses\. A case study on an arithmetic circuit in Llama\-3\-8B shows that the same formulation can extend beyond semi\-synthetic benchmarks to naturally trained models\. Overall, LM agents are promising circuit explainers, but reliable validation remains the key obstacle\.111We release the benchmark dataset, source code, and prompts at[https://github\.com/Ziyu\-Yao\-NLP\-Lab/LLM\-Circuit\-Explainer](https://github.com/Ziyu-Yao-NLP-Lab/LLM-Circuit-Explainer)\.
Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
Ayan Antik Khan1, Harsh Kohli2, Yuekun Yao2Huan Sun2,Ziyu Yao11George Mason University2The Ohio State University\{akhan265,ziyuyao\}@gmu\.edu\{kohli\.120,yao\.1267,sun\.397\}@osu\.edu
## 1Introduction
Figure 1:An instance of the circuit explanation task onfrac\_prevs, a 2\-layer transformer that computes the running fraction of token ‘x’\. An agent receives the input\-output examples and a localized circuit and must \(i\) assign each component a functional role \(e\.g\., L0\_MLP: INDICATOR, L1H2: AGGREGATOR\) along with a natural\-language role description and \(ii\) derive a task description of the overall model behavior\.Mechanistic Interpretability \(MI\) seeks to reverse\-engineer how language models \(LMs\) implement specific behaviors by identifying the underlyingcircuits, i\.e\., sub\-networks of attention heads and MLP components that contribute to the model behaviors\(Rai et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib28); Bereska and Gavves,[2024](https://arxiv.org/html/2606.24026#bib.bib4); Ferrando et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib7)\)\. While recent advances have made circuitlocalizationmore efficient through automated patching and attribution methodsConmy et al\. \([2023](https://arxiv.org/html/2606.24026#bib.bib6)\); Hanna et al\. \([2024](https://arxiv.org/html/2606.24026#bib.bib13)\); Syed et al\. \([2024](https://arxiv.org/html/2606.24026#bib.bib30)\), theexplanationphase, i\.e\., understanding the semantic roles of these components and their interactions, remains largely manual and difficult to scale\. Human researchers typically conduct iterative hypothesis generation and validation using established methods\. Yet as models grow larger and more complex, this human\-centered process becomes increasingly infeasible\. Recent work has shown that LM agents can support open\-ended scientific workflows by generating hypotheses, designing experiments, executing code, and refining conclusions from evidence\(Chen et al\.,[2025](https://arxiv.org/html/2606.24026#bib.bib5); Lu et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib19); Yamada et al\.,[2025](https://arxiv.org/html/2606.24026#bib.bib34)\)\. Given that circuit explanation shares a similar loop, a natural question arises:Can LM agents assist in explaining the circuits within an LM?
In this work, we explore whether LM agents can work as effective circuit explainers once a circuit is localized\. We focus on assessing the sufficiency and reliability of LMs in generating and validating explanations grounded in mechanistic evidence, an essential step toward scalable and automated circuit understanding\. To study this problem in a controlled setting, we constructAgenticInterpBench, a benchmark comprising 84 localized circuits on semi\-synthetic transformers that cover 163 transformer components, built on the InterpBenchGupta et al\. \([2025](https://arxiv.org/html/2606.24026#bib.bib10)\)\. Each component is annotated with a functional role tag drawn from a 5\-class taxonomy together with a natural\-language description of its task\-specific role\.
We further proposeHyVE\(Hypothesize,Validate,Explain\), an agent\-based framework that explains a localized circuit through iterativeobservation, hypothesis generation, and validation\. We evaluateHyVEonAgenticInterpBenchusing four frontier LMs as backbones: GPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2606.24026#bib.bib24)\), Claude\-Sonnet\-4\.6\(Anthropic,[2026](https://arxiv.org/html/2606.24026#bib.bib2)\), Gemini\-3\.1\-Pro\(Google DeepMind,[2026](https://arxiv.org/html/2606.24026#bib.bib9)\), and Qwen\-3\-Coder\-30B\-A3B\-Instruct\(Qwen,[2025](https://arxiv.org/html/2606.24026#bib.bib27)\)\.HyVEachieves up to 79% component tag accuracy and 83% task accuracy\. The results show that LM agents can produce useful circuit explanations, but no backbone is uniformly best\. Initial hypotheses are usually grounded for the stronger backbones, while the main failures arise later in the validation loop\. GPT\-5\.4 produces the soundest validation plans, Claude\-Sonnet\-4\.6 executes code most reliably, and Gemini\-3\.1\-Pro achieves the strongest judged explanation scores\. These trends suggest that hypothesis generation may not be the main bottleneck by itself, yet reliable circuit explanation also depends on validation design and code execution\.
To evaluate whetherHyVEgeneralizes beyond the semi\-synthetic transformers ofAgenticInterpBench, we conduct a case study on a realistic circuit for three\-operand addition in Llama\-3\-8b\(Mamidanna et al\.,[2025](https://arxiv.org/html/2606.24026#bib.bib20)\)\. Our experiment shows thatHyVEcan recover component roles in this setting: Claude\-Sonnet\-4\.6 correctly explains 8 of 10 components, while GPT\-5\.4 gives 6 correct and 3 partially correct descriptions\. Both models recover the main operand\-transfer structure, while Claude also explains the causally redundant components\. This case study complementsAgenticInterpBenchby testingHyVEin a more realistic setting, where the localized circuit comes from a naturally trained next\-token prediction model\. It also highlights a practical role for agentic explainers as tools for stress\-testing existing circuit analyses and probing for missed mechanisms\.
## 2Related Work
##### Mechanistic Interpretability \(MI\)
MI has largely advanced through detailed case studies that localize and explain circuits for specific model behaviors\. A landmark example is the IOI circuit ofWang et al\. \([2023](https://arxiv.org/html/2606.24026#bib.bib32)\), a 26\-head circuit for indirect\-object identification in GPT\-2 small\. A complementary line of work studies arithmetic and algorithmic circuits in LMs, including greater\-than comparison\(Hanna et al\.,[2023](https://arxiv.org/html/2606.24026#bib.bib12)\)and helical number representations\(Kantamneni and Tegmark,[2025](https://arxiv.org/html/2606.24026#bib.bib14)\)\. In our work, we evaluate the generalizability ofHyVEon the All\-for\-One \(AF1\) subgraph discovered byMamidanna et al\. \([2025](https://arxiv.org/html/2606.24026#bib.bib20)\)for mental math\.
##### Automation in MI
Early analyses relied solely on human\-designed interventions to identify relevant model components\. ACDC\(Conmy et al\.,[2023](https://arxiv.org/html/2606.24026#bib.bib6)\)automates part of this process by pruning a model’s computational graph with intervention\-based tests\. EAP and EAP\-IG\(Syed et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib30); Hanna et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib13)\)further improve scalability by using attribution\-based scores to identify important circuit edges\. These methods increasingly automatelocalization, but the subsequentexplanationstep still largely requires human analysis\. Our work was motivated by the need to fill this gap\. Similar to us,Paulo et al\. \([2024](https://arxiv.org/html/2606.24026#bib.bib26)\); Han et al\. \([2026](https://arxiv.org/html/2606.24026#bib.bib11)\); Liu et al\. \([2026](https://arxiv.org/html/2606.24026#bib.bib18)\); Marin\-Llobet and Ferrando \([2026](https://arxiv.org/html/2606.24026#bib.bib21)\)explore automated interpretability; however, they focus on explaining isolated features or neurons, while we target circuit explanation \(i\.e\., explaining transformer components and how they connect to enable specific task performance\)\.
Finally,Bai et al\. \([2026](https://arxiv.org/html/2606.24026#bib.bib3)\)design agents to*evaluate*MI findings against its underlying code, data, and evidence, while we create agents to*perform*MI research from scratch\.
##### Benchmarks in MI
MIB\(Mueller et al\.,[2025](https://arxiv.org/html/2606.24026#bib.bib22)\)evaluates circuit localization by reporting two metrics derived from the faithfulness of the circuit against the full model\. Tracr\(Lindner et al\.,[2023](https://arxiv.org/html/2606.24026#bib.bib17)\)compiles RASP\(Weiss et al\.,[2021](https://arxiv.org/html/2606.24026#bib.bib33)\)programs into transformers with known internal structure, which can then serve as the ground\-truth circuits, and TracrBench\(Thurnherr and Scheurer,[2024](https://arxiv.org/html/2606.24026#bib.bib31)\)scales this approach\. InterpBench\(Gupta et al\.,[2025](https://arxiv.org/html/2606.24026#bib.bib10)\)builds on this line by producing more realistic transformers with known circuits\. These datasets, however, are all evaluating the*localization*of circuits, yet benchmarking the*explanation*of circuit components remains widely understudied\. In this line, FIND\(Schwettmann et al\.,[2023](https://arxiv.org/html/2606.24026#bib.bib29)\)evaluates open\-ended descriptions of black\-box functions, but it does not center MI circuits\. Our work fills this gap by proposing the first benchmark for agentic circuit explanation\. Our benchmark was built on top of InterpBench, as described in Section[3](https://arxiv.org/html/2606.24026#S3)\.
## 3Benchmarking LLM Agents as Circuit Explainers
In this section, we formulate the task of circuit explanation and describeAgenticInterpBench\.
### 3\.1Task Formulation
Formally, letℰ=\{\(xk,yk\)\}k=1m\\mathcal\{E\}=\\\{\(x\_\{k\},y\_\{k\}\)\\\}\_\{k=1\}^\{m\}denote a set of task input\-output examples illustrating the model’s behavior, and let𝒞=\{c1,c2,…,cn\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{n\}\\\}denote a localized circuit, where eachcic\_\{i\}is a circuit component such as an attention head or MLP sublayer\. The agent’s task is to explain the functional role of each component and the task\-level behavior implemented by the circuit\. The agent producesthreeoutputs\. For each circuit componentcic\_\{i\}, it predicts \(i\) a role tagti∈ℛt\_\{i\}\\in\\mathcal\{R\}summarizing the component’s abstract role, whereℛ\\mathcal\{R\}is the role taxonomy introduced in Section[3\.2](https://arxiv.org/html/2606.24026#S3.SS2.SSS0.Px2), and \(ii\) a natural\-language notenin\_\{i\}describing the task\-specific behavior ofcic\_\{i\}\. For the full circuit, it also produces \(iii\) a derived task descriptionddcharacterizing the LM’s underlying task\. Figure[1](https://arxiv.org/html/2606.24026#S1.F1)illustrates these inputs and outputs on the running examplefrac\_prevs\.
### 3\.2AgenticInterpBench
We introduceAgenticInterpBench, a benchmark for evaluating LLM agents on circuit explanation\.AgenticInterpBenchconsists of 84 transformer circuits with 163 annotated components \(Table[1](https://arxiv.org/html/2606.24026#S3.T1)\)\. Notably,AgenticInterpBenchtargets a circuit explanation setting in which the localized circuit is given and the agent must recover the role of each component\.AgenticInterpBenchis built on InterpBenchGupta et al\. \([2025](https://arxiv.org/html/2606.24026#bib.bib10)\), which we briefly review before describing our annotation taxonomy and construction procedure\.
StatisticCountBenchmark tasks/circuits84Total MLP components120Total attention components43Avg\./Min/Max \#of components per circuit1\.94/1/10Role tag countsMAPPER72AGGREGATOR32COMBINER33ROUTER11INDICATOR15Table 1:Statistics ofAgenticInterpBench\.##### Background\.
InterpBench provides semi\-synthetic transformers whose ground\-truth circuits are known by design\. It builds on TracrLindner et al\. \([2023](https://arxiv.org/html/2606.24026#bib.bib17)\), a compiler that converts RASP programsWeiss et al\. \([2021](https://arxiv.org/html/2606.24026#bib.bib33)\)into decoder\-only transformers with fully transparent computational structure\. To mitigate the unrealistic weight distributions in Tracr\-compiled models, InterpBench retrains Tracr models withStrict Interchange Intervention Training\(SIIT\), a procedure extended from IITGeiger et al\. \([2022](https://arxiv.org/html/2606.24026#bib.bib8)\)that aligns a low\-level transformer with the Tracr\-compiled circuit while penalizing contributions from non\-circuit components\. The resulting models exhibit weight distributions and activations close to those of naturally trained transformers, while preserving the same circuit components of their Tracr counterparts\.
We use the 84 RASP\-derived models in InterpBench as the foundation forAgenticInterpBench\.222We exclude the two IOI tasks, as IOI is a widely studied circuit and the agent may rely on memorized conclusions instead of grounded executionBai et al\. \([2026](https://arxiv.org/html/2606.24026#bib.bib3)\)\.These tasks span small algorithmic behaviors, including counting, fraction computation, sorting, and matching\. Two properties make them well\-suited for evaluating LMs as circuit explainers: \(i\) the ground\-truth circuit and per\-component role are recoverable from the RASP source, enabling precise evaluation, and \(ii\) the diversity of tasks reduces the risk of the agent memorizing well\-known circuits from prior literature\.
##### Dataset Annotation
We buildAgenticInterpBenchby extending InterpBench with a semantic annotation layer for circuit explanation\. For each localized component in an InterpBench model, we inspect the corresponding RASP program and use InterpBench’s high\-level/low\-level correspondence map to trace the trained component back to the RASP variable it implements\. This allows us to assign precise task\-specific roles to each component\.
Specifically, each task inAgenticInterpBenchis annotated with its task description, the original RASP program, five input\-output examples with inputs sampled from the task’s data distribution and outputs obtained by executing the RASP program, and per\-componentrole annotations\. A role annotation consists of two fields: atag, drawn from the 5\-class taxonomy \(Indicator,Aggregator,Router,Mapper, andCombiner\) detailed in Appendix[B](https://arxiv.org/html/2606.24026#A2), and anote, a brief natural\-language description of the component’s task\-specific role\. Together, the two fields support evaluation at two granularities: whether the agent identifies the correctabstract role, and whether it can describe that role accurately in the task context\. The annotation was manually performed and examined against the original RASP program\-circuit mapping to ensure quality\.
An example is shown in Figure[1](https://arxiv.org/html/2606.24026#S1.F1)\. We include its corresponding RASP program and other details in Appendix[C](https://arxiv.org/html/2606.24026#A3)\.
### 3\.3Evaluation Metrics
We evaluate an agent at three levels of granularity\. Atcomponent\-level, across all components, we reporttag prediction accuracy\(AcctagAcc\_\{\\text\{tag\}\}\), the exact\-match rate between the agent’s predicted tag and the ground\-truth tag, androle description quality\(QdescQ\_\{\\text\{desc\}\}\), an LLM\-judged score of the predicted role note against the ground\-truth note\. Specifically, the LLM\-judge assesses the description quality on a 3\-point scale \(0 = incorrect, 1 = partially correct, 2 = correct\)\. We use this scale to distinguish fully incorrect descriptions from partially correct ones that capture the main role but contain incorrect mechanistic sub\-claims \(an example is provided in Appendix[E\.7\.1](https://arxiv.org/html/2606.24026#A5.SS7.SSS1)\)\. We then rescale the score to\[0,1\]\[0,1\]asQdescQ\_\{\\text\{desc\}\}\. At thetask\-level, for each task, we reportderived task accuracy\(AcctaskAcc\_\{\\text\{task\}\}\), a binary LLM\-judged score of the agent’s derived task description against the ground\-truth task description\. Finally, at theprocess\-level, we reportcode execution success rate\(SexecS\_\{\\text\{exec\}\}\), the fraction ofexecute\_pythoncalls that run without error\.
LLM\-judged metrics \(QdescQ\_\{\\text\{desc\}\}andAcctaskAcc\_\{\\text\{task\}\}\) are scored independently by two LLM judges, GPT\-5\.4 and Gemini\-3\.1\-Pro\. We aggregate the two scores by taking thelowerscore instead of the mean\. This choice provides a conservative estimate of explanation quality, which fits our setting because over\-crediting an incorrect mechanistic claim is more harmful than under\-crediting an incomplete one\. The lower\-score aggregation also reduces the impact of self\-preference bias, where LLM judges can favor their own generations\(Panickssery et al\.,[2024](https://arxiv.org/html/2606.24026#bib.bib25)\)\. A high score is retained only when both judges assign it\.
To validate the LLM\-judged metrics, we collect human ratings on a subset of 10 tasks containing 17 components\. For each component, two human judges independently evaluate the outputs of all fourHyVEbackbones \(Section[5](https://arxiv.org/html/2606.24026#S5)\), with agent identities hidden and randomized\. This results in a total of 68 component\-level annotations\. The judges scoreQdescQ\_\{\\text\{desc\}\}andAcctaskAcc\_\{\\text\{task\}\}using the same rubrics as the LLM judges, and we observed a Cohen’sκ\\kappaof0\.830\.83forQdescQ\_\{\\text\{desc\}\}and0\.960\.96forAcctaskAcc\_\{\\text\{task\}\}, indicating almost perfect inter\-annotator agreement\(Landis and Koch,[1977](https://arxiv.org/html/2606.24026#bib.bib15)\)\. As with the two LLM judges, we consider the lower score between the two human annotators as the ground\-truth evaluation label, and report the LLM\-human agreement\. We observed substantial agreement forQdescQ\_\{\\text\{desc\}\}\(κ\\kappa=0\.760\.76\) and almost perfect agreement forAcctaskAcc\_\{\\text\{task\}\}\(κ\\kappa=0\.80\.8\), which confirms the validity of the LLM\-judged metrics\. We include details in Appendix[E](https://arxiv.org/html/2606.24026#A5)\.
## 4HyVE
Figure 2:HyVE’s pipeline\.HyVEexplains each localized component through an iterativeobserve→\\rightarrowhypothesize→\\rightarrowvalidateloop\. Refuted hypotheses are fed back as additional context to the next round\. After processing all components, it assigns role tags and synthesizes a circuit\-level summary\.In this section, we introduceHyVE, our LM agent for circuit explanation\.
### 4\.1Overview
HyVEoperates one component at a time\. For each component in the localized circuit, it runs a three\-stage analysis:observe, hypothesize, and validate\. These three stages form an iterative loop\.HyVEgenerates a hypothesis from its observations, designs a controlled intervention to test it, and decides whether the evidence supports or refutes the claim\. If refuted, the loop returns to hypothesis generation with the refuted claim as additional context\. After all components are processed,HyVEclassifies each component, produces a component\-level explanation, and derives the task description\. Figure[2](https://arxiv.org/html/2606.24026#S4.F2)illustrates the full pipeline\.
### 4\.2Observation
The goal ofobservationis to gather descriptive evidence about a target component before hypothesizing its role\.HyVEfirst writes a structured observation plan specifying the goal of the observation, a step\-by\-step procedure, the required model tensors, and the expected pattern in the result\. It then writes Python code implementing the plan, executes the code, and summarizes the results as a natural\-language observation\. We provide a helper library,observation\_tools\.py, with primitives for inspecting attention patterns and activations\.HyVEmay use these helpers, write custom code, or combine both\.
### 4\.3Hypothesis Generation
GranularityMetricGPT\-5\.4Claude\-Sonnet\-4\.6Gemini\-3\.1 ProQwen\-3\-Coder\-30B\-A3BComponent\-levelAcctagAcc\_\{\\text\{tag\}\}0\.740\.790\.760\.67QdescQ\_\{\\text\{desc\}\}0\.460\.580\.590\.25Task\-levelAcctaskAcc\_\{\\text\{task\}\}0\.630\.750\.830\.25Process\-levelSexecS\_\{\\text\{exec\}\}0\.520\.930\.800\.62Table 2:Results ofHyVEwith different LM backbones onAgenticInterpBench\. Higher is better for all metrics\. Bold indicates the best score in each row\.After observation,HyVEproposes a hypothesis about the target component’s role\. The hypothesis is a short natural\-language claim grounded in the observation, the task input\-output examples, and any previously refuted hypotheses\. At this stage, the role taxonomy is withheld, allowingHyVEto reason freely about the component’s behavior before committing to a fixed label\.
### 4\.4Hypothesis Validation
After a hypothesis is proposed,HyVEtests it through controlled interventions on the target component\. It first writes a structured validation plan specifying the prediction being tested, a step\-by\-step procedure, the activations or hooks to intervene on, and the result that would support or refute the hypothesis\. It then writes Python code implementing the plan, executes the code, and issues a binary decision based on the results\. We providevalidation\_tools\.py, a helper library with primitives for ablation, activation patching, and interchange interventions\. Similar to the observation stage,HyVEis free to use these primitives, write custom code, or combine both\. If the evidence supports the hypothesis,HyVEmoves on to the next component\. If the evidence refutes it, the loop returns to hypothesis generation with the refuted claim added to the context, andHyVEproposes a revised hypothesis informed by what has been ruled out\.
### 4\.5Classification
Once all components have been processed,HyVEassigns each one atagfrom the taxonomy and writes a concise task\-specificnotedescribing the role of each component based on the validated hypotheses\. It receives the taxonomy together with the final hypothesis for each component, and selects the tag that best matches the component’s role in the circuit\. Separating classification from hypothesis generation allowsHyVEto reason about each component’s behavior before committing to a fixed label\. Introducing the taxonomy earlier could produce more taxonomy\-aligned descriptions, but it would also constrain the agent’s reasoning to the available tags, which we deliberately avoid\.
### 4\.6Summarization
After classification,HyVEsynthesizes the component\-level explanations into a circuit\-level account of how the localized circuit implements the task\. The summary contains two parts: First, a short description of how information flows between components, which serves as an intermediate step that externalizesHyVE’s findings; Second, aderived task descriptioninferred from the validated hypotheses and the task input\-output examples\. Only the derived task description is evaluated\. It tests whetherHyVEcan move beyond isolated component labels and recover the behavior implemented by the circuit as a whole, given that no task description was provided\.
### 4\.7Implementation
HyVEis implemented as a graph\-based state machine using LangGraph\(LangChain AI,[2024](https://arxiv.org/html/2606.24026#bib.bib16)\)\.ObservationandHypothesis Validationstages share the same tool\-calling procedure:list\_directoryandread\_filefor inspecting the helper libraries \(built using TransformerLens\(Nanda and Bloom,[2022](https://arxiv.org/html/2606.24026#bib.bib23)\)\), andexecute\_pythonfor running generated code\. If the code execution fails, the model receives the error message and may revise its code\. We allow up to five execution attempts per stage, after which the tool loop terminates andHyVEmust conclude with the evidence gathered so far\. Generated code runs in a sandboxed subprocess against a pre\-loaded LM\. The hypothesis generation and validation loop is capped at three iterations per component\. If the budget is exhausted without a supported hypothesis,HyVEproceeds to the next component and retains its most recent hypothesis as a tentative explanation\.
We provide the reproducible prompts forHyVEin Appendix[A](https://arxiv.org/html/2606.24026#A1)and traceHyVE’s full trajectory for the running example in Appendix[D](https://arxiv.org/html/2606.24026#A4)\.
## 5Experiments
We evaluate four frontier LLMs as agent backbones: GPT\-5\.4, Claude\-Sonnet\-4\.6, Gemini\-3\.1\-Pro, and Qwen\-3\-Coder\-30B\-A3B\-Instruct\. Table[2](https://arxiv.org/html/2606.24026#S4.T2)reports their results onAgenticInterpBench\.
##### HyVEprovides meaningful circuit explanations, but no backbone dominates\.
Table[2](https://arxiv.org/html/2606.24026#S4.T2)shows thatHyVEprovides useful component\- and task\-level explanations, with different strengths across backbones\. Claude\-Sonnet\-4\.6 is strongest on component tagging and code execution, reaching0\.790\.79AcctagAcc\_\{\\text\{tag\}\}and0\.930\.93SexecS\_\{\\text\{exec\}\}\. Gemini\-3\.1 Pro gives the best judged explanations, with the highestQdescQ\_\{\\text\{desc\}\}andAcctaskAcc\_\{\\text\{task\}\}; its task accuracy is 8 points higher than the second\-best backbone\. GPT\-5\.4 remains competitive on tag prediction, but its low code execution success appears to limit its final explanation quality\. Qwen\-3\-Coder trails the closed\-weight models on the final explanation metrics\.
##### Stronger LM backbones generate observation\-grounded hypotheses\.
We further analyze whetherHyVE’s hypotheses follow from its own observations\. On the 10\-task, 17\-component subset used for human validation, we manually rate eachobservation\-hypothesis pairfor all fourHyVEbackbones on a 0–2 grounding scale \(0: hypotheses contradicting or ignoring observations; 1: hypotheses partially supported by observations; 2: fully supported\)\. We include annotation details in Appendix[E](https://arxiv.org/html/2606.24026#A5)and show examples in Table[3](https://arxiv.org/html/2606.24026#S5.T3)\.
All proprietary models reveal high consistency between observations and hypotheses \(average score of1\.941\.94with only one partially supported hypothesis and no ungrounded hypotheses\)\. Qwen\-3\-Coder is lower, with a mean score of1\.411\.41and only41\.2%41\.2\\%fully grounded hypotheses\. It often starts from a valid but generic observation, but then over\-specifies the hypothesis by adding unsupported task\-specific mechanisms, such as particular per\-neuron roles or positional rules\.
##### GPT\-5\.4 produces the soundest validation plans\.
Given a grounded hypothesis, we ask whetherHyVEproposes an experiment that actually tests it\. We manually score*validation\-plan soundness*on a 0\-2 scale, ranging from no validation \(0\), indirect or incomplete validation \(1\), to full validation \(2\); example in Table[3](https://arxiv.org/html/2606.24026#S5.T3)\. The score judges only whether the proposed experiment would meaningfully support or refute the hypothesis, not whether the hypothesis itself is correct\. Similar to Section[3\.3](https://arxiv.org/html/2606.24026#S3.SS3), we collect human ratings on the 10\-task, 17\-component subset and aggregate them as the lower of the two annotators’ scores\. GPT\-5\.4 is strongest with a score of1\.711\.71, followed by Gemini\-3\.1 Pro \(1\.411\.41\), Claude\-Sonnet\-4\.6 \(1\.241\.24\), and Qwen\-3\-Coder \(0\.710\.71\)\. We provide scoring details and qualitative rubric examples in Appendix[E](https://arxiv.org/html/2606.24026#A5)\.
TaskComp\. \(role\)ObservationHypothesis loopreturns fraction of previous ‘x’ tokensL0\_MLP \(detect ‘x’ tokens\)Obs:…\\dotsL2 Norm of MLP outputs vary between ‘x’ and ‘non\-x’ tokens…\\dots✓H1:L0\_MLP is a binary ‘is\_x’ feature detector\(H1is fully grounded inObs\)example: \(‘c’, ‘x’, ‘a’\)→\\rightarrow\(0, 1/2, 1/3\)✓VP:Patch L0\_MLP activations among ‘x’ and ‘non\-x’ in both directions\. Outputs shift as if the token’sis\_xvalue flipped\.\(VPTests all claims inH1\)Detects spam keywords\.example: \(‘Hi’, ‘offer’, ‘free’\)→\\rightarrow\(‘not spam’, ‘spam’, ‘spam’\)L0\_MLP \(detect each token from spam keywords & emit per position signal\)Obs:…\\dotsL0\_MLP has a high, stable activation norms across positions, dominated by small set of neurons…\\dots▲H1:Detectsposition\-specificspam patternsby aggregating featuresfrom prev\. tokens\. \(H1is partially\-grounded inObs\.\)▲VP:Patch L0\_MLPon spam positions, mean\-ablate neuron 31, expect performance drops\. \(VPignores theaggregationclaim inH1\)Multiply each element by the sequence lengthexample: \(2, 4, 6\)→\\rightarrow\(6, 12, 18\)L0\_MLP \(computes per position seq\. length from aggregation\)Obs:…\\dotsL0\_MLP has high activation norms, with position\-dependent top neurons…\\dots✗H1:L0\_MLP applies anon\-linear transformationto each token\. \(H1not grounded inObs\.\)✗VP:Mean ablate top 3 neurons, Test neuron84 for causal effect\(VPdoes not verifyanyclaims inH1\)Table 3:Examples of hypothesis grounding and validation\-plan soundness on benchmark tasks\. Each row shows a task, an agent observation, the hypothesis \(H1\), and the corresponding validation plan \(VP\)\. \(✓\) indicates grounded/sound, \(▲\) indicates partial cases, and \(✗\) indicates ungrounded/unsound\.Pinkmarks the negative claims andGreenmarks the positive claims\.
##### Reliable validation requires both sound plans and executable code\.
Sound validation plans are not sufficient unless they can be executed\. GPT\-5\.4 has the strongest validation\-plan ratings, but its low code execution success \(Sexec=0\.52S\_\{\\text\{exec\}\}=0\.52\) limits how often those plans yield usable evidence\. Claude\-Sonnet\-4\.6 shows the opposite pattern \(Sexec=0\.93S\_\{\\text\{exec\}\}=0\.93\), with reliable execution but weaker validation plans\. Gemini\-3\.1 Pro is more balanced across the two dimensions, which helps explain its strong judged explanation scores\.
To understand execution failures, we cluster the failedexecute\_pythoncalls into broad error categories\. Figure[3](https://arxiv.org/html/2606.24026#S5.F3)shows that Python and tensor\-manipulation bugs are common across backbones\. Agents often make tensor shape mistakes, misuse helper or TransformerLens APIs, mishandle`<BOS\>`offsets, or violate the tool protocol by omitting the requiredresultvariable\. These patterns suggest that better execution scaffolding and more constrained helper APIs could improveHyVEwithout changing the high\-level reasoning loop\.
Figure 3:Distribution ofexecute\_pythonerrors by failure category and agent backbone\.
##### Explanations improve when hypotheses converge\.
Figure 4:Hypothesis convergence per backbone\. Each bar shows the share of 163 components supported on the 1st, 2nd, or 3rd hypothesis\-generation iteration, or left unresolved after the three\-iteration budget\.We examine the convergence rate ofHyVEbackbones\. This metric summarizes the downstream effect of the preceding failure modes: a grounded hypothesis must still be tested by a sound validation plan and executed successfully\. Figure[4](https://arxiv.org/html/2606.24026#S5.F4)shows that backbones with fewer unresolved components tend to achieve stronger final explanations: Claude\-Sonnet\-4\.6 converges most reliably and attains the best tag accuracy, while Qwen\-3\-Coder leaves many components unresolved and performs worst\. Despite producing the soundest validation plans, GPT\-5\.4 has the lowest convergence rate among the proprietary models, as many of its validation attempts fail at execution\. Thus, a sound plan improves final explanations only when the agent can execute it and turn the result into usable evidence\. This suggests that final explanation quality depends on completing the fullobserve, hypothesize, validateloop\. We provide token usage and estimated API cost for runningHyVEwith each backbone in Appendix[F](https://arxiv.org/html/2606.24026#A6)\.
## 6A Case Study on Realistic LM
AgenticInterpBenchprovides controlled ground truth by relying on semi\-synthetic transformers, but these models do not capture the full setting of a naturally trained, large\-scale autoregressive LM\. To test whetherHyVE’s behavior carries over to this setting, we conduct a case study on the All\-for\-One \(AF1\) circuit identified byMamidanna et al\. \([2025](https://arxiv.org/html/2606.24026#bib.bib20)\)for the three\-operand taskA\+B\+CA\+B\+Cin Llama\-3\-8B\(AI@Meta,[2024](https://arxiv.org/html/2606.24026#bib.bib1)\)\. Compared toAgenticInterpBench, this setting introduces additional challenges: \(i\) a larger localized circuit with more components, \(ii\) redundant routes between attention heads leading to backup heads, and \(iii\) components that appear important under logit lens probes but are causally weak under intervention\. The localized circuit contains 10 components, including operand\-transfer attention heads, late layer MLPs, and logit\-lens\-positive attention heads\. We manually construct component\-level reference roles using targeted interventions, retaining both causal and redundant components to test whetherHyVEcan distinguish mechanistic evidence from suggestive but non\-causal signals\. We provide setup and reference\-annotation details in Appendix[G](https://arxiv.org/html/2606.24026#A7)\.
We runHyVEwith three LM backbones: GPT\-5\.4, Claude\-Sonnet\-4\.6, and Gemini\-3\.1\-Pro333The AF1 paper was published after the reported training\-data cutoffs of the proprietary backbones we use, reducing the likelihood of data leakage\.\. We omit Qwen\-3\-Coder as it substantially underperforms the closed\-weight models on component descriptions and task inference in the controlled benchmark\. Three human annotators independently rate the natural\-language role note produced by each agent and we report a majority vote\.
AgentCorrectPartialWrongGPT\-5\.4631Claude\-Sonnet\-4\.6820Gemini\-3\.1\-Pro127Table 4:Human\-rated role\-description quality on the 10\-component AF1 circuit of Llama\-3\-8B\.Table[4](https://arxiv.org/html/2606.24026#S6.T4)shows the performance across these models\. Claude\-Sonnet\-4\.6 and GPT\-5\.4 generally recover the transfer\-head structure and distinguish causally redundant late components from necessary ones \(Claude\-Sonnet\-4\.6\-HyVEiteration example in Appendix Table[12](https://arxiv.org/html/2606.24026#A7.T12)\)\. The main failure mode is over\-interpreting answer\-correlated evidence: Gemini\-3\.1\-Pro often treats positional or logit\-lens signals as causal, leading to incorrect role descriptions\. More broadly, this case study suggests a practical use forHyVE: applying it to already studied realistic circuits could test whether agentic explainers can reproduce known human findings and even surface certain overlooked mechanisms\.
## 7Conclusion and Future Work
We study whether LM agents can explain localized circuits in transformers\. To this end, we introduce a controlled benchmark and a new agentic circuit explanation framework\. Our results show that LM agents can produce useful circuit explanations, but the problem is not solved\. Stronger backbones usually generate grounded hypotheses\. The harder step is validating them through sound causal tests and reliable code execution\. This validation loop is where failures occur, especially through incomplete validation plans and code execution errors\.
Future work may expandAgenticInterpBenchto larger and more naturally occurring circuits\. Improving the validation loop is also important\. In particular, richer helper libraries and more constrained execution interfaces could reduce code\-level failures and make causal interventions easier for agents\. More broadly, combining automated circuit discovery with agentic circuit explanation could enable end\-to\-end systems that both localize and explain mechanisms in language models\. Finally, we will release our dataset and the agent framework, encouraging the MI community to contribute with more MI tool implementations and framework designs\.
## Limitations
This work evaluates circuit explanation in a post\-localization setting\.HyVEis given the localized circuit and asked to explain its components\. Thus, our results measure the explanation stage rather than end\-to\-end circuit discovery\.
AgenticInterpBenchuses semi\-synthetic circuits with recoverable ground truth\. This enables systematic evaluation, but the circuits are smaller, more structured, and more algorithmic than many mechanisms in naturally trained LMs\. Our real\-model case study provides an initial test beyond this setting, and it can be extended, though future researchers should be careful about potential data leakage, i\.e\., the existing circuit findings may have been memorized by current LMs, which invalidates the benchmarking\.
The results reflect one agent design, prompting setup, and a helper library for code execution\. Future systems may instantiate the same framework with richer tools or alternative interaction designs\.
## Acknowledgments
We appreciate the sponsorship from Foresight Institute\. This project was also supported by resources provided by the Office of Research Computing at George Mason University \(URL: https://orc\.gmu\.edu\) and funded in part by grants from the National Science Foundation \(Award Number 2018631\)\.
## References
- AI@Meta \(2024\)AI@Meta\. 2024\.[Llama 3 model card](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)\.
- Anthropic \(2026\)Anthropic\. 2026\.Claude sonnet 4\.6 system card\.[https://www\-cdn\.anthropic\.com/78073f739564e986ff3e28522761a7a0b4484f84\.pdf](https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf)\.Accessed: 2026\-05\-22\.
- Bai et al\. \(2026\)Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, and Chenhao Tan\. 2026\.[The story is not the science: Execution\-grounded evaluation of mechanistic interpretability research](https://arxiv.org/abs/2602.18458)\.*Preprint*, arXiv:2602\.18458\.
- Bereska and Gavves \(2024\)Leonard Bereska and Efstratios Gavves\. 2024\.Mechanistic interpretability for ai safety–a review\.*arXiv preprint arXiv:2404\.14082*\.
- Chen et al\. \(2025\)Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N\. Baker, Benjamin Burns, Daniel Adu\-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun\. 2025\.[Scienceagentbench: Toward rigorous assessment of language agents for data\-driven scientific discovery](https://openreview.net/forum?id=6z4YKr0GK6)\.In*The Thirteenth International Conference on Learning Representations*\.
- Conmy et al\. \(2023\)Arthur Conmy, Augustine Mavor\-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga\-Alonso\. 2023\.Towards automated circuit discovery for mechanistic interpretability\.*Advances in Neural Information Processing Systems*, 36:16318–16352\.
- Ferrando et al\. \(2024\)Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R\. Costa\-jussà\. 2024\.[A primer on the inner workings of transformer\-based language models](https://arxiv.org/abs/2405.00208)\.*Preprint*, arXiv:2405\.00208\.
- Geiger et al\. \(2022\)Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts\. 2022\.[Inducing causal structure for interpretable neural networks](https://proceedings.mlr.press/v162/geiger22a.html)\.In*Proceedings of the 39th International Conference on Machine Learning*, volume 162 of*Proceedings of Machine Learning Research*, pages 7324–7338\. PMLR\.
- Google DeepMind \(2026\)Google DeepMind\. 2026\.Gemini 3\.1 pro model card\.[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-1\-Pro\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf)\.Accessed: 2026\-05\-22\.
- Gupta et al\. \(2025\)Rohan Gupta, Iván Arcuschin, Thomas Kwa, and Adrià Garriga\-Alonso\. 2025\.[Interpbench: Semi\-synthetic transformers for evaluating mechanistic interpretability techniques](https://arxiv.org/abs/2407.14494)\.*Preprint*, arXiv:2407\.14494\.
- Han et al\. \(2026\)Jiaojiao Han, Wujiang Xu, Mingyu Jin, and Mengnan Du\. 2026\.[Sage: An agentic explainer framework for interpreting sae features in language models](https://arxiv.org/abs/2511.20820)\.*Preprint*, arXiv:2511\.20820\.
- Hanna et al\. \(2023\)Michael Hanna, Ollie Liu, and Alexandre Variengien\. 2023\.[How does gpt\-2 compute greater\-than?: Interpreting mathematical abilities in a pre\-trained language model](https://arxiv.org/abs/2305.00586)\.*Preprint*, arXiv:2305\.00586\.
- Hanna et al\. \(2024\)Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov\. 2024\.[Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms](https://arxiv.org/abs/2403.17806)\.*Preprint*, arXiv:2403\.17806\.
- Kantamneni and Tegmark \(2025\)Subhash Kantamneni and Max Tegmark\. 2025\.[Language models use trigonometry to do addition](https://arxiv.org/abs/2502.00873)\.*Preprint*, arXiv:2502\.00873\.
- Landis and Koch \(1977\)J\. Richard Landis and Gary G\. Koch\. 1977\.The measurement of observer agreement for categorical data\.*Biometrics*, 33\(1\):159–174\.
- LangChain AI \(2024\)LangChain AI\. 2024\.Langgraph\.[https://github\.com/langchain\-ai/langgraph](https://github.com/langchain-ai/langgraph)\.
- Lindner et al\. \(2023\)David Lindner, János Kramár, Matthew Rahtz, Thomas McGrath, and Vladimir Mikulik\. 2023\.Tracr: Compiled transformers as a laboratory for interpretability\.*arXiv preprint arXiv:2301\.05062*\.
- Liu et al\. \(2026\)Weiqi Liu, Yongliang Miao, Haiyan Zhao, Yanguang Liu, and Mengnan Du\. 2026\.[Neuronscope: A multi\-agent framework for explaining polysemantic neurons in language models](https://arxiv.org/abs/2601.03671)\.*Preprint*, arXiv:2601\.03671\.
- Lu et al\. \(2024\)Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha\. 2024\.The ai scientist: Towards fully automated open\-ended scientific discovery\.*arXiv preprint arXiv:2408\.06292*\.
- Mamidanna et al\. \(2025\)Siddarth Mamidanna, Daking Rai, Ziyu Yao, and Yilun Zhou\. 2025\.[All for one: LLMs solve mental math at the last token with information transferred from other tokens](https://doi.org/10.18653/v1/2025.emnlp-main.1565)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 30747–30760, Suzhou, China\. Association for Computational Linguistics\.
- Marin\-Llobet and Ferrando \(2026\)Arnau Marin\-Llobet and Javier Ferrando\. 2026\.[Automated interpretability and feature discovery in language models with agents](https://arxiv.org/abs/2605.01555)\.*Preprint*, arXiv:2605\.01555\.
- Mueller et al\. \(2025\)Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto\-Kaufman, Tal Haklay, Michael Hanna, and 1 others\. 2025\.Mib: A mechanistic interpretability benchmark\.*arXiv preprint arXiv:2504\.13151*\.
- Nanda and Bloom \(2022\)Neel Nanda and Joseph Bloom\. 2022\.Transformerlens\.[https://github\.com/TransformerLensOrg/TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)\.
- OpenAI \(2026\)OpenAI\. 2026\.Introducing gpt\-5\.4\.[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)\.Accessed: 2026\-05\-22\.
- Panickssery et al\. \(2024\)Arjun Panickssery, Samuel R\. Bowman, and Shi Feng\. 2024\.[Llm evaluators recognize and favor their own generations](https://arxiv.org/abs/2404.13076)\.*Preprint*, arXiv:2404\.13076\.
- Paulo et al\. \(2024\)Gonçalo Paulo, Alex Mallen, Caden Juang, and Nora Belrose\. 2024\.[Automatically Interpreting Millions of Features in Large Language Models](https://doi.org/10.48550/arXiv.2410.13928)\.*arXiv e\-prints*, arXiv:2410\.13928\.
- Qwen \(2025\)Qwen\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Rai et al\. \(2024\)Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao\. 2024\.A practical review of mechanistic interpretability for transformer\-based language models\.*arXiv preprint arXiv:2407\.02646*\.
- Schwettmann et al\. \(2023\)Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, and Antonio Torralba\. 2023\.[FIND: A Function Description Benchmark for Evaluating Interpretability Methods](https://doi.org/10.48550/arXiv.2309.03886)\.*arXiv e\-prints*, arXiv:2309\.03886\.
- Syed et al\. \(2024\)Aaquib Syed, Can Rager, and Arthur Conmy\. 2024\.Attribution patching outperforms automated circuit discovery\.In*Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP*, pages 407–416\.
- Thurnherr and Scheurer \(2024\)Hannes Thurnherr and Jérémy Scheurer\. 2024\.[Tracrbench: Generating interpretability testbeds with large language models](https://arxiv.org/abs/2409.13714)\.*Preprint*, arXiv:2409\.13714\.
- Wang et al\. \(2023\)Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt\. 2023\.[Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small](https://openreview.net/forum?id=NpsVSN6o4ul)\.In*The Eleventh International Conference on Learning Representations*\.
- Weiss et al\. \(2021\)Gail Weiss, Yoav Goldberg, and Eran Yahav\. 2021\.[Thinking like transformers](https://arxiv.org/abs/2106.06981)\.*Preprint*, arXiv:2106\.06981\.
- Yamada et al\. \(2025\)Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha\. 2025\.The ai scientist\-v2: Workshop\-level automated scientific discovery via agentic tree search\.*arXiv preprint arXiv:2504\.08066*\.
## Appendix APrompt Templates
In this section, we include the templates used to promptHyVE\.
### A\.1System Prompt
System PromptYou are a research agent specializing in mechanistic interpretability of transformer language models\. Your goal is to explain how a localized circuit, and in particular each target component, contributes to the model’s behavior on a given sequence\-processing task\.Context:1\.Task:The model performs a sequence\-processing task\. See the Task Information below for input\-output examples\.2\.Target model:A decoder\-only transformer with\{NUM\_LAYERS\}layers,\{NUM\_HEADS\}attention heads per layer, and an MLP sublayer in each block\.3\.Localized components:\{ALL\_COMPONENTS\}4\.Target component:\{COMPONENT\_IDENTIFIER\}5\.Component Localization:The listed components have been identified as causally relevant to the task\. Your goal is to explain the functional role of the target component within this circuit\.6\.Previously validated components:\{VALIDATED\}\.7\.Iterative Workflow:\(a\)Observation:Gather descriptive evidence about the behavior and activation patterns ofCOMPONENT\_IDENTIFIERusing the available analysis tools\.\(b\)Hypothesis Generation:Propose a hypothesis about the functional role ofCOMPONENT\_IDENTIFIER, grounded in your observations\.\(c\)Hypothesis Validation:Design and execute causal or behavioral tests to support or refute the hypothesis\.\(d\)Classification:After all components are validated, assign each a functional role tag from a fixed taxonomy and a concise functional role note, grounded in its validated hypothesis\.\(e\)Summarization:Once all components are classified, synthesize how the information flows through the localized circuit\. Write a description of the task being performed by the model\.Guidelines\.Do not assume the role of any component in advance\. Prefer precise, falsifiable hypotheses\. Ground claims in mechanistic evidence from activations, interventions, and model behavior\.
### A\.2Observation Stage
Observation Plan PromptYou are currently in theObservationstage,Plan Generationstep\. Your goal is to gather descriptive evidence aboutCOMPONENT\_IDENTIFIERthat will guide hypothesis generation in the next stage\.Generate a structured plan specifying what to observe about\{COMPONENT\_IDENTIFIER\}\. Follow these instructions:1\.The plan must be purely descriptive and observational\. Do not include ablations or causal interventions\.2\.Be concrete and implementable\. After this step, Python code will be generated directly from this plan\.3\.Keep each field concise, using 1\-3 sentences\.Analysis tools are available in the execution environment and will be discovered during code execution\. Respond with a JSON object containing exactly the following fields, and do not include any text outside the JSON:•stage:"OBSERVATION PLAN for \{COMPONENT\_IDENTIFIER\}"•goal: what aspect of the component should be characterized, and why it is relevant\.•procedure: a short list of concrete observation steps\.•inputs: the model tensors, hooks, or cached activations required\.•expected\_result: the expected pattern or value and how it should be interpreted\.
Observation Code and Execution PromptYou are currently in theObservationstage,Code Generation and Executionstep\.Observation plan:OBSERVATION\_PLANYou have access to three tools:•list\_directory\(path\): list files in a directory on the analysis server\.•read\_file\(path\): read the full contents of a file on the analysis server\.•execute\_python\(code\): execute Python code in the pre\-configured execution environment\.The execution environment has the following objects in scope:•model: aHookedTransformer\.•tokenize\(sequences\): converts a list of token lists into a\[batch, seq\_len\]tensor\. Each sequence starts withBOSand uses tokens from the task vocabulary\.•decode\(logits\): converts model output logits to task\-semantic values in the same format as the task examples\. The BOS position is excluded by default\.•Standard importsRecommended steps:1\.Calllist\_directory\(\)to discover available helper files\.2\.Read the file most relevant to the observation plan to learn the available function signatures\.3\.Generate executable Python code for the observation and callexecute\_python\. The code must implement the observation plan, assign all outputs to a dictionary namedresult\.Eachexecute\_pythoncall runs in a fresh subprocess\. Variables, imports, and state do not persist across calls\. Every call must be self\-contained\. Do not perform causal interventions\. You have up to\{MAX\_EXECUTE\_ROUNDS\}execution attempts\. If code returns an error, read the traceback, identify the root cause, and revise the next attempt\.When sufficient evidence has been gathered to characterize\{COMPONENT\_IDENTIFIER\}, respond with a JSON object containing exactly one field:•observation: a concise natural\-language description of what was observed about\{COMPONENT\_IDENTIFIER\}\.
### A\.3Hypothesis Generation Stage
Hypothesis Generation PromptYou are currently in theHypothesis Generationstage\. Based on the observation collected in the previous stage \(if available\) and the task context provided in the system prompt, your goal is to propose a precise, falsifiable hypothesis about the functional role of\{COMPONENT\_IDENTIFIER\}in the circuit\.Prior observation:\{OBSERVATION\}Previously refuted hypotheses \(if any\):\{REFUTED\}Instructions:1\.Your hypothesis must be grounded in the prior observation\. Do not contradict observed evidence\.2\.If previous hypotheses were refuted, explicitly address why they failed and constrain the new proposal accordingly\.3\.A good hypothesis specifies: \(a\) what the component computes, \(b\) which input positions or tokens it operates on, and \(c\) what it contributes downstream\.4\.Use language that reflects a testable claim, such as “we hypothesize that …”Respond with a JSON object containing exactly the following fields, and do not include any text outside the JSON:•hypothesis: the proposed hypothesis about the functional role of\{COMPONENT\_IDENTIFIER\}\.•reasoning: 1\-3 sentences explaining why the hypothesis is consistent with the observation\.
### A\.4Hypothesis Validation Stage
Validation Plan PromptYou are currently in theHypothesis Validationstage\. Your goal is to design and execute a causal experiment that produces evidence either supporting or refuting the current hypothesis about\{COMPONENT\_IDENTIFIER\}\. The outcome will determine whether this component’s role is considered explained, or whether hypothesis generation restarts with revised constraints\.Current hypothesis:\{CURRENT\_HYPOTHESIS\}Prior observation:\{OBSERVATION\}Requirement:Design a causal experiment to test the hypothesis\. Follow these instructions:1\.The experiment must be causal or interventional\. Appropriate methods include activation patching, mean ablation, interchange intervention etc\.2\.The plan must be directly motivated by the hypothesis\. The expected result should follow logically from what the hypothesis predicts\.3\.Be concrete and implementable\. In the next step, Python code will be generated directly from this plan\.4\.Keep each field concise, using 1\-3 sentences\.Respond with a JSON object containing exactly the following fields, and do not include any text outside the JSON:•stage:"VALIDATION PLAN for COMPONENT\_IDENTIFIER"•goal: the specific prediction of the hypothesis being tested\.•procedure: a short list of concrete experimental steps\.•inputs: the model tensors, hooks, or cached activations required\.•expected\_result: what should appear in the results if the hypothesis is correct\.
Validation Code and Execution PromptYou are currently in theHypothesis Validationstage,Code Generation and Executionstep\.Current hypothesis:\{CURRENT\_HYPOTHESIS\}Validation plan:\{VALIDATION\_PLAN\}You have access to three tools:•list\_directory\(path\): list files in a directory on the analysis server\.•read\_file\(path\): read the full contents of a file on the analysis server\.•execute\_python\(code\): execute Python code in the pre\-configured analysis environment\.The execution environment has the following objects in scope:•model: aHookedTransformer\.•tokenize\(sequences\): converts a list of token lists into a\[batch, seq\_len\]tensor\. Each sequence starts withBOSand uses tokens from the task vocabulary\.•decode\(logits\): converts model output logits to task\-semantic values in the same format as the task examples\. The BOS position is excluded by default\.•Standard imports\.Recommended steps:1\.Calllist\_directory\(\)to discover available helper files with template code\.2\.Read the file most relevant to the validation task to learn the available causal\-intervention function signatures\. Helper functions are already defined in the execution environment and should be called directly, without import statements\.3\.Generate executable Python code for validating the hypothesis and callexecute\_python\. The code must implement the validation plan, assign all outputs to a dictionary namedresult\.Eachexecute\_pythoncall runs in a fresh subprocess; variables, imports, and state do not persist across calls\. Every call must be self\-contained\. The experiment must include at least one causal intervention, such as ablation, activation patching, or interchange intervention\. If a metric function is used, it should have signaturefn\(logits: Tensor\) \-\> float, evaluate all output\-bearing positions, exclude BOS, and usedecode\(\)to convert logits to task values\. You have up to\{MAX\_EXECUTE\_ROUNDS\}execution attempts\. If code returns an error, read the traceback, identify the root cause, and revise the next attempt\.Decision criteria:•support: the results are consistent with the hypothesis, and the intervention effect aligns with the prediction\.•refute: the results contradict the hypothesis, or the effect is absent, reversed, or inconsistent\. If the result is ambiguous, preferrefute\.When sufficient experimental evidence has been gathered, respond with a JSON object containing the following fields, and do not include any text outside the JSON:•decision: exactlysupportorrefute\.•explanation: 2–3 sentences summarizing the observed tool results and why they support the decision\.•ruling\_out: required only forrefute; specifies what claim in the hypothesis is contradicted\.
### A\.5Classification Stage
Taxonomy Classification PromptYou are now in theTaxonomy Classificationstage\. All components have been analyzed\. Assign each circuit component a functional role tag from the taxonomy below, based on its validated hypothesis\.Taxonomy:•INDICATOR: Detects a property of a single input and emits a binary or predicate\-like signal\.•AGGREGATOR: Reduces content across multiple positions into a summary quantity, such as a count, fraction, or accumulated value\. The information is collapsed after the operation\.•ROUTER: Moves content from one position to another through positional or index\-based selection\. Content is copied across positions\.•MAPPER: Transforms a single input at each position independently into a non\-binary output, such as an arithmetic value, remapping, lookup, or reshaped representation, including when the result is a control signal\.•COMBINER: Reads two or more distinct upstream signals or inputs and fuses them into one output\.Validated hypotheses:\{VALIDATED\_HYPOTHESES\}Instructions:1\.Assign exactly one tag to each component based on the validated hypothesis and the taxonomy definitions\.2\.When tags overlap, choose based on the component’s functional input\-output behavior\.3\.Write a concise one\-sentence description of what the component does in the context of this specific task\.4\.Ground the description in the hypothesis, not in the generic taxonomy definition\.5\.Do not output anything outside the JSON\.Respond with a JSON object containing one entry for each component\. Each entry should contain exactly two fields:•tag: one ofINDICATOR,AGGREGATOR,ROUTER,MAPPER, orCOMBINER\.•description: a one\-sentence task\-specific role description\.
### A\.6Summarization Stage
Circuit Summarization PromptYou are in theCircuit Summarizationstage\. All components have been validated and classified\.Localized components:\{LOCALIZED\}Classified component roles:\{ROLES\}Validated hypotheses:\{VALIDATED\_HYPOTHESES\}Instructions:1\.Synthesize a coherent account of how information flows through the localized circuit and what task is being implemented\.2\.Describe the computational stages and the explicit interactions between components\.3\.Infer the underlying task using both the input\-output examples and the validated component mechanisms\. If the mechanisms conflict with the examples, prioritize the examples\.4\.Before finalizing the derived task description, check it against at least two input\-output examples at non\-boundary positions and revise if needed\. Do not include these checks in the final output\.5\.Write the derived task description as a concise task specification\. Do not copy example values, or restate component\-level operations\.Respond with a valid JSON object containing the following fields, and do not include any text outside the JSON:•information\_flow: 1\-2 sentences describing the sequential dependencies among components\.•derived\_task\_description: 1\-2 sentences stating the sequential task performed by the model\.
TagTypeRASP primitiveDescriptionINDICATORMLPrasp\.Map\(pred, tokens\)Detects a property of the current token and emits a binary signal\.AGGREGATORATTNrasp\.Aggregate\(\),rasp\.SelectorWidth\(\)Computes a summary over selected positions \(e\.g\. count, fraction, accumulated quantity\)\.ROUTERATTNrasp\.Select\(rasp\.indices, …\)\+rasp\.Aggregate\(\)Moves a token from one position to another via positional or index\-based selection\.MAPPERMLPrasp\.Map\(\)Applies an element\-wise transformation to each position\.COMBINERMLPrasp\.SequenceMap\(\),rasp\.LinearSequenceMap\(\)Reads and combines multiple upstream signals into one output through an arithmetic or logical operation\.Table 5:Taxonomy of functional roles used in component\-level annotations\. Each tag captures the abstract computational role played by an attention head or MLP within a localized circuit, grounded in the corresponding RASP primitive\.
## Appendix BComponent Role Taxonomy
Table[5](https://arxiv.org/html/2606.24026#A1.T5)contains details regarding the 5\-class role taxonomy\.
## Appendix CAnnotation Example:frac\_prevs
This section illustrates how we derive component\-level annotations from InterpBench using thefrac\_prevstask as an example\. The goal offrac\_prevsis to return, at each position, the fraction of previous tokens up to and including that position that are equal to‘x’\. Our annotation procedure uses two sources of information: the original RASP program, which specifies the high\-level algorithm, and the high\-level/low\-level correspondence map, which identifies which trained InterpBench component implements each Tracr component\.
The RASP program for this task is:
```
is_x = (rasp.tokens == "x").named("is_x")
bools = rasp.numerical(is_x)
prevs = rasp.Select(rasp.indices,
rasp.indices,
rasp.Comparison.LEQ)
return rasp.numerical(
rasp.Aggregate(prevs, bools, default=0)
).named("frac_prevs")
```
This program decomposes the task into two main steps\. First,is\_xcomputes a per\-position predicate indicating whether the current token isx\. The variableboolsconverts this predicate into a numerical signal\. Second,prevsdefines a prefix selector over positions, andAggregate\(prevs, bools\)aggregates theis\_xsignal over the prefix to compute the running fraction\. Thus,is\_xcorresponds to an indicator\-style computation, whilefrac\_prevscorresponds to an aggregation over previous positions\.
InterpBench provides a high\-level/low\-level correspondence map that links each Tracr high\-level node to the trained low\-level InterpBench component aligned with it\. Forfrac\_prevs, the relevant entries are:
```
{TracrHLNode(
name: blocks.0.mlp.hook_post,
label: is_x_3,
index: [:]
) : {
LLNode(
name=’blocks.0.mlp.hook_post’,
index=[:])
},
TracrHLNode(
name: blocks.1.attn.hook_result,
label: frac_prevs_1,
index: [:, :, 0, :]
) : {LLNode(
name=’blocks.1.attn.hook_result’,
index=[:, :, 2, :])}}
is_x_3 |
HL = blocks.0.mlp.hook_post, index = [:]
-> LL = [(’blocks.0.mlp.hook_post’, [:])]
frac_prevs_1 |
HL = blocks.1.attn...,index=[:,:,0,:]
-> LL = [(’blocks.1.attn...’,[:,:,2,:])]
```
The first correspondence entry maps the Tracr MLP component labeledis\_x\_3to the trained InterpBench componentblocks\.0\.mlp\.hook\_post\. Since the corresponding RASP variableis\_xdetects whether each token isx, we annotate this component as anIndicator\. Its role note is: “Computes a per\-position feature indicating whether the token at that position isxor not\.”
The second correspondence entry maps the Tracr attention output labeledfrac\_prevs\_1to head 2 ofblocks\.1\.attn\.hook\_resultin the trained InterpBench model\. Since this component implements the aggregation over the prefix selectorprevs, we annotate it as anAggregator\. Its role note is: “Aggregates prefix fraction by attending over previous positions\.” We also record that this component uses the upstreamis\_xfeature computed by L0\_MLP\.
The resulting component annotations are therefore:
```
components = [
{
"id": "L0_MLP",
"hook": "blocks.0.mlp.hook_post",
"role": {
"tag": "INDICATOR",
"note": "Computes per-position feature
indicating whether the token at that
position is ’x’ or not."
},
"labels": ["is_x_3"],
},
{
"id": "L1H2_ATTN",
"hook": "blocks.1.attn.hook_result[2]",
"role": {
"tag": "AGGREGATOR",
"note": "Aggregates prefix fraction
by attending over previous positions."
},
"labels": ["frac_prevs_1"],
}
]
```
This example shows howAgenticInterpBenchextends InterpBench: InterpBench provides the trained low\-level models and their correspondence to Tracr components, whileAgenticInterpBenchadds semantic role annotations by tracing each localized component back to the RASP variable it implements\.
## Appendix DHyVEWalkthrough
To make the pipeline concrete, we traceHyVE’s full trajectory on component L0\_MLP for the running examplefrac\_prevs, using Claude\-Sonnet\-4\.6 as the backbone\.
##### Observation\.
The observation plan is to characterize what L0\_MLP encodes, write code to cache its outputs across token types \(‘x’,‘c’,‘a’,‘b’\), and compare per\-token\-type difference vectors\.HyVEobserves that L0\_MLP produces dramatically different outputs for‘x’vs non\-‘x’tokens \(∥Δ∥≈2\.65\\lVert\\Delta\\rVert\\approx 2\.65\), while non\-‘x’tokens are similar to each other \(∥Δ∥≈0\.04\-0\.07\\lVert\\Delta\\rVert\\approx 0\.04\\text\{\-\}0\.07\)\.
##### Hypothesis\.
“L0\_MLP is a binaryis\_xfeature detector: at every position, it writes a position\-invariant signal into the residual stream encoding whether the token is‘x’\(positive\) or not \(near\-zero/negative\)”
##### Validation\.
HyVEdesigns an activation\-patching experiment: replace the L0\_MLP output for an‘x’token with the output for a non\-‘x’token \(and reverse\)\. Patching confirms causal necessity, with normalized effect≈0\.97\\approx 0\.97forx→cx\\to cand≈0\.99\\approx 0\.99on the reverse patch\.
##### Classification\.
HyVEassigns the tagIndicator, matching the ground\-truth annotation\. It also writes a role description \(“At each token position, L0\_MLP detects whether the token is ‘x’ and writes a consistent binary feature into the residual stream”\) which closely matches the ground\-truth note \(“Computes per\-position feature indicating whether the token at that position is ‘x’\.”\)\.
##### Summarization\.
Based on the validated hypotheses and the assigned component tags,HyVEdefines the underlying task as: “Given a sequence of tokens, the model outputs at each position the proportion of tokens seen so far \(excluding`<BOS\>`\) that are equal to ‘x’, producing a running fraction that updates with each new token\.”
## Appendix EEvaluation Details and Human Annotation
### E\.1Overview
Section[3\.3](https://arxiv.org/html/2606.24026#S3.SS3)defines the main evaluation metrics\. Here, we provide additional details about the human annotation protocol, agreement computation, process\-level metrics, and qualitative rubric examples\. The human evaluation covers two final\-output metrics, role description quality \(QdescQ\_\{\\mathrm\{desc\}\}\) and derived task accuracy \(AcctaskAcc\_\{\\mathrm\{task\}\}\), and two process\-level metrics, validation\-plan soundness \(SvalS\_\{\\mathrm\{val\}\}\) and hypothesis grounding\.
The human evaluation was conducted in two stages\. We first annotated outputs from the two backbones used in our initial analysis, GPT\-5\.4 and Claude\-Sonnet\-4\.6\. This larger GPT/Claude annotation set is used to report \(i\) the inter\-annotator agreement and \(ii\) the agreement between human annotators and LLM judges in Table[6](https://arxiv.org/html/2606.24026#A5.T6)\. It containsn=110n=110component\-level instances forQdescQ\_\{\\mathrm\{desc\}\}andSvalS\_\{\\mathrm\{val\}\}, andn=62n=62task\-level instances forAcctaskAcc\_\{\\mathrm\{task\}\}\.
After extendingHyVEto two additional backbones, Gemini\-3\.1\-Pro and Qwen\-3\-Coder\-30B\-A3B\-Instruct, we performed a second annotation pass on a smaller shared subset coveringall four backbones\. This cross\-backbone subset contains 10 tasks and 17 components, yielding 68 component\-level instances and 40 task\-level instances\. This subset supports the human\-validation results and process\-level comparisons discussed in the main text\.
### E\.2Annotation Protocol
ForQdescQ\_\{\\mathrm\{desc\}\},AcctaskAcc\_\{\\mathrm\{task\}\}, andSvalS\_\{\\mathrm\{val\}\}, the annotation was performed by two human annotators, both CS graduate students with machine\-learning experience\. The annotators were given a standardized annotation README, detailed metric definitions, and representative examples for each score level\. The instructions followed the same rubrics used for the LLM\-judge evaluation\.
Annotators worked independently using a Streamlit\-based interface\. To reduce bias, model identities were hidden and randomized\. The interface displayed model outputs using anonymized labels such as Agent A and Agent B; these labels were only interface labels and did not correspond to fixed backbone names\. In the initial annotation stage, the interface showed outputs from GPT\-5\.4 and Claude\-Sonnet\-4\.6\. In the later cross\-backbone annotation stage, the same blinding and randomization procedure was applied to outputs from all four backbones\.
For each task, annotators first saw the task context, including the ground\-truth task summary, up to five input\-output examples, and the list of localized components with their ground\-truth tags\. For each localized component, annotators then saw the agent’s hypothesis and validation plan as read\-only context and rated validation\-plan soundness \(SvalS\_\{\\mathrm\{val\}\}\) asSound,Partial, orUnsound\. ASoundplan directly tests the key mechanistic prediction of the hypothesis; aPartialplan is causally relevant but indirect or incomplete; and anUnsoundplan does not meaningfully test the hypothesis\.
Annotators next saw the ground\-truth tag and role note for the component, followed by the agent’s predicted role description\. The predicted tag was shown only as context and was not itself rated\. Annotators rated role description quality \(QdescQ\_\{\\mathrm\{desc\}\}\) asCorrect,Partial, orWrong\. ACorrectdescription captures the component’s task\-specific role; aPartialdescription captures the main role but is vague, incomplete, or contains an incorrect mechanistic sub\-claim; and aWrongdescription contradicts the reference role or describes a different function\.
Finally, for each task, annotators saw the ground\-truth task summary and each agent’s derived task description\. They rated task accuracy \(AcctaskAcc\_\{\\mathrm\{task\}\}\) asCorrectorWrong, indicating whether the derived description recovered the task\-level behavior\. Annotators could optionally provide a short rationale for each rating\. The hidden mapping from anonymized agent labels to the underlyingHyVEbackbone was recorded automatically for analysis but was not visible during annotation\.
### E\.3Human Inter\-Annotator Agreement Computation
For ordinal 3\-point metrics \(QdescQ\_\{\\mathrm\{desc\}\}andSvalS\_\{\\mathrm\{val\}\}\), we report linearly weighted Cohen’sκ\\kappa\. Linear weighting is appropriate because adjacent disagreements, such as 1 vs\. 2, are less severe than endpoint disagreements, such as 0 vs\. 2\. For binaryAcctaskAcc\_\{\\mathrm\{task\}\}, we report ordinary Cohen’sκ\\kappawithout weighting\.
Table[6](https://arxiv.org/html/2606.24026#A5.T6)reports inter\-annotator agreement on the larger GPT/Claude annotation subset\. Agreement isalmost perfectfor the final\-output metrics, withκ=0\.8\\kappa=0\.8forQdescQ\_\{\\mathrm\{desc\}\}andκ=0\.96\\kappa=0\.96forAcctaskAcc\_\{\\mathrm\{task\}\}\. Agreement ismoderateforSvalS\_\{\\mathrm\{val\}\}\(κ=0\.46\\kappa=0\.46\), reflecting the greater subjectivity of judging whether a proposed causal experiment fully tests a mechanistic hypothesis\. To rule out this subjectivity, we consider the lower score between the two annotators as the ground truth, implementing a stricter evaluation standard for LM agents\. This applies to all human evaluations\.
### E\.4Human\-LLM Judge Agreement Computation
We employ two LLM judges in our work\. Similar to how we aggregate the annotated labels from the two annotators, we use conservative lower\-score aggregation between the two LLM judges as well\. That is, when we apply the LLM judges, we consider the lower score between them as the judging score for an agent\. This aggregation retains a high score only when both annotators or both LLM judges assign it, reducing the chance of over\-crediting an incomplete or incorrect explanation\. We report the human\-LLM judge agreement in Table[6](https://arxiv.org/html/2606.24026#A5.T6), the bottom panel, wherelower humanlabel is the lower of the two human annotator scores, and thelower judgelabel is the lower of the two LLM\-judge scores\.
Metricnnκ\\kappaHuman–human agreementQdescQ\_\{\\text\{desc\}\}1100\.802SvalS\_\{\\text\{val\}\}1100\.460AcctaskAcc\_\{\\text\{task\}\}620\.963Lower human vs\. Lower judge agreementQdescQ\_\{\\text\{desc\}\}1100\.753SvalS\_\{\\text\{val\}\}1100\.481AcctaskAcc\_\{\\text\{task\}\}620\.864Table 6:Human–human inter\-annotator agreement and Human–LLM\-judge agreement on the larger GPT/Claude annotation subset\. The value ofnncounts model\-output instances,κ\\kappadenotes linearly weighted Cohen’sκ\\kappafor ordinal metrics\.Table[6](https://arxiv.org/html/2606.24026#A5.T6)shows that the agreement issubstantialforQdescQ\_\{\\mathrm\{desc\}\}\(κ=0\.75\\kappa=0\.75\) andalmost perfectforAcctaskAcc\_\{\\mathrm\{task\}\}\(κ=0\.86\\kappa=0\.86\), supporting the use of LLM judges for the final\-output metrics\. In contrast, agreement is lower forSvalS\_\{\\mathrm\{val\}\}\(κ=0\.48\\kappa=0\.48\)\. Together with the lower human\-human agreement forSvalS\_\{\\mathrm\{val\}\}, this suggests that validation\-plan soundness is useful as a process\-level diagnostic in qualitative analysis but less reliable as an LLM\-judged headline metric\. We therefore opt not to use it as an official metric forAgenticInterpBenchand leave more reliable automatic evaluation of validation\-plan quality to future work\.
### E\.5Cross\-Backbone Human Validation Subset
We also evaluate a smaller subset covering all fourHyVEbackbones\. This subset contains 10 tasks and 17 localized components, yielding 68 component\-level instances forQdescQ\_\{\\mathrm\{desc\}\}andSvalS\_\{\\mathrm\{val\}\}, and 40 task\-level instances forAcctaskAcc\_\{\\mathrm\{task\}\}\. Table[7](https://arxiv.org/html/2606.24026#A5.T7)reports the corresponding agreement results on the cross\-backbone subset covering all fourHyVEbackbones\. The final output metrics show high human–human and human–LLM agreement, whileSvalS\_\{\\mathrm\{val\}\}remains lower, supporting our decision to treat it as a process\-level metric\.
Metricnnκ\\kappaHuman–human agreementQdescQ\_\{\\text\{desc\}\}680\.83SvalS\_\{\\text\{val\}\}680\.37AcctaskAcc\_\{\\text\{task\}\}400\.96Lower human vs\. Lower judge agreementQdescQ\_\{\\text\{desc\}\}680\.76SvalS\_\{\\text\{val\}\}680\.44AcctaskAcc\_\{\\text\{task\}\}400\.80Table 7:Human–human inter\-annotator agreement and Human–LLM\-judge agreement on the cross\-backbone human\-validation subset covering all fourHyVEbackbones\. The value ofnncounts model\-output instances,κ\\kappadenotes linearly weighted Cohen’sκ\\kappafor ordinal metrics\.For completeness, Table[8](https://arxiv.org/html/2606.24026#A5.T8)reports pairwise agreement between each human annotator and each LLM judge on the cross\-backbone subset\. Agreement varies across individual judge pairs, especially forSvalS\_\{\\mathrm\{val\}\}, but remains higher for the two final\-output metrics\.
MetricH1\-GPTH1\-GeminiH2\-GPTH2\-GeminiQdescQ\_\{\\text\{desc\}\}0\.730\.780\.630\.74SvalS\_\{\\text\{val\}\}0\.230\.590\.430\.49AcctaskAcc\_\{\\text\{task\}\}0\.730\.740\.730\.74Table 8:Linear\-weighted Cohen’sκ\\kappabetween each LLM judge and each human annotator \(H1, H2\) on thecross\-backbone subset\(n=68n=68forQdesc/SvalQ\_\{\\text\{desc\}\}/S\_\{\\text\{val\}\},n=40n=40forAcctaskAcc\_\{\\text\{task\}\}\)\.
### E\.6Process\-Level Diagnostics
In addition to the final\-output metrics, we analyze two process\-level diagnostics: hypothesis grounding and validation\-plan soundness\. These diagnostics help identify where the agent succeeds or fails inside the observe\-hypothesize\-validate loop\.
#### E\.6\.1Hypothesis Grounding
We annotate hypothesis grounding on the same 10\-task, 17\-component cross\-backbone subset used for the main\-text human validation\. This annotation was performed by one author of the paper for analysis purposes\. For eachHyVEbackbone and each component, the annotator was shown the natural\-language observation produced by the agent, the subsequent hypothesis generated from that observation, and the task context, including the task description, input\-output examples, and ground\-truth component roles\. The annotator judged whether the hypothesis was supported by the observation on a 3\-point scale: 0 if the hypothesis contradicted or ignored the observation, 1 if it was partially supported but added unsupported details, and 2 if it was fully supported by the observation\.
Table[9](https://arxiv.org/html/2606.24026#A5.T9)summarizes the grounding scores\. The proprietary backbones produce mostly observation\-grounded hypotheses, while Qwen\-3\-Coder more often adds unsupported task\-specific details beyond its observations\.
BackboneMeanFully groundedPartialGPT\-5\.41\.9494\.1%5\.9%Claude\-Sonnet\-4\.61\.9494\.1%5\.9%Gemini\-3\.1\-Pro1\.9494\.1%5\.9%Qwen\-3\-Coder\-30B1\.4141\.2%58\.8%Table 9:Human\-evaluated hypothesis\-grounding results on the 10\-task, 17\-component cross\-backbone subset\. The score measures whether the agent’s hypothesis is supported by its own observation \(scale: 0\-2\)\.
#### E\.6\.2Validation\-Plan Soundness
Validation\-plan soundness \(SvalS\_\{\\mathrm\{val\}\}\) measures whether a proposed validation experiment directly tests the current hypothesis\. A sound plan should specify an intervention whose expected result follows from the hypothesis and whose outcome could meaningfully support or refute it\. We report LLM\-judgedSvalS\_\{\\mathrm\{val\}\}in Table[10](https://arxiv.org/html/2606.24026#A5.T10)as a process\-level diagnostic for analyzing validation behavior\. However, as shown in Tables[6](https://arxiv.org/html/2606.24026#A5.T6),[7](https://arxiv.org/html/2606.24026#A5.T7), and[8](https://arxiv.org/html/2606.24026#A5.T8),SvalS\_\{\\mathrm\{val\}\}has lower human–human and human–LLM agreement than the final\-output metrics\. This suggests that validation\-plan soundness is more subjective to evaluate than role descriptions or task descriptions\. For this reason, the main text reports validation\-plan soundness using human annotations on the cross\-backbone subset\. The LLM\-judged scores in Table[10](https://arxiv.org/html/2606.24026#A5.T10)are included only as an additional diagnostic\. The LLM\-judged scores show a broadly consistent backbone\-level pattern\.
BackboneSvalS\_\{\\text\{val\}\}GPT\-5\.41\.85Claude\-Sonnet\-4\.61\.0Gemini\-3\.1\-Pro1\.13Qwen\-3\-Coder\-30B0\.76Table 10:LLM\-judged validation\-plan soundness for the fourHyVEbackbones\. Scores are averaged over all validation plans using the conservative lower score assigned by the two LLM judges, GPT\-5\.4 and Gemini\-3\.1\-Pro\.
### E\.7Qualitative Rubric Examples
#### E\.7\.1Role Description Quality
We provide examples for the threeQdescQ\_\{\\text\{desc\}\}score levels\. As described in Section[3\.3](https://arxiv.org/html/2606.24026#S3.SS3),QdescQ\_\{\\text\{desc\}\}evaluates whether the predicted role description matches the reference component role\. We use a 3\-point scale when prompting the LLM judge and later rescale it to\[0,1\]\[0,1\]for ease of reporting\. A score of 2 indicates a correct task\-specific description, a score of 1 indicates a partially correct description with missing or incorrect mechanistic details, and a score of 0 indicates a wrong or contradictory description\.
##### Correct example \(score 2\): L0\_MLP infrac\_prevs
- •Reference tag:INDICATOR Reference note:Computes per\-position feature indicating whether the token at that position is ‘x’ or not\.
- •Agent’s description:At each token position, L0\_MLP detects whether the token is ‘x’ and writes a consistent binary feature vector into the residual stream encoding ‘is\_x’ \(positive\) or ‘not\-x’ \(near\-zero/negative\)\.
- •Rationale:The description accurately captures that L0\_MLP computes a binary indicator of whether each token is ‘x’ and writing this to the residual stream\.
##### Partial example \(score 1\): L1H2 offrac\_prevs
- •Reference tag:AGGREGATOR Reference note:Aggregates prefix fraction by attending over previous positions\.
- •Agent’s description:At each positiontt, L1H2 reads the ‘is\_x’ feature vectors from all prior positions, attends preferentially to ‘x’ positions, and collapses them into a single output vector encoding the running fraction of ‘x’ tokens seen so far\(count\_x/t\)\(count\\\_x/t\)\.
- •Rationale:The description correctly identifies L1H2 as aggregating information from previous positions to compute a running fraction of ‘x’ tokens\. However, it incorrectly says the head attends preferentially to x positions\. The ground\-truth attention pattern is uniform over previous positions, not preferential toward ‘x’ positions, and the fraction is computed by averaging the upstream ‘is\_x’ values\.
##### Wrong example \(score 0\): L1H0 from an “extract\-unique tokens” task\.
- •Reference tag:AGGREGATOR Reference note:Aggregates matching positions, defined by same token and earlier\-or\-equal index, into a per\-position count of how many times each token has appeared up to and including the current position\.
- •Agent’s description:L1H0 mainly routes residual content by preserving local state through self\-attention on ‘c’ positions and otherwise sometimes pulling a weak, largely non\-essential generic contextual write from a recent ‘c’\-associated position\.
- •Rationale:The description focuses on preserving local state and attending to ‘c’ positions, which does not match the ground\-truth role of aggregating same\-token prefix positions to compute occurrence counts\.
#### E\.7\.2Hypothesis Grounding
Hypothesis grounding evaluates whetherHyVE’s hypothesis follows from its own observation\. This score is separate from role\-description correctness\. A hypothesis can be grounded in the observation but still be wrong with respect to the reference role, or correct in outcome but unsupported by the evidence the agent cites\.
Fully grounded example \(score 2\)\.In thefrac\_prevstask,HyVEobserves that the L2 norm of L0\_MLP outputs varies between‘x’and non\-‘x’tokens\. It then hypothesizes that L0\_MLP is a binary‘is\_x’feature detector\.
This receives a score of 2 because the hypothesis isdirectly supportedby the observation\. The observed activation difference is exactly the kind of evidence expected from a token\-property indicator\.
Partially grounded example \(score 1\)\.In the spam\-keyword detection task,HyVEobserves that L0\_MLP has high, stable activation norms across positions dominated by a small set of neurons\. It then hypothesizes that the component detects position\-specific spam patterns by aggregating features from previous tokens\.
This receives a score of 1\. Although the observationsupportsthe broad claim that L0\_MLP is important and neuron\-mediated, itdoes not supportthe more specific claims about position\-specific behavior or aggregation over previous tokens\.
Ungrounded example \(score 0\)\.In the sequence\-length multiplication task,HyVEobserves that L0\_MLP has high activation norms with position\-dependent top neurons\. It then hypothesizes that L0\_MLP applies a non\-linear transformation to each token\.
This receives a score of 0\. The hypothesisdoes not followfrom the observation: position\-dependent activation strength does not provide evidence for a token\-wise non\-linear transformation\.
#### E\.7\.3Validation\-Plan Soundness
SvalS\_\{\\mathrm\{val\}\}evaluates whether the agent’s proposed validation experiment directly tests its stated hypothesis\. The score does not judge whether the hypothesis itself is correct, nor whether the generated code eventually executes successfully\. Instead, it asks whether the proposed causal experiment would meaningfully support or refute the specific mechanistic claim made in the hypothesis\.
We use a 3\-point scale:
- •Sound \(2\):The plan directly targets the key prediction in the hypothesis\. The intervention cleanly distinguishes the hypothesis from nearby alternatives\.
- •Partial \(1\):The plan is causally motivated and relevant, but it tests the hypothesis only indirectly, bundles multiple subclaims together, or leaves important alternatives unresolved\.
- •Unsound \(0\):The plan does not test the stated hypothesis\. For example, it may test a different claim, rely only on non\-causal evidence, or propose an expected result that would actually refute the hypothesis\.
Partial example \(score 1\): L1H2 infrac\_prevs\.
- •Task:The model computes the running fraction ofxtokens seen so far\.
- •Component:L1H2, whose reference role is to aggregate prefix information by attending over previous positions\.
- •Agent hypothesis:L1H2 is a running\-fraction aggregator\. It reads upstreamis\_xfeatures from L0\_MLP, attends to prior positions, and writes an output vector encoding the running fraction ofxtokens\.
- •Validation plan:Run the model on sequences with different running fractions, mean\-ablate L1H2, and measure how the clean\-minus\-ablated output changes with the running fraction\.
The plan is relevant because it performs a causal intervention on L1H2\. If ablating this head systematically disrupts the running\-fraction output, that would provide evidence that the head contributes to the task\. Thus, the plan tests the general necessity of L1H2 for the running\-fraction computation\.
However, the plan is incomplete because the hypothesis makes more specific mechanistic claims than simple necessity\. It claims that L1H2 readsis\_xfeatures from previous positions and encodes the running fraction\. Mean ablation alone does not distinguish whether the head uniformly aggregates previous positions, attends preferentially toxpositions, or contributes through another nearby aggregation strategy\. We therefore rate this plan as Partial: it tests the right general mechanism, but it does not cleanly isolate the key prediction in the hypothesis\.
## Appendix FAPI Cost
We estimate the API cost of runningHyVEon the full 84\-caseAgenticInterpBenchbenchmark\. We count tokens using each provider’s native tokenizer API \(Claude, Gemini\) andtiktokenfor GPT\-5\.4 and Qwen\. Claude\-Sonnet\-4\.6 incurs the largest estimated API cost, at $147\.55 total \($1\.76 per task\), followed by GPT\-5\.4 at $77\.47 total \($0\.92 per task\) and Gemini\-3\.1\-Pro at $33\.81 total \($0\.40 per task\)\. Qwen\-3\-Coder is self\-hosted, so we report $0 marginal API cost and exclude GPU\-hour costs; however, the full run required approximately 10 GPU\-hours, which we exclude from the dollar\-cost estimate because GPU cost depends on the hardware and pricing environment\.
The cost differences highlight the cost\-performance tradeoff across backbones\. Claude produces the highestAcctagAcc\_\{tag\},QdescQ\_\{desc\}, andSexecS\_\{exec\}, but is also the most expensive\. Gemini achieves the bestAcctaskAcc\_\{task\}while being4×4\\timescheaper than Claude, and GPT\-5\.4 falls between them with the highestSvalS\_\{\\text\{val\}\}\. We report the token statistics and API Cost in Table[11](https://arxiv.org/html/2606.24026#A6.T11)\.
BackboneInput tokensOutput tokensTotal costMean cost/caseClaude\-Sonnet\-4\.634\.49M2\.94M$147\.55$1\.76GPT\-5\.416\.73M2\.38M$77\.47$0\.92Gemini\-3\.1\-Pro12\.77M0\.69M$33\.81$0\.40Qwen\-3\-Coder\-30B\-A3B18\.64M1\.31M$0\.00–Table 11:Estimated token usage and API cost for runningHyVEon the full 84\-caseAgenticInterpBenchbenchmark\. Qwen\-3\-Coder\-30B\-A3B is self\-hosted, so we report zero marginal API cost and exclude GPU\-hour costs\.
## Appendix GReal Circuit Reference Annotation
Comp\.Reference RoleObservationHypothesis loopL15H13PrimaryBB\-transfer head\.ReadsBBat the final query position, writesBB\-dependent information into the residual stream\.Final token attention concentrates strongly on operandBB, near\-zero mass elsewhere\.
Hypothesis\(Iteration 1✓\):L15H13 routesBB’s identity to the final token\. Ablation should cause a significant accuracy drop, with errors clustering nearA\+CA\{\+\}C\.
Validation:Supported\. Ablation yields a large accuracy drop, activation patching restores it\.Hypothesis\(Iteration 1✗\):L16H1 is a*primary*CC\-router\. Ablation should drop accuracy significantly\.
Validation:Refuted\. Ablation yields no accuracy drop\.Hypothesis\(Iteration 2✓\):L16H1 is a*backup*CC\-router\. Invisible under solo ablation, active when L15H3 and L15H31 are removed\.
Validation:Supported\. Ablation together with \(L15H3, L15H31\) causes a significant further accuracy drop\.TertiaryCC\-transfer head\.Invisible in the full model but load\-bearing once the strongerCCroutes \(L15H3, L15H31\) are suppressed\.Final\-token attention to operandCC, with secondary attention to`<BOS\>`and smaller weights onBBandAATable 12:Per\-component exploration trace for two transfer heads of the AF1 circuit on Llama\-3\-8B\.HyVEconverges immediately on L15H13’s role as the primaryBB\-transfer head, but requires a refuted iteration before re\-hypothesizing L16H1 as a backupCC\-router\.### G\.1Setup
##### Task and model\.
We use the three\-operand addition prompt template “A\+B\+C=A\+B\+C=\\quad” withA,B,C∈\{0,1,…,100\}A,B,C\\in\\\{0,1,\\dots,100\\\}and the answer lying in the range\{0,999\}\\\{0,999\\\}, evaluated on Llama\-3\-8B\. Our case study builds on the All\-for\-One \(AF1\) circuit, which identifies a sparse subgraph sufficient for this arithmetic behavior\.
##### Localized Circuit\.
Starting from the AF1 subgraph, we construct a 10\-component localized circuit for explanation\. The circuit contains five transfer attention heads in layers 15 and 16 \(L15H3, L15H13, L15H31, L16H1, L16H21\), three late MLPs \(L20\_MLP, L29\_MLP, L31\_MLP\), two late attention heads with strong logit\-lens signal \(L26H3, L28H18\)\. We deliberately retain some causally redundant components \(L29\_MLP, L31\_MLP, L26H3, L28H18\) to test whetherHyVEcan distinguish mechanistic evidence from suggestive but non\-causal signals\.
##### Reference annotation\.
AF1 establishes the high\-level arithmetic circuit, but it does not provide the component\-level roles needed for our evaluation\. We therefore construct manual reference annotations for the 10 localized components\. We start from the AF1 subgraph and run targeted interventions on 99 prompts of the form “A\+B\+C=A\+B\+C=\\quad”, restricted to examples the model answers correctly\. The raw model has baseline accuracy1\.001\.00on this set\.
For attention heads, we inspect final\-token attention patterns, edge ablations, last\-query head ablations, and corrupt\-operand activation patching\. For MLPs, we use zero/mean/CAMA\-style ablations, corrupt\-operand patching, iterative pruning, and logit\-lens projections\. These experiments distinguish primary operand\-transfer heads, backup transfer heads, a late MLP that is necessary for accuracy but does not directly write in the answer direction, and components with strong logit\-lens signal but weak causal effect\. Tables[13](https://arxiv.org/html/2606.24026#A7.T13)and[14](https://arxiv.org/html/2606.24026#A7.T14)summarize the resulting reference roles\.
ComponentReference roleEvidence used for annotationL16H21PrimaryAA\-transfer head\. This head carries information about the first operand to the final token\.At the final token, L16H21 almost always attends to operandAA\. Mean attention onAAis0\.9790\.979, andAAis the top key in all9999prompts\. This attention is causally important because zeroing this head’s final\-token output reduces accuracy from1\.0001\.000to0\.1920\.192\. Removing only theAAedge reduces accuracy from1\.0001\.000to0\.1010\.101, while removing theBBorCCedge has no effect\. In a corrupt\-AApatch, the model switches to the corrupt answer on98\.0%98\.0\\%of prompts, and the corrupt\-vs\-clean answer margin moves from−4\.750\-4\.750to\+4\.156\+4\.156\.L15H13PrimaryBB\-transfer head\. This head moves information about operandBBto the final token\.At the final token, L15H13 puts most of its attention onBB\(mean mass0\.8780\.878, top key in98/9998/99prompts\)\. TheBBedge was also found to be causally relevant\. Zeroing the head’s final\-token output drops accuracy from1\.0001\.000to0\.3740\.374, and removing only theBBedge drops it to0\.4650\.465\. But removingAAor C operand edges has no effect\. Corrupt\-BBpatching changes the corrupt\-answer rate from0\.0000\.000to0\.4750\.475and gives a corrupt\-answer logit gain of\+2\.953\+2\.953\.L15H3PrimaryCC\-transfer head\.This head moves information about operandCCto the final token, but its effect is weaker than the primaryAAandBBmovers becauseCChasbackup transfer routes\.At the final token, L15H3 puts most of its attention onCC\(mean mass0\.8430\.843, top key in99/9999/99prompts\)\. Zeroing the head’s final\-token output drops accuracy from1\.0001\.000to0\.7470\.747\. TheCCedge is the causal one as removing only theCCedge drops accuracy to0\.8790\.879, while removing`<BOS\>`,AA,BB, or=has no effect\. Corrupt\-CCpatching gives a corrupt\-answer logit gain of\+0\.734\+0\.734, but only changes the corrupt\-answer rate from0\.0000\.000to0\.0200\.020\. In iterative pruning over L15\-L16 heads, L15H3 is the final survivor; removing it takes accuracy from0\.0300\.030to0\.0000\.000\.L15H31BackupCC\-transfer head\.This head carriesCC\-related information, but it is mostly redundant while L15H3 is active\. Suppressing L15H3 exposes L15H31 as a load\-bearing backupCCroute\.At the final token, L15H31 attends mostly toCC, with some attention toBB\(mean mass0\.5050\.505on C and0\.3400\.340onBB; top keyCCin90/9990/99prompts\)\. In the full model, zeroing its final\-token output only drops accuracy from1\.0001\.000to0\.9800\.980\. After suppressing L15H3, zeroing L15H31 drops accuracy from0\.7470\.747to0\.4140\.414, and removing only itsCCedge gives the same accuracy\. This shows that L15H31’sCCedge becomes important when L15H3 is absent\. Corrupt\-CCpatching in this setting gives a corrupt\-answer logit gain of\+0\.762\+0\.762, compared with only\+0\.221\+0\.221in the full model\.L16H1TertiaryCC\-transfer backup head\.The head carriesCC\-related signal, but it becomes cleanly load\-bearing only after the strongerCC\-transfer routes L15H3 and L15H31 are both suppressed\.At the final token, L16H1 attends mostly toCC, with substantial attention to`<BOS\>`\(mean mass0\.5160\.516onCCand0\.2190\.219on`<BOS\>`\)\. With only L15H3 suppressed, removing theCCedge already hurts accuracy, from0\.7470\.747to0\.5150\.515\. After suppressing both L15H3 and L15H31, zeroing L16H1 drops accuracy from0\.4140\.414to0\.1210\.121, and removing only itsCCedge gives the same accuracy\. In this double\-suppressed setting, corrupt\-CCpatching gives a corrupt\-answer logit gain of\+0\.875\+0\.875\.Table 13:Manual reference annotations for the AF1 transfer heads\. The reference roles distinguish primary operand\-transfer heads from backup C\-transfer heads\.ComponentReference roleEvidence used for annotationL20\_MLPLatent arithmetic feature builder\. L20 is useful for the arithmetic task, but its output does not look like a direct answer vector or a clean intermediate such asA\+BA\+B,A\+CA\+C, orB\+CB\+C\.Zeroing L20\_MLP drops accuracy from1\.0001\.000to0\.7370\.737\. CAMA\-style ablation gives a smaller but nonzero drop of\+0\.192\+0\.192\. Corrupt\-operand patching gives similar corrupt\-answer flip rates forAA,BB, andCC\(0\.2530\.253,0\.2420\.242,0\.2730\.273\), with corrupt\-answer logit gains of\+2\.03\+2\.03,\+2\.20\+2\.20, and\+2\.31\+2\.31, suggesting that L20 does not strongly prefer one operand over the others\. Also, candidate\-target logit lens is weak\. The answer top\-5 rate is only0\.0510\.051, and pair\-sum targets remain near zero\. Directionally, the output is only weakly aligned with the answer direction \(cos=0\.036\\cos=0\.036\) and is not an amplification of the pre\-MLP residual \(cos=−0\.188\\cos=\-0\.188\)\. We therefore annotate L20 as a latent feature builder rather than an explicit answer writer\.L29\_MLPAnswer\-related but redundant MLP\. L29’s output points toward the correct answer in projection tests, but removing it does not hurt the model on this task\.A candidate\-target logit lens on L29\_MLP recovers the answer at top\-5 rate0\.3940\.394, with mean answer logit\+8\.35\+8\.35, while pair\-sum and operand targets stay near zero\. Direction decomposition gives DLA\(answer\)=\+8\.35=\+8\.35compared with DLA\(random\)=\+0\.97=\+0\.97, so the output is answer\-related\. However, direct zero\-ablation leaves accuracy unchanged \(1\.0001\.000to1\.0001\.000\), and CAMA\-style ablation also gives zero drop\. Corrupt\-operand patching gives nonzero corrupt\-answer logit gains around\+1\.3\+1\.3, but never flips the prediction\. We therefore annotate L29 as answer\-related but causally redundant in the full circuit\.L31\_MLPStrong answer projection but causally redundant MLP\. L31 has the strongest answer signal under logit\-lens\-style projection, but the signal is broad rather than answer\-specific, and the component is not necessary in isolation\.This MLP has the strongest single\-MLP answer lens signal, with answer top\-5 rate0\.5660\.566and mean answer logit\+15\.67\+15\.67\. Direction decomposition also gives large DLA\(answer\)=\+15\.67=\+15\.67compared with DLA\(random\)=−0\.16=\-0\.16\. However, the projection is not specific to the final answer: pair sums and individual operands also receive high mean logits \(A\+BA\+B:14\.5714\.57,B\+CB\+C:14\.5414\.54,A\+CA\+C:14\.4814\.48,AA\+15\.07\+15\.07,BB\+14\.96\+14\.96,CC\+14\.98\+14\.98\)\. Isolated zero\-ablation barely changes accuracy \(1\.000→0\.9901\.000\\rightarrow 0\.990\), CAMA\-style ablation gives zero drop, and corrupt\-operand patching produces zero corrupt\-answer flips\. We therefore annotate L31 as lens\-positive but causally redundant in the full circuit\.L26H3Lens\-positive`<BOS\>\|\-sink head`\. This late attention head has answer\-related projection under logit lens, but its attention is concentrated on \\verb<BOS\>\| rather than the operand tokens\.Per\-head logit lens ranks L26H3 second among late attention heads, with top\-1 rate0\.1620\.162and top\-3 rate0\.3940\.394\. Its final\-token attention is dominated by`<BOS\>`\.`<BOS\>`mass is0\.5870\.587, while total operand mass is only0\.0270\.027\(21\.7×21\.7\\timessmaller\)\. In targeted pruning over the top late\-attention lens heads, removing L26H3 leaves accuracy at1\.0001\.000\. We therefore annotate it as a lens\-positive`<BOS\>`\-sink head rather than a causally necessary arithmetic component\.L28H18Lens\-positive`<BOS\>\| / anchor head`\. This is the strongest late\-attention logit\-lens head, but its attention is not primarily on operands and it is causally redundant in isolation\. & Per\-head logit lens ranks L28H18 first among late attention heads, with top\-1 rate $0\.222$ and top\-3 rate $0\.455$\. Its final\-token attention is \\verb<BOS\>\|\-leaning:`<BOS\>`mass is0\.4450\.445, while total operand mass is0\.1020\.102\(4\.4×4\.4\\timessmaller\)\. The remaining non\-`<BOS\>`mass is concentrated more on positional anchors such as=than on operands\. In targeted pruning over the top late\-attention lens heads, removing L28H18 leaves accuracy at1\.0001\.000\. We therefore annotate it as lens\-positive but causally redundant\.Table 14:Manual reference annotations for the late\-layer AF1 components\. These components distinguish causal task\-relevant computation from answer\-correlated projection evidence\. L20\_MLP is causally relevant and carries operand\-dependent signal, but does not directly write the final answer or a clean pair\-sum representation\. In contrast, L29\_MLP, L31\_MLP, L26H3, and L28H18 show answer\-related projection signals but have little or no isolated causal effect in the full circuit\.
## Appendix HArtifact Use, Licensing, and Data Content
This work uses existing research artifacts and software libraries for MI evaluation\.AgenticInterpBenchis built on InterpBench and Tracr\-derived models, which we use as controlled research testbeds with known circuit structure\. We use TransformerLens and LangGraph as implementation libraries for model inspection and agent orchestration\. We also use Llama\-3\-8B and Qwen\-3\-Coder\-30B\-A3B\-Instruct only for research evaluation and do not redistribute third\-party model weights\. Our released artifacts, including benchmark annotations, prompt templates, andHyVEcode, are intended for research use in circuit\-explanation evaluation and should not be interpreted as deployment\-ready guarantees of model safety\. We will release our own code and annotations under the MIT License\. Our data are based on synthetic algorithmic tasks derived from InterpBench/RASP programs, consisting of task tokens and program outputs rather than human\-authored or user\-provided text\. We manually checked the task vocabularies, input\-output examples, prompt templates, annotations, and agent outputs for human names, uniquely identifying information, and offensive content\. The Llama\-3\-8B case study uses arithmetic prompts only\. We found no PII or offensive content requiring anonymization\.Similar Articles
MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability
Proposes MechRL, a reinforcement learning approach to automate circuit discovery in transformer language models. A PPO agent trained on multiple tasks discovers attention head circuits that match known canonical circuits and generalizes to a held-out task.
MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models
MechELK is a three-stage framework combining mechanistic interpretability tools (SAE, activation patching, causal probing) with representation engineering to elicit latent knowledge from LLMs, achieving 84.7% accuracy and outperforming existing methods like CCS and linear probing.
Beyond the Black Box: Interpretability of Agentic AI Tool Use
This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.
Towards Autonomous Mechanistic Reasoning in Virtual Cells
This paper introduces VCR-Agent, a multi-agent framework that enhances large language models for biological research by generating and validating mechanistic explanations using structured formalism and the VC-TRACES dataset. The approach improves factual precision in gene expression prediction through verified mechanistic reasoning in virtual cells.
Language models can explain neurons in language models
OpenAI proposes using language models (GPT-4) to automatically generate and score explanations for neurons in language models, open-sourcing datasets and tools covering all 307,200 neurons in GPT-2. The work demonstrates iterative and scalable approaches to mechanistic interpretability, though explanation quality still lags behind humans.