When Attribution Patching Lies: Diagnosis and a Second-Order Correction

arXiv cs.LG 06/10/26, 04:00 AM Papers
Summary
This paper diagnoses systematic errors in attribution patching, a gradient-based approximation used for causal localization in language models, and proposes a second-order correction using Hessian-vector products that improves reliability with minimal additional computational cost.
arXiv:2606.09899v1 Announce Type: new Abstract: A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:17 AM
# Diagnosis and a Second-Order Correction
Source: [https://arxiv.org/html/2606.09899](https://arxiv.org/html/2606.09899)
## When Attribution Patching Lies: Diagnosis and a Second\-Order Correction

Luyang Zhang1& Jialu Wang2 1Carnegie Mellon University 2University of California, Santa Cruz luyangz@cmu\.edu, faldict@ucsc\.edu

###### Abstract

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model’s behavior\. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms\. While activation patching provides a gold\-standard causal metric, its computational cost is prohibitive at scale\. Practitioners instead rely on attribution patching, a gradient\-based, first\-order approximation whose reliability remains poorly understood\. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non\-linearities in the downstream network rather than local curvature at the patched component\. This insight yields three practical tools: \(i\) a reliability score to detect untrustworthy estimates, \(ii\) error bounds quantifying potential attribution mis\-specifications, and \(iii\) a Hessian\-vector\-product \(HVP\) correction that eliminates the leading\-order error with only one additional backward pass\. In evaluations across five model families \(124M–9B parameters\) and both random\-token and naturalistic \(name\-swap\) perturbations, HVP is the only second\-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive\. In comparative experiments, a multi\-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second\-order baselines\. These improvements lead to higher\-fidelity circuit recovery on standard benchmarks and support a*Screen\-Flag\-Fix*workflow that targets computational effort only toward the components flagged as unreliable\.

## 1Introduction

As language models grow in scale and capability, understanding their internal mechanisms becomes increasingly important\. Mechanistic interpretability seeks to provide this understanding by explaining model behavior in terms of internal computations\. A central step in that agenda is causal localization: identifying which attention heads, neurons, or features causally drive a given behavior\. These localization scores are often used to support circuit claims\[[7](https://arxiv.org/html/2606.09899#bib.bib11),[24](https://arxiv.org/html/2606.09899#bib.bib12),[1](https://arxiv.org/html/2606.09899#bib.bib13)\], so systematic error can lead to incorrect mechanistic conclusions\. The gold\-standard causal test is*activation patching*\[[10](https://arxiv.org/html/2606.09899#bib.bib35),[39](https://arxiv.org/html/2606.09899#bib.bib36)\], which replaces a component’s activation under a corrupted input with the clean\-input value and measures the output change\. But its cost scales linearly with the number of components, quickly becoming prohibitive in modern Large Language Models \(LLMs\)\. In practice, broad localization therefore relies on cheaper approximations, with direct interventions reserved for a short list of components\.

A commonly used approximation is*attribution patching*\[[28](https://arxiv.org/html/2606.09899#bib.bib10)\], which replaces many explicit interventions with a single backward pass by linearizing the effect around a corrupted activation\. However, this linearization can be unreliable when dowstream nonlinearities are strong\. Prior work has identified concrete failure modes and proposed partial remedies\[[23](https://arxiv.org/html/2606.09899#bib.bib1),[9](https://arxiv.org/html/2606.09899#bib.bib2),[18](https://arxiv.org/html/2606.09899#bib.bib3)\], yet a fundamental question remains: for a given component, when is attribution patching reliable, how large can its error be, and how should it be corrected?

![Refer to caption](https://arxiv.org/html/2606.09899v1/x1.png)Figure 1:Screen–Flag–Fix pipelinefor reliable attribution patching\. \(a\) Attribution patching screens all heads cheaply; \(b\) a reliability score flags suspect estimates; \(c\) HVP corrects only the flagged heads, recovering the true ranking\.We address this gap by analyzing the structure of attribution patching error\. The key finding is that the dominant error stems from the network’s response to the intervention, rather than from the local nonlinearity at the patched component\. A second\-order Taylor expansion makes this precise: the error splits into a leading quadratic term, computable from a single Hessian\-vector product \(HVP\), plus a cubic remainder\. This decomposition yields three key results: \(1\) a reliability score for flagging unreliable estimates; \(2\) error bounds with a provable1/K21/K^\{2\}convergence rate, whereKKis the number of sub\-steps a correction is split into; and \(3\) an explanation for the insufficiency of prior fixes\[[23](https://arxiv.org/html/2606.09899#bib.bib1),[9](https://arxiv.org/html/2606.09899#bib.bib2)\]\. Specifically, while prior methods use local curvature, we show that the true error depends on the downstream network response, a quantity that differs from local metrics by22–66×66\\timeswith a near\-zero correlation\.

Building on this analysis, we propose a*Screen–Flag–Fix*pipeline \(Figure[1](https://arxiv.org/html/2606.09899#S1.F1)\): screen all components with attribution patching, flag unreliable ones via the reliability score, and correct only those with an HVP\-based second\-order fix\. For large perturbations where the single\-step expansion overshoots, a multi\-step variant \(MS\-HVP\) splits the correction intoKKsub\-steps, each evaluated locally\. We evaluate across five model families \(124M–9B parameters\), comparing against Integrated Gradients \(IG\)\[[35](https://arxiv.org/html/2606.09899#bib.bib23)\], Integrated Hessians \(IH\)\[[19](https://arxiv.org/html/2606.09899#bib.bib24)\], and GIM\[[9](https://arxiv.org/html/2606.09899#bib.bib2)\]\.*At larger scale*\(8B\+ parameters\), IG requires∼25\{\\sim\}25GPU\-days per task; HVP is the only second\-order correction demonstrated at this scale, reducing error by up to82%82\\%\(MS\-HVPK=5K\{=\}5on Llama\-3\.1\-8B\)\.*At smaller scales*, MS\-HVP matches or exceeds IG: on the hardest setting, MS\-HVP at cost 10 achieves better accuracy than IG at cost 35 \(3\.5×3\.5\\timescheaper,p=0\.022p=0\.022\), and at matched cost MS\-HVP wins by1\.21\.2pp \(p<0\.001p<0\.001\)\. HVP also outperforms IH by1\.5%1\.5\\%–13\.9%13\.9\\%across all nine tasks at smaller scales\.*For circuit recovery*, these per\-component gains translate to improved head recovery on IOI and Greater\-Than benchmarks\.

Our contributions are three\-fold:

- •A reframe of attribution patching reliability\.We trace the dominant error to the downstream network response, not local nonlinearity, explaining why prior local fixes are structurally incomplete\.
- •A scalable second\-order correction\.HVP \(and its multi\-step variant MS\-HVP\) remains tractable at 8B\+ parameters where existing refinement baselines become infeasible, the first second\-order correction demonstrated at this scale\.
- •A design space for attribution\-patching refinement\.We organize prior methods by their goals, including estimation accuracy, feature interactions, and circuit faithfulness, clarifying which tool fits which use case\.

## 2Related work

#### Patching methods for mechanistic localization\.

Activation patching\[[10](https://arxiv.org/html/2606.09899#bib.bib35),[39](https://arxiv.org/html/2606.09899#bib.bib36)\]provides the causal reference intervention for mechanistic localization by measuring the effect of intervening on hidden activations\. Because this cost scales linearly with the number of candidate components, attribution patching\[[28](https://arxiv.org/html/2606.09899#bib.bib10)\]replaces many interventions with a first\-order approximation and has become a standard localization primitive in automated circuit discovery\[[7](https://arxiv.org/html/2606.09899#bib.bib11),[36](https://arxiv.org/html/2606.09899#bib.bib5)\], sparse feature circuits\[[24](https://arxiv.org/html/2606.09899#bib.bib12)\], and large\-scale circuit tracing\[[1](https://arxiv.org/html/2606.09899#bib.bib13)\]\. Practical guides such asHeimersheim and Nanda \[[17](https://arxiv.org/html/2606.09899#bib.bib6)\]and methodological studies of activation\-patching metrics and corruption choices\[[41](https://arxiv.org/html/2606.09899#bib.bib7)\]note that patching often behaves approximately linearly while remaining sensitive to nonlinearities and design decisions, making attribution\-patching reliability central to patching\-based workflows\. Recent large\-scale circuit tracing\[[1](https://arxiv.org/html/2606.09899#bib.bib13),[34](https://arxiv.org/html/2606.09899#bib.bib14)\]applies attribution to SAE features\[[24](https://arxiv.org/html/2606.09899#bib.bib12)\]rather than raw heads or neurons; HVP is directly applicable to any differentiable activation, though the SAE encoder’s nonlinearity \(ReLU or top\-kk\) introduces an additional curvature source whose magnitude we leave to future work\.

#### Attribution\-patching reliability and circuit validation\.

Several failure modes of attribution patching are known:Kramáret al\.\[[23](https://arxiv.org/html/2606.09899#bib.bib1)\]identify activation\-region mismatch and cancellation;Edinet al\.\[[9](https://arxiv.org/html/2606.09899#bib.bib2)\]show that softmax redistribution systematically biases gradient\-based localization;Mélouxet al\.\[[25](https://arxiv.org/html/2606.09899#bib.bib4)\]document instability under prompt and hyperparameter variation;Sharkeyet al\.\[[32](https://arxiv.org/html/2606.09899#bib.bib15)\]identify gradient\-attribution error as a standing open problem\. Alternative attribution rules such as RelP\[[18](https://arxiv.org/html/2606.09899#bib.bib3)\]and EAP\-GP\[[42](https://arxiv.org/html/2606.09899#bib.bib16)\]improve faithfulness in circuit discovery\. A complementary line of work evaluates recovered circuits against known causal structure:Shiet al\.\[[33](https://arxiv.org/html/2606.09899#bib.bib20)\]formalize faithfulness and minimality tests,Muelleret al\.\[[26](https://arxiv.org/html/2606.09899#bib.bib22)\]andGuptaet al\.\[[14](https://arxiv.org/html/2606.09899#bib.bib21)\]benchmark localization methods, and formal mechanistic\-interpretability work\[[15](https://arxiv.org/html/2606.09899#bib.bib8),[2](https://arxiv.org/html/2606.09899#bib.bib9)\]studies provable robustness\. Generally, these efforts clarify failure modes and evaluation criteria but do not provide a general account of when attribution patching is numerically reliable or how to correct it across component types and architectures\.

#### Causal abstraction and higher\-order analysis\.

Geigeret al\.\[[12](https://arxiv.org/html/2606.09899#bib.bib17)\]place mechanistic interpretability in a broader causal\-abstraction framework\. Related work\[[13](https://arxiv.org/html/2606.09899#bib.bib18),[40](https://arxiv.org/html/2606.09899#bib.bib19)\]studies whether model adhere to an interpretable causal structure, a question that remains orthogonal to the numerical reliability of the underlying attribution\-patching estimates\. Separately, higher\-order attribution methods, such as Integrated Gradients\[[35](https://arxiv.org/html/2606.09899#bib.bib23)\], Integrated Hessians\[[19](https://arxiv.org/html/2606.09899#bib.bib24)\], compositional curvature analyses\[[11](https://arxiv.org/html/2606.09899#bib.bib28)\], and influence\-function HVPs\[[22](https://arxiv.org/html/2606.09899#bib.bib31)\], show that network\-level curvature is both informative and tractable, but none provides a reliability account for patching at internal activations\.

## 3Diagnosing and correcting attribution\-patching error

In this section, we analyze when attribution patching is unreliable and how to correct it\. The key idea is that attribution patching error can be decomposed into two parts: a dominant quadratic term \(computable from a single Hessian\-vector product\) and a smaller cubic remainder\. We set up this decomposition \(§[3\.1](https://arxiv.org/html/2606.09899#S3.SS1)\), use it to define a reliability score that flags unreliable attributions \(§[3\.2](https://arxiv.org/html/2606.09899#S3.SS2)\), and then show why local activation curvature cannot predict the dominant error term before introducing network\-level HVP, MS\-HVP, and selective correction \(§[3\.3](https://arxiv.org/html/2606.09899#S3.SS3)\)\.

### 3\.1Problem setup and error decomposition

Causal localization compares two matched inputs that differ in a controlled way \(e\.g\., swapping a name or replacing a token\) and asks which internal components account for the resulting change in output\. For a given component \(neuron, attention head, or residual\-stream position\), we compare its activationaaunder one input to its perturbed counterparta′=a\+δa^\{\\prime\}=a\+\\deltaunder the other\. Activation patching measures the true causal effect of this substitution by intervening directly, whereas attribution patching approximates it using a single backward pass\. Our focus is on characterizing the error introduced by this approximation\.

Patching a single component changes a scalar output metric as follows\. LetM:ℝd→ℝM:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}denote the scalar score read out from the model output as a function of the activation being patched \(e\.g\., a logit difference or the log\-probability of a target token\), and leta∈ℝda\\in\\mathbb\{R\}^\{d\}denote the activation at the component of interest\. Throughout this local analysis, we treat the rest of the input context and model computation as fixed\.*Activation patching*measures the true effect of replacingaawith the counterfactuala′=a\+δa^\{\\prime\}=a\+\\delta, namelyΔ:=M\(a\+δ\)−M\(a\)\\Delta:=M\(a\+\\delta\)\-M\(a\)\.*Attribution patching*approximates this via a first\-order Taylor expansion:

Δ^=∇aM⋅δ\.\\hat\{\\Delta\}=\\nabla\_\{a\}M\\cdot\\delta\\,\.\(1\)Here∇aM\\nabla\_\{a\}Mis the gradient of the scalar metric with respect to the activationaa, soΔ^\\hat\{\\Delta\}is the first\-order attribution\-patching estimate along the patch directionδ\\delta\. The approximation error isE=Δ−Δ^E=\\Delta\-\\hat\{\\Delta\}\. By Taylor’s theorem with integral remainder,

E=12δ⊤Hδ⏟dominant \(Hessian\) term\+Φ\(δ\)⏟remainder,H=∇a2M\(a\)\.E=\\underbrace\{\\tfrac\{1\}\{2\}\\,\\delta^\{\\\!\\top\}H\\delta\}\_\{\\text\{dominant \(Hessian\) term\}\}\+\\underbrace\{\\Phi\(\\delta\)\}\_\{\\text\{remainder\}\},\\qquad H=\\nabla\_\{a\}^\{2\}M\(a\)\.\(2\)The quantityδ⊤Hδ\\delta^\{\\\!\\top\}H\\deltais the Hessian quadratic form along the patching directionδ\\delta, i\.e\., the second\-order curvature of the scalar metric in the direction induced by the patch\.

To bound the higher\-order terms in \([2](https://arxiv.org/html/2606.09899#S3.E2)\), we adopt a path\-local Lipschitz\-Hessian condition, the standard route to a cubic Taylor remainder in second\-order optimization\[[29](https://arxiv.org/html/2606.09899#bib.bib25),[5](https://arxiv.org/html/2606.09899#bib.bib26)\]and a common tool in higher\-order analyses of neural networks and self\-attention\[[19](https://arxiv.org/html/2606.09899#bib.bib24),[11](https://arxiv.org/html/2606.09899#bib.bib28),[20](https://arxiv.org/html/2606.09899#bib.bib27),[21](https://arxiv.org/html/2606.09899#bib.bib29),[6](https://arxiv.org/html/2606.09899#bib.bib30)\]\. We use it only along the patch segment, not as a deployment\-computable global constant\.

###### Assumption 1\(Local third\-order smoothness\)\.

The scalar metricMMis twice continuously differentiable on a neighborhood of the line segment\{a\+tδ:t∈\[0,1\]\}\\\{a\+t\\delta:t\\in\[0,1\]\\\}, and there exists a finite constantL3L\_\{3\}such that

‖∇a2M\(a\+tδ\)−∇a2M\(a\)‖op≤L3t‖δ‖for allt∈\[0,1\]\.\\bigl\\\|\\nabla\_\{a\}^\{2\}M\(a\+t\\delta\)\-\\nabla\_\{a\}^\{2\}M\(a\)\\bigr\\\|\_\{\\mathrm\{op\}\}\\leq L\_\{3\}\\,t\\\|\\delta\\\|\\qquad\\text\{for all \}t\\in\[0,1\]\.\(3\)

#### How the assumption applies to patching\.

The assumption is local: it applies only along the patch segment and yields the remainder bound\|Φ\(δ\)\|≤L3‖δ‖3/6\|\\Phi\(\\delta\)\|\\leq L\_\{3\}\\\|\\delta\\\|^\{3\}/6\. Standard transformer components satisfy it in the relevant regime: exact GeLU and SiLU have bounded third derivatives, softmax is smooth, and normalization layers are smooth away from zero\-variance inputs\. Empirically, we estimateL3L\_\{3\}along the patch path by probing the Hessian at three interpolation points \(Appendix[B\.2](https://arxiv.org/html/2606.09899#A2.SS2)\): the resulting cubic bound holds for82%82\\%of GPT\-2 and95%95\\%of Gemma\-2\-2B component–prompt pairs, with median slack factors of3\.4×3\.4\\timesand7\.6×7\.6\\timesrespectively\.

### 3\.2Reliability score and error bounds

We now turn the decomposition into a diagnostic\. The goal is to decide, before running activation patching, whether a first\-order attribution\-patching score is likely to be trustworthy for a given component\. The quadratic term is computable from one HVP, but its raw magnitude is hard to interpret without a scale; we therefore normalize it by the first\-order estimate\. The result is a relative reliability score that tracks attribution patching’s relative error up to the cubic remainder and is used later to flag components for selective HVP/MS\-HVP correction\.

###### Definition 1\(Reliability score\)\.

For a component with HessianH=∇a2M\(a\)H=\\nabla\_\{a\}^\{2\}M\(a\), perturbationδ\\delta, and nonzero first\-order estimateΔ^≠0\\hat\{\\Delta\}\\neq 0, the*reliability score*isR~=\|δ⊤Hδ\|/\(2\|Δ^\|\)\\tilde\{R\}=\|\\delta^\{\\\!\\top\}H\\delta\|/\(2\\,\|\\hat\{\\Delta\}\|\)\.

The denominator is chosen to match the practical question: whether the attribution patching estimate itself is stable:R~\\tilde\{R\}measures the leading omitted term relative to the first\-order estimate\. Normalizing by the true effectΔ\\Deltawould require activation patching, so it is not available as a screening diagnostic; in near\-zero cases, the absolute error bounds from \([2](https://arxiv.org/html/2606.09899#S3.E2)\) are the right object to inspect\.

###### Proposition 1\(Local attribution\-patching error bound via the reliability score\)\.

Assume Assumption[1](https://arxiv.org/html/2606.09899#Thmassumption1),Δ^≠0\\hat\{\\Delta\}\\neq 0, andδ⊤Hδ≠0\\delta^\{\\\!\\top\}H\\delta\\neq 0\. Letα=L3‖δ‖33\|δ⊤Hδ\|\\alpha=\\tfrac\{L\_\{3\}\\\|\\delta\\\|^\{3\}\}\{3\\,\|\\delta^\{\\\!\\top\}H\\delta\|\}be the higher\-order slack parameter\. Then\|\|E\|/\|Δ^\|−R~\|≤αR~\\bigl\|\|E\|/\|\\hat\{\\Delta\}\|\-\\tilde\{R\}\\bigr\|\\leq\\alpha\\tilde\{R\}\. If additionallyα<1\\alpha<1, then\(1−α\)R~≤\|E\|/\|Δ^\|≤\(1\+α\)R~\(1\-\\alpha\)\\tilde\{R\}\\leq\|E\|/\|\\hat\{\\Delta\}\|\\leq\(1\+\\alpha\)\\tilde\{R\}\.

The full proof, including verification of the remainder bound and a discussion of path\-local smoothness for softmax and normalization layers, is in Appendix[C](https://arxiv.org/html/2606.09899#A3)\.

In practice,R~≪1\\tilde\{R\}\\ll 1means attribution patching is accurate;R~≈1\\tilde\{R\}\\approx 1means the omitted curvature can rival the first\-order estimate, changing its magnitude or even its sign\. The slackα\\alphapackages the uncomputed cubic remainder: smallα\\alphacorresponds to perturbations well within the Taylor convergence radius, makingR~\\tilde\{R\}a tight proxy for relative error; otherwiseR~\\tilde\{R\}remains a useful diagnostic\. Degenerate cases are handled naturally: ifΔ^=0\\hat\{\\Delta\}=0, use absolute bounds; ifδ⊤Hδ=0\\delta^\{\\\!\\top\}H\\delta=0, the quadratic term vanishes and\|E\|≤L36‖δ‖3\|E\|\\leq\\tfrac\{L\_\{3\}\}\{6\}\\\|\\delta\\\|^\{3\}\.

### 3\.3Network\-level HVP correction

The quadratic error term12δ⊤Hδ\\tfrac\{1\}\{2\}\\delta^\{\\\!\\top\}\\\!H\\deltais a network\-level quantity: it depends on how the patched activation propagates through the full computational graph, not only on local activation curvature\. Correcting it therefore requires network\-level information\. We first show that local curvature corrections fail, then introduce HVP and MS\-HVP corrections that compute the relevant term directly\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x2.png)Figure 2:Network–local curvature gap\.Full\-network curvature vs\. local component curvature across three models; prior fixes \(AtP∗\[[23](https://arxiv.org/html/2606.09899#bib.bib1)\], GIM\[[9](https://arxiv.org/html/2606.09899#bib.bib2)\]\) use only the local quantity\.#### Why local corrections fail\.

A natural first attempt is to use the*local*activation curvature\. For a pre\-activation MLP neuron with activation functionff, one might estimateE≈12f′′\(z\)δ2E\\approx\\tfrac\{1\}\{2\}f^\{\\prime\\prime\}\(z\)\\delta^\{2\}\. This should fail for a structural reason: a neuron’s contribution must traverse the rest of the network, and most of that downstream path is linear \(residual additions, LayerNorm in its operating regime, the unembedding\)\. Linear operations do not compound curvature, so\|d2M/da2\|\|d^\{2\}M/da^\{2\}\|inherits little of\|f′′\(z\)\|\|f^\{\\prime\\prime\}\(z\)\|unless a downstream nonlinearity happens to align with the patch direction\. Cross\-layer interactions further introduce curvature invisible to local analysis\. We therefore expect a large gap between\|d2M/dz2\|\|d^\{2\}\\\!M/dz^\{2\}\|and\|f′′\(z\)\|\|f^\{\\prime\\prime\}\(z\)\|, and confirm this empirically: the network\-level Hessian exceeds local curvature by22–66×66\\timesacross three model families, with near\-zero correlation \(r=0\.04r=0\.04\); usingf′′\(z\)f^\{\\prime\\prime\}\(z\)for correction produces\>\>750% error – worse than no correction at all \(Figure[2](https://arxiv.org/html/2606.09899#S3.F2)\)\.

#### Network\-level correction\.

This gap motivates computing the network\-level second\-order correction directly\.

###### Definition 2\(HVP\-corrected attribution patching\)\.

The HVP\-corrected estimate is

Δ^hvp=∇aM⋅δ\+12δ⊤Hδ=Δ^\+12δ⊤Hδ\.\\hat\{\\Delta\}\_\{\\text\{hvp\}\}=\\nabla\_\{a\}M\\cdot\\delta\+\\tfrac\{1\}\{2\}\\,\\delta^\{\\\!\\top\}H\\delta=\\hat\{\\Delta\}\+\\tfrac\{1\}\{2\}\\,\\delta^\{\\\!\\top\}H\\delta\\,\.\(4\)

Computingδ⊤Hδ\\delta^\{\\\!\\top\}\\\!H\\deltarequires the Hessian\-vector productHδH\\delta, which we obtain via two backward passes\[[30](https://arxiv.org/html/2606.09899#bib.bib37)\]: \(1\) computeg=∇aMg=\\nabla\_\{a\}Mwhile retaining the gradient computation graph so thatggitself can be differentiated; \(2\) differentiateg⋅δg\\cdot\\deltawith respect toaato obtainHδH\\delta\. The dot productδ⊤\(Hδ\)\\delta^\{\\\!\\top\}\(H\\delta\)then gives the quadratic form\. This costs∼3×\\sim 3\\timesa single backward pass \(one forward \+ two backward\)\.111Measured wall\-clock overhead on Pythia\-410M is2\.8×2\.8\\timesrather than3×3\\timesdue to caching effects from the retained computation graph\.

###### Corollary 1\(Residual error after HVP correction\)\.

The residual error after HVP correction satisfies\|Ehvp\|=\|Δ−Δ^hvp\|=\|Φ\(δ\)\|≤L3‖δ‖3/6\|E\_\{\\text\{hvp\}\}\|=\|\\Delta\-\\hat\{\\Delta\}\_\{\\text\{hvp\}\}\|=\|\\Phi\(\\delta\)\|\\leq L\_\{3\}\\\|\\delta\\\|^\{3\}/6\.

Corollary[1](https://arxiv.org/html/2606.09899#Thmcorollary1)shows that HVP reduces the error of attribution patching fromO\(‖δ‖2\)O\(\\\|\\delta\\\|^\{2\}\)toO\(‖δ‖3\)O\(\\\|\\delta\\\|^\{3\}\), one order tighter thanδ→0\\delta\\to 0\. However, whenL3‖δ‖L\_\{3\}\\\|\\delta\\\|is comparable to the Hessian scale, the cubic remainder dominates and a single Taylor step overshoots\. To address this, we introduce*Multi\-Step HVP*\(MS\-HVP\), which splits the patch intoKKequal sub\-steps and applies a second\-order correction at each intermediate point along the path fromaatoa\+δa\{\+\}\\delta\. SettingK=1K\{=\}1recovers standard HVP; increasingKKshrinks each sub\-step’s remainder at the cost of additional HVP evaluations\. Algorithm[1](https://arxiv.org/html/2606.09899#alg1)summarizes the procedure\.

Algorithm 1Multi\-step HVP attribution patching1:Model

MM, activation

aa, perturbation

δ\\delta, sub\-steps

KK
2:

Δ^ms←0\\hat\{\\Delta\}\_\{\\text\{ms\}\}\\leftarrow 0,

s←δ/Ks\\leftarrow\\delta/K
3:for

k=1,…,Kk=1,\\ldots,Kdo

4:

ak−1←a\+\(k−1\)sa\_\{k\-1\}\\leftarrow a\+\(k\-1\)s
5:

gk←∇aM\(ak−1\)g\_\{k\}\\leftarrow\\nabla\_\{a\}M\(a\_\{k\-1\}\)⊳\\trianglerightForward \+ first backward

6:

vk←∇a\(gk⋅s\)v\_\{k\}\\leftarrow\\nabla\_\{a\}\(g\_\{k\}\\cdot s\)⊳\\trianglerightHVP via second backward

7:

Δ^ms←Δ^ms\+gk⋅s\+12s⋅vk\\hat\{\\Delta\}\_\{\\text\{ms\}\}\\leftarrow\\hat\{\\Delta\}\_\{\\text\{ms\}\}\+g\_\{k\}\\cdot s\+\\tfrac\{1\}\{2\}\\,s\\cdot v\_\{k\}
8:endfor

9:return

Δ^ms\\hat\{\\Delta\}\_\{\\text\{ms\}\}

Concretely, MS\-HVP evaluates the gradient and Hessian at each intermediate activationak−1=a\+k−1Kδa\_\{k\-1\}=a\+\\tfrac\{k\-1\}\{K\}\\deltaand accumulates:

Δ^ms=∑k=1K\[∇aM\(ak−1\)⋅δK\+12K2δ⊤H\(ak−1\)δ\]\.\\hat\{\\Delta\}\_\{\\text\{ms\}\}=\\sum\_\{k=1\}^\{K\}\\Bigl\[\\nabla\_\{a\}M\(a\_\{k\-1\}\)\\cdot\\tfrac\{\\delta\}\{K\}\+\\tfrac\{1\}\{2K^\{2\}\}\\,\\delta^\{\\\!\\top\}H\(a\_\{k\-1\}\)\\delta\\Bigr\]\.\(5\)Because each sub\-step’s cubic remainder scales as‖δ/K‖3\\\|\\delta/K\\\|^\{3\}and there areKKsuch steps, the aggregate residual satisfies\|Ems\|≤L3‖δ‖3/\(6K2\)\|E\_\{\\text\{ms\}\}\|\\leq L\_\{3\}\\\|\\delta\\\|^\{3\}/\(6K^\{2\}\)by standard quadrature analysis\. MS\-HVP thus trades linearly more HVP evaluations for a quadratic reduction in the Taylor remainder, predicting diminishing returns asKKgrows\. We test this cost–accuracy prediction empirically in §[4](https://arxiv.org/html/2606.09899#S4)\.

#### Selective HVP: deciding which components to correct\.

MS\-HVP improves the accuracy of individual corrections; a separate question is*which*components need correction at all\. In practice, one rarely applies HVP to every candidate component\. The following result formalizes the aggregate error of the*Screen\-Flag\-Fix*workflow: computeR~\\tilde\{R\}for all components, apply HVP correction only to those withR~≥τ\\tilde\{R\}\\geq\\tau, and trust attribution patching for the rest\.

###### Proposition 2\(Selective\-HVP pipeline guarantee\)\.

Consider componentsi=1,…,ni=1,\\dots,npatched independently on the same input\. LetΔi\\Delta\_\{i\}denote the true effect of patching only componentii,Δ^i\\hat\{\\Delta\}\_\{i\}its attribution\-patching estimate, andci:=12δi⊤Hiiδic\_\{i\}:=\\tfrac\{1\}\{2\}\\delta\_\{i\}^\{\\\!\\top\}H\_\{ii\}\\delta\_\{i\}its diagonal second\-order correction\. For a thresholdτ\>0\\tau\>0, defineSok:=\{i:R~i<τ\}S\_\{\\mathrm\{ok\}\}:=\\\{i:\\tilde\{R\}\_\{i\}<\\tau\\\},Sflag:=\{i:R~i≥τ\}S\_\{\\mathrm\{flag\}\}:=\\\{i:\\tilde\{R\}\_\{i\}\\geq\\tau\\\}, and the selective estimate

Δ^sel:=∑i=1nΔ^i\+∑i∈Sflagci\.\\hat\{\\Delta\}\_\{\\mathrm\{sel\}\}:=\\sum\_\{i=1\}^\{n\}\\hat\{\\Delta\}\_\{i\}\+\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}c\_\{i\}\.\(6\)with the convention thatR~i=\+∞\\tilde\{R\}\_\{i\}=\+\\inftywhenΔ^i=0\\hat\{\\Delta\}\_\{i\}=0\. If Assumption[1](https://arxiv.org/html/2606.09899#Thmassumption1)holds along each componentwise patching segment with constantsL3,iL\_\{3,i\}, then

\|∑i=1nΔi−Δ^sel\|≤τ∑i∈Sok\|Δ^i\|\+16∑i=1nL3,i‖δi‖3\.\\bigl\|\\textstyle\\sum\_\{i=1\}^\{n\}\\Delta\_\{i\}\-\\hat\{\\Delta\}\_\{\\mathrm\{sel\}\}\\bigr\|\\;\\leq\\;\\tau\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}\|\\hat\{\\Delta\}\_\{i\}\|\\;\+\\;\\tfrac\{1\}\{6\}\\sum\_\{i=1\}^\{n\}L\_\{3,i\}\\\|\\delta\_\{i\}\\\|^\{3\}\.\(7\)

The independent\-patching assumption matches the standard practice in circuit discovery: tools such as ACDC\[[7](https://arxiv.org/html/2606.09899#bib.bib11)\]and EAP\[[36](https://arxiv.org/html/2606.09899#bib.bib5)\]consume per\-component rankings\. Joint patches add off\-diagonal Hessian contributionsδi⊤Hijδj\\delta\_\{i\}^\{\\\!\\top\}H\_\{ij\}\\delta\_\{j\}; these affect aggregate circuit\-level metrics but not per\-component ranking \(measuring them would requireO\(n2\)O\(n^\{2\}\)HVPs\)\. Corollary[2](https://arxiv.org/html/2606.09899#Thmcorollary2)gives an exact identity comparing selective and full diagonal HVP plus a sign\-aware refinement\.

###### Corollary 2\(Exact gap to full diagonal HVP\)\.

Let

Δ^full:=∑i=1nΔ^i\+∑i=1nci\\hat\{\\Delta\}\_\{\\mathrm\{full\}\}:=\\sum\_\{i=1\}^\{n\}\\hat\{\\Delta\}\_\{i\}\+\\sum\_\{i=1\}^\{n\}c\_\{i\}denote the estimate obtained by applying the diagonal HVP correction to every independently patched component, and letEfullind:=∑iΔi−Δ^fullE\_\{\\mathrm\{full\}\}^\{\\mathrm\{ind\}\}:=\\sum\_\{i\}\\Delta\_\{i\}\-\\hat\{\\Delta\}\_\{\\mathrm\{full\}\}\. Then

Eselind−Efullind=Qok:=∑i∈Sokci\.E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}\-E\_\{\\mathrm\{full\}\}^\{\\mathrm\{ind\}\}=Q\_\{\\mathrm\{ok\}\}:=\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}c\_\{i\}\.\(8\)Consequently,

\|Eselind−Efullind\|≤τ∑i∈Sok\|Δ^i\|\.\\bigl\|E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}\-E\_\{\\mathrm\{full\}\}^\{\\mathrm\{ind\}\}\\bigr\|\\leq\\tau\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}\|\\hat\{\\Delta\}\_\{i\}\|\.\(9\)

## 4Experiments

We evaluate whether HVP corrects attribution\-patching error broadly, how it compares to existing second\-order methods at matched cost, and whether per\-component accuracy gains translate to improved circuit recovery\.

### 4\.1Experimental setup

#### Models

We evaluate five model families at different scales as shown in Table[1](https://arxiv.org/html/2606.09899#S4.T1)\. GPT\-2\[[31](https://arxiv.org/html/2606.09899#bib.bib33)\], Pythia \(410M, 2\.8B\)\[[3](https://arxiv.org/html/2606.09899#bib.bib32)\], Qwen2\.5\-1\.5B\[[38](https://arxiv.org/html/2606.09899#bib.bib39)\], Gemma \(2B, 9B\)\[[37](https://arxiv.org/html/2606.09899#bib.bib40)\], and Llama\-3\.18B\[[8](https://arxiv.org/html/2606.09899#bib.bib41)\]\. All models are hooked with TransformerLens\[[27](https://arxiv.org/html/2606.09899#bib.bib34)\]\. We analyze attention heads in all models, pre\-activation MLP neurons in GeLU models, and MLP layer outputs in Qwen\.

Table 1:Comparison of model architectures used in our experiments\.For factual completion and Greater\-Than tasks, corruptions are generated by replacing the token at position 3 with a uniformly sampled vocabulary token\. For IOI, we adopt the standard name\-swap corruption introduced by\[[39](https://arxiv.org/html/2606.09899#bib.bib36)\]\. In all settings,δ\\deltadenotes the induced activation perturbation andMMdenotes the log\-probability of the correct next token\.

Our primary metric is*top\-5 relative error*: the mean absolute error of a method’s scores on the five components with largest activation\-patching effect, normalized by the ground\-truth score range\.

#### Tasks

We evaluate on the three established circuit benchmarks together with a broader factual\-completion setting:

- •IOI:For the IOI circuit ranking experiment, we follow the setup ofWanget al\.\[[39](https://arxiv.org/html/2606.09899#bib.bib36)\]: 50 IOI prompts of the form “When Mary and John went to the store, John gave a drink to”, with the indirect object \(Mary\) as the target\. We compute attribution\-patching and HVP\-corrected attributions for all12×12=14412\\times 12=144attention heads, rank them by magnitude, and compare the resulting rankings against the activation\-patching ground truth using top\-kkoverlap and Spearman correlation\.
- •Greater\-Than:For the Greater\-Than circuit\[[16](https://arxiv.org/html/2606.09899#bib.bib38)\], we use 200 prompts of the form “The war lasted from the year 17\{XX\} to the year 17” and measure whether the model assigns higher probability to years greater than XX\. All 144 attention heads in GPT\-2 Small are ranked by attribution magnitude\. Top\-kkoverlap is computed against the activation patching ground truth\.
- •Factual completion\.For generic model sweeps, we use factual next\-token prediction prompts and measure the effect of component interventions on the correct\-token log\-probability\.

Unless otherwise noted, generic sweeps use 20 factual prompts\. We additionally scale selected experiments to 55 prompts for the Pythia\-410M attention\-head setting and 35 prompts for the Qwen2\.5\-1\.5B attention\-head and MLP\-output settings\. IOI uses 50 templated prompts and Greater\-Than uses 200 prompts\.

Our model\-task selection is guided by the theoretical prediction of §[3](https://arxiv.org/html/2606.09899#S3), which predicts that attribution\-patching error increases when the quadratic formδ⊤Hδ\\delta^\{\\\!\\top\}\\\!H\\deltadominates the first\-order approximation\. Pythia\-410M on IOI represents a predicted high\-error regime in which the local Taylor approximation breaks down, whereas GPT\-2 Small on IOI represents a low\-error regime where standard attribution patching is already accurate\. Gemma\-2\-2B factual completion lies between these extremes\. Additional model\-task pairs provide broader cross\-architecture coverage \(Table[2](https://arxiv.org/html/2606.09899#S4.T2)\)\.

#### Evaluation protocal\.

Unless otherwise stated, we report 95% confidence intervals computed from 1,000 bootstrap resamples over prompts\. Each resample draws prompts with replacement and aggregates over all components within the sampled prompt set\. We report the 2\.5th and 97\.5th percentiles of the bootstrap distribution\. Circuit\-ranking metrics on IOI and Greater\-Than are reported as point estimates over the benchmark datasets\. We measure the following component types:

- •Pre\-activation MLP neurons:hook\_preactivations in GeLU\-based models, with 256 sampled neurons per layer\.
- •Qwen MLP outputs:hook\_mlp\_outactivations, corresponding to fulldmodeld\_\{\\text\{model\}\}\-dimensional layer outputs\.
- •Post\-activation MLP neurons:hook\_postactivations\.
- •Attention heads:hook\_zactivations, i\.e\., head outputs before the output projection\.
- •Residual stream:hook\_resid\_postactivations over the full residual vector\.

#### Baselines\.

We compare against AtP\*\[[23](https://arxiv.org/html/2606.09899#bib.bib1)\]and GIM\[[9](https://arxiv.org/html/2606.09899#bib.bib2)\]\. Since AtP\* does not provide an official implementation, we reimplemented the method from the paper description\. AtP\* combines Q/K\-linearization, which linearizes the softmax backward pass, with GradDrop, which suppresses gradient contributions from components whose activation region changes under perturbation\. Because these corrections are specific to attention mechanisms, comparisons with AtP\* are restricted to attention\-head experiments\.

For GIM, we follow the public implementation, modifying the softmax backward pass to remove the self\-repair gradient term responsible for systematic underestimation\. Both baselines are evaluated on the same prompt sets and component collections as the corresponding HVP experiments in Table[3](https://arxiv.org/html/2606.09899#S4.T3)\.

Each Hessian–vector productHδH\\deltais computed with a single double backward pass, without ever forming the Hessian explicitly \(implementation details in Appendix[E\.1](https://arxiv.org/html/2606.09899#A5.SS1)\)\.

#### Computational cost\.

We measure the wall clock runtime on a single NVIDIA L40S GPU\. For Pythia\-410M: standard attribution patching takes 0\.8s per prompt \(all components\), HVP correction takes 2\.2s \(2\.8×2\.8\\timesoverhead, not exactly3×3\\times, due to computation graph caching\)\. For Pythia\-2\.8B, attribution patching takes 3\.1s, while HVP takes 8\.9s \(2\.9×2\.9\\times\)\. For GPT\-2 Small: attribution patching takes 0\.4s, while HVP takes 1\.1s \(2\.8×2\.8\\times\)\.

Measured in backward passes per component per prompt, standard attribution patching has cost 1, standard HVP has cost 2, and MS\-HVP withKKiterations has cost2K2K\. Our integrated\-gradients baseline\[[35](https://arxiv.org/html/2606.09899#bib.bib23)\]usesSSinterpolation steps and therefore requires costSSper component\. We use a per\-component IG formulation distinct from the all\-at\-once EAP\-IG method ofSyedet al\.\[[36](https://arxiv.org/html/2606.09899#bib.bib5)\]\.

#### Infrastructure\.

All experiments were conducted on single NVIDIA L40S GPUs\. Total runtime was approximately 25 GPU\-hours\. Representative runtimes include∼\\sim6 hours for Pythia\-410M experiments,∼\\sim8 hours for Pythia\-2\.8B,∼\\sim1 hour for GPT\-2 Small factual sweeps,∼\\sim30 minutes for GPT\-2 IOI,∼\\sim4 hours for Greater\-Than, and∼\\sim4 hours for Qwen2\.5\-1\.5B experiments\. No experiment required multi\-GPU execution\.

Table 2:Compression of error rate\.Top\-55relative error \(%\) across fourteen model\-task pairs and seven methods \(lower is better; Std HVP==MS\-HVPK=1K\{=\}1\)\. ColumnNN: number of evaluation prompts;KK: HVP sub\-steps;SS: IG interpolation steps\.⋆Significantly better than AP \(p<0\.05p<0\.05, paired bootstrap\)\.∘Significantly worse\.∗Pathological; recovered by MS\-HVPK≥3K\\geq 3\.†Catastrophic; see §[4\.3](https://arxiv.org/html/2606.09899#S4.SS3)\.‡Infeasible \(∼25\{\\sim\}25GPU\-days/task\)\. — Tokenizer\-incompatible: the greater\_than task requires patching a single token representing a two\-digit number \(e\.g\. “42”\), but Gemma’s tokenizer splits it into multiple tokens\. *Note:*AtP∗\[[23](https://arxiv.org/html/2606.09899#bib.bib1)\]and RelP\[[18](https://arxiv.org/html/2606.09899#bib.bib3)\]are omitted due to different granularity\.

### 4\.2Broad gains from HVP correction

Consistent with the curvature\-gap analysis \(§[3\.3](https://arxiv.org/html/2606.09899#S3.SS3), Figure[2](https://arxiv.org/html/2606.09899#S3.F2)\), local activation curvature is a poor predictor of attribution\-patching error\. In contrast, the reliability scoreR~\\tilde\{R\}tracks network\-level curvature and accurately localizes failure regions\. Figure[4](https://arxiv.org/html/2606.09899#S4.F4)shows the reliability\-score diagnostic on Pythia\-410M, while Figure[4](https://arxiv.org/html/2606.09899#S4.F4)shows thatR~\\tilde\{R\}sharply concentrates in the IOI\-circuit layers of Pythia\-410M, precisely where attribution patching incurs its largest errors\. Applying the network\-level HVP correction substantially reduces error throughout these layers\. Full HVP correction result is deferred to Appendix[E\.2](https://arxiv.org/html/2606.09899#A5.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x3.png)Figure 3:Reliability scoreR~\\tilde\{R\}as a diagnostic for attribution\-patching failure on Pythia\-410M\.\(a\)Scatter ofR~\\tilde\{R\}vs\. true relative error for attention heads \(blue\) and MLP neurons \(red\),n=3,600n=3\{,\}600\. Spearmanρ=0\.48\\rho=0\.48\. Dashed lines mark the recommended thresholds\.\(b\)ROC curve for detecting\>\>50% relative error\. AUC =0\.700\.70\[0\.67,0\.73\]\[0\.67,0\.73\]\. AtR~\>0\.3\\tilde\{R\}\>0\.3\(marked\), recall is 89% and precision 36%, flagging 40% of components\.
![Refer to caption](https://arxiv.org/html/2606.09899v1/x4.png)Figure 4:Reliability score by layer\.R~\\tilde\{R\}by transformer layer \(Pythia\-410M IOI, heads with top\-quartile causal effect\)\. Error concentrates in the IOI\-circuit layers \(11–15\)\.

These improvements translate into broad empirical gains across architectures and tasks \(Table[2](https://arxiv.org/html/2606.09899#S4.T2)\)\. Standard HVP correction \(K=1K\{=\}1, cost 2\) reduces top\-5 relative error by7272–90%90\\%on attention heads and5656–67%67\\%on pre\-activation MLP neurons\. In contrast, gains on post\-activation neurons are small \(22–7%7\\%\), consistent with the near\-linearity of activations after the nonlinearity has already been applied\. The same qualitative pattern holds across all evaluated architectures, including SwiGLU and GeGLU models such as Qwen and Gemma\.

The strongest results occur in low\-to\-moderate curvature regimes\. For example, on Gemma\-2\-2B Greater\-Than, a single HVP pass reduces the top\-5 relative error from7\.41%7\.41\\%to0\.62%0\.62\\%, approaching exact recovery of the activation\-patching ranking\.

Beyond aggregate metrics, HVP correction also improves circuit recovery\. On GPT\-2 Greater\-Than, MS\-HVP \(K=5K\{=\}5\) increases top\-5 head overlap with activation\-patching ground truth from70\.1%70\.1\\%to83\.2%83\.2\\%\(\+13\.1 pp\), outperforming both IG and GIM\. Similar improvements hold on Pythia\-410M Greater\-Than, where all second\-order methods converge to comparable rankings while consistently outperforming attribution patching\.

Table 3:Comparison of attribution\-patching correction methods on Pythia\-410M using the matched 20\-prompt comparison protocol \(bootstrap 95% CIs for error columns\)\. HVP provides the largest reduction at3×3\\timescost\.MethodAttn\. heads \(%\)Pre\-act neurons \(%\)CostErrorReduct\.ErrorReduct\.Attribution patching4\.4 \[3\.6, 5\.2\]—35\.2 \[31\.1, 39\.3\]—1×1\\timesAtP\*2\.9 \[2\.2, 3\.6\]34\.133\.8 \[29\.8, 37\.8\]4\.01×1\\timesGIM2\.6 \[1\.9, 3\.3\]40\.931\.4 \[27\.5, 35\.3\]10\.81×1\\timesHVP \(ours\)1\.0 \[0\.7, 1\.3\]77\.411\.7 \[9\.4, 14\.0\]66\.73×3\\timesAct\. patching01000100N×N\\times#### Comparison to prior corrections\.

Table[3](https://arxiv.org/html/2606.09899#S4.T3)compares HVP correction against prior attribution\-patching refinements under matched experimental settings\. On Pythia\-410M, HVP achieves the largest reduction in both attention\-head and pre\-activation\-neuron error, reducing attention\-head error from4\.4%4\.4\\%to1\.0%1\.0\\%and neuron error from35\.2%35\.2\\%to11\.7%11\.7\\%\. By contrast, AtP\* and GIM provide only modest gains and are largely restricted to attention\-specific failure modes\.

Notably, GIM frequently underperforms even vanilla attribution patching in circuit\-recovery evaluations\. Across nine model–task pairs, GIM degrades top\-KKhead overlap in 26 of 27 settings, with losses reaching21\.821\.8percentage points on Llama\-3\.1\-8B factual recovery\. This suggests that correcting only softmax self\-repair is insufficient once higher\-order network curvature dominates the error\.

#### Larger\-scale models\.

At larger scales, HVP remains computationally practical while alternative second\-order methods become prohibitively expensive\. Per\-head integrated gradients withS=10S\{=\}10requires approximately2525GPU\-days per task at 8B scale due to itsS×NheadsS\\times N\_\{\\text\{heads\}\}backward\-pass cost, and Integrated Hessians \(IH\) has not been demonstrated at comparable scales\. In contrast, standard HVP \(K=1K\{=\}1, cost 2\) completes in2\.42\.4–28\.728\.7GPU\-hours across all 8B\-scale experiments\.

Despite its low cost, HVP continues to provide substantial gains\. Across four large\-scale model–task pairs, standard HVP reduces attribution\-patching error by55–57%57\\%\. On Llama\-3\.1\-8B IOI, multi\-step HVP further reduces error from8\.42%8\.42\\%to3\.54%3\.54\\%, corresponding to an82%82\\%reduction relative to attribution patching\. These results indicate that iterative second\-order correction remains effective even at modern frontier scales\.

#### Pathological high\-curvature regime\.

Pythia\-410M IOI represents the unique setting in our experiments where the perturbation magnitude exceeds the local Taylor convergence radius\. In this regime, standard HVP catastrophically overshoots, increasing the error from22\.34%22\.34\\%to47\.48%47\.48\\%, exactly as predicted by Corollary[1](https://arxiv.org/html/2606.09899#Thmcorollary1)\. However, multi\-step composition restores stability: error decreases monotonically with increasingKK, following the predicted𝒪\(1/K2\)\\mathcal\{O\}\(1/K^\{2\}\)trend \(Figure[6](https://arxiv.org/html/2606.09899#S4.F6)a\)\. Performance improves substantially byK=3K\{=\}3, reaches a practical knee aroundK=5K\{=\}5, and saturates beyondK=10K\{=\}10\.

This failure mode also distinguishes MS\-HVP from alternative higher\-order approximations\. Third\-order Taylor expansions, finite\-difference HVP, and Gauss–Newton corrections all fail catastrophically in this regime, despite performing adequately on lower\-curvature tasks\. In contrast, MS\-HVP remains stable because it composes locally valid quadratic corrections rather than relying on a single large\-step expansion\.

Finally, MS\-HVP achieves accuracy comparable to integrated gradients at matched computational cost\. Across nine non\-pathological tasks, MS\-HVP \(K=5K\{=\}5\) and IG \(S=10S\{=\}10\) are statistically tied on seven tasks and MS\-HVP significantly outperforms IG on two IOI benchmarks\. This suggests that iterative quadratic correction captures most of the practical benefit of path integration while requiring substantially fewer backward passes at large scale\.

### 4\.3Matched\-compute comparisons

We next compare HVP correction against existing second\-order alternatives under matched compute budgets on the nine tasks where all baselines are computationally feasible\. The full experimental results can be found at Appendix[D](https://arxiv.org/html/2606.09899#A4)\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x5.png)Figure 5:Cost–accuracy tradeoff: MS\-HVP vs\. integrated gradients\.Top\-55relative error as a function of compute cost \(backward passes per component\) on three representative regimes: pathological high\-curvature \(Pythia\-410M IOI\), low\-error clean \(GPT\-2 IOI\), and moderate\-error factual \(Gemma\-2\-2B factual\)\. Horizontal dotted lines denote attribution patching without correction\. MS\-HVP matches or exceeds IG at comparable compute across all regimes, while achieving substantially better compute efficiency in the pathological setting\.#### Comparison to integrated gradients\.

Integrated gradients \(IG\) provides the strongest existing accuracy baseline but scales linearly with the number of interpolation steps\. Figure[5](https://arxiv.org/html/2606.09899#S4.F5)compares the compute–accuracy tradeoff between IG and MS\-HVP across representative low\-, moderate\-, and high\-curvature regimes\. At matched cost 10, MS\-HVP \(K=5K\{=\}5\) and IG \(S=10S\{=\}10\) are statistically tied on seven of nine tasks under paired\-bootstrap testing, while MS\-HVP significantly outperforms IG on GPT\-2 IOI \(p<0\.001p<0\.001\) and Qwen2\.5\-1\.5B IOI \(p=0\.041p=0\.041\)\. Thus, iterative quadratic correction achieves parity with path integration on most tasks despite using only local second\-order information\.

The difference becomes more pronounced in the pathological high\-curvature regime\. On Pythia\-410M IOI, MS\-HVP withK=5K\{=\}5\(cost 10\) outperforms IG withS=35S\{=\}35\(cost 35\), achieving18\.03%18\.03\\%versus18\.47%18\.47\\%top\-5 error \(Δ=−0\.44\\Delta=\-0\.44pp,p=0\.022p=0\.022\)\. At approximately matched cost 40, aggregating MS\-HVP estimates acrossK∈\{5,10,20\}K\\in\\\{5,10,20\\\}via a per\-head median further improves performance to17\.57%17\.57\\%, surpassing IG’s best result of18\.77%18\.77\\%\(p<0\.001p<0\.001\)\. These results indicate that multi\-step quadratic correction can recover the benefits of dense path integration while requiring substantially less computation\.

Importantly, the practical scaling behavior differs sharply between the methods\. IG requiresSSfull backward passes per component and rapidly becomes infeasible at 8B scale, whereas MS\-HVP remains tractable because each step reuses local curvature information\. Consequently, MS\-HVP is not only competitive at matched cost but also extends naturally to model sizes where IG cannot practically run\.

#### Comparison to integrated Hessians\.

Integrated Hessians \(IH\) consistently underperforms MS\-HVP across all evaluated tasks\. This gap arises from a structural mismatch between IH’s weighting scheme and the correction required for attribution\-patching error\. Specifically, IH computes a path\-integrated interaction term with weighting∫012t\(1−t\)𝑑t=1/3\\int\_\{0\}^\{1\}2t\(1\-t\)\\,dt=1/3, which induces an effective coefficient of approximately1/41/4on the quadratic formδ⊤Hδ\\delta^\{\\\!\\top\}H\\delta\. In contrast, exact second\-order Taylor correction requires coefficient1/21/2\. As a result, IH systematically under\-corrects attribution\-patching error even when curvature estimation is accurate\.

Empirically, this gap is consistent across all foreground tasks\. Depending on the setting, IH trails MS\-HVP by3\.83\.8–7\.27\.2percentage points despite equal or higher computational cost\. This behavior is expected: IH was originally designed to attribute feature interactions, not to approximate finite perturbation effects\. Our results therefore highlight an important distinction between interaction attribution and error correction objectives\.

#### Comparison to GIM\.

GIM\[[9](https://arxiv.org/html/2606.09899#bib.bib2)\]targets a fundamentally different notion of faithfulness\. Rather than minimizing per\-component attribution error, it is designed for edge\-level causal metrics and mediation\-style objectives\[[26](https://arxiv.org/html/2606.09899#bib.bib22)\]\. Consequently, improvements in edge faithfulness do not necessarily translate into improved component rankings\.

Under our evaluation metrics, GIM frequently degrades attribution quality\. Across the evaluated tasks, GIM is significantly worse than vanilla attribution patching on nearly all per\-head error metrics, with the sole partial exception of GPT\-2 Greater\-Than, where it slightly improves top\-5 error while still reducing circuit\-recovery overlap\. The degradation is especially pronounced on larger models and factual\-completion settings, where higher\-order network curvature dominates the error\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x6.png)Figure 6:Multi\-step correction and selective workflow\.\(a\)KK\-sweep on Pythia\-410M IOI: top\-5 relative error vs\. number of sub\-stepsKKfor MS\-HVP \(blue\) and IG \(green squares\) at matched per\-step cost\.\(b\)Selective\-HVP on GPT\-2 IOI: top\-5 error vs\. total backward\-pass cost as more components are corrected, ranked byR~\\tilde\{R\}\.R~\\tilde\{R\}\-based selection \(orange\) vs\. random baseline \(gray\)\.

### 4\.4Selective workflow and diagnostics

The reliability scoreR~\\tilde\{R\}provides a practical diagnostic for deciding when correction is worthwhile, completing the*Screen\-Flag\-Fix*pipeline\.

#### Selective correction\.

Applying HVP only toR~\\tilde\{R\}\-flagged components \(Figure[6](https://arxiv.org/html/2606.09899#S4.F6)b\) captures91%91\\%of full\-HVP gain at26%26\\%of the cost \(τ=0\.1\\tau\{=\}0\.1\), validating Proposition[2](https://arxiv.org/html/2606.09899#Thmproposition2)\. TheR~\\tilde\{R\}\-ranked selection substantially outperforms random selection \(gray baseline in Figure[6](https://arxiv.org/html/2606.09899#S4.F6)b\), confirming thatR~\\tilde\{R\}identifies the components that benefit most from correction\. Figure[7](https://arxiv.org/html/2606.09899#S4.F7)plots the empirical operating curve of the practical workflow: run attribution patching broadly, thresholdR~\\tilde\{R\}, and apply HVP only to the flagged subset\. The left panel shows how many components are corrected as the thresholdτ\\tauvaries; the right panel shows the resulting reduction in median attribution\-patching error, with the dotted lines marking the full\-HVP ceiling for each model\. At the fixed thresholdτ=0\.3\\tau=0\.3, the selective pipeline flags only 7\.0% of components on Pythia\-410M, 14\.9% on Qwen2\.5\-1\.5B, and 19\.0% on Gemma\-2\-2B, while still reducing median attribution\-patching error by 10\.1%, 23\.9%, and 30\.1%, respectively\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x7.png)Figure 7:Selective\-HVP operating curve on three representative models\. Left: fraction of components flagged for HVP correction as the thresholdτ\\tauonR~\\tilde\{R\}varies\. Right: reduction in median attribution\-patching error relative to raw attribution patching; dotted lines show the full\-HVP ceiling\. Atτ=0\.3\\tau=0\.3, the selective pipeline flags only 7\.0%, 14\.9%, and 19\.0% of components on Pythia\-410M, Qwen2\.5\-1\.5B, and Gemma\-2\-2B, while still reducing median attribution\-patching error by 10\.1%, 23\.9%, and 30\.1%, respectively\.
#### Diagnostic accuracy\.

Figure[8](https://arxiv.org/html/2606.09899#S4.F8)evaluates the reliability scoreR~\\tilde\{R\}after stratifying components by perturbation magnitude\|δ\|\|\\delta\|and true effect magnitude\|ftrue\|\|f\_\{\\mathrm\{true\}\}\|\. Q1–Q4 denotes quartiles of stratified variables\. Although overall AUROC varies across models, performance is consistently strong in the regimes that matter most: for Q2–Q4,R~\\tilde\{R\}achieves AUROC above 0\.97 across all evaluated models under both stratifications\. Thus, the reliability score accurately identifies components where attribution patching is likely to be unreliable, even when applied without model\-specific tuning\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x8.png)Figure 8:AUROC ofR~\\tilde\{R\}stratified by‖δ‖\\\|\\delta\\\|quartile and\|ftrue\|\|f\_\{\\mathrm\{true\}\}\|quartile\. Q1 denotes the smallest\-norm quartile\.
#### Robustness analysis\.

Supplementary experiments confirm that HVP’s gains are robust across corruption types \(Appendix[E\.3](https://arxiv.org/html/2606.09899#A5.SS3)\), semantic perturbations \(Appendix[E\.4](https://arxiv.org/html/2606.09899#A5.SS4)\), and evaluation sample sizes \(Appendix[E\.5](https://arxiv.org/html/2606.09899#A5.SS5)\)\. Across random\-token, cross\-prompt resample, and zero corruptions, HVP consistently reduces attribution\-patching error by roughly 79–90% on GPT\-2, Gemma\-2\-2B, and Pythia\-1B, with weaker gains only on the known pathological Pythia\-410M IOI setting where the Taylor approximation radius is violated\. HVP also generalizes beyond synthetic token replacements: under semantically coherent entity\-swap corruptions, it reduces error by 54% on GPT\-2 and 59% on Pythia\-410M, indicating that the second\-order correction remains effective for larger, structured activation perturbations\. Finally, aggregate error\-reduction estimates stabilize after approximately 10–15 evaluation prompts and converge to values consistent with the main results, suggesting that the reported gains are not driven by small\-sample effects\.

#### Computational efficiency\.

A primary practical concern is wall\-clock timing\. We measured attribution time on a single NVIDIA L40S GPU using 100 evaluation examples to estimate the cost of a selective workflow \(running EAP to obtain edge scores, flagging components withR~\>τ\\tilde\{R\}\>\\tau, and applying HVP only to flagged edges\)\. As shown in Table[14](https://arxiv.org/html/2606.09899#A5.T14)\(see Appendix[E\.6](https://arxiv.org/html/2606.09899#A5.SS6)\), the selective pipeline adds≤20%\\leq 20\\%wall\-clock overhead over raw EAP while capturing the most important corrections\. Even full HVP adds only a modest1\.01\.0–3\.6×3\.6\\timesoverhead, as the HVP backward pass reuses the computation graph from the EAP forward pass\.

### 4\.5Circuit\-recovery payoff

Better component estimates translate to improved circuit discovery\. We evaluate this by ranking all attention heads by attribution magnitude and threshold at the known circuit size to measure overlap with the activation\-patching ground truth across multiple architectures and task, including GPT\-2 IOI/Greater\-Than\[[39](https://arxiv.org/html/2606.09899#bib.bib36),[16](https://arxiv.org/html/2606.09899#bib.bib38)\], Pythia\-410M Greater\-Than and Gemma\-2\-2B Greater\-Than\. The full ranking performance is deferred to Appendix[E\.7](https://arxiv.org/html/2606.09899#A5.SS7)\.

Attribution patching already provides a strong global ranking: Kendallτ\\taurank correlations range from0\.350\.35on Pythia\-410M to0\.720\.72on Gemma\-2\-2B factual\. HVP matches or slightly improves these global correlations\. However, HVP’s primary gains concentrate precisely at the ranking boundaries where circuit\-membership decisions are made\.

On GPT\-2 IOI, HVP recovers all 20 ground\-truth heads at the canonical boundary, compared to95%95\\%for attribution patching\. A concrete example illustrates why: head L4H11, a known duplicate\-token head, ranks 7th by ground truth but 27th by attribution patching \(4\.4×4\.4\\timesunderestimate of its causal effect\)\. HVP partially recovers the true score and promotes L4H11 to rank 12; the reliability score successfully flags it withR~=1\.32≫0\.3\\tilde\{R\}=1\.32\\gg 0\.3, correctly identifying it as a correction target \(full case study in Appendix[E\.10](https://arxiv.org/html/2606.09899#A5.SS10)\)\. Table[4](https://arxiv.org/html/2606.09899#S4.T4)shows the pattern of boundary improvement on Greater\-Than task across model architectures:

- •GPT\-2 Greater\-Than:MS\-HVPK=5K\{=\}5pushes top\-5 recovery to83\.2%83\.2\\%\(vs\.70\.1%70\.1\\%for Attribution Patching\)\.
- •Pythia\-410M Greater\-Than:Top\-5 overlap improves from76\.0%76\.0\\%to80\.4%80\.4\\%with standard HVP, while GIM degrades to69\.7%69\.7\\%\.
- •Gemma\-2\-2B Greater\-Than:This task yields the strongest correction in our study\. Top\-5 relative error drops from7\.41%7\.41\\%\(Attribution Patching\) to0\.62%0\.62\\%\(Std HVP\), a91\.6%91\.6\\%reduction\. The SwiGLU activation in Gemma\-2 introduces substantial network\-level curvature that AP misses, representing an ideal regime for second\-order correction\.

These gains also improve MIB’s circuit\-faithfulness metrics\[[26](https://arxiv.org/html/2606.09899#bib.bib22)\]and SAE features\. We defer more experimental results to Appendix[E\.8](https://arxiv.org/html/2606.09899#A5.SS8)and Appendix[E\.9](https://arxiv.org/html/2606.09899#A5.SS9), respectively\.

Table 4:Top\-KKhead recovery on Greater\-Than tasks \(N=200N\{=\}200\)\. Overlap is the fraction of ground\-truth heads recovered at eachKK\. Best results per model are inbold\.

## 5Conclusion

We showed that attribution\-patching error is dominated by the downstream network response, not local nonlinearity at the patched component\. This reframing explains why prior local corrections are structurally incomplete and motivates a simple fix: a Hessian\-vector product that captures the missing curvature\. The resulting Screen–Flag–Fix pipeline lets practitioners keep the speed of attribution patching where it is already accurate and apply a targeted second\-order correction where it is not, and scale to models where existing refinement methods are infeasible\. Natural next steps include extending the framework to multi\-token semantic corruptions, integrating with circuit\-completeness and minimality verification, and scaling the correction to sparse\-autoencoder features, where preliminary results are encouraging \(see Appendix[E\.9](https://arxiv.org/html/2606.09899#A5.SS9)\)\.

The current framework focuses on single\-token perturbations\. However, extending to multi\-token semantic corruptions is a natural next direction where preliminary results are promising \(Appendix[E\.4](https://arxiv.org/html/2606.09899#A5.SS4)\)\. While we demonstrate consistent gains up to 9B parameters, verifying behaviour at yet larger scales remains an open opportunity\. More broadly, HVP improves component\-level attribution accuracy but does not by itself address circuit completeness or minimality – integrating with verification tools is an exciting avenue for future work\.

## References

- \[1\]E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer,et al\.\(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread6,pp\. 16318–16352\.Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p1.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]\(2026\)Certified circuits: stability guarantees for mechanistic circuits\.External Links:2602\.22968,[Link](https://arxiv.org/abs/2602.22968)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]S\. Biderman, H\. Schoelkopf, Q\. G\. Anthony, H\. Bradley, K\. O’Brien, E\. Hallahan, M\. A\. Khan, S\. Purohit, U\. S\. Prashanth, E\. Raff, A\. Skowron, L\. Sutawika, and O\. Van Der Wal\(2023\-23–29 Jul\)Pythia: a suite for analyzing large language models across training and scaling\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 2397–2430\.External Links:[Link](https://proceedings.mlr.press/v202/biderman23a.html)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.4.2.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.5.3.1)\.
- \[4\]J\. Bloom, C\. Tigges, A\. Duong, and D\. Chanin\(2024\)SAELens\.Note:[https://github\.com/decoderesearch/SAELens](https://github.com/decoderesearch/SAELens)Cited by:[§E\.9](https://arxiv.org/html/2606.09899#A5.SS9.p1.1)\.
- \[5\]C\. Cartis, N\. I\. Gould, and P\. L\. Toint\(2011\-04\)Adaptive cubic regularisation methods for unconstrained optimization\. part i: motivation, convergence and numerical results\.Math\. Program\.127\(2\),pp\. 245–295\.External Links:ISSN 0025\-5610Cited by:[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[6\]V\. Castin, P\. Ablin, and G\. Peyré\(2024\-21–27 Jul\)How smooth is attention?\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 5817–5840\.External Links:[Link](https://proceedings.mlr.press/v235/castin24a.html)Cited by:[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[7\]A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso\(2023\)Towards automated circuit discovery for mechanistic interpretability\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=89ia77nZ8u)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p1.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2606.09899#S3.SS3.SSS0.Px3.p2.2)\.
- \[8\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.9.7.1)\.
- \[9\]J\. Edin, R\. Csordás, T\. Ruotsalo, Z\. Wu, M\. Maistro, C\. L\. Christensen, J\. Huang, and L\. Maaløe\(2026\)GIM: improved interpretability for large language models\.External Links:[Link](https://openreview.net/forum?id=ZRDYvWF1ZJ)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p2.1),[§1](https://arxiv.org/html/2606.09899#S1.p3.4),[§1](https://arxiv.org/html/2606.09899#S1.p4.10),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1),[Figure 2](https://arxiv.org/html/2606.09899#S3.F2),[Figure 2](https://arxiv.org/html/2606.09899#S3.F2.2.1.1),[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px4.p1.1),[§4\.3](https://arxiv.org/html/2606.09899#S4.SS3.SSS0.Px3.p1.1)\.
- \[10\]N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1,pp\. 12\.Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p1.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]T\. Entesari, S\. Sharifi, and M\. Fazlyab\(2024\)Compositional curvature bounds for deep neural networks\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[12\]A\. Geiger, D\. Ibeling, A\. Zur, M\. Chaudhary, S\. Chauhan, J\. Huang, A\. Arora, Z\. Wu, N\. Goodman, C\. Potts, and T\. Icard\(2025\)Causal abstraction: a theoretical foundation for mechanistic interpretability\.Journal of Machine Learning Research26\(83\),pp\. 1–64\.External Links:[Link](http://jmlr.org/papers/v26/23-0058.html)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1)\.
- \[13\]A\. Geiger, Z\. Wu, C\. Potts, T\. Icard, and N\. Goodman\(2024\-01–03 Apr\)Finding alignments between interpretable causal variables and distributed neural representations\.InProceedings of the Third Conference on Causal Learning and Reasoning,F\. Locatello and V\. Didelez \(Eds\.\),Proceedings of Machine Learning Research, Vol\.236,pp\. 160–187\.External Links:[Link](https://proceedings.mlr.press/v236/geiger24a.html)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]R\. Gupta, I\. Arcuschin, T\. Kwa, and A\. Garriga\-Alonso\(2024\)InterpBench: semi\-synthetic transformers for evaluating mechanistic interpretability techniques\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=R9gR9MPuD5)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]I\. Hadad, G\. Katz, and S\. Bassan\(2026\)Formal mechanistic interpretability: automated circuit discovery with provable guarantees\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Timsb74vIY)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]M\. Hanna, O\. Liu, and A\. Variengien\(2023\)How does GPT\-2 compute greater\-than?: interpreting mathematical abilities in a pre\-trained language model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=p4PckNQR8k)Cited by:[2nd item](https://arxiv.org/html/2606.09899#S4.I1.i2.p1.1),[§4\.5](https://arxiv.org/html/2606.09899#S4.SS5.p1.1)\.
- \[17\]S\. Heimersheim and N\. Nanda\(2024\)How to use and interpret activation patching\.External Links:2404\.15255,[Link](https://arxiv.org/abs/2404.15255)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[18\]F\. R\. Jafari, O\. Eberle, A\. Khakzar, and N\. Nanda\(2025\)RelP: faithful and efficient circuit discovery via relevance patching\.External Links:2508\.21258,[Link](https://arxiv.org/abs/2508.21258)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p2.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.09899#S4.T2.90.8)\.
- \[19\]J\. D\. Janizek, P\. Sturmfels, and S\. Lee\(2021\-01\)Explaining explanations: axiomatic feature interactions for deep networks\.J\. Mach\. Learn\. Res\.22\(1\)\.External Links:ISSN 1532\-4435Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p4.10),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[20\]J\. Jukić and J\. Šnajder\(2025\)From robustness to improved generalization and calibration in pre\-trained language models\.Transactions of the Association for Computational Linguistics13,pp\. 264–280\.External Links:[Link](https://aclanthology.org/2025.tacl-1.13/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00739)Cited by:[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[21\]H\. Kim, G\. Papamakarios, and A\. Mnih\(2021\-18–24 Jul\)The lipschitz constant of self\-attention\.InProceedings of the 38th International Conference on Machine Learning,M\. Meila and T\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.139,pp\. 5562–5571\.External Links:[Link](https://proceedings.mlr.press/v139/kim21i.html)Cited by:[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[22\]P\. W\. Koh and P\. Liang\(2017\-06–11 Aug\)Understanding black\-box predictions via influence functions\.InProceedings of the 34th International Conference on Machine Learning,D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1885–1894\.External Links:[Link](https://proceedings.mlr.press/v70/koh17a.html)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1)\.
- \[23\]J\. Kramár, T\. Lieberum, R\. Shah, and N\. Nanda\(2024\)AtP\*: an efficient and scalable method for localizing llm behaviour to components\.External Links:2403\.00745,[Link](https://arxiv.org/abs/2403.00745)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p2.1),[§1](https://arxiv.org/html/2606.09899#S1.p3.4),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1),[Figure 2](https://arxiv.org/html/2606.09899#S3.F2),[Figure 2](https://arxiv.org/html/2606.09899#S3.F2.2.1.1),[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px4.p1.1),[Table 2](https://arxiv.org/html/2606.09899#S4.T2.90.8)\.
- \[24\]S\. Marks, C\. Rager, E\. J\. Michaud, Y\. Belinkov, D\. Bau, and A\. Mueller\(2025\)Sparse feature circuits: discovering and editing interpretable causal graphs in language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=I4e82CIDxv)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p1.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]M\. Méloux, M\. Peyrard, and F\. Portet\(2026\)Mechanistic interpretability as statistical estimation: a variance analysis of EAP\-IG\.External Links:[Link](https://openreview.net/forum?id=YD1P4DVtdk)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]A\. Mueller, A\. Geiger, S\. Wiegreffe, D\. Arad, I\. Arcuschin, A\. Belfki, Y\. S\. Chan, J\. F\. Fiotto\-Kaufman, T\. Haklay, M\. Hanna, J\. Huang, R\. Gupta, Y\. Nikankin, H\. Orgad, N\. Prakash, A\. Reusch, A\. Sankaranarayanan, S\. Shao, A\. Stolfo, M\. Tutek, A\. Zur, D\. Bau, and Y\. Belinkov\(2025\-13–19 Jul\)MIB: a mechanistic interpretability benchmark\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 45069–45108\.External Links:[Link](https://proceedings.mlr.press/v267/mueller25a.html)Cited by:[§E\.8](https://arxiv.org/html/2606.09899#A5.SS8.p1.2),[Table 16](https://arxiv.org/html/2606.09899#A5.T16),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2606.09899#S4.SS3.SSS0.Px3.p1.1),[§4\.5](https://arxiv.org/html/2606.09899#S4.SS5.p3.4)\.
- \[27\]N\. Nanda and J\. Bloom\(2022\)TransformerLens\.External Links:[Link](https://github.com/TransformerLensOrg/TransformerLens)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1)\.
- \[28\]N\. Nanda\(2023\)Attribution patching: activation patching at industrial scale\.Note:[https://www\.neelnanda\.io/mechanistic\-interpretability/attribution\-patching](https://www.neelnanda.io/mechanistic-interpretability/attribution-patching)Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p2.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]Y\. Nesterov and B\. T\. Polyak\(2006\)Cubic regularization of newton method and its global performance\.Mathematical Programming108\(1\),pp\. 177–205\.Cited by:[§3\.1](https://arxiv.org/html/2606.09899#S3.SS1.p3.1)\.
- \[30\]B\. A\. Pearlmutter\(1994\)Fast exact multiplication by the hessian\.Neural Computation6\(1\),pp\. 147–160\.External Links:[Document](https://dx.doi.org/10.1162/neco.1994.6.1.147)Cited by:[§3\.3](https://arxiv.org/html/2606.09899#S3.SS3.SSS0.Px2.p2.9)\.
- \[31\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever\(2019\)Language models are unsupervised multitask learners\.OpenAI\.Note:Accessed: 2024\-11\-15External Links:[Link](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.3.1.1)\.
- \[32\]L\. Sharkey, B\. Chughtai, J\. Batson, J\. Lindsey, J\. Wu, L\. Bushnaq, N\. Goldowsky\-Dill, S\. Heimersheim, A\. Ortega, J\. I\. Bloom, S\. Biderman, A\. Garriga\-Alonso, A\. Conmy, N\. Nanda, J\. M\. Rumbelow, M\. Wattenberg, N\. Schoots, J\. Miller, W\. Saunders, E\. J\. Michaud, S\. Casper, M\. Tegmark, D\. Bau, E\. Todd, A\. Geiger, M\. Geva, J\. Hoogland, D\. Murfet, and T\. McGrath\(2025\)Open problems in mechanistic interpretability\.Transactions on Machine Learning Research\.Note:Survey CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=91H76m9Z94)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]C\. Shi, N\. Beltran\-Velez, A\. Nazaret, C\. Zheng, A\. Garriga\-Alonso, A\. Jesson, M\. Makar, and D\. Blei\(2024\)Hypothesis testing the circuit hypothesis in LLMs\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=5ai2YFAXV7)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.
- \[34\]N\. Sofroniew, I\. Kauvar, W\. Saunders, R\. Chen, T\. Henighan, S\. Hydrie, C\. Citro, A\. Pearce, J\. Tarng, W\. Gurnee, J\. Batson, S\. Zimmerman, K\. Rivoire, K\. Fish, C\. Olah, and J\. Lindsey\(2026\)Emotion concepts and their function in a large language model\.External Links:2604\.07729,[Link](https://arxiv.org/abs/2604.07729)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]M\. Sundararajan, A\. Taly, and Q\. Yan\(2017\)Axiomatic attribution for deep networks\.InProceedings of the 34th International Conference on Machine Learning \- Volume 70,ICML’17,pp\. 3319–3328\.Cited by:[§1](https://arxiv.org/html/2606.09899#S1.p4.10),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px5.p2.4)\.
- \[36\]A\. Syed, C\. Rager, and A\. Conmy\(2024\-11\)Attribution patching outperforms automated circuit discovery\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 407–416\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.25/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.25)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2606.09899#S3.SS3.SSS0.Px3.p2.2),[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px5.p2.4)\.
- \[37\]G\. Team\(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.7.5.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.8.6.1)\.
- \[38\]Q\. Team\(2024\-09\)Qwen2\.5: a party of foundation models\.External Links:[Link](https://qwenlm.github.io/blog/qwen2.5/)Cited by:[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.09899#S4.T1.2.6.4.1)\.
- \[39\]K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt\(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NpsVSN6o4ul)Cited by:[§E\.10](https://arxiv.org/html/2606.09899#A5.SS10.p4.1),[§1](https://arxiv.org/html/2606.09899#S1.p1.1),[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2606.09899#S4.I1.i1.p1.2),[§4\.1](https://arxiv.org/html/2606.09899#S4.SS1.SSS0.Px1.p2.2),[§4\.5](https://arxiv.org/html/2606.09899#S4.SS5.p1.1)\.
- \[40\]Z\. Wu, A\. Geiger, T\. Icard, C\. Potts, and N\. Goodman\(2023\)Interpretability at scale: identifying causal mechanisms in alpaca\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=nRfClnMhVX)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px3.p1.1)\.
- \[41\]F\. Zhang and N\. Nanda\(2024\)Towards best practices of activation patching in language models: metrics and methods\.External Links:2309\.16042,[Link](https://arxiv.org/abs/2309.16042)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px1.p1.1)\.
- \[42\]L\. Zhang, W\. Dong, Z\. Zhang, S\. Yang, L\. Hu, N\. Liu, P\. Zhou, and D\. Wang\(2026\)EAP\-GP: mitigating saturation effect in gradient\-based automated circuit identification\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lGyXq0LOeQ)Cited by:[§2](https://arxiv.org/html/2606.09899#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix ANotations

Table 5:Core notation used in the main text\.
## Appendix BLocal smoothness examples of the Lipschitz\-Hessian assumption

### B\.1Theoretical approximation

We include representative constants only to clarify the scale of the cubic remainder\. These values are not required by the main theorem, which only assumes a finiteL3L\_\{3\}on the segment of interest\.

#### GeLU\.

GeLU\(z\)=zΦ\(z\)\\text\{GeLU\}\(z\)=z\\,\\Phi\(z\)whereΦ\\Phiis the standard normal CDF\. Differentiating givesGeLU′′′\(z\)=z\(z2−4\)ϕ\(z\)\\text\{GeLU\}^\{\\prime\\prime\\prime\}\(z\)=z\(z^\{2\}\-4\)\\phi\(z\), whereϕ\\phiis the standard normal PDF\. The stationary points of\|GeLU′′′\(z\)\|\|\\text\{GeLU\}^\{\\prime\\prime\\prime\}\(z\)\|satisfyGeLU\(4\)\(z\)=0\\text\{GeLU\}^\{\(4\)\}\(z\)=0, equivalentlyz4−7z2\+4=0z^\{4\}\-7z^\{2\}\+4=0\. The maximizer is therefore\|z\|=7−332≈0\.792\|z\|=\\sqrt\{\\frac\{7\-\\sqrt\{33\}\}\{2\}\}\\approx 0\.792, which givesmaxz⁡\|GeLU′′′\(z\)\|≈0\.779\\max\_\{z\}\|\\text\{GeLU\}^\{\\prime\\prime\\prime\}\(z\)\|\\approx 0\.779\.

#### SiLU\.

SiLU\(z\)=z⋅σ\(z\)\\text\{SiLU\}\(z\)=z\\cdot\\sigma\(z\)whereσ\\sigmais the logistic sigmoid\. The third derivative satisfiesmaxz⁡\|SiLU′′′\(z\)\|≈0\.30818\\max\_\{z\}\|\\text\{SiLU\}^\{\\prime\\prime\\prime\}\(z\)\|\\approx 0\.30818, attained near\|z\|≈1\.032\|z\|\\approx 1\.032by numerical maximization\. ThusL3SiLU≈0\.308L\_\{3\}^\{\\text\{SiLU\}\}\\approx 0\.308\. This value is only illustrative: the main theorem itself does not rely on a closed\-form constant\.

#### Bilinear maps\.

If the perturbed variable enters bilinearly while the other argument is held fixed, then the third derivative with respect to that variable is zero\. This applies, for example, to perturbations of the value vectorVVwith fixed attention weights, but it is*not*a property of the full attention block whenQQorKKalso vary\.

#### Softmax and normalization layers\.

Softmax is real analytic, so its third derivative is finite on every compact subset of logit space\. LayerNorm and RMSNorm are smooth wherever the normalization scale is bounded away from zero\. Because Proposition[1](https://arxiv.org/html/2606.09899#Thmproposition1)requires only a finite constant on the single interpolation segment\{a\+tδ\}t∈\[0,1\]\\\{a\+t\\delta\\\}\_\{t\\in\[0,1\]\}, we treat these contributions through the segment\-localL3L\_\{3\}rather than assert a universal transformer\-wide bound\.

### B\.2Empirical verification

Assumption 1 posits that the metric function’s Hessian isL3L\_\{3\}\-Lipschitz along the interpolation path\. We verify this empirically by computing Hessian–vector productsH\(ac\+tδ\)⋅δH\(a\_\{c\}\+t\\delta\)\\cdot\\deltaat three points \(t=0,0\.5,1t=0,0\.5,1\) for each prompt×\\timesattention head, then estimating

L^3=maxi⁡‖H\(ac\+ti\+1δ\)⋅δ−H\(ac\+tiδ\)⋅δ‖\(ti\+1−ti\)‖δ‖2\.\\hat\{L\}\_\{3\}=\\max\_\{i\}\\frac\{\\\|H\(a\_\{c\}\+t\_\{i\+1\}\\delta\)\\cdot\\delta\-H\(a\_\{c\}\+t\_\{i\}\\delta\)\\cdot\\delta\\\|\}\{\(t\_\{i\+1\}\-t\_\{i\}\)\\,\\\|\\delta\\\|^\{2\}\}\\,\.This is a lower bound on the trueL3L\_\{3\}\(probed along one direction only\)\. WithL^3\\hat\{L\}\_\{3\}in hand, the cubic remainder bound becomesα^=L^3‖δ‖3/6\\hat\{\\alpha\}=\\hat\{L\}\_\{3\}\\\|\\delta\\\|^\{3\}/6, which we compare to the actual residual\|ftrue−fHVP\|\|f\_\{\\mathrm\{true\}\}\-f\_\{\\mathrm\{HVP\}\}\|\.

Table[6](https://arxiv.org/html/2606.09899#A2.T6)reports the empiricalL^3\\hat\{L\}\_\{3\}distribution and bound\-tightness statistics for GPT\-2 \(20 prompts×\\times144 heads = 2,880 records, 2,877 nontrivial\)\.

Table 6:EmpiricalL3L\_\{3\}estimation on GPT\-2 \(12L×\\times12H, 20 prompts\)\.α^=L^3‖δ‖3/6\\hat\{\\alpha\}=\\hat\{L\}\_\{3\}\\\|\\delta\\\|^\{3\}/6is the cubic remainder bound;\|res\|\|\\text\{res\}\|is the actual HVP residual\|ftrue−fHVP\|\|f\_\{\\text\{true\}\}\-f\_\{\\text\{HVP\}\}\|\. The bound holds whenα^≥\|res\|\\hat\{\\alpha\}\\geq\|\\text\{res\}\|\.The bound holds for 82\.4% of component–prompt pairs, with a median slack factor of3\.4×3\.4\\times\(the bound is3\.4×3\.4\\timeslarger than the actual residual\)\. The 17\.6% of violations are concentrated in Layer 0, whereL^3\\hat\{L\}\_\{3\}is an order of magnitude larger than mid\-network layers \(median0\.0650\.065vs\.0\.0010\.001–0\.0020\.002\)\. This is consistent with the embedding layer’s position\-dependent structure producing sharper Hessian variation\.

Across layers, medianL^3\\hat\{L\}\_\{3\}decreases monotonically from0\.0650\.065\(L0\) to0\.00010\.0001\(L11\), confirming the structural prediction that later layers, with shorter downstream paths, exhibit smoother Hessian landscapes\. This layer\-depth gradient also explains why HVP correction is most impactful for early\- and mid\-layer heads \(whereL3‖δ‖3L\_\{3\}\\\|\\delta\\\|^\{3\}is large enough to matter\) and nearly unnecessary for the last few layers\.

Table[7](https://arxiv.org/html/2606.09899#A2.T7)reports the same analysis for Gemma\-2\-2B \(26 layers×\\times8 heads = 208 heads, 20 prompts, 4,152 nontrivial records\)\.

Table 7:EmpiricalL3L\_\{3\}estimation on Gemma\-2\-2B \(26L×\\times8H, 20 prompts\)\. Same protocol as Table[6](https://arxiv.org/html/2606.09899#A2.T6)\.On Gemma\-2\-2B, the bound holds for95\.0%of pairs, stronger than GPT\-2’s 82\.4%, despite much larger perturbation norms \(median‖δ‖=4\.81\\\|\\delta\\\|=4\.81vs\.0\.730\.73\)\. The median slack factor is7\.6×7\.6\\times, meaning the cubic bound is typically an order of magnitude larger than the actual HVP residual\. Unlike GPT\-2, Gemma\-2\-2B does not show a strong monotonic layer\-depth gradient inL^3\\hat\{L\}\_\{3\}: the per\-layer medians hover around0\.00010\.0001–0\.00090\.0009throughout the network, with occasional spikes at L7 \(max0\.420\.42\) and L3 \(max0\.700\.70\)\. This flatter profile is consistent with Gemma\-2’s use of grouped\-query attention and SwiGLU, which distribute nonlinearity more uniformly across layers than GPT\-2’s standard architecture\.

## Appendix COmitted proofs

### C\.1Proof of Proposition[1](https://arxiv.org/html/2606.09899#Thmproposition1)\(Local attribution\-patching error bound via the reliability score\)

###### Proof\.

From the Taylor expansion with integral remainder,

M\(a\+δ\)=M\(a\)\+∇aM⋅δ\+12δ⊤Hδ\+Φ\(δ\),M\(a\+\\delta\)=M\(a\)\+\\nabla\_\{a\}M\\cdot\\delta\+\\tfrac\{1\}\{2\}\\delta^\{\\\!\\top\}H\\delta\+\\Phi\(\\delta\),where the remainder takes the standard integral form:

Φ\(δ\)=∫01\(1−t\)δ⊤\[∇a2M\(a\+tδ\)−H\]δ𝑑t\.\\Phi\(\\delta\)=\\int\_\{0\}^\{1\}\(1\-t\)\\,\\delta^\{\\\!\\top\}\\\!\\bigl\[\\nabla\_\{a\}^\{2\}M\(a\+t\\delta\)\-H\\bigr\]\\delta\\,dt\.Under Assumption[1](https://arxiv.org/html/2606.09899#Thmassumption1),

‖∇a2M\(a\+tδ\)−H‖op≤L3⋅t‖δ‖,\\\|\\nabla\_\{a\}^\{2\}M\(a\+t\\delta\)\-H\\\|\_\{\\mathrm\{op\}\}\\leq L\_\{3\}\\cdot t\\\|\\delta\\\|,and therefore

\|Φ\(δ\)\|\\displaystyle\|\\Phi\(\\delta\)\|≤∫01\(1−t\)⋅L3t‖δ‖3𝑑t=L3‖δ‖3∫01t\(1−t\)𝑑t\\displaystyle\\leq\\int\_\{0\}^\{1\}\(1\-t\)\\cdot L\_\{3\}\\,t\\\|\\delta\\\|^\{3\}\\,dt=L\_\{3\}\\\|\\delta\\\|^\{3\}\\int\_\{0\}^\{1\}t\(1\-t\)\\,dt=L3‖δ‖3⋅16=L3‖δ‖36\.\\displaystyle=L\_\{3\}\\\|\\delta\\\|^\{3\}\\cdot\\frac\{1\}\{6\}=\\frac\{L\_\{3\}\\\|\\delta\\\|^\{3\}\}\{6\}\\,\.\(10\)Here∫01t\(1−t\)𝑑t=12−13=16\\int\_\{0\}^\{1\}t\(1\-t\)\\,dt=\\frac\{1\}\{2\}\-\\frac\{1\}\{3\}=\\frac\{1\}\{6\}\.

DefiningQ=12δ⊤HδQ=\\tfrac\{1\}\{2\}\\delta^\{\\\!\\top\}\\\!H\\delta, we haveE=Q\+ΦE=Q\+\\Phiand the reverse triangle inequality gives

\|\|E\|−\|Q\|\|≤\|Φ\|\.\\bigl\|\|E\|\-\|Q\|\\bigr\|\\leq\|\\Phi\|\.With

α=L3‖δ‖33\|δ⊤Hδ\|=\|Φ\|max\|Q\|,\\alpha=\\frac\{L\_\{3\}\\\|\\delta\\\|^\{3\}\}\{3\|\\delta^\{\\\!\\top\}\\\!H\\delta\|\}=\\frac\{\|\\Phi\|\_\{\\max\}\}\{\|Q\|\},we obtain

\|\|E\|−\|Q\|\|≤α\|Q\|\.\\bigl\|\|E\|\-\|Q\|\\bigr\|\\leq\\alpha\|Q\|\.Dividing by\|Δ^\|\|\\hat\{\\Delta\}\|and noting\|Q\|/\|Δ^\|=R~\|Q\|/\|\\hat\{\\Delta\}\|=\\tilde\{R\}gives

\|\|E\|\|Δ^\|−R~\|≤αR~,\\left\|\\frac\{\|E\|\}\{\|\\hat\{\\Delta\}\|\}\-\\tilde\{R\}\\right\|\\leq\\alpha\\tilde\{R\},which is the first claimed inequality\. If additionallyα<1\\alpha<1, then

\(1−α\)\|Q\|≤\|E\|≤\(1\+α\)\|Q\|,\(1\-\\alpha\)\|Q\|\\leq\|E\|\\leq\(1\+\\alpha\)\|Q\|,and dividing again by\|Δ^\|\|\\hat\{\\Delta\}\|yields the two\-sided inequality in Proposition[1](https://arxiv.org/html/2606.09899#Thmproposition1)\. ∎

### C\.2Proof of Corollary[1](https://arxiv.org/html/2606.09899#Thmcorollary1)\(Residual error after HVP correction\)

###### Proof\.

By definition,

Δ^hvp=Δ^\+12δ⊤Hδ\.\\hat\{\\Delta\}\_\{\\text\{hvp\}\}=\\hat\{\\Delta\}\+\\tfrac\{1\}\{2\}\\delta^\{\\\!\\top\}H\\delta\.Combining this with the Taylor decomposition in \([2](https://arxiv.org/html/2606.09899#S3.E2)\),

Δ=Δ^\+12δ⊤Hδ\+Φ\(δ\),\\Delta=\\hat\{\\Delta\}\+\\tfrac\{1\}\{2\}\\delta^\{\\\!\\top\}H\\delta\+\\Phi\(\\delta\),we obtain

Δ−Δ^hvp=Φ\(δ\)\.\\Delta\-\\hat\{\\Delta\}\_\{\\text\{hvp\}\}=\\Phi\(\\delta\)\.The cubic remainder bound then follows immediately from the same Hessian\-Lipschitz assumption used in Proposition[1](https://arxiv.org/html/2606.09899#Thmproposition1)\. ∎

### C\.3Proof of Proposition[2](https://arxiv.org/html/2606.09899#Thmproposition2)\(Selective\-HVP pipeline guarantee\)

###### Proof\.

For each independently patched componentii, the single\-component Taylor decomposition gives

Δi−Δ^i=ci\+Φi\(δi\),\|Φi\(δi\)\|≤L3,i6‖δi‖3\.\\Delta\_\{i\}\-\\hat\{\\Delta\}\_\{i\}=c\_\{i\}\+\\Phi\_\{i\}\(\\delta\_\{i\}\),\\qquad\|\\Phi\_\{i\}\(\\delta\_\{i\}\)\|\\leq\\frac\{L\_\{3,i\}\}\{6\}\\\|\\delta\_\{i\}\\\|^\{3\}\.Subtracting the selective estimate

Δ^sel=∑i=1nΔ^i\+∑i∈Sflagci\\hat\{\\Delta\}\_\{\\mathrm\{sel\}\}=\\sum\_\{i=1\}^\{n\}\\hat\{\\Delta\}\_\{i\}\+\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}c\_\{i\}from the sum of true independent effects gives

Eselind=∑i=1n\(Δi−Δ^i\)−∑i∈Sflagci=∑i∈Sok\(ci\+Φi\(δi\)\)\+∑i∈SflagΦi\(δi\)\.E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}=\\sum\_\{i=1\}^\{n\}\(\\Delta\_\{i\}\-\\hat\{\\Delta\}\_\{i\}\)\-\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}c\_\{i\}=\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}\(c\_\{i\}\+\\Phi\_\{i\}\(\\delta\_\{i\}\)\)\+\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}\\Phi\_\{i\}\(\\delta\_\{i\}\)\.By the triangle inequality,

\|Eselind\|≤∑i∈Sok\|ci\|\+∑i=1n\|Φi\(δi\)\|\.\|E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}\|\\leq\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}\|c\_\{i\}\|\+\\sum\_\{i=1\}^\{n\}\|\\Phi\_\{i\}\(\\delta\_\{i\}\)\|\.For everyi∈Soki\\in S\_\{\\mathrm\{ok\}\}, the threshold definition givesR~i=\|ci\|/\|Δ^i\|<τ\\tilde\{R\}\_\{i\}=\|c\_\{i\}\|/\|\\hat\{\\Delta\}\_\{i\}\|<\\tau, hence

\|ci\|<τ\|Δ^i\|\.\|c\_\{i\}\|<\\tau\|\\hat\{\\Delta\}\_\{i\}\|\.For the remainders,

\|Φi\(δi\)\|≤L3,i6‖δi‖3for alli\.\|\\Phi\_\{i\}\(\\delta\_\{i\}\)\|\\leq\\frac\{L\_\{3,i\}\}\{6\}\\\|\\delta\_\{i\}\\\|^\{3\}\\qquad\\text\{for all \}i\.Summing these bounds yields

\|Eselind\|≤τ∑i∈Sok\|Δ^i\|\+16∑i=1nL3,i‖δi‖3,\|E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}\|\\leq\\tau\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}\|\\hat\{\\Delta\}\_\{i\}\|\+\\frac\{1\}\{6\}\\sum\_\{i=1\}^\{n\}L\_\{3,i\}\\\|\\delta\_\{i\}\\\|^\{3\},which is exactly \([7](https://arxiv.org/html/2606.09899#S3.E7)\)\. ∎

### C\.4Proofof Corollary[2](https://arxiv.org/html/2606.09899#Thmcorollary2)

###### Proof\.

By definition,

Efullind=∑i=1nΔi−∑i=1n\(Δ^i\+ci\)\.E\_\{\\mathrm\{full\}\}^\{\\mathrm\{ind\}\}=\\sum\_\{i=1\}^\{n\}\\Delta\_\{i\}\-\\sum\_\{i=1\}^\{n\}\(\\hat\{\\Delta\}\_\{i\}\+c\_\{i\}\)\.Subtracting this fromEselind=∑iΔi−Δ^selE\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}=\\sum\_\{i\}\\Delta\_\{i\}\-\\hat\{\\Delta\}\_\{\\mathrm\{sel\}\}and usingΔ^sel=∑iΔ^i\+∑i∈Sflagci\\hat\{\\Delta\}\_\{\\mathrm\{sel\}\}=\\sum\_\{i\}\\hat\{\\Delta\}\_\{i\}\+\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}c\_\{i\}gives

Eselind−Efullind=∑i=1nci−∑i∈Sflagci=∑i∈Sokci=Qok\.E\_\{\\mathrm\{sel\}\}^\{\\mathrm\{ind\}\}\-E\_\{\\mathrm\{full\}\}^\{\\mathrm\{ind\}\}=\\sum\_\{i=1\}^\{n\}c\_\{i\}\-\\sum\_\{i\\in S\_\{\\mathrm\{flag\}\}\}c\_\{i\}=\\sum\_\{i\\in S\_\{\\mathrm\{ok\}\}\}c\_\{i\}=Q\_\{\\mathrm\{ok\}\}\.Taking absolute values and using\|ci\|=R~i\|Δ^i\|<τ\|Δ^i\|\|c\_\{i\}\|=\\tilde\{R\}\_\{i\}\|\\hat\{\\Delta\}\_\{i\}\|<\\tau\|\\hat\{\\Delta\}\_\{i\}\|fori∈Soki\\in S\_\{\\mathrm\{ok\}\}yields \([9](https://arxiv.org/html/2606.09899#S3.E9)\)\. ∎

## Appendix DFull results of matched\-compute comparisons

### D\.1MS\-HVP vs\. Integrated Gradients

Table[8](https://arxiv.org/html/2606.09899#A4.T8)reports paired\-bootstrappp\-values comparing MS\-HVPK=5K\{=\}5\(cost 10\) against IGS=10S\{=\}10\(cost 10\) on all nine non\-frontier tasks\. Two tasks yield significant MS\-HVP wins \(GPT\-2 IOI,p<0\.001p<0\.001; Qwen2\.5\-1\.5B IOI,p=0\.041p=0\.041\); the remaining seven are statistical ties \(p\>0\.1p\>0\.1\)\.

Table 8:Matched\-cost paired bootstrap: MS\-HVPK=5K\{=\}5vs\. IGS=10S\{=\}10\(cost 10 each\)\.Δ\\Deltais MS\-HVP minus IG \(negative favours MS\-HVP\)\.pp\-values are one\-sided paired bootstrap \(10,000 resamples\)\.*Remark:*The main text reports a “per\-head median overK∈\{5,10,20\}K\\in\\\{5,10,20\\\}” achieving17\.57%17\.57\\%top\-5 relative error at amortized cost∼\\sim40\. This is computed by taking, for each attention head, the median of its three MS\-HVP error estimates atK=5K\{=\}5,K=10K\{=\}10, andK=20K\{=\}20, then recomputing the top\-5 ranking from the resulting per\-head medians\. Because different heads benefit from different sub\-step counts, the median combiner can outperform any singleKKin the table above\. The amortized cost counts the threeKK\-sweep runs as a single cost\-∼\\sim40 budget \(since intermediate sub\-step products are reused acrossKKvalues\)\.

### D\.2MS\-HVP vs\. Integrated Hessians

Table[9](https://arxiv.org/html/2606.09899#A4.T9)compares Integrated Hessians \(IH\) against MS\-HVP across all three foreground tasks\. Both IH\-PI \(path\-integrated\) and IH\-DR \(double Riemann\) under\-perform MS\-HVP by3\.83\.8–7\.27\.2percentage points\. The gap arises because IH’sαβ\\alpha\\betaweighting yields a1/41/4coefficient onδ⊤Hδ\\delta^\{\\\!\\top\}\\\!H\\deltainstead of the1/21/2needed for Taylor correction\.

Table 9:Integrated Hessians vs\. MS\-HVP on the three foreground tasks\. IH\-PI usesSSpath interpolation steps; IH\-DR uses anS×MS\{\\times\}Mdouble Riemann grid\. Allpp\-values are paired bootstrap vs\. attribution patching\.
### D\.3MS\-HVP vs\. GIM

Table[10](https://arxiv.org/html/2606.09899#A4.T10)shows that GIM under\-performs attribution patching on 26 of 27 task×K\\times Khead\-recovery settings, with losses of up to21\.821\.8pp \(Llama factual @5\)\. The sole positive entry is Gemma\-2\-2B IOI @5 \(\+0\.4\+0\.4pp\), which is within noise\.

Table 10:GIM vs\. attribution patching: top\-KKhead overlap across nine tasks\.Δ\\Deltais GIM minus attribution patching \(negative = GIM worse\)\.

## Appendix EAdditional experimental results

### E\.1HVP implementation

Hessian–vector products are computed in PyTorch viatorch\.autograd\.gradwithcreate\_graph=Trueon the first backward pass, so the gradient can itself be differentiated on the second backward pass\. We never explicitly form the Hessian; only the productHδH\\deltais computed, which is what keeps the correction tractable at 8B scale\. Gradients are kept in float32 throughout \(mixed precision elsewhere\)\.

### E\.2Full per\-model results

Table[11](https://arxiv.org/html/2606.09899#A5.T11)provides the complete per\-model breakdown for the main generic sweeps, including residual\-stream and post\-activation components, plus the Gemma\-2\-2B attention\-head closure run\.

Table 11:Full HVP correction results across all component types and models\.Attrib\. Pat\. Err\.andHVP Err\.are overall medians across available component\-prompt records\.Reductionuses the prompt\-level median percentage decrease in relative attribution\-patching error, with bootstrap 95% CIs over prompts\.ComponentModelAttrib\. Pat\. Err\. \(%\)HVP Err\. \(%\)Reduction \(%\)NNAttention headsPythia\-410M4\.11\.172\.0\[69\.4, 74\.7\]20,993Pre\-act neuronsPythia\-410M35\.2 \[31\.1, 39\.3\]11\.7 \[9\.4, 14\.0\]66\.7\[60\.2, 72\.8\]5,871Residual streamPythia\-410M45\.4 \[39\.5, 51\.3\]13\.6 \[10\.4, 16\.8\]70\.0\[63\.7, 75\.8\]480Post\-act neuronsPythia\-410M4\.3 \[3\.7, 4\.9\]4\.0 \[3\.4, 4\.6\]7\.0 \[2\.1, 11\.9\]2,943Pre\-act neuronsPythia\-2\.8B35\.4 \[31\.6, 39\.2\]15\.6 \[13\.5, 17\.7\]55\.8\[50\.1, 61\.5\]27,393Post\-act neuronsPythia\-2\.8B6\.2 \[5\.4, 7\.0\]6\.1 \[5\.3, 6\.9\]2\.2 \[0\.5, 3\.9\]27,393Attention headsGPT\-2 Small5\.1 \[4\.1, 6\.1\]1\.3 \[0\.9, 1\.7\]74\.5\[68\.3, 79\.9\]5,760Pre\-act neuronsGPT\-2 Small32\.8 \[28\.4, 37\.2\]12\.1 \[9\.6, 14\.6\]63\.1\[56\.8, 69\.4\]4,320Residual streamGPT\-2 Small42\.1 \[36\.8, 47\.4\]14\.2 \[11\.1, 17\.3\]66\.3\[59\.5, 72\.4\]240Attention headsQwen2\.5\-1\.5B5\.40\.590\.4\[89\.5, 91\.8\]11,680MLP layer outputQwen2\.5\-1\.5B45\.615\.768\.0\[59\.4, 71\.3\]980Attention headsGemma\-2\-2B7\.40\.890\.0\[88\.7, 90\.6\]19,749
### E\.3Corruption\-type robustness

Table[12](https://arxiv.org/html/2606.09899#A5.T12)evaluates HVP under three corruption styles: the main random\-token corruption, cross\-prompt resample corruption, and zero corruption\. The top block reports auxiliary model\-family checks; the bottom block tests the three foreground tasks from Table[2](https://arxiv.org/html/2606.09899#S4.T2)directly\. On clean tasks \(GPT\-2 IOI, Gemma\-2\-2B factual\), HVP reduction remains strong \(79–90%\)\. On the pathological task \(Pythia\-410M IOI\), Std HVP reduction is weaker \(36–50%\), consistent with the Taylor\-radius violation identified in §[4\.3](https://arxiv.org/html/2606.09899#S4.SS3)– MS\-HVPK≥3K\{\\geq\}3would be needed here\. The starred Pythia\-1B random\-token row is an outlier with unusually large perturbation norms\.

Table 12:Corruption robustness\.*Top block:*auxiliary models from initial robustness check\.*Bottom block:*foreground tasks from Table[2](https://arxiv.org/html/2606.09899#S4.T2)\. Reduction uses the prompt\-level median relative error reduction\. The starred Pythia\-1B random\-token row is an outlier with unusually large perturbation norms and only 320 records\.
### E\.4Semantic \(entity\-swap\) corruption

To verify that HVP’s correction generalizes beyond random\-token perturbations, we run per\-head attribution patching and HVP correction on 46 entity\-swap factual prompts \(e\.g\., “The Eiffel Tower is located in”→\\to“The Colosseum is located in”\)\. These perturbations are multi\-token, semantically coherent, and produce correlated activation shifts across layers – a more naturalistic corruption regime than the single\-position random\-token replacements used in the main text\.

Table 13:Entity\-swap \(semantic\) corruption results\. Both models use 46 entity\-swap prompt pairs \(e\.g\., “The Eiffel Tower is located in”→\\to“The Colosseum is located in”\)\. MAE is the mean absolute error vs\. ground\-truth activation patching across all nontrivial head×\\timesprompt entries\.ModelMethodMAERelative to APGPT\-2 \(144 heads, 7,056 records\)AP0\.003401\.00×1\.00\\timesHVP0\.001560\.46×0\.46\\times*Error reduction: 54\.1%*Pythia\-410M \(384 heads, 18,816 records\)AP0\.002721\.00×1\.00\\timesHVP0\.001120\.41×0\.41\\times*Error reduction: 58\.8%*Both models show substantial error reduction under entity\-swap corruption \(54% for GPT\-2, 59% for Pythia\-410M\), confirming that HVP’s second\-order correction is not specific to random\-token perturbations\. Entity\-swap corruptions produce larger, more structuredδ\\deltavectors \(because multiple token positions change\), yet the Hessian–vector product still captures the dominant curvature\. The slightly stronger reduction on Pythia\-410M \(58\.8% vs\. 54\.1%\) is consistent with this model’s deeper architecture \(24 layers vs\. 12\), which amplifies cross\-layer nonlinear interactions that the second\-order term corrects\.

### E\.5Stability of HVP correction estimates

Figure[9](https://arxiv.org/html/2606.09899#A5.F9)plots the*aggregate*HVP error\-reduction estimate as the number of evaluation prompts increases\. Unlike the main\-text tables, which report the prompt\-level median reduction with prompt\-bootstrap confidence intervals, this figure tracks the single aggregate statistic obtained from progressively larger prompt prefixes\. Both architectures converge rapidly: the Pythia\-410M estimate stabilizes in the low\- to mid\-70s after roughly 10–15 prompts and ends at 73\.3% for 55 prompts, while the Qwen2\.5\-1\.5B estimate remains near 90% throughout and ends at 90\.2% for 35 prompts\. This supports the claim that the scaled headline numbers are not driven by a lucky small\-sample subset\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x9.png)Figure 9:Stability of the*aggregate*HVP error\-reduction estimate as the number of evaluation prompts grows\. Both curves flatten after roughly 10–15 prompts\. The final aggregate estimates are 73\.3% for Pythia\-410M \(55 prompts\) and 90\.2% for Qwen2\.5\-1\.5B \(35 prompts\)\.
### E\.6Wall\-clock timing of the Screen–Flag–Fix pipeline

Table[14](https://arxiv.org/html/2606.09899#A5.T14)reports wall\-clock attribution time for the edge\-level circuit\-discovery pipeline on two representative settings, measured on a single NVIDIA L40S GPU with 100 evaluation examples\. The “selective” row estimates the Screen–Flag–Fix workflow: run EAP to obtain edge scores, flag components withR~\>τ\\tilde\{R\}\>\\tau, then apply HVP corrections only to flagged edges\.

Table 14:Wall\-clock attribution time \(seconds\) on L40S\. MIB evaluation time \(model reloading \+ 10 sparsity sweeps\) adds∼\\sim12–24 min and is method\-independent\. The selective pipeline \(Screen–Flag–Fix\) applies HVP only to flagged edges \(τ=0\.3\\tau\{=\}0\.3\), combining EAP cost with a small HVP overhead\.ModelMethodEdgesAttrib\. \(s\)Overhead vs\. EAP*GPT\-2 \(12L, 32,491 edges\)*EAP32,4911\.61\.0×1\.0\\timesHVP \(K=1K\{=\}1\)32,4915\.83\.6×3\.6\\timesMS\-HVP \(K=5K\{=\}5\)32,4914\.52\.8×2\.8\\timesEAP\-IG \(inputs\)32,4914\.93\.1×3\.1\\timesSelective HVP \(τ=0\.3\\tau\{=\}0\.3, 7% flagged\)32,491∼\\sim1\.9∼1\.2×\\sim 1\.2\\times*Qwen2\.5\-0\.5B \(24L, 179,749 edges\)*EAP179,7495\.91\.0×1\.0\\timesHVP \(K=1K\{=\}1\)179,7496\.11\.0×1\.0\\timesMS\-HVP \(K=5K\{=\}5\)179,7498\.81\.5×1\.5\\timesSelective HVP \(τ=0\.3\\tau\{=\}0\.3, 15% flagged\)179,749∼\\sim6\.0∼1\.0×\\sim 1\.0\\timesThe key takeaway: on both models, the selective pipeline adds≤20%\\leq 20\\%wall\-clock overhead over raw EAP while still capturing the most important corrections\. Full HVP adds1\.01\.0–3\.6×3\.6\\timesoverhead – modest in absolute terms \(seconds, not minutes\) because the HVP backward pass reuses the same computation graph as the EAP forward pass\. The dominant cost in a full MIB evaluation is the faithfulness sweep \(reloading the model at each sparsity level\), which is method\-independent and takes 12–24 min\.

### E\.7Ranking performance

Table[15](https://arxiv.org/html/2606.09899#A5.T15)supplements the error\-reduction analysis with rank\-correlation metrics computed from existing per\-head attribution data\. For each task and method, we aggregate per\-head scores \(mean\|score\|\|\\text\{score\}\|across prompts\), rank allNNheads, and compare to the ground\-truth ranking via Kendallτ\\tau\(global rank correlation\) and NDCG@KK\(quality of top\-KKrecovery,K∈\{5,10,20\}K\\in\\\{5,10,20\\\}\)\.

Table 15:Ranking metrics across tasks and methods\.τ\\tau: Kendall rank correlation with ground truth \(higher==better\)\. NDCG@KK: Normalized Discounted Cumulative Gain at rankKK\(higher==better\)\. Allp<10−6p<10^\{\-6\}\. Best per task inbold\.Attribution patching’s global ranking is already strong: Kendallτ\\tauranges from0\.350\.35\(Pythia\-410M IOI, 384 heads\) to0\.720\.72\(Gemma\-2\-2B factual\)\. HVP matches or slightly improvesτ\\tauon most tasks \(e\.g\.,\+0\.004\+0\.004on Gemma\-2\-2B IOI,\+0\.015\+0\.015on Gemma\-2\-2B factual\), with negligible change on GPT\-2 IOI \(−0\.001\-0\.001, within bootstrap noise\)\. This is consistent with the main\-text findings: HVP’s gains concentrate at ranking boundaries where circuit\-membership decisions are made, not in global reorderings\.

On Pythia\-410M IOI, HVP improvesτ\\tau\(\+0\.002\+0\.002\) and NDCG@20 \(\+0\.001\+0\.001\) but reduces NDCG@5 \(from1\.001\.00to0\.700\.70\)\. This reflects a known trade\-off in the pathological regime: Std HVP \(K=1K\{=\}1\) overshoots on a few high\-R~\\tilde\{R\}heads \(Table[2](https://arxiv.org/html/2606.09899#S4.T2)\), reranking them away from the top 5; MS\-HVPK≥3K\{\\geq\}3resolves this \(§[4\.3](https://arxiv.org/html/2606.09899#S4.SS3)\)\.

NDCG@5 is≥0\.87\\geq 0\.87in all non\-pathological settings\. On Llama\-3\.1\-8B IOI \(1,024 heads\), only AP baseline data is currently available \(τ=0\.409\\tau=0\.409\); HVP comparisons will be added when frontier jobs complete\.

### E\.8MIB benchmark comparison

To evaluate whether HVP’s per\-component accuracy gains translate to circuit\-level faithfulness, we run the MIB benchmark\[[26](https://arxiv.org/html/2606.09899#bib.bib22)\]on completed tasks\. MIB measures two complementary metrics:CPR\(circuit performance recovery, area under the curve; higher is better\) andCMD\(circuit metric deviation, area from 1; lower is better\)\. We compare four methods: HVP \(K=1K\{=\}1\), MS\-HVP \(K=5K\{=\}5\), EAP \(standard attribution patching\), and EAP\-IG \(all\-at\-once input\-level IG as implemented in MIB\)\.

Table 16:MIB circuit\-faithfulness metrics on completed tasks\. CPR: area under the performance recovery curve \(higher==better\)\. CMD: area from 1 in the metric deviation curve \(lower==better\)\. Avg Faithfulness \(lower==better; seeMuelleret al\.\[[26](https://arxiv.org/html/2606.09899#bib.bib22)\]for definition\)\. Best per column inbold\.On GPT\-2 IOI, MS\-HVPK=5K\{=\}5achieves the best CMD \(0\.265 vs\. EAP’s 0\.278\), confirming that improved per\-head accuracy translates to more faithful circuit recovery under MIB’s edge\-knockout protocol\. HVP \(K=1K\{=\}1\) also outperforms EAP on CPR \(1\.281 vs\. 1\.267\)\. EAP\-IG \(input\-level\) performs poorly on this task \(CMD 1\.159\), as EAP\-IG computes a different quantity and is not expected to match per\-head methods on node\-level metrics\.

On Gemma\-2\-2B MCQA, the second\-order correction provides the clearest gains: MS\-HVPK=5K\{=\}5reduces CMD from 0\.506 \(EAP\) to 0\.414, an 18% improvement in circuit faithfulness\. Standard HVP also improves over EAP \(CMD 0\.471 vs\. 0\.506\)\. EAP\-IG again underperforms \(CMD 0\.604\)\.

On Qwen2\.5\-0\.5B MCQA, MS\-HVPK=5K\{=\}5achieves the best CMD \(0\.137 vs\. EAP’s 0\.146\), while standard HVP \(K=1K\{=\}1\) overshoots and worsens CMD to 0\.204, consistent with the overcorrection phenomenon on small models documented in §[4\.3](https://arxiv.org/html/2606.09899#S4.SS3)\. This confirms the practical recommendation to prefer MS\-HVPK≥3K\{\\geq\}3over Std HVP when compute allows\.

On Gemma\-2\-2B IOI, MS\-HVPK=5K\{=\}5achieves the best CMD \(0\.327 vs\. EAP’s 0\.380\), a 14% improvement in circuit faithfulness\. Standard HVP \(K=1K\{=\}1\) severely overshoots on this task \(CMD 0\.927\), more than doubling EAP’s deviation, consistent with the overcorrection pattern on modern architectures where the SwiGLU activation introduces strong curvature that a single correction step over\-estimates\. MS\-HVP’s multi\-step interpolation tames this overshoot and delivers the best overall faithfulness\. EAP\-IG massively overshoots \(CPR 3\.491, CMD 2\.496\), its worst result across all tasks; the all\-at\-once interpolation conflates cross\-component interactions that are particularly strong in Gemma\-2’s grouped\-query attention\.

On Qwen2\.5\-0\.5B IOI, the edge\-level methods \(EAP, HVP, MS\-HVP\) all achieve low CPR \(<0\.27<0\.27\) and high CMD \(\>0\.73\>0\.73\), reflecting the difficulty of the IOI circuit for this very small model\. EAP\-IG achieves higher CPR \(1\.680\) and lower CMD \(0\.682\), suggesting that the all\-at\-once interpolation path happens to produce better\-calibrated edge scores on this task\. Within the edge\-level methods, HVP achieves the lowest CMD \(0\.735 vs\. EAP’s 0\.736\), though differences are marginal\.

On Llama\-3\.1\-8B MCQA, the first larger\-scale MIB task, all four methods achieve near\-ideal CPR \(≈1\.0\\approx 1\.0\) and very low CMD \(<0\.20<0\.20\), indicating that MIB’s edge\-knockout protocol is well\-behaved on this 8B\-parameter model\. MS\-HVPK=5K\{=\}5achieves the best CPR \(1\.039, closest to ideal 1\.0\) and near\-best CMD \(0\.046 vs\. EAP’s 0\.037\)\. Standard HVP \(K=1K\{=\}1\) undershoots substantially \(CPR 0\.892, CMD 0\.198\), consistent with the overcorrection pattern: on this large model, a single correction step does not adequately approximate the integral, while 5 sub\-steps recover accuracy\. EAP\-IG achieves competitive CPR \(1\.054\) and CMD \(0\.055\), suggesting that the all\-at\-once interpolation is better\-calibrated on MCQA’s shorter, structured sequences than on IOI\.

On Llama\-3\.1\-8B IOI \(3/4 methods complete\), EAP achieves near\-ideal CPR \(0\.969\) and the lowest CMD across all tasks \(0\.031\)\. Standard HVP \(K=1K\{=\}1\) shows mild overcorrection \(CPR 0\.939, CMD 0\.060\)\. Surprisingly, MS\-HVPK=5K\{=\}5*overshoots*on this task \(CPR 1\.376, CMD 0\.379\): the multi\-step correction produces edge scores that over\-concentrate importance on the top edges, causing the circuit to exceed original\-model performance at mid\-sparsity \(faithfulness\>1\>1at 2–10% of edges\)\. This is consistent with the IOI task’s strong compositional structure in Llama\-3\.1\-8B, where the second\-order correction amplifies already\-dominant edges\. EAP achieves the best overall faithfulness on this task, suggesting that first\-order attribution is well\-calibrated for edge\-level circuit discovery on large models with clean task structure\. EAP\-IG results are pending\.

On Llama\-3\.1\-8B×\\timesarithmetic\_addition, the edge\-level methods \(EAP, HVP, MS\-HVP\) achieve similar CPR \(≈0\.52\\approx 0\.52\) and CMD \(≈0\.48\\approx 0\.48\), with small inter\-method differences \(<1%<1\\%\)\. The low absolute faithfulness \(average≈0\.33\\approx 0\.33\) indicates that this task’s circuit is highly distributed – no sparse subgraph recovers majority performance\. EAP\-IG \(inputs\), which interpolates all sources jointly, dramatically outperforms the edge\-level methods on this task: CPR 0\.954 \(vs\.≈0\.52\\approx 0\.52\), CMD 0\.048 \(vs\.≈0\.48\\approx 0\.48\)\. This is consistent with arithmetic being a distributed task where the all\-at\-once interpolation path captures the joint contribution of many edges simultaneously, while edge\-level methods that score components independently miss the cooperative structure\.

Llama\-3×\\timesarc\_easy is infeasible on a single L40S \(the 8B model leaves insufficient memory for arc\_easy’s attention computation\), and MIB’s hook\-based attribution is incompatible with multi\-GPU model parallelism\.

### E\.9SAE feature\-level HVP correction

To test whether HVP’s second\-order correction extends beyond attention heads to*sparse autoencoder \(SAE\) features*, we decompose the residual stream at two layers of GPT\-2 Small \(layers 5 and 9\) using pretrained SAEs fromBloomet al\.\[[4](https://arxiv.org/html/2606.09899#bib.bib42)\]\(JumpReLU, 768→\\to24,576 features\) and compute per\-feature attribution patching and HVP correction on the IOI task \(50 prompts\)\.

For each prompt, we identify all SAE features with nontrivial ground\-truth activation\-patching effect \(\|ftrue\|\>10−8\|f\_\{\\mathrm\{true\}\}\|\>10^\{\-8\}\), yielding∼\\sim2,500 features per layer\. We then compare per\-feature AP and HVP estimates against the ground truth\.

Table 17:SAE feature\-level attribution accuracy on GPT\-2 IOI \(50 prompts\)\. Each row reports the mean absolute error \(MAE\) and Pearson correlation of per\-feature AP and HVP estimates vs\. ground\-truth activation patching\.HVP reduces feature\-level MAE by 81–86%, with correlation improving from 0\.9996–0\.9999 to effectively 1\.0000\. Layer 9 shows stronger absolute errors \(MAE6\.7×10−46\.7\\times 10^\{\-4\}for AP vs\.4\.8×10−54\.8\\times 10^\{\-5\}at Layer 5\), consistent with more features having large effects in later layers, but HVP’s relative improvement is even larger \(85\.5% vs\. 81\.5%\)\.

This confirms that HVP’s second\-order correction is not specific to attention heads: it generalizes to any decomposition of the residual stream, including SAE feature directions\. The practical implication is that HVP can improve the accuracy of feature\-level circuit discovery workflows that use SAEs to identify causally important features\.

### E\.10Case study: L4H11 mis\-ranking in GPT\-2 IOI

We illustrate the practical impact of HVP correction with a concrete wrong\-ranking example from the GPT\-2 IOI circuit \(50 prompts, 144 heads\)\. Head L4H11 is a known*duplicate\-token head*in the IOI circuit and ranks 7th by true activation\-patching effect \(ftrue=0\.765f\_\{\\mathrm\{true\}\}=0\.765\)\. Attribution patching severely underestimates its effect \(Δ^AP=0\.174\\hat\{\\Delta\}\_\{\\mathrm\{AP\}\}=0\.174, a4\.4×4\.4\\timesunderestimate\), placing it at rank 27 – outside any reasonable top\-KKcircuit\. HVP partially recovers the true score \(Δ^HVP=0\.404\\hat\{\\Delta\}\_\{\\mathrm\{HVP\}\}=0\.404\), promoting L4H11 to rank 12\.

This single correction cascades into improved circuit recovery across allKKthresholds on this task:

Table 18:Top\-KKoverlap with ground\-truth ranking on GPT\-2 IOI \(50 prompts\)\. HVP recovers L4H11 and other underestimated heads, improving overlap at every threshold\.The L4H11 example is representative of a broader pattern: attribution patching tends to underestimate heads whose clean activation has large norm \(producing large‖δ‖\\\|\\delta\\\|under name\-swap corruption\), precisely the regime where the second\-order correction is most needed\. The reliability score flags this head withR~=\|0\.404−0\.174\|/\|0\.174\|=1\.32≫0\.3\\tilde\{R\}=\|0\.404\-0\.174\|/\|0\.174\|=1\.32\\gg 0\.3, correctly identifying it as a candidate for HVP refinement\.

Figure[10](https://arxiv.org/html/2606.09899#A5.F10)visualizes the full12×1212\\times 12attribution landscape\. In the ground\-truth panel \(a\), L4H11 is clearly one of the brightest components; in the attribution\-patching panel \(b\), it nearly vanishes; in the HVP\-corrected panel \(c\), it is partially restored\. Blue outlines mark the 23 known IOI circuit heads fromWanget al\.\[[39](https://arxiv.org/html/2606.09899#bib.bib36)\]\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x10.png)Figure 10:Layer×\\timeshead attribution heatmaps for GPT\-2 IOI \(log scale\)\.\(a\)Ground truth \(activation patching\)\.\(b\)Attribution patching \(first\-order\)\.\(c\)HVP\-corrected \(second\-order\)\. Red box: L4H11 \(true rank 7, attribution\-patching rank 27, HVP rank 12\)\. Blue boxes: known IOI circuit heads\.Figure[11](https://arxiv.org/html/2606.09899#A5.F11)provides an alternative view, plotting only the 18 circuit heads grouped by functional role, with circle area proportional to attribution score\. L4H11’s circle is barely visible in the attribution\-patching panel but grows substantially in the HVP panel\.

Finally, Figure[12](https://arxiv.org/html/2606.09899#A5.F12)shows the attention patterns of three key circuit heads on a representative IOI prompt \(“When John and Mary went to the store, Mary gave a drink to”\)\. L10H7 \(name mover\) attends strongly to “John” \(the indirect object\) at the prediction position; L9H9 \(negative name mover\) also attends to “John” but with a negative contribution; and L4H11 \(duplicate token\) exhibits a characteristic diagonal pattern, detecting repeated tokens\. The nonlinear interaction between L4H11’s duplicate\-token detection and the name\-swap corruption explains why its attribution\-patching estimate is so far from the true activation\-patching effect\.

![Refer to caption](https://arxiv.org/html/2606.09899v1/x11.png)Figure 11:IOI circuit heads grouped by functional role \(GPT\-2 Small\)\. Circle area∝\\proptoattribution score\.\(a\)Attribution patching scores\.\(b\)HVP\-corrected scores\. L4H11 \(red outline, duplicate\-token head\) is severely underestimated by attribution patching \(0\.170\.17vs\. true0\.770\.77\) and partially recovered by HVP \(0\.400\.40\)\.![Refer to caption](https://arxiv.org/html/2606.09899v1/x12.png)Figure 12:Attention patterns of three key IOI circuit heads on a representative prompt\. Each panel shows the full query×\\timeskey attention matrix; the red horizontal line marks the prediction position \(last token “to”\)\.Left:L10H7 \(name mover\) attends 74% to “John” \(IO\)\.Center:L9H9 \(negative name mover\) attends 85% to “John”\.Right:L4H11 \(duplicate token, red title\) detects repeated tokens via a diagonal pattern\. Attribution patching underestimates L4H11 by4\.4×4\.4\\times\.
When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Submit Feedback

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution
Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context