Localizing Anchoring Pathways in Language Models

arXiv cs.CL Papers

Summary

This paper investigates how irrelevant numbers in prompts cause anchoring effects in language models and localizes the internal pathways carrying this signal using attribution-based circuit methods on Qwen and Llama models.

arXiv:2606.12818v1 Announce Type: new Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:50 AM

# Localizing Anchoring Pathways in Language Models
Source: [https://arxiv.org/html/2606.12818](https://arxiv.org/html/2606.12818)
Hillary N\. Owusu Sarah Wiegreffe Naomi H\. Feldman University of Maryland, College Park \{hnyowusu,sarahwie,nhf\}@umd\.edu

###### Abstract

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning\. We study where this anchor\-sensitive signal is carried inside language models using a controlled multiple\-choice setup with shared answer options\. We define a logit\-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring\. Using attribution\-based circuit localization on 7B–8B Qwen and Llama base and instruction\-tuned models, we find that edge\-level methods recover this signal more faithfully than node\-level methods\. Low\- and high\-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction\. However, sparse transfer across base and instruction\-tuned variants is less reliable, indicating that post\-training changes which pathways matter most\. Overall, our results provide a mechanistic account of how anchoring\-related decision signals are carried inside language models\.

Localizing Anchoring Pathways in Language Models

Hillary N\. Owusu Sarah Wiegreffe Naomi H\. FeldmanUniversity of Maryland, College Park\{hnyowusu,sarahwie,nhf\}@umd\.edu

## 1Introduction

A language model can know the right answer and still be moved by the wrong context\. Inanchoring, a cognitive bias that both humans and large language models \(LLMs\) exhibit, an irrelevant number shifts a subsequent judgment, pulling estimates toward the irrelevant number\(Tversky and Kahneman,[1974](https://arxiv.org/html/2606.12818#bib.bib4); Takenamiet al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib9),[Figure 1](https://arxiv.org/html/2606.12818#S1.F1)\)\. For LLMs, this phenomenon exposes a decision\-level robustness failure: irrelevant prompt content can change how the model ranks competing answers at the moment of prediction\.

Prompt variantsLow anchor
The slot machine stopped on 15\.High anchor
The slot machine stopped on 49\.Token\-matchedXcontrol
The slot machine stopped on XXXX\.Question: \[slot\-machine line\]
What is the duration of one moon orbit around the earth in days?
Choose the correct answer\. Respond with the letter only\.
A\) 15 B\) 19 C\) 23 D\) 27 E\) 31
F\) 35 G\) 39 H\) 44 I\) 49
Answer:Figure 1:Anchors \(shown in purple\) given in the prompt can shift language model predictions for the subsequent question \(shown in brown\)\. We study anchoring bias both behaviorally and mechanistically by converting questions from the OpAQ dataset\(Röseleret al\.,[2022](https://arxiv.org/html/2606.12818#bib.bib15)\)to the multiple\-choice format shown \(compactly\) here \(ground truth2727days, low anchor1515, high anchor4949\)\.Prior work has shown that LLMs exhibit anchoring\(Lou and Sun,[2024](https://arxiv.org/html/2606.12818#bib.bib8); Takenamiet al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib9)\), but has mostly treated the model as a black box: the anchor is added, the prediction shifts, and the size of that shift is measured\. What this behavioral view does not explain is*how*an irrelevant number affects the internal computation leading to the final answer\. Here we tackle the question of mechanism by asking which internal pathways support the anchor\-sensitive competition between the correct answer and the answer corresponding to the anchor\.

Mechanistic interpretability provides tools for determining which internal parts of the model carry a behavior from prompt to output\(Saphra and Wiegreffe,[2024](https://arxiv.org/html/2606.12818#bib.bib29); Muelleret al\.,[2025a](https://arxiv.org/html/2606.12818#bib.bib30); Geigeret al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib32)\)\. Circuit localization\(Olahet al\.,[2020](https://arxiv.org/html/2606.12818#bib.bib33)\)isolates a minimal subset of LLM components to faithfully replicate the full LLM’s inference\-time behavior\. In circuit localization, a model is treated as a graph: nodes correspond to internal components, such as attention heads or multilayer perceptron \(MLP\) blocks, and edges correspond to pathways through which information flows between components\. The goal is to identify which nodesWanget al\.\([2022](https://arxiv.org/html/2606.12818#bib.bib21)\)or edgesGoldowsky\-Dillet al\.\([2023](https://arxiv.org/html/2606.12818#bib.bib40)\)contribute most to a target behavior or metric\.

In this paper, we frame numerical anchoring as a problem for mechanistic interpretability\. We make three contributions\. First, we convert anchoring into a controlled multiple\-choice task with a validated answer\-level metric that measures whether the active anchor answer becomes more competitive with the correct answer\. Second, using this metric, we show that anchoring can be linked to identifiable internal pathways: edge\-level localization is more faithful than node\-level localization, suggesting that the effect is better understood through pathways between components than through isolated “biased” components\. Third, we test whether these pathways transfer across anchor direction and instruction tuning\. We find that Qwen and Llama show different layer\-wise attribution patterns, but that in both families, low\- and high\-anchor circuits are closely related within a model: overlapping in their top\-ranked edges, localizing to nearly identical layer regions, and transferring well across anchor contrasts\. In contrast, instruction tuning preserves the broad layer\-wise location of attribution but changes which sparse edges matter most\.

By characterizing these pathways, we provide a mechanistic account of how a validated anchor\-sensitive decision signal is carried inside language models and how the relevant pathways vary across model family and instruction tuning\.

## 2Related Work

### 2\.1Anchoring and Cognitive Biases in LLMs

Anchoring is a classic human judgment bias in which exposure to an arbitrary initial value shifts subsequent estimates toward that value\(Tversky and Kahneman,[1974](https://arxiv.org/html/2606.12818#bib.bib4)\)\. Recent work shows that large language models \(LLMs\) exhibit anchoring\-like effects across numerical estimation, negotiation, and decision\-making tasks\(Lou and Sun,[2024](https://arxiv.org/html/2606.12818#bib.bib8); Takenamiet al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib9); Valencia\-Clavijo,[2025](https://arxiv.org/html/2606.12818#bib.bib12); Huanget al\.,[2026](https://arxiv.org/html/2606.12818#bib.bib11)\)\. In numerical estimation, prior behavioral work further shows that anchoring depends on confidence and post\-training: models can resist anchors when they are confidently wrong, while low\-confidence models remain susceptible even when they are accurate\(Owusu and Feldman,[2026](https://arxiv.org/html/2606.12818#bib.bib28)\)\. This finding suggests that anchoring is not simply a factual\-knowledge failure, but a robustness failure tied to how strongly the model favors its current answer distribution\.

Recent studies have also begun to examine mechanisms of anchoring\.Valencia\-Clavijo \([2025](https://arxiv.org/html/2606.12818#bib.bib12)\)use structured prompt\-field attribution to quantify how anchor fields affect model log\-probabilities, whileHuanget al\.\([2026](https://arxiv.org/html/2606.12818#bib.bib11)\)use activation patching to show that anchoring\-related effects can appear in relatively shallow layers\. Our work builds on this direction by moving from activation\-level interventions to circuit localization: we use attribution patching to efficiently rank a large candidate set of nodes and edges\. This allows us to evaluate faithful subgraphs, compare low\- and high\-anchor circuits, and ask whether depth localization is stable across model families and instruction\-tuned variants\.

### 2\.2Circuit Localization and Mechanistic Interpretability

Circuit localization aims to identify candidate components or pathways that support a target behavior\. Early causal approaches relied on activation patching or related interventions\(Viget al\.,[2020](https://arxiv.org/html/2606.12818#bib.bib38); Menget al\.,[2022](https://arxiv.org/html/2606.12818#bib.bib22)\), while recent work uses scalable attribution\-based approximations\. Following MIB, we distinguish node\- and edge\-level attribution patching methods: Node Attribution Patching \(NAP\) scores individual components, while Edge Attribution Patching \(EAP\) scores directed pathways between components\(Nanda,[2023](https://arxiv.org/html/2606.12818#bib.bib39); Kramáret al\.,[2024](https://arxiv.org/html/2606.12818#bib.bib27); Syedet al\.,[2024](https://arxiv.org/html/2606.12818#bib.bib19)\)\. Integrated\-gradient variants, NAP\-IG and EAP\-IG, improve attribution quality by averaging gradients along an interpolation path rather than relying on a single local gradient\(Hannaet al\.,[2024](https://arxiv.org/html/2606.12818#bib.bib20)\)\.

Standard circuit benchmarks evaluate these methods using*faithfulness*: whether a localized subgraph recovers the full model’s behavior on a task\(Hannaet al\.,[2024](https://arxiv.org/html/2606.12818#bib.bib20)\)\. These tasks include entity\-tracking tasks such as indirect object identification, arithmetic, multiple\-choice question answering, and scientific reasoning\(Wanget al\.,[2022](https://arxiv.org/html/2606.12818#bib.bib21); Stolfoet al\.,[2023](https://arxiv.org/html/2606.12818#bib.bib23); Wiegreffeet al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib24); Muelleret al\.,[2025b](https://arxiv.org/html/2606.12818#bib.bib13)\)\. In many of these settings, the behavior can be summarized by a natural scalar metric, such as the logit difference between a correct and counterfactual answerZhang and Nanda \([2024](https://arxiv.org/html/2606.12818#bib.bib35)\)\.

Anchoring differs from these settings because the behavior of interest is not simply whether the model answers correctly, but whether an irrelevant number pulls probability toward the anchor answer\. Following prior work emphasizing that circuit evaluations depend on the chosen metric\(Muelleret al\.,[2025b](https://arxiv.org/html/2606.12818#bib.bib13)\), we make this choice explicit for anchoring\. We use a standard logit\-difference form, comparing the correct answer with the answer option corresponding to the active low or high anchor\. We then validate that this metric tracks behavioral anchoring before using it for attribution\-based circuit localization\.

## 3Multiple\-Choice Anchoring Task

### 3\.1Task Setup

We adapt 100 numerical\-estimation questions from the Open Anchoring Question Dataset \(OpAQ;Röseleret al\.[2022](https://arxiv.org/html/2606.12818#bib.bib15)\) into a controlled multiple\-choice format for mechanistic analysis, following prior work that uses fixed answer options to study model behavior in question answering\(Clarket al\.,[2018](https://arxiv.org/html/2606.12818#bib.bib25); Lieberumet al\.,[2023](https://arxiv.org/html/2606.12818#bib.bib34); Wiegreffeet al\.,[2025](https://arxiv.org/html/2606.12818#bib.bib24)\)\. This format serves as a controlled mechanistic probe: it tests whether anchoring effects appear in a fixed answer space while providing well\-defined single\-token answer\-label logits for circuit localization\.

Each item provides a factual question, a ground\-truth numerical answer, a low anchor, and a high anchor\. For each item, we construct a nine\-option candidate set containing the ground\-truth value, the low and high anchors, and intermediate values spanning the anchor range\. The same candidate set is used across anchored and control prompts\. The model is instructed to respond with the answer letter only, so each candidate is scored through a single answer\-label token rather than through open\-ended numerical generation\.

[Figure 1](https://arxiv.org/html/2606.12818#S1.F1)shows the prompt structure\. For each item, we create a low\-anchor prompt and a high\-anchor prompt by prepending an irrelevant slot\-machine sentence containing the corresponding anchor value\. Each anchored prompt is paired with its own tokenizer\-matched control, where the numerical anchor is replaced by a non\-numericXstring of matching token length\. Thus, the low\-anchor contrast compares the low\-anchor prompt to its matched control, and the high\-anchor contrast compares the high\-anchor prompt to its matched control\.

Because multiple\-choice models can be sensitive to answer\-letter and position order\(Pezeshkpour and Hruschka,[2024](https://arxiv.org/html/2606.12818#bib.bib26)\), we evaluate each item under multiple option orderings\. Behavioral analyses average over 20 random option permutations per question to reduce letter\- and position\-order artifacts\. Attribution\-metric validation and circuit localization use four fixed cyclic rotations per question for computational tractability, yielding 400 paired examples per anchor contrast\. We also test sensitivity to the number of answer options in Appendix[A](https://arxiv.org/html/2606.12818#A1)\.

We use four open\-weight decoder\-only language models from the Qwen2\.5 and Llama\-3\.1 familiesYanget al\.\([2024](https://arxiv.org/html/2606.12818#bib.bib41)\); Grattafioriet al\.\([2024](https://arxiv.org/html/2606.12818#bib.bib42)\)\. Within each family, we compare a base model to its instruction\-tuned counterpart, allowing us to examine how instruction tuning changes both behavioral anchoring and internal localization\. For Qwen, we evaluate Qwen2\.5\-7B and Qwen2\.5\-7B\-Instruct; for Llama, we evaluate Llama\-3\.1\-8B and Llama\-3\.1\-8B\-Instruct\. The Qwen models have 28 transformer layers, while the Llama models have 32 layers, a difference we account for when comparing localization depth\. All models are run locally using the Hugging Facetransformerslibrary\(Wolfet al\.,[2020](https://arxiv.org/html/2606.12818#bib.bib18)\)\. For compactness in tables and figures, we abbreviate these models as Qwen\-7B, Qwen\-Inst, Llama\-8B, and Llama\-Inst, respectively\.

Table 1:Behavioral anchoring and attribution\-metric validation\. Entries are low / high\. EV shift is relative to the tokenizer\-matched control; Negative EV shifts indicate movement toward the low anchor, while positive shifts indicate movement toward the high anchor\.Δ​m\\Delta mis the anchored\-minus\-control change in the correct–anchor logit difference;ρEV\\rho\_\{\\mathrm\{EV\}\}correlates−Δ​m\-\\Delta mwith anchor\-consistent EV shift\.
### 3\.2Behavioral and Attribution Metrics

We use behavioral metrics to measure the strength and direction of anchoring before applying circuit localization\. For each item and condition, we compute the model’s probability distribution over numerical candidate values by scoring the answer\-label tokens, mapping label probabilities back to candidate values, and averaging over option\-order permutations\. We summarize anchor effects using normalized expected\-value \(EV\) shift: the anchored\-minus\-control change in expected value, normalized over the candidate range, so negative values indicate movement toward the low anchor and positive values indicate movement toward the high anchor\. We also compute total variation distance \(TVD\), which captures the overall size of the distributional shift\. Full metric details and additional behavioral analyses are provided in Appendix[B](https://arxiv.org/html/2606.12818#A2)\.

For circuit localization, we need a single answer\-level metric whose gradients can be used to score internal nodes or edges\. We use a logit difference that compares the correct answer with the answer option corresponding to the active anchor\. For each item and anchor direction, letycorrecty\_\{\\mathrm\{correct\}\}be the option label assigned to the ground\-truth value, and letyanchory\_\{\\mathrm\{anchor\}\}be the option label assigned to the active low or high anchor value\.

At the final answer\-label prediction position, we define

m=logit​\(ycorrect\)−logit​\(yanchor\),m=\\mathrm\{logit\}\(y\_\{\\mathrm\{correct\}\}\)\-\\mathrm\{logit\}\(y\_\{\\mathrm\{anchor\}\}\),\(1\)wherelogit​\(y\)\\mathrm\{logit\}\(y\)denotes the model’s pre\-softmax score for answer\-label tokenyy\. Lower values ofmmindicate that the anchor option has become more competitive with the correct option\.

For each matched prompt pair, we compute

Δ​m=manchor−mcontrol\.\\Delta m=m\_\{\\mathrm\{anchor\}\}\-m\_\{\\mathrm\{control\}\}\.\(2\)

### 3\.3Does the Attribution Metric Track Behavioral Anchoring?

Before using the logit\-difference metric for circuit localization, we verify two things: that the multiple\-choice task elicits behavioral anchoring, and that the attribution metric tracks this behavioral effect\.

[Table 1](https://arxiv.org/html/2606.12818#S3.T1)shows that the multiple\-choice task elicits anchoring in all four models\. Low anchors shift expected value downward, while high anchors shift expected value upward\. The effect is strongest in the Qwen models, especially Qwen2\.5\-7B\-Instruct, and smaller but still directionally consistent in the Llama models\.

[Table 1](https://arxiv.org/html/2606.12818#S3.T1)also validates the attribution metric\. MeanΔ​m\\Delta mis negative for every model and anchor contrast, indicating that anchored prompts make the anchor answer more competitive with the correct answer; all values are significant under a one\-sided Wilcoxon signed\-rank test \(p<10−15p<10^\{\-15\}\)\. The logit\-difference shift also tracks behavioral anchoring\. Because stronger anchoring corresponds to more negativeΔ​m\\Delta m, we correlate−Δ​m\-\\Delta mwith anchor\-consistent normalized EV shift so that larger values indicate stronger anchoring in both measures\. We denote this Spearman correlation byρEV\\rho\_\{\\mathrm\{EV\}\}\. The strongest correlations appear for high\-anchor contrasts in the base models, withρEV=0\.73\\rho\_\{\\mathrm\{EV\}\}=0\.73for Qwen\-7B and0\.750\.75for Llama\-8B, while the weakest appears for Llama\-Inst on the high\-anchor contrast \(ρEV=0\.18\\rho\_\{\\mathrm\{EV\}\}=0\.18\)\.

We adoptΔ​m\\Delta mfor the mechanistic analyses that follow because, although it does not capture every distributional effect of anchoring, it consistently moves in the expected direction, correlates with behavioral EV shifts, and provides an answer\-level target for circuit localization\.

## 4Localizing Anchor\-Sensitive Pathways

### 4\.1Circuit Localization Setup

We adopt the circuit\-localization framework used in MIB\(Muelleret al\.,[2025b](https://arxiv.org/html/2606.12818#bib.bib13)\): a transformer is represented as a computation graphNN, where nodes are internal components such as attention heads or MLP modules, and edges are directed pathways between components\. Under this framework, a circuitC⊆NC\\subseteq Nis a selected subgraph of nodes or edges, and localization methods rank these nodes or edges by their estimated contribution to a target metric\.

We apply this framework to anchoring\. For each model and anchor direction, we use the matched prompt pairs defined in Section[3](https://arxiv.org/html/2606.12818#S3): the tokenizer\-matched control prompt is the “clean” input, and the anchored prompt is the “corrupted” input\. The target metric is the validated correct–anchor logit differencemmfrom[Equation 1](https://arxiv.org/html/2606.12818#S3.E1)\. Since smallermmmeans that the anchor option has become more competitive with the correct option, we useL=−mL=\-mas the attribution loss\.

We compare node\- and edge\-level attribution patching methods following MIB\. Attribution Patching \(AP\) approximates the effect of patching internal activations by combining the clean–corrupted activation difference with the gradient of the scalar objective\. Intuitively, a node or edge receives a high score when it changes under the anchor contrast and that change affects the answer\-level logit difference\. We refer to node\-level AP as NAP and edge\-level AP as EAP\. Their integrated\-gradient variants, NAP\-IG and EAP\-IG, average attribution estimates along an interpolation path between clean and corrupted input embeddings, reducing reliance on a single local gradient\. Each method returns a ranked list of candidate nodes or edges\. To form a circuit, we select the top\-ranked entries at a given retained fractionkkand keep only entries that remain connected in the resulting subgraph before evaluating faithfulness below\. Thus,CkC\_\{k\}is formed from attribution\-ranked entries rather than from a separate greedy search\. For node\-level methods,CkC\_\{k\}contains the top\-ranked components; for edge\-level methods,CkC\_\{k\}contains the top\-ranked directed pathways\.

### 4\.2Recovering the Anchor\-Sensitive Decision Signal

Attribution rankings alone do not show whether the selected subgraph accounts for the control–anchored difference inmm\. We next ask how intervening on the subgraph impacts model behavior\.

The tokenizer\-matched control prompt is treated as the clean input, representing the unanchored computation we want the retained circuit to recover\. The anchored prompt is treated as the corrupted input\. For each retained proportionkk, we construct a top\-kkcircuitCkC\_\{k\}\.111k∈\{\.001,\.002,\.005,\.01,\.02,\.05,\.1,\.2,\.5,1\}k\\in\\\{\.001,\.002,\.005,\.01,\.02,\.05,\.1,\.2,\.5,1\\\}\.We run the control prompt, retain the activations inCkC\_\{k\}, and replace all other activations with the corresponding activations from the matched anchored run\. Thus,Ck=∅C\_\{k\}=\\emptysetcorresponds to the fully patched anchored limit, whileCk=NC\_\{k\}=Ncorresponds to the full control run\. Faithfulness is evaluated onmm, not on the negated attribution lossL=−mL=\-m, so recovery is measured in the same units as Section[3\.3](https://arxiv.org/html/2606.12818#S3.SS3):

f​\(Ck,N;m\)=m​\(Ck\)−m​\(∅\)m​\(N\)−m​\(∅\)\.f\(C\_\{k\},N;m\)=\\frac\{m\(C\_\{k\}\)\-m\(\\emptyset\)\}\{m\(N\)\-m\(\\emptyset\)\}\.\(3\)Here,m​\(N\)m\(N\)is the full control value, andm​\(∅\)m\(\\emptyset\)is the fully patched limit where no ranked subgraph is retained\.222m​\(N\)m\(N\),m​\(Ck\)m\(C\_\{k\}\), andm​\(∅\)m\(\\emptyset\)denote mean logit\-difference values over the evaluation set\.In other words, faithfulness captures the proportion of the control–anchored gap inmmthat is recovered by keepingCkC\_\{k\}at its control values\. Raw values can fall outside\[0,1\]\[0,1\]when the retained circuit undershoots or overshoots this gap\.

Larger circuits generally better replicate model behavior than smaller circuits\. We therefore summarize faithfulness curves with circuit performance recovery\(CPR; Muelleret al\.,[2025b](https://arxiv.org/html/2606.12818#bib.bib13)\), the area under the curve defined byf​\(Ck,N;m\)f\(C\_\{k\},N;m\)at the various retained circuit fractionskk\. For main\-text results, faithfulness values are clipped to\[0,1\]\[0,1\]at each sweep point before trapezoidal integration, with endpoints added atk=0k=0andk=1k=1when missing\. Higher CPR indicates earlier recovery with a smaller retained subgraph\. Because Section[3\.3](https://arxiv.org/html/2606.12818#S3.SS3)shows that changes inmmtrack behavioral anchoring on the same prompt pairs, we interpret faithful subgraphs as pathways relevant to the validated anchor\-sensitive signal\.

### 4\.3Comparing Localization Methods by Faithfulness

We first evaluate which attribution\-patching method produces the most faithful circuits for the anchoring logit\-difference target\. This comparison determines which circuit estimate we use for the analyses that follow\.

[Table 2](https://arxiv.org/html/2606.12818#S4.T2)reports CPR for each attribution method, model, and anchor direction\. Consistent with previous work\(Muelleret al\.,[2025b](https://arxiv.org/html/2606.12818#bib.bib13)\), edge\-level methods are more faithful than node\-level methods: EAP consistently outperforms NAP, and EAP\-IG consistently outperforms NAP\-IG\. This is intuitive, given that edges are more granular than nodes\. Integrated gradients further improve recovery in most edge\-level comparisons, with EAP\-IG outperforming EAP in seven of eight model–contrast combinations\.

Table 2:Faithfulness \(CPR\) of attribution\-patching methods\. Higher values indicate a larger area\-under\-the\-curve, and therefore recovery of the control–anchored gap in the validated logit\-difference target at smaller circuit sizes\. NAP/EAP denote node\-/edge\-level Attribution Patching; NAP\-IG/EAP\-IG use integrated gradients over input embeddings\.Instruction tuning affects recoverability more strongly for EAP than for EAP\-IG\. For example, EAP CPR drops from 0\.939 to 0\.862 in Qwen on the low\-anchor contrast and from 0\.958 to 0\.840 in Llama, while EAP\-IG remains above 0\.91 across all instruction\-tuned settings\. EAP\-IG is therefore the most consistently faithful localization method in our setting\.

## 5What Do EAP\-IG Circuits Reveal About Anchoring Structure?

We focus our circuit analyses on EAP\-IG\. We first characterize where these pathways appear in the network and which component types they connect\. We then perform analyses examining the degree to which circuits are preserved across different anchor values and across base and instruction\-tuned models\.

### 5\.1Where Are Anchoring Pathways Localized?

We first ask where the estimated anchoring pathways of EAP\-IG are located in the network\. For each model and anchor contrast, we compute the share of total absolute EAP\-IG score contributed by edges from each source layer\.

[Figure 2](https://arxiv.org/html/2606.12818#S5.F2)shows two main layer\-wise attribution patterns\. First, attribution is not spread uniformly across depth\. Qwen variants place more attribution in mid\-to\-late layers, while Llama variants concentrate attribution earlier and drop sharply after mid\-depth\. Second, within each model, the high\- and low\-anchor curves are nearly identical: the same model stays in the same depth region across anchor contrasts\. The centroid lines make this visually clear, with Qwen centroids appearing later than Llama centroids in both contrasts and low/high centroids nearly overlapping within each model\. Quantitatively, low\- and high\-anchor per\-layer attribution vectors have Pearson correlations above0\.990\.99, and their centroids differ by less than0\.010\.01relative depth\.333Pearson correlation is computed between the low\- and high\-anchor vectors of per\-layer attribution shares for a given model\. The attribution centroid is∑ℓrℓ​aℓ\\sum\_\{\\ell\}r\_\{\\ell\}a\_\{\\ell\}, whererℓr\_\{\\ell\}is the relative depth of layerℓ\\ellandaℓa\_\{\\ell\}is that layer’s normalized attribution share\.This partly aligns with prior activation\-level analyses that found anchoring\-related effects in earlier layers\(Huanget al\.,[2026](https://arxiv.org/html/2606.12818#bib.bib11)\), but shows that the localization pattern is not uniform across model families\.

![Refer to caption](https://arxiv.org/html/2606.12818v1/x1.png)Figure 2:EAP\-IG attribution by source\-layer relative depth\. For each model and anchor contrast, absolute edge scores are summed by source layer and normalized by the total absolute score\. Qwen variants show a broader mid\-to\-late attribution profile, while Llama variants concentrate attribution earlier in the network\.\(a\) Low–high transfer within the same model ![Refer to caption](https://arxiv.org/html/2606.12818v1/x2.png)

\(b\) Base–instruction transfer within the same model family ![Refer to caption](https://arxiv.org/html/2606.12818v1/x3.png)

Figure 3:Structural overlap and transfer of EAP\-IG edge circuits\. Panel \(a\) compares low\- and high\-anchor circuits within each model\. Panel \(b\) compares base and instruction\-tuned circuits within each model family and anchor contrast\. Jaccard and shared fraction measure edge overlap; faithfulness measures whether circuits replicate model behavior across the comparison\.
### 5\.2Which Component Types Are Involved?

Layer depth is only one way to characterize the localized pathways\. We also examine which types of components are connected by the top\-ranked EAP\-IG edges\. Most possible edges in the full graph are attention\-to\-attention edges, accounting for about 95–96% of all scored edges\. The top5%5\\%circuits are also attention\-heavy, but less so: attention\-to\-attention edges make up roughly 68–80% of selected edges\. Compared with the full graph, the selected circuits include many more attention–MLP edges\. This suggests that MLPs contribute mainly through connections to and from attention components, rather than through MLP\-to\-MLP pathways\. This pattern does not by itself identify the semantic role of these MLP\-linked edges, but it suggests that MLPs participate as part of attention\-mediated pathways rather than forming a separate MLP\-only anchoring circuit\. Full component\-type breakdowns and full\-graph baselines are reported in Appendix[C](https://arxiv.org/html/2606.12818#A3)\.

### 5\.3Do Low and High Anchors Use the Same Edges?

The layer\-wise attribution results show that low\- and high\-anchor effects localize to nearly identical depth regions within each model\. We next ask whether the low\- and high\-anchor vs\. control prompt contrasts also rely on the same edges\. For each model, we compare equal\-sized circuits: the top5%5\\%of EAP\-IG\-ranked edges for the low\-anchor contrast and the top5%5\\%for the high\-anchor\.

We use two simple overlap measures\. Jaccard similarity, proposed inMerulloet al\.\([2024](https://arxiv.org/html/2606.12818#bib.bib43)\), measures how much the two edge sets overlap relative to all edges selected by either contrast\.444For edge setsAAandBB, Jaccard similarity, or Intersection over Union \(IoU\), isJ​\(A,B\)=\|A∩B\|/\|A∪B\|J\(A,B\)=\|A\\cap B\|/\|A\\cup B\|\.Because the two circuits are the same size, we also report the fraction of each circuit that is shared with the other\.555For equal\-sized edge setsAAandBB, this fraction is\|A∩B\|/\|A\|\|A\\cap B\|/\|A\|, equivalently\|A∩B\|/\|B\|\|A\\cap B\|/\|B\|\.This second measure is easier to interpret: it tells us how much of a low\-anchor circuit also appears in the high\-anchor circuit, and vice versa\.

Table 3:Edge overlap between circuits at5%5\\%retained EAP\-IG edges\. Ranges are across models for low–high overlap and across family and anchor\-contrast pairs for base–instruction overlap\.Atk=5%k=5\\%,[Table 3](https://arxiv.org/html/2606.12818#S5.T3)shows that low\- and high\-anchor circuits are not identical, but roughly two thirds of each circuit is shared with the other anchor contrast\. This suggests substantial shared structure, with some edges specific to each contrast\. Thus, low and high anchors differ in which individual edges rank highest, despite localizing to the same depth regions\. Full Jaccard values across retained fractions are reported in Appendix[C](https://arxiv.org/html/2606.12818#A3),[Table 8](https://arxiv.org/html/2606.12818#A3.T8)\.

### 5\.4Do Base and Instruction\-Tuned Models Use the Same Edges?

We next ask whether base and instruction\-tuned variants also rely on the same top\-ranked edges\. For each model family and anchor contrast, we compare equal\-sized circuits: the top5%5\\%of EAP\-IG\-ranked edges for the base model and the top5%5\\%for its instruction\-tuned counterpart\.666Qwen\-7B and Qwen\-Inst share the same scored edge set size, as do Llama\-8B and Llama\-Inst; therefore, top\-k%k\\%base–instruction circuits are equal\-sized within each family\.

Table[3](https://arxiv.org/html/2606.12818#S5.T3)shows that base and instruction\-tuned variants share many top\-ranked edges, though slightly less than the low–high circuits within a model\. The shared fraction is about0\.640\.64–0\.660\.66, compared with0\.660\.66–0\.710\.71for low–high overlap\. This suggests that instruction tuning preserves some sparse circuit structure while changing which individual edges rank highest\.

## 6Transferability of Anchoring Circuits

The structural edge overlap analyses above show that circuits can share many top\-ranked edges across anchor contrasts and between base and instruction\-tuned variants\. However, overlap alone does not provide direct information about their functional overlap\. We therefore test circuit transfer directly using faithfulness evaluations\(Hannaet al\.,[2024](https://arxiv.org/html/2606.12818#bib.bib20)\)\.

We test transfer across two axes\. A*matched*circuit is ranked and evaluated on the same model and anchor contrast\. A*cross\-contrast*circuit is ranked on one control–anchor contrast and evaluated on the other within the same model: control\-vs\-low↔\\leftrightarrowcontrol\-vs\-high\. A*cross\-variant*circuit is ranked on the paired base or instruction\-tuned variant from the same model family, using the same anchor contrast\. A*cross\-both*circuit changes both the anchor contrast and the model variant\.

### 6\.1Do Low\- and High\-Anchor Circuits Transfer Within a Model?

The full CPR results are reported in Appendix[C](https://arxiv.org/html/2606.12818#A3), Table[9](https://arxiv.org/html/2606.12818#A3.T9)\. Cross\-contrast circuits perform almost as well as matched circuits: CPR ranges from 0\.933 to 0\.981 and differs from matched recovery by at most about0\.040\.04\. Random edge sets perform much worse, showing that transfer is not simply a consequence of retaining many edges\.

[Figure 3](https://arxiv.org/html/2606.12818#S5.F3)complements this result by showing low–high overlap and cross\-contrast faithfulness across retained fractions\. At5%5\\%retained edges, low\- and high\-anchor circuits have moderate Jaccard overlap but larger shared fractions, and cross\-contrast faithfulness is already high\. This suggests that low and high anchors do not select exactly the same edges, but enough of each circuit is shared, and those shared edges are useful enough, for circuits to transfer across anchor contrasts\.

### 6\.2Does Circuit Transfer Hold Across Instruction Tuning?

The results above show that low\- and high\-anchor circuits transfer well within the same model\. We next ask whether this stability also holds across instruction tuning: can a circuit found in a base model work in its instruction\-tuned counterpart, and vice versa?

The full CPR results, reported in Appendix Table[9](https://arxiv.org/html/2606.12818#A3.T9), show that cross\-variant circuits remain well above random, indicating that instruction tuning does not create an entirely unrelated pathway\. However, transfer across instruction tuning is less uniform than transfer across low\- and high\-anchor contrasts within the same model, especially for low\-anchor contrasts and sparse retained fractions\.

Figure[3](https://arxiv.org/html/2606.12818#S5.F3)shows this pattern\. For high\-anchor contrasts, cross\-variant faithfulness is often close to the low–high transfer results\. For low\-anchor contrasts, sparse cross\-variant transfer is weaker: at5%5\\%retained edges, Qwen\-7B drops from 0\.917 matched faithfulness to 0\.291 with the Qwen\-Inst ranking, and Llama\-Inst drops from 0\.408 to 0\.166 with the Llama\-8B ranking \(Appendix Table[11](https://arxiv.org/html/2606.12818#A3.T11)\)\. Thus, instruction tuning preserves some broader pathway structure, but the same sparse circuit does not always transfer cleanly across base and instruction\-tuned variants\.

## 7Discussion

This work provides new insight into how language models implement numerical anchoring\. Using a controlled multiple\-choice anchoring task and a validated correct–anchor logit\-difference metric, we apply attribution\-based circuit localization to identify pathways that carry anchor\-sensitive answer competition\. Our results show that anchoring is better captured by sparse edge\-level circuits than by isolated components\.

Within a model, low\- and high\-anchor circuits share substantial structure and transfer well across anchor contrasts, suggesting a shared anchor\-sensitive pathway\. At the same time, instruction tuning and model family shape which edges matter most and where attribution concentrates in depth\. Because anchoring can shift model predictions even when the final answer remains correct, understanding these pathways is an important step toward building models that are less sensitive to irrelevant numerical context\.

These findings also clarify what mitigation would need to account for\. The strong low–high transfer suggests that anchoring may not need to be treated as a separate failure mode for each anchor contrast within a model\. However, weaker functional transfer across base and instruction\-tuned variants suggests that interventions may need to be re\-checked after post\-training\. More broadly, applying circuit localization to cognitive biases provides a way to move beyond measuring whether a bias occurs, toward understanding how prompt\-induced biases are carried through model computation\.

## Limitations

Our analysis uses 100 numerical\-estimation questions from OpAQ\. This scale is sufficient for controlled circuit\-localization experiments, but it is small relative to the diversity of real\-world numerical reasoning settings\. Because all questions come from a single anchoring dataset, the results should be interpreted as evidence about anchoring in this controlled benchmark rather than as a comprehensive account of anchoring\-like behavior in language models\.

Our analysis is also specific to the multiple\-choice formulation\. The fixed\-option setup is useful because it provides shared candidate answers, comparable answer distributions, and well\-defined answer\-label logits for attribution and faithfulness testing\. However, it differs from open\-ended numerical estimation, where models generate free\-form answers and may express anchoring through different decoding dynamics, rationales, or intermediate computations\. Future work should test whether the same layer\-wise attribution and cross\-contrast transfer results appear in open\-ended settings\.

We localize circuits using the correct–anchor logit difference, which captures competition between the factual answer and the answer corresponding to the active anchor at the final answer position\. Although we validate that this metric tracks expected\-value shifts, it does not capture the full distributional effect of anchoring\. Other scalar objectives, such as KL divergence, total variation distance, expected\-value shift, or anchor\-side probability mass, may highlight different parts of the computation\. Thus, our circuits should be understood as circuits for this validated answer\-level measure of anchoring, not as a complete circuit for every way the bias can appear\.

Finally, our mechanistic analysis covers four models from two families: Qwen2\.5\-7B, Qwen2\.5\-7B\-Instruct, Llama\-3\.1\-8B, and Llama\-3\.1\-8B\-Instruct\. The contrast between Qwen and Llama suggests family\-specific differences in where attribution concentrates across layers, but broader evaluation across additional architectures, model scales, and post\-training procedures is needed before drawing general conclusions about anchoring\-related circuits in language models\.

## References

- Think you have solved question answering? try arc, the ai2 reasoning challenge\.External Links:1803\.05457,[Link](https://arxiv.org/abs/1803.05457)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p1.1)\.
- A\. Geiger, J\. Harding, and T\. Icard \(2025\)How Causal Abstraction Underpins Computational Explanation\.arXiv\.Note:arXiv:2508\.11214 \[cs\]External Links:[Link](http://arxiv.org/abs/2508.11214),[Document](https://dx.doi.org/10.48550/arXiv.2508.11214)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1)\.
- N\. Goldowsky\-Dill, C\. MacLeod, L\. Sato, and A\. Arora \(2023\)Localizing Model Behavior with Path Patching\.arXiv\.Note:arXiv:2304\.05969 \[cs\]External Links:[Link](http://arxiv.org/abs/2304.05969),[Document](https://dx.doi.org/10.48550/arXiv.2304.05969)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p5.1)\.
- M\. Hanna, S\. Pezzelle, and Y\. Belinkov \(2024\)Have faith in faithfulness: going beyond circuit overlap when finding model mechanisms\.InConference on Language Modeling: COLM 2024,External Links:2403\.17806,[Link](https://arxiv.org/abs/2403.17806)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1),[§6](https://arxiv.org/html/2606.12818#S6.p1.1)\.
- Y\. Huang, B\. Bie, Z\. Na, W\. Ruan, S\. Lei, Y\. Yue, and X\. He \(2026\)Understanding the anchoring effect of llm with synthetic data: existence, mechanism, and potential mitigations\.External Links:2505\.15392,[Link](https://arxiv.org/abs/2505.15392v2)Cited by:[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.12818#S5.SS1.p2.2)\.
- J\. Kramár, T\. Lieberum, R\. Shah, and N\. Nanda \(2024\)AtP\*: an efficient and scalable method for localizing llm behaviour to components\.External Links:2403\.00745,[Link](https://arxiv.org/abs/2403.00745)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1)\.
- T\. Lieberum, M\. Rahtz, J\. Kramár, N\. Nanda, G\. Irving, R\. Shah, and V\. Mikulik \(2023\)Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla\.Technical reportTechnical ReportarXiv:2307\.09458,arXiv\.Note:arXiv:2307\.09458 \[cs\]External Links:[Link](http://arxiv.org/abs/2307.09458)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p1.1)\.
- J\. Lou and Y\. Sun \(2024\)Anchoring bias in large language models: an experimental study\.External Links:2412\.06593,[Link](https://arxiv.org/abs/2412.06593)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.Advances in Neural Information Processing Systems 35\.External Links:[Link](https://api.semanticscholar.org/CorpusID:255825985)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1)\.
- J\. Merullo, C\. Eickhoff, and E\. Pavlick \(2024\)Circuit Component Reuse Across Tasks in Transformer Language Models\.InThe Twelfth International Conference on Learning Representations,Note:arXiv:2310\.08744 \[cs\]External Links:[Link](http://arxiv.org/abs/2310.08744),[Document](https://dx.doi.org/10.48550/arXiv.2310.08744)Cited by:[§5\.3](https://arxiv.org/html/2606.12818#S5.SS3.p2.1)\.
- A\. Mueller, J\. Brinkmann, M\. Li, S\. Marks, K\. Pal, N\. Prakash, C\. Rager, A\. Sankaranarayanan, A\. S\. Sharma, J\. Sun, E\. Todd, D\. Bau, and Y\. Belinkov \(2025a\)The Quest for the Right Mediator: Surveying Mechanistic Interpretability for NLP Through the Lens of Causal Mediation Analysis\.Computational Linguistics,pp\. 1–48\.External Links:ISSN 0891\-2017,[Link](https://doi.org/10.1162/COLI.a.572),[Document](https://dx.doi.org/10.1162/COLI.a.572)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1)\.
- A\. Mueller, A\. Geiger, S\. Wiegreffe, D\. Arad, I\. Arcuschin, A\. Belfki, Y\. S\. Chan, J\. F\. Fiotto\-Kaufman, T\. Haklay, M\. Hanna, J\. Huang, R\. Gupta, Y\. Nikankin, H\. Orgad, N\. Prakash, A\. Reusch, A\. Sankaranarayanan, S\. Shao, A\. Stolfo, M\. Tutek, A\. Zur, D\. Bau, and Y\. Belinkov \(2025b\)MIB: a mechanistic interpretability benchmark\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=sSrOwve6vb)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1),[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p3.1),[§4\.1](https://arxiv.org/html/2606.12818#S4.SS1.p1.2),[§4\.2](https://arxiv.org/html/2606.12818#S4.SS2.p3.6),[§4\.3](https://arxiv.org/html/2606.12818#S4.SS3.p2.1)\.
- N\. Nanda \(2023\)Attribution patching: activation patching at industrial scale\.\(en\-US\)\.External Links:[Link](https://www.neelnanda.io/mechanistic-interpretability/attribution-patching)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1)\.
- C\. Olah, N\. Cammarata, L\. Schubert, G\. Goh, M\. Petrov, and S\. Carter \(2020\)Zoom In: An Introduction to Circuits\.Distill\(en\)\.External Links:[Link](https://distill.pub/2020/circuits/zoom-in)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1)\.
- H\. N\. Owusu and N\. H\. Feldman \(2026\)Anchoring depends on confidence and post\-training in language models\.InProceedings of the 64th Annual Meeting of the Association for Computational Linguistics: ACL 2026,Note:To appearCited by:[Appendix B](https://arxiv.org/html/2606.12818#A2.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1)\.
- P\. Pezeshkpour and E\. Hruschka \(2024\)Large language models sensitivity to the order of options in multiple\-choice questions\.InFindings of ACL: NAACL 2024,pp\. 2006–2017\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-naacl.130)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p4.1)\.
- L\. Röseler, L\. Weber, E\. P\. B\. Stijović, K\. A\. K\. Jaekel, J\. F\. \(\. M\. T\. \(\. G\. \(\. Gijsbers, and N\. Milstein \(2022\)The Open Anchoring Quest Dataset: Anchored Estimates from 96 Studies on Anchoring Effects\.Journal of Open Psychology Data10\(1\),pp\. 16\.External Links:[Document](https://dx.doi.org/10.5334/jopd.67)Cited by:[Figure 1](https://arxiv.org/html/2606.12818#S1.F1),[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p1.1)\.
- N\. Saphra and S\. Wiegreffe \(2024\)Mechanistic?\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 480–498\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.30/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.30)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1)\.
- A\. Stolfo, Y\. Belinkov, and M\. Sachan \(2023\)A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7035–7052\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.435)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1)\.
- A\. Syed, C\. Rager, and A\. Conmy \(2024\)Attribution patching outperforms automated circuit discovery\.InProceedings of the 7th BlackboxNLP Workshop,pp\. 407–416\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.25)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1)\.
- Y\. Takenami, Y\. J\. Huang, Y\. Murawaki, and C\. Chu \(2025\)How does cognitive bias affect large language models? a case study on the anchoring effect in price negotiation simulations\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),External Links:[Link](https://aclanthology.org/2025.findings-emnlp.240/)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p1.1),[§1](https://arxiv.org/html/2606.12818#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1)\.
- A\. Tversky and D\. Kahneman \(1974\)Judgment under uncertainty: heuristics and biases\.Science185\(4157\),pp\. 1124–1131\.External Links:[Document](https://dx.doi.org/10.1126/science.185.4157.1124),[Link](https://www.science.org/doi/abs/10.1126/science.185.4157.1124),https://www\.science\.org/doi/pdf/10\.1126/science\.185\.4157\.1124Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1)\.
- F\. Valencia\-Clavijo \(2025\)Anchors in the machine: behavioral and attributional evidence of anchoring bias in llms\.External Links:2511\.05766,[Link](https://arxiv.org/abs/2511.05766)Cited by:[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.12818#S2.SS1.p2.1)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating Gender Bias in Language Models Using Causal Mediation Analysis\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 12388–12401\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p1.1)\.
- K\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2022\)Interpretability in the wild: a circuit for indirect object identification in gpt\-2 small\.InICLR 2023,External Links:2211\.00593,[Link](https://arxiv.org/abs/2211.00593)Cited by:[§1](https://arxiv.org/html/2606.12818#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1)\.
- S\. Wiegreffe, O\. Tafjord, Y\. Belinkov, H\. Hajishirzi, and A\. Sabharwal \(2025\)Answer, assemble, ace: understanding how lms answer multiple choice questions\.InThe Thirteenth International Conference on Learning Representations: ICLR 2025,External Links:2407\.15018,[Link](https://arxiv.org/abs/2407.15018)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1),[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Q\. Liu and D\. Schlangen \(Eds\.\),Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p5.1)\.
- Q\. A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Qiu, S\. Quan, and Z\. Wang \(2024\)Qwen2\.5 technical report\.ArXivabs/2412\.15115\.External Links:[Link](https://api.semanticscholar.org/CorpusID:274859421)Cited by:[§3\.1](https://arxiv.org/html/2606.12818#S3.SS1.p5.1)\.
- F\. Zhang and N\. Nanda \(2024\)Towards Best Practices of Activation Patching in Language Models: Metrics and Methods\.InThe Twelfth International Conference on Learning Representations,Note:arXiv:2309\.16042 \[cs\]External Links:[Link](http://arxiv.org/abs/2309.16042),[Document](https://dx.doi.org/10.48550/arXiv.2309.16042)Cited by:[§2\.2](https://arxiv.org/html/2606.12818#S2.SS2.p2.1)\.

## Appendix

Table 4:Directional consistency across MCQA menu sizes\. Values report the percentage of items for which low anchors shift expected value downward and high anchors shift expected value upward relative to controls\.Table 5:Normalized EV shifts at small menu sizes for selected models\. The expected pattern is negative for low anchors and positive for high anchors; small menus can reverse this pattern under controls\.
## Appendix AMCQA Menu Size Validation

Our main experiments use nine answer options per item\. To justify this choice, we evaluated whether anchoring is preserved as the number of answer options varies\. For each menu sizen∈\{2,…,9\}n\\in\\\{2,\\ldots,9\\\}, we constructed MCQA candidate sets spanning the low and high anchors and measured whether each item showed the expected directional pattern under controls: low anchors should shift the normalized expected value downward, while high anchors should shift it upward\.

Table[4](https://arxiv.org/html/2606.12818#A0.T4)reports the percentage of items showing both expected directions\. Small menus are unstable: with only two or three options, several models fail to recover the expected bidirectional anchoring pattern\. In some cases, especially for Qwen instruction\-tuned models, the normalized expected\-value shifts even have the wrong sign\. Directional consistency improves as the menu size increases and stabilizes aroundn≥5n\\geq 5–77\. The nine\-option setting used in the main experiments is therefore not arbitrary: it is the largest tested menu size and is either the maximum or within a few percentage points of the maximum directional consistency for all models\.

## Appendix BBehavioral Validation

Before localizing circuits, we first verify that our MCQA transformation preserves the behavioral signature of anchoring\. This step is necessary because the mechanistic analysis relies on a multiple\-choice format, whereas anchoring is typically studied in open\-ended numerical estimation\.

Table 6:Behavioral anchoring effects after MCQA conversion\. EV shifts are relative to controls, adherence counts expected\-sign shifts, and TVD measures distributional change\.#### Behavioral Metrics

We use behavioral metrics fromOwusu and Feldman \([2026](https://arxiv.org/html/2606.12818#bib.bib28)\)to verify that the multiple\-choice task preserves the anchoring effect studied in prior behavioral work\. These metrics validate the task setup, while the circuit analyses focus on the correct–anchor logit\-difference metric\. In this setting, anchoring means that an irrelevant numerical prime shifts the model’s probability distribution over candidate answers: low anchors should move probability mass toward smaller values, while high anchors should move probability mass toward larger values\. We therefore measure both the direction of the shift, using expected value, and the overall size of the distributional change, using total variation distance\.

For each itemii, letℒ=\{A,…,I\}\\mathcal\{L\}=\\\{A,\\ldots,I\\\}denote the set of answer labels, and let𝒱i\\mathcal\{V\}\_\{i\}denote the corresponding set of numerical candidate values\. For a prompt conditioncc, the model assigns a logitzi\(c\)​\(ℓ\)z\_\{i\}^\{\(c\)\}\(\\ell\)to each answer labelℓ∈ℒ\\ell\\in\\mathcal\{L\}at the final answer\-label prediction position\. We obtain a probability distribution over answer labels by applying a softmax over the nine label logits:

pi\(c\)​\(ℓ\)=exp⁡zi\(c\)​\(ℓ\)∑ℓ′∈ℒexp⁡zi\(c\)​\(ℓ′\)\.p\_\{i\}^\{\(c\)\}\(\\ell\)=\\frac\{\\exp z\_\{i\}^\{\(c\)\}\(\\ell\)\}\{\\sum\_\{\\ell^\{\\prime\}\\in\\mathcal\{L\}\}\\exp z\_\{i\}^\{\(c\)\}\(\\ell^\{\\prime\}\)\}\.\(4\)
Because answer labels are randomly permuted across trials, we compute label probabilities separately for each permutation, map those probabilities back to their associated numerical values, and then average the resulting value\-space distributions across permutations\. This yieldsqi\(c\)​\(v\)q\_\{i\}^\{\(c\)\}\(v\), the permutation\-averaged probability assigned to numerical candidatev∈𝒱iv\\in\\mathcal\{V\}\_\{i\}\. Thus,qi\(c\)q\_\{i\}^\{\(c\)\}is a distribution over candidate values, not over fixed answer labels or display positions\.

We summarize the model’s distribution under conditionccby its expected value:

EVi\(c\)=∑v∈𝒱iqi\(c\)​\(v\)​v\.\\mathrm\{EV\}\_\{i\}^\{\(c\)\}=\\sum\_\{v\\in\\mathcal\{V\}\_\{i\}\}q\_\{i\}^\{\(c\)\}\(v\)v\.\(5\)
For each anchor directiond∈\{low,high\}d\\in\\\{\\mathrm\{low\},\\mathrm\{high\}\\\}, letb​\(d\)b\(d\)denote the matched control baseline for that direction\. The low\-anchor prompt is compared to a control prompt matched to the low\-anchor string, and the high\-anchor prompt is compared to a control prompt matched to the high\-anchor string\. Both controls preserve the slot\-machine sentence but replace the numerical anchor with a non\-numericXstring matched to the tokenizer length of the corresponding anchor\. In some examples these two matched controls are identical, but we define them separately because low and high anchor strings need not tokenize the same way\.

We define the normalized expected\-value shift as

Δ​EVi\(d\)=EVi\(d\)−EVi\(b​\(d\)\)vimax−vimin\.\\Delta\\mathrm\{EV\}\_\{i\}^\{\(d\)\}=\\frac\{\\mathrm\{EV\}\_\{i\}^\{\(d\)\}\-\\mathrm\{EV\}\_\{i\}^\{\(b\(d\)\)\}\}\{v\_\{i\}^\{\\max\}\-v\_\{i\}^\{\\min\}\}\.\(6\)Negative values indicate shifts toward lower estimates, while positive values indicate shifts toward higher estimates\. Thus, anchoring predictsΔ​EVi\(low\)<0\\Delta\\mathrm\{EV\}\_\{i\}^\{\(\\mathrm\{low\}\)\}<0andΔ​EVi\(high\)\>0\\Delta\\mathrm\{EV\}\_\{i\}^\{\(\\mathrm\{high\}\)\}\>0\.

We also measure the total variation distance between the anchored and matched control distributions:

TVDi\(d\)=12​∑v∈𝒱i\|qi\(d\)​\(v\)−qi\(b​\(d\)\)​\(v\)\|\.\\mathrm\{TVD\}\_\{i\}^\{\(d\)\}=\\frac\{1\}\{2\}\\sum\_\{v\\in\\mathcal\{V\}\_\{i\}\}\\left\|q\_\{i\}^\{\(d\)\}\(v\)\-q\_\{i\}^\{\(b\(d\)\)\}\(v\)\\right\|\.\(7\)UnlikeΔ​EV\\Delta\\mathrm\{EV\}, TVD is non\-directional: it measures the magnitude of the distributional change regardless of whether the expected value shifts upward or downward\.

#### Results

Normalized expected\-value shifts test whether the model’s answer distribution moves in the expected direction, while TVD measures the magnitude of the distributional change regardless of direction\. A preserved anchoring effect predictsΔi\(low\)<0\\Delta\_\{i\}^\{\(\\mathrm\{low\}\)\}<0for low anchors andΔi\(high\)\>0\\Delta\_\{i\}^\{\(\\mathrm\{high\}\)\}\>0for high anchors\.

Table[6](https://arxiv.org/html/2606.12818#A2.T6)shows that the MCQA transformation preserves the expected directionality of anchoring across all four models\. Mean normalized shifts are negative for low anchors and positive for high anchors in every model, indicating that low anchors pull the expected answer downward while high anchors pull it upward relative to controls\.

Llama\-3\.1\-8B moves in the expected direction on 98/100 low\-anchor questions and 87/100 high\-anchor questions, with mean shifts of−0\.035\-0\.035and\+0\.042\+0\.042, respectively\. Qwen2\.5\-7B shows larger aggregate shifts \(−0\.112\-0\.112for low anchors and\+0\.134\+0\.134for high anchors\) and high directional consistency in both directions, moving as expected on 98/100 low\-anchor and 95/100 high\-anchor questions\.

## Appendix CEAP\-IG Circuit Analyses

#### Component composition\.

Table[7](https://arxiv.org/html/2606.12818#A3.T7)compares the component\-type composition of the full scored edge set with the top5%5\\%EAP\-IG circuits\. Most possible edges in the full graph are attention\-to\-attention edges, but the selected circuits contain proportionally more attention–MLP edges\. This suggests that MLPs contribute mainly through connections to and from attention components rather than through MLP\-to\-MLP pathways\.

#### Low–high overlap across retained fractions\.

Table[10](https://arxiv.org/html/2606.12818#A3.T10)compares structural overlap and functional transfer between low\- and high\-anchor EAP\-IG circuits\. Jaccard overlap is moderate, but the shared fraction shows that a large portion of each equal\-sized circuit appears in the other\. Cross\-contrast faithfulness rises quickly as more top\-ranked edges are retained, indicating that the two anchor contrasts are not identical as edge sets but are functionally related\.

#### Sparse\-circuit transfer\.

Table[11](https://arxiv.org/html/2606.12818#A3.T11)reports pointwise faithfulness at5%5\\%retained EAP\-IG edges\. Unlike CPR, which summarizes recovery across the full retained\-fraction curve, this table tests whether transfer holds at a sparse circuit size\. The results show that low–high transfer is often strong even at5%5\\%, while cross\-variant transfer is more variable, suggesting that instruction tuning changes which sparse edges matter most\.

#### Additional transfer results\.

Tables[11](https://arxiv.org/html/2606.12818#A3.T11)and[9](https://arxiv.org/html/2606.12818#A3.T9)provide complementary transfer results for EAP\-IG edge circuits\. Table[11](https://arxiv.org/html/2606.12818#A3.T11)reports pointwise faithfulness at5%5\\%retained edges, while Table[9](https://arxiv.org/html/2606.12818#A3.T9)reports CPR across the full retained\-fraction sweep\.

Table 7:Component\-type composition of the full scored edge set and the top5%5\\%EAP\-IG edge circuits by absolute attribution score\. Values are percentages of edges\. A = attention component, M = MLP component, and E = embedding\. The full graph rows show the component\-type distribution before ranking by EAP\-IG score\.Table 8:Jaccard similarity between low\- and high\-anchor EAP\-IG edge circuits across retained fractions\. For each model and retained fractionkk, we compare the top\-kk% edges ranked by absolute EAP\-IG score for the low\-anchor contrast and the high\-anchor contrast\. Higher values indicate greater overlap between the two edge sets\.Table 9:Cross\-condition faithfulness of EAP\-IG edge circuits\. Matched uses circuits ranked on the same model and anchor contrast as evaluation\. Cross\-contrast uses the same model but the opposite anchor contrast\. Cross\-variant uses the paired base or instruction\-tuned variant from the same model family with the same anchor contrast\. Cross\-both changes both the model variant and anchor contrast\. Random uses same\-size random edge sets\.Table 10:Low–high circuit overlap and cross\-contrast faithfulness across retained fractions for EAP\-IG edge circuits\. Jaccard measures overlap relative to the union of the low\- and high\-anchor edge sets\. Shared fraction measures how much of each equal\-sized top\-kkcircuit is shared with the other\. Cross\-contrast faithfulness reports average faithfulness when a circuit ranked on one control–anchor contrast is evaluated on the other, with each direction clipped to\[0,1\]\[0,1\]\. These measures separate edge\-set overlap from functional transfer\.Table 11:Sparse\-circuit transfer at5%5\\%retained EAP\-IG edges\. Values are pointwise unclipped faithfulness values at the retained\-fraction sweep point nearest5%5\\%\(unlike CPR\)\. Matched uses circuits ranked on the same model and anchor contrast as evaluation\. Cross\-contrast uses the same model but the opposite anchor contrast\. Cross\-variant uses the paired base or instruction\-tuned variant from the same model family with the same anchor contrast\. Cross\-both changes both the model variant and anchor contrast\. Random uses same\-size random edge sets\. Values are unclipped and may exceed 1 when the retained circuit overshoots the full\-model recovery\.Table 12:Low–high edge overlap and transfer at5%5\\%retained EAP\-IG edges\. Jaccard measures overlap relative to the union of the two edge sets; shared fraction measures how much of each equal\-sized circuit is shared; cross\-contrast faithfulness averages low\-to\-high and high\-to\-low faithfulness, clipped to\[0,1\]\[0,1\]\.

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

arXiv cs.CL

Introduces PRIG, a gradient attribution method that localizes prompt ambiguity in large language models by training a linear probe to distinguish clear from ambiguous prompts and attributing the probe score to token representations in the residual stream, achieving strong performance on synthetic and human-written benchmarks.

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

arXiv cs.CL

This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.

Language-Switching Triggers Take a Latent Detour Through Language Models

Hugging Face Daily Papers

This paper identifies a circuit underlying a language-switching backdoor in an 8B-parameter language model, where a three-word Latin trigger redirects English output to French via attention heads and orthogonal latent subspaces, with the final layer MLP converting the latent signal to French logits.