Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

arXiv cs.CL 05/08/26, 04:00 AM Papers
Summary
This paper challenges the 'Locate-then-Update' paradigm in LLM post-training by demonstrating that static mechanistic localization is insufficient due to the dynamic evolution of neural circuits during fine-tuning. It introduces new metrics to analyze circuit stability and proposes the need for predictive frameworks in mechanistic localization.
arXiv:2605.06076v1 Announce Type: new Abstract: The "Locate-then-Update" paradigm has become a predominant approach in the post-training of large language models (LLMs), identifying critical components via mechanistic interpretability for targeted parameter updates. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine-tuning (SFT) process, revealing the underlying dynamics of task mechanisms. We introduce three novel metrics-Circuit Distance, Circuit Stability, and Circuit Conflict-to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross-task interference. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research.
Original Article
View Cached Full Text
Cached at: 05/08/26, 07:09 AM
# Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Source: [https://arxiv.org/html/2605.06076](https://arxiv.org/html/2605.06076)
Hang Chen School of Computer Science and Technology Xi’an Jiaotong University albert2123@stu\.xjtu\.edu\.cn &Jiaying Zhu School of Computer Science and Engineering The Chinese University of Hong Kong jyzhu24@cse\.cuhk\.edu\.hk &Hongyang Chen Shaanxi Co\., Ltd\(Xi’an 710077, China\) China Mobile Group chenhongyang@sn\.chinamobile\.com &Hongxu Liu College of Computing and Data Science Nanyang Technological University hongxu001@e\.ntu\.edu\.sg Xinyu Yang School of Computer Science and Technology Xi’an Jiaotong University yxyphd@mail\.xjtu\.edu\.cn &Wenya Wang College of Computing and Data Science Nanyang Technological University wangwy@ntu\.edu\.sg

###### Abstract

The "Locate\-then\-Update" paradigm has become a predominant approach in the post\-training of large language models \(LLMs\), identifying critical components via mechanistic interpretability for targeted parameter updates\. However, this paradigm rests on a fundamental yet unverified assumption: can mechanisms derived from current static parameters reliably guide future dynamic parameter updates? To investigate this, we systematically track the structural evolution of Transformer circuits throughout the supervised fine\-tuning \(SFT\) process, revealing the underlying dynamics of task mechanisms\. We introduce three novel metrics—Circuit Distance, Circuit Stability, and Circuit Conflict—to analyze circuit evolution across three dimensions: neural migration, semantic stability, and cross\-task interference\. Our empirical results reveal that circuits inherently exhibit "Free Evolution" during parameter updates\. Consequently, static mechanisms extracted from current states inevitably suffer from temporal latency, making them fundamentally inadequate for guiding future states\. Moreover, by deconstructing the "illusion of effectiveness" in existing methods, this work underscores the necessity of "foresight" in mechanistic localization and proposes a predictive framework for future research\. Our code is available at[https://github\.com/Zodiark\-ch/MechLocalization](https://github.com/Zodiark-ch/MechLocalization)\.

## 1Introduction

Post\-training optimization of large language models \(LLMs\) refers to the targeted refinement of pre\-trained language models equipped with powerful general capabilities\(Laiet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib1); Xiaoet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib2)\)\. Strategies such as supervised fine\-tuning \(SFT\)\(Huet al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib3)\), reinforcement learning\(Havrillaet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib5)\), parameter editing\(Yaoet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib6)\), or vector steering\(Caoet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib7)\)are employed to marginally alter model parameters\. This ensures the model better aligns with practical application scenarios while preserving its general capabilities\. From an optimization perspective, post\-training processing aims to achieve optimal performance on a newtarget taskwhile maintaining existing capabilities \(hereafter referred to aspervasiveness tasks\)\(Zhanget al\.,[2026](https://arxiv.org/html/2605.06076#bib.bib8)\)\.

To mitigate catastrophic forgetting on pervasiveness tasks, recent studies increasingly adopt a "locate\-then\-update" paradigm across applications like model unlearning\(Wuet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib12); Liet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib13)\), knowledge editing\(Menget al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib10); Daiet al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib11)\), and reinforcement learning\(Yanet al\.,[2026](https://arxiv.org/html/2605.06076#bib.bib4)\)\. This paradigm relies onMechanistic Localization—using mechanistic interpretability to identify the minimal parameter space responsible for the target skill\. Post\-training parameter updates are then exclusively confined to this localized region\.

However, recent studies show that Mechanistic Localization often lacks completeness\(Chenet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib14)\)and exclusiveness\(Haseet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib9)\), raising a critical question: acts it merely as a “placebo”? Specifically,can a static snapshot of mechanistic interpretability findings genuinely guide the dynamic process of future parameter updates?As illustrated in Figure[1](https://arxiv.org/html/2605.06076#S1.F1), if full\-parameter SFT causes the target task’s critical components to shift fromB1,B2B\_\{1\},B\_\{2\}toA1,A2A\_\{1\},A\_\{2\}, localization based solely on pre\-update parameters will prematurely freezeA1A\_\{1\}andA2A\_\{2\}\. Therefore, it remains unclear whether Mechanistic Localization genuinely prevents conflicts or improperly constrains the target mechanism’s natural evolution\. To address this, we break down this problem into two specific research questions:

![Refer to caption](https://arxiv.org/html/2605.06076v1/x1.png)Figure 1:Differences in mechanism localization in post\-training SFT with and without localization- •RQ1: Without localization, do the critical components of the target skill change during parameter updates? If so, how do they evolve?
- •RQ2: Does post\-training with localization truly improve performance on the target task and mitigate conflicts with pervasiveness tasks?

In this paper, we employSFTas the representative of post\-training process to intuitively observe the evolutionary process, defining thecomponent111Specifically, independent parameter matrices likeWq,Wk,Wv,WoW\_\{q\},W\_\{k\},W\_\{v\},W\_\{o\}in attention heads andWup,WdownW\_\{\\text\{up\}\},W\_\{\\text\{down\}\}in MLPs\.as the minimal update unit\. For Mechanistic Localization, we adoptcircuit discovery\(Conmyet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib15); Syedet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib16)\), which comprehensively captures overarching mechanisms and inter\-component connections\. To addressRQ1, we introducecircuit distanceto quantify critical component migration, andcircuit stabilityto evaluate the model’s mastery of the task mechanism\. ForRQ2, alongside direct performance observation, we proposecircuit conflictto measure how effectively localization prevents mechanistic clashes between the target and pervasiveness tasks\. Through extensive experiments, we draw the following conclusions:

- •Divergent Free Evolution:Without localization, critical components evolve freely, exhibiting distinct structural patterns: attention mechanisms undergo drastic shifts, whereas MLP components remain relatively stable\.
- •Temporal Lag of Static Localization:Because circuits inherently exhibit free evolution, utilizing current parameter states as a reference for future parameters suffers from severe latency and temporal lag\.
- •The Illusion of Effectiveness: The perceived success of existing Mechanistic Localization methods relies heavily on their application to MLP\-dominated, knowledge\-centric downstream tasks\(e\.g\., knowledge editing, unlearning\)\.

Ultimately, this paper reveals that Mechanistic Localization exhibits a criticallagduring dynamic updates\. To better optimize target performance and minimize mechanistic conflicts, we explore the necessity of more advanced, dynamic localization paradigms in Section[5](https://arxiv.org/html/2605.06076#S5)\.

## 2Preliminaries

We denote a well\-trained LLM asℳ=⟨𝒢,θ⟩\\mathcal\{M\}=\\langle\\mathcal\{G\},\\theta\\rangle, whereθ\\thetarepresents the states of all trainable parameters\. The computational graph𝒢=⟨𝒱,ℰ⟩\\mathcal\{G\}=\\langle\\mathcal\{V\},\\mathcal\{E\}\\ranglemodels the forward pass, with𝒱\\mathcal\{V\}comprising all components \(i\.e\., parameter matrices such asWq,Wk,Wv,Wo,Wup,WdownW\_\{q\},W\_\{k\},W\_\{v\},W\_\{o\},W\_\{\\text\{up\}\},W\_\{\\text\{down\}\}\) andℰ\\mathcal\{E\}denoting their activation connections \(e\.g\.,Wo→WupW\_\{o\}\\rightarrow W\_\{\\text\{up\}\}\)\.

### 2\.1Post\-Training Processing

We define post\-training processing as modifying a target task’s mechanism while preserving pre\-existing capabilities \(pervasiveness tasks\), with typical applications including model unlearning\(Liuet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib17)\)and knowledge editing\(Wanget al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib18)\)\. To intuitively observe intermediate dynamics, we formalize post\-training processing as a multi\-objective fine\-tuning task222We demonstrate in Appendix[F](https://arxiv.org/html/2605.06076#A6)that our conclusions also hold for non\-continuous post\-training methods like knowledge editing\.\. Given initial parametersθ\\theta, a target dataset𝒟t\\mathcal\{D\}\_\{t\}which requires inputxxyields an outputyty\_\{t\}, and a pervasiveness dataset𝒟p\\mathcal\{D\}\_\{p\}, any inputxxyields an outputyythat followsp\(y\|x,θ\)p\(y\|x,\\theta\)\. The updated parametersθ′\\theta^\{\\prime\}are optimized via:

minθ′⁡𝔼\(x,yt\)∈𝒟t\[ℒ\(yt\|x;θ′\)\]\+λ𝔼\(x,y\)∈𝒟p\[ℒ\(y\|x;θ′\)\]\\min\_\{\\theta^\{\\prime\}\}\\mathbb\{E\}\_\{\(x,y\_\{t\}\)\\in\\mathcal\{D\}\_\{t\}\}\[\\mathcal\{L\}\(y\_\{t\}\|x;\\theta^\{\\prime\}\)\]\+\\lambda\\mathbb\{E\}\_\{\(x,y\)\\in\\mathcal\{D\}\_\{p\}\}\[\\mathcal\{L\}\(y\|x;\\theta^\{\\prime\}\)\]\(1\)whereλ≥0\\lambda\\geq 0is the regularization parameter\. Essentially, this objective ensuresθ′\\theta^\{\\prime\}adapts to the target behavior specified by𝒟t\\mathcal\{D\}\_\{t\}while leaving the forward pass for task\-irrelevant inputs unaltered\.

### 2\.2Circuit Discovery

We employ circuit discovery as our mechanistic interpretability technique\. Compared to alternatives, it better deconfounds correlations \(vs\. gradient/magnitude methods\(Liet al\.,[2016](https://arxiv.org/html/2605.06076#bib.bib19); Tanget al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib20)\)\), captures holistic mechanisms \(vs\. causal interventions\(Stolfoet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib21)\)\), and provides stronger theoretical foundations \(vs\. probing\(Juet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib22)\)or vocabulary lenses\(Belroseet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib23)\)\)\. It seeks to identify a minimal subgraph \(circuit\)𝒞⊂𝒢\\mathcal\{C\}\\subset\\mathcal\{G\}capturing the task\-relevant behavior of a target dataset\(Elhageet al\.,[2021](https://arxiv.org/html/2605.06076#bib.bib24); Conmyet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib15); Raiet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib25)\), optimized via:

argmin𝒞𝔼\(x\)∈𝒟t\[D\(p𝒢\(y\|x\)\|\|p𝒞\(y\|x\)\)\],s\.t\.1−\|𝒞\|/\|𝒢\|≥s\\arg\\min\_\{\\mathcal\{C\}\}\\mathbb\{E\}\_\{\(x\)\\in\\mathcal\{D\}\_\{t\}\}\[D\(p\_\{\\mathcal\{G\}\}\(y\|x\)\|\|p\_\{\\mathcal\{C\}\}\(y\|x\)\)\],~~s\.t\.~1\-\|\\mathcal\{C\}\|/\|\\mathcal\{G\}\|\\geq s\(2\)wheressis the sparsity constraint andDDmeasures the output divergence between𝒢\\mathcal\{G\}and𝒞\\mathcal\{C\}\. The nodes and edges within𝒞\\mathcal\{C\}are thus deemed the most critical components for processing𝒟t\\mathcal\{D\}\_\{t\}333Note that the discovery target dataset𝒟t\\mathcal\{D\}\_\{t\}can correspond to either the post\-training target task or a pervasiveness task, depending on which mechanism is being observed\.\.

### 2\.3Locate\-then\-Update Paradigm

In summary, the locate\-then\-update paradigm consists of two steps\.First\(locating\), mechanistic interpretability \(here, circuit discovery\) identifies the critical component set𝒞=⟨𝒱t,ℰt⟩\\mathcal\{C\}=\\langle\\mathcal\{V\}\_\{t\},\\mathcal\{E\}\_\{t\}\\ranglefor𝒟t\\mathcal\{D\}\_\{t\}\(We show the details about logical circuit discovery in Appendix[A](https://arxiv.org/html/2605.06076#A1)\)\.Second\(updating\), the remaining components𝒱∗=𝒱∖𝒱t\\mathcal\{V\}^\{\*\}=\\mathcal\{V\}\\setminus\\mathcal\{V\}\_\{t\}are frozen, followed by target\-specific post\-training on𝒟t\\mathcal\{D\}\_\{t\}\. Recent literature has advanced both phases: localization improvements include jointly considering𝒟t\\mathcal\{D\}\_\{t\}and𝒟p\\mathcal\{D\}\_\{p\}\(Jiaet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib26)\), ensembling multiple interpretability methods\(Liet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib13)\), and utilizing low\-dimensional projections\(Muhamedet al\.,[2025](https://arxiv.org/html/2605.06076#bib.bib27)\); meanwhile, updating enhancements employ diverse fine\-tuning strategies such as gradient ascent\(Liuet al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib28)\), direct preference optimization\(Mainiet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib30)\), and negative preference optimization\(Zhanget al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib29)\)\.

## 3Evaluation Metrics

This section introduces three metrics \(summarized in Table[1](https://arxiv.org/html/2605.06076#S3.T1)\) to evaluate mechanism evolution\. To addressRQ1,Circuit Distance\(CDCD\) andCircuit Stability\(CSCS\) evaluate a single mechanism’s evolution: given a target task’s logical circuit \(Appendix[A](https://arxiv.org/html/2605.06076#A1)\),CDCDmeasures component migration, whileCSCSassesses knowledge consolidation\. To addressRQ2, we proposeCircuit Conflict\(CCCC\) to quantify inter\-mechanism interference\. Combined with intrinsic task performance metrics \- which gauge capability enhancement or retention \-CCCCcomprehensively evaluates multi\-task interactions within the Locate\-then\-Update paradigm\.

Table 1:An overview of the three circuit metrics\.### 3\.1Circuit Distance \(CDCD\)

In circuit discovery, an edgeWi→Wj∈𝒞W\_\{i\}\\rightarrow W\_\{j\}\\in\\mathcal\{C\}is identified by measuring the output variance under causal interventions\. An edge is retained if its causal effectI\(Wi→Wj\)I\(W\_\{i\}\\rightarrow W\_\{j\}\)exceeds a thresholdτ\\tau:

I\(Wi→Wj\)=\|𝕃\(x\|do\(Wi→Wj\)\)−𝕃\(x\)\|\>τI\(W\_\{i\}\\rightarrow W\_\{j\}\)=\\left\|\\mathbb\{L\}\(x\|\\text\{do\}\(W\_\{i\}\\rightarrow W\_\{j\}\)\)\-\\mathbb\{L\}\(x\)\\right\|\>\\tau\(3\)where𝕃\\mathbb\{L\}denotes the output logits, anddo\(⋅\)\\text\{do\}\(\\cdot\)represents activation patching\. For these interventions, we default to interchange ablation\(Heimersheim and Nanda,[2024](https://arxiv.org/html/2605.06076#bib.bib31); Viget al\.,[2020](https://arxiv.org/html/2605.06076#bib.bib32); Chanet al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib33); Goldowsky\-Dillet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib34)\)\.

To quantify the migration of critical pathways across parameter statesθ\\thetaandθ′\\theta^\{\\prime\}, we employ the Manhattan distance over the computational graph𝒢=⟨𝒱,ℰ⟩\\mathcal\{G\}=\\langle\\mathcal\{V\},\\mathcal\{E\}\\rangle\. Specifically, for a computational graph𝒢=⟨𝒱,ℰ⟩\\mathcal\{G\}=\\langle\\mathcal\{V\},\\mathcal\{E\}\\ranglewithNNcomponents, we associate each edge inℰ\\mathcal\{E\}with its causal effect, denoted asℰ=\{Wi→Wj,I\(Wi→Wj\)\}\\mathcal\{E\}=\\\{W\_\{i\}\\rightarrow W\_\{j\},I\(W\_\{i\}\\rightarrow W\_\{j\}\)\\\}\. This approach captures continuous variations more effectively than discrete metrics \(e\.g\., Hamming distance\)\. The Circuit Distance \(CDCD\) is defined as:

CD=D𝒢\(𝒢θ,𝒢θ′\)=∑\(Wi,Wj∈ℰ\)\|I\(Wi→Wj\)θ−I\(Wi→Wj\)θ′\|CD=D\_\{\\mathcal\{G\}\}\(\\mathcal\{G\}^\{\\theta\},\\mathcal\{G\}^\{\\theta^\{\\prime\}\}\)=\\sum\_\{\(W\_\{i\},W\_\{j\}\\in\\mathcal\{E\}\)\}\\left\|I\(W\_\{i\}\\rightarrow W\_\{j\}\)^\{\\theta\}\-I\(W\_\{i\}\\rightarrow W\_\{j\}\)^\{\\theta^\{\\prime\}\}\\right\|\(4\)To account for varying logit baselines across tasks, we normalizeCDCDusing the maximum empirical range ofI\(⋅\)I\(\\cdot\)\. Ultimately,CDCDreflects the extent of mechanism transformation by aggregating the absolute shifts in causal effects across all components\.

### 3\.2Circuit Stability \(CSCS\)

Beyond tracking mechanism migration, it is crucial to assess the model’s mastery of the mechanism\. Recent studies\(Sun,[2025](https://arxiv.org/html/2605.06076#bib.bib35)\)indicate that if the learning of the target mechanism has not converged, different sampled subsets of𝒟t\\mathcal\{D\}\_\{t\}will yield significantly divergent circuits\. We quantify this usingCircuit Stability\(CSCS\), defined as the expected Spearman’s rank correlation \(ρ\\rho\) between circuits derived from two i\.i\.d\. subsetss,s′s,s^\{\\prime\}\(each 20% of𝒟t\\mathcal\{D\}\_\{t\}\):

CS\(θ′\)=𝔼s,s′∈𝒟t\[ρ\(𝒞sθ′,𝒞s′θ′\)\]CS\(\\theta^\{\\prime\}\)=\\mathbb\{E\}\_\{s,s^\{\\prime\}\\in\\mathcal\{D\}\_\{t\}\}\[\\rho\(\\mathcal\{C\}^\{\\theta^\{\\prime\}\}\_\{s\},\\mathcal\{C\}^\{\\theta^\{\\prime\}\}\_\{s^\{\\prime\}\}\)\]\(5\)
Here,ρ\\rhoranks the causal effects of all edges\. Specifically, we take circuits𝒞s\\mathcal\{C\}\_\{s\}and𝒞s′\\mathcal\{C\}\_\{s^\{\\prime\}\}, and rankℰ=\{Wi→Wj,I\(Wi→Wj\)\}\(i,j\)N\\mathcal\{E\}=\\\{W\_\{i\}\\rightarrow W\_\{j\},I\(W\_\{i\}\\rightarrow W\_\{j\}\)\\\}^\{N\}\_\{\(i,j\)\}based on𝒞s\(ℰ\)\\mathcal\{C\}\_\{s\}\(\\mathcal\{E\}\)and𝒞s′\(ℰ\)\\mathcal\{C\}\_\{s^\{\\prime\}\}\(\\mathcal\{E\}\), respectively\. Therefore,CSCSreflects the stability trend of the target mechanism across two parameter states by measuring the correlation between circuits derived from different randomly sampled subsets\. A large\|CS\|\|CS\|indicates high cross\-sample consistency, meaning knowledge is effectively consolidated\. Conversely, a small\|CS\|\|CS\|implies the information remains scattered and unsettled, undergoing further refinement\. CombiningCDCDandCSCSyields four joint states:

- •LargeCDCD, Large\|CS\|\|CS\|: The model preserves the firmly consolidated information of the mechanism but actively migrates its storage to a new set of components\.
- •SmallCDCD, Small\|CS\|\|CS\|: The model continues to utilize the original components but actively updates and refines the unsettled internal information within them\.
- •LargeCDCD, Small\|CS\|\|CS\|: The model undergoes a reorganization, simultaneously updating the unsettled internal information and migrating its storage to a new set of components\.
- •SmallCDCD, Large\|CS\|\|CS\|: The model maintains a highly stable state, retaining both the consolidated information of the mechanism and the original components storing it\.

An intuitive, albeit imprecise, metaphor is thatCDCDdetermines whether to change theglassholding the cocktail, whileCSCSdetermines whether to refine the cocktail’srecipe\.CDCDandCSCSnot only reflect the evolutionary state of a single mechanism but also serve as a “baseline evolution” for studying multiple mechanisms\. By comparing against this baseline, we can better analyze the practical impact of localization on multiple mechanisms\.

### 3\.3Circuit Conflict \(CCCC\)

We defineconflicting componentsas critical components shared across multiple optimization targets’ circuits\. Reducing these components typically enhances post\-training performance\(Chenet al\.,[2026](https://arxiv.org/html/2605.06076#bib.bib36)\), as sharing induces optimization competition rather than synergy\. For instance, a polysemantic componentWiW\_\{i\}encoding both “sports” and “geography” faces competing gradient requirements during multi\-objective optimization within “sports” and “geography” tasks, preventing simultaneous optimality\. Conversely, disentangling these semantics into distinct componentsWmW\_\{m\}andWnW\_\{n\}enables independent tuning to task\-specific peaks\. Therefore, a decreasing trend in conflicting components during evolution is highly desirable, signifying a shift toward specialized neural representations that facilitate superior independent performance\.

However, identifying conflicting components by simply intersecting circuits yields biased conclusions due to component non\-exclusivity \(e\.g\., redundant components effectively forming an “OR gate”\)\. To resolve this, we leverage logical circuits\(Chenet al\.,[2026](https://arxiv.org/html/2605.06076#bib.bib36)\)to map task circuits into Conjunctive Normal Form \(CNF\) clauses, framing conflict detection as a Boolean satisfiability \(SAT\) problem\. LetΦt\\Phi\_\{t\}andΦp\\Phi\_\{p\}denote the CNFs for the target and pervasiveness circuits, respectively\. We define Circuit Conflict \(CCCC\) as:

CC\(θ\)=minn⁡UNSAT\(Φtθ∧Φpθ\)CC\(\\theta\)=\\min\_\{n\}\{\\text\{UNSAT\}\(\\Phi^\{\\theta\}\_\{t\}\\land\\Phi^\{\\theta\}\_\{p\}\)\}\(6\)whereUNSATcomputes theUNSAT Core—the minimum number of conflicting clauses—using a SAT solver\(Selsam and Bjørner,[2019](https://arxiv.org/html/2605.06076#bib.bib37); Cimattiet al\.,[2007](https://arxiv.org/html/2605.06076#bib.bib38); D’Ippolitoet al\.,[2010](https://arxiv.org/html/2605.06076#bib.bib39)\)\. This formulation naturally expands for multiple pervasiveness tasks \(e\.g\.,Φt∧Φp1∧…\\Phi\_\{t\}\\land\\Phi\_\{p1\}\\land\\dots\)\.

By extracting the UNSAT Core across parameter states,CCCCstrictly quantifies the evolutionary trend of conflicts\. A largerCCCCsignifies intensified interference from pervasiveness tasks, which inherently hinders optimal target performance\. Ultimately,CCCCbypasses superficial performance fluctuations, directly revealing whether multiple mechanisms evolve synergistically\.

## 4Experiments

Our evaluation framework is grounded in a Supervised Fine\-Tuning \(SFT\) pipeline\. The primary optimization objective is to master a target task while preserving performance on a pervasiveness task\. We analyze the underlying mechanistic dynamics by evaluating task circuits at each SFT step \(extraction details in Appendix[A](https://arxiv.org/html/2605.06076#A1)\)\.

Building upon this, we structure our experiments around two distinct SFT pipelines:

- •Free Evolution:Without Mechanistic Localization, we fine\-tune all parameters to investigate whether, how, to what extent, and why circuits change during parameter evolution\.
- •Localization Evolution:Adhering to the “locate\-then\-update” paradigm \(Section[2\.3](https://arxiv.org/html/2605.06076#S2.SS3)\), we identify critical components \(Appendix[A](https://arxiv.org/html/2605.06076#A1)\), freeze the others, and perform SFT\. By comparing this against free evolution, we explore whether localization genuinely benefits parameter updates and investigate why existing methods ostensibly improve task performance\.

We employ Mistral\-7B and LLaMA3\-8B444Mistral\-7B:[https://huggingface\.co/mistralai/Mistral\-7B\-v0\.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)\. Llama3\-8B:[https://huggingface\.co/meta\-llama/Meta\-Llama\-3\-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)as baseline models\. For target and pervasiveness tasks, we select 15 datasets widely adopted in mechanistic interpretability and post\-training research: OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.06076#bib.bib40)\), Gender\(Mathwinet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib41)\), RTE\(Daganet al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib42)\), IOI\([Wanget al\.,](https://arxiv.org/html/2605.06076#bib.bib43)\), Docstring\(Heimersheim and Janiak,[2023](https://arxiv.org/html/2605.06076#bib.bib44)\), SST2\(Socheret al\.,[2013](https://arxiv.org/html/2605.06076#bib.bib45)\), Winogrande\([51](https://arxiv.org/html/2605.06076#bib.bib46)\), Reverse\(Lindneret al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib47)\), Greater Than\(Hannaet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib48)\), FEVER\(Thorneet al\.,[2018](https://arxiv.org/html/2605.06076#bib.bib49)\), zsRE\(Levyet al\.,[2017](https://arxiv.org/html/2605.06076#bib.bib50)\), Induction\(Conmyet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib15)\), Bool\(Suzgunet al\.,[2023](https://arxiv.org/html/2605.06076#bib.bib51)\), Arithmetic\(Brownet al\.,[2020](https://arxiv.org/html/2605.06076#bib.bib52)\), and SA\(Yuet al\.,[2024](https://arxiv.org/html/2605.06076#bib.bib53)\)\. We track SFT evolution across1010epochs, subdividing each into2020observation steps to capture granular dynamics\. Comprehensive setups are detailed in Appendix[B](https://arxiv.org/html/2605.06076#A2)\. The performance fluctuations of all SFT experiments across55random seeds are tightly bound within a narrow95%95\\%confidence interval \(t=2\.776\) and we report the representative outcomes for clarity\.

### 4\.1Free Evolution

In this subsection, we focus onRQ1\(Without localization, do the critical components of the target skill change during parameter updates? If so, how do they evolve?\)\. Specifically, we decompose it into three sub\-questions:

- •RQ1\-a: Does the task mechanism evolve during parameter updates without localization?
- •RQ1\-b: If evolution exists, what are the underlying patterns of this evolution?
- •RQ1\-c: What factors influence this evolution, and how do they exert their effects?

![Refer to caption](https://arxiv.org/html/2605.06076v1/x2.png)\(a\)Circuit Distance
![Refer to caption](https://arxiv.org/html/2605.06076v1/x3.png)\(b\)Circuit Stability

Figure 2:line plots of different target tasks on the Mistral\-7B model in terms of Circuit Distance \(CDCD\) and Circuit Stability \(CSCS\)\.To addressRQ1\-aandRQ1\-b, we evaluate mechanism evolution on Mistral\-7B and LLaMA3\-8B across five target tasks \(Arithmetic, SST2, Winogrande, Bool, Gender\) against a general pervasiveness task \(OpenBookQA\)\. Circuits are extracted via the Logical Circuit Framework with EdgePruning \(Appendix[A\.3](https://arxiv.org/html/2605.06076#A1.SS3)\)\. Figure[2](https://arxiv.org/html/2605.06076#S4.F2)plots theCDCDandCSCStrajectories during SFT on Mistral\-7B, whereCDCDtracks the shift from the initial parameters \(θ0\\theta^\{0\}\) to the current step \(θs\\theta^\{s\}\)\. Full results, including LLaMA3\-8B findings and additional task performance metrics, are detailed in Appendix[C](https://arxiv.org/html/2605.06076#A3)\.

These results definitively answerRQ1\-a: without localization, task mechanisms spontaneously shift during parameter updates, driving substantial circuit migration\. Notably, even well\-mastered tasks like Gender \(initial accuracy\>85%\>85\\%\) undergo drastic migration, replacing numerous critical components \(Table[9](https://arxiv.org/html/2605.06076#A3.T9)\)\. Ablation studies \(Tables[9](https://arxiv.org/html/2605.06076#A3.T9)and[6](https://arxiv.org/html/2605.06076#A3.T6)\) confirm this significant transfer persists even without the pervasiveness task and at100%100\\%initial accuracy\. From an optimization perspective, post\-training transitions the LLM from infinite to finite objectives, naturally shifting the optimal solution space\. Ultimately, this dynamic nature confirms that static mechanistic conclusions \(circuits\) derived from current parameters cannot reliably represent future neural contributions\.

To addressRQ1\-b, we track the evolution of attention \(Wq,Wk,Wv,WoW\_\{q\},W\_\{k\},W\_\{v\},W\_\{o\}; denotedAttn\) and MLP \(Wup,WdownW\_\{\\text\{up\}\},W\_\{\\text\{down\}\}; denotedMLP\) components\. As shown in Figure[2](https://arxiv.org/html/2605.06076#S4.F2),Attncomponents consistently exhibit significantly higherCDCDthanMLPcomponents throughout SFT\. Prior mechanistic studies\(Menget al\.,[2022](https://arxiv.org/html/2605.06076#bib.bib10); Gevaet al\.,[2021](https://arxiv.org/html/2605.06076#bib.bib55)\)associate Attention with inter\-token “skills” \(e\.g\., induction heads dedicated to processing “A B …A” patterns, or syntax heads handling structures like “The \+ Noun”\) and MLPs with semantic “knowledge” \(e\.g\., factual concepts\)\. Combining these insights reveals a fundamental evolutionary pattern: “skill”\-centricAttncircuits are highly volatile and prone to migration during updates, whereas “knowledge”\-centricMLPcircuits remain largely inert\. This dichotomy is further corroborated byCSCStrajectories, whereMLPstability substantially exceedsAttnstability\.

Table 2:Summary of key factors on free evolutionFactorCDCDof AttnCDCDof MLPCSCSof AttnCSCSof MLP“Skill” tasks↑\\uparrow\-↓\\downarrow\-“Knowledge” tasks\-↓\\downarrow\-↑\\uparrowPervasiveness↑\\uparrow↑\\uparrow↑\\uparrow\-\-Dataset Size↑\\uparrow\-\-↑\\uparrow↑\\uparrowConflicting↑\\uparrow↑\\uparrow↑\\uparrow↓\\downarrow↓\\downarrowMastered tasks↓\\downarrow↓\\downarrow↑\\uparrow↑\\uparrowUnmastered tasks↑\\uparrow↑\\uparrow↓\\downarrow↓\\downarrow

This phenomenon is logically sound, given that MLPs contain vastly more parameters than Attention layers, making their stored knowledge more deeply entrenched and harder to migrate\. We further validate the differential evolutionary impacts of “skill” versus “knowledge” tasks in the subsequent analysis ofRQ1\-c\.

Finally, to addressRQ1\-c, we designed a series of ablation experiments to observe the impact of varying factors on Circuit Distance \(CDCD\) and Circuit Stability \(CSCS\)\. Ultimately, we identified five key factors that significantly influence circuit evolution:

- •Task Type:Skill\-centric tasks \(Attention\-dominated\) are highly susceptible to component migration, whereas knowledge\-centric tasks \(MLP\-dominated\) favor internal information updates \(reflected inCSCS\)\.
- •Degree of Pervasiveness:Co\-optimizing with increasingly pervasive tasks broadens component participation, triggering more extensive migration\.
- •Dataset Size:Larger SFT datasets intensify internal information refinement, enhancing knowledge consolidation and overall circuit robustness\.
- •Conflict Proportion:Higher ratios of conflicting components sharply increase migration and hinder effective information updates, diminishing circuit robustness\.
- •Initial Mastery:High initial task proficiency minimizes component migration and accelerates the efficiency of internal information updates\.

Table[2](https://arxiv.org/html/2605.06076#S4.T2)summarizes these factor\-metric correlations, with comprehensive ablation data detailed in Appendix[C](https://arxiv.org/html/2605.06076#A3)\.

### 4\.2Localization Evolution

In this section, we transition to evaluating the dynamics ofLocalization Evolution\. Maintaining the identical experimental setup as in Section[4\.1](https://arxiv.org/html/2605.06076#S4.SS1), we compare three distinct localization strategies during SFT:

- •Free \(Baseline\):Full\-parameter SFT without freezing any parameters\.
- •Mech \(Mechanistic Localization\):Localizing critical components based on the initial parameter state \(θ\\theta\) via mechanistic interpretability, while freezing all non\-critical components\.
- •Random:Randomly localizing and updating the exact same number of components as identified in theMechstrategy \(same ratio between Attn and MLP\), while freezing the remainder\.

Through the comparative analysis of these three strategies, we aim to systematically addressRQ2\(Does post\-training with localization truly improve performance on the target task and mitigate conflicts with pervasiveness tasks?\)\. To provide a granular analysis, we decompose this overarching inquiry into two specific sub\-questions:

- •RQ2\-a:Does static localization genuinely provide guidance for dynamic parameter updates?
- •RQ2\-b:Why do existing localization methods ostensibly yield significant improvements in task performance?

#### 4\.2\.1RQ2\-a

To addressRQ2\-a, we compared three localization strategies \(Free,Mech, andRandom\) on Mistral\-7B\. Consistent with previous settings, we paired five target tasks \(Arithmetic, Bool, Gender, SST2, Winogrande\) with the OpenBookQA pervasiveness task\. For theMechstrategy, we isolated800800critical components \(the predetermined circuit scale\)\. To ensure a rigorously controlled comparison, theRandomstrategy correspondingly updated exactly800800randomly selected components\.

![Refer to caption](https://arxiv.org/html/2605.06076v1/x4.png)\(a\)Target Accuracy
![Refer to caption](https://arxiv.org/html/2605.06076v1/x5.png)\(b\)Pervasiveness Accuracy
![Refer to caption](https://arxiv.org/html/2605.06076v1/x6.png)\(c\)Circuit Conflict

Figure 3:Target Task Accuracy, Pervasiveness Task Accuracy, and Circuit Conflict of Arithmetic Task with localization\.Figure[3](https://arxiv.org/html/2605.06076#S4.F3)illustrates the results for the Arithmetic task across three metrics: Target Task Accuracy, Pervasiveness Task Accuracy, and Circuit Conflict \(The results of the other 4 tasks are shown in Appendix[D\.1](https://arxiv.org/html/2605.06076#A4.SS1)\.\)\. A compelling observation emerges: First, the Target Task Accuracy fails to surpass its pre\-SFT \(Supervised Fine\-Tuning\) baseline, a phenomenon that starkly contrasts with typical outcomes observed in single\-objective optimization\. Coupled with the evidence of circuit conflicts presented in Figure[3](https://arxiv.org/html/2605.06076#S4.F3)\(c\), this substantiates that conflicts arising from multi\-task optimization inevitably degrade individual task performance\. Furthermore, a compelling observation emerges: while theMechstrategy significantly outperforms theFreebaseline in terms of accuracy and successfully maintains a lower conflict level, it exhibits no substantial advantage over theRandomstrategy\. This outcome reveals that while localizationin generaldoes benefit task performance, this improvement ostensibly does not stem from the mechanistic interpretability guidance of the circuit\.

TheCDCDandCSCSmetrics provided in Appendix[D\.2](https://arxiv.org/html/2605.06076#A4.SS2)further corroborate this: compared to free evolution, localization induces greater Circuit Distance and variations in Circuit Stability\. In other words, localization paradoxically renders the circuit’s evolution more uncontrollable and unstable, and the improvement in task performance may stem from the reduction in parameters, which makes task optimization simpler\. This validates our preceding hypothesis: Mechanistic Localization based on the current parameter state identifies only afractionof the components that genuinely contribute to future parameter updates\. Because theunidentified critical componentsare prematurelyfrozen, they impede normal evolutionary dynamics, rendering circuit evolution substantially more difficult\. Consequently, from a macroscopic perspective, its performance is virtually indistinguishable from “random localization\.”Furthermore, Appendix[D\.3](https://arxiv.org/html/2605.06076#A4.SS3)provides additional validation for this hypothesis\. Expanding the circuit scale—thereby encompassing more “unidentified critical components”—allows Mechanistic Localization to surpass Random Localization\.

#### 4\.2\.2RQ2\-b

However, prevalent downstream applications \(e\.g\., LLM unlearning, Knowledge Editing\) report Mechanistic Localization vastly outperforming Random Localization, starkly contradictingRQ2\-a\. We hypothesize this discrepancy arises because these applications predominantly target MLP\-governed, knowledge\-centric tasks\. As established inRQ1\-b\(Section[4\.1](https://arxiv.org/html/2605.06076#S4.SS1)\), MLP circuits inherently resist component migration\. Consequently, the divergence between current and updated circuits is significantly diminished\. This minimal evolutionary drift creates an “illusion of effectiveness”—the false premise that the “current circuit” reliably guides the “future circuit\.”

To empirically validate this hypothesis, we conducted a comparative analysis within the LLM unlearning paradigm, contrasting mainstream unlearning on the WMDP\-Bio dataset against unlearning the Induction task\. The WMDP\-Bio dataset comprises biological knowledge and related factual information; hence, unlearning WMDP\-Bio is a quintessential knowledge\-centric task dominated by MLP circuits\. Conversely, the Induction task relies on a skill\-centric circuit predominantly driven by Attention mechanisms, making its unlearning an Attention\-dominated process\.

We evaluated these two target tasks with OpenBookQA acting as the pervasiveness task\. The comprehensive metrics tracked include “FE” \(Forget efficacy\) is measured as 1\-accuracy on unlearning task and “RU” \(Retain utility\) is measured as accuracy of pervasiveness task, Circuit Distance of Attention \(CDAttnCD\_\{Attn\}\), Circuit Distance of MLP \(CDMLPCD\_\{MLP\}\), Circuit Stability of Attention \(CSAttnCS\_\{Attn\}\), Circuit Stability of MLP \(CSMLPCS\_\{MLP\}\), and Circuit Conflict \(CCCC\)\.

Table 3:Performance of WMDP\-Bio and Induction tasks on LLM unlearningTable[3](https://arxiv.org/html/2605.06076#S4.T3)corroborates our hypothesis: because the circuits of knowledge\-centric tasks are largely composed of MLPs and their associated edges, the immense parameter volume and dense knowledge superposition within MLPs render them highly resistant to neural migration\. In stark contrast, skill\-centric tasks governed by Attention mechanisms are much more susceptible to interference due to their smaller parameter footprint, leading to a substantially higher proportion of component migration\. In Appendix[E](https://arxiv.org/html/2605.06076#A5), we provide supplementary sampling results during the optimization process when these two tasks are treated as independent optimization objectives, further substantiating this conclusion\.

Therefore, from the perspective of circuit\-guided parameter dynamics, the circuits of Attention\-dominated skill tasks undergo drastic transformations during parameter updates\. Mechanistic conclusions derived from these circuits suffer from severe temporal latency, rendering them fundamentally incapable of providing practical, meaningful contributions to future parameter updates\.

## 5Discussion: The Path to Predictive Localization

Section[4](https://arxiv.org/html/2605.06076#S4)demonstrates that task mechanisms inherently undergo “free evolution\.” Due to these dynamic shifts, static interpretability conclusions suffer from severe temporal latency, fundamentally failing to guide future parameter updates\. To make Mechanistic Localization practical, we must endow it with “foresight”—the ability to predict future mechanisms\.

Ideally, successfully forecasting the post\-update circuit from the current parameter state would enable flawless critical component localization\. Accordingly, we conducted a preliminary exploration using the circuit obtainedafterfree evolution SFT as a surrogate for the current circuit, introducing a paradigm we termFuture Mechanisticlocalization\. Figure[4](https://arxiv.org/html/2605.06076#S5.F4)shows that this approach yields substantial advantages in both Target \(T\-Acc\) and Pervasiveness \(P\-Acc\) Accuracy on the Arithmetic task \(additional metrics in Appendix[F](https://arxiv.org/html/2605.06076#A6)\), powerfully corroborating our hypothesis and charting a clear trajectory for future research\.

Furthermore, Appendix[F](https://arxiv.org/html/2605.06076#A6)presents a comprehensive analysis of contemporary Mechanistic Localization methodologies\. Moving beyond strictly circuit\-centric approaches, we evaluate two other predominant paradigms: gradient\-based methods and intervention\-based methods\. Our empirical results reveal that critical components localized via gradient\-based techniques exhibit markedly superior performance in terms of Circuit Distance \(CDCD\)\. This suggests a compelling insight: leveraging current gradient signals to approximate or extrapolate the mechanistic structures of future parameter states represents a highly promising frontier for the evolution of predictive Mechanistic Localization\.

![Refer to caption](https://arxiv.org/html/2605.06076v1/x7.png)\(a\)Task Accuracy
![Refer to caption](https://arxiv.org/html/2605.06076v1/x8.png)\(b\)Pervasiveness Accuracy

Figure 4:Line plots of Future\-Localization\.
## 6Conclusion

In this paper, we explore whether interpretability conclusions derived from current parameter states offer predictive guidance for future parameter updates\. To investigate this, we constructed a “Locate\-then\-Update” pipeline using circuit discovery for Mechanistic Localization\. By tracking circuits across SFT steps via three novel metrics—Circuit Distance \(CDCD\), Circuit Stability \(CSCS\), and Circuit Conflict \(CCCC\)—we systematically evaluated the evolutionary dynamics of task mechanisms\. Extensive experiments substantiate three core conclusions:

- •Divergent Free Evolution:Without localization, circuits naturally undergo “free evolution\.” Attention\-driven skill tasks experience drastic structural shifts, whereas MLP\-reliant knowledge tasks evolve much more gradually\.
- •Temporal Lag of Static Localization:Due to this inherent free evolution, utilizing current parameter states as a reference suffers from severe temporal latency\. Consequently, static Mechanistic Localization fails to reliably guide future dynamic updates\.
- •The Illusion of Effectiveness:The perceived success of existing localization methods stems from their evaluation on MLP\-dominated tasks\. The inherently slower evolution of these tasks coincidentally masks the latency of static circuits, creating a false sense of efficacy\.

Finally, we discuss potential research directions for achieving effective Mechanistic Localization and outline the limitations of this work in Appendix[G](https://arxiv.org/html/2605.06076#A7)\.

## References

- \[1\]N\. Belrose, Z\. Furman, L\. Smith, D\. Halawi, I\. Ostrovsky, L\. McKinney, S\. Biderman, and J\. Steinhardt\(2023\)Eliciting latent predictions from transformers with the tuned lens\.arXiv preprint arXiv:2303\.08112\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[2\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei\(2020\)Language models are few\-shot learners\.External Links:2005\.14165Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[3\]Y\. Cao, T\. Zhang, B\. Cao, Z\. Yin, L\. Lin, F\. Ma, and J\. Chen\(2024\)Personalized steering of large language models: versatile steering vectors through bi\-directional preference optimization\.Advances in Neural Information Processing Systems37,pp\. 49519–49551\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[4\]L\. Chan, A\. Garriga\-Alonso, N\. Goldwosky\-Dill, R\. Greenblatt, J\. Nitishinskaya, A\. Radhakrishnan, B\. Shlegeris, and N\. Thomas\(2022\)Causal scrubbing, a method for rigorously testing interpretability hypotheses\.AI Alignment Forum\.Note:[https://www\.alignmentforum\.org/posts/JvZhhzycHu2Yd57RN/causal\-scrubbing\-a\-method\-for\-rigorously\-testing](https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing)Cited by:[§3\.1](https://arxiv.org/html/2605.06076#S3.SS1.p1.5)\.
- \[5\]H\. Chen, J\. Zhu, X\. Yang, and W\. Wang\(2025\)Rethinking circuit completeness in language models: and, or, and adder gates\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, L\. Montoya, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 150511–150540\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/dd37fdb24a4e1cfa3ed5c247217a7394-Paper-Conference.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.06076#A1.p1.1),[§1](https://arxiv.org/html/2605.06076#S1.p3.4)\.
- \[6\]H\. Chen, J\. Zhu, X\. Yang, and W\. Wang\(2026\)CLUE: conflict\-guided localization for LLM unlearning framework\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=jtRYvazBWv)Cited by:[§C\.5](https://arxiv.org/html/2605.06076#A3.SS5.p1.5),[Appendix F](https://arxiv.org/html/2605.06076#A6.p8.1),[§3\.3](https://arxiv.org/html/2605.06076#S3.SS3.p1.3),[§3\.3](https://arxiv.org/html/2605.06076#S3.SS3.p2.3)\.
- \[7\]A\. Cimatti, A\. Griggio, and R\. Sebastiani\(2007\)A simple and flexible way of computing small unsatisfiable cores in sat modulo theories\.InInternational Conference on Theory and Applications of Satisfiability Testing,pp\. 334–339\.Cited by:[§3\.3](https://arxiv.org/html/2605.06076#S3.SS3.p2.5)\.
- \[8\]A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso\(2023\)Towards automated circuit discovery for mechanistic interpretability\.Advances in Neural Information Processing Systems36,pp\. 16318–16352\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§1](https://arxiv.org/html/2605.06076#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[9\]N\. D’Ippolito, M\. F\. Frias, J\. P\. Galeotti, E\. Lanzarotti, and S\. Mera\(2010\)Alloy\+ hotcore: a fast approximation to unsat core\.InInternational Conference on Abstract State Machines, Alloy, B and Z,pp\. 160–173\.Cited by:[§3\.3](https://arxiv.org/html/2605.06076#S3.SS3.p2.5)\.
- \[10\]I\. Dagan, D\. Roth, F\. Zanzotto, and M\. Sammons\(2022\)Recognizing textual entailment: models and applications\.Springer Nature\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[11\]D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei\(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8493–8502\.Cited by:[Appendix F](https://arxiv.org/html/2605.06076#A6.p7.1),[§1](https://arxiv.org/html/2605.06076#S1.p2.1)\.
- \[12\]N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2021/framework/index\.htmlCited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[13\]M\. Geva, R\. Schuster, J\. Berant, and O\. Levy\(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5484–5495\.Cited by:[§4\.1](https://arxiv.org/html/2605.06076#S4.SS1.p4.4)\.
- \[14\]N\. Goldowsky\-Dill, C\. MacLeod, L\. Sato, and A\. Arora\(2023\)Localizing model behavior with path patching\.arXiv preprint arXiv:2304\.05969\.Cited by:[§3\.1](https://arxiv.org/html/2605.06076#S3.SS1.p1.5)\.
- \[15\]M\. Hanna, O\. Liu, and A\. Variengien\(2023\)How does gpt\-2 compute greater\-than?: interpreting mathematical abilities in a pre\-trained language model\.Advances in Neural Information Processing Systems36,pp\. 76033–76060\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[16\]P\. Hase, M\. Bansal, B\. Kim, and A\. Ghandeharioun\(2023\)Does localization inform editing? surprising differences in causality\-based localization vs\. knowledge editing in language models\.Advances in Neural Information Processing Systems36,pp\. 17643–17668\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p3.4)\.
- \[17\]A\. Havrilla, Y\. Du, S\. C\. Raparthy, C\. Nalmpantis, J\. Dwivedi\-Yu, E\. Hambro, S\. Sukhbaatar, and R\. Raileanu\(2024\)Teaching large language models to reason with reinforcement learning\.InAI for Math Workshop @ ICML 2024,External Links:[Link](https://openreview.net/forum?id=mjqoceuMnI)Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[18\]S\. Heimersheim and J\. Janiak\(2023\)A circuit for python docstrings in a 4\-layer attention\-only transformer\.InAlignment Forum,Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[19\]S\. Heimersheim and N\. Nanda\(2024\)How to use and interpret activation patching\.arXiv preprint arXiv:2404\.15255\.Cited by:[Appendix A](https://arxiv.org/html/2605.06076#A1.p2.4),[§3\.1](https://arxiv.org/html/2605.06076#S3.SS1.p1.5)\.
- \[20\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[21\]J\. Jia, J\. Liu, Y\. Zhang, P\. Ram, N\. Baracaldo, and S\. Liu\(2024\)Wagle: strategic weight attribution for effective and modular unlearning in large language models\.Advances in Neural Information Processing Systems37,pp\. 55620–55646\.Cited by:[Appendix F](https://arxiv.org/html/2605.06076#A6.p5.1),[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.
- \[22\]T\. Ju, W\. Sun, W\. Du, X\. Yuan, Z\. Ren, and G\. Liu\(2024\)How large language models encode context knowledge? a layer\-wise probing study\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 8235–8246\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[23\]H\. Lai, X\. Liu, J\. Gao, J\. Cheng, Z\. Qi, Y\. Xu, S\. Yao, D\. Zhang, J\. Du, Z\. Hou,et al\.\(2025\)A survey of post\-training scaling in large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2771–2791\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[24\]O\. Levy, M\. Seo, E\. Choi, and L\. Zettlemoyer\(2017\-08\)Zero\-shot relation extraction via reading comprehension\.InProceedings of the 21st Conference on Computational Natural Language Learning \(CoNLL 2017\),R\. Levy and L\. Specia \(Eds\.\),Vancouver, Canada,pp\. 333–342\.External Links:[Link](https://aclanthology.org/K17-1034/),[Document](https://dx.doi.org/10.18653/v1/K17-1034)Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[25\]J\. Li, X\. Chen, E\. Hovy, and D\. Jurafsky\(2016\)Visualizing and understanding neural models in nlp\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 681–691\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[26\]Y\. Li, C\. Sun, and T\. Weng\(2025\)Effective skill unlearning through intervention and abstention\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6358–6371\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.
- \[27\]D\. Lindner, J\. Kramár, S\. Farquhar, M\. Rahtz, T\. McGrath, and V\. Mikulik\(2023\)Tracr: compiled transformers as a laboratory for interpretability\.Advances in Neural Information Processing Systems36,pp\. 37876–37899\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[28\]B\. Liu, Q\. Liu, and P\. Stone\(2022\)Continual learning and private unlearning\.InConference on Lifelong Learning Agents,pp\. 243–254\.Cited by:[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.
- \[29\]H\. Liu, J\. Ma, X\. Wang, C\. Yuan, and F\. Feng\(2026\)An information\-theoretic parameter\-free bayesian framework for probing labeled dependency trees from attention score\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=q7raIuTQDK)Cited by:[§C\.2](https://arxiv.org/html/2605.06076#A3.SS2.p2.1)\.
- \[30\]S\. Liu, Y\. Yao, J\. Jia, S\. Casper, N\. Baracaldo, P\. Hase, Y\. Yao, C\. Y\. Liu, X\. Xu, H\. Li,et al\.\(2025\)Rethinking machine unlearning for large language models\.Nature Machine Intelligence7\(2\),pp\. 181–194\.Cited by:[§2\.1](https://arxiv.org/html/2605.06076#S2.SS1.p1.9)\.
- \[31\]I\. Loshchilov and F\. Hutter\(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p1.2)\.
- \[32\]P\. Maini, Z\. Feng, A\. Schwarzschild, Z\. C\. Lipton, and J\. Z\. Kolter\(2024\)TOFU: a task of fictitious unlearning for llms\.InFirst Conference on Language Modeling,Cited by:[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.
- \[33\]C\. Mathwin, G\. Corlouer, E\. Kran, F\. Barez, and N\. Nanda\(2023\)Identifying a preliminary circuit for predicting gendered pronouns in gpt\-2 small\.URL: https://itch\. io/jam/mechint/rate/1889871,pp\. 2\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[34\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in gpt\.Advances in neural information processing systems35,pp\. 17359–17372\.Cited by:[§C\.2](https://arxiv.org/html/2605.06076#A3.SS2.p2.1),[§1](https://arxiv.org/html/2605.06076#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.06076#S4.SS1.p4.4)\.
- \[35\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InEMNLP,Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[36\]A\. Muhamed, J\. Bonato, M\. T\. Diab, and V\. Smith\(2025\)Saes can improve unlearning: dynamic sparse autoencoder guardrails for precision unlearning in llms\.InSecond Conference on Language Modeling,Cited by:[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.
- \[37\]C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah\(2022\)In\-context learning and induction heads\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2022/in\-context\-learning\-and\-induction\-heads/index\.htmlCited by:[§C\.2](https://arxiv.org/html/2605.06076#A3.SS2.p2.1)\.
- \[38\]V\. Patil, P\. Hase, and M\. Bansal\(2023\)Can sensitive information be deleted from llms? objectives for defending against extraction attacks\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix F](https://arxiv.org/html/2605.06076#A6.p6.1)\.
- \[39\]D\. Rai, Y\. Zhou, S\. Feng, A\. Saparov, and Z\. Yao\(2024\)A practical review of mechanistic interpretability for transformer\-based language models\.arXiv preprint arXiv:2407\.02646\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[40\]D\. Selsam and N\. Bjørner\(2019\)Guiding high\-performance sat solvers with unsat\-core predictions\.InInternational conference on theory and applications of satisfiability testing,pp\. 336–353\.Cited by:[§3\.3](https://arxiv.org/html/2605.06076#S3.SS3.p2.5)\.
- \[41\]R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Ng, and C\. Potts\(2013\-10\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,Seattle, Washington, USA,pp\. 1631–1642\.External Links:[Link](https://www.aclweb.org/anthology/D13-1170)Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[42\]A\. Stolfo, Y\. Belinkov, and M\. Sachan\(2023\)A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7035–7052\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[43\]A\. Sun\(2025\)Circuit stability characterizes language model generalization\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9025–9040\.Cited by:[§3\.2](https://arxiv.org/html/2605.06076#S3.SS2.p1.5)\.
- \[44\]M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. Le, E\. Chi, D\. Zhou,et al\.\(2023\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 13003–13051\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[45\]A\. Syed, C\. Rager, and A\. Conmy\(2024\)Attribution patching outperforms automated circuit discovery\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,pp\. 407–416\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p5.1)\.
- \[46\]T\. Tang, W\. Luo, H\. Huang, D\. Zhang, X\. Wang, W\. X\. Zhao, F\. Wei, and J\. Wen\(2024\)Language\-specific neurons: the key to multilingual capabilities in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5701–5715\.Cited by:[§2\.2](https://arxiv.org/html/2605.06076#S2.SS2.p1.1)\.
- \[47\]J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal\(2018\-06\)FEVER: a large\-scale dataset for fact extraction and VERification\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 809–819\.External Links:[Link](https://aclanthology.org/N18-1074/),[Document](https://dx.doi.org/10.18653/v1/N18-1074)Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[48\]J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber\(2020\)Investigating gender bias in language models using causal mediation analysis\.Advances in neural information processing systems33,pp\. 12388–12401\.Cited by:[§3\.1](https://arxiv.org/html/2605.06076#S3.SS1.p1.5)\.
- \[49\]K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. SteinhardtInterpretability in the wild: a circuit for indirect object identification in gpt\-2 small\.InThe Eleventh International Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[50\]S\. Wang, Y\. Zhu, H\. Liu, Z\. Zheng, C\. Chen, and J\. Li\(2024\)Knowledge editing for large language models: a survey\.ACM Computing Surveys57\(3\),pp\. 1–37\.Cited by:[§2\.1](https://arxiv.org/html/2605.06076#S2.SS1.p1.9)\.
- \[51\]\(2019\)WinoGrande: an adversarial winograd schema challenge at scale\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[52\]X\. Wu, J\. Li, M\. Xu, W\. Dong, S\. Wu, C\. Bian, and D\. Xiong\(2023\)Depn: detecting and editing privacy neurons in pretrained language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 2875–2886\.Cited by:[Appendix F](https://arxiv.org/html/2605.06076#A6.p4.1),[§1](https://arxiv.org/html/2605.06076#S1.p2.1)\.
- \[53\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational conference on machine learning,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[54\]L\. Yan, R\. Li, G\. Chen, Q\. Li, J\. Geng, W\. Li, V\. Wang, and C\. Lee\(2026\)Spurious rewards paradox: mechanistically understanding how rlvr activates memorization shortcuts in llms\.arXiv preprint arXiv:2601\.11061\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p2.1)\.
- \[55\]Y\. Yao, P\. Wang, B\. Tian, S\. Cheng, Z\. Li, S\. Deng, H\. Chen, and N\. Zhang\(2023\)Editing large language models: problems, methods, and opportunities\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 10222–10240\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[56\]L\. Yu, J\. Niu, Z\. Zhu, and G\. Penn\(2024\)Functional faithfulness in the wild: circuit discovery with differentiable computation graph pruning\.arXiv preprint arXiv:2407\.03779\.Cited by:[Appendix B](https://arxiv.org/html/2605.06076#A2.p2.1),[§4](https://arxiv.org/html/2605.06076#S4.p3.4)\.
- \[57\]H\. Zhang, Z\. Zhang, M\. Wang, Z\. Su, Y\. Wang, Q\. Wang, S\. Yuan, E\. Nie, X\. Duan, Q\. Xue,et al\.\(2026\)Locate, steer, and improve: a practical survey of actionable mechanistic interpretability in large language models\.arXiv preprint arXiv:2601\.14004\.Cited by:[§1](https://arxiv.org/html/2605.06076#S1.p1.1)\.
- \[58\]R\. Zhang, L\. Lin, Y\. Bai, and S\. Mei\(2024\)Negative preference optimization: from catastrophic collapse to effective unlearning\.InFirst Conference on Language Modeling,Cited by:[§2\.3](https://arxiv.org/html/2605.06076#S2.SS3.p1.6)\.

## Appendix ADetails of Logical Circuit Framework

At first, we systematically introduce three fundamental circuit logic types: theANDgate,ORgate, andADDERgate\[[5](https://arxiv.org/html/2605.06076#bib.bib14)\]\.

###### Definition 1\.

We assume a common paradigm in which a receiver nodeBB, which is connected by more than 1 sender nodeA1,A2,…A\_\{1\},A\_\{2\},\.\.\.\. For any edgeAi→BA\_\{i\}\\rightarrow B, we use binary values ‘0’ and ‘1’ to represent the activation state of a node\. Specifically,Ai=0A\_\{i\}=0indicates that nodeAiA\_\{i\}is removed, ablated, or deactivated, whereasAi=1A\_\{i\}=1indicates that nodeAiA\_\{i\}is retained and active\. When the sender nodes are ablated, the effect of nodeBBon the output exhibits three distinct patterns, which are as follows:

AND: All sender nodes satisfy an AND logical relationship with the receiver node, i\.e\.,B=A1∧A2∧…B=A\_\{1\}\\land A\_\{2\}\\land\\dots\. In this case, nodeBBexerts a significant effect on the output only if all of its sender nodes are retained\. If even a single sender node is ablated, the effect ofBBon the output is nearly eliminated\.

ORgate: All sender nodes satisfy an OR logical relationship with the receiver node, i\.e\.,B=A1∨A2∨…B=A\_\{1\}\\lor A\_\{2\}\\lor\\dots\. In this case, nodeBBalways exerts a significant effect on the output if one or more of its sender nodes are retained\. Only if all sender nodes are ablated, the effect ofBBon the output is nearly eliminated\.

ADDERgate: all sender nodes satisfy an ADDER logical relationship with the receiver node, i\.e\.,B=A1\+A2\+…B=A\_\{1\}\+A\_\{2\}\+\\dots\. In this case, nodeBBexhibits its maximal effect on the output only when all of its sender nodes are retained\. If any single sender node is ablated, the effect ofBBon the output is substantially diminished; when all sender nodes are ablated,BB’s effect on the output is reduced to zero\. Accordingly, we define the state ofBBas taking values 0,1,2,\., where the total number of distinct states equals the number of sender nodes\.

Theoretical analyses support the view that noising\-based intervention is capable of recovering a complete AND gate but fails to recover a complete OR gate, whereas denoising\-based intervention demonstrates the opposite pattern\[[19](https://arxiv.org/html/2605.06076#bib.bib31)\]\. This asymmetry is straightforward to interpret\. The noising\-based intervention procedure corresponds to the transition from a clean activation state \(state=1\\text\{state\}=1\) to a corrupted activation state \(state=0\\text\{state\}=0\)\. Since all gates can be regarded as being initialized with activation states equal to11, any transition tostate=0\\text\{state\}=0induces a significant change in the effect of AND and ADDER gates on the output\. Consequently, noising\-based intervention can reliably identify AND and ADDER gates\.

Thedenoising\-based interventionfirst performs the corrupted run in the computational graph, and then replaces the corrupted activations with the clean activations\. Those activations that lead to significant changes in the output \(y~\\tilde\{y\}\) consist of the circuits\. denoising\-based intervention thus has the following objective:

argmin𝒞𝔼\(x,x~\)∈𝒯\[D\(p𝒢\(y~\|x~\)\|\|p𝒞\(y~\|x~,x\)\)\],s\.t\.1−\|𝒞\|/\|𝒢\|≥s\\arg\\min\_\{\\mathcal\{C\}\}\\mathbb\{E\}\_\{\(x,\\tilde\{x\}\)\\in\\mathcal\{T\}\}\[D\(p\_\{\\mathcal\{G\}\}\(\\tilde\{y\}\|\\tilde\{x\}\)\|\|p\_\{\\mathcal\{C\}\}\(\\tilde\{y\}\|\\tilde\{x\},x\)\)\],~~s\.t\.~1\-\|\\mathcal\{C\}\|/\|\\mathcal\{G\}\|\\geq s\(7\)Conversely, the denoising\-based intervention procedure corresponds to initialization with activation states equal to0\. In this case, any transition tostate=1\\text\{state\}=1produces a significant change in the effect of OR and ADDER gates on the output\.

Therefore, we denote the circuit constructed under the noising\-based intervention strategy as𝒞Ns\\mathcal\{C\}\_\{\\text\{Ns\}\}, and the one constructed under the denoising\-based intervention strategy as𝒞Dn\\mathcal\{C\}\_\{\\text\{Dn\}\}\. Based on the above set\-theoretic relationships between𝒞Ns\\mathcal\{C\}\_\{\\text\{Ns\}\}and𝒞Dn\\mathcal\{C\}\_\{\\text\{Dn\}\}, we extract subsets of edges corresponding to AND, OR, and ADDER gates as follows:

- •AND gate \(𝒞AND\\mathcal\{C\}\_\{\\text\{AND\}\}\): edges that are present in𝒞Ns\\mathcal\{C\}\_\{\\text\{Ns\}\}but absent from𝒞Dn\\mathcal\{C\}\_\{\\text\{Dn\}\}\.
- •OR gate \(𝒞OR\\mathcal\{C\}\_\{\\text\{OR\}\}\): edges that are present in𝒞Dn\\mathcal\{C\}\_\{\\text\{Dn\}\}but absent from𝒞Ns\\mathcal\{C\}\_\{\\text\{Ns\}\}\.
- •ADDER gate \(𝒞ADDER\\mathcal\{C\}\_\{\\text\{ADDER\}\}\): edges that are shared between𝒞Ns\\mathcal\{C\}\_\{\\text\{Ns\}\}and𝒞Dn\\mathcal\{C\}\_\{\\text\{Dn\}\}\.

Therefore, we propose a combinedNs\+Dnapproach to recover logically complete gates\. This method is compatible with a wide range of circuit discovery algorithms, introduces minimal additional computational overhead, and enables clear and effective separation of the three types of logic gates\. Ns\+Dn has the following objective:

argmin𝒞𝔼\(x,x~\)∈𝒯\[D\(p𝒢\(y\|x\)\|\|p𝒞\(y\|x,x~\)\)\+D\(p𝒢\(y~\|x~\)\|\|p𝒞\(y~\|x~,x\)\)\],s\.t\.1−\|𝒞\|/\|𝒢\|≥s\\arg\\min\_\{\\mathcal\{C\}\}\\mathbb\{E\}\_\{\(x,\\tilde\{x\}\)\\in\\mathcal\{T\}\}\[D\(p\_\{\\mathcal\{G\}\}\(y\|x\)\|\|p\_\{\\mathcal\{C\}\}\(y\|x,\\tilde\{x\}\)\)\+D\(p\_\{\\mathcal\{G\}\}\(\\tilde\{y\}\|\\tilde\{x\}\)\|\|p\_\{\\mathcal\{C\}\}\(\\tilde\{y\}\|\\tilde\{x\},x\)\)\],~~s\.t\.~1\-\|\\mathcal\{C\}\|/\|\\mathcal\{G\}\|\\geq s\(8\)In the following sections, we provide a detailed exposition of the original design of each method under the Ns\. strategy, the corresponding formulation under the Dn\. strategy, and the final approach that integrates both—Ns\.\+Dn\.—for recovering logically complete gates\.

### A\.1Greedy Search Example: ACDC

The ACDC method identifies important edges by iteratively removing each edge and observing the effect of this intervention on the model output\. Edges whose removal causes an effect greater than a predefined thresholdτ\\tauare retained, while those with an effect smaller thanτ\\tauare pruned\. The original algorithm \(Ns\. strategy\), is outlined as follows:

Data:Computational graph

𝒢\\mathcal\{G\}, dataset

\(xi\)i=1n\(x\_\{i\}\)\_\{i=1\}^\{n\}, corrupted datapoints

\(xi′\)i=1n\(x\_\{i\}^\{\\prime\}\)\_\{i=1\}^\{n\}and threshold

τ\>0\\tau\>0\.

Result:Subgraph

ℋ⊆𝒢\\mathcal\{H\}\\subseteq\\mathcal\{G\}\.

ℋ←𝒢\\mathcal\{H\}\\leftarrow\\mathcal\{G\}

//Initialize H to the full computational graph

ℋ←ℋ\.reverse\_topological\_sort\(\)\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\.\{reverse\\\_topological\\\_sort\(\)\}

//Sort H so output first

1for*v∈ℋv\\in\\mathcal\{H\}*do

2for*wwparent ofvv*do

ℋnew←ℋ∖\{w→v\}\\mathcal\{H\}\_\{\\mathrm\{new\}\}\\leftarrow\\mathcal\{H\}\\setminus\\\{w\\rightarrow v\\\}
//Temporarily remove candidate edge

3if*DKL\(𝒢\|\|ℋnew\)−DKL\(𝒢\|\|ℋ\)<τD\_\{KL\}\(\\mathcal\{G\}\|\|\\mathcal\{H\}\_\{\\mathrm\{new\}\}\)\-D\_\{KL\}\(\\mathcal\{G\}\|\|\\mathcal\{H\}\)<\\tau*then

ℋ←ℋnew\\mathcal\{H\}\\leftarrow\\mathcal\{H\}\_\{\\mathrm\{new\}\}
//Edge is unimportant, remove permanently

4

5

6

return*ℋ\\mathcal\{H\}*

Algorithm 1The ACDC algorithm in Ns\.In theNs\.strategy,𝒢\\mathcal\{G\}denotes theclean run, andℋ∖\{w→v\}\\mathcal\{H\}\\setminus\\\{w\\rightarrow v\\\}represents the replacement of the clean activation on the edgew→vw\\rightarrow vwith its corrupted activation\. In contrast, under theDn\.strategy,𝒢\\mathcal\{G\}refers to thecorrupted run, andℋ∖\{w→v\}\\mathcal\{H\}\\setminus\\\{w\\rightarrow v\\\}indicates the substitution of the corrupted activation on edgew→vw\\rightarrow vwith the corresponding clean activation\.

In the combinedNs\.\+Dn\.approach, the effects from both strategies are jointly considered\. Specifically, the original pruning conditionDKL\(𝒢∥ℋnew\)−DKL\(𝒢∥ℋ\)<τD\_\{KL\}\(\\mathcal\{G\}\\,\\\|\\,\\mathcal\{H\}\_\{\\mathrm\{new\}\}\)\-D\_\{KL\}\(\\mathcal\{G\}\\,\\\|\\,\\mathcal\{H\}\)<\\tauis replaced with the aggregated criterion:DKL\(𝒢clean∥ℋnew\)−DKL\(𝒢clean∥ℋ\)\+DKL\(𝒢corrupted∥ℋnew\)−DKL\(𝒢corrupted∥ℋ\)<τ\.D\_\{KL\}\(\\mathcal\{G\}^\{\\text\{clean\}\}\\,\\\|\\,\\mathcal\{H\}\_\{\\mathrm\{new\}\}\)\-D\_\{KL\}\(\\mathcal\{G\}^\{\\text\{clean\}\}\\,\\\|\\,\\mathcal\{H\}\)\+D\_\{KL\}\(\\mathcal\{G\}^\{\\text\{corrupted\}\}\\,\\\|\\,\\mathcal\{H\}\_\{\\mathrm\{new\}\}\)\-D\_\{KL\}\(\\mathcal\{G\}^\{\\text\{corrupted\}\}\\,\\\|\\,\\mathcal\{H\}\)<\\tau\.

### A\.2Linear Estimation Example: EAP

The EAP method approximates the effect of each edge using the first\-order term of its Fourier expansion, enabling the estimation of all edge effects with a single forward pass\. It is important to note that, during the computation of each edge’s effect, all other edges remain in their unpruned \(active\) state\.

Specifically, Ns\. has approximation:

L\(x\|do\(x~i\)\)−L\(x\)≈\(x~i−xi\)T∂∂xiL\(x\)L\(x\|do\(\\tilde\{x\}\_\{i\}\)\)\-L\(x\)\\approx\(\\tilde\{x\}\_\{i\}\-x\_\{i\}\)^\{T\}\\frac\{\\partial\}\{\\partial x\_\{i\}\}L\(x\)\(9\)
and Dn\. has approximation:

L\(x~\|do\(xi\)\)−L\(x~\)≈\(x~i−xi\)T∂∂x~iL\(x~\)L\(\\tilde\{x\}\|do\(x\_\{i\}\)\)\-L\(\\tilde\{x\}\)\\approx\(\\tilde\{x\}\_\{i\}\-x\_\{i\}\)^\{T\}\\frac\{\\partial\}\{\\partial\\tilde\{x\}\_\{i\}\}L\(\\tilde\{x\}\)\(10\)Therefore, the approximation for Ns\.\+Dn\. is\(x~i−xi\)T∂∂xiL\(x\)\+\(x~i−xi\)T∂∂x~iL\(x~\)\(\\tilde\{x\}\_\{i\}\-x\_\{i\}\)^\{T\}\\frac\{\\partial\}\{\\partial x\_\{i\}\}L\(x\)\+\(\\tilde\{x\}\_\{i\}\-x\_\{i\}\)^\{T\}\\frac\{\\partial\}\{\\partial\\tilde\{x\}\_\{i\}\}L\(\\tilde\{x\}\)\.

### A\.3Differentiable Mask Example: EdgePruning

EdgePruning assigns a learnable mask to each node or edge, where the mask is reparameterized using the hard concrete distribution\. In the Ns\. setting, the optimization objective corresponds to Equation[2](https://arxiv.org/html/2605.06076#S2.E2)\. Consequently, the objectives for the Dn\. and Ns\.\+Dn\. settings are given by Equation[7](https://arxiv.org/html/2605.06076#A1.E7)and Equation[8](https://arxiv.org/html/2605.06076#A1.E8), respectively\.

In the Ns\.\+Dn\. setting, directly optimizing both objectives jointly can lead to gradient interference and convergence to Pareto\-optimal solutions, rather than a unified optimum\. To address this, we independently compute the final mask values for Ns\. and Dn\. using Equations[2](https://arxiv.org/html/2605.06076#S2.E2)and[7](https://arxiv.org/html/2605.06076#A1.E7), and then obtain the mask for Ns\.\+Dn\. by averaging the two\.

Finally, we simplify the ADDER gate in the forget circuit to an OR gate, and the ADDER gate in the retain circuit to an AND gate\.

## Appendix BExperiment Details

The learning rate is grid\-searched at1×10−51\\times 10^\{\-5\}for each dataset\. The parameterλ=1\\lambda=1, and we adopted AdamW\[[31](https://arxiv.org/html/2605.06076#bib.bib54)\]as the optimizer\. All experiments were conducted on 16 NVIDIA RTX A100 GPUs\.

We select a series of specific tasks as target and pervasiveness set: OpenBookQA\[[35](https://arxiv.org/html/2605.06076#bib.bib40)\], Gender\[[33](https://arxiv.org/html/2605.06076#bib.bib41)\], RTE\[[10](https://arxiv.org/html/2605.06076#bib.bib42)\], IOI \(Indirect Object Identification\[[49](https://arxiv.org/html/2605.06076#bib.bib43)\]\), Docstring\[[18](https://arxiv.org/html/2605.06076#bib.bib44)\], SST2\[[41](https://arxiv.org/html/2605.06076#bib.bib45)\], Winogrande\[[51](https://arxiv.org/html/2605.06076#bib.bib46)\], Reverse\[[27](https://arxiv.org/html/2605.06076#bib.bib47)\], Greater Than\[[15](https://arxiv.org/html/2605.06076#bib.bib48)\], FEVER\[[47](https://arxiv.org/html/2605.06076#bib.bib49)\], zsRE\[[24](https://arxiv.org/html/2605.06076#bib.bib50)\], Induction\[[8](https://arxiv.org/html/2605.06076#bib.bib15)\], Bool\[[44](https://arxiv.org/html/2605.06076#bib.bib51)\], Arithmetic\[[2](https://arxiv.org/html/2605.06076#bib.bib52)\], and SA \( syntactic agreement\[[56](https://arxiv.org/html/2605.06076#bib.bib53)\]\)\. We show the examples of each task in the Table[4](https://arxiv.org/html/2605.06076#A2.T4)\.

Table 4:An overview of the datasets of specific tasks\.TaskExampleLabelWinograndeJohn moved the couch from the garage to the backyard to create space\. The\_\\\_is small\.garageSST\-2hide new secretions from the parental unitsnegativeRTENo Weapons of Mass Destruction Found in Iraq Yet\.not entailmentWeapons of Mass Destruction Found in Iraq\.Bool\(True AND True\) OR FalseTrueInductionVernon Dursley and Petunia DursleyIOIWhen John and Mary went to the store, Mary gave a bottle of milk toJohnGenderSo Evan is a really great friend, isn’theDocstringdef f\(self, files, obj, state, size, shape, option\):shape:param state: performance analysis:param size: pattern design:paramGreat ThanThe war lasted from 1517 to 1518SAMany girls insultedthemselvesarithmeticWhat is \(2 \- 8\) \- 4? Answer:\-10Reverse\[0, 3, 2, 1\]\[1, 2, 3, 0\]openbookqa’question\_\\\_stem’: ’The sun is responsible for’,D’choices’:\{\\\{’text’: \[’puppies learning new tricks’,’children growing up and getting old’,’flowers wilting in a vase’,’plants sprouting, blooming and wilting’\],’label’: \[’A’, ’B’, ’C’, ’D’\]\}\\\},FEVERKolhan is a village in the Palghar district of Maharashtra ,IndiazsREAngela Merkel’s second husband is professor JoachimSauer

## Appendix CAblation Studies and Additional Results on Free Evolution

### C\.1Comprehensive Results of Free Evolution

Figure[5](https://arxiv.org/html/2605.06076#A3.F5)presents the comprehensive evolutionary metrics for five target tasks \(Arithmetic, Bool, Gender, SST2, and Winogrande\) paired with OpenBookQA as the pervasiveness task, evaluated on both the Mistral\-7B and LLaMA3\-8B models\. These tracked metrics encompass Circuit Distance \(CDCD\), Circuit Stability \(CSCS\), Task Accuracy \(ACCACC\), and Loss\. Furthermore, forCDCDandCSCS, we separately delineate the evolutionary dynamics of components within the attention and MLP modules, explicitly denoted asAttnAttnandMLPMLP, respectively\.

![Refer to caption](https://arxiv.org/html/2605.06076v1/x9.png)\(a\)Circuit Distance on Mistral
![Refer to caption](https://arxiv.org/html/2605.06076v1/x10.png)\(b\)Circuit Stability on Mistral
![Refer to caption](https://arxiv.org/html/2605.06076v1/x11.png)\(c\)Performance on Mistral
![Refer to caption](https://arxiv.org/html/2605.06076v1/x12.png)\(d\)Circuit Distance on LlaMA
![Refer to caption](https://arxiv.org/html/2605.06076v1/x13.png)\(e\)Circuit Stability on LlaMA
![Refer to caption](https://arxiv.org/html/2605.06076v1/x14.png)\(f\)Performance on LlaMA

Figure 5:line plots of different target tasks on the Mistral\-7B model and LlaMA3\-8B model in terms of Circuit Distance \(CDCD\), Circuit Stability \(CSCS\), Task Accuracy \(ACCACC\) and loss \(LossLoss\)\.
### C\.2Impact of Task Type on Free Evolution

As observed in Figure[5](https://arxiv.org/html/2605.06076#A3.F5), components within the attention and MLP modules exhibit markedly distinct evolutionary patterns\. Consequently, we hypothesize that for any given task, the extent of circuit variation during free evolution is heavily influenced by the relative functional contributions of its attention and MLP components\. To validate this hypothesis, we designed the following ablation study\.

Prior research has established that tasks dominated by MLP components are predominantly associated with stored factual “knowledge”\[[34](https://arxiv.org/html/2605.06076#bib.bib10)\]\(e\.g\., factual queries regarding nations, geography, prominent figures, and sports\), whereas tasks dominated by attention components are generally tied to latent functional “skills” \(e\.g\., induction heads\[[37](https://arxiv.org/html/2605.06076#bib.bib56)\]and syntax heads\[[29](https://arxiv.org/html/2605.06076#bib.bib57)\]\)\. Building upon this consensus, we select theInductionandReversetasks to represent attention\-dominated \(skill\-centric\) tasks, and theFEVERandzsREfactual datasets to represent MLP\-dominated \(knowledge\-centric\) tasks\.

Table[5](https://arxiv.org/html/2605.06076#A3.T5)details the performance across various metrics for these disparate task types on the Mistral\-7B model, maintaining OpenBookQA as the pervasiveness task\. The Attention and MLP columns represent the percentage of attention/MLP components in the circuit relative to the total number of components in the corresponding computational graph\. The results substantiate our initial hypothesis: skill\-centric tasks induce substantial migration among attention components, but cause negligible changes in MLP circuit distance\. Conversely, knowledge\-centric tasks trigger drastic internal parameter updates within MLPs \(reflected in stability changes\) but result in minimal positional migration of the circuit\.

Table 5:Performance of skill\-centric tasks and knowledge\-centric tasks
### C\.3Impact of Pervasiveness Degree on Free Evolution

In the context of LLMs, the optimization of any target task is inherently a multi\-objective process accompanied by pervasiveness tasks\. Therefore, we conducted an ablation study to investigate the impact of the degree of pervasiveness\. Specifically, we selected various concrete tasks, each governing a distinct mechanism, and simulated increasing pervasiveness through their linear superposition \(e\.g\., an ensemble of1010disparate tasks functioning jointly as the pervasiveness task intrinsically embodies a more “universal” constraint than a single isolated task\)\. Accordingly, we chose Gender, RTE, IOI, Docstring, SST2, Winogrande, Reverse, Greater Than, FEVER, and zsRE as candidate components for the pervasiveness task\. Designating Induction as the target task, we observed the circuit evolution on the Mistral\-7B model when the total number of concurrent pervasiveness tasks was scaled to0,11,22,55, and1010\.

Table[6](https://arxiv.org/html/2605.06076#A3.T6)demonstrates that as the degree of pervasiveness increases, the target task exhibits more extensive component migration, while the internally learned information remains largely unaffected\. Intuitively, elevated pervasiveness implies a more comprehensive engagement of components across the network during parameter updates, equating to a more constrained and challenging optimization objective\. Consequently, the original critical components of the target task typically require more substantial migration to navigate toward the new optimal solution space\.

Table 6:Performance of different pervasiveness
### C\.4Impact of Dataset Size on Free Evolution

Similarly, we conducted an ablation study on the impact of dataset size on circuit evolution\. We hypothesized that circuit migration might simply be an artifact of insufficient sample sizes, which could lead to uncertain or unstable sampled distributions\. To isolate the effect of dataset size while simultaneously eliminating the confounding factor of sample imbalance, we designated the IOI task as the target task\. The IOI dataset is a synthetic \(procedurally generated\) dataset, alleviating concerns that scaling up data samples might introduce real\-world biases\. Using the Mistral\-7B model with OpenBookQA as the pervasiveness task, we evaluated circuit metrics across varying SFT dataset sizes:500500;2,0002,000;5,0005,000;10,00010,000; and100,000100,000\.

Table[7](https://arxiv.org/html/2605.06076#A3.T7)indicates that variations in dataset size do not induce significant changes in Circuit Distance \(CDCD\); that is, component migration during free evolution is largely independent of dataset size\. However, dataset size profoundly impacts the updating of internal information within components\. A richer set of training samples facilitates more effective internal information updating, thereby substantially enhancing the robustness of the resulting circuit\.

Table 7:Performance of different dataset size
### C\.5Impact of Conflict Proportion on Free Evolution

Prior research has established that the quantity of conflicting components among multiple optimization tasks is a critical determinant of multi\-task optimization efficacy\[[6](https://arxiv.org/html/2605.06076#bib.bib36)\]\. Accordingly, we constructed various skill combinations to serve as the pervasiveness task, aiming to observe how an escalating proportion of conflicting components impacts circuit evolution\. In our experimental setup, we evaluated the Mistral\-7B model using Gender as the target task\. We formulated five distinct pervasiveness task combinations that yielded varying proportions of conflicting components relative to the target task, calculated following the methodology in\[[6](https://arxiv.org/html/2605.06076#bib.bib36)\]\. These combinations and their corresponding conflict proportions are:5\.86%5\.86\\%\(SST2 \+ SA\),8\.57%8\.57\\%\(SST2 \+ SA \+ RTE\),10\.17%10\.17\\%\(SST2 \+ SA \+ RTE \+ IOI\),12\.49%12\.49\\%\(SST2 \+ SA \+ RTE \+ IOI \+ Winogrande\), and15\.63%15\.63\\%\(SST2 \+ SA \+ RTE \+ IOI \+ FEVER\)\.

Table[8](https://arxiv.org/html/2605.06076#A3.T8)reveals that an increased proportion of conflicting components indeed precipitates significant component migration\. As corroborated by previous studies, conflicting components are predominantly polysemantic, encapsulating essential information for multiple tasks simultaneously\. When confronted with multi\-task optimization, these components struggle to resolve optimal gradient descent directions\. Consequently, the language model is compelled to disentangle these multiplexed semantics, forcing them to migrate into distinct, separate components to satisfy the diverse optimization objectives\. Inevitably, this extensive migration necessitates a reorganization of internal neuronal semantics, which subsequently impedes the effective updating of internal information\.

Table 8:Performance of different conflict percentage
### C\.6Impact of Initial Mastery Level on Free Evolution

Finally, we investigate the impact of the model’s intrinsic mastery level of the target skill\. This investigation is driven by a compelling hypothesis: if a model has already attained100%100\\%mastery of a task’s mechanism \(i\.e\., achieving100%100\\%accuracy\), will its corresponding circuit still undergo evolution during subsequent parameter updates?

To explore this, we selected a set of high\-mastery tasks \(Gender, SST2, Winogrande\) and low\-mastery tasks \(Bool, Arithmetic, SA\)\. High\-mastery tasks are defined as those exhibiting a pre\-SFT accuracy exceeding70%70\\%, whereas low\-mastery tasks fall below40%40\\%\. Furthermore, we curated a specialized subset comprising500500correctly answered samples extracted from the high\-mastery datasets, ensuring a rigorous initial accuracy of100%100\\%\. Designating these datasets as target tasks and OpenBookQA as the pervasiveness task, we conducted ablation experiments on the Mistral\-7B model\.

Table[9](https://arxiv.org/html/2605.06076#A3.T9)demonstrates that tasks with higher initial mastery exhibit a diminished degree of component migration, suggesting that the model has already partially solidified the underlying mechanism\. Crucially, however, even for skills mastered at100%100\\%, the circuit still undergoes definitive migration and evolution during parameter updates\. This finding compellingly reinforces the conclusion that a circuit is an inherently dynamic property; therefore, a static circuit derived from current parameters cannot reliably dictate or guide future parameter updates\.

Table 9:Performance of different mastery

## Appendix DMore Experiments Results with Localization

### D\.1More Target Tasks on Three Localization Strategies

Figure[6](https://arxiv.org/html/2605.06076#A4.F6)details the evolutionary trajectories of the remaining four target tasks across the Accuracy and Circuit Conflict metrics\. Corroborating the findings from the Arithmetic task, the localization methods consistently exhibit superior performance relative to theFreeevolution baseline\.

![Refer to caption](https://arxiv.org/html/2605.06076v1/x15.png)\(a\)T\-ACC of Bool
![Refer to caption](https://arxiv.org/html/2605.06076v1/x16.png)\(b\)P\-Acc of Bool
![Refer to caption](https://arxiv.org/html/2605.06076v1/x17.png)\(c\)CC of Bool
![Refer to caption](https://arxiv.org/html/2605.06076v1/x18.png)\(d\)T\-ACC of Gender
![Refer to caption](https://arxiv.org/html/2605.06076v1/x19.png)\(e\)P\-Acc of Gender
![Refer to caption](https://arxiv.org/html/2605.06076v1/x20.png)\(f\)CC of Gender
![Refer to caption](https://arxiv.org/html/2605.06076v1/x21.png)\(g\)T\-ACC of Winogrande
![Refer to caption](https://arxiv.org/html/2605.06076v1/x22.png)\(h\)P\-Acc of Winogrande
![Refer to caption](https://arxiv.org/html/2605.06076v1/x23.png)\(i\)CC of Winogrande
![Refer to caption](https://arxiv.org/html/2605.06076v1/x24.png)\(j\)T\-ACC of SST2
![Refer to caption](https://arxiv.org/html/2605.06076v1/x25.png)\(k\)P\-Acc of SST\-2
![Refer to caption](https://arxiv.org/html/2605.06076v1/x26.png)\(l\)CC of SST\-2

Figure 6:Target Task Accuracy \(T\-Acc\), Pervasiveness Task Accuracy \(P\-Acc\), and Circuit Conflict \(CC\) of Bool, Gender, Winogrande, and SST\-2 Task with localization\.
### D\.2Impact of Localization on Circuit Distance and Stability

Figure[7](https://arxiv.org/html/2605.06076#A4.F7)illustrates the performance of the five target tasks concerning Circuit Distance \(CDCD\) and Circuit Stability \(CSCS\)\. A pronounced observation is that localization methods precipitate substantially greater component migration compared to free evolution\. This strongly implies that the components pinpointed by localization do not constitute the genuinely optimal subset required for adaptation; paradoxically, this artificial restriction renders the evolutionary trajectory even more “incorrect” than if no localization were applied\. Concurrently, the sharp decline inCSCSdemonstrates that it becomes markedly more arduous to effectively update internal information within the remaining unfrozen components\. Collectively, these phenomena substantiate that current Mechanistic Localization fails to isolate the truly critical components and inadvertently freezes components essential for natural adaptation\. It is precisely this misallocation that induces the chaotic volatility observed in the individual circuit metrics\.

![Refer to caption](https://arxiv.org/html/2605.06076v1/x27.png)\(a\)CDCDof Arithmetic
![Refer to caption](https://arxiv.org/html/2605.06076v1/x28.png)\(b\)CSCSof Arithmetic
![Refer to caption](https://arxiv.org/html/2605.06076v1/x29.png)\(c\)CDCDof Bool
![Refer to caption](https://arxiv.org/html/2605.06076v1/x30.png)\(d\)CSCSof Bool
![Refer to caption](https://arxiv.org/html/2605.06076v1/x31.png)\(e\)CDCDof Gender
![Refer to caption](https://arxiv.org/html/2605.06076v1/x32.png)\(f\)CSCSof Gender
![Refer to caption](https://arxiv.org/html/2605.06076v1/x33.png)\(g\)CDCDof Winogrande
![Refer to caption](https://arxiv.org/html/2605.06076v1/x34.png)\(h\)CSCSof Winogrande
![Refer to caption](https://arxiv.org/html/2605.06076v1/x35.png)\(i\)CDCDof SST2
![Refer to caption](https://arxiv.org/html/2605.06076v1/x36.png)\(j\)CSCSof SST\-2

Figure 7:Circuit Distance \(CDCD\) and Circuit Stability \(CSCS\) of Arithmetic, Bool, Gender, Winogrande, and SST\-2 Task with localization\.
### D\.3Impact of Circuit Scale on Localization Efficacy

To further investigate the limitations of static localization, we conducted a systematic interpolation study on the circuit scale\. We evaluated task performance \(T\-Acc and P\-Acc\), Circuit Distance \(CDCD\), Circuit Stability \(CSCS\), and Circuit Conflict \(CCCC\) across varying circuit capacities:500500,800800,1,0001,000,1,5001,500,2,0002,000,2,5002,500,3,0003,000,3,5003,500, and4,0004,000\(corresponding to the total number of unfrozen critical components\)\.

As demonstrated in Table[10](https://arxiv.org/html/2605.06076#A4.T10), optimal performance across all metrics is achieved when the circuit scale expands to approximately3,0003,000–3,5003,500components\. At this capacity, the localized subset is sufficiently expansive to encompass those latent components that—while deemed “marginally important” under the current parameter state—prove highly critical for future parameter updates\. This highlights a fundamental trade\-off inherent in determining circuit scale: an overly restrictive scale reliably isolates components critical to thecurrentstate but inadvertently discards latent components vital forfutureevolution; conversely, an excessively permissive scale severely compromises the very “interpretability” the mechanism seeks to provide\.

Table 10:Performance of different circuit scale in Arithmetic taskCrucially, however, at this optimal scale of3,0003,000–3,5003,500components, the localized circuit encompasses nearly80%80\\%of the language model’s total target components\. Consequently, any assertion regarding the inherent “effectiveness of localization” becomes exceedingly tenuous at this juncture, as the tuning process fundamentally regresses toward standard full\-parameter SFT\.

Additionally, we evaluated the comparative efficacy of the three distinct circuit extraction methodologies outlined in Appendix[A](https://arxiv.org/html/2605.06076#A1)\. Table[11](https://arxiv.org/html/2605.06076#A4.T11)delineates the performance disparities among ACDC, EAP, and EdgePruning\. It is evident that the EAP approach markedly outperforms the other two alternatives\. Synthesizing this empirical finding with our analysis ofRQ2\-cin Section[4\.2](https://arxiv.org/html/2605.06076#S4.SS2), a compelling explanation emerges: EAP’s reliance on gradient\-based estimation aligns inherently with the trajectory of gradient descent\. Consequently, this characteristic renders the EAP\-derived circuits substantially more compatible with, and adaptive to, future dynamic parameter updates\.

Table 11:Performance of different circuit methodologies

## Appendix EValidation in Single\-Objective SFT

In Section 4\.2\.2, we established a critical conclusion: in tasks governed by MLP\-dominated circuits, the circuits are inherently more resistant to migration\. Consequently, Mechanistic Localization coincidentally retains its guiding significance for future parameter updates, yielding performance metrics significantly superior to both Random Localization and free evolution\. Conversely, for tasks governed by Attention\-dominated circuits, intense component migration precipitates massive structural discrepancies between the circuit under the current parameter state and circuits under future states\. Thus, Mechanistic Localization fails to provide predictive guidance for future parameters, resulting in performance virtually indistinguishable from Random Localization\.

To rigorously validate this conclusion, we eliminated the pervasiveness task during the SFT of the Mistral\-7B model\. This allowed us to observe whether the isolated target task exhibits similar evolutionary patterns under pure single\-objective optimization\.

Table 12:Performance of WMDP\-Bio dataset without pervasiveness taskTable 13:Performance of Induction dataset without pervasiveness taskTable[12](https://arxiv.org/html/2605.06076#A5.T12)and[13](https://arxiv.org/html/2605.06076#A5.T13)unequivocally demonstrates that when WMDP\-Bio and Induction are optimized independently as isolated target tasks, the following conclusions persistently hold true:

1. 1\.MLP\-dominated circuitsare significantly less prone to migration\. Therefore, Mechanistic Localization for these knowledge\-centric tasks genuinely imparts meaningful guidance for future parameter updates\.
2. 2\.Attention\-dominated circuitsare highly susceptible to migration, leading to profound structural discrepancies across different parameter states\. Consequently, Mechanistic Localization for these skill\-centric tasks suffers from severe temporal latency, rendering it ineffective for guiding dynamic updates\.

## Appendix FExtended Analysis of Future Mechanistic Localization and Methodological Comparisons

![Refer to caption](https://arxiv.org/html/2605.06076v1/x37.png)\(a\)Circuit Distance
![Refer to caption](https://arxiv.org/html/2605.06076v1/x38.png)\(b\)Circuit Stability
![Refer to caption](https://arxiv.org/html/2605.06076v1/x39.png)\(c\)Circuit Conflict

Figure 8:Line plots of Future Mechanistic LocalizationFirst, we present the comprehensive circuit metrics—Circuit Distance \(CDCD\), Circuit Stability \(CSCS\), and Circuit Conflict \(CCCC\)—for the Arithmetic task under theFuture Mechanisticlocalization paradigm, as illustrated in Figure[8](https://arxiv.org/html/2605.06076#A6.F8)\. These detailed metrics further substantiate the profound superiority and stability of theFuture Mechanisticapproach over static baseline methods\.

Subsequently, we expand our comparative analysis beyond standard circuit discovery to encompass other predominant Mechanistic Localization methodologies\. Specifically, we evaluate gradient\-based methods \(e\.g\., WAGLE and DEPN\), intervention\-based methods \(e\.g\., MEMIT and CLUE\), and strictly circuit\-centric methods \(e\.g\., CLUE\-EAP and CLUE\-EdgePruning\)\.

As demonstrated in Table[14](https://arxiv.org/html/2605.06076#A6.T14), gradient\-associated methodologies consistently yield the optimalCDCDandCSCSmetrics among all evaluated techniques\. Notably, this superior performance persists even for methods that do not rely on standard gradient descent fine\-tuning for parameter updates \(e\.g\., MEMIT, which employs closed\-form vector editing\)\. The introduction of these strategies of Localization are as follows:

DEPN\[[52](https://arxiv.org/html/2605.06076#bib.bib12)\]\(Detect and Edit Privacy Neurons\) is a framework designed to safeguard against privacy leakage in pretrained language models by localizing and editing specific neurons\. The method’s core localization component is a novel privacy neuron detector that uses a gradient\-based attribution technique\. This detector computes a privacy attribution score for each neuron to quantify its contribution to the model’s leakage of private information\. This is achieved by calculating the cumulative gradient of the output probability with respect to the neuron’s activation value, as the activation is gradually changed from zero to its original value\.

WAGLE\[[21](https://arxiv.org/html/2605.06076#bib.bib26)\]\(Weight Attribution\-guided LLM Unlearning Framework\) is a framework that pinpoints the most influential weights for unlearning through a strategic weight attribution method\. The method frames the weight attribution problem as a bi\-level optimization \(BLO\) problem, which allows it to balance unlearning efficacy with utility preservation\. The core of the localization process is the derivation of a closed\-form attribution score for each weight, calculated using the implicit gradient from the BLO problem\. This score’s value is determined by combining the gradients from both the forget loss and the retain loss\.

MEMIT\[[38](https://arxiv.org/html/2605.06076#bib.bib58)\]\(\(Mass\-Editing Memory in a Transformer\)\) addresses the deletion of factual information by causal tracing, a denoising\-based intervention method\. This approach relies on the assumption that knowledge is stored in specific, localized components of the network, and can be identified via causal mediation\.

KN\[[11](https://arxiv.org/html/2605.06076#bib.bib11)\]\(Knowledge Neurons\) introduces the concept of knowledge neurons to investigate how factual knowledge is stored in pretrained Transformers\. As an intervention\-based method, it views feed\-forward network \(FFN\) modules as key\-value memories\. The method utilizes a knowledge attribution technique based on integrated gradients to evaluate the contribution of each neuron to knowledge predictions\. By identifying and manipulating \(e\.g\., suppressing or amplifying\) these specific neurons, KN demonstrates that interventions on localized neurons can explicitly affect knowledge expression and edit factual knowledge within the model without the need for SFT\.

CLUE\[[6](https://arxiv.org/html/2605.06076#bib.bib36)\]\(Conflict\-guided Localization for LLM Unlearning Framework\) is a circuit\-based localization framework designed to improve the precision of LLM unlearning\. It leverages mechanistic interpretability to discover logical circuits corresponding to the forget set and the retain set\. CLUE transforms these circuits into Conjunctive Normal Form \(CNF\) and uses a Boolean satisfiability solver to disentangle the intertwined nodes into three distinct categories: forget nodes, retain nodes, and conflict nodes\. By pinpointing the specific function of each node, CLUE enables targeted SFT strategies that significantly enhance forget efficacy while preserving retain utility, avoiding the pitfalls of applying uniform interventions on entangled nodes\.

This empirical evidence strongly implies that gradient\-based techniques inherently possess greater “foresight” during the parameter update process; they are capable of prospectively identifying a crucial subset of the critical components that will ultimately govern the future parameter state\.

Table 14:Performance of different Localization Strategies on Arithmetic Dataset
## Appendix GLimitations

While this paper derives a series of insightful conclusions by observing transformer circuits throughout the Supervised Fine\-Tuning \(SFT\) process, several limitations remain to be acknowledged:

1. 1\.Inherent Limitations of Circuit Discovery:The process of circuit discovery itself is notoriously difficult to scale to exceptionally large LLMs and imposes stringent requirements on data quality\. Consequently, this computational bottleneck precludes further analysis under massive data and model scaling scenarios, thereby restricting the direct application of our analytical framework in certain real\-world, large\-scale deployments\.
2. 2\.Coupling of Localization and Parameter Update Mechanisms:Many contemporary Mechanistic Localization methodologies introduce bespoke parameter update techniques paired with their localization strategies; the effects of these two components are rarely strictly independent\. Although employing standard SFT as our observational baseline allows us to capture universally applicable and dynamic evolutionary trends, integrating our framework with these specialized, coupled update methods could potentially unveil more granular and nuanced patterns that remain unexplored in this work\.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

Similar Articles

Architecture, Not Scale: Circuit Localization in Large Language Models

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Unintended Effects of Geographic Conditioning in Large Language Models

Useful memories become faulty when continuously updated by LLMs (30 minute read)

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

Submit Feedback

Similar Articles

Architecture, Not Scale: Circuit Localization in Large Language Models
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Unintended Effects of Geographic Conditioning in Large Language Models
Useful memories become faulty when continuously updated by LLMs (30 minute read)
Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation