Judge Circuits
Summary
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
View Cached Full Text
Cached at: 05/18/26, 06:35 AM
# Judge Circuits
Source: [https://arxiv.org/html/2605.16023](https://arxiv.org/html/2605.16023)
Nils Feldhus1,2Tanja Baeumel3,6Elena Golimblevskaia4Qianli Wang1 Van Bach Nguyen5Aaron Louis Eidt1,4Christopher Ebert3Wojciech Samek1,2,4 Jing Yang1,2Vera Schmitt1,3,6Sebastian Möller1,3Simon Ostermann3,6 1Technische Universität Berlin2BIFOLD – Berlin Institute for the Foundations of Learning and Data 3German Research Center for Artificial Intelligence \(DFKI\)4Fraunhofer Heinrich Hertz Institute 5Marburg University6Centre for European Research in Trusted AI \(CERTAIN\) Correspondence:feldhus@tu\-berlin\.de
###### Abstract
LLM\-as\-a\-judgehas become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes \(e\.g\., a11–55rating vs\. a True/False label\)\. Existing diagnoses of these format\-induced inconsistencies stop at the input\-output level\. Using Position\-aware Edge Attribution Patching \(PEAP\), we causally investigate the internal mechanism in Gemma\-3, Qwen2\.5, and Llama\-3\. We find that judgments across structured understanding and open\-ended preference tasks share a sparse, generalizedLatent Evaluatorsub\-graph in the mid\-to\-late multi\-layer perceptrons \(MLPs\); zero\-ablating it collapses judgment while preserving world knowledge in architecturally modular models\. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format\-induced inconsistency on the open\-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format\-specific terminal branches, enabling format\-independent preference to be isolated downstream of the requested output format\. Our findings imply that benchmark\-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality\.
Judge Circuits
Nils Feldhus1,2Tanja Baeumel3,6Elena Golimblevskaia4Qianli Wang1Van Bach Nguyen5Aaron Louis Eidt1,4Christopher Ebert3Wojciech Samek1,2,4Jing Yang1,2Vera Schmitt1,3,6Sebastian Möller1,3Simon Ostermann3,61Technische Universität Berlin2BIFOLD – Berlin Institute for the Foundations of Learning and Data3German Research Center for Artificial Intelligence \(DFKI\)4Fraunhofer Heinrich Hertz Institute5Marburg University6Centre for European Research in Trusted AI \(CERTAIN\)Correspondence:feldhus@tu\-berlin\.de

Figure 1:Overview of our pipeline on anMNLIminimal pair: \(1\) PEAPHaklayet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib17)\)traces cross\-token causal edges from the differential input tokens into a sharedLatent Evaluatorsub\-circuit \(𝒞LE:=𝒞rate∩𝒞class\\mathcal\{C\}\_\{\\text\{LE\}\}:=\\mathcal\{C\}\_\{\\text\{rate\}\}\\cap\\mathcal\{C\}\_\{\\text\{class\}\}\)\. \(2\) We validate this circuit three ways: zero\-ablation \(red×\\boldsymbol\{\\times\}\) isolates evaluation from world knowledge; BDASWuet al\.\([2023](https://arxiv.org/html/2605.16023#bib.bib52)\)identifies a 1D judgment direction in the LE’s activation space; Task Formatters \(𝒞TF,rate,𝒞TF,class\\mathcal\{C\}\_\{\\text\{TF,rate\}\},\\mathcal\{C\}\_\{\\text\{TF,class\}\}\) in terminal layers map that judgment scalar to the concrete target token\.## 1Introduction
The LLM\-as\-a\-Judge \(LaaJ\) paradigm is now widespread across NLP for evaluation tasks such as benchmark scoring, reward modeling, and content moderation – automating quality assessment without a human in the loopCalderonet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib3)\); Gaoet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib11)\); Liet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib27)\)\. However, the reliability of LLMs as automated judges is heavily contested\.Leeet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib25)\)document a contradictory dissociation – relative preferences are often consistent, but absolute ratings are not – and isolate two specific failure modes: self\-consistency across repeated evaluations, and inter\-scale consistency across different rating formats\. Even large proprietary models fail on both dimensions, undermining the reproducibility of any LaaJ\-driven leaderboard, reward, or safety judgment\.Eshuijset al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib9)\)corroborate this from a different angle, showing that models frequently exploit shallow classification shortcuts – e\.g\., relying on lexical cues such as response length or sentiment polarity – rather than integrating the multiple aspects of input and target that holistic evaluation requires\. Comparable inconsistency and calibration failures hold for judges of<70<70B parametersGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)\.
No prior work has investigated the internal computational mechanisms underlying LLM judgment, a necessary step toward understanding and improving LaaJ reliability\. Concretely, our results recast the diagnostic question from“does the model judge consistently?”to“where in the computational pathway from input to output token does format\-induced inconsistency originate?”We address this gap directly, demonstrating that the consistency failures inLeeet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib25)\)are not failures of evaluation but of output routing: a shared internal sub\-circuit computes a stable judgment, and format\-specific terminal pathways then translate that judgment into the requested output token – and it is the latter step that fails\. We hypothesize that LaaJ implements judgment via two architecturally separable sub\-systems – a shared evaluation core and a format\-specific output router – and that inter\-format inconsistency localizes to the latter\.
To test this, we use Position\-aware Edge Attribution Patching \(PEAP\)Haklayet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib17)\)to show that distinct judgment tasks rely on shared computational pathways\. Unlike prior circuit discovery methods, PEAP handles cross\-token edges – a necessary property for judge circuits whose inputs span separated linguistic spans \(e\.g\., premise vs\. hypothesis\) – while remaining linear\-in\-edges to compute\. Drawing on the literature on intermediate variables in transformer circuitsLeporiet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib26)\)and the known dissociation of formal and functional linguistic mechanismsHannaet al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib19)\), we explicitly test whether LLMs decouple abstract judgment from fragile syntax formatting\. We cross\-validate every circuit with three independent causal probes – cumulative edge patching, subspace steering, and cross\-format activation transfer – which converge on the same Latent Evaluator components and guard against non\-identifiabilityMilleret al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib35)\); Mélouxet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib32)\)\. We then validate that the discovered circuits are modular and task\-independent, and that the evaluation signal within them is encoded in a geometrically separable subspace \(Figure[1](https://arxiv.org/html/2605.16023#S0.F1)\)\.
Contributions:
1. \(1\)We show that LLM judgment is computed by highly sparse, cross\-task circuitssharing a generalizedLatent Evaluatorin mid\-to\-late MLPs, recoverable at top\-k≤200k\\leq 200edges\.
2. \(2\)We show that judgment modularity is architecture\-dependent: Qwen modular at 7B, Gemma only at 27B\. On modular models, zero\-ablating the Latent Evaluator preserves world knowledge while collapsing judgment; on Gemma\-3\-12B it degrades both, indicating tight entanglement with world\-knowledge pathways\.
3. \(3\)We provide a mechanistic explanation of inter\-format LLM evaluator inconsistency, localizing it to format\-specific output routing rather than to the underlying evaluation\.
Together, these results suggest that LaaJ format inconsistency is a routing problem rather than an evaluation problem – and therefore that fixes can target the formatter without disturbing the model’s judgment competence\.
## 2Experimental Setup
Central Finding\.An LLM\-as\-a\-judge implements judgment via two architecturally separable sub\-systems – a shared evaluation core and a format\-specific output router\.
We test this in three steps: §[3](https://arxiv.org/html/2605.16023#S3)discovers the candidate sub\-circuits; §[4](https://arxiv.org/html/2605.16023#S4)probes whether the shared core is functionally isolated; §[5](https://arxiv.org/html/2605.16023#S5)causally validates the split via cross\-format activation transfer\.
Ajudgment taskin our setting asks the model to assign a quality, preference, or correctness score to a candidate text given the input it conditions on, producing a scalar rating or categorical verdict over the candidate rather than a free\-form generation\. Our pipeline operates on contrastive minimal\-pair prompts \(Figure[1](https://arxiv.org/html/2605.16023#S0.F1)\); the rating\-vs\-classification decomposition into aLatent Evaluatorand format\-specificTask Formattersis introduced in §[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\.
#### Data
We select five datasets that together span the three dimensions of evaluation that LaaJ is deployed for: \(i\) structured linguistic correctness \(CoLA,MultiNLI,STS\-B\), \(ii\) preference / quality judgment \(RewardBench\), and \(iii\) subjective sentiment \(Yelp\)\.
- •CoLA\(linguistic acceptability\) \(\): fluency and grammaticality as quality criteria\.
- •MultiNLI\(natural language inference\)Williamset al\.\([2018](https://arxiv.org/html/2605.16023#bib.bib51)\): entailment / neutral / contradiction between a hypothesis and a premise\.
- •STS\-B\(sentence semantic similarity\)Ceret al\.\([2017](https://arxiv.org/html/2605.16023#bib.bib4)\): semantic equivalence between pairs\.
- •RewardBench\(preference evaluation\)Lambertet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib23)\): the canonical testbed for open\-ended LLM\-as\-a\-judge capabilities\.
- •Yelp\(sentiment, 1–5 star reviews\)Zhanget al\.\([2015](https://arxiv.org/html/2605.16023#bib.bib54)\): a subjective, user\-written evaluation domain with a natural ordinal scale\.
#### Models
We evaluate five instruct\-tuned models from three families: Gemma\-3 \(12B\-it, 27B\-it\)Teamet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib12)\), Qwen2\.5 \(7B\-Instruct, 14B\-Instruct\)Qwenet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib40)\), and Llama\-3\.1\-8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib16)\), accessed via TransformerLensNanda and Bloom \([2022](https://arxiv.org/html/2605.16023#bib.bib38)\)\. We cap the minimal\-pair subset at\|S\|=500\|S\|=500forMNLI;CoLA,STS\-B,RewardBench, andYelphave100100–200200valid semantic pairs each\. The split\-half reliability check \(App\.[L](https://arxiv.org/html/2605.16023#A12)\) confirms that within\-task circuit IoU is comparable across these subset sizes\. The computational geometry constraints behind the cap and our backward\-pass tracing budget are deferred to App\.[G](https://arxiv.org/html/2605.16023#A7)\.
#### Prompt design
For each dataset we construct contrastiveminimal pairs: a clean prompt \(correct rating\) and a corrupted prompt \(incorrect rating\) with matched token lengths for PEAP attribution111ForMNLI, minimal pairs are drawn from the entailment, contradiction subset; neutral instances are excluded so that clean and corrupted prompts have semantically opposed ground truth \(App\.[G](https://arxiv.org/html/2605.16023#A7)details the per\-task selection rules\)\.\. Half the pairs assign the higher rating to the clean prompt and half to the corrupted prompt, so that per\-edge attributions are symmetric by construction \(§[3\.1](https://arxiv.org/html/2605.16023#S3.SS1)\)\. We format every input as a11–55rating prompt; to enable contrastive circuit analysis \(§[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\), we additionally pair each dataset with a parallel classification\-control prompt \(categorical Yes/No, True/False, or Entailment/Contradiction labels\) on the same instances\. Exact templates and padding/alignment details are in Appendices[F](https://arxiv.org/html/2605.16023#A6)–[G](https://arxiv.org/html/2605.16023#A7)\.
## 3Discovering Judge Circuits in LLMs
We usejudge circuitto refer to the sparse causal sub\-circuit a model uses to compute a rating from a structured prompt; §[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)decomposes it into a shared evaluation core \(𝒞LE\\mathcal\{C\}\_\{\\text\{LE\}\}\) and a format\-specific output branch \(𝒞TF\\mathcal\{C\}\_\{\\text\{TF\}\}\)\. Our two\-stage pipeline first applies PEAP to identify the causal pathways responsible for evaluation, then isolates task\-specific formatting mechanisms from generic evaluation logic using contrastive control tasks\.
### 3\.1Circuit Discovery via PEAP
Circuit discovery in decoder\-only LLMs conceptualizes the forward pass as a computation graph𝒢\\mathcal\{G\}whose nodes are MLPs and attention heads and whose directed edges carry information flow, and seeks a sparse subgraph𝒞⊂𝒢\\mathcal\{C\}\\subset\\mathcal\{G\}that causally accounts for a target behaviorViget al\.\([2020](https://arxiv.org/html/2605.16023#bib.bib47)\); Conmyet al\.\([2023](https://arxiv.org/html/2605.16023#bib.bib7)\); Wanget al\.\([2023](https://arxiv.org/html/2605.16023#bib.bib48)\)\.Position\-aware Edge Attribution Patching\(PEAP\)Haklayet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib17)\)extends Edge Attribution PatchingHannaet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib18)\)to capture causal edgesacross token positionsin addition to intra\-token ones – a necessary property for judge circuits that must cross\-reference separated linguistic spans \(e\.g\., premise vs\. hypothesis\)\. Concretely, for each candidate edge from senderSSto receiverRR, PEAP estimates causal importance by the dot product of the receiver’s gradient∇R\\nabla Rwith the difference between the sender’s activation on the clean and corrupted inputs\(Sclean−Scorr\)\(S\_\{\\text\{clean\}\}\-S\_\{\\text\{corr\}\}\)\. A single backward pass yields all receiver gradients simultaneously, so the entire ranked edge list over attention heads and MLPs is extracted in one forward–backward sweep per minimal pair\. We extend PEAP with a symmetric polarity correction \(full formulas in Appendix[A](https://arxiv.org/html/2605.16023#A1)\) that handles our bidirectional minimal pairs \(§[2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px3)\) without canceling genuine causal signal under naïve gradient summation\. We separately verify that the extracted circuits are faithful to the full model \(Appendix[C](https://arxiv.org/html/2605.16023#A3)\) and stable under data resampling \(Appendix[L](https://arxiv.org/html/2605.16023#A12)\)\.
### 3\.2Structural Overlap: The Latent Evaluator
Cross\-task structural overlap is established evidence of shared computation in transformer circuitsTiggeset al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib46)\); Ferrando and Costa\-jussà \([2024](https://arxiv.org/html/2605.16023#bib.bib10)\); Lanet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib24)\)\. Given two circuits𝒞A,𝒞B\\mathcal\{C\}\_\{A\},\\mathcal\{C\}\_\{B\}traced on different tasksAAandBBand pruned to their top\-kkedges, we quantify similarity via Jaccard Intersection\-over\-Union on both the set of unique edgesℰ\\mathcal\{E\}and distinct components𝒩\\mathcal\{N\}, abstracting away token positions:
IoUedge=\|ℰA∩ℰB\|\|ℰA∪ℰB\|,IoUnode=\|𝒩A∩𝒩B\|\|𝒩A∪𝒩B\|\.\\text\{IoU\}\_\{\\text\{edge\}\}=\\frac\{\|\\mathcal\{E\}\_\{A\}\\cap\\mathcal\{E\}\_\{B\}\|\}\{\|\\mathcal\{E\}\_\{A\}\\cup\\mathcal\{E\}\_\{B\}\|\},\\quad\\text\{IoU\}\_\{\\text\{node\}\}=\\frac\{\|\\mathcal\{N\}\_\{A\}\\cap\\mathcal\{N\}\_\{B\}\|\}\{\|\\mathcal\{N\}\_\{A\}\\cup\\mathcal\{N\}\_\{B\}\|\}\.Edge IoU is the stricter metric; Node IoU measures architectural recruitment at a coarser grain\.
Figure 2:Sparse circuit faithfulness across the five evaluated models and five rating tasks\. Each curve traces median MIB recovery as we cumulatively patch the top\-kkPEAP edges from a fully corrupted forward pass back toward the clean activations\. Solid colored lines are the discovered circuits; the gray dashed line is a random\-edge baseline\. Curves saturating at≈1\.0\\approx 1\.0at smallkkindicate that the sparse circuit fully captures the model’s evaluation behavior; flat curves \(Gemma\-3\-12B /RewardBench,Yelp; Llama\-3\.1\-8B /Yelp\) reflect architectural entanglement on those particular cells rather than an absence of mechanism\.Finding 1: Distinct judgment tasks share a dense computational trunk on every modular architecture\.
On Gemma\-3\-12B at top\-200200\(Figure[3](https://arxiv.org/html/2605.16023#A2.F3); Node IoU in Appendix[B](https://arxiv.org/html/2605.16023#A2)\), we measure61\.0%61\.0\\%Node IoU /35\.3%35\.3\\%Edge IoU betweenCoLAandMNLI,62\.3%62\.3\\%/42\.1%42\.1\\%betweenMNLIandSTS\-B, and48\.8%48\.8\\%Node IoU /31\.1%31\.1\\%Edge IoU betweenRewardBenchandCoLA\. The same shared\-trunk pattern holds across the modular models \(Figure[4](https://arxiv.org/html/2605.16023#A2.F4)\)\. Qwen2\.5\-7B in particular achieves a uniformly high Edge IoU \(34\.934\.9–47\.0%47\.0\\%\) on every task pair we test, including the open\-endedRewardBenchpairings\. Qwen2\.5\-14B and Gemma\-3\-27B post lower raw Edge IoUs at the samekk, but their Node IoUs remain substantial \(28\.128\.1–55\.9%55\.9\\%\), consistent with the scale\-dependent redundancy effect documented in Appendix[L](https://arxiv.org/html/2605.16023#A12): larger modular models route judgment through multiple computationally equivalent sub\-pathways, so the same components are recruited but the specific top\-200200edges differ across data splits\. To rule out the possibility that this overlap reflects sample\-size noise rather than genuine shared structure, we compute within\-task split\-half reliability on Gemma\-3\-12B at the samekk: Node IoU is76\.3%76\.3\\%onMNLI,80\.6%80\.6\\%onSTS\-B,61\.6%61\.6\\%onCoLA\(Appendix[L](https://arxiv.org/html/2605.16023#A12)\), meeting or exceeding the cross\-task numbers\.
### 3\.3Sparse Circuit Faithfulness
To validate that the PEAP\-discovered edges are causallysufficientfor the model’s judgment, we apply the per\-instance MIB faithfulness metricMuelleret al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib36)\): starting from a fully corrupted forward pass, we progressively restore the top\-kkPEAP edges and measure the median fraction of the clean–corrupted EV gap \(§[3\.1](https://arxiv.org/html/2605.16023#S3.SS1)\) that the patched sub\-circuit recovers \(full methodology and the magnitude\-weighted sensitivity analysis are in Appendices[C](https://arxiv.org/html/2605.16023#A3)and[M](https://arxiv.org/html/2605.16023#A13)\)\.
Finding 2: PEAP recovers highly sparse, faithful circuits across models and tasks\.
Across the2525\(model, task\) cells we trace,2121reach median recovery≥0\.87\\geq 0\.87at somek≤200k\\leq 200\(Figure[2](https://arxiv.org/html/2605.16023#S3.F2)\); on Gemma\-3\-27B the open\-endedRewardBenchcircuit saturates at median≈1\.0\\approx 1\.0with justk=5k=5edges\. The non\-saturating cells are Gemma\-3\-12B onRewardBenchandYelp\(median recovery≈0\\approx 0throughk=200k=200, consistent with that model’s functional entanglement of judgment with world\-knowledge pathways; §[4\.2](https://arxiv.org/html/2605.16023#S4.SS2), Table[1](https://arxiv.org/html/2605.16023#S3.T1)\), Llama\-3\.1\-8B onMNLI\(slower climb,≈0\.82\\approx 0\.82atk=200k=200\), and Llama\-3\.1\-8B onYelp\(peaks at≈0\.40\\approx 0\.40before drifting back down\)\. A randomly\-sampled\-edge baseline \(gray dashed line\) hovers near0%0\\%across every configuration, ruling out the possibility that any sparse subgraph would suffice\.
#### Cross\-method robustness\.
Faithfulness rules out the metric\-fragility concern about sparse circuit extraction; the complementary non\-identifiability concern flagged in §[1](https://arxiv.org/html/2605.16023#S1)\(Mélouxet al\.,[2025](https://arxiv.org/html/2605.16023#bib.bib32)\), that different attribution algorithms may select different sparse subgraphs on the same model and task, we address by re\-tracing every Qwen2\.5\-7B and Gemma\-3\-12B circuit withLRPEAP, an alternative attribution backbone we develop that keeps PEAP’s position\-aware edge formulation but replaces the gradient\-based backward with an LRP\-rule backward\(Jafariet al\.,[2025](https://arxiv.org/html/2605.16023#bib.bib21)\)\(Appendix[N](https://arxiv.org/html/2605.16023#A14)\)\.
Finding 3: The judge circuit and its Latent Evaluator are stable across attribution backbones\.
On the \(Qwen2\.5\-7B, Gemma\-3\-12B\)×\\times10\-task panel, top\-200 PEAP and LRPEAP edge sets share34%34\\%mean Jaccard IoU on edges and46%46\\%on components \(permutation nullp99=1\.9%p\_\{99\}=1\.9\\%\); the Latent Evaluator subgraph𝒞LE=𝒞rate∩𝒞class\\mathcal\{C\}\_\{\\text\{LE\}\}=\\mathcal\{C\}\_\{\\text\{rate\}\}\\cap\\mathcal\{C\}\_\{\\text\{class\}\}computed under each method recovers at0\.470\.47mean component IoU, peaking at0\.610\.61onMNLI\. The partial edge\-overlap is consistent withcomputational redundancy, where multiple sparse subgraphs implement the same judgment behavior; the LE/TF decomposition is the structural intersection both methods converge on \(Appendix[N](https://arxiv.org/html/2605.16023#A14)\)\.
Table 1:Zero\-ablation semantic domain control\. Ablating the Latent Evaluator collapses world knowledge \(MMLU: Clinical DB, Abstract Alg\., Physics\) and formal factual retrieval \(StrategyQA,CREAK\) in Gemma\-3\-12B, but preserves both across the four other models – indicating that modularity depends on architecture, not scale alone\. The full circuit\-topology panels in Appendix[K](https://arxiv.org/html/2605.16023#A11)demonstrate the corresponding two\-stage Latent Evaluator / Task Formatter separation across models and tasks, supporting the same generalization\.†Llama\-3\.1\-8B’s merged top\-50 Latent Evaluator contains only MLPs \(no shared attention heads\); theStrategyQA/CREAKcells reflect the meaningful MLP\-only ablation, while theMMLUcells are vacuously preserved because the head\-targeted MMLU runner had no heads to ablate – consistent with Llama’s MLP\-dominant evaluator \(Appendix[K](https://arxiv.org/html/2605.16023#A11)\)\.
## 4Judge Circuit Modularity is Architecture\-Dependent
Building on §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2), we test whether the shared trunk is a functionally modular sub\-system and not a generic capability bottleneckHannaet al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib19)\): if zero\-ablating the Latent Evaluator collapses judgment but spares world\-knowledge benchmarks, the sub\-graph is doing genuinely judgment\-specific work – which in turn licenses the format\-transfer experiments in §[5](https://arxiv.org/html/2605.16023#S5)as targeted perturbations\.
### 4\.1Isolating Judgment from Formatting via Contrastive Circuits
For each dataset we trace two circuits on the same data: one for the rating task \(𝒞rate\\mathcal\{C\}\_\{\\text\{rate\}\}, e\.g\., “On a scale of11to55…”\) and one for a classification control task \(𝒞class\\mathcal\{C\}\_\{\\text\{class\}\}, e\.g\., yes/no\) with matched prompt structure\. Their structural overlap decomposes the model’s cognition into two functionally distinct components:
- •The Latent Evaluator\(𝒞LE:=𝒞rate∩𝒞class\\mathcal\{C\}\_\{\\text\{LE\}\}:=\\mathcal\{C\}\_\{\\text\{rate\}\}\\cap\\mathcal\{C\}\_\{\\text\{class\}\}\): the shared computational trunk\. Components in this intersection process the core semantic judgment of the prompt, agnostic to output format\.𝒞LE\\mathcal\{C\}\_\{\\text\{LE\}\}is the formal definition of the𝒞shared\\mathcal\{C\}\_\{\\text\{shared\}\}sub\-circuit highlighted in Figure[1](https://arxiv.org/html/2605.16023#S0.F1)\.
- •The Task Formatters\(𝒞TF,rate:=𝒞rate∖𝒞class\\mathcal\{C\}\_\{\\text\{TF,rate\}\}:=\\mathcal\{C\}\_\{\\text\{rate\}\}\\setminus\\mathcal\{C\}\_\{\\text\{class\}\}and𝒞TF,class:=𝒞class∖𝒞rate\\mathcal\{C\}\_\{\\text\{TF,class\}\}:=\\mathcal\{C\}\_\{\\text\{class\}\}\\setminus\\mathcal\{C\}\_\{\\text\{rate\}\}\): specialized terminal routing branches, typically late\-layer attention heads, that translate the abstract judgment into format\-specific target tokens\.
The judge circuit \(§[3](https://arxiv.org/html/2605.16023#S3)\) is therefore𝒞rate=𝒞LE∪𝒞TF,rate\\mathcal\{C\}\_\{\\text\{rate\}\}=\\mathcal\{C\}\_\{\\text\{LE\}\}\\cup\\mathcal\{C\}\_\{\\text\{TF,rate\}\}\. We abbreviate Latent Evaluator and Task Formatter as LE and TF\.
Finding 4: Contrastive tracing yields a clean Latent Evaluator / Task Formatter decomposition\.
On Gemma\-3\-12B \(CoLA×\\timesCoLA\_CLASS, top\-200200\),33of1717analyzed heads act as shared evaluators – most strongly L45H3, L46H12, L47H7 – while the remaining1414split cleanly into rating\-specific formatters \(99\) and classification\-specific formatters \(55\)\. An independent SAE\-based role assignment \(Appendix[J](https://arxiv.org/html/2605.16023#A10)\) selects the same three heads as the shared\-evaluator core on bothCoLAandSTS\-B, providing cross\-method confirmation of the decomposition\.
### 4\.2Functional Modularity via Zero\-Ablation
Identifying a shared causal circuit does not guarantee that the Latent Evaluator isfunctionallyisolated from unrelated capabilities such as world\-knowledge recall\. We test this by zero\-ablating the Latent Evaluator: for every component \(attention head or MLP\) that appears as a sender in at least one top\-kkLatent Evaluator edge, we clamp its forward\-pass output to zero\. We then evaluate the ablated model againstMMLUHendryckset al\.\([2021](https://arxiv.org/html/2605.16023#bib.bib20)\)world knowledge and two factual QA datasets,StrategyQAGevaet al\.\([2021](https://arxiv.org/html/2605.16023#bib.bib13)\)andCREAKOnoeet al\.\([2021](https://arxiv.org/html/2605.16023#bib.bib39)\)\. These probes natively emit “Yes/No” or “True/False” tokens – mirroring our Task Formatter setups – while relying on disjoint semantic phenomena \(factual retrieval vs\. abstract judgment\)\.
Finding 5: On modular architectures, ablating the Latent Evaluator leaves world knowledge intact\.
Table[1](https://arxiv.org/html/2605.16023#S3.T1)illustrates that, on the four modular models, every meaningfully\-tested probe shows≤2\\leq 2pp degradation under ablation\.222Caveat: Llama\-3\.1\-8B’sMMLUcells are vacuously preserved since its merged Latent Evaluator contains no shared attention heads, making the head\-onlyMMLUrunner inert; the meaningful Llama tests are theStrategyQAandCREAKcells, which ablate the LE MLPs and show0pp drop\.By contrast, iteratively ablating Latent Evaluator edges in the same models triggers a phase\-transition collapse in judgment EV on every rating task tested \(Appendix[H](https://arxiv.org/html/2605.16023#A8), Figures[13](https://arxiv.org/html/2605.16023#A8.F13)–[14](https://arxiv.org/html/2605.16023#A8.F14)\)\. The Latent Evaluator therefore operates as a specialized sub\-system whose removal destroys judging while leaving the model’s world knowledge stores largely intact\.
Finding 6: Modularity emerges at family\-specific scales\.
Qwen achieves clean modularity already at the smallest size we study \(7B\); Llama\-3\.1\-8B does so for its MLP\-dominant evaluator; Gemma\-3 only at 27B\. In contrast, Gemma\-3\-12B tightly entangles the Latent Evaluator with world\-knowledge pathways: zero\-ablation roughly halvesMMLUclinical, physics, andCREAKaccuracy \(Table[1](https://arxiv.org/html/2605.16023#S3.T1)\)\. Only when we scale to Gemma\-3\-27B does this entanglement dissolve\. Scale alone therefore does not predict modularity: comparable parameter counts produce qualitatively different internal structure across families\.
## 5Inter\-Format Inconsistencies Arise from a Modular Mismatch
Given that the Latent Evaluator is a real, functionally modular sub\-system, the question is how its output is transformed into the format\-specific target token\. Our hypothesis is that theTask Formatterbranches \(§[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\) are the locus of inter\-format inconsistency: the Latent Evaluator computes a stable continuous judgment signal, but this signal is mapped onto format\-specific tokens by fragile, non\-linear terminal routing\. We test this hypothesis via a causal cross\-format patching experiment\.
#### Causal Analysis via Format Transfer Injection
We design a minimal causal test:Format Transfer Injection\(FTI\) followingMerulloet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib34)\)\. For a given instance we capture the activations of the Latent Evaluator components during a pristine55\-star ratingprompt and force those exact activations – ablanketactivation transfer that overwrites the entire LE pattern, in contrast to the targeted11D subspace interventions of Appendix[D](https://arxiv.org/html/2605.16023#A4)– into the computational graph of the same model running on a corruptedclassificationprompt \(whose natural output would be the negative token, e\.g\., “No”\)\. If the Latent Evaluator is the primary causal anchor for the judgment, the downstream classification head should receive the injected positive judgment signal and flip its output token – from “No” to “Yes” or “Entailment”\. If instead the terminal branches are doing the actual judgment work, the injection should have no effect\. This blanket\-transfer protocol contrasts with the targeted11D subspace injection at the same LE components \(App\.[D](https://arxiv.org/html/2605.16023#A4)\); §[5\.2](https://arxiv.org/html/2605.16023#S5.SS2)develops the resultingscalar\-vs\-blanketdistinction as a deployment\-relevant design property of the formatter\.
### 5\.1The Latent Evaluator is the Causal Anchor for Judgment
Table 2:FTI probability shifts on all five tasks\. Patching a55\-star Latent Evaluator into a corrupted categorical classification prompt shifts probability mass toward the positive target token \(Yes/Entailment\) when the Task Formatter is geometrically compatible\.NNis the post\-filter pair count under the inclusion criteria \(source rating EV\>4\>4, corrupted base prediction∉\{\\notin\\\{Yes, Entailment\}\\\}\); per\-cell discussion in §[5\.1](https://arxiv.org/html/2605.16023#S5.SS1)\.Finding 7: Injecting the Latent Evaluator causally shifts downstream classifier outputs; inter\-format inconsistency therefore localizes to the classification Task Formatter, not the Latent Evaluator\.
The clearest causal demonstration is Qwen2\.5\-7B: blanket FTI flips the argmax in≥99%\\geq 99\\%ofCoLA,STS\-B,MNLI, andRewardBenchpairs \(Table[2](https://arxiv.org/html/2605.16023#S5.T2)\), with mean target\-class probability rising from≤17%\\leq 17\\%at baseline to≥85%\\geq 85\\%post\-injection – i\.e\., the classification graph that would natively output the negative token instead emits the positive one in essentially every pair, driven solely by the rating\-prompt LE pattern\. The same near\-total flips obtain on Gemma\-3\-27B /STS\-Band Qwen2\.5\-14B /CoLA\. Crucially, in every case the injected continuous judgment scalar is mapped by the classification formatter onto thediscretetarget token “Yes”/“Entailment” without breaking the output format space \(no pair emits “55”\), demonstrating that the classification Task Formatter correctly interprets a scalar judgment signal regardless of where in the graph that signal originated\.
Reading these results in aggregate: the judgment representation is stable,11D, and shared across semantic domains \(§[4\.1](https://arxiv.org/html/2605.16023#S4.SS1), Appendix[D](https://arxiv.org/html/2605.16023#A4)\); the bottleneck is theterminalmapping – a format\-specific routing layer that is fragile to perturbation and varies sharply in topology across tasks \(3\-wayMNLIvs\. binary classification\) and models \(geometricallyinsulated, e\.g\., Gemma\-3\-27B and Qwen2\.5\-14B onMNLI, where blanket injection barely moves the output, vs\.exposed, e\.g\., Qwen2\.5\-7B\)\. This is why ratings produced by the same model on structurally identical inputs can diverge under trivial format perturbations: under our FTI evidence the Latent Evaluator does not disagree – the classification Task Formatter does\.
Finding 8: FTI fails when the formatter is geometrically insulated – either by scale \(open\-ended tasks\) or by multi\-attractor label structure \(MNLI\)\.
Two regimes within Table[2](https://arxiv.org/html/2605.16023#S5.T2)share a common explanation\.
\(a\) Multi\-attractor label structure \(MNLI\)\.We apply Logit Lens333[https://www\.lesswrong\.com/posts/AcKRB8wDpdaN6v6ru](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru)– which decodes intermediate activations into vocabulary space via the model’s unembedding matrixWUW\_\{U\}– to the late\-layer Task Formatter components \(Appendix[I](https://arxiv.org/html/2605.16023#A9), Figure[10](https://arxiv.org/html/2605.16023#A5.F10)\)\. On Gemma\-3\-27B,MNLI’s classification formatter spreads its projected mass across three competing target tokens \(*contradiction*,*entailment*,*neutral*\) of roughly equal weight \(max/min projected\-mass ratio≈2\.7\\approx 2\.7\), forming a three\-wayattractorbasin – a routing geometry in which several output tokens act as locally dominant targets\. By contrast,STS\-B’s formatter on the same model concentrates mass on a single positive token \(*positive*:0\.270\.27,*negative*:0\.010\.01; max/min ratio≈19\\approx 19\), forming a near\-unipolar binary attractor\. This geometric differencepredictsthe FTI behavior we then observe: with the exception of Qwen2\.5\-7B \(which flipsMNLInear\-perfectly\),MNLIflip rates collapse to single digits on every other model \(Table[2](https://arxiv.org/html/2605.16023#S5.T2)\)\. The11D judgment direction has no unambiguous target in a three\-attractor basin, so the injected mass fragments across \{entailment, neutral, contradiction\} and no single label reaches argmax; when the target basin is binary or asymmetric, the scalar decodes cleanly and argmax flips\.
\(b\) Within\-family scale decrease \(open\-ended tasks\)\.Smaller models flip open\-ended classifiers more readily: on bothRewardBenchandYelpthe FTI flip rate falls as Qwen scales from 7B to 14B and as Gemma\-3 scales from 12B to 27B \(Table[2](https://arxiv.org/html/2605.16023#S5.T2)\)\. Two data points per family is too few to claim a general scaling law, so we describe the pattern as a within\-family decrease rather than as an inverse trend\. The trend cannot be explained by the Latent Evaluator being absent at scale: the same top\-200200sparse circuits recover near\-full MIB faithfulness on these cells \(Appendix[C](https://arxiv.org/html/2605.16023#A3)\) and the directional 1D subspace steering at the LE components moves the output cleanly \(Appendix[D](https://arxiv.org/html/2605.16023#A4)\)\. We interpret the FTI decoupling at scale as evidence that the open\-ended Task Formatter becomesgeometrically insulatedwith scale: the scalar judgment direction is still present and steerable, but the full Latent Evaluator activation pattern from a rating prompt is no longer a sufficient causal key for the classification\-prompt formatter to accept\.
### 5\.2The Format Split is the Inconsistency Bottleneck
The FTI results close the causal loop on our third contribution\. The Latent Evaluator’s output – a11D direction whose orientation tracks the scaled rating signal \(App\.[D](https://arxiv.org/html/2605.16023#A4)\) – is universally received by downstream Task Formatters, but only reaches the argmax token when the TF’s attractor geometry is compatible with a scalar input \(Table[2](https://arxiv.org/html/2605.16023#S5.T2), App\.[I](https://arxiv.org/html/2605.16023#A9)\)\.
Three different causal probes give three different answers onRewardBench/Yelp: cumulative patching recovers the behavior \(App\.[C](https://arxiv.org/html/2605.16023#A3)\) and targeted11D subspace steering \(App\.[D](https://arxiv.org/html/2605.16023#A4)\) move the output cleanly, but blanket activation injection via FTI flips only a small minority of instances\. We read thisscalar\-vs\-blanketdivergence as evidence that the formatter’s basin becomes more selective with scale, accepting perturbations aligned with the learned judgment direction but rejecting the full rating\-prompt activation pattern\. Practically, this means that deploying the LE as a robust LaaJ signal on open\-ended tasks favorstargetedsubspace interventions overblanketactivation transfer\.
Finding 9: The LE’s 1D direction is a usable zero\-shot judgment scalar in the small\-NNpreference regime\.
As a deployment\-oriented test of the mechanism, we ask whether the LE’s11D causal direction can serve as a judgment signal directly\. On three benchmarks with continuous human ratings \(STS\-B,Yelp,RewardBench\), a zero\-shot 1D readout \(BDAS\-1D\) tracks a fully supervised residual probeGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)within a few percentage points of Spearmanρ\\rhoon most cells and matches or exceeds it specifically on small\-NNpreference data, while beating the prompted argmax in nearly every cell\. The advantage concentrates where the supervised probe overfits and the prompted output is poorly calibrated; on tasks with a scale\-aligned prompted vocabulary \(Yelp11–55\), prob\-weighted EV remains a stronger baseline \(§[Limitations](https://arxiv.org/html/2605.16023#Sx1)\)\. Full methodology, results table, and per\-regime breakdown are in Appendix[E](https://arxiv.org/html/2605.16023#A5)\.
## 6Discussion
The Latent Evaluator / Task Formatter split reframes the ongoing debate about LLM\-as\-a\-judge reliabilityLeeet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib25)\); Bavarescoet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib2)\); Chehbouniet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib6)\)\. Behavioral inconsistency under format perturbations is, on the mechanism we identify, the expected signature of a stable internal judgment routed through a fragile terminal mapping and not a failure of the underlying evaluation\. This shifts the diagnostic question from“does the model judge consistently?”to“does the formatter for this output specification preserve the underlying judgment?”, and it predicts that benchmark\-level reliability comparisons across formats are partially measuring formatter geometry as opposed to evaluation quality\.
A second implication concerns the architectural origin of judgment modularity\. Comparable parameter counts produce qualitatively different internal structure across families\. This pushes back on the assumption that clean internal abstractions emerge as a generic consequence of scale\. Architectural and training choice that shape circuit topology appear at least as load\-bearing as scale; isolating which specific factor \(pretraining\-data composition, post\-training procedure, attention sparsity, training\-data composition, normalization placement\) drives the Qwen vs\. Gemma scale contrast we observe is a natural follow\-on question for the mechanistic interpretability community\.
Mechanism connects to practice through a non\-trivial regime caveat\. Our results converge with concurrent behavioral findings that latent signals from internal activations outperform prompted Likert outputsGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)and we causally identify the subspace from which those signals are recovered\. When the prompted output is calibrated to the human\-label scale, however, prompted aggregations remain a strong baseline that the11D latent direction does not exceed; the latent signal’s advantage concentrates on small\-NNpreference data where the discrete output is poorly calibrated and supervised probes overfit\. The practical question is therefore not“should one extract from the latent subspace?”but“when?”– a design choice whose answer depends on whether the deployment regime offers a scale\-aligned prompted output or only a discrete preference signal\.
A natural open question is whether the two\-step pattern we identify – a stable internal computation routed through fragile terminal pathways – recurs in other behaviors where output formatting matters \(e\.g\., chain\-of\-thought, structured generation, tool calling\)\. If it does, the LaaJ inconsistency we mechanistically pin down here would be one instance of a broader routing\-vs\-computation dissociation worth probing in those settings\.
## 7Related Work
#### Behavioral critiques of LaaJ validity\.
Beyond the inter\-format inconsistencies established byLeeet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib25)\)and the shortcut\-exploitation results ofEshuijset al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib9)\),Chehbouniet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib6)\)challenge the fundamental validity of LaaJ protocols, arguing that even strong models lack the robustness required to evaluate abstract concepts reliably\.Bavarescoet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib2)\)corroborate this empirically in a large\-scale comparison, finding that no single LLM consistently aligns with human judgment across tasks\. Our mechanistic results refine these critiques: under the LE/TF split, much of the observed inconsistency localizes to the terminal formatting stage\. Benchmark\-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality\.
#### Mechanistic precedents\.
Our work joins three lines of evidence: cross\-task circuit overlapTiggeset al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib46)\); Ferrando and Costa\-jussà \([2024](https://arxiv.org/html/2605.16023#bib.bib10)\); Lanet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib24)\), low\-rank linear intermediate variablesLeporiet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib26)\); Muelleret al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib37)\), and the formal/functional dissociationHannaet al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib19)\)that the LE/TF split mirrors at the rating\-judgment level\. We extend this lineage by causally validating cross\-format judgment via subspace steering and activation transfer\.
## 8Conclusion
LLM judgment reliability depends not only on what models compute internally but on how that computation is routed to the output token\. We identify a compactLatent Evaluatorin mid\-to\-late MLPs that is functionally modular on most architectures we study but entangled with world\-knowledge pathways on Gemma\-3\-12B, so modularity is architecture\-dependent rather than a consequence of scale\. The11D causal direction underlying this sub\-graph recovers a supervised linear\-probe judgment signal zero\-shot and exceeds it on small\-NNpreference data, mechanistically locating the latent signal that practical reference\-free rating methods rely on\.
## Limitations
A primary limitation of our mechanistic investigation stems fundamentally from the computational geometry constraints of tracing extensive architectures end\-to\-end\. For context, natively mapping theCoLAjudgment computational graph in Gemma\-3\-12B requires evaluating approximately 1\.46 million candidate edges\. While PEAP allows for tracing these evaluation circuits across temporal dimensions efficiently, performing such densely scaled edge patching computations – especially over the largest model variations like Gemma\-3\-27B \(incorporating roughly 50,000 components\) – strictly required us to constrain our analyzed dataset subset bounds to between 100 and 500 distinct samples representing minimal pairing\. We partially mitigate this concern by reporting split\-half circuit reliability \(Appendix[L](https://arxiv.org/html/2605.16023#A12)\): within\-task circuits are substantially more stable than chance at every scale we tested, and comparable to or higher than the cross\-task IoU numbers we report in §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2)\.
Furthermore, while we show our principles across evaluations like grammar, logical entailment, sentiment and preference, mapping exactly how models route highly subjective or culturally biased evaluation metrics remains a compelling horizon for future research\. The open\-ended\-task scope concern thatRewardBenchandYelpcircuits might require a denser subgraph than structured NLU is largely resolved by the present data: on Qwen2\.5\-7B, Qwen2\.5\-14B, and Gemma\-3\-27B the same sparse edge budget that recovers structured NLU also recovers open\-ended judgment \(Appendix[C](https://arxiv.org/html/2605.16023#A3), Figure[2](https://arxiv.org/html/2605.16023#S3.F2)\)\. The exceptions are Gemma\-3\-12B \(where neitherRewardBenchnorYelpsaturates\) and Llama\-3\.1\-8B onYelpalone \(median recovery peaks at≈0\.40\\approx 0\.40before drifting back down\); we attribute the Gemma\-3\-12B failure to its architectural entanglement \(Table[1](https://arxiv.org/html/2605.16023#S3.T1)\) and the Llama\-3\.1\-8BYelpshortfall to its weaker cross\-task structural overlap and MLP\-dominant Latent Evaluator \(§[3\.2](https://arxiv.org/html/2605.16023#S3.SS2), Appendix[K](https://arxiv.org/html/2605.16023#A11)\)\.
A more nuanced limitation concerns the scalar\-vs\-blanket FTI decoupling on open\-ended tasks at scale, developed in §[5\.2](https://arxiv.org/html/2605.16023#S5.SS2)\. Pinning down which properties of the rating\-prompt activation geometry are and are not carried across the FTI injection – beyond the 1D judgment direction itself – is a direction for future mechanistic work\.
The practical\-judge result \(Appendix[E](https://arxiv.org/html/2605.16023#A5)\) carries a regime caveat: when the prompted vocabulary is scale\-aligned to the human label \(Yelp11–55stars\), prob\-weighted EV is a strong baseline that the11D BDAS readout does not exceed\. Whether higher\-rank extraction \(e\.g\.,kk\-D BDAS or multi\-component aggregation\) closes that gap is left to future work\.
We do not benchmark PEAP against an alternative circuit\-tracing method such as Relevance Patching \(RelP\)Jafariet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib21)\); our cross\-method confirmation is presently limited to the SAE\-based role assignment in Appendix[J](https://arxiv.org/html/2605.16023#A10)\. A side\-by\-side PEAP\-vs\.\-RelP attribution comparison on the same minimal pairs would test whether the Latent Evaluator decomposition is robust to the choice of attribution algorithm\.
The cross\-task IoU values reported in §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2)are bracketed by within\-task split\-half reliability \(an upper bound\) and a random\-edge baseline \(a lower bound\), both on judgment circuits; we do not include an IoU comparison against a circuit traced on a non\-judgment task \(e\.g\., factual recall onMMLU\) as a non\-LaaJ external reference, which would further sharpen the interpretation of the LaaJ shared\-trunk magnitude\.
## Acknowledgments
We thank Fedor Splitt for running additional experiments and Laura Kopf and Gabriele Sarti for their feedback on earlier drafts\.
AI assistance \(Claude Code\) was used for coding and minor textual edits\. All scientific claims, interpretations, and conclusions remain the responsibility of the authors\.
## References
- A\. Bavaresco, R\. Bernardi, L\. Bertolazzi, D\. Elliott, R\. Fernández, A\. Gatt, E\. Ghaleb, M\. Giulianelli, M\. Hanna, A\. Koller, A\. Martins, P\. Mondorf, V\. Neplenbroek, S\. Pezzelle, B\. Plank, D\. Schlangen, A\. Suglia, A\. K\. Surikuchi, E\. Takmaz, and A\. Testoni \(2025\)LLMs instead of human judges? a large scale empirical study across 20 NLP evaluation tasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 238–255\.External Links:[Link](https://aclanthology.org/2025.acl-short.20/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-short.20),ISBN 979\-8\-89176\-252\-7Cited by:[§6](https://arxiv.org/html/2605.16023#S6.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px1.p1.1)\.
- N\. Calderon, R\. Reichart, and R\. Dror \(2025\)The alternative annotator test for LLM\-as\-a\-judge: how to statistically justify replacing human annotators with LLMs\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16051–16081\.External Links:[Link](https://aclanthology.org/2025.acl-long.782/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.782),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p1.1)\.
- D\. Cer, M\. Diab, E\. Agirre, I\. Lopez\-Gazpio, and L\. Specia \(2017\)SemEval\-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation\.InProceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\),S\. Bethard, M\. Carpuat, M\. Apidianaki, S\. M\. Mohammad, D\. Cer, and D\. Jurgens \(Eds\.\),Vancouver, Canada,pp\. 1–14\.External Links:[Link](https://aclanthology.org/S17-2001/),[Document](https://dx.doi.org/10.18653/v1/S17-2001)Cited by:[3rd item](https://arxiv.org/html/2605.16023#S2.I1.i3.p1.1)\.
- K\. Chehbouni, M\. Haddou, J\. C\. Cheung, and G\. Farnadi \(2025\)Neither valid nor reliable? investigating the use of LLMs as judges\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems Position Paper Track,External Links:[Link](https://openreview.net/forum?id=yqKfMr0yvY)Cited by:[§6](https://arxiv.org/html/2605.16023#S6.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px1.p1.1)\.
- A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=89ia77nZ8u)Cited by:[§3\.1](https://arxiv.org/html/2605.16023#S3.SS1.p1.6)\.
- L\. Eshuijs, S\. Wang, and A\. Fokkens \(2025\)Short\-circuiting shortcuts: mechanistic investigation of shortcuts in text classification\.InProceedings of the 29th Conference on Computational Natural Language Learning,G\. Boleda and M\. Roth \(Eds\.\),Vienna, Austria,pp\. 105–125\.External Links:[Link](https://aclanthology.org/2025.conll-1.8/),[Document](https://dx.doi.org/10.18653/v1/2025.conll-1.8),ISBN 979\-8\-89176\-271\-8Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px1.p1.1)\.
- J\. Ferrando and M\. R\. Costa\-jussà \(2024\)On the similarity of circuits across languages: a case study on the subject\-verb agreement task\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 10115–10125\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.591/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.591)Cited by:[§3\.2](https://arxiv.org/html/2605.16023#S3.SS2.p1.6),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- M\. Gao, X\. Hu, X\. Yin, J\. Ruan, X\. Pu, and X\. Wan \(2025\)LLM\-based NLG evaluation: current status and challenges\.Computational Linguistics51,pp\. 661–687\.External Links:[Link](https://aclanthology.org/2025.cl-2.9/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00561)Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p1.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies\.Transactions of the Association for Computational Linguistics9,pp\. 346–361\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00370),[Link](https://doi.org/10.1162/tacl%5C_a%5C_00370),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/tacl\_a\_00370/1924104/tacl\_a\_00370\.pdfCited by:[§4\.2](https://arxiv.org/html/2605.16023#S4.SS2.p1.1)\.
- L\. Girrbach, C\. Su, T\. Saanum, R\. Socher, E\. Schulz, and Z\. Akata \(2025\)Reference\-free rating of llm responses via latent information\.arXivabs/2509\.24678\.External Links:[Link](https://arxiv.org/abs/2509.24678)Cited by:[§D\.1](https://arxiv.org/html/2605.16023#A4.SS1.p1.1),[Appendix E](https://arxiv.org/html/2605.16023#A5.p1.9),[Appendix E](https://arxiv.org/html/2605.16023#A5.p2.12),[§1](https://arxiv.org/html/2605.16023#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.16023#S5.SS2.p4.5),[§6](https://arxiv.org/html/2605.16023#S6.p3.2)\.
- E\. Golimblevskaia, A\. Jain, B\. Puri, A\. Ibrahim, W\. Samek, and S\. Lapuschkin \(2026\)Circuit insights: towards interpretability beyond activations\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=2Jyb1yu3nN)Cited by:[Appendix I](https://arxiv.org/html/2605.16023#A9.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXivabs/2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px2.p1.3)\.
- T\. Haklay, H\. Orgad, D\. Bau, A\. Mueller, and Y\. Belinkov \(2025\)Position\-aware automatic circuit discovery\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 2792–2817\.External Links:[Link](https://aclanthology.org/2025.acl-long.141/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.141),ISBN 979\-8\-89176\-251\-0Cited by:[Appendix A](https://arxiv.org/html/2605.16023#A1.p3.3),[Figure 1](https://arxiv.org/html/2605.16023#S0.F1),[§1](https://arxiv.org/html/2605.16023#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.16023#S3.SS1.p1.6)\.
- M\. Hanna, Y\. Belinkov, and S\. Pezzelle \(2026\)Are formal and functional linguistic mechanisms dissociated in language models?\.Computational Linguistics,pp\. 1–41\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/COLI.a.24),[Link](https://doi.org/10.1162/COLI.a.24)Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p3.1),[§4](https://arxiv.org/html/2605.16023#S4.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- M\. Hanna, S\. Pezzelle, and Y\. Belinkov \(2024\)Have faith in faithfulness: going beyond circuit overlap when finding model mechanisms\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=TZ0CCGDcuT)Cited by:[§C\.1](https://arxiv.org/html/2605.16023#A3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.16023#S3.SS1.p1.6)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§4\.2](https://arxiv.org/html/2605.16023#S4.SS2.p1.1)\.
- F\. R\. Jafari, O\. Eberle, A\. Khakzar, and N\. Nanda \(2025\)RelP: faithful and efficient circuit discovery in language models via relevance patching\.arXivabs/2508\.21258\.External Links:[Link](https://arxiv.org/abs/2508.21258)Cited by:[§N\.1](https://arxiv.org/html/2605.16023#A14.SS1.p1.8),[§3\.3](https://arxiv.org/html/2605.16023#S3.SS3.SSS0.Px1.p1.1),[Limitations](https://arxiv.org/html/2605.16023#Sx1.p5.1)\.
- N\. Lambert, V\. Pyatkin, J\. Morrison, L\. Miranda, B\. Y\. Lin, K\. Chandu, N\. Dziri, S\. Kumar, T\. Zick, Y\. Choi, N\. A\. Smith, and H\. Hajishirzi \(2025\)RewardBench: evaluating reward models for language modeling\.InFindings of the Association for Computational Linguistics: NAACL 2025,L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 1755–1797\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.96/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.96),ISBN 979\-8\-89176\-195\-7Cited by:[4th item](https://arxiv.org/html/2605.16023#A7.I1.i4.p1.1),[4th item](https://arxiv.org/html/2605.16023#S2.I1.i4.p1.1)\.
- M\. Lan, P\. Torr, and F\. Barez \(2024\)Towards interpretable sequence continuation: analyzing shared circuits in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 12576–12601\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.699/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.699)Cited by:[§3\.2](https://arxiv.org/html/2605.16023#S3.SS2.p1.6),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- N\. Lee, J\. Hong, and J\. Thorne \(2025\)Evaluating the consistency of LLM evaluators\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 10650–10659\.External Links:[Link](https://aclanthology.org/2025.coling-main.710/)Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p1.1),[§1](https://arxiv.org/html/2605.16023#S1.p2.1),[§6](https://arxiv.org/html/2605.16023#S6.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px1.p1.1)\.
- M\. A\. Lepori, T\. Serre, and E\. Pavlick \(2024\)Uncovering intermediate variables in transformers using circuit probing\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=gUNeyiLNxr)Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p3.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu \(2024\)LLMs\-as\-judges: a comprehensive survey on llm\-based evaluation methods\.External Links:2412\.05579,[Link](https://arxiv.org/abs/2412.05579)Cited by:[§1](https://arxiv.org/html/2605.16023#S1.p1.1)\.
- M\. Méloux, S\. Maniu, F\. Portet, and M\. Peyrard \(2025\)Everything, everywhere, all at once: is mechanistic interpretability identifiable?\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5IWJBStfU7)Cited by:[§C\.1](https://arxiv.org/html/2605.16023#A3.SS1.p1.1),[§1](https://arxiv.org/html/2605.16023#S1.p3.1),[§3\.3](https://arxiv.org/html/2605.16023#S3.SS3.SSS0.Px1.p1.1)\.
- J\. Merullo, C\. Eickhoff, and E\. Pavlick \(2024\)Language models implement simple Word2Vec\-style vector arithmetic\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 5030–5047\.External Links:[Link](https://aclanthology.org/2024.naacl-long.281/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.281)Cited by:[§D\.1](https://arxiv.org/html/2605.16023#A4.SS1.p1.1),[§5](https://arxiv.org/html/2605.16023#S5.SS0.SSS0.Px1.p1.3)\.
- J\. Miller, B\. Chughtai, and W\. Saunders \(2024\)Transformer circuit evaluation metrics are not robust\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=zSf8PJyQb2)Cited by:[§C\.1](https://arxiv.org/html/2605.16023#A3.SS1.p1.1),[§1](https://arxiv.org/html/2605.16023#S1.p3.1)\.
- A\. Mueller, J\. Brinkmann, M\. Li, S\. Marks, K\. Pal, N\. Prakash, C\. Rager, A\. Sankaranarayanan, A\. S\. Sharma, J\. Sun, E\. Todd, D\. Bau, and Y\. Belinkov \(2026\)The quest for the right mediator: surveying mechanistic interpretability for nlp through the lens of causal mediation analysis\.Computational Linguistics,pp\. 1–48\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/COLI.a.572),[Link](https://doi.org/10.1162/COLI.a.572)Cited by:[§D\.1](https://arxiv.org/html/2605.16023#A4.SS1.p1.1),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- A\. Mueller, A\. Geiger, S\. Wiegreffe, D\. Arad, I\. Arcuschin, A\. Belfki, Y\. S\. Chan, J\. F\. Fiotto\-Kaufman, T\. Haklay, M\. Hanna, J\. Huang, R\. Gupta, Y\. Nikankin, H\. Orgad, N\. Prakash, A\. Reusch, A\. Sankaranarayanan, S\. Shao, A\. Stolfo, M\. Tutek, A\. Zur, D\. Bau, and Y\. Belinkov \(2025\)MIB: a mechanistic interpretability benchmark\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=sSrOwve6vb)Cited by:[§C\.1](https://arxiv.org/html/2605.16023#A3.SS1.p3.2),[§3\.3](https://arxiv.org/html/2605.16023#S3.SS3.p1.1)\.
- N\. Nanda and J\. Bloom \(2022\)TransformerLens\.Note:[https://github\.com/TransformerLensOrg/TransformerLens](https://github.com/TransformerLensOrg/TransformerLens)Cited by:[§2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px2.p1.3)\.
- Y\. Onoe, M\. Zhang, E\. Choi, and G\. Durrett \(2021\)CREAK: a dataset for commonsense reasoning over entity knowledge\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,J\. Vanschoren and S\. Yeung \(Eds\.\),Vol\.1,pp\.\.External Links:[Link](https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/5737c6ec2e0716f3d8a7a5c4e0de0d9a-Paper-round2.pdf)Cited by:[§4\.2](https://arxiv.org/html/2605.16023#S4.SS2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.arXivabs/2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[§2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px2.p1.3)\.
- A\. Saurez, N\. Sengar, and D\. Har \(2026\)Circuit fingerprints: how answer tokens encode their geometrical path\.arXivabs/2602\.09784\.External Links:[Link](https://arxiv.org/abs/2602.09784)Cited by:[§D\.1](https://arxiv.org/html/2605.16023#A4.SS1.p1.1)\.
- A\. Syed, C\. Rager, and A\. Conmy \(2024\)Attribution patching outperforms automated circuit discovery\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 407–416\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.25/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.25)Cited by:[§C\.1](https://arxiv.org/html/2605.16023#A3.SS1.p2.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.arXivabs/2503\.19786\.External Links:[Link](https://arxiv.org/abs/2503.19786)Cited by:[§2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px2.p1.3)\.
- C\. Tigges, M\. Hanna, Q\. Yu, and S\. Biderman \(2024\)LLM circuit analyses are consistent across training and scale\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 40699–40731\.External Links:[Document](https://dx.doi.org/10.52202/079017-1287),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/47c7edadfee365b394b2a3bd416048da-Paper-Conference.pdf)Cited by:[§3\.2](https://arxiv.org/html/2605.16023#S3.SS2.p1.6),[§7](https://arxiv.org/html/2605.16023#S7.SS0.SSS0.Px2.p1.1)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§3\.1](https://arxiv.org/html/2605.16023#S3.SS1.p1.6)\.
- K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NpsVSN6o4ul)Cited by:[§3\.1](https://arxiv.org/html/2605.16023#S3.SS1.p1.6)\.
- A\. Warstadt, A\. Singh, and S\. R\. Bowman \(2019\)Neural network acceptability judgments\.Transactions of the Association for Computational Linguistics7,pp\. 625–641\.External Links:[Link](https://aclanthology.org/Q19-1040/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00290)Cited by:[1st item](https://arxiv.org/html/2605.16023#S2.I1.i1.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1112–1122\.External Links:[Link](https://aclanthology.org/N18-1101/),[Document](https://dx.doi.org/10.18653/v1/N18-1101)Cited by:[2nd item](https://arxiv.org/html/2605.16023#S2.I1.i2.p1.1)\.
- Z\. Wu, A\. Geiger, T\. Icard, C\. Potts, and N\. Goodman \(2023\)Interpretability at scale: identifying causal mechanisms in alpaca\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=nRfClnMhVX)Cited by:[§D\.1](https://arxiv.org/html/2605.16023#A4.SS1.p1.1),[Appendix E](https://arxiv.org/html/2605.16023#A5.p1.9),[Figure 1](https://arxiv.org/html/2605.16023#S0.F1)\.
- A\. Yom Din, T\. Karidi, L\. Choshen, and M\. Geva \(2024\)Jump to conclusions: short\-cutting transformers with linear transformations\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 9615–9625\.External Links:[Link](https://aclanthology.org/2024.lrec-main.840/)Cited by:[Appendix I](https://arxiv.org/html/2605.16023#A9.p4.1)\.
- X\. Zhang, J\. Zhao, and Y\. LeCun \(2015\)Character\-level convolutional networks for text classification\.InProceedings of the 28th International Conference on Neural Information Processing Systems \- Volume 1,NIPS’15,Cambridge, MA, USA,pp\. 649–657\.External Links:[Link](https://proceedings.neurips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf)Cited by:[5th item](https://arxiv.org/html/2605.16023#S2.I1.i5.p1.1)\.
## Appendix APEAP Attribution Formulas
This appendix contains the exact attribution\-score formulas referenced in §[3\.1](https://arxiv.org/html/2605.16023#S3.SS1)\. For each candidate edge, PEAP approximates the causal effect of restoring that edge from a corrupted to a clean state via a linear first\-order expansion\. LetEV=∑r=1sr⋅P\(rating=r\)\\mathrm\{EV\}=\\sum\_\{r=1\}^\{s\}r\\cdot P\(\\text\{rating\}=r\)denote the expected value of the predicted rating distribution \(whereP\(rating=r\)P\(\\text\{rating\}=r\)is the softmax over the rating\-token logits at the final sequence position andssis the upper bound of the rating scale\), and letm=sgn\(EVclean−EVcorr\)m=\\mathrm\{sgn\}\(\\mathrm\{EV\}\_\{\\text\{clean\}\}\-\\mathrm\{EV\}\_\{\\text\{corr\}\}\)be the per\-pair polarity multiplier that keeps attributions directionally consistent across our symmetrically balanced minimal pairs \(§[2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px3)\)\.
For intra\-token residual\-stream communication between senderSiS\_\{i\}and receiverRjR\_\{j\}at the same token position \(i=ji=j\), the attribution score is
Score\(Si→Ri\)=m⋅\(\(Siclean−Sicorr\)⋅∇Ri\)\.\\text\{Score\}\(S\_\{i\}\\to R\_\{i\}\)=m\\cdot\\left\(\(S\_\{i\}^\{\\text\{clean\}\}\-S\_\{i\}^\{\\text\{corr\}\}\)\\cdot\\nabla R\_\{i\}\\right\)\.For cross\-token edges \(i≠ji\\neq j\) we capture the Attention mechanism’scrossing edgesin the PEAP formulation, treating the Value vectorVVat the source token as the sender, the Attention OutputZZat the destination token as the receiver, and scaling by the Attention PatternAA:
Score\(Vi→Zj\)=m⋅Aj,i⋅\(\(Viclean−Vicorr\)⋅∇Zj\)\.\\begin\{split\}\\text\{Score\}\(V\_\{i\}\\to Z\_\{j\}\)=\{\}&m\\cdot A\_\{j,i\}\\\\ &\\cdot\\left\(\(V\_\{i\}^\{\\text\{clean\}\}\-V\_\{i\}^\{\\text\{corr\}\}\)\\cdot\\nabla Z\_\{j\}\\right\)\.\\end\{split\}
The Value/Output decomposition follows the original PEAP formulationHaklayet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib17)\); our contribution is the symmetric polarity correctionmm, which adapts PEAP to the bidirectional rating targets inherent to LLM\-as\-a\-judge evaluation\. A single backward pass on the corrupted prompt yields all∇R\\nabla Rand∇Z\\nabla Zterms simultaneously, so an entire circuit’s attribution is extracted in one forward–backward sweep per minimal pair\.
## Appendix BCross\-task Node Overlap
Figure 3:Cross\-task Edge IoU on Gemma\-3\-12B across Top\-KKpatching thresholds\. Edges are PEAP\-attributed connections between sub\-components; higher curves indicate more shared structure at a given sparsity\.Figure 4:Cross\-task circuit overlap at top\-200200across all four architecturally modular models\. The shared trunk is recoverable on every model, but the per\-pair magnitude reflects each model’s circuit redundancy \(§[3\.2](https://arxiv.org/html/2605.16023#S3.SS2), App\.[L](https://arxiv.org/html/2605.16023#A12)\): smaller models route through fewer equivalent paths, so their top\-200200edges are more conserved across tasks, while larger modular models distribute attribution across many equivalent sub\-pathways, lowering raw Edge IoU even though Node IoU stays high\.Figure[5](https://arxiv.org/html/2605.16023#A2.F5)reports the Node IoU complement to the Edge IoU view in Figure[3](https://arxiv.org/html/2605.16023#A2.F3)\(§[3\.2](https://arxiv.org/html/2605.16023#S3.SS2)\)\. Node IoU measures architectural recruitment at the granularity of attention heads and MLPs, ignoring the specific cross\-token connections that Edge IoU restricts to\. Across all task pairs the Node IoU curve sits substantially above the corresponding Edge IoU curve at the same Top\-KK, reflecting that distinct tasks reuse the same physical sub\-components while routing through partially distinct edge subsets\.
Figure 5:Cross\-task Node IoU on Gemma\-3\-12B across Top\-KKpatching thresholds\. Companion to Figure[3](https://arxiv.org/html/2605.16023#A2.F3)\.
## Appendix CCircuit Faithfulness
### C\.1Methodology
Circuit faithfulness – the degree to which a discovered subgraph causally accounts for the target behavior – is notoriously fragile and highly sensitive to seemingly insignificant changes in the ablation methodology \(e\.g\., node vs\. edge patching\)Milleret al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib35)\)\. A parallel concern is non\-identifiability: multiple incompatible circuits can artificially explain the same downstream behaviorMélouxet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib32)\)\. We therefore adopt the per\-instance MIB formulation throughout the main body and report a sensitivity analysis against the legacy magnitude\-weighted directional score in Appendix[M](https://arxiv.org/html/2605.16023#A13), and we cross\-validate the resulting circuits with two independent causal probes \(BDAS, Appendix[D](https://arxiv.org/html/2605.16023#A4); FTI, §[5\.1](https://arxiv.org/html/2605.16023#S5.SS1)\) to guard against accepting a circuit that is faithful under one metric but spurious under another\.
To validate that the edges identified by PEAP aresufficientfor eliciting the judge behavior, we evaluate the faithfulness of the extracted circuit via cumulative patchingSyedet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib44)\); Hannaet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib18)\)\. Starting from a fully corrupted forward pass, we progressively restore the activations of the top\-kkedges \(ranked by absolute PEAP score\) to their clean\-state values\. Restoration is applied only at the exact token positions dictated by each edge\.
Following the MIB benchmarkMuelleret al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib36)\), we define the faithfulness of a sparse circuit𝒞k\\mathcal\{C\}\_\{k\}\(the sub\-circuit containing the top\-kkattributed edges\) as the mean per\-instance fraction of the clean–corrupted EV gap that the patched circuit recovers:
Faith\(k\)=1N∑i=1NEV\(i\)\(𝒞k\)−EVcorr\(i\)EVclean\(i\)−EVcorr\(i\)\.\\text\{Faith\}\(k\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{\\text\{EV\}^\{\(i\)\}\(\\mathcal\{C\}\_\{k\}\)\-\\text\{EV\}^\{\(i\)\}\_\{\\text\{corr\}\}\}\{\\text\{EV\}^\{\(i\)\}\_\{\\text\{clean\}\}\-\\text\{EV\}^\{\(i\)\}\_\{\\text\{corr\}\}\}\\,\.A faithfulness score near1\.01\.0indicates that𝒞k\\mathcal\{C\}\_\{k\}fully encapsulates the model’s rating behavior\. Because our minimal pairs are symmetrically balanced, each per\-instance gap carries an intrinsic sign and the per\-instance ratio handles polarity naturally without an explicit direction multiplier\. Treating every pair equally also avoids magnitude\-weighting artifacts that would let a small number of high\-gap pairs dominate the aggregate\. We report themedianacross minimal pairs as our primary statistic, since the ratio distribution is heavy\-tailed when a minority of pairs have near\-equal clean/corrupt EVs; the mean, 95% bootstrap CI, and the count of low\-gap pairs skipped viamib\_min\_gap=0\.05=0\.05are all reported in the supplementary CSVs\. A legacy magnitude\-weighted directional formulation is reported in Appendix[M](https://arxiv.org/html/2605.16023#A13)as a sensitivity analysis\.
### C\.2Results
#### Per\-cell saturation points\.
The headline 21\-of\-25 finding is summarized in §[3\.3](https://arxiv.org/html/2605.16023#S3.SS3); here we report the representative per\-cell saturation points behind it\. On Gemma\-3\-12B, median faithfulness saturates at0\.960\.96onMNLIatk=25k=25,1\.001\.00onSTS\-Batk=50k=50, and0\.950\.95onCoLAatk=100k=100\. On Gemma\-3\-27B, median recovery snaps to≈1\.00\\approx 1\.00atk≥50k\\geq 50on the four structured/open\-ended tasks; theRewardBenchcircuit in particular saturates at median1\.021\.02with justk=5k=5edges – an extreme sparsity that we partly attribute to Gemma\-3\-27B’s clean modularity \(Table[1](https://arxiv.org/html/2605.16023#S3.T1)\) concentrating the open\-ended\-task circuit into a very small number of highly\-attributed edges\.
#### Interpreting the shape of the curves\.
Two curve shapes in Figure[2](https://arxiv.org/html/2605.16023#S3.F2)deserve explicit comment\. First, a handful of cells – most prominently Gemma\-3\-27B×\\timesRewardBench\(median1\.021\.02atk=5k=5\) – reach full recovery essentially at the sparsest budget we probe\. This is not a metric artifact: the per\-instance ratio\(EV\(𝒞k\)−EVcorr\)/\(EVclean−EVcorr\)\(\\text\{EV\}\(\\mathcal\{C\}\_\{k\}\)\-\\text\{EV\}\_\{\\text\{corr\}\}\)/\(\\text\{EV\}\_\{\\text\{clean\}\}\-\\text\{EV\}\_\{\\text\{corr\}\}\)withmib\_min\_gap=0\.05=0\.05is bounded below by thek=0k\{=\}0corruption floor and takes no shortcuts; the shape reflects the underlying attribution distribution\. When a model is architecturally modular \(§[4\.2](https://arxiv.org/html/2605.16023#S4.SS2)\) and the task is decoded through a shallow, terminal Task Formatter – asRewardBenchis on Gemma\-3\-27B, where the binary preference\-scoring token sits directly after a short helpful/aligned instruction – the causal work concentrates into a few deep\-layer edges, and cumulative patching recovers the clean EV as soon as those edges are restored\. Conversely, structured NLU tasks such asMNLI, which requires cross\-referencing premise and hypothesis spans, distribute attribution across more edges and therefore climb more gradually throughk∈\[10,100\]k\\in\[10,100\]before saturating\. The sparsest\-cell finding is consistent with, not in tension with, the rest of our modularity results\. Second, the Gemma\-3\-12B open\-ended curves remain flat throughk=200k=200\. This is the entanglement regime documented in Table[1](https://arxiv.org/html/2605.16023#S3.T1): PEAP stilllocalizesstable open\-ended edges on Gemma\-3\-12B \(its split\-half IoU onYelpis22\.4%22\.4\\%and onRewardBenchis25\.6%25\.6\\%, well above the0\.50\.5–6\.8%6\.8\\%random baseline in Appendix[L](https://arxiv.org/html/2605.16023#A12)\), but the top\-200200subgraph is not sparse\-recoverable because the circuit is densely interleaved with world\-knowledge pathways\. The flat shape therefore encodes an architectural property of Gemma\-3\-12B rather than an absence of mechanism; we treat it as a bounded scope condition on the sparse\-circuit claim and say so explicitly in the Limitations \(§[Limitations](https://arxiv.org/html/2605.16023#Sx1)\)\.
## Appendix DCausal Subspace Steering of the Latent Evaluator
### D\.1Methodology
Beyond isolating physical components, recent mechanistic work investigates how concepts areencodedinside identified circuits through linear vector arithmeticMerulloet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib34)\)and through linearly\-steerable conceptual variables that route latent states into specific geometric output fingerprintsMuelleret al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib37)\); Saurezet al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib42)\)\.Wuet al\.\([2023](https://arxiv.org/html/2605.16023#bib.bib52)\)formalized Interchange Intervention Training \(IIT\) and Distributed Alignment Search \(DAS\) grounded in causal abstraction, discovering alignments between interpretable abstract variables and distributed neural representations; complementarily,Girrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)provide independent behavioral evidence that probability\-weighted scores and linear probes on rating\-position activations outperform prompted Likert outputs, indicating that judgment is encoded in a steerable latent representation\.
To probe whether the Latent Evaluator’s judgment signal lies along a single steerable direction, we apply a directional mean\-difference steering protocol at the PEAP\-discovered LE component positions, oriented toward positive judgment\. For each source\-task minimal pair\(xclean,xcorr\)\(x\_\{\\text\{clean\}\},x\_\{\\text\{corr\}\}\)we cache activations at every hook position\(ℓ,p,h\)\(\\ell,p,h\)identified as a Latent Evaluator sender and compute the per\-pair differenceΔ=aclean−acorr\\Delta=a\_\{\\text\{clean\}\}\-a\_\{\\text\{corr\}\}\. We orientΔ\\Deltatoward the positive\-judgment pole using the polarity multiplierm=sgn\(EVclean−EVcorr\)m=\\mathrm\{sgn\}\(\\mathrm\{EV\}\_\{\\text\{clean\}\}\-\\mathrm\{EV\}\_\{\\text\{corr\}\}\)from §[3\.1](https://arxiv.org/html/2605.16023#S3.SS1)– the same multiplier that keeps PEAP attributions directionally consistent under our symmetric minimal\-pair design – and averagem⋅Δm\\cdot\\Deltaacross all source\-task pairs to obtain a per\-hooksteering vectorv¯ℓ,p,h\\bar\{v\}\_\{\\ell,p,h\}\. At inference on the target task, we addα⋅v¯ℓ,p,h\\alpha\\cdot\\bar\{v\}\_\{\\ell,p,h\}to the corresponding hook activation during the forward pass and read off the resulting expected rating value\.α=0\\alpha=0recovers the unsteered baseline;α=1\\alpha=1approximates a one\-pair clean activation injection;α=2\\alpha=2extrapolates past it\. This protocol probes a 1Dlinearcharacterization of the LE subspace: any single\-direction encoding of the judgment signal predicts a smooth, monotonic dose\-response inα\\alpha\.
### D\.2Results
#### Finding: The Latent Evaluator’s judgment is encoded in a11D steerable subspace\.
Across the five models in Table[3](https://arxiv.org/html/2605.16023#A4.T3), the directional mean\-difference vector at the LE components steers the predicted rating from a neutral mid\-scale value to a confident≈5\\approx 5atα=2\.0\\alpha=2\.0when the target domain is compatible \(CoLA→\\rightarrowMNLI,CoLA→\\rightarrowSTS\-B\)\. Qwen2\.5\-7B – the smallest model – matches Qwen2\.5\-14B and Gemma\-3\-27B in steered EV precision \(4\.94±0\.054\.94\\pm 0\.05onCoLA→\\rightarrowMNLIatα=2\.0\\alpha=2\.0\); Llama\-3\.1\-8B reaches the tightest steered EV in the panel \(4\.98±0\.014\.98\\pm 0\.01on the same pair\)\. The evidence for a sharedlinearjudgment direction is twofold: \(i\) a steering vector computed on one domain \(e\.g\.,CoLAgrammar\) successfully steers judgment on a structurally unrelated domain \(e\.g\.,MNLIentailment\), so the direction generalizes across tasks; and \(ii\) the steering response is monotonic and smooth inα\\alpha\(Figure[6](https://arxiv.org/html/2605.16023#A4.F6), Figure[7](https://arxiv.org/html/2605.16023#A4.F7)\), consistent with a11D linear encoding rather than a nonlinear or multi\-dimensional one\.
Figure 6:Cross\-domain causal steering of the Latent Evaluator\. By extracting a11\-dimensional directional steering vector at the LE components on a source domain \(e\.g\.,CoLA\) and injecting it into the corresponding hooks of a distinct target domain \(e\.g\.,MNLI\), we control the model’s final output\. Thexx\-axis denotes the scalar multiplier \(α\\alpha\) applied to the targeted subspace intervention, demonstrating bidirectional control over the model’s judgment score independent of the underlying geometry\.Table 3:Cross\-task subspace steering atα=2\.0\\alpha=2\.0via the directional mean\-difference vector at the PEAP\-discovered Latent Evaluator components \(§[D](https://arxiv.org/html/2605.16023#A4)\)\. The isolated Latent Evaluator cleanly commands reasoning across syntax and semantics \(boldfaced\), but the steering fails when we attempt to cross\-patch entirely distinct output formats \(binary Classification→\\rightarrowordinal Rating\)\.Figure 7:Steering probability heatmap forSTS\-B→\\rightarrowMNLIacross intervention strength \(xx\-axis\) and predicted rating tokens \(1–5\) \(yy\-axis\) for Gemma\-3\-12B\. Asα\\alphaincreases, probability mass shifts monotonically from lower ratings toward55, demonstrating smooth, continuous control over the judgment output via a single geometric direction\.
#### Finding: DAS fails precisely at the cross\-format boundary\.
Steering between distinct formatting modalities – from a binary classification task to a 5\-bucket ordinal rating onSTS\-B– fails, shifting EV by a statistically negligible margin \(Table[3](https://arxiv.org/html/2605.16023#A4.T3), “Binary→\\rightarrowRating” rows\)\. This deliberate negative control reinforces the Latent Evaluator / Task Formatter split: DAS handles the abstract judgment direction but cannot re\-route a categorical output into an ordinal one, because that mapping lives in the non\-linear terminal Task Formatter\. This11D\-geometric finding is independently corroborated by the Format Transfer Injection experiment in §[5\.1](https://arxiv.org/html/2605.16023#S5.SS1), which uses direct activation patching \(rather than a learned rotation\) and reaches the same conclusion\.
#### Finding: Subspace steering extends to open\-ended tasks on the modular architectures\.
Supplementing Table[3](https://arxiv.org/html/2605.16023#A4.T3)with cross\-domain steering intoRewardBenchandYelpcircuits: on Qwen2\.5\-7B the steering vector fromCoLAtoRewardBenchdrives the target EV from a neutral baseline to4\.99±0\.024\.99\\pm 0\.02\(α=2\.0\\alpha=2\.0\), andSTS\-B→\\toYelpreaches4\.87±0\.044\.87\\pm 0\.04\. Qwen2\.5\-14B and Gemma\-3\-27B show the same pattern \(Qwen2\.5\-14BCoLA→\\toRewardBench:4\.98±0\.014\.98\\pm 0\.01; Gemma\-3\-27BCoLA→\\toRewardBench:4\.92±0\.194\.92\\pm 0\.19\)\. Gemma\-3\-12B, consistent with its entanglement profile, fails allRewardBench\-target steering \(EV stays at baseline≈0\\approx 0\) and shows only partial recovery onYelptargets\. Steering vectors sourced fromMNLIinto either open\-ended target are substantially weaker across all models \(e\.g\., Qwen2\.5\-14BMNLI→\\toRewardBench:3\.15±0\.883\.15\\pm 0\.88\), mirroring the 3\-way\-attractor structure ofMNLI’s formatter that we characterize for FTI in §[5\.1](https://arxiv.org/html/2605.16023#S5.SS1)\. Taken together, the subspace steering confirms that the11D judgment direction extracted on structured NLU transfers onto open\-ended Latent Evaluator circuits – even where the blanket FTI intervention fails to flip the final argmax, as on Gemma\-3\-27B’s open\-ended tasks\.
#### Finding: Random\-rotation control rules out a generic\-perturbation explanation\.
A skeptical reading of the steering result is that any sufficiently large perturbation in activation space would shift the output, and the trained rotation is therefore not specifically aligned to a judgment direction\. We rule this out via a Haar\-uniform random\-rotation control on Gemma\-3\-12B: atα=2\.0\\alpha=2\.0on the same target hooks, the trained rotation moves mean target EV by−0\.42\-0\.42onCoLA→\\toMNLIand by−0\.49\-0\.49onMNLI→\\toSTS\-B, while ten random orthogonal rotations move mean EV by less than±0\.01\\pm 0\.01on either pair \(no individual random sample produces a lift comparable to the trained rotation\)\. The trained rotation also reduces per\-instance variance \(σ=1\.10\\sigma=1\.10vs\.1\.591\.59for the random ensemble onCoLA→\\toMNLI;σ=0\.70\\sigma=0\.70vs\.1\.061\.06onMNLI→\\toSTS\-B\), consistent with a directionally\-aligned rather than noise\-injecting perturbation\. On the cross\-formatSTS\-BBinary→\\rightarrowRating pair where DAS already fails \(Table[3](https://arxiv.org/html/2605.16023#A4.T3)\), the trained rotation moves EV by only−0\.002\-0\.002– statistically indistinguishable from the random ensemble’s±0\.005\\pm 0\.005null effect\. The control therefore discriminates the two regimes: where DAS succeeds \(cross\-domain\), the trained rotation is∼50×\\sim 50\\timesmore effective than random; where DAS fails \(cross\-format\), real and random rotations alike are inert, indicating a genuine absence of a steerable cross\-format direction rather than a small\-intervention artifact\.
#### Cross\-task PCA overlap \(Figure[8](https://arxiv.org/html/2605.16023#A4.F8)\)\.
As a complementary geometric check, we compute the first principal componentPC1PC\_\{1\}of the difference matrices between clean\-rating and corrupt\-rating activations at the active Latent Evaluator nodes, separately forCoLA,MNLI, andSTS\-B\. The pairwise cosine similarity between thesePC1PC\_\{1\}directions is uniformly high, confirming that the geometric shift from a low\-rating to a high\-rating state is structurally conserved across semantically distinct tasks\.
Figure 8:Pairwise cosine similarity between thePC1PC\_\{1\}direction of the Latent Evaluator’s clean/corrupt activation difference matrix acrossCoLA,MNLI, andSTS\-B\(Gemma\-3\-12B\)\.Figure 9:Timeline of the geometric token intersection overlap between ordinal rating \(1\-5\) and categorical classification models\. Abstract judgment logic converges across tasks in the late\-middle layers before splitting into formatting topologies at the terminal layer \(1\.01\.0\)\.
## Appendix EThe Latent Evaluator as a Practical Judge
We close the gap between mechanism and practice by asking whether the LE’s11D causal direction can serve as a deployment\-ready judge signal\. For each instance in three benchmarks with continuous human ratings –STS\-B,Yelp, andRewardBench– we extract four signals at the rating position and correlate each with the human label \(Spearmanρ\\rho,N≤500N\\leq 500\): the prompted argmax \(M1\); the prob\-weighted expected valueEV=∑rr⋅P\(r\)\\text\{EV\}=\\sum\_\{r\}r\\cdot P\(r\)\(M2\); aGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)\-style supervised ridge probe trained on the residual\-stream activation \(of dimensiondmodeld\_\{\\text\{model\}\}, the model’s hidden size\) at the Boundless DASWuet al\.\([2023](https://arxiv.org/html/2605.16023#bib.bib52)\)layer, followingGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)’s reference\-free rating setup \(M3, 5\-fold CV\); and the zero\-shotBDAS\-1D\(M4\) – the first dimension of the rotation𝐑\\mathbf\{R\}trained for the steering experiment \(App\.[D](https://arxiv.org/html/2605.16023#A4)\) applied to the per\-head activation \(dimensiondheadd\_\{\\text\{head\}\}\) at the same site\. M4 never sees human labels:𝐑\\mathbf\{R\}’s IIT target is the model’s own clean rating, and we calibrate its sign per \(model, task\) cell against M2, mirroring the polarity multipliermmused for the steering vector in Appendix[D](https://arxiv.org/html/2605.16023#A4)\.
Table 4:Spearmanρ\\rhobetween four judgment signals and human labels: prompted argmax \(M1\), prob\-weighted EV \(M2\), Girrbach\-style supervised residual probe \(M3\), and zero\-shotBDAS\-1D\(M4\)\. Bold marks the per\-row best\. Methodology in Appendix[E](https://arxiv.org/html/2605.16023#A5)\.Figure 10:Late\-layer Task Formatter attractor geometry on Gemma\-3\-27B \(Logit Lens; Appendix[I](https://arxiv.org/html/2605.16023#A9)\)\.MNLI’s 3\-class formatter splits mass roughly evenly across three target tokens \(max/min ratio≈2\.7\\approx 2\.7\);STS\-B’s binary formatter concentrates mass on a single positive token \(ratio≈19\\approx 19\)\. The11D LE injection has no unambiguous target in the 3\-attractor basin, predicting theMNLIFTI flip\-rate collapse\.Three regimes emerge from Table[4](https://arxiv.org/html/2605.16023#A5.T4)\.\(i\)OnSTS\-B, the supervised residual probe wins on every model and BDAS\-1D tracks it within a few percentage points without supervised training \(meanΔ=4\.7\\Delta=4\.7pp\), beating argmax in all five models\.\(ii\)OnRewardBench, BDAS\-1D matches or exceeds the supervised probe in 3 of 5 models \(Gemma\-3\-12B/27B and Qwen2\.5\-7B, eachΔ≈\+5\\Delta\\approx\+5pp\): withN=200N=200giving∼40\\sim 40training examples per fold againstdmodel≈4d\_\{\\text\{model\}\}\\approx 4k features, the supervised probe overfits while the zero\-shot11D direction does not\.\(iii\)OnYelp, where the prompted11–55rating tokens are scale\-aligned to the human label, prob\-weighted EV is the strongest signal – a regime in which there is little headroom for latent extraction; even there, BDAS\-1D tracks prob\-EV within55–1313pp on every model\. The LE’s11D causal direction therefore encodes a zero\-shot judgment scalar that recovers the supervised probe’s signal on most cells and exceeds it on small\-NNpreference data, complementingGirrbachet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib14)\)by mechanistically identifying the subspace from which their probe extracts predictive signal\.
## Appendix FPrompt Design
This section provides the exact string templates used to query the evaluated models\. We first detail the prompts designed to elicit numerical judgments on a 1\-5 rating scale, followed by the alternative categorical templates used for our classification control tasks\. The Proposed Label line converts open\-ended judgment into label\-conditioned verification, which our minimal\-pair construction \(§[2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px3)\) requires: the clean prompt has the correct label, the corrupted prompt has the wrong one, and PEAP attribution scores the differential\.
1. 1\.CoLA Sentence:\{sentence\} Proposed Label: \{label\}\. On a scale of 1 to 5, how grammatically acceptable is this sentence? Rating:
2. 2\.MultiNLI Premise: \{premise\} Hypothesis: \{hypothesis\} Proposed Label: \{label\}\. On a scale of 1 to 5, how logically consistent is the hypothesis? Rating:
3. 3\.STS\-B Sentence 1: \{s1\} Sentence 2: \{s2\} Proposed Label: \{label\}\.On a scale of 1 to 5, how semantically similar are these sentences? Rating:
4. 4\.RewardBench User Prompt: \{prompt\} Response: \{response\} On a scale of 1 to 5, how helpful and aligned is this response? Rating:
5. 5\.Yelp Review: \{review\} On a scale of 1 to 5, how positive is this review? Rating:
Classification Control Tasks:
1. 1\.CoLA\_CLASS:…Is this sentence grammatically acceptable? Answer:
2. 2\.MNLI\_CLASS:…The relationship is:
3. 3\.STS\-B\_CLASS:…Are these sentences semantically similar? Answer:
4. 4\.RewardBench\_CLASS:…Is this response helpful and aligned? Answer:
5. 5\.Yelp\_CLASS:…Is this review positive? Answer:
The selection spans meaningfully different label structures – binary \(CoLA\), three\-class \(MNLI\), ordinal \(STS\-B,Yelp\), and pairwise preference \(RewardBench\) – and this heterogeneity is essential to the cross\-task overlap claim in §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2): a shared computational trunk that recurs across distinct label spaces is stronger evidence of generalized infrastructure than overlap on uniformly\-formatted tasks\.
## Appendix GMinimal Pairs and Sequence Alignment
Causal tracing requires a clean and a corrupted run\. For each dataset, we construct contrastive minimal pairs by sampling instances with opposite ground\-truth labels \(e\.g\., a fluent sentence vs\.a grammatically flawed sentence\)\. To ensure mathematical parity during the element\-wise gradient computations of PEAP, the clean and corrupted prompts within a pair are strictly constrained to tokenize to the exact same length\.
However, sequence lengths vary widelybetweendifferent pairs in the dataset\. To successfully aggregate edge scores across the entire dataset to find the generalized macro\-circuit, we apply right\-aligned sequence padding using negative indices\. By indexing from the end of the sequence, the evaluation token \(e\.g\.,Rating:\) is strictly anchored at position−1\-1for all inputs, allowing the causal graphs to superimpose regardless of the premise length\.
#### Per\-task selection rules\.
Minimal pairs are constructed automatically from labeled splits, so no human annotation step is involved and inter\-annotator agreement does not apply\. Per task:
- •CoLA: acceptable vs\. unacceptable sentences from the labeled splits, with token\-length matching\.
- •MNLI: pairs are drawn from \{entailment, contradiction\}; neutral instances are excluded so clean and corrupted prompts have semantically opposed ground truth\.
- •STS\-B: continuous similarity score≥4\\geq 4vs\.≤2\\leq 2on the 1–5 scale\.
- •RewardBench: nativechosen/rejectedpreference pairs fromLambertet al\.\([2025](https://arxiv.org/html/2605.16023#bib.bib23)\)\.
- •Yelp: 5\-star vs\. 1\-star reviews; intermediate stars excluded\.
After per\-task filtering and the token\-length\-matching constraint, the resulting yield is\|S\|=145\|S\|=145\(CoLA\),≤500\\leq 500\(MNLI; we cap at500500\),189189\(STS\-B\),150150–200200\(RewardBench\), and145145–200200\(Yelp\)\.
#### Backward\-pass tracing budget\.
Dense backward\-pass tracing has quadratic attention overhead in sequence length and is the binding cost for end\-to\-end attribution at the architectures we consider: natively mapping theCoLAjudgment computational graph in Gemma\-3\-12B requires evaluating approximately 1\.46 million candidate edges, and Gemma\-3\-27B incorporates roughly 50,000 components\. The minimal\-pair caps above are chosen so that one forward–backward sweep per pair completes within memory constraints across all five models \(see Limitations\)\.
## Appendix HAblation Study
To evaluate the functional importance of the causally identified circuit components at the strictest level, we perform a resampling ablation study within the Latent Evaluator\. For each edge in the circuit linearly ranked by attribution score, we iteratively ablate the edges by replacing their activations with values from corrupted inputs\. We measure the EV drop and the accuracy of judgment immediately after each ablation step\.
Circuit robustness varies substantially across structural tasks:STS\-Bclassification exhibits the highest robustness, whileMNLIjudgment is extremely fragile, with accuracy typically dropping significantly after ablating only the single top\-ranked edge\. Additionally, model scale appears to largely influence robustness, with smaller models \(e\.g\., Qwen2\.5\-7B\) exhibiting notably less robust judge circuits compared to larger models \(e\.g\., Gemma\-3\-27B\)\. All evaluation tasks demonstrate characteristic semantic phase transitions, where accuracy remains relatively stable until a critical edge ablation threshold, beyond which performance collapses completely\. Crucially, classification subtasks consistently exhibit much greater robustness than their numerical judgment counterparts, highlighting computationally redundant processing pathways in classification routers, whereas judgment circuits compress into highly concentrated bottleneck heads\.
Figures[11](https://arxiv.org/html/2605.16023#A8.F11),[12](https://arxiv.org/html/2605.16023#A8.F12),[13](https://arxiv.org/html/2605.16023#A8.F13), and[14](https://arxiv.org/html/2605.16023#A8.F14)illustrate the ablation study results showing the effect on downstream task performance on Gemma\-3\-12B, Gemma\-3\-27B, Qwen2\.5\-7B, and Qwen2\.5\-14B, respectively\.
\(a\)Classification Ablation\.
\(b\)Numerical Judgment Ablation\.
COLA dataset\.
\(c\)Classification Ablation\.
\(d\)Numerical Judgment Ablation\.
MNLI dataset\.
\(e\)Classification Ablation\.
\(f\)Numerical Judgment Ablation\.
STSB dataset\.
Figure 11:Ablation phase\-transition study \(Gemma\-3\-12B\)\.\(a\)Classification Ablation\.
\(b\)Numerical Judgment Ablation\.
COLA dataset\.
\(c\)Classification Ablation\.
\(d\)Numerical Judgment Ablation\.
MNLI dataset\.
\(e\)Classification Ablation\.
\(f\)Numerical Judgment Ablation\.
STSB dataset\.
Figure 12:Ablation phase\-transition study \(Gemma\-3\-27B\)\.\(a\)Classification Ablation\.
\(b\)Numerical Judgment Ablation\.
COLA dataset\.
\(c\)Classification Ablation\.
\(d\)Numerical Judgment Ablation\.
MNLI dataset\.
\(e\)Classification Ablation\.
\(f\)Numerical Judgment Ablation\.
STSB dataset\.
Figure 13:Ablation phase\-transition study \(Qwen2\.5\-7B\-Instruct\)\.\(a\)Classification Ablation\.
\(b\)Numerical Judgment Ablation\.
COLA dataset\.
\(c\)Classification Ablation\.
\(d\)Numerical Judgment Ablation\.
MNLI dataset\.
\(e\)Classification Ablation\.
\(f\)Numerical Judgment Ablation\.
STSB dataset\.
Figure 14:Ablation phase\-transition study \(Qwen2\.5\-14B\-Instruct\)\.
## Appendix IStructural Validation via Logit Lens
While automated circuit discovery provides scalable methodologies for identifying active subgraph components, it heavily relies on dataset activations and external language models for semantic explanationsGolimblevskaiaet al\.\([2026](https://arxiv.org/html/2605.16023#bib.bib15)\)\. To validate our findings, we employ Logit Lens444[https://www\.lesswrong\.com/posts/AcKRB8wDpdaN6v6ru](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru): projecting the contrastive steering representations directly into the vocabulary space using the model’s unembedding weights \(WUW\_\{U\}\)\. This resolves the explicit semantic composition of the nodes without depending on black\-box auto\-interpretability\.
To validate structural consistency across architectures without arbitrary discrete thresholding, we compute cosine similarity projections to map the contrastive evaluator vectors directly back into the unembedding matrices\. Normalizing by the magnitude vectors neutralizes untrained tokenizer noise and reveals geometric alignment across the nodes\. Figure[9](https://arxiv.org/html/2605.16023#A4.F9)graphically quantifies this geometric distribution connecting the topological outputs of ordinal \(Rating\) and categorical \(Classification\) evaluators across network progression\.
The topological projections trace a shared geometry in the late\-middle topological bucket \(depth0\.500\.50–0\.850\.85\)\. Notably,MNLIshows weaker convergence with other tasks in this shared evaluation window, consistent with its three\-way classification structure \(entailment/neutral/contradiction\) requiring a richer internal representation than a simple positive/negative judgment scalar\. This suggests that the Latent Evaluator’s 1D judgment abstraction generalizes most cleanly to binary or ordinal tasks, while multi\-class judgment tasks partially escape the shared trunk\. By applying a strict cross\-architecture intersection to discard tokenizer\-specific artifacts, we isolated a generalized abstract logic continuum utilized across all evaluations\. High\-probability masses cleanly define evaluator reasoning nodes without heuristics via cross\-architectural tokens: \{*confirm*,*verify*,*validate*,*identical*,*perfectly*\}\.
Just before the terminal Output Formatting boundary \(Layer Depth1\.01\.0\), however, the shared semantic coherence completely collapses across networks\. Visualizing the bifurcating projections directly, the ordinal rating tasks explicitly route their probability trajectories toward discrete syntactic intervals \(e\.g\.,*five*,*5*,*1*\), abandoning the abstraction layer entirely\. Categorical models simultaneously polarize entirely into categorical literals \(e\.g\.,*false*,*true*,*contradiction*\)\. This supports our Task Formatter hypothesis: the Latent Evaluator calculates generalized continuous judgment magnitudes uniformly within the deeper block sequences before task\-specific routers discretely overwrite that geometry strictly for terminal language formatting\. We emphasize that Logit Lens provides a correlational readout rather than a causal intervention: the tokens recovered through vocabulary projection represent directions that are linearly decodable from intermediate representations, which need not coincide with the representations the model actually uses for downstream computationYom Dinet al\.\([2024](https://arxiv.org/html/2605.16023#bib.bib53)\)\. We therefore treat these projections as supporting evidence that corroborates – but does not independently prove – the causal findings from PEAP and Boundless DAS\.
## Appendix JSparse Autoencoder Feature Analysis
As an independent check on the PEAP\-based circuit decomposition, we apply Sparse Autoencoders \(SAEs\) to Gemma\-3\-12B’s residual\-stream and attention\-head activations over theCoLAandSTS\-Bminimal pairs, using the Gemma\-Scope\-2 canonical SAE release \(gemma\-scope\-2\-res\-65k\-l0\-small; coverage limited to layers\{12,24,31,41\}\\\{12,24,31,41\\\}\)\. The SAE analysis operates in two modes: \(i\) at the residual\-stream position immediately before the rating token, we decode the top SAE features activated across all prompts and report their aggregate activation; \(ii\) at each attention head already identified by PEAP, we classify the head into one of three roles based on whether itsV→ZV\\to Zedges appear in𝒞rate∖𝒞class\\mathcal\{C\}\_\{\\text\{rate\}\}\\setminus\\mathcal\{C\}\_\{\\text\{class\}\},𝒞class∖𝒞rate\\mathcal\{C\}\_\{\\text\{class\}\}\\setminus\\mathcal\{C\}\_\{\\text\{rate\}\}, or the intersection𝒞rate∩𝒞class\\mathcal\{C\}\_\{\\text\{rate\}\}\\cap\\mathcal\{C\}\_\{\\text\{class\}\}\. These correspond respectively torating formatters,class formatters, andshared evaluators– the same decomposition used in §[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\.
#### Attention\-head role assignment\.
On theCoLAcircuit \(1717heads analyzed\), SAE attribution labels33heads as shared evaluators \(L45H3, L46H12, L47H7\),99as rating formatters, and55as class formatters – a clean two\-way partition modulo the small evaluator core\. The same three heads \(L45H3, L46H12, L47H7\) emerge as shared evaluators onSTS\-B\(from2525analyzed heads\), where they are joined by two additional shared heads \(L25H8, L44H8\) that did not surface in the CoLA top\-kk\. L45H3 carries the highest shared\-evaluator weight on both tasks \(normalized circuit weight0\.240\.24on CoLA,0\.130\.13on STS\-B\), which is precisely the attention head at which we train Boundless DAS rotations in Appendix[D](https://arxiv.org/html/2605.16023#A4)\. This alignment – that an independently\-computed SAE role\-labeling identifies the same head as the centralshared evaluatorthat our BDAS training selected on causal\-intervention grounds – is non\-trivial confirmation that the Latent Evaluator / Task Formatter decomposition is a genuine architectural structure rather than an artifact of either method alone\.
#### MLP residual\-stream features\.
At the rating\-token position \(relative offset−2\-2\) in L24M, the top\-ranked SAE feature \(ID 617\) activates on148/148148/148CoLAprompts with mean activation18721872, followed by features 1210, 8229, 1686, and 402 \(mean activations in the range10721072–15311531, each activating on all148148prompts\)\. Reconstruction quality at this position is high \(cosine similarity0\.9990\.999, relativeL2L^\{2\}error0\.0530\.053\), confirming the SAE faithfully reconstructs the rating\-position activations\. We do not attempt semantic interpretation of individual features here because it would require evidence beyond the activation statistics, but the consistency with which the top features fire across all evaluation prompts supports the interpretation that the rating\-token position aggregates a stable set of evaluation features rather than a prompt\-specific representation\.
#### Scope\.
The SAE analysis coversCoLAandSTS\-Bon Gemma\-3\-12B\. The Gemma\-Scope\-2 canonical SAE release covers only four MLP layers on Gemma\-3\-12B, which limits MLP\-level decomposition to L24M within the circuit\. Neither limitation affects the head\-level findings above, which use attention\-head attribution directly rather than an SAE over MLP outputs\. Public SAE releases for Gemma\-3 outside of Gemma\-Scope\-2 are limited; a broader multi\-model multi\-layer SAE decomposition of the Latent Evaluator is out of scope for this submission\.
## Appendix KGlobal Judge Circuit Topology
To complement the structural\-overlap and faithfulness summaries in the main body, we visualize the full PEAP\-discovered judge circuit across multiple \(model, task\) pairs \(Figures[15](https://arxiv.org/html/2605.16023#A11.F15)–[18](https://arxiv.org/html/2605.16023#A11.F18)\)\. Nodes are laid out by \(token position, layer\) so that the spatial separation of the Latent Evaluator and the rating\-specific Task Formatter is directly visible\. We pair the canonicalMNLIon Gemma\-3\-27B example with three additional circuits –CoLAon the same model,STS\-Bon Gemma\-3\-12B, andMNLIon Qwen2\.5\-14B – to illustrate that the two\-stage topology is conserved across both task semantics and model family\. The remaining \(RewardBench,Yelp\) circuits and the unshown model variants are available in the code release and exhibit the same pattern\.
Across all four panels, theLatent Evaluatorsub\-circuit corresponds to the green content\-token MLP cluster in the middle layers distributed across multiple token positions, while the rating\-specificTask Formattercorresponds to a concentrated salmon column of late\-layer attention heads at the rating token position\. Node coloring encodestoken\-role contextrather than circuit membership: green nodes sit on content tokens \(premise/hypothesis or sentence spans\), blue nodes on instruction/scale tokens \(“scale”, “how”, “Sentence”\), and salmon nodes on the terminal rating target tokens; edge color encodes PEAP attribution polarity \(blue = positive, crimson = negative\)\. Two qualitative observations motivate the two\-stage decomposition used in the main body: \(i\) Latent Evaluator edges form at earlier token positions and earlier layers than rating\-specific edges, which concentrate in the deepest layers at the rating token position; and \(ii\) the rating sub\-circuit is sparse and column\-like relative to the spatially distributed Latent Evaluator, consistent with the formatter acting as a terminal decoding stage rather than a distributed computation\. Tokens not in the top\-kkare rendered as\[VAR\]placeholders in the prompt template footer to avoid privileging any one instance\.
Figure 15:Global judge circuit forMNLIon Gemma\-3\-27B\.Figure 16:Global judge circuit forCoLAon Gemma\-3\-27B\.Figure 17:Global judge circuit forSTS\-Bon Gemma\-3\-12B\.Figure 18:Global judge circuit forMNLIon Qwen2\.5\-14B\.
## Appendix LSplit\-Half Circuit Reliability
A recurring concern with circuit\-level interpretability is whether circuits discovered on modest sample sizes reflect genuine causal structure or idiosyncratic features of the specific instances traced\. We address this by measuring within\-task split\-half reliability: for each \(model, task\) we partition the available minimal pairs into two disjoint halves, aggregate PEAP scores independently on each half, and compute Jaccard IoU between the resulting top\-kkcircuits\. We repeat this 10 times with different random partitions and report mean±\\pmstandard deviation\. IoU is computed on structural\(sender,receiver\)\(\\text\{sender\},\\text\{receiver\}\)pairs using the same convention as §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2)\(i\.e\., position\-specific edges are ranked first and then collapsed to structural pairs before computing IoU; early reading layers are excluded\)\. We additionally report a random\-subset baseline drawn independently from the same observed edge universe\.
Because the available pair counts per \(model, task\) vary \(N∈\{145,…,500\}N\\in\\\{145,\\dots,500\\\}\), a naive comparison across cells would confound the reliability signal with statistical power\. To isolate structural stability from sample\-size effects, we cap each task at the minimum N available across the original four models before splitting \(CoLA:N=145N=145, MNLI:186186, STS\-B:189189, RewardBench:150150, Yelp:145145\)\. Table[5](https://arxiv.org/html/2605.16023#A12.T5)reports the resulting headline Edge IoU atk=100k=100\. Random\-subset Edge IoU atk=100k=100ranges between0\.5%0\.5\\%and6\.8%6\.8\\%across conditions, so all reported reliability values are at least several times above chance\.
Table 5:Split\-half Edge IoU \(%\) at top\-100100, mean±\\pmstandard deviation over 10 random partitions\. All cells are evaluated at the same N per task \(smallest N available across models\), so comparisons are not confounded by sample size\. Chance baseline is<7%<7\\%across all cells\.Four observations are worth emphasizing\. First, Qwen split\-half reliability is uniformly high across both structured NLU and open\-ended judgment tasks, matching the architectural\-modularity pattern already visible in Tab\.[1](https://arxiv.org/html/2605.16023#S3.T1): wherever a model exhibits functional modularity, its extracted circuits are also reliable\.
Second, Gemma\-3\-27B yieldslowersplit\-half Edge IoU onMNLIandSTS\-Bthan Gemma\-3\-12B does, even at matched N\. We do not read this as instability\. Rather, it is consistent with a scale\-dependent redundancy effect: once the Latent Evaluator is cleanly modular \(Table[1](https://arxiv.org/html/2605.16023#S3.T1)\), the model can route judgment through multiple computationally equivalent sub\-pathways, and different data halves select different\-but\-equivalent subsets of the top\-kkedges\. The underlying Node IoU remains high on Gemma\-3\-27B \(66\.5%66\.5\\%onSTS\-B,65\.4%65\.4\\%onCoLAatk=100k=100\), indicating that the same set of components is recruited – just at different attribution ranks within the top 100\.
Third, at matched N, Gemma\-3\-12B’s Yelp reliability drops substantially \(Edge IoU22\.4%22\.4\\%vs\.the46\.5%46\.5\\%we observe at its nativeN=500N=500\)\. This sample\-size sensitivity is itself informative: on open\-ended tasks, reliable PEAP attribution on Gemma\-3\-12B requires significantly more data than the structured NLU circuits demand\. Qwen\-14B, by contrast, maintains strong Yelp reliability \(77\.5%77\.5\\%\) atN=145N=145, which matches Qwen’s earlier\-emergence\-of\-modularity pattern\.
Fourth, the split\-half numbers combined with the median MIB faithfulness results \(Appendix[C](https://arxiv.org/html/2605.16023#A3)\) yield a cleaner picture than the previous\-draft interpretation\. Among the four models with full open\-ended split\-half coverage,RewardBenchandYelpare above chance on every model, and on three \(both Qwens and Gemma\-3\-27B\) the same sparse top\-kkedge budget that suffices for structured NLU is sufficient to recover open\-ended judgment behavior\. Only Gemma\-3\-12B exhibits reliable\-but\-unfaithful open\-ended circuits \(stable split\-half IoU but MIB faithfulness near0\), consistent with its entangled zero\-ablation profile in Table[1](https://arxiv.org/html/2605.16023#S3.T1)\. The original concern that open\-ended judgment requires a denser circuit than structured NLU is therefore more accurately characterized as a Gemma\-3\-12B\-specific entanglement effect rather than a property of open\-ended evaluation per se\.
We also note that split\-half Edge IoU should not be read as a quality metric for the circuit itself, only as a diagnostic for attribution stability\. Where a model’s true circuit is distributed across many partially redundant paths \(as we suspect is the case for Gemma\-3\-27B onMNLI/STS\-B\), a strict top\-kkedge comparison understates the underlying structural agreement\. Full per\-k curves and raw native\-N numbers are reported in the companion CSVs in the supplementary release\.
## Appendix MPooled\-Directional Faithfulness
As a sensitivity analysis on the per\-instance MIB metric used in the main body \(Appendix[C](https://arxiv.org/html/2605.16023#A3)\), we additionally report a magnitude\-weighted directional formulation:
Faithpool\(k\)=∑i=1Nmi⋅\(EV\(i\)\(𝒞k\)−EVcorr\(i\)\)∑i=1N\|EVclean\(i\)−EVcorr\(i\)\|,\\text\{Faith\}\_\{\\text\{pool\}\}\(k\)=\\frac\{\\sum\_\{i=1\}^\{N\}m\_\{i\}\\cdot\\left\(\\text\{EV\}^\{\(i\)\}\(\\mathcal\{C\}\_\{k\}\)\-\\text\{EV\}^\{\(i\)\}\_\{\\text\{corr\}\}\\right\)\}\{\\sum\_\{i=1\}^\{N\}\\left\|\\text\{EV\}^\{\(i\)\}\_\{\\text\{clean\}\}\-\\text\{EV\}^\{\(i\)\}\_\{\\text\{corr\}\}\\right\|\},withmi∈\{−1,\+1\}m\_\{i\}\\in\\\{\-1,\+1\\\}the per\-pair polarity sign\. This pooled formulation has a single aggregate denominator, which causes pairs with large\|EVclean−EVcorr\|\|\\text\{EV\}\_\{\\text\{clean\}\}\-\\text\{EV\}\_\{\\text\{corr\}\}\|to dominate the recovery score and can yield artifacts such as non\-monotonic curves and implausibly high recovery at very smallkk\. On Gemma\-3\-12B the pooled curve peaks at1\.101\.10onMNLIatk=5k=5\(a single edge patching recovering 110% of the gap is an aggregation artifact, not a genuine mechanistic claim\) and similarly overshoots at intermediatekkonCoLAandSTS\-B, before drifting downward atk=200k=200\. The per\-instance MIB metric removes these artifacts by construction, which is the reason we adopt it as our primary metric\. The two metrics agree on the qualitative structure\-NLU vs\. open\-ended\-task distinction: both saturate near1\.01\.0onCoLA,MNLI, andSTS\-Bfor Gemma\-3\-12B, and both remain below0\.50\.5across the fullkkrange forRewardBenchandYelp\.
## Appendix NCross\-Method Validation via LRPEAP
### N\.1Methodology
LRPEAP retains PEAP’s position\-aware edge attribution \(Appendix[A](https://arxiv.org/html/2605.16023#A1)\) and per\-pair aggregation but replaces the autograd backward with an LRP\-rule backward, using theLN\-rule/Identity\-rule/Half\-rulecombination of RelP\(Jafariet al\.,[2025](https://arxiv.org/html/2605.16023#bib.bib21)\)\. All other PEAP machinery – candidate\-edge set, top\-kkcapping, polarity correctionm=sgn\(EVclean−EVcorr\)m=\\mathrm\{sgn\}\(\\mathrm\{EV\}\_\{\\text\{clean\}\}\-\\mathrm\{EV\}\_\{\\text\{corr\}\}\)– is unchanged, so LRPEAP and PEAP are comparable under our top\-kkJaccard IoU and faithfulness metrics\. LRPEAP is not equivalent to RelP itself: RelP’s candidate\-edge graph is component\-level\(n1,n2\)∈E\(n\_\{1\},n\_\{2\}\)\\in E, whereas LRPEAP injects RelP’s LRP\-coefficient backward into PEAP’s position\-aware formulation\. LRPEAP runs on the same minimal\-pair sets as the PEAP experiments \(§[2](https://arxiv.org/html/2605.16023#S2.SS0.SSS0.Px2)\); the permutation null for each \(model, task,kk\) cell samples500500random size\-kkedge subsets from each method’s edge pool and reports thep99p\_\{99\}Jaccard IoU\.
### N\.2Results
Table 6:PEAP vs LRPEAP Jaccard IoU atK=200K\{=\}200\(edge / component\); null is the permutationp99p\_\{99\}\.Table[6](https://arxiv.org/html/2605.16023#A14.T6)reports per\-task PEAP↔\\leftrightarrowLRPEAP IoU atK=200K\{=\}200: mean edge IoU is0\.290\.29on Qwen2\.5\-7B and0\.380\.38on Gemma\-3\-12B against nullp99p\_\{99\}of0\.0220\.022and0\.0150\.015, a∼13\\sim 13–25×25\\timesenrichment atK=200K\{=\}200, with≥12×\\geq 12\\timesenrichment at everyk∈\{5,…,500\}k\\in\\\{5,\\dots,500\\\}\. Cross\-method agreement is stronger on Gemma\-3\-12B than on Qwen2\.5\-7B; the single weak cell is Gemma\-3\-12BRewardBench\(edge IoU0\.140\.14\), consistent with that model’s entanglement on the same task \(Table[1](https://arxiv.org/html/2605.16023#S3.T1), Figure[2](https://arxiv.org/html/2605.16023#S3.F2)\)\.
Table 7:Cross\-method Latent Evaluator IoU atK=200K\{=\}200between PEAP’s𝒞LE\\mathcal\{C\}\_\{\\text\{LE\}\}and LRPEAP’s𝒞LE\\mathcal\{C\}\_\{\\text\{LE\}\}, each computed as𝒞rate∩𝒞class\\mathcal\{C\}\_\{\\text\{rate\}\}\\cap\\mathcal\{C\}\_\{\\text\{class\}\}\.Restricting to the Latent Evaluator \(Table[7](https://arxiv.org/html/2605.16023#A14.T7)\), the LE subgraph is recovered with0\.280\.28edge /0\.470\.47component IoU on average, peaking at0\.610\.61on Gemma\-3\-12BMNLI\. On Gemma\-3\-12BCoLA×\\timesCoLA\_CLASS, LRPEAP’s LE atK=200K\{=\}200includes3131distinct attention heads with V→\\toZ edges; L45H3, L46H12, and L47H7 \(the three shared\-evaluator heads from Appendix[J](https://arxiv.org/html/2605.16023#A10)\) are all present\.
Figure 19:Cross\-task structural overlap on Gemma\-3\-12B: LRPEAP \(solid\) overlaid on PEAP \(dashed, faded\) at matched task\-pair color\. Node IoU agrees on the structurally easy pairs \(CoLA×\\timesMNLI,MNLI×\\timesSTS\-B\); Edge IoU is consistently higher under LRPEAP, with both metrics diverging in LRPEAP’s favor on pairs involving the open\-endedRewardBenchtask\.The cross\-task shared trunk of Finding 1 also reproduces under LRPEAP \(Figure[19](https://arxiv.org/html/2605.16023#A14.F19)\): Gemma\-3\-12B Node IoU at top\-200200is61\.5%61\.5\\%/65\.0%65\.0\\%/65\.6%65\.6\\%forCoLA×\\timesMNLI/MNLI×\\timesSTS\-B/CoLA×\\timesRewardBench, matching or exceeding the PEAP numbers in §[3\.2](https://arxiv.org/html/2605.16023#S3.SS2)\(61\.0%61\.0\\%/62\.3%62\.3\\%/48\.8%48\.8\\%\)\. Edge IoU is also uniformly higher under LRPEAP \(5252–57%57\\%vs1616–42%42\\%across the six pairs\), suggesting LRP\-rule attribution produces more consistent edge rankings across semantically distinct tasks than autograd attribution\.
Figure 20:Layer\-pair attribution density of the top\-200200edges on Gemma\-3\-12BMNLIunder PEAP \(left\) and LRPEAP \(right\)\. Both methods light up the same mid\-to\-late diagonal band, the LE region of §[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\. LRPEAP additionally suppresses early\-layer attribution that PEAP picks up, possibly reflecting LRP rules’ numerical\-stability advantage through LayerNorm\.Figure[20](https://arxiv.org/html/2605.16023#A14.F20)confirms architectural agreement: under both methods the top\-200200MNLIedges on Gemma\-3\-12B concentrate in the same mid\-to\-late diagonal band \(layers∼\\sim20–47\), exactly the LE region \(§[4\.1](https://arxiv.org/html/2605.16023#S4.SS1)\); the only visible difference is some early\-layer activity \(layers33–1515\) that PEAP picks up but LRPEAP suppresses\.
Figure 21:Sparse\-circuit faithfulness with PEAP \(blue\) and LRPEAP \(green\) on Qwen2\.5\-7B and Gemma\-3\-12B across the five rating tasks\.Figure[21](https://arxiv.org/html/2605.16023#A14.F21)overlays PEAP and LRPEAP faithfulness curves on the same panel as Figure[2](https://arxiv.org/html/2605.16023#S3.F2); both backbones saturate at comparablekkon every cell where the PEAP circuit saturates\. The five cells where LRPEAP undershoots PEAP atK=200K\{=\}200\(Qwen2\.5\-7BMNLI\_CLASS/RewardBench; Gemma\-3\-12BCoLA\_CLASS/MNLI\_CLASS/STS\-B\_CLASS\) all peak atK≤100K\\leq 100\(e\.g\.86%86\\%,117%117\\%,107%107\\%on the three cells that reach saturation\); theK=200K\{=\}200drop reflects sign\-inverted edges entering the LRP ranking far down the tail on tasks with asymmetric output spaces, where LRP\-rule relevance redistribution does not preserve the per\-pair sign that PEAP’s symmetric polarity correction \(§[3\.1](https://arxiv.org/html/2605.16023#S3.SS1)\) handles natively\.Similar Articles
@ArizePhoenix: Who judges the evaluators? When you use LLM-as-a-judge, you’re trusting a model to decide whether your agent, workflow,…
The article discusses the challenges of debugging and evaluating LLM judges using Arize Phoenix, which traces evaluator runs via OpenTelemetry to inspect decision logic, costs, and potential biases.
MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Researchers introduce MM-JudgeBias, a benchmark that exposes systematic compositional biases in multimodal large language models when used as automatic judges, testing 26 SOTA MLLMs across 1,800 samples.
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
This paper investigates asymmetries in LLMs' pragmatic competence by comparing their performance as judges of linguistic appropriateness versus as generators of pragmatically appropriate language. The study finds that many models perform substantially better as pragmatic listeners than as speakers, suggesting misalignment between evaluation and generation capabilities.
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.