Validating Causal Abstraction Metrics on Simulated Complex Systems
Summary
This paper introduces a benchmark of ten complex systems for validating causal abstraction metrics, evaluates over thirty candidate metrics, and proposes the Causal Abstraction Error (CAE) as a general-purpose validity metric that reliably discriminates valid from invalid explanations.
View Cached Full Text
Cached at: 07/02/26, 05:37 AM
# Validating Causal Abstraction Metrics on Simulated Complex Systems
Source: [https://arxiv.org/html/2607.00267](https://arxiv.org/html/2607.00267)
Maxime Méloux1Tiago Pimentel2François Portet1Maxime Peyrard1 1Université Grenoble Alpes, CNRS, Grenoble INP, LIG2ETH Zürich \{melouxm, portetf, peyrardm\}@univ\-grenoble\-alpes\.fr tiago\.pimentel@inf\.ethz\.ch
###### Abstract
A central goal of science is to produce valid explanations of complex systems: high\-level causal accounts that faithfully reflect the behavior of lower\-level mechanisms\. Yet no consensus exists on how to measure whether a proposed high\-level explanation is actually valid\. We introduce a benchmark of ten complex systems spanning both discrete and continuous state spaces, as well as static and dynamical regimes, each equipped with consensual ground\-truth causal explanations and invalid contrastive conditions\. Within a unified causal abstraction framework, we systematically evaluate over thirty candidate metrics drawn from observational, functional, information\-theoretic, and causal families\. Our results show that only the latter reliably discriminates valid from invalid abstractions, and only when incorporating faithfulness testing over unmapped variables\. Building on these findings, we introduce the Causal Abstraction Error \(CAE\), a continuous validity metric with an explicit faithfulness test, which passes all discrimination tests across every system and can converge with as few as 30 sampled interventions\. We offer it as a general\-purpose metric for the discovery and validation of high\-level explanations\.
## 1Introduction
A central goal of science is to produceexplanations: not merely descriptions or predictions, but accounts ofwhyphenomena occur\(Hempel,[1965](https://arxiv.org/html/2607.00267#bib.bib113); Woodward,[2004](https://arxiv.org/html/2607.00267#bib.bib153)\)\. While the philosophy of science has long debated what explanations are, a working consensus has emerged around a few core desiderata: good scientific explanations should be causally informative\(Woodward,[2004](https://arxiv.org/html/2607.00267#bib.bib153); Pearl,[2009](https://arxiv.org/html/2607.00267#bib.bib157); Salmon,[1984](https://arxiv.org/html/2607.00267#bib.bib154)\), parsimonious\(Kitcher,[1989](https://arxiv.org/html/2607.00267#bib.bib114); Batterman and Rice,[2014](https://arxiv.org/html/2607.00267#bib.bib155)\), and appropriately scoped to the context and level of description at which the target phenomenon is best characterized\(Lombrozo,[2006](https://arxiv.org/html/2607.00267#bib.bib145); Potochnik,[2017](https://arxiv.org/html/2607.00267#bib.bib143)\)\. Finding good explanations is particularly difficult when the system under study exhibits what Warren Weaver famously calledorganized complexity\(Weaver,[1991](https://arxiv.org/html/2607.00267#bib.bib144)\): systems with many interacting parts that are neither so disordered as to yield to statistical averaging, nor so simple as to admit direct analytic treatment\. The appropriate high\-level variables for explanation must be discovered, and the class of functions that legitimately aggregate low\-level quantities into those variables is itself a subject of inquiry\(Hoelet al\.,[2013](https://arxiv.org/html/2607.00267#bib.bib110); Potochnik,[2017](https://arxiv.org/html/2607.00267#bib.bib143)\)\.
This raises a fundamental challenge: what are useful high\-level variables, and how should they be defined from lower\-level quantities? Different fields studying different classes of complex systems have developed their own candidate answers, such as firing rates and population codes in systems neuroscience\(Cunningham and Yu,[2014](https://arxiv.org/html/2607.00267#bib.bib109)\), or species abundance and trophic levels in ecology\(Loreau,[2010](https://arxiv.org/html/2607.00267#bib.bib108)\)\. However, a unified cross\-disciplinary methodology for evaluating high\-level explanations of complex systems remains elusive\. A sobering illustration of why comes fromJonas and Kording \([2017](https://arxiv.org/html/2607.00267#bib.bib178)\), who applied the standard causal and statistical toolkit of neuroscience to a fully observable microprocessor: a relatively simple system engineered for modular, hierarchical organization\. In doing so, they failed to recover meaningful high\-level properties of this system\. The recent success of artificial intelligence \(AI\) offers a similar account\. Similarly to microprocessors, modern AI systems are structurally complex, yet they are fully observable and perfectly manipulable, a rare property in the natural sciences\. This privileged epistemic access makes AI systems a methodological laboratory for the science of complex systems\(Holtzmanet al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib258)\), to develop and test explanation discovery procedures in simplified settings\. Yet, AI interpretability has also proven sobering, with a large tendency to produce false positive findings and explanations that are unstable and not generalizable\(Hewitt and Liang,[2019](https://arxiv.org/html/2607.00267#bib.bib255); Ravichanderet al\.,[2021](https://arxiv.org/html/2607.00267#bib.bib172); Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214)\)\.
A general metric for evaluating proposed high\-level explanations of low\-level complex systems is therefore a prerequisite to move forward: without it, the discovery and evaluation of candidates cannot be done rigorously, and progress is hard to measure or transfer across domains\. Yet, no such agreed\-upon, general metric exists\. A growing body of work has begun addressing this gap through the lens ofcausal abstraction\(Beckers and Halpern,[2019](https://arxiv.org/html/2607.00267#bib.bib70)\), the idea that a high\-level causal model is a valid explanation of a low\-level system if interventions at the high level correspond faithfully to interventions at the low level under a postulated abstraction map\. This framework is conceptually appealing and promising, but existing implementations differ in important ways: in whether they treat abstraction as exact or approximate\(Beckerset al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib71)\), and in how they handle distributions over inputs or over interventions\(Geigeret al\.,[2021](https://arxiv.org/html/2607.00267#bib.bib275),[2025](https://arxiv.org/html/2607.00267#bib.bib254)\)\. Current concrete metric implementations developed in the field of AI interpretability have also been shown to suffer from the global issues of striking false positives and non\-generalizable findings\(Sutteret al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib62); Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214)\)\.
In this work,we set out to systematically test, compare, and benchmark a wide range of proposed explanation metrics, observational, information\-theoretic, symbolic, and causal, across a diverse collection of idealized complex systems with known, consensual high\-level explanations\. Rather than confining ourselves to a single domain, we deliberately span systems that vary across multiple dimensions: \(i\) with discrete versus continuous state spaces, \(ii\) with static versus temporal dynamics, \(iii\) with disordered versus structured complexity \(in Weaver’s sense\)\. For each system, we manually construct the alignment function mapping low\-level variables to high\-level ones, implement all pairs within a shared conceptual formalism \(causal abstraction\) and a unified programmatic API supporting both observational and interventional queries\. Finally, we handcraft contrastive perturbations, modifications to either the low\-level or high\-level model that render the proposed explanation invalid, allowing us to test whether each candidate metric correctly recognizes broken explanations as well as valid ones\. This is an empirical test of the practical operationalization of existing conceptual ideas\.
Our results provide strong support for the causal abstraction research program, but only for specific distributional variants\. We further introduce the Causal Abstraction Error \(CAE\), an error measure that incorporates a principled treatment of distributional inputs and interventions, a built\-in faithfulness measurement, and produces a real\-valued degree\-of\-violation score\. TheCAEpasses all contrastive tests across every system class we examine and outperforms related causal abstraction variants\.
## 2Metrics of Explanation Validity
Following the framework ofPearl \([2009](https://arxiv.org/html/2607.00267#bib.bib157)\)andBeckers and Halpern \([2019](https://arxiv.org/html/2607.00267#bib.bib70)\), we model the studied system as a*structural causal model*\(SCM\)\.Dyeret al\.\([2024](https://arxiv.org/html/2607.00267#bib.bib156)\)previously adopted Pearl’s modeling assumption that simulated complex systems belong to this class\.
###### Definition 2\.1\(Structural Causal Model\)
A structural causal model is a tupleℳ=⟨𝒰,𝒱,ℛ,ℱ,P𝒰⟩\\mathcal\{M\}=\\langle\\mathcal\{U\},\\,\\mathcal\{V\},\\,\\mathcal\{R\},\\,\\mathcal\{F\},\\,P\_\{\\mathcal\{U\}\}\\rangle, where𝒰\\mathcal\{U\}is a set of*exogenous*variables with joint distributionP𝒰P\_\{\\mathcal\{U\}\};𝒱\\mathcal\{V\}is a set of*endogenous*variables;ℛ=\{ℛW\}W∈𝒰∪𝒱\\mathcal\{R\}=\\\{\\mathcal\{R\}\_\{W\}\\\}\_\{W\\in\\mathcal\{U\}\\cup\\mathcal\{V\}\}assigns a range to every variable; andℱ=\{fV\}V∈𝒱\\mathcal\{F\}=\\\{f\_\{V\}\\\}\_\{V\\in\\mathcal\{V\}\}is a set of*structural equations*, each determiningVVfrom its direct causesPa\(V\)⊆𝒱\\mathrm\{Pa\}\(V\)\\subseteq\\mathcal\{V\}and a noise term𝒰V⊆𝒰\\mathcal\{U\}\_\{V\}\\subseteq\\mathcal\{U\}\. For any subset𝒲⊆𝒰∪𝒱\\mathcal\{W\}\\subseteq\\mathcal\{U\}\\cup\\mathcal\{V\}, we writeℛ𝒲\\mathcal\{R\}\_\{\\mathcal\{W\}\}for its joint range\.
We writeℳ\(𝒲∣u,do\)\\mathcal\{M\}\(\\mathcal\{W\}\\mid u,\\mathrm\{do\}\)for the joint value assumed by the variables𝒲⊆𝒱\\mathcal\{W\}\\subseteq\\mathcal\{V\}inℳ\\mathcal\{M\}under exogenous realizationu∈ℛ𝒰u\\in\\mathcal\{R\}\_\{\\mathcal\{U\}\}and interventiondo\\mathrm\{do\}\(emptydo\\mathrm\{do\}is the observational case\); for a single variableVV, we abbreviateℳ\(V∣u,do\)\\mathcal\{M\}\(V\\mid u,\\mathrm\{do\}\)\. For deterministic models, we omit the vacuous dependence onuu\.
A*computational explanation*is a surrogate modelℰ\\mathcal\{E\}taken from some admissible class𝔈\\mathfrak\{E\}, intended to account for the behavior ofℳ\\mathcal\{M\}at a relevant level of abstraction\. The central question studied in this paper is:Under what conditions isℰ\\mathcal\{E\}a valid computational explanation ofℳ\\mathcal\{M\}underP𝒰P\_\{\\mathcal\{U\}\}?Radically different answers to this question have been proposed\. We briefly discuss here the ones we evaluate and offer a more in\-depth description in Appendix[B](https://arxiv.org/html/2607.00267#A2)\.
#### Observational validity\.
The most elementary criterion is*observational equivalence*:ℰ\\mathcal\{E\}should reproduce the measurable outputs ofℳ\\mathcal\{M\}under thenaturalinput distribution\. For deterministic settings, this is operationalized via pointwise discrepancy metrics, including MSE, RMSE,L2L^\{2\}, and their normalized variant NMSE, together with the coefficient of determinationR2R^\{2\}\{\}\(Koza,[1994](https://arxiv.org/html/2607.00267#bib.bib244); Ljung,[1999](https://arxiv.org/html/2607.00267#bib.bib232)\)\. Complexity\-regularized variants such as AIC\(Akaike,[1974](https://arxiv.org/html/2607.00267#bib.bib242)\), BIC\(Schwarz,[1978](https://arxiv.org/html/2607.00267#bib.bib241)\), MDL\(Rissanen,[1978](https://arxiv.org/html/2607.00267#bib.bib231)\), and Mallows’CpC\_\{p\}\(Mallows,[1973](https://arxiv.org/html/2607.00267#bib.bib25); Lee and Ghosh,[2009](https://arxiv.org/html/2607.00267#bib.bib223)\)penalize explanation complexity to prevent overfitting\. Whenℳ\\mathcal\{M\}is stochastic, distributional similarity is measured via measures like KL divergence, the symmetric JS divergence, or the kernel\-based MMD\(Grettonet al\.,[2012](https://arxiv.org/html/2607.00267#bib.bib29)\); HSIC\(Grettonet al\.,[2005](https://arxiv.org/html/2607.00267#bib.bib26)\)can further test whether residuals of the explanation are independent of the input, probing global agreement\. For dynamical systems, validity must hold over entire trajectories: we evaluate trajectory MSE, dynamic time warping \[DTW;Sakoe and Chiba,[1978](https://arxiv.org/html/2607.00267#bib.bib28)\], temporal autocorrelation matching, and spectral analysis\(Percival and Walden,[1993](https://arxiv.org/html/2607.00267#bib.bib27)\)\. Finally, we also consider symbolic regression methods like the SINDy evaluator\(Bruntonet al\.,[2016](https://arxiv.org/html/2607.00267#bib.bib116)\), which scores an explanation by the residual of its governing equations on the observed state derivatives\. All observational criteria share a fundamental limitation called*equifinality*\(Valogianni and Padmanabhan,[2022](https://arxiv.org/html/2607.00267#bib.bib100); Collinset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib101)\): distinct mechanisms can be observationally indistinguishable, yielding explanations that often do not generalize\(Ghorbaniet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib213); Kindermanset al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib211); Mélouxet al\.,[2025b](https://arxiv.org/html/2607.00267#bib.bib215)\)\.
#### Functional validity\.
Functional criteria require thatℰ\\mathcal\{E\}additionally reproduce the*input–output response profile*ofℳ\\mathcal\{M\}\.*Variance decomposition*via ANOVA or Sobol sensitivity indices\(Sobol’,[2001](https://arxiv.org/html/2607.00267#bib.bib9)\)decomposes output variance into per\-input contributions and interactions, with validity requiring agreement between the indices of the high\-level modelℰ\\mathcal\{E\}and the low\-level modelℳ\\mathcal\{M\}\. At the local level, the infidelity metric\(Yehet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib187)\), adapted to the two\-model setting, measures whetherℰ\\mathcal\{E\}andℳ\\mathcal\{M\}agree on their per\-input attribution vectors\. This replaces the original single\-model attribution formulation with a two\-model sensitivity comparison, usingℰ\\mathcal\{E\}itself as the local linear approximation\. LIME\(Ribeiroet al\.,[2016](https://arxiv.org/html/2607.00267#bib.bib177)\)and SHAP\(Lundberg and Lee,[2017](https://arxiv.org/html/2607.00267#bib.bib180)\)can be understood as proxies for this notion\. Globally, relational fidelity\(Collinset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib101)\)requires consistency of output differences across all input pairs, ensuring thatℰ\\mathcal\{E\}preserves the functional geometry ofℳ\\mathcal\{M\}\. However, functional criteria still remain agnostic to internal mechanisms\.
#### Information\-theoretic and representational validity\.
A third class of criteria requires thatℰ\\mathcal\{E\}reproduce the representational structure ofℳ\\mathcal\{M\}\. This can be measured by representational similarity analysis\(Kriegeskorteet al\.,[2008](https://arxiv.org/html/2607.00267#bib.bib271)\)\. Similarly,*probing accuracy*\(Alain and Bengio,[2018](https://arxiv.org/html/2607.00267#bib.bib186); Pimentelet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib257)\)measures whether the variables posited byℰ\\mathcal\{E\}are decodable from the internal states ofℳ\\mathcal\{M\}\. The*information bottleneck*\(IB\) Lagrangian\(Tishbyet al\.,[2000](https://arxiv.org/html/2607.00267#bib.bib121)\)characterizes valid representations as those achieving an optimal compression–relevance trade\-off\.*Complexity shift*\(Zenilet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib225)\)measures whetherℰ\\mathcal\{E\}induces comparable transformations under perturbation, assessed via Kolmogorov complexity\. These criteria capture*what*information is present, but not*how*it is causally transformed\(Elazaret al\.,[2021](https://arxiv.org/html/2607.00267#bib.bib4); Lasriet al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib5); Teneyet al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib259)\)\.
#### Causal and interventional validity\.
The strictest criteria requireℰ\\mathcal\{E\}to correctly reproduceℳ\\mathcal\{M\}’s behavior under*active interventions*\.*Symbion*\(Grittiet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib221)\)is a binary\-analysis technique that interleaves symbolic and concrete execution, using consistency between symbolic and concrete states to guide analysis through code regions that are difficult to handle symbolically\. We draw inspiration from this notion of symbolic–concrete consistency to define a Symbion\-inspired validity metric\. The*causal sensitivity index*and the related*structural deviation*metric\(Katende,[2025](https://arxiv.org/html/2607.00267#bib.bib237)\)measure how much the interventional consistency score betweenℰ\\mathcal\{E\}andℳ\\mathcal\{M\}changes when each parameter ofℰ\\mathcal\{E\}is zeroed\-out or perturbed by a fixed fraction, respectively\. Mechanistic interpretability\(Olahet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib140)\)seeks to recover the sparse*circuits*\(Viget al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib173); Menget al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib174)\)implementing a given behavior, with validity defined by output agreement under computational interventions\. Finally,ϵ\\epsilon\-machines\(Crutchfield and Young,[1989](https://arxiv.org/html/2607.00267#bib.bib224); Shalizi and Crutchfield,[2001](https://arxiv.org/html/2607.00267#bib.bib117)\)define the canonical valid explanation as the minimal sufficient statistic for predicting the next steps of a dynamical system,ℳ\\mathcal\{M\}, potentially under interventions\. The*dynamic causal consistency*\(DCC\) criterion\(Grosse\-Wentrupet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib120)\)offers a practical implementation of this idea\.
### 2\.1Zooming in on Causal Abstraction
Causal abstraction aims to provide a mathematical foundation for determining when high\-level models can safely be considered as causal proxies for low\-level ones\(Chalupkaet al\.,[2016](https://arxiv.org/html/2607.00267#bib.bib57); Rubensteinet al\.,[2017](https://arxiv.org/html/2607.00267#bib.bib69); Beckers and Halpern,[2019](https://arxiv.org/html/2607.00267#bib.bib70)\)\. It is the central conceptual framework in this work\. We explore its formal definitions and the ongoing debates regarding its precise framing\.
#### Overview\.
We denoteℳ\\mathcal\{M\}as the micro\- \(or low\-level\) model andℰ\\mathcal\{E\}as the macro\- \(or high\-level\) model\.ℰ\\mathcal\{E\}is now also defined as an SCM\. An abstraction between these models is defined by a mappingτ:ℛℳ→ℛℰ\\tau:\\mathcal\{R\}^\{\\mathcal\{M\}\}\\to\\mathcal\{R\}^\{\\mathcal\{E\}\}which specifies how low\-level computations inℳ\\mathcal\{M\}translate to high\-level ones inℰ\\mathcal\{E\}\.Zennaro \([2022](https://arxiv.org/html/2607.00267#bib.bib2)\)reviews properties of existing definitions of valid abstractions and their evolution\(Rubensteinet al\.,[2017](https://arxiv.org/html/2607.00267#bib.bib69); Beckers and Halpern,[2019](https://arxiv.org/html/2607.00267#bib.bib70); Beckerset al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib71); Rischel,[2020](https://arxiv.org/html/2607.00267#bib.bib68); Rischel and Weichwald,[2021](https://arxiv.org/html/2607.00267#bib.bib67); Otsuka and Saigo,[2022](https://arxiv.org/html/2607.00267#bib.bib66)\)\. Two key categories of properties are relevant for us:formal structural propertiesof the mapping itself andconsistency properties, which ensure that reasoning with either the micro\- or the macro\-model leads to causally consistent outputs\.
#### Structural Properties\.
Structural properties define constraints on the abstraction mappingτ\\tau, determining which types of abstraction maps arepermissible\. Minimal but sufficient properties to model the complex systems we consider are as follows:
1\. Surjective disjoint coarse\-graining of variables: We take the coarse\-graining map to bea:𝒱ℳ→𝒱ℰ∪\{Φ\}a:\\mathcal\{V\}\_\{\\mathcal\{M\}\}\\to\\mathcal\{V\}\_\{\\mathcal\{E\}\}\\cup\\\{\\Phi\\\}, surjective onto𝒱ℰ\\mathcal\{V\}\_\{\\mathcal\{E\}\}, whereΦ\\Phiis a designated dummy sentinel that collects allunmappedmicro\-variables; the \(possibly empty\) unmapped set isa−1\(Φ\)a^\{\-1\}\(\\Phi\)\. The map assigns each micro\-variable to exactly one macro\-variable or toΦ\\Phi, and surjectivity guarantees that every substantive macro\-variable is grounded in at least one micro\-variable\. This corresponds to theconstructivesetting\(Beckers and Halpern,[2019](https://arxiv.org/html/2607.00267#bib.bib70)\), which is widely used\(Zennaroet al\.,[2023a](https://arxiv.org/html/2607.00267#bib.bib60); Kekićet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib65); Zhuet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib64)\)and that we adopt throughout\. The dummyΦ\\Phicarries no structural equation and no value map; it plays an operational role only during sampling, where intervening on it represents interventions on the unmapped micro\-variables\.
2\. Surjective value mapping: The coarse\-graining map only associates variables \(structural\)\. We must also specify how micro\-states map to macro\-states \(contents\)\. This is achieved via a set ofvalue mapsτX:ℛa−1\(X\)ℳ→ℛXℰ\\tau\_\{X\}:\\mathcal\{R\}^\{\\mathcal\{M\}\}\_\{a^\{\-1\}\(X\)\}\\to\\mathcal\{R\}^\{\\mathcal\{E\}\}\_\{X\}, which determines, for each macro\-variableXX, how its state arises from the states of its associated micro\-variables\. We require each value map to be surjective to guarantee that every possible macro\-state can be realized from some micro\-states\.
Finally, the abstraction mapτ\\tauconsists of a surjective \(disjoint\) coarse\-graining mapaa, the set of all surjective value maps \(τX\\tau\_\{X\}\), and the noise mapτu\\tau\_\{u\}\.
#### Consistency Property\.
For a given macro\-interventionν=\(X=x\)\\nu=\(X\\\!=\\\!x\), applied asdo\(ν\)\\mathrm\{do\}\(\\nu\), multiple compatible micro\-level interventionsμ=\(a−1\(X\)=τX−1\(x\)\)\\mu=\(a^\{\-1\}\(X\)=\\tau\_\{X\}^\{\-1\}\(x\)\)may exist, applied asdo\(μ\)\\mathrm\{do\}\(\\mu\), sinceτX\\tau\_\{X\}is not typically injective \(multiple micro\-states may realize the same macro\-state\)\. Letμ\\mube one such micro\-intervention\. We overloadτ\\tauto the induced map on interventions, lifting targets viaaaand values viaτX\\tau\_\{X\}\.Then, the consistency property requires thatτ\\tau’s noise\-variable mapτu:ℛ𝒰ℳ→ℛ𝒰ℰ\\tau\_\{u\}:\\mathcal\{R\}^\{\\mathcal\{M\}\}\_\{\\mathcal\{U\}\}\\to\\mathcal\{R\}^\{\\mathcal\{E\}\}\_\{\\mathcal\{U\}\}satisfy, for every compatible pair of interventions\(ν,μ\)\(\\nu,\\mu\)and everyu∈ℛ𝒰ℳu\\in\\mathcal\{R\}\_\{\\mathcal\{U\}\_\{\\mathcal\{M\}\}\},ℰ\(𝒱ℰ\|τu\(u\),do\(ν\)\)=τ\(ℳ\(𝒱ℳ\|u,do\(μ\)\)\)\\mathcal\{E\}\(\\mathcal\{V\}\_\{\\mathcal\{E\}\}\|\\tau\_\{u\}\(u\),\\mathrm\{do\}\(\\nu\)\)=\\tau\(\\mathcal\{M\}\\left\(\\mathcal\{V\}\_\{\\mathcal\{M\}\}\|u,\\mathrm\{do\}\(\\mu\)\\right\)\)\. This pointwise counterfactual consistency requires both models to coincide for every exogenous realization\. Whenℰ\\mathcal\{E\}is deterministic, this equation reduces toℰ\(𝒱ℰ∣do\(ν\)\)=τ\(ℳ\(𝒱ℳ∣u,do\(μ\)\)\)\\mathcal\{E\}\(\\mathcal\{V\}\_\{\\mathcal\{E\}\}\\mid\\mathrm\{do\}\(\\nu\)\)=\\tau\(\\mathcal\{M\}\(\\mathcal\{V\}\_\{\\mathcal\{M\}\}\\mid u,\\mathrm\{do\}\(\\mu\)\)\), which is the primary criterion implemented in this work\. This is typically visualized via a commuting diagram\(Beckers and Halpern,[2019](https://arxiv.org/html/2607.00267#bib.bib70)\):
ℰ\(X\)\{\\mathcal\{E\}\(X\)\}ℰ\(Y\)\{\\mathcal\{E\}\(Y\)\}ℳ\(a−1\(X\)\)\{\\mathcal\{M\}\(a^\{\-1\}\(X\)\)\}ℳ\(a−1\(Y\)\)\{\\mathcal\{M\}\(a^\{\-1\}\(Y\)\)\}do\(ν\)\\scriptstyle\{\\mathrm\{do\}\(\\nu\)\}do\(μ\)\\scriptstyle\{\\mathrm\{do\}\(\\mu\)\}τX\\scriptstyle\{\\tau\_\{X\}\}τY\\scriptstyle\{\\tau\_\{Y\}\}
#### Measuring Consistency Errors\.
From the consistency requirement, it is possible to derive an error measure, quantifying how far a macro\-model \(and its associated abstraction map\) is from being a valid causal abstraction of a given micro\-model\. A general formulation of such a measure is given by:
Eτ\(ν,μ\)=D\(ℰ\(𝒴∣do\(ν\)\),τ𝒴\(ℳ\(a−1\(𝒴\)∣do\(μ\)\)\)\),E\_\{\\tau\}\(\\nu,\\mu\)=D\\\!\\left\(\\mathcal\{E\}\(\\mathcal\{Y\}\\mid\\mathrm\{do\}\(\\nu\)\),\\;\\tau\_\{\\mathcal\{Y\}\}\\\!\\left\(\\mathcal\{M\}\(a^\{\-1\}\(\\mathcal\{Y\}\)\\mid\\mathrm\{do\}\(\\mu\)\)\\right\)\\right\),where𝒴⊆𝒱ℰ\\mathcal\{Y\}\\subseteq\\mathcal\{V\}\_\{\\mathcal\{E\}\}is a set of macro\-variables \(scored jointly\), andDDis a pointwise dissimilarity metric \(e\.g\., MSE\) in the deterministic setting, or a distributional divergence for stochastic models\. The functionEτ\(ν,μ\)E\_\{\\tau\}\(\\nu,\\mu\)quantifies the degree of consistency violation for a given macro\-interventionν\\nuand one associated micro\-interventionμ\\mu\. Different approaches have been proposed to define a*global*consistency error:Zennaroet al\.\([2023a](https://arxiv.org/html/2607.00267#bib.bib60)\)adopts the worst\-case error over\(ν,μ\)\(\\nu,\\mu\), definingEτ=supν,μEτ\(ν,μ\)E\_\{\\tau\}=\\sup\_\{\\nu,\\mu\}E\_\{\\tau\}\(\\nu,\\mu\)and using the Jensen\-Shannon divergence as the choice forDD\. Then,Zhuet al\.\([2024](https://arxiv.org/html/2607.00267#bib.bib64)\)andKekićet al\.\([2023](https://arxiv.org/html/2607.00267#bib.bib65)\)employ the Kullback\-Leibler divergence and compute the expectation over interventions, facilitating sample\-based approximations\. A different approach is interchange intervention accuracy \(IIA;Geigeret al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib48)\), developed especially for neural networks\. IIA requires a variable correspondence \(a coarse\-graining\) and a mechanism to copy the micro\-state realizing a source macro\-value into a target run, but it does not require a full surjective value map with invertible preimages, and it checks consistency only at output variables\. Even perfect IIA therefore does not certify causal abstraction in the general sense, since intermediate variables may still violate consistency\(Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214); Sutteret al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib62)\)\.
#### Faithfulness\.
The standard definition of causal abstraction does not require all low\-level variables in𝒱ℳ\\mathcal\{V\}\_\{\\mathcal\{M\}\}to be mapped:a−1\(Φ\)a^\{\-1\}\(\\Phi\)may be non\-empty\. This is natural for AI interpretability, where explanations often posit that only a small subset of a neural network is relevant to a behavior\(Olahet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib140); Conmyet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib137)\), and similar sparsity assumptions appear in biological systems\(Rochefortet al\.,[2009](https://arxiv.org/html/2607.00267#bib.bib17); Meunieret al\.,[2010](https://arxiv.org/html/2607.00267#bib.bib18)\)\. Yet, declaring excluded components causally irrelevant is a strong empirical claim, that must be tested\. In circuit discovery, this is captured byfaithfulness: unmapped neurons should be intervened on to verify that they do not influence the proposed circuit computation\(Hannaet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib50)\)\. Existing operationalizations of causal abstraction largely take the relevant variable set as modeler\-provided, rather than testing whether excluded variables are genuinely inert\(Zennaroet al\.,[2023a](https://arxiv.org/html/2607.00267#bib.bib60); Kekićet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib65); Zhuet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib64)\)\. Faithfulness checks reject explanations that incorrectly assume away causal influence from omitted components\.
## 3The Causal Abstraction Error
Building upon existing causal alignment error measures\(Zennaroet al\.,[2023a](https://arxiv.org/html/2607.00267#bib.bib60); Kekićet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib65); Zhuet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib64)\), we define a new metric that explicitly captures faithfulness in causal abstraction\. We first define what we term an instance\-level abstraction error \(IAE\):
iae\(ℳ→𝜏ℰ,𝒴∣u,do\(μ,ν\)\)=D\(ℰ\(𝒴∣τu\(u\),do\(ν\)\),τ𝒴\(ℳ\(a−1\(𝒴\)∣u,do\(μ\)\)\)\)\\displaystyle\\textsc\{iae\}\(\\mathcal\{M\}\\overset\{\\tau\}\{\\to\}\\mathcal\{E\},\\mathcal\{Y\}\\mid u,\\,\\mathrm\{do\}\(\\mu,\\nu\)\)=D\\\!\\left\(\\mathcal\{E\}\(\\mathcal\{Y\}\\mid\\tau\_\{u\}\(u\),\\,\\mathrm\{do\}\(\\nu\)\),\\;\\tau\_\{\\mathcal\{Y\}\}\(\\mathcal\{M\}\(a^\{\-1\}\(\\mathcal\{Y\}\)\\mid u,\\,\\mathrm\{do\}\(\\mu\)\)\)\\right\)\(1\)where𝒴⊆𝒱ℰ\\mathcal\{Y\}\\subseteq\\mathcal\{V\}\_\{\\mathcal\{E\}\}is a set of macro\-variables \(scored jointly\), andDDis a dissimilarity overℛ𝒴ℰ\\mathcal\{R\}\_\{\\mathcal\{Y\}\}^\{\\mathcal\{E\}\}\. In words, this metric computes whether, given a controlled exogenous valueuuand interventiondo\(μ,ν\)\\mathrm\{do\}\(\\mu,\\nu\), the two SCMs produce compatible outputs for variables𝒴\\mathcal\{Y\}at a counterfactual level\. However, this is an instance\-level error\. In practice, we are interested in how our high\-level explanation performs across several exogenous values and interventions\. We thus define two variants ofcausal abstraction error\(CAE\) that differ in how we aggregate these values:
CAE↓Agg\(ℳ→𝜏ℰ,𝒴\)\\displaystyle\\textnormal\{\\scriptsize CAE\}\_\{\\downarrow\}^\{\\text\{Agg\}\}\(\\mathcal\{M\}\\overset\{\\tau\}\{\\to\}\\mathcal\{E\},\\mathcal\{Y\}\)=Aggν∼PI,μ∼Pτ−1\(ν\),u∼P𝒰ℳ\[iae\(ℳ→𝜏ℰ,𝒴∣u,do\(μ,ν\)\)\]\\displaystyle=\\text\{Agg\}\_\{\\nu\\sim P\_\{I\},\\;\\mu\\sim P\_\{\\tau^\{\-1\}\\\!\(\\nu\)\},\\;u\\sim P\_\{\\mathcal\{U\}\_\{\\mathcal\{M\}\}\}\}\\\!\\\!\\left\[\\textsc\{iae\}\\left\(\\mathcal\{M\}\\overset\{\\tau\}\{\\to\}\\mathcal\{E\},\\mathcal\{Y\}\\mid u,\\,\\mathrm\{do\}\(\\mu,\\nu\)\\right\)\\right\]\(2a\)CAE↑Agg\(ℳ→𝜏ℰ,𝒴\)\\displaystyle\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\}^\{\\text\{Agg\}\}\(\\mathcal\{M\}\\overset\{\\tau\}\{\\to\}\\mathcal\{E\},\\mathcal\{Y\}\)=Aggμ∼PI,u∼P𝒰ℳ\[iae\(ℳ→𝜏ℰ,𝒴∣u,do\(μ,ν\)\)\]\\displaystyle=\\text\{Agg\}\_\{\\mu\\sim P\_\{I\},\\;u\\sim P\_\{\\mathcal\{U\}\_\{\\mathcal\{M\}\}\}\}\\left\[\\textsc\{iae\}\\left\(\\mathcal\{M\}\\overset\{\\tau\}\{\\to\}\\mathcal\{E\},\\mathcal\{Y\}\\mid u,\\,\\mathrm\{do\}\(\\mu,\\nu\)\\right\)\\right\]\(2b\)wherePIP\_\{I\}denotes a distribution over interventions on the set of micro\- or macro\-variables \(in the↑\\uparrowand the↓\\downarrowvariant, respectively\); andPτ−1\(ν\)P\_\{\\tau^\{\-1\}\(\\nu\)\}denotes a distribution over micro\-interventionsμ\\mucompatible withν\\nu, i\.e\., for whichν=τ\(μ\)\\nu=\\tau\(\\mu\)\.CAEthus aggregatesiaevalues using a functionAggwhich operates over the indexed variables\(ν,μ,u\)\(\\nu,\\mu,u\)through distributionsPIP\_\{I\},Pτ−1\(ν\)P\_\{\\tau^\{\-1\}\\\!\(\\nu\)\}, andP𝒰ℳP\_\{\\mathcal\{U\}\_\{\\mathcal\{M\}\}\}; this function can be, e\.g\., the expectation, supremum, infimum, or combinations thereof\. Notably, the aggregation over exogenous variablesu∼P𝒰ℳu\\sim P\_\{\\mathcal\{U\}\_\{\\mathcal\{M\}\}\}could be folded with the dissimilarity metricDDand replaced by a distributional divergence instead \(e\.g\., JSD, KL, MMD\) for stochastic models, computing e\.g\.KL\(ℰ\(𝒴∣do\(τ\(μ\)\)\),τ𝒴\(ℳ\(a−1\(𝒴\)∣do\(μ\)\)\)\)\\mathrm\{KL\}\\\!\\left\(\\mathcal\{E\}\(\\mathcal\{Y\}\\mid\\mathrm\{do\}\(\\tau\(\\mu\)\)\),\\;\\tau\_\{\\mathcal\{Y\}\}\\\!\\left\(\\mathcal\{M\}\(a^\{\-1\}\(\\mathcal\{Y\}\)\\mid\\mathrm\{do\}\(\\mu\)\)\\right\)\\right\), whereuuis marginalized out; in that case,CAEwould measure causal consistency at the intervention \(instead of counterfactual\) level\.
In the top\-down variant \(↓\\downarrow\), macro\-interventions are grounded to compatible micro\-interventions viaτX−1\\tau\_\{X\}^\{\-1\}; in the bottom\-up one \(↑\\uparrow\), micro\-interventions are drawn directly and lifted to the macro level viaτX\\tau\_\{X\}\. Note that these two approaches are equivalent in their zero set; however, their non\-zero estimates may differ due to sampling decisions\. Both arguments ofDDlie in the macro\-variable rangeℛ𝒴\\mathcal\{R\}\_\{\\mathcal\{Y\}\}, avoiding ill\-posedness fromτ𝒴\\tau\_\{\\mathcal\{Y\}\}being non\-injective\. Faithfulness is integrated in both variants:PIP\_\{I\}can select the dummy macro\-variableΦ\\Phias an intervention target \(or, in the↑\\uparrowvariant, it can target the unmapped micro\-variablesa−1\(Φ\)a^\{\-1\}\(\\Phi\)\), which the proposed explanation claims are causally inert\. We also define non\-faithful variants, denoted by the subscriptNF\{\}\_\{\\text\{NF\}\}: in those,PIP\_\{I\}excludesΦ\\Phi\.
#### Relation toZennaroet al\.\([2023b](https://arxiv.org/html/2607.00267#bib.bib6)\)\.
That work defines four related metrics; ISIL and IC are the closest toCAE↓andCAE↑respectively\. Beyond the faithfulness consideration, three differences are worth noting\. First, their metrics aggregate by supremum; ours use expectation, which is easier to estimate by sampling\. Second, their framework is restricted to finite variable domains, whereasCAEhandles continuous value maps directly\. Third, their stochastic treatment prevents compositionality of valid abstractions from being preserved, which is not the case for theCAE\(see Appendix[D](https://arxiv.org/html/2607.00267#A4)\)\.
### 3\.1Practical Estimation
In all of our experiments, we use the expectation𝔼\\mathbb\{E\}, approximated by the empirical mean, as the aggregation operatorAgginCAE’s definition\. BothCAEvariants thus use expectations amenable to Monte Carlo estimation\. The formalism samples only exogenous variablesu∈ℛ𝒰ℳu\\in\\mathcal\{R\}\_\{\\mathcal\{U\}\}^\{\\mathcal\{M\}\}: the macro\-level exogenous variables enter only through the mappingτu\\tau\_\{u\}\. In our systems,ℰ\\mathcal\{E\}is deterministic, makingτu\\tau\_\{u\}vacuous \(the identity, as there is no macro\-exogenous noise to map\), so each Monte Carlo sample is determined solely by the draw of the intervention\(ν,μ\)\(\\nu,\\mu\)together with any sampledu∈ℛ𝒰ℳu\\in\\mathcal\{R\}\_\{\\mathcal\{U\}\}^\{\\mathcal\{M\}\}\. For modelsℳ\\mathcal\{M\}that are internally stochastic \(e\.g\., the Lennard–Jones gas\), additional simulator randomness is generated independently for each model evaluation\. Since our implementation does not specify a coupling map for this internal randomness across the micro\- and macro\-level models, the resulting comparison is distributional rather than pointwise\. Accordingly, these experiments instantiate the interventional interpretation ofCAE\(the folded\-divergence form\) rather than a counterfactual interpretation, which would require a well\-defined couplingτu\\tau\_\{u\}that identifies corresponding exogenous realizations across the two levels\. We return to this limitation in Section[6\.1](https://arxiv.org/html/2607.00267#S6.SS1)\.
#### Sampling macro\-interventions \(PIP\_\{I\}inCAE↓\)\.
We definePIP\_\{I\}as the following hierarchical distribution\. First draw a cardinalityk∼Uniform\{1,…,min\(\|𝒱ℰ\|,max\_interventions\)\}k\\sim\\mathrm\{Uniform\}\\\{1,\\ldots,\\min\(\|\\mathcal\{V\}\_\{\\mathcal\{E\}\}\|,\\text\{max\\\_interventions\}\)\\\}; then draw the target set𝒮\\mathcal\{S\}uniformly among all size\-kksubsets of𝒱ℰ\\mathcal\{V\}\_\{\\mathcal\{E\}\}; finally draw intervention values𝐱𝒮\\mathbf\{x\}\_\{\\mathcal\{S\}\}with eachxVx\_\{V\}\(V∈𝒮V\\in\\mathcal\{S\}\) i\.i\.d\. uniform on the domainℛV\\mathcal\{R\}\_\{V\}\. We drawnnsuch interventions\{ν\(i\)\}i=1n∼PI\\\{\\nu^\{\(i\)\}\\\}\_\{i=1\}^\{n\}\\sim P\_\{I\}\.
#### Sampling compatible micro\-interventions \(Pτ−1\(ν\)P\_\{\\tau^\{\-1\}\(\\nu\)\}inCAE↓\)\.
For each macro\-interventionν\\nuon variables𝒮\\mathcal\{S\}, we sample a compatible micro\-intervention by applying the inverse value map to each intervened variable: forV∈𝒮V\\in\\mathcal\{S\}with intervened valuexVx\_\{V\}, we compute the preimageτV−1\(xV\)\\tau\_\{V\}^\{\-1\}\(x\_\{V\}\)and sample a micro\-state from it, which corresponds to the construction fromZennaroet al\.\([2023a](https://arxiv.org/html/2607.00267#bib.bib60)\)\.
#### Sampling micro\-interventions \(PIP\_\{I\}inCAE↑\)\.
We drawnnmicro\-interventions\{μ\(i\)\}i=1n∼PI\\\{\\mu^\{\(i\)\}\\\}\_\{i=1\}^\{n\}\\sim P\_\{I\}directly, using the same multi\-node scheme as above but operating over micro\-variable domains\. The corresponding macro\-intervention is obtained by applyingτX\\tau\_\{X\}to the intervened micro\-state\.
#### Testing faithfulness\.
We test faithfulness by augmenting the macro\-intervention pool with the dummyΦ\\Phi\. At each sample,Φ\\Phiis included in the selected intervention targets with a custom probability\. When selected, we first ground the substantive interventions to micro\-values as usual, then overwrite the micro\-value of each unmapped variable with sampled noise before executingℳ\\mathcal\{M\}\(in the↑\\uparrowvariant,a−1\(Φ\)a^\{\-1\}\(\\Phi\)is instead included directly in the micro\-intervention pool\): Gaussian noise for continuous variables, uniform\{−1,0,1\}\\\{\-1,0,1\\\}perturbations for integers, and random resampling for Booleans\. SinceΦ\\Phiis a sentinel, it induces no intervention on the macro side: any shift in the abstracted output relative to the unperturbed run reflects a causal influence ofa−1\(Φ\)a^\{\-1\}\(\\Phi\)and increases the consistency error\.
## 4A Benchmark of Simulated Complex Systems
A central obstacle in evaluating explanation\-validity metrics is the lack of naturally occurring systems paired with objective ground\-truth explanations\. Human judgment is an inadequate substitute: an explanation can be*plausible*\(convincing to a human\) without being*faithful*\(an accurate account of the underlying mechanism\), and users often prefer the former even when it is incomplete or misleading\(Jacovi and Goldberg,[2020](https://arxiv.org/html/2607.00267#bib.bib159); Anconaet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib169); Herman,[2017](https://arxiv.org/html/2607.00267#bib.bib168)\)\. Further, existing benchmarks typically evaluate desirable*properties*of explanation methods \(sparsity, stability, localization, or causal robustness\) rather than whether an explanation recovers a*known*valid high\-level causal account\. For instance, SAEBench\(Karvonenet al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib47)\)scores sparse autoencoders on feature disentanglement and reconstruction quality, and RAVEL\(Huanget al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib43)\)measures how cleanly interventions isolate targeted attributes; both ask whether a method behaves well by some proxy, not whether its output matches a ground\-truth mechanism\. The distinction matters because a method can score well on stability or sparsity while still recovering the wrong causal structure\.
We address both issues above by constructing a benchmark of idealized complex systems whose high\-level explanations are known and analytically justified at the relevant abstraction level\. This benchmark enables us to verify two desired properties for a target metric: it should assign near\-zero error to correct explanations, and it should detect controlled failures produced by invalid explanations\. A key contribution of this work is the manual alignment of these heterogeneous systems with their high\-level explanations integrated within a common framework\. The benchmark spans many systems chosen to cover different abstraction types, yet all systems admit clean causal\-abstraction representations and, more strongly,*constructive*causal abstractions\. This provides a reusable resource and a nontrivial empirical demonstration of the breadth of the causal abstraction program\. For each system, we provide the high\-level model, low\-level implementation, abstraction map, and manually crafted invalid contrastive scenarios\. Full details are given in Appendix[E](https://arxiv.org/html/2607.00267#A5); Table[1](https://arxiv.org/html/2607.00267#S4.T1)summarizes the systems\.
Table 1:Benchmark systems\.Each row is a manually aligned ground\-truth abstraction pair\(ℳ,ℰ,τ\)\(\\mathcal\{M\},\\mathcal\{E\},\\tau\)\. The benchmark spans discrete \(D\) and continuous \(C\) state spaces, static and dynamical regimes, and local, spatial, and global constructive maps\.*Structure*describes how each macro\-observable is formed from the micro\-state; for “aggregation” and “observables”, this is performed withinℳ\\mathcal\{M\}’s measurement, withτ\\tauacting as the identity on the resulting macro\-variables\. Rows shaded in light blue are dynamical systems\. For every pair, we also provide invalid contrastive conditions by perturbing the model, the map, or the validity regime; full details are in Appendix[E](https://arxiv.org/html/2607.00267#A5)\.SystemLow levelℳ\\mathcal\{M\}High levelℰ\\mathcal\{E\}Constructive abstraction mapτ\\tauTypeStructureGasLennard–Jones particlesIdeal gas / VdW lawC→\\toCglobal / aggregationPredator–preyAgent\-based modelLotka–Volterra ODED→\\toCglobal / aggregationHeat equation 1DBrownian particles1D heat equationC→\\toCspatial / binningHeat equation 2DPhonon/lattice simulation2D heat equationC→\\toCspatial / smoothingIsing modelMD/Monte\-Carlo simulatorNearest\-neighbor Ising modelC→\\toCglobal / observablesTracr transformerCompiled transformer statesSymbolic sort\-rank programD→\\toDlocal / decodingGene regulatory networkMulti\-valued GRNBinary Wg–Fz ruleD→\\toDlocal / thresholdLogic circuitWire\-level Boolean circuitMulti\-bit gate circuitD→\\toDlocal / functionalTransistor circuitSPICE/MOSFET voltagesBoolean gate networkC→\\toDlocal / thresholdMOS 6502 CPUTransistor\-level sim\.Gate\-level sim\.D→\\toDlocal / functionalMOS 6502 CPUTransistor\-level sim\.ISA\-level sim\.D→\\toDlocal / functionalMOS 6502 CPUTransistor\-level sim\.ISA\-level sim\.D→\\toDlocal / functional
## 5Results
We evaluate metrics along three dimensions in sequence: classification performance under high sampling, discrimination of structurally invalid abstractions in controlled experiments, and sampling efficiency as a function of model size\.
#### Benchmarking Metrics\.
We begin by asking which metrics can, in principle, separate valid from invalid abstractions when sampling is not a constraint\. We evaluate all metrics in the asymptotic regime on every \(valid, invalid\) abstraction pair in the benchmark, with full results reported in Appendices[A\.1](https://arxiv.org/html/2607.00267#A1.SS1)and[A\.2](https://arxiv.org/html/2607.00267#A1.SS2)\. These results motivate two principled thresholds: after normalization, a metric should score at most 0\.10 on a valid abstraction and at least 0\.15 on an invalid one\. Figure[1](https://arxiv.org/html/2607.00267#S5.F1)uses these thresholds to assess the classification performance of each metric across the benchmark\.
Figure 1:Precision, recall, and AUROC of each metric, computed over all valid and invalid abstractions of the benchmark \(higher is better\)\. Applicability denotes the fraction of abstractions for which each metric was able to be evaluated; by design, many metrics are only applicable to a restricted class of systems\. Metric normalization, implementation, and hyperparameters are reported in Appendix[F](https://arxiv.org/html/2607.00267#A6)\.Coverage \(applicability\) immediately eliminates a large subset of candidates\. Temporal metrics \(TrajMSE, DTW, Autocorr, Spectral, SINDy\) are defined only for dynamical systems; Symbion requires finite discrete label spaces;CpC\_\{p\}requires a linear proxy model with a well\-defined noise floor; and structural deviation and the causal sensitivity index fail to produce scores on most benchmark systems\. Among metrics with broad coverage, information\-theoretic scores \(IB and CIB Lagrangian\) and several functional metrics, such as Sobol indices, structural deviation, and causal sensitivity indices, perform poorly on valid abstractions: they are sensitive to distributional details unrelated to causal consistency and assign high error even to valid abstractions\. For invalid abstractions, some metrics attain reasonable precision, but at the cost of low recall\. This combination of poor calibration and limited applicability eliminates many candidate metrics from further consideration\. Conversely, causal abstraction metrics consistently achieve the highest AUROC and maximum coverage\.
#### Detection of Controlled Failure Cases\.
Classification performance under unlimited sampling does not capture whether a metric can detect the specific structural failures that motivate causal abstraction\. We therefore turn to six controlled experiments, each designed to expose a subtle violation that is observationally indistinguishable from the valid condition or manifests only under targeted interventions\. The six settings are illustrated in Figure[2](https://arxiv.org/html/2607.00267#S5.F2)and described in more detail in Appendix[C](https://arxiv.org/html/2607.00267#A3)\. Figure[3](https://arxiv.org/html/2607.00267#S5.F3)then reports, for each experiment and metric, whether a Mann–WhitneyUUtest over 100 runs reveals a statistically significant difference between the valid and invalid condition\.
\(1\) Hidden confounderX1X\_\{1\}YYX2X\_\{2\}ℰ\\mathcal\{E\}:x1x\_\{1\}yyx2x\_\{2\}ℳ\\mathcal\{M\}:rr\(2\) XOR backup pathXXMMYYℰ\\mathcal\{E\}:xxmmyyaabbℳ\\mathcal\{M\}:m∨\(a⊕b\)m\\\!\\lor\\\!\(a\\\!\\oplus\\\!b\)\(3\) Wrong intermediateXXMMYYℰ\\mathcal\{E\}:M=2XM\\\!=\\\!2XM=3XM\\\!=\\\!3Xxxmmyyℳ\\mathcal\{M\}:m=2xm\\\!=\\\!2x\(4\) Spurious mediatorXXMMYYrm\.chainℰ\\mathcal\{E\}:xxmmyyℳ\\mathcal\{M\}:\(5\) Unreachable statesXXMMYYℰ\\mathcal\{E\}:M≥2M\\\!\\geq\\\!2M≥1M\\\!\\geq\\\!1xxmmyyℳ\\mathcal\{M\}:M=1M\\\!=\\\!1unreach\.\(6\) Wrong directionXXMMYYrm\.forkℰ\\mathcal\{E\}:xxmmyyℳ\\mathcal\{M\}:
Figure 2:Causal graphs and abstraction maps for the six controlled experiments\. Each panel showsℰ\\mathcal\{E\}\(top, uppercase\) andℳ\\mathcal\{M\}\(bottom, lowercase\)\. Gray dashed arrows are the abstraction mapτ\\tau\. Red marks what changes in the invalid condition; light gray dashed arrows mark edges present in the valid condition that are removed in the invalid condition\. Dashed gray circles areΦ\\Phi\(unmapped\) variables; a dashed red circle is aΦ\\Phivariable introduced only in the invalid condition\.XXandYYcorrespond to inputs and outputs of the models, respectively\. Each variable is assumed to be perturbed by exogenous noiseuu, suppressed in the diagrams for visual clarity; refer to Appendix[C](https://arxiv.org/html/2607.00267#A3)for details\.Figure 3:Controlled failure modes\.Each cell reports whether a metric significantly separates the valid and invalid abstraction over 100 runs using a Mann–WhitneyUUtest\.CAEis the only family that detects all six targeted invalidities\.The results show that among the metrics that do apply, none of the observational, functional, or information\-theoretic criteria achieve consistent discrimination across the six experiments\. This is expected: each invalid abstraction is observationally indistinguishable from the valid one or fails only under specific interventional conditions, which are precisely the failure modes that causal abstraction was designed to detect\. Within the causal abstraction family, results are more nuanced\. IIA only succeeds in detecting the violations in Experiments 4 and 5\. BothCAENF\\textnormal\{\\scriptsize CAE\}\_\{\\text\{NF\}\}variants fail in Experiment 1 and 2; only theCAEvariants successfully detect the violations\.
Together, the benchmark results and controlled experiments narrow the candidate set to four metrics\. IIA is excluded because, despite broad coverage, it fails to detect the violations in Experiments 3 and 6 where other causal abstraction metrics succeed\. The remaining candidates areCAE↓andCAE↑, which we carry forward as the primary metrics\. Their faithfulness\-free counterparts,CAE↓NF\{\}\_\{\\downarrow\\text\{NF\}\}andCAE↑NF\{\}\_\{\\uparrow\\text\{NF\}\}, are retained as ablations to isolate the contribution of faithfulness testing\.
#### Statistical Power\.
Having narrowed the candidate set to four metrics, we ask how many interventions each requires to reliably detect an invalid abstraction\. Convergence results are reported in Appendix[A\.3](https://arxiv.org/html/2607.00267#A1.SS3)\. Figure[4](https://arxiv.org/html/2607.00267#S5.F4)reports statistical power across selected systems and controlled experiments\.
Figure 4:How quickly do metrics detect bad abstractions over good ones? For different numbers of sampled interventionsnn, we compute each metric 100 times on selected valid and invalid abstractions\. We apply a Mann–WhitneyUUtest to the two sets of scores \(valid and invalid\) and report the fraction of runs for which the results differ significantly \(p<0\.05p<0\.05\)\. Note that both NF variants overlap in the leftmost and rightmost plots\. Full results are given in Appendix[A](https://arxiv.org/html/2607.00267#A1)\.Statistical power converges to nearly 100% within 30 interventions for most systems and all four metrics\. Two exceptions stand out\. For the gas simulation, power increases more slowly, consistent with the higher variance noted above\. Experiment 2 \(XOR leak\) illustrates the usefulness of faithfulness testing: metrics that include it detect the invalid abstraction asnngrows, while their faithfulness\-free counterparts consistently fail\. The noise condition on the GRN system illustrates the value of top\-down sampling:CAE↓andCAE↓NF\{\}\_\{\\downarrow\\text\{NF\}\}can reliably detect the invalid abstraction after 25–30 interventions, while bottom\-up metrics converge more slowly\. For Experiment 5 \(unreachable intermediate states\), the picture reverses: faithfulness\-aware metrics converge more slowly and exhibit lower power than plainCAENF\{\}\_\{\\text\{NF\}\}\.CAE↓NF\{\}\_\{\\downarrow\\text\{NF\}\}already probes unreachable states by sampling from the full abstract label domain; faithfulness augmentation additionally injects perturbations into unmappedΦ\\Phivariables, which occasionally pushℳ\\mathcal\{M\}into the same micro\-region as the unreachable states, inflating the valid\-condition score and shrinking the discrimination gap\. Faithfulness testing therefore does not miss the invalid abstraction here, but reduces specificity on the valid side\.
Figure 5:Minimum number of interventions required to reach 95% detection power as a function of\|𝒱ℰ\|\|\\mathcal\{\\mathcal\{V\}\_\{\\mathcal\{E\}\}\}\|, measured on Tracr sort\-rank programs of increasing sequence length \(100 runs per length\)\. See details in Appendix[A\.4](https://arxiv.org/html/2607.00267#A1.SS4)\.
#### Scaling\.
Power curves on fixed systems do not reveal how detection cost scales as the high\-level modelℰ\\mathcal\{E\}grows\. To answer this question, we construct a controlled scaling experiment using Tracr\(Lindneret al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib160)\), a compiler that translates RASP programs into exact transformer weights\. We compile a sort\-rank program for each sequence lengthℓ∈\{2,3,4,5,6,8,10,12,15\}\\ell\\in\\\{2,3,4,5,6,8,10,12,15\\\}, yielding transformers of increasing size that compute the rank of each token in the input sequence\. The valid high\-level model \(ℰ\\mathcal\{E\}\) encodes the correct formularanki=\|\{j≠i:tokenj<tokeni\}\|\\mathrm\{rank\}\_\{i\}=\|\\\{j\\neq i:\\mathrm\{token\}\_\{j\}<\\mathrm\{token\}\_\{i\}\\\}\|for all positionsii, while the wrong model encodes a specific incorrect hypothesis: the minimum\-valued token at position 0 always receives the maximum rank rather than rank 0\. This error triggers exactly whentoken0\\mathrm\{token\}\_\{0\}is the minimum element of the sequence, an event with probability1/ℓ1/\\ellunder uniform sampling of distinct tokens\. The expected number of interventions to observe at least one triggering event therefore grows linearly withℓ\\ell, and analytic calculation givesn95%≈2\.5ℓn\_\{95\\%\}\\approx 2\.5\\ellfor 95% power\.
Figure[5](https://arxiv.org/html/2607.00267#S5.F5)shows the empirically measured requirednnforCAEandCAENF\{\}\_\{\\text\{NF\}\}as a function of\|ℰ\|=2ℓ\|\\mathcal\{E\}\|=2\\ell\(token inputs plus rank outputs\)\. All four metrics follow the predicted linear trend closely\. This confirms that the sampling budget is governed primarily by the rarity of the triggering condition in the abstract domain, not by the internal complexity of the transformer\. In practice, a wrong explanation that diverges from the correct one on a fractionppof inputs requiresO\(1/p\)O\(1/p\)interventions to detect with fixed confidence: rare but systematic errors in proposed explanations remain detectable at a cost proportional to their rarity, independent of model depth or width\.
## 6Discussion
This paper asks when a high\-level explanation of a low\-level system should be trusted\. Our benchmark turns this into an empirical stress test: given valid and invalid abstraction pairs, which metrics separate them? The main result is that observational agreement, distribution matching, representational similarity, compression, symbolic fit, and input\-output sensitivity can be useful diagnostics, but not validity criteria\. Across heterogeneous systems, causal\-abstraction metrics are the most reliable\. However, our results also show that operationalization matters: not all causal metrics succeed and different implementations fail in different ways\. The proposedCAEmetrics, which combine multi\-node interventions with built\-in faithfulness tests, are the only metrics that pass all tests\. This makesCAEa natural objective for explanation discovery\. Symbolic, gradient\-based, evolutionary, and human\-guided search procedures all need a criterion that rewards causally valid abstractions\. In particular, future work should study adaptive intervention sampling to quickly detect invalid abstraction,CAE\-based abstraction discovery, and community extensions of the benchmark\. We release the benchmark and all metric implementations on[GitHub](https://github.com/MelouxM/CAE)and[PyPI](https://pypi.org/project/causal-abstraction-eval/)to support this direction\.
### 6\.1Limitations
#### Validity of a given triple vs\. discovery of good triples\.
CAEevaluates a*specified*\(ℳ,ℰ,τ\)\(\\mathcal\{M\},\\mathcal\{E\},\\tau\); it does not by itself prevent gerrymandered abstraction maps that are made complex enough to pass, the abstraction analogue of the non\-linear representation dilemma\(Sutteret al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib62); Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214)\)\. Three ingredients mitigate but do not eliminate this: constructive \(disjoint\) coarse\-graining restrictsτ\\tauto surjective variable partitions; faithfulness testing penalizes maps that silently assume away causal influence; and multi\-node interventions probe interaction terms that single\-variable checks miss\. A complete account of*which*valid abstractions are also*good*explanations \(parsimony, level\-appropriateness\) is beyond a consistency metric and remains future work\.
#### Handling of exogenous noise\.
All macro\-models in our benchmark are deterministic, and the reported results useτu\\tau\_\{u\}as the identity\. The implementation optionally supports a stochastic high\-level modelℰ\\mathcal\{E\}that carries an exogenous distributionP𝒰P\_\{\\mathcal\{U\}\}: its noise is marginalized out and consumed by the distributional divergences \(JSD, KL, MMD\), instantiating the interventional interpretation ofCAE\(Section[3\.1](https://arxiv.org/html/2607.00267#S3.SS1)\)\. The*counterfactual*regime, with a nontrivial couplingτu\\tau\_\{u\}that identifies corresponding exogenous realizations across the micro\- and macro\-levels, is out of scope and left for future work\.
#### Design choices and inductive bias\.
The benchmark necessarily involves design choices that can be debated: systems, valid abstractions, invalid contrasts, and sampled interventions\. We mitigate this by trying to span diverse domains and abstraction types, but future versions should include community\-contributed systems and possibly blind evaluations\.
## Acknowledgements
This work was partially conducted within French research unit UMR 5217 and was partially supported by CNRS \(grant ANR\-22\-CPJ2\-0036\-01\) and by MIAI@Grenoble\-Alpes \(grant ANR\-19\-P3IA\-0003\)\. It was granted access to the HPC resources of IDRIS under the allocation 2025\-AD011014834 made by GENCI\.
## References
- A new look at the statistical model identification\.IEEE transactions on automatic control19\(6\),pp\. 716–723\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- G\. Alain and Y\. Bengio \(2018\)Understanding intermediate layers using linear classifier probes\.External Links:1610\.01644,[Link](https://arxiv.org/abs/1610.01644)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- M\. Ancona, E\. Ceolini, C\. Öztireli, and M\. Gross \(2019\)Gradient\-based attribution methods\.Explainable AI: Interpreting, explaining and visualizing deep learning,pp\. 169–191\.Cited by:[§4](https://arxiv.org/html/2607.00267#S4.p1.1)\.
- K\. L\. Aw, S\. Montariol, B\. AlKhamissi, M\. Schrimpf, and A\. Bosselut \(2024\)Instruction\-tuning aligns llms to the human brain\.External Links:2312\.00575,[Link](https://arxiv.org/abs/2312.00575)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- R\. Axtell, R\. Axelrod, J\. M\. Epstein, and M\. D\. Cohen \(1996\)Aligning simulation models: a case study and results\.Computational & Mathematical Organization Theory1\(2\),pp\. 123–141\.External Links:ISSN 1572\-9346,[Document](https://dx.doi.org/10.1007/BF01299065),[Link](https://doi.org/10.1007/BF01299065)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px3.p1.11)\.
- R\. W\. Batterman and C\. C\. Rice \(2014\)Minimal model explanations\.Philosophy of Science81\(3\),pp\. 349–376\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- S\. Beckers, F\. Eberhardt, and J\. Y\. Halpern \(2020\)Approximate causal abstractions\.InProceedings of The 35th Uncertainty in Artificial Intelligence Conference,R\. P\. Adams and V\. Gogate \(Eds\.\),Proceedings of Machine Learning Research, Vol\.115,pp\. 606–615\.External Links:[Link](https://proceedings.mlr.press/v115/beckers20a.html)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6)\.
- S\. Beckers and J\. Y\. Halpern \(2019\)Abstracting causal models\.Proceedings of the AAAI Conference on Artificial Intelligence33\(01\),pp\. 2678–2685\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/4117),[Document](https://dx.doi.org/10.1609/aaai.v33i01.33012678)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px2.p2.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px3.p1.16),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.p1.1)\.
- Y\. Belinkov \(2022\)Probing classifiers: promises, shortcomings, and advances\.Computational Linguistics48\(1\),pp\. 207–219\.External Links:[Link](https://aclanthology.org/2022.cl-1.7/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00422)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- A\. A\. S\. Beula, G\. Peter, A\. Alexander Stonier, K\. E\. Vignesh, and V\. Ganji \(2024\)Behaviour analysis of modeling and model evaluating methods in system identification for a multiprocess station\.Complexity2024\(1\),pp\. 7741473\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1155/2024/7741473),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1155/2024/7741473),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1155/2024/7741473Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, Z\. Hatfield\-Dodds, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.htmlCited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- S\. L\. Brunton, J\. L\. Proctor, and J\. N\. Kutz \(2016\)Discovering governing equations from data by sparse identification of nonlinear dynamical systems\.Proceedings of the National Academy of Sciences of the United States of America113\(15\),pp\. 3932–3937\.External Links:ISSN 0027\-8424,[Link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4839439/),[Document](https://dx.doi.org/10.1073/pnas.1517384113)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- K\. Chalupka, F\. Eberhardt, and P\. Perona \(2016\)Multi\-level cause\-effect systems\.InProceedings of the 19th International Conference on Artificial Intelligence and Statistics,A\. Gretton and C\. C\. Robert \(Eds\.\),Proceedings of Machine Learning Research, Vol\.51,Cadiz, Spain,pp\. 361–369\.External Links:[Link](https://proceedings.mlr.press/v51/chalupka16.html)Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.p1.1)\.
- A\. Collins, M\. Koehler, and C\. Lynch \(2024\)Methods that support the validation of agent\-based models: an overview and discussion\.Journal of Artificial Societies and Social Simulation27\(1\),pp\. 11\.External Links:ISSN 1460\-7425,[Link](http://jasss.soc.surrey.ac.uk/27/1/11.html),[Document](https://dx.doi.org/10.18564/jasss.5258)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px3.p1.11),[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px4.p1.8),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px2.p1.9)\.
- A\. Conmy, A\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 16318–16352\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Paper-Conference.pdf)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2)\.
- J\. P\. Crutchfield and K\. Young \(1989\)Inferring statistical complexity\.Phys\. Rev\. Lett\.63,pp\. 105–108\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevLett.63.105),[Link](https://link.aps.org/doi/10.1103/PhysRevLett.63.105)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px5.p1.6),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- R\. Cukier, C\. Fortuin, K\. E\. Shuler, A\. Petschek, and J\. H\. Schaibly \(1973\)Study of the sensitivity of coupled reaction systems to uncertainties in rate coefficients\. i theory\.The Journal of chemical physics59\(8\),pp\. 3873–3878\.Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px1.p1.5)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.External Links:2309\.08600,[Link](https://arxiv.org/abs/2309.08600)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- J\. P\. Cunningham and B\. M\. Yu \(2014\)Dimensionality reduction for large\-scale neural recordings\.Nature neuroscience17\(11\),pp\. 1500–1509\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- J\. Dong and J\. Zhong \(2025\)Recent advances in symbolic regression\.ACM Comput\. Surv\.57\(11\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3735634),[Document](https://dx.doi.org/10.1145/3735634)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- J\. Dyer, N\. Bishop, Y\. Felekis, F\. M\. Zennaro, A\. Calinescu, T\. Damoulas, and M\. Wooldridge \(2024\)Interventionally consistent surrogates for complex simulation models\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 21814–21841\.External Links:[Document](https://dx.doi.org/10.52202/079017-0686),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/26b8e3dc3a21fcd660d80c63b767f324-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2607.00267#S2.p1.1)\.
- B\. Efron and C\. Stein \(1981\)The jackknife estimate of variance\.The Annals of Statistics,pp\. 586–596\.Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px1.p1.5)\.
- Y\. Elazar, S\. Ravfogel, A\. Jacovi, and Y\. Goldberg \(2021\)Amnesic probing: behavioral explanation with amnesic counterfactuals\.Transactions of the Association for Computational Linguistics9,pp\. 160–175\.External Links:[Link](https://aclanthology.org/2021.tacl-1.10/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00359)Cited by:[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- M\. E\. Fisher \(1998\)Renormalization group theory: its basis and formulation in statistical physics\.Rev\. Mod\. Phys\.70,pp\. 653–681\.External Links:[Document](https://dx.doi.org/10.1103/RevModPhys.70.653),[Link](https://link.aps.org/doi/10.1103/RevModPhys.70.653)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px1.p1.3),[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px3.p1.6)\.
- A\. Geiger, D\. Ibeling, A\. Zur, M\. Chaudhary, S\. Chauhan, J\. Huang, A\. Arora, Z\. Wu, N\. Goodman, C\. Potts, and T\. Icard \(2025\)Causal abstraction: a theoretical foundation for mechanistic interpretability\.Journal of Machine Learning Research26\(83\),pp\. 1–64\.External Links:[Link](http://jmlr.org/papers/v26/23-0058.html)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p3.1)\.
- A\. Geiger, H\. Lu, T\. Icard, and C\. Potts \(2021\)Causal Abstractions of Neural Networks\.arXiv\.Note:arXiv:2106\.02997 \[cs\]External Links:[Link](http://arxiv.org/abs/2106.02997),[Document](https://dx.doi.org/10.48550/arXiv.2106.02997)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p3.1)\.
- A\. Geiger, Z\. Wu, H\. Lu, J\. Rozner, E\. Kreiss, T\. Icard, N\. Goodman, and C\. Potts \(2022\)Inducing causal structure for interpretable neural networks\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 7324–7338\.External Links:[Link](https://proceedings.mlr.press/v162/geiger22a.html)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8)\.
- H\. Georgi \(1993\)Effective Field Theory\.Annual Review of Nuclear and Particle Science43\(1\),pp\. 209–252\(en\)\.External Links:ISSN 0163\-8998, 1545\-4134,[Link](https://www.annualreviews.org/doi/10.1146/annurev.ns.43.120193.001233),[Document](https://dx.doi.org/10.1146/annurev.ns.43.120193.001233)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px3.p1.6)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12216–12235\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.751),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.751)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1)\.
- A\. Ghorbani, A\. Abid, and J\. Zou \(2019\)Interpretation of neural networks is fragile\.Proceedings of the AAAI Conference on Artificial Intelligence33\(01\),pp\. 3681–3688\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/4252),[Document](https://dx.doi.org/10.1609/aaai.v33i01.33013681)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- M\.L\. Glasser \(2002\)Second virial coefficient for a lennard\-jones \(2n\-n\) system in d dimensions and confined to a nanotube surface\.Physics Letters A300\(4\),pp\. 381–384\.External Links:ISSN 0375\-9601,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0375-9601%2802%2900814-9),[Link](https://www.sciencedirect.com/science/article/pii/S0375960102008149)Cited by:[5th item](https://arxiv.org/html/2607.00267#A5.I3.i5.p1.4),[§G\.2](https://arxiv.org/html/2607.00267#A7.SS2.SSS0.Px2.p1.6),[Appendix G](https://arxiv.org/html/2607.00267#A7.p1.2)\.
- A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola \(2012\)A kernel two\-sample test\.Journal of Machine Learning Research13\(25\),pp\. 723–773\.External Links:[Link](http://jmlr.org/papers/v13/gretton12a.html)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px3.p1.11),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- A\. Gretton, O\. Bousquet, A\. Smola, and B\. Schölkopf \(2005\)Measuring statistical dependence with hilbert\-schmidt norms\.InAlgorithmic Learning Theory,S\. Jain, H\. U\. Simon, and E\. Tomita \(Eds\.\),Berlin, Heidelberg,pp\. 63–77\.External Links:ISBN 978\-3\-540\-31696\-1Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px3.p1.11),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- F\. Gritti, L\. Fontana, E\. Gustafson, F\. Pagani, A\. Continella, C\. Kruegel, and G\. Vigna \(2020\)SYMBION: interleaving symbolic with concrete execution\.In2020 IEEE Conference on Communications and Network Security \(CNS\),Vol\.,pp\. 1–10\.External Links:[Document](https://dx.doi.org/10.1109/CNS48642.2020.9162164)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px3.p1.6),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- M\. Grosse\-Wentrup, A\. Kumar, A\. Meunier, and M\. Zimmer \(2024\)Neuro\-cognitive multilevel causal modeling: a framework that bridges the explanatory gap between neuronal activity and cognition\.PLOS Computational Biology20\(12\),pp\. 1–32\.External Links:[Document](https://dx.doi.org/10.1371/journal.pcbi.1012674),[Link](https://doi.org/10.1371/journal.pcbi.1012674)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px1.p1.3),[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px2.p1.9),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- M\. Hanna, S\. Pezzelle, and Y\. Belinkov \(2024\)Have faith in faithfulness: going beyond circuit overlap when finding model mechanisms\.arXiv preprint arXiv:2403\.17806\.Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2)\.
- J\. V\. Haxby, M\. I\. Gobbini, M\. L\. Furey, A\. Ishai, J\. L\. Schouten, and P\. Pietrini \(2001\)Distributed and overlapping representations of faces and objects in ventral temporal cortex\.Science293\(5539\),pp\. 2425–2430\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- C\. G\. Hempel \(1965\)Aspects of scientific explanation and other essays in the philosophy of science\.The Free Press,New York\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- B\. Herman \(2017\)The promise and peril of human evaluation for model interpretability\.arXiv preprint arXiv:1711\.07414\.Cited by:[§4](https://arxiv.org/html/2607.00267#S4.p1.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 2733–2743\.External Links:[Link](https://aclanthology.org/D19-1275/),[Document](https://dx.doi.org/10.18653/v1/D19-1275)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- E\. P\. Hoel, L\. Albantakis, and G\. Tononi \(2013\)Quantifying causal emergence shows that macro can beat micro\.Proceedings of the National Academy of Sciences110\(49\),pp\. 19790–19795\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1314922110),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1314922110),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1314922110Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- A\. Holtzman, P\. West, and L\. Zettlemoyer \(2025\)Generative models as a complex systems science: how can we make sense of large language model behavior?\.Journal of Social Computing6\(2\),pp\. 75–94\.External Links:[Link](https://www.sciopen.com/article/10.23919/JSC.2025.0009),[Document](https://dx.doi.org/10.23919/JSC.2025.0009)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- J\. Huang, Z\. Wu, C\. Potts, M\. Geva, and A\. Geiger \(2024\)RAVEL: evaluating interpretability methods on disentangling language model representations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 8669–8687\.External Links:[Link](https://aclanthology.org/2024.acl-long.470/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.470)Cited by:[§4](https://arxiv.org/html/2607.00267#S4.p1.1)\.
- G\. S\. Imai Aldeia, H\. Zhang, G\. Bomarito, M\. Cranmer, A\. Fonseca, B\. Burlacu, W\. G\. La Cava, and F\. O\. de França \(2025\)Call for action: towards the next generation of symbolic regression benchmark\.InProceedings of the Genetic and Evolutionary Computation Conference Companion,GECCO ’25 Companion,New York, NY, USA,pp\. 2529–2538\.External Links:ISBN 9798400714641,[Link](https://doi.org/10.1145/3712255.3734309),[Document](https://dx.doi.org/10.1145/3712255.3734309)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- A\. Jacovi and Y\. Goldberg \(2020\)Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4198–4205\.External Links:[Link](https://aclanthology.org/2020.acl-main.386),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.386)Cited by:[§4](https://arxiv.org/html/2607.00267#S4.p1.1)\.
- E\. Jonas and K\. P\. Kording \(2017\)Could a neuroscientist understand a microprocessor?\.PLoS computational biology13\(1\),pp\. e1005268\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- A\. Karvonen, C\. Rager, J\. Lin, C\. Tigges, J\. Bloom, D\. Chanin, Y\. Lau, E\. Farrell, C\. McDougall, K\. Ayonrinde, M\. Wearden, A\. Conmy, S\. Marks, and N\. Nanda \(2025\)SAEBench: a comprehensive benchmark for sparse autoencoders in language model interpretability\.External Links:2503\.09532,[Link](https://arxiv.org/abs/2503.09532)Cited by:[§4](https://arxiv.org/html/2607.00267#S4.p1.1)\.
- R\. Katende \(2025\)Causal operator discovery in partial differential equations via counterfactual physics\-informed neural networks\.External Links:2506\.20181,[Link](https://arxiv.org/abs/2506.20181)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px3.p1.3),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- A\. Kekić, B\. Schölkopf, and M\. Besserve \(2023\)Targeted reduction of causal models\.arXiv preprint arXiv:2311\.18639\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px2.p2.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2),[§3](https://arxiv.org/html/2607.00267#S3.p1.25)\.
- B\. Kim, M\. Wattenberg, J\. Gilmer, C\. Cai, J\. Wexler, F\. Viegas, and R\. Sayres \(2018\)Interpretability beyond feature attribution: quantitative testing with concept activation vectors \(tcav\)\.External Links:1711\.11279,[Link](https://arxiv.org/abs/1711.11279)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- P\. Kindermans, S\. Hooker, J\. Adebayo, M\. Alber, K\. T\. Schütt, S\. Dähne, D\. Erhan, and B\. Kim \(2019\)The \(un\)reliability of saliency methods\.InExplainable AI: Interpreting, Explaining and Visualizing Deep Learning,W\. Samek, G\. Montavon, A\. Vedaldi, L\. K\. Hansen, and K\. Müller \(Eds\.\),pp\. 267–280\.External Links:ISBN 978\-3\-030\-28954\-6,[Document](https://dx.doi.org/10.1007/978-3-030-28954-6%5F14),[Link](https://doi.org/10.1007/978-3-030-28954-6_14)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- P\. Kitcher \(1989\)Explanatory unification and the causal structure of the world\.InScientific Explanation,P\. Kitcher and W\. C\. Salmon \(Eds\.\),pp\. 410–505\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- J\. R\. Koza \(1994\)Genetic programming as a means for programming computers by natural selection\.Statistics and computing4\(2\),pp\. 87–112\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px1.p1.8),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- P\. A\. Kragel, L\. Koban, L\. F\. Barrett, and T\. D\. Wager \(2018\)Representation, pattern information, and brain signatures: from neurons to neuroimaging\.Neuron99\(2\),pp\. 257–273\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- N\. Kriegeskorte, M\. Mur, and P\. A\. Bandettini \(2008\)Representational similarity analysis\-connecting the branches of systems neuroscience\.Frontiers in systems neuroscience2,pp\. 249\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- M\. Landajuela, C\. S\. Lee, J\. Yang, R\. Glatt, C\. P\. Santiago, I\. Aravena, T\. Mundhenk, G\. Mulcahy, and B\. K\. Petersen \(2022\)A unified framework for deep symbolic regression\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 33985–33998\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/dbca58f35bddc6e4003b2dd80e42f838-Paper-Conference.pdf)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px1.p1.8)\.
- K\. Lasri, T\. Pimentel, A\. Lenci, T\. Poibeau, and R\. Cotterell \(2022\)Probing for the usage of grammatical number\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8818–8831\.External Links:[Link](https://aclanthology.org/2022.acl-long.603/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.603)Cited by:[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- H\. Lee and S\. Ghosh \(2009\)Performance of information criteria for spatial models\.Journal of statistical computation and simulation79,pp\. 93–106\.External Links:[Document](https://dx.doi.org/10.1080/00949650701611143)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- D\. Lindner, J\. Kramár, S\. Farquhar, M\. Rahtz, T\. McGrath, and V\. Mikulik \(2023\)Tracr: compiled transformers as a laboratory for interpretability\.Advances in Neural Information Processing Systems36,pp\. 37876–37899\.Cited by:[§5](https://arxiv.org/html/2607.00267#S5.SS0.SSS0.Px4.p1.9)\.
- L\. Ljung \(1999\)System identification\.InWiley Encyclopedia of Electrical and Electronics Engineering,pp\.\.External Links:ISBN 9780471346081,[Document](https://dx.doi.org/https%3A//doi.org/10.1002/047134608X.W1046),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1002/047134608X.W1046),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1002/047134608X\.W1046Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px1.p1.8),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- T\. Lombrozo \(2006\)The structure and function of explanations\.Trends in cognitive sciences10\(10\),pp\. 464–470\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- M\. Loreau \(2010\)From populations to ecosystems: theoretical foundations for a new ecological synthesis \(mpb\-46\)\.Princeton University Press\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf)Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px2.p1.11),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px2.p1.9)\.
- C\. L\. Mallows \(1973\)Some comments on cp\.Technometrics15\(4\),pp\. 661–675\.External Links:ISSN 00401706,[Link](http://www.jstor.org/stable/1267380)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- N\. M\. Mangan, J\. N\. Kutz, S\. L\. Brunton, and J\. L\. Proctor \(2017\)Model selection for dynamical systems via sparse regression and information criteria\.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences473\(2204\),pp\. 20170009\.External Links:ISSN 1471\-2946,[Link](http://dx.doi.org/10.1098/rspa.2017.0009),[Document](https://dx.doi.org/10.1098/rspa.2017.0009)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- T\. McGrath, M\. Rahtz, J\. Kramar, V\. Mikulik, and S\. Legg \(2023\)The hydra effect: emergent self\-repair in language model computations\.External Links:2307\.15771,[Link](https://arxiv.org/abs/2307.15771)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px6.p1.1)\.
- M\. Méloux, G\. Dirupo, F\. Portet, and M\. Peyrard \(2025a\)The dead salmons of ai interpretability\.arXiv preprint arXiv:2512\.18792\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px4.p1.1),[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px6.p1.1),[§1](https://arxiv.org/html/2607.00267#S1.p2.1),[§1](https://arxiv.org/html/2607.00267#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8),[§6\.1](https://arxiv.org/html/2607.00267#S6.SS1.SSS0.Px1.p1.2)\.
- M\. Méloux, F\. Portet, and M\. Peyrard \(2025b\)Mechanistic interpretability as statistical estimation: a variance analysis of eap\-ig\.arXiv preprint arXiv:2510\.00845\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.Advances in Neural Information Processing Systems35\.Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- D\. Meunier, R\. Lambiotte, and E\. T\. Bullmore \(2010\)Modular and hierarchically modular organization of brain networks\.Frontiers in neuroscience4,pp\. 200\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2)\.
- G\. Monea, M\. Peyrard, M\. Josifoski, V\. Chaudhary, J\. Eisner, E\. Kıcıman, H\. Palangi, B\. Patra, and R\. West \(2024\)A glitch in the matrix? locating and detecting language model grounding with fakepedia\.External Links:2312\.02073,[Link](https://arxiv.org/abs/2312.02073)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1)\.
- M\. D\. Morris \(1991\)Factorial sampling plans for preliminary computational experiments\.Technometrics33\(2\),pp\. 161–174\.Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px1.p1.5)\.
- K\. A\. Norman, S\. M\. Polyn, G\. J\. Detre, and J\. V\. Haxby \(2006\)Beyond mind\-reading: multi\-voxel pattern analysis of fmri data\.Trends in cognitive sciences10\(9\),pp\. 424–430\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- C\. Olah, N\. Cammarata, L\. Schubert, G\. Goh, M\. Petrov, and S\. Carter \(2020\)Zoom in: an introduction to circuits\.Distill\.Note:https://distill\.pub/2020/circuits/zoom\-inExternal Links:[Document](https://dx.doi.org/10.23915/distill.00024.001)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2)\.
- J\. Otsuka and H\. Saigo \(2022\)On the equivalence of causal models: a category\-theoretic approach\.InConference on Causal Learning and Reasoning,pp\. 634–646\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6)\.
- J\. Pearl \(2009\)Causality: models, reasoning and inference\.2nd edition,Cambridge University Press,USA\.External Links:ISBN 052189560XCited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.p1.1)\.
- D\. B\. Percival and A\. T\. Walden \(1993\)Spectral analysis for physical applications\.cambridge university press\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px4.p1.10),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- B\. K\. Petersen, M\. Landajuela, T\. N\. Mundhenk, C\. P\. Santiago, S\. K\. Kim, and J\. T\. Kim \(2019\)Deep symbolic regression: recovering mathematical expressions from data via risk\-seeking policy gradients\.arXiv preprint arXiv:1912\.04871\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px1.p1.8)\.
- T\. Pimentel, J\. Valvoda, R\. H\. Maudslay, R\. Zmigrod, A\. Williams, and R\. Cotterell \(2020\)Information\-theoretic probing for linguistic structure\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4609–4622\.External Links:[Link](https://aclanthology.org/2020.acl-main.420/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.420)Cited by:[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- A\. Potochnik \(2017\)Idealization and the aims of science\.University of Chicago Press,Chicago\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- A\. Ravichander, Y\. Belinkov, and E\. Hovy \(2021\)Probing the probing paradigm: does probing accuracy entail task relevance?\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 3363–3377\.External Links:[Link](https://aclanthology.org/2021.eacl-main.295),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.295)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8),[§1](https://arxiv.org/html/2607.00267#S1.p2.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)"Why should i trust you?": explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,New York, NY, USA,pp\. 1135–1144\.External Links:ISBN 9781450342322,[Link](https://doi.org/10.1145/2939672.2939778),[Document](https://dx.doi.org/10.1145/2939672.2939778)Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px2.p1.11),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px2.p1.9)\.
- E\. F\. Rischel and S\. Weichwald \(2021\)Compositional abstraction error and a category of causal models\.InUncertainty in Artificial Intelligence,pp\. 1013–1023\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6)\.
- E\. F\. Rischel \(2020\)The category theory of causal models\.Master’s thesis, University of Copenhagen\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6)\.
- J\. Rissanen \(1978\)Modeling by shortest data description\.Automatica14\(5\),pp\. 465–471\.External Links:ISSN 0005\-1098,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0005-1098%2878%2990005-5),[Link](https://www.sciencedirect.com/science/article/pii/0005109878900055)Cited by:[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- N\. L\. Rochefort, O\. Garaschuk, R\. Milos, M\. Narushima, N\. Marandi, B\. Pichler, Y\. Kovalchuk, and A\. Konnerth \(2009\)Sparsification of neuronal activity in the visual cortex at eye\-opening\.Proceedings of the National Academy of Sciences106\(35\),pp\. 15049–15054\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2)\.
- F\. E\. Rosas, B\. C\. Geiger, A\. I\. Luppi, A\. K\. Seth, D\. Polani, M\. Gastpar, and P\. A\. M\. Mediano \(2024\)Software in the natural world: a computational approach to hierarchical emergence\.External Links:2402\.09090,[Link](https://arxiv.org/abs/2402.09090)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px5.p1.6)\.
- P\. K\. Rubenstein, S\. Weichwald, S\. Bongers, J\. M\. Mooij, D\. Janzing, M\. Grosse\-Wentrup, and B\. Schölkopf \(2017\)Causal consistency of structural equation models\.InProceedings of the 33rd Conference on Uncertainty in Artificial Intelligence \(UAI\),pp\. ID 11\.Note:\*equal contributionExternal Links:[Link](http://auai.org/uai2017/proceedings/papers/11.pdf)Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.p1.1)\.
- M\. Ryskina, G\. Tuckute, A\. Fung, A\. Malkin, and E\. Fedorenko \(2025\)Language models align with brain regions that represent concepts across modalities\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=2JohTFaGbW)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- H\. Sakoe and S\. Chiba \(1978\)Dynamic programming algorithm optimization for spoken word recognition\.IEEE transactions on acoustics, speech, and signal processing26\(1\),pp\. 43–49\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px4.p1.10),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- W\. C\. Salmon \(1984\)Scientific explanation and the causal structure of the world\.Princeton University Press\.External Links:ISBN 9780691072937,[Link](http://www.jstor.org/stable/j.ctv173f2gh)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- A\. Saltelli, S\. Tarantola, and K\. Chan \(1999\)A quantitative model\-independent method for global sensitivity analysis of model output\.Technometrics41\(1\),pp\. 39–56\.Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px1.p1.5)\.
- L\. Sánchez, C\. Chaouiya, and D\. Thieffry \(2008\)Segmenting the fly embryo: logical analysis of the role of the segment polarity cross\-regulatory module\.The International Journal of Developmental Biology52\(8\),pp\. 1059–1075\(en\)\.External Links:ISSN 0214\-6282,[Document](https://dx.doi.org/10.1387/ijdb.072439ls),[Link](https://ijdb.ehu.eus/article/072439ls),[Link](https://doi.org/10.1387/ijdb.072439ls)Cited by:[§E\.9](https://arxiv.org/html/2607.00267#A5.SS9.p1.3)\.
- M\. Schmidt and H\. Lipson \(2009\)Distilling free\-form natural laws from experimental data\.science324\(5923\),pp\. 81–85\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px1.p1.8)\.
- M\. Schrimpf, I\. A\. Blank, G\. Tuckute, C\. Kauf, E\. A\. Hosseini, N\. Kanwisher, J\. B\. Tenenbaum, and E\. Fedorenko \(2021\)The neural architecture of language: integrative modeling converges on predictive processing\.Proceedings of the National Academy of Sciences118\(45\),pp\. e2105646118\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- G\. Schwarz \(1978\)Estimating the dimension of a model\.The annals of statistics,pp\. 461–464\.Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- C\. R\. Shalizi and J\. P\. Crutchfield \(2001\)Computational Mechanics: Pattern and Prediction, Structure and Simplicity\.Journal of Statistical Physics104\(3\),pp\. 817–879\(en\)\.External Links:ISSN 1572\-9613,[Link](https://doi.org/10.1023/A:1010388907793),[Document](https://dx.doi.org/10.1023/A%3A1010388907793)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px5.p1.6),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- F\. N\. F\. Q\. Simoes, M\. Dastani, and T\. van Ommen \(2025\)The causal information bottleneck and optimal causal variable abstractions\.External Links:2410\.00535,[Link](https://arxiv.org/abs/2410.00535)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px2.p1.4)\.
- I\. Sobol’ \(2001\)Global sensitivity indices for nonlinear mathematical models and their monte carlo estimates\.Mathematics and Computers in Simulation55\(1\),pp\. 271–280\.Note:The Second IMACS Seminar on Monte Carlo MethodsExternal Links:ISSN 0378\-4754,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0378-4754%2800%2900270-6),[Link](https://www.sciencedirect.com/science/article/pii/S0378475400002706)Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px1.p1.5),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px2.p1.9)\.
- D\. Sutter, J\. Minder, T\. Hofmann, and T\. Pimentel \(2025\)The non\-linear representation dilemma: is causal abstraction enough for mechanistic interpretability?\.External Links:2507\.08802,[Link](https://arxiv.org/abs/2507.08802)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p3.1),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8),[§6\.1](https://arxiv.org/html/2607.00267#S6.SS1.SSS0.Px1.p1.2)\.
- A\. Syed, C\. Rager, and A\. Conmy \(2023\)Attribution patching outperforms automated circuit discovery\.External Links:2310\.10348,[Link](https://arxiv.org/abs/2310.10348)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1)\.
- W\. Tenachi, R\. Ibata, and F\. I\. Diakogiannis \(2023\)Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws\.The Astrophysical Journal959\(2\),pp\. 99\.External Links:ISSN 1538\-4357,[Link](http://dx.doi.org/10.3847/1538-4357/ad014c),[Document](https://dx.doi.org/10.3847/1538-4357/ad014c)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- D\. Teney, M\. Peyrard, and E\. Abbasnejad \(2022\)Predicting is not understanding: recognizing and addressing underspecification in machine learning\.InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII,Berlin, Heidelberg,pp\. 458–476\.External Links:ISBN 978\-3\-031\-20049\-6,[Link](https://doi.org/10.1007/978-3-031-20050-2_27),[Document](https://dx.doi.org/10.1007/978-3-031-20050-2%5F27)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- N\. Tishby, F\. C\. Pereira, and W\. Bialek \(2000\)The information bottleneck method\.External Links:physics/0004057,[Link](https://arxiv.org/abs/physics/0004057)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px2.p1.4),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- S\. Udrescu and M\. Tegmark \(2020\)AI feynman: a physics\-inspired method for symbolic regression\.External Links:1905\.11481,[Link](https://arxiv.org/abs/1905.11481)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- G\. Vallejo, E\. Tuero\-Herrero, J\. C\. Núñez, and P\. Rosário \(2014\)Performance evaluation of recent information criteria for selecting multilevel models in behavioral and social sciences\.International Journal of Clinical and Health Psychology14\(1\),pp\. 48–57\.External Links:ISSN 1697\-2600,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S1697-2600%2814%2970036-5),[Link](https://www.sciencedirect.com/science/article/pii/S1697260014700365)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px2.p1.12)\.
- K\. Valogianni and B\. Padmanabhan \(2022\)Causal abms: learning plausible causal models using agent\-based modeling\.InProceedings of The KDD’22 Workshop on Causal Discovery,T\. D\. Le, L\. Liu, E\. Kıcıman, S\. Triantafyllou, and H\. Liu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.185,pp\. 3–29\.External Links:[Link](https://proceedings.mlr.press/v185/valogianni22a.html)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1),[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px3.p1.6),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px1.p1.6)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 12388–12401\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px4.p1.7)\.
- S\. Wachter, B\. D\. Mittelstadt, and C\. Russell \(2017\)Counterfactual explanations without opening the black box: automated decisions and the GDPR\.CoRRabs/1711\.00399\.External Links:[Link](http://arxiv.org/abs/1711.00399),1711\.00399Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px4.p1.8)\.
- T\. D\. Wager, L\. Y\. Atlas, M\. A\. Lindquist, M\. Roy, C\. Woo, and E\. Kross \(2013\)An fmri\-based neurologic signature of physical pain\.New England Journal of Medicine368\(15\),pp\. 1388–1397\.Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px1.p1.8)\.
- W\. Weaver \(1991\)Science and complexity\.InFacets of systems science,pp\. 449–456\.Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- J\. Woodcock, P\. G\. Larsen, J\. Bicarregui, and J\. Fitzgerald \(2009\)Formal methods: practice and experience\.ACM Comput\. Surv\.41\(4\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/1592434.1592436),[Document](https://dx.doi.org/10.1145/1592434.1592436)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px3.p1.6)\.
- J\. Woodward \(2004\)Making things happen: a theory of causal explanation\.Oxford University Press\.External Links:ISBN 9780195155273,[Document](https://dx.doi.org/10.1093/0195155270.001.0001),[Link](https://doi.org/10.1093/0195155270.001.0001)Cited by:[§1](https://arxiv.org/html/2607.00267#S1.p1.1)\.
- C\. Yeh, C\. Hsieh, A\. S\. Suggala, D\. I\. Inouye, and P\. Ravikumar \(2019\)On the \(in\)fidelity and sensitivity for explanations\.External Links:1901\.09392,[Link](https://arxiv.org/abs/1901.09392)Cited by:[§B\.2](https://arxiv.org/html/2607.00267#A2.SS2.SSS0.Px2.p1.3),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px2.p1.9)\.
- R\. Yuste \(2008\)Circuit neuroscience: the road ahead\.Frontiers in neuroscience2,pp\. 1038\.Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1)\.
- H\. Zenil, N\. A\. Kiani, F\. S\. Abrahão, and J\. N\. Tegnér \(2020\)Algorithmic Information Dynamics\.Scholarpedia15\(7\),pp\. 53143\.Note:revision \#203921External Links:[Document](https://dx.doi.org/10.4249/scholarpedia.53143)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px3.p1.3)\.
- H\. Zenil, N\. A\. Kiani, F\. Marabita, Y\. Deng, S\. Elias, A\. Schmidt, G\. Ball, and J\. Tegnér \(2019\)An algorithmic information calculus for causal discovery and reprogramming systems\.iScience19,pp\. 1160–1172\.External Links:ISSN 2589\-0042,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.isci.2019.07.043),[Link](https://www.sciencedirect.com/science/article/pii/S2589004219302706)Cited by:[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px3.p1.3),[§B\.3](https://arxiv.org/html/2607.00267#A2.SS3.SSS0.Px3.p1.4),[§2](https://arxiv.org/html/2607.00267#S2.SS0.SSS0.Px3.p1.5)\.
- F\. M\. Zennaro, M\. Drávucz, G\. Apachitei, W\. D\. Widanage, and T\. Damoulas \(2023a\)Jointly learning consistent causal abstractions over multiple interventional distributions\.InConference on Causal Learning and Reasoning,pp\. 88–121\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px2.p2.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2),[§3\.1](https://arxiv.org/html/2607.00267#S3.SS1.SSS0.Px2.p1.5),[§3](https://arxiv.org/html/2607.00267#S3.p1.25)\.
- F\. M\. Zennaro, P\. Turrini, and T\. Damoulas \(2023b\)Quantifying Consistency and Information Loss for Causal Abstraction Learning\.InElectronic proceedings of IJCAI 2023,Vol\.5,pp\. 5750–5757\(en\)\.Note:ISSN: 1045\-0823External Links:[Link](https://www.ijcai.org/proceedings/2023/638),[Document](https://dx.doi.org/10.24963/ijcai.2023/638)Cited by:[§B\.4](https://arxiv.org/html/2607.00267#A2.SS4.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2607.00267#S3.SS0.SSS0.Px1)\.
- F\. M\. Zennaro \(2022\)Abstraction between structural causal models: a review of definitions and properties\.arXiv preprint arXiv:2207\.08603\.Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px1.p1.6)\.
- H\. Zhang, F\. T\. Figueroa, and H\. Hermanns \(2025\)Saliency Maps Give a False Sense of Explanability to Image Classifiers: An empirical evaluation across methods and metrics\.InProceedings of the 16th Asian Conference on Machine Learning,V\. Nguyen and H\. Lin \(Eds\.\),Proceedings of Machine Learning Research, Vol\.260,pp\. 479–494\.External Links:[Link](https://proceedings.mlr.press/v260/zhang25a.html)Cited by:[§B\.1](https://arxiv.org/html/2607.00267#A2.SS1.SSS0.Px5.p1.1)\.
- Y\. Zhu, S\. H\. G\. Mejia, B\. Schölkopf, and M\. Besserve \(2024\)Unsupervised causal abstraction\.InNeurIPS 2024 Causal Representation Learning Workshop,Cited by:[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px2.p2.6),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px4.p1.8),[§2\.1](https://arxiv.org/html/2607.00267#S2.SS1.SSS0.Px5.p1.2),[§3](https://arxiv.org/html/2607.00267#S3.p1.25)\.
## Appendix AAdditional results
This section contains multiple additional figures for results described in the main body of the work, as well as a new experiment on convergence rates\.
### A\.1Validity on Correct Abstractions
Figure[6](https://arxiv.org/html/2607.00267#A1.F6)shows the scores produced by each baseline metric when applied to the correct abstractions in our benchmark\.
Figure 6:Error scores obtained by baseline metrics when applied to valid abstractions across benchmark systems\. Scores are normalized to the\[0,1\]\[0,1\]range, where0denotes no error and11maximal error\. Gray cells denote computation failure or NaN results for the metric\.Note that most benchmark systems use an identity\-style coarse\-graining in which every micro\-variable is either mapped or declared internal, leaving the unmapped setΦ=∅\\Phi=\\emptyset\. For these systems, the faithfulness test has noΦ\\Phi\-variable to perturb and is therefore vacuous, so the faithful \(CAE↑,CAE↓\) and non\-faithful \(CAE↑NF\{\}\_\{\\uparrow\\text\{NF\}\},CAE↓NF\{\}\_\{\\downarrow\\text\{NF\}\}\) variants coincide up to sampling noise; their separate entries should not be read as independent evidence\. Only the logic circuit and the GRN have a non\-emptyΦ\\Phiand thus a substantive faithfulness test\.
### A\.2Discrimination Power on Invalid Abstractions
We report in Figure[7](https://arxiv.org/html/2607.00267#A1.F7)the results of the discrimination measurement in[5](https://arxiv.org/html/2607.00267#S5.SS0.SSS0.Px1), applied to all systems \(including controlled experiments\) and metrics\.
Figure 7:Discriminatory power of baseline metrics when applied to invalid abstractions compared to valid ones\. For each cell, we report three values \(from top to bottom\): Cohen’sdd, computed on the scores obtained by the metric on the invalid vs\. valid abstractions; the score obtained on the invalid condition; the significance level, computed on a Mann–WhitneyUUtest applied between the valid and invalid condition scores\. Gray cells denote computation failure or NaN results for the metric\. Stars represent p\-values: p<0\.05 \(\*\), p<0\.01 \(\*\*\) and p<0\.001 \(\*\*\*\)\.
### A\.3Power Curves
Figure[8](https://arxiv.org/html/2607.00267#A1.F8)contains the full results of the power experiments in[5](https://arxiv.org/html/2607.00267#S5.SS0.SSS0.Px2), computed across all invalid abstractions for all systems\.
Figure 8:How quickly do metrics detect bad abstractions over good ones? For different numbers of sampled interventionsnn, we compute each metric 100 times on selected valid and invalid abstractions\. We apply a Mann–WhitneyUUtest to the two sets of scores \(valid and invalid\) and report the fraction of runs for which the results differ significantly \(p<0\.05p<0\.05\)\.Figure[9](https://arxiv.org/html/2607.00267#A1.F9)shows how the score estimates produced by each causal metric stabilize for different numbers of sampled interventions\.
Figure 9:Convergence rate: For different numbers of sampled interventionsnn, we compute each metric 100 times on selected invalid abstractions\.Estimates stabilize quickly across most systems: 30 sampled interventions are generally sufficient to obtain a stable score, with the gas simulation generally requiring more samples due to the higher variance of the molecular dynamics estimator\. No metric stands out as substantially slower to converge than the others\. However, differences between sampling directions emerge\. For the GRN with a reversed causal rule, top\-down metrics \(CAE↓andCAE↓NF\{\}\_\{\\downarrow\\text\{NF\}\}\), which intervene directly on the abstract states that expose the mis\-specification, score higher than their bottom\-up counterparts, which only draw micro\-states reachable from the input distribution\. For the GRN noise condition, bottom\-up metrics score higher than top\-down ones on valid abstractions: sampling generates interventions near the boundaries of the value\-map subspaces, and noisy outputs ofℳ\\mathcal\{M\}are more likely to be misclassified after abstraction\. Top\-down sampling stays in the interior of each subspace, reducing the noise\. The separation is notably strong in Experiment 1 \(hidden confounder\):CAENF\{\}\_\{\\text\{NF\}\}variants score near zero, while faithfulness\-augmented variants detect the invalid abstraction \(score near one\)\. This demonstrates that faithfulness testing is critical, independently of the sampling direction\.
### A\.4Tracr power curves
We provide here extended results for the Tracr sort\-rank scaling experiment\.
Figure[10](https://arxiv.org/html/2607.00267#A1.F10)displays the full detection power curves across different sequence lengths\. Figure[11](https://arxiv.org/html/2607.00267#A1.F11)provides the minimal sample counts required to achieve a95%95\\%detection threshold\.
Figure 10:Detection power vs\. the number of sampled interventionsnnacross different sequence lengths for the Tracr sort\-rank programs\. The curves show the empirical probability of successfully detecting the invalid condition \(whererank0=ℓ−1\\text\{rank\}\_\{0\}=\\ell\-1whentoken0\\text\{token\}\_\{0\}is the minimum\)\.Figure 11:Minimum number of sampled interventions required to achieve 95% detection power on the Tracr sort\-rank experiment, evaluated across different metrics \(rows\) and sequence lengths \(columns\)\. Numbers between brackets represent Wilson confidence intervals\.
## Appendix BExtended Background: Metrics of Explanation Validity
We survey the principal classes of metrics used to evaluate whether a computational explanationℰ\\mathcal\{E\}is a valid account of a target systemℳ\\mathcal\{M\}\. Throughout,u∼P𝒰u\\sim P\_\{\\mathcal\{U\}\}denotes an exogenous input drawn from the natural input distribution, and validity is assessed relative to the observable or internal behavior ofℳ\\mathcal\{M\}\. We organize criteria by the type of evidence they require: observational, functional, information\-theoretic, or causal\.
### B\.1Observational Validity
The most elementary criterion for explanation validity is*observational equivalence*:ℰ\\mathcal\{E\}is valid insofar as it reproduces the measurable outputs ofℳ\\mathcal\{M\}under the natural input distribution\.
#### Pointwise reconstruction accuracy\.
In deterministic or regression settings, observational validity is operationalized as low expected point\-to\-point discrepancy between the outputs ofℰ\\mathcal\{E\}andℳ\\mathcal\{M\}\. Standard metrics include the*mean squared error*MSE=𝔼u\[‖ℳ\(u\)−ℰ\(u\)‖2\]\\mathrm\{MSE\}=\\mathbb\{E\}\_\{u\}\[\\\|\\mathcal\{M\}\(u\)\-\\mathcal\{E\}\(u\)\\\|^\{2\}\], the*root mean squared error*RMSE\\mathrm\{RMSE\}, the*ℓ2\\ell\_\{2\}distance*L2L\_\{2\}, and the*normalized*variantNMSE\\mathrm\{NMSE\}, which removes the dependence on output scale\. The coefficient of determinationR2R^\{2\}provides an interpretable measure of the fraction of variance explained\. These criteria underlie classical system identification\[Ljung,[1999](https://arxiv.org/html/2607.00267#bib.bib232)\]and general\-purpose symbolic regression\[Koza,[1994](https://arxiv.org/html/2607.00267#bib.bib244), Schmidt and Lipson,[2009](https://arxiv.org/html/2607.00267#bib.bib243), Petersenet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib240), Landajuelaet al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib245)\], where model selection is performed by minimizing reconstruction error on held\-out data\.
#### Complexity\-regularized accuracy\.
Accuracy\-only criteria permit arbitrarily complex explanations that overfit the observed data\. Standard practice augments the reconstruction objective with a complexity penaltyΩ\(ℰ\)\\Omega\(\\mathcal\{E\}\), yielding criteria such as AIC\[Akaike,[1974](https://arxiv.org/html/2607.00267#bib.bib242)\], BIC\[Schwarz,[1978](https://arxiv.org/html/2607.00267#bib.bib241)\], and minimum description length\[Manganet al\.,[2017](https://arxiv.org/html/2607.00267#bib.bib239), Dong and Zhong,[2025](https://arxiv.org/html/2607.00267#bib.bib238)\]\.*Mallows’CpC\_\{p\}*statistic\[Mallows,[1973](https://arxiv.org/html/2607.00267#bib.bib25), Vallejoet al\.,[2014](https://arxiv.org/html/2607.00267#bib.bib220), Beulaet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib219), Lee and Ghosh,[2009](https://arxiv.org/html/2607.00267#bib.bib223)\]provides a related bias–variance decomposition, penalizing the number of free parameters:Cp=RSS/σ^2−n\+2pC\_\{p\}=\\mathrm\{RSS\}/\\hat\{\\sigma\}^\{2\}\-n\+2p, whereRSS\\mathrm\{RSS\}is the residual sum of squares,ppthe number of parameters, andσ^2\\hat\{\\sigma\}^\{2\}an estimate of noise variance\. In our setting,CpC\_\{p\}is applied to a linear proxy fitted between the predictions ofℰ\\mathcal\{E\}and the outputs ofℳ\\mathcal\{M\}so that it simultaneously penalizes residual mismatch and model complexity in a cross\-model comparison: p is set to twice the number of variables inℰ\\mathcal\{E\}\(one slope and one intercept per variable\), andσ2\\sigma^\{2\}is estimated from the residuals of this proxy\. Physics\-inspired approaches embed domain\-specific inductive biases, such as dimensional homogeneity, separability, known invariances, directly into the admissible hypothesis class, as in AI Feynman\[Udrescu and Tegmark,[2020](https://arxiv.org/html/2607.00267#bib.bib251)\]and related frameworks\[Tenachiet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib246), Imai Aldeiaet al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib233)\]\. We also employ a*SINDy\-style evaluator*\[Bruntonet al\.,[2016](https://arxiv.org/html/2607.00267#bib.bib116)\], which scores an explanation by how well its governing equations predict the empirical trajectory one epoch ahead\. High residual indicates that the governing equations ofℰ\\mathcal\{E\}do not faithfully reproduce the system’s epoch\-level dynamics\.
#### Distributional equivalence\.
Whenℳ\\mathcal\{M\}is stochastic or the output of interest is a distributional quantity, pointwise comparison is insufficient\. Observational validity is then defined by proximity between the output marginalsPℰ\(u\)P\_\{\\mathcal\{E\}\(u\)\}andPℳ\(u\)P\_\{\\mathcal\{M\}\(u\)\}\. The*KL divergence*DKL\(Pℳ∥Pℰ\)D\_\{\\mathrm\{KL\}\}\(P\_\{\\mathcal\{M\}\}\\\|P\_\{\\mathcal\{E\}\}\)and the symmetric*Jensen–Shannon divergence*DJSD\_\{\\mathrm\{JS\}\}measure discrepancy in a model\-based sense\. The*maximum mean discrepancy*MMD2\(ℳ,ℰ\)=‖μPℳ−μPℰ‖ℋ2\\mathrm\{MMD\}^\{2\}\(\\mathcal\{M\},\\mathcal\{E\}\)=\\\|\\mu\_\{P\_\{\\mathcal\{M\}\}\}\-\\mu\_\{P\_\{\\mathcal\{E\}\}\}\\\|^\{2\}\_\{\\mathcal\{H\}\}, whereμP\\mu\_\{P\}is the kernel mean embedding ofPPin a reproducing kernel Hilbert spaceℋ\\mathcal\{H\}\[Grettonet al\.,[2012](https://arxiv.org/html/2607.00267#bib.bib29)\], provides a nonparametric alternative\. The*Hilbert–Schmidt independence criterion*\(HSIC\)\[Grettonet al\.,[2005](https://arxiv.org/html/2607.00267#bib.bib26)\]evaluates statistical dependence between residualsℳ\(u\)−ℰ\(u\)\\mathcal\{M\}\(u\)\-\\mathcal\{E\}\(u\)and the inputuu; a low HSIC indicates that errors are independent of the input, reflecting global rather than locally compensating agreement\. These criteria subsume the*docking*paradigm in agent\-based modeling\[Axtellet al\.,[1996](https://arxiv.org/html/2607.00267#bib.bib96), Collinset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib101)\], where models are accepted as equivalent when they reproduce the same aggregate distributional patterns\.
#### Temporal fidelity\.
For dynamical systems only, observational validity must be assessed over entire trajectories rather than marginal snapshots\.*Trajectory MSE*measures the average squared deviation between state sequences\{xtℳ\}\\\{x\_\{t\}^\{\\mathcal\{M\}\}\\\}and\{xtℰ\}\\\{x\_\{t\}^\{\\mathcal\{E\}\}\\\}over a common time horizon\.*Dynamic time warping*\(DTW\)\[Sakoe and Chiba,[1978](https://arxiv.org/html/2607.00267#bib.bib28)\]generalizes this to temporally misaligned trajectories by computing the minimum\-cost elastic alignment:DTW\(𝐱ℳ,𝐱ℰ\)=minπ∑\(i,j\)∈πd\(xiℳ,xjℰ\)\\mathrm\{DTW\}\(\\mathbf\{x\}^\{\\mathcal\{M\}\},\\mathbf\{x\}^\{\\mathcal\{E\}\}\)=\\min\_\{\\pi\}\\sum\_\{\(i,j\)\\in\\pi\}d\(x\_\{i\}^\{\\mathcal\{M\}\},x\_\{j\}^\{\\mathcal\{E\}\}\), whereπ\\piranges over admissible warping paths\.*Temporal autocorrelation matching*assesses whetherℰ\\mathcal\{E\}reproduces the autocorrelation structureρℳ\(τ\)=𝔼\[xtxt\+τ\]\\rho^\{\\mathcal\{M\}\}\(\\tau\)=\\mathbb\{E\}\[x\_\{t\}x\_\{t\+\\tau\}\]ofℳ\\mathcal\{M\}across lagsτ\\tau; failure indicates thatℰ\\mathcal\{E\}does not capture the system’s memory structure\.*Spectral analysis*compares the power spectral densities obtained via the Fourier transform of the respective autocorrelation functions\[Percival and Walden,[1993](https://arxiv.org/html/2607.00267#bib.bib27)\]; agreement in dominant frequencies, harmonic structure, and spectral entropy indicates that the explanation captures the characteristic temporal modes ofℳ\\mathcal\{M\}\.
#### Limitation: equifinality\.
All observational criteria share a fundamental limitation: they are insensitive to the causal mechanisms generating the data\. Distinct explanations may induce indistinguishable output distributions, a phenomenon termed*equifinality*\[Valogianni and Padmanabhan,[2022](https://arxiv.org/html/2607.00267#bib.bib100), Collinset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib101)\]\. Observationally valid models can violate the causal structure ofℳ\\mathcal\{M\}while remaining empirically accurate, leading to explanations that are unstable under distribution shift\[Ghorbaniet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib213), Kindermanset al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib211), Zhanget al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib212)\]and fragile to mechanistic perturbations\[Mélouxet al\.,[2025b](https://arxiv.org/html/2607.00267#bib.bib215),[a](https://arxiv.org/html/2607.00267#bib.bib214), Katende,[2025](https://arxiv.org/html/2607.00267#bib.bib237)\]\.
### B\.2Functional Validity
*Functional validity*strengthens observational equivalence by requiring thatℰ\\mathcal\{E\}reproduce not only the outputs ofℳ\\mathcal\{M\}but also its*input–output response profile*: how outputs vary as inputs are systematically modified\. These criteria are most naturally applied to static or quasi\-static input–output systems, though several generalizations to dynamical settings exist\.
#### Variance decomposition and global sensitivity\.
*Variance decomposition*decomposes the total output varianceVar\[ℳ\(u\)\]\\mathrm\{Var\}\[\\mathcal\{M\}\(u\)\]into contributions attributable to individual inputs and their interactions via the ANOVA expansion\[Efron and Stein,[1981](https://arxiv.org/html/2607.00267#bib.bib247), Saltelliet al\.,[1999](https://arxiv.org/html/2607.00267#bib.bib248)\]\.*Sobol sensitivity indices*\[Sobol’,[2001](https://arxiv.org/html/2607.00267#bib.bib9)\]formalize this asSi=Var\[𝔼\[ℳ\(u\)∣ui\]\]/Var\[ℳ\(u\)\]S\_\{i\}=\\mathrm\{Var\}\[\\mathbb\{E\}\[\\mathcal\{M\}\(u\)\\mid u\_\{i\}\]\]/\\mathrm\{Var\}\[\\mathcal\{M\}\(u\)\]\(first\-order\) and analogously for higher\-order terms \(interactions between input coordinates\)\. Under this validity criterion,ℰ\\mathcal\{E\}is deemed valid if its Sobol indicesS^i\\hat\{S\}\_\{i\}agree with those ofℳ\\mathcal\{M\}up to an acceptable tolerance, ensuring global functional coherence in how uncertainty in each input propagates to the output\. Related global sensitivity methods including FAST\[Cukieret al\.,[1973](https://arxiv.org/html/2607.00267#bib.bib249), Saltelliet al\.,[1999](https://arxiv.org/html/2607.00267#bib.bib248)\]and Morris screening\[Morris,[1991](https://arxiv.org/html/2607.00267#bib.bib250)\]provide computationally cheaper approximations to the same underlying sensitivity structure\.
#### Local attribution fidelity\.
A complementary notion evaluates agreement betweenℰ\\mathcal\{E\}andℳ\\mathcal\{M\}in a neighborhood of a fixed inputu0u\_\{0\}\. The*infidelity*metricYehet al\.\[[2019](https://arxiv.org/html/2607.00267#bib.bib187)\]is defined as
INFD\(ℰ,u0\)=𝔼δu\[\(δu⊤ϕ\(ℰ,u0\)−\[ℳ\(u0\)−ℳ\(u0−δu\)\]\)2\],\\mathrm\{INFD\}\(\\mathcal\{E\},u\_\{0\}\)=\\mathbb\{E\}\_\{\\delta u\}\\\!\\left\[\\bigl\(\\delta u^\{\\top\}\\phi\(\\mathcal\{E\},u\_\{0\}\)\-\[\\mathcal\{M\}\(u\_\{0\}\)\-\\mathcal\{M\}\(u\_\{0\}\-\\delta u\)\]\\bigr\)^\{2\}\\right\],whereϕ\(ℰ,u0\)\\phi\(\\mathcal\{E\},u\_\{0\}\)is the attribution vector ofℰ\\mathcal\{E\}atu0u\_\{0\}andδu\\delta uis drawn from a perturbation distribution\. Low infidelity indicates that the explanation’s attributions accurately predict the actual output changes ofℳ\\mathcal\{M\}under local perturbations\. We adapt infidelity to compare output sensitivities betweenℰ\\mathcal\{E\}andℳ\\mathcal\{M\}directly, without requiring attribution vectors\. LIMERibeiroet al\.\[[2016](https://arxiv.org/html/2607.00267#bib.bib177)\]constructs a locally faithful surrogate by minimizing a locally weighted reconstruction error aroundu0u\_\{0\}, and SHAPLundberg and Lee \[[2017](https://arxiv.org/html/2607.00267#bib.bib180)\]defines attributions via Shapley values, which satisfy a consistent axiomatic decomposition of the output into per\-feature contributions; both can be interpreted as proxies for local attribution fidelity\.
#### Structural deviation and causal sensitivity index\.
While standard sensitivity indices quantify statistical influence under passive sampling, a complementary family of metrics evaluates the sensitivity of explanatory validity to the structural parameters ofℰ\\mathcal\{E\}itself\[Katende,[2025](https://arxiv.org/html/2607.00267#bib.bib237)\]\. The*structural deviation*metric perturbs each parameter ofℰ\\mathcal\{E\}by a small amount and measures the resulting change in alignment score, identifying which parameters are most important for the explanation’s validity\. The*causal sensitivity index*applies a harder test: each parameter is zeroed out in turn, and the drop in alignment score is recorded, providing a global decomposition of validity across the components ofℰ\\mathcal\{E\}\. Together, these metrics support model diagnosis by distinguishing robust explanatory components from those whose perturbation or removal would invalidate the abstraction\.
#### Counterfactual and relational fidelity\.
*Counterfactual validity*\[Wachteret al\.,[2017](https://arxiv.org/html/2607.00267#bib.bib90)\]requires thatℰ\\mathcal\{E\}reproduce the minimal perturbationsδu\\delta uthat change the class label or output regime ofℳ\\mathcal\{M\}: the decision boundaries ofℰ\\mathcal\{E\}must coincide with those ofℳ\\mathcal\{M\}in the relevant neighborhood\.*Relational fidelity*\[Collinset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib101)\]measures the Pearson correlation betweenΔℰ\(u,u′\)=ℰ\(u′\)−ℰ\(u\)\\Delta\_\{\\mathcal\{E\}\}\(u,u^\{\\prime\}\)=\\mathcal\{E\}\(u^\{\\prime\}\)\-\\mathcal\{E\}\(u\)andΔℳ\(u,u′\)=τ𝒴\(ℳ\(u′\)\)−τ𝒴\(ℳ\(u\)\)\\Delta\_\{\\mathcal\{M\}\}\(u,u^\{\\prime\}\)=\\tau\_\{\\mathcal\{Y\}\}\(\\mathcal\{M\}\(u^\{\\prime\}\)\)\-\\tau\_\{\\mathcal\{Y\}\}\(\\mathcal\{M\}\(u\)\)over sampled pairs\(u,u′\)\(u,u^\{\\prime\}\), computed independently per output dimension and averaged\. A score of 0 \(after normalization\) indicates perfect agreement in relational structure\.
#### Limitation: structural underdetermination\.
Functional criteria remain agnostic to internal mechanisms\. Distinct models can exhibit identical global sensitivity profiles, local attributions, and counterfactual boundaries while relying on entirely different underlying structures\. Functional validity therefore does not uniquely identify the causal organization ofℳ\\mathcal\{M\}, and multiple mechanistically incompatible explanations may be simultaneously admissible\.
### B\.3Information\-Theoretic and Representational Validity
Another class of criteria requires thatℰ\\mathcal\{E\}reproduce not only the input–output behavior ofℳ\\mathcal\{M\}but also the structure of its internal representations: how information is encoded, compressed, and transmitted across the system\.
#### Representational alignment and probing accuracy\.
A first approach tests whether the abstract variables posited byℰ\\mathcal\{E\}are linearly decodable from the internal states ofℳ\\mathcal\{M\}\. In the case of neural networks, given a representationh\(u\)∈ℝdh\(u\)\\in\\mathbb\{R\}^\{d\}extracted from a nominated layer ofℳ\\mathcal\{M\},*probing accuracy*measures the performance of a linear classifier or regressor trained to predict an explanatory variable fromh\(u\)h\(u\)\[Alain and Bengio,[2018](https://arxiv.org/html/2607.00267#bib.bib186)\]\. For real neural networks, neuroscience commonly refers to these practices asmultivariate pattern analysis\[Haxbyet al\.,[2001](https://arxiv.org/html/2607.00267#bib.bib15), Normanet al\.,[2006](https://arxiv.org/html/2607.00267#bib.bib22)\]orbrain signatures\[Wageret al\.,[2013](https://arxiv.org/html/2607.00267#bib.bib21), Kragelet al\.,[2018](https://arxiv.org/html/2607.00267#bib.bib20)\]\. High probing accuracy indicates that the feature is explicitly represented inℳ\\mathcal\{M\}’s internal geometry\[Ravichanderet al\.,[2021](https://arxiv.org/html/2607.00267#bib.bib172), Belinkov,[2022](https://arxiv.org/html/2607.00267#bib.bib14)\]\. For systems with continuous identity value maps, where all samples share a single abstract label, probing accuracy is trivially maximal and reflects the smoothness of the micro\-model’s representation rather than the discriminality of abstract categories\. Supervised concept\-level probing is instantiated by Concept Activation Vectors\[Kimet al\.,[2018](https://arxiv.org/html/2607.00267#bib.bib87)\]; unsupervised variants based on sparse autoencoders \(SAEs\)\[Cunninghamet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib270), Brickenet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib273)\]recover a dictionary of monosemantic features without requiring labeled concepts\. Representational similarity analysis \(RSA\)\[Kriegeskorteet al\.,[2008](https://arxiv.org/html/2607.00267#bib.bib271)\]generalizes this by comparing representational dissimilarity matrices betweenℰ\\mathcal\{E\}andℳ\\mathcal\{M\}\[Schrimpfet al\.,[2021](https://arxiv.org/html/2607.00267#bib.bib272)\], without committing to a specific linear structure\[Awet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib24), Ryskinaet al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib274)\]\.
#### Information bottleneck and causal information bottleneck Lagrangians\.
The*information bottleneck*\(IB\) framework characterizes valid representations as those achieving an optimal trade\-off between compression of the input and preservation of information relevant to the output\[Tishbyet al\.,[2000](https://arxiv.org/html/2607.00267#bib.bib121)\]\. However, standard IB relies on observational mutual information, rendering it sensitive to spurious correlations in the input distribution\. The*causal information bottleneck*\(CIB\)\[Simoeset al\.,[2025](https://arxiv.org/html/2607.00267#bib.bib94)\]strengthens this by replacing observational relevance with causal relevance measured under interventions over inputs, so thatℰ\\mathcal\{E\}is assessed by its capacity to preserve causally invariant predictive information\. Operationally, the IB Lagrangian is evaluated using the observational mutual informationI\(ℰ\(u\);ℳ\(u\)\)I\(\\mathcal\{E\}\(u\);\\mathcal\{M\}\(u\)\), whereas the CIB Lagrangian replaces this with the interventional relevanceH\(Y\)−Hc\(Y∣do\(T\)\)H\(Y\)\-H\_\{c\}\(Y\\mid\\mathrm\{do\}\(T\)\); this distinction is relevant when the input distribution contains spurious correlations that increase the apparent observational relevance ofℰ\\mathcal\{E\}\.
#### Complexity shift\.
*Algorithmic information dynamics*\(AID\)\[Zenilet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib225),[2020](https://arxiv.org/html/2607.00267#bib.bib98),Zenil\_Kiani\_Tegnér\_2023\]evaluates explanations in terms of changes in algorithmic complexity induced by interventions\. The*complexity shift*metric is defined as the discrepancy between the change in Kolmogorov complexityKKof the output under perturbation ofℳ\\mathcal\{M\}versus the corresponding change inℰ\\mathcal\{E\}:
ΔKℳ\(u,δu\)=K\(ℳ\(u\+δu\)\)−K\(ℳ\(u\)\),CS\(ℰ\)=𝔼\[\|ΔKℳ−ΔKℰ\|\]\.\\Delta K\_\{\\mathcal\{M\}\}\(u,\\delta u\)=K\(\\mathcal\{M\}\(u\+\\delta u\)\)\-K\(\\mathcal\{M\}\(u\)\),\\qquad\\mathrm\{CS\}\(\\mathcal\{E\}\)=\\mathbb\{E\}\\\!\\left\[\\bigl\|\\Delta K\_\{\\mathcal\{M\}\}\-\\Delta K\_\{\\mathcal\{E\}\}\\bigr\|\\right\]\.Low complexity shift indicates thatℰ\\mathcal\{E\}produces mechanistically comparable transformations under perturbation\. In practice, Kolmogorov complexity is approximated via compression algorithms or block decomposition methods\[Zenilet al\.,[2019](https://arxiv.org/html/2607.00267#bib.bib225)\], introducing estimation variance\.
#### Limitation: mechanistic underdetermination\.
Information\-theoretic criteria capture*what information*is present and how it is organized across variables, but not*how*that information is causally transformed by the system’s dynamics\. Distinct mechanisms can yield identical probing accuracies, identical IB optima, and identical complexity\-shift profiles\[Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214)\]\. Moreover, most of these criteria, except in their explicitly causal variants, are grounded in observational regimes\[Teneyet al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib259)\]\.
### B\.4Causal and Interventional Validity
The strictest class of criteria requires thatℰ\\mathcal\{E\}correctly reproduce the behavior ofℳ\\mathcal\{M\}under*active interventions*\. Informally, letℐ\\mathcal\{I\}be a class of interventions onℳ\\mathcal\{M\}, and letℳ\(i\)\\mathcal\{M\}^\{\(i\)\}andℰ\(i\)\\mathcal\{E\}^\{\(i\)\}denote the respective post\-intervention systems fori∈ℐi\\in\\mathcal\{I\}\. Thenℰ\\mathcal\{E\}is*interventionally valid*relative toℐ\\mathcal\{I\}ifPℳ\(i\)\(Y\)≈Pℰ\(i\)\(Y^\),∀i∈ℐ,P\_\{\\mathcal\{M\}^\{\(i\)\}\}\(Y\)\\;\\approx\\;P\_\{\\mathcal\{E\}^\{\(i\)\}\}\(\\hat\{Y\}\),\\\>\\forall\\,i\\in\\mathcal\{I\},whereYYandY^\\hat\{Y\}are corresponding output variables under a fixed variable correspondence\. Unlike observational criteria, this requirement is closed under the causal operator: agreement must hold across a family of intervention regimes\.
#### Behavioral causal consistency and macroscopic invariance\.
A prerequisite for interventional validity is that the abstraction mapτ\\taumust be well\-defined with respect to the dynamics: distinct micro\-states sharing the same abstract label must produce the same abstract output under intervention, regardless of the micro\-realization\. In the Neuro\-Cognitive Multilevel Causal Modeling framework\[Grosse\-Wentrupet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib120)\], this is formalized as*behavioral causal consistency*\(BCC\): for every pairx1,x2∈τ−1\(c\),P\(B\[t\]∣do\(X\[t\]=x1\)\)=P\(B\[t\]∣do\(X\[t\]=x2\)\)x\_\{1\},x\_\{2\}\\in\\tau^\{\-1\}\(c\),P\(B\[t\]\\mid\\mathrm\{do\}\(X\[t\]=x\_\{1\}\)\)=P\(B\[t\]\\mid\\mathrm\{do\}\(X\[t\]=x\_\{2\}\)\)\. The same requirement appears independently in statistical physics under the name*macroscopic invariance*: a coarse\-graining is admissible only when all micro\-configurations within a macro\-state are observationally equivalent at the macro\-level\[Fisher,[1998](https://arxiv.org/html/2607.00267#bib.bib217)\]\. BCC and macroscopic invariance differ in scope: macroscopic invariance tests each variable in isolation, intervening on a single variable while leaving others unspecified, whereas BCC tests each variable within a complete system state drawn from the joint distribution\. Operationally, these are assessed by sampling pairs of micro\-states with the same abstract label and verifying that their abstracted outputs ofℳ\\mathcal\{M\}agree under intervention\.
#### Dynamic causal consistency\.
The*dynamic causal consistency*\(DCC\) criterion, introduced in the Neural Causal Model framework for dynamical systems\[Grosse\-Wentrupet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib120)\], requires that interventions onℰ\\mathcal\{E\}correctly predict the next\-step internal states ofℳ\\mathcal\{M\}\. Formally, DCC is satisfied ifℰ\(i\)\(ut\)≈hℳ\(ut\+1\)\\mathcal\{E\}^\{\(i\)\}\(u\_\{t\}\)\\approx h\_\{\\mathcal\{M\}\}\(u\_\{t\+1\}\)for all interventionsi∈ℐi\\in\\mathcal\{I\}and time stepstt, wherehℳ\(⋅\)h\_\{\\mathcal\{M\}\}\(\\cdot\)denotes the corresponding internal state ofℳ\\mathcal\{M\}\. DCC thus enforces step\-wise causal coherence between the explanation’s dynamics and those of the target system\. It is a dynamical extension of behavioral causal consistency \(BCC\): where BCC requires that same\-label micro\-states produce identical behavioral outputs under intervention, DCC additionally requires that micro\-states sharing the same abstract label at timettmust also do so att\+1t\+1under the system dynamics\. We evaluate this using exact label matching on the abstracted next state\.
#### Constructive and formal verification\.
The strongest guarantees arise in settings whereℰ\\mathcal\{E\}can be analytically derived from or formally verified againstℳ\\mathcal\{M\}\. In physics, renormalization group methods and effective field theory\[Fisher,[1998](https://arxiv.org/html/2607.00267#bib.bib217), Georgi,[1993](https://arxiv.org/html/2607.00267#bib.bib216)\]construct coarse\-grained explanations that provably preserve long\-distance or low\-energy observables\. In computer science, formal verification\[Woodcocket al\.,[2009](https://arxiv.org/html/2607.00267#bib.bib210)\]establishes that an implementationℳ\\mathcal\{M\}satisfies a high\-level specificationℰ\\mathcal\{E\}for*all*inputs, yielding universal behavioral equivalence guarantees\.*Symbion*\[Grittiet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib221)\]provides a useful source of inspiration through its use of agreement between symbolic and concrete execution as a consistency check\. Adapting this idea to the explanation setting, we treat the explanatory modelℰ\\mathcal\{E\}as an abstract representation of the target modelℳ\\mathcal\{M\}and measure validity by the extent to which their behaviors agree\. In our evaluation, we implement a discrete analog suited to finite abstract domains: we exhaustively enumerate all combinations of abstract input labels, execute both models, and measure the fraction of input combinations for which they disagree on any output\. In causal agent\-based modeling\[Valogianni and Padmanabhan,[2022](https://arxiv.org/html/2607.00267#bib.bib100)\], explanations are constructed to satisfy domain\-theoretic structural constraints\.
#### Mechanistic interpretability\.
Developed primarily in AI, mechanistic interpretability \(MI\) aims to reverse\-engineer networks into human\-interpretable algorithms\[Olahet al\.,[2020](https://arxiv.org/html/2607.00267#bib.bib140)\]\. MI assumes that for a given behavior of interest, a sparse subset of the network executes the relevant algorithm\. This subset, often referred to as acircuit, is a key research target of modern AI interpretability methodsViget al\.\[[2020](https://arxiv.org/html/2607.00267#bib.bib173)\], Menget al\.\[[2022](https://arxiv.org/html/2607.00267#bib.bib174)\], Moneaet al\.\[[2024](https://arxiv.org/html/2607.00267#bib.bib253)\],kramár2024atpefficientscalablemethod, Conmyet al\.\[[2023](https://arxiv.org/html/2607.00267#bib.bib137)\], Gevaet al\.\[[2023](https://arxiv.org/html/2607.00267#bib.bib139)\], Syedet al\.\[[2023](https://arxiv.org/html/2607.00267#bib.bib72)\]\. It is loosely inspired by neuroscience, which also seeks to uncover neural circuits underlying observed behaviorsYuste \[[2008](https://arxiv.org/html/2607.00267#bib.bib138)\]\. A circuit is valid if it reproduces the network output under interventions on the computation, whether they are part or not of the circuit\[Hannaet al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib50)\]\.*Interchange intervention accuracy*\(IIA\)\[Geigeret al\.,[2022](https://arxiv.org/html/2607.00267#bib.bib48)\]generalizes this perspective: it requires that pairwise interchange interventions on the network’s internal representations and the high\-level model produce consistent outputs\. Unlike IC\[Zennaroet al\.,[2023b](https://arxiv.org/html/2607.00267#bib.bib6)\]and its variants, IIA evaluates interventional consistency without requiring a fully specified abstraction map over variables\.
#### Computational mechanics andϵ\\epsilon\-machines\.
A more general perspective is provided by computational mechanics, in which valid explanations correspond to*causally closed*macro\-level representations\. Theϵ\\epsilon\-machine framework\[Crutchfield and Young,[1989](https://arxiv.org/html/2607.00267#bib.bib224), Shalizi and Crutchfield,[2001](https://arxiv.org/html/2607.00267#bib.bib117)\]defines the canonical valid explanation as the minimal sufficient statistic for predictingℳ\\mathcal\{M\}: the set of*causal states*𝒮\\mathcal\{S\}such that future behavior is conditionally independent of the past given𝒮\\mathcal\{S\}\. An explanation is valid under this criterion if its state space is isomorphic to theϵ\\epsilon\-machine ofℳ\\mathcal\{M\}, ensuring that no micro\-level information would improve predictive accuracy\. Related perspectives interpret valid explanations as*programs implemented*by the underlying system, whose macro\-level dynamics are self\-contained and interventionally sufficient\[Rosaset al\.,[2024](https://arxiv.org/html/2607.00267#bib.bib191)\]\.
#### Limitations
While causal and interventional criteria provide the strongest notion of validity, existing formulations face practical limitations in complex, stochastic settings\. Constructive approaches require strong domain\-specific assumptions\. Mechanistic interpretability operates at the wrong level of abstraction, viewing low\-level components as variables of the high\-level explanation and, thus, suffers from causal overdetermination\[McGrathet al\.,[2023](https://arxiv.org/html/2607.00267#bib.bib19), Mélouxet al\.,[2025a](https://arxiv.org/html/2607.00267#bib.bib214)\]\. Finally, computational mechanics andϵ\\epsilon\-machines are subsumed by the more general framework of causal abstraction\.
## Appendix CControlled Experiments
We describe below the six controlled experiments from Figure[2](https://arxiv.org/html/2607.00267#S5.F2)\. These highlight subtle ways in which abstraction can fail: aside from the strongest causal abstraction metrics, baseline metrics generally fail to detect the errors in these claimed abstractions\.
Each panel in this appendix shows the full structural model, including the exogenous noise variablesu∙u\_\{\\bullet\}\. Every endogenous variable carries its own exogenous noise\. The noise abstraction mapτu\\tau\_\{u\}is the identity: each micro\-noiseuvu\_\{v\}maps to the macro\-noise of the corresponding variable\. ForΦ\\Phivariables \(unmapped byτ\\tau\), the noise has no macro counterpart, soτu\\tau\_\{u\}is undefined and the corresponding noise is held at zero under the valid abstraction\. As in Figure[2](https://arxiv.org/html/2607.00267#S5.F2),ℰ\\mathcal\{E\}is drawn on top \(uppercase\) andℳ\\mathcal\{M\}on the bottom \(lowercase\), gray dashed arrows areτ\\tau, dashed gray circles areΦ\\Phivariables, and red marks what changes in the invalid condition\.
X1X\_\{1\}YYX2X\_\{2\}ℰ\\mathcal\{E\}:uX1u\_\{X\_\{1\}\}uYu\_\{Y\}uX2u\_\{X\_\{2\}\}x1x\_\{1\}yyx2x\_\{2\}ℳ\\mathcal\{M\}:ux1u\_\{x\_\{1\}\}uyu\_\{y\}ux2u\_\{x\_\{2\}\}rruru\_\{r\}1\+γr1\{\+\}\\gamma r
XXMMYYℰ\\mathcal\{E\}:uXu\_\{X\}uMu\_\{M\}uYu\_\{Y\}xxmmyyaabbℳ\\mathcal\{M\}:uxu\_\{x\}umu\_\{m\}uyu\_\{y\}uau\_\{a\}ubu\_\{b\}m∨\(a⊕b\)m\\\!\\lor\\\!\(a\\\!\\oplus\\\!b\)
XXMMZZℰ\\mathcal\{E\}:M=2XM\{=\}2XM=3XM\{=\}3XZ=M\+3Z\{=\}M\{\+\}3Z=23M\+3Z\{=\}\\tfrac\{2\}\{3\}M\{\+\}3uXu\_\{X\}uMu\_\{M\}uZu\_\{Z\}xxmmzzℳ\\mathcal\{M\}:m=2xm\{=\}2xz=m\+3z\{=\}m\{\+\}3uxu\_\{x\}umu\_\{m\}uzu\_\{z\}
Figure 12:Experiments 1 to 3\.#### Experiment 1: Hidden confounder\.
The macro\-model posits that initial prey and predator populations determine the final populations \(Lotka\-Volterra dynamics\)\.ℳ\\mathcal\{M\}introduces an unmapped resource variablerr\(aΦ\\Phivariable\) that multiplicatively scales the final prey population output by a factor of1\+γr1\+\\gamma r, whereγ\\gammais the coupling strength andrris the resource level\. This creates a hidden confounder that the macro\-model cannot account for\. The confounder enters through the exogenous noiseuru\_\{r\}of the unmapped variable, which has no image underτu\\tau\_\{u\}\. A valid metric should detect that the abstraction fails under perturbations to this resource variable\. In the valid conditionγ=0\\gamma=0, so the factor is11and the resource is inert\.
#### Experiment 2: XOR backup path\.
The macro\-model positsX→M→YX\\to M\\to Y, with twoΦ\\Phivariablesa,ba,b\(declared causally inert\)\. In the validℳ\\mathcal\{M\}, the XOR path is structurally disconnected \(y=m=xy=m=x\), so perturbingaaandbbhas no effect regardless of their values\. In the invalidℳ\\mathcal\{M\},YYis computed asm∨\(a⊕b\)m\\lor\(a\\oplus b\), wheremmis the micro\-variable corresponding toMM: a hidden XOR backup path toYYfires only when theϕ\\phivariables are independently perturbed through their noiseua,ubu\_\{a\},u\_\{b\}\. This experiment specifically tests faithfulness: structural metrics that do not probeϕ\\phivariables cannot detect the violation\.
#### Experiment 3: Wrong intermediate variable\.
ℳ\\mathcal\{M\}implements the chainx→m=2x→z=m\+3x\\to m=2x\\to z=m\+3\. The validℰ\\mathcal\{E\}matches this exactly\. The invalidℰ\\mathcal\{E\}positsm=3xm=3xwith a compensating downstream equation\(z=23m\+3\)\(z=\\tfrac\{2\}\{3\}m\+3\), so thatz=2x\+3z=2x\+3in both cases\. The intermediatemmis wrong, but the output is identical under observation\.
XXMMYYvalid:X→YX\{\\to\}Ychainℰ\\mathcal\{E\}:uXu\_\{X\}uMu\_\{M\}uYu\_\{Y\}xxmmyyℳ\\mathcal\{M\}:uxu\_\{x\}umu\_\{m\}uyu\_\{y\}
XXMMYYℰ\\mathcal\{E\}:M≥2M\{\\geq\}2M≥1M\{\\geq\}1uXu\_\{X\}uMu\_\{M\}uYu\_\{Y\}xxmmyyℳ\\mathcal\{M\}:uxu\_\{x\}umu\_\{m\}uyu\_\{y\}thr\.1\.51\.5;m∈\{0,2\}m\\\!\\in\\\!\\\{0,2\\\}only
XXMMYYvalid:M→YM\{\\to\}Yforkℰ\\mathcal\{E\}:uXu\_\{X\}uMu\_\{M\}uYu\_\{Y\}xxmmyyℳ\\mathcal\{M\}:uxu\_\{x\}umu\_\{m\}uyu\_\{y\}
Figure 13:Experiments 4 to 6\.
#### Experiment 4: Spurious mediator\.
ℳ\\mathcal\{M\}is fork\-structured:x→mx\\to mandx→yx\\to yindependently, soyydoes not depend onmm\. The invalidℰ\\mathcal\{E\}incorrectly posits the chainx→m→yx\\to m\\to y\. Specifically,m=2x\+1m=2x\+1andy=3x\+2y=3x\+2inℳ\\mathcal\{M\}; the invalid chainℰ\\mathcal\{E\}positsy=1\.5m\+0\.5y=1\.5m\+0\.5, which reproducesy=3x\+2y=3x\+2observationally but fails under interventions onmm\.
#### Experiment 5: Unreachable states\.
Bothℰ\\mathcal\{E\}andℳ\\mathcal\{M\}share the chainX→M→YX\\to M\\to YwithM∈\{0,1,2,3\}M\\in\\\{0,1,2,3\\\}\. The dynamics ofℳ\\mathcal\{M\}only ever produceM∈\{0,2\}M\\in\\\{0,2\\\}from any inputX∈\{0,1\}X\\in\\\{0,1\\\}\. The validℰ\\mathcal\{E\}uses the thresholdM≥2M\\geq 2; the invalidℰ\\mathcal\{E\}usesM≥1M\\geq 1, so they disagree only atM=1M=1, which is never reached from natural inputs\. In micro\-space,ℳ\\mathcal\{M\}uses a threshold of 1\.5 to cleanly separate theM∈\{0,1\}M\\in\\\{0,1\\\}subspaces from the\{2,3\}\\\{2,3\\\}subspaces\.
#### Experiment 6: Wrong causal direction\.
ℳ\\mathcal\{M\}implements the chainx→m→yx\\to m\\to y\(soyydepends onmm\)\. The invalidℰ\\mathcal\{E\}posits the forkx→mx\\to m,x→yx\\to y\(treatingxxas the common cause of both, ignoringm→ym\\to y\)\. Specifically,m=2x\+1m=2x\+1andy=1\.5m\+0\.5y=1\.5m\+0\.5inℳ\\mathcal\{M\}; the invalid forkℰ\\mathcal\{E\}positsy=3x\+2y=3x\+2, reproducingy=3x\+2y=3x\+2observationally but failing under interventions onmm\.
## Appendix DCompositionality of CAE
### D\.1Compositionality
Scientific knowledge is often organized in hierarchies: a coarse macro\-modelℳC\\mathcal\{M\}^\{C\}abstracts a meso\-modelℳB\\mathcal\{M\}^\{B\}, which in turn abstracts a micro\-modelℳA\\mathcal\{M\}^\{A\}\. The following proposition shows that, under the compatibility assumption stated below \(Assumption[D\.1](https://arxiv.org/html/2607.00267#A4.Thmassumption1)\),CAEdegrades gracefully across abstraction chains: intermediate errors accumulate at most additively\.
###### Assumption D\.1\(Compatible intervention distributions\)
The intervention distributionsPIAP\_\{I\}^\{A\},PIBP\_\{I\}^\{B\},PICP\_\{I\}^\{C\}are*compatible*across levels:
- •ForCAE↓: the marginal ofνB\\nu\_\{B\}induced byνC∼PIC\\nu\_\{C\}\\sim P\_\{I\}^\{C\},νB∼PνC\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\}equalsPIBP\_\{I\}^\{B\}; and the marginal ofμA\\mu\_\{A\}induced byνB∼PIB\\nu\_\{B\}\\sim P\_\{I\}^\{B\},μA∼PνBAB\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}^\{AB\}equalsPIAP\_\{I\}^\{A\}; and for everyνC\\nu\_\{C\}, the conditional distribution ofμA\\mu\_\{A\}induced byνB∼PνCBC\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\}^\{BC\},μA∼PνBAB\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}^\{AB\}equalsPνCACP\_\{\\nu\_\{C\}\}^\{AC\}, i\.e\. the chain of compatible micro\-interventions composes correctly\.
- •ForCAE↑: the pushforward\(τAB\)∗PIA=PIB\(\\tau\_\{AB\}\)\_\{\\ast\}P\_\{I\}^\{A\}=P\_\{I\}^\{B\}\.
A sufficient condition for the conditional compatibility is that eachPIP\_\{I\}is uniform over intervention subsets and values, and thatτAB\\tau\_\{AB\},τBC\\tau\_\{BC\}are surjections with equal\-measure fibers \(e\.g\., uniform surjections in discrete settings, or measure\-preserving maps in continuous ones\)\.
###### Proposition D\.1\(Compositionality\)
LetℳA\\mathcal\{M\}^\{A\},ℳB\\mathcal\{M\}^\{B\},ℳC\\mathcal\{M\}^\{C\}be SCMs with abstraction mapsτAB\\tau\_\{AB\}andτBC\\tau\_\{BC\}, and letτAC=τBC∘τAB\\tau\_\{AC\}=\\tau\_\{BC\}\\circ\\tau\_\{AB\}\. Under Assumption[D\.1](https://arxiv.org/html/2607.00267#A4.Thmassumption1)andD=DTVD=D\_\{\\mathrm\{TV\}\}, both variants satisfy
CAE\(ℳA,ℳC,τAC\)≤CAE\(ℳA,ℳB,τAB\)\+CAE\(ℳB,ℳC,τBC\)\.\\textnormal\{\\scriptsize CAE\}\\\!\\left\(\\mathcal\{M\}^\{A\},\\mathcal\{M\}^\{C\},\\tau\_\{AC\}\\right\)\\;\\leq\\;\\textnormal\{\\scriptsize CAE\}\\\!\\left\(\\mathcal\{M\}^\{A\},\\mathcal\{M\}^\{B\},\\tau\_\{AB\}\\right\)\+\\textnormal\{\\scriptsize CAE\}\\\!\\left\(\\mathcal\{M\}^\{B\},\\mathcal\{M\}^\{C\},\\tau\_\{BC\}\\right\)\.\(3\)WhenCAE\(ℳA,ℳB,τAB\)=0\\textnormal\{\\scriptsize CAE\}\(\\mathcal\{M\}^\{A\},\\mathcal\{M\}^\{B\},\\tau\_\{AB\}\)=0the bound is tight and equality holds\.
###### Proof D\.1
We prove each variant in turn; both rely on the triangle inequality forDTVD\_\{\\mathrm\{TV\}\}and the data processing inequality \(DPI\) for deterministic maps\.
#### CAE↓\.
Fix a C\-level interventionνC\\nu\_\{C\}, draw a compatible B\-level interventionνB∼PνC\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\}, and draw a compatible A\-level interventionμA∼PνB\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}\. SinceτAC=τBC∘τAB\\tau\_\{AC\}=\\tau\_\{BC\}\\circ\\tau\_\{AB\}, transitivity of compatibility givesμA∈τAC−1\(νC\)\\mu\_\{A\}\\in\\tau\_\{AC\}^\{\-1\}\(\\nu\_\{C\}\)\. By the conditional compatibility condition in Assumption[D\.1](https://arxiv.org/html/2607.00267#A4.Thmassumption1), sampling via the chainνB∼PνCBC\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\}^\{BC\},μA∼PνBAB\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}^\{AB\}yields the correct conditional distributionPνCACP\_\{\\nu\_\{C\}\}^\{AC\}, so the triple expectation below is preciselyCAE↓AC\\textnormal\{\\scriptsize CAE\}\_\{\\downarrow\}^\{AC\}\. Define three distributions overYCY\_\{C\}:
P\\displaystyle P:=ℳC\(YC∣doνC\),\\displaystyle:=\\mathcal\{M\}^\{C\}\\\!\\left\(Y\_\{C\}\\mid\\mathrm\{do\}\_\{\\nu\_\{C\}\}\\right\),Q\\displaystyle Q:=τYBC\(ℳB\(YB∣doνB\)\),\\displaystyle:=\\tau\_\{Y\}^\{BC\}\\\!\\left\(\\mathcal\{M\}^\{B\}\\\!\\left\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\\right\)\\right\),R\\displaystyle R:=τYAC\(ℳA\(YA∣doμA\)\)=τYBC\(τYAB\(ℳA\(YA∣doμA\)\)\)\.\\displaystyle:=\\tau\_\{Y\}^\{AC\}\\\!\\left\(\\mathcal\{M\}^\{A\}\\\!\\left\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\\right\)\\right\)=\\tau\_\{Y\}^\{BC\}\\\!\\left\(\\tau\_\{Y\}^\{AB\}\\\!\\left\(\\mathcal\{M\}^\{A\}\\\!\\left\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\\right\)\\right\)\\right\)\.By the triangle inequality,DTV\(P,R\)≤DTV\(P,Q\)\+DTV\(Q,R\)D\_\{\\mathrm\{TV\}\}\(P,R\)\\leq D\_\{\\mathrm\{TV\}\}\(P,Q\)\+D\_\{\\mathrm\{TV\}\}\(Q,R\)\. SinceτYBC\\tau\_\{Y\}^\{BC\}is deterministic, the DPI gives
DTV\(Q,R\)≤DTV\(ℳB\(YB∣doνB\),τYAB\(ℳA\(YA∣doμA\)\)\)\.D\_\{\\mathrm\{TV\}\}\(Q,R\)\\;\\leq\\;D\_\{\\mathrm\{TV\}\}\\\!\\left\(\\mathcal\{M\}^\{B\}\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\),\\;\\tau\_\{Y\}^\{AB\}\\\!\\left\(\\mathcal\{M\}^\{A\}\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\)\\right\)\\right\)\.Combining and taking expectations overνC∼PIC\\nu\_\{C\}\\sim P\_\{I\}^\{C\},νB∼PνC\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\},μA∼PνB\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}:
CAE↓AC\\displaystyle\\textnormal\{\\scriptsize CAE\}\_\{\\downarrow\}^\{AC\}≤𝔼νC𝔼νB∣νC\[DTV\(P,Q\)\]⏟\(I\)\+𝔼νC𝔼νB∣νC𝔼μA∣νB\[DTV\(ℳB\(YB∣doνB\),τYAB\(ℳA\(YA∣doμA\)\)\)\]⏟\(II\)\.\\displaystyle\\leq\\underbrace\{\\mathbb\{E\}\_\{\\nu\_\{C\}\}\\mathbb\{E\}\_\{\\nu\_\{B\}\\mid\\nu\_\{C\}\}\\bigl\[D\_\{\\mathrm\{TV\}\}\(P,\\,Q\)\\bigr\]\}\_\{\(I\)\}\+\\underbrace\{\\mathbb\{E\}\_\{\\nu\_\{C\}\}\\mathbb\{E\}\_\{\\nu\_\{B\}\\mid\\nu\_\{C\}\}\\mathbb\{E\}\_\{\\mu\_\{A\}\\mid\\nu\_\{B\}\}\\left\[D\_\{\\mathrm\{TV\}\}\\\!\\left\(\\mathcal\{M\}^\{B\}\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\),\\;\\tau\_\{Y\}^\{AB\}\\\!\\left\(\\mathcal\{M\}^\{A\}\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\)\\right\)\\right\)\\right\]\}\_\{\(II\)\}\.Term\(I\)\(I\)equalsCAE↓BC\\textnormal\{\\scriptsize CAE\}\_\{\\downarrow\}^\{BC\}directly, sinceμA\\mu\_\{A\}does not appear\. For term\(II\)\(II\), Assumption[D\.1](https://arxiv.org/html/2607.00267#A4.Thmassumption1)guarantees that the marginal ofνB\\nu\_\{B\}under\(νC∼PIC,νB∼PνC\)\(\\nu\_\{C\}\\sim P\_\{I\}^\{C\},\\,\\nu\_\{B\}\\sim P\_\{\\nu\_\{C\}\}\)isPIBP\_\{I\}^\{B\}, and the marginal ofμA\\mu\_\{A\}under\(νB∼PIB,μA∼PνB\)\(\\nu\_\{B\}\\sim P\_\{I\}^\{B\},\\,\\mu\_\{A\}\\sim P\_\{\\nu\_\{B\}\}\)isPIAP\_\{I\}^\{A\}, so term\(II\)\(II\)equalsCAE↓AB\\textnormal\{\\scriptsize CAE\}\_\{\\downarrow\}^\{AB\}, yielding \([3](https://arxiv.org/html/2607.00267#A4.E3)\)\.
#### CAE↑\.
DrawμA∼PIA\\mu\_\{A\}\\sim P\_\{I\}^\{A\}and defineνB:=τAB\(μA\)\\nu\_\{B\}:=\\tau\_\{AB\}\(\\mu\_\{A\}\),νC:=τBC\(νB\)=τAC\(μA\)\\nu\_\{C\}:=\\tau\_\{BC\}\(\\nu\_\{B\}\)=\\tau\_\{AC\}\(\\mu\_\{A\}\)deterministically\. Define:
P\\displaystyle P:=ℳC\(YC∣doνC\),\\displaystyle:=\\mathcal\{M\}^\{C\}\\\!\\left\(Y\_\{C\}\\mid\\mathrm\{do\}\_\{\\nu\_\{C\}\}\\right\),Q\\displaystyle Q:=τYBC\(ℳB\(YB∣doνB\)\),\\displaystyle:=\\tau\_\{Y\}^\{BC\}\\\!\\left\(\\mathcal\{M\}^\{B\}\\\!\\left\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\\right\)\\right\),R\\displaystyle R:=τYAC\(ℳA\(YA∣doμA\)\)\.\\displaystyle:=\\tau\_\{Y\}^\{AC\}\\\!\\left\(\\mathcal\{M\}^\{A\}\\\!\\left\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\\right\)\\right\)\.By the triangle inequality,DTV\(P,R\)≤DTV\(P,Q\)\+DTV\(Q,R\)D\_\{\\mathrm\{TV\}\}\(P,R\)\\leq D\_\{\\mathrm\{TV\}\}\(P,Q\)\+D\_\{\\mathrm\{TV\}\}\(Q,R\)\. The DPI applied toτYBC\\tau\_\{Y\}^\{BC\}gives
DTV\(Q,R\)≤DTV\(ℳB\(YB∣doνB\),τYAB\(ℳA\(YA∣doμA\)\)\)\.D\_\{\\mathrm\{TV\}\}\(Q,R\)\\;\\leq\\;D\_\{\\mathrm\{TV\}\}\\\!\\left\(\\mathcal\{M\}^\{B\}\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\),\\;\\tau\_\{Y\}^\{AB\}\\\!\\left\(\\mathcal\{M\}^\{A\}\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\)\\right\)\\right\)\.Taking expectations overμA∼PIA\\mu\_\{A\}\\sim P\_\{I\}^\{A\}:
CAE↑AC\\displaystyle\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\}^\{AC\}≤𝔼μA\[DTV\(P,Q\)\]⏟\(I\)\+𝔼μA\[DTV\(ℳB\(YB∣doνB\),τYAB\(ℳA\(YA∣doμA\)\)\)\]⏟\(II\)\.\\displaystyle\\leq\\underbrace\{\\mathbb\{E\}\_\{\\mu\_\{A\}\}\\bigl\[D\_\{\\mathrm\{TV\}\}\(P,\\,Q\)\\bigr\]\}\_\{\(I\)\}\+\\underbrace\{\\mathbb\{E\}\_\{\\mu\_\{A\}\}\\\!\\left\[D\_\{\\mathrm\{TV\}\}\\\!\\left\(\\mathcal\{M\}^\{B\}\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\),\\;\\tau\_\{Y\}^\{AB\}\\\!\\left\(\\mathcal\{M\}^\{A\}\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\)\\right\)\\right\)\\right\]\}\_\{\(II\)\}\.Term\(II\)\(II\)equalsCAE↑AB\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\}^\{AB\}directly\. For term\(I\)\(I\), Assumption[D\.1](https://arxiv.org/html/2607.00267#A4.Thmassumption1)gives\(τAB\)∗PIA=PIB\(\\tau\_\{AB\}\)\_\{\\ast\}P\_\{I\}^\{A\}=P\_\{I\}^\{B\}, so the marginal ofνB=τAB\(μA\)\\nu\_\{B\}=\\tau\_\{AB\}\(\\mu\_\{A\}\)isPIBP\_\{I\}^\{B\}, and term\(I\)\(I\)equalsCAE↑BC\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\}^\{BC\}, yielding \([3](https://arxiv.org/html/2607.00267#A4.E3)\)\.
#### Tightness\.
We consider both variants\. WhenCAETVAB=0\\textnormal\{\\scriptsize CAE\}^\{AB\}\_\{TV\}=0, the AB consistency error vanishes almost surely in each case, soτYAB\(ℳA\(YA∣doμA\)\)=ℳB\(YB∣doνB\)\\tau\_\{Y\}^\{AB\}\(\\mathcal\{M\}^\{A\}\(Y\_\{A\}\\mid\\mathrm\{do\}\_\{\\mu\_\{A\}\}\)\)=\\mathcal\{M\}^\{B\}\(Y\_\{B\}\\mid\\mathrm\{do\}\_\{\\nu\_\{B\}\}\)a\.s\., and thereforeQ=RQ=Ra\.s\. It follows thatDTV\(P,R\)=DTV\(P,Q\)D\_\{\\mathrm\{TV\}\}\(P,R\)=D\_\{\\mathrm\{TV\}\}\(P,Q\)a\.s\., and taking expectations givesCAETVAC=CAETVBC\\textnormal\{\\scriptsize CAE\}^\{AC\}\_\{TV\}=\\textnormal\{\\scriptsize CAE\}^\{BC\}\_\{TV\}, so equality holds in \([3](https://arxiv.org/html/2607.00267#A4.E3)\) for both variants\.
## Appendix EBenchmark Systems
For each system, we describe the high\-level modelℰ\\mathcal\{E\}, the low\-level modelℳ\\mathcal\{M\}, the abstraction mapτ\\tau, the invalid conditions used to test discrimination, and key implementation details\. All metrics are evaluated on a fixed numberNNof samples and averaged overnrunsn\_\{\\text\{runs\}\}\. Their values, given below, were chosen to keep computational costs similar across systems\.
### E\.1Logic Circuit
A 2\-bit ripple\-carry adder evaluated at the wire level \(ℳ\\mathcal\{M\}\) and the gate level \(ℰ\\mathcal\{E\}\)\. This is the simplest discrete\-to\-discrete abstraction in the benchmark\.
- •ℰ\\mathcal\{E\}: An SCM with integer\-valued variables: two 2\-bit operands \(𝙰\\mathtt\{A\},𝙱∈\{0,…,3\}\\mathtt\{B\}\\in\\\{0,\\ldots,3\\\}\), a 1\-bit carry\-in, a 1\-bit internal carry, a 2\-bit sum output, and a 1\-bit carry\-out\. Each variable is a deterministic Boolean arithmetic function of its parents\.
- •ℳ\\mathcal\{M\}: A Boolean netlist simulator over individual wire states\. Each wire carries a float value in\{0\.0,1\.0\}\\\{0\.0,1\.0\\\}, and gates are simulated by iterating until stable\. Interventions force specific wire values before propagation\.
- •Abstraction map:The coarse\-graining map groups pairs of 1\-bit wires or single wires into variables ofℰ\\mathcal\{E\}\. The value map maps each integer label to the Cartesian product of unit\-width subspaces:\(−0\.1,0\.1\)\(\-0\.1,0\.1\)for bit\-0 and\(0\.9,1\.1\)\(0\.9,1\.1\)for bit\-1\. Internal gate wires are declared as internal variables and excluded from theΦ\\Phiset\. Because every wire is either explicitly coarse\-grained or declared internal, theΦ\\Phiset is empty\.
- •Invalid conditions: \(i\)*Fail*:ℰ\\mathcal\{E\}replaces XOR with OR in both the internal\-carry and the sum logic\. \(ii\)*Inverted internal*:ℰ\\mathcal\{E\}inverts𝙸𝚗𝚝𝚎𝚛𝚗𝚊𝚕\_𝙲𝚊𝚛𝚛𝚒𝚎𝚜\\mathtt\{Internal\\\_Carries\}but compensates downstream, so the final output is correct, but the intermediate is wrong\. \(iii\)*Noise*: Gaussian noise \(σ=0\.4\\sigma=0\.4\) is added to wire voltages\.
- •Implementation details:Metrics are evaluated overN=1000N=1000samples and averaged over 100 runs\.
### E\.2Transistor Circuit
A CMOS half\-adder \(sum \+ carry\) simulated at the SPICE transistor level \(ℳ\\mathcal\{M\}\) and the Boolean gate level \(ℰ\\mathcal\{E\}\)\. This tests a continuous\-to\-discrete abstraction in which thresholding continuous voltages yields Boolean gate values\.
- •ℰ\\mathcal\{E\}: A Boolean gate network with variables\{a,b,𝚜𝚞𝚖,𝚌𝚊𝚛𝚛𝚢\}∈\{0,1\}\\\{a,b,\\mathtt\{sum\},\\mathtt\{carry\}\\\}\\in\\\{0,1\\\}, implementing XOR for sum and AND for carry\.
- •ℳ\\mathcal\{M\}: A PySpice SPICE simulation of the CMOS half\-adder topology\. The topology uses four NAND gates and one inverter to implement the half\-adder, and produces continuous node voltages for all circuit nodes \(inputs, outputs, internal NAND wires, transistor junctions, and power rail\)\. Internal nodes are declared as internal variables and excluded from theΦ\\Phiset\.
- •Abstraction map:The coarse\-graining map associates\{a,b,𝚜𝚞𝚖,𝚌𝚊𝚛𝚛𝚢\}\\\{a,b,\\mathtt\{sum\},\\mathtt\{carry\}\\\}nodes to the corresponding variables ofℰ\\mathcal\{E\}\. The value map maps label 0 to voltages in\(−0\.5,0\.5\)\(\-0\.5,0\.5\)V and label 1 to\(4\.5,5\.5\)\(4\.5,5\.5\)V\. Voltages outside these ranges are unmapped\.
- •Invalid conditions: \(i\)*Fail*:ℰ\\mathcal\{E\}uses OR instead of XOR for sum\. \(ii\)*Noise*: Gaussian noise \(σ=1\.5\\sigma=1\.5V\) is added to all circuit nodes \(inputs, outputs, internal wires, and the power rail\)\.
- •Implementation details:Metrics are evaluated overN=250N=250samples and averaged over 50 runs\.
### E\.3Gas Simulation
AnNN\-particle Lennard\-Jones fluid \(ℳ\\mathcal\{M\}\) evaluated against the ideal gas law or Van der Waals equation of state \(ℰ\\mathcal\{E\}\)\. This is an extreme example of global aggregation: every variable ofℰ\\mathcal\{E\}aggregates over allNNparticles\. A detailed case study of this system is provided in Appendix[G](https://arxiv.org/html/2607.00267#A7)\.
- •ℰ\\mathcal\{E\}: The ideal gas lawPV=NTPV=NT\(reduced units\) or the Van der Waals equation\(P\+aρ2\)\(1/ρ−b\)=T\(P\+a\\rho^\{2\}\)\(1/\\rho\-b\)=T, with causal variables\{P,V,T\}\\\{P,V,T\\\}\. Unless specified, the causal graph is reduced to the acyclic\(V,T\)→P\(V,T\)\\to Pforward macroscopic mapping \(NVT ensemble\)\.
- •ℳ\\mathcal\{M\}: AnNN\-body molecular dynamics simulation using the 12\-6 Lennard\-Jones potential \(N=128N=128particles\), velocity\-Verlet integration, Langevin thermostat, and Berendsen barostat\. Long\-range tail corrections are applied analytically to the pressure virial to account for truncation of the potential at cutoffrc=3\.0σr\_\{c\}=3\.0\\sigmain reduced units\.ℳ\\mathcal\{M\}outputs the measuredPP,TT,VValongside particle positions and velocities\. An energy minimization step \(steepest descent, up to 500 iterations\) is applied after lattice initialization to remove particle overlaps before dynamics begin\.
- •Abstraction map:The coarse\-graining map is the identity on\{P,V,T\}\\\{P,V,T\\\}\(these are directly computed macroscopic observables returned by the simulator: T via kinetic energy, P via the virial theorem, V =BoxLength3\\text\{BoxLength\}^\{3\}\)\. The value map is the identity\. The underlying particle substrate \(positions, velocities, box length\) is the stateℳ\\mathcal\{M\}integrates; it is not part of the macro description and is excluded fromτX\\tau\_\{X\}, soΦ=∅\\Phi=\\emptysetand the gas system carries no faithfulness test\.
- •Invalid conditions: \(i\)*Wrongα\\alpha*: the ideal gasℰ\\mathcal\{E\}uses an incorrect temperature exponent\. \(ii\)*High density*: particle density is raised outside the ideal gas regime\. \(iii\)*Low temperature*: temperature is reduced to a regime where the ideal gas law breaks down\.
- •Implementation details:Van der Waals parametersa,ba,bare calibrated against LJ fluid isotherms at the Boyle temperature \(T∗≈3\.42T^\{\*\}\\approx 3\.42\)\[Glasser,[2002](https://arxiv.org/html/2607.00267#bib.bib42)\]\. Metrics are evaluated overN=10N=10samples and averaged over 20 runs; a custom sampler initializes lattice configurations and hands equilibration toℳ\\mathcal\{M\}\.
### E\.4Predator\-Prey Dynamics
An agent\-based model of predator\-prey dynamics \(ℳ\\mathcal\{M\}\), with optional spatial, stochastic, and aging variants, evaluated against the Lotka\-Volterra ODE system \(ℰ\\mathcal\{E\}\)\. The valid baseline used for calibration is the non\-spatial configuration\. The abstraction is a continuous identity mapping over total population counts:ℳ\\mathcal\{M\}outputs aggregate agent counts, which are the macro\-variables\.
- •ℰ\\mathcal\{E\}: The Lotka\-Volterra equations, with parameters\{α,β,δ,γ\}\\\{\\alpha,\\beta,\\delta,\\gamma\\\}calibrated against the ideal \(non\-spatial, deterministic\) ABM\. The LV equations are integrated via the forward Euler method over 50 steps, matching the ABM’s timestep count\.
- •ℳ\\mathcal\{M\}: An agent\-based model \(ABM\) of predator\-prey dynamics in which individual prey and predators are tracked\. Each timestep, births, predation events, and deaths are computed from per\-capita probabilities\. The simplest configuration \(non\-spatial, deterministic rates, no aging\) approximates the LV equations and is used for calibration\.
- •Abstraction map:The coarse\-graining map is the identity on\{𝚙𝚛𝚎𝚢\_𝚝,𝚙𝚛𝚎𝚍𝚊𝚝𝚘𝚛\_𝚝,𝚏𝚒𝚗𝚊𝚕\_𝚙𝚘𝚙𝚞𝚕𝚊𝚝𝚒𝚘𝚗𝚜\}\\\{\\mathtt\{prey\\\_t\},\\mathtt\{predator\\\_t\},\\mathtt\{final\\\_populations\}\\\}\. The value map is the identity\.
- •Invalid conditions:\(i\)*Wrongα\\alpha*: prey reproduction rate inℰ\\mathcal\{E\}is 1\.5 times higher than inℳ\\mathcal\{M\}\. \(ii\)*Spatial*:ℳ\\mathcal\{M\}uses spatial dynamics on a20×2020\\times 20grid, where agents move to and interact with neighboring cells\. \(iii\)*Stochastic*:ℳ\\mathcal\{M\}uses stochastic \(binomial demographic\) reproduction\. The noise is mean\-unbiased \(its expectation equals the deterministic per\-capita rate\) so this is a zero\-mean\-shift, variance\-only contrastive that mean\-based baselines under\-detect\. \(iv\)*Aging*: agents have finite lifespans\. \(v\)*Noise*: Gaussian noise \(σ=5\\sigma=5\) on output populations\. \(vi\)*Complex*: all threeℳ\\mathcal\{M\}extensions simultaneously active\.
- •Implementation details:Metrics are evaluated overN=50N=50samples and averaged over 100 runs\. LV parameters are calibrated via grid search\.
### E\.5Heat Equation \(1D\)
Brownian particle diffusion \(ℳ\\mathcal\{M\}\) is abstracted to a finite\-difference heat equation \(ℰ\\mathcal\{E\}\)\. This tests a spatial aggregation abstraction\.
- •ℰ\\mathcal\{E\}: A finite\-difference heat equation solver on a uniform grid ofKKbins, parameterized by the physical diffusion coefficientα\\alpha\(distinct from the solver’s internal dimensionless update coefficientαΔt/Δx2\\alpha\\,\\Delta t/\\Delta x^\{2\}\)\. The finite\-difference solver uses Neumann boundary conditions \(zero flux at both walls\), consistent with the reflective particle boundaries inℳ\\mathcal\{M\}\.
- •ℳ\\mathcal\{M\}: A system ofN=1000N=1000Brownian particles diffusing in a 1D box\[0,L\]\[0,L\]with reflective boundaries over 200 time steps\.
- •Abstraction map:The value map bins particle positions into a normalized density histogram \(one bin per cell inℰ\\mathcal\{E\}\), and grounding samples particle positions from that histogram as a probability density function\.
- •Invalid conditions: \(i\)*Fail*: the diffusion coefficient inℰ\\mathcal\{E\}is set to a wrong value of zero \(ii\)*Noise*: Gaussian noise \(σ=0\.15\\sigma=0\.15\) on the outputs ofℳ\\mathcal\{M\}\.
- •Implementation details:Metrics are evaluated overN=30N=30samples and averaged over 50 runs\.K=50K=50bins\. The diffusion coefficientα=0\.1\\alpha=0\.1is set to a known value matching the particle simulation’s diffusion constant\. Initial conditions are sampled as random Gaussian profiles \(mean∼U\(0\.3,0\.7\)\\sim\\mathrm\{U\}\(0\.3,0\.7\), width∼U\(0\.05,0\.15\)\\sim\\mathrm\{U\}\(0\.05,0\.15\)\), grounded to particle positions and re\-abstracted to obtainℰ\\mathcal\{E\}\-compatible inputs\.
### E\.6Heat Equation \(2D\)
A phonon lattice model \(ℳ\\mathcal\{M\}\) is abstracted to a 2D heat PDE \(ℰ\\mathcal\{E\}\)\. This tests a spatial aggregation abstraction\.
- •ℰ\\mathcal\{E\}: A 2D heat PDE solver with Neumann boundaries and a calibrated diffusion coefficientα\\alpha\.
- •ℳ\\mathcal\{M\}: A16×1616\\times 16phonon lattice model with harmonic springs, phonon scattering \(ratepscatter=0\.2p\_\{\\text\{scatter\}\}=0\.2\), andNavg=10N\_\{\\text\{avg\}\}=10simulation averages per sample\.
- •Abstraction map:The value map abstracts the kinetic energy map via Gaussian smoothing \(σ=0\.05×GridSize\\sigma=0\.05\\times\\text\{GridSize\}\) and normalization; grounding uses a parameterized Gaussian source\. The source parameters\(𝚜𝚘𝚞𝚛𝚌𝚎\_𝚡,𝚜𝚘𝚞𝚛𝚌𝚎\_𝚢,𝚜𝚘𝚞𝚛𝚌𝚎\_𝙴\)\(\\mathtt\{source\\\_x\},\\mathtt\{source\\\_y\},\\mathtt\{source\\\_E\}\)use identity grounding\.
- •Invalid conditions: \(i\)*Fail*: the diffusion coefficient inℰ\\mathcal\{E\}is set to a wrong value of10×10\\timesthe calibrated value \(ii\)*Noise*: Gaussian noise \(σ=50\\sigma=50\) on the output ofℳ\\mathcal\{M\}\.
- •Implementation details:Metrics are evaluated overN=5N=5samples and averaged over 10 runs\. The diffusion coefficientα\\alphais calibrated via scalar optimization\. The explicit Euler PDE solver uses adaptive sub\-stepping: when the stability criterionαΔt/Δx2\>0\.2\\alpha\\Delta t/\\Delta x^\{2\}\>0\.2is violated, the time step is subdivided accordingly\.
### E\.7Ising Model
A hybrid molecular\-dynamics/Monte\-Carlo simulator in which atomic positions can evolve via Newtonian dynamics when the vibrational temperature is\>0\>0, and spins are updated via Metropolis sweeps with distance\-dependent couplingJeff=J0exp\(−λ\|r−r0\|\)J\_\{\\text\{eff\}\}=J\_\{0\}\\exp\(\-\\lambda\|r\-r\_\{0\}\|\)\. The motivating question is whether the rigid\-lattice abstraction remains valid once atoms vibrate \(which would shiftJeffJ\_\{\\text\{eff\}\}\); in this benchmark, we evaluateℳ\\mathcal\{M\}at vibrational temperature0, so atoms stay on their lattice sites andJeff=J0J\_\{\\text\{eff\}\}=J\_\{0\}, isolating the abstraction’s validity for the static lattice\.
- •ℰ\\mathcal\{E\}: A rigidL×LL\\times LIsing model simulated via Metropolis\-Hastings MCMC with coupling constantJJ\. The causal variables are\{𝚃𝚎𝚖𝚙𝚎𝚛𝚊𝚝𝚞𝚛𝚎,𝙴𝚡𝚝𝚎𝚛𝚗𝚊𝚕𝙵𝚒𝚎𝚕𝚍,𝙿𝚛𝚎𝚍𝚒𝚌𝚝𝚎𝚍𝙼𝚊𝚐𝚗𝚎𝚝𝚒𝚣𝚊𝚝𝚒𝚘𝚗\}\\\{\\mathtt\{Temperature\},\\mathtt\{ExternalField\},\\mathtt\{PredictedMagnetization\}\\\}, all continuous\.
- •ℳ\\mathcal\{M\}: A hybrid molecular\-dynamics/Monte\-Carlo simulator in which atomic positions evolve via Newtonian dynamics when vibrational temperature\>0\>0, and spins are updated via Metropolis sweeps with distance\-dependent couplingJeff=J0exp\(−λ\|r−r0\|\)J\_\{\\text\{eff\}\}=J\_\{0\}\\exp\(\-\\lambda\|r\-r\_\{0\}\|\)\.
- •Abstraction map:The coarse\-graining map is the identity on all three continuous variables\. The value map is the identity\.
- •Invalid conditions: \(i\)*Fail*: the coupling constant inℰ\\mathcal\{E\}is set toJ=2J=2\(double the true value\)\. \(ii\)*Noise*: Gaussian noise \(σ=0\.05\\sigma=0\.05\) on the magnetization output ofℳ\\mathcal\{M\}\.
- •Implementation details:L=8L=8and the vibrational temperature is fixed to0for all conditions \(rigid lattice\), so the invalid conditions are the coupling change \(J=2J=2\) and the output noise above rather than a vibrational\-breakdown regime\. Metrics are evaluated overN=30N=30samples and averaged over 150 runs\. Sampling is performed over\[𝚃𝚎𝚖𝚙𝚎𝚛𝚊𝚝𝚞𝚛𝚎,𝙴𝚡𝚝𝚎𝚛𝚗𝚊𝚕𝙵𝚒𝚎𝚕𝚍\]\[\\mathtt\{Temperature\},\\mathtt\{ExternalField\}\]\.
### E\.8Tracr Transformer
A transformer mechanistically compiled from a RASP sort\-rank program \(ℳ\\mathcal\{M\}\) evaluated against the symbolic program itself \(ℰ\\mathcal\{E\}\)\. In this system,ℳ\\mathcal\{M\}is a neural network\.
- •ℰ\\mathcal\{E\}: A symbolic sort\-rank program over sequences of lengthℓ=3\\ell=3:𝚛𝚊𝚗𝚔i=\|\{j≠i:𝚝𝚘𝚔𝚎𝚗j<𝚝𝚘𝚔𝚎𝚗i\}\|\\mathtt\{rank\}\_\{i\}=\|\\\{j\\neq i:\\mathtt\{token\}\_\{j\}<\\mathtt\{token\}\_\{i\}\\\}\|\. Variables:𝚝𝚘𝚔𝚎𝚗i∈\{1,…,5\}\\mathtt\{token\}\_\{i\}\\in\\\{1,\\ldots,5\\\}and𝚛𝚊𝚗𝚔i∈\{0,…,4\}\\mathtt\{rank\}\_\{i\}\\in\\\{0,\\ldots,4\\\}\(the value map and rank prior use five labels; only\{0,1,2\}\\\{0,1,2\\\}are reachable forℓ=3\\ell=3\)\.
- •ℳ\\mathcal\{M\}: The Tracr\-compiled JAX transformer, exposing two multi\-dimensional micro\-variables:𝚝𝚘𝚔𝚎𝚗𝚜∈ℝℓ\\mathtt\{tokens\}\\in\\mathbb\{R\}^\{\\ell\}\(float encoding of token indices\) and𝚛𝚊𝚗𝚔𝚜∈ℝℓ\\mathtt\{ranks\}\\in\\mathbb\{R\}^\{\\ell\}\(float encoding of rank values\)\.
- •Abstraction map:The coarse\-graining map slices𝚝𝚘𝚔𝚎𝚗i→𝚝𝚘𝚔𝚎𝚗𝚜\[i\]\\mathtt\{token\}\_\{i\}\\to\\mathtt\{tokens\}\[i\]and𝚛𝚊𝚗𝚔i→𝚛𝚊𝚗𝚔𝚜\[i\]\\mathtt\{rank\}\_\{i\}\\to\\mathtt\{ranks\}\[i\]\. The value map uses\[\(v−1\)−0\.5,\(v−1\)\+0\.5\]\[\(v\-1\)\-0\.5,\(v\-1\)\+0\.5\]for token labelvv\(tokens are 0\-indexed internally: labelv=1→0\.0,v=5→4\.0v=1\\to 0\.0,v=5\\to 4\.0\) and\[r−0\.5,r\+0\.5\]\[r\-0\.5,r\+0\.5\]for rank labelrr\. Tracr guarantees lossless round\-trip encoding: grounding and abstraction are exact inverses for integer labels\.
- •Invalid conditions: \(i\)*Fail*:ℰ\\mathcal\{E\}uses\>\>instead of<<in the rank comparison \(reversed ranks\)\. \(ii\)*Noise*: Gaussian noise \(σ=0\.3\\sigma=0\.3\) on the residual stream before read\-out\.
- •Implementation details:Metrics are evaluated overN=100N=100samples and 150 runs\. The sequence length isℓ=3\\ell=3, the vocabulary\{1,…,5\}\\\{1,\\ldots,5\\\}\. Inputs are formed as\[𝙱𝙾𝚂\]\+\[𝚟𝚘𝚌𝚊𝚋\_𝚕𝚊𝚋𝚎𝚕𝚜\]\[\\mathtt\{BOS\}\]\+\[\\mathtt\{vocab\\\_labels\}\]; the BOS output position is discarded from the decoded result\.
### E\.9Gene Regulatory Network \(GRN\)
The segment polarity gene regulatory network ofSánchezet al\.\[[2008](https://arxiv.org/html/2607.00267#bib.bib10)\]\(ℳ\\mathcal\{M\}\) evaluated against a simple two\-variable Boolean causal rule \(ℰ\\mathcal\{E\}\)\.ℳ\\mathcal\{M\}is a multi\-valued network \(not a continuous ODE model\) to which we apply one synchronous update step\.
- •ℰ\\mathcal\{E\}: A two\-variable Boolean SCM:𝚠𝚐\_𝚜𝚛𝚌∈\{0,1\}\\mathtt\{wg\\\_src\}\\in\\\{0,1\\\}\(Wg signal active?\)→\\to𝚏𝚣\_𝚝𝚐𝚝∈\{0,1\}\\mathtt\{fz\\\_tgt\}\\in\\\{0,1\\\}\(target Fz on?\)\. Rule:𝚏𝚣\_𝚝𝚐𝚝=𝚠𝚐\_𝚜𝚛𝚌\\mathtt\{fz\\\_tgt\}=\\mathtt\{wg\\\_src\}\.
- •Low\-level model:One synchronous update step of the 6\-cell segment polarity network\. Variables are multi\-valued \(e\.g\. Wg∈\{0,1,2\}\\in\\\{0,1,2\\\}\), and updates follow GINsim logical parameter semantics\. Micro\-variables exposed:𝚠𝚐\_𝚌𝟸\\mathtt\{wg\\\_c2\}\(Wg level in cell 2, input\) and𝚏𝚣\_𝚌𝟷,𝚏𝚣\_𝚌𝟹,𝚏𝚣\_𝚌𝟺,𝚏𝚣\_𝚌𝟻\\mathtt\{fz\\\_c1\},\\mathtt\{fz\\\_c3\},\\mathtt\{fz\\\_c4\},\\mathtt\{fz\\\_c5\}\(Fz in various cells, outputs\)\.
- •Abstraction map:The coarse\-graining map links\[𝚠𝚐\_𝚌𝟸\]→𝚠𝚐\_𝚜𝚛𝚌\[\\mathtt\{wg\\\_c2\}\]\\to\\mathtt\{wg\\\_src\}and\[𝚏𝚣\_𝚌𝟷,𝚏𝚣\_𝚌𝟹\]→𝚏𝚣\_𝚝𝚐𝚝\[\\mathtt\{fz\\\_c1\},\\mathtt\{fz\\\_c3\}\]\\to\\mathtt\{fz\\\_tgt\}\(true neighbors of cell 2\)\. The value map \(𝚆𝚐𝙵𝚣𝚅𝚊𝚕𝚞𝚎𝙼𝚊𝚙\\mathtt\{WgFzValueMap\}\) thresholds:𝚠𝚐\_𝚜𝚛𝚌=1\\mathtt\{wg\\\_src\}=1iff𝚠𝚐\_𝚌𝟸≥1\.9\\mathtt\{wg\\\_c2\}\\geq 1\.9\(effectively=2=2for integer\-valued states\);𝚏𝚣\_𝚝𝚐𝚝=1\\mathtt\{fz\\\_tgt\}=1iff any target Fz\>0\.5\>0\.5\.
- •Invalid conditions: \(i\)*Wrong map*: the coarse\-graining links\[𝚏𝚣\_𝚌𝟺,𝚏𝚣\_𝚌𝟻\]\[\\mathtt\{fz\\\_c4\},\\mathtt\{fz\\\_c5\}\]\(non\-neighbors\) instead of\[𝚏𝚣\_𝚌𝟷,𝚏𝚣\_𝚌𝟹\]\[\\mathtt\{fz\\\_c1\},\\mathtt\{fz\\\_c3\}\]\. \(ii\)*Wrongℰ\\mathcal\{E\}*: the causal rule is reversed \(𝚏𝚣\_𝚝𝚐𝚝=1−𝚠𝚐\_𝚜𝚛𝚌\\mathtt\{fz\\\_tgt\}=1\-\\mathtt\{wg\\\_src\}\)\. \(iii\)*Noise*: Gaussian noise \(σ=0\.4\\sigma=0\.4\) on Fz outputs\.
- •Implementation details:Metrics are evaluated overN=200N=200samples and averaged over 200 runs\. Sampling is performed over\[𝚠𝚐\_𝚜𝚛𝚌\]\[\\mathtt\{wg\\\_src\}\]\.
### E\.10MOS 6502 CPU
The MOS 6502 CPU, modeled at three discrete abstraction levels: the Visual6502 digital netlist, the manually extracted gate\-level implementation, and a corresponding ISA\-level simulation\. All three levels are discrete and differ in their internal simulator fidelity: the transistor level uses binary switch segments, the gate level uses Boolean gate outputs, and the ISA level uses integer register and flag values \(0–255\)\. The causal variables exposed to the abstraction are, at every level, the shared integer register/flag interface \(soaaandτX\\tau\_\{X\}are identity on these values, see the abstraction\-map note below\); individual transistor segments or gate signals are never themselves abstraction variables\. We evaluate abstractions between those three layers independently\.
- •Highest\-level model \(ISA\):The ISA semantics of selected 6502 instructions, expressed as a causal SCM over register variables \(accumulator, index registers, status flags\) implemented by the ISA simulator \([ivop/fake6502](https://github.com/ivop/fake6502), commit 09fc542\)\.
- •Intermediate\-level model \(gate\):A gate\-level simulator \([ivop/break6502](https://github.com/ivop/break6502), commit 922af64\), which implements the decoder PLA and datapath at the Boolean gate level\.
- •Lowest\-level model \(transistor\):A transistor\-level digital logic simulator \([mist64/perfect6502](https://github.com/mist64/perfect6502), commit 6c1f1a5\), extracted directly from the 6502 CPU via the[Visual6502](http://visual6502.org/)project\. This implements a segment model in which each transistor is a binary switch\.
- •Abstraction maps:All simulators share the same interface: integer register values 0–255 per variable\. The coarse\-graining map and value maps are both effectively identity on these integer values\. All simulators mask the processor status register P with𝟶𝚡𝙲𝙵\\mathtt\{0xCF\}\(clearing bits 4 and 5\) before comparison, since bit 4 \(the B\-flag\) only reflects how P was pushed and is not a persistent flag; bit 5 is unused/hardwired 1\. Without this mask, all status\-affecting instructions would appear to disagree\.
- •Invalid conditions: \(i\)*Broken gate↔\\leftrightarrowISA*: accumulator bit 7 is stuck at 0 inℳ\\mathcal\{M\}\. \(ii\)*Broken transistor↔\\leftrightarrowgate*: same fault applied at the transistor level\. \(iii\)*Broken transistor↔\\leftrightarrowISA*: the same accumulator bit 7 stuck\-at\-0 fault at the transistor level, evaluated against the ISA high\-level model\.
- •Implementation details:The abstraction is evaluated on a reduced set of single\-instruction tests covering implied and immediate addressing modes\. Metrics are evaluated overN=200N=200samples and averaged over 100 runs \(per condition\)\.
## Appendix FBaseline Metric Implementation Details
We document design choices, normalization functions, and known limitations for the baseline metrics\. All metrics receive the same sampler,nnsamples, and output variable list as theCAEvariants\.
#### Observational metrics \(MSE, RMSE, NMSE,R2R^\{2\},L2L^\{2\}, JSD, KL, MMD, HSIC, VarDecomp\)\.
These are evaluated via a task wrapper that restricts interventions to root variables only and compares outputs at the designated output nodes\. For the asymmetric or normalized metrics \(KL, NMSE,R2R^\{2\},L2L^\{2\}\), the high\-level modelℰ\\mathcal\{E\}is the normalization reference: KL computesKL\(ℰ∥ℳ\)\\mathrm\{KL\}\(\\mathcal\{E\}\\,\\\|\\,\\mathcal\{M\}\), and NMSE,R2R^\{2\}, andL2L^\{2\}divide by the variance \(resp\. squared norm\) ofℰ\\mathcal\{E\}’s outputs\.
JSD/KL:Uses adaptive histogram binning into⌈log2n\+1⌉\\lceil\\log\_\{2\}n\+1\\rceilbins, capped at 20\. KL adds Laplace smoothing \(ε=10−10\\varepsilon=10^\{\-10\}\)\. When samples collapse to a point mass, JSD returns 0\. The adaptive bin count depends onnn, so the metric definition \(and not only its estimate\) varies with sample size; their power curves therefore conflate estimator variance with this definitional drift\.
MMD:Uses an RBF kernel and bandwidthσ=median\(‖xi−xj‖2\)/2\\sigma=\\sqrt\{\\smash\[b\]\{\\text\{median\}\(\\\|x\_\{i\}\-x\_\{j\}\\\|^\{2\}\)/2\}\}from the pooled sample \(median heuristic\)\. The unbiasedMMD^2\\widehat\{\\text\{MMD\}\}^\{2\}estimator is clipped to\[0,∞\)\[0,\\infty\), then passed throughtanh\\tanh\.
HSIC:Measures whether the residualsR=ℳ\(u\)−ℰ\(u\)R=\\mathcal\{M\}\(u\)\-\\mathcal\{E\}\(u\)are statistically independent of the outputs ofℰ\\mathcal\{E\}\(used as a proxy for the inputsuu\)\. The centered kernel alignment estimatorHSIC^=tr\(HKXH⋅HKRH\)/\(n−1\)2\\widehat\{\\text\{HSIC\}\}=\\mathrm\{tr\}\(HK\_\{X\}H\\cdot HK\_\{R\}H\)/\(n\-1\)^\{2\}is computed, whereKXK\_\{X\}is the RBF kernel matrix of the outputs ofℰ\\mathcal\{E\}andKRK\_\{R\}is the RBF kernel matrix of residuals; both kernel matrices are centered byH=I−𝟏𝟏⊤/nH=I\-\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}/n\. The estimator is normalized by the product of the Frobenius norms ofHKXHHK\_\{X\}HandHKRHHK\_\{R\}H\. Requires≥20\\geq 20samples; returns NaN below this threshold\.
R2R^\{2\}:Normalized as\(1−R2\)/2\(1\-R^\{2\}\)/2, clipped to\[0,1\]\[0,1\]\.
VarDecomp:ReportsVar\(y−y^\)/Var\(y\)\\mathrm\{Var\}\(y\-\\hat\{y\}\)/\\mathrm\{Var\}\(y\), which is robust to constant offsets\. This is a fraction\-of\-variance\-unexplained statistic, not a Sobol/ANOVA partition\.
L2L^\{2\}:Reports the relative squared errormean\(‖y−y^‖2/‖y‖2\)\\mathrm\{mean\}\(\\\|y\-\\hat\{y\}\\\|^\{2\}/\\\|y\\\|^\{2\}\)— an NMSE\-like ratio, not a Euclidean distance — passed throughtanh\\tanh\.
#### Temporal metrics \(TrajMSE, DTW, Autocorr, Spectral, SINDy\)\.
These are implemented for the predator\-prey system in a way that rolls out bothℰ\\mathcal\{E\}andℳ\\mathcal\{M\}from shared initial conditions\.
DTW:Standard dynamic programming with Euclidean local cost, normalized by the sum of sequence lengths\|s1\|\+\|s2\|\|s\_\{1\}\|\+\|s\_\{2\}\|\.
SINDy:For each sampled trajectory, we extract consecutive macro\-step pairs\(Xt,Xt\+1\)\(X\_\{t\},X\_\{t\+1\}\)from the trajectory ofℳ\\mathcal\{M\}\(ABM\), where each step corresponds to one full model epoch\. The observed epoch change isX˙t=Xt\+1−Xt\\dot\{X\}\_\{t\}=X\_\{t\+1\}\-X\_\{t\}\. The predicted change inℰ\\mathcal\{E\}isX˙^t=ν\(Xt\)−Xt\\hat\{\\dot\{X\}\}\_\{t\}=\\nu\(X\_\{t\}\)\-X\_\{t\}, whereν\(Xt\)\\nu\(X\_\{t\}\)runs the LV equations for one full epoch starting from the actual ABM stateXtX\_\{t\}\. Only steps where both species exceed 1 individual are scored\. The normalized Frobenius residual‖X˙−X˙^‖F/‖X˙‖F\\\|\\dot\{X\}\-\\hat\{\\dot\{X\}\}\\\|\_\{F\}\\,/\\,\\\|\\dot\{X\}\\\|\_\{F\}is mapped throughtanh\(⋅\)\\tanh\(\\cdot\)\.
#### Sobol first\-order indices:
The Saltelli estimator shares the sameAAandBBsample matrices forℰ\\mathcal\{E\}andℳ\\mathcal\{M\}, making indices directly comparable\. The metric reports mean\|Siℰ−Siℳ\|\|S\_\{i\}^\{\\mathcal\{E\}\}\-S\_\{i\}^\{\\mathcal\{M\}\}\|over input variables\. In several benchmark systems, we evaluate Sobol indices on fewer than100×k100\\times ksamples due to the shared compute budget; reported Sobol scores should therefore be read as under\-sampled estimates, not as evidence that the metric is inherently uninformative\.
#### Infidelity \(two\-model attribution adaptation\):
For each abstract input variableii, a finite\-difference perturbation of sizeδi\\delta\_\{i\}is applied to the abstract label:δi=max\(δfrac⋅\|xi\|,δmin\)\\delta\_\{i\}=\\max\(\\delta\_\{\\text\{frac\}\}\\cdot\|x\_\{i\}\|,\\,\\delta\_\{\\text\{min\}\}\)for continuous labels \(withδfrac=0\.1\\delta\_\{\\text\{frac\}\}=0\.1,δmin=10−3\\delta\_\{\\text\{min\}\}=10^\{\-3\}\) andδi=1\\delta\_\{i\}=1for discrete \(integer\) labels\. Bothℰ\\mathcal\{E\}and abstractedℳ\\mathcal\{M\}are evaluated at the original and perturbed inputs; per\-input attributionsϕi=\(outputperturbed−outputoriginal\)/δi\\phi\_\{i\}=\(\\text\{output\}\_\{\\text\{perturbed\}\}\-\\text\{output\}\_\{\\text\{original\}\}\)/\\delta\_\{i\}are computed for each model and compared by squared difference, averaged over input variables; multi\-dimensional outputs are first mean\-collapsed to a per\-variable scalar before formingϕi\\phi\_\{i\}\. The metric is normalized viatanh\\tanh\. For the 6502 CPU, only register inputs\{A,X,Y,S,P\}\\\{A,X,Y,S,P\\\}are perturbed; opcode and operand bytes are excluded because a perturbation ofδ=1\\delta=1changes the instruction identity, producing spurious attribution disagreement even for correct abstractions\.
#### Relational fidelity:
Pearsonrrper output dimension, averaged, mapped to\(1−r\)/2∈\[0,1\]\(1\-r\)/2\\in\[0,1\]\.
#### Mallows’CpC\_\{p\}:
p=2×p=2\\timesthe number of scored output variables ofℰ\\mathcal\{E\};σ^2\\hat\{\\sigma\}^\{2\}is estimated as the mean variance of the outputs ofℳ\\mathcal\{M\}across output dimensions, serving as a reference noise floor\. Normalized astanh\(max\(0,Cp−p\)/n\)\\tanh\(\\max\(0,C\_\{p\}\-p\)/n\)\.
#### Structural deviation and causal sensitivity index:
Structural deviation perturbs each parameter ofℰ\\mathcal\{E\}by 1% relative and measures\|CAE↑NF\(ℰperturb\)−CAE↑NF\(ℰbase\)\|\|\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\\text\{NF\}\}\(\\mathcal\{E\}\_\{\\text\{perturb\}\}\)\-\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\\text\{NF\}\}\(\\mathcal\{E\}\_\{\\text\{base\}\}\)\|\. Causal sensitivity zeros each parameter instead, but measures the same quantity\. Neither metric divides by the baselineCAE↑NF\{\}\_\{\\uparrow\\text\{NF\}\}score\.
#### IB vs\. CIB Lagrangian:
IB uses passive co\-occurrence statistics for bothI\(X;T\)I\(X;T\)andI\(T;Y\)I\(T;Y\); a spurious correlate ofYYinℳ\\mathcal\{M\}can inflateI\(T;Y\)I\(T;Y\)even without causal influence\. CIB replacesI\(T;Y\)I\(T;Y\)withCAE↑NF\(Y∣do\(T\)\)\\textnormal\{\\scriptsize CAE\}\_\{\\uparrow\\text\{NF\}\}\(Y\\mid\\mathrm\{do\}\(T\)\), estimated asH\(Y\)−Hc\(Y∣do\(T\)\)H\(Y\)\-H\_\{c\}\(Y\\mid\\mathrm\{do\}\(T\)\)where the conditional entropy averages over 20 label bins under direct interventions\. Both Lagrangians are normalized to\[0,1\]\[0,1\]: the IB result is shifted byβH\(Y\)\\beta H\(Y\)and divided byH\(X\)\+βH\(Y\)H\(X\)\+\\beta H\(Y\); the CIB result is shifted byβ\\betaand divided byH\(X\)\+βH\(X\)\+\\beta, whereH\(⋅\)H\(\\cdot\)denotes histogram entropy of the respective variable\.
#### Complexity shift:
For each sample, both the original inputxxand a Gaussian\-perturbed inputx\+εx\+\\varepsilon\(ε∼𝒩\(0,0\.012\)\\varepsilon\\sim\\mathcal\{N\}\(0,0\.01^\{2\}\)\) are run through both models\. The complexity shift for each model is\(K\(outputx\+ε\)−K\(outputx\)\)/K\(input\)\(K\(\\text\{output\}\_\{x\+\\varepsilon\}\)\-K\(\\text\{output\}\_\{x\}\)\)/K\(\\text\{input\}\), whereK\(⋅\)K\(\\cdot\)denotes the zlib level\-9 compressed byte length\. The metric reports the absolute difference between the mean complexity shifts ofℰ\\mathcal\{E\}andℳ\\mathcal\{M\}\. Because the proxy depends on float formatting and output dimensionality, complexity\-shift values are not comparable across systems and are interpreted only within\-system as an ordinal signal\.
#### Symbion:
Exhaustively enumerates all root\-variable label combinations; applicable only to systems with finite discrete label spaces\.
#### Macroscopic invariance vs\. BCC:
Both compare abstracted outputs ofℳ\\mathcal\{M\}for two micro\-states that share the same abstract label\. Macroscopic invariance perturbs a single variable in isolation: it intervenes on one variable at a time, drawing two micro\-states from the subspace of each label while leaving the rest of the state unspecified\. BCC instead tests consistency within a complete system state: each iteration draws a joint label assignment over all mapped input \(root\) variables, grounds the whole state to a micro\-realization, then re\-grounds it to a second micro\-realization of the same complete label vector, and compares the abstracted outputs read from the non\-root behavioral variables\. For both discrete and continuous value maps, BCC skips pairs whose grounding is the identity, as these are trivially consistent\.
#### DCC:
Always uses hard label equality for the next\-state comparison\.
#### IIA:
The interchange copies raw micro\-values ofℳ\\mathcal\{M\}from the source run \(not grounded preimages of the source abstract label\)\. This tests whether the actual internal micro\-state ofℳ\\mathcal\{M\}is causally exchangeable, which is a stronger condition than checking any preimage of the source label\. This distinction vanishes for identity value maps\.
#### Probing:
The probe is trained onntrainn\_\{\\text\{train\}\}samples \(default 200, overridden per system\) and evaluated on a held\-out test set\. When the probe is scored with hard label matching \(Tracr\), its predictions are rounded to the nearest integer label before scoring; when it is scored with MSE \(logic circuit and GRN, whose integer labels are treated as regression targets, as well as the continuous identity value maps\), the probe is a standard linear regression whose continuous predictions are compared to the labels without rounding\.
### F\.1Computation Costs
All experiments were run on a consumer\-grade laptop equipped with 16 GB of RAM and an Intel i5\-13600H CPU\. The computation time required to produce experimental results is estimated as such:
- •Evaluation of baseline metrics on benchmark systems for valid and invalid abstraction: approx\. 100 hours, of which nearly half was allocated to system 3 \(gas simulation\)
- •Measurements of statistical power and convergence: approx\. 50 hours, of which half for system 3
- •Scaling laws for Tracr: Approx\. 5 hours
In addition, functional tests were implemented and run to check for correctness of the CPU implementations\. Their total runtime is estimated at approx\. 50 hours\.
Finally, preliminary experiments, including debugging and prototyping, are estimated to have required twice as much compute as the production of final results \(approx\. 300 hours\)\.
## Appendix GCase Study: Gas simulation \(system 3\)
We report additional experiments for the gas simulation system, which tests whether thermodynamic laws emerge as valid causal abstractions of a Lennard\-Jones fluid simulation\. Unless otherwise stated, experiments are performed at the Boyle temperatureTBoyle≈3\.418T\_\{\\text\{Boyle\}\}\\approx 3\.418\[Glasser,[2002](https://arxiv.org/html/2607.00267#bib.bib42)\]and with a low density ofρ=0\.05\\rho=0\.05, where inter\-particle interactions are rare and ideal thermodynamic behavior is expected\.
We evaluate two high\-level models: the ideal gas law \(IGL,PV∝TPV\\propto T\) and the Van der Waals equation \(VdW,\(P\+aρ2\)\(1/ρ−b\)∝T\(P\+a\\rho^\{2\}\)\(1/\\rho\-b\)\\propto T\)\. These models represent successive levels of physical refinement: the IGL assumes non\-interacting point particles, while VdW is a historical correction that explicitly accounts for finite particle volume and pairwise attraction\. Neither is exact, but they serve as a useful pair to test whetherCAEcan discriminate the quality of competing valid abstractions\.
For each equation, we consider three causal directions corresponding to different thermodynamic ensembles: NVT \(P←f\(V,T\)P\\leftarrow f\(V,T\)\), NPT \(V←f\(P,T\)V\\leftarrow f\(P,T\)\), and a non\-standard inverse problem PVT \(T←f\(P,V\)T\\leftarrow f\(P,V\)\), implemented via a proportional\-gain temperature controller\.
### G\.1Valid Abstractions
Table[2](https://arxiv.org/html/2607.00267#A7.T2)confirms that both models achieve near\-zeroCAEacross all three causal directions in the dilute limit\. This demonstrates that at low density, the validity of the statistical mechanical mapping matters more than the precise choice of equation of state\. The VdW equation is marginally more faithful across most conditions, reflecting its tighter fit to the LJ fluid\.
CAENVT \(P←f\(V,T\)P\\leftarrow f\(V,T\)\)NPT \(V←f\(P,T\)V\\leftarrow f\(P,T\)\)PVT \(T←f\(P,V\)T\\leftarrow f\(P,V\)\)Ideal gas law0\.02310\.03970\.0278Van der Waals equation0\.01350\.05200\.0435Table 2:CAE↓for two high\-level gas models across three causal directions \(10 samples per evaluation\)\.
### G\.2Destructive Perturbations
We next systematically violated the physical assumptions required for thermodynamic laws to emerge from particle dynamics\. Results are shown in Figure[14](https://arxiv.org/html/2607.00267#A7.F14)and Table[3](https://arxiv.org/html/2607.00267#A7.T3)\.
#### Density sweep \(ρ\\rho\)\.
Increasing density fromρ=0\.05\\rho=0\.05toρ=0\.85\\rho=0\.85in NVT mode introduces progressively stronger inter\-particle interactions\. The IGL error rises steadily, while VdW remains below0\.050\.05for significantly longer, exploiting its explicit correction terms\. Both abstractions effectively fail \(error≈1\\approx 1\) forρ\>0\.45\\rho\>0\.45\.
#### Temperature sweep \(TT\)\.
At low temperatures \(T<1\.0T<1\.0\), particles cluster in attractive wells, violating the homogeneity assumptions of both laws, and both fail \(error≈0\.8\\approx 0\.8\)\. As temperature increases, VdW recovers faster, reaching its minimum aroundT≈2\.5T\\approx 2\.5\. The IGL minimum falls in\[3\.25,3\.75\]\[3\.25,3\.75\], consistent with the Boyle temperature where attractive and repulsive contributions cancel\[Glasser,[2002](https://arxiv.org/html/2607.00267#bib.bib42)\]\. Both models remain valid at high temperatures \(T\>4\.0T\>4\.0, error<0\.1<0\.1\) where kinetic energy dominates\.
#### Distorted exponent \(α\\alpha\)\.
Replacing the IGL’s temperature dependence withP∝TαP\\propto T^\{\\alpha\}and sweepingα∈\[0\.5,1\.5\]\\alpha\\in\[0\.5,1\.5\]tests sensitivity to functional form\. TheCAEcorrectly identifies the physical ground truth, reaching near zero atα=1\.0\\alpha=1\.0and increasing monotonically away from it: sub\-linear deviations \(α<1\\alpha<1\) degrade error nearly linearly \(reaching≈0\.7\\approx 0\.7atα=0\.5\\alpha=0\.5\), while super\-linear deviations are penalized more gradually \(≈0\.42\\approx 0\.42atα=1\.5\\alpha=1\.5\)\.



Figure 14:AverageCAE↓for IGL and VdW under \(a\) increasing density, \(b\) varying temperature exponentα\\alpha, and \(c\) varying initial temperatureTT\. All experiments in NVT mode, 10 samples per evaluation\. Shaded areas denote±1σ\\pm 1\\sigma\.CAE↓NVT \(P←f\(V,T\)P\\leftarrow f\(V,T\)\)NPT \(V←f\(P,T\)V\\leftarrow f\(P,T\)\)PVT \(T←f\(P,V\)T\\leftarrow f\(P,V\)\)Ideal gas law0\.59200\.54320\.5497Van der Waals equation0\.58920\.54780\.5551Table 3:CAE↓for both models in non\-equilibrium conditions \(10 samples per evaluation\)\.Together, these results illustrate thatCAE↓behaves as a graded physical validity score: it correctly validates both abstractions in their regime of applicability, distinguishes the more faithful VdW model at intermediate densities and temperatures, and detects violations as soon as the underlying physical assumptions break down\.Similar Articles
Evaluating Bivariate Causal Statements Based on Mutual Compatibility
This paper introduces compatibility and incompatibility scores for evaluating collections of bivariate causal statements without relying on faithfulness, and demonstrates their applicability by analyzing causal claims from large language models.
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
This paper introduces the Causal Sensitivity Score (CSS), an interventional metric that evaluates whether clinical LLMs and agents appropriately update their recommendations when patient inputs change along clinically meaningful dimensions. It reveals hidden capability profiles not captured by standard coverage-based metrics, exposing safety blind spots and structural responsiveness deficits.
$ECUAS_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
This paper proposes a family of metrics called ECUAS_n for principled evaluation of uncertainty-augmented systems that output both predictions and uncertainty scores. The authors argue that existing evaluation approaches are inadequate and formulate these metrics as proper scoring rules for decision-making under uncertainty.
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.