StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis

arXiv cs.LG 06/05/26, 04:00 AM Papers
Summary
StableRCA is a novel root cause analysis framework that identifies intervention targets by estimating local Markov boundaries and detecting conditional distribution shifts, avoiding the need for global causal graph discovery and demonstrating robustness across synthetic and real-world datasets.
arXiv:2606.05636v1 Announce Type: new Abstract: Root-Cause Analysis (RCA) seeks to identify the variables responsible for abnormal system behavior in complex domains such as manufacturing, cloud computing, and healthcare. Existing approaches face a critical bottleneck: graph-based causal methods can identify intervention targets but typically require a known or accurately estimated causal graph, while graph-free statistical methods either localize marginal anomalies rather than structural causes, or rely on restrictive assumptions about graph structure or functional form. We propose StableRCA, a local mechanism-level RCA framework that avoids global graph discovery by estimating local Markov boundaries and detecting conditional distribution shifts within them. Leveraging the Independent Causal Mechanism principle, we show that intervention targets can be identified with probability converging exponentially in sample size under faithful Markov boundary recovery and non-degenerate mechanism shifts. Experiments on synthetic benchmarks and five real-world datasets demonstrate that StableRCA is robust to graph misspecification, effective under multiple intervention targets, scalable to large systems, and reliable across diverse application domains. Code is available at: https://anonymous.4open.science/r/StableRCA-E362
Original Article
View Cached Full Text
Cached at: 06/05/26, 08:12 AM
# StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis
Source: [https://arxiv.org/html/2606.05636](https://arxiv.org/html/2606.05636)
Xiaoyu Lin Department of Computer Science Tsinghua University xiaoyulin@mail\.tsinghua\.edu\.cn &Nicholas Tagliapietra11footnotemark:1 Bosch Center for Artificial Intelligence Renningen, Germany Computer Science Department TU Darmstadt, Germany nicholas\.tagliapietra@de\.bosch\.com &Kehan Li11footnotemark:1 Department of Computer Science Tsinghua University lkh20@mails\.tsinghua\.edu\.cn &Lavdim Halilaj Bosch Center for Artificial Intelligence Renningen, Germany Lavdim\.Halilaj@de\.bosch\.com &Juergen Luettin Bosch Center for Artificial Intelligence Renningen, Germany Juergen\.Luettin@de\.bosch\.com

###### Abstract

Root\-Cause Analysis \(RCA\) seeks to identify the variables responsible for abnormal system behavior in complex domains such as manufacturing, cloud computing, and healthcare\. Existing approaches face a critical bottleneck: graph\-based causal methods can identify intervention targets but typically require a known or accurately estimated causal graph, while graph\-free statistical methods either localize marginal anomalies rather than structural causes, or rely on restrictive assumptions about graph structure or functional form\. We propose StableRCA, a local mechanism\-level RCA framework that avoids global graph discovery by estimating local Markov boundaries and detecting conditional distribution shifts within them\. Leveraging the Independent Causal Mechanism principle, we show that intervention targets can be identified with probability converging exponentially in sample size under faithful Markov boundary recovery and non\-degenerate mechanism shifts\. Experiments on synthetic benchmarks and five real\-world datasets demonstrate that StableRCA is robust to graph misspecification, effective under multiple intervention targets, scalable to large systems, and reliable across diverse application domains\. Code is available at:[https://anonymous\.4open\.science/r/StableRCA\-E362](https://anonymous.4open.science/r/StableRCA-E362)

## 1Introduction

Root\-Cause Analysis \(RCA\) aims to identify the underlying factors responsible for abnormal system behavior\. It is a central task in many high\-stakes and large\-scale domains, including manufacturing\[e Oliveiraet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib29), Papageorgiouet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib31)\], IT services and cloud systems\[solé2017surveymodelstechniquesrootcause, Soldani and Brogi,[2022](https://arxiv.org/html/2606.05636#bib.bib32)\], healthcare\[Kellogget al\.,[2017](https://arxiv.org/html/2606.05636#bib.bib34)\], and medicine\[Wuet al\.,[2008](https://arxiv.org/html/2606.05636#bib.bib33)\]\. In such systems, anomalies often propagate through complex dependencies: many variables may appear abnormal, while only a small subset corresponds to the causal mechanisms that actually changed\. The central challenge is therefore to distinguish*true root causes*from downstream effects\.

Graph\-based causal RCA methods address this problem by leveraging a causal graph and tracing anomaly propagation along graph structures\. When the graph is accurate, these methods can provide interpretable causal diagnoses\. In practice, however, reliable global causal graphs are rarely available\. Learning them from observational data is statistically fragile and computationally expensive in high\-dimensional systems, especially when variables are heterogeneous and dependencies are nonlinear\. Consequently, graph\-based RCA can degrade severely under graph misspecification\. In contrast, graph\-free statistical methods avoid causal graph learning, but often rank variables by marginal distribution shifts, anomaly scores, or statistical discrepancies\. Such signals are insufficient for mechanism\-level RCA because downstream variables can exhibit strong marginal anomalies even when their own causal mechanisms remain unchanged\. Existing methods that relax full\-graph requirements usually rely on localized conditional\-independence testing, partial causal structures, or restrictive assumptions on graph topology, intervention type, or functional form\.

In this work, we focuses onpopulation\-levelRCA, where the goal is to identify variables whose mechanisms drive systematic distributional changes between normal and abnormal regimes\. This setting differs fromsample\-levelRCA, which explains individual anomalous instances and may be sensitive to noise or instance\-specific fluctuations\. Under this setting, we propose StableRCA, a graph\-agnostic framework for mechanism\-level RCA\. The key insight is that root causes and downstream affected variables behave differently under local conditioning\. An intervention induces marginal shifts not only on the intervened variable but also on its descendants\. However, under the Independent Causal Mechanisms principle, downstream variables preserve their conditional mechanisms once conditioned on their local Markov boundaries \(MB\), whereas the true intervention target exhibits a conditional distribution shift\. Based on this observation, StableRCA first detects variables with marginal shifts, then estimates their local MBs, and finally ranks candidates by the strength of their conditional distribution shifts\.

This design offers three practical advantages\. First, it avoids global structure discovery by relying on local conditioning sets rather than a complete system topology\. Second, it separates true mechanism changes from propagated marginal anomalies, reducing false positives among downstream variables\. Third, it accommodates high\-dimensional and heterogeneous data by combining stable local causal variable selection with predictive risk\-based conditional shift detection\.

Our contributions are summarized as follows:

- •We introduce StableRCA, a graph\-agnostic framework for mechanism\-level RCA\. It identifies root causes by combining marginal shift screening, local MB estimation, and conditional distribution shift detection, without requiring a known or learned global causal graph\.
- •We establish theoretical conditions under which intervention targets are identifiable from downstream affected variables via conditional distribution shifts relative to their MBs\. We further establish finite\-sample identification guarantees, showing that under appropriate non\-degenerate mechanism\-shift assumptions, the probability of correct recovery converges exponentially with sample size\.
- •We conduct extensive experiments on synthetic and real\-world benchmarks, covering graph misspecification, multiple intervention targets, large\-scale graphs, and diverse application domains\. The results show that StableRCA achieves strong accuracy and robustness, and provides a favorable accuracy\-efficiency trade\-off compared with other RCA baselines\.

![Refer to caption](https://arxiv.org/html/2606.05636v1/figures/motivation_figure_a.png)\(a\)Traditional Graph\-Dependent RCA
![Refer to caption](https://arxiv.org/html/2606.05636v1/figures/motivation_figure_b.png)\(b\)StableRCA Population\-Level Mechanism RCA

Figure 1:Motivation and Contribution\. : a\) Traditional Graph\-Dependent RCA, and b\) StableRCA Population\-Level Mechanism RCA\.
## 2Related Work

Graph\-Based RCA\.Many RCA methods localize root causes using a known, constructed, or learned global graph\. Early approaches, especially in microservice systems, build dependency, causal, or impact graphs from domain knowledge, system topology, or observational data, and then rank candidate causes using graph traversal or scoring procedures such as random walks\[Wanget al\.,[2018](https://arxiv.org/html/2606.05636#bib.bib79), Maet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib45), Wanget al\.,[2023](https://arxiv.org/html/2606.05636#bib.bib40), Zhenget al\.,[2024](https://arxiv.org/html/2606.05636#bib.bib41)\], PageRank\[Xinet al\.,[2023](https://arxiv.org/html/2606.05636#bib.bib43), Linet al\.,[2024](https://arxiv.org/html/2606.05636#bib.bib42)\], or depth\-first search\[Chenet al\.,[2014](https://arxiv.org/html/2606.05636#bib.bib14), Linet al\.,[2018](https://arxiv.org/html/2606.05636#bib.bib44)\]\. However, such topological heuristics can conflate correlation or graph proximity with causal influence\. More recent methods formulate RCA as intervention\-target identification under Structural Causal Models \(SCMs\)\. For example, CIRCA identifies root causes by measuring changes in conditional distributions given parent variables on a causal Bayesian network\[Liet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib15)\]\. Under the assumption that the SCM is known,Budhathokiet al\.\[[2022](https://arxiv.org/html/2606.05636#bib.bib74)\]define the root causes of outliers as interventions on exogenous noise variables, and identify them by simulating counterfactual distributions via noise randomization\. Despite the differences in modeling and inference, such methods rely on a reasonably accurate global graph, which is often difficult to obtain in high\-dimensional real\-world settings\.

RCA Without Full Structural Knowledge\.Several methods relax the need for a fully known causal graph\. RCD introduces an auxiliary intervention indicator and performs hierarchical localized conditional\-independence tests to identify root\-cause candidates\[Ikramet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib13)\]\. RCG instead uses an offline\-learned partial graph, such as a CPDAG or mixed graph, and ranks variables by conditional dependence with the intervention indicator given possible parents\[Ikramet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib80)\]\. Under a single\-root\-cause polytree graph setting, Score Ordering ranks variables by marginal anomaly scores without a graph; the same work also proposes Smooth Traversal for the graph\-known setting, identifying the root cause through parent\-child score differences\[Orchardet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib17)\]\. Likewise,Liet al\.\[[2025](https://arxiv.org/html/2606.05636#bib.bib16)\]establishes identifiability in linear acyclic SCMs with a single mean\-shift intervention via an invariance property induced by variable permutations and Cholesky decomposition\. Domain\-specific approaches such as PRISM further reduce structural requirements by exploiting component\-level knowledge in microservice systems\[Pham,[2026](https://arxiv.org/html/2606.05636#bib.bib81)\]\. Overall, existing RCA methods without full structural knowledge typically rely on localized CI testing, partial causal graphs, or restrictive assumptions on graph structure, intervention type, or application domain\. In contrast, our method combines data\-driven local MB estimation with conditional distribution shift detection for mechanism\-level RCA, without requiring either a known global graph or an offline\-learned partial causal structure\.

Non\-Causal RCA Methods\.Another line of work approaches RCA through anomaly detection or statistical testing without explicit causal modeling\. For example,ϵ\\epsilon\-Diagnosis identifies root causes using two\-sample testing andϵ\\epsilon\-statistics\[Shanet al\.,[2019](https://arxiv.org/html/2606.05636#bib.bib12)\], while BARO combines multivariate Bayesian online change\-point detection with nonparametric hypothesis testing for root\-cause localization\[Phamet al\.,[2024](https://arxiv.org/html/2606.05636#bib.bib82)\]\. These methods can be efficient and practically useful, but they do not explicitly reason about interventions or causal propagation, making them vulnerable to spurious associations and downstream anomaly effects\.

![Refer to caption](https://arxiv.org/html/2606.05636v1/x1.png)Figure 2:Illustration of the StableRCA framework\. It comprises three main phases: 1\) Marginal Distribution Shift Detection; 2\) MB Identification; and 3\) Conditional Distribution Shift Detection\.Invariant Prediction under Distribution Shift\.Invariant prediction is closely related but aims at a different objective\. It uses distributional changes across environments to identify stable predictive relationships and remove spurious associations for robust generalization, whereas RCA uses distributional changes to localize the intervention targets responsible for abnormal behavior\.Peterset al\.\[[2016](https://arxiv.org/html/2606.05636#bib.bib83)\]connect cross\-environment invariance with causal inference by identifying predictors whose conditional relationship with the target remains invariant\. Building on this perspective, Stable Learning identifies stable predictors under covariate shift through independence\-driven importance weighting\[Kuanget al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib58), Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18), Zhanget al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib59), Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]\. In our framework, we borrow this idea as a building block for robust local MB estimation, while root\-cause identification is achieved through subsequent conditional distribution shift detection\.

## 3Preliminaries

#### Notation

Let𝐗=\(X1,…,Xd\)T∈ℝd\\mathbf\{X\}=\(X\_\{1\},\.\.\.,X\_\{d\}\)^\{T\}\\in\\mathbb\{R\}^\{d\}denote the system variables, whereXiX\_\{i\}represents theii\-th variable\. We denote by𝒟obs=\{𝐗obs\(n\)\}n=1Nobs\\mathcal\{D\}\_\{\\textrm\{obs\}\}=\\\{\\mathbf\{X\}^\{\(n\)\}\_\{\\textrm\{obs\}\}\\\}\_\{n=1\}^\{N\_\{\\textrm\{obs\}\}\}the observational dataset collected under normal operation, and by𝒟int=\{𝐗int\(n\)\}n=1Nint\\mathcal\{D\}\_\{\\textrm\{int\}\}=\\\{\\mathbf\{X\}^\{\(n\)\}\_\{\\textrm\{int\}\}\\\}\_\{n=1\}^\{N\_\{\\textrm\{int\}\}\}the interventional dataset collected under anomalous regime\.

### 3\.1SCMs, Interventions, and Markov boundaries

We assume that the system variables𝐗=\{X1,…,XD\}\\mathbf\{X\}=\\\{X\_\{1\},\\ldots,X\_\{D\}\\\}are generated by a causally sufficient and faithful SCM whose causal graph𝒢\{\\mathcal\{G\}\}is a DAG\[Pearl,[2009](https://arxiv.org/html/2606.05636#bib.bib9), Spirteset al\.,[2000](https://arxiv.org/html/2606.05636#bib.bib36)\]\. The observational distribution under normal operating conditions is denoted byP\(𝐗\)P\(\\mathbf\{X\}\)\. For each variableXiX\_\{i\}, we writePaiPa\_\{i\},ChiCh\_\{i\}, andSpi\\mathrm\{Sp\}\_\{i\}for its parents, children, and spouses in𝒢\{\\mathcal\{G\}\}, respectively\. Its Markov boundary is

MB\(Xi\)=Pai∪Chi∪Spi,\\mathrm\{MB\}\(X\_\{i\}\)=Pa\_\{i\}\\cup Ch\_\{i\}\\cup\\mathrm\{Sp\}\_\{i\},which is the minimal local conditioning set that rendersXiX\_\{i\}independent of the remaining variables under faithfulness\.

As in\[Ikramet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib13)\], we model population\-level systematic anomalous behavior \(the root\-cause\) as an intervention on a certain target variableXiX\_\{i\}\. Specifically, ifXiX\_\{i\}is generated by the structural equationXi=fi\(Pai,Ui\)X\_\{i\}=f\_\{i\}\(Pa\_\{i\},U\_\{i\}\), whereUiU\_\{i\}denotes exogenous noise, an intervention replaces the original causal mechanismfif\_\{i\}ofXiX\_\{i\}with another mechanismf~i\\tilde\{f\}\_\{i\}, which induces an interventional SCMℳI\\mathcal\{M\}^\{I\}and its associated marginal distributionPI\(𝐗\)P^\{I\}\(\\mathbf\{X\}\)incorporating the abnormality\. We assume that interventions are local, i\.e\. that each intervention only affects the causal mechanism of its respective target target variable\.

Our analysis relies on the Independent Causal Mechanisms, or modularity, assumption\[Pearl,[2009](https://arxiv.org/html/2606.05636#bib.bib9), Schölkopfet al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib10)\]: causal mechanisms are autonomous, so an intervention onXiX\_\{i\}changes the mechanism ofXiX\_\{i\}but leaves non\-target mechanisms unchanged\. Consequently, downstream variables may exhibit marginal distribution shifts due to anomaly propagation, but their conditional mechanisms remain invariant when conditioned on the appropriate local Markov boundary\. This observation motivates identifying root causes through conditional distribution shifts of the form

P\(Xi∣MB\(Xi\)\)≠PI\(Xi∣MB\(Xi\)\)\.P\(X\_\{i\}\\mid\\mathrm\{MB\}\(X\_\{i\}\)\)\\neq P^\{I\}\(X\_\{i\}\\mid\\mathrm\{MB\}\(X\_\{i\}\)\)\.

### 3\.2Stable Learning Background

Stable Learning seeks predictors that remain reliable under distribution shift by relying on variables whose predictive relationship with the target is stable, rather than on spurious correlations that may vary across environments\[Kuanget al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib58), Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\]\. In this work, we use this idea as a tool for estimating local Markov boundaries\.

###### Definition 3\.1\(Stable Set\)\.

A stable variable set ofXiX\_\{i\}under distributionPP, denoted byStable\(Xi\)\\mathrm\{Stable\}\(X\_\{i\}\), is any subset𝐒i⊆𝐗∖\{Xi\}\\mathbf\{S\}\_\{i\}\\subseteq\\mathbf\{X\}\\setminus\\\{X\_\{i\}\\\}such that

𝔼P\[Xi∣𝐗\]=𝔼P\[Xi∣𝐒i\]\.\\mathbb\{E\}\_\{P\}\[X\_\{i\}\\mid\\mathbf\{X\}\]=\\mathbb\{E\}\_\{P\}\[X\_\{i\}\\mid\\mathbf\{S\}\_\{i\}\]\.\(1\)A minimal stable set is an inclusion\-minimal subset satisfying Eq\. \([1](https://arxiv.org/html/2606.05636#S3.E1)\)\.

Intuitively, a minimal stable set preserves the conditional\-mean predictive information aboutXiX\_\{i\}, while discarding variables whose contribution is redundant or unstable\.

###### Assumption 3\.2\(Additive Noise Model\)\.

Each variableXi∈𝐗X\_\{i\}\\in\\mathbf\{X\}follows

Xi=fi\(Pai\)\+Ui,𝔼\[Ui\]=0,X\_\{i\}=f\_\{i\}\(Pa\_\{i\}\)\+U\_\{i\},\\qquad\\mathbb\{E\}\[U\_\{i\}\]=0,where the exogenous noise is conditionally independent of non\-boundary variables:

Ui⟂⟂𝐗∖\(\{Xi\}∪MB\(Xi\)\)∣MB\(Xi\)\.U\_\{i\}\\perp\\\!\\\!\\\!\\perp\\mathbf\{X\}\\setminus\\bigl\(\\\{X\_\{i\}\\\}\\cup\\mathrm\{MB\}\(X\_\{i\}\)\\bigr\)\\mid\\mathrm\{MB\}\(X\_\{i\}\)\.

Under standard positivity and causal regularity assumptions, Stable Learning can identify the minimal stable set for a target variable\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]\. Moreover, under Assumption[3\.2](https://arxiv.org/html/2606.05636#S3.Thmtheorem2), the minimal stable set for predictingXiX\_\{i\}coincides with its Markov boundaryMB\(Xi\)\\mathrm\{MB\}\(X\_\{i\}\)\. This connection allows us to use Stable Learning as a local MB estimation tool\.

We instantiate Stable Learning with the Sample Reweighted Decorrelation Operator \(SRDO\)\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\]\. SRDO learns sample weights that reduce spurious dependence among covariates, encouraging the predictor to rely on stable variables rather than unstable correlations\. In StableRCA, SRDO is used only to obtain robust local estimates ofMB\(Xi\)\\mathrm\{MB\}\(X\_\{i\}\); root causes are subsequently identified by conditional distribution shift tests within these estimated Markov boundaries\.

## 4Method

StableRCA identifies intervention targets from observational and anomalous data through a three\-phase procedure, without requiring a known global causal graph\. It first screens variables that exhibit marginal distribution shifts, then estimates a local MB for each candidate, and finally ranks candidates according to the magnitude of their conditional distribution shift given the estimated MB\. Figure[2](https://arxiv.org/html/2606.05636#S2.F2)illustrates the overall framework and its three phases, which are described below\. Implementation details are provided in Appendix[B](https://arxiv.org/html/2606.05636#A2)\.

Phase 1: Marginal Distribution Shift Detection\.In the first phase, we identify variables whose marginal distributions are affected by the anomaly\. Specifically, for each variable, we test for distributional differences between𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}\. We apply the Kolmogorov–Smirnov test to continuous variables\[Conover,[1999](https://arxiv.org/html/2606.05636#bib.bib21)\]and theχ2\\chi^\{2\}test to discrete variables\[Pearson,[1900](https://arxiv.org/html/2606.05636#bib.bib22)\]\. This phase yields a subset of variables with significant marginal shifts, denoted by𝐗shift\\mathbf\{X\}\_\{\\textrm\{shift\}\}, defined as those for which the null hypothesis is rejected at levelα\\alpha\. Under our causal assumptions, this set is expected to include the intervened variable together with its descendants\.

Phase 2: Markov Boundary Identification\.In the second phase, for each variableXi∈𝐗shiftX\_\{i\}\\in\\mathbf\{X\}\_\{\\textrm\{shift\}\}, we estimate its Markov boundaryMBi\\textrm\{MB\}\_\{i\}by applying the SRDO method introduced in Section[3](https://arxiv.org/html/2606.05636#S3)to the observational dataset𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}\. Concretely, we treatXiX\_\{i\}as the target variable and use the remaining variables,𝐗−i=𝐗∖Xi\\mathbf\{X\}\_\{\-i\}=\\mathbf\{X\}\\setminus X\_\{i\}as predictors\. SRDO first estimates sample weights via an importance\-driven weighting procedure to reduce spurious associations among predictors\. Using these weights, we then fit a weighted regression model for continuous targets or a weighted classification model for discrete targets, and use the resulting feature importances to obtain a Markov boundary estimate forXiX\_\{i\}\. The specific model class and feature\-selection procedure are described in Appendix[B](https://arxiv.org/html/2606.05636#A2)\.

Phase 3: Conditional Distribution Shift Detection\.In the third phase, we identify variables within𝐗shift\\mathbf\{X\}\_\{\\textrm\{shift\}\}whose conditional distributions given their Markov boundaries,P\(Xi∣MBi\)P\(X\_\{i\}\\mid\\textrm\{MB\}\_\{i\}\), exhibit significant shifts between the observational dataset𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and the interventional dataset𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}\. Here,MBi\\textrm\{MB\}\_\{i\}denotes the Markov boundary estimate obtained in Phase 2\. To quantify such conditional shifts, we train a predictive modelf^i\(MBi\)\\hat\{f\}\_\{i\}\(\\textrm\{MB\}\_\{i\}\)for each variableXiX\_\{i\}using a training split of𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}, and evaluate it on both a held\-out split of𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and on𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}\. The resulting degradation in predictive performance serves as an indicator of the conditional distributional changes\. Formally, we define the risk difference for variableXiX\_\{i\}as

ΔRi=RiPI−RiP=𝔼\(MBi,Xi\)∼PI\[l\(f^i\(MBi\),Xi\)\]−𝔼\(MBi,Xi\)∼P\[l\(f^i\(MBi\),Xi\)\],\\displaystyle\\Delta R\_\{i\}=R\_\{i\}^\{P^\{I\}\}\-R\_\{i\}^\{P\}=\\mathbb\{E\}\_\{\(\\textrm\{MB\}\_\{i\},X\_\{i\}\)\\sim P^\{I\}\}\[l\(\\hat\{f\}\_\{i\}\(\\textrm\{MB\}\_\{i\}\),X\_\{i\}\)\]\-\\mathbb\{E\}\_\{\(\\textrm\{MB\}\_\{i\},X\_\{i\}\)\\sim P\}\[l\(\\hat\{f\}\_\{i\}\(\\textrm\{MB\}\_\{i\}\),X\_\{i\}\)\],\(2\)wherePIP^\{I\}andPPdenote the interventional and observational distributions, respectively, andl\(⋅,⋅\)l\(\\cdot,\\cdot\)is a task\-dependent loss function\. In our experiments, we use the mean squared error for regression tasks and cross\-entropy loss for classification tasks\. To disentangle conditional distribution shift from covariate shift induced by changes in the marginal distributionP\(MBi\)P\(\\textrm\{MB\}\_\{i\}\), we estimate the expected risk underPIP^\{I\}using importance weighting\[Shimodaira,[2000](https://arxiv.org/html/2606.05636#bib.bib64)\]\. Specifically, samples from𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}are reweighted so that the marginal distribution ofMBi\\textrm\{MB\}\_\{i\}matches that underPP\. We then define the root\-cause score forXiX\_\{i\}as the*Relative Risk Variation*\(RRV\):

S\(Xi\)=Ri,weightedPI−RiPRiP\.S\(X\_\{i\}\)=\\frac\{R\_\{i,\\textrm\{weighted\}\}^\{P^\{I\}\}\-R\_\{i\}^\{P\}\}\{R\_\{i\}^\{P\}\}\.\(3\)Variables with larger RRV scores are ranked as more likely root causes\.

## 5Theoretical Analysis

In this section, we provide the main end\-to\-end theoretical result on how StableRCA discovers the true interventional target\. In Appendix[A](https://arxiv.org/html/2606.05636#A1), we provide the complete theoretical framework along with proofs, both in the ideal \(large\-sample limit and perfect stable set learned\) and non\-ideal case \(finite\-data, and possibly inaccurate stable set\)\.

We prove that intervened nodes have a larger drop in performance on the interventional data \(used at test time\) with respect to the observational data \(used for training\)\. Such drop results in intervened nodes to be scored higher than non\-intervened ones\. The intuitive reason for the drop is that the learned stable set enables accurate predictions in scenarios sharing the same causal structure as the observational domain\. After intervention, however, the stable set of the intervened nodes will change, therefore making predictions using the "old" stable set learned in the observational setting will lead to poor predictions\.

In Appendix[A\.2](https://arxiv.org/html/2606.05636#A1.SS2)we develop the theory in the ideal case, i\.e\. in the population limit ofn→\+∞n\\to\+\\infty\. In the real\-world, however, data is finite, therefore we developed in Appendix[A\.3](https://arxiv.org/html/2606.05636#A1.SS3)robustness guarantees on the probability of identifying the correct interventional target\. Those guarantees are tailored to SRDO\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\], which is the instantiation of stable learning that we adopted in this work\.

###### Theorem 5\.1\(End\-to\-End Finite\-Sample Identification\)\.

LetXtX\_\{t\}be the true intervention target with marginal shift magnitudeδshift\>0\\delta\_\{shift\}\>0and structural intervention strengthν\>0\\nu\>0\. LetNNbe the sample size\. Suppose that the true intervention signal dominates the stable set approximation error, i\.e\. that the following Identifiability Margin Condition:

ν\>C⋅ϵdec\+η,\\nu\>C\\cdot\\epsilon\_\{dec\}\+\\eta,\(4\)whereν\\nuis the structural intervention strength in Def[A\.21](https://arxiv.org/html/2606.05636#A1.Thmtheorem21),ϵdec\\epsilon\_\{\\mathrm\{dec\}\}is the*decorrelation error*in Def\.[A\.20](https://arxiv.org/html/2606.05636#A1.Thmtheorem20),C\>0C\>0is a universal constant, andη\>0\\eta\>0is a separation margin\. Then, with probability at least:

P\(Success\)≥1−Λ\(N,δshift,K\)⏟Phase 1 Failure−exp⁡\(−Nη2/8\)⏟Phase 2 Failure\\displaystyle P\(\\text\{Success\}\)\\geq 1\-\\underbrace\{\\Lambda\(N,\\delta\_\{shift\},K\)\}\_\{\\text\{Phase 1 Failure\}\}\-\\underbrace\{\\exp\\left\(\-N\\eta^\{2\}/8\\right\)\}\_\{\\text\{Phase 2 Failure\}\}−8\(\|𝒦^\|−1\)e−Nη28M2⏟Phase 3 Failure,\\displaystyle\-\\underbrace\{8\(\|\\hat\{\\mathcal\{K\}\}\|\-1\)e^\{\-\\frac\{N\\eta^\{2\}\}\{8M^\{2\}\}\}\}\_\{\\text\{Phase 3 Failure\}\},\(5\)StableRCA correctly identifiesXtX\_\{t\}as the unique target\. Here,MMdenotes a uniform bound on the empirical conditional\-shift statistic and\|𝒦^\|\|\\hat\{\\mathcal\{K\}\}\|denotes the number of Phase 1 candidates\.Λ\(⋅\)\\Lambda\(\\cdot\)is the bound on failing to detect the marginal shift in Phase 1 \(Type II error\), defined based on the variable type ofXtX\_\{t\}:

Λ=\{2exp⁡\(−Nccont\),ifXtis continuous,2\(2K−2\)exp⁡\(−Ncdisc\),ifXtis discrete,\\Lambda=\\begin\{cases\}2\\exp\\\!\\left\(\-Nc\_\{\\mathrm\{cont\}\}\\right\),&\\text\{if \}X\_\{t\}\\text\{ is continuous\},\\\\\[2\.84526pt\] 2\(2^\{K\}\-2\)\\exp\\\!\\left\(\-Nc\_\{\\mathrm\{disc\}\}\\right\),&\\text\{if \}X\_\{t\}\\text\{ is discrete\},\\end\{cases\}\(6\)with

ccont:=2\(δshift−τcont\)2,cdisc:=12\(δshift−τdisc\)2\.c\_\{\\mathrm\{cont\}\}:=2\(\\delta\_\{\\mathrm\{shift\}\}\-\\tau\_\{\\mathrm\{cont\}\}\)^\{2\},\\qquad c\_\{\\mathrm\{disc\}\}:=\\tfrac\{1\}\{2\}\(\\delta\_\{\\mathrm\{shift\}\}\-\\tau\_\{\\mathrm\{disc\}\}\)^\{2\}\.Here,KKis the number of categories andτcont,τdisc\\tau\_\{\\mathrm\{cont\}\},\\tau\_\{\\mathrm\{disc\}\}denote the corresponding test thresholds\.

In essence, even in the finite\-sample regime, the failure probability of StableRCA decays exponentially in the sample sizeNN; equivalently, the probability of correctly identifying the true intervention target converges to 1 at an exponential rate\. Theorem[5\.1](https://arxiv.org/html/2606.05636#S5.Thmtheorem1)guarantees identification provided that the target’s signal dominates the noise\. In practice, relying solely on the raw predictability dropΔRi\\Delta R\_\{i\}favors variables with naturally higher marginal variance or larger unit scales \(e\.g\., continuous variables with large MSE\)\. To address this, we employ the RRV as defined in Eq\. \([3](https://arxiv.org/html/2606.05636#S4.E3)\), which acts as a variance\-stabilizing transformation\. This normalization renders the score dimensionless, enabling comparisons between mixed data types \(continuous and discrete\) and ensuring the ranking is driven by therelative magnitudeof the structural change rather than the intrinsic scale of the noise\.

## 6Experiments

### 6\.1Experimental settings

We evaluate StableRCA on both synthetic and real\-world datasets\. Below, we briefly describe the experimental settings and corresponding datasets; further details are provided in Appendix[C](https://arxiv.org/html/2606.05636#A3)\.

Synthetic data\.We generate data from SCMs defined on random DAGs sampled using the Erdős–Rényi \(ER\) model\[Erdös and Rényi,[1959](https://arxiv.org/html/2606.05636#bib.bib73)\]\. Each non\-root variable follows an additive\-noise structural equation, where the functional form is independently sampled as either a linear function or a nonlinear MLP\-based function\. Noise terms are independently drawn from Gaussian, Gumbel, Uniform, or Exponential distributions\. Anomalies are introduced through mechanistic interventions on one or more root\-cause variables\. Specifically, for an intervened variableXiX\_\{i\}with observational structural equationXi=fi\(Pai\)\+UiX\_\{i\}=f\_\{i\}\(\\mathrm\{Pa\}\_\{i\}\)\+U\_\{i\}, abnormal samples are generated from a perturbed mechanismXi′=fi′\(Pai\)\+UiX^\{\\prime\}\_\{i\}=f^\{\\prime\}\_\{i\}\(\\mathrm\{Pa\}\_\{i\}\)\+U\_\{i\}\. The remaining variables are then generated according to the original SCM, allowing the intervention effects to propagate through the causal graph\.

We evaluate three synthetic settings: causal\-graph quality, multiple interventions, and scalability\. Forcausal\-graph quality, we use graphs with5050nodes and100100edges, comparing the ground\-truth graph, an XGES\-estimated graph learned from normal data\[Nazaret and Blei,[2024](https://arxiv.org/html/2606.05636#bib.bib68)\], and corrupted DAGs with30%30\\%,50%50\\%, or70%70\\%edge deletion, addition, and orientation errors\. Formultiple interventions, we inject five root causes and vary graph size over4040,6060, and8080nodes with2d2dedges\. Forscalability, we use linear SCMs with five interventions and100100,200200,400400, or800800nodes\. Each setting uses 20 SCMs, with 2,000 normal and 200 abnormal samples per SCM, and is repeated over three runs\. We report the mean and uncertainty across these run repetitions\.

Real\-world data\.We evaluate on five datasets across retail services, microservices, manufacturing, and physical systems:ProRCA\[Dawoud and Talupula,[2025](https://arxiv.org/html/2606.05636#bib.bib67)\],Sock\-Shop\[[36](https://arxiv.org/html/2606.05636#bib.bib69)\],RCAEval\[Phamet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib70)\],CausalMan\[Tagliapietraet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib72)\], andCausalChambers\[Gamellaet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib28)\]\. For graph\-based baselines, we use ground\-truth causal graphs forProRCA,CausalMan, andCausalChambers; an edge\-reversed service\-call graph as a proxy causal graph forSock\-Shop; and an XGES\-estimated graph from normal data forRCAEval, where no ground\-truth graph is available\.

Baselines and metricsWe compare StableRCA against a diverse set of representative RCA baselines, includinggraph\-based methods: Traversal\[Chenet al\.,[2014](https://arxiv.org/html/2606.05636#bib.bib14)\], CIRCA\[Liet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib15)\], Smooth Traversal\[Orchardet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib17)\], and Counterfactual Attribution\[Budhathokiet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib74)\];graph\-free methods: RCD\[Ikramet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib13)\], Cholesky\[Liet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib16)\], Score Ordering\[Orchardet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib17)\], and RCG\-0\[Ikramet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib80)\]; andnon\-causal methods:ϵ\\epsilon\-diagnosis\[Shanet al\.,[2019](https://arxiv.org/html/2606.05636#bib.bib12)\], and BARO\[Phamet al\.,[2024](https://arxiv.org/html/2606.05636#bib.bib82)\]\. Baseline details are provided in Appendix[D](https://arxiv.org/html/2606.05636#A4)\. For sample\-level RCA methods, RCA scores are averaged across all abnormal samples\. We evaluate using top\-kkranking metrics: Top\-1 accuracy for single\-root\-cause cases and Top\-kkprecision/recall for multi\-root\-cause cases, withkkset to the number of true root causes and precision and recall coincide\.

### 6\.2Results on synthetic data

Table 1:Top\-1 accuracy on a 50\-node ER graph\. Comparison across the ground\-truth \(GT\), the XGES\-estimated graph, and GT with varying levels of corruption \(30%,50%,and70%30\\%,50\\%,\\text\{ and \}70\\%\)\.Table 2:Top\-5 precision/recall with 5 intervention targets\. Precision and recall coincide withk=5k=5\.Figure 3:Top\-1 accuracy across 5 different real\-world datasets\.
![Refer to caption](https://arxiv.org/html/2606.05636v1/x2.png)
Table 3:Top\-5 precision/recall and runtime under increasing graph sizes\. Precision and recall coincide withk=5k=5\. Runtime is reported in seconds\.Table 4:Ablation studies for StableRCA\. Left: MB ablation, where “Add” denotes injecting irrelevant variables into the MB, “Remove” denotes deleting true MB variables, and “Add/Remove” uses the same ratio for both operations\. Right: effect of abnormal sample size\.Effect of Causal\-Graph Quality\.Table[1](https://arxiv.org/html/2606.05636#S6.T1)reports Top\-1 accuracy under varying graph\-quality regimes\. Overall, StableRCA achieves the best performance in four out of five settings and remains competitive under 50% graph corruption, showing strong robustness to graph misspecification\. In contrast, graph\-based methods are more sensitive to graph quality: Smooth Traversal, Traversal, and Counterfactual Attribution perform reasonably with the true graph but degrade as corruption increases\. For example, Traversal drops from 0\.68 with the true graph to 0\.30 under 70% corruption, while Smooth Traversal decreases from 0\.58 to 0\.27, suggesting that incorrect edges can mislead propagation\-based RCA\. Graph\-independent or weakly graph\-dependent methods are less affected by corruption, but their accuracy varies substantially\. Score Ordering and Cholesky remain stable yet consistently underperform StableRCA, while RCD performs strongly and slightly surpasses StableRCA under 50% corruption\. In contrast,ϵ\\epsilon\-Diagnosis, BARO, CIRCA, and RCG\-0 perform poorly overall\. Finally, the XGES\-estimated graph yields results close to the true\-graph setting for most methods, suggesting that XGES provides a reasonably accurate graph estimate in this regime\.

Effect of Multiple Intervention Targets\.Table[2](https://arxiv.org/html/2606.05636#S6.T2)reports Top\-5 recovery under five simultaneous interventions\. StableRCA achieves the best performance across graph sizes, obtaining0\.610\.61,0\.580\.58, and0\.580\.58forn=40,60,80n=40,60,80, respectively\. Sincekkequals the number of true root causes, these scores directly indicate the fraction of intervention targets recovered in the top five predictions\. Score Ordering is competitive but drops atn=80n=80, while StableRCA remains stable\. Cholesky performs moderately but degrades with graph size, probably due to its linear\-SEM assumptions\. CIRCA, BARO, RCD, and RCG\-0 recover fewer true causes, whereas traversal\-based methods and counterfactual attribution fail under multiple interventions\. These results suggest that StableRCA remains robust when both graph size and the number of intervention targets increase\.

Scalability to Large Graphs\.Table[3](https://arxiv.org/html/2606.05636#S6.T3)reports Top\-5 precision/recall and runtime on graphs with 100–800 nodes under linear\-SCM, evaluated on a server with two Intel Xeon Gold 5320 CPUs at 2\.20 GHz\. StableRCA achieves the best accuracy–efficiency trade\-off, maintaining near\-perfect precision/recall from 0\.97 to 1\.00 while remaining substantially faster than other high\-accuracy methods\. Atn=800n=800, Cholesky and Score Ordering take 14994\.00 and 4024\.80 seconds, respectively, whereas StableRCA achieves 0\.99 precision/recall in 860\.90 seconds\. In contrast, traversal\-based methods,ϵ\\epsilon\-Diagnosis, and Counterfactual Attribution fail to recover root causes reliably; BARO is fastest but inaccurate; and RCD is efficient but less stable at large graph sizes\. These results show that StableRCA offers a practical balance between accuracy and scalability for large\-scale RCA\.

### 6\.3Results on real\-world scenarios

Figure[3](https://arxiv.org/html/2606.05636#S6.F3)summarizes the real\-world precision results across five benchmarks\. StableRCA forms a relatively large and balanced polygon, indicating consistently strong performance across heterogeneous real\-world scenarios\. In contrast, several baselines exhibit more irregular profiles: some perform competitively on particular benchmarks but degrade substantially on others\. This suggests that their effectiveness is more sensitive to dataset\-specific properties, graph quality, or intervention characteristics\. By relying on local Markov boundary estimation and conditional distribution shift detection, StableRCA achieves a more stable precision pattern across benchmarks, supporting its robustness in practical RCA settings without requiring a fully reliable global causal graph\.

### 6\.4Ablation studies

Robustness to cascading errors\.As StableRCA consists of three phases, errors in MB identification may propagate to Phase 3 and affect conditional distribution shift detection\. We therefore conduct an ablation study by comparing the SRDO\-estimated MBs with oracle MBs and perturbed oracle MBs with randomly removed true MB variables and/or added irrelevant variables\. Table[4](https://arxiv.org/html/2606.05636#S6.T4)\-left shows that the SRDO\-estimated MB achieves the same Top\-1 accuracy as the oracle MB, both reaching 0\.90\. This suggests that Phase 2 provides sufficiently accurate MB estimates in this setting\. Perturbing oracle MBs generally degrades performance, with removing true MB variables being more harmful than adding irrelevant ones: accuracy drops to 0\.60 after removing 40% of true MB variables, but remains 0\.80 after adding 40% irrelevant variables\. When both errors are introduced, accuracy decreases to 0\.65 under 40% removal and addition\. Overall, StableRCA is robust to moderate MB errors, but retaining true MB variables is more critical than excluding all irrelevant variables\.

Sensitivity to abnormal sample size\.Table[4](https://arxiv.org/html/2606.05636#S6.T4)\-right evaluates the sensitivity of StableRCA to the number of abnormal samples\. Performance improves as the abnormal sample size increases from 5 to 100, with Top\-1 accuracy rising from 0\.65 to 0\.85, suggesting that very small abnormal sets provide insufficient evidence for reliable marginal and conditional shift detection\. Beyond 100 abnormal samples, accuracy remains relatively stable, mostly between 0\.80 and 0\.85, indicating that StableRCA does not require a large abnormal set to perform well\. The drop at 2000 samples is likely due to finite\-run variability or changes in the detected shifted\-variable set\. Overall, StableRCA remains robust once a moderate number of abnormal samples is available\.

## 7Conclusions

In this work, we introduced StableRCA for population\-level RCA, showing that mechanism\-level root causes can be identified through local Markov boundary information rather than full causal graph recovery\. Our theoretical analysis characterizes when intervention targets can be separated from downstream effects under both idealized and non\-ideal regimes\. Empirically, StableRCA achieves strong effectiveness, robustness, and scalability across three synthetic configurations and five real\-world datasets, positioning it as a practical middle ground between graph\-based causal RCA and graph\-free statistical methods that may confuse root causes with downstream symptoms\.

Limitations and Future Work\.StableRCA still relies on sufficiently accurate Markov boundary estimation and standard causal assumptions such as causal sufficiency, faithfulness, and detectable intervention\-induced shifts\. Severe boundary estimation errors, hidden confounding, feedback dynamics, or weak interventions may therefore affect its reliability\. Future work will extend StableRCA to partially observed and cyclic systems, develop uncertainty\-aware or ensemble Markov boundary estimators, and combine population\-level mechanism diagnosis with instance\-level explanations\. Improving the efficiency of Markov boundary estimation and conditional shift testing is also important for deploying StableRCA in larger, heterogeneous, and streaming environments\.

## References

- P\. L\. Bartlett and S\. Mendelson \(2003\)Rademacher and gaussian complexities: risk bounds and structural results\.Journal of Machine Learning Research3\(null\),pp\. 463–482\.Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px4.p1.14),[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px4.p1.4)\.
- K\. Budhathoki, L\. Minorics, P\. Blöbaum, and D\. Janzing \(2022\)Causal structure\-based root cause analysis of outliers\.InInternational conference on machine learning,pp\. 2357–2369\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p6.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p1.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- P\. Chen, Y\. Qi, P\. Zheng, and D\. Hou \(2014\)CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems\.InIEEE Conference on Computer Communications,pp\. 1887–1895\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p3.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p1.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- W\. J\. Conover \(1999\)Practical nonparametric statistics\.3rd edition,John Wiley & Sons\.Cited by:[1st item](https://arxiv.org/html/2606.05636#A1.I3.i1.p1.1),[§4](https://arxiv.org/html/2606.05636#S4.p2.5)\.
- A\. Dawoud and S\. Talupula \(2025\)ProRCA: a causal python package for actionable root cause analysis in real\-world business scenarios\.arXiv:2503\.01475\.External Links:2503\.01475Cited by:[§C\.2](https://arxiv.org/html/2606.05636#A3.SS2.p3.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p4.1)\.
- A\. Dvoretzky, J\. Kiefer, and J\. Wolfowitz \(1956\)Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator\.The Annals of Mathematical Statistics27\(3\),pp\. 642 – 669\.Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px1.p1.6),[Lemma A\.18](https://arxiv.org/html/2606.05636#A1.Thmtheorem18.p1.6.6)\.
- E\. e Oliveira, V\. L\. Miguéis, and J\. L\. Borges \(2022\)Automatic root cause analysis in manufacturing: an overview & conceptualization\.Journal of Intelligent Manufacturing34\(5\),pp\. 2061–2078\.Cited by:[§1](https://arxiv.org/html/2606.05636#S1.p1.1)\.
- P\. Erdös and A\. Rényi \(1959\)On random graphs i\.Publicationes Mathematicae Debrecen6,pp\. 290\.Cited by:[§C\.1](https://arxiv.org/html/2606.05636#A3.SS1.SSS0.Px1.p1.4),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p2.3)\.
- J\. L\. Gamella, J\. Peters, and P\. Bühlmann \(2025\)Causal chambers as a real\-world physical testbed for AI methodology\.Nature Machine Intelligence\.Cited by:[§C\.2](https://arxiv.org/html/2606.05636#A3.SS2.p6.10),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p4.1)\.
- A\. Ikram, S\. Chakraborty, S\. Mitra, S\. K\. Saini, S\. Bagchi, and M\. Kocaoglu \(2022\)Root cause analysis of failures in microservices through causal discovery\.InProceedings of International Conference on Neural Information Processing Systems,Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p5.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p2.1),[§3\.1](https://arxiv.org/html/2606.05636#S3.SS1.p2.9),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- A\. Ikram, K\. Lee, S\. Agarwal, S\. K\. Saini, S\. Bagchi, and M\. Kocaoglu \(2025\)Root cause analysis of failures from partial causal structures\.InProceedings of the Forty\-First Conference on Uncertainty in Artificial Intelligence,Vol\.286,pp\. 1794–1818\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p10.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p2.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- K\. M\. Kellogg, Z\. Hettinger, M\. Shah, R\. L\. Wears, C\. R\. Sellers, M\. Squires, and R\. J\. Fairbanks \(2017\)Our current approach to root cause analysis: is it contributing to our failure to improve patient safety?\.BMJ Quality & Safety26\(5\),pp\. 381–387\.Cited by:[§1](https://arxiv.org/html/2606.05636#S1.p1.1)\.
- K\. Kuang, R\. Xiong, P\. Cui, S\. Athey, and B\. Li \(2020\)Stable prediction with model misspecification and agnostic distribution shift\.Proceedings of the AAAI Conference on Artificial Intelligence34,pp\. 4485–4492\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p4.1),[§3\.2](https://arxiv.org/html/2606.05636#S3.SS2.p1.1)\.
- J\. Li, B\. B\. Chu, I\. F\. Scheller, J\. Gagneur, and M\. H\. Maathuis \(2025\)Root cause discovery via permutations and cholesky decomposition\.Journal of the Royal Statistical Society Series B: Statistical Methodology,pp\. qkaf066\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p7.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p2.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- M\. Li, Z\. Li, K\. Yin, X\. Nie, W\. Zhang, K\. Sui, and D\. Pei \(2022\)Causal inference\-based root cause analysis for online service systems with intervention recognition\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 3230–3240\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p4.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p1.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- C\. Lin, C\. Chang, W\. Wang, K\. Wang, and W\. Peng \(2024\)Root cause analysis in microservice using neural granger causal discovery\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- J\. Lin, P\. Chen, and Z\. Zheng \(2018\)Microscope: pinpoint performance issues with causal graphs in micro\-service environments\.InService\-Oriented Computing,pp\. 3–20\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p3.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- D\. Liu, C\. He, X\. Peng, F\. Lin, C\. Zhang, S\. Gong, Z\. Li, J\. Ou, and Z\. Wu \(2021\)Microhecl: high\-efficient root cause localization in large\-scale microservice systems\.InIEEE/ACM International Conference on Software Engineering: Software Engineering in Practice,pp\. 338–347\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p3.1.1)\.
- M\. Ma, J\. Xu, Y\. Wang, P\. Chen, Z\. Zhang, and P\. Wang \(2020\)AutoMAP: diagnose your microservice\-based web applications automatically\.InProceedings of The Web Conference 2020,pp\. 246–258\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- P\. Massart \(1990\)The tight constant in the dvoretzky\-kiefer\-wolfowitz inequality\.The Annals of Probability18\(3\),pp\. 1269–1283\.Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px1.4.p4.1),[Lemma A\.18](https://arxiv.org/html/2606.05636#A1.Thmtheorem18.p1.6.6)\.
- A\. Nazaret and D\. Blei \(2024\)Extremely greedy equivalence search\.InProceedings of Conference on Uncertainty in Artificial Intelligence,Cited by:[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p3.13)\.
- W\. R\. Orchard, N\. Okati, S\. H\. G\. Mejia, P\. Blöbaum, and D\. Janzing \(2025\)Root cause analysis of outliers with missing structural knowledge\.InThe Annual Conference on Neural Information Processing Systems,Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p8.1.1),[Appendix D](https://arxiv.org/html/2606.05636#A4.p9.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p2.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- K\. Papageorgiou, T\. Theodosiou, A\. Rapti, E\. I\. Papageorgiou, N\. Dimitriou, D\. Tzovaras, and G\. Margetis \(2022\)A systematic review on machine learning methods for root cause analysis towards zero\-defect manufacturing\.Frontiers in Manufacturing Technology2,pp\. 972712\.Cited by:[§1](https://arxiv.org/html/2606.05636#S1.p1.1)\.
- J\. Pearl \(1988\)Probabilistic reasoning in intelligent systems: networks of plausible inference\.Morgan Kaufmann\.Cited by:[§A\.1](https://arxiv.org/html/2606.05636#A1.SS1.p3.3),[Definition A\.6](https://arxiv.org/html/2606.05636#A1.Thmtheorem6),[Definition A\.7](https://arxiv.org/html/2606.05636#A1.Thmtheorem7)\.
- J\. Pearl \(2009\)Causality\.2 edition,Cambridge University Press\.Cited by:[Definition A\.1](https://arxiv.org/html/2606.05636#A1.Thmtheorem1.p1.1),[Appendix A](https://arxiv.org/html/2606.05636#A1.p4.1),[§3\.1](https://arxiv.org/html/2606.05636#S3.SS1.p1.8),[§3\.1](https://arxiv.org/html/2606.05636#S3.SS1.p3.2)\.
- K\. Pearson \(1900\)X\. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling\.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50\(302\),pp\. 157–175\.Cited by:[2nd item](https://arxiv.org/html/2606.05636#A1.I3.i2.p1.1),[§4](https://arxiv.org/html/2606.05636#S4.p2.5)\.
- J\. Peters, P\. Bühlmann, and N\. Meinshausen \(2016\)Causal inference using invariant prediction: identification and confidence intervals\.Journal of the Royal Statistical Society: Series B \(Statistical Methodology\)78\(5\),pp\. 947–1012\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p4.1)\.
- L\. Pham, H\. Ha, and H\. Zhang \(2024\)BARO: robust root cause analysis for microservices via multivariate bayesian online change point detection\.Proceedings of the ACM on Software Engineering1\(FSE\),pp\. 2214–2237\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p11.1.1),[§2](https://arxiv.org/html/2606.05636#S2.p3.2),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- L\. Pham, H\. Zhang, H\. Ha, F\. Salim, and X\. Zhang \(2025\)RCAEval: a benchmark for root cause analysis of microservice systems with telemetry data\.InCompanion Proceedings of the ACM Web Conference 2025,pp\. 777–780\.Cited by:[§C\.2](https://arxiv.org/html/2606.05636#A3.SS2.p7.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p4.1)\.
- L\. Pham \(2026\)Graph\-free root cause analysis\.arXiv preprint arXiv:2601\.21359\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2601.21359)Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p2.1)\.
- L\. Prokhorenkova, G\. Gusev, A\. Vorobev, A\. V\. Dorogush, and A\. Gulin \(2018\)CatBoost: unbiased boosting with categorical features\.InProceedings of International Conference on Neural Information Processing Systems,pp\. 6639–6649\.Cited by:[Appendix B](https://arxiv.org/html/2606.05636#A2.p4.3)\.
- B\. Schölkopf, F\. Locatello, S\. Bauer, N\. R\. Ke, N\. Kalchbrenner, A\. Goyal, and Y\. Bengio \(2021\)Toward causal representation learning\.Proceedings of the IEEE109\(5\),pp\. 612–634\.Cited by:[Assumption A\.4](https://arxiv.org/html/2606.05636#A1.Thmtheorem4),[Appendix A](https://arxiv.org/html/2606.05636#A1.p3.1),[Appendix A](https://arxiv.org/html/2606.05636#A1.p4.1),[§3\.1](https://arxiv.org/html/2606.05636#S3.SS1.p3.2)\.
- H\. Shan, Y\. Chen, H\. Liu, Y\. Zhang, X\. Xiao, X\. He, M\. Li, and W\. Ding \(2019\)ϵ\\epsilon\-Diagnosis: unsupervised and real\-time diagnosis of small window long\-tail latency in large\-scale microservice platforms\.InThe World Wide Web Conference,pp\. 3215–3222\.Cited by:[Appendix D](https://arxiv.org/html/2606.05636#A4.p2.2.1),[§2](https://arxiv.org/html/2606.05636#S2.p3.2),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p5.4)\.
- Z\. Shen, P\. Cui, T\. Zhang, and K\. Kuang \(2020\)Stable learning via sample reweighting\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 5692–5699\.Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px2.p1.5),[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.p1.1),[Appendix B](https://arxiv.org/html/2606.05636#A2.p4.3),[§2](https://arxiv.org/html/2606.05636#S2.p4.1),[§3\.2](https://arxiv.org/html/2606.05636#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.05636#S3.SS2.p4.1),[§5](https://arxiv.org/html/2606.05636#S5.p3.1)\.
- H\. Shimodaira \(2000\)Improving predictive inference under covariate shift by weighting the log\-likelihood function\.Journal of Statistical Planning and Inference90\(2\),pp\. 227–244\.Cited by:[§4](https://arxiv.org/html/2606.05636#S4.p4.20)\.
- \[36\]\(2022\)Sock\-shop: a microservice demo application\.Note:[https://github\.com/microservices\-demo/microservices\-demo](https://github.com/microservices-demo/microservices-demo)Cited by:[§C\.2](https://arxiv.org/html/2606.05636#A3.SS2.p4.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p4.1)\.
- J\. Soldani and A\. Brogi \(2022\)Anomaly detection and failure root cause analysis in \(micro\) service\-based cloud applications: a survey\.ACM Computing Surveys55\(3\)\.Cited by:[§1](https://arxiv.org/html/2606.05636#S1.p1.1)\.
- P\. Spirtes, C\. Glymour, and R\. Scheines \(2000\)Causation, prediction, and search\.2 edition,MIT Press,Cambridge, MA\.Cited by:[Appendix A](https://arxiv.org/html/2606.05636#A1.p3.1),[§3\.1](https://arxiv.org/html/2606.05636#S3.SS1.p1.8)\.
- A\. Statnikov, N\. I\. Lytkin, J\. Lemeire, and C\. F\. Aliferis \(2013\)Algorithms for discovery of multiple markov boundaries\.Journal of Machine Learning Research14\(15\),pp\. 499–566\.Cited by:[§A\.1](https://arxiv.org/html/2606.05636#A1.SS1.p3.3)\.
- M\. Sugiyama, M\. Krauledat, and K\. Müller \(2007\)Covariate shift adaptation by importance weighted cross validation\.Journal of Machine Learning Research8,pp\. 985–1005\.Cited by:[Appendix B](https://arxiv.org/html/2606.05636#A2.p5.7)\.
- M\. Sugiyama, T\. Suzuki, and T\. Kanamori \(2012\)Density ratio estimation in machine learning\.Cambridge University Press\.Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px2.p1.6)\.
- N\. Tagliapietra, J\. Luettin, L\. Halilaj, M\. Willig, T\. Pychynski, and K\. Kersting \(2025\)CausalMan: a physics\-based simulator for large\-scale causality\.arXiv:2502\.12707\.External Links:2502\.12707Cited by:[§C\.2](https://arxiv.org/html/2606.05636#A3.SS2.p5.1),[§6\.1](https://arxiv.org/html/2606.05636#S6.SS1.p4.1)\.
- D\. Wang, Z\. Chen, Y\. Fu, Y\. Liu, and H\. Chen \(2023\)Incremental causal graph learning for online root cause analysis\.InProceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 2269–2278\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- P\. Wang, J\. Xu, M\. Ma, W\. Lin, D\. Pan, Y\. Wang, and P\. Chen \(2018\)CloudRanger: root cause identification for cloud native systems\.In2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing \(CCGRID\),pp\. 492–502\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- T\. Weissman, E\. Ordentlich, G\. Seroussi, S\. Verdú, and M\. J\. Weinberger \(2003\)Inequalities for the l1 deviation of the empirical distribution\.InHP Labs Technical Report,Cited by:[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px1.6.p1.16),[Lemma A\.19](https://arxiv.org/html/2606.05636#A1.Thmtheorem19.p2.4.4)\.
- A\. W\. Wu, A\. K\. M\. Lipshutz, and P\. J\. Pronovost \(2008\)Effectiveness and efficiency of root cause analysis in medicine\.Journal of the American Medical Association299\(6\),pp\. 685–687\.Cited by:[§1](https://arxiv.org/html/2606.05636#S1.p1.1)\.
- R\. Xin, P\. Chen, and Z\. Zhao \(2023\)CausalRCA: causal inference based precise fine\-grained root cause localization for microservice applications\.Journal of Systems and Software203\(C\)\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.
- R\. Xu, X\. Zhang, Z\. Shen, T\. Zhang, and P\. Cui \(2022\)A theoretical analysis on independence\-driven importance weighting for covariate\-shift generalization\.InProceedings of International Conference on Machine Learning,Vol\.162,pp\. 24803–24829\.Cited by:[§A\.1](https://arxiv.org/html/2606.05636#A1.SS1.1.p1.1),[§A\.1](https://arxiv.org/html/2606.05636#A1.SS1.p4.1),[§A\.3](https://arxiv.org/html/2606.05636#A1.SS3.SSS0.Px2.p1.5),[Assumption A\.10](https://arxiv.org/html/2606.05636#A1.Thmtheorem10),[Theorem A\.11](https://arxiv.org/html/2606.05636#A1.Thmtheorem11),[Assumption A\.5](https://arxiv.org/html/2606.05636#A1.Thmtheorem5),[§2](https://arxiv.org/html/2606.05636#S2.p4.1),[§3\.2](https://arxiv.org/html/2606.05636#S3.SS2.p3.2)\.
- X\. Zhang, P\. Cui, R\. Xu, L\. Zhou, Y\. He, and Z\. Shen \(2021\)Deep stable learning for out\-of\-distribution generalization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 5372–5382\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p4.1)\.
- L\. Zheng, Z\. Chen, J\. He, and H\. Chen \(2024\)MULAN: multi\-modal causal structure learning and root cause analysis for microservice systems\.InProceedings of the ACM Web Conference,pp\. 4107–4116\.Cited by:[§2](https://arxiv.org/html/2606.05636#S2.p1.1)\.

## Appendix ATheoretical Analysis

In this section, we provide a theoretical analysis of StableRCA \. First, we start from an analysis in the ideal case where exact stable sets are available\. Next, we proceed with an analysis of its robustness and potential failure modes\.

###### Definition A\.1\.

\[Structural Causal Model \(SCM\)\[Pearl,[2009](https://arxiv.org/html/2606.05636#bib.bib9)\]\] A Structural Causal Model \(SCM\) is a 4\-tupleℳ=⟨𝐗,𝐔,𝐅,P\(𝐔\)⟩\\mathcal\{M\}=\\langle\\mathbf\{X\},\\mathbf\{U\},\\mathbf\{F\},P\(\\mathbf\{U\}\)\\rangle, where:

- •𝐗\\mathbf\{X\}denotes the set of endogenous variables\.
- •𝐔\\mathbf\{U\}denotes the set of exogenous noise variables, representing external factors not explained within the model\.
- •𝐅=\{f1,…,fD\}\\mathbf\{F\}=\\\{f\_\{1\},\\dots,f\_\{D\}\\\}is a collection of structural equations, where each structural equation specifies an endogenous variableXi∈𝐗X\_\{i\}\\in\\mathbf\{X\}asXi=fi\(Pai,Ui\),X\_\{i\}=f\_\{i\}\(\\mathrm\{Pa\}\_\{i\},U\_\{i\}\),withPai⊆𝐗∖\{Xi\}\\mathrm\{Pa\}\_\{i\}\\subseteq\\mathbf\{X\}\\setminus\\\{X\_\{i\}\\\}denoting the set of parent variables ofXiX\_\{i\}in the associated causal graph, andUi∈𝐔U\_\{i\}\\in\\mathbf\{U\}the corresponding exogenous noise variable\.
- •P\(𝐔\)P\(\\mathbf\{U\}\)is the joint probability distribution over the exogenous variables\.

Additionally, an SCM induces an observational data distributionP\(𝐗\)P\(\\mathbf\{X\}\)\. We define an interventionIIon a target variableXt∈𝐗X\_\{t\}\\in\\mathbf\{X\}as the replacement of its structural equationftf\_\{t\}with a new mechanismf~t\\tilde\{f\}\_\{t\}\(or a fixed valuea∈ℝa\\in\\mathbb\{R\}\)\. An intervention results in a new SCMℳI\\mathcal\{M\}^\{I\}where the structural equation forXtX\_\{t\}is substituted\. We refer to the new data distribution entailed byℳI\\mathcal\{M\}^\{I\}asPIP^\{I\}\(𝐗\\mathbf\{X\}\)\.

We start by reporting the 3 core assumptions that are necessary for our framework:Causal Sufficiency\[Spirteset al\.,[2000](https://arxiv.org/html/2606.05636#bib.bib36)\],Causal Faithfulness\[Spirteset al\.,[2000](https://arxiv.org/html/2606.05636#bib.bib36)\]andIndependent Causal Mechanisms\[Schölkopfet al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib10)\]\.

###### Assumption A\.2\(Causal Sufficiency\)\.

All variables𝐗\\mathbf\{X\}in the underlying data generating process, described by the SCMℳ\\mathcal\{M\}, are observed\.

###### Assumption A\.3\(Causal Faithfulness\)\.

LetXi,Xj⊂𝐗X\_\{i\},X\_\{j\}\\subset\\mathbf\{X\}, and letP\(𝐗\)P\(\\mathbf\{X\}\)be the observational data distribution induced by an SCMℳ\\mathcal\{M\}\. Then, ifXiX\_\{i\}andXjX\_\{j\}are independent givenXk∈𝐗X\_\{k\}\\in\\mathbf\{X\}inP\(𝐗\)P\(\\mathbf\{X\}\), then they are also d\-separated in the causal graph induced byℳ\\mathcal\{M\}\.

Given the interventional setting of this work, we leverage the Independent Causal Mechanisms assumption\[Schölkopfet al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib10)\], often connected to the Modularity assumption\[Pearl,[2009](https://arxiv.org/html/2606.05636#bib.bib9)\]\.

###### Assumption A\.4\(Independent Causal Mechanisms\[Schölkopfet al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib10)\]\)\.

The causal data generating process is composed by autonomous modules that do not inform or influence each other\. In essence, interventions on a variableXjX\_\{j\}do not change the other causal mechanisms\.

Lastly, we invoke theStrict Positivityassumption to facilitate the Markov Boundary identification with stable learning\.

###### Assumption A\.5\(Strict Positivity\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]\)\.

The joint probability distributionP\(𝐗\)P\(\\mathbf\{X\}\)is strictly positive\. That is, for every possible assignment of values𝐱\\mathbf\{x\}in the domain ofX, it holds thatP\(𝐱\)\>0P\(\\mathbf\{x\}\)\>0\.

The strict positivity assumption is reasonable in practice, as the presence of measurement or exogenous noise with unbounded support \(e\.g\., Gaussian noise\) ensures that the induced joint distribution assigns positive probability \(or density\) to all admissible covariate configurations\.

### A\.1Markov Blanket, Markov Boundary, Stable Set, and Minimal Stable Set

Our framework builds on stable learning, which, under appropriate assumptions and data heterogeneity, enables identification of the minimal stable variable set\. Under the SCM framework considered in this work, this set coincides with the Markov Boundary \(and hence a Markov blanket\) of the target variable\. Accordingly, we formally define these notions below and clarify their relationships in the context of stable learning\. For notational convenience, we useBL\(Xi\)\\textrm\{BL\}\(X\_\{i\}\)to denote the Markov BlanketXiX\_\{i\},BD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\)to denote the Markov Boundary ofXiX\_\{i\},Stable\(Xi\)\\textrm\{Stable\}\(X\_\{i\}\)to denote the stable set ofXiX\_\{i\}, andMinStable\(Xi\)\\textrm\{MinStable\}\(X\_\{i\}\)to denote the minimal stable set ofXiX\_\{i\}\. This notation differs from that used in the main text, whereMB\(Xi\)\\textrm\{MB\}\(X\_\{i\}\)denotes the Markov Boundary ofXiX\_\{i\}\.

###### Definition A\.6\(Markov Blanket\[Pearl,[1988](https://arxiv.org/html/2606.05636#bib.bib19)\]\)\.

A Markov blanket ofXiX\_\{i\}under distributionPP, denoted asBL\(Xi\)\\textrm\{BL\}\(X\_\{i\}\), is any subsetBL\(Xi\)⊆X∖Xi\\textrm\{BL\}\(X\_\{i\}\)\\subseteq\\textbf\{X\}\\setminus X\_\{i\}for which the following holds

Xi⟂\(X∖BL\(Xi\)\)\|BL\(Xi\)\.X\_\{i\}\\perp\(\\textbf\{X\}\\setminus\\textrm\{BL\}\(X\_\{i\}\)\)\|\\textrm\{BL\}\(X\_\{i\}\)\.\(7\)

Further, we define below the Markov Boundary as the minimal Markov blanket\.

###### Definition A\.7\(Markov Boundary\[Pearl,[1988](https://arxiv.org/html/2606.05636#bib.bib19)\]\)\.

A Markov Boundary ofXiX\_\{i\}, denoted asBD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\), is a minimal subset of the Markov blanket ofXiX\_\{i\}\. In other words, no subset of the Markov Boundary satisfies the Markov blanket property\.

A Markov blanket of a variableXiX\_\{i\}is any set of variables that rendersXiX\_\{i\}conditionally independent of all remaining variables in the system\. The Markov Boundary, in contrast, is a Markov blanket that is minimal with respect to set inclusion, and thus contains no redundant variables\. In a causal DAG, under the causal faithfulness assumption, the Markov Boundary ofXiX\_\{i\}is given by the union of its parents, its children, and the coparents of its children \(often referred to as spouses\)\. While, in principle, multiple Markov boundaries may exist, classical results byPearl \[[1988](https://arxiv.org/html/2606.05636#bib.bib19)\]andStatnikovet al\.\[[2013](https://arxiv.org/html/2606.05636#bib.bib20)\]show that under mild regularity conditions satisfied in most practical settings, the Markov Boundary is unique\.

###### Definition A\.8\(Stable Set\)\.

A stable variable set ofXiX\_\{i\}under distributionPP, denoted asStable\(Xi\)\\textrm\{Stable\}\(X\_\{i\}\), is any subset𝐒i\\mathbf\{S\}\_\{i\}of\{X\\Xi\}\\\{\\textbf\{X\}\\backslash\{X\_\{i\}\}\\\}for which the following identity holds

𝔼P⁡\[Xi\|X\]=𝔼P⁡\[Xi\|𝐒i\]\.\\operatorname\{\\mathbb\{E\}\}\_\{P\}\[X\_\{i\}\|\\textbf\{X\}\]=\\operatorname\{\\mathbb\{E\}\}\_\{P\}\[X\_\{i\}\|\\mathbf\{S\}\_\{i\}\]\.\(8\)

###### Definition A\.9\(Minimal Stable Set\)\.

A minimal stable variable set ofXiX\_\{i\}, denoted asMinStable\(Xi\)\\textrm\{MinStable\}\(X\_\{i\}\), is a minimal set inStable\(Xi\)\\textrm\{Stable\}\(X\_\{i\}\), i\.e\., none of its proper subsets satisfies Equation \([8](https://arxiv.org/html/2606.05636#A1.E8)\)\.

The central intuition underlying stable learning is that the spurious correlations, that arise from associations and do not correspond to invariant causal dependencies across environments, should be removed in order to obtain predictors whose performance remains stable under distributional shifts\. In\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\],the authors further establish a formal connection between stable sets and Markov blankets, as well as between the minimal stable set and the Markov Boundary\. In particular, Theorem A\.2 of\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]shows that, under idealized assumptions, any stable set constitutes a Markov blanket of the target variable, and the minimal stable set is a subset of the Markov Boundary\.

###### Assumption A\.10\(Strict Positivity\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]\)\.

The joint probability distributionP\(𝐗\)P\(\\mathbf\{X\}\)is strictly positive\. That is, for every possible assignment of values𝐱\\mathbf\{x\}in the domain ofX, it holds thatP\(𝐱\)\>0P\(\\mathbf\{x\}\)\>0\.

###### Theorem A\.11\(Any Stable set is a Markov Blanket\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]\)\.

Under Assumption[A\.2](https://arxiv.org/html/2606.05636#A1.Thmtheorem2)and[A\.10](https://arxiv.org/html/2606.05636#A1.Thmtheorem10), it holds that a Stable Set is also a Markov Blanket, and the minimal stable variable set is a subset of the Markov Boundary\. Formally,

BL\(Xi\)⊆Stable\(Xi\),BD\(Xi\)⊆MinStable\(Xi\)\.\\textrm\{BL\}\(X\_\{i\}\)\\subseteq\\textrm\{Stable\}\(X\_\{i\}\),\\qquad\\textrm\{BD\}\(X\_\{i\}\)\\subseteq\\textrm\{MinStable\}\(X\_\{i\}\)\.\(9\)

###### Proof\.

See\[Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\], Theorem A\.2\. ∎

Under the SCM framework defined in Definition[A\.1](https://arxiv.org/html/2606.05636#A1.Thmtheorem1), each variableXiX\_\{i\}is generated according to a structural equation of the formXi=fi\(Pai,Ui\)X\_\{i\}=f\_\{i\}\(Pa\_\{i\},U\_\{i\}\), wherefif\_\{i\}is deterministic and all stochasticity arises solely from the exogenous noise variableUiU\_\{i\}\. Consequently, the conditional mean sufficiency in Equation \([8](https://arxiv.org/html/2606.05636#A1.E8)\) holds\. Therefore, the minimal stable setMinStable\(Xi\)\\textrm\{MinStable\}\(X\_\{i\}\)coincides with the Markov BoundaryBD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\)\.

### A\.2Ideal Conditions

We first perform a theoretical analysis under ideal conditions, i\.e\. namely assuming asymptotically infinite data, and StableRCA recovers the exact Markov Boundary of the target variable\.

###### Assumption A\.12\(Single Non\-trivial Interventional Target\)\.

The distribution shift in the outcomeYYis originated by a unique intervention on a variableXi∈VX\_\{i\}\\in V\.

First of all, our goal is to discover the intervention targetXtX\_\{t\}\. Under Assm\.[A\.12](https://arxiv.org/html/2606.05636#A1.Thmtheorem12), the intervened node must be an ancestor of all variables that exhibit a marginal distribution shift\.

###### Corollary A\.13\.

LetIIbe a non\-trivial intervention satisfying Assm\.[A\.12](https://arxiv.org/html/2606.05636#A1.Thmtheorem12), inducing a marginal distribution shift on the nodes in𝒦=\{Xi∈𝐗∣Pobs\(Xi\)≠PI\(Xi\)\}⊆𝐗\\mathcal\{K\}=\\\{X\_\{i\}\\in\\mathbf\{X\}\\mid P\_\{obs\}\(X\_\{i\}\)\\neq P^\{I\}\(X\_\{i\}\)\\\}\\subseteq\\mathbf\{X\}\. Under the causal sufficiency assumption\(Assm\.[A\.2](https://arxiv.org/html/2606.05636#A1.Thmtheorem2)\), the target of the interventionIIis an ancestor of all the nodes in𝒦\\mathcal\{K\}\.

###### Proof\.

This corollary follows from a simple observation: Under the assumptions that the underlying causal graph is a DAG, and under causal sufficiency, an intervention on a descendant cannot cause a marginal distribution shift on an ancestor node\. In the underlying SCM, the value ofXiX\_\{i\}is determined by its structural equationfif\_\{i\}\. Therefore, by exclusion, the intervention target can only be an ancestor of all other nodes in𝒦\\mathcal\{K\}\. ∎

Now, we proceed by showing how the detected marginal distribution shifts do not imply an intervention, aside for the intervention target node\.

###### Theorem A\.14\(Invariance ofP\(Xi\|BD\(Xi\)\)P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)for non\-Intervened nodes\)\.

LetXi∈𝐗X\_\{i\}\\in\\mathbf\{X\}be a target variable andBD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\)its associated Markov Boundary\. Under Assm\.[A\.12](https://arxiv.org/html/2606.05636#A1.Thmtheorem12), for any variableXiX\_\{i\}that is not the interventional target, the conditional distributionP\(Xi\|BD\(Xi\)\)P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)is invariant with respect to any interventionIItargeting other variablesXt∈𝐗X\_\{t\}\\in\\mathbf\{X\}\. Mathematically,

PI\(Xi\|BD\(Xi\)\)=P\(Xi\|BD\(Xi\)\)∀i≠t,P^\{I\}\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)=P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)\\quad\\forall i\\neq t,\(10\)wherePI\(𝐗\)P^\{I\}\(\\mathbf\{X\}\)is the data distribution induced byII\.

###### Proof\.

By the ICM assumption[A\.4](https://arxiv.org/html/2606.05636#A1.Thmtheorem4), when an intervention is applied onXtX\_\{t\}, it exclusively changes the causal mechanismftf\_\{t\}along with its associated conditional distributionP\(Xt\|Pat\)P\(X\_\{t\}\|Pa\_\{t\}\)\. All other causal mechanisms are invariant\.

Now, we distinguish two cases on whether the intervention is done outside of \(1\) or within \(2\) the Markov Boundary:

1. 1\.XtX\_\{t\}is outside of the Markov Boundary: This follows from the shielding property ofBD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\) Xi⟂\(X∖BD\(Xi\)\)\|BD\(Xi\)⟹Xi⟂Xt\|BD\(Xi\)\.X\_\{i\}\\perp\(\\textbf\{X\}\\setminus\\textrm\{BD\}\(X\_\{i\}\)\)\|\\textrm\{BD\}\(X\_\{i\}\)\\implies X\_\{i\}\\perp X\_\{t\}\|\\textrm\{BD\}\(X\_\{i\}\)\.\(11\)Therefore,BD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\)shieldsXi\\textbf\{X\}\_\{i\}from all other variables, including the intervened variableXtX\_\{t\}\.
2. 2\.XtX\_\{t\}is inside of the stable set: In this case, the shielding property of Markov blankets cannot be used\. For example, ifXtX\_\{t\}is a parent ofXiX\_\{i\}, thenXi⟂̸Xt\|BD\(Xi\)X\_\{i\}\\not\\perp X\_\{t\}\|\\textrm\{BD\}\(X\_\{i\}\), therefore a different argument is needed\. By Corollary[A\.13](https://arxiv.org/html/2606.05636#A1.Thmtheorem13), we know that the interventional target is an ancestor of all other variables exhibiting a marginal distribution shift betweenP\(𝐗\)P\(\\mathbf\{X\}\)andPI\(𝐗\)P^\{I\}\(\\mathbf\{X\}\)\. The case of an indirect ancestor is covered by point \(1\), and here we discuss the case in which the interventional targetXtX\_\{t\}is a parent ofXiX\_\{i\}contained inBD\(Xi\)\\textrm\{BD\}\(X\_\{i\}\)\. We recall the structural equation forXiX\_\{i\}as: Xi:=fi\(Pai,ui\)whereXt∈Pai⊆BD\(Xi\),X\_\{i\}:=f\_\{i\}\(Pa\_\{i\},u\_\{i\}\)\\quad\\text\{where \}X\_\{t\}\\in Pa\_\{i\}\\subseteq\\textrm\{BD\}\(X\_\{i\}\),\(12\)whereuiu\_\{i\}is the independent noise term\. We now exploit again the structural invariance, i\.e\. the ICM assumption: An intervention insidePai⊆BD\(Xi\)Pa\_\{i\}\\subseteq\\textrm\{BD\}\(X\_\{i\}\)affects the marginal distributionP\(BD\(Xi\)\)P\(\\textrm\{BD\}\(X\_\{i\}\)\), but it does not affectP\(Xi\|Pai\)P\(X\_\{i\}\|Pa\_\{i\}\)which stays invariant sincefif\_\{i\}is untouched\. And thereforeP\(Xi\|BD\(Xi\)\)P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)stays invariant\.

∎

StableRCA is based on exploiting the conditional distribution shift induced by the interventionII\. Therefore, we continue by characterizing the behavior of the intervention targetXtX\_\{t\}\.

###### Theorem A\.15\(Distribution Change ofP\(Xt\|BD\(Xt\)\)P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)for the Intervened node\)\.

LetXt∈𝐗X\_\{t\}\\in\\mathbf\{X\}be a target variable andBD\(Xt\)\\textrm\{BD\}\(X\_\{t\}\)an associated Markov Blanket\. If a nontrivial interventionIIis applied onXtX\_\{t\}, then its conditional distributionPI\(Xt\|BD\(Xt\)\)=P\(Xt\|BD\(Xt\),I\)P^\{I\}\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)=P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\),I\)with respect to the interventional distribution is different from the one on the observational distributionP\(Xt\|BD\(Xt\)\)P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)\. Mathematically,

PI\(Xi\|BD\(Xi\)\)≠P\(Xi\|BD\(Xi\)\)\.P^\{I\}\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)\\neq P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)\.\(13\)

###### Proof\.

This is an immediate implication of the ICM assumption \([A\.4](https://arxiv.org/html/2606.05636#A1.Thmtheorem4)\): Applying interventionIIresults on a new causalmechanism,mXt=f~t\(BD\(Xt\),ut\)X\_\{t\}=\\tilde\{f\}\_\{t\}\(\\textrm\{BD\}\(X\_\{t\}\),u\_\{t\}\)which entails a conditional distributionP\(Xt\|BD\(Xt\),I\)=PI\(Xt\|BD\(Xt\)\)P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\),I\)=P^\{I\}\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)\. Given that the interventionIIis not the identity \(by Assm\.[A\.12](https://arxiv.org/html/2606.05636#A1.Thmtheorem12)\), then it holds that

P\(Xt\|BD\(Xt\),I\)≠P\(Xt\|BD\(Xt\)\)\.P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\),I\)\\neq P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)\.\(14\)∎

A practical example for corollary[A\.15](https://arxiv.org/html/2606.05636#A1.Thmtheorem15)can be made in the case of an hard\-intervention: such intervention entails a Dirac delta as a conditional and marginal distribution, that isP\(Xi\|BD\(Xi\),do\(Xi=δ\(Xi−a\)\)\)=P~\(Xi\)P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\),do\(X\_\{i\}=\\delta\(X\_\{i\}\-a\)\)\)=\\tilde\{P\}\(X\_\{i\}\)\. That would be clearly a distribution change unlessP\(Xi\|BD\(Xi\)\)P\(X\_\{i\}\|\\textrm\{BD\}\(X\_\{i\}\)\)was also a point\-mass distribution centered ata∈ℝa\\in\\mathbb\{R\}\.

###### Theorem A\.16\(Predictability drop under Interventional Data\)\.

Lethobs⋆:𝒮→𝒳h^\{\\star\}\_\{\\text\{obs\}\}:\\mathcal\{S\}\\to\\mathcal\{X\}be a predictive model for the intervention targetXt∈𝐗X\_\{t\}\\in\\mathbf\{X\}learned usingBD\(Xt\)\\textrm\{BD\}\(X\_\{t\}\)as a predictor set and the observational distributionPP\. Let𝕄\(h,P\)\\mathbb\{M\}\(h,P\)be a strictly proper metric \(i\.e\. a function that is minimized only by a unique correct model under a distributionPP\), and assumehobs⋆h^\{\\star\}\_\{\\text\{obs\}\}is be the optimal minimizer for𝕄\(h,Pobs\)\\mathbb\{M\}\(h,P\_\{obs\}\), and thathint⋆h^\{\\star\}\_\{\\text\{int\}\}is the one for𝕄\(h,PI\)\\mathbb\{M\}\(h,P^\{I\}\)\. Further, letIIbe a nontrivial intervention on the target variableXjX\_\{j\}inducing the interventional distributionPIP^\{I\}\. Then, it holds that

𝕄\(hobs⋆,PI\)\>𝕄\(hint⋆,PI\)\.\\mathbb\{M\}\(h^\{\\star\}\_\{\\text\{obs\}\},P^\{I\}\)\>\\mathbb\{M\}\(h^\{\\star\}\_\{\\text\{int\}\},P^\{I\}\)\.\(15\)In other words,hobs⋆h^\{\\star\}\_\{obs\}is optimal when evaluated on the observational distribution, but not the optimal model anymore when evaluated on the interventional distribution\.

###### Proof\.

Given that𝕄\\mathbb\{M\}is a strictly proper metric,hobs⋆h^\{\\star\}\_\{\\text\{obs\}\}andhint⋆h^\{\\star\}\_\{\\text\{int\}\}are the unique minimizers for𝕄\(h,P\)\\mathbb\{M\}\(h,P\)and𝕄\(h,PI\)\\mathbb\{M\}\(h,P^\{I\}\)respectively, therefore they correspond to the conditional distributionP\(Xt\|BD\(Xt\)\)P\(X\_\{t\}\|\\textrm\{BD\}\(X\_\{t\}\)\)in their respective environments \(observational and interventional distribution\)\.

By Assm\.[A\.12](https://arxiv.org/html/2606.05636#A1.Thmtheorem12), the intervention is nontrivial and cannot be the identity, and we proved in Theorem[A\.15](https://arxiv.org/html/2606.05636#A1.Thmtheorem15)that the observational and interventional conditional distributions differ\. Therefore,P\(Xt∣BD\(Xt\)\)≠PI\(Xt∣BD\(Xt\)\)P\(X\_\{t\}\\mid\\textrm\{BD\}\(X\_\{t\}\)\)\\neq P^\{I\}\(X\_\{t\}\\mid\\textrm\{BD\}\(X\_\{t\}\)\)implies thatPI≠PP^\{I\}\\neq P, and consequentlyhobs⋆≠hint⋆h^\{\\star\}\_\{\\text\{obs\}\}\\neq h^\{\\star\}\_\{\\text\{int\}\}\. Sincehint⋆h^\{\\star\}\_\{\\text\{int\}\}is theuniqueminimizer under the interventional distributionPIP^\{I\}, any different functionhobs⋆h^\{\\star\}\_\{\\text\{obs\}\}must have a strictly higher error:

𝕄\(hobs⋆,PI\)\>𝕄\(hint⋆,PI\),\\mathbb\{M\}\(h^\{\\star\}\_\{\\text\{obs\}\},P^\{I\}\)\>\\mathbb\{M\}\(h^\{\\star\}\_\{\\text\{int\}\},P^\{I\}\),\(16\)which concludes our proof\. ∎

### A\.3Non\-Ideal Case

After establishing the theoretical properties of our framework in the large\-sample limit, we turn to the finite\-data regime\. In detail, we analyze the implementation of StableRCA using theSample Reweighted Decorrelation Operator\(SRDO\)\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\]\. While SRDO is designed to learn the minimal stable set, we discussed in Appendix[A\.1](https://arxiv.org/html/2606.05636#A1.SS1)that the minimal stable set coincides with the Markov Boundary in our specific setting\.

#### Phase 1: Detection of Marginal Distribution Shifts\.

During this phase, we search for nodes with a marginal distribution shift, as the intervention target will be among them\. We tackle this via Hypothesis testing\. For eachXi∈𝐗X\_\{i\}\\in\\mathbf\{X\}, we test the null hypothesisH0:Pobs\(Xi\)=PI\(Xi\)H\_\{0\}:P\_\{obs\}\(X\_\{i\}\)=P^\{I\}\(X\_\{i\}\)against the alternativeH1:Pobs\(Xi\)≠PI\(Xi\)H\_\{1\}:P\_\{obs\}\(X\_\{i\}\)\\neq P^\{I\}\(X\_\{i\}\)\. We employ appropriate non\-parametric tests based on the variable type:

- •Continuous Variables:We use the two\-sample Kolmogorov\-Smirnov \(KS\) test\[Conover,[1999](https://arxiv.org/html/2606.05636#bib.bib21)\]\.
- •Discrete/Categorical Variables:We use the Pearsonχ2\\chi^\{2\}test\[Pearson,[1900](https://arxiv.org/html/2606.05636#bib.bib22)\]\.

Letα\\alphabe the significance level for these tests\. The result is a candidate set𝒦^\\hat\{\\mathcal\{K\}\}containing variables where the p\-value<α<\\alpha, i\.e\. where a marginal distribution shift is detected\. Errors in this phase result in discarding candidates that might be potentially the intervention target\. Finite\-data guarantees for this phase are just a direct application of the Dvoretzky\-Kiefer\-Wolfowitz inequality\[Dvoretzkyet al\.,[1956](https://arxiv.org/html/2606.05636#bib.bib24)\]\.

With finite data, the sensibility of StableRCA depends on the amount of data and on how much the anomaly stands above noise\. To analyze this, we proceed by defining the Minimum Detectable Shift\.

###### Assumption A\.17\(Non\-Zero Distribution Shift for Intervened Nodes\)\.

LetXjX\_\{j\}be a variable with observational distributionPobsP\_\{obs\}and interventional distributionPintP\_\{int\}\. We quantify the magnitude of the marginal distribution shiftδj\\delta\_\{j\}as follows:

δj=\{supx\|Fobs\(x\)−Fint\(x\)\|ifXjis continuous \(Kolmogorov Dist\.\),12∑k\|Pobs\(k\)−Pint\(k\)\|ifXjis discrete \(Total Variation Dist\.\)\.\\delta\_\{j\}=\\begin\{cases\}\\sup\_\{x\}\|F\_\{obs\}\(x\)\-F\_\{int\}\(x\)\|&\\text\{if \}X\_\{j\}\\text\{ is continuous \(Kolmogorov Dist\.\),\}\\\\ \\frac\{1\}\{2\}\\sum\_\{k\}\|P\_\{obs\}\(k\)\-P\_\{int\}\(k\)\|&\\text\{if \}X\_\{j\}\\text\{ is discrete \(Total Variation Dist\.\)\.\}\\end\{cases\}\(17\)whereF\(⋅\)F\(\\cdot\)denotes the cumulative distribution function \(CDF\) andP\(⋅\)P\(\\cdot\)denotes the probability mass function \(PMF\)\. We assume that for any node affected by the intervention, there exists a non\-trivial shiftδj\>0\\delta\_\{j\}\>0\.

###### Lemma A\.18\(Finite\-Sample Detectability of Continuous\-Distribution Shifts\)\.

Consider a r\.v\.XXwith observational CDFFobsF\_\{obs\}and interventional CDFFintF\_\{int\}\. By Assm\.[A\.17](https://arxiv.org/html/2606.05636#A1.Thmtheorem17),δ=supx\|Fobs\(x\)−Fint\(x\)\|\\delta=\\sup\_\{x\}\|F\_\{obs\}\(x\)\-F\_\{int\}\(x\)\|is the Kolmogorov distance quantifying the distribution\-shift\. We perform a two\-sample Kolmogorov\-Smirnov test at significance levelα\\alphausingNNsamples from each distribution\. By the Dvoretzky\-Kiefer\-Wolfowitz \(DKW\) inequality\[Dvoretzkyet al\.,[1956](https://arxiv.org/html/2606.05636#bib.bib24), Massart,[1990](https://arxiv.org/html/2606.05636#bib.bib25)\], the probability of correctly detecting the shift \(statistical power\) is lower\-bounded by:

P\(RejectH0∣δ\)≥1−2exp⁡\(−N2\(δ−2Nln⁡2α\)2\),P\(\\text\{Reject \}H\_\{0\}\\mid\\delta\)\\geq 1\-2\\exp\\left\(\-\\frac\{N\}\{2\}\\left\(\\delta\-\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\}\{\\alpha\}\}\\right\)^\{2\}\\right\),\(18\)provided that the sample sizeNNis large enough thatδ\>2Nln⁡2α\\delta\>\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\}\{\\alpha\}\}\.

###### Proof\.

We aim to bound the probability of a Type\-II error \(failing to detect a shift when one exists\), i\.e\. acceptingH0H\_\{0\}when it is false\.

First, we recall that the two\-sample KS test statistic is the maximum distance between the empirical cumulative distribution functions \(ECDFs\), denoted asF^obs\\hat\{F\}\_\{obs\}andF^int\\hat\{F\}\_\{int\}:

D^N=supx\|F^obs\(x\)−F^int\(x\)\|\.\\hat\{D\}\_\{N\}=\\sup\_\{x\}\|\\hat\{F\}\_\{obs\}\(x\)\-\\hat\{F\}\_\{int\}\(x\)\|\.\(19\)The test rejects the null hypothesis \(H0:Fobs=FintH\_\{0\}:F\_\{obs\}=F\_\{int\}\) ifD^N\>c\(α\)\\hat\{D\}\_\{N\}\>c\(\\alpha\), where the critical valuec\(α\)≈2Nln⁡2αc\(\\alpha\)\\approx\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\}\{\\alpha\}\}\.

We know the true distributions differ byδ\\delta\. By triangle inequality, the empirical distance is related to the true distance as

\|Fobs\(x\)−Fint\(x\)\|\\displaystyle\|F\_\{obs\}\(x\)\-F\_\{int\}\(x\)\|=\|\(Fobs\(x\)−F^obs\(x\)\)\+\(F^obs\(x\)−F^int\(x\)\)\+\(F^int\(x\)−Fint\(x\)\)\|,\\displaystyle=\|\(F\_\{obs\}\(x\)\-\\hat\{F\}\_\{obs\}\(x\)\)\+\(\\hat\{F\}\_\{obs\}\(x\)\-\\hat\{F\}\_\{int\}\(x\)\)\+\(\\hat\{F\}\_\{int\}\(x\)\-F\_\{int\}\(x\)\)\|,≤\|Fobs\(x\)−F^obs\(x\)\|\+\|F^obs\(x\)−F^int\(x\)\|\+\|F^int\(x\)−Fint\(x\)\|\.\\displaystyle\\leq\|F\_\{obs\}\(x\)\-\\hat\{F\}\_\{obs\}\(x\)\|\+\|\\hat\{F\}\_\{obs\}\(x\)\-\\hat\{F\}\_\{int\}\(x\)\|\+\|\\hat\{F\}\_\{int\}\(x\)\-F\_\{int\}\(x\)\|\.\(20\)Taking the supremum over allxxon both sides:

supx\|Fobs\(x\)−Fint\(x\)\|⏟True Shiftδ\\displaystyle\\underbrace\{\\sup\_\{x\}\|F\_\{obs\}\(x\)\-F\_\{int\}\(x\)\|\}\_\{\\text\{True Shift \}\\delta\}≤supx\(\|Fobs\(x\)−F^obs\(x\)\|\+\|F^obs\(x\)−F^int\(x\)\|\+\|F^int\(x\)−Fint\(x\)\|\),\\displaystyle\\leq\\sup\_\{x\}\\left\(\|F\_\{obs\}\(x\)\-\\hat\{F\}\_\{obs\}\(x\)\|\+\|\\hat\{F\}\_\{obs\}\(x\)\-\\hat\{F\}\_\{int\}\(x\)\|\+\|\\hat\{F\}\_\{int\}\(x\)\-F\_\{int\}\(x\)\|\\right\),≤supx\|F^obs\(x\)−Fobs\(x\)\|⏟Sampling ErrorE1\+supx\|F^obs\(x\)−F^int\(x\)\|⏟Empirical DistanceD^N\+supx\|F^int\(x\)−Fint\(x\)\|⏟Sampling ErrorE2\.\\displaystyle\\leq\\underbrace\{\\sup\_\{x\}\|\\hat\{F\}\_\{obs\}\(x\)\-F\_\{obs\}\(x\)\|\}\_\{\\text\{Sampling Error \}E\_\{1\}\}\+\\underbrace\{\\sup\_\{x\}\|\\hat\{F\}\_\{obs\}\(x\)\-\\hat\{F\}\_\{int\}\(x\)\|\}\_\{\\text\{Empirical Distance \}\\hat\{D\}\_\{N\}\}\+\\underbrace\{\\sup\_\{x\}\|\\hat\{F\}\_\{int\}\(x\)\-F\_\{int\}\(x\)\|\}\_\{\\text\{Sampling Error \}E\_\{2\}\}\.\(21\)Rearranging the inequality to isolate the test statisticD^N\\hat\{D\}\_\{N\}, we obtain the necessary lower bound:

D^N≥δ−\(E1\+E2\)\.\\hat\{D\}\_\{N\}\\geq\\delta\-\(E\_\{1\}\+E\_\{2\}\)\.\(22\)We fail to rejectH0H\_\{0\}if the empirical distance is too small, i\.e\. ifD^N≤c\(α\)\\hat\{D\}\_\{N\}\\leq c\(\\alpha\)\. Using the inequality above, failure implies:

δ−\(E1\+E2\)≤c\(α\)⟹\(E1\+E2\)≥δ−c\(α\)\.\\delta\-\(E\_\{1\}\+E\_\{2\}\)\\leq c\(\\alpha\)\\implies\(E\_\{1\}\+E\_\{2\}\)\\geq\\delta\-c\(\\alpha\)\.\(23\)Intuitively, the test fails if the sampling noise \(E1\+E2E\_\{1\}\+E\_\{2\}\) hides the true signal \(δ−c\(α\)\\delta\-c\(\\alpha\)\)\.

\[Massart,[1990](https://arxiv.org/html/2606.05636#bib.bib25)\]provides a version of the DKW inequality stating that∀ϵ\>0\\forall\\epsilon\>0it holds that

P\(supx\|F^N\(x\)−F\(x\)\|\>ϵ\)≤2exp⁡\(−2Nϵ2\)\.P\(\\sup\_\{x\}\|\\hat\{F\}\_\{N\}\(x\)\-F\(x\)\|\>\\epsilon\)\\leq 2\\exp\(\-2N\\epsilon^\{2\}\)\.\(24\)We apply this to the sum of errorsE1\+E2E\_\{1\}\+E\_\{2\}\. To boundP\(E1\+E2≥λ\)P\(E\_\{1\}\+E\_\{2\}\\geq\\lambda\), whereλ=δ−c\(α\)\\lambda=\\delta\-c\(\\alpha\), we assume the worst\-case split of errorλ/2\\lambda/2for each term:

P\(AcceptH0∣δ\)≤P\(E1≥λ/2\)\+P\(E2≥λ/2\)≤2⋅\[2exp⁡\(−2N\(λ2\)2\)\]\.P\(\\text\{Accept \}H\_\{0\}\\mid\\delta\)\\leq P\(E\_\{1\}\\geq\\lambda/2\)\+P\(E\_\{2\}\\geq\\lambda/2\)\\leq 2\\cdot\\left\[2\\exp\\left\(\-2N\(\\frac\{\\lambda\}\{2\}\)^\{2\}\\right\)\\right\]\.\(25\)Simplifying the exponent, we have that−2N\(λ/2\)2=−2N\(λ2/4\)=−N2λ2\-2N\(\\lambda/2\)^\{2\}=\-2N\(\\lambda^\{2\}/4\)=\-\\frac\{N\}\{2\}\\lambda^\{2\}\.

And finally, we reach our desired bound:

P\(AcceptH0∣δ\)≤4exp⁡\(−N2\(δ−c\(α\)\)2\)⟹P\(RejectH0∣δ\)≥1−2exp⁡\(−N2\(δ−2Nln⁡2α\)2\)\.P\(\\text\{Accept \}H\_\{0\}\\mid\\delta\)\\leq 4\\exp\\left\(\-\\frac\{N\}\{2\}\(\\delta\-c\(\\alpha\)\)^\{2\}\\right\)\\implies P\(\\text\{Reject \}H\_\{0\}\\mid\\delta\)\\geq 1\-2\\exp\\left\(\-\\frac\{N\}\{2\}\\left\(\\delta\-\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\}\{\\alpha\}\}\\right\)^\{2\}\\right\)\.\(26\)∎

We proceed by deriving an analogous bound for discrete or categorical variables\.

###### Lemma A\.19\(Finite\-Sample Detectability of Discrete Distribution Shifts\)\.

Consider a discrete/categorical r\.v\.XXtaking values in a finite set of sizeKK\. LetPobsP\_\{obs\}andPintP\_\{int\}denote the observational and interventional probability mass functions \(PMFs\)\. By Assm\.[A\.17](https://arxiv.org/html/2606.05636#A1.Thmtheorem17), the distribution shiftδ\\deltafor categorical variables is quantified using the Total Variation Distance \(TVD\)\.

We perform a two\-sample test based on theL1L\_\{1\}distance \(or the asymptotically equivalent Pearsonχ2\\chi^\{2\}test\) at significance levelα\\alphausingNNsamples per distribution\. By Weissman’s Inequality\[Weissmanet al\.,[2003](https://arxiv.org/html/2606.05636#bib.bib26)\], the probability of correctly detecting the shift \(statistical power\) is lower\-bounded by:

P\(RejectH0∣δ\)≥1−2\(2K−2\)exp⁡\(−N2\(δ−2Nln⁡2\(2K−2\)α\)2\),P\(\\text\{Reject \}H\_\{0\}\\mid\\delta\)\\geq 1\-2\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\}\{2\}\\left\(\\delta\-\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\(2^\{K\}\-2\)\}\{\\alpha\}\}\\right\)^\{2\}\\right\),\(27\)provided that the sample sizeNNis sufficient such thatδ\>2Nln⁡2\(2K−2\)α\\delta\>\\sqrt\{\\frac\{2\}\{N\}\\ln\\frac\{2\(2^\{K\}\-2\)\}\{\\alpha\}\}\.

###### Proof\.

We follow the same logic as Lemma[A\.18](https://arxiv.org/html/2606.05636#A1.Thmtheorem18), adapting the metric from the Kolmogorov distance \(used for continuous variables\) to theL1L\_\{1\}\-norm of PMFs \(used for categorical/discrete ones\)\. LetP^obs\\hat\{P\}\_\{obs\}andP^int\\hat\{P\}\_\{int\}be the empirical PMFs estimated fromNNsamples\. By Assm\.[A\.17](https://arxiv.org/html/2606.05636#A1.Thmtheorem17), the test statistic is the empirical Total Variation Distance \(orL1L\_\{1\}\-norm\):

D^N=12∑k=1K\|P^obs\(k\)−P^int\(k\)\|=12‖P^obs−P^int‖1\.\\hat\{D\}\_\{N\}=\\frac\{1\}\{2\}\\sum\_\{k=1\}^\{K\}\|\\hat\{P\}\_\{obs\}\(k\)\-\\hat\{P\}\_\{int\}\(k\)\|=\\frac\{1\}\{2\}\\\|\\hat\{P\}\_\{obs\}\-\\hat\{P\}\_\{int\}\\\|\_\{1\}\.\(28\)The null hypothesis \(Pobs=PintP\_\{obs\}=P\_\{int\}\) is rejected ifD^N\>c\(α\)\\hat\{D\}\_\{N\}\>c\(\\alpha\)\. By the triangle inequality on theL1L\_\{1\}norm \(procedure completely identical to the proof of Theorem[A\.18](https://arxiv.org/html/2606.05636#A1.Thmtheorem18)\):

12‖Pobs−Pint‖1⏟True Shiftδ\\displaystyle\\underbrace\{\\frac\{1\}\{2\}\\\|P\_\{obs\}\-P\_\{int\}\\\|\_\{1\}\}\_\{\\text\{True Shift \}\\delta\}≤12‖Pobs−P^obs‖1\+12‖P^obs−P^int‖1\+12‖P^int−Pint‖1,\\displaystyle\\leq\\frac\{1\}\{2\}\\\|P\_\{obs\}\-\\hat\{P\}\_\{obs\}\\\|\_\{1\}\+\\frac\{1\}\{2\}\\\|\\hat\{P\}\_\{obs\}\-\\hat\{P\}\_\{int\}\\\|\_\{1\}\+\\frac\{1\}\{2\}\\\|\\hat\{P\}\_\{int\}\-P\_\{int\}\\\|\_\{1\},=E1\+D^N\+E2,\\displaystyle=E\_\{1\}\+\\hat\{D\}\_\{N\}\+E\_\{2\},\(29\)whereE1E\_\{1\}andE2E\_\{2\}represent the sampling errors of the empirical PMFs in terms of TVD\. Rearranging for the test statistic, we haveD^N≥δ−\(E1\+E2\)\\hat\{D\}\_\{N\}\\geq\\delta\-\(E\_\{1\}\+E\_\{2\}\)\. A Type\-II error occurs if we fail to rejectH0H\_\{0\}, i\.e\.,D^N≤c\(α\)\\hat\{D\}\_\{N\}\\leq c\(\\alpha\), which implies:

E1\+E2≥δ−c\(α\)\.E\_\{1\}\+E\_\{2\}\\geq\\delta\-c\(\\alpha\)\.\(30\)\[Weissmanet al\.,[2003](https://arxiv.org/html/2606.05636#bib.bib26)\]provides the concentration bound for theL1L\_\{1\}deviation of a multinomial estimate onKKcategories, where∀ϵ\>0\\forall\\epsilon\>0it holds that

P\(‖P^−P‖1≥ϵ\)≤\(2K−2\)exp⁡\(−Nϵ22\)\.P\(\\\|\\hat\{P\}\-P\\\|\_\{1\}\\geq\\epsilon\)\\leq\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\\epsilon^\{2\}\}\{2\}\\right\)\.\(31\)Note that our error termsEEare defined as12‖…‖1\\frac\{1\}\{2\}\\\|\\dots\\\|\_\{1\}\. We have thatP\(E≥γ\)=P\(12‖P^−P‖1≥γ\)=P\(‖P^−P‖1≥2γ\)P\(E\\geq\\gamma\)=P\(\\frac\{1\}\{2\}\\\|\\hat\{P\}\-P\\\|\_\{1\}\\geq\\gamma\)=P\(\\\|\\hat\{P\}\-P\\\|\_\{1\}\\geq 2\\gamma\)\. Therefore, by Applying Weissman’s bound toP\(‖P^−P‖1≥2γ\)P\(\\\|\\hat\{P\}\-P\\\|\_\{1\}\\geq 2\\gamma\), we have that∀γ\>0\\forall\\gamma\>0:

P\(‖P^−P‖1≥2γ\)≤\(2K−2\)exp⁡\(−N\(2γ\)22\)=\(2K−2\)exp⁡\(−2Nγ2\)\.P\(\\\|\\hat\{P\}\-P\\\|\_\{1\}\\geq 2\\gamma\)\\leq\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\(2\\gamma\)^\{2\}\}\{2\}\\right\)=\(2^\{K\}\-2\)\\exp\(\-2N\\gamma^\{2\}\)\.\(32\)Settingλ=δ−c\(α\)\\lambda=\\delta\-c\(\\alpha\)and splitting the error budgetλ/2\\lambda/2betweenE1E\_\{1\}andE2E\_\{2\}:

P\(AcceptH0∣δ\)\\displaystyle P\(\\text\{Accept \}H\_\{0\}\\mid\\delta\)≤P\(E1≥λ/2\)\+P\(E2≥λ/2\),\\displaystyle\\leq P\(E\_\{1\}\\geq\\lambda/2\)\+P\(E\_\{2\}\\geq\\lambda/2\),≤2⋅\[\(2K−2\)exp⁡\(−2N\(λ2\)2\)\],\\displaystyle\\leq 2\\cdot\\left\[\(2^\{K\}\-2\)\\exp\\left\(\-2N\\left\(\\frac\{\\lambda\}\{2\}\\right\)^\{2\}\\right\)\\right\],=2\(2K−2\)exp⁡\(−N2λ2\)\.\\displaystyle=2\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\}\{2\}\\lambda^\{2\}\\right\)\.\(33\)Substitutingλ=δ−c\(α\)\\lambda=\\delta\-c\(\\alpha\)yields the required lower bound on the power\. ∎

#### Phase 2: Stable Set Learning via SRDO\.

In this phase, we learn the stable set for every variable in𝒦^\\hat\{\\mathcal\{K\}\}\. To do so, we employ the SRDO algorithm\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\]\. SRDO works by learning a reweighting of the samples that transformsP\(𝐗\)P\(\\mathbf\{X\}\)intoP~\(𝐗\)=w\(𝐗\)⋅P\(𝐗\)\\tilde\{P\}\(\\mathbf\{X\}\)=w\(\\mathbf\{X\}\)\\cdot P\(\\mathbf\{X\}\)\. These weights are subsequently used as a feature selection criterion\. Further,\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18), Xuet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib11)\]prove that features with non\-negligible regression coefficients belong to the stable set \(in practice, a threshold is applied\)\. Therefore, we can use SRDO to learn such stable set\. To do so, SRDO randomly resamples column\-wise the entries in𝐃\\mathbf\{D\}, resulting in a new data distributionP~\(𝐗\)\\tilde\{P\}\(\\mathbf\{X\}\)where each variable is decorrelated\. Subsequently, the weight matrix can be expressed as

w\(𝐗\)∼P~\(𝐗\)P\(𝐗\)=P~\(X1\)⋅P~\(X2\)⋯⋅P~\(XD\)P\(𝐗\)w\(\\mathbf\{X\}\)\\sim\\frac\{\\tilde\{P\}\(\\mathbf\{X\}\)\}\{P\(\\mathbf\{X\}\)\}=\\frac\{\\tilde\{P\}\(X\_\{1\}\)\\cdot\\tilde\{P\}\(X\_\{2\}\)\\dots\\cdot\\tilde\{P\}\(X\_\{D\}\)\}\{P\(\\mathbf\{X\}\)\}\(34\)which can be estimated by density\-ratio estimation\[Sugiyamaet al\.,[2012](https://arxiv.org/html/2606.05636#bib.bib23)\]\.

In the finite\-data regime, this decorrelation procedure may be imperfect\. Therefore, we define thedecorrelation erroras follows\.

###### Definition A\.20\(Decorrelation Error\)\.

Let𝐃∈ℝn×d\\mathbf\{D\}\\in\\mathbb\{R\}^\{n\\times d\}be the covariate matrix and let𝐖=diag\(w1,…,wd\)∈ℝd×d\\mathbf\{W\}=diag\(w\_\{1\},\\dots,w\_\{d\}\)\\in\\mathbb\{R\}^\{d\\times d\}be the weight matrix used to decorrelate features with SRDO\. Further, letΣW\\Sigma\_\{W\}be the covariance matrix ofP~\(𝐗\)\\tilde\{P\}\(\\mathbf\{X\}\)\. We define the decorrelation error as

ϵdec=‖ΣW∘\(𝐉−𝐈\)‖F,\\epsilon\_\{dec\}=\|\|\\Sigma\_\{W\}\\circ\(\\mathbf\{J\}\-\\mathbf\{I\}\)\|\|\_\{F\},\(35\)where\|\|⋅\|\|F\|\|\\cdot\|\|\_\{F\}is the Frobenius norm,𝐈∈ℝD×D\\mathbf\{I\}\\in\\mathbb\{R\}^\{D\\times D\}is the identity matrix, and𝐉∈ℝD×D\\mathbf\{J\}\\in\\mathbb\{R\}^\{D\\times D\}is an all\-ones matrix \(\[𝐉\]ij=1\[\\mathbf\{J\}\]\_\{ij\}=1∀i,j\\forall i,j\)\.

Errors in the de\-correlation procedure can result in erroneous variables contained in the stable set\. Before continuing with the last theorem, we have to define quantitatively the magnitude of the intervention by using the Structural Intervention Strength\.

###### Definition A\.21\(Structural Intervention Strength\)\.

The structural intervention strengthν\\nuis the magnitude of the mechanism shift for the true target variableXtX\_\{t\}\. By denoting withht∗h^\{\*\}\_\{t\}the optimal model trained on the observational data, we define it as

ν≔Δt=ℛint\(ht∗\)−ℛobs\(ht∗\),\\nu\\coloneqq\\Delta\_\{t\}=\\mathcal\{R\}\_\{int\}\(h^\{\*\}\_\{t\}\)\-\\mathcal\{R\}\_\{obs\}\(h^\{\*\}\_\{t\}\),\(36\)whereℛobs\(⋅\)\\mathcal\{R\}\_\{obs\}\(\\cdot\)andℛint\(⋅\)\\mathcal\{R\}\_\{int\}\(\\cdot\)denote the expected risk in the observational and interventional environments, respectively\. Here,ht∗h^\{\*\}\_\{t\}represents the optimal predictive model forXtX\_\{t\}inferred from the observational distribution using the true stable set \(Markov Boundary\)\.

Intuitively,ν\\numeasures the error increase encountered by the stable model when the underlying mechanismP\(Xt\|Pat\)P\(X\_\{t\}\|Pa\_\{t\}\)changes due to an intervention\.

###### Theorem A\.22\(End\-to\-End Finite\-Sample Identification\)\.

LetXtX\_\{t\}be the true intervention target with marginal shiftδshift\>0\\delta\_\{shift\}\>0and structural intervention strengthν\>0\\nu\>0\. LetNNbe the sample size\. If the Identifiability Margin Condition is satisfied, where the true intervention signal dominates the stable set approximation error:

ν\>C⋅ϵdec\+η,\\nu\>C\\cdot\\epsilon\_\{dec\}\+\\eta,\(37\)whereη\>0\\eta\>0is a separation margin\. Then, StableRCA correctly identifiesXtX\_\{t\}as the unique target with probability at least:

P\(Success\)≥1−Λ\(N,δshift,K\)⏟Phase 1 Failure−exp⁡\(−Nη2/8\)⏟Phase 2 Failure−8\(\|𝒦^\|−1\)e−Nη28M2⏟Phase 3 Failure,P\(\\text\{Success\}\)\\geq 1\-\\underbrace\{\\Lambda\(N,\\delta\_\{shift\},K\)\}\_\{\\text\{Phase 1 Failure\}\}\-\\underbrace\{\\exp\\left\(\-N\\eta^\{2\}/8\\right\)\}\_\{\\text\{Phase 2 Failure\}\}\-\\underbrace\{8\(\|\\hat\{\\mathcal\{K\}\}\|\-1\)e^\{\-\\frac\{N\\eta^\{2\}\}\{8M^\{2\}\}\}\}\_\{\\text\{Phase 3 Failure\}\},\(38\)whereΛ\(⋅\)\\Lambda\(\\cdot\)is the bound on failing to detect the marginal shift \(Type II error\), defined based on the variable type ofXtX\_\{t\}:

Λ=\{2exp⁡\(−2N\(δshift−τcont\)2\)ifXtis continuous,2\(2K−2\)exp⁡\(−N2\(δshift−τdisc\)2\)ifXtis discrete,\\Lambda=\\begin\{cases\}2\\exp\\left\(\-2N\(\\delta\_\{shift\}\-\\tau\_\{cont\}\)^\{2\}\\right\)&\\text\{if \}X\_\{t\}\\text\{ is continuous\},\\\\ 2\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\}\{2\}\(\\delta\_\{shift\}\-\\tau\_\{disc\}\)^\{2\}\\right\)&\\text\{if \}X\_\{t\}\\text\{ is discrete\},\\end\{cases\}\(39\)whereKKis the number of categories, andτ\\taurepresents the respective test thresholds\.

###### Proof\.

We define the event of "Success" as the intersection of the success events of the three phases:

- •EdetectE\_\{detect\}\(Detection\): The true targetXtX\_\{t\}is successfully included in the candidate set𝒦^\\hat\{\\mathcal\{K\}\}\(i\.e\., the hypothesis test correctly rejects the null\)\.
- •EstableE\_\{stable\}\(Stable Set Generalization\): An accurate stable set is learned, meaning the learned weights𝐖^\\hat\{\\mathbf\{W\}\}generalize to unseen environments \(i\.e\., no overfitting\)\.
- •EidentE\_\{ident\}\(Identification\): Among the candidates in𝒦^\\hat\{\\mathcal\{K\}\}, the empirical predictability gap of the targetΔ^t\\hat\{\\Delta\}\_\{t\}is strictly larger than that of any non\-targetXjX\_\{j\}\.

Success occurs ifEdetect∩Estable∩EidentE\_\{detect\}\\cap E\_\{stable\}\\cap E\_\{ident\}holds\. By De Morgan’s laws and the union bound, the probability of failure is:

P\(Failure\)=P\(Edetectc∪Establec∪Eidentc\)≤P\(Edetectc\)\+P\(Establec\)\+P\(Eidentc\)\.P\(\\text\{Failure\}\)=P\(E\_\{detect\}^\{c\}\\cup E\_\{stable\}^\{c\}\\cup E\_\{ident\}^\{c\}\)\\leq P\(E\_\{detect\}^\{c\}\)\+P\(E\_\{stable\}^\{c\}\)\+P\(E\_\{ident\}^\{c\}\)\.\(40\)
The Identifiability Margin Condition ensures that the true target’s signalν\\nuis strictly separated from the non\-target leakage by a safety bufferη\\eta\. In simple words, the target signalν\\nuemerges from noise by at least a marginη\\eta\. In the following, we will useη\\etaas the available error budget against finite\-sample noise\. We will allocateη\\etabetween phase 2 and 3 as a budget, and the proof will show that the probability of the actual sampling noise exceeding this budget decays exponentially withNN\.

We proceed by bounding each failure term separately\.

#### 1\. Bounding Phase 1 Failure \(P\(Edetectc\)P\(E\_\{detect\}^\{c\}\)\)\.

EventEdetectcE\_\{detect\}^\{c\}corresponds to a Type II error in the hypothesis test for the intervention targetXtX\_\{t\}\. The convergence guarantees depend on the data type ofXtX\_\{t\}:

Case A: Continuous Target\.We apply Lemma[A\.18](https://arxiv.org/html/2606.05636#A1.Thmtheorem18)\. Using the DKW inequality, the probability that the empirical Kolmogorov distance falls below the thresholdτcont\\tau\_\{cont\}is bounded by:

P\(Edetectc\)≤2exp⁡\(−2N\(δshift−τcont\)2\)\.P\(E\_\{detect\}^\{c\}\)\\leq 2\\exp\\left\(\-2N\(\\delta\_\{shift\}\-\\tau\_\{cont\}\)^\{2\}\\right\)\.\(41\)
Case B: Discrete Target\.We apply Lemma[A\.19](https://arxiv.org/html/2606.05636#A1.Thmtheorem19)\. Using Weissman’s inequality for the Total Variation distance, givenKKcategories, the failure probability is bounded by:

P\(Edetectc\)≤2\(2K−2\)exp⁡\(−N2\(δshift−τdisc\)2\)\.P\(E\_\{detect\}^\{c\}\)\\leq 2\(2^\{K\}\-2\)\\exp\\left\(\-\\frac\{N\}\{2\}\(\\delta\_\{shift\}\-\\tau\_\{disc\}\)^\{2\}\\right\)\.\(42\)In both cases, the error decays exponentially withNN, ensuring that the true target is included in𝒦^\\hat\{\\mathcal\{K\}\}with high probability\.

#### 2\. Bounding Phase 2 Failure \(P\(Establec\)P\(E\_\{stable\}^\{c\}\)\)\.

Here, we bound the probability that the learned stable set weights generalize\. Mathematically, we require the true population decorrelationϵdec\(𝐖^\)\\epsilon\_\{dec\}\(\\hat\{\\mathbf\{W\}\}\)to be close to the empirical estimateϵ^dec\(𝐖^\)\\hat\{\\epsilon\}\_\{dec\}\(\\hat\{\\mathbf\{W\}\}\)\. In SRDO, the optimization is formulated as a bounded Empirical Risk Minimization task over the weight hypothesis space𝒲\\mathcal\{W\}; therefore, classic uniform convergence bounds apply\. From\[Bartlett and Mendelson,[2003](https://arxiv.org/html/2606.05636#bib.bib71)\], we know that with probability1−δ′1\-\\delta^\{\\prime\}, it holds that:

sup𝐖∈𝒲\|ϵdec\(𝐖\)−ϵ^dec\(𝐖\)\|≤2ℜN\(ℱ𝒲\)\+Blog⁡\(1/δ′\)2N,\\sup\_\{\\mathbf\{W\}\\in\\mathcal\{W\}\}\|\\epsilon\_\{dec\}\(\\mathbf\{W\}\)\-\\hat\{\\epsilon\}\_\{dec\}\(\\mathbf\{W\}\)\|\\leq 2\\mathfrak\{R\}\_\{N\}\(\\mathcal\{F\}\_\{\\mathcal\{W\}\}\)\+B\\sqrt\{\\frac\{\\log\(1/\\delta^\{\\prime\}\)\}\{2N\}\},\(43\)whereBBis the range of the error \(assumedB=1B=1via normalization\), andℜN\(ℱ𝒲\)\\mathfrak\{R\}\_\{N\}\(\\mathcal\{F\}\_\{\\mathcal\{W\}\}\)is the Rademacher complexity ofℱ𝒲\\mathcal\{F\}\_\{\\mathcal\{W\}\}\. A failure in Phase 2 occurs if the generalization error \(LHS\) exceeds the allocated safety margin\. We allocateη/2\\eta/2of the total marginη\\etato this phase\. First,\[Bartlett and Mendelson,[2003](https://arxiv.org/html/2606.05636#bib.bib71)\]shows that the Rademacher complexity termℜN\(ℱ𝒲\)∈𝒪\(N−1/2\)\\mathfrak\{R\}\_\{N\}\(\\mathcal\{F\}\_\{\\mathcal\{W\}\}\)\\in\\mathcal\{O\}\(N^\{\-1/2\}\), therefore∃N∈ℕ\\exists N\\in\\mathbb\{N\}such that this term is bounded byη/4\\eta/4\. The remaining budgetη/4\\eta/4covers the sampling noise\. We set the failure condition as the event where the noise term exceeds this budget:

log⁡\(1/δ′\)2N\>η4\.\\sqrt\{\\frac\{\\log\(1/\\delta^\{\\prime\}\)\}\{2N\}\}\>\\frac\{\\eta\}\{4\}\.\(44\)We now solve this inequality forδ′\\delta^\{\\prime\}to bound the probability of Phase 2 failure:

log⁡\(1/δ′\)2N\>η4⟹P\(Establec\)=δ′<exp⁡\(−Nη28\)\.\\sqrt\{\\frac\{\\log\(1/\\delta^\{\\prime\}\)\}\{2N\}\}\>\\frac\{\\eta\}\{4\}\\implies P\(E\_\{stable\}^\{c\}\)=\\delta^\{\\prime\}<\\exp\\left\(\-\\frac\{N\\eta^\{2\}\}\{8\}\\right\)\.\(45\)

#### 3\. Bounding Phase 3 Failure \(P\(Eidentc\)P\(E\_\{ident\}^\{c\}\)\)\.

Given an accurate stable set, we aim to bound the probability that the sampling noise causes a non\-target variableXjX\_\{j\}to have a larger predictability gap than the true targetXtX\_\{t\}\(a ranking error\)\. We allocate the remaining noise budget ofη/2\\eta/2to Phase 3\.

LetΔk\\Delta\_\{k\}denote the population \(asymptotic\) predictability gap for variableXkX\_\{k\}, andΔ^k\\hat\{\\Delta\}\_\{k\}denote its empirical estimate\. By the Identifiability Margin Condition and our allocated budgetη/2\\eta/2, the true gaps satisfyΔt−Δj≥η/2\\Delta\_\{t\}\-\\Delta\_\{j\}\\geq\\eta/2for allj≠tj\\neq t\. A failure occurs ifΔ^j≥Δ^t\\hat\{\\Delta\}\_\{j\}\\geq\\hat\{\\Delta\}\_\{t\}\. Rearranging terms:

Δ^j\\displaystyle\\hat\{\\Delta\}\_\{j\}≥Δ^t,\\displaystyle\\geq\\hat\{\\Delta\}\_\{t\},\(Δ^j−Δj\)\+Δj\\displaystyle\(\\hat\{\\Delta\}\_\{j\}\-\\Delta\_\{j\}\)\+\\Delta\_\{j\}≥\(Δ^t−Δt\)\+Δt,\\displaystyle\\geq\(\\hat\{\\Delta\}\_\{t\}\-\\Delta\_\{t\}\)\+\\Delta\_\{t\},\(Δ^j−Δj\)−\(Δ^t−Δt\)\\displaystyle\(\\hat\{\\Delta\}\_\{j\}\-\\Delta\_\{j\}\)\-\(\\hat\{\\Delta\}\_\{t\}\-\\Delta\_\{t\}\)≥Δt−Δj,\\displaystyle\\geq\\Delta\_\{t\}\-\\Delta\_\{j\},\(Δ^j−Δj\)−\(Δ^t−Δt\)\\displaystyle\(\\hat\{\\Delta\}\_\{j\}\-\\Delta\_\{j\}\)\-\(\\hat\{\\Delta\}\_\{t\}\-\\Delta\_\{t\}\)≥η/2\.\\displaystyle\\geq\\eta/2\.\(46\)Letℰk=Δ^k−Δk\\mathcal\{E\}\_\{k\}=\\hat\{\\Delta\}\_\{k\}\-\\Delta\_\{k\}be the estimation error for variablekk\. The failure conditionℰj−ℰt≥η/2\\mathcal\{E\}\_\{j\}\-\\mathcal\{E\}\_\{t\}\\geq\\eta/2implies that at least one error must be large:

\{ℰj−ℰt≥η/2\}⟹\{\|ℰj\|≥η/4\}∪\{\|ℰt\|≥η/4\}\.\\\{\\mathcal\{E\}\_\{j\}\-\\mathcal\{E\}\_\{t\}\\geq\\eta/2\\\}\\implies\\\{\|\\mathcal\{E\}\_\{j\}\|\\geq\\eta/4\\\}\\cup\\\{\|\\mathcal\{E\}\_\{t\}\|\\geq\\eta/4\\\}\.\(47\)Note thatℰk\\mathcal\{E\}\_\{k\}decomposes into errors on interventional and observational risks:

ℰk=\(ℛ^int−ℛint\)−\(ℛ^obs−ℛobs\)\.\\mathcal\{E\}\_\{k\}=\(\\hat\{\\mathcal\{R\}\}\_\{int\}\-\\mathcal\{R\}\_\{int\}\)\-\(\\hat\{\\mathcal\{R\}\}\_\{obs\}\-\\mathcal\{R\}\_\{obs\}\)\.\(48\)Assuming the loss is bounded byM∈ℝ\>0M\\in\\mathbb\{R\}\_\{\>0\}, we can apply Hoeffding’s inequality, as the loss is assumed to be the sum ofNNindependent samples\. For the total errorℰk\\mathcal\{E\}\_\{k\}to stay withinη/4\\eta/4, each risk term must stay withinη/8\\eta/8\. The probability of exceeding this is:

P\(\|ℰk\|≥η/4\)\\displaystyle P\(\|\\mathcal\{E\}\_\{k\}\|\\geq\\eta/4\)≤P\(\|ℛ^int−ℛint\|≥η/8\)\+P\(\|ℛ^obs−ℛobs\|≥η/8\),\\displaystyle\\leq P\(\|\\hat\{\\mathcal\{R\}\}\_\{int\}\-\\mathcal\{R\}\_\{int\}\|\\geq\\eta/8\)\+P\(\|\\hat\{\\mathcal\{R\}\}\_\{obs\}\-\\mathcal\{R\}\_\{obs\}\|\\geq\\eta/8\),≤2exp⁡\(−2N\(η/8\)2M2\)\+2exp⁡\(−2N\(η/8\)2M2\),\\displaystyle\\leq 2\\exp\\left\(\-\\frac\{2N\(\\eta/8\)^\{2\}\}\{M^\{2\}\}\\right\)\+2\\exp\\left\(\-\\frac\{2N\(\\eta/8\)^\{2\}\}\{M^\{2\}\}\\right\),=4exp⁡\(−Nη232M2\)\.\\displaystyle=4\\exp\\left\(\-\\frac\{N\\eta^\{2\}\}\{32M^\{2\}\}\\right\)\.\(49\)Using the union bound on Eq\.[47](https://arxiv.org/html/2606.05636#A1.E47), the pairwise ranking failure probability is:

P\(Δ^j≥Δ^t\)≤P\(\|ℰj\|≥η/4\)\+P\(\|ℰt\|≥η/4\)≤8exp⁡\(−Nη232M2\)\.P\(\\hat\{\\Delta\}\_\{j\}\\geq\\hat\{\\Delta\}\_\{t\}\)\\leq P\(\|\\mathcal\{E\}\_\{j\}\|\\geq\\eta/4\)\+P\(\|\\mathcal\{E\}\_\{t\}\|\\geq\\eta/4\)\\leq 8\\exp\\left\(\-\\frac\{N\\eta^\{2\}\}\{32M^\{2\}\}\\right\)\.\(50\)Finally, we apply the union bound over all candidate variables\. A global failure occurs ifanynon\-targetj∈𝒦^∖\{t\}j\\in\\hat\{\\mathcal\{K\}\}\\setminus\\\{t\\\}flips ranking with the target:

P\(Eidentc\)\\displaystyle P\(E\_\{ident\}^\{c\}\)=P\(⋃j∈𝒦^∖\{t\}\{Δ^j≥Δ^t\}\),\\displaystyle=P\\Big\(\\bigcup\_\{j\\in\\hat\{\\mathcal\{K\}\}\\setminus\\\{t\\\}\}\\left\\\{\\hat\{\\Delta\}\_\{j\}\\geq\\hat\{\\Delta\}\_\{t\}\\right\\\}\\Big\),≤∑j≠tP\(Δ^j≥Δ^t\),\\displaystyle\\leq\\sum\_\{j\\neq t\}P\\left\(\\hat\{\\Delta\}\_\{j\}\\geq\\hat\{\\Delta\}\_\{t\}\\right\),≤\(\|𝒦^\|−1\)⋅8exp⁡\(−Nη232M2\)\.\\displaystyle\\leq\(\|\\hat\{\\mathcal\{K\}\}\|\-1\)\\cdot 8\\exp\\left\(\-\\frac\{N\\eta^\{2\}\}\{32M^\{2\}\}\\right\)\.\(51\)This confirms the exponential decay of the ranking error\.

Conclusion\.Summing the failure probabilities for all three phases yields the theorem statement\. AsN→∞N\\to\\infty, all terms vanish, and the probability of success tends to 1 with exponential speed\. ∎

### A\.4Discussion on scenarios with Multiple Interventional Targets

In case of multiple interventions, most currently available approaches for RCA display limitations\. For StableRCA, the theory still holds either by adding an additional assumption, or by introducing partial causal information as an inductive bias:

1. 1\.Assuming that interventions cannot happen at consecutive nodes, as formalized below: ###### Assumption A\.23\(Non\-Consecutive Interventional Targets\)\. No interventional target is directly connected to another\. The reason why this assumption is needed is that intervening on the children of a target encounters two limitations of our current design: i\) Being the children part of the Markov Boundary, it is not conditionally independent from its parents, and 2\) The intervention removes a causal link, therefore impeding us to leverage on the ICM assumption \(the mechanism was indeed changed\) as done in Theorem[A\.14](https://arxiv.org/html/2606.05636#A1.Thmtheorem14)\. Indeed, in the event of a simultaneous intervention applied both children and parent, it would not be possible to distinguish whether the targets are only the children alone, or both children and parent together\.
2. 2\.Alternatively, Assm\.[A\.23](https://arxiv.org/html/2606.05636#A1.Thmtheorem23)can be relaxed in the case that prior knowledge is available on the causal relationship between the two targets: If we know thatX→YX\\to Y, and forceXXto not useYYfor making predictions, then the identifiability issue is resolved\. In this case, observing a conditional shift on both variables can only be explained by a multiple intervention on parent and children simultaneously\.

## Appendix BStableRCA Implementation Details

Here we present the implementation details of each module in StableRCA\.

The overall algorithm takes as input two data frames: one corresponding to the observational dataset𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and the other to the interventional dataset𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}\. Each data frame containsDDcolumns, corresponding to the set of variables, andNobsN\_\{obs\}\(respectivelyNintN\_\{int\}\) rows, corresponding to the number of samples\. The output is a dictionary with keys representing the variable names and values representing their corresponding scores for being root cause\.

In the first module, we perform marginal distribution shift detection independently for each variabled∈\(1,…D\)d\\in\(1,\.\.\.D\)\. Using the observational dataset, we first determine whether each variable should be treated as continuous or discrete based on its data type and cardinality\. Specifically, a variable is considered discrete if it is categorical or if the number of unique values is below a predefined threshold \(set to 10 in our experiments\); otherwise, it is treated as continuous\. For continuous variables, we apply the Kolmogorov–Smirnov test, while for discrete variables we use the chi\-squared test, on both𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}to assess if a marginal distribution shift has occurred between these two datasets\. Throughout our experiments, the significance level of the statistical tests is set toα=0\.001\\alpha=0\.001\. The output of this module is the set of variables𝐗shift\\mathbf\{X\}\_\{\\textrm\{shift\}\}that exhibit statistically significant marginal distribution shifts\.

In the second module, we apply the SRDO algorithm\[Shenet al\.,[2020](https://arxiv.org/html/2606.05636#bib.bib18)\], a representative instantiation of the stable learning framework, to identify the Markov Boundary variables for each variableXi∈𝐗shiftX\_\{i\}\\in\\mathbf\{X\}\_\{\\textrm\{shift\}\}detected in the first module\. Specifically, each variableXiX\_\{i\}is treated as the target variable, while the remaining variables𝐗−i=𝐗\\Xi\\mathbf\{X\}\_\{\-i\}=\\mathbf\{X\}\\backslash X\_\{i\}are used as predictors\. SRDO takes the observational dataset as input and estimates sample\-wise importance weights via an independence\-driven weighting procedure, following the principles described in Appendix[A\.3](https://arxiv.org/html/2606.05636#A1.SS3)\. Using the estimated weights, we then perform a weighted regression \(for continuous targets\) or a weighted classification \(for discrete targets\) to select the Markov Boundary features\. To accommodate potentially nonlinear data\-generating mechanisms and handle categorical variables effectively, we employ CatBoost\-based regressors and classifiers\[Prokhorenkovaet al\.,[2018](https://arxiv.org/html/2606.05636#bib.bib65)\]\. Since tree\-based models do not yield explicit regression coefficients, we use the built\-in Gini importance scores as a proxy for feature relevance and perform feature selection accordingly\. Features whose importance values exceed a predefined threshold are selected\. In our experiments, this threshold is set to 0\.15 times the maximum importance value across all features\.

In the third module, we conduct conditional distribution shift detection by measuring the degradation in predictive performance of a learned model when evaluated on observational versus interventional data\. The observational data set𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}is first split into a training set and a held\-out test set\. For a target variableXi∈𝐗shiftX\_\{i\}\\in\\mathbf\{X\}\_\{\\textrm\{shift\}\}, the MB features identified in the second module are used as predictors to fit a predictive model – specifically, a regressor ifXiX\_\{i\}is continuous and a classifier ifXiX\_\{i\}is discrete\. In our experiments, we employ CatBoost\-based regressors and classifiers\. The trained model is then evaluated on both the held\-out test set from𝒟obs\\mathcal\{D\}\_\{\\textrm\{obs\}\}and the interventional dataset𝒟int\\mathcal\{D\}\_\{\\textrm\{int\}\}\. For continuous targets, we compute the mean squared error, while for discrete targets, we compute the cross\-entropy loss\. To isolate conditional distribution shifts from covariate shifts in the predictors, we apply an importance\-weighting procedure when computing the loss on the interventional dataset, as described in Section[4](https://arxiv.org/html/2606.05636#S4)\. In practice, the importance weights are estimated by training an auxiliary nuisance classifier following standard procedures\[Sugiyamaet al\.,[2007](https://arxiv.org/html/2606.05636#bib.bib66)\]\. Finally, the root cause score for each variableXiX\_\{i\}is computed as the relative risk variation after importance weighting, as formalized in Equation \([3](https://arxiv.org/html/2606.05636#S4.E3)\)\. The resulting scores across all variables are used for the overall top\-k evaluation of the complete RCA algorithm\.

## Appendix CDataset Description

### C\.1Synthetic Data Generation

We generate synthetic tabular datasets using a causal graph\-based data\-generating mechanism, and create anomalies by intervening on a small set of causal variables \(“root causes”\)\. In each trial, we first sample a causal graph that specifies the high\-level data\-generating structure among variables, and instantiate a structural causal model by assigning node\-wise structural equations and exogenous noise distributions\. We then draw a set of observational \(normal\) samples for training, and generate abnormal samples under interventions on the selected root\-cause variables\. Along with the resulting observational and interventional datasets, we also record the ground\-truth root\-cause set, as well as a designated target node for evaluation\.

#### Causal Graph Construction

We first sample a causal graphG=\(V,E\)G=\(V,E\)with\|V\|=n\|V\|=nvariables using ER algorithm\[Erdös and Rényi,[1959](https://arxiv.org/html/2606.05636#bib.bib73)\]\. Concretely, we draw an undirected ER graph withnnnodes and\|E\|=2n\|E\|=2nedges, and then assign directions to the edges to obtain a directed acyclic graph \(DAG\) that serves as the causal structure for the data\-generating mechanism\.

#### Structural Assignments and Normal Data Generation

Given the causal graph, we instantiate the data\-generating mechanism by assigning each node a structural equation and an exogenous noise distribution, which together define how normal \(observational\) data are generated\. Specifically, root nodes are sampled independently from a uniform distribution over a predefined range, while each non\-root node is generated as a function of its parent variables plus additive noise\. The the functional form is independently sampled as either a linear function or a nonlinear MLP\-based function\. For linear mappings, weights are independently sampled from a uniform distribution𝒰\(0\.25,1\)\\mathcal\{U\}\(0\.25,1\), and no bias term is used\. For nonlinear mappings, we employ a two\-layer MLP with a hidden dimension of 10 and randomly initialized weights\. For each node, we randomly select a noise distribution type from a predefined set that includes Gaussian, Gumbel, Uniform, and Exponential, and generate zero\-mean noise with a fixed standard deviation\. The sampled noise is then added to the output of the structural function to obtain the final value of the node\. Sampling from this instantiated model yields the observational dataset, which serves as the reference distribution for root\-cause analysis\.

#### Intervention Design and Abnormal Data Generation

To simulate anomalies, we intervene on a small set of variables referred to as*root causes*\. The root\-cause variables are selected from non\-root nodes of the causal graph, and in multi\-root\-cause settings, we enforce that no selected variable is an ancestor or descendant of another, so as to avoid trivial causal chains within the intervention set\. By modifying the data\-generating mechanism at these variables and sampling from the resulting model, we obtain the interventional \(abnormal\) dataset\. We consider the soft\-function interventions, where the structural function of the variable is modified while keeping its noise distribution unchanged\.

#### Target Node Selection

To ensure that the target variable indeed corresponds to an anomalous observation caused by the selected root causes, the target node is determined*after*the intervention mechanism is specified\. Specifically, given the selected root\-cause variables, all their descendant nodes are identified, and one of them is randomly chosen as the target node\. This guarantees that the anomaly observed at the target node is causally influenced by the intervened variables\.

#### Evaluation Protocol

In addition to observational and abnormal datasets, the ground\-truth root\-cause set is recorded and used to evaluate the rankings produced by the RCA methods\. In each trial, 2,000 observational samples and 200 abnormal samples are provided, together with the underlying SCM when required by certain RCA methods\. By default, each experimental setting is repeated for 20 independent trials with different SCMs, and all reported results are averaged over three repetitive runs\. In multi\-root\-cause settings, top\-kkmetrics are reported withkkequal to the number of intervened variables\. For sample\-level RCA methods, including Score Ordering, Smooth Traversal, Traversal, Counterfactual Attribution, RCG\-0, and Cholesky, when multiple abnormal samples are available, a batch evaluation protocol is optionally applied, in which each sample is scored independently and the resulting node\-wise scores are aggregated \(e\.g\., by mean or max\) to produce a final ranking\.

### C\.2Real\-World Datasets

We provide detailed descriptions of the datasets used in our experimental evaluation\. These datasets are derived from multiple real\-world domains, including retail services, microservice architectures, manufacturing systems, and physical simulation environments, thereby reflecting a broad spectrum of practical deployment scenarios\.

In our experiments, we utilize both normal and abnormal samples\. For sample\-level RCA baseline methods, including Score Ordering, Smooth Traversal, Traversal, RCG\-0, and Cholesky, RCA scores are computed by aggregating results across all abnormal samples and evaluating on their average\. For the counterfactual attribution method, due to computational constraints, evaluation is performed on a randomly sampled abnormal instance\. For graph\-based methods, we use the available causal graph whenever possible\. Specifically, the ground\-truth causal graphs are used forProRCA,CausalMan, andCausalChambers\. ForSock\-Shop, we use the edge\-reversed service call graph as a proxy causal graph\. ForRCAEval, where ground\-truth causal graphs are unavailable, we estimate a graph from normal data using XGES\.

ProRCA\.ProRCA is a semi\-synthetic data provided by the ProRCA package, constructed via simulation to approximate real\-world retail transaction processes in a complex, nonlinear manner\[Dawoud and Talupula,[2025](https://arxiv.org/html/2606.05636#bib.bib67)\]\. The dataset is designed to model realistic business systems into which different types of anomalies can be systematically injected\. It comprises 13 variables, with "PROFIT" designated as the outcome variable\. The ground\-truth causal graph underlying the data\-generating process is illustrated in Figure[4](https://arxiv.org/html/2606.05636#A3.F4)\. A total of five anomaly types can be injected to generate abnormal samples:ExcessiveDiscount,COGs,FulfillmentSpike,ReturnSurge, andShippingDisruption\. We apply four of these anomaly types with predefined anomaly strengths\. Additional dataset statistics and configuration details are reported in Table[5](https://arxiv.org/html/2606.05636#A3.T5)\.

![Refer to caption](https://arxiv.org/html/2606.05636v1/x3.png)Figure 4:Causal graph of ProRCA dataset\.Table 5:ProRCA data information\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x4.png)Figure 5:Causal graph of Sockshop dataset\.Sock\-shop\.Sock\-shop is a microservice monitoring data collected from an application of an online shop that sells socks\[[36](https://arxiv.org/html/2606.05636#bib.bib69)\]\. It is a real\-world, microservice\-based testbed designed to facilitate the evaluation of microservice and cloud\-native technologies\. The dataset comprises 14 variables, and the underlying ground\-truth causal graph is shown in Figure[5](https://arxiv.org/html/2606.05636#A3.F5)\. The variable "FRONT\-END" is treated as the outcome variable\. Five types of anomalies are injected into the system: CPU hog \(cpu\), memory leak \(mem\), disk I/O stress \(disk\), network delay \(delay\), and packet loss \(loss\)\. Each anomaly type is independently injected into one of five services — CARTS, CATALOGUE, ORDERS, PAYMENT, and USER\. For each anomaly–service pair, five independent replicates are provided\. In total, the dataset contains 125 anomaly\-injection instances\. Each instance consists of 360 normal samples and 361 anomalous samples\.

CausalMan\.CausalMan is a manufacturing data simulator modeled after a real\-world production line that assembles magnetic valves and hydraulic blocks in Hydraulic Units \(HU\)\[Tagliapietraet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib72)\]\. The simulator captures a wide range of linear and non\-linear causal mechanisms, as well as challenging real\-world behaviors such as discrete mode switches\. The resulting CausalMan dataset is provided in two scales:CausalMan Small, which contains 157 variables, andCausalMan Medium, which contains 605 variables\. In both settings, the data are generated under a causally sufficient assumption, where all relevant variables are observed\. CausalMan represents a highly challenging real\-world benchmark due to its complex non\-linear dependencies and heterogeneous variable types, including continuous, categorical, and binary variables\. The dataset includes 14 interventional settings targeting three variables, "PF\_M1\_T1\_Force", "PF\_M1\_T1\_Fmax" and "PF\_M1\_T1\_sgrad", that induce abnormal system behavior\. Specifically, CausalMan Small contains 5 hard and 3 soft interventions, while CausalMan Medium includes 3 hard and 3 soft interventions\. The outcome variable, "Sec\_C2\_Machine1\_ProcessResult" is binary and indicates whether the manufacturing process result is acceptable\. Detailed information about all interventional configurations is provided in Table[6](https://arxiv.org/html/2606.05636#A3.T6)\. For each intervention, both the normal \(observational\) dataset and the corresponding interventional \(abnormal\) dataset contain 47,400 samples\. Due to computational constraints, we subsample 2,000 normal samples and 200 abnormal samples for each root cause analysis experiment\.

Table 6:CausalMan data information\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x5.png)Figure 6:Causal graph of Causal\-Chamber dataset\.Causal\-Chamber\.The Causal Chambers are real\-world physical system simulators designed to generate data governed by known physical laws with fully specified ground\-truth data\-generating processes\[Gamellaet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib28)\]\. Two physical systems are provided: a wind tunnel and a light tunnel\. The wind tunnel consists of a chamber equipped with two controllable fans that drive airflow, along with barometers that measure air pressure at multiple locations\. The light tunnel comprises a controllable light source at one end and two linear polarizers mounted on rotating frames\. In this work, we use interventional data generated from the light tunnel system for root cause analysis experiments\. The light tunnel system contains 38 variables and supports 58 distinct interventions with varying intervention targets and anomaly strengths\. The ground\-truth causal graph of the system is shown in Figure[6](https://arxiv.org/html/2606.05636#A3.F6)\. We conduct experiments on 16 interventions applied to continuous variables\. For each intervention, the outcome variable is randomly selected from the leaf nodes among the descendants of the intervention target\. The normal data is generated as follows\. The actuators are sampled as follows:R,G,B∼i\.i\.d\.Unif\(\{0,…,85\}\)R,G,B\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}\\text\{Unif\}\(\\\{0,\\dots,85\\\}\);L11,L12,L21,L22,L31,L32∼i\.i\.d\.Unif\(\{0,…,170\}\)L\_\{11\},L\_\{12\},L\_\{21\},L\_\{22\},L\_\{31\},L\_\{32\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}\\text\{Unif\}\(\\\{0,\\dots,170\\\}\); andθ1,θ2∼i\.i\.d\.Unif\(\{−15,−14\.9,…,25\}\)\\theta\_\{1\},\\theta\_\{2\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}\\text\{Unif\}\(\\\{\-15,\-14\.9,\\dots,25\\\}\), whereUnif\(S\)\\text\{Unif\}\(S\)denotes sampling uniformly fromSS\. The sensor parameters are set as follows:O1,O2,OC=1O\_\{1\},O\_\{2\},O\_\{C\}=1;R1,R2,RC=5R\_\{1\},R\_\{2\},R\_\{C\}=5;D1I,D2I,D3I=2D\_\{1\}^\{I\},D\_\{2\}^\{I\},D\_\{3\}^\{I\}=2;D1V,D2V,D3V=1D\_\{1\}^\{V\},D\_\{2\}^\{V\},D\_\{3\}^\{V\}=1; andT1V,T2V,T3V,T1I,T2I,T3I=3T\_\{1\}^\{V\},T\_\{2\}^\{V\},T\_\{3\}^\{V\},T\_\{1\}^\{I\},T\_\{2\}^\{I\},T\_\{3\}^\{I\}=3\. This procedure yields 10,000 normal samples\. Detailed information about the interventions is summarized in Table[7](https://arxiv.org/html/2606.05636#A3.T7)\. Each interventional \(abnormal\) dataset contains 1,000 samples\. Due to computational constraints, we subsample 2,000 normal samples and 200 abnormal samples for each root cause analysis experiment\.

Table 7:Causal\-Chamber data information\.Table 8:RCAEval data information\. RCAEval contains benchmark failure cases from three microservice systems: Online Boutique \(OB\), Sock Shop \(SS\), and Train Ticket \(TT\)\.RCAEval\.RCAEval is an open\-source benchmark for RCA in microservice systemsPhamet al\.\[[2025](https://arxiv.org/html/2606.05636#bib.bib70)\]\. It contains 735 real failure cases collected from three representative systems: Online Boutique, Sock Shop, and Train Ticket\. The benchmark covers 11 fault types, including resource, network, and code\-level failures, and provides annotated root\-cause services and root\-cause indicators for each case\. RCAEval is organized into three suites: RE1 contains metric\-only data, while RE2 and RE3 provide multi\-source telemetry, including metrics, logs, and traces, to support metric\-based, trace\-based, and multi\-source RCA evaluation\.

Table 9:Top\-1 accuracy on five real\-world RCA benchmarks\. We report mean±\\pmstandard deviation across benchmark cases\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x6.png)Figure 7:Case\-level performance heatmap of different methods on ProRCA\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x7.png)Figure 8:Case\-level performance heatmap of different methods on Sock\-shop\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x8.png)Figure 9:Case\-level performance heatmap of different methods on Causal\-Chamber\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x9.png)Figure 10:Performance heatmap of different methods on CausalMan\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x10.png)Figure 11:Performance heatmap of different methods on RCAEval, grouped by dataset\.![Refer to caption](https://arxiv.org/html/2606.05636v1/x11.png)Figure 12:Performance heatmap of different methods on RCAEval, grouped by fault type\.

## Appendix DBaseline Models

Table 10:Comparison of representative root cause analysis \(RCA\) methods\.We compare our method against a diverse set of representative root cause analysis \(RCA\) baselines, covering graph\-based, statistics\-based, and model\-based approaches\.

ϵ\\epsilon\-Diagnosis\[Shanet al\.,[2019](https://arxiv.org/html/2606.05636#bib.bib12)\]\.ϵ\\epsilon\-Diagnosis is a root cause localization method originally developed for microservice\-based systems\. It localizes root causes by performing energy\-distance\-based two\-sample tests between normal and anomalous windows for each container\-metric pair\. Metrics whose marginal distributions exhibit statistically significant deviations are reported as root\-cause candidates\.

Traversal\[Chenet al\.,[2014](https://arxiv.org/html/2606.05636#bib.bib14), Linet al\.,[2018](https://arxiv.org/html/2606.05636#bib.bib44), Liuet al\.,[2021](https://arxiv.org/html/2606.05636#bib.bib77)\]\.Traversal operates on a given causal graph and localizes root causes by selecting anomalous nodes that are upstream of the target, have no anomalous parents, and lie on paths composed solely of anomalous nodes\.

CI\-RCA\[Liet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib15)\]\.CIRCA leverages a known causal graph to fit a linear structural model on normal data and detects root causes as nodes whose conditional behavior, given their parent variables, deviates most strongly during anomalies\.

RCD\[Ikramet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib13)\]\.RCD formulates root cause analysis as a causal discovery problem by introducing a dummy fault node and identifying variables directly connected to it via conditional\-independence\-based structure learning on normal and anomalous data\.

Counterfactual\[Budhathokiet al\.,[2022](https://arxiv.org/html/2606.05636#bib.bib74)\]\.Counterfactual uses a given SCM to quantify each node’s counterfactual contribution to the target anomaly via Shapley\-based attribution analysis\.

Cholesky\[Liet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib16)\]\.Cholesky assumes a linear SEM and localizes root causes by analyzing the Cholesky decomposition of the covariance matrix under different variable permutations\.

Smooth Traversal\[Orchardet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib17)\]\.Smooth Traversal localizes the root cause by selecting the node with the largest positive difference between its anomaly score and that of its highest\-scoring parent in the causal graph\.

Score Ordering\[Orchardet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib17)\]\.Score orderring identifies root causes by directly ranking variables according to their information\-theoretic anomaly scores without using any structural information\.

RCG\-0\[Ikramet al\.,[2025](https://arxiv.org/html/2606.05636#bib.bib80)\]\. RCG\-0 is thek=0k=0variant of RCG that learns a partial causal structure using only marginal CI tests, then ranks root\-cause candidates by CMI conditioned on possible parents, but its dense graph can force large conditioning sets and obscure the true root cause\.

BARO\[Phamet al\.,[2024](https://arxiv.org/html/2606.05636#bib.bib82)\]\.BARO is an end\-to\-end microservice troubleshooting method that combines Multivariate Bayesian Online Change Point Detection for anomaly detection with a nonparametric statistical hypothesis\-testing scorer to rank root\-cause metrics and services\.

## Appendix EDetailed Experimental Results on Real\-World Datasets

Here we present the detailed experimental results on the five real\-world datasets\. Table[9](https://arxiv.org/html/2606.05636#A3.T9)shows that StableRCA achieves strong and stable Top\-1 accuracy across five real\-world RCA benchmarks\. It obtains the best performance on ProRCA and SockShop, and remains competitive on CausalChamber, CausalMan, and RCAEval\. Several baselines are competitive on particular datasets but degrade on others\. For example, Cholesky performs well on ProRCA, SockShop, and RCAEval, but drops sharply on CausalChamber and CausalMan\. BARO is competitive on SockShop and RCAEval, but performs poorly on CausalChamber and CausalMan\. Traversal\-based methods achieve perfect accuracy on CausalChamber, but fail on RCAEval\. In contrast, StableRCA maintains consistently strong performance across all five datasets, suggesting better cross\-domain robustness under heterogeneous real\-world RCA scenarios\. Figure[7](https://arxiv.org/html/2606.05636#A3.F7)\-Figure[12](https://arxiv.org/html/2606.05636#A3.F12)report the detailed results on each case across datasets\.
StableRCA: Robust Graph-Agnostic Mechanism-Level Root Cause Analysis

Similar Articles

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

Why Retrieval-Augmented Generation Fails: A Graph Perspective

RRISE: Robust Radius Inference via a Surrogate Estimator

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

Submit Feedback

Similar Articles

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices
ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis
Why Retrieval-Augmented Generation Fails: A Graph Perspective
RRISE: Robust Radius Inference via a Surrogate Estimator
GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning