Global Explanations for Multivariate Time Series Forecasting Models via $K$-Order Markov Approximations

arXiv cs.LG Papers

Summary

This paper proposes KARMA, a method for explaining multivariate time series forecasting models by constructing a K-order Markov surrogate model that captures temporal dependencies, offering a five-level global explanation hierarchy.

arXiv:2606.27599v1 Announce Type: new Abstract: While many explainable AI (XAI) methods have been proposed, most are not designed for time-series forecasting models and often rely on the implicit assumption that timestamp features are independent. This assumption ignores the fundamental property of temporal dependence and can lead to explanations that violate the sequential and causal structure of the data. We introduce \textsc{KARMA}, a method for explaining time-series predictors by constructing a Markov surrogate model that captures the temporal dependencies learned by the predictor. Our approach revolves around three main aspects: identifying the minimal history length $K$ that is predictively sufficient for the model, estimating the best-fitting $K$-order Markov transition kernel from the discretized history space, and a five-level global explanation hierarchy that can be derived from the Markov transition kernel, which we illustrate using real-world weather data (Beijing PM 2.5). We also certify using complex synthetic data with known true causal edges that KARMA (i) recovers the data causal structure as learned by the model via a controlled experiment and (ii) identifies temporal dependencies better than established attribution methods such as TimeSHAP.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:23 AM

# Global Explanations for Multivariate Time Series Forecasting Models via 𝐾-Order Markov Approximations
Source: [https://arxiv.org/html/2606.27599](https://arxiv.org/html/2606.27599)
###### Abstract

While many explainable AI \(XAI\) methods have been proposed, most are not designed for time\-series forecasting models and often rely on the implicit assumption that timestamp features are independent\. This assumption ignores the fundamental property of temporal dependence and can lead to explanations that violate the sequential and causal structure of the data\. We introduceKARMA, a method for explaining time\-series predictors by constructing a Markov surrogate model that captures the temporal dependencies learned by the predictor\. Our approach revolves around three main aspects: identifying the minimal history lengthKKthat is predictively sufficient for the model, estimating the best\-fittingKK\-order Markov transition kernel from the discretized history space, and a five\-level global explanation hierarchy that can be derived from the Markov transition kernel, which we illustrate using real\-world weather data \(Beijing PM 2\.5\)\. We also certify using complex synthetic data with known true causal edges that KARMA \(i\) recovers the data causal structure as learned by the model via a controlled experiment and \(ii\) identifies temporal dependencies better than established attribution methods such as TimeSHAP\.

*Keywords*Explainable AI⋅\\cdotTime Series Forecasting⋅\\cdotMarkov Chains

## 1Introduction

The deployment of deep learning models on financial time series, clinical monitoring and industrial sensing have grown substantially\. Recurrent networks, temporal convolutional networks \(TCNs\), and transformer architectures achieve strong predictive performance, but remain largely opaque: practitioners cannot readily identify which temporal patterns, cross\-variable dependencies, or historical regimes drive a particular prediction\. This opacity creates regulatory friction under ESMA guidelines and the EU AI Act, and undermines model trust in high\-stakes decision environments\.

Explainable AI \(XAI\) methods such as LIMERibeiroet al\.\([2016](https://arxiv.org/html/2606.27599#bib.bib14)\), SHAPLundberg and Lee \([2017](https://arxiv.org/html/2606.27599#bib.bib11)\), and attention\-based attribution were developed primarily for static, i\.i\.d\. settings\. Their extension to time series is non\-trivial: temporal autocorrelation, non\-stationarity, and inter\-variable Granger causality violate the independence assumptions underlying most attribution frameworks\. For instance, SHAP’s baseline marginalisation is structurally incoherent for time series becauseXt−2X\_\{t\-2\}is not independent ofXt−1X\_\{t\-1\}; gradient attributions describe local geometry at a single point rather than the systematic conditional structure of what the model has globally learned\.

We propose a different approach rooted in probabilistic approximation\. Rather than perturbing inputs or computing gradient\-based attributions, we ask:*can the behaviour of a black\-box model on a multivariate time series be faithfully approximated by aKK\-th order Markov chain?*If so, the transition probabilities of that chain constitute a structured, interpretable explanation grounded in the conditional dependence structure of the data as learned by the model\. This framing is natural for sequential domains: a clinician, engineer, or trader already reasons in conditional scenarios “given the last three states of the system, what does the model expect next?” and transition probabilities answer that question directly, in the native language of the domain\. We present KARMA, an approach to interpreting a black\-box time\-series model using a five\-level \(global\) explanation hierarchy derived from the transition kernel\. Our contribution is organized around three distinct pillars with different theoretical foundations and different data requirements\.

1. 1\.Pillar 1 \(Markov surrogate\)\.A surrogacy predictive validity stopping rule selects the minimal lagK∗K^\{\*\}that is predictively sufficient, and for thisK∗K^\{\*\}estimate with minimal error using the data, the best nearestK∗−K^\{\*\}\-Order Markov probability transition kernel\. This answers the question every practitioner should ask before trusting a sequential model: how much of its input window does the model actually use? The answer is both certified and model\-agnostic,
2. 2\.Pillar 2 \(Compression and certified attribution\)\.WhenKKis smaller than the model window size, the model is provably insensitive to inputs beyond lagK∗K^\{\*\}\. This yields a compression ratio, a model\-certified baselineb∗b^\{\*\}that resolves the baseline selection problem, and certified\-zero attributions for all lags beyondK∗K^\{\*\}, not merely small attributions, but mathematically zero within approximation error\.
3. 3\.Pillar 3 \(Five Level Global Explanation Derivation\)\.Given the estimated kernel, KARMA extracts five layers of explanation without additional oracle queries\.Level 1computes the normalized variable importance or feature\-level importance, ranking source variables by their total distributional influence on the model’s predictions across all target variables and lags\.Level 2resolves this influence by lagkk, yielding cell\-level explanations \(for example, influence on forecast at timettof value at timet−kt\-kfor featuredd\) that reveals whether the model has learned momentum, mean\-reversion, or long\-memory dynamics for each variable\.Level 3identifies distinctive regimes via the marginal interdependence index identifying distinct paths \(histories\) relevant to the forecast\.Level 4computes average interventional effects and the edge contributions that populate the model\-induced causal graph, connecting the explanation directly to the level 1\.Level 5quantifies explanation reliability: aleatoric entropy measures genuine model uncertainty at each history, while the epistemic variance flags histories where kernel estimates are unreliable\.

## 2Related Work

XAI for Time Series Forecasting\.Existing methods fall into three families\.*Gradient\-based*methods propagate prediction sensitivity back to inputs: SaliencySimonyanet al\.\([2014](https://arxiv.org/html/2606.27599#bib.bib36)\), Grad\-CAMSelvarajuet al\.\([2017](https://arxiv.org/html/2606.27599#bib.bib38)\), DeepLIFTShrikumaret al\.\([2017](https://arxiv.org/html/2606.27599#bib.bib39)\), Layer\-wise Relevance PropagationBachet al\.\([2015](https://arxiv.org/html/2606.27599#bib.bib40)\), and Integrated GradientsSundararajanet al\.\([2017](https://arxiv.org/html/2606.27599#bib.bib37)\), often smoothed via SmoothGradSmilkovet al\.\([2017](https://arxiv.org/html/2606.27599#bib.bib41)\)\. These are fast but provide no probabilistic guarantees and are sensitive to architecture; moreover, standard saliency methods transfer poorly to temporal data, as benchmarked by Ismail et al\.Ismailet al\.\([2020](https://arxiv.org/html/2606.27599#bib.bib42)\), who propose Temporal Saliency Rescaling \(TSR\) to recover time\-localized importance\. Temporal Integrated Gradients \(TIG;Enguehard \([2023](https://arxiv.org/html/2606.27599#bib.bib26)\)\) extends this family to sequential settings but remains local\.*Perturbation\-based*methods \(LIMERibeiroet al\.\([2016](https://arxiv.org/html/2606.27599#bib.bib14)\), SHAPLundberg and Lee \([2017](https://arxiv.org/html/2606.27599#bib.bib11)\)\) struggle with temporal autocorrelation since random perturbations break sequential structure and induce off\-manifold inputs\. The simplest instance, Feature Occlusion \(FO;Sureshet al\.\([2017](https://arxiv.org/html/2606.27599#bib.bib52)\)\), replaces a feature or group with a baseline and scores the resulting output change; its augmented variant \(AFO;Tonekaboniet al\.\([2020](https://arxiv.org/html/2606.27599#bib.bib43)\)\) perturbs with in\-distribution samples to mitigate off\-manifold artifacts\. TimeSHAPBentoet al\.\([2021](https://arxiv.org/html/2606.27599#bib.bib4)\)extends SHAP to recurrent models but inherits the baseline selection problem and provides no certified attributions\. FITTonekaboniet al\.\([2020](https://arxiv.org/html/2606.27599#bib.bib43)\)instead scores each observation by its contribution to the predictive\-distribution shift under a KL divergence against a counterfactual where remaining features are unobserved, explicitly controlling for time\-dependent shift, but remains purely local\. DynamaskCrabbé and van der Schaar \([2021](https://arxiv.org/html/2606.27599#bib.bib27)\)learns a salient input mask but operates locally with no reliability guarantees, and subsequent mask\-based refinements such as ContraLSPLiuet al\.\([2024](https://arxiv.org/html/2606.27599#bib.bib45)\)\(contrastive, locally sparse perturbations\) and the surrogate explainer TimeXQueenet al\.\([2023](https://arxiv.org/html/2606.27599#bib.bib46)\)improve fidelity yet likewise yield only instance\-level, uncertified attributions\. WinITLeunget al\.\([2023](https://arxiv.org/html/2606.27599#bib.bib44)\)measures information transfer across time windows but yields no certified attributions\.*Attention\-based*methods treat attention weights as explanations, a practice whose reliability is contestedJain and Wallace \([2019](https://arxiv.org/html/2606.27599#bib.bib10)\); Wiegreffe and Pinter \([2019](https://arxiv.org/html/2606.27599#bib.bib47)\); interpretable forecasters such as the Temporal Fusion TransformerLimet al\.\([2021](https://arxiv.org/html/2606.27599#bib.bib51)\)expose variable\-selection and attention weights intrinsically but offer no statistical guarantees on the resulting attributions\. KARMA is a global surrogate method that explicitly models sequential dependence through Markov structure, yielding explanations that respect the temporal geometry of the data and carry explicit statistical reliability guarantees absent from all prior work\.

Total Variation for Model Selection\.The formal definition of the total variation commonly used,

###### Definition 2\.1\(Total Variation Distance\)\.

Letℳ​\(\[N\]\)\\mathcal\{M\}\(\[N\]\)denote the set of conditional probability kernels on a set\[N\]\[N\]\. Forp\(⋅\|h\),q\(⋅\|h\)∈ℳ\(\[N\]\)p\(\\cdot\|h\),q\(\\cdot\|h\)\\in\\mathcal\{M\}\(\[N\]\),hhin some history space, the*total variation distance*is given by

∥p\(⋅\|h\)−q\(⋅\|h\)∥T​V=12∑s∈\[N\]\|p\(s\|h\)−q\(s\|h\)\|\\\|p\(\\cdot\|h\)\-q\(\\cdot\|h\)\\\|\_\{TV\}=\\frac\{1\}\{2\}\\sum\_\{s\\in\[N\]\}\|p\(s\|h\)\-q\(s\|h\)\|\(1\)

The use of total variation distance as a Markov order stopping rule is related toCsiszár and Shields \([2000](https://arxiv.org/html/2606.27599#bib.bib7)\)on consistent order estimation, though our formulation targets surrogate equivalence rather than data compression, and operates on the model kernel rather than the data kernel\.

## 3K\-Order Markov Chain Surrogate Model

Building on the intuition introduced in the introduction, we now formalize KARMA as a K\-order Markov surrogate model\. We model a multivariate time series \(MVTS\) as a stochastic process𝐗=\{Xt\}t∈ℤ,Xt=\(Xt1,…,XtD\)⊤∈ℝD\\mathbf\{X\}=\\\{X\_\{t\}\\\}\_\{t\\in\\mathbb\{Z\}\},\\\>X\_\{t\}=\(X^\{1\}\_\{t\},\\ldots,X^\{D\}\_\{t\}\)^\{\\top\}\\in\\mathbb\{R\}^\{D\}\. Condider the predictorf:ℝW×D→𝒴∈ℝT×Df:\\mathbb\{R\}^\{W\\times D\}\\to\\mathcal\{Y\}\\in\\mathbb\{R\}^\{T\\times D\}be a trained black\-box predictor that takes an input window of lengthWWand outputs aTT\-length prediction window\. For the rest of this manuscript, we considerT=1T=1,T\>1T\>1follows by constructive extension\.

Our goal is to construct an explanation based on the transition kernel of a surrogateKK\-order Markov chain that is \(1\) human–interpretable, \(2\) faithful toffin the total variation sense, and \(3\) explicit about its own statistical reliability\.

This section develops the construction of this surrogate model\.

###### Definition 3\.1\(K−K\-Order Markov chain\)\.

Let𝐗=\(Xn\)n≥1\\mathbf\{X\}=\(X\_\{n\}\)\_\{n\\geq 1\}be a stochastic process with countable state space𝒮\\mathcal\{S\}\. The process is a*kk\-th order Markov chain*if, for alln\>kn\>k,XnX\_\{n\}is independent of all other past time steps given the lastkksteps\. The chain is*time\-homogeneous*if the transition probabilities do not depend onnn\.

### 3\.1State Space Construction

Denote\[D\]=\{1,…,D\}\[D\]=\\\{1,\\dots,D\\\}and\[N\]=\{1,…,N\}\[N\]=\\\{1,\\dots,N\\\}\. For each variabled∈\[D\]d\\in\[D\], define a measurable partition ofℝ\\mathbb\{R\}intoNNbins using quantile boundaries estimated on the training data:

ψd:ℝ→\[N\],ψd​\(x\)=n⇔x∈\[qn−1d,qnd\)\.\\psi^\{d\}:\\mathbb\{R\}\\to\[N\],\\qquad\\psi^\{d\}\(x\)=n\\iff x\\in\[q^\{d\}\_\{n\-1\},q^\{d\}\_\{n\}\)\.
The multivariate discretization

ψ:ℝD→\[N\]D\\psi:\\mathbb\{R\}^\{D\}\\to\[N\]^\{D\}
induces the discrete state space𝒮=\[N\]D\\mathcal\{S\}=\[N\]^\{D\}with\|𝒮\|=ND\.\|\\mathcal\{S\}\|=N^\{D\}\.

The discretized process isX~t=ψ​\(Xt\)∈𝒮\\tilde\{X\}\_\{t\}=\\psi\(X\_\{t\}\)\\in\\mathcal\{S\}\. And for histories of lengthKK, the corresponding history space is

ℋK=𝒮K,\|ℋK\|=ND​K\.\\mathcal\{H\}\_\{K\}=\\mathcal\{S\}^\{K\},\\qquad\|\\mathcal\{H\}\_\{K\}\|=N^\{DK\}\.
wherehkd∈\[N\]h^\{d\}\_\{k\}\\in\[N\]is the discretized bin of variableddat lag positionkk, and the total number of distinct indices isND​K=\|ℋK\|N^\{DK\}=\|\\mathcal\{H\}\_\{K\}\|\.

### 3\.2The K\-Order Markov Surrogate

Letf:ℝW×D→𝒴f:\\mathbb\{R\}^\{W\\times D\}\\to\\mathcal\{Y\}be a trained black\-box predictor andψ:ℝD→𝒮\\psi:\\mathbb\{R\}^\{D\}\\to\\mathcal\{S\}the discretization map defined above\. Sinceffoperates on a window of lengthWW, its predictions may in principle depend on the full input history\. However, the predictive information relevant toffmay be concentrated in a strictly shorter suffix\. To identify the minimal such suffix, we prepend a baselineb∈ℬb\\in\\mathcal\{B\}to aKK\-length suffix to form a complete input window and measure

Δ^pred​\(K,b\)=1n​∑i=1nℓ​\(f​\(h~i\),f​\(b⊕hi\)\),\\hat\{\\Delta\}\_\{\\mathrm\{pred\}\}\(K,b\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\\\!\\left\(f\(\\tilde\{h\}\_\{i\}\),\\,f\(b\\oplus h\_\{i\}\)\\right\),\(2\)whereh~i∈ℝW×D\\tilde\{h\}\_\{i\}\\in\\mathbb\{R\}^\{W\\times D\}denotes theWWlengthiiobserved history,b⊕hib\\oplus h\_\{i\}denotes theKKlength suffixhih\_\{i\}ofh~i\\tilde\{h\}\_\{i\}concatenated with aW−KW\-Klength baselinebb, andℓ​\(⋅,⋅\)\\ell\(\\cdot,\\cdot\)a user imposed predictive difference metric \(absolute difference for forecasting\)\.Δ^pred​\(K,b\)\\hat\{\\Delta\}\_\{\\mathrm\{pred\}\}\(K,b\)defines the average discrepancy betweenff’s predictions on the true window and on the substituted window\. The selected orderK∗K^\{\*\}is the smallestKKfor whichminb∈ℬ⁡Δ^pred​\(K,b\)<ε\\min\_\{b\\in\\mathcal\{B\}\}\\,\\hat\{\\Delta\}\_\{\\mathrm\{pred\}\}\(K,b\)<\\varepsilon,ε\>0\\varepsilon\>0withb∗=arg⁡minb⁡Δ^pred​\(K∗,b\)b^\{\*\}=\\arg\\min\_\{b\}\\,\\hat\{\\Delta\}\_\{\\mathrm\{pred\}\}\(K^\{\*\},b\)\. WhenK∗<WK^\{\*\}<W, the model is provably insensitive to all inputs beyond lagK∗K^\{\*\}, yielding compression ratioW/K∗W/K^\{\*\}and certified\-zero attributions for allk\>K∗k\>K^\{\*\}\. Importantly,Δ^pred\\hat\{\\Delta\}\_\{\\mathrm\{pred\}\}requires only direct model queries, no kernel estimation, making it the sole stopping certificate\.

GivenK∗K^\{\*\}, thesurrogate transition kernelofffis

𝒯K∗f\(s∣h\)=ℙ\(ψ\(f\(h~\)\)=s\|suffixK∗\(ψ\(h~\)\)=h\),\\mathcal\{T\}^\{f\}\_\{K^\{\*\}\}\(s\\mid h\)\\;=\\;\\mathbb\{P\}\\\!\\left\(\\psi\(f\(\\tilde\{h\}\)\)=s\\;\\middle\|\\;\\mathrm\{suffix\}\_\{K^\{\*\}\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\),\(3\)
s∈𝒮,h∈ℋK∗s\\in\\mathcal\{S\},\\;h\\in\\mathcal\{H\}\_\{K^\{\*\}\}with marginal at dimensiond∈\[D\]d\\in\[D\]

𝒯K∗f,d\(sd∣h\)=ℙ\(ψd\(f\(h~\)\)=sd\|suffixK∗\(ψ\(h~\)\)=h\)\.\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\\;=\\;\\mathbb\{P\}\\\!\\left\(\\psi^\{d\}\(f\(\\tilde\{h\}\)\)=s^\{d\}\\;\\middle\|\\;\\mathrm\{suffix\}\_\{K^\{\*\}\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\)\.\(4\)wheresds^\{d\}is the d\-th component ofss\. TheKK\-order Markov chain associated with𝒯K∗f\\mathcal\{T\}^\{f\}\_\{K^\{\*\}\}is theKK\-order Markov SurrogateℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}\. Crucially,ffitself need not satisfy any Markov property:𝒯K∗f\\mathcal\{T\}^\{f\}\_\{K^\{\*\}\}defines the nearestKK\-order Markov approximation toff’s predictive behavior, not an extraction of latent Markov structure from the model\.

⋯\\cdots⋯\\cdots⋯\\cdots⋮\\vdots⋮\\vdots⋮\\vdots⋱\\ddotsh~∈ℝW×D\\tilde\{h\}\\in\\mathbb\{R\}^\{W\\times D\}continuous window⋯\\cdots⋯\\cdots⋯\\cdots⋮\\vdots⋮\\vdots⋮\\vdots⋱\\ddotsψ​\(h~\)∈\[N\]D×W\\psi\(\\tilde\{h\}\)\\in\[N\]^\{D\\times W\}discretized state⋮\\vdots⋮\\vdots⋮\\vdotsh∈\[N\]D×K∗h\\in\[N\]^\{D\\times K^\{\*\}\}history \(suffix\)123𝒯^K∗f,d\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)transition kernelsuffixK∗\\mathrm\{suffix\}\_\{K^\{\*\}\}lastK∗K^\{\*\}stepsψ\\psiquantilebinsψd∘f\\psi^\{d\}\\\!\\circ fquery& poolFigure 1:KARMA surrogate construction\.A continuous input windowh~∈ℝW×D\\tilde\{h\}\\in\\mathbb\{R\}^\{W\\times D\}is discretized by the quantile mapψ\\psiinto\[N\]D×W\[N\]^\{D\\times W\}; its length\-K∗K^\{\*\}suffix forms the historyh∈\[N\]D×K∗h\\in\[N\]^\{D\\times K^\{\*\}\}\. TheK∗K^\{\*\}\-order surrogate kernel𝒯^K∗f,d\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)is the distribution of the discretized model outputψd​\(f​\(h~\)\)\\psi^\{d\}\(f\(\\tilde\{h\}\)\)over next\-binssd∈\[N\]s^\{d\}\\in\[N\], conditioned onhhand pooled over all windows landing in that history\.

## 4Why Transition Probabilities Are \(Global\) Explanations

Standard attribution methods ask:how much does featurexix\_\{i\}contribute relative to a baseline?For time series, this framing is problematic —Xt−2X\_\{t\-2\}andXt−1X\_\{t\-1\}are strongly dependent, so marginalizing one while fixing the other produces counterfactuals off the data manifold, and gradient\-based explanations describe local sensitivity at a single point rather than the global conditional structure the model has learned\. Transition probabilities answer a different and more natural question:given that the system was in statehhfor the lastK∗K^\{\*\}steps, what does the model predict next?For aK∗K^\{\*\}\-order Markov process,𝒯K∗f\\mathcal\{T\}^\{f\}\_\{K^\{\*\}\}and𝒯K∗f,d\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}are sufficient statistics for next\-state prediction, so under output equivalence at toleranceε\\varepsilonthe surrogate kernel captures precisely the predictive dependencies the model has learned between variables and lags\. Under the faithfulness assumption \(in the sense of causality\), this directly reveals which variables at which lags act as direct causal drivers\.

### 4\.1The Five\-Level Explanation Hierarchy

LetK∗\>0K^\{\*\}\>0,h∈ℋK∗h\\in\\mathcal\{H\}\_\{K^\{\*\}\},d∈\[D\]d\\in\[D\]andffa trained model\. Further assume𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}\_\{K^\{\*\}\}^\{f,d\}the approximation of𝒯K∗f,d\\mathcal\{T\}\_\{K^\{\*\}\}^\{f,d\}such that

𝔼\[∥𝒯^Kf,d\(⋅∣h\)−𝒯Kf,d\(⋅∣h\)∥T​V\]<λ4,λ\>0\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]<\\frac\{\\lambda\}\{4\},\\qquad\\lambda\>0\(5\)We demonstrate how marginal transition probabilities are used to provide global explanations\. We also go on to prove in Supplementary Material[S4](https://arxiv.org/html/2606.27599#S4a)that under a model\-centric faithfulness condition onℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}, the resulting attributions are causally grounded in a precise sense: nonzero attributions correspond exactly to direct effects in the model\-induced causal graph, and zero attributions to d\-separated pairs\.

#### 4\.1\.1Level 1: Variable Importance \- Who Matters?

###### Definition 4\.1\(Lag\-Resolved Influence and Variable Importance\)\.

For any dimensiond′∈\[D\]d^\{\\prime\}\\in\[D\]at lagk∈\{1,…,K∗\}k\\in\\\{1,\\dots,K^\{\*\}\\\}, the*lag\-resolved influence*on the forecast at timettis

ϕkd′=∑d=1Dρ​\(Xt−kd′→Xtd\)⋅1​\{ρ​\(Xt−kd′→Xtd\)\>λ\}\.\\phi\_\{k\}^\{d^\{\\prime\}\}=\\sum\_\{d=1\}^\{D\}\\rho\(X\_\{t\-k\}^\{d^\{\\prime\}\}\\rightarrow X\_\{t\}^\{d\}\)\\cdot\\mathrm\{1\}\\\{\\rho\(X\_\{t\-k\}^\{d^\{\\prime\}\}\\rightarrow X\_\{t\}^\{d\}\)\>\\lambda\\\}\.\(6\)where:

ρ​\(Xt−kd′→Xtd\)=∑h∈ℋK∗π^∗​\(h\)⋅12​ΔT​V,\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)=\\sum\_\{h\\in\\mathcal\{H\}\_\{K^\{\*\}\}\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\frac\{1\}\{2\}\\Delta\_\{TV\},\(7\)with

ΔT​V=∑sd∈\[N\]\|𝒯^K∗f,d\(sd∣h\)−1N∑x=0N−1𝒯^K∗f,d\(sd∣hkd′←x\)\|,\\Delta\_\{TV\}=\\sum\_\{s^\{d\}\\in\[N\]\}\\left\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\-\\frac\{1\}\{N\}\\sum\_\{x=0\}^\{N\-1\}\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\right\|,
wherehkd′←xh\_\{k\}^\{d^\{\\prime\}\\leftarrow x\}replaces only thed′d^\{\\prime\}\-th component at lagkk\.π^∗​\(h\)=ℙ​\(ψ​\(Xt−K∗\+1:t\)=h\)\>0\\hat\{\\pi\}^\{\*\}\(h\)=\\mathbb\{P\}\(\\psi\(X\_\{t\-K^\{\*\}\+1:t\}\)=h\)\>0is the approximate stationary probability of historyhh\. Intuitively,ρ​\(Xt−kd′→Xtd\)\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)measures how much the model’s forecast distribution for variableddshifts when we average out the influence of variabled′d^\{\\prime\}at lagkk\.λ\\lambdahere serves as a threshold for which there exists genuine causal dependency, not noise arising from approximations, and prevents false positive and false negative causal edges with probability dependent on the upper bound estimate of𝔼\[∥𝒯^Kf,d\(⋅∣h\)−𝒯Kf,d\(⋅∣h\)∥T​V\]\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\(λ≤0\.1\)\\lambda\\leq 0\.1\)is recommended, Supplementary Material[S2](https://arxiv.org/html/2606.27599#S2a)\)\.

The*Marginal Total\-Variation Influence*of variabled′d^\{\\prime\}aggregated across all lags is:

Φd′=∑k=1K∗ϕkd′\\Phi^\{d^\{\\prime\}\}=\\sum\_\{k=1\}^\{K^\{\*\}\}\\phi\_\{k\}^\{d^\{\\prime\}\}\(8\)and finally define the*Variable Importance*as the normalization of the marginal total\-variation influence\. The quantityΦ~d′\\tilde\{\\Phi\}^\{d^\{\\prime\}\}quantifies the total influence of\{Xd′\}t−kt−1\\\{X^\{d^\{\\prime\}\}\\\}\_\{t\-k\}^\{t\-1\}to predict𝐗t\\mathbf\{X\}\_\{t\}for anyt≥0t\\geq 0\.

This ranking is \(i\) global, averaged over the full state space; \(ii\) probabilistically grounded, measuring actual distributional shift rather than contribution to a local linear approximation; and \(iii\) directly comparable across variables, as all values lie in\[0,1\]\[0,1\]\.

#### 4\.1\.2Level 2: Lag Profiles \- When does Each Variable Matter?

The lag\-resolved influence in equation \([6](https://arxiv.org/html/2606.27599#S4.E6)\) plotted againstkkreveals the temporal shape of how the model uses past information\. The decay profile reveals whether the model has learned momentum, volatility clustering, or mean\-reversion\.

#### 4\.1\.3Level 3: Marginal Regime Explanation \(Top\-kk\) \- What Sequences Matter?

For target dimensiond∈\[D\]d\\in\[D\]and history sequenceh∈ℋK∗h\\in\\mathcal\{H\}\_\{K^\{\*\}\}, we definemarginal interdependence indexas:

Ψd​\(h\)\\displaystyle\\Psi^\{d\}\(h\)=𝔼h′∼π^∗\[∥𝒯^K∗f,d\(⋅∣h\)−𝒯^K∗f,d\(⋅∣h′\)∥T​V\]\\displaystyle=\\mathbb\{E\}\_\{h^\{\\prime\}\\sim\\hat\{\\pi\}^\{\*\}\}\\bigl\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\}\)\\\|\_\{TV\}\]=∑h′∈ℋK∗π^∗\(h′\)⋅∥𝒯^K∗f,d\(⋅∣h\)−𝒯^K∗f,d\(⋅∣h′\)∥T​V\\displaystyle=\\sum\_\{h^\{\\prime\}\\in\\mathcal\{H\}\_\{K^\{\*\}\}\}\\hat\{\\pi\}^\{\*\}\(h^\{\\prime\}\)\\cdot\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\}\)\\\|\_\{TV\}theπ^∗\\hat\{\\pi\}^\{\*\}\-weighted average total variation distance between the model’s prediction athhand its prediction at a history drawn from the stationary distribution\. A high value ofΨd​\(h\)\\Psi^\{d\}\(h\)identifieshhas a distinctive regime: the model behaves unusually athhrelative to its typical conditional behavior across the history space\. Level 3 reports the score table\{Ψd​\(h\)\}d∈\[D\],h∈ℋK∗\\\{\\Psi^\{d\}\(h\)\\\}\_\{d\\in\[D\],\\,h\\in\\mathcal\{H\}\_\{K^\{\*\}\}\}\.

#### 4\.1\.4Level 4: Interventional Profiles \- What Happens When We Fix a Variable?

Forh∈HK∗\+h\\in H^\{\+\}\_\{K^\{\*\}\}, the*counterfactual history*hkd′←xh^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}replaces thed′d^\{\\prime\}\-th component ofhhat lagkkwith binxx\. The*average interventional effect*\(AIE\) of variabled′d^\{\\prime\}at lagkkon variableddis

AIEd\(d′,k,x\)=∑hπ^∗\(h\)∥𝒯^K∗f,d\(⋅∣h\)−𝒯^K∗f,d\(⋅∣hkd′←x\)∥\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)=\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)\\left\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\right\\\|\(9\)AIEd​\(d′,k,x\)\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)measuresff’s predictive sensitivity to fixing past inputs; it is a property of the surrogateℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}, not of the data\-generating process\. One can verify that1N​∑xAIEd​\(d′,k,x\)=ρ​\(Xt−kd′→Xtd\)\\frac\{1\}\{N\}\\sum\_\{x\}\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)=\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\), connecting Level 4 directly to the edge\-trimming criterion of Level 1\. Level 4 reports the ranked table\{1N​∑xAIEd​\(d′,k,x\)\}d,d′,k\\\{\\frac\{1\}\{N\}\\sum\_\{x\}\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)\\\}\_\{d,d^\{\\prime\},k\}\.

#### 4\.1\.5Level 5: Uncertainty of Explanations

Per\-variable aleatoric uncertaintyis given by the entropy of the predictive distribution:

Hhd=−∑sd𝒯^K∗f,d​\(sd∣h\)​log⁡𝒯^K∗f,d​\(sd∣h\),H^\{d\}\_\{h\}=\-\\sum\_\{s^\{d\}\}\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\\log\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\),\(10\)
We estimate theEpistemic Uncertaintyvia an MLE bootstrap variance:

Var​\(𝒯^K∗f,d​\(sd∣h\)\)≈𝒯^K∗f,d​\(sd∣h\)​\[1−𝒯^K∗f,d​\(sd∣h\)\]/nf​\(h\)\.\\mathrm\{Var\}\(\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\)\\approx\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\\left\[1\-\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\\right\]/n^\{f\}\(h\)\.\(11\)
wherenf​\(h\)n^\{f\}\(h\)is the query\-visit count for history h\. Histories with lownf​\(h\)n^\{f\}\(h\)flag unreliable explanations, yielding a naturalexplanation coveragemap\. No other XAI method for time series provides this reliability certificate\.

![Refer to caption](https://arxiv.org/html/2606.27599v1/x2.png)

\(a\)

![Refer to caption](https://arxiv.org/html/2606.27599v1/x3.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.27599v1/x4.png)

\(c\)

![Refer to caption](https://arxiv.org/html/2606.27599v1/x5.png)

\(d\)

![Refer to caption](https://arxiv.org/html/2606.27599v1/x6.png)

\(e\)

Figure 2:Five\-levelKARMAexplanation hierarchyapplied to a TCN \(W=24W=24, hidden 64, 2 layers, kernel 3\) trained on the Beijing PM2\.5dataset \(D=11D=11,N=3N=3,K∗=4K^\{\*\}=4,ε=0\.049\\varepsilon=0\.049,λ=0\.025\\lambda=0\.025,Δ^pred=0\.046\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}=0\.046\)\. All levels are derived from the single estimated kernel𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}without additional oracle queries\. History indiceshhfollow the mixed\-radix encodingh=∑k,dhkd⋅ND​\(K∗−1−k\)\+\(D−1−d\)h=\\sum\_\{k,d\}h^\{d\}\_\{k\}\\cdot N^\{D\(K^\{\*\}\-1\-k\)\+\(D\-1\-d\)\}\.\(a\) Level 1 — Variable importanceΦ~d′\\tilde\{\\Phi\}^\{d^\{\\prime\}\}:dewpdominates; the remaining wind\-direction and precipitation variables receive certified\-zero attributions \(<λ<\\lambda\)\.\(b\) Level 2 — Lag profilesϕkd′\\phi^\{d^\{\\prime\}\}\_\{k\}: active variables peak sharply atk=1k=1and decay over theK∗=4K^\{\*\}=4retained lags, withtempretaining secondary influence \(certified\-zero variables omitted\)\.\(c\) Level 4 — Ranked interventional effectsρ​\(e\)\\rho\(e\): self\-loops dominate the top\-25 retained edges; the strongest cross\-variable effect isdewp→\\totemp\.\(d\) Level 3 — Regime distance matrix\(targetpm2\.5\): pairwise total\-variation distances∥𝒯^K∗f,d\(⋅∣h\)−𝒯^K∗f,d\(⋅∣h′\)∥T​V\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\}\)\\\|\_\{TV\}between histories, hierarchically clustered; bright off\-diagonal blocks mark distinctive regimes where the model’s conditional forecast departs from its typical behaviour, the pairwise view from whichΨd​\(h\)\\Psi^\{d\}\(h\)aggregates\.\(e\) Level 5 — Uncertainty\-aware explanation\(π^∗\\hat\{\\pi\}^\{\*\}\-weighted\):\(A\)per\-history aleatoric entropyHhdH^\{d\}\_\{h\}\(bits, averaged over variables\);\(B\)epistemic uncertainty by next\-bin, the bootstrap standard deviationVar​\(𝒯^K∗f,d\)\\sqrt\{\\mathrm\{Var\}\(\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\)\};\(C\)π^∗\\hat\{\\pi\}^\{\*\}\-weighted marginal next\-bin distribution per target variable, against the uniform1/N1/Nreference;\(D\)coverage map of aleatoric entropy vs\. epistemic deviation with bubble area∝\\proptovisit countnf​\(h\)n^\{f\}\(h\), flagging high\-epistemic, low\-coverage histories that warrant more data\.

## 5KARMA

The complete KARMA procedure is summarized in Algorithm[S4\.1](https://arxiv.org/html/2606.27599#S4.SS1a)\(Supplementary Material[S4\.1](https://arxiv.org/html/2606.27599#S4.SS1a)\); we describe each step in turn\.

### 5\.1Step 1: State Space Discretization

Each marginal seriesXdX^\{d\}is discretized intoNNbins via quantile boundaries on the training window\. We recommendN∈\{2,3,4,5\}N\\in\\\{2,3,4,5\\\}\.

### 5\.2Step 2:K∗K^\{\*\}Selection and Certified Attribution

##### Surrogate validity\.

The selected orderK∗K^\{\*\}is the smallestKKfor which the model is certified insensitive to the prefix:

K∗=min⁡\{K≥1:minb∈ℬ⁡Δ^pred​\(K,b\)<ε\},ε\>0,K^\{\*\}=\\min\\\!\\left\\\{K\\geq 1:\\min\_\{b\\in\\mathcal\{B\}\}\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}\(K,b\)<\\varepsilon\\right\\\},\\\>\\\>\\\>\\varepsilon\>0,\(12\)
Δ^pred\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}is adirect model testrequiring no kernel estimation: queryfftwice per sample, once with the full window and once withb∗⊕hib^\{\*\}\\oplus h\_\{i\}, and average the prediction discrepancy\.111Optimally, for maximum confidence in the surrogacy approximation, it is performed on unseen data to ensure transparency\. However, due to data scarcity, the validation split can be used as an alternative\.

### 5\.3Step 3: KARMA Baseline and Model Compression

WhenK∗<WK^\{\*\}<W, define theKARMA baselineb∗b^\{\*\}as the prefix minimizing the average prediction discrepancy when the true prefix is replaced:

b∗=arg​minb⁡Δ^pred​\(K∗,b\)\.b^\{\*\}=\\operatorname\*\{arg\\,min\}\_\{b\}\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}\(K^\{\*\},b\)\.\(13\)Unlike other baselines, which are analyst\-chosen,b∗b^\{\*\}ismodel\-certified: it is the prefix that makesff’s predictions under the surrogate closest toff’s true predictions, discovered from model behavior\. Define theCompression ratiobyW/K∗W/K^\{\*\}\.222Model compression and the certified\-zero result require only thatΔ^pred\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}be estimated reliably, which demandsn≳200n\\gtrsim 200independent model queries for stable averages, easily satisfied with any held\-out series of lengthT≥W\+200T\\geq W\+200\. Pillars 1–2 are therefore applicable under standard data budgets without qualification\.

### 5\.4Step 4: \(Marginal\) Transition Kernel Estimation

The five\-level explanations are computed from the marginal transition kernel𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\. Estimating the full joint kernel is statistically prohibitive: the number of histories grows asND​K∗N^\{DK^\{\*\}\}, so the data required for a uniform total\-variation guarantee is exponential inD​K∗DK^\{\*\}and infeasible at standard series lengths\. KARMA therefore estimatesDDfactored marginal kernels, one per target variable, which a single set of model queries supplies simultaneously\.

We provide three estimators, all with finite\-sample total\-variation guarantees; the choice trades data efficiency against implementation simplicity\.

##### Strategy A \(empirical counting\)\.

A Dirichlet posterior\-mean estimator accumulates discretised model outputs into per\-history counts\. It is simple and unbiased in the large\-sample limit, but its sufficient sample budget still scales withND​K∗N^\{DK^\{\*\}\}, limiting it to small state spaces\.

##### Strategy B \(b∗b^\{\*\}\-prefix Monte Carlo, default\)\.

We provide first the following definition

###### Definition 5\.1\(Observed history support\)\.

Theobserved supportofπ^∗\\hat\{\\pi\}^\{\*\}at orderKKis:

ℋK\+=\{h∈ℋK:π^∗​\(h\)\>θ,θ≥0\},\\mathcal\{H\}^\{\+\}\_\{K\}=\\bigl\\\{h\\in\\mathcal\{H\}\_\{K\}:\\hat\{\\pi\}^\{\*\}\(h\)\>\\theta,\\\>\\\>\\theta\\geq 0\\bigr\\\},\(14\)the set of histories that actually appear in the training data, whereπ^∗​\(h\)=ℙ​\(ψ​\(Xt−K∗\+1:t\)=h\)\>0\\hat\{\\pi\}^\{\*\}\(h\)=\\mathbb\{P\}\(\\psi\(X\_\{t\-K^\{\*\}\+1:t\}\)=h\)\>0\. Allπ^∗\\hat\{\\pi\}^\{\*\}\-weighted explanation quantities are zero outside this set\. Restricting toℋK∗\+\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}bounds the effective history space by\|ℋK∗\+\|≤T−K∗\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|\\leq T\-K^\{\*\}, growing linearly inTTrather than exponentially inD​K∗DK^\{\*\}\. We useθ=0\\theta=0is the default setting\.

Rather than relying on observed history counts, Strategy B draws samples from a distribution pool, substitutes the certified baselineb∗b^\{\*\}on the prefix, and queries the model\. Its sufficient budget scales with\|ℋK∗\+\|≤T−K∗\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|\\leq T\-K^\{\*\}, linear in the series length rather than exponential inD​K∗DK^\{\*\}, which is what makes KARMA tractable in practice\. We use Strategy B by default\.

##### Strategy C \(tree\-structured pooling\)\.

Strategies A and B estimate each history’s kernel independently, so they fail on histories that are never sufficiently covered, no matter how long the series\. Strategy C instead grows a regression tree that partitions the covered historiesℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}\(patterns with sufficient coverage\) into leaves of structurally similar histories and pools their model counts, sharing statistical strength across histories that agree on the bins the tree identifies as predictive\. Sparse histories inherit the pooled kernel of the leaf they route to, extending reliable estimates to histories inaccessible to Strategies A and B\. The sample budget scales with the number of leavesL≪\|ℋK∗\+\|L\\ll\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|rather than with\|ℋK∗\+\|\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|, and when every history is covered the tree reduces to one history per leaf and recovers Strategy B exactly\. The split variables additionally read off as Level 1–2 structure: a source variable at a given lag that appears as a high split node for targetddis a primary driver ofdd\.

All estimators admit explicit noise floors that upper\-bound the per\-history estimation error \([5](https://arxiv.org/html/2606.27599#S4.E5)\) and feed directly into the Level 5 reliability map; the precise bounds and sample budgets are stated and proved in Supplementary Material\.

### 5\.5Step 5: 5\-level Explanation Retrieval

See Section[4\.1](https://arxiv.org/html/2606.27599#S4.SS1)

## 6Experiments

We compare KARMA \(KM\) to three temporal aware methods, TimeSHAP \(TS\), WinIT \(WI\), and Dynamask \(DM\), the Feature Occlusion \(FO\) method, and the gradient\-based method Integrated Gradients \(IG\)\. Code available on[https://github\.com/AmTuTi1999/karma\_\\\_\.git](https://github.com/AmTuTi1999/karma_.git)\.

### 6\.1Case Study A: Synthetic Data

#### 6\.1\.1Experimental setup

All methods explain the same analytical oraclefVARf\_\{\\mathrm\{VAR\}\}, which implements the true VAR data\-generating process exactly \(no model error\) with known true Markov OrderKt​r​u​eK\_\{true\}given coefficient matrixA∈ℝD×D×Kt​r​u​eA\\in\\mathbb\{R\}^\{D\\times D\\times K\_\{true\}\}:

fVAR​\(x\)j=∑k=1Kt​r​u​e∑i=1DAi​j\(k\)​xW−k,i,j∈\[D\]f\_\{\\mathrm\{VAR\}\}\(x\)\_\{j\}\\;=\\;\\sum\_\{k=1\}^\{K\_\{true\}\}\\sum\_\{i=1\}^\{D\}A\_\{ij\}^\{\(k\)\}\\,x\_\{W\-k,\\,i\},\\qquad j\\in\[D\]wherex∈ℝW×Dx\\in\\mathbb\{R\}^\{W\\times D\}is the input window of lengthWW\. The ground\-truth importance matrix isΦi,k∗=∑j\|Ai​j\(k\)\|\\Phi^\{\*\}\_\{i,k\}=\\sum\_\{j\}\|A\_\{ij\}^\{\(k\)\}\|, aggregating the absolute coefficient magnitudes over all target variables\. Methods are ranked by Kendall’sτ\\taubetween their flattened attribution estimateΦ^\\hat\{\\Phi\}andΦ∗\\Phi^\{\*\}\.

For KARMA, the Markov orderK∗K^\{\*\}is immediately obtained asKt​r​u​eK\_\{true\}and certified baselineb∗b^\{\*\}is selected viaΔpred<ε\\Delta^\{\\mathrm\{pred\}\}<\\varepsilonwithε=10−5\\varepsilon=10^\{\-5\}\. The history discretizer usesN=3N=3bins forD≤4D\\leq 4andN=2N=2forD\>4D\>4to keep the symbolic state space tractable\.M=100M=100Monte Carlo draws per history are used to build sampling pool for kernel estimation\.

To obtain cell\-level global explanations from FO, DynaMask, and TimeSHAP, the raw\(B,D,W\)\(B,D,W\)attribution tensor is aggregated toΦ^∈ℝD×W\\hat\{\\Phi\}\\in\\mathbb\{R\}^\{D\\times W\}by

Φ^d=1B​∑b=1B\|ϕb,d,W−k\|,d∈\[D\],k∈\[W\]\\hat\{\\Phi\}\_\{d\}\\;=\\;\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\|\\phi\_\{b,d,\\,W\-k\}\|,\\qquad d\\in\[D\],\\\>k\\in\[W\]reading each lagkkfrom window positionW−kW\-k\. WinIT uses the last\-target slice at the same window positions\. KARMA producesΦ^\\hat\{\\Phi\}directly without further aggregation\.

We exclude IG from the synthetic comparison: on the linear VAR oracle, integrated gradients reduce to the exact coefficient magnitudes, making the comparison degenerate rather than informative\. IG is retained in the real\-world evaluation \(Table[3](https://arxiv.org/html/2606.27599#S6.T3)\), where the models are nonlinear\.

#### 6\.1\.2Results

We first evaluateKARMAon model\-induced causal graph recovery, treatingρ​\(e\)\>λ\\rho\(e\)\>\\lambda\(0\.025 fortiny, 0\.1 for the rest\) as the criterion for a retained edgeee\. Table[1](https://arxiv.org/html/2606.27599#S6.T1)reports the precision, recall, and F1 scores of edge discovery\. KARMA achieves a recall score of1\.01\.0in all but the most complex case \(still achieving 0\.71\) showing KARMA excludes no true causal relationships\. The high precision and F1 indicates KARMA’s ability to not retain spurious edges\.

Table 1:KARMA edge\-recovery metrics,\|E\|\|E\|denotes the number of edges\. Recall is 1\.00 through*large*; precision and recall also stay consistently close or equal to 1\.00\.ConfigDDKt​r​u​eK\_\{true\}\|E\|\|E\|PrecisionRecallF1tiny2121\.001\.001\.00small4251\.001\.001\.00medium4370\.701\.000\.82large6391\.001\.001\.00xlarge84141\.000\.710\.83We further evaluate across five VAR configurations of increasing scale \(Table[2](https://arxiv.org/html/2606.27599#S6.T2)\)\.KARMAis the sole top\-ranked method on medium, large, and xlarge configurations, where the other methods degrade as the graph grows denser\.

Table 2:Kendall’sτ\\tauvs\. ground truth across VAR configurations\.Boldmarks the highest per row \(ties included\)\.∗\*indicates Kendall test with p value less than 0\.05ConfigFODMWITSKARMAtiny0\.360\.360\.76\\phantom\{\-\}0\.76^\{\\phantom\{\*\}\}0\.94∗\\mathbf\{0\.94^\{\*\}\}0\.94∗\\mathbf\{0\.94^\{\*\}\}0\.94∗\\mathbf\{0\.94^\{\*\}\}small0\.13\\phantom\{\-\}0\.130\.14\\phantom\{\-\}0\.14^\{\\phantom\{\*\}\}0\.86∗0\.86^\{\*\}0\.92∗\\mathbf\{0\.92^\{\*\}\}0\.92∗\\mathbf\{0\.92^\{\*\}\}medium0\.26\\phantom\{\-\}0\.26−0\.09\-0\.09^\{\\phantom\{\*\}\}0\.77∗0\.77^\{\*\}0\.79∗0\.79^\{\*\}0\.90∗\\mathbf\{0\.90^\{\*\}\}large−0\.07\-0\.07−0\.32∗\-0\.32^\{\*\}0\.72∗0\.72^\{\*\}0\.68∗0\.68^\{\*\}0\.92∗\\mathbf\{0\.92^\{\*\}\}xlarge0\.25\\phantom\{\-\}0\.250\.09\\phantom\{\-\}0\.09\{\\phantom\{\*\}\}0\.76∗0\.76^\{\*\}0\.64∗0\.64^\{\*\}0\.77∗\\mathbf\{0\.77^\{\*\}\}

### 6\.2Case Study B: Real\-World Forecasting Benchmark

We consider three diverse time series datasets: Electricity Transformer Temperature hourly 1 dataset \(ETTh1\)Zhouet al\.\([2021](https://arxiv.org/html/2606.27599#bib.bib48)\), Beijing PM2\.5\(Bei\)Lianget al\.\([2015](https://arxiv.org/html/2606.27599#bib.bib49)\), Exchange Rate \(ExRa\)Laiet al\.\([2018](https://arxiv.org/html/2606.27599#bib.bib50)\), and three time series forecasting models \(GRU, LSTM, TCN\)\. To compare KARMA to the different methods, we use atemporal aware lag based AUC score\(𝐀𝐔𝐂l​a​g\\mathbf\{AUC\}\_\{lag\}\) \(Supplementary Material[S5](https://arxiv.org/html/2606.27599#S5a)\) to measure predictive change under time lag occlusion i\.e\. we evaluate how well the XAI method learns model induced temporal dependencies by replacing entire time horizons with conditional expectations learnt by a VAR on transition dynamics on the training data\.

Given a batch ofBBtest windows and aDD\-variate attribution tensorϕ∈ℝB×D×T\\phi\\in\\mathbb\{R\}^\{B\\times D\\times T\}, we aggregate attributions into a per\-timestep importance score

st=1B​D​∑b=1B∑d=1D\|ϕb,d,t\|,t=1,…,T\.s\_\{t\}\\;=\\;\\frac\{1\}\{BD\}\\sum\_\{b=1\}^\{B\}\\sum\_\{d=1\}^\{D\}\\bigl\|\\phi\_\{b,d,t\}\\bigr\|,\\qquad t=1,\\ldots,T\.\(15\)ForKARMA, we bypass the cell attribution matrix and derive time scores directly from the raw edge weights,

stKARMA=∑\{e:T−ℓe=t\}ρe,s\_\{t\}^\{\\textsc\{KARMA\}\}\\;=\\\!\\\!\\sum\_\{\\\{e\\,:\\,T\-\\ell\_\{e\}=t\\\}\}\\\!\\\!\\rho\_\{e\},\(16\)whereℓe\\ell\_\{e\}is the lag andρe\\rho\_\{e\}is the importance of edgeee, so that a timestep targeted by many high\-weight causal edges scores proportionally higher\.

We report𝐀𝐔𝐂l​a​g​@​25%\\mathbf\{AUC\}\_\{lag\}\\text\{@\}25\\%, a single\-point summary of how much prediction changes when the top quarter of timesteps are removed\. A method that correctly identifies the temporal dependencies concentrates its top\-ranked positions there, producing a steep early rise and a high area; one that wastes rankings on non\-blanket lags sees a flat curve\.

#### 6\.2\.1Results\.

Table[3](https://arxiv.org/html/2606.27599#S6.T3)reports AUClag\{\}\_\{\\text\{lag\}\}@25% across the three datasets and three architectures, with the full removal curves shown in Figure[3](https://arxiv.org/html/2606.27599#S6.F3)\. On ETTh1, KARMA attains the highest score on every architecture and opens a clear margin on the TCN \(0\.710\.71vs\.0\.490\.49for the next\-best method\), indicating that its edge\-derived timestep scores concentrate prediction\-relevant lags more sharply than the flat baselines\. On Beijing PM2\.5, KARMA is competitive throughout, matching or marginally trailing the strongest baseline on each architecture\. On Exchange Rate, all methods collapse to near\-zero AUClag\{\}\_\{\\text\{lag\}\}@25%: this dataset is close to a random walk, where successive increments carry little learnable temporal structure, so no attribution method can identify timesteps whose removal substantially shifts the forecast\. We read this not as a failure of the explainers but as the metric behaving correctly: when the model has learned little exploitable temporal dependence, a faithful temporal\-importance score*should*be flat\. Notably, the strong perturbation baselines IG and TimeSHAP, which are competitive on ETTh1, exhibit the same collapse on Exchange Rate, confirming that the effect is a property of the data rather than of KARMA’s construction\.

Table 3:𝐀𝐔𝐂l​a​g​@​25%\\mathbf\{AUC\}\_\{lag\}\\text\{@\}25\\%under VAR\-conditional imputation\. Higher is better\.DatasetArchFODMWIIGTSKMETTh1GRU0\.030\.040\.030\.610\.610\.63LSTM0\.160\.080\.090\.620\.660\.66TCN0\.030\.240\.020\.490\.490\.71ExRAGRU0\.000\.010\.000\.030\.030\.02LSTM0\.000\.060\.000\.010\.010\.00TCN0\.000\.010\.000\.020\.020\.05BeiGRU0\.010\.010\.290\.290\.220\.18LSTM0\.010\.010\.100\.250\.240\.24TCN0\.020\.010\.280\.220\.190\.28![Refer to caption](https://arxiv.org/html/2606.27599v1/x7.png)Figure 3:Lag AUC removal curves for all datasets and architectures\. Each curve shows cumulative\|Δ​y^\|\|\\Delta\\hat\{y\}\|as the fraction of removed timesteps grows from 0 to 1\.

## 7Conclusion

We presented KARMA, a framework for certified temporal explanation of black\-box models on multivariate time series viaKK\-order Markov approximations\. KARMA makes three contributions\. Pillar 1 selects the minimal predictively sufficient lagK∗K^\{\*\}\. Pillar 2 delivers a model compression theorem certifying prediction fidelity loss≤ε\\leq\\varepsilonat compression ratioW/K∗W/K^\{\*\}, a model\-certified baselineb∗b^\{\*\}, and certified\-zero attributions for all lags beyondK∗K^\{\*\}, not merely small attributions\. Pillar 3 extracts five layers of global explanation from the single estimated kernel without additional oracle queries, spanning variable importance, lag profiles, regime distinctiveness, average causal effects, and uncertainty quantification\.

On synthetic VAR benchmarks with known ground truth, KARMA recovers the model\-induced causal structure with perfect recall on all but the largest configuration \(1\.001\.00through thelargesetting,0\.710\.71onxlarge\), correctly retaining every true edge; its high precision on all graphs reflects a robustness to spurious relationships that might arise from autocorrelation\. On feature\-importance ranking, KARMA is the sole top\-ranked method on themedium,large, andxlargeconfigurations, where perturbation\-based baselines such as TimeSHAP and WinIT degrade as graph density grows, while remaining competitive on the smaller settings\. On real\-world forecasting across three datasets and three architectures, KARMA matches or exceeds the strongest baselines under VAR\-conditional imputation, most notably on ETTh1\.

A key limitation is the exponential growth of the state space inND​K∗N^\{DK^\{\*\}\}, which can make full kernel estimation challenging in high\-dimensional settings\. In practice, KARMA mitigates this via marginal kernels, sampling strategies restricting toℋK∗\+\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\(bounded linearly byT−K∗T\-K^\{\*\}\) and thresholding onπ^∗​\(h\)\\hat\{\\pi\}^\{\*\}\(h\), with reliability certified by the Level 5 coverage map; scaling to very high\-dimensional systems remains an open direction\. We believe the Markov approximation paradigm offers a principled and underexplored direction for XAI in sequential domains, with applications spanning finance, neuroscience, and genomics\.

## References

- S\. Bach, A\. Binder, G\. Montavon, F\. Klauschen, K\. Müller, and W\. Samek \(2015\)On pixel\-wise explanations for non\-linear classifier decisions by layer\-wise relevance propagation\.PLOS ONE10\(7\),pp\. e0130140\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- J\. Bento, P\. Saleiro, A\. F\. Cruz, M\. A\. T\. Figueiredo, and P\. Bizarro \(2021\)TimeSHAP: explaining recurrent models through sequence perturbations\.InProceedings of KDD,Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- J\. Crabbé and M\. van der Schaar \(2021\)Explaining time series predictions with dynamic masks\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 2166–2177\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- I\. Csiszár and P\. C\. Shields \(2000\)The consistency of the BIC Markov order estimator\.Annals of Statistics28\(6\),pp\. 1601–1619\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p3.1)\.
- A\. Dvoretzky, J\. Kiefer, and J\. Wolfowitz \(1956\)Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator\.Annals of Mathematical Statistics27\(3\),pp\. 642–669\.External Links:[Document](https://dx.doi.org/10.1214/aoms/1177728174)Cited by:[3rd item](https://arxiv.org/html/2606.27599#S3.I2.i3.p1.2)\.
- J\. Enguehard \(2023\)Learning perturbations to explain time series predictions\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 9329–9342\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- A\. A\. Ismail, M\. Gunady, H\. Corrada Bravo, and S\. Feizi \(2020\)Benchmarking deep learning interpretability in time series predictions\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 6441–6452\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- S\. Jain and B\. C\. Wallace \(2019\)Attention is not explanation\.InProceedings of NAACL,Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- G\. Lai, W\. Chang, Y\. Yang, and H\. Liu \(2018\)Modeling long\-and short\-term temporal patterns with deep neural networks\.InProceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 95–104\.Cited by:[§6\.2](https://arxiv.org/html/2606.27599#S6.SS2.p1.2)\.
- K\. K\. Leung, C\. Rooke, J\. Smith, S\. Zuberi, and M\. Volkovs \(2023\)Temporal dependencies in feature importance for time series prediction\.InThe Eleventh International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- X\. Liang, T\. Zou, B\. Guo, S\. Li, H\. Zhang, S\. Zhang, H\. Huang, and S\. X\. Chen \(2015\)Assessing beijing’s pm2\.5 pollution: severity, weather impact, apec and winter heating\.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences471\(2182\),pp\. 20150257\.Cited by:[§6\.2](https://arxiv.org/html/2606.27599#S6.SS2.p1.2)\.
- B\. Lim, S\. Ö\. Arık, N\. Loeff, and T\. Pfister \(2021\)Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International Journal of Forecasting37\(4\),pp\. 1748–1764\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- Z\. Liu, Y\. Zhang, T\. Wang, Z\. Wang, D\. Luo, M\. Du, M\. Wu, Y\. Wang, C\. Chen, L\. Fan, and Q\. Wen \(2024\)Explaining time series via contrastive and locally sparse perturbations\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- S\. M\. Lundberg and S\. I\. Lee \(2017\)A unified approach to interpreting model predictions\.InProceedings of NeurIPS,Cited by:[§1](https://arxiv.org/html/2606.27599#S1.p2.2),[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- O\. Queen, T\. Hartvigsen, T\. Koker, H\. He, T\. Tsiligkaridis, and M\. Zitnik \(2023\)Encoding time\-series explanations through self\-supervised model behavior consistency\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36,pp\. 32129–32159\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)“Why should I trust you?”: explaining the predictions of any classifier\.InProceedings of KDD,Cited by:[§1](https://arxiv.org/html/2606.27599#S1.p2.2),[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra \(2017\)Grad\-cam: visual explanations from deep networks via gradient\-based localization\.InProceedings of the IEEE International Conference on Computer Vision \(ICCV\),pp\. 618–626\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- A\. Shrikumar, P\. Greenside, and A\. Kundaje \(2017\)Learning important features through propagating activation differences\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),Vol\.70,pp\. 3145–3153\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- K\. Simonyan, A\. Vedaldi, and A\. Zisserman \(2014\)Deep inside convolutional networks: visualising image classification models and saliency maps\.InInternational Conference on Learning Representations \(ICLR\) Workshop,Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- D\. Smilkov, N\. Thorat, B\. Kim, F\. Viégas, and M\. Wattenberg \(2017\)SmoothGrad: removing noise by adding noise\.arXiv preprint arXiv:1706\.03825\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- M\. Sundararajan, A\. Taly, and Q\. Yan \(2017\)Axiomatic attribution for deep networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),Vol\.70,pp\. 3319–3328\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- H\. Suresh, N\. Hunt, A\. Johnson, L\. A\. Celi, P\. Szolovits, and M\. Ghassemi \(2017\)Clinical intervention prediction and understanding using deep networks\.arXiv preprint arXiv:1705\.08498\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- S\. Tonekaboni, S\. Joshi, K\. Campbell, D\. Duvenaud, and A\. Goldenberg \(2020\)What went wrong and when? instance\-wise feature importance for time\-series black\-box models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- S\. Wiegreffe and Y\. Pinter \(2019\)Attention is not not explanation\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 11–20\.Cited by:[§2](https://arxiv.org/html/2606.27599#S2.p1.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[§6\.2](https://arxiv.org/html/2606.27599#S6.SS2.p1.2)\.

## Supplementary Material

This supplementary material gathers the proofs and implementation details deferred from the main text: the soundness of the trimming thresholdλ\\lambdaand the K\.E\.R\.I\. reliability index \(Section[S2](https://arxiv.org/html/2606.27599#S2a)\); finite\-sample total\-variation guarantees, noise floors, and sample budgets for the three kernel estimators \(Section[S3](https://arxiv.org/html/2606.27599#S3a)\); the model\-induced causal graph and faithfulness of KARMA’s attributions \(Section[S4](https://arxiv.org/html/2606.27599#S4a)\); and the full algorithm, theT\>1T\>1extension, and the evaluation metric\. References prefixed with “S” point to this supplement; unprefixed ones to the main paper\.

## S1Constructive Extension toT\>1T\>1

Recall\[N\]\[N\],\[D\]\[D\],ψ\\psi,𝒮\\mathcal\{S\}andℋK\\mathcal\{H\}\_\{K\}as defined in Section 3\.1 of the manuscript\. For a forecast horizon of lengthTT, let𝒮T=\[N\]D×T\\mathcal\{S\}\_\{T\}=\[N\]^\{D\\times T\}denote the extended \(discretised\) output state space, so that a discretised forecast is a matrixs=\(sid\)d∈\[D\],i∈\[T\]∈𝒮Ts=\(s^\{d\}\_\{i\}\)\_\{d\\in\[D\],\\,i\\in\[T\]\}\\in\\mathcal\{S\}\_\{T\}, wheresids^\{d\}\_\{i\}is the bin of variableddat horizon stepii\. Given the selected orderK∗K^\{\*\}, themulti\-step \(joint\) surrogate transition kernelofffis

𝒯K∗f\(s∣h\)=ℙ\(ψ\(f\(h~\)\)=s\|suffixK∗\(ψ\(h~\)\)=h\),\\mathcal\{T\}^\{f\}\_\{K^\{\*\}\}\(s\\mid h\)\\;=\\;\\mathbb\{P\}\\\!\\left\(\\psi\(f\(\\tilde\{h\}\)\)=s\\;\\middle\|\\;\\mathrm\{suffix\}\_\{K^\{\*\}\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\),\(S1\)fors∈𝒮Ts\\in\\mathcal\{S\}\_\{T\}andh∈ℋK∗h\\in\\mathcal\{H\}\_\{K^\{\*\}\}, whereh~\\tilde\{h\}is aWW\-length input window whose discretisedK∗K^\{\*\}\-suffix equalshh\.

##### Horizon paths\.

ForT\>1T\>1an explanation tracks a single target variable at each future step\. We encode this choice by a*relevant mask*M∈\{0,1\}D×TM\\in\\\{0,1\\\}^\{D\\times T\}with exactly one active entry per horizon step,

M​\[d\]​\[i\]=\{1d=di,0otherwise,i∈\[T\],M\[d\]\[i\]=\\begin\{cases\}1&d=d\_\{i\},\\\\ 0&\\text\{otherwise,\}\\end\{cases\}\\qquad i\\in\[T\],\(S2\)wheredi∈\[D\]d\_\{i\}\\in\[D\]is the variable selected at stepii\(thedid\_\{i\}need not be distinct across steps\)\. Writing⟨s,M⟩=\(sidi\)i∈\[T\]∈\[N\]T\\langle s,M\\rangle=\(s^\{d\_\{i\}\}\_\{i\}\)\_\{i\\in\[T\]\}\\in\[N\]^\{T\}for the entries ofsspicked out byMM, themulti\-step marginal surrogate transition kernelalong the pathMMis

𝒯K∗f,M\(sT∣h\)=ℙ\(⟨ψ\(f\(h~\)\),M⟩=sT\|suffixK∗\(ψ\(h~\)\)=h\),\\mathcal\{T\}^\{f,M\}\_\{K^\{\*\}\}\(s^\{T\}\\mid h\)\\;=\\;\\mathbb\{P\}\\\!\\left\(\\langle\\psi\(f\(\\tilde\{h\}\)\),\\,M\\rangle=s^\{T\}\\;\\middle\|\\;\\mathrm\{suffix\}\_\{K^\{\*\}\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\),\(S3\)wheresT∈\[N\]Ts^\{T\}\\in\[N\]^\{T\}is the length\-TThorizon trajectory selected byMM\. BecauseMMactivates at most one variable per step, the path contains no contemporaneous \(same\-step\) couplings, consistent with the strictly\-past construction of Section 3\.2; the single\-target caseT=1T=1recovers the marginal kernel𝒯K∗f,d\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}of Eq\. \(4\)\.

### S1\.1The Five\-Level Explanation Hierarchy

LetK∗\>0K^\{\*\}\>0,h∈ℋK∗h\\in\\mathcal\{H\}\_\{K^\{\*\}\},d∈\[D\]d\\in\[D\]andffa trained model\. Further assume𝒯^K∗f,M\\hat\{\\mathcal\{T\}\}\_\{K^\{\*\}\}^\{f,M\}the approximation of𝒯K∗f,M\\mathcal\{T\}\_\{K^\{\*\}\}^\{f,M\}such that

𝔼\[∥𝒯^Kf,M\(⋅∣h\)−𝒯Kf,M\(⋅∣h\)∥T​V\]<λ4,λ\>0\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,M\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,M\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]<\\frac\{\\lambda\}\{4\},\\qquad\\lambda\>0\(S4\)We describe two cases, user\-specified prediction time steps

###### Definition S1\.1\(Lag\-Resolved Influence and Variable Importance\)\.

We are given the estimated horizon kernel𝒯^K∗f​\(s∣h\)\\hat\{\\mathcal\{T\}\}^\{f\}\_\{K^\{\*\}\}\(s\\mid h\)over𝒮T=\[N\]D×T\\mathcal\{S\}\_\{T\}=\[N\]^\{D\\times T\}\(Eq\.[S1](https://arxiv.org/html/2606.27599#S1.E1)\) and wish to quantify the effect onqqchosen forecast stepsτ=\{t1,…,tq\}⊆\[T\]\\tau=\\\{t\_\{1\},\\dots,t\_\{q\}\\\}\\subseteq\[T\]\. Fix a target dimensiondd\. Selecting its length\-TThorizon with the dimension maskMd∈\{0,1\}D×TM\_\{d\}\\in\\\{0,1\\\}^\{D\\times T\}\(Md​\[d′\]​\[i\]=𝟏​\{d′=d\}M\_\{d\}\[d^\{\\prime\}\]\[i\]=\\mathbf\{1\}\\\{d^\{\\prime\}=d\\\}\) gives

𝒯^K∗f,d​\(s1:Td∣h\)=ℙ​\(⟨ψ​\(f​\(h~\)\),Md⟩=s1:Td∣suffixK∗​\(ψ​\(h~\)\)=h\),s1:Td∈\[N\]T,\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\_\{1:T\}\\mid h\)=\\mathbb\{P\}\\\!\\left\(\\langle\\psi\(f\(\\tilde\{h\}\)\),M\_\{d\}\\rangle=s^\{d\}\_\{1:T\}\\mid\\mathrm\{suffix\}\_\{K^\{\*\}\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\),\\quad s^\{d\}\_\{1:T\}\\in\[N\]^\{T\},already computed in equation \(19\) and marginalizing out theT−qT\-qsteps outsideτ\\taukeeps the requested steps:

𝒯^K∗f,d,τ​\(sτd∣h\)=∑𝐱∈\[N\]\[T\]∖τ𝒯^K∗f,d​\(sτd,s\[T\]∖τd=𝐱∣h\),sτd∈\[N\]q\.\\hat\{\\mathcal\{T\}\}^\{f,d,\\tau\}\_\{K^\{\*\}\}\(s^\{d\}\_\{\\tau\}\\mid h\)=\\sum\_\{\\mathbf\{x\}\\in\[N\]^\{\[T\]\\setminus\\tau\}\}\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\\\!\\bigl\(s^\{d\}\_\{\\tau\},\\,s^\{d\}\_\{\[T\]\\setminus\\tau\}=\\mathbf\{x\}\\mid h\\bigr\),\\quad s^\{d\}\_\{\\tau\}\\in\[N\]^\{q\}\.The edge strength of sourced′d^\{\\prime\}at lagkkon targetddover the stepsτ\\tauis

ρ\(Xt1−kd′→Xτd\)=∑h∈ℋK∗π^∗\(h\)⋅12ΔT​V,ΔT​V=∑sτd∈\[N\]q\|𝒯^K∗f,d,τ\(sτd∣h\)−1N∑x=0N−1𝒯^K∗f,d,τ\(sτd∣hkd′←x\)\|,\\rho\(X^\{d^\{\\prime\}\}\_\{t\_\{1\}\-k\}\\to X^\{d\}\_\{\\tau\}\)=\\sum\_\{h\\in\\mathcal\{H\}\_\{K^\{\*\}\}\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\tfrac\{1\}\{2\}\\,\\Delta\_\{TV\},\\qquad\\Delta\_\{TV\}=\\\!\\\!\\sum\_\{s^\{d\}\_\{\\tau\}\\in\[N\]^\{q\}\}\\\!\\\!\\left\|\\hat\{\\mathcal\{T\}\}^\{f,d,\\tau\}\_\{K^\{\*\}\}\(s^\{d\}\_\{\\tau\}\\mid h\)\-\\frac\{1\}\{N\}\\sum\_\{x=0\}^\{N\-1\}\\hat\{\\mathcal\{T\}\}^\{f,d,\\tau\}\_\{K^\{\*\}\}\(s^\{d\}\_\{\\tau\}\\mid h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\right\|,\(S5\)wherehkd′←xh^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}replaces only thed′d^\{\\prime\}\-th component at lagkk\.

The*lag\-resolved influence*of sourced′d^\{\\prime\}at lagkksums the edge strengths over all targetsdd,

⋯\\cdots⋯\\cdots⋯\\cdots⋮\\vdots⋮\\vdots⋮\\vdots⋮\\vdots⋱\\ddotsWWDDW×DW\\times Dinput windowlagkkffpredictor⋯\\cdots⋯\\cdots⋯\\cdots⋮\\vdots⋮\\vdots⋮\\vdots⋱\\ddotst1t\_\{1\}t2t\_\{2\}⋯\\cdotsTTτ=\{t1,…,tq\}\\tau=\\\{t\_\{1\},\\dots,t\_\{q\}\\\}ddT×DT\\times Dforecast horizonρ​\(Xt1−kd′→Xτd\)\\rho\\\!\\left\(X^\{d^\{\\prime\}\}\_\{t\_\{1\}\-k\}\\to X^\{d\}\_\{\\tau\}\\right\)Figure S1:KARMA explanation for aT\>1T\>1forecast\.A black\-box predictorffmaps aW×DW\\times Dinput window \(window lengthWW,DDvariables\) to aT×DT\\times Dforecast horizon\. The explanation targets a user\-chosen subset of horizon stepsτ=\{t1,…,tq\}⊆\[T\]\\tau=\\\{t\_\{1\},\\dots,t\_\{q\}\\\}\\subseteq\[T\]; the remainingT−qT\-qsteps are marginalised out of the estimated horizon kernel\. The shaded input column is the lag\-kksource slice, and the highlighted row\-ddband overτ\\tauis the target\. The curved arrow denotes the interventional edgeρ​\(Xt1−kd′→Xτd\)\\rho\\\!\\left\(X^\{d^\{\\prime\}\}\_\{t\_\{1\}\-k\}\\to X^\{d\}\_\{\\tau\}\\right\)measured by KARMA: theπ^∗\\hat\{\\pi\}^\{\*\}\-weighted total\-variation shift in the targeted forecast𝒯^K∗f,d,τ\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d,\\tau\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)when thed′d^\{\\prime\}\-th input component at lagkkis intervened on\. Lagsk\>K∗k\>K^\{\*\}are certified negligible by the stopping rule\.ϕkd′=∑d=1Dρ​\(Xt−kd′→Xτd\)⋅𝟏​\{ρ​\(Xt−kd′→Xτd\)\>λ\},\\phi\_\{k\}^\{d^\{\\prime\}\}=\\sum\_\{d=1\}^\{D\}\\rho\(X\_\{t\-k\}^\{d^\{\\prime\}\}\\to X\_\{\\tau\}^\{d\}\)\\cdot\\mathbf\{1\}\\\{\\rho\(X\_\{t\-k\}^\{d^\{\\prime\}\}\\to X\_\{\\tau\}^\{d\}\)\>\\lambda\\\},\(S6\)and the variable importance sums over lags,Φd′=∑k=1K∗ϕkd′\\Phi^\{d^\{\\prime\}\}=\\sum\_\{k=1\}^\{K^\{\*\}\}\\phi\_\{k\}^\{d^\{\\prime\}\}, normalised as in theT=1T=1case\. The remaining levels follow as in theT=1T=1case\.

## S2Causal Sufficiency of Thresholding Constantλ\\lambda

LetK∗≥1K^\{\*\}\\geq 1ande=\(Xt−kd′→Xtd\)e=\(X\_\{t\-k\}^\{d^\{\\prime\}\}\\to X\_\{t\}^\{d\}\)denote a causal edge andρ^​\(ρ\)\\hat\{\\rho\}\(\\rho\), the quantity in Eq \(6\) w\.r\.t\.𝒯^K∗f,d​\(𝒯K∗f,d\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\)respectively\. In the following, the argument proceeds analogously for the joint kernel𝒯K∗f\{\\mathcal\{T\}\}^\{f\}\_\{K^\{\*\}\}as well\.

###### Proposition S2\.1\.

Let𝒯K∗f,d\(⋅∣h\)\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)be the true modelK−K\-lag surrogate kernel and𝒯^K∗f,d\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)it’s approximation via any strategy\. The constantλ\>0\\lambda\>0is chosen such that

𝔼\[∥𝒯^K∗f,d\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\]<λ4\.\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]<\\frac\{\\lambda\}\{4\}\.for anyh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\. Further assumemaxh∈ℋK∗\+⁡ρfloor​\(h\)<λ/4\\max\_\{h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4, whereρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)is an approximate upper bound for𝔼\[∥𝒯^Kf,d\(⋅∣h\)−𝒯Kf,d\(⋅∣h\)∥T​V\]\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\. Then the trimming decision at thresholdλ\\lambdain Level 1 is correct with probability at least1/21/2for each edge, and the probability of a incorrect trimming decision is bounded by:

ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥λ/2\)≤4​maxh⁡ρfloor​\(h\)λ<1\.\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda/2\\right\)\\leq\\frac\{4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\lambda\}<1\.\(S7\)

###### Proof\.

By definition ofρ^\\hat\{\\rho\}andρ\\rhofrom Eq\. \(6\), using\|\|a\|−\|b\|\|≤\|a−b\|\\big\|\|a\|\-\|b\|\\big\|\\leq\|a\-b\|and then the triangle inequality to the inner difference:

\|ρ^​\(e\)−ρ​\(e\)\|\\displaystyle\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|≤∑hπ^∗\(h\)⋅12∑sd\[\|𝒯^K∗f,d\(sd\|h\)−𝒯K∗f,d\(sd\|h\)\|\\displaystyle\\leq\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\frac\{1\}\{2\}\\sum\_\{s^\{d\}\}\\left\[\\left\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\|h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\|h\)\\right\|\\right\.\+1N∑x\|𝒯^K∗f,d\(sd\|hkd′←x\)−𝒯K∗f,d\(sd\|hkd′←x\)\|\]\.\\displaystyle\\quad\\left\.\+\\frac\{1\}\{N\}\\sum\_\{x\}\\left\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\right\|\\right\]\.\(S8\)
Recognizing the inner sums as total variation distances:

\|ρ^​\(e\)−ρ​\(e\)\|\\displaystyle\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|≤∑hπ^∗\(h\)\[\[∥𝒯^K∗f,d\(⋅\|h\)−𝒯K∗f,d\(⋅\|h\)∥T​V\]\+1N∑x\[∥𝒯^K∗f,d\(⋅\|hkd′←x\)−𝒯K∗f,d\(⋅\|hkd′←x\)∥T​V\]\]\.\\displaystyle\\leq\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)\\Bigg\[\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h\)\\\|\_\{TV\}\\right\]\+\\frac\{1\}\{N\}\\sum\_\{x\}\\Big\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\\|\_\{TV\}\\Big\]\\Bigg\]\.\(S9\)
Taking expectations on both sides and applying linearity, noting thatπ^∗​\(h\)\\hat\{\\pi\}^\{\*\}\(h\)is estimated independently of the kernel:

𝔼​\[\|ρ^​\(e\)−ρ​\(e\)\|\]\\displaystyle\\mathbb\{E\}\\left\[\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\right\]≤∑hπ^∗\(h\)\[𝔼\[∥𝒯^K∗f,d\(⋅\|h\)−𝒯K∗f,d\(⋅\|h\)∥T​V\]\\displaystyle\\leq\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)\\Bigg\[\\mathbb\{E\}\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h\)\\\|\_\{TV\}\\right\]\+1N∑x𝔼\[∥𝒯^K∗f,d\(⋅\|hkd′←x\)−𝒯K∗f,d\(⋅\|hkd′←x\)∥T​V\]\]\.\\displaystyle\\quad\+\\frac\{1\}\{N\}\\sum\_\{x\}\\mathbb\{E\}\\Big\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\|h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)\\\|\_\{TV\}\\Big\]\\Bigg\]\.\(S10\)
Since counterfactual historieshkd′←x∈ℋK∗\+h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}for allx∈\[N\]x\\in\[N\]\. Since∑hπ^∗​\(h\)=1\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)=1:

𝔼​\[\|ρ^​\(e\)−ρ​\(e\)\|\]\\displaystyle\\mathbb\{E\}\\left\[\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\right\]≤∑hπ^∗​\(h\)⋅2⋅maxh′⁡ρfloor​\(h′\)\\displaystyle\\leq\\sum\_\{h\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot 2\\cdot\\max\_\{h^\{\\prime\}\}\\rho\_\{\\mathrm\{floor\}\}\(h^\{\\prime\}\)=2⋅maxh∈ℋK∗\+⁡ρfloor​\(h\)\.\\displaystyle=2\\cdot\\max\_\{h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\.\(S11\)
Since\|ρ^​\(e\)−ρ​\(e\)\|≥0\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq 0, Markov’s inequality applies directly:

ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥t\)≤𝔼​\[\|ρ^​\(e\)−ρ​\(e\)\|\]t≤2⋅maxh⁡ρfloor​\(h\)t\.\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq t\\right\)\\leq\\frac\{\\mathbb\{E\}\\left\[\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\right\]\}\{t\}\\leq\\frac\{2\\cdot\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{t\}\.\(S12\)
Settingt=λ/2t=\\lambda/2:

ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥λ/2\)≤4⋅maxh⁡ρfloor​\(h\)λ\.\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda/2\\right\)\\leq\\frac\{4\\cdot\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\lambda\}\.\(S13\)
We now verify that the trimming decision at thresholdλ\\lambdais correct with high probability under the conditionmaxh⁡ρfloor​\(h\)<λ/4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4\. Consider two cases:

- •False negative: A true edge withρ​\(e\)≥λ\\rho\(e\)\\geq\\lambdais missed becauseρ^​\(e\)<λ\\hat\{\\rho\}\(e\)<\\lambda\. This requires\|ρ^​\(e\)−ρ​\(e\)\|≥λ/2\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda/2, since if the true value is at leastλ\\lambdaaway fromλ/2\\lambda/2, an error of at mostλ/2\\lambda/2cannot push the estimate belowλ\\lambda\. By Eq\. \([S13](https://arxiv.org/html/2606.27599#S2.E13)\), this occurs with probability at most4​maxh⁡ρfloor​\(h\)λ<1\\frac\{4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\lambda\}<1\.
- •False positive: A spurious non\-edge withρ​\(e\)=0\\rho\(e\)=0is included becauseρ^​\(e\)≥λ\\hat\{\\rho\}\(e\)\\geq\\lambda\. This requires\|ρ^​\(e\)−ρ​\(e\)\|≥λ\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda, a stricter condition\. By Markov’s inequality witht=λt=\\lambda: ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥λ\)≤2⋅maxh⁡ρfloor​\(h\)λ<12,\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda\\right\)\\leq\\frac\{2\\cdot\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\lambda\}<\\frac\{1\}\{2\},\(S14\)under the conditionmaxh⁡ρfloor​\(h\)<λ/4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4\.

Therefore, undermaxh⁡ρfloor​\(h\)<λ/4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4, both false negatives and false positives occur with probability strictly less than11, and the probability of any incorrect trimming decision across all edges is bounded by Eq\. \([S7](https://arxiv.org/html/2606.27599#S2.E7)\)\. Substituting the conditionmaxh⁡ρfloor​\(h\)<λ/4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4into Eq\. \([S13](https://arxiv.org/html/2606.27599#S2.E13)\) gives the safety guarantee:

ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥λ/2\)<1,\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\lambda/2\\right\)<1,\(S15\)confirmingλ\\lambdaas a valid trimming threshold provided the noise floor conditionmaxh⁡ρfloor​\(h\)<λ/4\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\lambda/4is satisfied\. ∎

### S2\.1KARMA Explanation Reliability Index

We derive the probabilistic foundation of K\.E\.R\.I from the Markov bound established in Proposition 1\. Recall that for any causal edgee=\(d′,k,d\)e=\(d^\{\\prime\},k,d\):

ℙ​\(\|ρ^​\(e\)−ρ​\(e\)\|≥λ2\)≤4⋅maxh⁡ρfloor​\(h\)λ,\\mathbb\{P\}\\left\(\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\geq\\frac\{\\lambda\}\{2\}\\right\)\\leq\\frac\{4\\cdot\\max\_\{h\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\lambda\},\(S16\)which is derived by applying Markov’s inequality to the corrected bound:

𝔼​\[\|ρ^​\(e\)−ρ​\(e\)\|\]≤2⋅maxh∈HK∗\+⁡ρfloor​\(h\)\.\\mathbb\{E\}\\left\[\|\\hat\{\\rho\}\(e\)\-\\rho\(e\)\|\\right\]\\leq 2\\cdot\\max\_\{h\\in H^\{\+\}\_\{K^\{\*\}\}\}\\rho\_\{\\mathrm\{floor\}\}\(h\)\.\(S17\)For the bound in Eq\. \([S16](https://arxiv.org/html/2606.27599#S2.E16)\) to be non\-vacuous, i\.e\. strictly less than 1, we require:

maxh∈HK∗\+⁡ρfloor​\(h\)<λ4\.\\max\_\{h\\in H^\{\+\}\_\{K^\{\*\}\}\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\frac\{\\lambda\}\{4\}\.\(S18\)
Sinceρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)is itself an upper bound on the true per\-history estimation error, the condition in Eq\. \([S18](https://arxiv.org/html/2606.27599#S2.E18)\) may be overly conservative: the true noise floor is likely smaller thanρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)by some factorκ≥1\\kappa\\geq 1\. To account for this inflation, we relax the non\-vacuity condition by replacingλ/4\\lambda/4withκ⋅λ/4\\kappa\\cdot\\lambda/4:

maxh∈HK∗\+⁡ρfloor​\(h\)<κ⋅λ4,\\max\_\{h\\in H^\{\+\}\_\{K^\{\*\}\}\}\\rho\_\{\\mathrm\{floor\}\}\(h\)<\\kappa\\cdot\\frac\{\\lambda\}\{4\},\(S19\)and redefine the per\-history miscoverage probability bound accordingly:

δκ​\(h\):=min⁡\(4​ρfloor​\(h\)κ⋅λ,1\),\\delta\_\{\\kappa\}\(h\):=\\min\\left\(\\frac\{4\\,\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\kappa\\cdot\\lambda\},\\;1\\right\),\(S20\)so thatδκ​\(h\)∈\[0,1\]\\delta\_\{\\kappa\}\(h\)\\in\[0,1\]is the worst\-case probability that the trimming decision at historyhhis incorrect under the relaxed threshold\. The conservatism constantκ\>1\\kappa\>1reflects the practitioner’s belief thatρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)overestimates the true noise floor;κ=1\\kappa=1recovers the unrelaxed bound, andκ=2\\kappa=2is the recommended default\. K\.E\.R\.I aggregates the complementary per\-history reliability1−δκ​\(h\)1\-\\delta\_\{\\kappa\}\(h\)over the stationary distribution:

K\.E\.R\.I\\displaystyle\\mathrm\{K\.E\.R\.I\}=∑π^∗​\(h\)\>0π^∗​\(h\)⋅\(1−δκ​\(h\)\)\\displaystyle=\\sum\_\{\\hat\{\\pi\}^\{\*\}\(h\)\>0\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\bigl\(1\-\\delta\_\{\\kappa\}\(h\)\\bigr\)=∑π^∗​\(h\)\>0π^∗​\(h\)⋅\(1−4​ρfloor​\(h\)κ⋅λ\)\+,\\displaystyle=\\sum\_\{\\hat\{\\pi\}^\{\*\}\(h\)\>0\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\left\(1\-\\frac\{4\\,\\rho\_\{\\mathrm\{floor\}\}\(h\)\}\{\\kappa\\cdot\\lambda\}\\right\)\_\{\\\!\+\},\(S21\)which is theπ^∗\\hat\{\\pi\}^\{\*\}\-weighted expected reliability of theλ\\lambda\-trimming decisions across the full history space\. A value ofK\.E\.R\.I\\mathrm\{K\.E\.R\.I\}close to11indicates that the non\-vacuity condition \([S19](https://arxiv.org/html/2606.27599#S2.E19)\) is comfortably satisfied at most histories; a low value suggests increasingMM\(more Monte Carlo draws\) orλ\\lambdauntil the condition is met\. Note thatK\.E\.R\.I\\mathrm\{K\.E\.R\.I\}is a lower bound on reliability for any fixedκ\\kappa: increasingκ\\kapparelaxes the threshold and raises the reported score, so the practitioner should chooseκ\\kappaconservatively\.

##### Interpretation\.

K\.E\.R\.I=1=1if and only ifρfloor​\(h\)=0\\rho\_\{\\mathrm\{floor\}\}\(h\)=0for allh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}, an ideal case requiring perfect kernel estimation\. In practice, K\.E\.R\.I∈\[0,1\)\\in\[0,1\), and values close to 1 indicate that trimming decisions are well\-supported across the stationary distribution\. Sinceρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)is itself an upper bound, K\.E\.R\.I is conservative: the true expected reliability is at least as large as the reported value\. We therefore recommend interpreting K\.E\.R\.I as a*lower bound on reliability*\. A low K\.E\.R\.I does not invalidate the certified\-zero attributions for lagsk\>K∗k\>K^\{\*\}, which rest solely onΔ^pred<ε\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}<\\varepsilonand are unaffected by kernel estimation quality\.

## S3Step 4: \(Marginal\) Transition Kernel Estimation

We expand in detail the strategies for kernel estimation introduced in the manuscript\.

### S3\.1Strategy A: Empirical Counting \(Dirichlet Estimator\)

The empirical estimator accumulates model outputs directly from held\-out data\. For each queryiion windowh~i\\tilde\{h\}\_\{i\}, record theK∗K^\{\*\}\-length discretised suffixhih\_\{i\}and the discretised outputs\(i\)=ψ​\(f​\(h~i\)\)s^\{\(i\)\}=\\psi\(f\(\\tilde\{h\}\_\{i\}\)\)\. Define the counts

nf​\(h,s\)=\|\{i∈\[N\]:hi=h,ψ​\(f​\(h~i\)\)=s\}\|,n^\{f\}\(h,s\)=\\bigl\|\\bigl\\\{i\\in\[N\]:h\_\{i\}=h,\\;\\psi\(f\(\\tilde\{h\}\_\{i\}\)\)=s\\bigr\\\}\\bigr\|,
nf​\(h\)=∑snf​\(h,s\)\.n^\{f\}\(h\)=\\sum\_\{s\}n^\{f\}\(h,s\)\.
The Dirichlet posterior mean with Jeffreys priorα≥0\\alpha\\geq 0gives:

𝒯^K∗f,\(A\)​\(s∣h\)=nf​\(h,s\)\+αnf​\(h\)\+ND​α\.\\hat\{\\mathcal\{T\}\}^\{f,\(\\mathrm\{A\}\)\}\_\{K^\{\*\}\}\(s\\mid h\)=\\frac\{n^\{f\}\(h,s\)\+\\alpha\}\{n^\{f\}\(h\)\+N^\{D\}\\alpha\}\.\(S22\)
A sufficient minimum number of observations needed to estimate𝒯Kf,\(A\)\(⋅\|h\)\\mathcal\{T\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\|h\)up to maximum expected total variation errorλ4\>0\\frac\{\\lambda\}\{4\}\>0for anyh∈ℋK\+h\\in\\mathcal\{H\}^\{\+\}\_\{K\}isTmin\(A\)=W\+ND\(λ4\)2⋅ND​K∗T^\{\(\\mathrm\{A\}\)\}\_\{\\min\}=W\+\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\}\(Theorem[S3\.1](https://arxiv.org/html/2606.27599#S3.Thmtheorem1a)\) and grows exponentially inD​K∗\+DDK^\{\*\}\+Dthe number of histories blows up faster than the per\-history count requirement, infeasible at standard financial or clinical series lengths\.

To remain tractable,KARMAestimatesDDfactored marginal kernelsinstead of the full joint\. For target variabled∈\[D\]d\\in\[D\], historyh∈ℋK\+h\\in\\mathcal\{H\}^\{\+\}\_\{K\}, and next binsd∈\[N\]s^\{d\}\\in\[N\]:

𝒯Kf,d,\(A\)​\(sd∣h\)=ℙ​\(ψd​\(f​\(h~\)\)=sd\|suffixK​\(ψ​\(h~\)\)=h\),\\mathcal\{T\}^\{f,d,\(\\mathrm\{A\}\)\}\_\{K\}\(s^\{d\}\\mid h\)=\\mathbb\{P\}\\\!\\left\(\\psi^\{d\}\\\!\\bigl\(f\(\\tilde\{h\}\)\\bigr\)=s^\{d\}\\;\\Big\|\\;\\mathrm\{suffix\}\_\{K\}\(\\psi\(\\tilde\{h\}\)\)=h\\right\),\(S23\)the marginal of𝒯Kf,\(A\)\\mathcal\{T\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}over thedd\-th output coordinate\. Each model query\(h,s\)\(h,s\)contributes to allDDmarginals simultaneously by reading offsds^\{d\}for eachdd, yielding a reduction in the data requirement per history\.

Define the marginal counts:

nf,d​\(h,sd\)=\|\{i∈\[N\]:hi=h,ψd​\(f​\(h~i\)\)=sd\}\|,n^\{f,d\}\(h,s^\{d\}\)=\\bigl\|\\bigl\\\{i\\in\[N\]:h\_\{i\}=h,\\;\\psi^\{d\}\(f\(\\tilde\{h\}\_\{i\}\)\)=s^\{d\}\\bigr\\\}\\bigr\|,nf​\(h\)=∑sdnf,d​\(h,sd\)\.n^\{f\}\(h\)=\\sum\_\{s^\{d\}\}n^\{f,d\}\(h,s^\{d\}\)\.The Dirichlet posterior mean with Jeffreys priorα≥0\\alpha\\geq 0becomes:

𝒯^K∗f,d,\(A\)​\(sd∣h\)≈nf,d​\(h,sd\)\+αnf​\(h\)\+N​α\.\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{A\}\)\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)\\approx\\frac\{n^\{f,d\}\(h,s^\{d\}\)\+\\alpha\}\{n^\{f\}\(h\)\+N\\alpha\}\.\(S24\)Each query contributes to allDDmarginals simultaneously by reading offsd\(i\)s^\{\(i\)\}\_\{d\}for eachdd, giving a factorND−1N^\{D\-1\}count improvement over the joint kernel\. This strategy with the marginal kernel reducesTmin\(A\)T^\{\(\\mathrm\{A\}\)\}\_\{\\min\}toW\+N\(λ4\)2⋅ND​K∗W\+\\frac\{N\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\}\(Theorem[S3\.3](https://arxiv.org/html/2606.27599#S3.Thmtheorem3)\), better but still not an easily attainable budget for largeN,DN,Dand/orK∗K^\{\*\}\.

###### Theorem S3\.1\(Total Variation Error of Estimation of Joint Kernel\)\.

Let𝒯^Kf,\(A\)\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\\mid h\)denote the Dirichlet posterior mean estimator \(Eq\.[S22](https://arxiv.org/html/2606.27599#S3.E22)\) with Jeffreys priorα=1/2\\alpha=1/2, based onnf​\(h\)n^\{f\}\(h\)observations\. The following inequality holds true

𝔼\[∥𝒯^Kf,\(A\)\(⋅∣h\)−𝒯Kf\(⋅∣h\)∥T​V\]≤ND2​nf​\(h\),\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\},\(S25\)
We define the noise floor as

ρfloor\(A\)​\(h\):=ND2​nf​\(h\)\\rho^\{\(\\mathrm\{A\}\)\}\_\{\\text\{floor\}\}\(h\):=\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\}
Setting this below\(λ4\)≥0\(\\frac\{\\lambda\}\{4\}\)\\geq 0:

ND2​nf​\(h\)<\(λ4\)⟹nf​\(h\)\>ND\(λ4\)2\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\}<\(\\frac\{\\lambda\}\{4\}\)\\;\\;\\Longrightarrow\\;\\;n^\{f\}\(h\)\>\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}and the minimum amount of observed history needed,

Tmin\(A\)=W\+ND\(λ4\)2⋅ND​K∗T^\{\(\\mathrm\{A\}\)\}\_\{\\min\}=W\+\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\}

Letn=nf​\(h\)n=n^\{f\}\(h\),ps=𝒯Kf,d\(⋅∣h\)p\_\{s\}=\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)denote the true kernel, andp^s=\(ns\+α\)/\(n\+ND​α\)\\hat\{p\}\_\{s\}=\(n\_\{s\}\+\\alpha\)/\(n\+N^\{D\}\\alpha\)the Dirichlet posterior mean with Jeffreys priorα≥0\\alpha\\geq 0, wherens=nf,d​\(h,s\)n\_\{s\}=n^\{f,d\}\(h,s\)and\(ns\)s∈\[N\]\(n\_\{s\}\)\_\{s\\in\[N\]\}\.

###### Lemma S3\.2\(Pinsker’s Inequality\)\.

LetPPandQQbe two probability distributions on the same measurable space\. Then Pinsker’s inequality states:

‖P−Q‖T​V≤12​KL​\(P∥Q\)≤12​χ2​\(P∥Q\)\\\|P\-Q\\\|\_\{TV\}\\leq\\sqrt\{\\frac\{1\}\{2\}\\mathrm\{KL\}\(P\\\|Q\)\}\\leq\\sqrt\{\\frac\{1\}\{2\}\\chi^\{2\}\(P\\\|Q\)\}
where

KL​\(P∥Q\)\\displaystyle\\mathrm\{KL\}\(P\\\|Q\):=∑xP​\(x\)​log⁡P​\(x\)Q​\(x\)\\displaystyle:=\\sum\_\{x\}P\(x\)\\log\\frac\{P\(x\)\}\{Q\(x\)\}χ2​\(P∥Q\)\\displaystyle\\chi^\{2\}\(P\\\|Q\):=∑x\(P​\(x\)−Q​\(x\)\)2Q​\(x\)\\displaystyle:=\\sum\_\{x\}\\frac\{\(P\(x\)\-Q\(x\)\)^\{2\}\}\{Q\(x\)\}
are the Kullback–Leibler divergence and chi\-squared divergence, respectively\.

###### Proof of Theorem[S3\.1](https://arxiv.org/html/2606.27599#S3.Thmtheorem1a)\.

Decompose the estimation error:

p^s−ps=\(ns−n​ps\)\+α​\(1−ND​ps\)n\+ND​α\.\\hat\{p\}\_\{s\}\-p\_\{s\}=\\frac\{\(n\_\{s\}\-np\_\{s\}\)\+\\alpha\(1\-N^\{D\}p\_\{s\}\)\}\{n\+N^\{D\}\\alpha\}\.Since𝔼​\[ns\]=n​ps\\mathbb\{E\}\[n\_\{s\}\]=np\_\{s\}, the cross term vanishes and:

𝔼​\[\(p^s−ps\)2\]\\displaystyle\\mathbb\{E\}\\bigl\[\(\\hat\{p\}\_\{s\}\-p\_\{s\}\)^\{2\}\\bigr\]=Var​\(ns\)\+α2​\(1−ND​ps\)2\(n\+ND​α\)2\\displaystyle=\\frac\{\\mathrm\{Var\}\(n\_\{s\}\)\+\\alpha^\{2\}\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}=n​ps​\(1−ps\)\+α2​\(1−ND​ps\)2\(n\+ND​α\)2,\\displaystyle=\\frac\{np\_\{s\}\(1\-p\_\{s\}\)\+\\alpha^\{2\}\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\},usingVar​\(ns\)=n​ps​\(1−ps\)\\mathrm\{Var\}\(n\_\{s\}\)=np\_\{s\}\(1\-p\_\{s\}\)for Binomial\(n,ps\)\(n,p\_\{s\}\)\.

Summing overssand dividing bypsp\_\{s\}:

𝔼​\[χ2​\(p^∥p\)\]\\displaystyle\\mathbb\{E\}\\bigl\[\\chi^\{2\}\(\\hat\{p\}\\\|p\)\\bigr\]=1\(n\+ND​α\)2×∑s=0N−1n​ps​\(1−ps\)\+α2​\(1−ND​ps\)2ps\\displaystyle=\\frac\{1\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}\\times\\sum\_\{s=0\}^\{N\-1\}\\frac\{np\_\{s\}\(1\-p\_\{s\}\)\+\\alpha^\{2\}\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{p\_\{s\}\}=n\(n\+ND​α\)2​∑s\(1−ps\)\+α2\(n\+ND​α\)2​∑s\(1−ND​ps\)2ps\\displaystyle=\\frac\{n\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}\\sum\_\{s\}\(1\-p\_\{s\}\)\+\\frac\{\\alpha^\{2\}\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}\\sum\_\{s\}\\frac\{\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{p\_\{s\}\}\(S26\)=1\(n\+ND​α\)2​\[n​\(ND−1\)\+α2​∑s\(1−ND​ps\)2ps\]\.\\displaystyle=\\frac\{1\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}\\left\[n\(N^\{D\}\-1\)\+\\alpha^\{2\}\\sum\_\{s\}\\frac\{\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{p\_\{s\}\}\\right\]\.\(S27\)For the bias term, expand\(1−ND​ps\)2/ps=1/ps−2​ND\+N2​D​ps\(1\-N^\{D\}p\_\{s\}\)^\{2\}/p\_\{s\}=1/p\_\{s\}\-2N^\{D\}\+N^\{2D\}p\_\{s\}and use∑sps=1\\sum\_\{s\}p\_\{s\}=1andps≤1p\_\{s\}\\leq 1:

∑s\(1−ND​ps\)2ps≤∑s1ps−2​ND\+1\+N2​D≤∑s1ps\+N2​D\.\\sum\_\{s\}\\frac\{\(1\-N^\{D\}p\_\{s\}\)^\{2\}\}\{p\_\{s\}\}\\leq\\sum\_\{s\}\\frac\{1\}\{p\_\{s\}\}\-2N^\{D\+1\}\+N^\{2D\}\\leq\\sum\_\{s\}\\frac\{1\}\{p\_\{s\}\}\+N^\{2D\}\.
Retaining only the leading term in \([S27](https://arxiv.org/html/2606.27599#S3.E27)\):

𝔼​\[χ2​\(p^∥p\)\]\\displaystyle\\mathbb\{E\}\\bigl\[\\chi^\{2\}\(\\hat\{p\}\\\|p\)\\bigr\]≤n​\(ND−1\)\+α2​Cp\(n\+ND​α\)2\\displaystyle\\;\\leq\\;\\frac\{n\(N^\{D\}\-1\)\+\\alpha^\{2\}C\_\{p\}\}\{\(n\+N^\{D\}\\alpha\)^\{2\}\}≤NDn⋅1\+α2​Cp/\(n​ND\)\(1\+ND​α/n\)2,\\displaystyle\\;\\leq\\;\\frac\{N^\{D\}\}\{n\}\\cdot\\frac\{1\+\\alpha^\{2\}C\_\{p\}/\(nN^\{D\}\)\}\{\(1\+N^\{D\}\\alpha/n\)^\{2\}\},\(S28\)whereCp=∑s\(1\+ND​ps\)2/ps≥0C\_\{p\}=\\sum\_\{s\}\(1\+N^\{D\}p\_\{s\}\)^\{2\}/p\_\{s\}\\geq 0\. Fornnsufficiently large \(specificallyn≥ND​αn\\geq N^\{D\}\\alphaandn≥α2​Cp/NDn\\geq\\alpha^\{2\}C\_\{p\}/N^\{D\}or settingα→0\\alpha\\to 0\):

𝔼​\[KL⁡\(p^∥p\)\]≤𝔼​\[χ2​\(p^∥p\)\]≤NDnf​\(h\),\\mathbb\{E\}\\bigl\[\\operatorname\{KL\}\(\\hat\{p\}\\\|p\)\\bigr\]\\;\\leq\\;\\mathbb\{E\}\\bigl\[\\chi^\{2\}\(\\hat\{p\}\\\|p\)\\bigr\]\\;\\leq\\;\\frac\{N^\{D\}\}\{n^\{f\}\(h\)\},\(S29\)The KL divergence from the posterior mean to the true kernel satisfies:

KL\(𝒯^Kf,\(A\)\(⋅∣h\)∥𝒯Kf\(⋅∣h\)\)≤NDnf​\(h\),\\mathrm\{KL\}\\\!\\left\(\\hat\{\\mathcal\{T\}\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\\mid h\)\\,\\Big\\\|\\,\\mathcal\{T\}^\{f\}\_\{K\}\(\\cdot\\mid h\)\\right\)\\;\\leq\\;\\frac\{N^\{D\}\}\{n^\{f\}\(h\)\},\(S30\)fornf​\(h\)n^\{f\}\(h\)sufficiently large\. Applying Pinsker’s Inequality to \([S30](https://arxiv.org/html/2606.27599#S3.E30)\) yields:

𝔼\[∥𝒯^Kf,\(A\)\(⋅∣h\)−𝒯Kf\(⋅∣h\)∥T​V\]≤ND2​nf​\(h\)\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\}\(S31\)
If we define thenoise flooras

ρfloor\(A\)​\(h\):=ND2​nf​\(h\)\\rho^\{\(\\mathrm\{A\}\)\}\_\{\\text\{floor\}\}\(h\):=\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\}
Setting this below\(λ4\)≥0\(\\frac\{\\lambda\}\{4\}\)\\geq 0:

ND2​nf​\(h\)<\(λ4\)⟹nf​\(h\)\>ND\(λ4\)2\\sqrt\{\\frac\{N^\{D\}\}\{2\\,n^\{f\}\(h\)\}\}<\(\\frac\{\\lambda\}\{4\}\)\\;\\;\\Longrightarrow\\;\\;n^\{f\}\(h\)\>\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}
Meaning every historyh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}needs at leastND\(λ4\)2\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}model queries\. The total number of histories in the worst case is:

\|ℋK∗\|=ND​K∗\\left\|\\mathcal\{H\}\_\{K^\{\*\}\}\\right\|=N^\{DK^\{\*\}\}
LetTTbe the number of observations used in the training set \(or whatever held\-out set is used\)\. Since each window contributes to exactly one history’s count, the total number of windows needed is:

T−W≥ND\(λ4\)2⋅ND​K∗T\-W\\geq\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\}
giving:

Tmin\(A\)≈W\+ND\(λ4\)2⋅ND​K∗,nf​\(h\)\>ND\(λ4\)2T^\{\(\\mathrm\{A\}\)\}\_\{\\min\}\\approx W\+\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\},\\quad n^\{f\}\(h\)\>\\frac\{N^\{D\}\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}
the sufficient minimum amount of history needed\. ∎

###### Theorem S3\.3\.

Let𝒯^Kf,d\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)denote the Dirichlet posterior mean estimator \(Eq\.[S24](https://arxiv.org/html/2606.27599#S3.E24)\) with Jeffreys priorα=1/2\\alpha=1/2, based onnf​\(h\)n^\{f\}\(h\)observations\.

𝔼\[∥𝒯^Kf,d\(⋅∣h\)−𝒯Kf,d\(⋅∣h\)∥T​V\]≤N2​nf​\(h\),\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{N\}\{2\\,n^\{f\}\(h\)\}\},\(S32\)which is the noise floor of Strategy A \(Eq\.[S33](https://arxiv.org/html/2606.27599#S3.E33)\)\. By Pinsker’s inequality applied to the Dirichlet posterior:

ρfloor\(A\)​\(h\):=N2​nf​\(h\),\\rho^\{\(\\mathrm\{A\}\)\}\_\{\\mathrm\{floor\}\}\(h\):=\\sqrt\{\\frac\{N\}\{2\\,n^\{f\}\(h\)\}\},\(S33\)which diverges asnf​\(h\)→0n^\{f\}\(h\)\\to 0\. And the sufficient minimum series length if we requireρfloor\(A\)​\(h\)<\(λ4\)\\rho^\{\(\\mathrm\{A\}\)\}\_\{\\mathrm\{floor\}\}\(h\)<\(\\frac\{\\lambda\}\{4\}\),\(λ2\)≥0\(\\frac\{\\lambda\}\{2\}\)\\geq 0:

Tmin\(A\)≈W\+N\(λ4\)2⋅ND​K∗,nf​\(h\)\>N\(λ4\)2T^\{\(\\mathrm\{A\}\)\}\_\{\\min\}\\approx W\+\\frac\{N\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\\cdot N^\{DK^\{\*\}\},\\quad n^\{f\}\(h\)\>\\frac\{N\}\{\(\\frac\{\\lambda\}\{4\}\)^\{2\}\}\(S34\)growing*exponentially*inD​K∗DK^\{\*\}\.

###### Proof of Theorem[S3\.3](https://arxiv.org/html/2606.27599#S3.Thmtheorem3)\.

Follows from the fact that:

KL\(𝒯^Kf,d,\(A\)\(⋅∣h\)∥𝒯Kf,d\(⋅∣h\)\)≤Nnf​\(h\),\\mathrm\{KL\}\\\!\\left\(\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{A\}\)\}\_\{K\}\(\\cdot\\mid h\)\\,\\Big\\\|\\,\\mathcal\{T\}^\{f,d\}\_\{K\}\(\\cdot\\mid h\)\\right\)\\;\\leq\\;\\frac\{N\}\{n^\{f\}\(h\)\},\(S35\)fornf​\(h\)n^\{f\}\(h\)sufficiently large\. Everything else follows as in the proof of Theorem[S3\.1](https://arxiv.org/html/2606.27599#S3.Thmtheorem1a)∎

### S3\.2Strategy B:b∗b^\{\*\}\-Prefix Monte Carlo \(default in practice\)

Even in the marginal case, the Dirichlet estimator still proves inconvenient as we require observation lengths of orderN\(D​K∗\+1\)N^\{\(DK^\{\*\}\+1\)\}\. So instead of observed history instances, we look at observed history distribution\. We define the following quantity,

###### Definition S3\.4\(Within\-bin conditional density\)\.

Letpstat:ℝD×K∗→ℝ≥0p\_\{\\mathrm\{stat\}\}:\\mathbb\{R\}^\{D\\times K^\{\*\}\}\\to\\mathbb\{R\}\_\{\\geq 0\}denote the stationary joint density of theK∗K^\{\*\}\-length suffix\(Xt−K∗\+1,…,Xt\)\(X\_\{t\-K^\{\*\}\+1\},\\ldots,X\_\{t\}\)under the data\-generating process\. Thewithin\-bin conditional densityat historyhhis:

p​\(x∣ψ​\(x\)=h\)=pstat​\(x\)​1​\[ψ​\(x\)=h\]π^∗​\(h\),x∈ℝD×K∗,p\(x\\mid\\psi\(x\)=h\)\\;=\\;\\frac\{p\_\{\\mathrm\{stat\}\}\(x\)\\;\\mathbf\{1\}\[\\psi\(x\)=h\]\}\{\\hat\{\\pi\}^\{\*\}\(h\)\},\\quad x\\in\\mathbb\{R\}^\{D\\times K^\{\*\}\},\(S36\)whereπ^∗​\(h\)=ℙ​\(ψ​\(Xt−K∗\+1:t\)=h\)\>0\\hat\{\\pi\}^\{\*\}\(h\)=\\mathbb\{P\}\(\\psi\(X\_\{t\-K^\{\*\}\+1:t\}\)=h\)\>0is the stationary probability of historyhh\. This density characterises the distribution of continuous states conditional on falling in the bins specified byhh\. Sampling fromp\(⋅∣ψ\(⋅\)=h\)p\(\\cdot\\mid\\psi\(\\cdot\)=h\)gives the authentic within\-bin continuous variation that both estimation strategies must approximate\.

###### Definition S3\.5\(Suffix pool\)\.

Forh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}, thesuffix poolis:

pool​\(h\)=\{x~∈𝐗train:ψ​\(x~W−K∗\+1:W\)=h\},\\mathrm\{pool\}\(h\)=\\bigl\\\{\\tilde\{x\}\\in\\mathbf\{X\}\_\{\\mathrm\{train\}\}:\\psi\(\\tilde\{x\}\_\{W\-K^\{\*\}\+1:W\}\)=h\\bigr\\\},\(S37\)the set of allWW\-length training windows whose discretisedK∗K^\{\*\}\-suffix equalshh\. Sampling uniformly frompool​\(h\)\\mathrm\{pool\}\(h\)provides a non\-parametric Monte Carlo approximation to the within\-bin conditional density \(Definition[S3\.4](https://arxiv.org/html/2606.27599#S3.Thmtheorem4)\)\.

Strategy B proceeds as follows:

For eachh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}andm=1,…,Mm=1,\\ldots,M:

1. 1\.Samplex~\(m\)∼Uniform​\(pool​\(h\)\)\\tilde\{x\}^\{\(m\)\}\\sim\\mathrm\{Uniform\}\(\\mathrm\{pool\}\(h\)\)\(Definition[S3\.5](https://arxiv.org/html/2606.27599#S3.Thmtheorem5)\)\. This approximates the within\-bin conditional density \(Definition[S3\.4](https://arxiv.org/html/2606.27599#S3.Thmtheorem4)\) non\-parametrically from training data\.
2. 2\.Replace the prefix withb∗b^\{\*\}:h~\(m\)=b∗⊕x~W−K∗\+1:W\(m\)\\tilde\{h\}^\{\(m\)\}=b^\{\*\}\\oplus\\tilde\{x\}^\{\(m\)\}\_\{W\-K^\{\*\}\+1:W\}, whereb∗∈ℝ\(W−K∗\)×Db^\{\*\}\\in\\mathbb\{R\}^\{\(W\-K^\{\*\}\)\\times D\}is the model\-certified baseline from Step 3\.
3. 3\.Query the model:sd\(m\)=ψd​\(f​\(h~\(m\)\)\)s^\{\(m\)\}\_\{d\}=\\psi^\{d\}\(f\(\\tilde\{h\}^\{\(m\)\}\)\)\.

The kernel estimate is:

𝒯^K∗f,d,\(B\)​\(sd∣h\)=∑m=1M𝟏​\[sd\(m\)=sd\]\+αM\+N​α,α=12\.\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)=\\frac\{\\sum\_\{m=1\}^\{M\}\\mathbf\{1\}\[s^\{\(m\)\}\_\{d\}=s^\{d\}\]\+\\alpha\}\{M\+N\\alpha\},\\quad\\alpha=\\tfrac\{1\}\{2\}\.\(S38\)
###### Theorem S3\.6\(Strategy B noise floor decomposition\)\.

Let𝒯^K∗f,d,\(B\)\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)be theb∗b^\{\*\}\-prefix Monte Carlo estimator \(Eq\.[S38](https://arxiv.org/html/2606.27599#S3.E38)\) withMMdraws and poolpool​\(h\)\\mathrm\{pool\}\(h\)of sizePh=\|pool​\(h\)\|P\_\{h\}=\|\\mathrm\{pool\}\(h\)\|\. Then:

𝔼\[∥𝒯^K∗f,d,\(B\)\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\]≤ρfloor\(B\)\(h\)\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\\leq\\rho^\{\(\\mathrm\{B\}\)\}\_\{\\mathrm\{floor\}\}\(h\)\(S39\)where,

ρfloor\(B\)​\(h\)=π2​M⏟\(i\) MC variance\+Δ^pred⏟\(ii\) prefix error\+π2​Ph⏟\(iii\) within\-bin var\.\\rho^\{\(\\mathrm\{B\}\)\}\_\{\\mathrm\{floor\}\}\(h\)=\\underbrace\{\\sqrt\{\\frac\{\\pi\}\{2M\}\}\}\_\{\\text\{\(i\) MC variance\}\}\+\\;\\underbrace\{\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}\}\_\{\\text\{\(ii\) prefix error\}\}\+\\;\\underbrace\{\\sqrt\{\\frac\{\\pi\}\{2P\_\{h\}\}\}\}\_\{\\text\{\(iii\) within\-bin var\.\}\}And the minimum series length if we requireρfloor\(B\)​\(h\)<\(λ4\)\\rho^\{\(\\mathrm\{B\}\)\}\_\{\\mathrm\{floor\}\}\(h\)<\(\\frac\{\\lambda\}\{4\}\),λ4\>0\\frac\{\\lambda\}\{4\}\>0:

Tmin\(B\)≈W\+⌈2​πλ2⌉⋅\|ℋK∗\+\|,and,​Ph\>⌈2​πλ2⌉T^\{\(\\mathrm\{B\}\)\}\_\{\\min\}\\approx W\+\\left\\lceil\\frac\{2\\pi\}\{\\lambda^\{2\}\}\\right\\rceil\\cdot\\left\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\\right\|,\\quad\\text\{and,\}\\\>P\_\{h\}\>\\left\\lceil\\frac\{2\\pi\}\{\\lambda^\{2\}\}\\right\\rceil

###### Proof of Theorem[S3\.6](https://arxiv.org/html/2606.27599#S3.Thmtheorem6)\.

Assume the Jeffrey priorα=0\\alpha=0, the case whenα=1/2\\alpha=1/2follows via upper bound approximation\. Introduce two intermediate distributions\. Let𝒯K∗f,d,b∗\(⋅∣h\)\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)denote the kernel estimated using the true within\-bin density but with prefix replaced byb∗b^\{\*\}:𝒯K∗f,d,b∗​\(sd∣h\)\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)given by

𝔼x∼p\(⋅∣ψ\(x\)=h\)​\[𝟏​\[ψd​\(f​\(b∗⊕xW−K∗\+1:W\)\)=sd\]\],\\mathbb\{E\}\_\{x\\sim p\(\\cdot\\mid\\psi\(x\)=h\)\}\\\!\\left\[\\mathbf\{1\}\\\!\\left\[\\psi^\{d\}\\\!\\bigl\(f\(b^\{\*\}\\oplus x\_\{W\-K^\{\*\}\+1:W\}\)\\bigr\)=s^\{d\}\\right\]\\right\],\(S40\)where the expectation is over the true within\-bin conditional density \(Definition[S3\.4](https://arxiv.org/html/2606.27599#S3.Thmtheorem4)\)\. Let𝒯K∗f,d,pool\(⋅∣h\)\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)denote the same quantity with the pool empirical distribution replacing the true within\-bin density:

𝒯K∗f,d,pool​\(sd∣h\)\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h\)given by

1Ph​∑x~∈pool​\(h\)𝟏​\[ψd​\(f​\(b∗⊕x~W−K∗\+1:W\)\)=sd\]\.\\frac\{1\}\{P\_\{h\}\}\\sum\_\{\\tilde\{x\}\\in\\mathrm\{pool\}\(h\)\}\\mathbf\{1\}\\\!\\left\[\\psi^\{d\}\\\!\\bigl\(f\(b^\{\*\}\\oplus\\tilde\{x\}\_\{W\-K^\{\*\}\+1:W\}\)\\bigr\)=s^\{d\}\\right\]\.\(S41\)The estimator𝒯^K∗f,d,\(B\)\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}is a Monte Carlo approximation to \([S41](https://arxiv.org/html/2606.27599#S3.E41)\) withMMdraws\. By the triangle inequality:

‖𝒯^K∗f,d,\(B\)−𝒯K∗f,d‖T​V\\displaystyle\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\-\\;\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\\\|\_\{TV\}≤‖𝒯^K∗f,d,\(B\)−𝒯K∗f,d,pool‖T​V⏟\(i\)\\displaystyle\\leq\\underbrace\{\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\-\\;\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\\\|\_\{TV\}\}\_\{\\text\{\(i\)\}\}\+‖𝒯K∗f,d,pool−𝒯K∗f,d,b∗‖T​V⏟\(ii\)\\displaystyle\+\\;\\underbrace\{\\\|\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\-\\;\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\\\|\_\{TV\}\}\_\{\\text\{\(ii\)\}\}\+‖𝒯K∗f,d,b∗−𝒯K∗f,d‖T​V⏟\(iii\)\.\\displaystyle\+\\;\\underbrace\{\\\|\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\-\\;\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\\\|\_\{TV\}\}\_\{\\text\{\(iii\)\}\}\.
- •For the term \(i\):𝒯^K∗f,d,\(B\)\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)estimatesTK∗f,d,pool\(⋅∣h\)T^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)by averagingMMi\.i\.d\. Bernoulli indicators in\[0,1\]\[0,1\]for eachsd∈\[N\]s^\{d\}\\in\[N\]\. By Hoeffding’s inequality, each marginal probability is estimated withinttwith probability at least1−2​e−2​M​t21\-2e^\{\-2Mt^\{2\}\}\. Taking expectation over theMMdraws: 𝔼​\[‖𝒯^K∗f,d,\(B\)−𝒯K∗f,d,pool‖T​V\]≤π2​M\.\\mathbb\{E\}\\\!\\left\[\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{B\}\)\}\_\{K^\{\*\}\}\-\\;\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{\\pi\}\{2M\}\}\.\(S42\)
- •For the term \(iii\):𝒯K∗f,d,b∗\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}uses the true within\-bin density but theb∗b^\{\*\}prefix, while𝒯K∗f,d\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}uses the true within\-bin density and the true prefix\. The Total Variation distance between them is bounded by the average prediction discrepancy when the prefix is replaced byb∗b^\{\*\}: ∥𝒯K∗f,d,b∗\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\\\|\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\|\_\{TV\}≤𝔼x∼p\(⋅\|ψ\(x\)=h\)​\[ℓ​\(f​\(h~true\),f​\(b∗⊕xW−K∗\+1:W\)\)\]\\;\\leq\\;\\mathbb\{E\}\_\{x\\sim p\(\\cdot\|\\psi\(x\)=h\)\}\\\!\\left\[\\ell\\\!\\left\(f\(\\tilde\{h\}\_\{\\mathrm\{true\}\}\),\\;f\(b^\{\*\}\\oplus x\_\{W\-K^\{\*\}\+1:W\}\)\\right\)\\right\]≤Δ^pred,\\;\\leq\\;\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}, where the last inequality follows from the surrogate validity certificate \(Eq\.[2](https://arxiv.org/html/2606.27599#S3.E2)\), which bounds the average prediction discrepancy under prefix substitution byΔ^pred<ε\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}<\\varepsilon\.
- •For the term \(ii\):𝒯K∗f,d,pool\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}uses the empirical pool distribution as a surrogate for the true within\-bin density\. The pool is a random sample of sizePhP\_\{h\}from the within\-bin conditional distribution, so by the Dvoretzky–Kiefer–Wolfowitz \(DKW\) inequalityDvoretzkyet al\.\[[1956](https://arxiv.org/html/2606.27599#bib.bib21)\]applied to the empirical distribution: 𝔼\[∥𝒯K∗f,d,pool\(⋅∣h\)−𝒯K∗f,d,b∗\(⋅∣h\)∥T​V\]≤π2​Ph,\\mathbb\{E\}\\\!\\left\[\\\|\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\;\\mathcal\{T\}^\{f,d,b^\{\*\}\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{\\pi\}\{2P\_\{h\}\}\},\(S43\)wherePh=\|pool​\(h\)\|P\_\{h\}=\|\\mathrm\{pool\}\(h\)\|\. Intuitively, the pool is an empirical approximation to the within\-bin density; the TV error of this approximation decreases asPh−1/2P\_\{h\}^\{\-1/2\}, the standard Monte Carlo rate for distribution estimation\.

The noise floorρfloor​\(h\)\\rho\_\{\\text\{floor\}\}\(h\)has three terms:

ρfloor​\(h\)=π2​M\+Δ^pred\+π2​Ph\\rho\_\{\\text\{floor\}\}\(h\)=\\sqrt\{\\frac\{\\pi\}\{2M\}\}\+\\hat\{\\Delta\}\_\{\\text\{pred\}\}\+\\sqrt\{\\frac\{\\pi\}\{2P\_\{h\}\}\}
The first two terms are controlled by design choicesMMandε\\varepsilonindependently ofTT\. The binding constraint fromTTcomes entirely from the pool size termπ2​Ph\\sqrt\{\\frac\{\\pi\}\{2P\_\{h\}\}\}\.

Subtracting the other two terms from the budgetλ4\>0\\frac\{\\lambda\}\{4\}\>0:

π2​Ph<λ4−π2​M−Δ^pred<λ4\\sqrt\{\\frac\{\\pi\}\{2P\_\{h\}\}\}<\\frac\{\\lambda\}\{4\}\-\\sqrt\{\\frac\{\\pi\}\{2M\}\}\-\\hat\{\\Delta\}\_\{\\text\{pred\}\}<\\frac\{\\lambda\}\{4\}
requires:

Ph\>⌈2​πλ2⌉=:npoolP\_\{h\}\>\\left\\lceil\\frac\{2\\pi\}\{\\lambda^\{2\}\}\\right\\rceil=:n\_\{\\text\{pool\}\}
where⌈a⌉\\lceil a\\rceilis the nearest biggest integer toa∈ℝa\\in\\mathbb\{R\}\. NowPh=\|pool​\(h\)\|P\_\{h\}=\|\\text\{pool\}\(h\)\|is the number of training windows whose discretisedK∗K^\{\*\}\-suffix equalshh\. Each of theT−WT\-Wavailable windows lands in exactly one pool\. For everyh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}to accumulatenpooln\_\{\\text\{pool\}\}windows:

T−W≥npool⋅\|ℋK∗\+\|T\-W\\geq n\_\{\\text\{pool\}\}\\cdot\\left\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\\right\|
giving:

Tmin\(B\)≈W\+npool⋅\|ℋK∗\+\|T^\{\(\\mathrm\{B\}\)\}\_\{\\min\}\\approx W\+n\_\{\\text\{pool\}\}\\cdot\\left\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\\right\|
The key difference from Strategy A is that:

\|ℋK∗\+\|≤T−K∗,\\left\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\\right\|\\leq T\-K^\{\*\},
that is, the observed support is at most the number of distinct suffixes seen in the data, which grows linearly inTTrather than exponentially inD​K∗DK^\{\*\}\. This is what gives Strategy B its linear scaling\. ∎

### S3\.3Strategy C: Regression Tree on the Transition Tensor

Strategies A and B both estimate𝒯K∗f,d\(⋅∣h\)\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)independently for eachh∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}, treating every history as a separate estimation problem\. This nonparametric approach is interpretable and certified but data\-hungry: the pool requirement scales asnpool\(B\)⋅\|ℋK∗\+\|n\_\{\\mathrm\{pool\}\}^\{\(\\mathrm\{B\}\)\}\\cdot\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|, wherenpool\(B\)n\_\{\\mathrm\{pool\}\}^\{\(\\mathrm\{B\}\)\}is the minimum pool size such thatρfloor\(B\)​\(h\)≤β\\rho^\{\(\\mathrm\{B\}\)\}\_\{\\mathrm\{floor\}\}\(h\)\\leq\\betafor allhh\. Strategy C replaces independent per\-history estimation with a*regression tree*that partitionsℋK∗\+\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}into leaves and pools counts within each leaf, sharing statistical strength across structurally similar histories while remaining fully interpretable\.

##### Covered and sparse sets\.

PartitionℋK∗\+\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}according to whether each history has accumulated sufficient model counts:

ℋcov=\{h∈ℋK∗\+:Ph≥npool\(B\)\},ℋsparse=ℋK∗\+∖ℋcov\.\\mathcal\{H\}^\{\\mathrm\{cov\}\}=\\bigl\\\{h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}:P\_\{h\}\\geq n^\{\(\\mathrm\{B\}\)\}\_\{\\mathrm\{pool\}\}\\bigr\\\},\\qquad\\mathcal\{H\}^\{\\mathrm\{sparse\}\}=\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\\setminus\\mathcal\{H\}^\{\\mathrm\{cov\}\}\.Histories inℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}have enough data to estimate their individual transition distributions reliably at the noise floor; histories inℋsparse\\mathcal\{H\}^\{\\mathrm\{sparse\}\}do not\. The regression tree is grown onℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}and extrapolates toℋsparse\\mathcal\{H\}^\{\\mathrm\{sparse\}\}via leaf assignment\.

##### What the tree is doing\.

The tree operates in three conceptually distinct steps\.

Step 1 \(direct estimation\)\.For everyh∈ℋcovh\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}, compute the individual kernel estimate𝒯^K∗f,d,\(C\)\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{C\}\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)directly from B\. The covered set exists precisely because these estimates are reliable:Ph≥npoolP\_\{h\}\\geq n\_\{\\mathrm\{pool\}\}ensures the noise floor at eachh∈ℋcovh\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}is already below the coverage threshold\.

Step 2 \(structured pooling\)\.Leth∈ℋcov⊆\[N\]D×Kh\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\\subseteq\[N\]^\{D\\times K\}and target dimensiondd, the input is the discrete history vectorhhbe represented as\(hkd′\)d′∈\[D\],k∈\[K∗\]∈\[N\]D​K∗\(h^\{d^\{\\prime\}\}\_\{k\}\)\_\{d^\{\\prime\}\\in\[D\],\\,k\\in\[K^\{\*\}\]\}\\in\[N\]^\{DK^\{\*\}\},hkd′h^\{d^\{\\prime\}\}\_\{k\}denoting the bin of variabled′≤dd^\{\\prime\}\\leq dat lagkkwithinhh\. The tree partitionsℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}intoLLleaves by recursively selecting binary splits of the form\{h:hkd′=n\}\\\{h:h^\{d^\{\\prime\}\}\_\{k\}=n\\\}vs\.\{h:hkd′≠n\}\\\{h:h^\{d^\{\\prime\}\}\_\{k\}\\neq n\\\}for any random\(d′,k,n\)∈\[D\]×\[K∗\]×\[N\]\(d^\{\\prime\},k,n\)\\in\[D\]\\times\[K^\{\*\}\]\\times\[N\]such that the split minimises the within\-leaf heterogeneity of the individual kernel estimates \(Eq\.[S44](https://arxiv.org/html/2606.27599#S3.E44)\)\. Within each leafℓ\\ell, model counts are aggregated across all member histories to form apooled leaf kernel𝒯¯ℓd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\(Eq\.[S45](https://arxiv.org/html/2606.27599#S3.E45)\), which replaces the individual estimates for every covered history in that leaf\. This pooling is structured rather than arbitrary: the splits are chosen so that histories assigned to the same leaf have similar individual kernels, making the pooled kernel a good representative for all of them\. The cost of pooling is the within\-leaf biasδℓ\\delta\_\{\\ell\}, the maximum Total Variation distance between any individual kernel in the leaf and the pooled kernel, which the KL The splitting criterion is directly minimized\.

Step 3 \(extrapolation by structural similarity\)\.Forh∈ℋsparseh\\in\\mathcal\{H\}^\{\\mathrm\{sparse\}\}, there are insufficient model counts to form a reliable individual estimate\. The tree routeshhto a leafℓ​\(h\)\\ell\(h\)by evaluating the same binary split conditions top\-down from the root: at each internal node, check whetherhk∗d∗=n∗h^\{d^\{\*\}\}\_\{k^\{\*\}\}=n^\{\*\},d∗≤dd^\{\*\}\\leq dand go left if true, right otherwise\. The sparse history then inherits the leaf’s pooled kernel\. The implicit claim is:hhshares the same bin values as the covered histories inℓ​\(h\)\\ell\(h\)on the variables and lags that the tree has identified as predictively important, so their transition distributions should be similar\. This is extrapolation by structural similarity, not smoothing, not interpolation, but stratification: histories that look alike on the splits that matter are assigned the same transition distribution estimate\.

##### Input and response\.

For eachh∈ℋcov⊆\[N\]D×Kh\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\\subseteq\[N\]^\{D\\times K\}and target dimensiondd, the input is the discrete history vectorxh=\(hkd′\)d′∈\[D\],k∈\[K∗\]∈\[N\]D​K∗x\_\{h\}=\(h^\{d^\{\\prime\}\}\_\{k\}\)\_\{d^\{\\prime\}\\in\[D\],\\,k\\in\[K^\{\*\}\]\}\\in\[N\]^\{DK^\{\*\}\},hkd′h^\{d^\{\\prime\}\}\_\{k\}denoting the bin of variabled′d^\{\\prime\}at lagkkwithinhhand the response is the individually estimated kernel simplexyh=𝒯^K∗f,d\(⋅∣h\)y\_\{h\}=\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\.

##### Tree construction\.

The tree recursively partitionsℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}by choosing splits of the form\{h:hkd′=n\}\\\{h:h^\{d^\{\\prime\}\}\_\{k\}=n\\\}vs\.\{h:hkd′≠n\}\\\{h:h^\{d^\{\\prime\}\}\_\{k\}\\neq n\\\}for each variabled′∈\[D\]d^\{\\prime\}\\in\[D\], lagk∈\[K∗\]k\\in\[K^\{\*\}\], and binn∈\[N\]n\\in\[N\], givingD⋅K∗⋅ND\\cdot K^\{\*\}\\cdot Ncandidates per node\. At each node the split minimising theπ^∗\\hat\{\\pi\}^\{\*\}\-weighted within\-leaf KL divergence is selected:

csplit=∑ℓ∈\{left,right\}∑h∈ℓπ^∗\(h\)⋅KL\(𝒯^K∗f,d\(⋅∣h\)∥𝒯¯ℓd\(⋅\)\),\\mathrm\{c\}\_\{\\mathrm\{split\}\}=\\sum\_\{\\ell\\in\\\{\\mathrm\{left,\\,right\}\\\}\}\\sum\_\{h\\in\\ell\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\operatorname\{KL\}\\\!\\left\(\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\;\\Big\\\|\\;\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\(\\cdot\)\\right\),\(S44\)where thepooled leaf kernelis:

𝒯¯ℓd​\(sd\)=∑h∈ℓnf,d​\(h,sd\)\+α∑h∈ℓnf​\(h\)\+N​α,α=12\.\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\(s^\{d\}\)=\\frac\{\\displaystyle\\sum\_\{h\\in\\ell\}n^\{f,d\}\(h,\\,s^\{d\}\)\+\\alpha\}\{\\displaystyle\\sum\_\{h\\in\\ell\}n^\{f\}\(h\)\+N\\alpha\},\\quad\\alpha=\\tfrac\{1\}\{2\}\.\(S45\)
Splitting halts when any of the following hold: the node contains a single history; maximum depthdmaxd\_\{\\max\}is reached; no valid split exists that gives both children pooled count≥npool\(C\)\\geq n\_\{\\mathrm\{pool\}\}^\{\(\\mathrm\{C\}\)\}; or the best cost reduction is below thresholdη\\eta\. The pool\-size check∑h∈ℓPh≥npool\(C\)\\sum\_\{h\\in\\ell\}P\_\{h\}\\geq n\_\{\\mathrm\{pool\}\}^\{\(\\mathrm\{C\}\)\}at every candidate child is the structural link between tree construction andTmin\(C\)T^\{\(\\mathrm\{C\}\)\}\_\{\\min\}: it guarantees that every leaf produced by the algorithm meets the coverage requirement by construction, without any post\-hoc pruning\.

##### Extrapolation to sparse histories\.

Forh∈ℋsparseh\\in\\mathcal\{H\}^\{\\mathrm\{sparse\}\}, routehhdown the tree to its leafℓ​\(h\)\\ell\(h\)and assign:

𝒯^K∗f,d,\(C\)\(⋅∣h\)=𝒯¯ℓ​\(h\)d\(⋅\)\.\\hat\{\\mathcal\{T\}\}^\{f,d,\(\\mathrm\{C\}\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)=\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\(h\)\}\(\\cdot\)\.\(S46\)The leafℓ​\(h\)\\ell\(h\)groups histories sharing the same bin conditions on the predictively important splits, providing a non\-parametric but a structured estimate for histories that would otherwise be inaccessible to Strategies A and B\.

Algorithm 1Strategy C: Tree construction1:Individual kernel estimates

\{𝒯^K∗f,d\(⋅∣h\)\}h∈ℋcov\\\{\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\}\_\{h\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\}, counts

\{nf,d​\(h,sd\),nf​\(h\)\}h∈ℋcov\\\{n^\{f,d\}\(h,s^\{d\}\),\\,n^\{f\}\(h\)\\\}\_\{h\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\}, stationary weights

\{π^∗​\(h\)\}h∈ℋcov\\\{\\hat\{\\pi\}^\{\*\}\(h\)\\\}\_\{h\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\}, pool sizes

\{Ph\}h∈ℋcov\\\{P\_\{h\}\\\}\_\{h\\in\\mathcal\{H\}^\{\\mathrm\{cov\}\}\}, parameters

npool,dmax,η,α=12n\_\{\\mathrm\{pool\}\},\\,d\_\{\\max\},\\,\\eta,\\,\\alpha=\\tfrac\{1\}\{2\}
2:Tree

𝒯d\\mathcal\{T\}^\{d\}with leaf kernels

\{T¯ℓd\}\\\{\\bar\{T\}^\{d\}\_\{\\ell\}\\\}and split conditions

\{\(dν′⁣∗,kν∗,nν∗\)\}ν​internal\\\{\(d^\{\\prime\*\}\_\{\\nu\},\\,k^\{\*\}\_\{\\nu\},\\,n^\{\*\}\_\{\\nu\}\)\\\}\_\{\\nu\\,\\text\{internal\}\}
3:Initialise root

ν0\\nu\_\{0\}:

ℋ​\(ν0\)←ℋcov\\mathcal\{H\}\(\\nu\_\{0\}\)\\leftarrow\\mathcal\{H\}^\{\\mathrm\{cov\}\},

depth​\(ν0\)←0\\mathrm\{depth\}\(\\nu\_\{0\}\)\\leftarrow 0
4:

𝒬←\{ν0\}\\mathcal\{Q\}\\leftarrow\\\{\\nu\_\{0\}\\\}
5:while

𝒬≠∅\\mathcal\{Q\}\\neq\\emptysetdo

6:Pop node

ν\\nufrom

𝒬\\mathcal\{Q\}
7:Compute pooled kernel and node cost:

𝒯¯νd​\(sd\)\\displaystyle\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\nu\}\(s^\{d\}\)←∑h∈ℋ​\(ν\)nf,d​\(h,sd\)\+α∑h∈ℋ​\(ν\)nf​\(h\)\+N​α∀sd∈\[N\]\\displaystyle\\leftarrow\\frac\{\\displaystyle\\sum\_\{h\\in\\mathcal\{H\}\(\\nu\)\}n^\{f,d\}\(h,\\,s^\{d\}\)\+\\alpha\}\{\\displaystyle\\sum\_\{h\\in\\mathcal\{H\}\(\\nu\)\}n^\{f\}\(h\)\+N\\alpha\}\\qquad\\forall\\,s^\{d\}\\in\[N\]cν\\displaystyle c\_\{\\nu\}←∑h∈ℋ​\(ν\)π^∗\(h\)⋅KL\(𝒯^K∗f,d\(⋅∣h\)∥𝒯¯νd\)\\displaystyle\\leftarrow\\sum\_\{h\\in\\mathcal\{H\}\(\\nu\)\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\operatorname\{KL\}\\\!\\Bigl\(\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\;\\Big\\\|\\;\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\nu\}\\Bigr\)
8:if

\|ℋ​\(ν\)\|=1\|\\mathcal\{H\}\(\\nu\)\|=1or

depth​\(ν\)≥dmax\\mathrm\{depth\}\(\\nu\)\\geq d\_\{\\max\}then

9:Mark

ν\\nuas leaf with kernel

𝒯¯νd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\nu\};continue⊳\\trianglerighttrivial stopping

10:endif

11:

c∗←cνc^\{\*\}\\leftarrow c\_\{\\nu\},

σ∗←null\\sigma^\{\*\}\\leftarrow\\mathrm\{null\}
12:for

\(d′,k,n\)∈\[D\]×\[K∗\]×\[N\]\(d^\{\\prime\},\\,k,\\,n\)\\in\[D\]\\times\[K^\{\*\}\]\\times\[N\]do⊳\\trianglerightD⋅K∗⋅ND\\cdot K^\{\*\}\\cdot Ncandidates

13:

ℋL←\{h∈ℋ​\(ν\):hkd′=n\}\\mathcal\{H\}\_\{L\}\\leftarrow\\\{h\\in\\mathcal\{H\}\(\\nu\):h^\{d^\{\\prime\}\}\_\{k\}=n\\\},

ℋR←ℋ​\(ν\)∖ℋL\\mathcal\{H\}\_\{R\}\\leftarrow\\mathcal\{H\}\(\\nu\)\\setminus\\mathcal\{H\}\_\{L\}
14:if

ℋL=∅\\mathcal\{H\}\_\{L\}=\\emptysetor

ℋR=∅\\mathcal\{H\}\_\{R\}=\\emptysetthen

15:continue

16:endif

17:if

∑h∈ℋLPh<npool\\sum\_\{h\\in\\mathcal\{H\}\_\{L\}\}P\_\{h\}<n\_\{\\mathrm\{pool\}\}or

∑h∈ℋRPh<npool\\sum\_\{h\\in\\mathcal\{H\}\_\{R\}\}P\_\{h\}<n\_\{\\mathrm\{pool\}\}then

18:continue⊳\\trianglerightchild pool violates coverage requirement

19:endif

20:Compute child pooled kernels

𝒯¯Ld\\bar\{\\mathcal\{T\}\}^\{d\}\_\{L\},

𝒯¯Rd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{R\}via Eq\. \([S45](https://arxiv.org/html/2606.27599#S3.E45)\)

21:

csplit←∑ℓ∈\{L,R\}∑h∈ℋℓπ^∗\(h\)⋅KL\(𝒯^K∗f,d\(⋅∣h\)∥𝒯¯ℓd\)c\_\{\\mathrm\{split\}\}\\leftarrow\\displaystyle\\sum\_\{\\ell\\in\\\{L,\\,R\\\}\}\\sum\_\{h\\in\\mathcal\{H\}\_\{\\ell\}\}\\hat\{\\pi\}^\{\*\}\(h\)\\cdot\\operatorname\{KL\}\\\!\\bigl\(\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\;\\\|\\;\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\\bigr\)⊳\\trianglerightEq\. \([S44](https://arxiv.org/html/2606.27599#S3.E44)\), single variabledd

22:if

csplit<c∗c\_\{\\mathrm\{split\}\}<c^\{\*\}then

23:

c∗←csplitc^\{\*\}\\leftarrow c\_\{\\mathrm\{split\}\},

σ∗←\(d′,k,n,ℋL,ℋR\)\\sigma^\{\*\}\\leftarrow\(d^\{\\prime\},\\,k,\\,n,\\,\\mathcal\{H\}\_\{L\},\\,\\mathcal\{H\}\_\{R\}\)
24:endif

25:endfor

26:if

σ∗=null\\sigma^\{\*\}=\\mathrm\{null\}or

cν−c∗<ηc\_\{\\nu\}\-c^\{\*\}<\\etathen

27:Mark

ν\\nuas leaf with kernel

𝒯¯νd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\nu\}⊳\\trianglerightno valid split or gain<η<\\eta

28:else

29:

\(d′⁣∗,k∗,n∗,ℋL,ℋR\)←σ∗\(d^\{\\prime\*\},\\,k^\{\*\},\\,n^\{\*\},\\,\\mathcal\{H\}\_\{L\},\\,\\mathcal\{H\}\_\{R\}\)\\leftarrow\\sigma^\{\*\}
30:

ν\.split←\(d′⁣∗,k∗,n∗\)\\nu\.\\mathrm\{split\}\\leftarrow\(d^\{\\prime\*\},\\,k^\{\*\},\\,n^\{\*\}\)
31:Create children

νL\\nu\_\{L\},

νR\\nu\_\{R\}:

ℋ​\(νL\)←ℋL\\mathcal\{H\}\(\\nu\_\{L\}\)\\leftarrow\\mathcal\{H\}\_\{L\},

ℋ​\(νR\)←ℋR\\mathcal\{H\}\(\\nu\_\{R\}\)\\leftarrow\\mathcal\{H\}\_\{R\},

depth←depth​\(ν\)\+1\\mathrm\{depth\}\\leftarrow\\mathrm\{depth\}\(\\nu\)\+1
32:

𝒬←𝒬∪\{νL,νR\}\\mathcal\{Q\}\\leftarrow\\mathcal\{Q\}\\cup\\\{\\nu\_\{L\},\\,\\nu\_\{R\}\\\}
33:endif

34:endwhile

35:returntree

𝒯d\\mathcal\{T\}^\{d\}

Algorithm[1](https://arxiv.org/html/2606.27599#alg1)is run independently for each target variabled∈\[D\]d\\in\[D\], producingDDtrees\{𝒯d\}d=1D\\\{\\mathcal\{T\}^\{d\}\\\}\_\{d=1\}^\{D\}with potentially different split structures\. The split candidates\(d′,k,n\)\(d^\{\\prime\},k,n\)range over all source variablesd′∈\[D\]d^\{\\prime\}\\in\[D\], lagsk∈\[K∗\]k\\in\[K^\{\*\}\], and binsn∈\[N\]n\\in\[N\], so the tree for target variableddmay select splits on any source variabled′d^\{\\prime\}, includingd′=dd^\{\\prime\}=d\(self\-influence\) andd′≠dd^\{\\prime\}\\neq d\(cross\-variable influence\)\. The split variables selected across theDDtrees directly encode the Level 1 and Level 2 explanation: a source variabled′d^\{\\prime\}at lagkkthat appears as a high\-level split node in𝒯d\\mathcal\{T\}^\{d\}is a primary driver of the transition distribution of variabledd\.

yesnoyesnoyesnoℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}covered historieshk2d2=n2h^\{d\_\{2\}\}\_\{k\_\{2\}\}=n\_\{2\}?hk3d3=n3h^\{d\_\{3\}\}\_\{k\_\{3\}\}=n\_\{3\}?split:hk1d1=n1h^\{d\_\{1\}\}\_\{k\_\{1\}\}=n\_\{1\}?𝒯¯ℓ1d\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\_\{1\}\}𝒯¯ℓ2d\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\_\{2\}\}𝒯¯ℓ3d\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\_\{3\}\}𝒯¯ℓ4d\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\_\{4\}\}sparseh∈ℋsparseh\\in\\mathcal\{H\}^\{\\mathrm\{sparse\}\}inherits𝒯¯ℓ4d\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\_\{4\}\}Figure S2:Strategy C: regression tree on the transition tensor\.The tree partitions the covered historiesℋcov\\mathcal\{H\}^\{\\mathrm\{cov\}\}through binary bin testshkd′=nh^\{d^\{\\prime\}\}\_\{k\}=n, with candidates\(d′,k,n\)∈\[D\]×\[K∗\]×\[N\]\(d^\{\\prime\},k,n\)\\in\[D\]\\times\[K^\{\*\}\]\\times\[N\]chosen to minimise theπ^∗\\hat\{\\pi\}^\{\*\}\-weighted within\-leaf KL divergence \(Eq\.[S44](https://arxiv.org/html/2606.27599#S3.E44)\)\. Each leaf stores a pooled kernel𝒯¯ℓd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\(Eq\.[S45](https://arxiv.org/html/2606.27599#S3.E45)\) aggregated from the model counts of its member histories, shown here as a distribution over theNNnext\-state bins\. Growth halts when a node holds a single history, reaches depthdmaxd\_\{\\max\}, yields gain belowη\\eta, or would give a child pooled count∑h∈ℓPh<npool\\sum\_\{h\\in\\ell\}P\_\{h\}<n\_\{\\mathrm\{pool\}\}\. A sparse historyh∈ℋsparseh\\in\\mathcal\{H\}^\{\\mathrm\{sparse\}\}\(dashed\) is routed top\-down through the same tests to a leaf and inherits its pooled kernel, extending reliable estimates to histories never sufficiently covered by Strategies A and B\. The sample budget then scales with the number of leavesL≪\|ℋK∗\+\|L\\ll\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|rather than with\|ℋK∗\+\|\|\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}\|\.###### Theorem S3\.8\(Strategy C noise floor\)\.

Let𝒯^K∗f,d,\(C\)\(⋅∣h\)\\hat\{\\mathcal\{T\}\}^\{f,d,\(C\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)be the Strategy C estimator forh∈HK∗\+h\\in H^\{\+\}\_\{K^\{\*\}\}routed to leafℓ=ℓ​\(h\)\\ell=\\ell\(h\)with pooled countPℓ=∑h′∈ℓPh′P\_\{\\ell\}=\\sum\_\{h^\{\\prime\}\\in\\ell\}P\_\{h^\{\\prime\}\}\. Define the*population within\-leaf diameter*

δℓ∗=maxh′,h′′∈ℓ∥𝒯K∗f,d\(⋅∣h′\)−𝒯K∗f,d\(⋅∣h′′\)∥T​V,\\delta\_\{\\ell\}^\{\*\}\\;=\\;\\max\_\{h^\{\\prime\},h^\{\\prime\\prime\}\\in\\ell\}\\bigl\\\|\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\\prime\}\)\\bigr\\\|\_\{TV\},\(S47\)and the population pooled kernel

𝒯ℓf,d,pool​\(sd\)=1Pℓ​∑h′′∈ℓPh′′​𝒯K∗f,d​\(sd∣h′′\)\.\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{\\ell\}\(s^\{d\}\)\\;=\\;\\frac\{1\}\{P\_\{\\ell\}\}\\sum\_\{h^\{\\prime\\prime\}\\in\\ell\}P\_\{h^\{\\prime\\prime\}\}\\,\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(s^\{d\}\\mid h^\{\\prime\\prime\}\)\.\(S48\)Then, for anyh∈ℓh\\in\\ell,

𝔼\[∥𝒯^K∗f,d,\(C\)\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\bigl\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(C\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\bigr\\\|\_\{TV\}\\right\]≤N2​Pℓ⏟\(i\) pooled estimation\+δℓ∗⏟\(ii\) within\-leaf bias,\\displaystyle\\quad\\quad\\;\\leq\\;\\underbrace\{\\sqrt\{\\frac\{N\}\{2P\_\{\\ell\}\}\}\}\_\{\\text\{\(i\) pooled estimation\}\}\\;\+\\;\\underbrace\{\\delta\_\{\\ell\}^\{\*\}\}\_\{\\text\{\(ii\) within\-leaf bias\}\},\(S49\)so the noise floor isρfloor\(C\)​\(h\)=N/\(2​Pℓ\)\+δℓ∗\\rho^\{\(C\)\}\_\{\\mathrm\{floor\}\}\(h\)=\\sqrt\{N/\(2P\_\{\\ell\}\)\}\+\\delta\_\{\\ell\}^\{\*\}\.

###### Proof\.

Fixh∈HK∗\+h\\in H^\{\+\}\_\{K^\{\*\}\}routed to leafℓ\\ell\. By definition \([S48](https://arxiv.org/html/2606.27599#S3.E48)\) and the assignment ruleT^K∗f,d,\(C\)\(⋅∣h\)=T¯ℓd\\hat\{T\}^\{f,d,\(C\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)=\\bar\{T\}^\{d\}\_\{\\ell\}\(Eq\. 55 in the main paper\), apply the triangle inequality withTℓf,d,poolT^\{f,d,\\mathrm\{pool\}\}\_\{\\ell\}as intermediate:

∥𝒯^K∗f,d,\(C\)\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\\displaystyle\\bigl\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(C\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\bigr\\\|\_\{TV\}≤‖𝒯¯ℓd−𝒯ℓf,d,pool‖T​V⏟Term \(i\)\+∥𝒯ℓf,d,pool−𝒯K∗f,d\(⋅∣h\)∥T​V⏟Term \(ii\)\.\\displaystyle\\quad\\leq\\;\\underbrace\{\\bigl\\\|\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\-\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{\\ell\}\\bigr\\\|\_\{TV\}\}\_\{\\text\{Term \(i\)\}\}\\;\+\\;\\underbrace\{\\bigl\\\|\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{\\ell\}\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\bigr\\\|\_\{TV\}\}\_\{\\text\{Term \(ii\)\}\}\.\(S50\)
Bounding Term \(i\)\.𝒯¯ℓd\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}is the Dirichlet posterior mean \(Eq\. 54 in the main paper\) formed by pooling allPℓP\_\{\\ell\}model queries within leafℓ\\ell\. Each query yields a Multinomial observation in\[N\]\[N\]\. By the same Pinsker–Dirichlet argument used in Theorem A\.4 \(Strategy A, marginal kernel\), applied now to the pooled estimator withPℓP\_\{\\ell\}total counts andNNbins:

𝔼​\[‖𝒯¯ℓd−𝒯ℓf,d,pool‖T​V\]≤N2​Pℓ\.\\mathbb\{E\}\\\!\\left\[\\bigl\\\|\\bar\{\\mathcal\{T\}\}^\{d\}\_\{\\ell\}\-\\mathcal\{T\}^\{f,d,\\mathrm\{pool\}\}\_\{\\ell\}\\bigr\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{N\}\{2P\_\{\\ell\}\}\}\.\(S51\)This is a finite\-sample statement requiring no limiting argument\.

Bounding Term \(ii\)\.By convexity of the total variation distance,

\(i​i\)\\displaystyle\(ii\)=∥1Pℓ∑h′′∈ℓPh′′𝒯K∗f,d\(⋅∣h′′\)−𝒯K∗f,d\(⋅∣h\)∥T​V\\displaystyle=\\Bigl\\\|\\frac\{1\}\{P\_\{\\ell\}\}\\sum\_\{h^\{\\prime\\prime\}\\in\\ell\}P\_\{h^\{\\prime\\prime\}\}\\,\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\\prime\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\Bigr\\\|\_\{TV\}≤1Pℓ∑h′′∈ℓPh′′\[∥𝒯K∗f,d\(⋅∣h′′\)−𝒯K∗f,d\(⋅∣h\)∥T​V\]\\displaystyle\\leq\\;\\frac\{1\}\{P\_\{\\ell\}\}\\sum\_\{h^\{\\prime\\prime\}\\in\\ell\}P\_\{h^\{\\prime\\prime\}\}\\,\\biggl\[\\\|\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\\prime\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\\|\_\{TV\}\\biggr\]≤maxh′′∈ℓ∥𝒯K∗f,d\(⋅∣h′′\)−𝒯K∗f,d\(⋅∣h\)∥T​V\\displaystyle\\leq\\;\\max\_\{h^\{\\prime\\prime\}\\in\\ell\}\\bigl\\\|\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{\\prime\\prime\}\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\bigr\\\|\_\{TV\}≤δℓ∗,\\displaystyle\\leq\\;\\delta\_\{\\ell\}^\{\*\},\(S52\)where the last inequality holds becauseh∈ℓh\\in\\elland the maximum in \([S47](https://arxiv.org/html/2606.27599#S3.E47)\) is taken over all pairs inℓ\\ell\. Crucially, every step here involves only the*true*population kernels𝒯K∗f,d\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}; no estimated quantity appears, so no finite\-sample correction is needed for this term\.

Combining\.Taking expectations in \([S50](https://arxiv.org/html/2606.27599#S3.E50)\) and substituting \([S51](https://arxiv.org/html/2606.27599#S3.E51)\)–\([S52](https://arxiv.org/html/2606.27599#S3.E52)\):

𝔼\[∥𝒯^K∗f,d,\(C\)\(⋅∣h\)−𝒯K∗f,d\(⋅∣h\)∥T​V\]≤N2​Pℓ\+δℓ∗,\\mathbb\{E\}\\\!\\left\[\\bigl\\\|\\hat\{\\mathcal\{T\}\}^\{f,d,\(C\)\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\-\\mathcal\{T\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)\\bigr\\\|\_\{TV\}\\right\]\\;\\leq\\;\\sqrt\{\\frac\{N\}\{2P\_\{\\ell\}\}\}\+\\delta\_\{\\ell\}^\{\*\},\(S53\)establishing \([S49](https://arxiv.org/html/2606.27599#S3.E49)\)\.

∎

##### Degradation to Strategy B\.

WhenTTis large enough thatℋsparse=∅\\mathcal\{H\}^\{\\mathrm\{sparse\}\}=\\emptyset, Strategy C with one history per leaf recovers the nonparametric kernel exactly\. Strategy C therefore degrades gracefully: it activates the tree structure only where coverage fails and reverts to the nonparametric estimate for well\-covered histories\.

## S4Model Induced Causal Graph Recovery

We refer to DAG, a Directed Acyclic Graph\. We distinguish carefully between \(i\) thedata\-generatingcausal DAGG∗G^\{\*\}over the true data\-generating process, and \(ii\) themodel\-inducedcausal DAGGfG^\{f\}encoding whatffhas learned\. This distinction is not merely philosophical: ifffhas learned spurious correlations,GfG^\{f\}faithfully reflects them, which is precisely what model\-centric XAI should report\. ComparingGfG\_\{f\}withG∗G^\{\*\}\(recovered by PCMCI on raw data\) constitutes a model audit: agreement indicates causally correct learning; divergence flags what the model got wrong\.

###### Definition S4\.1\(Model\-induced causal DAG\)\.

Given surrogateMK∗M\_\{K^\{\*\}\}, define the node set

V=\{Xt−kd:d∈\[D\],k∈\{0,…​K∗\}\}V=\\\{X\_\{t\-k\}^\{d\}:d\\in\[D\],k\\in\\\{0,\\dots K^\{\*\}\\\}\\\}A directed edgeXt−kd→Xt−k′d′X^\{d\}\_\{t\-k\}\\to X^\{d^\{\\prime\}\}\_\{t\-k^\{\\prime\}\}\(withk\>k′k\>k^\{\\prime\}, forward in time\) belongs toEEiff

Xt−kd⟂̸⟂Xt−k′d′\|V\\\{Xt−kd\}X^\{d\}\_\{t\-k\}\\not\\perp\\\!\\\!\\\!\\perp X^\{d^\{\\prime\}\}\_\{t\-k^\{\\prime\}\}\|V\\backslash\\\{X^\{d\}\_\{t\-k\}\\\}under the marginal distribution define𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\. The model\-induced DAG isGf=\(V,E\)G^\{f\}=\(V,E\)\.

Three structural properties ofGfG^\{f\}follow immediately from the time\-ordered construction\. First, acyclicity is automatic: all edges run strictly forward in time, so no directed cycle can exist, and KARMA need not enforce acyclicity as a constraint\. Second, orientation is given rather than inferred: because time\-ordering provides all edge orientations for free, KARMA recovers a fully oriented Model induced DAG, a strictly stronger result than the CPDAG recovered by standard algorithms such as PC or FCI\. We note that intra\-slice edges \(k=k′k=k^\{\\prime\}\) are excluded by construction, as the transition kernel conditions only on strictly past states; we leave contemporaneous extensions to future work\.

In the presence of noise from data budget and approximations, we impose a minimum detectable causal effectλ\\lambdathat ensures meaningful causation\.

###### Definition S4\.2\(Total\-Variation Markov Blanket\)\.

The Total\-Variation Markov Blanket ofXtd′X\_\{t\}^\{d^\{\\prime\}\}at thresholdλ≥0\\lambda\\geq 0is

MBλ​\(Xtd′\)=\{Xt−kd:ρ​\(Xt−kd→Xtd′\)≥λ,k∈\{0,…,K∗\}\}\\text\{MB\}\_\{\\lambda\}\(X\_\{t\}^\{d^\{\\prime\}\}\)=\\\{X\_\{t\-k\}^\{d\}:\\rho\(X^\{d\}\_\{t\-k\}\\to X^\{d^\{\\prime\}\}\_\{t\}\)\\geq\\lambda,\\\>k\\in\\\{0,\\dots,K^\{\*\}\\\}\\\}where,ρ​\(⋅\)\\rho\(\\cdot\)is defined as per Eq[7](https://arxiv.org/html/2606.27599#S4.E7)\.

The DAG recovered by KARMAGfG^\{f\}has edge set

Eλ=\{\(Xt−kd→Xt−k′d′\):Xt−kd∈MBλ​\(Xt−k′d′\),k<k′\}E^\{\\lambda\}=\\\{\(X^\{d\}\_\{t\-k\}\\to X^\{d^\{\\prime\}\}\_\{t\-k^\{\\prime\}\}\):X^\{d\}\_\{t\-k\}\\in\\text\{MB\}\_\{\\lambda\}\(X\_\{t\-k^\{\\prime\}\}^\{d^\{\\prime\}\}\),\\\>\\\>k<k^\{\\prime\}\\\}Asλ→0\\lambda\\to 0,GfλG\_\{f\}^\{\\lambda\}approaches the full transition graph; asλ→‖Pf−PℳK∗‖T​V\\lambda\\to\\\|P\_\{f\}\-P\_\{\\mathcal\{M\}\_\{K^\{\*\}\}\}\\\|\_\{TV\}, it approaches the empty graph, wherePfP\_\{f\}is the model distribution andPℳK∗P\_\{\\mathcal\{M\}\_\{K^\{\*\}\}\}the distribution of the Markov Surrogate\. The operating rangeλ∈\(0,ε/2\)\\lambda\\in\(0,\\varepsilon/2\)is where causal structure is meaningfully identified\.

Ifρ​\(Xt−kd→Xt−k′d′\)<λ\\rho\(X^\{d\}\_\{t\-k\}\\to X^\{d^\{\\prime\}\}\_\{t\-k^\{\\prime\}\}\)<\\lambda, the edge\(Xt−kd→Xt−k′d′\)\(X^\{d\}\_\{t\-k\}\\to X^\{d^\{\\prime\}\}\_\{t\-k^\{\\prime\}\}\)has no effective contribution\. This treats trimming as*regularization under a faithfulness prior*: edges removed by trimming are those whose path coefficient is smaller thanλ\\lambda, analogous to d\-separation\.

###### Theorem S4\.3\(Causal Faithfulness of KARMA Explanations\)\.

LetℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}be the K\-order Markov surrogate offfwith estimated transition kernel𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}and model\-induced causal graphGf=\(V,Eλ\)G^\{f\}=\(V,E^\{\\lambda\}\)\. Suppose the following*transition–separation correspondence*holds: for alld∈\[D\]d\\in\[D\],d′∈\[D\]d^\{\\prime\}\\in\[D\],k∈\{1,…,K∗\}k\\in\\\{1,\\ldots,K^\{\*\}\\\},

ρ\(Xt−kd′→Xtd\)=0⇔Xt−kd′⟂⟂GfXtd∣V∖\{Xt−kd′\},\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)=0\\;\\iff\\;X^\{d^\{\\prime\}\}\_\{t\-k\}\\perp\\\!\\\!\\\!\\perp\_\{G^\{f\}\}X^\{d\}\_\{t\}\\mid V\\setminus\\\{X^\{d^\{\\prime\}\}\_\{t\-k\}\\\},\(S56\)where⟂⟂Gf\\perp\\\!\\\!\\\!\\perp\_\{G^\{f\}\}denotes d\-separation inGfG^\{f\}\. Then:

1. \(i\)Faithfulness\.The edge setEλE^\{\\lambda\}ofGfG^\{f\}exactly encodes the conditional dependence structure ofℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}: an edge\(Xt−kd′→Xtd\)\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)is retained iffXt−kd′X^\{d^\{\\prime\}\}\_\{t\-k\}has a direct distributional influence onXtdX^\{d\}\_\{t\}within the surrogate that is not mediated by any other variable in the Markov blanket\.
2. \(ii\)Causal grounding\.The AIE \(Eq\.[9](https://arxiv.org/html/2606.27599#S4.E9)\) satisfies AIEd​\(d′,k,x\)\>0⇔\(Xt−kd′→Xtd\)∈Eλ,\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)\>0\\;\\iff\\;\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)\\in E^\{\\lambda\},\(S57\)so the Level 1–4 explanations are causally grounded inGfG^\{f\}: every retained attribution corresponds to a genuine direct effect within the surrogate, and every zero attribution corresponds to a d\-separated pair\.
3. \(iii\)Inheritance toG∗G^\{\*\}\.If additionallyffhas learned the true conditional dependence structure of the data\-generating process, i\.e\.GfG^\{f\}is faithful toG∗G^\{\*\}, then the explanations are causally grounded in the data\-generating causal graphG∗G^\{\*\}\.

###### Proof\.

We prove each part in turn\.

Part \(i\)\.By Definition A\.7, an edge\(Xt−kd′→Xtd\)\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)belongs toEλE^\{\\lambda\}iffρ​\(Xt−kd′→Xtd\)≥λ\>0\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)\\geq\\lambda\>0, i\.e\. iff condition \([S56](https://arxiv.org/html/2606.27599#S4.E56)\) holds withρ\>0\\rho\>0\. By \([S56](https://arxiv.org/html/2606.27599#S4.E56)\),ρ=0\\rho=0iffXt−kd′X^\{d^\{\\prime\}\}\_\{t\-k\}andXtdX^\{d\}\_\{t\}are d\-separated inGfG^\{f\}given the remaining Markov blanket\. ThereforeEλE^\{\\lambda\}retains edges corresponding to conditionally dependent pairs inℳK∗\\mathcal\{M\}\_\{K^\{\*\}\}, which is precisely the definition of faithfulness for a DAG with respect to a distribution\. ∎

Part \(ii\)\.From Eq\. \([9](https://arxiv.org/html/2606.27599#S4.E9)\) in Level 4,

1N​∑x∈\[N\]AIEd​\(d′,k,x\)=ρ​\(Xt−kd′→Xtd\)\.\\frac\{1\}\{N\}\\sum\_\{x\\in\[N\]\}\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)\\;=\\;\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)\.\(S58\)SinceAIEd​\(d′,k,x\)≥0\\mathrm\{AIE\}\_\{d\}\(d^\{\\prime\},k,x\)\\geq 0for allxxby definition \(it is a total variation distance\), the average overxxis zero iff every summand is zero, iff𝒯^K∗f,d\(⋅∣h\)=𝒯^K∗f,d\(⋅∣hkd′←x\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h\)=\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\(\\cdot\\mid h^\{d^\{\\prime\}\\leftarrow x\}\_\{k\}\)forπ^∗\\hat\{\\pi\}^\{\*\}\-almost allhhand allx∈\[N\]x\\in\[N\], iffρ​\(Xt−kd′→Xtd\)=0\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)=0\. Combined with Part \(i\), this gives \([S57](https://arxiv.org/html/2606.27599#S4.E57)\): a positive AIE iff the edge is retained, i\.e\. iff the pair is not d\-separated inGfG^\{f\}\. Hence every nonzero attribution in Levels 1–4 maps bijectively to a direct effect inGfG^\{f\}, and every zero attribution maps to a d\-separated pair\. ∎

Part \(iii\)\.Faithfulness ofGfG^\{f\}toG∗G^\{\*\}means that d\-separation inGfG^\{f\}coincides with conditional independence in the data\-generating distributionP∗P^\{\*\}\. By Part \(i\), d\-separation inGfG^\{f\}already coincides withρ=0\\rho=0inMK∗M\_\{K^\{\*\}\}\. Chaining these two equivalences:

ρ​\(Xt−kd′→Xtd\)=0\\displaystyle\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)=0⇔Xt−kd′⟂⟂GfXtd\\displaystyle\\;\\iff\\;X^\{d^\{\\prime\}\}\_\{t\-k\}\\perp\\\!\\\!\\\!\\perp\_\{G^\{f\}\}X^\{d\}\_\{t\}⇔Xt−kd′⟂⟂G∗Xtd,\\displaystyle\\;\\iff\\;X^\{d^\{\\prime\}\}\_\{t\-k\}\\perp\\\!\\\!\\\!\\perp\_\{G^\{\*\}\}X^\{d\}\_\{t\},\(S59\)so the retained edges ofEλE^\{\\lambda\}coincide with the true direct effects inG∗G^\{\*\}, and the KARMA explanations are causally grounded in the data\-generating process\. ∎∎

### S4\.1KARMA Algorithm

Algorithm 2KARMA

1:MVTS

𝐗\\mathbf\{X\}, model

ffwith window

WW, Training series

𝐗train\\mathbf\{X\}\_\{\\mathrm\{train\}\}of length

TtrainT\_\{\\mathrm\{train\}\}, held\-out series

𝐗val\\mathbf\{X\}\_\{\\mathrm\{val\}\}of length

TvalT\_\{\\mathrm\{val\}\}, model

ffwith window

WW, tolerances

ε,λ\\varepsilon,\\lambda, candidate baselines

ℬ\\mathcal\{B\}, kernel estimator

ℰ∈\{StratA,StratB,StratC\}\\mathcal\{E\}\\in\\\{\\textsc\{StratA\},\\,\\textsc\{StratB\},\\,\\textsc\{StratC\}\\\}, estimator hyperparameters

Θℰ\\Theta\_\{\\mathcal\{E\}\}⊳\\trianglerighte\.g\.M,npool,dmaxM,n\_\{\\mathrm\{pool\}\},d\_\{\\max\}forStratB/C

2:— Shared pre\-processing —

3:Discretise

𝐗train\\mathbf\{X\}\_\{\\text\{train\}\}into

𝒮\\mathcal\{S\}via quantisation

4:Compute

π^∗​\(h\)\\hat\{\\pi\}^\{\*\}\(h\)from

𝐗\\mathbf\{X\}⊳\\trianglerightstationary weights, independent of model

5:— Pillar 1: Markov order selection —

7:repeat

8:Query

ffon

n≤Tval−Wn\\leq T\_\{\\mathrm\{val\}\}\-Windependent windows; record triples

\(h~i,hi\(K\),si′\)\(\\tilde\{h\}\_\{i\},\\,h\_\{i\}^\{\(K\)\},\\,s^\{\\prime\}\_\{i\}\)
9:Find

bK=arg​minb∈ℬ⁡Δ^pred​\(K,b\)b\_\{K\}=\\operatorname\*\{arg\\,min\}\_\{b\\in\\mathcal\{B\}\}\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}\(K,b\)⊳\\trianglerightdirect model test, no kernel needed

11:until

Δ^pred​\(K,bK\)<ε\\hat\{\\Delta\}^\{\\mathrm\{pred\}\}\(K,b\_\{K\}\)<\\varepsilon⊳\\trianglerightsole stopping certificate

12:

K∗←KK^\{\*\}\\leftarrow K;

b∗←bKb^\{\*\}\\leftarrow b\_\{K\}⊳\\trianglerightassertK∗≤WK^\{\*\}\\leq Wby architecture

13:— Pillar 2: compression and certified attribution —

14:Certified zeros:

ϕkd≈0\\phi^\{d\}\_\{k\}\\approx 0for all

k\>K∗k\>K^\{\*\},

d∈\[D\]d\\in\[D\]
15:Report compression ratio

K∗/WK^\{\*\}/W
16:— Pillar 3: kernel estimation —

17:Compute

TminT\_\{\\min\}⊳\\trianglerightstrategy\-specificTminT\_\{\\min\}

18:for

d=1,…,Dd=1,\\ldots,Ddo

19:

𝒯^K∗f,d←EstimateKernel​\(ℰ,Θℰ,K∗\)\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\\leftarrow\\textsc\{EstimateKernel\}\(\\mathcal\{E\},\\,\\Theta\_\{\\mathcal\{E\}\},\\,K^\{\*\}\)⊳\\trianglerightfinal kernel at selected order

20:Compute noise floor

ρfloor​\(h\)\\rho\_\{\\mathrm\{floor\}\}\(h\)for all

h∈ℋK∗\+h\\in\\mathcal\{H\}^\{\+\}\_\{K^\{\*\}\}
21:endfor

23:for

d=1,…,Dd=1,\\ldots,Ddo

24:for

\(d′,k\)∈\[D\]×\[K∗\]\(d^\{\\prime\},k\)\\in\[D\]\\times\[K^\{\*\}\]do

25:Compute

ρ​\(Xt−kd′→Xtd\)\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)via Eq\. \([7](https://arxiv.org/html/2606.27599#S4.E7)\) using

T^K∗f,d\\hat\{T\}^\{f,d\}\_\{K^\{\*\}\}
26:endfor

27:endfor

28:

Gf←\{\(d′,k,d\):ρ​\(Xt−kd′→Xtd\)\>λ\}G\_\{f\}\\leftarrow\\\{\(d^\{\\prime\},k,d\):\\rho\(X^\{d^\{\\prime\}\}\_\{t\-k\}\\to X^\{d\}\_\{t\}\)\>\\lambda\\\}⊳\\trianglerightRegularised trimming

29:— Five\-level explanation hierarchy —

30:Compute Levels 1–5 from

𝒯^K∗f,d\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}and

GfG\_\{f\}
31:return

K∗K^\{\*\},

b∗b^\{\*\},

K∗/WK^\{\*\}/W,

\{𝒯^K∗f,d\}d=1D\\\{\\hat\{\\mathcal\{T\}\}^\{f,d\}\_\{K^\{\*\}\}\\\}\_\{d=1\}^\{D\},

GfG\_\{f\},

Cov​\(K∗\)\\mathrm\{Cov\}\(K^\{\*\}\), Levels 1–5

## S5Temporal Aware Lag Based AUC Score

##### Naive AUC and its limitation\.

Letσ\\sigmabe the permutation that sortsssin descending order\. A natural metric masks the top\-kktimesteps with the training\-set meanx¯d,t\\bar\{x\}\_\{d,t\}:

x~b,d,t\(k\)=\{x¯d,tt∈\{σ​\(1\),…,σ​\(k\)\},xb,d,totherwise\.\\tilde\{x\}^\{\(k\)\}\_\{b,d,t\}\\;=\\;\\begin\{cases\}\\bar\{x\}\_\{d,t\}&t\\in\\\{\\sigma\(1\),\\ldots,\\sigma\(k\)\\\},\\\\ x\_\{b,d,t\}&\\text\{otherwise\.\}\\end\{cases\}\(S60\)This is problematic: substituting an unconditional constant into a temporally correlated sequence creates out\-of\-distribution discontinuities\. The model reacts to the broken correlation structure rather than to any genuine absence of causal information, inflating prediction shifts even for lags outside the Markov blanket and masking meaningful differences between methods\.

##### VAR\-conditional imputation\.

We address this by replacing the masked value with its conditional mean given recent history\. We fit a VAR\(KK\) model by ridge regression on the training windows, selectingKKvia BIC:

x^t=∑k=1KA^k​xt−k\+b^,A^k∈ℝD×D\.\\hat\{x\}\_\{t\}\\;=\\;\\sum\_\{k=1\}^\{K\}\\hat\{A\}\_\{k\}\\,x\_\{t\-k\}\+\\hat\{b\},\\qquad\\hat\{A\}\_\{k\}\\in\\mathbb\{R\}^\{D\\times D\}\.\(S61\)When positionttis masked, we substitutex~t=A^1​xt−1\+⋯\+A^K​xt−K\\tilde\{x\}\_\{t\}=\\hat\{A\}\_\{1\}x\_\{t\-1\}\+\\cdots\+\\hat\{A\}\_\{K\}x\_\{t\-K\}, processing positions in temporal order so each imputation can condition on already\-imputed predecessors \(an AR\-consistent chain\)\. This withholds only the innovationrt=xt−x~tr\_\{t\}=x\_\{t\}\-\\tilde\{x\}\_\{t\}—the component unpredictable from recent history—while preserving the learnable dynamics\. A timestep outside the Markov blanket thus yields‖f​\(𝐱\)−f​\(𝐱~\)‖1≈0\\\|f\(\\mathbf\{x\}\)\-f\(\\tilde\{\\mathbf\{x\}\}\)\\\|\_\{1\}\\approx 0when imputed, whereas a blanket member carrying genuine causal signal produces a measurable drop\.

Similar Articles

Nested Spatio-Temporal Time Series Forecasting

arXiv cs.LG

This paper proposes a nested spatiotemporal forecasting framework that uses spectral clustering to construct semantically coherent macro-level regions, which provide top-down guidance for fine-grained micro-level predictions. Experiments on high-dimensional datasets show consistent improvements over state-of-the-art baselines.