GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series

arXiv cs.LG 06/24/26, 04:00 AM Papers
causal-discovery time-series gated-refinement high-dimensional constraint-based score-based l0-regularization
Summary
Proposes GRACE, a method that combines constraint-based skeleton with gated refinement using L0 regularization for efficient and accurate causal edge discovery in high-dimensional time series. It outperforms existing methods in F1 and speed, demonstrated on synthetic and real-world river flow data.
arXiv:2606.23880v1 Announce Type: new Abstract: From climate teleconnections to gene regulation, modern time-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging. Constraint-based methods offer statistical rigor but their nonlinear CI tests are infeasible at scale, while score-based alternatives avoid CI testing but require arbitrary thresholds to binarize continuous edge scores. We propose GRACE ($\textbf{G}$ated $\textbf{R}$efinement for $\textbf{A}$ccurate $\textbf{C}$ausal $\textbf{E}$dge discovery), which refines constraint-based discovery using Hard Concrete gates with $L_0$ regularization: each candidate edge has an independent gate whose values concentrate near 0 or 1, yielding a clean bimodal separation that makes the binary decision robust, unlike the narrow, overlapping score distributions produced by $L_1$ and attention-based methods. A fast linear CI skeleton provides high-recall candidates; a single gated model then prunes false positives by learning which edges genuinely improve prediction, with automatic regularization adapted to problem dimensions and skeleton density. Systematic experiments on synthetic benchmarks, spanning diverse graph topologies (scale-free, Erd\H{o}s-R'enyi, small-world) and dimensionalities up to $d=100$, show that GRACE substantially improves F1 over its base CI method while maintaining high precision, and outperforms attention-based and score-based alternatives. GRACE matches or exceeds expensive nonlinear CI tests at a fraction of the cost ($75\times$ faster). On a real-world river flow dataset, where rainfall confounders, variable propagation lags, and distributional shifts violate standard assumptions, a temporal bootstrap variant of GRACE recovers 9 of 11 causal edges along the Elbe River with only 1 false positive ($F_1 = 0.86$, AUROC${} = 0.99$), reducing the skeleton's 106 false positives by 99%.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:49 AM
# GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series
Source: [https://arxiv.org/html/2606.23880](https://arxiv.org/html/2606.23880)
###### Abstract

From climate teleconnections to gene regulation, modern time\-series datasets encompass tens or hundreds of interacting variables, making causal discovery increasingly challenging\. Constraint\-based methods offer statistical rigor but their nonlinear CI tests are infeasible at scale, while score\-based alternatives avoid CI testing but require arbitrary thresholds to binarize continuous edge scores\. We proposeGRACE\(GatedRefinement forAccurateCausalEdge discovery\), which refines constraint\-based discovery using Hard Concrete gates withL0L\_\{0\}regularization—each candidate edge has an independent gate whose values concentrate near 0 or 1, yielding a clean bimodal separation that makes the binary decision robust—unlike the narrow, overlapping score distributions produced byL1L\_\{1\}and attention\-based methods\. A fast linear CI skeleton provides high\-recall candidates; a single gated model then prunes false positives by learning which edges genuinely improve prediction, with automatic regularization adapted to problem dimensions and skeleton density\. Systematic experiments on synthetic benchmarks—spanning diverse graph topologies \(scale\-free, Erdős–Rényi, small\-world\) and dimensionalities up tod=100d=100—show thatGRACEsubstantially improves F1 over its base CI method while maintaining high precision, and outperforms attention\-based and score\-based alternatives\.GRACEmatches or exceeds expensive nonlinear CI tests at a fraction of the cost \(75×75\\timesfaster\)\. On a real\-world river flow dataset—where rainfall confounders, variable propagation lags, and distributional shifts violate standard assumptions—a temporal bootstrap variant ofGRACErecovers 9 of 11 causal edges along the Elbe River, for example, with only 1 false positive \(F1=0\.86F\_\{1\}=0\.86, AUROC=0\.99\{\}=0\.99\), reducing the skeleton’s 106 false positives by 99%\.

## 1Introduction

Inferring causal relationships from observational time series is a central challenge in climate science, neuroscience, finance, gene regulatory network reconstruction, industrial root cause analysis, and many other fields\(Peters et al\.,[2017](https://arxiv.org/html/2606.23880#bib.bib17)\)\. Given a multivariate time series𝐱t=\(xt1,…,xtd\)\\mathbf\{x\}\_\{t\}=\(x\_\{t\}^\{1\},\\ldots,x\_\{t\}^\{d\}\)withTTobservations, the goal is to recover the*lagged causal graph*GG, whereG\[c,ℓ,e\]=1G\[c,\\ell,e\]=1indicates that variablexcx^\{c\}at lagℓ\\ellis a direct cause of variablexex^\{e\}at timett\.

Two dominant families of methods have emerged for time\-series causal discovery, each with complementary strengths and limitations\.*Score\-based methods*—Neural Granger Causality\(Tank et al\.,[2021](https://arxiv.org/html/2606.23880#bib.bib24)\), NAVAR\(Bussmann et al\.,[2021](https://arxiv.org/html/2606.23880#bib.bib3)\), DYNOTEARS\(Pamfil et al\.,[2020](https://arxiv.org/html/2606.23880#bib.bib16)\), CUTS\+\(Cheng et al\.,[2024](https://arxiv.org/html/2606.23880#bib.bib4)\)—take a global optimization view, searching for the graph that best fits the data under a scoring criterion\. They can capture nonlinear dependencies and scale to moderate dimensionality, but enforce sparsity through continuous penalties \(L1L\_\{1\}or Gumbel\-Softmax\), requiring post\-hoc weight thresholding to obtain a binary graph—and model misspecification or poor scoring assumptions can silently degrade the recovered graph\. Attention\-based methods like TCDF\(Nauta et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib15)\)similarly optimize a predictive objective but use attention weights as causal proxies; however, softmax normalization creates artificial competition among candidate parents, and attention thresholds must be dataset\-tuned\.*Constraint\-based methods*—PCMCI/PCMCI\+\(Runge et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib20); Runge,[2020](https://arxiv.org/html/2606.23880#bib.bib19)\), CDNOTS\(Sadeghi et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib21)\), SyPI\+\(Fesanghary and Gopal,[2025](https://arxiv.org/html/2606.23880#bib.bib5)\), LPCMCI\(Gerhardus and Runge,[2020](https://arxiv.org/html/2606.23880#bib.bib6)\)—build causal graphs from conditional independence \(CI\) tests, offering interpretable, statistically principled inference without requiring a global scoring function\. However, these tests are sensitive to sample size and conditioning\-set dimension: nonlinear CI tests such as CMI\(Runge,[2018](https://arxiv.org/html/2606.23880#bib.bib18)\), KCIT\(Zhang et al\.,[2011](https://arxiv.org/html/2606.23880#bib.bib26)\)\(O\(n3\)O\(n^\{3\}\)\), and RCoT\(Strobl et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib23)\)\(O\(df2n\)O\(d\_\{f\}^\{2\}n\)fordfd\_\{f\}Fourier features\) remain costly at highdd, forcing practitioners to rely on fast linear tests that miss nonlinear dependencies\. For broader surveys, seeAssaad et al\. \([2022](https://arxiv.org/html/2606.23880#bib.bib1)\)andGong et al\. \([2023](https://arxiv.org/html/2606.23880#bib.bib7)\)\.

In this work, we proposeGRACE\(Gated Refinement for Accurate Causal Edge discovery\), a framework for improving constraint\-based causal discovery at high dimensionality\.GRACEoperates in two stages: first, a high\-recall constraint\-based method such as CDNOTS\(Sadeghi et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib21)\)or PCMCI\(Runge et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib20)\)produces a candidate skeleton; then, a gated neural model refines this skeleton by assigning each candidate edge an independent Hard Concrete gate\(Louizos et al\.,[2018](https://arxiv.org/html/2606.23880#bib.bib12); Maddison et al\.,[2017](https://arxiv.org/html/2606.23880#bib.bib13)\)trained withL0L\_\{0\}regularization, whose gate values concentrate near 0 or 1, yielding a robust binary graph at a natural 0\.5 threshold\. The gated model learns which skeleton edges genuinely improve prediction, pruning false positives while preserving true causal links—narrowing the gap with expensive nonlinear CI tests at a fraction of the computational cost\. Although our experiments focus on lagged effects \(τ≥1\\tau\\geq 1\), the framework natively supports contemporaneous \(lag\-0\) edges: enabling lag\-0 gates allows the sameL0L\_\{0\}mechanism to select among instantaneous candidates provided by the skeleton\.

Our contributions are:

1. 1\.TheGRACEframework: A gated refinement stage that converts a high\-recall skeleton into a high\-precision causal graph via Hard Concrete gates withL0L\_\{0\}regularization\. The gating mechanism is independent of the skeleton source: we demonstrate it with both CDNOTS and PCMCI skeletons\.
2. 2\.An empirically derived regularization rule \(Eq\.[12](https://arxiv.org/html/2606.23880#S2.E12)\) that adaptsλ\\lambdato problem dimensions \(dd,TT\) and skeleton density, paired withL0L\_\{0\}normalization that decouples regularization from batch size\.
3. 3\.Real\-world evaluation on the Elbe River network\(Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\), where a temporal bootstrap variant—applyingGRACEto random temporal windows and retaining edges that appear consistently—handles non\-stationarity, hidden confounders, and variable lags without domain\-specific heuristics\.

## 2Methodology

### 2\.1Problem Formulation

Consider add\-variate stationary time series\{𝐱t\}t=1T\\\{\\mathbf\{x\}\_\{t\}\\\}\_\{t=1\}^\{T\}with𝐱t∈ℝd\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{d\}\. We assume a structural causal model with maximum lagLL:

xte=fe\(\{xt−ℓc:G\[c,ℓ,e\]=1\}\)\+ϵte,ϵte∼𝒩\(0,σe2\),x\_\{t\}^\{e\}=f\_\{e\}\\\!\\left\(\\\{x\_\{t\-\\ell\}^\{c\}:G\[c,\\ell,e\]=1\\\}\\right\)\+\\epsilon\_\{t\}^\{e\},\\quad\\epsilon\_\{t\}^\{e\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{e\}^\{2\}\),\(1\)whereG∈\{0,1\}d×\(L\+1\)×dG\\in\\\{0,1\\\}^\{d\\times\(L\+1\)\\times d\}is the lagged causal graph andfef\_\{e\}is the \(possibly nonlinear\) causal mechanism for effect variableee\. The goal is to recoverGGfrom observations\{𝐱t\}t=1T\\\{\\mathbf\{x\}\_\{t\}\\\}\_\{t=1\}^\{T\}\.

The gated refinement stage assumes: \(i\)*Causal sufficiency*: there are no unobserved common causes; \(ii\)*Finite lag*: all causal effects occur within lags0,…,L0,\\ldots,L; \(iii\)*Additive structure*: effects combine additively across parents, though each individual effectfc,ℓ→ef\_\{c,\\ell\\to e\}can be nonlinear\. Additional assumptions \(e\.g\., stationarity\) are inherited from the skeleton discovery algorithm; when using CDNOTS, nonstationarity is handled natively\.

Constraint\-based method \(e\.g\., CDNOTS, PCMCI\)skeletonSSmaskStage 1Stage 2xt−11x^\{1\}\_\{t\-1\}xt−12x^\{2\}\_\{t\-1\}⋮\\vdotsxt−Ldx^\{d\}\_\{t\-L\}𝐖1,1\\mathbf\{W\}\_\{1,1\}𝐖2,1\\mathbf\{W\}\_\{2,1\}𝐖d,L\\mathbf\{W\}\_\{d,L\}⋮\\vdotsz1z\_\{1\}z2z\_\{2\}zDz\_\{D\}×\\times×\\times×\\times⋮\\vdotsΣ\\SigmaMLPe\\text\{MLP\}\_\{e\}μe,σe\\mu\_\{e\},\\sigma\_\{e\}InputsEncoderGatesDecodersharedper effecteeHard Concrete\+L0\+\\;L\_\{0\}penaltyFigure 1:GRACEtwo\-stage pipeline\.Stage 1: A constraint\-based method \(e\.g\., CDNOTS or PCMCI\) identifies candidate edges\.Stage 2: A gated model refines the skeleton—only edges inSShave learnable Hard Concrete gates; all others are masked to zero\. TheL0L\_\{0\}penalty drives gates toward exact zero or one, producing a bimodal distribution that makes the0\.50\.5decision boundary robust\.
### 2\.2Gated Causal Model

We construct a predictive model for each effect variableee, where candidate causal inputs are individually gated\. The architecture consists of three components \(Figure[1](https://arxiv.org/html/2606.23880#S2.F1)\): shared encoder, per\-effect gated aggregation, and per\-effect decoder\.

#### Shared encoder\.

A linear projection maps each lagged input to a hidden representation shared across all effect variables:

hc,ℓ=𝐖c,ℓxt−ℓc\+bc,ℓ∈ℝH,h\_\{c,\\ell\}=\\mathbf\{W\}\_\{c,\\ell\}\\,x\_\{t\-\\ell\}^\{c\}\+b\_\{c,\\ell\}\\in\\mathbb\{R\}^\{H\},\(2\)whereHHis the encoder hidden dimension,𝐖c,ℓ∈ℝH×1\\mathbf\{W\}\_\{c,\\ell\}\\in\\mathbb\{R\}^\{H\\times 1\}, andbc,ℓ∈ℝHb\_\{c,\\ell\}\\in\\mathbb\{R\}^\{H\}\. Sharing the encoder across effects reduces parameters fromO\(d2LH\)O\(d^\{2\}LH\)toO\(dLH\)O\(dLH\); effect\-specificity is provided by the gates\.

#### Gated aggregation\.

For each effectee, a Hard Concrete gatezc,ℓ,e∈\{0,1\}z\_\{c,\\ell,e\}\\in\\\{0,1\\\}independently controls each candidate parent:

𝐱^e=∑c,ℓzc,ℓ,e⋅hc,ℓ\.\\hat\{\\mathbf\{x\}\}\_\{e\}=\\sum\_\{c,\\ell\}z\_\{c,\\ell,e\}\\cdot h\_\{c,\\ell\}\.\(3\)Gates are parameterized by learnable log\-oddslog⁡αc,ℓ,e\\log\\alpha\_\{c,\\ell,e\}\(Section[2\.3](https://arxiv.org/html/2606.23880#S2.SS3)\)\.

#### Per\-effect decoders\.

Each effect variable has an independent two\-layer MLP decoder with SiLU activations that maps the aggregated representation to Gaussian parameters:

\(μe,log⁡σe\)=MLPe\(𝐱^e\),MLPe:ℝH→ℝ2\.\(\\mu\_\{e\},\\log\\sigma\_\{e\}\)=\\text\{MLP\}\_\{e\}\(\\hat\{\\mathbf\{x\}\}\_\{e\}\),\\quad\\text\{MLP\}\_\{e\}:\\mathbb\{R\}^\{H\}\\to\\mathbb\{R\}^\{2\}\.\(4\)Using per\-effect decoders \(rather than a shared decoder\) prevents conflation of distinct causal mechanisms—e\.g\., a quadratic effectxt−1c→xtex^\{c\}\_\{t\-1\}\\to x^\{e\}\_\{t\}and a linear effectxt−1c→xte′x^\{c\}\_\{t\-1\}\\to x^\{e^\{\\prime\}\}\_\{t\}can learn separate response functions\.

### 2\.3Hard Concrete Gates

We adopt the Hard Concrete distribution\(Louizos et al\.,[2018](https://arxiv.org/html/2606.23880#bib.bib12)\), which stretches the binary concrete \(Gumbel\-Softmax\)\(Maddison et al\.,[2017](https://arxiv.org/html/2606.23880#bib.bib13); Jang et al\.,[2017](https://arxiv.org/html/2606.23880#bib.bib9)\)distribution to produce exact zeros and ones\.

#### Training \(stochastic\)\.

During training, gates are sampled via the reparameterization trick:

u\\displaystyle u∼Uniform\(ϵ,1−ϵ\),\\displaystyle\\sim\\text\{Uniform\}\(\\epsilon,1\-\\epsilon\),\(5\)s\\displaystyle s=sigmoid⁡\(log⁡u−log⁡\(1−u\)\+log⁡ατ\),\\displaystyle=\\operatorname\{sigmoid\}\\\!\\left\(\\frac\{\\log u\-\\log\(1\-u\)\+\\log\\alpha\}\{\\tau\}\\right\),\(6\)z¯\\displaystyle\\bar\{z\}=s⋅\(ζ−γ\)\+γ,\\displaystyle=s\\cdot\(\\zeta\-\\gamma\)\+\\gamma,\(7\)z\\displaystyle z=min⁡\(max⁡\(z¯,0\),1\),\\displaystyle=\\min\(\\max\(\\bar\{z\},0\),1\),\(8\)whereτ=2/3\\tau=2/3is the temperature, andγ=−0\.1\\gamma=\-0\.1,ζ=1\.1\\zeta=1\.1define the stretch interval\. The stretch beyond\[0,1\]\[0,1\]followed by clamping places positive probability mass on exact 0 and exact 1\.

#### Evaluation \(deterministic\)\.

At test time, we use the deterministic Hard Concrete estimator fromLouizos et al\. \([2018](https://arxiv.org/html/2606.23880#bib.bib12)\):

z=min⁡\(max⁡\(sigmoid⁡\(log⁡α\)⋅\(ζ−γ\)\+γ,0\),1\)\.z=\\min\\\!\\left\(\\max\\\!\\left\(\\operatorname\{sigmoid\}\(\\log\\alpha\)\\cdot\(\\zeta\-\\gamma\)\+\\gamma,\\;0\\right\),\\;1\\right\)\.\(9\)Crucially, the stretch\-and\-clamp mechanism places positive probability mass on*exact*zero, so trained gates converge to eitherz=0z=0\(edge absent\) orz≈1z\\approx 1\(edge present\); Appendix[A](https://arxiv.org/html/2606.23880#A1)confirms this empirically with gate value histograms showing clear bimodal separation between true and false edges\. The final causal graph is obtained by thresholding deterministic gate values at0\.50\.5—a natural boundary that falls in the gap between the two modes, making the binary decision robust to the exact threshold choice\. Unlike the*post\-hoc weight thresholding*required byL1L\_\{1\}and attention\-based methods—where continuous edge scores cluster in a narrow range and results are sensitive to the chosen cutoff—the bimodal gate distribution makes the0\.50\.5boundary a principled default rather than an arbitrary tuning parameter\. The gate initialization \(log⁡α=−0\.5\\log\\alpha=\{\-\}0\.5\) andL0L\_\{0\}penalty strength \(λ\\lambda\) remain as design choices, but these are set once before training rather than tuned after inspecting learned weights\.

#### L0L\_\{0\}penalty\.

The probability that a gate is nonzero has a closed\-form expression:

ℙ\(z≠0\)=sigmoid⁡\(log⁡α−τlog⁡−γζ\)\.\\mathbb\{P\}\(z\\neq 0\)=\\operatorname\{sigmoid\}\\\!\\left\(\\log\\alpha\-\\tau\\log\\\!\\frac\{\-\\gamma\}\{\\zeta\}\\right\)\.\(10\)This enables an analyticL0L\_\{0\}regularizer without Monte Carlo estimation\.

#### Initialization\.

We initializelog⁡α=−0\.5\\log\\alpha=\-0\.5for all gates, yielding a deterministic gate value of approximately0\.360\.36\(below the active threshold of0\.50\.5\)\. Gates start mostly closed, requiring edges to demonstrate predictive value to open\. This asymmetric initialization reduces false positives compared to a symmetric initialization nearlog⁡α=0\\log\\alpha=0\.

### 2\.4Loss Function

The training objective combines prediction quality with sparsity:

ℒ\\displaystyle\\mathcal\{L\}=1Bd∑t=1B∑e=1dNLL\(xte∣μte,σte\)⏟prediction\+λ⋅s∑c,ℓ,eℙ\(zc,ℓ,e≠0\)⏟L0sparsity\\displaystyle=\\underbrace\{\\frac\{1\}\{Bd\}\\sum\_\{t=1\}^\{B\}\\sum\_\{e=1\}^\{d\}\\text\{NLL\}\(x\_\{t\}^\{e\}\\mid\\mu\_\{t\}^\{e\},\\sigma\_\{t\}^\{e\}\)\}\_\{\\text\{prediction\}\}\+\\underbrace\{\\lambda\\cdot s\\sum\_\{c,\\ell,e\}\\mathbb\{P\}\(z\_\{c,\\ell,e\}\\neq 0\)\}\_\{\\text\{$L\_\{0\}$ sparsity\}\}\+λ⋅λlag∑c,eReLU⁡\(∑ℓℙ\(zc,ℓ,e≠0\)−k\)⏟lag concentration \(optional\),\\displaystyle\\quad\+\\underbrace\{\\lambda\\cdot\\lambda\_\{\\text\{lag\}\}\\sum\_\{c,e\}\\operatorname\{ReLU\}\\\!\\left\(\\sum\_\{\\ell\}\\mathbb\{P\}\(z\_\{c,\\ell,e\}\\neq 0\)\-k\\right\)\}\_\{\\text\{lag concentration \(optional\)\}\},\(11\)whereNLL\(x∣μ,σ\)=\(x−μ\)22σ2\+log⁡σ\+12log⁡2π\\text\{NLL\}\(x\\mid\\mu,\\sigma\)=\\frac\{\(x\-\\mu\)^\{2\}\}\{2\\sigma^\{2\}\}\+\\log\\sigma\+\\frac\{1\}\{2\}\\log 2\\piis the Gaussian negative log\-likelihood,BBis the batch size, andddis the number of variables\.

#### L0L\_\{0\}normalization\.

The scaling factors=1/⌈N/B⌉s=1/\\lceil N/B\\rceilnormalizes theL0L\_\{0\}penalty by the number of gradient steps per epoch, ensuring thatλ\\lambdavalues transfer reliably across dataset sizes \(Section[2\.5](https://arxiv.org/html/2606.23880#S2.SS5)\)\.

### 2\.5Practical Considerations

#### Batch size andL0L\_\{0\}normalization\.

WithNNsamples and batch sizeBB, there are⌈N/B⌉\\lceil N/B\\rceilgradient steps per epoch\. Without normalization, theL0L\_\{0\}penalty is applied per step, while the NLL is averaged per sample, creating an implicit coupling:effectiveλ∝λ⋅N/B\\text\{effective \}\\lambda\\propto\\lambda\\cdot N/B\. OurL0L\_\{0\}normalization \(factors=1/⌈N/B⌉s=1/\\lceil N/B\\rceilin Eq\.[11](https://arxiv.org/html/2606.23880#S2.E11)\) decouples regularization from batch size\. The batch size is set automatically asB=max⁡\(32,min⁡\(N/8,256\)\)B=\\max\(32,\\min\(N/8,256\)\)\.

#### Lag concentration \(optional\)\.

By default, each gate is penalized independently by theL0L\_\{0\}term and the skeleton determines the lag structure—if the CI skeleton identifies edges at multiple lags for a pair, all corresponding gates are learned without additional constraint\. When domain knowledge or exploratory analysis \(e\.g\., partial cross\-correlation\) suggests that causal effects concentrate at a small number of lags, an optional lag concentration penalty \(third term in Eq\.[11](https://arxiv.org/html/2606.23880#S2.E11)\) can be enabled by settingλlag\>0\\lambda\_\{\\text\{lag\}\}\>0\. This penalty discourages more thankkactive lags per \(cause, effect\) pair via a soft hinge:kklags are “free”; the penalty is linear abovekk\. The lag term omits the per\-step normalization factorss: it is a structural penalty on edge multiplicity, not a per\-sample regularizer\. In our experiments,λlag=0\\lambda\_\{\\text\{lag\}\}=0\(disabled\) since the skeleton already provides the appropriate lag structure\.

#### Training details\.

We use Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2606.23880#bib.bib10)\)with learning rate10−310^\{\-3\}and weight decay10−410^\{\-4\}\. The encoder hidden dimension isH=64H=64; each per\-effect decoder is a two\-layer MLP with hidden size 64 and SiLU activations\. The noise scaleσe\\sigma\_\{e\}is not a fixed parameter but a learned output of the decoder \(via softplus\), optimized jointly with all other parameters\. Training runs for up to 150 epochs with early stopping \(patience 20, monitoring training loss\)\.

### 2\.6CDNOTS\-Gated Refinement

Constraint\-based methods like CDNOTS\(Sadeghi et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib21)\)with fast linear CI tests \(partial correlation\) achieve high recall but suffer from low precision as dimensionality increases: in our benchmarks \(Section[3](https://arxiv.org/html/2606.23880#S3)\), CDNOTS achieves95%95\\%TPR but only37%37\\%precision atd=100d=100/T=1000T=1000, yielding F1==0\.53\.GRACEuses the gated model from Section[2\.2](https://arxiv.org/html/2606.23880#S2.SS2)as a*precision filter*: starting from this high\-recall skeleton, it learns which edges genuinely improve prediction and prunes the rest\.

#### Pipeline\.

The CDNOTS\-Gated refinement operates in two stages:

1. 1\.CDNOTS skeleton: Run CDNOTS with partial correlation CI tests at significance levelα=0\.05\\alpha=0\.05to obtain a binary skeletonS∈\{0,1\}d×d×\(L\+1\)S\\in\\\{0,1\\\}^\{d\\times d\\times\(L\+1\)\}\. This produces a high\-recall superset of the true causal graph\.
2. 2\.Gated refinement: Train a single gated model with the skeleton as a hard mask—only edges present inSShave learnable gates; all others are fixed to zero\. Gates are initialized atlog⁡α=−0\.5\\log\\alpha=\-0\.5\(deterministic gate value≈0\.36\\approx 0\.36, below the active threshold of0\.50\.5\) for skeleton edges, so each edge must demonstrate predictive value to open\.

#### λ\\lambdaselection\.

We use an empirically derived rule based on problem dimensions and skeleton density:

λ=max⁡\(0\.007\+0\.16ρS,1d\+4T−0\.4ρS0\.8\),\\lambda=\\max\\\!\\bigl\(0\.007\+0\.16\\,\\rho\_\{S\},\\;\\;\\tfrac\{1\}\{d\}\+\\tfrac\{4\}\{T\}\-0\.4\\,\\rho\_\{S\}^\{0\.8\}\\bigr\),\(12\)whereρS=\|S\|/\(d2\(L\+1\)−d\)\\rho\_\{S\}=\|S\|/\(d^\{2\}\(L\+1\)\-d\)is the fraction of candidate edges in the skeleton\. The coefficients were fitted by grid search over the oracle\-bestλ\\lambdaon a held\-out calibration set of∼1,000\{\\sim\}1\{,\}000random Erdős–Rényi graphs spanningd∈\{10,…,100\}d\\in\\\{10,\\ldots,100\\\},T∈\{500,…,2000\}T\\in\\\{500,\\ldots,2000\\\}, and edge densities from sparse to 0\.12—deliberately disjoint from the SCP benchmarks used in Section[3](https://arxiv.org/html/2606.23880#S3), to avoid overfitting to the evaluation data\. The main term \(1/d\+4/T−0\.4ρS0\.81/d\+4/T\-0\.4\\,\\rho\_\{S\}^\{0\.8\}\) adapts to problem scale: more variables and more data allow weaker regularization, with a sublinear density correction\. The floor \(0\.007\+0\.16ρS0\.007\+0\.16\\,\\rho\_\{S\}\) ensures sufficient regularization for dense skeletons\. This reduces computation to a single model\-training run\.

#### Why it works\.

The combination succeeds because the two components address complementary failure modes\. CDNOTS with partial correlation provides a computationally cheap skeleton with very high recall—in practice, most nonlinear causal relationships also induce some linear association that partial correlation can detect—but many false positives at highdd\. The gated model, constrained to the skeleton, operates in a much smaller search space \(\|S\|≪d2L\|S\|\\ll d^\{2\}L\), enabling it to learn nonlinear causal mechanisms and prune edges that don’t genuinely improve prediction\.

A natural question is why a singleλ\\lambdavalue works across all edges, rather than requiring per\-edge tuning\. To build intuition \(not a formal guarantee for the nonlinear case\), Appendix[B](https://arxiv.org/html/2606.23880#A2)analyzes a simplified linear Gaussian model and shows that each edge has a*critical penalty*λ∗\\lambda^\{\*\}proportional to its NLL reductionΔ\\Delta—the improvement in prediction from opening that gate\. True causal edges reduce NLL by a constantΔtrue=Θ\(1\)\\Delta^\{\\text\{true\}\}=\\Theta\(1\)\(determined by signal strength\), while false edges can only overfit noise, givingΔfalse=O\(1/T\)\\Delta^\{\\text\{false\}\}=O\(1/T\)\. The ratio of their critical penalties is thereforeλ∗true/λ∗false=T⋅SNR\\lambda^\{\*\\text\{true\}\}/\\lambda^\{\*\\text\{false\}\}=T\\cdot\\text\{SNR\}, whereSNRis the per\-edge signal\-to\-noise ratio \(formally defined in Appendix[B](https://arxiv.org/html/2606.23880#A2)as the conditional SNR\)\. This means a*separation interval*of validλ\\lambdavalues exists and widens linearly with sample size, so anyλ\\lambdain this interval recovers the correct graph\. The analysis does not directly derive the constants in Eq\.[12](https://arxiv.org/html/2606.23880#S2.E12), but it explains three qualitative properties that the heuristic satisfies: \(i\) the skeleton pre\-filters candidates, reducing the number of false edges thatλ\\lambdamust suppress; \(ii\) the gate initialization below the decision boundary creates inertia against false edges opening; and \(iii\) the dependence onρS\\rho\_\{S\}is justified because denser skeletons admit more false candidates, requiring stronger regularization to keep the expected false positive count bounded\.

#### Comparison with nonlinear CI tests\.

An alternative to gated refinement is using expensive nonlinear CI tests \(RCoT, KCIT\) within CDNOTS\. Using a nonlinear CI test \(RCoT\) within the skeleton generally improves F1, since it can detect dependencies that linear partial correlation misses\. However, the runtime cost is steep—in our experiments, 40–75×\\timesslower atd≥50d\\geq 50\. Gated refinement with a fast linear skeleton narrows the F1 gap at a fraction of the cost, requiring onlyO\(\|S\|⋅H\)O\(\|S\|\\cdot H\)per gradient step\. We validate this empirically in Section[3\.3](https://arxiv.org/html/2606.23880#S3.SS3.SSS0.Px3)\(full results in Appendix[C](https://arxiv.org/html/2606.23880#A3)\)\.

## 3Experiments

We evaluateGRACEon synthetic benchmarks with known ground\-truth causal graphs, focusing on high\-dimensional scaling \(d∈\{20,30,50,70,100\}d\\in\\\{20,30,50,70,100\\\}\) with varying sample sizes \(T∈\{300,500,1000,2000\}T\\in\\\{300,500,1000,2000\\\}\)\.

### 3\.1Setup

#### Data generation\.

We generate random structural causal processes \(SCPs\) with controlled properties—the standard benchmark class for time\-series causal discovery, modeling systems common in climate science, econometrics, and neuroscience\(Runge et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib20); Assaad et al\.,[2022](https://arxiv.org/html/2606.23880#bib.bib1)\)\. Graphs are*sparse*\(∼\\sim1\.3–2\.2 edges per variable\), with lags drawn uniformly from11toL=5L=5\. Dependency coefficients are sampled uniformly from\[0\.40,0\.89\]\[0\.40,0\.89\]or\[−0\.89,−0\.40\]\[\-0\.89,\-0\.40\]with equal probability, and functions are drawn from a mixed pool of linear and nonlinear forms:\{x,x2,\|x\|,sin⁡\(x\)\}\\\{x,x^\{2\},\|x\|,\\sin\(x\)\\\}\. For each dimensionalitydd, we generate 10 random graphs \(seeds 0–9\) atTmax=2500T\_\{\\max\}=2500and slice to each targetTTfor evaluation, ensuring that results across sample sizes reflect the same underlying causal structure\.

#### Metrics\.

We report F1 score \(harmonic mean of precision and recall\), along with precision and true positive rate \(TPR/recall\) to characterize the precision–recall tradeoff\. Structural Hamming Distance \(SHD\) results are provided in Appendix[D](https://arxiv.org/html/2606.23880#A4)\. All results are averaged over 10 random graph seeds; we report mean±\\pmstandard deviation\.

### 3\.2Baselines

We compare against representative methods from each major paradigm—constraint\-based \(PCMCI, CDNOTS\), attention\-based \(TCDF\), and score\-based \(CUTS\+\)\. CUTS\+ subsumes the score\-based family: it outperforms NGC, DYNOTEARS, and NAVAR on its own benchmarks\(Cheng et al\.,[2024](https://arxiv.org/html/2606.23880#bib.bib4)\), so including all four would be redundant\.

- •PCMCI\(Runge et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib20)\): Constraint\-based method with partial correlation CI test\. Significance levelα=0\.05\\alpha=0\.05\.
- •CDNOTS\(Sadeghi et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib21)\): Nonstationary causal discovery with partial correlation\.
- •TCDF\(Nauta et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib15)\): Attention\-based convolutional approach with permutation testing\.
- •CDNOTS\-Gated\(GRACE\): Our CDNOTS\-Gated refinement pipeline \(Section[2\.6](https://arxiv.org/html/2606.23880#S2.SS6)\)\.
- •CUTS\+\(Cheng et al\.,[2024](https://arxiv.org/html/2606.23880#bib.bib4)\): Score\-based method using Gumbel\-Softmax gates on a GRU\-based message\-passing network, designed explicitly for*high\-dimensional*causal discovery from irregular time series\. CUTS\+ is the most directly comparable score\-based baseline at scale; since it outputs pairwise probabilities without lag resolution, we evaluate both methods at*pair\-level*F1 \(lags collapsed\) for a fair comparison, setting asideGRACE’s lag\-specific output\.GRACEsubstantially outperforms CUTS\+ across all configurations \(Appendix[E](https://arxiv.org/html/2606.23880#A5)\)\. We use their published default settings with threshold0\.50\.5and set the input window to match the true maximum lag\.

### 3\.3Results: F1 Across Dimensionality and Sample Size

Table 1:F1 scores on sparse SCP benchmarks across dimensionality and sample size \(mean±\\pmstd over 10 seeds\)\. Best result per row inbold\.ddMethodT=300T=300T=500T=500T=1000T=1000T=2000T=200020PCMCI\.436±\\pm\.03\.433±\\pm\.02\.455±\\pm\.04\.454±\\pm\.03CDNOTS\.785±\\pm\.04\.810±\\pm\.04\.844±\\pm\.04\.856±\\pm\.06TCDF\.556±\\pm\.16\.623±\\pm\.17\.660±\\pm\.18\.670±\\pm\.17CDNOTS\-Gated\.794±\\pm\.05\.845±\\pm\.04\.868±\\pm\.04\.893±\\pm\.0330PCMCI\.341±\\pm\.02\.346±\\pm\.03\.354±\\pm\.03\.356±\\pm\.03CDNOTS\.768±\\pm\.04\.784±\\pm\.04\.816±\\pm\.06\.830±\\pm\.03TCDF\.518±\\pm\.12\.608±\\pm\.14\.683±\\pm\.15\.688±\\pm\.15CDNOTS\-Gated\.804±\\pm\.04\.851±\\pm\.05\.893±\\pm\.03\.908±\\pm\.0450PCMCI\.221±\\pm\.02\.228±\\pm\.02\.236±\\pm\.02\.241±\\pm\.02CDNOTS\.717±\\pm\.05\.720±\\pm\.04\.742±\\pm\.04\.722±\\pm\.04TCDF\.431±\\pm\.08\.600±\\pm\.10\.691±\\pm\.09\.724±\\pm\.08CDNOTS\-Gated\.841±\\pm\.03\.886±\\pm\.03\.915±\\pm\.02\.911±\\pm\.0370PCMCI\.161±\\pm\.02\.168±\\pm\.02\.171±\\pm\.02\.177±\\pm\.02CDNOTS\.651±\\pm\.04\.665±\\pm\.03\.666±\\pm\.05\.648±\\pm\.05TCDF\.322±\\pm\.04\.529±\\pm\.07\.724±\\pm\.07\.768±\\pm\.06CDNOTS\-Gated\.849±\\pm\.05\.896±\\pm\.03\.927±\\pm\.02\.929±\\pm\.02100PCMCI\.105±\\pm\.02\.113±\\pm\.02\.117±\\pm\.02\.119±\\pm\.02CDNOTS\.571±\\pm\.04\.559±\\pm\.06\.531±\\pm\.07\.513±\\pm\.07TCDF\.269±\\pm\.04\.504±\\pm\.10\.718±\\pm\.10\.785±\\pm\.09CDNOTS\-Gated\.830±\\pm\.06\.878±\\pm\.04\.920±\\pm\.02\.922±\\pm\.02![Refer to caption](https://arxiv.org/html/2606.23880v1/x1.png)Figure 2:F1F\_\{1\}score vs\. number of variablesddat three sample sizes\. CDNOTS\-Gated \(black\) consistently outperforms all baselines, and the advantage grows with dimensionality\.Table[1](https://arxiv.org/html/2606.23880#S3.T1)and Figure[2](https://arxiv.org/html/2606.23880#S3.F2)present F1 scores across all methods, dimensionalities, and sample sizes\. We highlight several key findings:

#### CDNOTS\-Gated dominates at high dimensionality\.

Atd=50d=50/T=1000T=1000, CDNOTS\-Gated achieves F1==0\.915, compared to 0\.742 for CDNOTS and 0\.691 for TCDF\. Atd=100d=100, the gap widens further \(F1==0\.878 vs\. 0\.559 atT=500T=500\), with the advantage growing monotonically indd\. PCMCI achieves\>97%\>97\\%recall at allddbut precision collapses from 30% \(d=20d=20\) to 6% \(d=100d=100\)—a common challenge as the hypothesis space grows asO\(d2L\)O\(d^\{2\}L\)\. TCDF achieves high precision \(\>93%\>93\\%\) through permutation testing but limited recall \(∼56%\{\\sim\}56\\%\), plateauing around F1≈\\approx0\.72\. CUTS\+ precision collapses at highdd: pair\-level F1 drops to 0\.055 atd=100d=100\(Appendix[E](https://arxiv.org/html/2606.23880#A5)\)\.

Table 2:Precision and TPR atT=1000T=1000across dimensionality \(mean over 10 seeds\)\. CDNOTS\-Gated preserves CDNOTS’s high recall while dramatically improving precision\.d=20d=20d=50d=50d=70d=70d=100d=100MethodPrecTPRPrecTPRPrecTPRPrecTPRPCMCI\.298\.973\.135\.978\.094\.984\.062\.982CDNOTS\.837\.860\.617\.934\.514\.953\.372\.951TCDF\.938\.537\.933\.556\.931\.597\.954\.584CDNOTS\-Gated\.950\.804\.967\.870\.977\.883\.974\.872
#### Precision–recall tradeoff\.

Table[2](https://arxiv.org/html/2606.23880#S3.T2)reveals the mechanism: CDNOTS\-Gated preserves most of CDNOTS’s recall \(83–88% vs\. 86–95%\) while boosting precision from 37–84% to 86–97%\. Atd=70d=70, precision reaches 97\.5%—nearly every predicted edge is correct\. Runtime remains practical \(∼4\{\\sim\}4min atd=100d=100; Appendix[F](https://arxiv.org/html/2606.23880#A6)\)\.

#### Nonlinear CI skeleton\.

Replacing the linear skeleton with RCoT\(Strobl et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib23)\)improves F1 slightly \(0\.944 vs\. 0\.922 atd=100d=100/T=2000T=2000\) but at75×75\\timesthe cost; at lowerTT, ParCorr\-Gated matches or exceeds RCoT\-Gated \(Appendix[C](https://arxiv.org/html/2606.23880#A3)\)\.

#### Ablations and generalization\.

Without a skeleton,L0L\_\{0\}suppression alone cannot recover the signal at highdd\(F1 degrades substantially\); the adaptiveλ\\lambdaformula consistently outperforms fixed values; lag overspecification is handled gracefully \(F1≥0\.878F\_\{1\}\\geq 0\.878\) viaL0L\_\{0\}suppression \(Appendix[G](https://arxiv.org/html/2606.23880#A7)\)\.GRACEgeneralizes across skeleton sources: PCMCI\-Gated yields33–5×5\\timesF1 over raw PCMCI \(Appendix[H](https://arxiv.org/html/2606.23880#A8)\), and DYNOTEARS\-Gated matches CDNOTS\-Gated atd≤50d\\leq 50\(Appendix[I](https://arxiv.org/html/2606.23880#A9)\)\. It also generalizes across non\-additive mechanisms \(Lorenz\-96; Appendix[J](https://arxiv.org/html/2606.23880#A10)\) and diverse graph topologies \(scale\-free, Erdős–Rényi, small\-world; Appendix[K](https://arxiv.org/html/2606.23880#A11)\)\.

### 3\.4Real\-World Evaluation: Elbe River Network

Synthetic benchmarks offer controlled evaluation but cannot capture the compounding assumption violations present in real\-world systems\. We evaluateGRACEon a river network dataset\(Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\)providing water level measurements along river networks in eastern Germany, where ground truth is determined by physical flow direction \(upstream→\\todownstream\)\.

#### Challenges\.

River data violates standard assumptions: rainfall acts as a time\-varying, spatially\-varying hidden confounder; causal lags vary with water level \(higher flow==faster propagation\); seasonal floods create distributional shifts; and reservoirs disconnect natural flow\. These challenges make causal discovery on river data substantially harder than on synthetic benchmarks\(Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\)\.

#### Setup\.

We focus on the Elbe River’s main branch in eastern Germany: 12 gauging stations forming a chain graph with 11 directed edges \(upstream→\\todownstream\), measured at 6\-hour resolution over 2019–2023 \(T=3,164T=3\{,\}164\)\. We compare PCMCI, CDNOTS, single\-runGRACE, andGRACE\-Bootstrap\. When applied to the full multi\-year time series, all skeleton methods achieve perfect recall but extremely poor precision due to ubiquitous confounding—e\.g\., CDNOTS recovers all 11 true edges but produces 106 false positives \(F1=0\.17F\_\{1\}=0\.17\)\.

#### Temporal bootstrap\.

We applyGRACEindependently to random 1\-month windows and retain edges appearing in more than aτ\\taufraction of runs—a temporal analogue of stability selection\(Meinshausen and Bühlmann,[2010](https://arxiv.org/html/2606.23880#bib.bib14)\)\. Because confounders like rainfall are transient and episodic, they inflate correlations only in the windows they affect; true flow edges, by contrast, persist across all hydrological conditions\. Windowing thus separates stable causal signal from transient confounding without requiring confounder measurements\.

#### Results\.

We setτ=0\.70\\tau=0\.70, requiring an edge to appear in more than 70% of windows to be retained—a conservative threshold that does not require ground\-truth calibration and is interpretable on its own terms\. Table[3](https://arxiv.org/html/2606.23880#S3.T3)shows thatGRACE\-Bootstrap \(50 runs,τ=0\.70\\tau=0\.70\) recovers 9 of 11 edges with only 1 false positive \(F1=0\.86F\_\{1\}=0\.86, Precision==0\.90\), reducing CDNOTS’s 106 FPs by 99%\. The bootstrap AUROC reaches 0\.986, showing that edge frequency provides an excellent ranking even when the binary threshold excludes borderline edges\.

Table 3:Elbe River’s main branch \(d=12d=12, 11 true edges,T=3,164T=3\{,\}164, 6h resolution\)\(Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\)\. The CI skeleton achieves perfect recall but produces 106 false positives\.GRACE\-Bootstrap appliesGRACEto 50 random 1\-month windows and thresholds on edge frequency \(τ=0\.70\\tau=0\.70\)\.MethodAUROCPrecRecF1FPCDNOTS \(skeleton\)\.839\.0941\.00\.171106GRACE\(single run\)\.938\.1381\.00\.24269GRACE\-Bootstrap\.986\.900\.818\.8571Together, the synthetic experiments \(Section[3\.3](https://arxiv.org/html/2606.23880#S3.SS3)\) and this real\-world evaluation cover complementary axes of difficulty: the former tests scalability under controlled conditions, the latter tests robustness when causal assumptions are simultaneously and severely violated—a combination that no single benchmark can address\.

## 4Conclusion

We presentedGRACE, a framework that refines constraint\-based causal discovery via Hard Concrete gates withL0L\_\{0\}regularization\. On synthetic benchmarks,GRACEnearly doubles F1 over raw CDNOTS atd=100d=100\(0\.92 vs\. 0\.53\) with 97% precision, generalizing across skeleton sources \(PCMCI, DYNOTEARS\) and graph topologies\. On the real\-world Elbe River benchmark\(Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\), a temporal bootstrap variant achievesF1=0\.86F\_\{1\}=0\.86with AUROC=0\.99\{\}=0\.99despite hidden confounders, variable lags, and distributional shifts\.

#### Limitations\.

On dense graphs \(scale\-free hubs atd≳70d\\gtrsim 70\), per\-edge SNR decreases and gate values cluster below threshold—AUROC remains high but binary F1 degrades, partially mitigated by more data\. Spurious edges from latent confounders may survive if they improve prediction\. The linear skeleton may miss purely symmetric nonlinear dependencies \(e\.g\.,x2x^\{2\}\) that induce no linear association; a nonlinear CI skeleton mitigates this at higher cost \(Appendix[C](https://arxiv.org/html/2606.23880#A3)\)\. Gated refinement trades a small number of true positives for a large precision gain; the recall reduction is minor in our experiments but may be more pronounced at very low sample sizes\.

## References

- Assaad et al\. \[2022\]Charles K Assaad, Emilie Devijver, and Eric Gaussier\.Survey and evaluation of causal discovery methods for time series\.*Journal of Artificial Intelligence Research*, 73:767–819, 2022\.
- Barabási and Albert \[1999\]Albert\-László Barabási and Réka Albert\.Emergence of scaling in random networks\.*Science*, 286\(5439\):509–512, 1999\.
- Bussmann et al\. \[2021\]Bart Bussmann, Jannes Nys, and Steven Latré\.Neural additive vector autoregression models for causal discovery in time series\.In*International Conference on Discovery Science*, pages 446–460\. Springer, 2021\.
- Cheng et al\. \[2024\]Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qin Zhong\.CUTS\+: High\-dimensional causal discovery from irregular time\-series\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, 2024\.
- Fesanghary and Gopal \[2025\]Mohammad Fesanghary and Achintya Gopal\.Efficient causal discovery for autoregressive time series, 2025\.URL[https://arxiv\.org/abs/2507\.07898](https://arxiv.org/abs/2507.07898)\.
- Gerhardus and Runge \[2020\]Andreas Gerhardus and Jakob Runge\.High\-recall causal discovery for autocorrelated time series with latent confounders\.In*Advances in Neural Information Processing Systems*, volume 33, pages 12615–12625, 2020\.
- Gong et al\. \[2023\]Chang Gong, Di Yao, Chuzhe Zhang, Wenbin Li, and Jingping Bi\.Causal discovery from temporal data: An overview and new perspectives\.*arXiv preprint arXiv:2303\.10112*, 2023\.
- Hagberg et al\. \[2008\]Aric A\. Hagberg, Daniel A\. Schult, and Pieter J\. Swart\.Exploring network structure, dynamics, and function using NetworkX\.In*Proceedings of the 7th Python in Science Conference \(SciPy\)*, pages 11–15, 2008\.
- Jang et al\. \[2017\]Eric Jang, Shixiang Gu, and Ben Poole\.Categorical reparameterization with Gumbel\-Softmax\.In*International Conference on Learning Representations*, 2017\.
- Kingma and Ba \[2015\]Diederik P Kingma and Jimmy Ba\.Adam: A method for stochastic optimization\.In*International Conference on Learning Representations*, 2015\.
- Lorenz \[1996\]Edward N Lorenz\.Predictability: A problem partly solved\.*Proc\. Seminar on Predictability, ECMWF*, 1:1–18, 1996\.
- Louizos et al\. \[2018\]Christos Louizos, Max Welling, and Diederik P Kingma\.Learning sparse neural networks throughL0L\_\{0\}regularization\.In*International Conference on Learning Representations*, 2018\.
- Maddison et al\. \[2017\]Chris J Maddison, Andriy Mnih, and Yee Whye Teh\.The concrete distribution: A continuous relaxation of discrete random variables\.In*International Conference on Learning Representations*, 2017\.
- Meinshausen and Bühlmann \[2010\]Nicolai Meinshausen and Peter Bühlmann\.Stability selection\.*Journal of the Royal Statistical Society: Series B \(Statistical Methodology\)*, 72\(4\):417–473, 2010\.
- Nauta et al\. \[2019\]Merel Nauta, Doina Bucur, and Christin Seifert\.Causal discovery with attention\-based convolutional neural networks\.*Machine Learning and Knowledge Extraction*, 1\(1\):312–340, 2019\.
- Pamfil et al\. \[2020\]Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont, and Bryon Aragam\.DYNOTEARS: Structure learning from time\-series data\.In*International Conference on Artificial Intelligence and Statistics*, pages 1595–1605\. PMLR, 2020\.
- Peters et al\. \[2017\]Jonas Peters, Dominik Janzing, and Bernhard Schölkopf\.*Elements of Causal Inference: Foundations and Learning Algorithms*\.MIT Press, 2017\.
- Runge \[2018\]Jakob Runge\.Conditional independence testing based on a nearest\-neighbor estimator of conditional mutual information\.In*International Conference on Artificial Intelligence and Statistics*, pages 938–947\. PMLR, 2018\.
- Runge \[2020\]Jakob Runge\.Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets\.In*Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence \(UAI\)*, pages 1388–1397\. PMLR, 2020\.
- Runge et al\. \[2019\]Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic\.Detecting and quantifying causal associations in large nonlinear time series datasets\.*Science Advances*, 5\(11\):eaau4996, 2019\.
- Sadeghi et al\. \[2025\]Agathe Sadeghi, Achintya Gopal, and Mohammad Fesanghary\.Causal discovery from nonstationary time series\.*International Journal of Data Science and Analytics*, 19:33–59, 2025\.doi:10\.1007/s41060\-024\-00679\-7\.
- Stein et al\. \[2025\]Gideon Stein, Maha Shadaydeh, Jan Blunk, Niklas Penzel, and Joachim Denzler\.CausalRivers – scaling up benchmarking of causal discovery for real\-world time\-series\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=wmV4cIbgl6](https://openreview.net/forum?id=wmV4cIbgl6)\.
- Strobl et al\. \[2019\]Eric V Strobl, Kun Zhang, and Shyam Visweswaran\.Approximate kernel\-based conditional independence tests for fast non\-parametric causal discovery\.*Journal of Causal Inference*, 7\(1\), 2019\.
- Tank et al\. \[2021\]Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox\.Neural Granger causality\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 44\(8\):4267–4279, 2021\.
- Watts and Strogatz \[1998\]Duncan J\. Watts and Steven H\. Strogatz\.Collective dynamics of ‘small\-world’ networks\.*Nature*, 393\(6684\):440–442, 1998\.
- Zhang et al\. \[2011\]Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf\.Kernel\-based conditional independence test and application in causal discovery\.In*Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence*, pages 804–813, 2011\.

## Appendix AGate Value Distribution

Figure[3](https://arxiv.org/html/2606.23880#A1.F3)plots the distribution of deterministic gate values \(Eq\.[9](https://arxiv.org/html/2606.23880#S2.E9)\) after training, partitioned by whether each skeleton edge corresponds to a true or false causal link\. Atd=100d=100, false edges cluster near zero \(mass at0\.00\.0and0\.250\.25–0\.400\.40\) while true edges concentrate at0\.550\.55–0\.850\.85, with almost no overlap at the0\.50\.5decision boundary\. Atd=50d=50, the separation is somewhat narrower, with minor overlap near0\.50\.5, consistent with the slightly lower F1 at this scale\. The clear bimodal pattern confirms that Hard Concrete gates learn to separate true from false edges without requiring post\-hoc threshold tuning: the0\.50\.5boundary falls squarely in the gap between the two modes\.

![Refer to caption](https://arxiv.org/html/2606.23880v1/x2.png)Figure 3:Distribution of deterministic gate values on SCP benchmarks \(T=1000T=1000, 3 seeds\)\. True causal edges \(blue\) concentrate above the0\.50\.5decision boundary; false edges \(yellow\) concentrate below it\. The separation is sharper at higherdd, where theL0L\_\{0\}penalty provides stronger sparsity pressure\.
## Appendix BTheoretical Analysis ofλ\\lambda\-Sensitivity

This appendix analyzes theλ\\lambda\-sensitivity of the gated model under a simplified linear Gaussian setting\. The results provide*qualitative intuition*for why a singleλ\\lambdacan separate true and false edges, but do not constitute identifiability guarantees for the full nonlinear model with MLP decoders\. We view these results as motivation for the design choices in Section[2\.6](https://arxiv.org/html/2606.23880#S2.SS6), not as formal recovery proofs\.

### B\.1Setup and Notation

Consider a stationary VAR process withddvariables and maximum lagLL:

xte=∑\(c,ℓ\)∈pa\(e\)βcℓext−ℓc\+εte,εte∼iid𝒩\(0,σe2\),x^\{e\}\_\{t\}=\\sum\_\{\(c,\\ell\)\\in\\text\{pa\}\(e\)\}\\beta\_\{c\\ell\}^\{e\}\\,x^\{c\}\_\{t\-\\ell\}\+\\varepsilon^\{e\}\_\{t\},\\qquad\\varepsilon^\{e\}\_\{t\}\\stackrel\{\{\\scriptstyle\\text\{iid\}\}\}\{\{\\sim\}\}\\mathcal\{N\}\(0,\\sigma\_\{e\}^\{2\}\),\(13\)wherepa\(e\)\\text\{pa\}\(e\)denotes the true parent set of variableee\(the set of\(c,ℓ\)\(c,\\ell\)pairs withβcℓe≠0\\beta\_\{c\\ell\}^\{e\}\\neq 0\)\. We analyze the linear encoder case; the nonlinear extension is discussed in the Remark \(Nonlinear extension\) below\.

The per\-sample loss for effectee, with weightwcℓew\_\{c\\ell\}^\{e\}for each gated input, is:

ℓe=12σ^e2\(xte−∑c,ℓzcℓewcℓext−ℓc\)2\+log⁡σ^e\+12log⁡2π\.\\ell\_\{e\}=\\frac\{1\}\{2\\hat\{\\sigma\}\_\{e\}^\{2\}\}\\left\(x^\{e\}\_\{t\}\-\\sum\_\{c,\\ell\}z\_\{c\\ell e\}\\,w\_\{c\\ell\}^\{e\}\\,x^\{c\}\_\{t\-\\ell\}\\right\)^\{2\}\+\\log\\hat\{\\sigma\}\_\{e\}\+\\tfrac\{1\}\{2\}\\log 2\\pi\.\(14\)The full training objective \(Eq\.[11](https://arxiv.org/html/2606.23880#S2.E11)\) averagesℓe\\ell\_\{e\}over bothTTsamples andddeffects—i\.e\.,NLL¯=1d∑e1T∑tℓe\(t\)\\overline\{\\text\{NLL\}\}=\\frac\{1\}\{d\}\\sum\_\{e\}\\frac\{1\}\{T\}\\sum\_\{t\}\\ell\_\{e\}\(t\)—and adds theL0L\_\{0\}penaltyλ⋅s∑c,ℓ,eℙ\(zcℓe≠0\)\\lambda\\cdot s\\sum\_\{c,\\ell,e\}\\mathbb\{P\}\(z\_\{c\\ell e\}\\neq 0\), wheres=1/⌈T/B⌉s=1/\\lceil T/B\\rceilis the per\-step normalization factor\. The1/d1/daveraging is important: it means each gate’s NLL contribution is scaled by1/d1/d, which enters the criticalλ\\lambdaformula below\.

### B\.2Per\-Edge NLL Reduction

###### Definition 1\(Marginal NLL reduction\)\.

For edge\(c,ℓ,e\)\(c,\\ell,e\), defineΔcℓe\\Delta\_\{c\\ell e\}as the decrease in average NLL when the gate is opened \(with optimal weightw∗w^\{\*\}\) compared to when it is forced closed:

Δcℓe=1T∑t=1T\[ℓe\|zcℓe=0−ℓe\|zcℓe=1,w=w∗\]\.\\Delta\_\{c\\ell e\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[\\ell\_\{e\}\\big\|\_\{z\_\{c\\ell e\}=0\}\-\\ell\_\{e\}\\big\|\_\{z\_\{c\\ell e\}=1,\\,w=w^\{\*\}\}\\right\]\.\(15\)

Under the linear Gaussian DGP \(Eq\.[13](https://arxiv.org/html/2606.23880#A2.E13)\), assuming the other gates and weights are at their population optima:

#### True edge\(\(c,ℓ\)∈pa\(e\)\)\(\(c,\\ell\)\\in\\text\{pa\}\(e\)\)\.

The optimal weight isw∗=βcℓew^\{\*\}=\\beta\_\{c\\ell\}^\{e\}, and the NLL reduction equals the variance uniquely explained by this edge—the variance ofxt−ℓcx^\{c\}\_\{t\-\\ell\}after projecting out all other parents already included:

Δcℓetrue=\(βcℓe\)2Var^\(xt−ℓc∣pa^−\(c,ℓ\)e\)2σe2,\\Delta\_\{c\\ell e\}^\{\\text\{true\}\}=\\frac\{\(\\beta\_\{c\\ell\}^\{e\}\)^\{2\}\\,\\widehat\{\\text\{Var\}\}\(x^\{c\}\_\{t\-\\ell\}\\mid\\widehat\{\\text\{pa\}\}^\{e\}\_\{\-\(c,\\ell\)\}\)\}\{2\\,\\sigma\_\{e\}^\{2\}\},\(16\)whereVar^\(xt−ℓc∣pa^−\(c,ℓ\)e\)\\widehat\{\\text\{Var\}\}\(x^\{c\}\_\{t\-\\ell\}\\mid\\widehat\{\\text\{pa\}\}^\{e\}\_\{\-\(c,\\ell\)\}\)is the empirical variance of the residual ofxt−ℓcx^\{c\}\_\{t\-\\ell\}after linear projection onto all other true parents ofee\. This is the conditional variance formula from multivariate regression: in a regression with correlated regressors, the incremental gain from one predictor depends on its residualized variance, not its marginal variance\. By the law of large numbers this converges toVar\(xt−ℓc∣pa−\(c,ℓ\)e\)\>0\\text\{Var\}\(x^\{c\}\_\{t\-\\ell\}\\mid\\text\{pa\}^\{e\}\_\{\-\(c,\\ell\)\}\)\>0\(positive unless a true parent is a perfect linear combination of others\), soΔtrue\\Delta^\{\\text\{true\}\}converges to a*positive constant*\.

#### False edge\(\(c,ℓ\)∉pa\(e\)\)\(\(c,\\ell\)\\notin\\text\{pa\}\(e\)\)\.

Opening a spurious gate adds one free parameter that can only fit noise\. By standard regression theory, the in\-sample NLL reduction from fitting one noise regressor toTTresiduals is:

Δcℓefalse=r^cℓe22\+O\(T−1\),\\Delta\_\{c\\ell e\}^\{\\text\{false\}\}=\\frac\{\\hat\{r\}\_\{c\\ell e\}^\{2\}\}\{2\}\+O\(T^\{\-1\}\),\(17\)wherer^cℓe\\hat\{r\}\_\{c\\ell e\}is the sample correlation between the residual andxt−ℓcx^\{c\}\_\{t\-\\ell\}\. Under the null hypothesis \(no causal effect\),T⋅r^2∼χ2\(1\)T\\cdot\\hat\{r\}^\{2\}\\sim\\chi^\{2\}\(1\), sor^2\\hat\{r\}^\{2\}is itselfO\(1/T\)O\(1/T\)and

𝔼\[Δfalse\]=12T\.\\mathbb\{E\}\[\\Delta^\{\\text\{false\}\}\]=\\frac\{1\}\{2T\}\.\(18\)

### B\.3Criticalλ\\lambdaand Separation

###### Proposition 1\(Edge\-wise criticalλ\\lambda\)\.

At a local minimum of the expected loss, gate\(c,ℓ,e\)\(c,\\ell,e\)is open \(i\.e\.,log⁡αcℓe\>0\\log\\alpha\_\{c\\ell e\}\>0, giving deterministic gatez\>0\.5z\>0\.5\) if and only if the per\-edge NLL reduction exceeds the marginalL0L\_\{0\}cost\. Define the*critical penalty*as:

λcℓe∗=Δcℓe⋅Cd⋅s,C=Rκ,\\lambda^\{\*\}\_\{c\\ell e\}=\\frac\{\\Delta\_\{c\\ell e\}\\cdot C\}\{d\\cdot s\},\\quad C=\\frac\{R\}\{\\kappa\},\(19\)whereκ=∂∂log⁡αℙ\(z≠0\)\|log⁡α=0\\kappa=\\frac\{\\partial\}\{\\partial\\log\\alpha\}\\mathbb\{P\}\(z\\neq 0\)\\big\|\_\{\\log\\alpha=0\}is the sensitivity of the gate\-open probability at the decision boundary, andR=𝔼\[∂z∂log⁡α\]\|log⁡α=0R=\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial z\}\{\\partial\\log\\alpha\}\\right\]\\big\|\_\{\\log\\alpha=0\}is the expected reparameterization gradient of the gate value\. The factor1/d1/darises because the NLL is averaged overddeffects, so each gate’s contribution to the loss is scaled by1/d1/d\. With the default parameters\(τ=2/3,γ=−0\.1,ζ=1\.1\)\(\\tau=2/3,\\gamma=\-0\.1,\\zeta=1\.1\):

κ=σ\(c0\)\(1−σ\(c0\)\)≈0\.140,c0=−τlog⁡−γζ≈1\.60,\\kappa=\\sigma\(c\_\{0\}\)\(1\-\\sigma\(c\_\{0\}\)\)\\approx 0\.140,\\quad c\_\{0\}=\-\\tau\\log\\frac\{\-\\gamma\}\{\\zeta\}\\approx 1\.60,\(20\)andR≈0\.219R\\approx 0\.219\(computed via Monte Carlo over the Hard Concrete noise\), yieldingC=R/κ≈1\.56C=R/\\kappa\\approx 1\.56\. The gate is open whenλ<λcℓe∗\\lambda<\\lambda^\{\*\}\_\{c\\ell e\}and closed whenλ\>λcℓe∗\\lambda\>\\lambda^\{\*\}\_\{c\\ell e\}\.

###### Proof\.

The training loss decomposes asℒ=NLL¯\+λ⋅s∑c,ℓ,eℙ\(zcℓe≠0\)\\mathcal\{L\}=\\overline\{\\text\{NLL\}\}\+\\lambda\\cdot s\\sum\_\{c,\\ell,e\}\\mathbb\{P\}\(z\_\{c\\ell e\}\\neq 0\), whereNLL¯=1d∑e1T∑tℓe\(t\)\\overline\{\\text\{NLL\}\}=\\frac\{1\}\{d\}\\sum\_\{e\}\\frac\{1\}\{T\}\\sum\_\{t\}\\ell\_\{e\}\(t\)\. Since gate\(c,ℓ,e\)\(c,\\ell,e\)only affectsℓe\\ell\_\{e\}, the stationarity condition is:

∂ℒ∂log⁡αcℓe=−Δcℓed⋅𝔼\[∂z∂log⁡α\]⏟NLL gradient \(favors opening\)\+λ⋅s⋅∂ℙ\(z≠0\)∂log⁡αcℓe⏟L0gradient \(favors closing\)=0\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\log\\alpha\_\{c\\ell e\}\}=\\underbrace\{\-\\frac\{\\Delta\_\{c\\ell e\}\}\{d\}\\cdot\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial z\}\{\\partial\\log\\alpha\}\\right\]\}\_\{\\text\{NLL gradient \(favors opening\)\}\}\+\\underbrace\{\\lambda\\cdot s\\cdot\\frac\{\\partial\\mathbb\{P\}\(z\\neq 0\)\}\{\\partial\\log\\alpha\_\{c\\ell e\}\}\}\_\{\\text\{$L\_\{0\}$ gradient \(favors closing\)\}\}=0\.\(21\)At the decision boundarylog⁡α=0\\log\\alpha=0, bothRRandκ\\kappaare strictly positive\. Setting the gradients equal and solving forλ\\lambdagivesλ∗=Δ⋅R/\(d⋅s⋅κ\)=Δ⋅C/\(d⋅s\)\\lambda^\{\*\}=\\Delta\\cdot R/\(d\\cdot s\\cdot\\kappa\)=\\Delta\\cdot C/\(d\\cdot s\)\. ∎

###### Corollary 1\(Sample\-size separation\)\.

Substituting the NLL reductions from Eqs\.[16](https://arxiv.org/html/2606.23880#A2.E16)–[17](https://arxiv.org/html/2606.23880#A2.E17), and writingVcℓecond=Var\(xt−ℓc∣pa−\(c,ℓ\)e\)V^\{\\text\{cond\}\}\_\{c\\ell e\}=\\text\{Var\}\(x^\{c\}\_\{t\-\\ell\}\\mid\\text\{pa\}^\{e\}\_\{\-\(c,\\ell\)\}\)for the conditional variance:

λcℓe∗true\\displaystyle\\lambda^\{\*\\text\{true\}\}\_\{c\\ell e\}=\(βcℓe\)2Vcℓecond⋅C2σe2⋅d⋅s\+o\(1\),\\displaystyle=\\frac\{\(\\beta\_\{c\\ell\}^\{e\}\)^\{2\}\\,V^\{\\text\{cond\}\}\_\{c\\ell e\}\\cdot C\}\{2\\,\\sigma\_\{e\}^\{2\}\\cdot d\\cdot s\}\+o\(1\),\(22\)λcℓe∗false\\displaystyle\\lambda^\{\*\\text\{false\}\}\_\{c\\ell e\}=C2T⋅d⋅s\+o\(T−1\)\.\\displaystyle=\\frac\{C\}\{2T\\cdot d\\cdot s\}\+o\(T^\{\-1\}\)\.\(23\)The separation ratio is:

λ∗trueλ∗false=T⋅\(βcℓe\)2Vcℓecondσe2=T⋅SNRcℓecond\.\\frac\{\\lambda^\{\*\\text\{true\}\}\}\{\\lambda^\{\*\\text\{false\}\}\}=T\\cdot\\frac\{\(\\beta\_\{c\\ell\}^\{e\}\)^\{2\}\\,V^\{\\text\{cond\}\}\_\{c\\ell e\}\}\{\\sigma\_\{e\}^\{2\}\}=T\\cdot\\text\{SNR\}^\{\\text\{cond\}\}\_\{c\\ell e\}\.\(24\)Note thatCC,dd, andsscancel in the ratio, so the separation depends only onTTand the edge’s*conditional*signal\-to\-noise ratioSNRcond=\(βcℓe\)2Vcℓecond/σe2\\text\{SNR\}^\{\\text\{cond\}\}=\(\\beta\_\{c\\ell\}^\{e\}\)^\{2\}V^\{\\text\{cond\}\}\_\{c\\ell e\}/\\sigma\_\{e\}^\{2\}\. For any minimum signal strengthSNRmin\>0\\text\{SNR\}\_\{\\min\}\>0, there existsT∗=1/SNRminT^\{\*\}=1/\\text\{SNR\}\_\{\\min\}such that forT\>T∗T\>T^\{\*\}, the interval\(λmax∗false,λmin∗true\)\(\\lambda^\{\*\\text\{false\}\}\_\{\\max\},\\,\\lambda^\{\*\\text\{true\}\}\_\{\\min\}\)is nonempty, and*any*λ\\lambdain this interval yields perfect edge recovery \(zero FP, full TP\)\. This bound uses the per\-edge expectedΔfalse\\Delta^\{\\text\{false\}\}; for a simultaneous guarantee over allnfalsen\_\{\\text\{false\}\}spurious edges, extreme\-value corrections onmaxi⁡r^i2\\max\_\{i\}\\hat\{r\}\_\{i\}^\{2\}increaseT∗T^\{\*\}toO\(log⁡\(nfalse\)/SNRmin\)O\(\\log\(n\_\{\\text\{false\}\}\)/\\text\{SNR\}\_\{\\min\}\)\.

## Appendix CNonlinear CI Test Results

Section[3\.3](https://arxiv.org/html/2606.23880#S3.SS3.SSS0.Px3)compares the fast linear CI skeleton \(ParCorr\) with a nonlinear one \(RCoT\) in terms of F1 accuracy and runtime cost\. Here we present the full results\. We compare CDNOTS\-Gated with the default ParCorr skeleton against CDNOTS\-Gated with an RCoT skeleton, reporting both F1 and runtime\.

Table 4:F1 scores and runtime \(seconds\) for ParCorr\-Gated vs\. RCoT\-Gated \(mean±\\pmstd over 10 seeds\)\. Best F1 per row inbold\. RCoT\-Gated generally outperforms ParCorr\-Gated, but at 20–75×\\timesthe runtime cost\.ddTTParCorr\-GatedF1RCoT\-GatedF1ParCorrTime \(s\)RCoTTime \(s\)20300\.794±\\pm\.05\.804±\\pm\.0638±\\pm8380±\\pm282500\.845±\\pm\.04\.853±\\pm\.0442±\\pm9726±\\pm5421000\.868±\\pm\.04\.872±\\pm\.0438±\\pm9852±\\pm6882000\.893±\\pm\.03\.911±\\pm\.0443±\\pm101,129±\\pm1,01830300\.804±\\pm\.04\.840±\\pm\.0456±\\pm5611±\\pm153500\.851±\\pm\.05\.879±\\pm\.0353±\\pm111,189±\\pm3471000\.893±\\pm\.03\.900±\\pm\.0355±\\pm131,554±\\pm4492000\.908±\\pm\.04\.932±\\pm\.0361±\\pm131,845±\\pm63550300\.841±\\pm\.03\.847±\\pm\.0399±\\pm31,874±\\pm610500\.886±\\pm\.03\.897±\\pm\.0295±\\pm53,479±\\pm1,1581000\.915±\\pm\.02\.928±\\pm\.02100±\\pm94,060±\\pm1,3262000\.911±\\pm\.03\.941±\\pm\.02105±\\pm104,671±\\pm1,42970300\.849±\\pm\.05\.855±\\pm\.04129±\\pm74,385±\\pm1,638500\.896±\\pm\.03\.859±\\pm\.05129±\\pm27,926±\\pm2,7431000\.927±\\pm\.02\.918±\\pm\.03159±\\pm99,287±\\pm3,1182000\.929±\\pm\.02\.943±\\pm\.03156±\\pm1010,074±\\pm2,954100300\.830±\\pm\.06\.804±\\pm\.08181±\\pm158,321±\\pm3,745500\.878±\\pm\.04\.875±\\pm\.06234±\\pm1114,971±\\pm6,5671000\.920±\\pm\.02\.916±\\pm\.04228±\\pm817,059±\\pm7,1282000\.922±\\pm\.02\.944±\\pm\.03243±\\pm2118,962±\\pm7,103Table[4](https://arxiv.org/html/2606.23880#A3.T4)compares both F1 and runtime for the two Gated variants\. RCoT\-Gated outperforms ParCorr\-Gated at highTT, since RCoT can detect nonlinear conditional dependencies that linear partial correlation misses\. Atd=100d=100/T=2000T=2000, RCoT\-Gated achieves F1==0\.944 vs\. 0\.922 for ParCorr\-Gated\. However, at moderateTT, the gap narrows or reverses \(e\.g\., 0\.916 vs\. 0\.920 atd=100d=100/T=1000T=1000\), while the runtime cost is substantial: atd=50d=50/T=1000T=1000, RCoT\-Gated requires 4,060s vs\. 100s for ParCorr\-Gated—a 40×\\timesslowdown\. Atd=100d=100, the ratio reaches 75×\\times\(∼17,000\{\\sim\}17\{,\}000s vs\.∼228\{\\sim\}228s\)\. The practical conclusion is clear: ford≥50d\\geq 50, ParCorr\-Gated is the most practical choice, with RCoT offering diminishing returns at disproportionate cost\.

## Appendix DSHD Results

We report the Structural Hamming Distance \(SHD\)—the total number of edge additions, deletions, and reversals needed to transform the predicted graph into the true graph—to complement the F1 results in the main text\. Lower SHD indicates a predicted graph closer to ground truth\.

Table 5:SHD on sparse SCP benchmarks \(mean±\\pmstd over 10 seeds\)\. Best per row inbold\. CDNOTS\-Gated achieves the lowest SHD at all dimensionalities\.ddTTPCMCICDNOTSTCDFCDNOTS\-Gated20300104±\\pm919±\\pm532±\\pm917±\\pm5500108±\\pm1217±\\pm426±\\pm1013±\\pm41000104±\\pm1614±\\pm423±\\pm1011±\\pm42000106±\\pm1814±\\pm622±\\pm109±\\pm430300226±\\pm1231±\\pm549±\\pm1124±\\pm6500227±\\pm1630±\\pm640±\\pm1419±\\pm71000226±\\pm926±\\pm832±\\pm1513±\\pm42000229±\\pm1725±\\pm431±\\pm1511±\\pm450300642±\\pm2166±\\pm1385±\\pm1830±\\pm7500638±\\pm3269±\\pm1462±\\pm1921±\\pm71000627±\\pm1864±\\pm1249±\\pm1516±\\pm42000619±\\pm3174±\\pm1545±\\pm1418±\\pm6703001246±\\pm34117±\\pm21128±\\pm1937±\\pm165001233±\\pm36118±\\pm1892±\\pm2426±\\pm1010001241±\\pm25124±\\pm2560±\\pm2118±\\pm720001210±\\pm14139±\\pm2652±\\pm1918±\\pm61003002575±\\pm43218±\\pm36159±\\pm2452±\\pm255002530±\\pm54243±\\pm42120±\\pm3841±\\pm1910002490±\\pm42281±\\pm4379±\\pm3827±\\pm1220002478±\\pm47313±\\pm5465±\\pm3426±\\pm10Table[5](https://arxiv.org/html/2606.23880#A4.T5)confirms the F1 findings: CDNOTS\-Gated achieves the lowest SHD at all dimensionalities and sample sizes\. The SHD reduction over raw CDNOTS is particularly striking at highddwith sufficient samples: atd=100d=100/T=1000T=1000, CDNOTS\-Gated achieves SHD==27 vs\. 281 for CDNOTS—a 90% reduction in graph edit distance\. PCMCI’s SHD grows dramatically withdd\(reaching∼2,500\{\\sim\}2\{,\}500atd=100d=100\), reflecting its high false positive rate, while TCDF achieves competitive SHD at highTTbut is less consistent at low sample sizes\.

## Appendix EScore\-Based Baseline: CUTS\+ Comparison

Since CUTS\+\[Cheng et al\.,[2024](https://arxiv.org/html/2606.23880#bib.bib4)\]outputs a pairwise probability matrix without lag resolution, we report*pair\-level*metrics: the lag dimension is collapsed \(an edge exists if any lag is active\) before computing precision, recall, and F1\. We use their published default settings: threshold0\.50\.5, 64 training epochs, GRU hidden size 32, and Gumbel temperature annealing fromτ=1\\tau=1toτ=0\.1\\tau=0\.1\. The input window is set to match the true maximum lag \(L=5L=5\)\.

Table 6:Pair\-level F1 comparison betweenGRACE\(CDNOTS\-Gated\) and CUTS\+ on sparse SCP benchmarks \(mean±\\pmstd over 10 seeds\)\. CUTS\+ achieves high recall but very low precision, resulting in pair F1 that degrades sharply with dimensionality\.GRACEresults use the same pair\-level evaluation \(lags collapsed\) for fair comparison\.ddMethodT=300T=300T=500T=500T=1000T=1000T=2000T=200020CDNOTS\-Gated\.662±\\pm\.07\.736±\\pm\.07\.813±\\pm\.07\.844±\\pm\.06CUTS\+\.346±\\pm\.06\.441±\\pm\.08\.510±\\pm\.05\.620±\\pm\.0850CDNOTS\-Gated\.732±\\pm\.07\.814±\\pm\.06\.824±\\pm\.09\.878±\\pm\.04CUTS\+\.112±\\pm\.04\.139±\\pm\.05\.186±\\pm\.06\.400±\\pm\.08100CDNOTS\-Gated\.791±\\pm\.04\.821±\\pm\.04\.834±\\pm\.05\.859±\\pm\.04CUTS\+\.035±\\pm\.02\.043±\\pm\.02\.055±\\pm\.03\.117±\\pm\.07Table 7:Pair\-level precision and recall atT=1000T=1000\(mean over 10 seeds\)\. CUTS\+ opens most gates \(high recall\) but cannot distinguish true from false edges \(low precision\), especially at highdd\.d=20d=20d=50d=50d=100d=100MethodPrecRecPrecRecPrecRecCDNOTS\-Gated\.975\.702\.885\.797\.922\.763CUTS\+\.372\.838\.105\.903\.029\.912Table[6](https://arxiv.org/html/2606.23880#A5.T6)shows that CUTS\+ pair\-level F1 degrades sharply with dimensionality: from0\.5100\.510atd=20d=20to0\.0550\.055atd=100d=100\(T=1000T=1000\)\. Table[7](https://arxiv.org/html/2606.23880#A5.T7)reveals the underlying cause: CUTS\+ achieves high recall \(\>83%\>83\\%\) by opening most Gumbel\-Softmax gates, but precision collapses from37%37\\%atd=20d=20to2\.9%2\.9\\%atd=100d=100\. Its Gumbel\-Softmax gates produce continuous probabilities that cluster in a narrow range \(typically0\.360\.36–0\.710\.71\), making it difficult to separate true from false edges regardless of threshold choice\. By contrast,GRACE’s Hard Concrete gates concentrate near exact zero or one \(Appendix[A](https://arxiv.org/html/2606.23880#A1)\), yielding a robust binary decision at the0\.50\.5boundary\.

## Appendix FRuntime

Table[8](https://arxiv.org/html/2606.23880#A6.T8)and Figure[4](https://arxiv.org/html/2606.23880#A6.F4)compare wall\-clock runtime atT=1000T=1000\. CDNOTS\-Gated adds22–3\.5×3\.5\\timesoverhead over raw CDNOTS while achieving substantially higher F1\. The runtime is dominated by the CDNOTS skeleton at highdd; the gated training phase itself scales linearly in the number of skeleton edges\.

Table 8:Runtime in seconds atT=1000T=1000\(mean over 10 seeds\)\. CDNOTS\-Gated is practical even atd=100d=100\.Methodd=20d=20d=30d=30d=50d=50d=70d=70d=100d=100PCMCI5103275187CDNOTS472050138TCDF46101218CDNOTS\-Gated3857112186345![Refer to caption](https://arxiv.org/html/2606.23880v1/x3.png)Figure 4:Wall\-clock runtime vs\.dd\(log scale\)\. CDNOTS\-Gated \(black\) adds moderate overhead over CDNOTS while achieving substantially higher F1\. CDNOTS\(RCoT\)\-Gated \(dark gray\) achieves higher F1 but at50×50\\timesthe cost atd=100d=100\.
## Appendix GAblation Study

We ablate three design choices in GRACE on the same SCP benchmarks used in the main experiments \(d∈\{50,100\}d\\in\\\{50,100\\\},T∈\{500,1000\}T\\in\\\{500,1000\\\}, 10 seeds each\)\. The baseline is GRACE with full CDNOTS skeleton and the adaptiveλ\\lambdaformula\.

#### Skeleton quality\.

Table[9](https://arxiv.org/html/2606.23880#A7.T9)evaluates three skeleton configurations: \(i\)*No skeleton*—alld2\(L\+1\)−dd^\{2\}\(L\+1\)\-dcandidate edges are passed to the gated model; \(ii\)*Shallow CI*\(max\_degree=1\\texttt\{max\\\_degree\}=1\)—only pairwise and first\-order conditional independence tests; \(iii\)*Full CDNOTS*\(default\)—conditioning sets up to the maximum graph degree\. Without any skeleton pruning, the gated model faces∼\\sim15,000 candidate edges atd=50d=50for∼\\sim100 true edges; noλ\\lambdasetting can recover signal at this ratio \(F1<<0\.02\)\. Shallow CI improves substantially \(F1==0\.46–0\.71\) but still underperforms full CDNOTS by 25–50% in F1, reflecting the spurious edges that higher\-order conditioning would remove\.

Table 9:Skeleton quality ablation: F1 scores \(mean±\\pmstd over 10 seeds\)\.d=50d=50d=100d=100SkeletonT=500T=500T=1000T=1000T=500T=500T=1000T=1000No skeleton\.012±\\pm\.01\.013±\\pm\.01\.006±\\pm\.00\.006±\\pm\.00Shallow CI \(k=1k=1\)\.458±\\pm\.03\.614±\\pm\.07\.697±\\pm\.07\.635±\\pm\.08Full CDNOTS\.886±\\pm\.03\.915±\\pm\.02\.878±\\pm\.04\.920±\\pm\.02
#### Lambda sensitivity\.

Table[10](https://arxiv.org/html/2606.23880#A7.T10)compares fixedλ\\lambdavalues against the adaptive formula \(Eq\.[12](https://arxiv.org/html/2606.23880#S2.E12)\)\. A fixedλ=0\.01\\lambda=0\.01performs comparably atd=50d=50, since both yield similar values when the skeleton is sparse\. Atd=100d=100, the adaptive formula’s1/d1/dterm lowersλ\\lambdaappropriately, maintaining high F1 where a fixed value would overregularize\.λ=0\.05\\lambda=0\.05already degrades atd=100d=100\(F1 drops from 0\.92 to 0\.49\), andλ≥0\.1\\lambda\\geq 0\.1aggressively penalizes gates, suppressing most true edges\.

Table 10:λ\\lambdasensitivity: F1 scores \(mean±\\pmstd over 10 seeds\)\.d=50d=50d=100d=100λ\\lambdaT=500T=500T=1000T=1000T=500T=500T=1000T=1000λ=0\.01\\lambda=0\.01\.739±\\pm\.05\.846±\\pm\.04\.893±\\pm\.04\.895±\\pm\.03λ=0\.05\\lambda=0\.05\.763±\\pm\.07\.765±\\pm\.07\.491±\\pm\.14\.485±\\pm\.13λ=0\.1\\lambda=0\.1\.524±\\pm\.12\.529±\\pm\.11\.246±\\pm\.07\.242±\\pm\.07λ=0\.5\\lambda=0\.5\.008±\\pm\.01\.015±\\pm\.02\.007±\\pm\.01\.007±\\pm\.01Adaptive\.886±\\pm\.03\.915±\\pm\.02\.878±\\pm\.04\.920±\\pm\.02
#### Lag misspecification robustness\.

In practice, the true maximum causal lagLLis unknown and must be specified by the practitioner\. We evaluate robustness to lag misspecification by generating SCP graphs with true max lagL=5L=5\(d=50d=50,T=1000T=1000, 10 seeds\) and running each method with assumed search lagsLsearch∈\{2,3,5,7,10\}L\_\{\\text\{search\}\}\\in\\\{2,3,5,7,10\\\}\. Evaluation uses the ground truth padded \(or truncated\) toLsearchL\_\{\\text\{search\}\}, so that edges beyond the search horizon count as false negatives for all methods equally\.

Table 11:F1 vs\. assumed max lag \(true max lag = 5\),d=50d=50,T=1000T=1000\(mean±\\pmstd over 10 seeds\)\.MethodL=2L=2L=3L=3L=5∗L=5^\{\*\}L=7L=7L=10L=10GRACE\.793±\\pm\.024\.825±\\pm\.036\.906±\\pm\.022\.878±\\pm\.026\.891±\\pm\.022CDNOTS\.626±\\pm\.030\.676±\\pm\.050\.732±\\pm\.060\.714±\\pm\.056\.679±\\pm\.048∗True max lag\.Under underspecification \(Lsearch<5L\_\{\\text\{search\}\}<5\), late\-lag true edges are outside the search space and count as false negatives for all methods equally\. GRACE degrades from0\.9060\.906at the correct lag to0\.7930\.793atL=2L=2, while CDNOTS drops further to0\.6260\.626—a gap of0\.1670\.167vs\.0\.1130\.113points respectively, showing that GRACE is more robust even in the underspecified regime\. Under overspecification \(Lsearch\>5L\_\{\\text\{search\}\}\>5\), the advantage is more striking: GRACE maintainsF1=0\.878F\_\{1\}=0\.878–0\.8910\.891atL∈\{7,10\}L\\in\\\{7,10\\\}\(precision≥0\.908\\geq 0\.908\), because the Hard ConcreteL0L\_\{0\}penalty suppresses the extra false\-lag candidates\. CDNOTS has no such suppression mechanism and loses0\.0530\.053F1 points atL=10L=10relative toL=5L=5, as spurious cross\-lag edges inflate false positives\.

## Appendix HPCMCI\-Gated: Refinement Across Skeleton Sources

To demonstrate thatGRACE’s gated refinement is not tied to a specific skeleton algorithm, we replace the CDNOTS skeleton with PCMCI\[Runge et al\.,[2019](https://arxiv.org/html/2606.23880#bib.bib20)\]and apply the same gating pipeline \(denoted PCMCI\-Gated\)\. PCMCI provides an ideal stress test: it achieves very high recall \(\>90%\>90\\%\) but poor precision that degrades sharply with dimensionality \(from∼30%\{\\sim\}30\\%atd=20d=20to∼6%\{\\sim\}6\\%atd=100d=100\), producing a skeleton with many false positives for gating to prune\.

Table 12:F1 scores for PCMCI vs\. PCMCI\-Gated on sparse SCP benchmarks \(mean±\\pmstd over 10 seeds\)\. Gated refinement consistently improves PCMCI, with the largest gains at highddand sufficientTT\.ddMethodT=300T=300T=500T=500T=1000T=1000T=2000T=200020PCMCI\.436±\\pm\.03\.433±\\pm\.02\.455±\\pm\.03\.454±\\pm\.03PCMCI\-Gated\.427±\\pm\.03\.431±\\pm\.03\.460±\\pm\.04\.615±\\pm\.0730PCMCI\.341±\\pm\.02\.346±\\pm\.03\.354±\\pm\.03\.356±\\pm\.03PCMCI\-Gated\.331±\\pm\.02\.343±\\pm\.03\.537±\\pm\.05\.733±\\pm\.0650PCMCI\.221±\\pm\.02\.228±\\pm\.02\.236±\\pm\.02\.241±\\pm\.02PCMCI\-Gated\.242±\\pm\.03\.427±\\pm\.06\.758±\\pm\.07\.752±\\pm\.0870PCMCI\.161±\\pm\.02\.168±\\pm\.02\.171±\\pm\.02\.177±\\pm\.02PCMCI\-Gated\.345±\\pm\.04\.687±\\pm\.06\.710±\\pm\.07\.696±\\pm\.08100PCMCI\.105±\\pm\.02\.113±\\pm\.02\.117±\\pm\.02\.119±\\pm\.02PCMCI\-Gated\.605±\\pm\.11\.602±\\pm\.12\.602±\\pm\.13\.527±\\pm\.16Table 13:Precision and TPR atT=1000T=1000for PCMCI vs\. PCMCI\-Gated \(mean over 10 seeds\)\. Gating dramatically improves precision at the cost of some recall, converting PCMCI’s high\-recall/low\-precision profile into a more balanced one\.d=20d=20d=30d=30d=50d=50d=70d=70d=100d=100MethodPrecTPRPrecTPRPrecTPRPrecTPRPrecTPRPCMCI\.298\.973\.218\.969\.135\.978\.094\.984\.062\.982PCMCI\-Gated\.334\.752\.420\.755\.806\.719\.743\.695\.661\.590Table 14:SHD for PCMCI vs\. PCMCI\-Gated on sparse SCP benchmarks \(mean±\\pmstd over 10 seeds\)\. Lower is better\. Gating reduces SHD by up to17×17\\timesat highdd\.ddMethodT=300T=300T=500T=500T=1000T=1000T=2000T=200020PCMCI104±\\pm8108±\\pm12104±\\pm15106±\\pm17PCMCI\-Gated87±\\pm688±\\pm879±\\pm1244±\\pm1330PCMCI226±\\pm12227±\\pm15226±\\pm9229±\\pm17PCMCI\-Gated189±\\pm13182±\\pm1183±\\pm835±\\pm1150PCMCI642±\\pm20638±\\pm31627±\\pm17619±\\pm30PCMCI\-Gated459±\\pm24192±\\pm2248±\\pm2250±\\pm2970PCMCI1246±\\pm331233±\\pm341241±\\pm241210±\\pm14PCMCI\-Gated360±\\pm3488±\\pm3678±\\pm3383±\\pm38100PCMCI2575±\\pm412530±\\pm522490±\\pm402478±\\pm44PCMCI\-Gated145±\\pm70149±\\pm90150±\\pm91234±\\pm190Tables[12](https://arxiv.org/html/2606.23880#A8.T12)and[14](https://arxiv.org/html/2606.23880#A8.T14)show that gated refinement consistently improves PCMCI across all dimensionalities\. Atd=50d=50/T=1000T=1000, F1 improves from 0\.236 to 0\.758 \(3\.2×3\.2\\times\) and SHD drops from 627 to 48 \(13×13\\times\); atd=100d=100/T=1000T=1000, F1 improves from 0\.117 to 0\.602 \(5\.1×5\.1\\times\) and SHD from 2490 to 150 \(17×17\\times\)\. Table[13](https://arxiv.org/html/2606.23880#A8.T13)reveals the mechanism: gating converts PCMCI’s extreme recall/low\-precision profile \(e\.g\., 97% TPR / 6% precision atd=100d=100\) into a more balanced one \(59% TPR / 66% precision\)\.

Comparing with CDNOTS\-Gated \(Table[1](https://arxiv.org/html/2606.23880#S3.T1)\), PCMCI\-Gated achieves lower F1 at all configurations—e\.g\., 0\.602 vs\. 0\.920 atd=100d=100/T=1000T=1000—because PCMCI’s skeleton has far more false positives than CDNOTS’s\. This confirms thatGRACEdelivers consistent and substantial gains regardless of the skeleton source—up to5×5\\timesF1 improvement—while the final performance ceiling is set by the skeleton’s recall\.

## Appendix IScore\-Based Skeleton: DYNOTEARS

The preceding appendix demonstrated thatGRACEimproves CI\-based skeletons \(PCMCI\)\. Here we test whether the same gating mechanism can refine a*score\-based*skeleton produced by DYNOTEARS\[Pamfil et al\.,[2020](https://arxiv.org/html/2606.23880#bib.bib16)\], which fits anL1L\_\{1\}\-penalized vector autoregression \(LASSO VAR\) per target variable using cross\-validated penalty selection\. LASSO VAR produces continuous coefficient magnitudes, and the central challenge is converting them to a binary graph: setting the threshold too low retains many false positives, while setting it too high discards true edges\. This is precisely the thresholding problem thatGRACE’s Hard Concrete gates solve automatically\.

#### Oracle threshold analysis\.

To quantify the thresholding difficulty, we sweep all possible coefficient thresholds on each seed and record the value that maximizes F1 given the ground truth \(the “oracle” threshold\)\. Across all configurations \(d∈\{20,50,100\}d\\in\\\{20,50,100\\\},T∈\{500,1000\}T\\in\\\{500,1000\\\}, 10 seeds each\), the oracle threshold clusters in the range 0\.04–0\.09 but varies up to5×5\\timesacross seeds \(e\.g\., 0\.038–0\.188 atd=20d=20\)\. A fixed threshold oft=0\.05t=0\.05recovers∼97%\{\\sim\}97\\%of the oracle F1, making it a reasonable lower bound—but selecting it required access to the ground truth, which is unavailable in practice\.

#### Results\.

Table[15](https://arxiv.org/html/2606.23880#A9.T15)compares four configurations: DYNOTEARS with the default threshold \(t=0t=0, all nonzero coefficients\), DYNOTEARS with the post\-hoc oracle\-informed threshold \(t=0\.05t=0\.05\), DYNOTEARS\-Gated \(GRACE applied to thet=0\.05t=0\.05skeleton\), and CDNOTS\-Gated \(the default GRACE pipeline\)\.

Table 15:F1 scores for DYNOTEARS variants and CDNOTS\-Gated on sparse SCP benchmarks \(mean±\\pmstd over 10 seeds\)\.†\\daggerPost\-hoc threshold selected via oracle analysis; unavailable in practice\.d=20d=20d=50d=50d=100d=100MethodT=500T=500T=1000T=1000T=500T=500T=1000T=1000T=500T=500T=1000T=1000DYNOTEARS \(t=0t=0\)\.238±\\pm\.04\.222±\\pm\.03\.168±\\pm\.03\.150±\\pm\.03\.149±\\pm\.03\.129±\\pm\.03DYNOTEARS \(t=0\.05t=0\.05\)†\.861±\\pm\.05\.892±\\pm\.05\.843±\\pm\.05\.911±\\pm\.04\.822±\\pm\.03\.902±\\pm\.03DYNO\-Gated \(t=0\.05t=0\.05\)†\.871±\\pm\.04\.907±\\pm\.04\.866±\\pm\.05\.924±\\pm\.03\.824±\\pm\.04\.908±\\pm\.03CDNOTS\-Gated\.833±\\pm\.04\.862±\\pm\.04\.825±\\pm\.08\.897±\\pm\.04\.888±\\pm\.04\.907±\\pm\.03Several patterns emerge\. First, DYNOTEARS with the default threshold \(t=0t=0\) achieves F1 of only 0\.13–0\.24, confirming that raw LASSO coefficients without thresholding are unusable as a causal graph—the vast majority of nonzero coefficients are false positives\. Second, the oracle thresholdt=0\.05†t=0\.05^\{\\dagger\}dramatically improves DYNOTEARS to F1 0\.82–0\.91, demonstrating that a good threshold*exists*but requires ground truth to find\. Third, DYNOTEARS\-Gated \(t=0\.05†t=0\.05^\{\\dagger\}\) matches or slightly exceeds CDNOTS\-Gated atd≤50d\\leq 50, showing thatGRACEcan effectively refine score\-based skeletons when the skeleton quality is sufficient\. Fourth, CDNOTS\-Gated retains its advantage atd=100d=100/T=500T=500\(0\.888 vs\. 0\.824\), where the LASSO skeleton becomes sparser and misses more true edges\.

The key takeaway is thatGRACE’s Hard Concrete gates provide the same benefit to score\-based skeletons as to CI\-based ones: automatic, threshold\-free conversion of continuous edge scores into a binary causal graph\. However, the DYNOTEARS\-Gated results rely on a post\-hoc threshold \(†\\dagger\) to pre\-filter the skeleton before gating\. By contrast, CDNOTS\-Gated requires no such threshold—the CI\-based skeleton is already binary—making it the preferred default in practice\. This experiment confirms thatGRACEis compatible with diverse skeleton sources: it delivers consistent improvements over skeletons from both CI\-based methods \(CDNOTS, PCMCI\) and score\-based methods \(DYNOTEARS\), with the gating mechanism providing the same threshold\-free binary decision benefit across all skeleton types\.

## Appendix JLorenz\-96 Benchmark: Multiplicative Interactions

The Lorenz\-96 model\[Lorenz,[1996](https://arxiv.org/html/2606.23880#bib.bib11)\]is a chaotic dynamical system defined by the ODEx˙i=\(xi\+1−xi−2\)⋅xi−1−xi\+F\\dot\{x\}\_\{i\}=\(x\_\{i\+1\}\-x\_\{i\-2\}\)\\cdot x\_\{i\-1\}\-x\_\{i\}\+F, where indices wrap cyclically\. Each variablexix\_\{i\}has four causal parents at lag 1:xi−2x\_\{i\-2\},xi−1x\_\{i\-1\},xix\_\{i\}\(self\), andxi\+1x\_\{i\+1\}\. Critically, the mechanism is*multiplicative*—xi−1x\_\{i\-1\}multiplies\(xi\+1−xi−2\)\(x\_\{i\+1\}\-x\_\{i\-2\}\)—creating interaction effects that violate GRACE’s additive parent assumption\. This benchmark directly tests whether GRACE remains useful when its structural assumptions are misspecified\.

We use the same implementation parameters as CUTS\+\[Cheng et al\.,[2024](https://arxiv.org/html/2606.23880#bib.bib4)\]: forcingF=10F=10\(chaotic regime\), integration time stepΔt=0\.1\\Delta t=0\.1viascipy\.integrate\.odeint, additive Gaussian observation noiseσ=0\.1\\sigma=0\.1, and burn\-in of 1000 steps\. The observed time series are thus noisy discretizations of the underlying chaotic ODE, making the causal discovery task more realistic and challenging\.

Table 16:F1 on Lorenz\-96 with noisy observations \(σ=0\.1\\sigma=0\.1\), mean±\\pmstd over 10 seeds\. GRACE substantially improves over the CDNOTS skeleton at higherNN, demonstrating that gated refinement remains effective even when the true causal mechanism is multiplicative\.NNMethodT=500T=500T=1000T=1000T=2000T=200050CDNOTS\.757±\\pm\.01\.748±\\pm\.01\.760±\\pm\.01GRACE\.764±\\pm\.01\.828±\\pm\.01\.874±\\pm\.01100CDNOTS\.724±\\pm\.01\.717±\\pm\.01\.723±\\pm\.01GRACE\.840±\\pm\.01\.875±\\pm\.00\.894±\\pm\.01Table 17:Precision and recall on Lorenz\-96 atT=1000T=1000\(mean over 10 seeds\)\. GRACE preserves the skeleton’s recall while dramatically improving precision, achieving perfect precision atN=100N=100\.N=50N=50N=100N=100MethodPrecRecPrecRecCDNOTS\.717\.782\.665\.778GRACE\.879\.7821\.000\.778Tables[16](https://arxiv.org/html/2606.23880#A10.T16)and[17](https://arxiv.org/html/2606.23880#A10.T17)show that GRACE substantially improves over CDNOTS on Lorenz\-96 at higher dimensionality\. AtN=100N=100/T=1000T=1000, GRACE achieves F1==0\.875 with perfect precision \(1\.000\) compared to CDNOTS F1==0\.717 with precision 0\.665—a pattern consistent with the SCP results in Section[3](https://arxiv.org/html/2606.23880#S3)\. Both methods share the same recall \(∼0\.78\{\\sim\}0\.78\), confirming that GRACE prunes false positives from the skeleton without removing true edges\. The improvement grows with bothNNandTT: atN=50N=50/T=500T=500the gap is small \(F1 0\.764 vs\. 0\.757\), but atN=100N=100/T=2000T=2000it is substantial \(0\.894 vs\. 0\.723\)\.

These results demonstrate that GRACE’s additive MLP decoders can effectively distinguish true from spurious edges even when the underlying causal mechanism is multiplicative\. While the model cannot perfectly represent the true functional formxi−1⋅\(xi\+1−xi−2\)x\_\{i\-1\}\\cdot\(x\_\{i\+1\}\-x\_\{i\-2\}\), it learns sufficient signal from true parent variables to keep their gates open, while spurious edges provide no predictive value and are pruned\. The observation noise \(σ=0\.1\\sigma=0\.1\) further validates robustness: the method operates on noisy discretizations of the chaotic ODE, not clean trajectories\.

## Appendix KDiverse Graph Topologies

The main experiments \(Section[3](https://arxiv.org/html/2606.23880#S3)\) use sparse random SCPs where edges are sampled uniformly at random\. A natural concern is whetherGRACEgeneralizes to fundamentally different graph structures\. We test on three standard graph families generated viaNetworkX\[Hagberg et al\.,[2008](https://arxiv.org/html/2606.23880#bib.bib8)\]:

- •Scale\-free\(Barabási–Albert,m=2m=2\): hub\-and\-spoke topology with power\-law degree distribution, characteristic of gene regulatory networks and metabolic pathways where a few master regulators influence many downstream targets\[Barabási and Albert,[1999](https://arxiv.org/html/2606.23880#bib.bib2)\]\.
- •Erdős–Rényi\(p=0\.08p=0\.08, directed\): uniformly random edges, producing a denser graph \(∼200\{\\sim\}200cross\-edges atd=50d=50\) with no preferential structure\.
- •Small\-world\(Watts–Strogatz,k=4k=4,p=0\.3p=0\.3\): clustered ring lattice with random rewiring, combining high local clustering with short path lengths, as observed in neural connectomes and social networks\[Watts and Strogatz,[1998](https://arxiv.org/html/2606.23880#bib.bib25)\]\.

For undirected generators \(BA, WS\), edges are directed from lower to higher index to establish a causal ordering\. Lags are drawn uniformly from\{1,…,5\}\\\{1,\\ldots,5\\\}, and causal mechanisms use the same coefficient ranges and mixed linear/nonlinear function pool as the SCP benchmarks\. Autoregressive terms and spectral radius checking ensure stationarity\.

We evaluate atd=50d=50,T=1000T=1000\(8–10 seeds per configuration\) using pair\-level AUROC, which is threshold\-free and enables a fair comparison betweenGRACE’s continuous gate values and CUTS\+’s Gumbel\-Softmax probabilities\.

Table 18:Pair\-level AUROC on diverse graph topologies \(d=50d=50,T=1000T=1000, mean±\\pmstd\)\.GRACEleads on Erdős–Rényi and small\-world graphs; CUTS\+ edges ahead on scale\-free\.TopologyGRACECUTS\+Scale\-free\.902±\\pm\.05\.932±\\pm\.02Erdős–Rényi\.839±\\pm\.02\.803±\\pm\.02Small\-world\.976±\\pm\.01\.966±\\pm\.02Table[18](https://arxiv.org/html/2606.23880#A11.T18)shows thatGRACEgeneralizes well across all three topologies\. On small\-world and Erdős–Rényi graphsGRACEleads in AUROC \(0\.976 and 0\.839 vs\. 0\.966 and 0\.803\)\. Scale\-free is the one topology where CUTS\+ achieves slightly higher AUROC \(0\.932 vs\. 0\.902\), possibly because the hub\-and\-spoke structure suits CUTS\+’s GRU message\-passing architecture\. The method maintains its advantage across fundamentally different graph structures—including the denser Erdős–Rényi graphs \(edge density∼4×\{\\sim\}4\\timeshigher than SCP\)—demonstrating that gated refinement generalizes beyond the sparse random SCPs used in the main experiments\.

## Appendix LElbe River Network: Graph Comparison

Figure[5](https://arxiv.org/html/2606.23880#A12.F5)visualizes the causal graphs recovered on the Elbe River’s main branch \(Section[3\.4](https://arxiv.org/html/2606.23880#S3.SS4)\)\. The CDNOTS skeleton \(center\) recovers all 11 true edges but adds 106 spurious connections, making the graph uninterpretable\.GRACE\-Bootstrap \(right\) prunes nearly all false positives while retaining the chain structure\.

![Refer to caption](https://arxiv.org/html/2606.23880v1/x4.png)Figure 5:Elbe River’s main branch \(d=12d=12\)\[Stein et al\.,[2025](https://arxiv.org/html/2606.23880#bib.bib22)\]\.Left:geographic map of 12 gauging stations colored by elevation, with arrows indicating true flow direction \(upstream→\\todownstream\)\.Center:CDNOTS skeleton—all 11 true edges recovered \(black\) but buried under 106 false positives \(orange\)\.Right:GRACE\-Bootstrap \(N=50N=50,τ=0\.70\\tau=0\.70\)—9 true edges retained with only 1 false positive, recovering the river’s chain structure\.
GRACE: Gated Refinement for Accurate Causal Edge Discovery in High-Dimensional Time Series

Similar Articles

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

Reconstructing GRACE Terrestrial Water Storage with Spatio-Temporal Graph Neural Networks: An Application to South America

Towards Continuous-time Causal Foundation Models

Function-Valued Causal Influence in Nonlinear Time Series

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

Submit Feedback

Similar Articles

GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training
Reconstructing GRACE Terrestrial Water Storage with Spatio-Temporal Graph Neural Networks: An Application to South America
Towards Continuous-time Causal Foundation Models
Function-Valued Causal Influence in Nonlinear Time Series
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution