TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data

arXiv cs.LG Papers

Summary

The paper introduces TTCD, a novel framework for temporal causal discovery from non-stationary time series data using transformer-based feature learning and reconstruction-guided signal distillation.

arXiv:2605.08111v1 Announce Type: new Abstract: The widespread availability of complex time series data in various domains such as environmental science, epidemiology, and economics demands robust causal discovery methods that can identify intricate contemporaneous and lagged relationships in non-stationary, nonlinear, and noisy settings. Existing constraint-based methods often rely heavily on conditional independence tests that degrade for limited data samples and complex distributions, while score-based methods impose strong statistical assumptions. Recent methods address special cases such as change point detection or distribution shifts, but struggle to provide a unified solution. We propose the Transformer Integrated Temporal Causal Discovery (TTCD) Framework, a novel end-to-end approach that learns contemporaneous and lagged causal relations from non-stationary time series. TTCD introduces a Non-Stationary Feature Learner integrating temporal and frequency-domain attention with dynamic non-stationarity profiling, and a custom Causal Structure Learner. A key innovation is reconstruction-guided causal signal distillation, to distill essential causal signals through the reconstruction process of the transformer decoder, which mitigates noise and spurious correlations while preserving meaningful dependencies. The Causal Structure Learner operates on distilled reconstructed signals to infer the underlying causal graph without restrictive assumptions on noise distributions or data generation processes. Experiments on synthetic, benchmark, and real world datasets show that TTCD consistently outperforms state-of-the-art baselines in both accuracy and consistency with domain knowledge, demonstrating the approach's effectiveness for causal discovery in challenging real world contexts.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 06:41 AM

# TTCD: Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data
Source: [https://arxiv.org/html/2605.08111](https://arxiv.org/html/2605.08111)
Omar Faruque Department of Information Systems University of Maryland, Baltimore County Baltimore, Maryland, USA omarfaruque@umbc\.edu &Sahara Ali Department of Data Science University of North Texas Denton, Texas, USA Sahara\.Ali@unt\.edu Xue Zheng Climate Science Section Lawrence Livermore National Laboratory Livermore, California, USA zheng7@llnl\.gov &Jianwu Wang Department of Information Systems University of Maryland, Baltimore County Baltimore, Maryland, USA jianwu@umbc\.edu

###### Abstract

The widespread availability of complex time series data in various domains such as environmental science, epidemiology, and economics demands robust causal discovery methods that can identify intricate contemporaneous and lagged relationships in non\-stationary, nonlinear, and noisy settings\. Existing constraint\-based methods often rely heavily on conditional independence tests that degrade for limited data samples and complex distributions, while score\-based methods impose strong statistical assumptions\. Recent methods address special cases such as change point detection or distribution shifts, but struggle to provide a unified solution\. We propose theTransformer IntegratedTemporalCausalDiscovery \(TTCD\) Framework, a novel end\-to\-end approach that learns contemporaneous and lagged causal relations from non\-stationary time series\. TTCD introduces a Non\-Stationary Feature Learner integrating temporal and frequency\-domain attention with dynamic non\-stationarity profiling, and a custom Causal Structure Learner\. A key innovation is reconstruction\-guided causal signal distillation, to distill essential causal signals through the reconstruction process of the transformer decoder, which mitigates noise and spurious correlations while preserving meaningful dependencies\. The Causal Structure Learner operates on distilled reconstructed signals to infer the underlying causal graph without restrictive assumptions on noise distributions or data generation processes\. Experiments on synthetic, benchmark, and real world datasets show that TTCD consistently outperforms state\-of\-the\-art baselines in both accuracy and consistency with domain knowledge, demonstrating the approach’s effectiveness for causal discovery in challenging real world contexts\.

## 1Introduction

Time series generated by natural systems such as climate, finance, economics, and healthcare often exhibit non\-linearity, non\-stationarity, different noise types, and autocorrelation\(Rungeet al\.,[2019a](https://arxiv.org/html/2605.08111#bib.bib142)\)\. These intricate properties pose significant challenges for understanding dependencies among system components\. A common approach to simplify this complexity is to graphically represent the data generation model using directed acyclic graphs \(DAGs\), which is a very convenient way to express complex systems in a highly interpretable manner and also provides causal insight into the underlying processes\(Pearl,[2000](https://arxiv.org/html/2605.08111#bib.bib146)\)\. DAG representation of a system plays a vital role in decision making and prediction of future conditions in different applications such as causal inference\(Pearl,[1991](https://arxiv.org/html/2605.08111#bib.bib143); Spirteset al\.,[2000](https://arxiv.org/html/2605.08111#bib.bib144)\), neuroscience\(Rajapakse and Zhou,[2007](https://arxiv.org/html/2605.08111#bib.bib145)\), medicine\(Heckermanet al\.,[1992](https://arxiv.org/html/2605.08111#bib.bib159)\), economics\(Appiah,[2018](https://arxiv.org/html/2605.08111#bib.bib157); Sanford and Moosa,[2012](https://arxiv.org/html/2605.08111#bib.bib156)\), etc\. However, learning DAGs from observational time series data is very challenging when controlled experiments with different population sub\-groups are impractical or unethical\(Spirteset al\.,[2000](https://arxiv.org/html/2605.08111#bib.bib144); Peterset al\.,[2017](https://arxiv.org/html/2605.08111#bib.bib147)\)\.

Several state\-of\-the\-art methods have been developed for causal discovery from temporal data based on constraint\-based and score\-based methodologies\. Constraint\-based methods\(Rungeet al\.,[2019b](https://arxiv.org/html/2605.08111#bib.bib138); Runge,[2020](https://arxiv.org/html/2605.08111#bib.bib139); Gerhardus and Runge,[2020](https://arxiv.org/html/2605.08111#bib.bib151); Entner and Hoyer,[2010](https://arxiv.org/html/2605.08111#bib.bib149); Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150)\)learn conditional independencies through statistical tests to build DAGs\. However, conditional independence tests \(CIT\) require a large number of samples to generate reliable test scores and can struggle with complex data distributions, often generating equivalence classes instead of precise causal graphs\(Shah and Peters,[2020](https://arxiv.org/html/2605.08111#bib.bib148); Huanget al\.,[2018](https://arxiv.org/html/2605.08111#bib.bib216); Glymouret al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib217)\)\. Errors in the early stage can be impacted by cascading errors in later stages, and CIT at multiple stages can lead to false detection results\(Liet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib215); Triantafillou and Tsamardinos,[2016](https://arxiv.org/html/2605.08111#bib.bib214)\)\.

Score\-based causal discovery methods use a score function to quantify a predicted causal graph and optimize it gradually by enforcing the acyclicity constraint\(Glymouret al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib217); Huanget al\.,[2018](https://arxiv.org/html/2605.08111#bib.bib216); Triantafillou and Tsamardinos,[2016](https://arxiv.org/html/2605.08111#bib.bib214)\)\. By evaluating the entire graph instead of applying sequential tests, they mitigate error propagation and multi\-stage inconsistencies\. However, the large combinatorial search space of an adjacency matrix makes this optimization challenging and often requires additional DAG constraints\.Zhenget al\.\([2018](https://arxiv.org/html/2605.08111#bib.bib135)\)transform this combinatorial problem into a continuous optimization by formulating an acyclicity constraint using the trace exponential of the predicted adjacency matrix, enabling gradient\-based optimization\. Based on this, several neural network\-based methods have been proposed\(Zhenget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib136); Sunet al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib141); Pamfilet al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib140); Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126); Löweet al\.,[2022](https://arxiv.org/html/2605.08111#bib.bib218)\)\. But these methods often face the overfitting issue due to noise or spurious correlations in small\-sample settings, and most methods assume stationarity\. Recently, transformer architectures have also been explored to analyze time series data\(Wenet al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib219); Zenget al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib220); Konget al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib227)\)\.

Causal discovery from non\-stationary temporal data remains an active research area\(Gonget al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib228)\)and several advanced methods have been proposed in constraint\-based\(Ferdouset al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib175); Zhifenget al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib222); Sadeghiet al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib224)\)and score\-based\(Schäcket al\.,[2017](https://arxiv.org/html/2605.08111#bib.bib221); Liu and Kuang,[2023](https://arxiv.org/html/2605.08111#bib.bib226); Mamecheet al\.,[2025](https://arxiv.org/html/2605.08111#bib.bib223); Rodaset al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib225)\)categories for this task\. However, these methods are designed to address specific scenarios such as change point detection, shift in data distribution, conditional stationarity, change in causal relationships, or summary graphs\. Some existing approaches also require prior knowledge of noise distribution and parametric information of data generation\. Therefore, in this paper, we propose a causal discovery framework capable of capturing causal structure from non\-stationary temporal data without any noise or data distribution assumptions\. Our proposed framework integrates a transformer\-based non\-stationary feature learner with a custom 2D convolution to capture causal relationships between each variable and its temporal parents\. The contributions of this paper are three\-fold:

- •We propose a non\-stationary transformer to learn dominant features from time series data using both temporal and frequency domain attentions with non\-stationary profiling and de\-stationary feature learning, which provides specific attention on important features\.
- •We propose a convolution\-based Causal Structure Learner to learn the causal relationships from distilled signals\. The proposed module can identify lagged and contemporaneous causal links simultaneously using the acyclicity constraint and sparsity penalty into the optimization process\.
- •We conduct extensive evaluations of the proposed framework with state\-of\-the\-art causal discovery methods and ablation studies using synthetic and real world datasets\. The proposed framework performs better than state\-of\-the\-art approaches in most cases, making it a strong contender for time\-series causal discovery\.

## 2Related Works

Traditional statistical causal discovery methods were not designed to handle non\-linear data\. While some methods extend the traditional causal discovery methods to handle non\-linear time series data, such as PCMCI and PCMCI\+\(Rungeet al\.,[2019b](https://arxiv.org/html/2605.08111#bib.bib138); Runge,[2020](https://arxiv.org/html/2605.08111#bib.bib139); Bahadori and Liu,[2012](https://arxiv.org/html/2605.08111#bib.bib183)\), some approaches utilize neural networks for these extensions\(Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126); Tanket al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib131); Absaret al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib119); Zhenget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib136); Pamfilet al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib140); Sunet al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib141)\)\. For instance, DAG\-GNN\(Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126)\)leverages neural networks and gradient\-based optimization to identify causal structures\.

Recent research has made inroads to propose causal discovery techniques applicable to non\-stationary time\-series data, constraint\-based methods\(Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150); Sadeghiet al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib224); Ferdouset al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib175); Zhifenget al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib222)\)and score\-based methods\(Rodaset al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib225); Schäcket al\.,[2017](https://arxiv.org/html/2605.08111#bib.bib221); Liu and Kuang,[2023](https://arxiv.org/html/2605.08111#bib.bib226); Mamecheet al\.,[2025](https://arxiv.org/html/2605.08111#bib.bib223)\)\. Causal Discovery from NOnstationary Data \(CD\-NOD\)\(Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150)\)is a nonparametric framework that identifies causal relations from non\-stationary data based on distribution shift\. Causal Discovery from Nonstationary Time Series \(CD\-NOTS\) extends CD\-NOD to find lagged and instantaneous causal links using CITs\(Sadeghiet al\.,[2024](https://arxiv.org/html/2605.08111#bib.bib224)\)\.Ferdouset al\.\([2023](https://arxiv.org/html/2605.08111#bib.bib175)\)proposed CDANs, which reduces the conditioning set by considering lagged parents and utilizes changing modules to detect causal edges\.Zhifenget al\.\([2024](https://arxiv.org/html/2605.08111#bib.bib222)\)introduced a causal discovery method that divides the time series into several stationary intervals using a change detection method and applies a stationary method to individual intervals\.

Score\-based method, State\-Dependent Causal Inference \(SDCI\)\(Rodaset al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib225)\)assumes the dynamics of a non\-stationary system change based on different states, and conditioning on each state applies a probabilistic deep learning approach to learn causal graphs\.Schäcket al\.\([2017](https://arxiv.org/html/2605.08111#bib.bib221)\)proposed a method by integrating a time\-varying autoregressive method and generalized partial directed coherence \(PDC\), where the Kalman filter is used to predict PDC parameters\. Latent Intervened Non\-stationary learning \(LIN\)\(Liu and Kuang,[2023](https://arxiv.org/html/2605.08111#bib.bib226)\)method assumes data contains both observational and interventional samples, learns causal graphs for each class using a neural network and acyclicity constraint\. SPACETIME\(Mamecheet al\.,[2025](https://arxiv.org/html/2605.08111#bib.bib223)\)method considers changes in time and space simultaneously to detect causal graphs from multi\-context data using Gaussian processes\.Fujiwaraet al\.\([2023](https://arxiv.org/html/2605.08111#bib.bib172)\)combined Linear Non\-Gaussian Acyclic Model \(LiNGAM\) and the Just\-In\-Time \(JIT\) framework to identify causal relations in nonlinear and non\-stationary data\.

While these methods have significantly contributed to the field of non\-stationary time series causal discovery, several challenges persist\. The constraint\-based methods highly rely on conditional independence tests and are prone to error propagation\. Also, multi\-stage tests can lead to an increased risk of false positives or negatives\. Though recent score\-based causal discovery methods for non\-stationary temporal data mitigate these issues to some extent, they deal with specific challenges, like context change, distribution shift, interventional data, conditional and local stationary, etc\. Natural non\-stationary temporal data does not match these criteria for all cases\. By relaxing specific conditions on data distribution and data generation mechanism, our proposed framework learns non\-stationary features from natural temporal data and generates effective temporal causal graphs\. The comparison between different existing methods is provided in Appendix[A](https://arxiv.org/html/2605.08111#A1)\.

## 3Preliminaries

Let’s consider a multivariate time series datasetX=\{x1,x2,x3,…,xn\}X=\\\{x^\{1\},x^\{2\},x^\{3\},…,x^\{n\}\\\}consisting ofnnvariables, and each variable is measured forTTtimesteps\. Variablexi​\(i∈\{1,…,n\}\)x^\{i\}\(i\\in\\\{1,\.\.\.,n\\\}\)at a specific time pointt∈Tt\\in Tcould be caused by other variables at the same timestep \(tt\) and all variables from previous timesteps \(0tot−1t\-1\), following the temporal precedence assumption \(the output causal graph of Figure[1](https://arxiv.org/html/2605.08111#S3.F1)\)\. The effects from previous timesteps, also called lagged effects, can propagate from infinite earlier time points, but for DAG learning purposes, we will consider a maximum time lag, i\.e\.,lm​a​xl\_\{max\}\.

Definition 1:Consider a time seriesXt=\(Xti\)i∈\{1,…,n\}X\_\{t\}=\(X^\{i\}\_\{t\}\)\_\{i\\in\\\{1,\.\.\.,n\\\}\}with continuous distribution\. If there is alm​a​x\>0l\_\{max\}\>0and∀i∈n\\forall i\\in nthere are setsP​Atxi⊆Xtn∖iPA^\{x^\{i\}\}\_\{t\}\\subseteq X^\{n\\setminus i\}\_\{t\},P​A0​…​\(t−1\)xi⊆X0​…​\(t−1\)PA^\{x^\{i\}\}\_\{0\.\.\.\(t\-1\)\}\\subseteq X\_\{0\.\.\.\(t\-1\)\}, the structural equation model is

Xti=fi​\(P​At−lm​a​xxi,…,P​At−1xi,P​Atxi,eti\),X^\{i\}\_\{t\}=f\_\{i\}\(PA^\{x^\{i\}\}\_\{t\-l\_\{max\}\},\.\.\.,PA^\{x^\{i\}\}\_\{t\-1\},PA^\{x^\{i\}\}\_\{t\},e^\{i\}\_\{t\}\),\(1\)with noise termetie^\{i\}\_\{t\}\. So the set of possible cause variables of each time seriesxix^\{i\}at timettisP​Axi∈\[\{X\(t−lm​a​x\),X\(t−lm​a​x\+1\),…,X\(t−1\),Xt\}−xi\]PA^\{x^\{i\}\}\\in\[\\\{X\_\{\(t\-l\_\{max\}\)\},X\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,X\_\{\(t\-1\)\},X\_\{t\}\\\}\-x^\{i\}\]\. The goal is to learn a causal graphG​\(V,E\)G\(V,E\)such that its vertices resemble time\-lagged and current time variables, and its directed edges express causal links\. So the vertices and edges can be denoted asV=\{X\(t−lm​a​x\),X\(t−lm​a​x\+1\),…,X\(t−1\),Xt\}V=\\\{X\_\{\(t\-l\_\{max\}\)\},X\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,X\_\{\(t\-1\)\},X\_\{t\}\\\},E=\{\(Vi,Vj\):Vi,Vj∈\{X\(t−lm​a​x\),X\(t−lm​a​x\+1\),…,X\(t−1\),Xt\}\}E=\\\{\(V\_\{i\},V\_\{j\}\):V\_\{i\},V\_\{j\}\\in\\\{X\_\{\(t\-l\_\{max\}\)\},X\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,X\_\{\(t\-1\)\},X\_\{t\}\\\}\\\}, respectively\. Let the weighted adjacency matrix of full temporal causal graph G be denoted byW∈ℝ\(n×\(lm​a​x\+1\)\)×nW\\in\\mathbb\{R\}^\{\(n\\times\(l\_\{max\}\+1\)\)\\times n\}\.

The proposed method works based on the following assumptions\. Markov and Faithfulness Assumption: AssumeP\(Xi\),i∈\{1,…,n\}P^\{\(X^\{i\}\)\},i\\in\\\{1,\.\.\.,n\\\}is Markov and faithful to the true/generated causal graphGG\(Hasanet al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib181)\)\. Causal Sufficiency Assumption: We assume that there are no hidden/unobserved confounders in the data generation process\. Causal Consistency Assumption: Assume causal relations between the variables are consistent through all time steps\. Acyclicity Assumption: This assumption states that there are no causal paths that begin and end at the same node\. Assuming temporal precedence in the data ensures acyclicity constraint in the time\-lagged part ofWW\. However, for the contemporary part ofWWattt, each node can serve as both the source and target of causal links, required to maintain the acyclicity\. Simultaneously learning the lagged and contemporaneous parts of the adjacency matrix is very challenging for complex datasets\. Since any variable might be the cause of another effect variable, cycles can occur in the contemporaneous part of the adjacency matrix\.

![Refer to caption](https://arxiv.org/html/2605.08111v1/x1.png)Figure 1:Proposed TTCD framework to learn full temporal causal graph\. The Non\-Stationary Feature Learner module learns distilled reconstructed features using temporal and frequency domain attentions, a non\-stationary profiling network, and a de\-stationary factor block\. The Causal Structure Learner operates on distilled reconstructed signals to generate a full causal graph\.
## 4Proposed Methodology

The causal graph generating task can be treated as an unsupervised learning process of the adjacency matrixWWgivenTTobservations of a multivariate time series dataXX\. To learn a directed adjacency matrixWWof the full temporal causal graphGG, we propose a framework based on unsupervised deep neural networks\. The proposed framework learns the instantaneous\(Xt→Xt\)\(X\_\{t\}\\to X\_\{t\}\)and time\-lagged\(\{X\(t−lm​a​x\),X\(t−lm​a​x\+1\),…,X\(t−1\)\}→Xt\)\(\\\{X\_\{\(t\-l\_\{max\}\)\},X\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,X\_\{\(t\-1\)\}\\\}\\to X\_\{t\}\)causal links for maximum time lag \(lm​a​x\>0l\_\{max\}\>0\)\. Our proposed framework is illustrated in Figure[1](https://arxiv.org/html/2605.08111#S3.F1), which consists of two modules: a Non\-Stationary Feature Learner module and a Causal Structure Learner module\. Non\-Stationary Feature Learner leverages a transformer\-based encoder with a specialized attention mechanism to learn latent representations from temporal and frequency domain features\. The decoder reconstructs the input signal from these latent representations, and we use the reconstructed output \(prior to denormalization\) as input to the causal structure learner\. This design serves two purposes: it ensures dimensional alignment with the original data space, and it filters out spurious fluctuations and noise while retaining meaningful dynamics\. The Causal Structure Learner module containsNNnonsequential custom causal layers \(Causal Conv2D\) to learn time\-lagged and instantaneous causal relationships of each input variable to its parent variables\. Following a similar analogy used byZhenget al\.\([2020](https://arxiv.org/html/2605.08111#bib.bib136)\)and DAG\-GNN\(Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126)\), we will learn causal links from parameters of Causal Conv2D layers\. Here, the links learned by this module are always unidirectional\. The unique design of this module helps to learn the causal links for each input variable independently\. Finally, the results of each causal layer are aggregated to generate a causal graph of the input data\. Causal graph identifiability of the proposed model is inspired byPeterset al\.\([2013](https://arxiv.org/html/2605.08111#bib.bib212);[2011](https://arxiv.org/html/2605.08111#bib.bib213)\)and follows a similar behavior\. Please refer to Appendix[B](https://arxiv.org/html/2605.08111#A2)for details\.

### 4\.1Non\-Stationary Feature Learner

The Non\-Stationary Feature Learner module of our proposed framework learns latent representation from input time series data leveraging non\-stationary attention introduced byLiuet al\.\([2022](https://arxiv.org/html/2605.08111#bib.bib208)\)\. Motivated by the boosted performance of their model, we follow a similar transformer strategy to learn features from non\-stationary temporal data\. The input data is divided into sequential chunks \(I​n∈ℝ\(T×\(lm​a​x\+1\)×n\)In\\in\\mathbb\{R\}^\{\(T\\times\(l\_\{max\}\+1\)\\times n\)\}\) to maintain inherent temporal order, and an embedding is generated for each chunk \(E∈ℝ\(T×\(lm​a​x\+1\)×de\)E\\in\\mathbb\{R\}^\{\(T\\times\(l\_\{max\}\+1\)\\times d\_\{e\}\)\}\)\. Encoder block of the transformer takes this embedding as input to compute attention scores using non\-stationary attention and learns the latent representation\.

The non\-stationary transformer can learn important features of input time series; however, the non\-stationarity cannot be fully covered through attention only, because the data normalization process attenuates non\-stationary characteristics\. Due to normalization, sequences from distinct time series can appear statistically identical, causing the model to generate uniform attention without being aware of important features\. This ignorance of crucial non\-stationary features limits the quality of the learned features and weakens the overall performance\. To tackle this problem, we explicitly derive non\-stationary components from raw input and integrate them into attention computation such that the transformer can retain significant non\-stationarity in its representations\. We achieve this by introducing a Non\-Stationary Profiling network and a de\-stationarizing module,De\-Stationary factor learningBlock \(DSB\)\. Specifically, the Non\-Stationary Profiling Network extracts dynamic, localized statistics—such as local variability or higher\-order moments—capturing sample\-specific distributional profiles that are often suppressed by standard normalization\. These learned profile vectors \(γQ\(i\),γK\(i\)∈ℝT×de\{\\gamma\}\_\{Q\}^\{\(i\)\},\\,\{\\gamma\}\_\{K\}^\{\(i\)\}\\in\\mathbb\{R\}^\{T\\times d\_\{e\}\}\) work as meta\-conditioning signals that adapt the transformer’s attention weights for each input dynamically\. This makes the transformer data\-adaptive, not just parameter\-adaptive, and goes beyond fixed decomposition\.

Q\(i\)=Q⊙γQ\(i\),K\(i\)=K⊙γK\(i\)w​h​e​r​e​\[γQ\(i\),γK\(i\)\]=P​r​o​f​i​l​e​\(X\(i\)\)\\begin\{split\}Q^\{\(i\)\}=Q\\odot\{\\gamma\}\_\{Q\}^\{\(i\)\},K^\{\(i\)\}=K\\odot\{\\gamma\}\_\{K\}^\{\(i\)\}\\\\ where\\big\[\{\\gamma\}\_\{Q\}^\{\(i\)\},\\,\{\\gamma\}\_\{K\}^\{\(i\)\}\\big\]=Profile\\big\(X^\{\(i\)\}\\big\)\\end\{split\}\(2\)The De\-Stationary Factor Learning Module \(DSB\) is designed to explicitly restore and amplify the intrinsic non\-stationary characteristics of time series data, which are often attenuated or lost due to standard normalization and sequence embedding steps\. In our framework, the non\-stationary profiling network focuses on local features, and the DSB captures broader, global non\-stationary profiles\. As shown in Figure[1](https://arxiv.org/html/2605.08111#S3.F1), the DSB comprises a convolution layer and several linear layers with ReLU activations\. This block learns scaling factorτ∈ℝT×1\\tau\\in\\mathbb\{R\}^\{T\\times 1\}\(equivalent toσ2\\sigma^\{2\}\) and shifting vectorΔ∈ℝT×\(lm​a​x\+1\)\\Delta\\in\\mathbb\{R\}^\{T\\times\(l\_\{max\}\+1\)\}\(equivalent toμ\\mu\) utilizing raw input data and its computed meanμx\\mu\_\{x\}and standard deviationσx\\sigma\_\{x\}\. These learned de\-stationary factors are then integrated with the attention computing mechanism \(Equation[3](https://arxiv.org/html/2605.08111#S4.E3)\) to learn varying attention considering non\-stationarity\.

A​t​t​n​\(Q,K,V,τ,Δ\)=S​o​f​t​m​a​x​\(τ​Q​K⊤\+I​Δ⊤de\)​V,Attn\(Q,K,V,\\tau,\\Delta\)=Softmax\(\\frac\{\\tau QK^\{\\top\}\+I\\Delta^\{\\top\}\}\{\\sqrt\{d\_\{e\}\}\}\)V,\(3\)where Q, K and V represent the query, key and value matrices of the transformer with dimensionℝT×\(lm​a​x\+1\)×de\\mathbb\{R\}^\{T\\times\(l\_\{max\}\+1\)\\times d\_\{e\}\}, respectively, andIIis a vector of all ones\. These learned de\-stationary factors are used inside the attention module to multiply learned attention values\. Additionally, these learned de\-stationary factors are shared by all attention modules used in the whole transformer\.

Recent studies byZhouet al\.\([2022](https://arxiv.org/html/2605.08111#bib.bib233)\),Yiet al\.\([2023](https://arxiv.org/html/2605.08111#bib.bib235)\), andLiet al\.\([2025](https://arxiv.org/html/2605.08111#bib.bib234)\)have demonstrated that integrating frequency domain attention significantly improves the model’s capacity to disentangle non\-stationary time series and identify latent causal drivers\. So, we integrate a Frequency Domain Attention alongside the standard temporal attention to further enhance the model’s ability to capture complex non\-stationary patterns\. A Fourier Transform is applied to convert time\-domain signals into frequency spectra \(F​r​e∈ℝ\(T×\(\(lm​a​x\+1\)/2\)×de\)Fre\\in\\mathbb\{R\}^\{\(T\\times\(\(l\_\{max\}\+1\)/2\)\\times d\_\{e\}\)\}\), enabling the network to selectively attend to distinct spectral bands and periodic components of non\-stationarity in the signal\. This frequency domain attention is fused with the time domain latent features conditioned by local profile vectors and de\-stationary factors, enabling the model to simultaneously exploit localized distribution and frequency\-based dependencies\. This multiview representation enhances the learner’s ability to detect time\-lagged and instantaneous causal links that are modulated by complex non\-stationary dynamics\. The learned latent representation \(ℝT×\(lm​a​x\+1\)×de\\mathbb\{R\}^\{T\\times\(l\_\{max\}\+1\)\\times d\_\{e\}\}\) of the encoder module is provided as input to the transformer decoder module together with generated input data embeddings\. The decoder module also uses non\-stationary attention blocks with de\-stationary factors to reconstruct input data \(ℝT×\(lm​a​x\+1\)×n\\mathbb\{R\}^\{T\\times\(l\_\{max\}\+1\)\\times n\}\)\. The distilled signals learned by the decoder module are provided as input to the proposed Causal Structure Learner module to learn causal relationships\. The output of the decoder module is still on an unnormalized scale\. Therefore, the de\-normalization block is used to shift the output back to the original scale of the input data\.

### 4\.2Causal Structure Learner

We propose this novel module to learn lagged and instantaneous causal links of each variable using distilled reconstructed signals \(ℝT×\(lm​a​x\+1\)×n\\mathbb\{R\}^\{T\\times\(l\_\{max\}\+1\)\\times n\}\) learned by the decoder block\. This consists of a separate custom Causal Conv2D layer for each variable, and these layers are organized in a non\-sequential pattern\. The Causal Conv2D layer takes input in a similar structure as shown in the full causal graph on the right side of Figure[1](https://arxiv.org/html/2605.08111#S3.F1), with lagged data followed by data from the current time point\. Each Causal Conv2D layer is designed to learn causal links of an input variable, for example, tox1x^\{1\}from all possible parentsP​Ax1∈PA^\{x^\{1\}\}\\in\{x\(t−lm​a​x\)1,x\(t−lm​a​x\+1\)1,…,x\(t−1\)1,x\(t−lm​a​x\)2,x\(t−lm​a​x\+1\)2,\\\{x^\{1\}\_\{\(t\-l\_\{max\}\)\},x^\{1\}\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,x^\{1\}\_\{\(t\-1\)\},x^\{2\}\_\{\(t\-l\_\{max\}\)\},x^\{2\}\_\{\(t\-l\_\{max\}\+1\)\},…,x\(t−1\)2,xt2,…,x\(t−lm​a​x\)n,x\(t−lm​a​x\+1\)n,…,x\(t−1\)n,xtn\}\.\.\.,x^\{2\}\_\{\(t\-1\)\},x^\{2\}\_\{t\},\.\.\.,x^\{n\}\_\{\(t\-l\_\{max\}\)\},x^\{n\}\_\{\(t\-l\_\{max\}\+1\)\},\.\.\.,x^\{n\}\_\{\(t\-1\)\},x^\{n\}\_\{t\}\\\}of that variable following Definition\-1\. The variable itself cannot be included in the set of its parent variables\. Let’s assume we have a time series dataset with 4 variables, and for lagged effects, consider a maximum time laglm​a​x=4l\_\{max\}=4\. So, the input data size will be\(4×5\)\(4\\times 5\), one row for each variable andlm​a​x\+1=5l\_\{max\}\+1=5columns for lagged and contemporaneous data\. To learn a causal graph for 4 variables, as shown in Figure[2](https://arxiv.org/html/2605.08111#S4.F2), we have to employ 4 Causal Conv2D layers\. Each of these layers predicts the expectation of a target variable in the current timestepttgiven all lagged and instantaneous parents \(Equation[4](https://arxiv.org/html/2605.08111#S4.E4)\)\.

E​\[xi\|P​Axi\]=fWxi​\(P​Axi\)E\[x^\{i\}\|PA^\{x^\{i\}\}\]=f\_\{W^\{x^\{i\}\}\}\(PA^\{x^\{i\}\}\)\(4\)Here,fWxi​\(\)f\_\{W^\{x^\{i\}\}\}\(\)denotes the function learned for target variablexix^\{i\}andWxiW^\{x^\{i\}\}represents the set of weight parameters of that layer\. The adjacency matrix is derived from learned weights of these Causal Conv2D layers\. Each weight parameter of a layer related to a target variable represents the strength of causal links from its potential parent\. If a weight parameterWi​jxk=0W^\{x^\{k\}\}\_\{ij\}=0, this means that the target variablexkx^\{k\}is independent of the cause variablexix^\{i\}at timestepjj\. Conversely, ifWi​jxk\>0W^\{x^\{k\}\}\_\{ij\}\>0, the target variablexkx^\{k\}has a causal edge from parent variablexix^\{i\}at a specific time lagjj\. After training weight parameters of all target variables, we apply a thresholding operation to prune causal links with weak dependency strength,Wωxk=\(Wxk\>ω\)W^\{x^\{k\}\}\_\{\\omega\}=\(W^\{x^\{k\}\}\>\\omega\), whereω\\omegais a minimum threshold limit\. After thresholding, the weight parameters of all variables represent the adjacency matrix of the generated causal graph\.

![Refer to caption](https://arxiv.org/html/2605.08111v1/x2.png)Figure 2:Example of the proposed custom causal layers with four variables and a time lag of 4\.
### 4\.3Optimization

To train the proposed framework, we optimize these four terms in the objective function: transformer reconstruction loss, target variable estimation loss, acyclicity constraint, and sparsity loss\.Reconstruction loss: In the Non\-Stationary Feature Learner module, we learn latent representations and reconstruct them to the input data through the transformer’s encoder\-decoder structure\. Therefore, to optimize this module, we use the mean squared error \(MSE\) loss of input dataXXand reconstructed outputX^\\widehat\{X\}, which is defined as:

Lr=1T​∑i=1T‖X−X^‖22L\_\{r\}=\\frac\{1\}\{T\}\\sum\_\{i=1\}^\{T\}\\left\\\|X\-\\widehat\{X\}\\right\\\|\_\{2\}^\{2\}\(5\)Acyclicity Constraint: To optimize the Causal Structure Learner module we have to ensure acyclicity property of the causal graph\. To enforce acyclicity in the adjacency matrix of the learned causal graph, we use an equality constraint similar to that ofZhenget al\.\([2018](https://arxiv.org/html/2605.08111#bib.bib135)\), formulated ash​\(W\)=0h\(W\)=0\. The functionh​\(W\)h\(W\)is defined using the trace exponential \(t​rtr\) of the elementwise product of the adjacency matrix with itself\.h​\(W\)=t​r​eW⊙W−nh\(W\)=tr\\\>e^\{W\\odot W\}\-n, herennis the number of variables\. We cannot use the learned adjacency matrixWWdirectly in this equality function becauseWWcontains both time\-lagged\(t−lm​a​x,t−lm​a​x\+1,…,t−1\)\(t\-l\_\{max\},t\-l\_\{max\}\+1,…,t\-1\)and contemporaneous\(t\)\(t\)edges of the causal graph\. Lagged causal edges always redirect from previous timesteps to the current timesteptt, and are acyclic\. We have to apply acyclicity constraint only to the contemporaneous part of the adjacency matrixWtW^\{t\}\.

h​\(Wt\)=t​r​\(eWt⊙Wt\)−n=0h\(W^\{t\}\)=tr\(\\\>e^\{W^\{t\}\\odot W^\{t\}\}\)\-n=0\(6\)The functionh​\(Wt\)h\(W^\{t\}\)will be equal to0if and only if the corresponding matrixWtW^\{t\}does not have any cycle\. However, we cannot directly integrate this equality constraint into a continuous optimization framework\. But the equality constrainth​\(Wt\)=0h\(W^\{t\}\)=0can be solved using continuous optimization after converting this into an unconstrained problem\(Zhenget al\.,[2018](https://arxiv.org/html/2605.08111#bib.bib135)\)\. Therefore, we transform this equality constraint into an unconstrained subproblem using the augmented Lagrangian method\. The continuous form of this constraint is defined as:

h​\(Wt\)=0≈min⁡\[ρ2​\|h​\(Wt\)\|2\+α​h​\(Wt\)\],h\(W^\{t\}\)=0\\approx\\min\\left\[\\frac\{\\rho\}\{2\}\|h\(W^\{t\}\)\|^\{2\}\+\\alpha h\(W^\{t\}\)\\right\],\(7\)whereα\>0\\alpha\>0is the Lagrange multiplier,ρ\>0\\rho\>0is the penalty parameter of the augmented Lagrangian\.

Target Variable Estimation Loss: The proposed Causal Structure Learner module estimates each target variable by using all possible parents to learn the adjacency matrixW∈R\(n×\(lm​a​x\+1\)\)×nW\\in R^\{\(n\\times\(l\_\{max\}\+1\)\)\\times n\}of the desired causal graph\. To improve this estimation quality, we must consider the difference between estimated and actual values of target variables\. Therefore, we define a mean squared error loss \(Lt​v​eL\_\{tve\}\) between true values and estimated values obtained through causal connections:

Lt​v​e​\(W\)=1T​‖X−W​X‖F2L\_\{tve\}\(W\)=\\frac\{1\}\{T\}\|\|X\-WX\|\|^\{2\}\_\{F\}\(8\)Sparsity Loss: We incorporate an additional penalty to enforce sparsity in the learned adjacency matrix using theL​1L1norm ofW∈R\(n×\(lm​a​x\+1\)\)×nW\\in R^\{\(n\\times\(l\_\{max\}\+1\)\)\\times n\}\. This penalty term helps to generate fewer causal links with strong relationships\. WhileLrL\_\{r\},Lt​v​eL\_\{tve\}, and the acyclicity constraint \(Equation[7](https://arxiv.org/html/2605.08111#S4.E7)\) drive the model to find dense and overly connected relationships to minimize loss, potentially increasing the number of non\-zero weights—theL​1L1norm ofWWtry to reduce less significant weights to zero to keep a minimum number of non\-zero entries\. The sparsity loss is defined asLs=λ​‖W‖1L\_\{s\}=\\lambda\|\|W\|\|\_\{1\}, whereλ\\lambdais the sparsity penalty regularization\. By combining all loss terms, the objective function of the proposed framework becomes:

minW⁡\[Lr\+Lt​v​e​\(W\)\+ρ2​\|h​\(Wt\)\|2\+α​h​\(Wt\)\+Ls\],\\min\_\{W\}\\left\[L\_\{r\}\+L\_\{tve\}\(W\)\+\\frac\{\\rho\}\{2\}\|h\(W^\{t\}\)\|^\{2\}\+\\alpha h\(W^\{t\}\)\+L\_\{s\}\\right\],\(9\)where the 3rd and 4th terms are augmented Lagrangian for the acyclicity constraint\. This objective function can be minimized using any state\-of\-the\-art continuous optimizer\.

## 5Experimental Setup

We describe datasets and evaluation metrics used for performance comparison in this section\. Our model is developed using the PyTorch library, and all experiments are conducted on Google Colab Runtime with CPU for easy reproducibility\. A fixed random seed value is used for randomized data to make the experimental results reproducible\. The implementation code and datasets used for this study are available at https://anonymous\.4open\.science/r/TTCD/README\.md\.

Synthetic Datasets:We used two synthetic datasets to evaluate the performance of our proposed causal discovery method\. As we know the ground truth causal graph for synthetic datasets, we can measure and compare generated causal graphs easily\. We generated a time series dataset \(Dataset\-1\) consisting of four variables using Gaussian white noiseε\\varepsilonfollowing a similar data generation process presented in\(Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150)\), which contains both lagged and instantaneous links\. A mathematical description and the true causal graph for this dataset are provided in Appendix[C](https://arxiv.org/html/2605.08111#A3)\. Non\-stationary characteristic is incorporated into the generation process of this dataset to mimic the dynamic properties of real world natural system\.

The other synthetic dataset \(Dataset\-2\) follows a similar data generation process presented in\(Kanget al\.,[2022](https://arxiv.org/html/2605.08111#bib.bib209)\)\. For this dataset, we used exponential nonlinearity and noise signals from the Poisson distribution\. The mathematical equations of this dataset also given in Appendix[C](https://arxiv.org/html/2605.08111#A3)\. All the variables of this dataset are also non\-stationary\.

Real World Datasets:Two real world Earth/atmospheric science datasets, namely Turbulence Kinetic Energy \(TKE\) and Arctic Sea Ice, and FMRI benchmark data were used to evaluate our work\. These natural datasets exhibit high variability, non\-stationarity, and complex interactions\. TKE refers to the mean kinetic energy per unit mass of eddies in turbulent flow\(Hinze,[1975](https://arxiv.org/html/2605.08111#bib.bib152)\)\. The temporal TKE data used in this study represent the TKE evolution during a typical cumulus\-topped boundary layer day \(local time 05:00–18:00\) over the DOE Atmospheric Radiation Measurement \(ARM\) Southern Great Plains Central Facility\. This data file is generated from an idealized numerical simulation using the Weather Research & Forecasting Model\(Skamarocket al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib153)\)with modifications from the Large\-Eddy Simulation \(LES\) Symbiotic Simulation and Observation \(LASSO\) activity, which is developed through the US Department of Energy’s ARM facility\(Gustafsonet al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib154); Endoet al\.,[2015](https://arxiv.org/html/2605.08111#bib.bib155)\)\. The dataset also includes TKE vertical shear production \(S​HSH\) and buoyancy production \(B​UBU\) terms, which together yield the net TKE tendency \(T​E​N​DTEND\)\. Their ground\-truth relationships are shown in Figure[3](https://arxiv.org/html/2605.08111#A3.F3)b \(Appendix[C](https://arxiv.org/html/2605.08111#A3)\), and non\-stationarity test results are available in Appendix[E](https://arxiv.org/html/2605.08111#A5)\.

Arctic sea ice is an important component of the world’s climate system and plays a significant role in the rise of extreme weather events\.Huanget al\.\([2021](https://arxiv.org/html/2605.08111#bib.bib198)\)conducted a causal discovery analysis to investigate the links between the melting Arctic sea ice and atmospheric variables\. We use the same 11 atmospheric variables with the sea ice extent employed in\(Huanget al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib198)\)\. This time series data contains monthly averages from 1980 to 2018 over the Arctic region north of 60N\. The variable names and non\-stationary test results for this dataset are provided in Appendix[D](https://arxiv.org/html/2605.08111#A4)and[E](https://arxiv.org/html/2605.08111#A5)respectively\.

The FMRI benchmark dataset\(Smithet al\.,[2011](https://arxiv.org/html/2605.08111#bib.bib229)\)provides rich, realistic simulated blood\-oxygen\-level\-dependent \(BOLD\) time series for modeling brain networks\. Activity between 5 brain regions is measured in this dataset, considering the change in blood flow\. Each brain region is considered a node and 2400 samples are recorded for each node\. The ground truth causal graph of this dataset is also provided for performance evaluation\.

Evaluation Metrics:We evaluate the performance of our proposed causal discovery method using Structural Hamming Distance \(SHD\), F1 Score and False Discovery Rate \(FDR\)\. SHD represents the number of edge corrections \(deletion, insertion\) to match the predicted graph with the true causal graph\. FDR explains the rate of predicted wrong edges from all predicted edges considering the direction of each edge\. F1 Score calculates the harmonic mean of precision and recall\. The F1 score ranges from 0 to 1 and a higher value means a better prediction of the true graph\. In contrast, lower SHD and FDR represent better performance of the causal discovery method\.

## 6Results

In this section, we present the comparative results of time series causal discovery between the proposed method and state\-of\-the\-art methods\. To evaluate the performance of our proposed method, we considered 8 SOTA methods as baselines: CD\-NOD\(Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150)\), LIN\(Liu and Kuang,[2023](https://arxiv.org/html/2605.08111#bib.bib226)\), PCMCI\+\(Runge,[2020](https://arxiv.org/html/2605.08111#bib.bib139)\), DYNOTEARS\(Pamfilet al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib140)\), NTS\-NOTEARS\(Sunet al\.,[2023](https://arxiv.org/html/2605.08111#bib.bib141)\), PCMCI\(Rungeet al\.,[2019b](https://arxiv.org/html/2605.08111#bib.bib138)\), NOTEARS\-MLP\(Zhenget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib136)\), and DAG\-GNN\(Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126)\)\. The first six methods can learn causal graphs for time series data; among these, CD\-NOD\(Huanget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib150)\)and LIN\(Liu and Kuang,[2023](https://arxiv.org/html/2605.08111#bib.bib226)\)work on non\-stationary data\. Although the LIN method assumes both intervention and observation samples, in our experiments, we set the intervention parameter to0to model observation data, which means no intervention is applied to the data\. While the NOTEARS\-MLP\(Zhenget al\.,[2020](https://arxiv.org/html/2605.08111#bib.bib136)\), and DAG\-GNN\(Yuet al\.,[2019](https://arxiv.org/html/2605.08111#bib.bib126)\)methods were proposed for non\-temporal data, we include these methods due to their strong performance and widespread usage in different domains\(Entner and Hoyer,[2010](https://arxiv.org/html/2605.08111#bib.bib149); Huanget al\.,[2021](https://arxiv.org/html/2605.08111#bib.bib198)\)\. We transformed the lagged and instantaneous data into a long sequence such as\{xt−51,xt−52,xt−53,xt−54,xt−41,xt−42,xt−43,xt−44,…,xt1,xt2,\\\{x^\{1\}\_\{t\-5\},x^\{2\}\_\{t\-5\},x^\{3\}\_\{t\-5\},x^\{4\}\_\{t\-5\},x^\{1\}\_\{t\-4\},x^\{2\}\_\{t\-4\},x^\{3\}\_\{t\-4\},x^\{4\}\_\{t\-4\},…,x^\{1\}\_\{t\},x^\{2\}\_\{t\},xt3,xt4\}x^\{3\}\_\{t\},x^\{4\}\_\{t\}\\\}, so that we could apply transformed dataset to the non\-temporal methods to find lagged and current time causal relationships\. For fair comparison, we carefully tuned hyperparameters for each method to get the best evaluation scores\. The hyperparameters used for all baseline methods are provided in Appendix[F](https://arxiv.org/html/2605.08111#A6)\.

Table 1:Comparative analysis of the full causal graph predicted by different baseline methods for synthetic and real world datasets\.To evaluate the performance of the baseline methods, we compared the predicted causal graph in the full temporal graph setting, considering both edge direction and time lag of each edge\. The qualitative comparison is reported in Table[1](https://arxiv.org/html/2605.08111#S6.T1), where the best scores are marked in bold, and underlined values represent the second best scores\. From Table[1](https://arxiv.org/html/2605.08111#S6.T1), we can see that our proposed method obtained the best results on Synthetic Dataset\-1, TKE, and Arctic Sea Ice datasets, and comparable results for other datasets\. For Dataset\-2, the TCDF method yielded the same SHD and FDR scores but a lower F1 score, indicating this method predicted fewer edges than the ground truth edges\. For the FMRI dataset, DAG\-GNN method achieved the best F1 score, where the proposed method generated a better FDR score with the same SHD\. As this dataset represents chain\-like relationships, the proposed method failed to detect all target edges, eventually generating fewer edges with high causal strengths\. Baseline methods for non\-stationary data also performed well on FMRI dataset\. Overall, these comparative results demonstrate that our proposed framework is capable of generating better quality causal graphs for non\-stationary temporal data and can identify true causal edges with fewer spurious causal links compared to state\-of\-the\-art baseline models\.

Table 2:Ablation analysis between the proposed framework and its different variants\.### 6\.1Ablation Study

A comparative study of the proposed framework and its different variants is performed to verify the effectiveness of each component in our framework\. In the TTCD Normal Transformer variant, we used a standard transformer rather than a non\-stationary transformer, keeping the Causal Structure Learner unchanged\. The TTCD w/o DSB variant removes the de\-stationary factor learning block \(DSB\) to evaluate its contribution, and TTCD w/o Frequency excludes the frequency\-domain attention from non\-stationary transformer while keeping other modules unchanged\. The evaluation results in Table[2](https://arxiv.org/html/2605.08111#S6.T2)show that the non\-stationary transformer learns informative latent features better than a standard transformer on multivariate non\-stationary data\. Moreover, removing either DSB or frequency domain attention block degrades performance across all datasets, demonstrating their effectiveness for robust causal discovery\.

## 7Conclusion

In this paper, we propose TTCD, a score\-based causal structure learning method for non\-stationary time series data that integrates the non\-stationary transformer and a custom Causal Conv2D module\. The proposed method leverages the temporal and frequency domain attentions enhanced by non\-stationary profiling and de\-stationary factor learning networks to learn important non\-stationary features and refined reconstructed signals\. The custom causal structure learner keeps the causal contributor of each target/effect variable isolated from other target variables, which helps to estimate a better causal structure from distilled signals\. Unlike many existing methods, the proposed framework does not require prior knowledge about variable independence, noise distribution, or the underlying data generation process\. We conducted extensive experiments on synthetic, benchmark, and real world complex time series datasets to demonstrate the performance of the proposed causal discovery framework\. Experimental analysis demonstrates that the TTCD framework achieves superior causal graph learning capability compared to state\-of\-the\-art baselines\. In the future, we will analyze more benchmarks and real world datasets from other domains and evaluate the sensitivity of different parameters\.

## Acknowledgment

This work is supported by NSF grants: CAREER: Big Data Climate Causality \(OAC\-1942714\) and HDR Institute: HARP \- Harnessing Data and Model Revolution in the Polar Regions \(OAC\-2118285\)\. The work at LLNL was performed under the auspices of the U\.S\. Department of Energy by Lawrence Livermore National Laboratory under Contract LLNL\-MI\-2016887\.

## References

- Neural time\-invariant causal discovery from time series data\.In2023 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2605.08111#S2.p1.1)\.
- M\. O\. Appiah \(2018\)Investigating the multivariate granger causality between energy consumption, economic growth and co2 emissions in ghana\.Energy Policy112,pp\. 198–208\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- M\. T\. Bahadori and Y\. Liu \(2012\)On causality inference in time series\.In2012 AAAI Fall Symposium Series,Cited by:[§2](https://arxiv.org/html/2605.08111#S2.p1.1)\.
- Y\. Cheung and K\. S\. Lai \(1995\)Lag order and critical values of the augmented dickey–fuller test\.Journal of Business & Economic Statistics13\(3\),pp\. 277–280\.Cited by:[Appendix E](https://arxiv.org/html/2605.08111#A5.p1.1)\.
- D\. Colombo, M\. H\. Maathuis,et al\.\(2014\)Order\-independent constraint\-based causal structure learning\.\.J\. Mach\. Learn\. Res\.15\(1\),pp\. 3741–3782\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.4.4.1.1.1)\.
- S\. Endo, A\. M\. Fridlind, W\. Lin, A\. M\. Vogelmann, T\. Toto, A\. S\. Ackerman, G\. M\. McFarquhar, R\. C\. Jackson, H\. H\. Jonsson, and Y\. Liu \(2015\)RACORO continental boundary layer cloud investigations: 2\. large\-eddy simulations of cumulus clouds and evaluation with in situ and ground\-based observations\.Journal of Geophysical Research: Atmospheres120\(12\),pp\. 5993–6014\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p4.3)\.
- D\. Entner and P\. O\. Hoyer \(2010\)On causal discovery from time series data using fci\.Probabilistic graphical models,pp\. 121–128\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.12.12.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- M\. H\. Ferdous, U\. Hasan, and M\. O\. Gani \(2023\)CDANs: temporal Causal Discovery from Autocorrelated and Non\-Stationary Time Series Data\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1)\.
- D\. Fujiwara, K\. Koyama, K\. Kiritoshi, T\. Okawachi, T\. Izumitani, and S\. Shimizu \(2023\)Causal discovery for non\-stationary non\-linear time series data using just\-in\-time modeling\.InProceedings of the Second Conference on Causal Learning and Reasoning,M\. van der Schaar, C\. Zhang, and D\. Janzing \(Eds\.\),Proceedings of Machine Learning Research, Vol\.213,pp\. 880–894\.External Links:[Link](https://proceedings.mlr.press/v213/fujiwara23a.html)Cited by:[§2](https://arxiv.org/html/2605.08111#S2.p3.1)\.
- A\. Gerhardus and J\. Runge \(2020\)High\-recall causal discovery for autocorrelated time series with latent confounders\.Advances in Neural Information Processing Systems33,pp\. 12615–12625\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1)\.
- C\. Glymour, K\. Zhang, and P\. Spirtes \(2019\)Review of causal discovery methods based on graphical models\.Frontiers in genetics10,pp\. 524\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- C\. Gong, C\. Zhang, D\. Yao, J\. Bi, W\. Li, and Y\. Xu \(2024\)Causal discovery from temporal data: an overview and new perspectives\.ACM Computing Surveys57\(4\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1)\.
- W\. I\. Gustafson, A\. M\. Vogelmann, Z\. Li, X\. Cheng, K\. K\. Dumas, S\. Endo, K\. L\. Johnson, B\. Krishna, T\. Fairless, and H\. Xiao \(2020\)The large\-eddy simulation \(les\) atmospheric radiation measurement \(arm\) symbiotic simulation and observation \(lasso\) activity for continental shallow convection\.Bulletin of the American Meteorological Society101\(4\),pp\. E462–E479\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p4.3)\.
- U\. Hasan, E\. Hossain, and M\. O\. Gani \(2023\)A survey on causal discovery methods for iid and time series data\.arXiv preprint arXiv:2303\.15027\.Cited by:[§3](https://arxiv.org/html/2605.08111#S3.p3.5)\.
- D\. E\. Heckerman, E\. J\. Horvitz, and B\. N\. Nathwani \(1992\)Toward normative expert systems: part i the pathfinder project\.Methods of information in medicine31\(02\),pp\. 90–105\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- J\. O\. Hinze \(1975\)Turbulence\.McGraw\-Hill\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p4.3)\.
- B\. Huang, K\. Zhang, Y\. Lin, B\. Schölkopf, and C\. Glymour \(2018\)Generalized score functions for causal discovery\.InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining,pp\. 1551–1560\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- B\. Huang, K\. Zhang, J\. Zhang, J\. Ramsey, R\. Sanchez\-Romero, C\. Glymour, and B\. Schölkopf \(2020\)Causal discovery from heterogeneous/nonstationary data\.The Journal of Machine Learning Research21\(1\),pp\. 3482–3534\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.11.11.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1),[§5](https://arxiv.org/html/2605.08111#S5.p2.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- Y\. Huang, M\. Kleindessner, A\. Munishkin, D\. Varshney, P\. Guo, and J\. Wang \(2021\)Benchmarking of data\-driven causality discovery approaches in the interactions of arctic sea ice and atmosphere\.Frontiers in big Data4,pp\. 642182\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p5.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- A\. Hyvärinen, K\. Zhang, S\. Shimizu, and P\. O\. Hoyer \(2010\)Estimation of a structural vector autoregression model using non\-gaussianity\.\.Journal of Machine Learning Research11\(5\)\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.13.13.1.1.1)\.
- M\. Kang, D\. Chen, N\. Meng, G\. Yan, and W\. Yu \(2022\)Identifying unique causal network from nonstationary time series\.arXiv preprint arXiv:2211\.10085\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p3.1)\.
- L\. Kong, W\. Li, H\. Yang, Y\. Zhang, J\. Guan, and S\. Zhou \(2024\)CausalFormer: an interpretable transformer for temporal causal discovery\.IEEE Transactions on Knowledge and Data Engineering\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.14.14.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- D\. Kwiatkowski, P\. C\. Phillips, P\. Schmidt, and Y\. Shin \(1992\)Testing the null hypothesis of stationarity against the alternative of a unit root: how sure are we that economic time series have a unit root?\.Journal of econometrics54\(1\-3\),pp\. 159–178\.Cited by:[Appendix E](https://arxiv.org/html/2605.08111#A5.p1.1)\.
- H\. Li, V\. Cabeli, N\. Sella, and H\. Isambert \(2019\)Constraint\-based causal structure learning with consistent separating sets\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1)\.
- R\. Li, M\. Jiang, Q\. Liu, K\. Wang, K\. Feng, Y\. Sun, and X\. Zhou \(2025\)FAITH: frequency\-domain attention in two horizons for time series forecasting\.Knowledge\-Based Systems309,pp\. 112790\.Cited by:[§4\.1](https://arxiv.org/html/2605.08111#S4.SS1.p3.3)\.
- C\. Liu and K\. Kuang \(2023\)Causal structure learning for latent intervened non\-stationary data\.InInternational Conference on Machine Learning,pp\. 21756–21777\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.16.16.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p3.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- Y\. Liu, H\. Wu, J\. Wang, and M\. Long \(2022\)Non\-stationary transformers: exploring the stationarity in time series forecasting\.Advances in Neural Information Processing Systems35,pp\. 9881–9893\.Cited by:[§4\.1](https://arxiv.org/html/2605.08111#S4.SS1.p1.2)\.
- S\. Löwe, D\. Madras, R\. Zemel, and M\. Welling \(2022\)Amortized causal discovery: learning to infer causal graphs from time\-series data\.InConference on Causal Learning and Reasoning,pp\. 509–525\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- S\. Mameche, L\. Cornanguer, U\. Ninad, and J\. Vreeken \(2025\)SPACETIME: causal discovery from non\-stationary time series\.arXiv preprint arXiv:2501\.10235\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.15.15.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p3.1)\.
- M\. Nauta, D\. Bucur, and C\. Seifert \(2019\)Causal Discovery with Attention\-Based Convolutional Neural Networks\.Machine Learning and Knowledge Extraction1\(1\),pp\. 312–340\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.10.10.1.1.1)\.
- R\. Pamfil, N\. Sriwattanaworachai, S\. Desai, P\. Pilgerstorfer, K\. Georgatzis, P\. Beaumont, and B\. Aragam \(2020\)Dynotears: structure learning from time\-series data\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1595–1605\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.9.9.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- J\. Pearl \(1991\)Probabilistic reasoning in intelligent systems \- networks of plausible inference\.InMorgan Kaufmann series in representation and reasoning,External Links:[Link](https://api.semanticscholar.org/CorpusID:32583695)Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- J\. Pearl \(2000\)Models, reasoning and inference\.Cambridge, UK: CambridgeUniversityPress19\(2\),pp\. 3\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- J\. Peters, D\. Janzing, and B\. Schölkopf \(2013\)Causal inference on time series using restricted structural equation models\.Advances in neural information processing systems26\.Cited by:[Appendix B](https://arxiv.org/html/2605.08111#A2.p1.16),[§4](https://arxiv.org/html/2605.08111#S4.p1.9)\.
- J\. Peters, D\. Janzing, and B\. Schölkopf \(2017\)Elements of causal inference: foundations and learning algorithms\.The MIT Press,Cambridge, MA, USA\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- J\. Peters, J\. M\. Mooij, D\. Janzing, and B\. Schölkopf \(2011\)Identifiability of causal graphs using functional models\.UAI’11,Arlington, Virginia, USA,pp\. 589–598\.External Links:ISBN 9780974903972Cited by:[Appendix B](https://arxiv.org/html/2605.08111#A2.p1.16),[§4](https://arxiv.org/html/2605.08111#S4.p1.9)\.
- J\. C\. Rajapakse and J\. Zhou \(2007\)Learning effective brain connectivity with dynamic bayesian networks\.Neuroimage37\(3\),pp\. 749–760\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- C\. B\. Rodas, R\. Tu, and H\. Kjellstrom \(2021\)Causal discovery from conditionally stationary time\-series\.arXiv preprint arXiv:2110\.06257\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p3.1)\.
- J\. Runge, S\. Bathiany, E\. M\. Bollt, G\. Camps\-Valls, D\. Coumou, E\. R\. Deyle, C\. Glymour, M\. Kretschmer, M\. D\. Mahecha, J\. Muñoz\-Marí, E\. H\. van Nes, J\. Peters, R\. Quax, M\. Reichstein, M\. Scheffer, B\. Scholkopf, P\. Spirtes, G\. Sugihara, J\. Sun, K\. Zhang, and J\. Zscheischler \(2019a\)Inferring causation from time series in earth system sciences\.Nature Communications10\.External Links:[Link](https://api.semanticscholar.org/CorpusID:189819550)Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- J\. Runge, P\. Nowack, M\. Kretschmer, S\. Flaxman, and D\. Sejdinovic \(2019b\)Detecting and quantifying causal associations in large nonlinear time series datasets\.Science Advances5\(11\),pp\. eaau4996\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aau4996)Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.5.5.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- J\. Runge \(2020\)Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets\.InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence \(UAI\),J\. Peters and D\. Sontag \(Eds\.\),Proceedings of Machine Learning Research, Vol\.124,pp\. 1388–1397\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.6.6.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- A\. Sadeghi, A\. Gopal, and M\. Fesanghary \(2024\)Causal discovery from nonstationary time series\.International Journal of Data Science and Analytics,pp\. 1–27\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1)\.
- A\. D\. Sanford and I\. A\. Moosa \(2012\)A bayesian network structure for operational risk modelling in structured finance operations\.Journal of the Operational Research Society63,pp\. 431–444\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- T\. Schäck, M\. Muma, M\. Feng, C\. Guan, and A\. M\. Zoubir \(2017\)Robust nonlinear causality analysis of nonstationary multivariate physiological time series\.IEEE Transactions on Biomedical Engineering65\(6\),pp\. 1213–1225\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1),[§2](https://arxiv.org/html/2605.08111#S2.p3.1)\.
- R\. D\. Shah and J\. Peters \(2020\)The hardness of conditional independence testing and the generalised covariance measure\.The Annals of Statistics48\(3\)\.External Links:ISSN 0090\-5364,[Document](https://dx.doi.org/10.1214/19-aos1857)Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1)\.
- W\. C\. Skamarock, J\. B\. Klemp, J\. Dudhia, D\. O\. Gill, Z\. Liu, J\. Berner, W\. Wang, J\. Powers, M\. Duda, D\. Barker,et al\.\(2019\)A description of the advanced research wrf version 4\.NCAR tech\. note ncar/tn\-556\+ str145\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p4.3)\.
- S\. M\. Smith, K\. L\. Miller, G\. Salimi\-Khorshidi, M\. Webster, C\. F\. Beckmann, T\. E\. Nichols, J\. D\. Ramsey, and M\. W\. Woolrich \(2011\)Network modelling methods for fmri\.Neuroimage54\(2\),pp\. 875–891\.Cited by:[§5](https://arxiv.org/html/2605.08111#S5.p6.1)\.
- P\. Spirtes, C\. N\. Glymour, and R\. Scheines \(2000\)Causation, prediction, and search\.MIT press,Cambridge, MA, USA\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p1.1)\.
- X\. Sun, O\. Schulte, G\. Liu, and P\. Poupart \(2023\)NTS\-notears: learning nonparametric dbns with prior knowledge\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1942–1964\.Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.8.8.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- A\. Tank, I\. Covert, N\. Foti, A\. Shojaie, and E\. B\. Fox \(2021\)Neural granger causality\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(8\),pp\. 4267–4279\.Cited by:[§2](https://arxiv.org/html/2605.08111#S2.p1.1)\.
- S\. Triantafillou and I\. Tsamardinos \(2016\)Score\-based vs constraint\-based causal learning in the presence of confounders\.\.InCfa@ uai,pp\. 59–67\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p2.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- Q\. Wen, T\. Zhou, C\. Zhang, W\. Chen, Z\. Ma, J\. Yan, and L\. Sun \(2023\)Transformers in time series: a survey\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence,IJCAI ’23\.External Links:ISBN 978\-1\-956792\-03\-4,[Link](https://doi.org/10.24963/ijcai.2023/759),[Document](https://dx.doi.org/10.24963/ijcai.2023/759)Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- K\. Yi, Q\. Zhang, W\. Fan, S\. Wang, P\. Wang, H\. He, N\. An, D\. Lian, L\. Cao, and Z\. Niu \(2023\)Frequency\-domain mlps are more effective learners in time series forecasting\.Advances in Neural Information Processing Systems36,pp\. 76656–76679\.Cited by:[§4\.1](https://arxiv.org/html/2605.08111#S4.SS1.p3.3)\.
- Y\. Yu, J\. Chen, T\. Gao, and M\. Yu \(2019\)DAG\-gnn: dag structure learning with graph neural networks\.InInternational Conference on Machine Learning,pp\. 7154–7163\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p3.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§4](https://arxiv.org/html/2605.08111#S4.p1.9),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 11121–11128\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p3.1)\.
- X\. Zheng, B\. Aragam, P\. Ravikumar, and E\. P\. Xing \(2018\)DAGs with NO TEARS: Continuous Optimization for Structure Learning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p3.1),[§4\.3](https://arxiv.org/html/2605.08111#S4.SS3.p1.13),[§4\.3](https://arxiv.org/html/2605.08111#S4.SS3.p1.17)\.
- X\. Zheng, C\. Dan, B\. Aragam, P\. Ravikumar, and E\. P\. Xing \(2020\)Learning sparse nonparametric DAGs\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[Table 3](https://arxiv.org/html/2605.08111#A1.T3.1.7.7.1.1.1),[§1](https://arxiv.org/html/2605.08111#S1.p3.1),[§2](https://arxiv.org/html/2605.08111#S2.p1.1),[§4](https://arxiv.org/html/2605.08111#S4.p1.9),[§6](https://arxiv.org/html/2605.08111#S6.p1.3)\.
- H\. Zhifeng, Z\. Weijie, C\. Ruichu, and C\. Wei \(2024\)Non\-stationary causal discovery method based on conditional independence test\.\.Journal of Computer Engineering & Applications60\(10\)\.Cited by:[§1](https://arxiv.org/html/2605.08111#S1.p4.1),[§2](https://arxiv.org/html/2605.08111#S2.p2.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)Fedformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InInternational conference on machine learning,pp\. 27268–27286\.Cited by:[§4\.1](https://arxiv.org/html/2605.08111#S4.SS1.p3.3)\.

Appendix

## Appendix AComparison of Causal Discovery Methods

Causal discovery methods consider different assumptions about data distribution and causal structure of the target system\. A comprehensive summary of the assumptions used in different causal discovery methods is provided in Table[3](https://arxiv.org/html/2605.08111#A1.T3)\. The four most commonly used assumptions are mentioned in the columns: acyclicity, stationary or non\-stationary data, Markov & Faithfulness assumption, and causal sufficiency\. The last column of the table mentions any specific criteria used by the method\.

Table 3:Assumptions used by different existing causal discovery methods\.MethodAcyclicityStationarityMarkov &CausalOthersFaithfulnessSufficiencyGranger CausalityNoYesNoYesLinear relationshipPC\-StableColomboet al\.\([2014](https://arxiv.org/html/2605.08111#bib.bib231)\)YesYesYesYesPCMCIRungeet al\.\([2019b](https://arxiv.org/html/2605.08111#bib.bib138)\)YesYesYesYesPCMCI\+Runge\([2020](https://arxiv.org/html/2605.08111#bib.bib139)\)YesYesYesYesNOTEARS\-MLPZhenget al\.\([2020](https://arxiv.org/html/2605.08111#bib.bib136)\)YesNoYesYesLinear relationshipNTS\-NOTEARSSunet al\.\([2023](https://arxiv.org/html/2605.08111#bib.bib141)\)YesYesYesYesNonlinear relationshipDYNOTEARSPamfilet al\.\([2020](https://arxiv.org/html/2605.08111#bib.bib140)\)YesYesYesYesTCDFNautaet al\.\([2019](https://arxiv.org/html/2605.08111#bib.bib170)\)YesYesNoYesAttention weights capture causal importanceCD\-NODHuanget al\.\([2020](https://arxiv.org/html/2605.08111#bib.bib150)\)YesNoYesNoDistribution shifts reveal causal influencesTS\-FCIEntner and Hoyer\([2010](https://arxiv.org/html/2605.08111#bib.bib149)\)NoYesYesYesVAR\-LINGAMHyvärinenet al\.\([2010](https://arxiv.org/html/2605.08111#bib.bib230)\)YesYesNoYesLinear relationshipCausalFormerKonget al\.\([2024](https://arxiv.org/html/2605.08111#bib.bib227)\)YesYesNoYesSpaceTimeMamecheet al\.\([2025](https://arxiv.org/html/2605.08111#bib.bib223)\)YesNoYesYesDistribution Shift, Independent ChangesLINLiu and Kuang\([2023](https://arxiv.org/html/2605.08111#bib.bib226)\)YesNoYesYesInterventional data and Equivalence classTTCD \(Ours\)YesNoYesYesNonlinear relationship

## Appendix BIdentifiability of Causal Graph

Considering the assumptions stated earlier and the given time series follows a nonlinear function with additive noise, the full\-time causal graphGGis identifiable from data distribution\. This renders equation[1](https://arxiv.org/html/2605.08111#S3.E1)follows an identifiable functional model class \(IFMOC\)Peterset al\.\([2011](https://arxiv.org/html/2605.08111#bib.bib213);[2013](https://arxiv.org/html/2605.08111#bib.bib212)\)where the causal graph is acyclic\. Motivated byPeterset al\.\([2013](https://arxiv.org/html/2605.08111#bib.bib212)\)we derived the following explanation of identifiability\. Assume we got two different directed acyclic causal graphsG1G\_\{1\}andG2G\_\{2\}from the distribution ofXtX\_\{t\}\. Suppose an edge betweenxix^\{i\}andxjx^\{j\}with a time lagpp,xt−pi→ytjx^\{i\}\_\{t\-p\}\\to y^\{j\}\_\{t\}which exist inG1G\_\{1\}but not inG2G\_\{2\}\. Based on causal faithfulness assumption, fromG1G\_\{1\}we havext−pi⟂⟂ytj\|\{Xt−lk∖\{xt−pi,ytj\},k∈n,1≤l≤lm​a​x\}x^\{i\}\_\{t\-p\}\\not\\\!\\perp\\\!\\\!\\\!\\perp y^\{j\}\_\{t\}\|\\\{X^\{k\}\_\{t\-l\}\\setminus\\\{x^\{i\}\_\{t\-p\},y^\{j\}\_\{t\}\\\},k\\in n,1\\leq l\\leq l\_\{max\}\\\}\. Similarly, the Markov condition onG2G\_\{2\}providesxt−pi⟂⟂ytj\|\{Xt−lk∖\{xt−pi,ytj\},k∈n,1≤l≤lm​a​x\}x^\{i\}\_\{t\-p\}\\perp\\\!\\\!\\\!\\perp y^\{j\}\_\{t\}\|\\\{X^\{k\}\_\{t\-l\}\\setminus\\\{x^\{i\}\_\{t\-p\},y^\{j\}\_\{t\}\\\},k\\in n,1\\leq l\\leq l\_\{max\}\\\}\. This creates a contradiction in data distribution, hence the full\-time causal graphsG1G\_\{1\}andG2G\_\{2\}must be equal and represent the same IFMOC\.

## Appendix CSynthetic Dataset and Ground Truth Causal Graph

The synthetic dataset\-1 is generated using the following equations and the noise signals used in this dataset are generated by the Gaussian distribution\. Here we used sinusoidal nonlinearity and this dataset represents both instantaneous and time\-lagged causal relationships\.

Xt1=0\.5​Xt−51\+0\.5​Xt−21\+ε1X^\{1\}\_\{t\}=0\.5X^\{1\}\_\{t\-5\}\+0\.5X^\{1\}\_\{t\-2\}\+\\varepsilon\_\{1\}Xt2=0\.1​Xt1\+0\.7​Xt−11\+1\.5​s​i​n​\(t/50\)\+ε2X^\{2\}\_\{t\}=0\.1X^\{1\}\_\{t\}\+0\.7X^\{1\}\_\{t\-1\}\+1\.5sin\(t/50\)\+\\varepsilon\_\{2\}Xt3=0\.8​Xt−11\+ε3X^\{3\}\_\{t\}=0\.8X^\{1\}\_\{t\-1\}\+\\varepsilon\_\{3\}Xt4=0\.2​Xt−14\+0\.4​Xt3\+0\.4​Xt−13\+0\.4​Xt−11\+X^\{4\}\_\{t\}=0\.2X^\{4\}\_\{t\-1\}\+0\.4X^\{3\}\_\{t\}\+0\.4X^\{3\}\_\{t\-1\}\+0\.4X^\{1\}\_\{t\-1\}\+s​i​n​\(t50\)\+s​i​n​\(t20\)\+ε4sin\(\\frac\{t\}\{50\}\)\+sin\(\\frac\{t\}\{20\}\)\+\\varepsilon\_\{4\}
The synthetic dataset\-2 is generated using the equations given below, and the noise signals used in this dataset are generated by the Poisson distribution\. Here we used the exponential non\-linearity using the termf​\(x\)=x\+5​x2​e−x220f\(x\)=x\+5x^\{2\}e^\{\-\\frac\{x^\{2\}\}\{20\}\}\. All the variables of this dataset are also non\-stationary\.

Xt1=t\+0\.2​t300X^\{1\}\_\{t\}=\\frac\{t\+0\.2t\}\{300\}Xt2=0\.2​f​\(Xt−12\)\+0\.3​f​\(Xt−11\)\+𝒩​\(0,1\)X^\{2\}\_\{t\}=0\.2f\(X^\{2\}\_\{t\-1\}\)\+0\.3f\(X^\{1\}\_\{t\-1\}\)\+\\mathcal\{N\}\(0,1\)Xt3=0\.5​f​\(Xt−13\)\+0\.2​f​\(Xt−41\)\+𝒩​\(0,1\)X^\{3\}\_\{t\}=0\.5f\(X^\{3\}\_\{t\-1\}\)\+0\.2f\(X^\{1\}\_\{t\-4\}\)\+\\mathcal\{N\}\(0,1\)Xt4=0\.7​f​\(Xt−14\)\+0\.5​f​\(Xt−33\)\+0\.8​f​\(Xt2\)\+𝒩​\(0,1\)X^\{4\}\_\{t\}=0\.7f\(X^\{4\}\_\{t\-1\}\)\+0\.5f\(X^\{3\}\_\{t\-3\}\)\+0\.8f\(X^\{2\}\_\{t\}\)\+\\mathcal\{N\}\(0,1\)Xt5=0\.6​f​\(Xt−25\)\+0\.2​f​\(Xt−11\)\+𝒩​\(0,1\)X^\{5\}\_\{t\}=0\.6f\(X^\{5\}\_\{t\-2\}\)\+0\.2f\(X^\{1\}\_\{t\-1\}\)\+\\mathcal\{N\}\(0,1\)
The ground truth causal graph of the synthetic dataset\-1 is illustrated in Figure[3](https://arxiv.org/html/2605.08111#A3.F3)a\. Where X1 is a common cause of all other variables\. The time lag between each cause and effect variable pair is provided on the edge connecting them\. Figure[3](https://arxiv.org/html/2605.08111#A3.F3)b visualizes the true causal relationships between different variables of the TKE dataset\.

![Refer to caption](https://arxiv.org/html/2605.08111v1/x3.png)Figure 3:Causal graph of \(a\) our synthetic dataset\-1 and \(b\) the real world Turbulence Kinetic Energy \(TKE\) dataset\.
## Appendix DArctic Sea Ice Data

The following 11 atmospheric variables with the sea ice extent are included in the Arctic Sea Ice Dataset\. This time series data contains monthly averages from 1980 to 2018 over the Arctic region of 60N\.

Table 4:Variables in the Arctic Sea Ice Data\.
## Appendix ENon\-Stationarity Test Results for Real World Datasets

The non\-stationarity feature of the real world TKE and Arctic Sea Ice datasets is evaluated using the Augmented Dickey–Fuller test \(ADF\)Cheung and Lai \([1995](https://arxiv.org/html/2605.08111#bib.bib210)\)and Kwiatkowski\-Phillips\-Schmidt\-Shin test \(KPSS\)Kwiatkowskiet al\.\([1992](https://arxiv.org/html/2605.08111#bib.bib211)\)statistical test methods for time series data\. The ADF test method assumes a null hypothesis: the time series has a unit root and is not stationary\. Then try to reject the null hypothesis and if failed to be rejected, it suggests the time series is not stationarity\. For the ADF test, if the p\-value of a time series is higher than the 0\.05 alpha level the null hypothesis cannot be rejected\. So the time series is not stationary\. The KPSS test works in a somewhat similar manner to the ADF test but assumes an inverse null hypothesis\. The null hypothesis of the KPSS method is that the time series is stationary\. If the p\-value is less than 0\.05 alpha level, we can reject the null hypothesis and derive that the time series is not stationary\.

Table 5:Non\-stationarity test results for variables of the TKE dataset\.The statistical non\-stationarity test results for the TKE dataset are given in Table[5](https://arxiv.org/html/2605.08111#A5.T5)and the results for the Arctic Sea Ice data are available in Table[6](https://arxiv.org/html/2605.08111#A5.T6)\. The non\-stationarity test results revealed that the TKE dataset contains only non\-stationary variables, and both test methods have agreement on the test outcome\. For Arctic Sea Ice dataset, the ADF test found 4 non\-stationary variables; on the other hand KSSP method found 3 non\-stationary variables\. Therefore, we can say that the Arctic Sea Ice data have a mixture of both non\-stationary and stationary variables\.

Table 6:Non\-stationarity test results for variables of the Arctic Sea Ice dataset\.
## Appendix FHyperparameters

To find the best hyperparameters for baseline methods, we started using the parameters suggested by the authors and gradually tuned those values to obtain better evaluation results\. The results reported in the comparative analysis of the main article are obtained with tuned hyperparameters\. The parameters used to generate evaluation results are given here\.

- •PCMCI: Conditional Independence Test = ParCorr,t​a​u​\_​m​a​xtau\\\_max= Maximum time lag,p​c​\_​a​l​p​h​apc\\\_alpha= None \[So the model will use the optimal value from the list\{0\.05,0\.1,0\.2,0\.3,0\.4,0\.5\}\\\{0\.05,0\.1,0\.2,0\.3,0\.4,0\.5\\\}\],a​l​p​h​a​\_​l​e​v​e​lalpha\\\_level= 0\.01
- •PCMCI\+: Conditional Independence Test = ParCorr,t​a​u​\_​m​a​xtau\\\_max= Maximum time lag,p​c​\_​a​l​p​h​apc\\\_alpha= None \[So the model will use the optimal value from the list\{0\.001,0\.005,0\.01,0\.025,0\.05\}\\\{0\.001,0\.005,0\.01,0\.025,0\.05\\\}\]
- •NOTEARS\-MLP: lambda1 = 0\.01, lambda2 = 0\.01, rho = 1\.0, alpha = 0\.0,w​\_​t​h​r​e​s​h​o​l​dw\\\_threshold= 0\.3
- •NTS\-NOTEARS: lambda1 = 0\.0005, lambda2 = 0\.001,w​\_​t​h​r​e​s​h​o​l​dw\\\_threshold= 0\.3, rho = 1\.0, alpha = 0\.0,n​u​m​b​e​r​\_​o​f​\_​l​a​g​snumber\\\_of\\\_lags= Maximum time lag
- •DYNOTEARS:t​a​u​\_​m​a​xtau\\\_max= Maximum time lag,w​\_​t​h​r​e​s​h​o​l​dw\\\_threshold= 0\.01,l​a​m​b​d​a​\_​wlambda\\\_w= 0\.05,l​a​m​b​d​a​\_​alambda\\\_a= 0\.05
- •CD\-NOD: indep\_test = fisherz
- •LIN: E\(no of intervention\)=1,n​o​\_​h​i​d​d​e​n​\_​l​a​y​e​rno\\\_hidden\\\_layer= \[1, 2\],h​i​d​d​e​n​\_​d​i​mhidden\\\_dim= \[3, 4\]
- •Proposed Method: lambda1 = 0\.9, alpha=1\.0, rho = 1\.0w​\_​t​h​r​e​s​h​o​l​dw\\\_threshold= 0\.002, 0\.004, 0\.007, 0\.17,h​i​d​d​e​n​\_​d​i​m​\(de\)hidden\\\_dim\(d\_\{e\}\)= 16

## Appendix GAblation Study

To understand the effectiveness of the proposed non\-stationary transformer with the custom Causal Conv2D module, we created one variant of the proposed framework without using a transformer\. The architecture of this model variant is illustrated in Figure[4](https://arxiv.org/html/2605.08111#A7.F4)\. Here, we replaced the transformer from the Non\-Stationary Feature Learner with a comparatively simple convolutional neural network module\. The module learns latent temporal features of the input data and integrates the factors learned by the de\-stationary factor learning MLP\. Finally, these rescaled latent features are provided to the Causal Structure Learner module to generate the causal graph\. To optimize the Causal Conv2D model, we used three components of the combined loss function: target estimation loss \(Lt​eL\_\{te\}\), acyclicity constraint, andL​1L1regularization of the learned adjacency matrix\. The input data reconstruction loss is not included in the objective function, as we did not use an autoencoder architecture\.

![Refer to caption](https://arxiv.org/html/2605.08111v1/x4.png)Figure 4:The structure of Causal Conv2D model without the transformer\. Instead of using a non\-stationary transformer, a Conv2D block is used to learn non\-stationary features with the de\-stationary factor MLP\.For TTCD Causal Conv1D, we used a 1D variant of the proposed custom Causal Conv2D layer\. To incorporate this Conv1D layer into the model, we flattened the latent representation generated by non\-stationary transformer of the proposed architecture\. The same training and optimization process was utilized for both models\. From the experimental analysis of these ablation models, we can see that the non\-stationary transformer and Custom causal Conv2D layer improve the causal graph learning performance of the proposed model with a significant margin for each evaluation score\.

Table 7:Ablation study on normal transformer \(TTCD\-Without\-Transformer\) and 1D design of the causal structure learner \(TTCD\-Causal\-Conv1D\)\.

Similar Articles