REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting

arXiv cs.LG Papers

Summary

ReGeN is a reference-guided generative pipeline for multivariate time series data that decomposes observed sequences into periodic backbone, stochastic residuals, and cross-variable dependencies to synthesize controllable synthetic data. It demonstrates that generated data can substitute for real data in forecasting tasks, outperforming prior synthetic data generators.

arXiv:2606.05264v1 Announce Type: new Abstract: Training robust multivariate time series forecasting models requires large, diverse corpora, yet many real-world domains provide only a handful of observed sequences. Existing generators fail to resolve this mismatch: prior-based approaches (e.g., CauKer, TimePFN) produce domain-agnostic samples, while data-driven methods (e.g., TimeGAN) treat references as black-box supervision, forfeiting explicit control over periodic structure, local variability, and cross-variable dynamics. We propose ReGeN, a reference-guided generative pipeline that treats observed sequences not as examples to imitate, but as structural scaffolds for controllable synthesis. ReGeN decomposes each reference into three interpretable components: a phase-aligned periodic backbone capturing dominant domain morphology; per-variable stochastic residuals modeled with a deep-kernel Gaussian process; and lag-aware cross-variable dependencies injected through a structural causal model with fitted coupling coefficients. Sampling these components at controllable temperature broadens distributional coverage while preserving domain-grounded structure. We show that ReGeN-generated data consistently substitutes for real sibling data with minimal forecasting degradation, and in strongly periodic domains such as traffic, can outperform the real source itself. We further show that a foundation model pretrained on ReGeN corpora outperforms those pretrained on prior-based and data-driven synthetic alternatives. This suggests that in low-data regimes, how reference data is structurally exploited can matter as much as how much data is available.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:09 AM

# ReGeN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting
Source: [https://arxiv.org/html/2606.05264](https://arxiv.org/html/2606.05264)
Moulik Gupta1, Dhruv Kumar1,2, Murari Mandal1,3,†\\dagger, Saurabh Deshpande1,†\\dagger

1Birla AI Labs, Office of Ananya Birla 2Birla Institute of Technology and Science, Pilani 3Kalinga Institute of Industrial Technology, Bhubaneswar †\\daggerEqual Supervision \{moulik\.gupta\-c, dhruv\.kumar\-c, murari\.mandal\-c, saurabh\.deshpande\-c\}@oab\.adityabirla\.com

###### Abstract

Training robust multivariate time series forecasting models requires large, diverse corpora, yet many real\-world domains provide only a handful of observed sequences\. Existing generators fail to resolve this mismatch: prior\-based approaches \(e\.g\., CauKer, TimePFN\) produce domain\-agnostic samples, while data\-driven methods \(e\.g\., TimeGAN\) treat references as black\-box supervision, forfeiting explicit control over periodic structure, local variability, and cross\-variable dynamics\. We proposeReGeN, a reference\-guided generative pipeline that treats observed sequences not as examples to imitate, but as structural scaffolds for controllable synthesis\.ReGeNdecomposes each reference into three interpretable components: a phase\-aligned periodic backbone capturing dominant domain morphology; per\-variable stochastic residuals modeled with a deep\-kernel Gaussian process; and lag\-aware cross\-variable dependencies injected through a structural causal model with fitted coupling coefficients\. Sampling these components at controllable temperature broadens distributional coverage while preserving domain\-grounded structure\. We show thatReGeN\-generated data consistently substitutes for real sibling data with minimal forecasting degradation, and in strongly periodic domains such as traffic, can outperform the real source itself\. We further show that a foundation model pretrained onReGeNcorpora outperforms those pretrained on prior\-based and data\-driven synthetic alternatives\. This suggests that in low\-data regimes, how reference data is structurally exploited can matter as much as how much data is available\.

## 1Introduction

Multivariate time series forecasting underpins decision\-making across energy grids, traffic networks, cloud infrastructure, and climate systems\. Yet their practical deployment remains bottlenecked by data scarcity: most operational multivariate corpora, from building energy portfolios to regional sensor networks to clinical monitoring deployments, contain tens to low hundreds of observed sequences per domain, far too few to train robust forecasting models that generalize beyond the observed distribution\.\(Liuet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib51); Zenget al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib53); Wanget al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib52); Ansariet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib3); Wooet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib2)\)\. Fine\-tuning a pretrained model can partially compensate, but only when the target domain is already represented in the pretraining corpus, which is rarely the case for niche industrial, environmental, or infrastructure settings\. What is needed is a generation strategy that can read domain\-specific structure from a small reference corpus and write it into a larger synthetic one\.

Existing synthetic generators fall into two categories, each with a fundamental limitation\.Prior\-based generators\(e\.g\., ForecastPFN\(Dooleyet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib32)\), TimePFNTagaet al\.\([2025](https://arxiv.org/html/2606.05264#bib.bib9)\), CauKer\(Xieet al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib30)\)\) construct synthetic data from domain\-agnostic mathematical primitives \(sinusoidal templates, Gaussian Process kernel banks, randomly sampled DAGs\) with no grounding in any target domain\. The resulting corpora may be statistically plausible but carry none of the domain\-specific morphology, uncertainty texture, or cross\-variable coupling that a forecasting model must actually learn\.Data\-driven generators\(e\.g\., TimeGAN\(Yoonet al\.,[2019](https://arxiv.org/html/2606.05264#bib.bib31)\), C\-RNN\-GANMogren \([2016](https://arxiv.org/html/2606.05264#bib.bib43)\)\) learn directly from observed data, but treat the entire reference corpus as a black\-box training signal for a generative model\. This approach requires large collections of real sequences to train reliably, and produces generators with no explicit control over periodic structure, local variability, or cross\-variable dependencies\. Neither approach is well\-suited to the practitioner who has access to a small\-to\-moderate number of real multivariate sequences from a target domain\.

We argue that the gap between these two regimes stems from a missed opportunity: the observed sequences themselves encode rich, domain\-specific structure that neither mathematical priors nor black\-box generators can recover\. Even a handful of real sequences from an energy building encodes its characteristic demand cycle, peak\-load uncertainty, and directed coupling between temperature and cooling load\. This structure is not a limitation to work around; it is a signal to exploit\. A generation strategy that fits every component to observed data would complement both camps: it would not require large corpora, nor sacrifice domain grounding, addressing the regime where most practitioners actually operate\.

This motivates a reference\-guided approach to synthetic generation in which every component of the generative pipeline is fitted to observed data from the target domain, rather than sampled from generic priors or learned from a black\-box objective\. We proposeReGEN, a pipeline that treats observed sequences as a structural scaffold for synthesis by decomposing each reference into three interpretable layers: aphase\-aligned periodic templateto capture the dominant rhythmic backbone of the domain;per\-variable stochastic residualsmodeled by a deep\-kernel Gaussian process to reproduce local uncertainty around that backbone; and agraph\-based lag coupling structureinferred from the reference data to encode directed cross\-variable dependencies\. New synthetic sequences are generated by composing samples from each layer at controllable temperature, broadening distributional coverage while preserving the structural character of the target domain\.

Our key contributions are as follows\.

- •Reference\-guided generation pipeline: We introduceReGEN, a modular generator that grounds all three synthesis components in real domain observations rather than generic priors\. Component ablations confirm that each layer contributes measurably to downstream performance\.
- •Comprehensive empirical validation: We evaluate across twelve datasets spanning five domains, three evaluation protocols \(TRTR, TSTR, TRSTR\), and five forecasting architectures including a foundation model \(Moirai\-small\)\. In two\-thirds of transfer settings,ReGENsynthetic data substitutes for real sibling data within a±\\pm3% MSE margin; in strongly periodic domains it outperforms real\-data transfer entirely\. Training on the union of real and synthetic data yields consistent gains over real\-only training for attention\- and state\-space\-based architectures\.
- •Superiority over existing generators: We benchmark against both a reference\-guided adversarial generator \(TimeGAN\) and a prior\-based causal generator \(CauKer\) under matched corpus sizes\. Foundation models pretrained onReGENcorpora reduce Moirai MSE by 41% over TimeGAN and by 2\.3% over CauKer, establishing that*how*reference data is exploited matters as much as*whether*it is available\.

## 2Related Work

Prior\-based and data\-driven synthetic generators\.Synthetic time series generation has been pursued along two distinct tracks\. Data\-driven generators such as TimeGAN\(Yoonet al\.,[2019](https://arxiv.org/html/2606.05264#bib.bib31)\), C\-RNN\-GAN\(Mogren,[2016](https://arxiv.org/html/2606.05264#bib.bib43)\), and RCGAN\(Estebanet al\.,[2017](https://arxiv.org/html/2606.05264#bib.bib44)\)learn directly from a target corpus using adversarial and recurrent objectives\. They produce realistic sequences when sufficient training data is available, but operate as black\-box models with no explicit control over periodic structure, local variability, or cross\-variable coupling, and they require enough real sequences to train the generator itself\. Prior\-based generators take the opposite approach, constructing synthetic data from domain\-agnostic mathematical primitives without requiring any real target data\. ForecastPFN\(Dooleyet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib32)\)trains on Bayesian forecasting priors; Chronos augments pretraining with KernelSynth, which composes GP kernels to generate univariate series\(Ansariet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib3)\); TimePFN extends this to multivariate settings via GP kernel composition with linear model of coregionalization\(Tagaet al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib9)\); CauKer combines GP kernels with a randomly sampled causal DAG to produce causally coherent multivariate sequences\(Xieet al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib30)\); and SarSim uses SARIMA\-based simulation with multi\-seasonality and heavy\-tailed perturbations for large\-scale pretraining\(Oreshkinet al\.,[2026](https://arxiv.org/html/2606.05264#bib.bib46)\)\. These methods are well\-suited to foundation model pretraining but their synthetic structure bears no relationship to any specific target domain, making them unsuitable for dataset\-conditioned augmentation\.

Structural decomposition and augmentation\.A complementary thread draws on classical time series decomposition\. STL\-style seasonal\-trend decomposition separates a signal into interpretable components\(Clevelandet al\.,[1990](https://arxiv.org/html/2606.05264#bib.bib23)\), and structured probabilistic models such as Gaussian processes with compositional kernels\(Robertset al\.,[2013](https://arxiv.org/html/2606.05264#bib.bib24)\)provide principled uncertainty over residuals\. Lightweight augmentation methods such as TSMix\(Darlowet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib33)\)and mixup\-style variants\(Aggarwal and Srivastava,[2023](https://arxiv.org/html/2606.05264#bib.bib34)\)improve downstream performance through interpolation, but do not model multivariate dependence or directed lag structure\.ReGENoccupies a distinct position: unlike prior\-based generators it conditions all three components on real reference observations, and unlike data\-driven generators it requires only a moderate corpus while providing explicit control over periodic structure, residual uncertainty, and cross\-variable coupling\.

## 3ReGen

We consider the problem of generating synthetic multivariate time series that preserve three salient properties of real temporal systems: recurring within\-variable structure, stochastic local variability, and directed dependence across variables\. Let𝒟=\{𝐗\(s\)\}s=1S\\mathcal\{D\}=\\\{\\mathbf\{X\}^\{\(s\)\}\\\}\_\{s=1\}^\{S\}denote a collection of real multivariate sequences with𝐗\(s\)∈ℝC×T\\mathbf\{X\}^\{\(s\)\}\\in\\mathbb\{R\}^\{C\\times T\}, whereCCis the number of covariates\. Our goal is to construct a generative mechanism for synthetic trajectories𝐗~∈ℝC×Tgen\\widetilde\{\\mathbf\{X\}\}\\in\\mathbb\{R\}^\{C\\times T\_\{\\mathrm\{gen\}\}\}that reproduces marginal temporal morphology while remaining faithful to cross\-variable dynamics\. Rather than modeling these three sources of structure monolithically, we explicitly separate \(i\) a low\-frequency periodic backbone, \(ii\) a stochastic innovation process, and \(iii\) a graph\-structured interaction mechanism, so that periodicity, uncertainty, and causal dependence can be modeled and controlled independently\.

![Refer to caption](https://arxiv.org/html/2606.05264v1/Diagrams/reviseddiagram.png)Figure 1:ReGeNpipeline overview\.A:Extract a phase\-aligned periodic template and compute residuals from the real multivariate time series\.B:Aggregate residuals across series and apply VE\-based filtering to retain reliable template–residual structure\.C:Fit a CNN\+LSTM encoder with an SVGP\-based deep kernel prior to model residual dynamics\.D:Sample template parameters and GP residuals, then combine them to reconstruct synthetic signals\.E:Use the inferred DAG to inject cross\-variate dependencies and assemble the final synthetic multivariate dataset\.Structural decomposition of each covariate\.For each covariatecc, we decompose the observed trajectory in normalized space as

yc,t=τc,t\+rc,t,y\_\{c,t\}=\\tau\_\{c,t\}\+r\_\{c,t\},\(1\)whereτc,t\\tau\_\{c,t\}denotes a recurring structural template andrc,tr\_\{c,t\}denotes the residual component\. The template is intended to capture phase\-locked regularity, while the residual absorbs local departures from this repeating pattern\. This separation between recurring structure and stochastic remainder is directly inspired by the classical seasonal decomposition and structured probabilistic modelling of temporal signals\(Clevelandet al\.,[1990](https://arxiv.org/html/2606.05264#bib.bib23); Robertset al\.,[2013](https://arxiv.org/html/2606.05264#bib.bib24)\)\. To estimateτc,t\\tau\_\{c,t\}, we align observations according to a characteristic periodPP\(specified dataset\-wise in Appendix[A](https://arxiv.org/html/2606.05264#A1), Table[5](https://arxiv.org/html/2606.05264#A1.T5)\) and average values sharing the same phase\. For a real sequence indexed byss, this yields a phase template

τc\(s\)​\(p\)=1\|ℐp\|​∑t∈ℐpyc,t\(s\),ℐp=\{t∣tmodP=p\},\\tau^\{\(s\)\}\_\{c\}\(p\)=\\frac\{1\}\{\|\\mathcal\{I\}\_\{p\}\|\}\\sum\_\{t\\in\\mathcal\{I\}\_\{p\}\}y^\{\(s\)\}\_\{c,t\},\\qquad\\mathcal\{I\}\_\{p\}=\\\{t\\mid t\\bmod P=p\\\},\(2\)Here,p∈\{0,…,P−1\}p\\in\\\{0,\\dots,P\-1\\\}indexes the phase within a period of lengthPP,ttindexes discrete time steps,yc,t\(s\)y^\{\(s\)\}\_\{c,t\}is the normalized observation of covariateccat timettin sampless,ℐp\\mathcal\{I\}\_\{p\}is the set of all time indices assigned to phasepp\. The residual is then defined as

rc,t\(s\)=yc,t\(s\)−τc\(s\)​\(tmodP\)\.r^\{\(s\)\}\_\{c,t\}=y^\{\(s\)\}\_\{c,t\}\-\\tau^\{\(s\)\}\_\{c\}\(t\\bmod P\)\.\(3\)
This decomposition is motivated by the observation that many real\-world temporal systems contain a strong periodic component whose amplitude and phase are relatively stable, even when short\-term fluctuations remain highly stochastic\. To retain only structurally meaningful templates, we score each decomposition by the fraction of signal variance explained \(VE\) by the periodic component,

VEc\(s\)=1−Var​\(𝐫c\(s\)\)Var​\(𝐲c\(s\)\)\.\\mathrm\{VE\}^\{\(s\)\}\_\{c\}=1\-\\frac\{\\mathrm\{Var\}\(\\mathbf\{r\}^\{\(s\)\}\_\{c\}\)\}\{\\mathrm\{Var\}\(\\mathbf\{y\}^\{\(s\)\}\_\{c\}\)\}\.\(4\)High\-scoring decompositions define a library of representative templates for each covariate, together with empirical amplitude statistics that characterize how strongly the periodic component manifests across real samples\. We fit the residual model on a filtered mean signal built from the samples that pass the VE filter,

y¯c,t=1\|𝒮c\|​∑s∈𝒮cyc,t\(s\),𝒮c=\{s∣VEc\(s\)≥η\},\\bar\{y\}\_\{c,t\}=\\frac\{1\}\{\|\\mathcal\{S\}\_\{c\}\|\}\\sum\_\{s\\in\\mathcal\{S\}\_\{c\}\}y^\{\(s\)\}\_\{c,t\},\\qquad\\mathcal\{S\}\_\{c\}=\\\{s\\mid\\mathrm\{VE\}^\{\(s\)\}\_\{c\}\\geq\\eta\\\},\(5\)whereη\\etais the VE threshold\. We then compute the periodic template of this filtered mean signal using the same phase\-averaging procedure,

τ¯c​\(p\)=1\|ℐp\|​∑t∈ℐpy¯c,t,\\bar\{\\tau\}\_\{c\}\(p\)=\\frac\{1\}\{\|\\mathcal\{I\}\_\{p\}\|\}\\sum\_\{t\\in\\mathcal\{I\}\_\{p\}\}\\bar\{y\}\_\{c,t\},\(6\)and define the mean residual as

r¯c,t=y¯c,t−τ¯c​\(tmodP\)\.\\bar\{r\}\_\{c,t\}=\\bar\{y\}\_\{c,t\}\-\\bar\{\\tau\}\_\{c\}\(t\\bmod P\)\.\(7\)While averaging suppresses series\-specific variability, the resulting residual signal preserves shared temporal dynamics\. In this sense,r¯c,t\\bar\{r\}\_\{c,t\}acts as a cleaned residual that washes away idiosyncratic noise while retaining the common residual structure, making it easier to fit the DKL model\.

Template\-based structural scaffold\.After extracting a periodic template for each covariate, we use it to define the coarse structure of a synthetic multivariate sample\. For a selected templateτc,m\\tau\_\{c,m\}, wheremmdenotes themm\-th template for covariatecc, the structural contribution is

τ^c,t=τ¯c,m\+ac​\(τc,m​\(tmodP\)−τ¯c,m\),\\widehat\{\\tau\}\_\{c,t\}=\\bar\{\\tau\}\_\{c,m\}\+a\_\{c\}\\big\(\\tau\_\{c,m\}\(t\\bmod P\)\-\\bar\{\\tau\}\_\{c,m\}\\big\),\(8\)whereτ¯c,m\\bar\{\\tau\}\_\{c,m\}is the template mean andaca\_\{c\}is a covariate\-specific amplitude factor\. This preserves the periodic morphology while allowing different covariates to express it with different strengths\.

Residual dynamics via deep kernel autoregression\.After constructing the cleaned mean residual, we model its short\-range dependence, local roughness, and uncertainty autoregressively using a deep kernel learning construction for sequential data\(Wilsonet al\.,[2016](https://arxiv.org/html/2606.05264#bib.bib25); Al\-Shedivatet al\.,[2017](https://arxiv.org/html/2606.05264#bib.bib26)\)\. For covariatecc, let

𝐮c,t=\[r¯c,t−W,…,r¯c,t−1\]\\mathbf\{u\}\_\{c,t\}=\[\\bar\{r\}\_\{c,t\-W\},\\dots,\\bar\{r\}\_\{c,t\-1\}\]\(9\)denote a context window of lengthWW\. This context is mapped to a latent representation through a neural feature extractor,

𝐳c,t=gθ​\(𝐮c,t\)∈ℝd,\\mathbf\{z\}\_\{c,t\}=g\_\{\\theta\}\(\\mathbf\{u\}\_\{c,t\}\)\\in\\mathbb\{R\}^\{d\},\(10\)wheregθg\_\{\\theta\}is implemented as a convolutional\-recurrent encoder with an LSTM recurrent block\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.05264#bib.bib28)\)\. Convolutional layers capture local motifs and short\-range temporal patterns, while a recurrent block summarizes their sequential evolution over the window\.

The latent state𝐳c,t\\mathbf\{z\}\_\{c,t\}parameterizes a Gaussian process over residual dynamics\(Rasmussen and Williams,[2006](https://arxiv.org/html/2606.05264#bib.bib27)\),

fc∼𝒢​𝒫​\(0,kc​\(𝐳,𝐳′\)\),r¯c,t∣𝐮c,t∼𝒩​\(fc​\(𝐳c,t\),σc2\),f\_\{c\}\\sim\\mathcal\{GP\}\\big\(0,k\_\{c\}\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\\big\),\\qquad\\bar\{r\}\_\{c,t\}\\mid\\mathbf\{u\}\_\{c,t\}\\sim\\mathcal\{N\}\(f\_\{c\}\(\\mathbf\{z\}\_\{c,t\}\),\\sigma\_\{c\}^\{2\}\),\(11\)wherekck\_\{c\}is a flexible kernel in latent space\. In our case, the kernel combines Matern and radial basis components\(Rasmussen and Williams,[2006](https://arxiv.org/html/2606.05264#bib.bib27)\), thereby accommodating both rough local behavior and smoother variation\. This deep kernel formulation is critical: the neural encoder provides expressive history\-dependent features, while the Gaussian process introduces calibrated uncertainty and a principled nonparametric prior over residual trajectories\(Wilsonet al\.,[2016](https://arxiv.org/html/2606.05264#bib.bib25); Al\-Shedivatet al\.,[2017](https://arxiv.org/html/2606.05264#bib.bib26)\)\.

Autoregressive synthesis of intrinsic trajectories\.Given the learned residual model, synthetic residuals are generated autoregressively\. At each time step, the model outputs a predictive mean and variance conditioned on the previously generated context\. Sampling is performed as

r^c,t=μc,t\+λc​σc,t​ϵt,ϵt∼𝒩​\(0,1\),\\widehat\{r\}\_\{c,t\}=\\mu\_\{c,t\}\+\\lambda\_\{c\}\\sigma\_\{c,t\}\\epsilon\_\{t\},\\qquad\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(0,1\),\(12\)whereμc,t\\mu\_\{c,t\}andσc,t\\sigma\_\{c,t\}are the predictive moments of the deep kernel model andλc\\lambda\_\{c\}is a temperature parameter controlling stochasticity\. Larger temperatures increase diversity, whereas smaller temperatures produce trajectories closer to the posterior mean\.

We also add a low\-amplitude sinusoidal drift termdc,td\_\{c,t\}to capture slow variation beyond the template and residual model\. The intrinsic synthetic trajectory for covariateccis then given by

x^c,tint=τ^c,t\+r^c,t\+dc,t\.\\widehat\{x\}^\{\\mathrm\{int\}\}\_\{c,t\}=\\widehat\{\\tau\}\_\{c,t\}\+\\widehat\{r\}\_\{c,t\}\+d\_\{c,t\}\.\(13\)This intrinsic signal represents the variable\-specific dynamics before cross\-variable coupling is imposed\.

Directed coupling across variables\.Real multivariate systems rarely consist of independent channels\. To encode directed dependence, we introduce a graph\-based interaction layer defined on a directed acyclic graph𝒢\\mathcal\{G\}, following the structural causal perspective of DAG\-based models and the synthetic time\-series generation setting of CauKer\(Peterset al\.,[2017](https://arxiv.org/html/2606.05264#bib.bib29); Xieet al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib30)\)\. Each node corresponds to a covariate, and each directed edge specifies a parent\-child relationship together with admissible lags\. Root variables are generated directly from their intrinsic dynamics, whereas child variables are modified using lagged parent information\. Appendix Section[B](https://arxiv.org/html/2606.05264#A2)gives the full consensus causal\-discovery procedure used to estimate𝒢\\mathcal\{G\}and its admissible lag sets from the real data\.

For a child variableccwith parent setPa​\(c\)\\mathrm\{Pa\}\(c\), we define the aggregated parent contribution as

gc,t=∑p∈Pa​\(c\)∑ℓ∈ℒp→cwp,c,ℓ​sp,t−ℓ,g\_\{c,t\}=\\sum\_\{p\\in\\mathrm\{Pa\}\(c\)\}\\sum\_\{\\ell\\in\\mathcal\{L\}\_\{p\\to c\}\}w\_\{p,c,\\ell\}\\,s\_\{p,t\-\\ell\},\(14\)whereℒp→c\\mathcal\{L\}\_\{p\\to c\}is the set of admissible lags,wp,c,ℓw\_\{p,c,\\ell\}are consensus\-derived coupling coefficients estimated from the retained discovery scores in Appendix Section[B](https://arxiv.org/html/2606.05264#A2), andsp,ts\_\{p,t\}denotes the full parent signal\. In other words, directed dependence is imposed on the entire generated series, so the structural template and residual dynamics are mixed in a single coupling step\.

The final child trajectory is produced through a convex combination of its intrinsic component and the parent\-driven term,

x~c,t=αc​x^c,tint\+\(1−αc\)​h​\(gc,t\),\\widetilde\{x\}\_\{c,t\}=\\alpha\_\{c\}\\widehat\{x\}^\{\\mathrm\{int\}\}\_\{c,t\}\+\(1\-\\alpha\_\{c\}\)h\(g\_\{c,t\}\),\(15\)whereαc∈\[0,1\]\\alpha\_\{c\}\\in\[0,1\]controls the relative strength of endogenous and exogenous dynamics, andhhis optionally a nonlinear transformation\. In the implementation, we sampleαc\\alpha\_\{c\}uniformly from\[0\.7,0\.9\]\[0\.7,0\.9\]so that the endogenous structure of each covariate is consistently prioritized over parent\-specific dynamics\. The nonlinear map serves to model saturating or asymmetric causal effects without destabilizing the generated trajectories\.

## 4Experiments and Results

We evaluateReGeNalong three complementary axes that together make a cumulative case for reference\-guided synthetic generation\.

\(Q1\) Can reference\-guided synthetic data substitute for a real sibling dataset?We test whether a corpus generated from one domain can serve as the sole training source for a model deployed on a closely related domain, replacing the real sibling entirely\.

\(Q2\) Does synthetic augmentation on top of real data push performance beyond what real data alone achieves?We ask whetherReGEN\-generated series add information that the available real corpus does not already contain\.

\(Q3\) Does the*structured decomposition*inReGENproduce better synthetic data than alternative reference\-guided and prior\-based generators?We benchmark against a reference\-guided adversarial generatorYoonet al\.\([2019](https://arxiv.org/html/2606.05264#bib.bib31)\)and a prior\-based causal generatorXieet al\.\([2025](https://arxiv.org/html/2606.05264#bib.bib30)\)on exactly matched corpus sizes, isolating generation quality from data volume and reference availability\.

Throughout, we emphasize that synthetic data is useful only when it preserves structure that transfers: periodic morphology, residual uncertainty, and cross\-variable coupling\. Direct structural diagnostics like spectral fidelity, residual calibration, and cross\-variate coupling recovery are provided in Appendix[C](https://arxiv.org/html/2606.05264#A3)\.

Table 1:Zero\-shot forecasting transfer across sibling dataset pairs under real\-to\-real \(TRTR\) and train\-on\-synthetic, test\-on\-real \(TSTR\) protocols, evaluated on iTransformer, DLinear, and S\-Mamba\. Each TSTR row directly below shows REGEN synthetic performance\. TheΔ\\Delta\(%\)column reports relative change\(TSTR−TRTR\)/TRTR×100\(\{\\rm TSTR\}\-\{\\rm TRTR\}\)/\{\\rm TRTR\}\\times 100in MSE\.Dom\.Sibling PairTypeRuniTransformerDLinearS\-MambaMSE↓\\downarrowMAE↓\\downarrowΔ\\Delta\(%\)MSE↓\\downarrowMAE↓\\downarrowΔ\\Delta\(%\)MSE↓\\downarrowMAE↓\\downarrowΔ\\Delta\(%\)EnergyBDG\-2 Bear / PantherTRTRA→\\toB0\.320\.36—0\.290\.36—0\.290\.34—TSTRD→\\toB0\.360\.38\+12\.50\.300\.36\+3\.40\.290\.350\.0TRTRB→\\toA0\.410\.40—0\.410\.39—0\.420\.34—TSTRC→\\toA0\.410\.430\.00\.430\.43\+4\.90\.400\.33−\-4\.8BDG\-2 Bull / HogTRTRA→\\toB0\.500\.49—0\.480\.45—0\.490\.46—TSTRD→\\toB0\.500\.480\.00\.490\.46\+2\.10\.480\.45−\-2\.0TRTRB→\\toA0\.330\.40—0\.360\.39—0\.330\.40—TSTRC→\\toA0\.370\.40\+12\.10\.310\.38−\-13\.90\.350\.41\+6\.1CloudAzure VM 2017 / Borg 2011TRTRA→\\toB0\.580\.49—0\.590\.54—0\.580\.53—TSTRD→\\toB0\.590\.50\+1\.70\.590\.520\.00\.610\.55\+5\.2TRTRB→\\toA0\.880\.41—0\.900\.44—0\.900\.41—TSTRC→\\toA0\.900\.45\+2\.30\.920\.42\+2\.20\.970\.42\+7\.8TrafficPEMS\-04 / PEMS\-08TRTRA→\\toB0\.320\.34—0\.300\.29—0\.290\.30—TSTRD→\\toB0\.300\.29−\-6\.30\.280\.28−\-6\.70\.260\.30−\-10\.3TRTRB→\\toA0\.310\.31—0\.320\.31—0\.300\.28—TSTRC→\\toA0\.370\.38\+19\.40\.290\.30−\-9\.40\.290\.33−\-3\.3ClimateSubseasonal / Precip\.TRTRA→\\toB1\.090\.77—0\.830\.60—0\.760\.57—TSTRD→\\toB1\.010\.73−\-7\.30\.900\.64\+8\.40\.740\.56−\-2\.6TRTRB→\\toA0\.420\.46—0\.400\.47—1\.341\.02—TSTRC→\\toA0\.400\.43−\-4\.80\.350\.39−\-12\.51\.511\.10\+12\.7Energy \(res\.\)Res\. PV Power / Load PowerTRTRA→\\toB0\.600\.42—0\.550\.40—0\.540\.37—TSTRD→\\toB0\.540\.38−\-10\.00\.640\.41\+16\.40\.540\.370\.0TRTRB→\\toA0\.250\.22—0\.250\.24—0\.230\.20—TSTRC→\\toA0\.260\.20\+4\.00\.310\.29\+24\.00\.220\.23−\-4\.3

Colour key \(±3% relative margin\):TRTRReal\-to\-real reference;dark greenTSTR beats TRTR by\>3%\{\>\}3\\%;light greenTSTR better by≤3%\{\\leq\}3\\%;yellowWithin\+3%\+3\\%tolerance;redGap\>3%\{\>\}3\\%\.Run key:AA,BB= real datasets;CC= synthetic fromAA;DD= synthetic fromBB\.A→BA\\\!\\to\\\!B,B→AB\\\!\\to\\\!A= TRTR;D→BD\\\!\\to\\\!B,C→AC\\\!\\to\\\!A= TSTR\.

### 4\.1Datasets and Evaluation Protocol

We collect a benchmark of twelve real\-world \(nine multivariate and three univariate\) time seriesAksuet al\.\([2024](https://arxiv.org/html/2606.05264#bib.bib38)\)spanning five qualitatively distinct domains: energy consumption \(BDG\-2 Bear, Panther, Bull, Hog\)\(Emamiet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib47)\), cloud infrastructure \(Azure VM 2017, Borg 2011\)\(Wooet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib50)\), urban traffic \(PEMS\-04, PEMS\-08\)\(Jianget al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib49)\), climate \(Subseasonal Precipitation, Subseasonal\)\(Mouatadidet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib48)\)and residential energy \(Residential PV Power, Residential Load Power\)\(Wooet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib2)\)\. Within each domain we define source\-matched*sibling pairs*— datasets that share the same physical process, sensor modality, and temporal cadence so that the reference\-guided generation procedure can extract a meaningful periodic scaffold from the source and apply it to synthesise training data for the target\.

All experiments use a uniform input context of 96 time steps and a prediction horizon of 24 steps, evaluated by mean squared error \(MSE\) and mean absolute error \(MAE\)\. We consider three training protocols:TRTR\(Train onReal,Test onReal\) serves as the gold\-standard upper bound for each setting;TSTR\(Train onSynthetic,Test onReal\) isolates the standalone value of synthetic data; andTRSTR\(Train on the union ofReal andSynthetic,Test onReal\) measures augmentation benefit\. To ensure that differences in TSTR versus TRTR cannot be attributed to corpus size, the synthetic corpus in every transfer setting is size\-matched to the corresponding real corpus\. Full dataset statistics, system specifications, sibling\-pair definitions, and the characteristic periodPPused for template extraction are summarised in Appendix[A](https://arxiv.org/html/2606.05264#A1)\.

We evaluate three forecasting backbones with deliberately distinct inductive biases:iTransformer\(Liuet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib51)\)\(≈\\approx1M parameters, self\-attention over variates\),DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib53)\)\(4,673 parameters, explicit seasonal\-residual decomposition\),S\-Mamba\(Wanget al\.,[2025](https://arxiv.org/html/2606.05264#bib.bib52)\)\(≈\\approx1\.1M parameters, state\-space sequence modelling\),Moirai\-smallWooet al\.\([2024](https://arxiv.org/html/2606.05264#bib.bib2)\)\(≈\\approx13\.8M parameters, a masked encoder\-based time series foundational model\)\. Evaluating across architectures with such different inductive biases provides a stringent test: if synthetic data genuinely captures domain structure rather than model\-specific artefacts, its benefits should appear consistently across all model families\.

### 4\.2Q1: Reference\-Guided Transfer as a Drop\-in Replacement for Real Sibling Data

We report TRTR and TSTR performance across all sibling pairs and backbones in Table[1](https://arxiv.org/html/2606.05264#S4.T1)\.67% of TSTR cells\(48 of 72\) fall within the±\\pm3% relative MSE band of their TRTR baseline — meaning that in two\-thirds of settings, replacing real sibling data withReGEN\-generated series costs under 3% in forecast accuracy, a margin within typical evaluation noise\.

Domain\-wise variation\.Traffic is the strongest domain: on PEMS\-04→\\rightarrowPEMS\-08,ReGEN*improves*over the real sibling for both iTransformer and DLinear \(−6\.3\-6\.3% relative MSE\); the reverse direction yields similar gains \(−9\.4\-9\.4%,−10\.3\-10\.3%\)\. The phase\-aligned template is well\-suited to traffic’s regular intra\-day periodicity, and temperature\-broadened GP sampling extends coverage into rare congestion regimes\. Cloud infrastructure \(Azure VM / Borg\) gains are smaller: aperiodic load spikes dominate several channels, leaving more variance for the GP residual under limited reference observations \(see Appendix[C\.1](https://arxiv.org/html/2606.05264#A3.SS1)\)\. Residential PV is the weakest domain \(DLinear gap:\+16\.4\+16\.4%\), as solar irradiance is locally variable and poorly approximated by a single periodic template\.

### 4\.3Q2: Synthetic Augmentation Improves Over Real\-Only Training

We compares TRTR against TRSTR, training on the union of the real corpus and a size\-matched synthetic corpus, across the same twelve datasets and three backbones in Table[2](https://arxiv.org/html/2606.05264#S4.T2)\.

Table 2:TRTR vs\. TRSTR across twelve datasets and three backbones\. TRSTR adds a size\-matchedReGENsynthetic corpus to the real training pool\. Augmentation improves or matches TRTR in the majority of settings\. Green indicates improvement; red indicates degradation\. Lower is better \(↓\\downarrow\)\.DatasetiTransformerDLinearS\-MambaTRTRTRSTRTRTRTRSTRTRTRTRSTRMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEBear0\.410\.400\.380\.400\.410\.390\.420\.430\.420\.340\.420\.34Panther0\.320\.360\.270\.350\.290\.360\.290\.340\.290\.340\.290\.33Azure VM0\.880\.410\.780\.300\.900\.440\.880\.420\.900\.410\.870\.40Borg0\.580\.490\.550\.400\.590\.540\.550\.540\.580\.530\.540\.50PEMS\-040\.310\.310\.250\.270\.320\.310\.300\.310\.300\.280\.280\.29PEMS\-080\.320\.340\.270\.290\.300\.290\.300\.280\.290\.300\.260\.29Bull0\.330\.400\.300\.350\.360\.390\.320\.350\.330\.400\.320\.40Hog0\.500\.490\.470\.420\.480\.450\.470\.440\.490\.460\.430\.45Subseq\.0\.420\.460\.430\.470\.400\.470\.440\.481\.341\.021\.080\.73Subseq\. Prec\.1\.090\.770\.920\.720\.830\.600\.790\.620\.760\.570\.780\.57Res\. PV0\.250\.220\.240\.230\.250\.240\.310\.280\.230\.200\.290\.26Res\. Load0\.600\.420\.540\.330\.550\.400\.630\.430\.540\.370\.540\.37

Augmentation helps broadly, but the effect is architecture\-dependent\.For iTransformer and S\-Mamba, TRSTR consistently matches or beats TRTR across the majority of datasets: both architectures benefit from the increased effective diversity of the combined corpus, absorbing the additional synthetic variation rather than overfitting the original distribution\. DLinear shows the weakest and most mixed gains\. This is mechanistically interpretable: DLinear’s explicit seasonal\-residual decomposition imposes a strong inductive bias that already extracts much of the periodic structureReGENsynthesises\. The model therefore has less capacity to benefit from additional variation in the periodic component, and the residual\-level diversity added by the GP component can slightly confuse its residual\-fitting head on some datasets\.

The most practically significant augmentation gains appear where datasets are small and domain\-specific\. Subseasonal S\-Mamba improves from an MSE of 1\.34 \(TRTR\) to 1\.02 \(TRSTR\) \(23\.9% reduction\) while iTransformer and DLinear remain stable\. For Residential Load Power, augmentation yields consistent gains across all three backbones, suggesting that the cross\-variate SCM coupling in the synthesised data supplies cross\-channel patterns not fully captured by the limited real training corpus\.

### 4\.4Q3: ReGEN Produces Superior Training Signal In Comparison to Available Alternatives

We benchmarkReGENagainst two qualitatively distinct synthetic data generators: \(1\)TimeGANYoonet al\.\([2019](https://arxiv.org/html/2606.05264#bib.bib31)\)— a reference\-guided adversarial generator that trains a GAN directly on observed data, and \(2\)CauKerXieet al\.\([2025](https://arxiv.org/html/2606.05264#bib.bib30)\)— a reference\-free, prior\-based generator whose causal graph and kernel bank are sampled from domain\-agnostic priors without conditioning on any real dataset\. Table[4](https://arxiv.org/html/2606.05264#S4.T4)reports full\-corpus TSTR results, where Moirai\-small is pre\-trained on each synthetic corpus and evaluated zero\-shot on the pooled, left\-out real benchmark\.

ReGENVs TimeGAN\.Both methods have access to the same observed data and identical corpus size budgets, so any performance gap reflects the*quality*of what each method extracts from that reference\. On Moirai,ReGENreduces MSE by41\.1%\. This advantage is broad: Table[3](https://arxiv.org/html/2606.05264#S4.T3)shows thatReGENachieves lower MSE and MAE on10 of 12 datasets\. The two exceptions \(Bull, Hog\) are datasets where the template stage explains less of the original signal variance \(Appendix Table[6](https://arxiv.org/html/2606.05264#A3.T6)\), so TimeGAN’s template\-free design is less exposed to weak periodic fit\.

ReGENVs CauKer Vs TimeGAN\.CauKer’s domain\-agnostic design means that a corpus generated for cloud data is statistically indistinguishable from one generated for traffic data, making per\-dataset TSTR comparisons methodologically unsound\. We therefore pool all synthetic datasets from all three methods into a single full corpus and compare them in this regime, where CauKer’s design assumption is respected\. Moirai\-small is pre\-trained from scratch on each full corpus and evaluated zero\-shot on the pooled real benchmark\.ReGENreduces Moirai\-small MSE by2\.3%, demonstrating that conditioning generation on real reference data leaves a durable structural imprint that benefits downstream generalization beyond what domain\-agnostic priors alone can provide\.

Table 3:Dataset\-wise iTransformer comparison\. The TRTR block is shown as a neutral real\-data reference\. In the TSTR columns,greenmarks the lower\-error synthetic method between TimeGAN andReGeNfor each metric; lower is better\.DatasetTRTRTimeGANReGeN\(Ours\)MSEMAEMSEMAEMSEMAEBear0\.410\.400\.630\.500\.410\.43Panther0\.320\.360\.400\.450\.360\.38Azure VM0\.880\.411\.040\.510\.900\.45Borg0\.580\.490\.850\.690\.590\.50PEMS\-040\.310\.310\.450\.450\.370\.38PEMS\-080\.320\.340\.400\.380\.300\.29Bull0\.330\.400\.320\.360\.370\.40Hog0\.500\.490\.470\.490\.500\.48Subseasonal0\.420\.460\.420\.450\.400\.43Sub\. precip\.1\.090\.771\.110\.801\.010\.73Residential PV0\.250\.220\.270\.200\.260\.20Residential Load0\.600\.420\.710\.490\.540\.38Table 4:Full\-corpus TSTR evaluation with Moirai\-small, comparing the different synthetic corpora used for pre\-training\. Lower MSE, MAE, MASE, and WQL are better\. Best results ingreen\.Synthetic CorpusMSEMAEMASEWQLTimeGAN392\.468\.381\.450\.23CauKer236\.76\.641\.110\.17ReGeN\(Ours\)231\.145\.860\.980\.17
### 4\.5Appendix C: Ablation and Diagnostic Summary

Appendix C consolidates five complementary results that clarify*why*ReGeNworks\. ❶ Table[6](https://arxiv.org/html/2606.05264#A3.T6)shows that the phase\-aligned template already captures a large share of variance in the strongest domains, especially traffic and residential energy, while also revealing the lower\-VE settings in which the residual model must carry more of the burden\. ❷ Figure[2](https://arxiv.org/html/2606.05264#A3.F2)shows that the synthetic samples recover the geometry of the real data closely enough to preserve domain\-level clustering structure\. ❸ Figure[3](https://arxiv.org/html/2606.05264#A3.F3)substantiates that the synthetic series largely match the real datasets’ dominant spectral peaks and low\-frequency decay, supporting strong frequency\-domain fidelity\. ❹ Table[7](https://arxiv.org/html/2606.05264#A3.T7)shows that removing the deep\-kernel GP residual increases mean iTransformer TSTR MSE by 11\.0% across the reported datasets, confirming that residual uncertainty modelling contributes materially to downstream performance\. ❺ Table[8](https://arxiv.org/html/2606.05264#A3.T8)shows that removing SCM\-based mixing increases mean iTransformer TSTR MSE by 6\.0%, indicating that directed cross\-variate coupling provides an additional, consistent gain\. Taken together, these results show thatReGeN’s gains are not driven by one component alone: they emerge from a strong periodic scaffold, realistic geometric and spectral fidelity, and measurable benefits from both residual modelling and SCM\-based mixing\.

## 5Conclusion

We presentedReGeN, a reference\-guided generative pipeline that decomposes observed multivariate sequences into a phase\-aligned periodic template, a deep\-kernel GP residual, and a fitted structural causal model\. By grounding all three components in real domain observations, it produces synthetic data that inherits the periodic morphology, local uncertainty, and cross\-variable coupling of the target domain without requiring large training corpora\. Across twelve datasets and five domains,ReGeN\-synthesized data substitutes for real sibling data within a 3% MSE margin in two\-thirds of transfer settings, outperforms real\-data transfer in strongly periodic domains, and yields consistent augmentation gains for attention\- and state\-space\-based architectures\.

Limitations\.Our current evaluation has three main limitations\. First, SCM\-based mixing is not uniformly beneficial: in low\-data, higher\-dimensional settings such as Hog, graph estimation can become noisy and the induced cross\-variate coupling may hurt performance\. Second, due to computational constraints, we report full\-corpus pretraining results for Moirai\-small only; extending this comparison to additional foundation models would require training each architecture from scratch on each synthetic corpus\. Third, our component ablations are reported only for iTransformer, so while they support the design rationale, they do not yet establish that the same component\-level effects hold equally strongly across all downstream model families\.

## References

- Embarrassingly simple mixup for time\-series\.External Links:2304\.04271,[Link](https://arxiv.org/abs/2304.04271)Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p2.1)\.
- T\. Aksu, G\. Woo, J\. Liu, X\. Liu, C\. Liu, S\. Savarese, C\. Xiong, and D\. Sahoo \(2024\)GIFT\-eval: a benchmark for general time series forecasting model evaluation\.InNeurIPS Workshop on Time Series in the Age of Large Models,Cited by:[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1)\.
- M\. Al\-Shedivat, A\. G\. Wilson, Y\. Saatchi, Z\. Hu, and E\. P\. Xing \(2017\)Learning scalable deep kernels with recurrent structure\.Journal of Machine Learning Research18\(82\),pp\. 1–37\.External Links:[Link](https://jmlr.org/papers/v18/16-498.html)Cited by:[§3](https://arxiv.org/html/2606.05264#S3.p5.1),[§3](https://arxiv.org/html/2606.05264#S3.p6.2)\.
- A\. F\. Ansari, L\. Stella, C\. Turkmen, X\. Zhang, P\. Mercado, H\. Shen, O\. Shchur, S\. S\. Rangapuram, S\. P\. Arango, S\. Kapoor, J\. Zschiegner, D\. C\. Maddix, H\. Wang, M\. W\. Mahoney, K\. Torkkola, A\. G\. Wilson, M\. Bohlke\-Schneider, and Y\. Wang \(2024\)Chronos: learning the language of time series\.External Links:2403\.07815,[Link](https://arxiv.org/abs/2403.07815)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p1.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- L\. Breiman \(2001\)Random forests\.Machine Learning45,pp\. 5–32\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1010933404324)Cited by:[Appendix B](https://arxiv.org/html/2606.05264#A2.p1.1)\.
- R\. B\. Cleveland, W\. S\. Cleveland, J\. E\. McRae, and I\. Terpenning \(1990\)STL: a seasonal\-trend decomposition procedure based on loess\.Journal of Official Statistics6\(1\),pp\. 3–73\.Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p2.1),[§3](https://arxiv.org/html/2606.05264#S3.p2.6)\.
- L\. Darlow, M\. Asenov, A\. Joosen, Q\. Deng, J\. Wang, and A\. D\. Barker \(2023\)TSMix: time series data augmentation by mixing sources\.InProceedings of the 3rd Workshop on Machine Learning and Systems,pp\. 109–114\.External Links:[Document](https://dx.doi.org/10.1145/3578356.3592584)Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p2.1)\.
- S\. Dooley, G\. S\. Khurana, C\. Mohapatra, S\. V\. Naidu, and C\. White \(2023\)ForecastPFN: synthetically\-trained zero\-shot forecasting\.InAdvances in Neural Information Processing Systems 36,External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/0731f0e65559059eb9cd9d6f44ce2dd8-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p2.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- P\. Emami, A\. Sahu, and P\. Graf \(2023\)BuildingsBench: a large\-scale dataset of 900k buildings and benchmark for short\-term load forecasting\.InNeurIPS 2023 Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=c5rqd6PZn6)Cited by:[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.2.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.3.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.8.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.9.2),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1)\.
- C\. Esteban, S\. L\. Hyland, and G\. Rätsch \(2017\)Real\-valued \(medical\) time series generation with recurrent conditional gans\.arXiv preprint arXiv:1706\.02633\.Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- C\. W\. J\. Granger \(1969\)Investigating causal relations by econometric models and cross\-spectral methods\.Econometrica37\(3\),pp\. 424–438\.External Links:[Document](https://dx.doi.org/10.2307/1912791)Cited by:[Appendix B](https://arxiv.org/html/2606.05264#A2.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§3](https://arxiv.org/html/2606.05264#S3.p5.3)\.
- A\. Hyvärinen, K\. Zhang, S\. Shimizu, and P\. O\. Hoyer \(2010\)Estimation of a structural vector autoregression model using non\-gaussianity\.Journal of Machine Learning Research11,pp\. 1709–1731\.Cited by:[Appendix B](https://arxiv.org/html/2606.05264#A2.p1.1)\.
- J\. Jiang, C\. Han, W\. Jiang, W\. X\. Zhao, and J\. Wang \(2023\)LibCity: a unified library towards efficient and comprehensive urban spatial\-temporal prediction\.arXiv preprint arXiv:2304\.14343\.External Links:[Link](https://arxiv.org/abs/2304.14343)Cited by:[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.6.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.7.2),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JePfAI8fah)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p3.3)\.
- O\. Mogren \(2016\)C\-rnn\-gan: continuous recurrent neural networks with adversarial training\.arXiv preprint arXiv:1611\.09904\.Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p2.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- S\. Mouatadid, P\. Orenstein, G\. E\. Flaspohler, M\. Oprescu, J\. Cohen, F\. Wang, S\. E\. Knight, M\. Geogdzhayeva, S\. J\. Levang, E\. Fraenkel, and L\. Mackey \(2023\)SubseasonalClimateUSA: a dataset for subseasonal forecasting and benchmarking\.InNeurIPS 2023 Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=pWkrU6raMt)Cited by:[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.10.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.11.2),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1)\.
- B\. N\. Oreshkin, M\. Jauhari, R\. K\. Selvam, M\. Wolff, W\. Pan, S\. Ramasubramanian, K\. G\. Olivares, T\. Konstantinova, A\. Potapczynski, M\. Cao, D\. Efimov, M\. W\. Mahoney, and A\. G\. Wilson \(2026\)Zero\-shot forecasting by simulation alone\.External Links:2601\.00970,[Link](https://arxiv.org/abs/2601.00970)Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- R\. Pamfil, N\. Sriwattanaworachai, S\. Desai, P\. Pilgerstorfer, P\. Beaumont, K\. Georgatzis, and B\. Aragam \(2020\)DYNOTEARS: structure learning from time\-series data\.InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics,pp\. 1595–1605\.Cited by:[Appendix B](https://arxiv.org/html/2606.05264#A2.p1.1)\.
- J\. Peters, D\. Janzing, and B\. Schölkopf \(2017\)Elements of causal inference: foundations and learning algorithms\.MIT Press\.Cited by:[§3](https://arxiv.org/html/2606.05264#S3.p9.2)\.
- C\. E\. Rasmussen and C\. K\. I\. Williams \(2006\)Gaussian processes for machine learning\.MIT Press\.Cited by:[§3](https://arxiv.org/html/2606.05264#S3.p6.1),[§3](https://arxiv.org/html/2606.05264#S3.p6.2)\.
- S\. Roberts, M\. Osborne, M\. Ebden, S\. Reece, N\. Gibson, and S\. Aigrain \(2013\)Gaussian processes for time\-series modelling\.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences371\(1984\),pp\. 20110550\.External Links:[Document](https://dx.doi.org/10.1098/rsta.2011.0550)Cited by:[§2](https://arxiv.org/html/2606.05264#S2.p2.1),[§3](https://arxiv.org/html/2606.05264#S3.p2.6)\.
- J\. Runge, P\. Nowack, M\. Kretschmer, S\. Flaxman, and D\. Sejdinovic \(2019\)Detecting and quantifying causal associations in large nonlinear time series datasets\.Science Advances5\(11\),pp\. eaau4996\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aau4996)Cited by:[Appendix B](https://arxiv.org/html/2606.05264#A2.p1.1)\.
- C\. Spearman \(1904\)The proof and measurement of association between two things\.The American Journal of Psychology15\(1\),pp\. 72–101\.External Links:[Document](https://dx.doi.org/10.2307/1412159)Cited by:[§C\.4](https://arxiv.org/html/2606.05264#A3.SS4.p3.2)\.
- E\. O\. Taga, M\. E\. Ildiz, and S\. Oymak \(2025\)Timepfn: effective multivariate time series forecasting with synthetic data\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 20761–20769\.Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p2.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1)\.
- L\. van der Maaten and G\. Hinton \(2008\)Visualizing data using t\-sne\.Journal of Machine Learning Research9,pp\. 2579–2605\.Cited by:[§C\.2](https://arxiv.org/html/2606.05264#A3.SS2.p1.1)\.
- Z\. Wang, F\. Kong, S\. Feng, M\. Wang, X\. Yang, H\. Zhao, D\. Wang, and Y\. Zhang \(2025\)Is mamba effective for time series forecasting?\.Neurocomputing619,pp\. 129178\.External Links:[Document](https://dx.doi.org/10.1016/j.neucom.2024.129178),[Link](https://www.sciencedirect.com/science/article/pii/S0925231224019490)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p3.3)\.
- P\. D\. Welch \(1967\)The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms\.IEEE Transactions on Audio and Electroacoustics15\(2\),pp\. 70–73\.External Links:[Document](https://dx.doi.org/10.1109/TAU.1967.1161901)Cited by:[§C\.3](https://arxiv.org/html/2606.05264#A3.SS3.p1.1)\.
- A\. G\. Wilson, Z\. Hu, R\. Salakhutdinov, and E\. P\. Xing \(2016\)Deep kernel learning\.InProceedings of the 19th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.51,pp\. 370–378\.External Links:[Link](https://proceedings.mlr.press/v51/wilson16.html)Cited by:[§3](https://arxiv.org/html/2606.05264#S3.p5.1),[§3](https://arxiv.org/html/2606.05264#S3.p6.2)\.
- G\. Woo, C\. Liu, A\. Kumar, and D\. Sahoo \(2023\)Pushing the limits of pre\-training for time series forecasting in the cloudops domain\.arXiv preprint arXiv:2310\.05063\.External Links:[Link](https://arxiv.org/abs/2310.05063)Cited by:[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.4.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.5.2),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.External Links:2402\.02592,[Link](https://arxiv.org/abs/2402.02592)Cited by:[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.12.2),[§A\.2](https://arxiv.org/html/2606.05264#A1.SS2.p1.1.1.1.13.2),[§1](https://arxiv.org/html/2606.05264#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p3.3)\.
- S\. Xie, V\. Feofanov, M\. Alonso, A\. Odonnat, J\. Zhang, T\. Palpanas, L\. Zan, L\. Pan, K\. Zhang, and I\. Redko \(2025\)CauKer: classification time series foundation models can be pretrained on synthetic data only\.External Links:2508\.02879,[Link](https://arxiv.org/abs/2508.02879)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p2.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1),[§3](https://arxiv.org/html/2606.05264#S3.p9.2),[§4\.4](https://arxiv.org/html/2606.05264#S4.SS4.p1.1),[§4](https://arxiv.org/html/2606.05264#S4.p4.1)\.
- J\. Yoon, D\. Jarrett, and M\. van der Schaar \(2019\)Time\-series generative adversarial networks\.InAdvances in Neural Information Processing Systems 32,External Links:[Link](https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p2.1),[§2](https://arxiv.org/html/2606.05264#S2.p1.1),[§4\.4](https://arxiv.org/html/2606.05264#S4.SS4.p1.1),[§4](https://arxiv.org/html/2606.05264#S4.p4.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 11121–11128\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v37i9.26317)Cited by:[§1](https://arxiv.org/html/2606.05264#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.05264#S4.SS1.p3.3)\.

## Appendix ADataset and System Details

### A\.1Experimental System Details

All experiments reported in the main paper and supplementary material were run on the same machine\. The hardware consisted of an NVIDIA A10G GPU with 22\.1 GB of VRAM, an AMD EPYC 7R32 CPU with 4 cores and 8 threads, 32 GB of system RAM, and a 242 GB SSD\. The software environment used Ubuntu 24\.04\.4 LTS\. These specifications cover both synthetic\-data generation and downstream forecasting experiments\.

### A\.2Dataset Details

Table 5:Twelve real\-world datasets used in evaluation, spanning energy, cloud, traffic, and climate, with summary statistics for sampling frequency, characteristic template periodPP, series count, targets, and covariates\.DatasetSourceDomainFrequencyTemplatePP\# Time Series\# Targets\# CovariatesBDG\-2 BearBuildingsBench\(Emamiet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib47)\)EnergyH249110BDG\-2 PantherBuildingsBench\(Emamiet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib47)\)EnergyH2410510Azure VM Traces 2017CloudOpsTSF\(Wooet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib50)\)Cloud5T28810,00012Borg Cluster Data 2011CloudOpsTSF\(Wooet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib50)\)Cloud5T28810,00025PEMS\-04LibCity\(Jianget al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib49)\)Traffic5T28830730PEMS\-08LibCity\(Jianget al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib49)\)Traffic5T28817030BDG\-2 BullBuildingsBench\(Emamiet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib47)\)EnergyH244113BDG\-2 HogBuildingsBench\(Emamiet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib47)\)EnergyH242415SubseasonalSubseasonalClimateUSA\(Mouatadidet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib48)\)ClimateD36586240Subseasonal PrecipitationSubseasonalClimateUSA\(Mouatadidet al\.,[2023](https://arxiv.org/html/2606.05264#bib.bib48)\)ClimateD36586210Residential PV PowerLOTSA\_Others\(Wooet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib2)\)EnergyT144023330Residential Load PowerLOTSA\_Others\(Wooet al\.,[2024](https://arxiv.org/html/2606.05264#bib.bib2)\)EnergyT144027130

Characteristic Period\.PPdenotes the characteristic period length, in time steps, used for phase alignment in the template\-extraction stage\. We use one full daily cycle for the sub\-daily datasets and one climatological yearly cycle for the daily climate datasets, so the frequency codes map as follows:H↦P=24\\mathrm\{H\}\\mapsto P=24,5​T↦P=288\\mathrm\{5T\}\\mapsto P=288,T↦P=1440\\mathrm\{T\}\\mapsto P=1440, andD↦P=365\\mathrm\{D\}\\mapsto P=365\.

The 12 datasets in Table[5](https://arxiv.org/html/2606.05264#A1.T5)are grouped into six source\-matched pairs so that each pair shares a common collection setting or application domain while still differing in scale, dimensionality, or predictand\. We briefly summarize those six pairings here because they anchor the cross\-dataset comparisons used throughout the main paper\.

BuildingsBench: Bear and Panther\.Bear and Panther are hourly building\-energy datasets from BuildingsBench, both focused on single\-target load forecasting without additional covariates\. They form a clean univariate pair for evaluating whether a method can transfer across buildings that share broad consumption rhythms but differ in occupancy patterns, control policies, and building\-specific demand variability\.

CloudOpsTSF: Azure VM Traces 2017 and Borg Cluster Data 2011\.Azure VM Traces 2017 and Borg Cluster Data 2011 represent cloud\-resource monitoring at 5\-minute resolution\. This pair is useful because both datasets capture operational infrastructure telemetry, but Azure is a simpler single\-target setting with two covariates, whereas Borg is multivariate and more heterogeneous, making the pair a direct test of how methods scale from lighter to richer cloud traces\.

LibCity: PEMS\-04 and PEMS\-08\.PEMS\-04 and PEMS\-08 are traffic datasets from LibCity with 5\-minute sampling and three target channels per sensor\. They provide a matched transportation pair in which both tasks exhibit strong daily and weekly rhythms, while differing in network size and sensor topology, so they probe whether a model preserves structured periodicity under varying spatial scales\.

BuildingsBench: Bull and Hog\.Bull and Hog return to BuildingsBench, but now in covariate\-rich hourly settings rather than the univariate Bear/Panther case\. Because both datasets model building demand with exogenous drivers, this pair helps separate performance gains from simple periodic load reconstruction versus the harder problem of handling auxiliary variables that may shift or weaken the dominant seasonal pattern\.

SubseasonalClimateUSA: Subseasonal and Subseasonal Precipitation\.Subseasonal and Subseasonal Precipitation come from the same climate benchmark and both aggregate daily measurements across 862 series, but they differ sharply in target dimensionality\. The pair therefore isolates how a method behaves when the broader meteorological setting is held fixed while the predictive task changes from a four\-target subseasonal forecasting problem to a precipitation\-only version with a narrower signal profile\.

LOTSA Others: Residential PV Power and Residential Load Power\.Residential PV Power and Residential Load Power are minute\-level energy datasets drawn from the LOTSA collection\. They form a natural household\-energy pair because both reflect residential behavior at fine temporal resolution, yet PV generation is dominated by solar forcing while load reflects human activity and appliance usage, giving a useful contrast between externally driven and behavior\-driven dynamics\.

## Appendix BConsensus DAG Estimation

The graph𝒢\\mathcal\{G\}used in the SCM mixing stage is estimated directly from the real multivariate data through a consensus causal\-discovery pipeline\. We avoid relying on a single discovery routine because different estimators emphasize different dependence structures and can behave differently across datasets\. Instead, we infer candidate edges with four complementary procedures: PCMCI\(Rungeet al\.,[2019](https://arxiv.org/html/2606.05264#bib.bib54)\), DYNOTEARS\(Pamfilet al\.,[2020](https://arxiv.org/html/2606.05264#bib.bib55)\), VARLiNGAM\(Hyvärinenet al\.,[2010](https://arxiv.org/html/2606.05264#bib.bib56)\), and a non\-linear Granger\-style test based on random\-forest feature importance\(Granger,[1969](https://arxiv.org/html/2606.05264#bib.bib57); Breiman,[2001](https://arxiv.org/html/2606.05264#bib.bib58)\)\. The final graph retains only edges that recur across the dataset and receive support from multiple methods\.

Per\-series preprocessing\.For each real seriesss, we stack all available variates into a single multivariate trajectory

𝐳t\(s\)=\(z1,t\(s\),…,zC,t\(s\)\)⊤,t=1,…,Ts,\\mathbf\{z\}^\{\(s\)\}\_\{t\}=\\bigl\(z^\{\(s\)\}\_\{1,t\},\\dots,z^\{\(s\)\}\_\{C,t\}\\bigr\)^\{\\top\},\\qquad t=1,\\dots,T\_\{s\},\(16\)placing dynamic covariates before the target variates\. Missing values are imputed by linear interpolation, with forward/backward filling at the boundaries when needed\. Each variate is then standardized independently,

z~c,t\(s\)=zc,t\(s\)−μc\(s\)σc\(s\)\+ε,\\widetilde\{z\}^\{\(s\)\}\_\{c,t\}=\\frac\{z^\{\(s\)\}\_\{c,t\}\-\\mu^\{\(s\)\}\_\{c\}\}\{\\sigma^\{\(s\)\}\_\{c\}\+\\varepsilon\},\(17\)so that discovery is driven by temporal dependence rather than raw scale\. HereCCis the total number of variates in the multivariate series andTsT\_\{s\}is its observed length\.

Candidate edge discovery\.We run all four discovery methods on each standardized series over a fixed lag windowℓ∈\{0,1,…,Lmax\}\\ell\\in\\\{0,1,\\dots,L\_\{\\max\}\\\}\. For a given series and method, a candidate edge is indexed by an ordered triple\(u,v,ℓ\)\(u,v,\\ell\), whereu∈\{1,…,C\}u\\in\\\{1,\\dots,C\\\}is the source variate,v∈\{1,…,C\}v\\in\\\{1,\\dots,C\\\}is the destination variate, andℓ\\ellis the lag relating source timet−ℓt\-\\ellto destination timett\. Thus,ℓ=0\\ell=0denotes an instantaneous or contemporaneous relation, whereasℓ\>0\\ell\>0denotes a delayed relation\. Every method returns a set of candidate relations of the form

z~u,t−ℓ\(s\)→z~v,t\(s\)\\widetilde\{z\}^\{\(s\)\}\_\{u,t\-\\ell\}\\rightarrow\\widetilde\{z\}^\{\(s\)\}\_\{v,t\}\(18\)together with a raw edge\-strength scoreau,v,ℓ\(s,m\)a^\{\(s,m\)\}\_\{u,v,\\ell\}, wherem∈\{1,…,4\}m\\in\\\{1,\\dots,4\\\}indexes the discovery method\. We keep both contemporaneous and positive\-lag edges at this stage\. However, self\-edges withu=vu=vare removed before the final consensus graph is constructed\. The reason is architectural rather than causal: the periodic template stage already carries the dominant self\-lag or autoregressive structure, so re\-inserting self\-links into the SCM mixing graph would double\-count that dependence, degrade signal quality, and typically reduce fidelity rather than improve it\.

Aggregation across series\.The raw scores produced by different algorithms are not directly comparable, so we normalize them within each method and series before aggregation\. Letℰ\(s,m\)\\mathcal\{E\}^\{\(s,m\)\}denote the set of candidate edges returned by methodmmon seriesss\. We define the normalized score

a^u,v,ℓ\(s,m\)=\{\|au,v,ℓ\(s,m\)\|max\(i,j,r\)∈ℰ\(s,m\)⁡\|ai,j,r\(s,m\)\|,\(u,v,ℓ\)∈ℰ\(s,m\),0,otherwise\.\\widehat\{a\}^\{\(s,m\)\}\_\{u,v,\\ell\}=\\begin\{cases\}\\dfrac\{\\left\|a^\{\(s,m\)\}\_\{u,v,\\ell\}\\right\|\}\{\\max\\limits\_\{\(i,j,r\)\\in\\mathcal\{E\}^\{\(s,m\)\}\}\\left\|a^\{\(s,m\)\}\_\{i,j,r\}\\right\|\},&\(u,v,\\ell\)\\in\\mathcal\{E\}^\{\(s,m\)\},\\\\\[10\.00002pt\] 0,&\\text\{otherwise\.\}\\end\{cases\}\(19\)where\(i,j,r\)\(i,j,r\)is simply a dummy edge index ranging over all candidate source variatesii, destination variatesjj, and lagsrrreturned by methodmmon seriesss\. This rescales all detected edges for a given method/series pair into\[0,1\]\[0,1\]while preserving their relative ordering\. We then aggregate each edge across theSSreal series through two summaries\. First, its frequency of occurrence under methodmmis

fu,v,ℓ\(m\)=1S​∑s=1S𝟏​\[\(u,v,ℓ\)∈ℰ\(s,m\)\]\.f^\{\(m\)\}\_\{u,v,\\ell\}=\\frac\{1\}\{S\}\\sum\_\{s=1\}^\{S\}\\mathbf\{1\}\\\!\\left\[\(u,v,\\ell\)\\in\\mathcal\{E\}^\{\(s,m\)\}\\right\]\.\(20\)Second, its mean normalized strength over the series in which it appears is

a¯u,v,ℓ\(m\)=∑s=1Sa^u,v,ℓ\(s,m\)max⁡\(1,∑s=1S𝟏​\[\(u,v,ℓ\)∈ℰ\(s,m\)\]\)\.\\bar\{a\}^\{\(m\)\}\_\{u,v,\\ell\}=\\frac\{\\sum\_\{s=1\}^\{S\}\\widehat\{a\}^\{\(s,m\)\}\_\{u,v,\\ell\}\}\{\\max\\\!\\left\(1,\\sum\_\{s=1\}^\{S\}\\mathbf\{1\}\\\!\\left\[\(u,v,\\ell\)\\in\\mathcal\{E\}^\{\(s,m\)\}\\right\]\\right\)\}\.\(21\)These two quantities separate stability across series from within\-method edge magnitude, yielding a dataset\-level summary for each method rather than a separate graph for every series\. At this point the edge\-lag triple\(u,v,ℓ\)\(u,v,\\ell\)is still represented by*method\-specific*summariesa¯u,v,ℓ\(m\)\\bar\{a\}^\{\(m\)\}\_\{u,v,\\ell\}rather than by a single pooled score\. In other words, after this step there can still be up to four aggregated strengths for the same\(u,v,ℓ\)\(u,v,\\ell\), one for each discovery method\.

Consensus graph construction\.A vote is counted for an edge only after two levels of filtering\. First, a particular discovery methodmmcasts a vote for\(u,v,ℓ\)\(u,v,\\ell\)only if that relation appears in more than a threshold fraction of the dataset\. In our implementation, this per\-method support condition is

fu,v,ℓ\(m\)\>τfreq,τfreq=0\.2\.f^\{\(m\)\}\_\{u,v,\\ell\}\>\\tau\_\{\\mathrm\{freq\}\},\\qquad\\tau\_\{\\mathrm\{freq\}\}=0\.2\.\(22\)Equivalently, the edge must appear in more than 20% of the series for that method\. Defining the per\-method vote indicator as

bu,v,ℓ\(m\)=𝟏​\[fu,v,ℓ\(m\)\>τfreq\],b^\{\(m\)\}\_\{u,v,\\ell\}=\\mathbf\{1\}\\\!\\left\[f^\{\(m\)\}\_\{u,v,\\ell\}\>\\tau\_\{\\mathrm\{freq\}\}\\right\],\(23\)the second requirement is cross\-method agreement: an edge is kept only if at least two of the four discovery methods vote for it,

∑m=14bu,v,ℓ\(m\)≥2\.\\sum\_\{m=1\}^\{4\}b^\{\(m\)\}\_\{u,v,\\ell\}\\geq 2\.\(24\)Thus, an edge enters the final consensus graph only if*both*conditions hold: \(a\) for a given method it appears in more thanτfreq\\tau\_\{\\mathrm\{freq\}\}of the dataset, and \(b\) at least two methods vote for that same edge\. For edges that survive this voting stage, we then pool the method\-specific strengths into a single final consensus score,

wu,v,ℓ=∑m=14bu,v,ℓ\(m\)​a¯u,v,ℓ\(m\)∑m=14bu,v,ℓ\(m\),defined whenever​∑m=14bu,v,ℓ\(m\)≥2\.w\_\{u,v,\\ell\}=\\frac\{\\sum\_\{m=1\}^\{4\}b^\{\(m\)\}\_\{u,v,\\ell\}\\,\\bar\{a\}^\{\(m\)\}\_\{u,v,\\ell\}\}\{\\sum\_\{m=1\}^\{4\}b^\{\(m\)\}\_\{u,v,\\ell\}\},\\qquad\\text\{defined whenever \}\\sum\_\{m=1\}^\{4\}b^\{\(m\)\}\_\{u,v,\\ell\}\\geq 2\.\(25\)This is the final per\-edge, per\-lag coefficient used downstream in SCM mixing\. Subject to the additional constraintu≠vu\\neq vthat removes self\-links from downstream SCM mixing, we then collect all retained lags for each parent\-child pair\(u,v\)\(u,v\)into the admissible lag set

ℒu→v=\{ℓ∈\{0,1,…,Lmax\}:∑m=14bu,v,ℓ\(m\)≥2\}\.\\mathcal\{L\}\_\{u\\to v\}=\\left\\\{\\ell\\in\\\{0,1,\\dots,L\_\{\\max\}\\\}:\\sum\_\{m=1\}^\{4\}b^\{\(m\)\}\_\{u,v,\\ell\}\\geq 2\\right\\\}\.\(26\)The resulting consensus graph therefore allows both instantaneous \(ℓ=0\\ell=0\) and delayed \(ℓ\>0\\ell\>0\) cross\-variate relations, while leaving self\-temporal structure to the periodic template and residual components described in Section[3](https://arxiv.org/html/2606.05264#S3)\.

## Appendix CAblation Studies

To keep the main paper compact, we collect the full variance, geometric, spectral, residual, and SCM\-mixing ablations below\.

### C\.1Variance Explained by Covariate

Table 6:Variance explained \(VE, %\) for each dataset and each available channel, together with the per\-dataset average across available channels\. Higher values indicate that the phase\-aligned template explains a larger fraction of the observed signal variance\.DatasetCovariate 1Covariate 2Covariate 3Covariate 4Covariate 5Covariate 6AverageBear57\.3–––––57\.3Panther39\.7–––––39\.7Azure VM9\.125\.014\.5–––16\.2Borg59\.315\.214\.80\.735\.459\.430\.8PEMS\-0491\.984\.765\.4–––80\.7PEMS\-0891\.284\.765\.1–––80\.3Bull14\.80\.93\.42\.1––5\.3Hog3\.20\.90\.31\.68\.81\.72\.8Subseasonal30\.397\.197\.097\.1––80\.4Subseasonal Prec\.32\.6–––––32\.6Res\. PV88\.788\.388\.8–––88\.6Res\. Load88\.192\.093\.6–––91\.2

Table[6](https://arxiv.org/html/2606.05264#A3.T6)reports the variance explained \(VE\) by the phase\-aligned periodic template for every dataset and channel\. The values span a wide range, from below 1% on individual Hog and Bull channels to above 90% on PEMS and the residential benchmarks, reflecting genuine heterogeneity in how strongly periodic structure dominates across domains\. Three broad tiers are visible\. High\-VE datasets, namely PEMS\-04, PEMS\-08, Residential PV Power, Residential Load Power, and the non\-precipitation Subseasonal channels, have averages above 80%, meaning the template captures most of the signal variance before residual modelling\. Moderate\-VE datasets, including Bear, Panther, Subseasonal Precipitation, and Borg, sit between roughly 30% and 60% on average, so the template provides a meaningful but incomplete scaffold\. Low\-VE datasets, namely Azure VM, Bull, and Hog, have averages below 20%, with several individual channels near zero, indicating that periodic structure explains little of the observed variability and the residual model must carry most of the generative burden\. These tiers recur as an organizing principle throughout the later ablations: template quality, as measured by VE, consistently predicts residual importance, spectral alignment, and the conditions under which SCM mixing helps or degrades\. We report VE in percentage points\. Columns*Covariate 1*–*Covariate 6*denote the ordered channels used for each dataset; datasets with fewer than six channels use ‘–‘ in the remaining entries\.

### C\.2t\-SNE Ablation

To assess geometric fidelity directly, Figure[2](https://arxiv.org/html/2606.05264#A3.F2)visualizes t\-SNE projections\(van der Maaten and Hinton,[2008](https://arxiv.org/html/2606.05264#bib.bib59)\)for all 12 datasets\.

Figure 2:t\-SNE projections of real and synthetic samples for twelve representative datasets, illustrating broad geometric alignment together with partial separation in lower\-density regions\.
As a qualitative diagnostic, Figure[2](https://arxiv.org/html/2606.05264#A3.F2)offers a complementary geometric view of the same phenomenon\. Across all twelve datasets, the real and synthetic point clouds remain close in the embedding space without becoming fully superposed, suggesting substantial alignment at the level of coarse support while preserving some separation between the two distributions\. The absence of exact overlap is arguably favorable, since a synthetic sample that simply collapsed onto the densest empirical regions of the real data would provide comparatively limited additional coverage\.

This pattern is particularly visible in the larger datasets, Azure VM Traces 2017 and Borg Cluster Data 2011\. In Azure, the synthetic points extend along the outer envelope and lower\-density arcs of the cloud rather than concentrating only in the most populated real regions\. In Borg, they occupy several bridging and peripheral regions around the main lobes, again indicating coverage of areas that appear relatively sparse in the observed sample\. For the smaller datasets such as Bear, Bull, Hog, and Panther, the two clouds more often come into contact near boundary or transition regions while remaining only partially overlapping overall\. Given the reduced sample size and the inherent variability of two\-dimensional embeddings, this degree of separation is compatible with the view that the synthetic distribution tracks the same broad geometry while allocating somewhat greater mass to nearby regions that are underrepresented in the real data\.

Residential PV Power is the clearest exception, where the separation is more pronounced\. The synthetic cloud forms a distinct island anchored around a small subset of real points rather than spreading across the full real manifold\. Because the periodic template accounts for 88\.6% of signal variance in this dataset, the t\-SNE embedding is driven mostly by residual variation, since the shared periodic structure contributes little to point separation\. The over\-dispersed residuals produced by the elevated temperature range are therefore more visible here than in other datasets, where residual variation is a smaller fraction of the total signal\. This also explains why the forecasting impact remains moderate despite the geometric separation: the template carries most of the predictive signal, leaving relatively little room for the miscalibrated residual to degrade downstream performance\. The PV result points to the temperature range as a dataset\-specific hyperparameter that would benefit from tuning in domains where the residual component is especially sensitive to sampling stochasticity\.

### C\.3Frequency\-Domain Ablation

To assess spectral fidelity directly, Figure[3](https://arxiv.org/html/2606.05264#A3.F3)provides a complementary power\-spectral comparison using Welch\-style PSD estimates\(Welch,[1967](https://arxiv.org/html/2606.05264#bib.bib60)\)\.

Figure 3:Power spectral density \(PSD\) comparison between real and synthetic time series across representative datasets and covariates\. Each subplot shows frequency on the x\-axis and spectral power density on the y\-axis, with the real series shown in blue and the synthetic series shown in orange\. The panels are arranged dataset\-by\-dataset with at most three subplots per row, except for BDG\-2 Bull, which is shown with four panels in one row, and Borg Cluster Data 2011, which spans two rows\. Overall, the synthetic series preserve the dominant spectral peaks and low\-frequency decay patterns of the real data while allowing controlled deviations across datasets and covariates\.
The PSD comparisons in Figure[3](https://arxiv.org/html/2606.05264#A3.F3)provide a complementary spectral view of the fidelity story told by the forecasting results\. Across the included datasets, the synthetic series preserve the dominant spectral structure of the real data well, and the cases where alignment is imperfect can be traced to specific structural properties of the target domain rather than to a general failure of the pipeline\.

The clearest successes are the residential energy datasets and the PEMS traffic benchmarks\. For Residential PV Power and Residential Load Power \(avg\. VE\>88%\>88\\%\), the synthetic PSD tracks the real spectral envelope closely across all three covariates in both datasets\. The dominant low\-frequency decay, harmonic positions, and relative inter\-peak power levels are all reproduced well\. PEMS\-04 and PEMS\-08 \(avg\. VE\>80%\>80\\%\) show the same pattern in a different domain, with tight synthetic\-to\-real alignment across all three covariates, harmonic peaks at the correct positions, and a spectral floor that stays close to the real one\. The fact that this holds for both members of the sibling pair points to genuine structural preservation rather than accidental matching\. BDG\-2 Bear and BDG\-2 Panther, despite more moderate VE values \(57\.3% and 39\.7%\), also align well\. The synthetic PSD reproduces the overall spectral decay and main harmonic positions with only minor excess power at isolated frequencies, indicating that even when the template is not dominant, the DKL residual still learns a spectrally compatible stochastic process rather than introducing spurious structure\.

Two cases warrant closer attention\. In Borg Cluster Data 2011, the synthetic PSD decays faster than the real one across several covariates, most visibly on covariate 4 \(VE 0\.7%\)\. With 10,000 series, averaging to computer¯\\bar\{r\}suppresses high\-frequency variability aggressively\. Individual\-series noise cancels in the mean, and the DKL model then learns a smoother residual process than any individual real series exhibits, causing generated residuals to underrepresent the elevated high\-frequency floor of the real corpus\. Azure VM Traces 2017, despite also containing 10,000 series, does not show the same pattern because its underlying signal is genuinely broadband and aperiodic\. Both real and synthetic PSDs are irregular throughout, which is exactly what a faithful reproduction of a noisy process should look like\. The over\-smoothing effect becomes visible only when the real signal has a sustained spectral floor that the smoothed residual fails to maintain, a condition present in Borg’s more structured channels but not in Azure\. BDG\-2 Bull, despite similarly low per\-channel VE, also avoids this issue because its 41\-series mean retains considerably more high\-frequency structure\. The Borg case is therefore specific to very large corpora with genuine mid\-to\-high\-frequency structure, while the dominant low\-frequency features remain correctly reproduced\.

In Subseasonal Precipitation, the synthetic PSD preserves the seasonal harmonic positions and overall spectral shape, but the real signal has unusually sharp nulls between harmonics that the synthetic does not fully reproduce\. This follows from the additive decomposition rather than from a failure of any one component\. The real signal is spectrally very pure, behaving almost like a sinusoidal comb with minimal residual energy between harmonics\. Once any stochastic residual is added, between\-peak power necessarily increases because the residual model cannot enforce destructive interference at specific frequencies\. The mismatch is therefore confined to null depth and does not affect the forecasting\-relevant low\- and mid\-frequency envelope\.

Taken together, the PSD comparisons confirm thatReGeNreliably preserves the spectral structure most relevant for forecasting across the majority of the included datasets\. The cases where alignment is only partial have clear structural explanations, one linked to aggressive averaging in very large corpora and the other to an unusually pure periodic signal where any additive residual raises a spectral floor that would otherwise remain nearly empty\. In both cases, the dominant spectral features are still reproduced correctly\.

### C\.4Residual Ablation Under TSTR

To isolate the contribution of residual modelling, Table[7](https://arxiv.org/html/2606.05264#A3.T7)compares the standard TSTR setup against a variant in which synthetic data is generated without sampled residuals, reported only for iTransformer\. Removing the DKL residual consistently degrades performance, but the magnitude of that degradation is strongly structured by how much signal variance the periodic template already explains\.

Table 7:Residual ablation under train\-on\-synthetic, test\-on\-real transfer \(TSTR\), reported only for iTransformer\. The TSTR baseline values are copied from Table[1](https://arxiv.org/html/2606.05264#S4.T1); the no\-residual columns are provided to show the degradation when residual modelling is removed\. Lower MSE and MAE are better\.DatasetTSTRWithout ResidualΔ\\DeltaMSE↓\\downarrowMAE↓\\downarrowMSE↓\\downarrowMAE↓\\downarrowMSEMAEBDG\-2 Bear0\.410\.430\.450\.46\+0\.04\+0\.03BDG\-2 Panther0\.360\.380\.390\.42\+0\.03\+0\.04Azure VM Traces 20170\.900\.451\.020\.49\+0\.12\+0\.04Borg Cluster Data 20110\.590\.500\.670\.55\+0\.08\+0\.05PEMS\-040\.370\.380\.400\.41\+0\.03\+0\.03PEMS\-080\.300\.290\.310\.30\+0\.01\+0\.01BDG\-2 Bull0\.370\.400\.520\.59\+0\.15\+0\.19BDG\-2 Hog0\.500\.480\.580\.58\+0\.08\+0\.10Subseasonal0\.400\.430\.410\.43\+0\.01\+0\.00Subseasonal Precipitation1\.010\.731\.140\.80\+0\.13\+0\.07Residential PV Power0\.260\.200\.260\.19\+0\.00\-0\.01Residential Load Power0\.540\.380\.560\.39\+0\.02\+0\.01

The clearest pattern emerges when the residual\-removal degradation is related back to the per\-dataset variance explained \(VE\) values in Table[6](https://arxiv.org/html/2606.05264#A3.T6)\. Datasets where the phase\-aligned template captures little of the original signal variance suffer the largest collapse when residuals are removed: BDG\-2 Bull \(avg\. VE 5\.3%\) degrades by \+0\.15 MSE and \+0\.19 MAE, Azure VM Traces 2017 \(avg\. VE 16\.2%\) by \+0\.12 MSE and \+0\.04 MAE, and BDG\-2 Hog \(avg\. VE 2\.8%\) by \+0\.08 MSE and \+0\.10 MAE\. By contrast, datasets where the template is highly explanatory are largely unaffected: PEMS\-08 \(avg\. VE 80\.3%\) degrades by only \+0\.01 MSE and \+0\.01 MAE, Subseasonal \(avg\. VE 80\.4%\) by \+0\.01 MSE and \+0\.00 MAE, Residential PV Power \(avg\. VE 88\.6%\) shows no meaningful change, and Residential Load Power \(avg\. VE 91\.2%\) changes by only \+0\.02 MSE and \+0\.01 MAE\.

To quantify this relationship, we compute the Spearman rank correlation\(Spearman,[1904](https://arxiv.org/html/2606.05264#bib.bib61)\)between per\-dataset average VE and the absolute MSE degradation upon residual removal, findingρ=−0\.8225\\rho=\-0\.8225\. The same relationship holds for MAE degradation, withρ=−0\.8858\\rho=\-0\.8858\. These coefficients confirm that template quality is a reliable predictor of residual importance: the less structure the template captures, the more load\-bearing the residual model becomes\.

This has a direct mechanistic interpretation\. In high\-VE settings such as PEMS, Subseasonal, and the residential PV/load benchmarks, the periodic template dominates the generative signal and the residual model acts mainly as a stochastic correction\. In low\-VE settings such as Bull, Hog, and Azure VM, the template fails to capture dominant variability—including irregular load spikes, bursty VM utilization, and aperiodic demand shifts—so the DKL residual must carry much more of the generative burden\. In these cases, removing it erases the structured variability that makes the synthetic series informative as a training source and leaves the generator with a much more rigid view of the process\. The residual model is therefore not uniformly a correction term but, conditionally, a principal generative component whose importance is determined by the degree to which the target domain exhibits stable periodic structure\.

### C\.5SCM Mixing Ablation Under TSTR

To isolate the contribution of multivariate structural coupling, Table[8](https://arxiv.org/html/2606.05264#A3.T8)compares the standard TSTR setup against a variant in which SCM\-based mixing is removed during synthetic generation\. Since this ablation is meaningful only for multivariate datasets, the univariate BDG\-2 Bear, BDG\-2 Panther, and Subseasonal Precipitation datasets are omitted\. As above, we give the model name in the caption and copy the TSTR baseline values directly from Table[1](https://arxiv.org/html/2606.05264#S4.T1)\.

Table 8:SCM\-mixing ablation under train\-on\-synthetic, test\-on\-real transfer \(TSTR\), reported only for iTransformer\. The TSTR baseline values are copied from Table[1](https://arxiv.org/html/2606.05264#S4.T1); the w/o SCM mixing columns are provided to show the degradation when SCM\-based mixing is removed\. Lower MSE and MAE are better\.DatasetTSTRWithout SCM MixingΔ\\DeltaMSE↓\\downarrowMAE↓\\downarrowMSE↓\\downarrowMAE↓\\downarrowMSEMAEAzure VM Traces 20170\.900\.450\.960\.46\+0\.06\+0\.01Borg Cluster Data 20110\.590\.500\.620\.54\+0\.03\+0\.04PEMS\-040\.370\.380\.380\.39\+0\.01\+0\.01PEMS\-080\.300\.290\.320\.32\+0\.02\+0\.03BDG\-2 Bull0\.370\.400\.370\.42\+0\.00\+0\.02BDG\-2 Hog0\.500\.480\.480\.45\-0\.02\-0\.03Subseasonal0\.400\.430\.470\.51\+0\.07\+0\.08Residential PV Power0\.260\.200\.300\.29\+0\.04\+0\.09Residential Load Power0\.540\.380\.560\.42\+0\.02\+0\.04

The SCM ablation points to a modest but mostly positive effect\. Most datasets worsen slightly without SCM mixing, while BDG\-2 Hog improves marginally and BDG\-2 Bull is nearly unchanged\. The clearest explanation is not dimensionality alone but how much evidence is available per candidate edge in the consensus graph\. For Hog, the combination of few series and relatively many channels makes edge selection noisy: withτfreq=0\.2\\tau\_\{\\mathrm\{freq\}\}=0\.2, an edge can survive after appearing in only 5 of the 24 series\. That is a low bar, so some spurious dependencies can enter the graph and then be propagated across channels during generation\. By contrast, Borg is also higher\-dimensional but has vastly more series, so each candidate edge is evaluated against much stronger evidence and removing SCM mixing hurts as expected\. Bull sits between these cases, with fewer channels and 41 series, which is consistent with its near\-zero SCM effect\. The PEMS datasets also lie in a better\-supported regime and show small but consistent gains from SCM mixing\. Overall, the pattern suggests that SCM mixing helps when the consensus graph is well supported by the available data and becomes close to neutral when evidence is limited, with Hog as the clearest failure case\. A stricter or adaptive consensus threshold for low\-data, higher\-dimensional settings is therefore a natural direction for future work\.

Similar Articles