TriHead-GAN: A Generative Adversarial Network with Triple-Head Discriminator for Carbon Emission Time Series Generation
Summary
Proposes TriHead-GAN, a transformer-based GAN with a triple-head discriminator that jointly supervises distributional authenticity, cross-variable dependencies, and temporal smoothness for generating realistic carbon emission time series, outperforming baselines on multiple datasets.
View Cached Full Text
Cached at: 06/09/26, 08:46 AM
# TriHead-GAN: A Generative Adversarial Network with Triple-Head Discriminator for Carbon Emission Time Series Generation
Source: [https://arxiv.org/html/2606.07569](https://arxiv.org/html/2606.07569)
Zesen Wang, Lijuan Lan\*, Yonggang Li, Chunhua Yang
###### Abstract
Accurate carbon emission monitoring is critical for climate policy and emerging regulatory mechanisms such as the EU Carbon Border Adjustment Mechanism, yet city\-level high\-frequency monitoring data remain extremely scarce, severely limiting data\-hungry deep learning models\. Time series generation is a natural remedy, but existing GAN and diffusion\-based generators often provide limited explicit supervision for the domain structure of carbon emission data: they may match marginal distributional statistics while insufficiently preserving cross\-variable correlations between CO2and co\-emitted pollutants and meteorological factors, and tend to collapse the first\-difference statistics of atmospheric measurements, producing sequences that are smooth on average but lack the realistic step\-wise variability of the underlying signals\. We propose TriHead\-GAN, a Transformer\-based adversarial framework whose triple\-head discriminator jointly supervises three complementary aspects of the joint distribution: distributional authenticity via a Wasserstein critic, cross\-variable dependency via leakage\-free regression of the target variable, and step\-wise temporal smoothness via adjacent\-difference prediction\. The generator combines global self\-attention with local temporal convolution, per\-step noise injection, and an anti\-smoothing loss that matches first\-difference statistics\. Experiments on the self\-collected Changsha Carbon dataset, two public carbon datasets \(China, US\), and the ETTh1 benchmark show that TriHead\-GAN achieves favorable performance over mainstream baselines on the vast majority of settings, and that the resulting synthetic windows improve downstream forecasting accuracy in low\-resource carbon monitoring scenarios\.
## IIntroduction
Climate change is one of the most pressing challenges facing humanity, largely driven by greenhouse gas emissions, particularly carbon dioxide \(CO2\)\. The Paris Agreement aims to hold the global temperature increase well below 2∘C above pre\-industrial levels and to pursue efforts to limit the increase to 1\.5∘C, making carbon emission reduction a central policy priority worldwide\[[27](https://arxiv.org/html/2606.07569#bib.bib1)\]\. Annual scientific assessments such as the Global Carbon Budget\[[9](https://arxiv.org/html/2606.07569#bib.bib30)\]and high\-resolution emission inventories such as EDGAR\[[3](https://arxiv.org/html/2606.07569#bib.bib31)\]document the persistent gap between emission targets and observed trends\. Accurate monitoring and prediction of carbon emissions are therefore critical for carbon trading markets, regional emission reduction policies, and carbon neutrality pathway planning\. Beyond science, carbon monitoring is now also tightly coupled with direct economic consequences: the European Union’s Carbon Border Adjustment Mechanism \(CBAM\) entered its definitive regime on 1 January 2026, under which authorised CBAM declarants must annually declare the embedded emissions of covered imports, with the corresponding certificate obligations phasing in over the subsequent years; when actual emissions cannot be adequately determined, default values are applied, raising compliance costs and therefore making verified, high\-quality emissions data economically valuable\[[6](https://arxiv.org/html/2606.07569#bib.bib2),[7](https://arxiv.org/html/2606.07569#bib.bib3)\]\. High\-quality, high\-resolution monitoring data are thus a prerequisite both for scientific carbon management and for industrial competitiveness, which in turn requires reliable predictive models built on top of such data\.
Progress in such predictive models, however, is bottlenecked by the scarcity of training data, which constitutes a concrete applied data\-mining problem: city\-level monitoring stations rarely accumulate enough high\-frequency history to train modern deep learners, yet downstream regulators and policymakers still need reliable predictive pipelines\. Recent deep learning methods\[[21](https://arxiv.org/html/2606.07569#bib.bib17),[11](https://arxiv.org/html/2606.07569#bib.bib18),[14](https://arxiv.org/html/2606.07569#bib.bib10)\]achieve strong nonlinear modeling but remain highly dependent on data scale and quality, and the publicly available carbon emission corpora \(e\.g\., Carbon Monitor\[[20](https://arxiv.org/html/2606.07569#bib.bib11)\]\) are mostly aggregated at the national level and far from sufficient for data\-intensive learners\. Compared to meteorology or power systems, which have decades of high\-frequency observations, urban\-scale carbon emission monitoring suffers from high deployment cost, sparse spatial coverage, and limited historical records\. The recent emergence of time series foundation models such as Sundial and TimeMoE\[[19](https://arxiv.org/html/2606.07569#bib.bib12),[26](https://arxiv.org/html/2606.07569#bib.bib13)\]further raises the data bar: domain\-specific foundation models for carbon emissions typically require millions of training samples, making data scarcity a fundamental obstacle to further progress\. Time series generation has therefore become a natural and pragmatic remedy\.
GAN\-based methods \(e\.g\., TimeGAN\[[29](https://arxiv.org/html/2606.07569#bib.bib5)\], TTS\-GAN\[[17](https://arxiv.org/html/2606.07569#bib.bib6)\]\) and diffusion\-based methods \(e\.g\., Diffusion\-TS\[[30](https://arxiv.org/html/2606.07569#bib.bib7)\]\) have shown promising performance on general time series generation tasks\. However, when applied to carbon emission scenarios, two practical issues remain unresolved\. First, cross\-variable consistency is lacking: CO2exhibits strong correlations with co\-emitted pollutants \(CO, NO2, O3\) and meteorological factors \(temperature, humidity, pressure\), yet existing discriminators mainly assess overall sample authenticity, allowing generated samples to match marginal distributions while potentially distorting the correlation structure\. Second, temporal dynamics are easily distorted: atmospheric variables change slowly and continuously, but Transformer\-based generators tend to over\-smooth via global attention, and the lack of explicit constraints on local variation can also produce unrealistic abrupt changes\. Although grounded in carbon emission data, these two requirements are general properties of most multivariate time series: temporal dependency spans both global patterns and local abrupt variations, while cross\-variable dependency lies in the internal joint structure among variables rather than their marginal distributions alone\. We therefore target both properties directly and validate the framework beyond carbon data on the general\-purpose ETTh1 benchmark\.
To address the above issues, this paper proposes TriHead\-GAN, a Transformer\-based adversarial framework with a triple\-head discriminator tailored for multivariate carbon emission time series generation\. The discriminator routes the input through three parallel task\-specific 1D\-CNN branches, each feeding its own head:D\-Headevaluates distributional authenticity via the Wasserstein objective;R\-Headconstrains cross\-variable dependencies through leakage\-free regression of the target variable from non\-target variables; andT\-Headenforces the temporal smoothness of atmospheric parameters by predicting adjacent\-step differences\. The generator combines self\-attention with local temporal convolution and per\-step noise injection, and is further regularized by an anti\-smoothing loss that matches the first\-difference statistics\. We evaluate TriHead\-GAN on the self\-collected Changsha Carbon dataset, two public national\-level datasets \(China Carbon, US Carbon\), and the ETTh1 benchmark, and observe consistent improvements over competitive baselines\.
Our main contributions are:
- •We construct*TriHead\-GAN*, a Transformer\-based adversarial framework specifically designed for multivariate carbon emission time series generation under data scarcity\.
- •We propose a*triple\-head discriminator*that simultaneously supervises distributional authenticity, cross\-variable dependency, and step\-wise temporal smoothness, providing cross\-variable consistency\-aware multi\-aspect supervision\.
- •We present a systematic applied study on a self\-collected real\-world urban carbon monitoring station and three public datasets, evaluating distribution fidelity, diversity, downstream forecasting utility, and deployment cost, and showing favorable performance over six competitive baselines\.
## IIRelated Work
### II\-ACarbon Emission Forecasting
Carbon emission forecasting has evolved from statistical and machine learning models to end\-to\-end deep learning and foundation\-model approaches\[[14](https://arxiv.org/html/2606.07569#bib.bib10),[21](https://arxiv.org/html/2606.07569#bib.bib17),[11](https://arxiv.org/html/2606.07569#bib.bib18),[22](https://arxiv.org/html/2606.07569#bib.bib19)\]\. Despite continuous advances, existing studies predominantly rely on national\-level data from platforms such as Carbon Monitor\[[20](https://arxiv.org/html/2606.07569#bib.bib11)\], while city\-level high\-frequency monitoring data remain severely scarce\. This bottleneck motivates our focus on high\-quality time series generation for low\-resource carbon monitoring\.
### II\-BTime Series Generation
Existing approaches can be broadly divided into GAN\-based methods that learn temporal distributions through adversarial training, and diffusion\-based or likelihood\-based methods that model data distributions through denoising or latent\-variable inference\.
GAN\-based methods\.RCGAN establishes recurrent conditional generation and the TSTR paradigm\[[5](https://arxiv.org/html/2606.07569#bib.bib4)\], TimeGAN adds embedding\-space supervision\[[29](https://arxiv.org/html/2606.07569#bib.bib5)\], and TTS\-GAN introduces Transformer\-based adversarial generation\[[17](https://arxiv.org/html/2606.07569#bib.bib6)\]\. Recent studies also show that GAN\-based synthetic time series can improve downstream forecasting in data\-scarce regimes and continue to refine sequence\-feature modeling\[[2](https://arxiv.org/html/2606.07569#bib.bib20),[12](https://arxiv.org/html/2606.07569#bib.bib21)\]\.
Diffusion\-based methods\.TimeVAE learns latent representations via variational autoencoding\[[4](https://arxiv.org/html/2606.07569#bib.bib14)\], while Diffusion\-TS, PAD\-TS, and TimeDP use denoising objectives with frequency\-domain, population\-aware, and domain\-prompt designs\[[30](https://arxiv.org/html/2606.07569#bib.bib7),[18](https://arxiv.org/html/2606.07569#bib.bib15),[13](https://arxiv.org/html/2606.07569#bib.bib16)\]; newer extensions further address irregular sampling and graph\-structured spectral relations\[[8](https://arxiv.org/html/2606.07569#bib.bib22),[25](https://arxiv.org/html/2606.07569#bib.bib23)\]\.
Despite strong general performance, existing generators usually emphasise sample authenticity or denoising fidelity, with limited explicit supervision for domain\-specific cross\-variable structure and step\-wise variation\. In carbon emission scenarios, synthetic windows must preserve both the relationships between CO2and meteorological/pollutant variables and the realistic first\-difference statistics of atmospheric measurements\. This motivates the dedicated discriminator design pursued in this work\.
## IIIMethodology
### III\-AProblem Formulation
Given a set of real carbon emission time series𝒳=\{𝐱\(i\)\}i=1N\\mathcal\{X\}=\\\{\\mathbf\{x\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, where each sample𝐱\(i\)∈ℝT×F\\mathbf\{x\}^\{\(i\)\}\\in\\mathbb\{R\}^\{T\\times F\}is a window of lengthTTwithFFfeature variables, our goal is to learn a generative modelGGthat produces synthetic windows𝐱^=G\(𝐳\)∈ℝT×F\\hat\{\\mathbf\{x\}\}=G\(\\mathbf\{z\}\)\\in\\mathbb\{R\}^\{T\\times F\}from random noise𝐳∼𝒩\(0,𝐈\)\\mathbf\{z\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\), satisfying three criteria: \(1\)distribution consistency\(the generated distributionpGp\_\{G\}should approximatepdatap\_\{\\text\{data\}\}\); \(2\)cross\-variable consistency\(inter\-variable correlations should conform to the underlying domain structure\) and \(3\)temporal coherence\(variation rates and smoothness should match real sequences\)\.
### III\-BOverall Framework
TriHead\-GAN consists of a Transformer\-based generator and a Triple\-Head Discriminator \(Fig\.[1](https://arxiv.org/html/2606.07569#S3.F1)\)\.
Transformer\-based GeneratorGG\.It takes random noise𝐳∈ℝT×dz\\mathbf\{z\}\\in\\mathbb\{R\}^\{T\\times d\_\{z\}\}as input, applies linear projection, sinusoidal positional encoding, and a Transformer encoder to extract global temporal dependencies, followed by a Local Temporal Convolution module for fine\-grained local dynamics and per\-step noise injection for temporal diversity, ultimately generating synthetic segments𝐱^∈ℝT×F\\hat\{\\mathbf\{x\}\}\\in\\mathbb\{R\}^\{T\\times F\}\.
Triple\-Head DiscriminatorDD\.It takes time series segments \(real or generated\) as input and routes them to three task\-specific heads, each fed by its own parallel CNN branch\.*D\-Head*\(DdD\_\{d\}\) serves as a Wasserstein critic on top of a dedicated 3\-layer 1D\-CNN with spectral normalization\[[23](https://arxiv.org/html/2606.07569#bib.bib25)\]\.*R\-Head*\(DrD\_\{r\}\) has its own 3\-layer CNN branch fed only with non\-target features, providing leakage\-free input for cross\-variable regression\.*T\-Head*\(DtD\_\{t\}\) employs a separate 2\-layer causal CNN branch to enforce temporal coherence by predicting adjacent\-step differences\.
### III\-CTransformer\-based Generator
The generatorGGuses a Transformer encoder as backbone, augmented with local temporal convolution and per\-step noise injection\.
Input projection\.Each time step’s noise vector is linearly mapped from the noise dimensiondzd\_\{z\}to the model dimensiondmodeld\_\{\\text\{model\}\}:𝐡0=Linear\(𝐳\)∈ℝT×dmodel\\mathbf\{h\}\_\{0\}=\\text\{Linear\}\(\\mathbf\{z\}\)\\in\\mathbb\{R\}^\{T\\times d\_\{\\text\{model\}\}\}\.
Positional encoding\.Sinusoidal positional encoding\[[28](https://arxiv.org/html/2606.07569#bib.bib8)\]injects absolute temporal information:
PE\(t,2k\)\\displaystyle\\text\{PE\}\(t,2k\)=sin\(t/100002k/dmodel\),\\displaystyle=\\sin\(t/0000^\{2k/d\_\{\\text\{model\}\}\}\),\(1\)PE\(t,2k\+1\)\\displaystyle\\text\{PE\}\(t,2k\{\+\}1\)=cos\(t/100002k/dmodel\)\.\\displaystyle=\\cos\(t/0000^\{2k/d\_\{\\text\{model\}\}\}\)\.The position\-injected feature is then𝐡1=Dropout\(𝐡0\+PE\)\\mathbf\{h\}\_\{1\}=\\text\{Dropout\}\(\\mathbf\{h\}\_\{0\}\+\\text\{PE\}\)\.
Transformer encoding\.The input passes throughLLTransformer encoder layers, each comprising multi\-head self\-attention \(MHA\) and a feed\-forward network \(FFN\), so that𝐡l\+1=TransformerEncoderLayer\(𝐡l\)\\mathbf\{h\}\_\{l\+1\}=\\text\{TransformerEncoderLayer\}\(\\mathbf\{h\}\_\{l\}\)forl=1,…,Ll=1,\\ldots,L\. The self\-attention mechanism enables the generator to perceive information from all time steps, capturing periodic patterns and long\-range dependencies in carbon emission data\.
Local temporal convolution\.The global attention of Transformers tends to over\-smooth outputs and lose fine\-grained local variation\. We add a 2\-layer 1D convolution module after the encoder with a residual connection:
𝐡local\\displaystyle\\mathbf\{h\}\_\{\\text\{local\}\}=Conv1d2\(GELU\(Conv1d1\(𝐡L\+1\)\)\),\\displaystyle=\\text\{Conv1d\}\_\{2\}\(\\text\{GELU\}\(\\text\{Conv1d\}\_\{1\}\(\\mathbf\{h\}\_\{L\+1\}\)\)\),\(2\)𝐡′\\displaystyle\\mathbf\{h\}^\{\\prime\}=𝐡L\+1\+𝐡local,\\displaystyle=\\mathbf\{h\}\_\{L\+1\}\+\\mathbf\{h\}\_\{\\text\{local\}\},where Conv1d uses kernel size 3 with padding to preserve sequence length\.
Per\-step noise injection\.During training, learnable\-scale random noise is injected at each time step to enhance temporal diversity:
𝐡′′=𝐡′\+s⋅ϵ,ϵ∼𝒩\(0,𝐈\),s∈ℝdmodel,\\mathbf\{h\}^\{\\prime\\prime\}=\\mathbf\{h\}^\{\\prime\}\+s\\cdot\\boldsymbol\{\\epsilon\},\\quad\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\),\\quad s\\in\\mathbb\{R\}^\{d\_\{\\text\{model\}\}\},\(3\)wheressis initialized to0\.050\.05and is disabled during inference\.
Output projection\.Features are mapped back to the data dimension with a Tanh activation that constrains outputs to\[−1,1\]\[\-1,1\]:𝐱^=tanh\(Linear\(𝐡′′\)\)∈ℝT×F\\hat\{\\mathbf\{x\}\}=\\tanh\(\\text\{Linear\}\(\\mathbf\{h\}^\{\\prime\\prime\}\)\)\\in\\mathbb\{R\}^\{T\\times F\}\.
Figure 1:Overview of TriHead\-GAN\. The Transformer generator emits time series that the discriminator processes through three parallel CNN branches, each feeding its own head: D\-Head \(WGAN authenticity\), R\-Head \(cross\-variable regression with leakage\-free input\), and T\-Head \(causal temporal coherence\)\.
### III\-DTriple\-Head Discriminator
The Triple\-Head Discriminator is the core innovation of TriHead\-GAN, evaluating time series quality from three complementary perspectives\. Conceptually, the three heads target the marginal, conditional, and transition components of the joint sequence distribution, respectively\.
Three parallel CNN branches\.The discriminator routes the input𝐱\\mathbf\{x\}through three task\-specific 1D\-CNN branches that operate in parallel, each feeding its own head:
𝐡=CNNd\(𝐱\),𝐠=CNNr\(𝐱~\),𝐮=CNNτ\(𝐱\),\\mathbf\{h\}=\\text\{CNN\}\_\{d\}\(\\mathbf\{x\}\),\\;\\;\\mathbf\{g\}=\\text\{CNN\}\_\{r\}\(\\tilde\{\\mathbf\{x\}\}\),\\;\\;\\mathbf\{u\}=\\text\{CNN\}\_\{\\tau\}\(\\mathbf\{x\}\),\(4\)whereCNNd\\text\{CNN\}\_\{d\}is a 3\-layer spectrally\-normalized 1D\-CNN feeding D\-Head,CNNr\\text\{CNN\}\_\{r\}is a separate 3\-layer 1D\-CNN whose*leakage\-free*input𝐱~=𝐱:,1:F−1\\tilde\{\\mathbf\{x\}\}=\\mathbf\{x\}\_\{:,1:F\-1\}excludes the target column, andCNNτ\\text\{CNN\}\_\{\\tau\}is a 2\-layer causal 1D\-CNN feeding T\-Head\. Each block stacks convolution, LeakyReLU, and LayerNorm\. Spectral normalization\[[23](https://arxiv.org/html/2606.07569#bib.bib25)\]onCNNd\\text\{CNN\}\_\{d\}enforces Lipschitz continuity and complements gradient penalty for stable WGAN training, while the causal padding ofCNNτ\\text\{CNN\}\_\{\\tau\}prevents future information from leaking into the temporal\-difference prediction\.
D\-Head: authenticity evaluation\.Following WGAN\-GP\[[10](https://arxiv.org/html/2606.07569#bib.bib9)\], D\-Head maps its branch features to a scalar critic score:sd=Dd\(𝐡\)=Linear\(Flatten\(𝐡\)\)∈ℝs\_\{d\}=D\_\{d\}\(\\mathbf\{h\}\)=\\text\{Linear\}\(\\text\{Flatten\}\(\\mathbf\{h\}\)\)\\in\\mathbb\{R\}\.
R\-Head: structural consistency\.R\-Head imposes cross\-variable structural regularities\. To prevent the target from leaking into its own prediction,CNNr\\text\{CNN\}\_\{r\}receives only the non\-target features𝐱~=𝐱:,1:F−1\\tilde\{\\mathbf\{x\}\}=\\mathbf\{x\}\_\{:,1:F\-1\}\. R\-Head then regresses the target variable at each step:
y^treg=Linear\(𝐠t\),ℒreg=1T∑t=1T\(y^treg−xt,F\)2\.\\hat\{y\}\_\{t\}^\{\\text\{reg\}\}=\\text\{Linear\}\(\\mathbf\{g\}\_\{t\}\),\\;\\;\\mathcal\{L\}\_\{\\text\{reg\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\(\\hat\{y\}\_\{t\}^\{\\text\{reg\}\}\-x\_\{t,F\}\)^\{2\}\.\(5\)
T\-Head: temporal coherence\.T\-Head consumes the causal\-CNN features𝐮\\mathbf\{u\}from its own branch to predict adjacent\-step differences:
Δ^t=Dt\(𝐮t\)∈ℝF,ℒtemp=1\(T−1\)F∑t=1T−1‖Δ^t−\(𝐱t\+1−𝐱t\)‖2\.\\hat\{\\Delta\}\_\{t\}=D\_\{t\}\(\\mathbf\{u\}\_\{t\}\)\\in\\mathbb\{R\}^\{F\},\\;\\;\\mathcal\{L\}\_\{\\text\{temp\}\}=\\frac\{1\}\{\(T\{\-\}1\)F\}\\sum\_\{t=1\}^\{T\-1\}\\\|\\hat\{\\Delta\}\_\{t\}\-\(\\mathbf\{x\}\_\{t\+1\}\-\\mathbf\{x\}\_\{t\}\)\\\|^\{2\}\.\(6\)
### III\-ETraining Objective
TriHead\-GAN is trained under WGAN\-GP with the auxiliary losses introduced above and linearly warmed\-up weights\.
Anti\-smoothing loss\.To counter over\-smoothed Transformer outputs, we match both the mean and standard deviation of the per\-feature absolute first\-difference distribution:
ℒsmooth=\\displaystyle\\mathcal\{L\}\_\{\\text\{smooth\}\}\\;=MSE\(μΔ^,μΔ\)\+MSE\(σΔ^,σΔ\),\\displaystyle\\,\\text\{MSE\}\\\!\\left\(\\mu\_\{\\hat\{\\Delta\}\},\\,\\mu\_\{\\Delta\}\\right\)\+\\text\{MSE\}\\\!\\left\(\\sigma\_\{\\hat\{\\Delta\}\},\\,\\sigma\_\{\\Delta\}\\right\),\(7\)μΔ=\\displaystyle\\mu\_\{\\Delta\}\\;=1\(T−1\)B∑b,t\|𝐱t\+1\(b\)−𝐱t\(b\)\|,\\displaystyle\\,\\frac\{1\}\{\(T\{\-\}1\)B\}\\sum\_\{b,t\}\|\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\}\-\\mathbf\{x\}\_\{t\}^\{\(b\)\}\|,σΔ2=\\displaystyle\\sigma\_\{\\Delta\}^\{2\}\\;=1\(T−1\)B∑b,t\(\|𝐱t\+1\(b\)−𝐱t\(b\)\|−μΔ\)2,\\displaystyle\\,\\frac\{1\}\{\(T\{\-\}1\)B\}\\sum\_\{b,t\}\\bigl\(\|\\mathbf\{x\}\_\{t\+1\}^\{\(b\)\}\-\\mathbf\{x\}\_\{t\}^\{\(b\)\}\|\-\\mu\_\{\\Delta\}\\bigr\)^\{2\},withμΔ^\\mu\_\{\\hat\{\\Delta\}\}andσΔ^\\sigma\_\{\\hat\{\\Delta\}\}defined analogously on the generated batch\. This complements T\-Head by matching temporal variation at the statistical level\.
Discriminator loss:
ℒD=\\displaystyle\\mathcal\{L\}\_\{D\}=Dd\(𝐱^\)−Dd\(𝐱\)\+λgpGP\\displaystyle\\;D\_\{d\}\(\\hat\{\\mathbf\{x\}\}\)\-D\_\{d\}\(\\mathbf\{x\}\)\+\\lambda\_\{\\text\{gp\}\}\\,\\text\{GP\}\(8\)\+αwℒregreal\+βwℒregfake\+δw\(ℒtempreal\+ℒtempfake\)\.\\displaystyle\+\\alpha\_\{w\}\\mathcal\{L\}\_\{\\text\{reg\}\}^\{\\text\{real\}\}\+\\beta\_\{w\}\\mathcal\{L\}\_\{\\text\{reg\}\}^\{\\text\{fake\}\}\+\\delta\_\{w\}\(\\mathcal\{L\}\_\{\\text\{temp\}\}^\{\\text\{real\}\}\+\\mathcal\{L\}\_\{\\text\{temp\}\}^\{\\text\{fake\}\}\)\.
Generator loss:
ℒG=−Dd\(𝐱^\)\+γwℒregfake\+δwℒtempfake\+ηwℒsmooth,\\mathcal\{L\}\_\{G\}=\-D\_\{d\}\(\\hat\{\\mathbf\{x\}\}\)\+\\gamma\_\{w\}\\mathcal\{L\}\_\{\\text\{reg\}\}^\{\\text\{fake\}\}\+\\delta\_\{w\}\\mathcal\{L\}\_\{\\text\{temp\}\}^\{\\text\{fake\}\}\+\\eta\_\{w\}\\mathcal\{L\}\_\{\\text\{smooth\}\},\(9\)where warmup weights followwi\(e\)=wi∗min\(1,e/Ew\)w\_\{i\}\(e\)=w\_\{i\}^\{\\ast\}\\min\(1,e/E\_\{w\}\)forwi∈\{α,β,γ,δ,η\}w\_\{i\}\\in\\\{\\alpha,\\beta,\\gamma,\\delta,\\eta\\\}\. Each minibatch usesncriticn\_\{\\text\{critic\}\}critic updates, one generator update, and an EMA update onGG;GEMAG\_\{\\text\{EMA\}\}is used at inference\.
### III\-FNecessity of the Three Heads
We now show why the three heads are jointly necessary and not redundant, both from a statistical and an empirical\-gradient perspective\.
Statistical view\.The heads supervise three different functionals of the joint window distributionp\(𝐱1:T\)p\(\\mathbf\{x\}\_\{1:T\}\)\. D\-Head approximates the Lipschitz dual of the Wasserstein\-1 distance betweenpdatap\_\{\\text\{data\}\}andpGp\_\{G\}, sensitive to the*marginal*of full windows but less so to conditional and transition moments\. R\-Head, predicting the targetxFx\_\{F\}from non\-target features, adds a discrepancy on the*conditional*p\(xF∣x1,…,xF−1\)p\(x\_\{F\}\\mid x\_\{1\},\\dots,x\_\{F\-1\}\), which a generator with the right marginal but distorted cross\-variable structure cannot match\. T\-Head, predicting the one\-step differencext\+1−xtx\_\{t\+1\}\-x\_\{t\}, supervises the lag\-one*transition kernel*p\(𝐱t\+1∣𝐱≤t\)p\(\\mathbf\{x\}\_\{t\+1\}\\mid\\mathbf\{x\}\_\{\\leq t\}\)tied to local smoothness; the anti\-smoothing loss complements it by matching the first two moments of the absolute first\-difference distribution\. These three functionals \(marginal, conditional, transition\) emphasize different aspects of the joint distribution, indicating complementary rather than overlapping supervision\.
Empirical gradient orthogonality\.To verify that the corresponding training signals are also non\-redundant in parameter space, for each of5050randomly sampled minibatches on Changsha we compute the generator gradient of the adversarial, regression, and temporal losses and report pairwise cosine similarities: the means are−0\.075±0\.217\-0\.075\_\{\\pm 0\.217\}\(Adv\. vs\. Reg\.\),0\.024±0\.1410\.024\_\{\\pm 0\.141\}\(Adv\. vs\. Temp\.\), and0\.066±0\.1330\.066\_\{\\pm 0\.133\}\(Reg\. vs\. Temp\.\), all within±0\.08\\pm 0\.08of zero with batch\-level values bounded by±0\.4\\pm 0\.4\. Cosine similarities clustered near zero are an empirical signature of near\-orthogonal gradient directions, indicating that the three heads supply complementary rather than redundant supervisory directions\. Combined with the ablation results in Sec\.[IV\-D](https://arxiv.org/html/2606.07569#S4.SS4), R\-Head and T\-Head therefore act as*complementary regularizers*that constrain views of the joint distribution the adversarial loss may not fully capture\.
## IVExperiments
TABLE I:Experimental datasets\. All use sliding windowT=24T=24with stride1212\.TABLE II:Main generation quality \(lower is better\)\. Per dataset, best and second\-best mean values are in bold and underlined; standard deviations over five seeds are reported alongside\.### IV\-AExperimental Setup
Datasets\.We evaluate TriHead\-GAN on four multivariate datasets summarized in Table[I](https://arxiv.org/html/2606.07569#S4.T1): our self\-collected Changsha Carbon urban monitoring data, two national\-level carbon emission datasets \(China Carbon and US Carbon from Carbon Monitor\[[20](https://arxiv.org/html/2606.07569#bib.bib11)\]\), and the public ETTh1 benchmark\. The Changsha Carbon dataset is collected from a real\-world urban atmospheric monitoring station from late June 2025 to mid\-January 2026 at a 15\-minute sampling cadence, with stable sensor operation and negligible missing values across the recording period\. It reflects a common low\-resource setting in city\-level carbon monitoring: high\-frequency CO2observations are expensive to deploy and maintain, and the available history is much shorter than that of mature meteorological benchmarks\. It contains1212raw variables: PM2\.5, PM10, CO, NO2, SO2, O3, CO2, wind speed, wind direction, temperature, humidity, and atmospheric pressure\. With CO2as the target variable, we retain the55variables whose Pearson correlation with CO2exceeds\|r\|=0\.3\|r\|=0\.3, namely CO, PM2\.5, PM10, wind speed, and CO2\. The remaining datasets contain77variables\. In every dataset the R\-Head target is the canonical last column, CO2for the three carbon datasets and oil temperature \(OT\) for ETTh1, following each source’s standard target convention\. All datasets use sliding windows of lengthT=24T=24and stride1212, and preprocessing includes missing\-value interpolation, outlier repair, MinMax normalization to\[0,1\]\[0,1\], and a linear mapping to\[−1,1\]\[\-1,1\]during training to match the generator output\.
Baselines\.We compare against six representative time series generators, organised into two families\.*GAN\-based methods*: TimeGAN\[[29](https://arxiv.org/html/2606.07569#bib.bib5)\], an embedding\-space GAN with a supervised reconstruction loss; RCGAN\[[5](https://arxiv.org/html/2606.07569#bib.bib4)\], a recurrent conditional GAN that established the TSTR evaluation paradigm and TTS\-GAN\[[17](https://arxiv.org/html/2606.07569#bib.bib6)\], a Transformer\-based GAN tailored to time series\.*Diffusion\-based methods*: Diffusion\-TS\[[30](https://arxiv.org/html/2606.07569#bib.bib7)\], a denoising diffusion model with a Fourier\-domain loss; PAD\-TS\[[18](https://arxiv.org/html/2606.07569#bib.bib15)\], a population\-aware diffusion model that captures subpopulation heterogeneity and TimeDP\[[13](https://arxiv.org/html/2606.07569#bib.bib16)\], a multi\-domain diffusion model with domain prompts\.
Metrics\.Generation quality is measured by five complementary metrics, all in lower\-is\-better form\.*Discriminative score*\(DS\) is the deviation from0\.50\.5of a post\-hoc real/fake classifier’s error rate; values near0mean real and synthetic windows are indistinguishable\.*Predictive score*\(PS\) is the MAE of a one\-step\-ahead forecaster trained on synthetic and tested on real data \(predicting the final step of each window from the preceding steps\), probing whether the conditional structure needed for downstream prediction is preserved\.*Maximum mean discrepancy*\(MMD\) is the kernel distance between real and generated samples, sensitive to higher\-order moments\.*Fréchet distance*\(FID\) is the Wasserstein\-22distance between Gaussian fits to real and generated features, summarising overall distribution alignment\.*Autocorrelation difference*\(ACF\) is the absolute gap between real and generated autocorrelation curves over multiple lags, reflecting temporal\-dependence fidelity\. All five metrics are computed by a fixed external protocol applied identically to every method and independent of the TriHead\-GAN discriminator: DS trains a separate two\-layer LSTM classifier and PS a two\-layer GRU forecaster, while MMD and FID are computed in the raw flattened\-window feature space \(FID being the Fréchet distance between Gaussian fits, without any learned embedding\)\.
Implementation details\.The generator usesdz=64d\_\{z\}=64,dmodel=128d\_\{\\text\{model\}\}=128,88attention heads,44Transformer encoder layers, and feed\-forward dimension256256\. The discriminator uses128128hidden channels, a 3\-layer spectrally\-normalized CNN for the D\-Head branch, a separate 3\-layer CNN for the R\-Head branch, and a 2\-layer causal CNN for the T\-Head branch\. We train for10001000epochs with batch size3232, Adam \(β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.9\\beta\_\{2\}=0\.9\), learning rate10−410^\{\-4\}with cosine annealing,ncritic=5n\_\{\\text\{critic\}\}=5, gradient clipping at1\.01\.0, EMA decay0\.9990\.999, and a 300\-epoch auxiliary\-loss warmup\. Loss weights areλgp=10\\lambda\_\{\\text\{gp\}\}=10,α=3\\alpha=3,β=1\\beta=1,γ=3\\gamma=3,δ=1\\delta=1, andη=2\\eta=2\. All baselines are paper\-aligned reimplementations developed with reference to the original papers and their open\-source repositories, where hyperparameters are configured following the recommended values in the original papers\. To control variance, the main comparison is repeated over five seeds \(42, 43, 44, 45, 46\), while all other experiments \(ablation, TSTR, diversity, convergence, sensitivity\) are repeated over three seeds \(42, 43, 44\); reported numbers are the seed\-averaged values\. All models are trained on a single NVIDIA RTX 4060Ti GPU\.
### IV\-BMain Results
As shown in Table[II](https://arxiv.org/html/2606.07569#S4.T2), TriHead\-GAN attains the best \(or tied\-best\) score in1818of the2020\(dataset, metric\) cells and is uniformly first on MMD, FID, and DS across all four datasets\. We attribute this advantage to two design choices: the triple\-head discriminator spreads adversarial pressure across distribution, cross\-variable, and temporal aspects, helping the generator balance multiple statistics more effectively, and the anti\-smoothing loss penalises high\-frequency collapse, keeping short\-range power and autocorrelation aligned with the real signal\.
On the joint\-distribution metrics MMD and FID, TriHead\-GAN improves clearly over the best baseline on every dataset, and it ties with PAD\-TS for the best ACF on US\. The two cells in which TriHead\-GAN is not first are Changsha PS \(0\.0290\.029vs\. TimeDP0\.0250\.025\) and Changsha ACF \(0\.0310\.031vs\. TimeDP0\.0270\.027\); both are univariate marginal statistics on which the strongest diffusion baselines \(TimeDP, PAD\-TS\) are competitive\. These gains stem from the proposed design rather than from any single tuned statistic: the regression and temporal heads add explicit cross\-variable and step\-wise supervision on top of the Wasserstein critic, and the anti\-smoothing loss preserves the first\-difference statistics, so the generator matches the joint structure that marginal\-oriented baselines model less directly\.
The two model families display complementary strengths\. Diffusion\- and likelihood\-based generators are trained with a denoising objective that directly fits fine\-grained per\-variable distributional detail, which makes them naturally strong on the univariate marginal statistics: configured with the recommended settings from their original papers and open\-source code, TimeDP attains the best PS and ACF on Changsha and the lowest DS among baselines on Changsha and ETTh1, and PAD\-TS ties our model for the best ACF on US\. Adversarial training instead optimises a sample\-level discrimination signal that is better suited to capturing the joint structure of the data, which TriHead\-GAN reinforces with explicit cross\-variable and step\-wise supervision through its R\- and T\-heads\. Consequently, TriHead\-GAN keeps the lead on the joint\-distribution metrics, ranking first on DS, MMD, and FID on every dataset, while remaining comparable to the strongest diffusion baselines on the marginal statistics\.
### IV\-CSignificance Analysis
TABLE III:Number of metrics \(out of55\) on whichTriHead\-GANis significantly lower than each baseline \(p<0\.05p<0\.05, paired one\-sidedtt\-test\)\.To confirm that the gains in Table[II](https://arxiv.org/html/2606.07569#S4.T2)are not driven by seed variance, we run a paired one\-sidedtt\-test for every \(dataset, baseline, metric\) cell on the same five\-seed runs \(lower\-is\-better,Δ=TriHead\-GAN¯−baseline¯\\Delta=\\overline\{\\text\{TriHead\-GAN\}\}\-\\overline\{\\text\{baseline\}\}, significant whenΔ<0\\Delta<0andp<0\.05p<0\.05\)\. Table[III](https://arxiv.org/html/2606.07569#S4.T3)summarises the significant counts per \(dataset, baseline\) pair\.
Across all4×6×5=1204\\times 6\\times 5=120comparisons, TriHead\-GAN is significantly better atp<0\.05p<0\.05in105105cells\. The1515non\-significant cases are confined to the univariate marginal metrics, eight on PS and six on ACF, plus a single MMD case \(China vs\. TimeGAN\), and arise mainly against the strongest baselines \(TTS\-GAN and the diffusion models PAD\-TS, TimeDP, Diffusion\-TS\), where the absolute gaps on these marginal statistics are small and TriHead\-GAN is statistically comparable rather than worse\. This split is by design rather than a deficiency: the triple\-head discriminator deliberately distributes supervision across the marginal, conditional, and transition components of the joint distribution instead of over\-optimising any single univariate marginal, so diffusion baselines tuned toward marginal denoising fidelity are expected to stay competitive on the per\-variable statistics \(PS, ACF\), whereas TriHead\-GAN should, and does, dominate the metrics that reflect joint and cross\-variable structure\. On the joint\-distribution metrics the advantage is uniform: TriHead\-GAN is significantly better on DS and FID in all2424comparisons and on MMD in2323of2424\. The improvements reported in Table[II](https://arxiv.org/html/2606.07569#S4.T2)are therefore consistent with significant gains on the large majority of metric–baseline pairs\. These are paired one\-sided tests over five seeds without multiple\-comparison correction, so we read them as evidence that the improvements are stable across seeds rather than as strong stand\-alone statistical claims\.
### IV\-DAblation Study
To verify the contribution of each component, we design four ablation variants: \(1\)*w/o T*: removing the temporal coherence head \(δ=0\\delta=0\); \(2\)*w/o R*: removing the regression consistency head \(α=β=γ=0\\alpha=\\beta=\\gamma=0\); \(3\)*w/o AS*: removing the anti\-smoothing loss \(η=0\\eta=0\); \(4\)*MLP gen\.*: replacing the Transformer\+\+LocalConv generator with an MLP\.
Table[IV](https://arxiv.org/html/2606.07569#S4.T4)reveals four findings\.*\(i\) R\-Head is the most stable structural contributor:*removing it worsens all five metrics on Changsha and China and most on US and ETTh1, with MMD rising49%49\\%/31%31\\%/83%83\\%on Changsha/China/ETTh1, confirming the value of explicit cross\-variable supervision\.*\(ii\) The anti\-smoothing loss is an important temporal regularizer:*removing it inflates Changsha MMD by107%107\\%and FID by125%125\\%and degrades ACF/FID on the other datasets, matching its role of preventing local\-variation collapse\.*\(iii\) T\-Head gives selective gains*, mainly on PS and ACF \(removing it degrades Changsha ACF by39%39\\%, Changsha PS by21%21\\%, and China FID by39%39\\%\), while DS is occasionally lower without it, as the critic then faces fewer auxiliary constraints\.*\(iv\) The Transformer generator helps more as variables grow:*an MLP wins three Changsha cells \(F=5F=5\) and stays competitive on a few ACF cells of the77\-variable datasets, but loses most cells overall, with MMD increasing by11%11\\%–170%170\\%\. The full TriHead\-GAN is thus the most balanced configuration, and the per\-component degradation matches the gradient orthogonality in Sec\.[III\-F](https://arxiv.org/html/2606.07569#S3.SS6): each head covers a distinct aspect, so each ablation hurts a recognisable subset of metrics\.
TABLE IV:Ablation study \(lower is better\)\. Best and second\-best per metric in bold and underlined\. w/o T / w/o R / w/o AS drop the temporal, regression, and anti\-smoothing components; MLP gen\. replaces the Transformer generator with an MLP\.
### IV\-EDownstream Prediction Enhancement
To validate the practical utility of generated data beyond distributional similarity, we conduct Train\-on\-Synthetic, Test\-on\-Real \(TSTR\) experiments\[[5](https://arxiv.org/html/2606.07569#bib.bib4)\], an evaluation paradigm subsequently adopted and extended by PATE\-GAN\[[15](https://arxiv.org/html/2606.07569#bib.bib28)\]and the sample\-level audit framework of Alaa et al\.\[[1](https://arxiv.org/html/2606.07569#bib.bib29)\]\. The Real\+Syn protocol additionally simulates the practical use case where a limited real monitoring history is augmented with synthetic windows before training a downstream forecaster\. We evaluate three settings: \(1\)*TRTR*\(train and test on real data\); \(2\)*TSTR*\(train on synthetic data, test on real\); and \(3\)*Real\+Syn*\(train on a size\-matched 1:1 mixture, test on real\)\. For Real\+Syn, the total number of training windows is kept the same as TRTR, with real and synthetic windows sampled at a 1:1 ratio\. We use LSTM, GRU, and Transformer predictors, each predicting the last66steps from the first1818steps of a window\. This multi\-step horizon is more demanding than the one\-step predictive score \(PS\) in Table[II](https://arxiv.org/html/2606.07569#S4.T2), so the two evaluations need not rank methods identically\. Table[V](https://arxiv.org/html/2606.07569#S4.T5)reports MAE averaged over the three predictors\.
TABLE V:Downstream prediction utility measured by Train\-on\-Synthetic\-Test\-on\-Real \(TSTR\)\. Values are MAE averaged over LSTM, GRU, and Transformer predictors\.TriHead\-GAN obtains the best TSTR MAE on all four datasets, and the best Real\+Syn MAE on ETTh1; on Changsha, China, and US the best Real\+Syn score goes to TimeDP \(Changsha\) and PAD\-TS \(China, US\), with TriHead\-GAN a close second\. Notably, under the size\-controlled real–synthetic training of the Real\+Syn protocol \(where the total number of windows matches TRTR\), replacing half of the real windows with TriHead\-GAN samples still pushes the MAE*below*the TRTR baseline on every dataset, with relative reductions of11%11\\%on Changsha,9%9\\%on China,4%4\\%on US, and14%14\\%on ETTh1, indicating that the synthetic windows carry useful complementary signal\.
Across all four datasets, the TSTR MAE of TriHead\-GAN improves on the strongest baseline by33–14%14\\%\(3%3\\%on Changsha,12%12\\%on China,8%8\\%on US, and14%14\\%on ETTh1\), suggesting that its cross\-variable supervision yields synthetic data that transfers better to the real forecasting task\. The strongest baselines stay competitive rather than degenerate \(Diffusion\-TS and PAD\-TS track the TRTR reference closely, e\.g\., Diffusion\-TS reaches0\.05230\.0523on Changsha and0\.05460\.0546on China\), so the downstream gains are measured against genuinely strong synthetic data\.
### IV\-FDiversity and Mode Coverage
Figure 2:Diversity analysis on four datasets: Precision \(sample fidelity,yy\-axis\) vs\. Recall \(mode coverage,xx\-axis\)\.Fig\.[2](https://arxiv.org/html/2606.07569#S4.F2)reports the Precision–Recall trade\-off using the manifold\-based estimator of Kynkäänniemi et al\.\[[16](https://arxiv.org/html/2606.07569#bib.bib26)\], refined by the density/coverage criterion of Naeem et al\.\[[24](https://arxiv.org/html/2606.07569#bib.bib27)\]\. TriHead\-GAN occupies a high\-Precision operating point on China, US, and ETTh1 while retaining competitive Recall; on Changsha it sits at a balanced high\-Precision, moderate\-Recall position \(Precision≈0\.82\\approx 0\.82, Recall≈0\.30\\approx 0\.30\), where TimeGAN and TTS\-GAN reach higher Precision only by collapsing Recall to near zero\. Across methods, the Precision\-oriented regime targeted by TriHead\-GAN remains a practically useful operating point for carbon monitoring, where synthetic windows that stay close to observed trajectories yield more reliable training data for downstream forecasters\.
### IV\-GQualitative Visualization
Figure 3:t\-SNE projections of real \(blue\) and generated \(red\) samples on Changsha for TriHead\-GAN and five representative baselines\.Figure 4:Autocorrelation function of real vs\. generated sequences on Changsha for TriHead\-GAN and five representative baselines\.Figs\.[3](https://arxiv.org/html/2606.07569#S4.F3)and[4](https://arxiv.org/html/2606.07569#S4.F4)provide qualitative support on Changsha; the same patterns hold on the other three datasets and are omitted for space\. The t\-SNE projection in Fig\.[3](https://arxiv.org/html/2606.07569#S4.F3)shows that TriHead\-GAN samples overlap densely with the real distribution and are scattered across the same multi\-mode area as the real points, providing evidence against severe mode collapse\. TimeGAN and RCGAN partially separate from the real manifold, while Diffusion\-TS spreads beyond it, the spatial counterpart of its near\-zero autocorrelation after lag 2 in Fig\.[4](https://arxiv.org/html/2606.07569#S4.F4)\. The ACF curves further show that TriHead\-GAN, TTS\-GAN, and PAD\-TS are tightly grouped at short lags but diverge at lags above1010, where the anti\-smoothing loss in TriHead\-GAN closes the residual gap to the real curve\.
### IV\-HConvergence and Training Dynamics
Figure 5:Wasserstein distance estimate during training \(10001000epochs\) on the four datasets\.Fig\.[5](https://arxiv.org/html/2606.07569#S4.F5)reports the per\-epoch Wasserstein\-distance estimate from the critic\. The relative drop from initial to final values is96\.8%96\.8\\%on Changsha,86\.2%86\.2\\%on China,63\.9%63\.9\\%on US, and88\.9%88\.9\\%on ETTh1\. The curves show overall decreasing trends with narrow seed bands and little oscillation typical of unstable WGAN training, supporting the design choice of combining gradient penalty with spectral normalization\. The descent has a clear two\-phase shape: a steep drop over the first∼\\sim200 epochs followed by a long plateau, which coincides with the300300\-epoch auxiliary\-loss warmup; by the time R\-Head and T\-Head reach full strength the critic has already absorbed most of the distance, and the auxiliary heads refine rather than dominate the trajectory\.
### IV\-IComputational Efficiency
Figure 6:Computational profile on Changsha\. Left: per\-step training latency vs\. batch\-32 inference latency \(log scale\)\. Right: parameter count \(log\) vs\. peak GPU memory\.Fig\.[6](https://arxiv.org/html/2606.07569#S4.F6)summarises the computational profile on Changsha\. TriHead\-GAN has the largest per\-step training time \(47\.747\.7ms\) among all methods, reflecting the cost of three discriminator branches together with a Transformer generator, though at1\.851\.85M parameters it is smaller than the diffusion baselines \(PAD\-TS2\.252\.25M, Diffusion\-TS3\.763\.76M, TimeDP8\.808\.80M\)\. Its decisive advantage is at inference: sampling a batch of3232takes only1\.701\.70ms, three orders of magnitude faster than the diffusion baselines, which require many denoising steps \(1,705\.71\{,\}705\.7ms for Diffusion\-TS,4,866\.34\{,\}866\.3ms for PAD\-TS, and3,535\.13\{,\}535\.1ms for TimeDP\)\. This single\-pass sampling makes TriHead\-GAN well suited to city\-level carbon\-monitoring deployments: the generator can be trained offline on station history and repeatedly queried to provide auxiliary synthetic windows for downstream forecasters without replacing real sensor observations, so that the added training cost is paid once while the per\-query deployment cost remains on par with single\-pass GANs and far below iterative diffusion baselines\.
### IV\-JScalability and Sensitivity
Figure 7:Scalability and loss\-weight sensitivity on Changsha \(lower is better\)\. \(a\) training\-fraction sweep\{10,25,50,100\}%\\\{10,25,50,100\\\}\\%normalised to10%10\\%; \(b\)–\(d\) single\-weight sweeps overα\\alpha,δ\\delta,η\\eta, each normalised to its1\.0×1\.0\\timesdefault \(dotted line\)\. Each colour denotes one metric\.Panel \(a\) of Fig\.[7](https://arxiv.org/html/2606.07569#S4.F7)reports the scalability behaviour: PS, FID, and ACF improve monotonically as the training fraction increases, while MMD is noisy below the25%25\\%point and stabilises afterwards\. This is consistent with the cross\-variable head requiring enough samples to reliably estimate the conditional structure of CO2given the surrounding meteorological and pollutant variables\.
Panels \(b\)–\(d\) sweep the three non\-trivial auxiliary weights \(α\\alphafor R\-Head,δ\\deltafor T\-Head, andη\\etafor anti\-smoothing\) in\{0\.5,0\.75,1\.0,1\.5,2\.0\}×\\\{0\.5,0\.75,1\.0,1\.5,2\.0\\\}\\timesthe defaults \(the other two regression weightsβ,γ\\beta,\\gammaare scaled by the same factor asα\\alphawhenα\\alphais swept, since all three weight the regression head, andλgp\\lambda\_\{\\text\{gp\}\}uses the standard WGAN\-GP value\)\. Three observations follow\. \(i\)*Stability:*across the full sweep no configuration collapses, with PS, FID, and ACF confined to\[0\.028,0\.034\]\[0\.028,0\.034\],\[0\.087,0\.152\]\[0\.087,0\.152\], and\[0\.021,0\.039\]\[0\.021,0\.039\]\. \(ii\)*Per\-weight profile:*largerα\\alphakeeps improving FID up to2\.0×2\.0\\times\(0\.135→0\.0870\.135\\to 0\.087\);δ\\deltais the most sensitive \(its0\.5×0\.5\\timessetting gives the worst FID of the sweep,0\.1520\.152, so under\-weighting transition supervision is the riskiest setting\), whileη\\etais the most robust\. \(iii\)*Defaults:*the1\.0×1\.0\\timessetting lies on the Pareto frontier of the four metrics but is not best on every one, since the three heads optimise different functionals\.
## VConclusion and Limitations
We propose TriHead\-GAN, a Transformer\-based adversarial framework for carbon emission time series generation\. Its triple\-head discriminator simultaneously supervises distributional authenticity, cross\-variable dependency, and step\-wise temporal smoothness, and the generator is further regularized by local temporal convolution, per\-step noise injection, and an anti\-smoothing loss\. Across four datasets, TriHead\-GAN achieves the strongest distribution fidelity \(ranking first on DS, MMD, and FID on every dataset\) and competitive\-to\-best downstream forecasting utility, with statistically significant improvements over six representative baselines on the large majority of metric–baseline pairs\.
Limitations and future work\.TriHead\-GAN is not uniformly best: Changsha PS/ACF and US ACF remain competitive but not first, and on China and US it is narrowly second to PAD\-TS on the Real\+Syn downstream protocol\. Two operational failure modes follow from the design: \(i\) the cross\-variable head needs enough samples to estimate the conditional structure, so TriHead\-GAN suits stations with on the order of a thousand or more sliding windows; \(ii\) R\-Head presupposes a designated target variable, which is natural for carbon monitoring \(CO2\) but less natural when variables are exchangeable, in which case R\-Head can be replaced by a randomly masked target rotation\. Future work will pursue this masked rotation and explore integration with time series foundation models for carbon\-specific pretraining\.
## Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant 62394341, Grant 62027811\. The authors used ChatGPT to assist with drafting and polishing the manuscript text; the proposed methods were produced and verified by the authors\.
## References
- \[1\]A\. Alaa, B\. Van Breugel, E\. S\. Saveliev, and M\. Van Der Schaar\(2022\)How faithful is your synthetic data? Sample\-level metrics for evaluating and auditing generative models\.InInternational Conference on Machine Learning,pp\. 290–306\.Cited by:[§IV\-E](https://arxiv.org/html/2606.07569#S4.SS5.p1.2)\.
- \[2\]S\. Chatterjee, D\. Hazra, and Y\. Byun\(2025\)GAN\-based synthetic time\-series data generation for improving prediction of demand for electric vehicles\.Expert Systems with Applications264,pp\. 125838\.External Links:[Document](https://dx.doi.org/10.1016/j.eswa.2024.125838)Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p2.1)\.
- \[3\]M\. Crippa, E\. Solazzo, G\. Huang, D\. Guizzardi, E\. Koffi, M\. Muntean, C\. Schieberle, R\. Friedrich, and G\. Janssens\-Maenhout\(2020\)High resolution temporal profiles in the emissions database for global atmospheric research \(EDGAR\)\.Scientific Data7\(1\),pp\. 121\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p1.3)\.
- \[4\]A\. Desai, C\. Freeman, Z\. Wang, and I\. Beaver\(2021\)TimeVAE: a variational auto\-encoder for multivariate time series generation\.arXiv preprint arXiv:2111\.08095\.Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1)\.
- \[5\]C\. Esteban, S\. L\. Hyland, and G\. Rätsch\(2017\)Real\-valued \(medical\) time series generation with recurrent conditional GANs\.arXiv preprint arXiv:1706\.02633\.Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1),[§IV\-E](https://arxiv.org/html/2606.07569#S4.SS5.p1.2)\.
- \[6\]European Commission\(2024\)Carbon border adjustment mechanism \(CBAM\)\.Note:Directorate\-General for Taxation and Customs UnionAvailable athttps://taxation\-customs\.ec\.europa\.eu/carbon\-border\-adjustment\-mechanism\_enCited by:[§I](https://arxiv.org/html/2606.07569#S1.p1.3)\.
- \[7\]European Union\(2023\)Regulation \(EU\) 2023/956 of the European Parliament and of the Council of 10 may 2023 establishing a carbon border adjustment mechanism\.Note:Official Journal of the European Union, L 130/52Available athttps://eur\-lex\.europa\.eu/legal\-content/EN/TXT/?uri=CELEX:32023R0956Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p1.3)\.
- \[8\]G\. Fadlon, I\. Arbiv, N\. Berman, and O\. Azencot\(2025\)A diffusion model for regular time series generation from irregular data with completion and masking\.InAdvances in Neural Information Processing Systems,Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1)\.
- \[9\]P\. Friedlingstein, M\. O’Sullivan, M\. W\. Jones, R\. M\. Andrew, D\. C\. E\. Bakker, J\. Hauck, P\. Landschützer, C\. Le Quéré, I\. T\. Luijkx, G\. P\. Peters,et al\.\(2023\)Global carbon budget 2023\.Earth System Science Data15\(12\),pp\. 5301–5369\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p1.3)\.
- \[10\]I\. Gulrajani, F\. Ahmed, M\. Arjovsky, V\. Dumoulin, and A\. C\. Courville\(2017\)Improved training of Wasserstein GANs\.Advances in Neural Information Processing Systems30\.Cited by:[§III\-D](https://arxiv.org/html/2606.07569#S3.SS4.p3.1)\.
- \[11\]Z\. Hong, Z\. Peng, and X\. Liu\(2026\)Stable time series prediction of enterprise carbon emissions based on causal inference\.arXiv preprint arXiv:2602\.00775\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.07569#S2.SS1.p1.1)\.
- \[12\]X\. Hou, S\. Liu, Z\. Peng, Y\. Chu, Y\. Zhang, and Y\. Wang\(2025\)DLGAN: time series synthesis based on dual\-layer generative adversarial networks\.arXiv preprint arXiv:2508\.21340\.Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p2.1)\.
- \[13\]Y\. Huang, C\. Xu, Y\. Wu, W\. Li, and J\. Bian\(2025\)TimeDP: learning to generate multi\-domain time series with domain prompts\.Proceedings of the AAAI Conference on Artificial Intelligence39\(17\),pp\. 17520–17527\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i17.33926)Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1)\.
- \[14\]Y\. Jin, A\. Sharifi, Z\. Li, S\. Chen, S\. Zeng, and S\. Zhao\(2024\)Carbon emission prediction models: a review\.Science of The Total Environment927,pp\. 172319\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.07569#S2.SS1.p1.1)\.
- \[15\]J\. Jordon, J\. Yoon, and M\. Van Der Schaar\(2019\)PATE\-GAN: generating synthetic data with differential privacy guarantees\.InInternational Conference on Learning Representations,Cited by:[§IV\-E](https://arxiv.org/html/2606.07569#S4.SS5.p1.2)\.
- \[16\]T\. Kynkäänniemi, T\. Karras, S\. Laine, J\. Lehtinen, and T\. Aila\(2019\)Improved precision and recall metric for assessing generative models\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§IV\-F](https://arxiv.org/html/2606.07569#S4.SS6.p1.2)\.
- \[17\]X\. Li, V\. Metsis, H\. Wang, and A\. H\. H\. Ngu\(2022\)TTS\-GAN: a transformer\-based time\-series generative adversarial network\.InInternational Conference on Artificial Intelligence in Medicine,pp\. 133–143\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p3.3),[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1)\.
- \[18\]Y\. Li, H\. Meng, Z\. Bi, I\. T\. Urnes, and H\. Chen\(2025\)Population aware diffusion for time series generation\.Proceedings of the AAAI Conference on Artificial Intelligence39\(17\),pp\. 18520–18529\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i17.34038)Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1)\.
- \[19\]Y\. Liu, G\. Qin, Z\. Shi, Z\. Chen, C\. Yang, X\. Huang, J\. Wang, and M\. Long\(2025\)Sundial: a family of highly capable time series foundation models\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 39295–39317\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1)\.
- \[20\]Z\. Liu, P\. Ciais, Z\. Deng, S\. J\. Davis, B\. Zheng, Y\. Wang, D\. Cui, B\. Zhu, X\. Dou, P\. Ke,et al\.\(2020\)Carbon Monitor, a near\-real\-time daily dataset of global CO2emission from fossil fuel and cement production\.Scientific Data7\(1\),pp\. 392\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.07569#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p1.17)\.
- \[21\]X\. Ma, J\. Wang, J\. Huang, and Y\. Ke\(2025\)Energy consumption and carbon emission modeling and forecasting study with novel deep learning methods\.Expert Systems with Applications290,pp\. 128314\.External Links:[Document](https://dx.doi.org/10.1016/j.eswa.2025.128314)Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.07569#S2.SS1.p1.1)\.
- \[22\]D\. Maji, K\. Yang, P\. Shenoy, R\. K\. Sitaraman, and M\. Srivastava\(2025\)CarbonX: an open\-source tool for computational decarbonization using time series foundation models\.arXiv preprint arXiv:2510\.01521\.Cited by:[§II\-A](https://arxiv.org/html/2606.07569#S2.SS1.p1.1)\.
- \[23\]T\. Miyato, T\. Kataoka, M\. Koyama, and Y\. Yoshida\(2018\)Spectral normalization for generative adversarial networks\.InThe Sixth International Conference on Learning Representations,Cited by:[§III\-B](https://arxiv.org/html/2606.07569#S3.SS2.p3.4),[§III\-D](https://arxiv.org/html/2606.07569#S3.SS4.p2.7)\.
- \[24\]M\. F\. Naeem, S\. J\. Oh, Y\. Uh, Y\. Choi, and J\. Yoo\(2020\)Reliable fidelity and diversity metrics for generative models\.InInternational Conference on Machine Learning,pp\. 7176–7185\.Cited by:[§IV\-F](https://arxiv.org/html/2606.07569#S4.SS6.p1.2)\.
- \[25\]L\. Shen, X\. Li, and L\. Long\(2025\)TSGDiff: rethinking synthetic time series generation from a pure graph perspective\.arXiv preprint arXiv:2511\.12174\.Cited by:[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1)\.
- \[26\]X\. Shi, S\. Wang, Y\. Nie, D\. Li, Z\. Ye, Q\. Wen, and M\. Jin\(2025\)Time\-MoE: billion\-scale time series foundation models with mixture of experts\.InProceedings of the Thirteenth International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p2.1)\.
- \[27\]United Nations Framework Convention on Climate Change\(2015\)The Paris agreement\.Note:Decision 1/CP\.21, Adoption of the Paris Agreement, FCCC/CP/2015/10/Add\.1Available athttps://unfccc\.int/process\-and\-meetings/the\-paris\-agreementCited by:[§I](https://arxiv.org/html/2606.07569#S1.p1.3)\.
- \[28\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in Neural Information Processing Systems30\.Cited by:[§III\-C](https://arxiv.org/html/2606.07569#S3.SS3.p3.2)\.
- \[29\]J\. Yoon, D\. Jarrett, and M\. Van der Schaar\(2019\)Time\-series generative adversarial networks\.Advances in Neural Information Processing Systems32\.Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p3.3),[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1)\.
- \[30\]X\. Yuan and Y\. Qiao\(2024\)Diffusion\-TS: interpretable diffusion for general time series generation\.InThe Twelfth International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.07569#S1.p3.3),[§II\-B](https://arxiv.org/html/2606.07569#S2.SS2.p3.1),[§IV\-A](https://arxiv.org/html/2606.07569#S4.SS1.p2.1)\.Similar Articles
Cross-scale Aligned Supervision for Training GANs
This paper proposes CAT, a cross-scale aligned transformer that enforces consistency between intermediate and final GAN outputs to resolve trajectory misalignment, achieving state-of-the-art FID of 1.56 on ImageNet-256.
Improving GANs using optimal transport
OT-GAN introduces a novel GAN variant using optimal transport combined with energy distance in an adversarially learned feature space to improve training stability and image generation quality. The method demonstrates state-of-the-art results on benchmark problems with stable training using large mini-batches.
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
This paper proposes a hybrid WGAN-GA approach for refining generative graph topologies, using a genetic algorithm to correct residual structural deviations in GAN-based generated graphs, improving realism for synthetic graph synthesis and data augmentation.
REGEN: Reference-Guided Synthetic Multivariate Time Series Generation for Forecasting
ReGeN is a reference-guided generative pipeline for multivariate time series data that decomposes observed sequences into periodic backbone, stochastic residuals, and cross-variable dependencies to synthesize controllable synthetic data. It demonstrates that generated data can substitute for real data in forecasting tasks, outperforming prior synthetic data generators.
CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction
CHAM-net introduces a contrastive hierarchical adaptive meta-network that captures site-specific and cross-year dynamics for robust global methane flux prediction, outperforming baseline methods on simulation and observational datasets.