PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
Summary
Proposes PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that addresses periodic perception degradation, trend-noise entanglement, and loss of cross-variable dependencies via multiplicative periodic gating, multi-scale structured encoder, and cross-scale collaborative attention.
View Cached Full Text
Cached at: 05/19/26, 06:43 AM
# PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
Source: [https://arxiv.org/html/2605.16449](https://arxiv.org/html/2605.16449)
###### Abstract
Deep forecasting models often suffer from attenuated periodic perception and entangled trend–noise representations as network depth increases\. Moreover, the widely adopted channel\-independent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross\-variable consistency in multivariate time series\. To address these issues, we propose PESD\-TSF, a physics\-inspired structured decomposition framework for long\-term time series forecasting that jointly emphasizes interpretability and predictive accuracy\. PESD\-TSF introduces three key designs\. First, a Multiplicative Periodic Gating mechanism incorporates continuous\-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers\. Second, a multi\-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long\-term trends from high\-frequency variations while retaining fine\-grained temporal semantics\. Third, to recover disrupted inter\-variable dependencies, we propose Cross\-Scale Collaborative Attention \(CSCA\) together with an RLC regularization scheme, which reconstructs global inter\-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints\. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD\-TSF consistently achieves state\-of\-the\-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter\-variable coupling, highlighting its superior structural modeling capability and generalization\.
Time Series Forecasting, Deep Learning, Decomposition, Interpretability, Spatiotemporal Modeling
## 1Introduction
Multivariate long\-term time series forecasting \(LSTF\)\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.16449#bib.bib42)\)is critical for domains like energy and traffic management\(Qianet al\.,[2019](https://arxiv.org/html/2605.16449#bib.bib43)\)\. While deep learning models have surpassed traditional statistical approaches in accuracy, they often operate as ”black boxes” lacking physical inductive bias\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.16449#bib.bib26); Wanget al\.,[2026b](https://arxiv.org/html/2605.16449#bib.bib30)\)\. Consequently, these models struggle to distinguish robust physical regularities \(e\.g\., periodicity, inter\-variable dependency\) from stochastic noise, leading to brittle predictions under distribution shifts\(Wanget al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib34)\)\. Specifically, existing architectures face three limitations: \(1\) Signal Dilution: Static positional encodings fail to preserve periodic patterns \(e\.g\., calendar effects\)\(Luo and Wang,[2024b](https://arxiv.org/html/2605.16449#bib.bib40)\)within deep feature hierarchies; \(2\) Noise Entanglement: Attention mechanisms are often distracted by abrupt perturbations, hindering the separation of long\-term trends from noise; and \(3\) Loss of Correlation: As shown in Figure[1](https://arxiv.org/html/2605.16449#S1.F1), the prevalent channel\-independent strategy artificially severs intrinsic cross\-variable dependencies\(Xueet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib41)\), failing to model dynamic system coordination\.
Figure 1:Capturing vs\. Losing Dependencies\. Left: Inherent dynamic synergy in multivariate series\. Right: Channel\-independent methods sever these crucial connections, hindering consistent feature learning\.To address these challenges, we propose PESD\-TSF, a deep decomposition framework guided by physical inductive bias\. As illustrated in Figure[2](https://arxiv.org/html/2605.16449#S1.F2), PESD\-TSF explicitly decomposes temporal signals into three dimensions: long\-term trends, short\-term perturbations, and cross\-variable collaborations\. First, to mitigate periodic attenuation, we design a Multiplicative Periodic Gating mechanism that modulates signal amplitudes using continuous\-time priors\. Second, we construct a multi\-scale structured encoder employing detrended attention and hierarchical sampling to disentangle trends from local variations\. Finally, to recover inter\-variable dependencies, we introduce Cross\-Scale Collaborative Attention \(CSCA\) with RLC regularization\. This module enforces orthogonality and consistency constraints, guiding the model to learn structurally consistent topological dependencies\. Extensive experiments demonstrate that PESD\-TSF achieves state\-of\-the\-art performance, effectively unifying high forecasting accuracy with physical interpretability\.

Figure 2:Schematic of PESD\-TSF\. The framework explicitly decomposes signals into three physical components: \(a\) long\-term trend, \(b\) short\-term variations, and \(c\) cross\-variable coherence\. This structured decomposition unifies high forecasting accuracy with physical interpretability\.In summary, the main contributions of this paper are as follows:
- •We proposePESD\-TSF, a framework that explicitly decomposes temporal dynamics into three physical dimensions—trend, perturbation, and collaboration—to effectively address long\-range dependency attenuation and dynamic variable coupling\.
- •We design aMulti\-scale Structured Encoderintegrated withPeriodic Gating\. This architecture injects continuous temporal priors to mitigate signal dilution and employs hierarchical sampling to disentangle multi\-granularity features\.
- •We introduce aRegularized Latent Component \(RLC\)module\. By imposing orthogonality constraints, it enforces physically interpretable, disentangled representations and reconstructs cross\-variable topological consistency\.
- •Extensive experiments across diverse benchmarks demonstrate that PESD\-TSF achievesstate\-of\-the\-art performance, validating the superiority of explicit structural decomposition over purely stacking deep architectures\.
## 2Methodology
To address long\-term dependency attenuation and neglected cross\-variable correlations in time series forecasting, we propose PESD\-TSF, a framework grounded in physical inductive bias\. Departing from generic “black\-box” paradigms, PESD\-TSF employs a structured pipeline—period enhancement, multi\-scale decoupling, collaborative reconstruction, and regularization—to explicitly capture underlying sequence dynamics\.
Figure 3:The PESD\-TSF framework comprises three core components: \(1\)Periodic Gating: fusing temporal features through amplitude modulation to capture periodic patterns; \(2\)Multi\-scale Structured Encoder: integrating IntAttention, PatchSampling, and CSCA for multi\-resolution feature processing and context aggregation; \(3\)RLC module: utilizing orthogonal constraints \(ℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}\) to decouple the representations into trend, seasonal, and consistent components\. Finally, a linear layer outputs the prediction resultYY\.### 2\.1Problem Definition and Overall Architecture
We map historical observationsX∈ℝB×L×CX\\in\\mathbb\{R\}^\{B\\times L\\times C\}to future statesY∈ℝB×O×CY\\in\\mathbb\{R\}^\{B\\times O\\times C\}\. As shown in Figure[3](https://arxiv.org/html/2605.16449#S2.F3), PESD\-TSF employs a hierarchical architecture starting with Instance Normalization and Periodic Gating to modulate amplitudes via temporal priors\. Subsequently, a three\-stage encoder disentangles trends, local variations, and correlations, followed by a linear decoder supervised by an RLC module for structural constraints\.
### 2\.2Period\-Aware Gating and Multi\-Scale Embedding
Standard additive positional encoding suffers from Signal Dilution and lacks physical semantics\. We propose Multiplicative Periodic Gating to model temporal attributes as continuous priors\. Featuresℳ∈ℝB×L×Nfreq\\mathcal\{M\}\\in\\mathbb\{R\}^\{B\\times L\\times N\_\{\\text\{freq\}\}\}are mapped to embedding subspace𝐖emb\(k\)∈ℝVk×Demb\\mathbf\{W\}\_\{\\text\{emb\}\}^\{\(k\)\}\\in\\mathbb\{R\}^\{V\_\{k\}\\times D\_\{\\text\{emb\}\}\}:
Eb,t\(k\)=𝐖emb\(k\)\[mb,t,k\]∈ℝDemb\.E\_\{b,t\}^\{\(k\)\}=\\mathbf\{W\}\_\{\\text\{emb\}\}^\{\(k\)\}\[m\_\{b,t,k\}\]\\in\\mathbb\{R\}^\{D\_\{\\text\{emb\}\}\}\.\(1\)Embeddings are concatenated intoEcatE\_\{\\text\{cat\}\}and projected to contextEtimeE\_\{\\text\{time\}\}viaWfuseW\_\{\\text\{fuse\}\}:
Ecat=Concat\(E\(1\),E\(2\),…,E\(Nfreq\)\),\\begin\{split\}E\_\{\\text\{cat\}\}&=\\text\{Concat\}\(E^\{\(1\)\},E^\{\(2\)\},\\dots,E^\{\(N\_\{\\text\{freq\}\}\)\}\),\\end\{split\}\(2\)Etime=EcatWfuse∈ℝB×L×D,E\_\{\\text\{time\}\}=E\_\{\\text\{cat\}\}W\_\{\\text\{fuse\}\}\\in\\mathbb\{R\}^\{B\\times L\\times D\},\(3\)whereEcat∈ℝB×L×\(Nfreq⋅Demb\)E\_\{\\text\{cat\}\}\\in\\mathbb\{R\}^\{B\\times L\\times\(N\_\{\\text\{freq\}\}\\cdot D\_\{\\text\{emb\}\}\)\}andDembD\_\{\\text\{emb\}\}is the embedding dimension\.WfuseW\_\{\\text\{fuse\}\}bridges multi\-scale dependencies\. A gating network then generatesGperiodG\_\{\\text\{period\}\}:
Gperiod=σ\(EtimeWgate\+bgate\)∈\(0,1\)B×L×1,G\_\{\\text\{period\}\}=\\sigma\(E\_\{\\text\{time\}\}W\_\{\\text\{gate\}\}\+b\_\{\\text\{gate\}\}\)\\in\(0,1\)^\{B\\times L\\times 1\},\(4\)whereWgate,bgateW\_\{\\text\{gate\}\},b\_\{\\text\{gate\}\}are projection parameters andσ\\sigmais Sigmoid\. This signal modulates the normalized inputXnormX\_\{\\text\{norm\}\}to yieldX′X^\{\\prime\}:
Xb,t,c′=Xnorm,b,t,c⋅\(1\+γ⋅Gperiod,b,t\),X^\{\\prime\}\_\{b,t,c\}=X\_\{\\text\{norm\},b,t,c\}\\cdot\(1\+\\gamma\\cdot G\_\{\\text\{period\},b,t\}\),\(5\)whereγ\\gammacontrols enhancement intensity\. Finally, thenn\-th patchpb,n,c∈ℝPp\_\{b,n,c\}\\in\\mathbb\{R\}^\{P\}ofX′X^\{\\prime\}is projected to latent spaceZ\(0\)Z^\{\(0\)\}:
Zb,n,c\(0\)=pb,n,cWemb∈ℝD,Z\_\{b,n,c\}^\{\(0\)\}=p\_\{b,n,c\}W\_\{\\text\{emb\}\}\\in\\mathbb\{R\}^\{D\},\(6\)whereWembW\_\{\\text\{emb\}\}is the projection matrix\. Using zero\-padding and strideSS, the total patch countNNis:
N=⌊L−PS⌋\+1\.N=\\lfloor\\frac\{L\-P\}\{S\}\\rfloor\+1\.\(7\)This yieldsZ\(0\)∈ℝB×N×C×DZ^\{\(0\)\}\\in\\mathbb\{R\}^\{B\\times N\\times C\\times D\}, fusing local features with periodic priors for the encoder\.
### 2\.3Multi\-scale Structured Decomposition Encoder
The encoder adopts a cascaded three\-stage design with physical inductive biases: IntAttention captures robust temporal patterns, PatchSampling aggregates multi\-scale features, and Cross\-Scale Cooperative Attention \(CSCA\) models inter\-variable dependencies\. Together, these stages form a coherent decomposition–reconstruction pipeline\.
Stage 1: Integrated Attention \(IntAttention\) — Denoised Trend Extraction
Stage 1 receives embedding𝐙\(0\)∈ℝB×N×C×D\\mathbf\{Z\}^\{\(0\)\}\\in\\mathbb\{R\}^\{B\\times N\\times C\\times D\}and aims to extract stable trends via Smoothing Convolution\. Using a uniform kernelKw=1w1wK\_\{w\}=\\frac\{1\}\{w\}1\_\{w\}\(with kernel sizew=3w=3and same\-padding to maintain sequence lengthNN\), we calculate the local trendZtrend\(0\)Z\_\{\\text\{trend\}\}^\{\(0\)\}:
Ztrend\(0\)=Z\(0\)⊛Kw∈ℝB×N×C×D,Z\_\{\\text\{trend\}\}^\{\(0\)\}=Z^\{\(0\)\}\\circledast K\_\{w\}\\in\\mathbb\{R\}^\{B\\times N\\times C\\times D\},\(8\)where⊛\\circledastdenotes temporal convolution\. We subtract this to obtain the detrended componentZdet\(0\)Z\_\{\\text\{det\}\}^\{\(0\)\}containing high\-frequency fluctuations:
Zdet\(0\)=Z\(0\)−Ztrend\(0\)\.Z\_\{\\text\{det\}\}^\{\(0\)\}=Z^\{\(0\)\}\-Z\_\{\\text\{trend\}\}^\{\(0\)\}\.\(9\)We adopt an Asymmetric Design\. To strictly preserve channel independence, we reshape the input tensor to\(B⋅C\)×N×D\(B\\cdot C\)\\times N\\times Dprior to the attention operation\. We then use the detrended componentZdet\(0\)Z\_\{det\}^\{\(0\)\}for Queries and Keys to emphasize structural similarity, while Values derive from the completeZ\(0\)Z^\{\(0\)\}\. The outputHattnH\_\{\\text\{attn\}\}is defined as:
Q=Zdet\(0\)WQ,K=Zdet\(0\)WK,V=Z\(0\)WV,Q=Z\_\{\\text\{det\}\}^\{\(0\)\}W\_\{Q\},\\quad K=Z\_\{\\text\{det\}\}^\{\(0\)\}W\_\{K\},\\quad V=Z^\{\(0\)\}W\_\{V\},\(10\)whereH=8H=8is the number of attention heads\. The output is defined as:
Hattn=Softmax\(QK⊤D/H\)V\.H\_\{\\text\{attn\}\}=\\text\{Softmax\}\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{D/H\}\}\\right\)V\.\(11\)Finally, we apply Residual Connections and Layer Normalization to obtainZ\(1\)Z^\{\(1\)\}:
Z\(1\)=LayerNorm\(Z\(0\)\+Hattn\)∈ℝB×N×C×D\.Z^\{\(1\)\}=\\text\{LayerNorm\}\(Z^\{\(0\)\}\+H\_\{\\text\{attn\}\}\)\\in\\mathbb\{R\}^\{B\\times N\\times C\\times D\}\.\(12\)This preserves original information while injecting denoised contextual features\.
Stage 2: PatchSampling — Multi\-granularity Feature Aggregation
To capture long\-period dependencies, PatchSampling compressesZ\(1\)Z^\{\(1\)\}via a dual\-branch structure \(Conv1dtimeConv1d\_\{time\}andMaxPooltimeMaxPool\_\{time\}with stride 2, kernel 3\), mapping concatenated outputs to dimensionDDwithout information loss:
Zagg=𝐖agg\(Concatdim=D\[Conv1dtime\(Z\(1\)\),MaxPooltime\(Z\(1\)\)\]\),\\begin\{split\}Z\_\{\\text\{agg\}\}=\\mathbf\{W\}\_\{\\text\{agg\}\}\\big\(&\\text\{Concat\}\_\{\\text\{dim\}=D\}\[\\\\ &\\quad\\text\{Conv1d\}\_\{\\text\{time\}\}\(Z^\{\(1\)\}\),\\\\ &\\quad\\text\{MaxPool\}\_\{\\text\{time\}\}\(Z^\{\(1\)\}\)\]\\big\),\\end\{split\}\(13\)whereConv1dtime\\text\{Conv1d\}\_\{\\text\{time\}\}denotes 1D convolution andConcatdim=D\\text\{Concat\}\_\{\\text\{dim\}=D\}denotes feature concatenation\. This reduces the temporal dimension toN′=⌊N/2⌋N^\{\\prime\}=\\lfloor N/2\\rfloor, forming coarse\-grained features:
Z\(2\)∈ℝB×N′×C×D\.Z^\{\(2\)\}\\in\\mathbb\{R\}^\{B\\times N^\{\\prime\}\\times C\\times D\}\.\(14\)This provides multi\-scale context for the subsequent CSCA stage\.
Stage 3: Cross\-Scale Cooperative Attention \(CSCA\) \- Global Synergy Reconstruction
Stage 3 inputsZ\(2\)Z^\{\(2\)\}to extract Global Variable Synergy\. Departing from computationally prohibitive dynamic graph modeling, we focus on stable, system\-wide correlations\. Specifically, we perform Global Average Pooling \(GAP\) along the time dimension to generate the compact representationHcH\_\{c\}
𝐇c=1N′∑t=1N′Z:,t,:,:\(2\)∈ℝB×C×D\.\\mathbf\{H\}\_\{\\text\{c\}\}=\\frac\{1\}\{N^\{\\prime\}\}\\sum\_\{t=1\}^\{N^\{\\prime\}\}Z\_\{:,t,:,:\}^\{\(2\)\}\\in\\mathbb\{R\}^\{B\\times C\\times D\}\.\(15\)This operation aggregates temporal contexts, allowing the subsequent attention mechanism to discover intrinsic, time\-invariant dependencies among variables\.
We construct self\-attention over the variable dimensionCCon𝐇c\\mathbf\{H\}\_\{\\text\{c\}\}to capture implicit dependencies, generating cooperative context𝐂context\\mathbf\{C\}\_\{\\text\{context\}\}:
𝐂context=Softmax\(\(𝐇c𝐖q\)\(𝐇c𝐖k\)⊤D\)×\(𝐇c𝐖v\)∈ℝB×C×D,\\begin\{split\}\\mathbf\{C\}\_\{\\text\{context\}\}&=\\text\{Softmax\}\\left\(\\frac\{\(\\mathbf\{H\}\_\{\\text\{c\}\}\\mathbf\{W\}\_\{q\}\)\(\\mathbf\{H\}\_\{\\text\{c\}\}\\mathbf\{W\}\_\{k\}\)^\{\\top\}\}\{\\sqrt\{D\}\}\\right\)\\\\ &\\quad\\times\(\\mathbf\{H\}\_\{\\text\{c\}\}\\mathbf\{W\}\_\{v\}\)\\in\\mathbb\{R\}^\{B\\times C\\times D\},\\end\{split\}\(16\)where𝐖q,𝐖k,𝐖v\\mathbf\{W\}\_\{q\},\\mathbf\{W\}\_\{k\},\\mathbf\{W\}\_\{v\}are projection matrices\. Finally, we inject this global structure back into the local stream via a Structured Conditioning mechanism to obtainZfinalZ\_\{\\text\{final\}\}:
Zfinal,b,t,c=Zb,t,c\(2\)\+Ccontext,b,c\.Z\_\{\\text\{final\},b,t,c\}=Z^\{\(2\)\}\_\{b,t,c\}\+C\_\{\\text\{context\},b,c\}\.\(17\)This forces local dynamics to satisfy system\-wide topological constraints while avoiding expensive dynamic graph calculations\.
### 2\.4Final Prediction and Regularization Constraints
To mitigate autoregressive error accumulation, we adopt a ‘Mixed\-Encoding, Independent\-Decoding’ strategy, flattening the temporal and embedding dimensions ofZfinalZ\_\{\\text\{final\}\}to construct the regression input:
Z~=Flatten\(Zfinal\)∈ℝB×C×\(N⋅D\),\\tilde\{Z\}=\\text\{Flatten\}\(Z\_\{\\text\{final\}\}\)\\in\\mathbb\{R\}^\{B\\times C\\times\(N\\cdot D\)\},\(18\)whereBB,NN,CC, andDDdenote the batch size, temporal length, number of variables, and embedding dimension, respectively\. A channel\-shared linear projectionWhead∈ℝ\(N⋅D\)×OW\_\{\\text\{head\}\}\\in\\mathbb\{R\}^\{\(N\\cdot D\)\\times O\}is then applied to obtain the prediction:
Hpred=Z~Whead∈ℝB×C×O\.H\_\{\\text\{pred\}\}=\\tilde\{Z\}W\_\{\\text\{head\}\}\\in\\mathbb\{R\}^\{B\\times C\\times O\}\.\(19\)The output is transposed toℝB×O×C\\mathbb\{R\}^\{B\\times O\\times C\}to enable independent variable extrapolation\. We then apply exact inverse normalization using preserved statisticsμX\\mu\_\{X\}andσX\\sigma\_\{X\}to restore the physical scale:
Y^=Hpred⊙σX\+μX,\\hat\{Y\}=H\_\{\\text\{pred\}\}\\odot\\sigma\_\{X\}\+\\mu\_\{X\},\(20\)where⊙\\odotdenotes the Hadamard product, andμX,σX\\mu\_\{X\},\\sigma\_\{X\}are replicated along the forecasting horizonOOfor dimensional alignment\.
#### Regularized Latent Component \(RLC\) as Auxiliary Statistical Alignment
To further enhance robustness and prevent overfitting, we introduce the Regularized Latent Component \(RLC\) module as an auxiliary regularization constraint operating on the*statistical readout*of latent representations\. Unlike post\-hoc analysis, the RLC module is fully integrated into the training graph and backpropagates gradients into the encoder\.
Concretely, given the encoder outputZfinal∈ℝB×N×C×DZ\_\{\\text\{final\}\}\\in\\mathbb\{R\}^\{B\\times N\\times C\\times D\}, we compute channel\-wise statistical descriptors by aggregating over the temporal and embedding dimensions:
μ\(Zfinal\)=1ND∑n=1N∑d=1DZfinal\[:,n,:,d\],\\mu\(Z\_\{\\text\{final\}\}\)=\\frac\{1\}\{ND\}\\sum\_\{n=1\}^\{N\}\\sum\_\{d=1\}^\{D\}Z\_\{\\text\{final\}\}\[:,n,:,d\],\(21\)σ\(Zfinal\)=1ND∑n,d\(Zfinal\[:,n,:,d\]−μ\(Zfinal\)\)2\.\\sigma\(Z\_\{\\text\{final\}\}\)=\\sqrt\{\\frac\{1\}\{ND\}\\sum\_\{n,d\}\\left\(Z\_\{\\text\{final\}\}\[:,n,:,d\]\-\\mu\(Z\_\{\\text\{final\}\}\)\\right\)^\{2\}\}\.\(22\)The resulting statistical feature vector is defined as:
Fstat=Concat\(μ\(Zfinal\),σ\(Zfinal\)\)∈ℝB×2C\.F\_\{\\text\{stat\}\}=\\text\{Concat\}\(\\mu\(Z\_\{\\text\{final\}\}\),\\sigma\(Z\_\{\\text\{final\}\}\)\)\\in\\mathbb\{R\}^\{B\\times 2C\}\.\(23\)Similarly, we extract pooled future statistics from the ground\-truth targets:
Ypool=Concat\(Mean\(Ygt\),Std\(Ygt\)\)∈ℝB×2C,Y\_\{\\text\{pool\}\}=\\text\{Concat\}\(\\text\{Mean\}\(Y\_\{\\text\{gt\}\}\),\\text\{Std\}\(Y\_\{\\text\{gt\}\}\)\)\\in\\mathbb\{R\}^\{B\\times 2C\},\(24\)where statistics are computed over the forecasting horizon\.
The RLC module projectsFstatF\_\{\\text\{stat\}\}into a compact latent factor space:
Zrlc=FstatWrlc,Wrlc∈ℝ2C×K\.Z\_\{\\text\{rlc\}\}=F\_\{\\text\{stat\}\}W\_\{\\text\{rlc\}\},\\quad W\_\{\\text\{rlc\}\}\\in\\mathbb\{R\}^\{2C\\times K\}\.\(25\)To encourage the extracted factors to capture distinct and complementary statistical patterns, we impose a column orthogonality constraint on the projection matrix:
ℒorth=‖Wrlc⊤Wrlc−I‖F2,\\mathcal\{L\}\_\{\\text\{orth\}\}=\\\|W\_\{\\text\{rlc\}\}^\{\\top\}W\_\{\\text\{rlc\}\}\-I\\\|\_\{F\}^\{2\},\(26\)whereK≤2CK\\leq 2Cis enforced to avoid rank deficiency and information redundancy\. This constraint ensures that the linear projection does not introduce additional correlations or amplify perturbations within the statistical latent space \(see Appendix[B](https://arxiv.org/html/2605.16449#A2)for detailed analysis\)\.
To further align latent statistical factors with future targets, we maximize the Pearson Correlation Coefficient \(PCC\) betweenZrlcZ\_\{\\text\{rlc\}\}andYpoolY\_\{\\text\{pool\}\}across the batch dimension:
PCC\(𝐮,𝐯\)=∑i=1B\(ui−u¯\)\(vi−v¯\)∑i=1B\(ui−u¯\)2∑i=1B\(vi−v¯\)2,\\text\{PCC\}\(\\mathbf\{u\},\\mathbf\{v\}\)=\\frac\{\\sum\_\{i=1\}^\{B\}\(u\_\{i\}\-\\bar\{u\}\)\(v\_\{i\}\-\\bar\{v\}\)\}\{\\sqrt\{\\sum\_\{i=1\}^\{B\}\(u\_\{i\}\-\\bar\{u\}\)^\{2\}\}\\sqrt\{\\sum\_\{i=1\}^\{B\}\(v\_\{i\}\-\\bar\{v\}\)^\{2\}\}\},\(27\)whereu¯\\bar\{u\}andv¯\\bar\{v\}denote batch\-wise means\. Computing PCC across the batch enforces distributional consistency, ensuring that relative statistical ordering is preserved while discouraging sample\-wise memorization\.
Finally, the overall training objective is formulated as:
ℒtotal=ℒMSE\(Y^,Ygt\)\+λ1ℒorth−λ2∑k=1KPCC\(Zrlc\(k\),Ypool\(k\)\),\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{MSE\}\}\(\\hat\{Y\},Y\_\{\\text\{gt\}\}\)\+\\lambda\_\{1\}\\mathcal\{L\}\_\{\\text\{orth\}\}\-\\lambda\_\{2\}\\sum\_\{k=1\}^\{K\}\\text\{PCC\}\(Z\_\{\\text\{rlc\}\}^\{\(k\)\},Y\_\{\\text\{pool\}\}^\{\(k\)\}\),\(28\)whereλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}balance prediction accuracy, statistical orthogonality, and consistency constraints\.
## 3Experiment
To evaluate PESD\-TSF, we conduct experiments on three aspects: \(1\) long\-term forecasting, emphasizing global trends and periodic patterns; \(2\) short\-term forecasting, focusing on high\-frequency local dynamics; and \(3\) ablation studies, assessing the contributions of period\-aware gating, cross\-scale cooperative attention \(CSCA\), RLC regularization, and hierarchical decoupling\. Detailed settings are provided in Appendix[C](https://arxiv.org/html/2605.16449#A3)\.
### 3\.1Long\-term Forecasting
To validate generalization, we categorize experiments into three groups: \(1\)Classic Benchmarks\(Table[1](https://arxiv.org/html/2605.16449#S3.T1)\), covering seven standard datasets like ETT, Electricity, and Weather; \(2\)Diverse Domain\-Specific Datasets\(Table[2](https://arxiv.org/html/2605.16449#S3.T2)\), extending to Solar, Air Quality, and Medical domains\(Luo and Wang,[2024a](https://arxiv.org/html/2605.16449#bib.bib22)\); and \(3\)High\-Dimensional Benchmarks\(Appendix Table[10](https://arxiv.org/html/2605.16449#A5.T10)\), comprising Meter, Atec, and Mobility for massive multivariate series\. Following standard protocols \(L=720L=720\), we evaluate using MSE and MAE \(details in Appendix[D\.1](https://arxiv.org/html/2605.16449#A4.SS1)\)\. PESD\-TSF achieves Top\-2 accuracy across most datasets, verifying its ability to capture intrinsic dependencies regardless of domain variations\. Compared to DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib5)\), it reduces MSE and MAE by 14\.96% and 14\.87% respectively\. Against lightweight SOTA models \(FITS\(Xuet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib17)\), SparseTSF\(Linet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib18)\)\), it achieves 10\.23%∼\\sim11\.59% error reduction\. Notably, PESD\-TSF ranks1st in 35 out of 38metrics, underscoring superior robustness despite its compact design\. Furthermore, Appendix Table[10](https://arxiv.org/html/2605.16449#A5.T10)confirms superior performance on high\-dimensional datasets against specialized baselines\. For full results, refer to Appendix[E](https://arxiv.org/html/2605.16449#A5)\.
Table 1:Long\-term sequence prediction performance of PESD\-TSF\. The best and second\-best results are highlighted in bold and underlined, respectively\.Table 2:Long\-term sequence prediction performance of PESD\-TSF\. The best and second\-best results are highlighted in bold and underlined, respectively\.
### 3\.2Short\-term Forecasting
We conducted short\-term forecasting experiments on PeMS datasets \(Input 96, Output 12\)\(Wanget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib6)\)and evaluated performance using MAE, MAPE, and RMSE \(details in Appendix[D\.2](https://arxiv.org/html/2605.16449#A4.SS2)\)\(Zhang and Yan,[2023](https://arxiv.org/html/2605.16449#bib.bib7)\)\. As shown in Table[3](https://arxiv.org/html/2605.16449#S3.T3), leveraging its effective modeling of complex spatiotemporal dependencies\(Liuet al\.,[2022b](https://arxiv.org/html/2605.16449#bib.bib9)\), PESD\-TSF demonstrates a distinct performance advantage over channel\-independent approaches like PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib3)\)and DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib5)\)\. Specifically, PESD\-TSF achieved state\-of\-the\-art results across all metrics on PeMS03, 07, and 08\(Liuet al\.,[2022a](https://arxiv.org/html/2605.16449#bib.bib8)\), comprehensively outperforming TimeMixer\(Wanget al\.,[2024](https://arxiv.org/html/2605.16449#bib.bib1)\)\. Even on PeMS04, it secured a significant lead in MAPE \(11\.62 vs\. 12\.53\), highlighting its dominance in multivariate spatiotemporal modeling tasks\(Qiuet al\.,[2025b](https://arxiv.org/html/2605.16449#bib.bib23)\)\.
Table 3:Short\-term sequence prediction performance of PESD\-TSF\. The best and second\-best results are highlighted in bold and underlined, respectively\.
### 3\.3Ablation Studies
We conducted ablation studies \(Table[4](https://arxiv.org/html/2605.16449#S3.T4)\) to verify the contribution of each component\.For comprehensive results, please refer to Table[11](https://arxiv.org/html/2605.16449#A5.T11)\.Key observations indicate:
w/o Period: Removing period\-aware gating significantly increased errors on datasets with calendar effects \(e\.g\., Traffic, ETTh1\), confirming that explicit temporal priors prevent signal dilution\.
w/o CSCA: The removal of this module degraded multivariate forecasting performance \(especially on PeMS\), highlighting the necessity of modeling dynamic coupling to capture system topology\.
w/o RLC: Removing structured constraints led to higher test errors, demonstrating that orthogonality and consistency constraints effectively curb overfitting and enforce the learning of intrinsic laws\.
w/o Hierarchy: Using single\-level resolution caused substantial performance decay in long sequences \(L=720L=720\), validating the effectiveness of the hierarchical decoupling mechanism in shortening long\-range dependency paths\.
Table 4:Ablation studies of PESD\-TSF on ETT datasets\. Performance is measured by MSE and MAE\.
### 3\.4Hyperparameter Analysis
This section systematically investigates the sensitivity of the PESD\-TSF framework to two pivotal hyperparameters governing model constraints and prior integration: the RLC regularization weightλ\\lambdaand the periodic coefficientγ\\gamma\.
First, regarding the regularization weightλ\\lambda, we evaluated the MSE across the logarithmic scale\[10−5,10−1\]\[10^\{\-5\},10^\{\-1\}\]\(Figure[4](https://arxiv.org/html/2605.16449#S3.F4)\(a\)\)\. The results uncover a scale\-dependent sensitivity: smaller datasets \(e\.g\., ETTh1, ETTh2\) manifest a distinct “U\-shaped” error curve, indicating a critical need for balanced constraint strength, whereas larger datasets \(e\.g\., Traffic, Electricity\) exhibit greater robustness with flatter trajectories\. Despite these variations, all datasets consistently achieve minimal error nearλ=10−3\\lambda=10^\{\-3\}\. We observe that while unconstrained settings \(λ→0\\lambda\\to 0\) risk overfitting, excessive penalties \(λ\>10−2\\lambda\>10^\{\-2\}\) overly constrict the latent space, limiting feature flexibility\. Thus,λ=10−3\\lambda=10^\{\-3\}is identified as the optimal equilibrium for injecting physical inductive bias without hindering expressiveness\.
Second, concerning the periodic coefficientγ\\gamma, experiments on Traffic and ETTh1 \(Figure[4](https://arxiv.org/html/2605.16449#S3.F4)\(b\)\) reveal a convex performance trend within\[0,1\]\[0,1\]\. Values that are too low \(γ<0\.1\\gamma<0\.1\) fail to effectively leverage periodic priors, while values approaching1\.01\.0\(γ\>0\.9\\gamma\>0\.9\) overemphasize periodicity, risking the suppression of underlying trends or the fitting of high\-frequency noise\. Fortunately, performance remains stable within the moderate range of\[0\.3,0\.7\]\[0\.3,0\.7\]\. Consequently, we adopt a default ofγ=0\.5\\gamma=0\.5to strike a synergistic balance between local periodic feature extraction and global temporal modeling\.
Figure 4:Hyperparameter sensitivity analysis on representative datasets\. \(a\) RLC regularization weightλ\\lambda:λ≈10−3\\lambda\\approx 10^\{\-3\}yields the lowest MSE, effectively balancing structural constraints with feature flexibility\. \(b\) Periodic coefficientγ\\gamma: The “U\-shaped” trend indicates an optimalγ≈0\.5\\gamma\\approx 0\.5, effectively integrating periodic priors without introducing noise\.
### 3\.5Interpretability Analysis: Visualization of Learned Spatiotemporal Dependencies
To validate CSCA’s spatial modeling, we compare the physical topology with the learned attention matrix on PeMS04 \(Figure[5](https://arxiv.org/html/2605.16449#S3.F5); methodologies in Appendix[G](https://arxiv.org/html/2605.16449#A7)\)\. High diagonal activations align with physical sparsity, confirming that PESD\-TSF preserves local neighborhood structures\. Moreover, the heatmap reveals denser connectivity: distinct vertical stripes \(e\.g\., indices 50, 150\) identify global “traffic hubs,” demonstrating the capture of dynamic long\-range dependencies beyond static GNNs\. Quantitatively, we evaluate the Top\-5 Topology Matching Rate, defined as the Intersection over Union \(IoU\) between the top\-5 edges with the highest attention weights and the ground\-truth physical connections\. The model achieves a rate of 20\.92% \(significantly exceeding random<1%<1\\%\), validating physical interpretability\. Regarding the diagonal structure in Figure 5, since the residual connections in our architecture implicitly handle node self\-preservation, we superimpose an identity matrix onto the learned cross\-variable attention map to represent these inherent self\-loops\. The resulting visualization confirms that PESD\-TSF preserves both local neighborhood structures and global traffic hubs\.

Figure 5:Comparison of spatial dependencies on PeMS04\. \(a\) Physical ground truth shows local sparse connectivity; \(b\) CSCA learned matrix displays vertical stripe patterns\.
### 3\.6Spectral Verification of Explicit Decomposition
To verify the physical semantics of PESD\-TSF’s explicit decomposition, we performed spectral density analysis on the components extracted by IntAttention\. As shown in Figure[6](https://arxiv.org/html/2605.16449#S3.F6)\(a\) \(ECL\), the trend component is highly concentrated in the low\-frequency range \(f<0\.1f<0\.1\), effectively capturing long\-term trends, while the variation component dominates medium\-to\-high frequencies \(f\>0\.2f\>0\.2\)\. Figure[6](https://arxiv.org/html/2605.16449#S3.F6)\(b\) \(ETTh1\) demonstrates even sharper separation, with trends confined to low frequencies and variations exhibiting a broad\-band distribution\. This consistency confirms that PESD\-TSF correctly assigns low frequencies to trends and medium\-to\-high frequencies to variations\. Such adaptive decoupling validates the model’s physical inductive bias, minimizing reliance on spurious correlations and significantly enhancing forecasting robustness\.


Figure 6:Spectral verification on \(a\) ECL and \(b\) ETTh1\. Trend \(blue\) concentrates in low frequencies, while variation \(orange\) dominates medium\-to\-high frequencies\. This separation validates PESD\-TSF’s physical decoupling\.

Figure 7:RLC analysis on Solar dataset\. \(a\) Orthogonal loss: Rapid decline ofℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}confirms effective constraint enforcement\. \(b\) Correlation heatmap: Near\-zero off\-diagonal elements verify effective decoupling of temporal dynamics\.
### 3\.7Empirical Analysis of Latent Space Decoupling
To evaluate the decoupling capability of PESD\-TSF in the statistical latent space, we conducted a detailed analysis of the orthogonal constraint dynamics on the Solar dataset with a prediction length ofO=720O=720\.
As illustrated in Figure[7](https://arxiv.org/html/2605.16449#S3.F7), the orthogonal loss functionℒorth\\mathcal\{L\}\_\{\\text\{orth\}\}declines sharply during initial training \(first 500 steps\) and converges to a negligible magnitude \(<10−6<10^\{\-6\}\), confirming that the RLC module effectively enforces orthogonality on the projection basis\. Subsequent low\-amplitude oscillations indicate that the model maintains orthogonality while optimizing the primary task, balancing accuracy and independence\. Figure[7](https://arxiv.org/html/2605.16449#S3.F7)also visualizes the Pearson correlation matrix of the projected latent factors\. Excluding diagonal entries, all off\-diagonal values are 0\.00, indicating eliminated correlations\. This observation aligns with the stability analysis in Appendix[B](https://arxiv.org/html/2605.16449#A2), confirming that PESD\-TSF disentangles statistical variations into independent orthogonal components\. By eliminating redundancy, each factor captures unique temporal patterns, enhancing interpretability and robustness\.
### 3\.8Further Analysis
We use additional experiments to examine whether the observed gains are robust to stronger baselines and practical deployment constraints, while leaving more fine\-grained sensitivity analyses to Appendix[F](https://arxiv.org/html/2605.16449#A6)\. To ensure that the improvements do not only arise from comparisons with linear baselines, we add representative attention\-based forecasters, including iTransformer, PatchTST, and TimeXer, under the same input lengthL=720L=720\. Table[5](https://arxiv.org/html/2605.16449#S3.T5)reports the average results over four prediction horizons\. PESD\-TSF consistently outperforms the strongest Transformer baseline on Traffic, Electricity, and ETTh1, showing that the proposed structured decomposition remains beneficial when compared with competitive attention\-based alternatives\.
Table 5:Average forecasting performance against strong Transformer baselines underL=720L=720\. Results are averaged overO∈\{96,192,336,720\}O\\in\\\{96,192,336,720\\\}and reported as MSE/MAE\.We further examine the practical overhead of PESD\-TSF on the high\-dimensional Mobility dataset with 5,826 variables\. As shown in Table[6](https://arxiv.org/html/2605.16449#S3.T6), although PESD\-TSF has more parameters than iTransformer, it requires substantially less peak memory and shorter training time per epoch\. This indicates that CSCA avoids direct full\-sequence cross\-variable attention: after temporal aggregation, collaboration is modeled mainly along the variable dimension, leading to a favorable efficiency–performance trade\-off in high\-dimensional scenarios\.
Table 6:Efficiency comparison on Mobility with 5,826 variables\. All results are measured on a single GPU under identical settings\.
### 3\.9Structural Order and Generality
We also test whether the three\-stage order is merely a stack of interchangeable modules\. UnderL=720,O=720L=720,O=720, the full orderS1→S2→S3S1\\rightarrow S2\\rightarrow S3achieves0\.437/0\.4630\.437/0\.463on ETTh1 and0\.376/0\.2570\.376/0\.257on Traffic\. Reordering the stages degrades performance to0\.448/0\.4710\.448/0\.471and0\.390/0\.2660\.390/0\.266, while bypassing Stage 2 yields0\.452/0\.4740\.452/0\.474and0\.394/0\.2690\.394/0\.269\. Removing CSCA also increases the errors to0\.447/0\.4690\.447/0\.469and0\.388/0\.2650\.388/0\.265\. These results support the design choice that PESD\-TSF first extracts denoised temporal structure, then compresses multi\-scale context, and only afterwards reconstructs cross\-variable collaboration from time\-aggregated representations\.
This ordering is consistent with the intended decomposition path of PESD\-TSF: temporal regularities are first stabilized by period\-aware modulation, local and multi\-scale variations are then aggregated through hierarchical processing, and cross\-variable collaboration is reconstructed after the temporal representation has been compressed\. The design is therefore not simply an accumulation of modules, but a constrained modeling sequence that reduces interference among trend, perturbation, and dependency learning\. Additional evidence on weakly periodic datasets, the RLC latent dimension, topology matching, and band\-wise spectral energy is provided in Appendix[F](https://arxiv.org/html/2605.16449#A6); these results further support the use of structured priors without over\-claiming strict physical recovery in every dataset\.
Taken together, the supplementary experiments clarify the scope of the proposed model\. The Transformer\-baseline comparison verifies competitiveness against strong sequence models, the Mobility experiment confirms that cross\-variable modeling remains feasible when the number of variables is very large, and the stage\-order ablation shows that the performance gain depends on the proposed decomposition order rather than on parameter count alone\. These results complement the main forecasting tables and suggest that PESD\-TSF is most useful when temporal regularity and inter\-variable coupling must be modeled jointly\.
## 4Conclusion
We propose PESD\-TSF, a structured decomposition framework that jointly models trend, perturbation, and cross\-variable collaboration for long\-term forecasting\. By integrating periodic gating, a multi\-scale structured encoder, CSCA, and RLC regularization, PESD\-TSF embeds physical priors directly into the forecasting process\. Experiments across diverse benchmarks show strong accuracy, scalability, and interpretable latent structure\.
## Acknowledgements
This work was supported in part by the following: the National Natural Science Foundation of China under Grant Nos\. U24A20328, U24A20219, 62272281, the Youth Innovation Technology Project of Higher School in Shandong Province under Grant No\. 2023KJ212, and the Yantai Natural Science Foundation under Grant No\. 2024JCYJ034\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- S\. Chen, C\. Li, S\. Ö\. Arik, N\. C\. Yoder, and T\. Pfister \(2023\)TSMixer: an all\-mlp architecture for time series forecasting\.Trans\. Mach\. Learn\. Res2023\(\),pp\. 1–24\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Z\. Chen, Y\. Hu, Z\. Fu, Z\. Li, J\. Huang, Q\. Huang, and Y\. Wei \(2026\)INTENT: invariance and discrimination\-aware noise mitigation for robust composed image retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 20463–20471\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- Z\. Chen, Y\. Hu, Z\. Li, Z\. Fu, H\. Wen, and W\. Guan \(2025\)HUD: hierarchical uncertainty\-aware disambiguation network for composed video retrieval\.InProceedings of the ACM International Conference on Multimedia,pp\. 6143–6152\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- Q\. Huang, Z\. Zhou, K\. Yang, Z\. Yi, X\. Wang, and Y\. Wang \(2025\)TimeBase: the power of minimalism in efficient long\-term time series forecasting\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Z\. Li, Y\. Hu, Z\. Chen, Q\. Huang, G\. Qiu, Z\. Fu, and M\. Liu \(2026a\)ReTrack: evidence\-driven dual\-stream directional anchor calibration network for composed video retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 23373–23381\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- Z\. Li, Y\. Hu, Z\. Chen, S\. Zhang, Q\. Huang, Z\. Fu, and Y\. Wei \(2026b\)HABIT: chrono\-synergia robust progressive learning framework for composed image retrieval\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 6762–6770\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- B\. Lim, S\. Ö\. Arık, N\. Loeff, and T\. Pfister \(2021\)Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International Journal of Forecasting37\(4\),pp\. 1748–1764\.External Links:[Document](https://dx.doi.org/10.1016/j.ijforecast.2021.03.012)Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p1.1)\.
- S\. Lin, W\. Lin, W\. Wu, H\. Chen, and C\. P\. Chen \(2025\)SparseTSF: lightweight and robust time series forecasting via sparse modeling\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.1](https://arxiv.org/html/2605.16449#S3.SS1.p1.2)\.
- M\. Liu, A\. Zeng, M\. Chen, Z\. Xu, Q\. Lai, L\. Ma, and Q\. Xu \(2022a\)Scinet: time series modeling and forecasting with sample convolution and interaction\.Advances in Neural Information Processing Systems35,pp\. 5816–5828\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- P\. Liu, B\. Wu, Y\. Hu, N\. Li, T\. Dai, J\. Bao, and S\. Xia \(2025\)TimeBridge: non\-stationarity matters for long\-term time series forecasting\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Y\. Liu, C\. Li, J\. Wang, and M\. Long \(2023\)Koopa: learning non\-stationary time series dynamics with koopman predictors\.Advances in neural information processing systems36,pp\. 12271–12290\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1)\.
- Y\. Liu, H\. Wu, J\. Wang, and M\. Long \(2022b\)Non\-stationary transformers: exploring the stationarity in time series forecasting\.Advances in neural information processing systems35,pp\. 9881–9893\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- J\. Lu, F\. Zhang, X\. Zhang, Y\. Sun, and H\. Wang \(2025\)MCNR: multiscale feature\-based latent data component extraction linear regression model\.Expert Systems with Applications,pp\. 128634\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p2.1)\.
- D\. Luo and X\. Wang \(2024a\)DeformableTST: transformer for time series forecasting without over\-reliance on patching\.Advances in Neural Information Processing Systems37,pp\. 88003–88044\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§E\.1](https://arxiv.org/html/2605.16449#A5.SS1.p1.2),[§3\.1](https://arxiv.org/html/2605.16449#S3.SS1.p1.2)\.
- D\. Luo and X\. Wang \(2024b\)Moderntcn: a modern pure convolution structure for general time series analysis\.InThe twelfth international conference on learning representations,pp\. 1–43\.Cited by:[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- J\. Ni, S\. Wang, Z\. Liu, X\. Shi, X\. Zhong, Z\. Ye, and W\. Jin \(2025\)U\-cast: learning hierarchical structures for high\-dimensional time series forecasting\.arXiv preprint arXiv:2507\.15119\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Z\. Ni, H\. Yu, S\. Liu, J\. Li, and W\. Lin \(2023\)Basisformer: attention\-based time series forecasting with learnable and interpretable basis\.Advances in Neural Information Processing Systems36,pp\. 71222–71241\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- B\. N\. Oreshkin, D\. Carpov, N\. Chapados, and Y\. Bengio \(2019\)N\-beats: neural basis expansion analysis for interpretable time series forecasting\.arXiv preprint arXiv:1905\.10437\.Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p1.1)\.
- Z\. Qian, Y\. Pei, H\. Zareipour, and N\. Chen \(2019\)A review and discussion of decomposition\-based hybrid models for wind energy forecasting applications\.Applied energy235,pp\. 939–953\.Cited by:[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- G\. Qiu, Z\. Chen, Z\. Li, Q\. Huang, Z\. Fu, X\. Song, and Y\. Hu \(2026\)MELT: improve composed image retrieval via the modification frequentation\-rarity balance network\.arXiv preprint arXiv:2603\.29291\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- X\. Qiu, J\. Hu, L\. Zhou, X\. Wu, J\. Du, B\. Zhang, C\. Guo, A\. Zhou, C\. S\. Jensen, Z\. Sheng, and B\. Yang \(2024\)TFB: towards comprehensive and fair benchmarking of time series forecasting methods\.InProc\. VLDB Endow\.,pp\. 2363–2377\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p13.1)\.
- X\. Qiu, X\. Wu, H\. Cheng, X\. Liu, C\. Guo, J\. Hu, and B\. Yang \(2025a\)DBLoss: decomposition\-based loss function for time series forecasting\.InNeurIPS,Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1)\.
- X\. Qiu, X\. Wu, Y\. Lin, C\. Guo, J\. Hu, and B\. Yang \(2025b\)Duet: dual clustering enhanced multivariate time series forecasting\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1,pp\. 1185–1196\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- D\. Salinas, V\. Flunkert, J\. Gasthaus, and T\. Januschowski \(2020\)DeepAR: probabilistic forecasting with autoregressive recurrent networks\.International journal of forecasting36\(3\),pp\. 1181–1191\.Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- H\. Wang, J\. Lu, and F\. Zhang \(2026a\)EEO\-tfv: escape\-explore optimizer for web\-scale time\-series forecasting and vision analysis\.InProceedings of the ACM Web Conference 2026,pp\. 7271–7282\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1)\.
- H\. Wang, J\. Lu, and F\. Zhang \(2026b\)IdealTSF: can non\-ideal data contribute to enhancing the performance of time series forecasting models?\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 26224–26232\.Cited by:[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- H\. Wang, J\. Peng, F\. Huang, J\. Wang, J\. Chen, and Y\. Xiao \(2023\)MICN: multi\-scale local and global context modeling for long\-term series forecasting\.InICLR,Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Y\. Zhang, and J\. Zhou \(2024\)TimeMixer: decomposable multiscale mixing for time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- Y\. Wang, H\. Wang, and F\. Zhang \(2025\)A medical image segmentation model with auto\-dynamic convolution and location attention mechanism\.Computer Methods and Programs in Biomedicine261,pp\. 108593\.Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- G\. Woo, C\. Liu, D\. Sahoo, A\. Kumar, and S\. Hoi \(2022\)Etsformer: exponential smoothing transformers for time\-series forecasting\.arXiv preprint arXiv:2202\.01381\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1)\.
- H\. Wu, T\. Hu, Y\. Liu, H\. Zhou, J\. Wang, and M\. Long \(2023\)TimesNet: temporal 2d\-variation modeling for general time series analysis\.InProceedings of the Eleventh International Conference on Learning Representations,Note:ICLR 2023Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- H\. Wu, J\. Xu, J\. Wang, and M\. Long \(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.Advances in neural information processing systems34,pp\. 22419–22430\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1),[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- Z\. Xu, A\. Zeng, and Q\. Xu \(2023\)FITS: modeling time series with 10k parameters\.CoRR\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.1](https://arxiv.org/html/2605.16449#S3.SS1.p1.2)\.
- W\. Xue, T\. Zhou, Q\. Wen, J\. Gao, B\. Ding, and R\. Jin \(2023\)Card: channel aligned robust blend transformer for time series forecasting\.arXiv preprint arXiv:2305\.12095\.Cited by:[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 11121–11128\.Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.1](https://arxiv.org/html/2605.16449#S3.SS1.p1.2),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- F\. Zhang, S\. Fan, and H\. Wang \(2026a\)Time\-tk: a multi\-offset temporal interaction framework combining transformer and kolmogorov\-arnold networks for time series forecasting\.InProceedings of the ACM Web Conference 2026,pp\. 7495–7506\.Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.16449#S1.p1.1)\.
- F\. Zhang, S\. Fan, and H\. Wang \(2026b\)TimeSAF: towards llm\-guided semantic asynchronous fusion for time series forecasting\.arXiv preprint arXiv:2604\.12648\.Cited by:[§A\.2](https://arxiv.org/html/2605.16449#A1.SS2.p2.1)\.
- F\. Zhang, Z\. Gu, and H\. Wang \(2026c\)Decoding with structured awareness: integrating directional, frequency\-spatial, and structural attention for medical image segmentation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 12421–12429\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p3.1)\.
- Y\. Zhang and J\. Yan \(2023\)Crossformer: transformer utilizing cross\-dimension dependency for multivariate time series forecasting\.InThe eleventh international conference on learning representations,Cited by:[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1),[§3\.2](https://arxiv.org/html/2605.16449#S3.SS2.p1.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 11106–11115\.Cited by:[§A\.1](https://arxiv.org/html/2605.16449#A1.SS1.p1.1),[Appendix C](https://arxiv.org/html/2605.16449#A3.p9.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)Fedformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InInternational Conference on Machine Learning,pp\. 27268–27286\.Cited by:[§A\.3](https://arxiv.org/html/2605.16449#A1.SS3.p1.1)\.
## Appendix ARelated Works
### A\.1Attention\-Based Multi\-Scale Time Series Forecasting
Attention\-based models remain one of the most influential paradigms for long\-term time series modeling\(Niet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib38)\)\. However, conventional single\-scale Transformers suffer from inherent limitations in long\-range dependency preservation and periodic structure representation\. Informer\(Zhouet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib12)\)reduces the computational cost of global attention through the ProbSparse mechanism, enabling larger effective temporal windows, yet it still relies on downstream attention layers to implicitly discover periodic patterns\. Autoformer\(Wuet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib11)\)replaces dot\-product attention with an autocorrelation mechanism, allowing the model to directly capture repetitive temporal segments and thereby enhancing trend\-aware representations\. TimesNet\(Wuet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib4)\)introduces multi\-period convolutional blocks to explicitly unfold multiple temporal variations\(Liuet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib37)\), preventing periodic structures from being implicitly encoded within attention matrices\. More recently, TimeMixer\(Wanget al\.,[2024](https://arxiv.org/html/2605.16449#bib.bib1)\)further develops decomposable mixing modules to jointly decompose and model multi\-scale components within a unified framework, achieving improved scale adaptivity\. Web\-scale forecasting studies further highlight the importance of scalable model design under heterogeneous temporal signals\(Wanget al\.,[2026a](https://arxiv.org/html/2605.16449#bib.bib27)\)\.
Overall, multi\-scale attention\-based models have evolved from “searching for patterns via global attention scales”\(Luet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib35)\)toward “explicitly constructing decomposable, multi\-period structures\.” Nevertheless, the structural correlations among variables and between trend components remains insufficiently modeled in a unified manner\. In contrast, PESD\-TSF not only follows the paradigm of explicit decomposition but advances it through a Multi\-Scale Structured Encoder, which leverages the CSCA module to reconstruct cross\-variable collaborative structures in deep feature spaces that are overlooked by prior methods, enabling unified modeling of multi\-scale temporal dynamics and spatial topological dependencies\.
### A\.2Utilization of Temporal Priors and Calendar Features
Long\-term time series forecasting relies not only on numerical patterns in historical observations but also on structured exogenous priors inherent to real\-world systems, such as work cycles, holiday rhythms, and diurnal variations—collectively referred to as calendar\-driven factors\. DeepAR\(Salinaset al\.,[2020](https://arxiv.org/html/2605.16449#bib.bib15)\)was among the earliest approaches to integrate temporal prior encodings with sequence modeling, leveraging handcrafted features such as minute\-of\-hour, day\-of\-week, and holiday indicators to improve seasonal forecasting performance\. N\-BEATS\(Oreshkinet al\.,[2019](https://arxiv.org/html/2605.16449#bib.bib14)\)explicitly incorporates temporal priors into the representation space through trend and seasonality basis functions, reducing the burden on deep networks to implicitly learn periodic structures\. Temporal Fusion Transformer\(Limet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib16)\)further combines temporal features, categorical variables, and attention mechanisms, enabling interpretable conditional forecasting guided by external priors\. More recently, Uni2TS maps temporal priors into learnable representations, allowing adaptive prior transfer across datasets\.
Overall, this research direction has evolved from handcrafted calendar features toward learnable continuous\-time embeddings and structured prior encoding\(Zhanget al\.,[2026a](https://arxiv.org/html/2605.16449#bib.bib26),[b](https://arxiv.org/html/2605.16449#bib.bib28)\)\. However, a unified framework that jointly models multi\-period temporal patterns, cross\-variable collaboration, and trend\-level interaction structure remains underexplored\(Wanget al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib34)\)\. To this end, PESD\-TSF introduces a Multiplicative Periodic Gating mechanism\. Unlike additive positional encodings or simple feature concatenation, this mechanism transforms discrete calendar features into continuous physical priors and applies dynamic amplitude modulation, explicitly enhancing the deep network’s ability to perceive and preserve multi\-frequency periodic patterns\.
### A\.3Latent Factors and Structured Decomposition
Structured decomposition has emerged as a central paradigm for addressing the three\-dimensional structure of long\-term time series forecasting\(Chenet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib39); Qiuet al\.,[2025a](https://arxiv.org/html/2605.16449#bib.bib32)\), namely trend, perturbation, and cross\-variable dependencies\. Autoformer\(Wuet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib11)\)introduces a learnable residual decomposition that explicitly models trend and seasonality, endowing Transformer architectures with interpretable sequence dynamics\. FEDformer\(Zhouet al\.,[2022](https://arxiv.org/html/2605.16449#bib.bib10)\)incorporates Fourier sparse blocks that project dominant periodic structures into the frequency domain, allowing trend and seasonality to be embedded in latent spaces with low\-rank and sparse representations\. MICN\(Wanget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib6)\)further combines decomposition with mixed convolutional kernels, enabling hierarchical extraction of multi\-period components\. In parallel, other approaches emphasize the orthogonality and decomposability of latent factors; for instance, ETSformer\(Wooet al\.,[2022](https://arxiv.org/html/2605.16449#bib.bib13)\)decomposes time series into multiple interpretable latent trajectories, facilitating structural reconstruction in the representation space\.
Overall, the evolution of decomposition\-based methods has shifted from post hoc component separation toward embedding structural priors along the forward modeling path\. However, most existing approaches focus on either trend or periodic components in isolation and lack a unified, interpretable structural constraint that jointly captures trend dynamics, local perturbations, and cross\-variable dependencies interactions\. To address this limitation, PESD\-TSF introduces a Regularized Latent Component \(RLC\) module\. By enforcing orthogonality and physical consistency constraints, RLC further disentangles the latent space\. into three mutually exclusive dimensions—trend, perturbation, and collaboration—thereby providing a mathematical guarantee that the model not only fits the data but also reconstructs dynamical structures consistent with physical intuition\.
Recent studies in other structured perception tasks have also emphasized robust representation calibration under ambiguous or compositional inputs, including evidence\-driven anchor calibration\(Liet al\.,[2026a](https://arxiv.org/html/2605.16449#bib.bib44)\), progressive robust learning\(Liet al\.,[2026b](https://arxiv.org/html/2605.16449#bib.bib45)\), noise mitigation\(Chenet al\.,[2026](https://arxiv.org/html/2605.16449#bib.bib46)\), uncertainty\-aware disambiguation\(Chenet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib47)\), modification\-frequency balancing\(Qiuet al\.,[2026](https://arxiv.org/html/2605.16449#bib.bib48)\), and structured attention for medical segmentation\(Zhanget al\.,[2026c](https://arxiv.org/html/2605.16449#bib.bib29)\)\. Although these works target retrieval or segmentation rather than forecasting, they reflect a broader trend toward explicitly structured and uncertainty\-aware representation learning, which is conceptually aligned with our motivation to reduce entanglement and improve robustness in temporal latent spaces\.
## Appendix BAnalysis of Orthogonal Projection Stability
We aim to quantify the stability of the RLC projection head by analyzing the response of the latent factorsZrlc=FstatWrlcZ\_\{rlc\}=F\_\{\\text\{stat\}\}W\_\{rlc\}to perturbations in the statistical feature space\. The stability of this linear map is determined by the spectral norm \(i\.e\., the maximum singular value\) of its transformation matrixWrlcW\_\{rlc\}\.
Consider a perturbationϵ∈ℝB×2C\\epsilon\\in\\mathbb\{R\}^\{B\\times 2C\}introduced to the statistical featuresFstatF\_\{\\text\{stat\}\}\. The resulting variation in the projected latent space is given by:
ΔZrlc=\(Fstat\+ϵ\)Wrlc−FstatWrlc=ϵWrlc\\Delta Z\_\{rlc\}=\(F\_\{\\text\{stat\}\}\+\\epsilon\)W\_\{rlc\}\-F\_\{\\text\{stat\}\}W\_\{rlc\}=\\epsilon W\_\{rlc\}\(29\)To measure the magnitude of this variation, we examine its Frobenius norm\. Utilizing the consistency property of matrix norms \(i\.e\.,‖𝐀𝐁‖F≤‖𝐀‖F‖𝐁‖2\\\|\\mathbf\{A\}\\mathbf\{B\}\\\|\_\{F\}\\leq\\\|\\mathbf\{A\}\\\|\_\{F\}\\\|\\mathbf\{B\}\\\|\_\{2\}\), we establish the following upper bound:
‖ΔZrlc‖F=‖ϵWrlc‖F≤‖ϵ‖F‖Wrlc‖2\\\|\\Delta Z\_\{rlc\}\\\|\_\{F\}=\\\|\\epsilon W\_\{rlc\}\\\|\_\{F\}\\leq\\\|\\epsilon\\\|\_\{F\}\\\|W\_\{rlc\}\\\|\_\{2\}\(30\)where‖Wrlc‖2\\\|W\_\{rlc\}\\\|\_\{2\}denotes the spectral norm of the weight matrix\.
During model training, minimizing the orthogonality lossℒorth=‖WrlcTWrlc−I‖F2\\mathcal\{L\}\_\{\\text\{orth\}\}=\\\|W\_\{rlc\}^\{T\}W\_\{rlc\}\-I\\\|\_\{F\}^\{2\}encouragesWrlcW\_\{rlc\}to satisfy the column\-orthonormality conditionWrlcTWrlc=IKW\_\{rlc\}^\{T\}W\_\{rlc\}=I\_\{K\}\. From a linear algebra perspective, this condition strictly constrains all non\-zero singular values ofWrlcW\_\{rlc\}to be 1\. Consequently, as the loss converges, the spectral norm approaches unity:
‖Wrlc‖2=σmax\(Wrlc\)≈1\\\|W\_\{rlc\}\\\|\_\{2\}=\\sigma\_\{\\max\}\(W\_\{rlc\}\)\\approx 1\(31\)Substituting this spectral norm value back into the inequality, we derive the stability bound:
‖ΔZrlc‖F≤‖ϵ‖F⋅‖Wrlc‖2⏟≈1≈‖ϵ‖F\\\|\\Delta Z\_\{rlc\}\\\|\_\{F\}\\leq\\\|\\epsilon\\\|\_\{F\}\\cdot\\underbrace\{\\\|W\_\{rlc\}\\\|\_\{2\}\}\_\{\\approx 1\}\\approx\\\|\\epsilon\\\|\_\{F\}\(32\)
This derivation demonstrates that the linear projection from the statistical readoutFstatF\_\{\\text\{stat\}\}to the RLC latent space satisfies 1\-Lipschitz continuity with respect to the projection weights\. This property guarantees that the Euclidean distance between any two samples in the statistical feature space is not expanded by the projection head\. Theoretically, this acts as a hard constraint on the readout mechanism, ensuring that the auxiliary task remains numerically stable and does not introduce artificial variance into the gradient backpropagation process\.
## Appendix CExperiment Setup
Description
We follow standard LTSF benchmark protocols, consistent with recent state\-of\-the\-art models such as TimeBase and TimeBridge, using a lookback window of 720\. Data splits and preprocessing strictly follow the settings adopted by prior benchmark methods\.
Datasets
Table 7:Dataset statistics\. Var: variables, Length: total length, L: lookback window, O: prediction horizon, Freq: frequency, Scale: total data points\.DatasetVarLengthLLOOFreqScaleNormalETTh1714,40072096∼72096\\sim 7201h0\.1METTh2714,40072096∼72096\\sim 7201h0\.1METTm1757,60072096∼72096\\sim 72015m0\.4METTm2757,60072096∼72096\\sim 72015m0\.4MWeather2152,69672096∼72096\\sim 72010m1\.1MElectricity32126,30472096∼72096\\sim 7201h8\.1MTraffic86217,54472096∼72096\\sim 7201h15\.0MSolar13752,56072096∼72096\\sim 72010m7\.2MWind748,67372096∼72096\\sim 72015m0\.4MMETR\-LA20734,27272096∼72096\\sim 7205m7\.1MAQshunyi1135,06472096∼72096\\sim 7201h0\.4MAQWan1135,06472096∼72096\\sim 7201h0\.4MZafNoo1119,22572096∼72096\\sim 72030m0\.2MCzeLan1119,93472096∼72096\\sim 72030m0\.2MPM2\.518411,68872096∼72096\\sim 7203h2\.2MTemp18411,68872096∼72096\\sim 7203h2\.2MLargeMeter2,89826,499100833630m76\.8MAtec1,5696,915100833610m10\.8MMobility5,8269213571d5\.4M
We conducted extensive experiments on 20 real\-world datasets, categorized into three distinct groups based on their statistical characteristics and scale:
Classic Multivariate Benchmarks: This category comprises seven standard datasets spanning the energy, meteorology, and traffic domains \(ETTh1\-2, ETTm1\-2, Electricity, Weather, Traffic\)\. Serving as the primary reference for SOTA comparisons, these datasets are used to evaluate the model’s fundamental capability in fitting non\-stationary trends and seasonality\.
Diverse Domain\-Specific Datasets: We extend our evaluation to include environmental \(AQShunyi, AQWan\), traffic \(Meter\-LA\), and medical \(ZafNoo, CzeLan\) domains\. Characterized by complex noise distributions and latent dependencies, these datasets are employed to verify the generalization robustness of the model across diverse application scenarios\.
High\-Dimensional & Strongly Coupled Datasets: We introduce three high\-dimensional multivariate datasets: Meter, Mobility, and Atec\. These datasets are distinguished by their extreme dimensionality \(e\.g\., Mobility contains 5,826 variables, and Meter contains 2,898\) and intense cross\-variable dependencies \(with average correlation coefficients exceeding 0\.84\)\. Spanning cloud computing, energy, and social flow domains, these datasets serve to challenge the computational efficiency of the model when processing thousands of input variables and to validate its ability to capture Global Synergy structures in strongly coupled environments\.The statistics of dataset is summarized in Table[7](https://arxiv.org/html/2605.16449#A3.T7)\.
Baseline
We compare PESD\-TSF with a comprehensive set of SOTA baselines covering diverse paradigms:Recent SOTA \(2024\-2025\):TimeBase\(Huanget al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib19)\), SparseTSF\(Linet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib18)\), TimeBridge\(Liuet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib20)\), Duet\(Qiuet al\.,[2025b](https://arxiv.org/html/2605.16449#bib.bib23)\), DeformableTST\(Luo and Wang,[2024a](https://arxiv.org/html/2605.16449#bib.bib22)\), U\-CAST\(Niet al\.,[2025](https://arxiv.org/html/2605.16449#bib.bib21)\), and FITS\(Xuet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib17)\)\.Transformers:iTransformer\(Liuet al\.,[2024](https://arxiv.org/html/2605.16449#bib.bib2)\), PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib3)\), Crossformer\(Zhang and Yan,[2023](https://arxiv.org/html/2605.16449#bib.bib7)\), Non\-stationary Transformer \(Stationary\)\(Liuet al\.,[2022b](https://arxiv.org/html/2605.16449#bib.bib9)\), Autoformer\(Wuet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib11)\), and Informer\(Zhouet al\.,[2021](https://arxiv.org/html/2605.16449#bib.bib12)\)\.Linear, CNN\-based & Others:TimeMixer\(Wanget al\.,[2024](https://arxiv.org/html/2605.16449#bib.bib1)\), DLinear\(Zenget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib5)\), TimesNet\(Wuet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib4)\), TSMixer\(Chenet al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib39)\), SCINet\(Liuet al\.,[2022a](https://arxiv.org/html/2605.16449#bib.bib8)\), and MICN\(Wanget al\.,[2023](https://arxiv.org/html/2605.16449#bib.bib6)\)\.
Implementation Details
All models are implemented in PyTorch and trained on NVIDIA RTX 4060 \(8GB\) and A800 \(80GB\) GPUs using the Adam optimizer\. The learning rate is selected via grid search from\{10−3,10−4,5×10−4\}\\\{10^\{\-3\},10^\{\-4\},5\\times 10^\{\-4\}\\\}\.
To ensure the statistical robustness and reproducibility of our results, each experiment was repeated three times using different random seeds \(e\.g\., 2021–2025\)\. Please refer to Appendix[J](https://arxiv.org/html/2605.16449#A10)for specific experimental details\.
Specific Hyperparameters:To ensure reproducibility, we specify the key structural parameters:
- •Patching:Patch lengthP=16P=16and strideS=8S=8\.
- •Dimensions:Embedding dimensionD=64D=64; RLC latent factor countK=2CK=2C\(matching the input feature dimension\)\.
- •Coefficients:Periodic gating intensityγ=0\.5\\gamma=0\.5\(default\)\.
Baseline Protocol:Following standard benchmark protocols\(Qiuet al\.,[2024](https://arxiv.org/html/2605.16449#bib.bib31)\), we fix the lookback window atL=720L=720for all models to ensure a fair comparison on long\-term dependency modeling capabilities, unless a baseline strictly requires a specific input length \(e\.g\., rigid architecture constraints\), in which case we follow its official configuration\.
Table 8:Long\-term sequence prediction performance of PESD\-TSF\. All results are averaged across four different prediction lengths: O∈\\in\{96, 192, 336, 720\}\. The best and second\-best results are highlighted in bold and underlined, respectively\.
## Appendix DMetrics
### D\.1Long\-term Forecasting
Table 9:Long\-term sequence prediction performance of PESD\-TSF\. All results are averaged across four different prediction lengths: O∈\\in\{96, 192, 336, 720\}\. The best and second\-best results are highlighted in bold and underlined, respectively\.We adopt Mean Squared Error \(MSE\) and Mean Absolute Error \(MAE\) as evaluation metrics for long\-term forecasting\. Given the ground\-truth observationsYiY\_\{i\}and the corresponding model predictionsY^i\\hat\{Y\}\_\{i\}, the metrics are defined as:
MSE=1Neval∑i=1Neval\(Yi−Y^i\)2,MAE=1Neval∑i=1Neval\|Yi−Y^i\|,\\text\{MSE\}=\\frac\{1\}\{N\_\{\\text\{eval\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{eval\}\}\}\(Y\_\{i\}\-\\hat\{Y\}\_\{i\}\)^\{2\},\\quad\\text\{MAE\}=\\frac\{1\}\{N\_\{\\text\{eval\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{eval\}\}\}\|Y\_\{i\}\-\\hat\{Y\}\_\{i\}\|,whereNevalN\_\{\\text\{eval\}\}denotes the total number of predicted values\.
### D\.2Short\-term Forecasting
For short\-term forecasting, we evaluate performance using MAE \(as defined above\), Mean Absolute Percentage Error \(MAPE\), and Root Mean Squared Error \(RMSE\), defined as:
MAPE=1Neval∑i=1Neval\|Yi−Y^iYi\|×100,RMSE=1Neval∑i=1Neval\(Yi−Y^i\)2\.\\text\{MAPE\}=\\frac\{1\}\{N\_\{\\text\{eval\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{eval\}\}\}\\left\|\\frac\{Y\_\{i\}\-\\hat\{Y\}\_\{i\}\}\{Y\_\{i\}\}\\right\|\\times 100,\\quad\\text\{RMSE\}=\\sqrt\{\\frac\{1\}\{N\_\{\\text\{eval\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{eval\}\}\}\(Y\_\{i\}\-\\hat\{Y\}\_\{i\}\)^\{2\}\}\.
## Appendix EFull Results
### E\.1Main Experiments
Tables[8](https://arxiv.org/html/2605.16449#A3.T8),[9](https://arxiv.org/html/2605.16449#A4.T9), and[10](https://arxiv.org/html/2605.16449#A5.T10)present the comprehensive forecasting results derived from our experiments\. Specifically, Table[8](https://arxiv.org/html/2605.16449#A3.T8)details performance onClassic Benchmarks\(Table[1](https://arxiv.org/html/2605.16449#S3.T1)\), covering seven standard datasets including ETT, Electricity, and Traffic, while Table[9](https://arxiv.org/html/2605.16449#A4.T9)extends the evaluation toDiverse Domain\-Specific Datasets\(Table[2](https://arxiv.org/html/2605.16449#S3.T2)\), encompassing Solar, Air Quality, and Medical domains\(Luo and Wang,[2024a](https://arxiv.org/html/2605.16449#bib.bib22)\)\. Additionally, Table[10](https://arxiv.org/html/2605.16449#A5.T10)reports results onHigh\-Dimensional Benchmarks\. Regarding the lookback window setting: for the datasets in Tables[8](https://arxiv.org/html/2605.16449#A3.T8)and[9](https://arxiv.org/html/2605.16449#A4.T9), the input length was fixed at 720 to strictly evaluate long\-range modeling capabilities\. In contrast, for the high\-dimensional datasets in Table[10](https://arxiv.org/html/2605.16449#A5.T10), we adopted specific protocols: Meter and Atec utilize an input length of 1008 to forecast 336 steps \(3×3\\timeshorizon\), whereas Mobility employs an input length of 35 to forecast 7 steps \(5×5\\timeshorizon\)\.
### E\.2Ablation Studies
Table[11](https://arxiv.org/html/2605.16449#A5.T11)presents the full results of the ablation studies discussed in the main text\.
Table 10:Performance benchmarking of PESD\-TSF on high\-dimensional multivariate time series with intense cross\-variable dependencies\. The best and second\-best results are highlighted in bold and underlined, respectively\.Table 11:Full ablation studies on ETT, Weather, Traffic, Electricity, and Solar datasets\. Performance is measured by MSE and MAE\.
## Appendix FAdditional Analysis
### F\.1Weakly Periodic Datasets
We examine whether PESD\-TSF remains effective when periodic patterns are weak or not visually obvious\. Table[12](https://arxiv.org/html/2605.16449#A6.T12)compares the full model with a variant without Periodic Gating on PM2\.5, AQShunyi, and CzeLan\. Removing the periodic module increases the error on all three datasets, but the degradation is moderate compared with strongly seasonal datasets\. This behavior is consistent with the role of Periodic Gating as an adaptive prior rather than a hard assumption: when calendar\-driven structure is weak, the gain of the periodic pathway naturally decreases, while multi\-scale decomposition and cross\-variable collaboration continue to provide robust forecasting capacity\.
Table 12:Applicability on weakly periodic or non\-obviously periodic datasets\.
### F\.2RLC Latent Dimension
We analyze the latent dimensionKKin the RLC module, which controls the trade\-off between compression and statistical decorrelation\. As shown in Table[13](https://arxiv.org/html/2605.16449#A6.T13), overly smallKKvalues produce low latent correlation but sacrifice forecasting accuracy due to excessive bottlenecking\. In contrast, overly largeKKvalues preserve more information but increase redundancy and weaken the regularization effect\. The settingK=2CK=2Cachieves the best MSE/MAE on ETTh1, suggesting a balanced configuration between information preservation and latent\-space decorrelation\.
Table 13:Ablation on latent dimensionKKin RLC on ETTh1 \(L=720,O=720L=720,O=720\)\.
### F\.3Topology and Frequency Statistics
We further refine the interpretation of the topology and frequency analyses with quantitative evidence\. For the learned cross\-variable dependency map on PeMS04, PESD\-TSF obtains an off\-diagonal Top\-10 matching rate of8\.42%8\.42\\%against the physical adjacency, compared with1\.69%1\.69\\%for random selection, indicating statistically meaningful alignment with known topology rather than full mechanistic recovery\. For spectral decomposition, the trend component allocates62\.57/24\.73/12\.70%62\.57/24\.73/12\.70\\%of its energy to low/mid/high frequency bands, whereas the variation component allocates28\.99/33\.64/37\.37%28\.99/33\.64/37\.37\\%\. These results support a calibrated claim: PESD\-TSF concentrates trend information in low frequencies and distributes variation more broadly over mid\-to\-high frequencies, while avoiding overstatement of strict physical interpretability\.
## Appendix GPhysical Topology Verification of Cross\-Variable Synergy \(CSCA\)
### G\.1Physical Ground Truth Construction
The physical topology ground truth is derived from the sensor distance matrix provided by the PeMS datasets\. To address the sparsity and directionality of the raw data, we implemented the following preprocessing steps to construct a robust evaluation benchmark:
- •Symmetrization:Considering the bidirectional nature of traffic flow influence \(e\.g\., congestion waves propagate both upstream and downstream\), we symmetrized the distance matrix such thatAij=AjiA\_\{ij\}=A\_\{ji\}\.
- •2\-Hop Expansion:Given the extreme sparsity of the PeMS04 road network \(where first\-order connections are minimal\), we extended the physical neighborhood to include second\-order neighbors \(i\.e\., “neighbors of neighbors”\) to comprehensively evaluate connectivity\. This resulted in a connectivity\-based binary adjacency matrix\.
### G\.2Attention Map Extraction and Aggregation
The PESD\-TSF model employs a multi\-head cross\-scale attention mechanism\. To extract a unified dependency matrix from the model for visualization, we performed the following aggregation operations:
- •Multi\-Head Averaging:We averaged the attention weights across allHHheads to obtain a representative global view of the learned dependencies\.
- •Cross\-Scale Aggregation:Since the CSCA module captures dependencies across different scales, we spatially aligned and aggregated the attention matrices, projecting them back into the original sensor topology spaceℝN×N\\mathbb\{R\}^\{N\\times N\}\.
- •Self\-Loop Representation via Residuals:Recognizing that the residual connections in the Transformer architecture implicitly preserve a node’s own historical information—while the attention mechanism focuses on cross\-variable interactions—we incorporate an identity matrix into the final visualization\. This operation mathematically represents the inherent self\-loops maintained by the residual stream, ensuring the heatmap accurately reflects the complete information flow \(both self\-preserving and interactive\) consistent with the physical ground truth\.
## Appendix HVisualization of Explicit Decomposition Results
While PESD\-TSF is designed to explicitly decompose time series into semantically distinct components, quantitative forecasting metrics alone do not fully reveal whether the learned representations exhibit the intended temporal behaviors\. In particular, numerical errors cannot directly reflect how the extracted components are utilized during the forecasting process in the time domain\. To provide further qualitative evidence, we visualize the explicit decomposition results produced by the proposed model\.
Specifically, we select a representative variable from the test set of ETTh1 and visualize a contiguous segment of the input sequence together with the corresponding decomposed components generated by PESD\-TSF\. For clarity and readability, only a single decomposition scale is shown when multi\-scale modeling is applied\.
Figure[8](https://arxiv.org/html/2605.16449#A8.F8)presents the visualization results\. The extracted trend component exhibits smooth and slowly varying temporal dynamics that closely align with the long\-term evolution of the original signal\. In contrast, the residual component primarily captures localized fluctuations and transient variations, without introducing spurious long\-term structures\. Moreover, the predicted future sequence follows the extrapolated trend rather than fitting short\-term perturbations, indicating that the proposed structured decomposition guides the model toward stable long\-term forecasting behavior\.
Figure 8:Time\-domain visualization of explicit decomposition results produced by PESD\-TSF on a representative test sample from ETTh1\. The figure illustrates the original input sequence, the extracted trend component, the residual component, and the resulting forecasting behavior compared with the ground truth\.
## Appendix IAdaptive Gating Behavior Analysis
To further investigate how the proposed periodic gating mechanism responds to varying temporal characteristics, we visualize its activation patterns alongside a proxy measure of signal complexity in the time domain\. Quantitative performance metrics alone cannot reveal whether the gating mechanism adapts meaningfully to local temporal variations during forecasting\.
Specifically, we select a representative variable from the ETTh1 test set and compute local signal volatility using a sliding\-window standard deviation as a measure of temporal complexity\. Figure[9](https://arxiv.org/html/2605.16449#A9.F9)illustrates the raw input signal and its corresponding local volatility \(top\), together with the learned periodic gating intensity over time \(bottom\)\.
As shown in Figure[9](https://arxiv.org/html/2605.16449#A9.F9), periods with higher local volatility and stronger temporal fluctuations are consistently associated with elevated gating responses\. In contrast, during relatively stable intervals, the gating intensity remains close to its learned mean value\. This alignment indicates that the periodic gating mechanism adaptively modulates its activation strength in response to changing temporal dynamics, rather than behaving as a static or uniformly activated module\.
Importantly, this visualization does not imply explicit causal attribution, but provides qualitative evidence that the proposed gating mechanism exhibits structured and interpretable temporal behavior\. By selectively emphasizing periodic representations in complex or highly varying regions, the gating mechanism contributes to more stable and robust long\-term forecasting\.
Figure 9:Visualization of adaptive periodic gating on a representative ETTh1 test sample\. The top panel shows the input signal and local volatility, while the bottom panel presents the corresponding gating intensity\. Higher activation aligns with increased temporal complexity\.
## Appendix JStatistical Analysis
We repeated all experiments three times and reported the standard deviations for both our model and the second\-best baseline, along with the results of statistical significance tests\. Table[14](https://arxiv.org/html/2605.16449#A10.T14), Table[15](https://arxiv.org/html/2605.16449#A10.T15), and Table[16](https://arxiv.org/html/2605.16449#A10.T16)present the results for long\-term and short\-term forecasting, respectively\.
Table 14:Performance comparison between PESD\-TSF and the runner\-up baseline TimeBridge\. Results are reported as mean±\\pmstandard deviation\.Table 15:Performance comparison between PESD\-TSF and the runner\-up baseline TimeBase\. Results are reported as mean±\\pmstandard deviation\.Table 16:Short\-term forecasting performance comparison between PESD\-TSF and TimeMixer on PeMS datasets\. Results are reported as mean±\\pmstandard deviation\.
## Appendix KFuture Directions
Although PESD\-TSF has demonstrated strong predictive performance and clear physical interpretability in long\-term time series forecasting, its structured modeling paradigm leaves several promising directions for future exploration\. Motivated by our consistent empirical observation that explicitly reconstructing the underlying dynamical structure is often more effective for generalization than merely increasing model scale, a key direction is to extend PESD\-TSF toward a pre\-trained foundation model for time series analysis\. Unlike existing pretraining paradigms that primarily rely on implicit temporal representations, this direction aims to leverage the explicit decoupling of trends, variations, and cross\-variable correlations to learn transferable, structure\-aware dynamical priors, thereby enabling more robust adaptation in zero\-shot or few\-shot scenarios\.
In addition, for safety\-critical applications such as power grid management, where reliable uncertainty estimation is essential, future work will explore integrating probabilistic frameworks—such as normalizing flows or diffusion models—into the RLC latent space\. By modeling uncertainty under explicit structural consistency constraints, this extension would allow PESD\-TSF to go beyond point forecasts and quantify plausible confidence intervals, providing risk\-aware predictions suitable for high\-stakes decision\-making\.
Finally, we aim to relax the discrete\-time assumption by extending the periodicity\-aware gating mechanism to continuous\-time settings\. By incorporating Neural Ordinary Differential Equations \(Neural ODEs\) to model structured dynamics, PESD\-TSF could naturally handle irregular sampling and missing observations without relying on interpolation artifacts, further improving its robustness and applicability in real\-world temporal systems\.Similar Articles
Nested Spatio-Temporal Time Series Forecasting
This paper proposes a nested spatiotemporal forecasting framework that uses spectral clustering to construct semantically coherent macro-level regions, which provide top-down guidance for fine-grained micro-level predictions. Experiments on high-dimensional datasets show consistent improvements over state-of-the-art baselines.
PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
This paper proposes Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that aligns model memory with physical timescales to improve robustness under distribution shift in scientific time series, demonstrating improvements on fMRI and weather forecasting tasks.
Temporal-Decay Shapley: A Time-Aware Data Valuation Framework for Time-Series Data
This paper proposes Temporal-Decay Shapley (TDS), a data valuation framework for time-series data that incorporates temporal decay and multi-scale fusion to address the time-varying nature of sample values, outperforming traditional methods in noise detection and data selection.
Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series
This paper presents an online framework for modeling streaming time series as dynamic mixtures of time-delay systems, addressing regime shifts and memory constraints via a summary system tensor and tensor decomposition.
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
PeakFocus proposes a unified multi-scale framework for electricity load peak forecasting that jointly handles peak timing localization and intensity regression, addressing limitations of existing two-stage methods.