PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting
Summary
PMDformer introduces patch-mean decoupling and specialized attention mechanisms to improve shape similarity modeling in long-term time series forecasting, outperforming existing methods on multiple benchmarks.
View Cached Full Text
Cached at: 06/26/26, 05:13 AM
# PMDformer: Patch-Mean Decoupling Information Transformer for Long-term Forecasting
Source: [https://arxiv.org/html/2606.26549](https://arxiv.org/html/2606.26549)
Ao Hu1,2Liangjian Wen1,7Jiang Duan1,611footnotemark:1Yong Dai5HE YAN6Dongkai Wang1 Jun Wang1Yukun Zhang4,2Ruoxi Jiang2,3Zenglin Xu2,322footnotemark:2 1Southwestern University of Finance and Economics2Shanghai Academy of AI for Science 3Fudan University4Harbin Institute of Technology, Shenzhen5X\-Humanoid Research Institute 6Chengdu Everimaging Science and Technology Co\., Ltd\. 7Artificial Intelligence and Digital Finance Key Laboratory of Sichuan Province \{huao1105, wlj6816, zenglin\}@gmail\.comduanj\_t@swufe\.edu\.cn
###### Abstract
Long\-term time series forecasting \(LTSF\) plays a crucial role in fields such as energy management, finance, and traffic prediction\. Transformer\-based models have adopted patch\-based strategies to capture long\-range dependencies, but accurately modeling shape similarities across patches and variables remains challenging due to scale differences\. To address this, we introduce patch\-mean decoupling \(PMD\), which separates the trend and residual shape information by subtracting the mean of each patch, preserving the original structure and ensuring that the attention mechanism captures true shape similarities\. Futhermore, to more effectively model long\-range dependencies and capture cross\-variable relationships, we propose Trend Restoration Attention \(TRA\) and Proximal Variable Attention \(PVA\)\. The former module reintegrates the decoupled trend from PMD while calculating attention output\. And the latter focuses cross\-variable attention on the most relevant, recent time segments to avoid overfitting on outdated correlations\. Combining these components, we propose PMDformer, a model designed to effectively capture shape similarity in long\-term forecasting scenarios\. Extensive experiments indicate that PMDformer outperforms existing state\-of\-the\-art methods in stability and accuracy across multiple LTSF benchmarks\. The code is available at[https://github\.com/aohu1105/PMDformer](https://github.com/aohu1105/PMDformer)\.
## 1Introduction
Long\-term time series forecasting \(LTSF\) is a key task in machine learning, with wide applications in areas like energy management\(Box and Jenkins,[1990](https://arxiv.org/html/2606.26549#bib.bib14)\), financial markets\(Huet al\.,[2025c](https://arxiv.org/html/2606.26549#bib.bib13)\), and traffic prediction\(Guoet al\.,[2019](https://arxiv.org/html/2606.26549#bib.bib16); Yiet al\.,[2023b](https://arxiv.org/html/2606.26549#bib.bib17)\)\. Recent Transformer\-based models have drawn inspiration from computer vision\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2606.26549#bib.bib38)\), increasingly using patch\-based strategies\(Nieet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib11); Zhang and Yan,[2023](https://arxiv.org/html/2606.26549#bib.bib23); Chenet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib24); Wanget al\.,[2024c](https://arxiv.org/html/2606.26549#bib.bib28)\)to better capture long\-range dependencies\. Most of these approaches treat variables independently \(VI\)\(Huanget al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib7); Linet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib8)\), while variable\-dependent \(VD\) methods\(Liuet al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib10); Luo and Wang,[2024](https://arxiv.org/html/2606.26549#bib.bib27)\)that model interactions across variables have not yet shown clear gains over VI baselines\.
Unlike 2D images with a fixed spatial structure, time series are one\-dimensional curveGermainet al\.\([2024](https://arxiv.org/html/2606.26549#bib.bib43)\); Hamilton \([2020](https://arxiv.org/html/2606.26549#bib.bib50)\), with the primary focus being on capturing shape similarities between patches or variables\(Grabockaet al\.,[2014](https://arxiv.org/html/2606.26549#bib.bib41); Kacprzyket al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib42)\)as well as modeling long\-range trend\(Liet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib51)\)\. For instance, two patches may share similar trends, such as gradual increases with comparable rates of change\. Identifying such shape correspondence helps the model extract temporally consistent patterns and improves forecast accuracy\. However, time series data is inherently non\-stationary\(Fanet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib21); Liuet al\.,[2022b](https://arxiv.org/html/2606.26549#bib.bib20)\), where patch scales fluctuate wildly across time\. As illustrated in the top panels of Figure[2](https://arxiv.org/html/2606.26549#S1.F2), The attention weight of\(P1,P3\)\(P\_\{1\},P\_\{3\}\)is higher than that of\(P1,P2\)\(P\_\{1\},P\_\{2\}\), despite the more similar shape betweenP1P\_\{1\}andP2P\_\{2\}\. This occurs due to the different scales amongP1P\_\{1\},P2P\_\{2\}andP3P\_\{3\}, which influence the attention weights, thereby failing to reflect true shape similarity\. Consequently, the model may learn incorrect similarity relationships, leading to performance degradation\. Furthermore, this scale bias is even more pronounced when modeling dependencies between variables, further hindering the effectiveness of VD models\.
To balance the scale differences of patches, recent methods have employed Patch Normalization\(Liuet al\.,[2023b](https://arxiv.org/html/2606.26549#bib.bib25)\), which Z\-score normalizes each patch by subtracting the mean and dividing by the standard deviation\. However, the removal of the standard deviation inadvertently distorts the original shape of the patch\. As a result, it hampers the model’s ability to identify shape similarities across patches or variables\. In this paper, we propose a simple yet effective alternative method called patch\-mean decoupling \(PMD\)\. We subtract the mean of each patch, which recenters each patch to zero mean and explicitly separates the long\-range trend component which is encoded in the means of patches from the residual shape information\. Unlike Patch Normalization, our method preserves the original amplitude variations and maintains the intrinsic shape structure, ensuring that the model better captures true shape similarities across patches\. As shown in Figure[2](https://arxiv.org/html/2606.26549#S1.F2), through our method attention favors shape\-aligned pairs \(P1,P2P\_\{1\},P\_\{2\}\) over shape\-unaligned \(P1,P3P\_\{1\},P\_\{3\}\)\.
Figure 1:Attention weights for three patches before and after patch\-mean decoupling\. Scale differences initially obscure true shape similarity, which are clearly revealed after decoupling as increased \(red\) or decreased \(green\) correlations, with analogous similarity shown inbluefor \(P3P\_\{3\},P1P\_\{1\}\) and \(P3P\_\{3\},P2P\_\{2\}\)\.
Figure 2:Comparison of the MSE of all baselines with our proposed PMDformer\. The results are the averages for all prediction lengths\.
PMD thus enables more shape\-focused attention across patches and variables, revealing true similarities obscured by scales\. For cross\-variable shape modeling, existing methods\(Luo and Wang,[2024](https://arxiv.org/html/2606.26549#bib.bib27); Zhang and Yan,[2023](https://arxiv.org/html/2606.26549#bib.bib23)\)often compute interactions over the entire historical window\. However, cross\-variable relationships are often non\-stationary and evolve over time, so recent interactions are the most predictive of future dynamics\. For example, in financial markets asset correlations often spike sharply during crises\. Relying on the entire historical dependencies introduces substantial noise and redundancy, degrading performance\. To address this, we introduce proximal variable attention \(PVA\), which confines self\-attention to the most recent patch—the segment most proximal to the prediction horizon\. By capturing shape similarities among variables in this temporally relevant window, PVA minimizes noise from historical drifts and risk overfitting\.
Complementarily, recentering via PMD inherently attenuates the long\-term trend signal, potentially overlooking global dependencies\. To restore this without disrupting shape matching between temporal patches, we propose trend restoration attention \(TRA\), which explicitly injects the decoupled means \(long\-range trend information\) into the value pathway of the attention mechanism\. This seamless integration allows the model to jointly encode local shape patterns and global trend yielding more stable forecasts\.
Building on above, we proposePMDformer, which combines patch\-mean decoupling \(PMD\) module, Proximal variable attention \(PVA\), trend reinsertion attention \(TRA\) and a projection layer for final forecasting\. The comparison of predictive accuracy of our PMDformer and other state of the art models refer to Figure[2](https://arxiv.org/html/2606.26549#S1.F2)\. Our contributions are:
- •We introduce a novel mechanism to decouple trend and residual shape within the attention module via residual mean deduction, enabling more effectively capture shape similarity among temporal patches and varibles\.
- •We introduce proximal variable attention, which focuses on the most recent patch to capture the most relevant shape similarities, mitigating overfitting\.
- •We demonstrate the effectiveness of our approach through extensive experiments on a variety of LTSF benchmarks, showing that PMDformer provides more stable and accurate forecasts than current state\-of\-the\-art methods\.
## 2Related Work
Deep learning models have demonstrated remarkable performance in long\-term time series forecasting\. These models can be broadly divided into Transformer\-based modelsVaswaniet al\.\([2017](https://arxiv.org/html/2606.26549#bib.bib19)\); Wuet al\.\([2021](https://arxiv.org/html/2606.26549#bib.bib52)\); Liuet al\.\([2022a](https://arxiv.org/html/2606.26549#bib.bib31)\); Zhouet al\.\([2022](https://arxiv.org/html/2606.26549#bib.bib53)\), MLP\-based modelsZenget al\.\([2023](https://arxiv.org/html/2606.26549#bib.bib54)\); Liet al\.\([2023](https://arxiv.org/html/2606.26549#bib.bib51)\); Wanget al\.\([2024a](https://arxiv.org/html/2606.26549#bib.bib4)\); Huet al\.\([2025b](https://arxiv.org/html/2606.26549#bib.bib65)\), GNN\-based modelsHuanget al\.\([2023](https://arxiv.org/html/2606.26549#bib.bib55)\); Yiet al\.\([2023a](https://arxiv.org/html/2606.26549#bib.bib22)\)and CNN\-based modelsWanget al\.\([2023](https://arxiv.org/html/2606.26549#bib.bib56)\); Eldeleet al\.\([2024](https://arxiv.org/html/2606.26549#bib.bib57)\); Huet al\.\([2025a](https://arxiv.org/html/2606.26549#bib.bib64)\)\.
#### Transformer\-based time series models\.
The success of Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.26549#bib.bib19)\)in NLP has inspired their adaptation for LTSF to capture long\-range dependencies\. Early models treat series as token sequences with efficient attention: Informer\(Zhouet al\.,[2021](https://arxiv.org/html/2606.26549#bib.bib30)\)uses ProbSparse for complexity reduction; Pyraformer\(Liuet al\.,[2022a](https://arxiv.org/html/2606.26549#bib.bib31)\)employs pyramidal attention; Autoformer\(Wuet al\.,[2021](https://arxiv.org/html/2606.26549#bib.bib52)\)adds decomposition; and FEDformer\(Zhouet al\.,[2022](https://arxiv.org/html/2606.26549#bib.bib53)\)incorporates frequency blocks\. Yet, their efficacy is challenged by simple linear models\(Zenget al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib54)\), underscoring needs for better temporal modeling\.
#### Patch\-based time series models\.
Inspired by vision transformers\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2606.26549#bib.bib38)\), recent works segment time series into overlapping or non\-overlapping patches to bolster local semantic capture\. Transformer\-based examples include PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib11)\), which uses variable\-independent shared encoders for temporal patch semantics \(SOTA in LTSF\), and Pathformer\(Chenet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib24)\)with multi\-scale patches and adaptive path selection for intra/inter\-dependencies\. MLP variants like TSMixer\(Ekambaramet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib26)\)and PatchMixer\(Gonget al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib59)\)model patch relations via MLPs, while foundation models such as Moirai\(Wooet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib33)\), Timer\(Liuet al\.,[2024b](https://arxiv.org/html/2606.26549#bib.bib32)\), TimesFM\(Daset al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib34)\), and LLM\-based\(Panet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib60); Jinet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib61)\)leverage patches for pretraining and cross\-modal alignment\. Recent TimeBase\(Huanget al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib7)\)employs orthogonalized patches to reduce redundancy for SOTA efficiency, which further underscores patches’ success in LTSF modeling\.
#### Patch\-Normalization\.
Due to the non\-stationary nature of time series, some works\(Fanet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib21); Kimet al\.,[2021](https://arxiv.org/html/2606.26549#bib.bib62)\)apply normalization to mitigate scale discrepancies and stabilize distributions\. Among them, Patch\-level normalization works include SAN\(Liuet al\.,[2023b](https://arxiv.org/html/2606.26549#bib.bib25)\), a model\-agnostic framework that adaptively normalizes slices by removing non\-stationarity for flexible forecasting, and SIN\(Hanet al\.,[2024b](https://arxiv.org/html/2606.26549#bib.bib49)\), which selectively learns normalization parameters to maximize local invariance and global variability, enabling interpretable long\-term predictions\. However, these normalization methods distort intrinsic patch shapes by scaling with standard deviation, hindering true shape similarity capture\. In contrast, our PMD overcomes through mean subtraction to preserve amplitudes\.
## 3Proposed Method
We consider the task of long\-term time series forecasting, where the goal is to predict the future evolution of multiple correlated variables given their historical observations\. Formally, let𝐗=\{xt∈ℝC∣t=1,2,…,L\}\\mathbf\{X\}=\\\{x\_\{t\}\\in\\mathbb\{R\}^\{C\}\\mid t=1,2,\\ldots,L\\\}denote an input sequence of lengthLL, whereCCis the number of variables\. Eachxt=\(xt1,xt2,…,xtC\)x\_\{t\}=\(x\_\{t\}^\{1\},x\_\{t\}^\{2\},\\ldots,x\_\{t\}^\{C\}\)contains the values of all variables at timett\. Given𝐗\\mathbf\{X\}, the objective is to forecast the subsequentTTtime steps𝐘^=\{x^t∈ℝC∣t=L\+1,…,L\+T\}\\hat\{\\mathbf\{Y\}\}=\\\{\\hat\{x\}\_\{t\}\\in\\mathbb\{R\}^\{C\}\\mid t=L\+1,\\ldots,L\+T\\\}\.
### 3\.1The General Structure
Our proposedPMDformerarchitecture is a unified framework composed of four synergistic modules designed to explicitly decouple the long\-term trend from the shape structure, selectively focus on the most relevant inter\-variable dependencies, and ensure the accurate restoration of global dynamics for stable forecasting, as illustrated in Figure[3](https://arxiv.org/html/2606.26549#S3.F3)\. \(a\)Patch\-Mean Decoupling \(PMD\): This module partitions the input time series into non\-overlapping patches and explicitly separates each patch into its long\-term trend component and its residual shape component\. \(b\)Proximal Variable Attention \(PVA\): To capture the most relevant cross\-variable dependencies, the PVA module focuses its self\-attention mechanism only on theCCtokens of thelast \(proximal\) patch, modeling interactions across all variables\. \(c\)Trend Restoration Attention \(TRA\): This module is designed to model the shape similarities across patches\. Crucially, it thenrestoresthe long\-range trend information into the value pathway, enabling to accurately capture and utilize the overall long\-term trend\. \(d\)Projection Layer: This final layer combines the learned temporal representations with the reincorporated trend information through a fully connected projection to produce the final predictions\.
### 3\.2Model Architecture
#### Patch\-Mean Decoupling \(PMD\) & Embedding\.
We first divide the input sequence𝐗=\{xt∈ℝC\}t=1L\\mathbf\{X\}=\\\{x\_\{t\}\\in\\mathbb\{R\}^\{C\}\\\}\_\{t=1\}^\{L\}intoNNnon\-overlapping patches of lengthSS, whereN=⌊L/S⌋N=\\lfloor L/S\\rfloor\. For variablei∈\[C\]i\\\!\\in\\\!\[C\]and patch indexj∈\[N\]j\\\!\\in\\\!\[N\], the raw patch vector is
𝐏ji=\(x\(j−1\)S\+1i,x\(j−1\)S\+2i,…,xjSi\)∈ℝS\.\\displaystyle\\mathbf\{P\}\_\{j\}^\{i\}=\\big\(x\_\{\(j\-1\)S\+1\}^\{i\},\\,x\_\{\(j\-1\)S\+2\}^\{i\},\\,\\ldots,\\,x\_\{jS\}^\{i\}\\big\)\\in\\mathbb\{R\}^\{S\}\.\(1\)We then compute its temporal mean and the corresponding mean\-decoupled residual:
μji=1S∑k=1Sx\(j−1\)S\+ki,𝐫ji=𝐏ji−μji1S,\\displaystyle\\mu\_\{j\}^\{i\}=\\tfrac\{1\}\{S\}\\sum\_\{k=1\}^\{S\}x\_\{\(j\-1\)S\+k\}^\{i\},\\qquad\\mathbf\{r\}\_\{j\}^\{i\}=\\mathbf\{P\}\_\{j\}^\{i\}\-\\mu\_\{j\}^\{i\}\\,\\mathbf\{1\}\_\{S\},\(2\)where𝟏S\\mathbf\{1\}\_\{S\}is theSS\-dimensional all\-ones vector\. Each residual patch is then embedded into add\-dimensional representation through a shared linear projection\. To encode location, we add learned positional embeddings to form the Transformer token:
𝐏ji:=𝐫ji𝐖E\+𝐛E\+𝐳pj\\displaystyle\\mathbf\{P\}\_\{j\}^\{i\}:=\\mathbf\{r\}\_\{j\}^\{i\}\\,\\mathbf\{W\}\_\{E\}\+\\mathbf\{b\}\_\{E\}\+\\mathbf\{z\}\_\{p\_\{j\}\}\(3\)where𝐖E∈ℝS×d\\mathbf\{W\}\_\{E\}\\in\\mathbb\{R\}^\{S\\times d\},𝐛E∈ℝd\\mathbf\{b\}\_\{E\}\\in\\mathbb\{R\}^\{d\}, and𝐳pj∈ℝd\\mathbf\{z\}\_\{p\_\{j\}\}\\in\\mathbb\{R\}^\{d\}denotes the positional embedding of patchjj\. By removing patch means before embedding, each patch is centered, which alleviates local inconsistencies across patches and variables so that attention mechanism can focus on shape similarities\.
Figure 3:Overview of the proposed PMDformer\. The model comprises: \(a\)Patch\-Mean Decouplingmodule re\-centers each patch and separates patches into trend and shape components; \(b\)Proximal Variable Attentionoperates only on the most recent token to capture variable interactions which are most relevant for forecasting; \(c\)Trend Restoration Attentionrestores long\-range trends after value projections, restoring trend modeling; \(d\)Projection Linearadds the trend back to model long\-range trend information for stable and accurate predictions\.
#### Proximal Variable Attention \(PVA\)\.
Intuitively, accurate time series forecasting hinges on the immediate interactions between variables at the most recent time steps, as these dependencies are most indicative of near\-term changes\. Therefore, the PVA module is designed to concentrate its attention mechanism on the mostproximal\(i\.e\., most recent\) tokens to model these critical cross\-variable relationships\.
LetNNbe the index of the last \(most recent\) patch\. We collect the most recent tokens of allCCvariables, denoted as𝒫N=\{𝐏N1,…,𝐏NC\}\\mathcal\{P\}\_\{N\}=\\\{\\mathbf\{P\}\_\{N\}^\{1\},\\ldots,\\mathbf\{P\}\_\{N\}^\{C\}\\\}, where each token𝐏Ni∈ℝd\\mathbf\{P\}\_\{N\}^\{i\}\\in\\mathbb\{R\}^\{d\}is derived from the Patch\-Mean Decoupling \(PMD\) embedding\. The PVA then applies Multi\-Head Self\-Attention \(MHSA\) exclusively within the set𝒫N\\mathcal\{P\}\_\{N\}to effectively capture the cross\-variable shape dependencies that are most relevant for forecasting\. Following the attention mechanism, a Feed\-Forward Network \(FFN\) is employed to enhance the non\-linear feature representation:
𝒫^N=LayerNorm\(MHSA\(𝒫N\)\+𝒫N\),\\displaystyle\\hat\{\\mathcal\{P\}\}\_\{N\}=\\text\{LayerNorm\}\(\\text\{MHSA\}\(\\mathcal\{P\}\_\{N\}\)\+\\mathcal\{P\}\_\{N\}\),\(4\)𝒫N=LayerNorm\(FFN\(𝒫^N\)\+𝒫^N\)\.\\displaystyle\\mathcal\{P\}\_\{N\}=\\text\{LayerNorm\}\(\\text\{FFN\}\(\\hat\{\\mathcal\{P\}\}\_\{N\}\)\+\\hat\{\\mathcal\{P\}\}\_\{N\}\)\.\(5\)Tokens from the earlier historical patches, specifically those indexed\{1,…,N−1\}\\\{1,\\ldots,N\-1\\\}, maintain their original representation derived from the PMD module\. Following the PVA operation, the refined token set𝒫N\\mathcal\{P\}\_\{N\}is concatenated with these remaining historical tokens along the patch dimension to form the full sequence of shape embeddings, denoted as𝒫∈ℝC×N×d\\mathcal\{P\}\\in\\mathbb\{R\}^\{C\\times N\\times d\}\. This deliberate strategy of restricting cross\-variable attention solely to the most proximal patch offers dual advantages: itenhances model robustnessby avoiding spurious long\-range couplings from historical noise, and it improvescomputational efficiencyby reducing the complexity fromO\(C2N\)O\(C^\{2\}N\)toO\(C2\)O\(C^\{2\}\)\.
#### Trend Restoration Attention \(TRA\)\.
Following the refinement of the most proximal tokens by the PVA module, the TRA module aims to capture temporal shape similarities across all historical patches while preserving long\-range trend information\. This is achieved by applying a parameter\-shared Transformer encoder \(MHSA \+ FFN\) along the patch axis for each variable independently\.
In this design, the Query\(𝐐\\mathbf\{Q\}\) and Key\(𝐊\\mathbf\{K\}\) projections operate solely on the shape embeddings, ensuring that the resulting attention scores𝒜\\mathcal\{A\}emphasize precise inter\-patch shape similarity\. To counteract the potential loss of global dynamics inherent in shape\-focused modeling, we explicitlyincorporate the per\-patch mean\(μi\\mu^\{i\}\) into theValue\(𝐕\\mathbf\{V\}\) pathway\. The additive reintegration is inspired by residual connections in ResNet\(Heet al\.,[2016](https://arxiv.org/html/2606.26549#bib.bib63)\)\. Concretely, for theii\-th variable’s patch sequence𝐏i∈ℝN×d\\mathbf\{P\}^\{i\}\\in\\mathbb\{R\}^\{N\\times d\}, the computation is defined as:
𝐐i=𝐏i𝐖Q,𝐊i=𝐏i𝐖K,\\displaystyle\\mathbf\{Q\}^\{i\}=\\mathbf\{P\}^\{i\}\\mathbf\{W\}\_\{Q\},\\qquad\\mathbf\{K\}^\{i\}=\\mathbf\{P\}^\{i\}\\mathbf\{W\}\_\{K\},\(6\)𝒜=Softmax\(𝐐i\(𝐊i\)⊤d\),\\displaystyle\\mathcal\{A\}=\\text\{Softmax\}\\Big\(\\frac\{\\mathbf\{Q\}^\{i\}\(\\mathbf\{K\}^\{i\}\)^\{\\top\}\}\{\\sqrt\{d\}\}\\Big\),\(7\)𝐕i=𝐏i𝐖V\+μi,\\displaystyle\\mathbf\{V\}^\{i\}=\\mathbf\{P\}^\{i\}\\mathbf\{W\}\_\{V\}\+\\mu^\{i\},\(8\)where𝐖Q,𝐖K,𝐖V\\mathbf\{W\}\_\{Q\},\\mathbf\{W\}\_\{K\},\\mathbf\{W\}\_\{V\}are the projection matrices, andμi\\mu^\{i\}is the per\-patch mean \(Eq\.[2](https://arxiv.org/html/2606.26549#S3.E2)\), broadcast to match the dimensions of𝐏i𝐖V\\mathbf\{P\}^\{i\}\\mathbf\{W\}\_\{V\}\. This architectural separation allows the𝐐/𝐊\\mathbf\{Q\}/\\mathbf\{K\}pathway to model fine\-grained local shape dependencies, while the𝐕\\mathbf\{V\}pathway ensures the preservation of the essentialglobal trend dynamics\. The resulting trend\-integrated tokens are then refined through a Feed\-Forward Network \(FFN\) to enhance the temporal representation learning\.
#### Projection Layer\.
The temporal tokens produced by the TRA module are rich in shape dependencies but still require the finalrestoration of the global trend informationfor stable and accurate multi\-step forecasting\. This final step is essential to fully recover the original scale and long\-term dynamics that were decoupled earlier\. To achieve this, before generating the multi\-step forecasts, were\-incorporatethe per\-patch trend means \(μi\\mu^\{i\}\) into the refined shape embeddings:
𝐘^i=\(𝐏i\+μi\)𝐖o\+𝐛o,𝐘^i∈ℝT\.\\displaystyle\\hat\{\\mathbf\{Y\}\}^\{i\}=\(\\mathbf\{P\}^\{i\}\+\\mu^\{i\}\)\\,\\mathbf\{W\}\_\{o\}\+\\mathbf\{b\}\_\{o\},\\quad\\hat\{\\mathbf\{Y\}\}^\{i\}\\in\\mathbb\{R\}^\{T\}\.\(9\)Here,𝐖o∈ℝ\(N×d\)×T\\mathbf\{W\}\_\{o\}\\in\\mathbb\{R\}^\{\(N\\times d\)\\times T\}and𝐛o∈ℝT\\mathbf\{b\}\_\{o\}\\in\\mathbb\{R\}^\{T\}are the weight matrix and bias vector, respectively\. The meanμi\\mu^\{i\}is implicitly broadcast to align with the dimensions of𝐏i\\mathbf\{P\}^\{i\}\. This final step ensures the model’s predictions are well\-calibrated with the long\-range trend observed in the input series\.
### 3\.3Theoretical Analysis
Scale Bias Without Patch\-Mean Decoupling \(PMD\)Consider embedding raw patches𝐱~=𝐫\+μ𝟏\\tilde\{\\mathbf\{x\}\}=\\mathbf\{r\}\+\\mu\\mathbf\{1\}, where𝐫\\mathbf\{r\}is the residual andμ\\muis the patch mean\. The attention logit between tokens\(i,j\)\(i,j\)is given by:
z~ij=𝐪i⊤𝐤j=𝐱~i⊤𝐌𝐱~j=μiμj𝟏⊤𝐌𝟏⏟mean–mean\+μi𝟏⊤𝐌𝐫j\+μj𝐫i⊤𝐌𝟏⏟cross\+𝐫i⊤𝐌𝐫j⏟residual similarity,\\displaystyle\\tilde\{z\}\_\{ij\}=\\mathbf\{q\}\_\{i\}^\{\\top\}\\mathbf\{k\}\_\{j\}=\\tilde\{\\mathbf\{x\}\}\_\{i\}^\{\\top\}\\mathbf\{M\}\\tilde\{\\mathbf\{x\}\}\_\{j\}=\\underbrace\{\\mu\_\{i\}\\mu\_\{j\}\\mathbf\{1\}^\{\\top\}\\mathbf\{M\}\\mathbf\{1\}\}\_\{\\text\{mean\-\-mean\}\}\+\\underbrace\{\\mu\_\{i\}\\mathbf\{1\}^\{\\top\}\\mathbf\{M\}\\mathbf\{r\}\_\{j\}\+\\mu\_\{j\}\\mathbf\{r\}\_\{i\}^\{\\top\}\\mathbf\{M\}\\mathbf\{1\}\}\_\{\\text\{cross\}\}\+\\underbrace\{\\mathbf\{r\}\_\{i\}^\{\\top\}\\mathbf\{M\}\\mathbf\{r\}\_\{j\}\}\_\{\\text\{residual similarity\}\},\(10\)where𝐌:=𝐖E⊤𝐖Q⊤𝐖K𝐖E\\mathbf\{M\}:=\\mathbf\{W\}\_\{E\}^\{\\top\}\\mathbf\{W\}\_\{Q\}^\{\\top\}\\mathbf\{W\}\_\{K\}\\mathbf\{W\}\_\{E\}and𝟏\\mathbf\{1\}is the all\-ones vector\. The first three terms depend on the means and can affect even dominate the residual similarity, inducing scale bias\.
#### Proposition 1: Sufficient Condition for Level\-Dominated Logits
Letiibe a fixed query\. A sufficient condition for the mean\-dependent part ofz~ij\\tilde\{z\}\_\{ij\}to dominate the residual similarity uniformly over all keysjjis:
\|μi\|\|μj\|\|𝟏⊤𝐌𝟏\|≥‖𝐌‖2‖𝐫i‖‖𝐫j‖\+\|μi\|‖𝐌𝟏‖‖𝐫j‖\+\|μj\|‖𝐌𝟏‖‖𝐫i‖,\\displaystyle\|\\mu\_\{i\}\|\|\\mu\_\{j\}\|\|\\mathbf\{1\}^\{\\top\}\\mathbf\{M\}\\mathbf\{1\}\|\\geq\\\|\\mathbf\{M\}\\\|\_\{2\}\\\|\\mathbf\{r\}\_\{i\}\\\|\\\|\\mathbf\{r\}\_\{j\}\\\|\+\|\\mu\_\{i\}\|\\\|\\mathbf\{M\}\\mathbf\{1\}\\\|\\\|\\mathbf\{r\}\_\{j\}\\\|\+\|\\mu\_\{j\}\|\\\|\\mathbf\{M\}\\mathbf\{1\}\\\|\\\|\\mathbf\{r\}\_\{i\}\\\|,\(11\)where∥⋅∥2\\\|\\cdot\\\|\_\{2\}represents the spectral norm\. This condition guarantees that the mean\-dependent terms outweigh the residual term and cross terms, leading to scale\-induced bias in attention\. This confirms that attention is biased toward scale when the means are large, which motivates the need for patch\-mean decoupling in our method\.
## 4Experiment
### 4\.1Experiment Setup
#### Datasets
We conduct experiments on 8 widely\-used and publicly available real\-world datasets\. These include: ECL111[https://archive\.ics\.uci\.edu/ml/datasets/ElectricityLoadDiagrams20112014](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014), Traffic222[https://pems\.dot\.ca\.gov/](https://pems.dot.ca.gov/), Weather333[https://www\.bgc\-jena\.mpg\.de/wetter/](https://www.bgc-jena.mpg.de/wetter/), Solar444[http://www\.nrel\.gov/grid/solar\-power\-data\.html](http://www.nrel.gov/grid/solar-power-data.html), ETTh1, ETTh2, ETTm1, and ETTm2555[https://github\.com/zhouhaoyi/ETDataset](https://github.com/zhouhaoyi/ETDataset)\. Following the experimental protocol established in prior work\(Wanget al\.,[2024b](https://arxiv.org/html/2606.26549#bib.bib36); Qiuet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib35); Liuet al\.,[2023a](https://arxiv.org/html/2606.26549#bib.bib37)\), we partition the datasets into training, validation, and test sets with the following ratios: 6:2:2 for the four ETT datasets and 7:1:2 for the remaining datasets\. The detailed statistics of each dataset are summarized in Table[1](https://arxiv.org/html/2606.26549#S4.T1)\.
Table 1:Characteristics of Long\-term Time Series Datasets\. This table summarizes key attributes of each dataset, including the application domain; the number of variables; total time points; data split ratios for training, validation, and testing and sampling interval\.
#### Baselines
We compare PMDformer against 9 baselines, including state\-of\-the\-art \(SOTA\) long\-term forecasting models: TQNet\(Linet al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib5)\), TimeBase\(Huanget al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib7)\), SOFTS\(Hanet al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib6)\), SparseTSF\(Linet al\.,[2024](https://arxiv.org/html/2606.26549#bib.bib8)\), ModernTCN\(Luo and Wang,[2024](https://arxiv.org/html/2606.26549#bib.bib27)\), iTransformer\(Liuet al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib10)\), TimeMixer\(Wanget al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib4)\), and PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib11)\)\.
#### Setups
Consistent with prior research\(Huanget al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib7)\), we use an input lengthLLof 720 and evaluate prediction lengthsTTof \{96, 192, 336, 720\}\. Results for TimeBase, SparseTSF, iTransformer, TimeMixer, and PatchTST are derived from the TimeBase study, while other outcomes are from our own experiments\. All experiments are conducted using PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2606.26549#bib.bib40)\)on an NVIDIA A100 80GB GPU\. The Adam optimizer\(Kingma,[2014](https://arxiv.org/html/2606.26549#bib.bib39)\)is employed, with learning rates chosen from \{2e\-4, 5e\-4, 1e\-3, 1e\-2\}\. The number of patchesNNis adjusted based on the requirements of each dataset\.
Table 2:Comprehensive results for multivariable time series forecasting with a lookback window of 720 time steps\. Performance metrics for TQNet\(Linet al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib5)\)and SOFTS\(Hanet al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib6)\)were obtained through our experiments, while results for other methods were sourced from TimeBase\(Huanget al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib7)\)\. The best results are highlighted inbold, and the second\-best are indicated withunderlining\.
### 4\.2Main Results
Table[2](https://arxiv.org/html/2606.26549#S4.T2)summarizes the quantitative results for long\-term time series forecasting across multiple prediction horizons and datasets\. As shown, our proposed PMDformer achieves the lowest Mean Squared Error \(MSE\) and Mean Absolute Error \(MAE\) on 7 out of 8 real\-world datasets, outperforming all baselines in the majority of cases\. This success is directly tied to PMDformer’s ability to overcome fundamental limitations in existing architectures\.
Specifically, compared to the patch\-based model TimeBase, PMDformer yields an average MSE reduction of 5\.68% and MAE reduction of 6\.61%\. This improvement stems from our method’s capacity to indentify meaningful shape similarities across patches, a capability that TimeBase’s orthogonal patch selection inherently sacrifices to reduce redundancy\. Moreover, against TQNet, PMDformer achieves an average MSE reduction of 8\.62% and MAE reduction of 9\.96%\. TQNet’s fixed periodic queries constrain its ability to handle diverse cycles, whereas PMDformer’s adaptive proximal variable attention offers greater flexibility in modeling variables’ shape similarities\. Compared to the Transformer\-based iTransformer, PMDformer delivers an average MSE reduction of 11\.44% and MAE reduction of 12\.38%\. iTransformer captures dependencies among variable tokens embedded from the entire historical sequence, which can lead to overfitting on early, weakly relevant variable relationships that degrade future predictions\. In contrast, our PVA module succeed to avoid this by focusing on the shape similarities of variables within the most nearest patch\.
Table 3:Ablation study on PMD module\. We assess different modules for patch\-wise normalization, along with removing PMD module\. Results are averaged across all prediction horizons\.Table 4:Ablation studies on TRA and PVA modules in PMDformer: Performance impacts of replacements, removals, and order swaps across ETTh2, ETTm1, Traffic, and Solar datasets\. Results are averaged across all prediction horizons\.#### PMD Module Analysis\.
We assessed the effectiveness of the PMD module through extensive ablations conducted on five non\-stationary benchmarks: ETTh2, ETTm1, Weather, Traffic, and Solar\(Wenet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib47); Kimet al\.,[2025](https://arxiv.org/html/2606.26549#bib.bib48)\)\. Using a fixed input length of 720, we tested the model’s performance across various prediction horizons \(96, 192, 336, and 720\) against several patch\-wise normalization variants: \(i\) mean–variance standardization \(’w/ stdev’\), \(ii\) utilizing the Scale\-Adaptive Normalization \(SAN\)\(Liuet al\.,[2023b](https://arxiv.org/html/2606.26549#bib.bib25)\)method, and \(iii\) removing the PMD module entirely\. As presented in Table[3](https://arxiv.org/html/2606.26549#S4.T3), the PMDformer consistently achieves superior accuracy across all datasets\. We attribute this advantage to the PMD module’s per\-patch centering mechanism, which effectively preserves crucial intra\-patch shape information\. This preservation allows the Transformer architecture to specifically concentrate its attention on modelingshape similarity\. Furthermore, by explicitly injecting the patch mean as a separated trend component into the Transformer pathway, PMDformer is uniquely positioned to accurately capture and model long\-term trends\. In stark contrast, SAN explicitly decouples the scale and residual components for independent prediction\. Since global scale estimation is inherently unstable in highly non\-stationary series, this rigid decoupling undermines the essential joint modeling of scale–shape interactions, consequently leading to overfitting and weaker generalization capabilities\.
#### TRA & PVA Analysis\.
To assess the effectiveness of the TRA and PVA modules, we conducted ablation studies on the ETTh2, ETTm1, Traffic, and Solar datasets\. For the TRA module, we tested two alternatives: replacing it with standard self\-attention or removing it entirely\. For the PVA module, we either modified it to compute variable\-wise shape similarity acrossallpatches or removed the module completely\. Additionally, we investigated a structural variant that swaps the sequential order of the two modules\. The experimental outcomes are summarized in Table[4](https://arxiv.org/html/2606.26549#S4.T4)\.
The results unequivocally show thatPMDformerconsistently outperforms all ablated variants across every dataset and configuration\. When TRA is replaced with standard self\-attention, performance degrades significantly because the crucial long\-term trend information is neglected\. Similarly, when PVA is forced to compute variable\-wise shape similarity across all historical patches, performance decreases\. This confirms our hypothesis that early variable relationships are often only weakly or spuriously correlated with the predictive sequences, justifying PVA’s proximal focus\. Furthermore, removing both TRA and PVA results in the largest performance drop observed, emphatically highlighting the dual importance of TRA in modeling temporal patch shapes and long\-range trends, and PVA in capturing relevant variable\-wise shape similarity\. Finally, swapping the original order of TRA and PVA also causes notable performance degradation\. When TRA is applied first, it compresses patch information too early, making it harder for the subsequent variable modeling to identify meaningful cross\-variable dependencies\.
### 4\.3Parameter Sensitivity Analysis
#### Patch Count for Cross\-Variable Modeling\.
We evaluate the impact of capturing variable patterns within different numbers of patches, wherek∈\{1,2,3,5,7,10\}k\\in\\\{1,2,3,5,7,10\\\}\. For each setting, thekknearest patches to future sequences are selected to capture the shape similarity of variable, thereby further validating the effectiveness of PVA\. Experiments are conducted on the ETTh1 and Solar datasets\. As shown in Figure[4](https://arxiv.org/html/2606.26549#S4.F4)\(a\), the mean squared error \(MSE\) exhibits an overall upward trend askkgradually increases on the ETTh1 dataset\. On the Solar dataset, this increase is more pronounced when predicting 192, 336, or 720 steps ahead, because future sequences are more weakly correlated with early variable relationships\. Moreover, the MSE curves show some fluctuations, indicating that different values ofkkmay lead to more significant differences in prediction performance\. In contrast, across all four prediction horizons, usingk=1k=1yields more stable performance compared with largerkk\. This is because the nearest patch is typically more closely aligned with the target sequence to be predicted, making it more beneficial for accurate modeling\.
#### Patch Size\.
Different patch sizes lead to varying degrees of distinction among patches\. To investigate this, we evaluate multiple patch sizes\{8,16,24,48,72,120\}\\\{8,16,24,48,72,120\\\}on the ETTh2 and ECL datasets\. As shown in Figure[4](https://arxiv.org/html/2606.26549#S4.F4)\(b\), both overly small and overly large patch sizes fail to deliver optimal performance\. This is because excessively small patches provide insufficient shape information to distinguish similarity, making it difficult for the attention mechanism to capture underlying temporal dependencies or genuine variable correlations\. Conversely, overly large patches reduce the number of tokens, thereby limiting the model’s ability to capture long\-range dependencies\. Based on these observations, we find that moderate patch sizes, particularly within\{24,48,72\}\\\{24,48,72\\\}, achieve a better trade\-off and yield more robust performance\.
Figure 4:Parameter Sensitivity Analysis\. \(a\) Selection of the number ofkknearest patches to the prediction sequence for capturing inter\-variable dependencies on these patches\. Superior and more stable performance is achieved whenk=1k=1\. \(b\) Different patch sizes are used to partition the input sequence, and a moderate patch size yields the optimal choice\.
## 5Conclusion
In this paper, we tackle challenges in long\-term time series forecasting by emphasizing true shape similarities hidden by scale variations in non\-stationary data\. Our patch\-mean decoupling \(PMD\) separates trends from residual shapes while preserving amplitudes, enabling shape\-focused attention across patches and variables\. Integrated with proximal variable attention \(PVA\) for recent inter\-variable dependencies and trend restoration attention \(TRA\) for global trend reintegration\. Experiments on LTSF benchmarks show PMDformer surpasses state\-of\-the\-art baselines in accuracy and stability, underscoring the value of shape\-centric Transformer designs\. Future directions include scaling to higher\-dimensional multivariate data and multimodal integrations for applications in energy, finance, and traffic\.
## 6Acknowledgements
This work was supported by the Major Science and Technology Special Project of the Sichuan Provincial Department of Science and Technology \(Grant No\. 2024ZDZX0002\), the Sichuan Provincial Innovation Group Project \(Grant No\. 2024NSFTD0054\), Fundamental Research Funds for the Central Universities \(JBK202511081\), the Blockchain Research Center of China, the Natural Science Foundation of China \(Grant No\. 62502397\), the National Natural Science Foundation of China \(Grant No\. 72471197\), and the Sichuan Provincial Philosophy and Social Science Fund \(Grant No\. SCJJ25ND091\)\.
## References
- G\. E\. P\. Box and G\. Jenkins \(1990\)Time series analysis, forecasting and control\.Holden\-Day, Inc\.\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1)\.
- P\. Chen, Y\. Zhang, Y\. Cheng, Y\. Shu, Y\. Wang, Q\. Wen, B\. Yang, and C\. Guo \(2024\)Pathformer: multi\-scale transformers with adaptive pathways for time series forecasting\.arXiv preprint arXiv:2402\.05956\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Das, W\. Kong, R\. Sen, and Y\. Zhou \(2024\)A decoder\-only foundation model for time\-series forecasting\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- V\. Ekambaram, A\. Jati, N\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)TSMixer: lightweight mlp\-mixer model for multivariate time series forecasting\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Eldele, M\. Ragab, Z\. Chen, M\. Wu, and X\. Li \(2024\)Tslanet: rethinking transformers for time series representation learning\.arXiv preprint arXiv:2404\.08472\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- W\. Fan, P\. Wang, D\. Wang, D\. Wang, Y\. Zhou, and Y\. Fu \(2023\)Dish\-ts: a general paradigm for alleviating distribution shift in time series forecasting\.InProceedings of the AAAI conference on artificial intelligence,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Germain, S\. Gruffaz, C\. Truong, A\. Durmus, and L\. Oudre \(2024\)Shape analysis for time series\.Advances in neural information processing systems37,pp\. 95607–95638\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7)\.
- Z\. Gong, Y\. Tang, and J\. Liang \(2023\)Patchmixer: a patch\-mixing architecture for long\-term time series forecasting\.arXiv preprint arXiv:2310\.00655\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Grabocka, N\. Schilling, M\. Wistuba, and L\. Schmidt\-Thieme \(2014\)Learning time\-series shapelets\.InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining,pp\. 392–401\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7)\.
- S\. Guo, Y\. Lin, N\. Feng, C\. Song, and H\. Wan \(2019\)Attention based spatial\-temporal graph convolutional networks for traffic flow forecasting\.Proceedings of the AAAI Conference on Artificial Intelligence\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1)\.
- J\. D\. Hamilton \(2020\)Time series analysis\.Princeton university press\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7)\.
- L\. Han, X\. Chen, H\. Ye, and D\. Zhan \(2024a\)SOFTS: efficient multivariate time series forecasting with series\-core fusion\.arXiv preprint arXiv:2404\.14197\.Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.26549#S4.T2)\.
- L\. Han, H\. Ye, and D\. Zhan \(2024b\)SIN: selective and interpretable normalization for long\-term time series forecasting\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px3.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InCVPR,Cited by:[§3\.2](https://arxiv.org/html/2606.26549#S3.SS2.SSS0.Px3.p2.7)\.
- A\. Hu, L\. Wen, Y\. Dai, S\. Qi, J\. Wang, Z\. Chen, X\. Zhou, D\. Wang, Z\. Xu, and J\. Duan \(2025a\)TimeCNN: refining cross\-variable interaction on time point for time series forecasting\.Neural Networks\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- A\. Hu, L\. Wen, J\. Duan, Y\. Dai, D\. Wang, S\. Huang, J\. Wang, and Z\. Xu \(2025b\)FDNet: high\-frequency disentanglement network with information\-theoretic guidance for multivariate time series forecasting\.Pattern Recognition\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- Y\. Hu, Y\. Li, P\. Liu, Y\. Zhu, N\. Li, T\. Dai, S\. Xia, D\. Cheng, and C\. Jiang \(2025c\)FinTSB: a comprehensive and practical benchmark for financial time series forecasting\.arXiv preprint arXiv:2502\.18834\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1)\.
- Q\. Huang, L\. Shen, R\. Zhang, S\. Ding, B\. Wang, Z\. Zhou, and Y\. Wang \(2023\)CrossGNN: confronting noisy multivariate time series via cross interaction refinement\.NeurIPS\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- Q\. Huang, Z\. Zhou, K\. Yang, Z\. Yi, X\. Wang, and Y\. Wang \(2025\)TimeBase: the power of minimalism in long\-term time series forecasting\.InProceedings of the Forty\-Second International Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px3.p1.3),[Table 2](https://arxiv.org/html/2606.26549#S4.T2)\.
- M\. Jin, S\. Wang, L\. Ma, Z\. Chu, J\. Y\. Zhang, X\. Shi, P\. Chen, Y\. Liang, Y\. Li, S\. Pan,et al\.\(2023\)Time\-llm: time series forecasting by reprogramming large language models\.arXiv preprint arXiv:2310\.01728\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Kacprzyk, T\. Liu, and M\. van der Schaar \(2024\)Towards transparent time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7)\.
- H\. Kim, S\. Kim, J\. Mok, and S\. Yoon \(2025\)Battling the non\-stationarity in time series forecasting via test\-time adaptation\.InAAAI,Cited by:[§4\.2](https://arxiv.org/html/2606.26549#S4.SS2.SSS0.Px1.p1.1)\.
- T\. Kim, J\. Kim, Y\. Tae, C\. Park, J\. Choi, and J\. Choo \(2021\)Reversible instance normalization for accurate time\-series forecasting against distribution shift\.InInternational conference on learning representations,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px3.p1.1)\.
- D\. P\. Kingma \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px3.p1.3)\.
- Z\. Li, S\. Qi, Y\. Li, and Z\. Xu \(2023\)Revisiting long\-term time series forecasting: an investigation on linear mapping\.arXiv preprint arXiv:2305\.10721\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- S\. Lin, H\. Chen, H\. Wu, C\. Qiu, and W\. Lin \(2025\)Temporal query network for efficient multivariate time series forecasting\.InForty\-second International Conference on Machine Learning,Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.26549#S4.T2)\.
- S\. Lin, W\. Lin, W\. Wu, H\. Chen, and J\. Yang \(2024\)SparseTSF: modeling long\-term time series forecasting with 1k parameters\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Liu, H\. Yu, C\. Liao, J\. Li, W\. Lin, A\. X\. Liu, and S\. Dustdar \(2022a\)Pyraformer: low\-complexity pyramidal attention for long\-range time series modeling and forecasting\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- X\. Liu, Y\. Xia, Y\. Liang, J\. Hu, Y\. Wang, L\. Bai, C\. Huang, Z\. Liu, B\. Hooi, and R\. Zimmermann \(2023a\)LargeST: a benchmark dataset for large\-scale traffic forecasting\.InAdvances in Neural Information Processing Systems,Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024a\)ITransformer: inverted transformers are effective for time series forecasting\.ICLR\.Cited by:[§A\.1](https://arxiv.org/html/2606.26549#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Liu, H\. Wu, J\. Wang, and M\. Long \(2022b\)Non\-stationary transformers: exploring the stationarity in time series forecasting\.Advances in neural information processing systems35,pp\. 9881–9893\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p2.7)\.
- Y\. Liu, H\. Zhang, C\. Li, X\. Huang, J\. Wang, and M\. Long \(2024b\)Timer: generative pre\-trained transformers are large time series models\.arXiv preprint arXiv:2402\.02368\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Liu, M\. Cheng, Z\. Li, Z\. Huang, Q\. Liu, Y\. Xie, and E\. Chen \(2023b\)Adaptive normalization for non\-stationary time series forecasting: a temporal slice perspective\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p3.2),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.26549#S4.SS2.SSS0.Px1.p1.1)\.
- D\. Luo and X\. Wang \(2024\)Moderntcn: a modern pure convolution structure for general time series analysis\.InThe twelfth international conference on learning representations,Cited by:[§A\.1](https://arxiv.org/html/2606.26549#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§1](https://arxiv.org/html/2606.26549#S1.p4.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.ICLR\.Cited by:[§A\.1](https://arxiv.org/html/2606.26549#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Pan, Y\. Jiang, S\. Garg, A\. Schneider, Y\. Nevmyvaka, and D\. Song \(2024\)S2ip\-llm: semantic space informed prompt learning with llm for time series forecasting\.InForty\-first International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)Pytorch: an imperative style, high\-performance deep learning library\.Advances in neural information processing systems32\.Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px3.p1.3)\.
- X\. Qiu, J\. Hu, L\. Zhou, X\. Wu, J\. Du, B\. Zhang, C\. Guo, A\. Zhou, C\. S\. Jensen, Z\. Sheng,et al\.\(2024\)Tfb: towards comprehensive and fair benchmarking of time series forecasting methods\.arXiv preprint arXiv:2403\.20150\.Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- H\. Wang, J\. Peng, F\. Huang, J\. Wang, J\. Chen, and Y\. Xiao \(2023\)MICN: multi\-scale local and global context modeling for long\-term series forecasting\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Y\. Zhang, and J\. ZHOU \(2024a\)TimeMixer: decomposable multiscale mixing for time series forecasting\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Wang, H\. Wu, J\. Dong, Y\. Liu, M\. Long, and J\. Wang \(2024b\)Deep time series models: a comprehensive survey and benchmark\.arXiv preprint arXiv:2407\.13278\.Cited by:[§4\.1](https://arxiv.org/html/2606.26549#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Wang, H\. Wu, J\. Dong, Y\. Liu, Y\. Qiu, H\. Zhang, J\. Wang, and M\. Long \(2024c\)Timexer: empowering transformers for time series forecasting with exogenous variables\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1)\.
- Q\. Wen, W\. Chen, L\. Sun, Z\. Zhang, L\. Wang, R\. Jin, T\. Tan,et al\.\(2023\)Onenet: enhancing time series forecasting models under concept drift by online ensembling\.Advances in Neural Information Processing Systems36,pp\. 69949–69980\.Cited by:[§4\.2](https://arxiv.org/html/2606.26549#S4.SS2.SSS0.Px1.p1.1)\.
- G\. Woo, C\. Liu, A\. Kumar, C\. Xiong, S\. Savarese, and D\. Sahoo \(2024\)Unified training of universal time series forecasting transformers\.InICML,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Wu, J\. Xu, J\. Wang, and M\. Long \(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.Advances in neural information processing systems34,pp\. 22419–22430\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- K\. Yi, Q\. Zhang, W\. Fan, H\. He, L\. Hu, P\. Wang, N\. An, L\. Cao, and Z\. Niu \(2023a\)FourierGNN: rethinking multivariate time series forecasting from a pure graph perspective\.Advances in neural information processing systems36,pp\. 69638–69660\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- K\. Yi, Q\. Zhang, W\. Fan, H\. He, L\. Hu, P\. Wang, N\. An, L\. Cao, and Z\. Niu \(2023b\)FourierGNN: rethinking multivariate time series forecasting from a pure graph perspective\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1)\.
- A\. Zeng, M\. Chen, L\. Zhang, and Q\. Xu \(2023\)Are transformers effective for time series forecasting?\.AAAI\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
- Y\. Zhang and J\. Yan \(2023\)Crossformer: transformer utilizing cross\-dimension dependency for multivariate time series forecasting\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.26549#S1.p1.1),[§1](https://arxiv.org/html/2606.26549#S1.p4.1)\.
- H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang \(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InAAAI,Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Zhou, Z\. Ma, Q\. Wen, X\. Wang, L\. Sun, and R\. Jin \(2022\)Fedformer: frequency enhanced decomposed transformer for long\-term series forecasting\.InInternational conference on machine learning,pp\. 27268–27286\.Cited by:[§2](https://arxiv.org/html/2606.26549#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26549#S2.p1.1)\.
## Appendix AAppendix
### A\.1Efficiency Analysis
To evaluate the efficiency of our model in handling complex tasks, we conduct experiments under two settings: varying the number of variables and varying the input length\. In the first setting, we fix the input length at 720 and change the number of variables; in the second setting, we fix the number of variables at 100 and test PMDformer with different input lengths\. The batch size is set to 1 in all experiments\. The results are shown in Figure[5](https://arxiv.org/html/2606.26549#A1.F5)\. Under both settings, compared with recent popular models such as PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2606.26549#bib.bib11)\), iTransformer\(Liuet al\.,[2024a](https://arxiv.org/html/2606.26549#bib.bib10)\), and ModernTCN\(Luo and Wang,[2024](https://arxiv.org/html/2606.26549#bib.bib27)\), PMDformer requires significantly less GPU memory, thereby reducing the overall computational cost\.
Figure 5:\(a\) Comparison of memory usage with varying number of variablesCC\. \(b\) Comparison of memory usage with varying input sequence lengthLL\. PMDformer consistently requires the lowest memory\.Figure 6:Comparison on synthetic data\. The ground truth alternates between pulse and sine shapes with varying scales\. The ’w/o PMD’ yields smoothed outputs and struggles to recognize the shape similarity, while ’w/ PMD’ effectively fits the shapes and trends\.
### A\.2Comparison on Synthetic Data
To further validate the effectiveness of our PMD module, we conduct an experiment on a synthetic dataset\. This dataset consists of patches alternating between two different shapes: a sharp pulse wave with large amplitude and a smooth sine wave with small amplitude\. To simulate non\-stationary time series, the patches exhibit varying scales and are augmented with moderate noise\. We compare a standard patch\-based Transformer \(w/o PMD\) against our model incorporating the patch\-mean decoupling module \(w/ PMD\)\. As illustrated in Figure[6](https://arxiv.org/html/2606.26549#A1.F6), the ’w/o PMD’ model struggles to recognize true shape similarities due to scale differences between patches, leading to predictions that resemble mostly smooth curves with inadequate trend fitting\. In contrast, our ’w/ PMD’ model, by removing scale factors, enables attention to focus more effectively on intrinsic shapes, resulting in predictions that better capture both the underlying patterns and long\-range trends\.
## Appendix BON THE USE OF LARGE LANGUAGE MODELS
The authors used large language models \(LLMs\) exclusively for language polishing and minor rephrasing during the final writing stage\. All scientific content, ideas, and initial drafts were created entirely by the authors without any text improved by LLMs was carefully checked and edited by the authors\. LLMs played no role in developing research questions, designing experiments, analyzing results, or any other aspect of the research itself\.Similar Articles
PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
Proposes PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that addresses periodic perception degradation, trend-noise entanglement, and loss of cross-variable dependencies via multiplicative periodic gating, multi-scale structured encoder, and cross-scale collaborative attention.
Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting
This paper theoretically and empirically examines adaptive patching for time-series Transformers, deriving conditions under which content-adaptive tokenization should outperform tuned uniform patching. Controlled experiments on standard benchmarks show that a well-tuned uniform baseline is competitive with dynamic patching methods, challenging the assumed benefit of adaptive approaches.
MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.
From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
This paper introduces a framework for time series forecasting that uses importance-aware news compression and process reward model-guided retrieval to incorporate long news articles within fixed context limits, improving prediction accuracy across finance, energy, traffic, and Bitcoin benchmarks.
Forecasting Medium-Horizon Alzheimer's Disease Progression: Residual Gap-Aware Transformers for 24-Month CDR-SB Change from ADNI Clinical and Biomarker Histories
This paper proposes a residual gap-aware transformer that combines a mixed-effects statistical reference with transformer-based residual learning to forecast 24-month CDR-SB change from ADNI clinical and biomarker histories, achieving reduced MSE and improved correlation over baselines.