Nested Spatio-Temporal Time Series Forecasting

arXiv cs.LG Papers

Summary

This paper proposes a nested spatiotemporal forecasting framework that uses spectral clustering to construct semantically coherent macro-level regions, which provide top-down guidance for fine-grained micro-level predictions. Experiments on high-dimensional datasets show consistent improvements over state-of-the-art baselines.

arXiv:2605.16447v1 Announce Type: new Abstract: Spatiotemporal forecasting is critical for real-world applications like traffic management, yet capturing reliable interactions remains challenging under noisy and non-stationary conditions. Existing methods primarily rely on historical spatial priors, often failing to account for evolving temporal correlations and suffering from systematic errors. In this work, we propose a nested forecasting framework that couples future macro-level regional trends with micro-level historical observations, enabling top-down guidance from abstract future representations for fine-grained forecasting. Specifically, we employ a spectral clustering-based approach to construct semantically coherent regions, providing both theoretical and empirical evidence that this representation effectively filters systematic noise while preserving essential trends. Building on this, we develop a progressive coarse-to-fine predictor to integrate these representative features into the inference process. This enables the model to leverage trend predictions to anticipate dynamic anomalies, such as periodic offsets, in advance. Furthermore, extensive experiments on multiple high-dimensional datasets demonstrate that our method consistently outperforms state-of-the-art baselines, validating the effectiveness of future macro-guided nested forecasting.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:43 AM

# Nested Spatio-Temporal Time Series Forecasting
Source: [https://arxiv.org/html/2605.16447](https://arxiv.org/html/2605.16447)
Yukai Zhou∗⋄Ruoxi Jiang†Junyi An†Chao Qu†Zhijian ZhouShiyu WangFenglei CaoZenglin Xu†Furao Shen†Yuan Qi†

###### Abstract

Spatiotemporal forecasting is critical for real\-world applications like traffic management, yet capturing reliable interactions remains challenging under noisy and non\-stationary conditions\. Existing methods primarily rely on historical spatial priors, often failing to account for evolving temporal correlations and suffering from systematic errors\. In this work, we propose a nested forecasting framework that couples future macro\-level regional trends with micro\-level historical observations, enabling top\-down guidance from abstract future representations for fine\-grained forecasting\. Specifically, we employ a spectral clustering\-based approach to construct semantically coherent regions, providing both theoretical and empirical evidence that this representation effectively filters systematic noise while preserving essential trends\. Building on this, we develop a progressive coarse\-to\-fine predictor to integrate these representative features into the inference process\. This enables the model to leverage trend predictions to anticipate dynamic anomalies, such as periodic offsets, in advance\. Furthermore, extensive experiments on multiple high\-dimensional datasets demonstrate that our method consistently outperforms state\-of\-the\-art baselines, validating the effectiveness of future macro\-guided nested forecasting\.

Spatiotemporal Forecasting, Graph Neural Networks, ICML

## 1Introduction

Spatio\-temporal forecasting \(STF\) plays a pivotal role in modern intelligent systems, supporting diverse applications from urban traffic management to extreme weather prediction\(Jinet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib89); Kumaret al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib67); Wanget al\.,[2025b](https://arxiv.org/html/2605.16447#bib.bib79); Chenet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib84); Wanget al\.,[2025a](https://arxiv.org/html/2605.16447#bib.bib85)\)\. In these domains, accurate forecasting is indispensable for proactive decision\-making, such as early congestion control\(Hamedmoghadamet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib46)\)and emergency planning\(Guoet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib44)\)\. Nevertheless, achieving robust predictions over extended horizons remains a significant challenge, primarily due to the complex spatial interactions and dynamic temporal patterns inherent in real\-world systems\(Lanet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib31); Chenet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib83); Heet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib32); Gaoet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib78)\)\.

As a specialized task within Multivariate Time Series \(MTS\) forecasting, STF focuses on improving prediction accuracy by effectively modeling spatial correlations\(Shaoet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib23)\)\. Early approaches\(Liet al\.,[2018](https://arxiv.org/html/2605.16447#bib.bib29); Yuet al\.,[2018](https://arxiv.org/html/2605.16447#bib.bib69)\)typically integrated prior topological structures directly into graph\-based learners; however, these priors often require expert knowledge, and might overlook the intricate patterns in the dynamic feature space\. To better align with the inductive biases of dynamic data, subsequent methods\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34),[2020b](https://arxiv.org/html/2605.16447#bib.bib57)\)introduced learnable adjacency matrices to adaptively capture latent edges with graph neural networks\. Other research\(Shaoet al\.,[2022a](https://arxiv.org/html/2605.16447#bib.bib58); Jianget al\.,[2023a](https://arxiv.org/html/2605.16447#bib.bib60); Diaoet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib30); Maet al\.,[2025b](https://arxiv.org/html/2605.16447#bib.bib77)\)has extended this to the temporal dimension, employing time\-varying and multi\-view graphs to model evolving interactions\. Despite these architectural advancements, existing frameworks face a critical limitation: the fine\-grained, full\-graph modeling prevalent in current methods is highly susceptible to system noise, which becomes particularly acute with the growth of spatial scale\. Within this expanded search space, models are prone to learn spurious correlations\(Zhaoet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib25)\), ultimately degrading the robustness of the learned representations\.

To mitigate the impact of structural uncertainties and achieve robust forecasting, we investigate the utility of macro\-level representations\. A standard practice for constructing such representations involves the spatial aggregation or slicing; however, existing literature\(Maet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib36); Zhanget al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib19); Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)primarily treats these coarse\-grained signals as auxiliary inputs to provide robust historical statistics\. In this work, we aim to explore a more promising paradigm: leveraging coarse\-grained representations to characterize future states, thereby serving as a stable structural guide for the forecasting process\. This approach introduces a significant challenge: how to extract coarse\-grained signals with high representational fidelity while maintaining topological and semantic alignment with their fine\-grained counterparts\.

Motivated by these insights, we proposeNeST\(NestedSpatio\-Temporal forecasting\), a spatio\-temporal forecasting framework that moves beyond micro\-level modeling through macro\-guided design with future awareness\.NeSTrealizes the cross\-horizon modeling of historical fine\-grained and future coarse\-grained dynamics through two core stages\. First, to extract representative macro\-dynamics from local observations, we employ semantic spectral clustering\(Nget al\.,[2001](https://arxiv.org/html/2605.16447#bib.bib99)\), operating on an affinity matrix constructed directly from raw feature sequences, adapting to dynamic semantic correlations without relying on static physical priors while yielding a compact representation space\. Second, to enable effective multi\-scale interaction for temporal forecasting, we introduce a symmetric attention mechanism that facilitates bidirectional information flow across spatio\-temporal scales\. In particular, macro\-level states are predicted multiple steps ahead, providing abstract future context that regularizes fine\-grained forecasting\. Crucially, the low\-rank nature of the macro\-state representation significantly reduces computational cost while preserving trend\-level expressiveness\. Our contributions are summarized as follows:

- •We proposeNeST, a nested spatio\-temporal forecasting framework that introduces macro\-guided cross\-horizon modeling, using predicted region\-level futures as explicit top\-down guidance to regularize and enhance fine\-grained forecasting\.
- •We design a computationally efficient multi\-scale architecture that leverages semantic spectral clustering for capturing dynamic representative features, enabling robust alignment while better preserving systematic trends\.
- •We validate the effectiveness ofNeSTthrough extensive experiments on multiple large\-scale datasets, achieving consistent improvements over state\-of\-the\-art methods across diverse metrics\.

## 2Related Works

Spatio\-Temporal Forecasting\. The core objective of spatio\-temporal forecasting is to predict future system states by capturing complex dependencies present in historical observations\. Early works combined recurrent or temporal convolution modules with graph encoders that use fixed topologies to model spatial correlations\(Liet al\.,[2018](https://arxiv.org/html/2605.16447#bib.bib29); Yuet al\.,[2018](https://arxiv.org/html/2605.16447#bib.bib69)\)\. To relax the reliance on predefined structures, later methods such asGWNET,MTGNN, andAGCRNintroduced adaptive embeddings to infer latent spatial dependencies directly from observations\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34),[2020b](https://arxiv.org/html/2605.16447#bib.bib57); Baiet al\.,[2020](https://arxiv.org/html/2605.16447#bib.bib18)\)\. Although these data\-driven methods relax the reliance on predefined graphs by learning spatial relations from data, the inferred relational structures are typically static during inference\. As a result, they still struggle to capture spatial dependencies that evolve over time, which are common in complex real\-world systems\(Hanet al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib80)\)\. To capture time\-varying connectivity, a recent line of work leverages attention and dynamic message\-passing mechanisms that adapt relations over time; representative examples includeDSTAGNN,MegaCRNandSTAEFormer, among others\(Lanet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib31); Jianget al\.,[2023b](https://arxiv.org/html/2605.16447#bib.bib56); Liuet al\.,[2023a](https://arxiv.org/html/2605.16447#bib.bib65); Xieet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib81); Konget al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib61); Gonget al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib86); Liet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib17)\)\. These approaches enable models to track changing dependencies and better handle nonstationary spatio\-temporal dynamics\(Lyuet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib82)\)\. Building on these dynamic capabilities, recent research further targets systemic challenges such as spatial heterogeneity\(Jiet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib75); Donget al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib37)\)and the scalability of massive networks\(Yuanet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib91); Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)\. Despite these advances, most existing architectures remain focused on micro\-scale modeling, remaining sensitive to noise and short\-term irregularities\.

Hierarchical Spatio\-Temporal Modeling\.Hierarchical structures help capture multi\-scale spatio\-temporal dynamics\(Maoet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib1)\)\. Early works such asHGCNandHRNRgrouped sensors into static regions to summarize macro\-scale patterns\(Guoet al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib35); Wuet al\.,[2020a](https://arxiv.org/html/2605.16447#bib.bib39)\), and later extensions \(e\.g\.,HSDGNN\) modeled more complex structural dependencies\(Zhouet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib96)\)\. More recent methods introduce flexible cross\-scale interactions:HiSTGNN,HIESTandAIMSTpropose dynamic mechanisms to exchange information between sensor\-level and region\-level representations, whileHSTANandHSFEapply multi\-level attention and feature fusion to integrate global and local correlations\(Maet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib38),[2023](https://arxiv.org/html/2605.16447#bib.bib36); Zhanget al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib19); Mariscaet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib13)\)\.

Beyond spatial hierarchies, several works explore temporal multi\-scale modeling\(Challuet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib10); Wanget al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib12); Chenet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib53); Wanget al\.,[2024a](https://arxiv.org/html/2605.16447#bib.bib88)\)\. However, most prior methods treat hierarchy primarily as a mechanism for historical representations\. They typically employ a single\-stage projection that maps past observations directly, forcing the model to implicitly infer evolving trends from noisy data, making predictions highly susceptible to input perturbations\. On the other hand, a recent work on neural operator\(Jianget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib41)\)demonstrates that incorporating future information can improve long\-term stability and physical consistency\. However, its formulation assumes regular spatiotemporal grids and is less suitable for irregular graph\-structured data with missing values and high noise\. In this work,NeSTestablishes a hierarchical forecasting paradigm that explicitly predicts future macro\-level states as structural guidance\. By leveraging representative future dynamics to guide fine\-grained generation,NeSTstabilizes the forecasting process and alleviates the limitations of single\-stage prediction\.

![Refer to caption](https://arxiv.org/html/2605.16447v1/x1.png)Figure 1:Diagram ofNeST\. \(i\)Spectral representation:Node\-level time series are partitioned via spectral clustering to derive representative regional dynamics𝐙\\mathbf\{Z\}, which serve as macroscopic structural anchors for the system\. \(ii\)Training phase:We process historical node signals𝐗t−L\+1:t\\mathbf\{X\}\_\{t\-L\+1:t\}alongside future regional guidance𝐙t\+1:t\+P\\mathbf\{Z\}\_\{t\+1:t\+P\}using decoupled encoders\. To bridge the discrepancy between training and inference, we employ a scheduled sampling strategy: ground\-truth regional features are provided via teacher\-forcing with probabilityPtfP\_\{\\rm\{tf\}\}, while predicted rollouts𝐙^t\+1:t\+P\\hat\{\\mathbf\{Z\}\}\_\{t\+1:t\+P\}are utilized with probability1−Ptf1\-P\_\{\\rm\{tf\}\}\. Bidirectional information flow between scales is facilitated via cross\-attention \(MLP is omitted for clarity\), which effectively bounds the interaction complexity to the number of clustersMM\(whereM<NM<N\), ensuring future\-oriented guidance while maintaining linear scalability relative to the number of nodes\.
## 3Preliminary

We consider a spatiotemporal system consisting ofNNcorrelated sensors, where observations at timettare denoted by𝐗t∈ℝN×C\\mathbf\{X\}\_\{t\}\\in\\mathbb\{R\}^\{N\\times C\}\. Our goal is to forecast a future sequence of lengthHHgiven a historical context of lengthLL, denoted as𝐗t−L\+1:t∈ℝN×L×C\\mathbf\{X\}\_\{t\-L\+1:t\}\\in\\mathbb\{R\}^\{N\\times L\\times C\}\.

To effectively model long\-range dependencies while maintaining computational efficiency, we frame the problem as a patch\-based autoregressive forecasting task\. At each step, the model predicts a subsequent future patch of lengthPPfrom the precedingLLsteps\. During training, the model is optimized via single\-step supervision on the predicted patch𝐗^t\+1:t\+P\\hat\{\\mathbf\{X\}\}\_\{t\+1:t\+P\}\. During inference, the full horizonHHis generated auto\-regressively: the model consumes its own predicted patches as context for subsequent iterations until the entire sequence is realized\.

## 4Method

In this section, we present theNeST\(Nested Spatio\-Temporal\) framework\.NeSTadopts a hierarchical coarse\-to\-fine paradigm, where fine\-grained node\-level predictions are guided by stable, macroscopic centroid dynamics to mitigate the impact of localized noise\.

### 4\.1From raw data to centroid features

Direct node\-level forecasting is challenging due to the high dimensionality of the output space, as well as the presence of local noise, missing values, and short\-term irregular fluctuations\. To address this, we leverage spectral clustering\(Nget al\.,[2001](https://arxiv.org/html/2605.16447#bib.bib99); Shi and Malik,[2000](https://arxiv.org/html/2605.16447#bib.bib97)\)to extract latent region\-level representations, that serve as structural anchors for the system, enabling abstract guidance for fine\-grained node\-level forecasting\.

#### Constructing the Temporal Affinity Matrix\.

An effective regionalization should reflect temporal coherence, meaning that nodes assigned to the same region exhibit consistent long\-term co\-movement patterns\. Affinity matrices derived from physical proximity or predefined topology are often insufficient, as they fail to capture latent semantic relationships that evolve with the system dynamics\.

To address this limitation, we construct a feature\-driven affinity matrix𝐀∈ℝN×N\\mathbf\{A\}\\in\\mathbb\{R\}^\{N\\times N\}directly from raw temporal observations\. Specifically, we partition the training sequence intoT~\\tilde\{T\}non\-overlapping temporal chunks, whereT~\\tilde\{T\}is chosen to align with the intrinsic periodicity of the data \(see Appendix[A\.7](https://arxiv.org/html/2605.16447#A1.SS7)\)\. For each node, we compute an averaged representation within each chunk and define pairwise affinities as

𝐀i​j=exp⁡\(−12​σ2​T~​∑k=1T~‖𝐗i\(k\)−𝐗j\(k\)‖22\),\\mathbf\{A\}\_\{ij\}=\\exp\\\!\\left\(\-\\frac\{1\}\{2\\sigma^\{2\}\\tilde\{T\}\}\\sum\_\{k=1\}^\{\\tilde\{T\}\}\\left\\lVert\\mathbf\{X\}\_\{i\}^\{\(k\)\}\-\\mathbf\{X\}\_\{j\}^\{\(k\)\}\\right\\rVert\_\{2\}^\{2\}\\right\),\(1\)where𝐗i\(k\)∈ℝTk\\mathbf\{X\}\_\{i\}^\{\(k\)\}\\in\\mathbb\{R\}^\{T\_\{k\}\}denotes the temporal subsequence of nodeiiin chunkkk, andσ\\sigmacontrols the kernel bandwidth\. This construction emphasizes similarity in long\-term temporal evolution rather than short\-term fluctuations\.

Spectral representation\.Given the spatiotemporal affinity matrix𝐀\\mathbf\{A\}, we compute the normalized graph Laplacian

𝐋sym=𝐈−𝐃−12​𝐀𝐃−12,\\mathbf\{L\}\_\{\\rm\{sym\}\}=\\mathbf\{I\}\-\\mathbf\{D\}^\{\-\\frac\{1\}\{2\}\}\\mathbf\{A\}\\mathbf\{D\}^\{\-\\frac\{1\}\{2\}\},\(2\)whereDDis the degree matrix withDi​i=∑j𝐀i​jD\_\{ii\}=\\sum\_\{j\}\\mathbf\{A\}\_\{ij\}\. The Laplacian characterizes how node features vary over the learned affinity graph: nodes with strong affinities are encouraged to have similar embeddings, while weakly connected nodes are allowed to diverge\. In particular, the low\-frequency eigenvectors of𝐋sym\\mathbf\{L\}\_\{\\rm\{sym\}\}capture dominant and globally consistent correlation structures in the spatiotemporal system\. These components provide a principled basis for identifying coherent latent regions aligned by long\-range temporal behavior\.

We then obtainMMregions withM<NM<Nby applying K\-means clustering to the spectral embeddings, resulting in an assignment matrixS∈\{0,1\}N×MS\\in\\\{0,1\\\}^\{N\\times M\}\. This yields an assignment matrix𝐒∈\{0,1\}N×M\\mathbf\{S\}\\in\\\{0,1\\\}^\{N\\times M\}, whereSi,m=1S\_\{i,m\}=1indicates that nodeiiis assigned to regionmm\. The region\-level representation at timettis computed via average pooling:

𝐙t,m=∑i=1NSi,m​𝐗t,i∑i=1NSi,m,m=1,…,M,\\mathbf\{Z\}\_\{t,m\}=\\frac\{\\sum\_\{i=1\}^\{N\}S\_\{i,m\}\\mathbf\{X\}\_\{t,i\}\}\{\\sum\_\{i=1\}^\{N\}S\_\{i,m\}\},\\quad m=1,\\dots,M,\(3\)where𝐗t,i∈ℝC\\mathbf\{X\}\_\{t,i\}\\in\\mathbb\{R\}^\{C\}denotes the feature of nodeiiat timett\. This operation compresses node\-level observations into compact centroid features for hierarchical forecasting\. Crucially, this aggregation acts as a natural low\-pass filter that smooths out high\-frequency local anomalies while preserving underlying regional trends\. A visual demonstration of how these centroid features provide stable anchors compared to noisy raw sequences is provided in Appendix[A\.14](https://arxiv.org/html/2605.16447#A1.SS14)\.

To formally ground this intuition, assuming independent Gaussian noiseϵi\\epsilon\_\{i\}, we model the node observations as𝐗i=𝐒i\+ϵi\\mathbf\{X\}\_\{i\}=\\mathbf\{S\}\_\{i\}\+\\epsilon\_\{i\}\. Theorem[1](https://arxiv.org/html/2605.16447#Thmtheorem1)demonstrates that aggregating nodes with similar patterns into latent regions significantly enhances the Signal\-to\-Noise Ratio \(SNR\)\.

###### Theorem 1\.

Consider a graph signal where the noise at each node is independent\. Let𝒞m\\mathcal\{C\}\_\{m\}denote a cluster of nodes with cardinality\|𝒞m\|\|\\mathcal\{C\}\_\{m\}\|, and let𝐙m\\mathbf\{Z\}\_\{m\}be the corresponding cluster center\. Then the SNR of the cluster center satisfies

SNR​\(𝐙m\)≥\[1\+\(\|𝒞m\|−1\)​ρm\]⋅SNR¯m,\\mathrm\{SNR\}\(\\mathbf\{Z\}\_\{m\}\)\\geq\[1\+\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\\rho\_\{m\}\]\\cdot\\overline\{\\mathrm\{SNR\}\}\_\{m\},\(4\)whereSNR¯m\\overline\{\\mathrm\{SNR\}\}\_\{m\}is the average SNR of the individual signals within cluster𝒞m\\mathcal\{C\}\_\{m\}, andρm\\rho\_\{m\}is the average correlation coefficient among the true signals in the cluster:

ρm=1\|𝒞m\|​\(\|𝒞m\|−1\)​∑i≠j∈𝒞mCorr​\(𝐬i,𝐬j\)\.\\rho\_\{m\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\}\\sum\_\{i\\neq j\\in\\mathcal\{C\}\_\{m\}\}\\mathrm\{Corr\}\(\\mathbf\{s\}\_\{i\},\\mathbf\{s\}\_\{j\}\)\.\(5\)

This theorem provides the theoretical guarantee that our region\-level modeling effectively suppresses local noise by maximizing intra\-cluster correlationρm\\rho\_\{m\}; detailed proofs and derivations are provided in Appendix[A\.4](https://arxiv.org/html/2605.16447#A1.SS4)\.

### 4\.2Nested Spatio\-Temporal Forecasting

Dual\-Horizon Encoding\.To capture multi\-scale dynamics across time and facilitate information interaction, we employ a projection layer that maps sequenced data𝐗\\mathbf\{X\}and𝐙\\mathbf\{Z\}into add\-dimensional space\. At time\-steptt, the model processes two inputs that occupy entirely different temporal horizons: the historical node\-level sequence𝐗t−L\+1:t∈ℝN×L×C\\mathbf\{X\}\_\{t\-L\+1:t\}\\in\\mathbb\{R\}^\{N\\times L\\times C\}, covering the pastLLsteps, and the regional guidance𝐙t\+1:t\+P∈ℝM×P×C\\mathbf\{Z\}\_\{t\+1:t\+P\}\\in\\mathbb\{R\}^\{M\\times P\\times C\}, providing representative features for the targeted future\. To unify these inputs at different scale, we first flatten their temporal and feature axes and project them into the latent space\.

𝐇xpast\\displaystyle\\mathbf\{H\}\_\{x\}^\{\\text\{past\}\}=Linearx​\(Flatten​\(𝐗t−L\+1:t\)\)\+𝐓𝐄x\+𝐒𝐄node,\\displaystyle=\\text\{Linear\}\_\{x\}\(\\text\{Flatten\}\(\\mathbf\{X\}\_\{t\-L\+1:t\}\)\)\+\\mathbf\{TE\}\_\{x\}\+\\mathbf\{SE\}\_\{\\text\{node\}\},𝐇zfut\\displaystyle\\mathbf\{H\}\_\{z\}^\{\\text\{fut\}\}=Linearz​\(Flatten​\(𝐙t\+1:t\+P\)\)\+𝐓𝐄z\+𝐒𝐄region,\\displaystyle=\\text\{Linear\}\_\{z\}\(\\text\{Flatten\}\(\\mathbf\{Z\}\_\{t\+1:t\+P\}\)\)\+\\mathbf\{TE\}\_\{z\}\+\\mathbf\{SE\}\_\{\\text\{region\}\},where𝐇xpast∈ℝN×d\\mathbf\{H\}\_\{x\}^\{\\text\{past\}\}\\in\\mathbb\{R\}^\{N\\times d\}and𝐇zfut∈ℝM×d\\mathbf\{H\}\_\{z\}^\{\\text\{fut\}\}\\in\\mathbb\{R\}^\{M\\times d\}represent the latent tokens for the past node context and future region guidance, respectively\. To explicitly encode structural and periodic dynamics, we incorporate spatial embeddings \(𝐒𝐄node\\mathbf\{SE\}\_\{\\text\{node\}\}and𝐒𝐄region\\mathbf\{SE\}\_\{\\text\{region\}\}\) to capture spatial identities, along with temporal embeddings \(𝐓𝐄x\\mathbf\{TE\}\_\{x\}and𝐓𝐄z\\mathbf\{TE\}\_\{z\}\), which are learnable Time\-of\-Day and Day\-of\-Week features averaged across the horizon\.

#### Cross\-Scale Interaction\.

To couple fine\-grained historical dynamics with coarse\-grained future trends, we introduce a bidirectional cross\-attention mechanism\. Given a standard attention operatorAttn​\(Q,K,V\)\\text\{Attn\}\(Q,K,V\), interaction proceeds in two ways\.

First, a top\-down guidance step allows node\-level tokens to query regional tokens:

𝐇~x=Attn​\(𝐇xpast,𝐇zfut,𝐇zfut\),\\tilde\{\\mathbf\{H\}\}\_\{x\}=\\text\{Attn\}\(\\mathbf\{H\}\_\{x\}^\{\\text\{past\}\},\\mathbf\{H\}\_\{z\}^\{\\text\{fut\}\},\\mathbf\{H\}\_\{z\}^\{\\text\{fut\}\}\),\(6\)enabling node representations to incorporate macroscopic evolutionary trends\.

Next, a bottom\-up refinement step updates regional tokens by querying the enriched node features:

𝐇~z=Attn​\(𝐇zfut,𝐇~x,𝐇~x\)\.\\tilde\{\\mathbf\{H\}\}\_\{z\}=\\text\{Attn\}\(\\mathbf\{H\}\_\{z\}^\{\\text\{fut\}\},\\tilde\{\\mathbf\{H\}\}\_\{x\},\\tilde\{\\mathbf\{H\}\}\_\{x\}\)\.\(7\)This step anchors regional guidance in the most recent fine\-grained context\. Together, these interactions yield representations that are both locally detailed and globally predictive\.

#### Dual\-Head Decoding\.

Following cross\-scale interaction, we decode both fine\- and coarse\-grained predictions\. A node\-level decoding head produces

𝐗^t\+1:t\+P=Projx​\(𝐇~x\),\\hat\{\\mathbf\{X\}\}\_\{t\+1:t\+P\}=\\text\{Proj\}\_\{x\}\(\\tilde\{\\mathbf\{H\}\}\_\{x\}\),while a regional decoding head predicts the next segment of macro dynamics:

𝐙^t\+P\+1:t\+2​P=Projz​\(𝐇~z\)\.\\hat\{\\mathbf\{Z\}\}\_\{t\+P\+1:t\+2P\}=\\text\{Proj\}\_\{z\}\(\\tilde\{\\mathbf\{H\}\}\_\{z\}\)\.The predicted regional states are recursively used as guidance for subsequent decoding steps, enabling nested auto\-regressive generation\.

#### Teacher forcing and multi\-step ahead rollout\.

During inference, future regional guidance𝐙t\+1:t\+P\\mathbf\{Z\}\_\{t\+1:t\+P\}is unavailable, whereas training relies on teacher forcing and rollout prediction\. To bridge this discrepancy, we first introduce a boundary modeling strategy based on masked guidance reconstruction\.

During training, a proportion of regional tokens is replaced with zero masks, and a dedicated boundary decoder is trained to recover the missing guidance:

𝐙^t\+1:t\+P=Projbd​\(Attn​\(𝐇zzeros,𝐇~x,𝐇~x\)\)\.\\hat\{\\mathbf\{Z\}\}\_\{t\+1:t\+P\}=\\text\{Proj\}\_\{\\text\{bd\}\}\\\!\\left\(\\text\{Attn\}\(\\mathbf\{H\}^\{\\text\{zeros\}\}\_\{z\},\\tilde\{\\mathbf\{H\}\}\_\{x\},\\tilde\{\\mathbf\{H\}\}\_\{x\}\)\\right\)\.\(8\), where𝐇zzeros\\mathbf\{H\}^\{\\text\{zeros\}\}\_\{z\}is the hidden state encoded from an all\-zero mask, capturing the prior state before cross\-scale interaction\.

Then, building on this initialized boundary𝐙^t\+1:t\+P\\hat\{\\mathbf\{Z\}\}\_\{t\+1:t\+P\}, we execute a multi\-step rollout\. Specifically, we let the rollout prediction for future macro\-state𝐙^t\+P\+1:t\+2​P\\hat\{\\mathbf\{Z\}\}\_\{t\+P\+1:t\+2P\}be fed back as the guidance for the next time window\. This establishes a coherent auto\-regressive loop where evolving macro\-trends continuously anchor and regularize the long\-term micro\-level predictions\.

At inference time, guidance is initialized with zero masks, and the reconstructed𝐙^t\+1:t\+P\\hat\{\\mathbf\{Z\}\}\_\{t\+1:t\+P\}serves as the structural anchor for node\-level prediction\. This strategy aligns training and inference behavior and stabilizes early rollout steps\.

### 4\.3Uncertainty\-Aware guidance and Complexity

#### Robustness via Quantile Regression\.

To mitigate error accumulation and estimate the uncertainty caused by inaccurate macro\-level guidance, we explicitly model regional dynamics as predictive distributions rather than point estimates\. We adopt quantile regression\(Bassett and Jr\.,[1978](https://arxiv.org/html/2605.16447#bib.bib2)\)to estimate multiple conditional quantiles\{τq\}q=1Q\\\{\\tau\_\{q\}\\\}\_\{q=1\}^\{Q\}of future regional states, thereby capturing epistemic uncertainty in coarse\-grained evolution:

𝐙^t\+1:t\+P\(τq\)=f\(τq\)​\(𝐇~z\),q=1,…,Q\.\\hat\{\\mathbf\{Z\}\}\_\{t\+1:t\+P\}^\{\(\\tau\_\{q\}\)\}=f^\{\(\\tau\_\{q\}\)\}\\\!\\left\(\\tilde\{\\mathbf\{H\}\}\_\{z\}\\right\),\\quad q=1,\\dots,Q\.\(9\)For inference, we use the median prediction \(τ=0\.5\\tau=0\.5\) as deterministic guidance for downstream node\-level forecasting\. This design leverages the inherent stability of macroscopic dynamics while reducing sensitivity to local noise and outliers, leading to more reliable long\-horizon predictions\. Details of the quantile regression loss are provided in Appendix[A\.6](https://arxiv.org/html/2605.16447#A1.SS6.SSS0.Px2)\.

#### Training Objective\.

The proposed model is trained end\-to\-end under a multi\-task objective that jointly optimizes: \(i\) fine\-grained node\-level forecasting, \(ii\) uncertainty\-aware regional forecasting, and \(iii\) masked guidance reconstruction to handle missing future context\. The overall loss is defined as

ℒ=ℒx\+λ1​ℒz\+λ2​ℒbd,\\mathcal\{L\}=\\mathcal\{L\}\_\{x\}\+\\lambda\_\{1\}\\mathcal\{L\}\_\{z\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{\\rm\{bd\}\},\(10\)whereℒx\\mathcal\{L\}\_\{x\}denotes the node\-level forecasting loss,ℒz\\mathcal\{L\}\_\{z\}supervises the multi\-quantile predictions of regional dynamics, andℒbd\\mathcal\{L\}\_\{\\rm\{bd\}\}corresponds to the masked guidance reconstruction loss introduced for boundary modeling\. The hyperparametersλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}control the trade\-off between fine\-grained accuracy, macro\-level uncertainty modeling, and robustness to missing guidance\. Formal definitions of each loss term are deferred to Appendix[A\.6](https://arxiv.org/html/2605.16447#A1.SS6)\.

#### Computational Complexity\.

We analyze the computational complexity with respect to the number of nodesNN, regionsMM\(M<NM<N\), latent dimensiondd, and the number of cross\-attention layersll\. In each layer, the dominant cost arises from cross\-attention between node\-level and region\-level tokens, resulting in a complexity of𝒪​\(N​M​d\)\\mathcal\{O\}\(NMd\)per layer and𝒪​\(l​N​M​d\)\\mathcal\{O\}\(lNMd\)per forward pass\. In contrast, a standard node\-level transformer with full self\-attention incurs𝒪​\(l​N2​d\)\\mathcal\{O\}\(lN^\{2\}d\)complexity\. SinceM<NM<Nin practice, the proposed hierarchical design substantially reduces the quadratic dependence on the number of nodes, achieving near\-linear scaling while preserving global contextual modeling\.

Table 1:Performance comparison on GBA, GLA, and CA datasets\.Redindicates the best results, andBlueindicates the second\-best results\. The row “Improv\.” denotes the relative improvement of our method over the best baseline\.NeSTconsistently outperforms prior methods across datasets and horizons\.DatasetMethodHorizon 3Horizon 6Horizon 12AverageMAERMSEMAPEMAERMSEMAPEMAERMSEMAPEMAERMSEMAPEGBAGWNET\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34)\)17\.8529\.1213\.9221\.1133\.6917\.7925\.5840\.1923\.4820\.9133\.4117\.66AGCRN\(Baiet al\.,[2020](https://arxiv.org/html/2605.16447#bib.bib18)\)18\.3130\.2414\.2721\.2734\.7216\.8924\.8540\.1820\.8021\.0134\.2516\.90STGODE\(Fanget al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib45)\)18\.8430\.5115\.4322\.0435\.6118\.4226\.2242\.9022\.8321\.7935\.3718\.26DSTAGNN\(Lanet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib31)\)19\.7331\.3915\.4224\.2137\.7020\.9930\.1246\.4028\.1623\.8237\.2920\.16D2STGNN\(Shaoet al\.,[2022c](https://arxiv.org/html/2605.16447#bib.bib48)\)17\.5428\.9412\.1220\.9233\.9214\.8925\.4840\.9919\.8320\.7133\.6515\.04DGCRN\(Liet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib49)\)18\.0229\.4914\.1321\.0834\.0316\.9425\.2540\.6321\.1520\.9133\.8316\.88STID\(Shaoet al\.,[2022b](https://arxiv.org/html/2605.16447#bib.bib71)\)17\.3629\.3913\.2820\.4534\.5116\.0324\.3841\.3319\.9020\.2234\.6115\.91STWave\(Fanget al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib76)\)17\.9529\.4213\.0120\.9934\.0115\.6224\.9640\.3120\.0820\.8133\.7715\.76RPMixer\(Yehet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib47)\)20\.3133\.3415\.6426\.9544\.0222\.7539\.6666\.4437\.3527\.7747\.7223\.87BigST\(Hanet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib24)\)18\.7030\.2715\.5522\.2135\.3318\.5426\.9842\.7323\.6821\.9535\.5418\.50PatchSTG\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)16\.8128\.7112\.2519\.6833\.0914\.5123\.4939\.2318\.9319\.5033\.1614\.64NeST16\.0527\.5310\.6618\.9031\.7712\.8622\.6337\.7116\.1318\.7331\.8512\.90Improv\.4\.52%4\.11%12\.98%3\.96%3\.99%11\.37%3\.66%3\.87%14\.79%3\.95%3\.95%11\.88%GLAGWNET\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34)\)17\.2827\.6810\.1821\.3133\.7013\.0226\.9942\.5117\.6421\.2033\.5813\.18AGCRN\(Baiet al\.,[2020](https://arxiv.org/html/2605.16447#bib.bib18)\)17\.2729\.7010\.7820\.3834\.8212\.7024\.5942\.5916\.0320\.2534\.8412\.87STGODE\(Fanget al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib45)\)18\.1030\.0211\.1821\.7136\.4613\.6426\.4545\.0917\.6021\.4936\.1413\.72DSTAGNN\(Lanet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib31)\)19\.4931\.0811\.5024\.2738\.4315\.2430\.9248\.5220\.4524\.1338\.1515\.07STID\(Shaoet al\.,[2022b](https://arxiv.org/html/2605.16447#bib.bib71)\)16\.5427\.7310\.0019\.9834\.2312\.3824\.2942\.5016\.0219\.7634\.5612\.41STWave\(Fanget al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib76)\)17\.4828\.0510\.0621\.0833\.5812\.5625\.8241\.2816\.5120\.9633\.4812\.70RPMixer\(Yehet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib47)\)19\.9432\.5411\.5327\.1044\.8716\.5840\.1369\.1127\.9327\.8748\.9617\.66BigST\(Hanet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib24)\)18\.3829\.4011\.6822\.2235\.5314\.4827\.9844\.7419\.6522\.0836\.0014\.57PatchSTG\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)15\.8426\.349\.2719\.0631\.8511\.3023\.3239\.6414\.6018\.9632\.3311\.44NeST15\.1225\.448\.8017\.9430\.2210\.6921\.8536\.9113\.4917\.8930\.5210\.74Improv\.4\.55%3\.42%5\.07%5\.89%5\.13%5\.40%6\.30%6\.90%7\.60%5\.65%5\.60%6\.14%CAGWNET\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34)\)17\.1427\.8112\.6221\.6834\.1617\.1428\.5844\.1324\.2421\.7234\.2017\.40STGODE\(Fanget al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib45)\)17\.5729\.9113\.9120\.9836\.6216\.8825\.4645\.9921\.0020\.7736\.6016\.80STID\(Shaoet al\.,[2022b](https://arxiv.org/html/2605.16447#bib.bib71)\)15\.5126\.2311\.2618\.5331\.5613\.8222\.6339\.3717\.5918\.4132\.0013\.82STWave\(Fanget al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib76)\)16\.7726\.9812\.2018\.9730\.6914\.4025\.3638\.7719\.0119\.6931\.5814\.58RPMixer\(Yehet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib47)\)18\.1830\.4912\.8624\.3341\.3818\.3435\.7462\.1230\.3825\.0744\.7519\.47BigST\(Hanet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib24)\)17\.1527\.9213\.0320\.4433\.1615\.8725\.4941\.0920\.9720\.3233\.4515\.91PatchSTG\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)14\.6924\.8210\.5117\.4129\.4312\.8321\.2036\.1316\.0017\.3529\.7912\.79NeST13\.9824\.179\.3716\.5928\.3011\.2420\.3134\.4414\.1816\.5428\.5511\.28Improv\.4\.82%2\.62%10\.81%4\.73%3\.86%12\.41%4\.17%4\.69%11\.18%4\.69%4\.17%11\.78%

## 5Experiments

In this section, we present a comprehensive empirical evaluation on diverse large\-scale benchmarks to validate the effectiveness and robustness of our proposed framework\.

### 5\.1Experiment Setup

Dataset\. We select the GLA, GBA, and CA datasets from the LargeST benchmark\(Liuet al\.,[2023b](https://arxiv.org/html/2605.16447#bib.bib28)\), prioritizing their high node counts to rigorously test our model’s scalability and spatial modeling capabilities\. Following established settings\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\), we chronologically split the data into training, validation, and test sets with a 6:2:2 ratio\. The forecasting task is framed as using 12 historical time steps as input to predict the subsequent 12 steps\. Detailed datasets are presented in Appendix[A\.1](https://arxiv.org/html/2605.16447#A1.SS1)\.

Baselines\. Our proposed method is compared with 11 advanced baselines to demonstrate its superiority\. The benchmark models includeSTID\(Shaoet al\.,[2022b](https://arxiv.org/html/2605.16447#bib.bib71)\),GWNET\(Wuet al\.,[2019](https://arxiv.org/html/2605.16447#bib.bib34)\),AGCRN\(Baiet al\.,[2020](https://arxiv.org/html/2605.16447#bib.bib18)\),STGODE\(Fanget al\.,[2021](https://arxiv.org/html/2605.16447#bib.bib45)\),DGCRN\(Liet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib49)\),DSTAGNN\(Lanet al\.,[2022](https://arxiv.org/html/2605.16447#bib.bib31)\),D2STGNN\(Shaoet al\.,[2022c](https://arxiv.org/html/2605.16447#bib.bib48)\),STWave\(Fanget al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib76)\),RPMixer\(Yehet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib47)\),BigST\(Hanet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib24)\), and the current state\-of\-the\-artPatchSTG\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\)\. Detailed descriptions of these methods are provided in Appendix[A\.2](https://arxiv.org/html/2605.16447#A1.SS2)\.

Evaluation Metrics\.We evaluate forecasting performance using three standard metrics: Mean Absolute Error \(MAE\), Root Mean Square Error \(RMSE\), and Mean Absolute Percentage Error \(MAPE\)\. Their formal definitions and calculation formulas are provided in Appendix[A\.3](https://arxiv.org/html/2605.16447#A1.SS3)\.

Implementation Details\.Consistent with existing literature, the forecasting task is configured with a look\-back windowL=12L=12and a prediction horizonH=12H=12\. Comprehensive implementation details, encompassing hyperparameter settings, optimization strategies, and hardware environments, are provided in Appendix[A\.5](https://arxiv.org/html/2605.16447#A1.SS5)\.

### 5\.2Performance Comparisons

Overall Performance\.Table[1](https://arxiv.org/html/2605.16447#S4.T1)summarizes the comprehensive performance ofNeSTagainst all other baselines\. Our framework consistently achieves the top performance across all datasets, metrics, and horizons \(3, 6, and 12 steps\)\. On average across all three datasets,NeSTimproves MAE by 4\.71%, RMSE by 4\.41%, and MAPE by 9\.34%, respectively\. Notably, the improvements in MAPE exceed 10% on GBA and CA, demonstrating a superior ability to handle high\-dimensional volatility\. This success stems from two core mechanisms: semantic regional aggregation and explicit macro\-trend regularization\. While baseline models often rely on passive historical aggregation, our approach utilizes future regional guidance to mitigate local noise, ensuring fine\-grained node predictions remain aligned with broader semantic context\.

Table 2:Long\-horizonMAEcomparison on GLA and CA datasets\. We evaluate performance across increasing prediction steps \(hours\)\. Best results are highlighted inbold\.DatasetModelPrediction Horizon \(Steps / Hours\)16\(4h\)20\(5h\)24\(6h\)36\(9h\)48\(12h\)GLAPatchSTG25\.6327\.0827\.9230\.4232\.43NeST23\.6325\.1126\.2828\.5130\.03CAPatchSTG24\.1725\.5226\.5428\.1029\.15NeST22\.0523\.4624\.5626\.6027\.92

Long\-Horizon Stability\.To evaluate the temporal stability of our framework, we compareNeSTagainst the strongest baseline,PatchSTG, on the GLA and CA over an extended 48\-step horizon \(12 hours\)\. Crucially, these forecasts were generated via auto\-regressive rollout, using the model’s own predictions step\-by\-step to align with realistic deployment scenarios where training separate models for every horizon is impractical\. As detailed in Table[2](https://arxiv.org/html/2605.16447#S5.T2),NeSTdemonstrates superior robustness, consistently outperformingPatchSTGacross all extended horizons\. Although auto\-regressive inference typically suffers from cumulative error propagation, our framework mitigates this by conditioning nodal forecasts on predicted regional trends\. For example, on the GLA dataset, the performance gap in our favor expands from 2\.0 MAE at step 16 to 2\.4 MAE at step 48\. These results confirm that leveraging macro\-trends guidance as top\-down context effectively stabilizes long\-term rollouts and maintains coherence over extended periods\.

Generalization to Non\-Traffic Domains\.While our primary evaluation focuses on large\-scale traffic networks, we further verify the broad applicability ofNeSTon diverse non\-traffic datasets, including meteorology \(KnowAir\), energy \(UrbanEV\), and classical time\-series benchmarks \(Electricity, Solar\-Energy\)\. As detailed in Appendix[A\.13](https://arxiv.org/html/2605.16447#A1.SS13),NeSTachieves consistent and superior performance against recent state\-of\-the\-art models \(e\.g\., MAGE\(Maet al\.,[2025a](https://arxiv.org/html/2605.16447#bib.bib98)\), Air\-DualODE\(Tianet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib100)\), and iTransformer\(Liuet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib101)\)\) across these diverse domains\. These supplementary results confirm that our data\-driven semantic clustering and cross\-scale attention mechanisms successfully capture general spatio\-temporal dynamics without relying on traffic\-specific physical priors\.

### 5\.3Ablation Study

We analyze the contribution of individual components by grouping experiments into two parts: the interaction mechanism \(how regions and nodes communicate\) and the semantic partitioning strategy \(how regions are constructed\)\.

#### Impact of Macro\-Micro Interaction\.

To test our hierarchical coupling design, We compareNeSTagainst two ablated variants: \(i\)w/o CA \(no Cross\-Attention\), which completely removes the cross\-attention module, severing the link between regional and nodal representations; and \(ii\)w/o FG \(no Future Guidance\), which replaces the predicted future regional signals𝐙t\+1:t\+P\\mathbf\{Z\}\_\{t\+1:t\+P\}with past region observations𝐙t−P\+1:t\\mathbf\{Z\}\_\{t\-P\+1:t\}\. As detailed in Table[3](https://arxiv.org/html/2605.16447#S5.T3), both components prove to be essential for optimal performance\. The removal of cross\-attention \(w/o CA\) results in the most significant degradation, raising MAE by 1\.04 on GBA and 1\.11 on GLA\. This substantial drop indicates that without the top\-down regularization from region\-level semantics, the model struggles to handle local noise effectively, degenerating into a purely local predictor\. Similarly, thew/o FGvariant, which relies solely on past regional patterns, fails to capture evolving temporal shifts\. This leads to a notable increase in RMSE \(e\.g\., \+1\.04 on GBA\), suggesting that historical context alone is insufficient for predictive stability\. Overall, these results show that explicitly modeling region–node interactions together with future\-aware guidance is crucial for robust multi\-step forecasting inNeST\.

![Refer to caption](https://arxiv.org/html/2605.16447v1/x2.png)Figure 2:Illustration of the Nested Spatio\-Temporal Forecasting mechanism\.The left panel shows the spatial distribution of two learned latent regions\. The right panels visualize the two\-stage decoding process: \(1\) Macro Generation, where the model first predicts a stable regional trend \(see top row\); and \(2\) Micro Guidance, where this trend explicitly guides the forecasting of downstream nodes \(bottom rows, indicated by arrows\)\. As shown in the highlighted intervals, this mechanism enablesNeST\(red\) to maintain robustness against local noise and align better with the Ground Truth \(gray dashed\) compared toPatchSTG\(green\)\.Table 3:Ablation results of Macro\-Micro Interaction on GBA and GLA\. Best results are highlighted inbold\.VariantGBAGLAMAERMSEMAPE \(%\)MAERMSEMAPE \(%\)w/o CA19\.7634\.1115\.3319\.0032\.7611\.37w/o FG19\.6432\.8914\.8818\.8532\.0411\.03NeST18\.7331\.8512\.8917\.8930\.5210\.74

#### Effectiveness of Semantic Partitioning\.

We evaluate our spatial partitioning strategy against four baselines: \(i\)w/ KM \(K\-Means\), which clusters nodes based on raw features similarity; \(ii\)w/ RN \(Random Node\), which selectsMMindividual nodes as regional representatives; \(iii\)w/ RP \(Random Partition\), which randomly assigns nodes to clusters; \(iv\)w/ DA \(Distance Adjacency\), which uses static geographic proximity for spectral clustering\. As shown in Table[4](https://arxiv.org/html/2605.16447#S5.T4), the quality of regional definitions is pivotal\. Arbitrary grouping \(w/ RP\) leads to a 13% MAPE increase on GBA due to the lack of coherent macro patterns, while single\-node representation \(w/ RN\) suffers from sensitivity to local noise\. Crucially, our semantic approach outperforms bothw/ KMandw/ DA, reducing MAE by 0\.21 on GBA and 0\.45 on GLA\. Unlikew/ KM, which ignores graph topology, orw/ DA, which is limited by static geometry, our method captures functional spatial dependencies, providing a robust foundation for stable regional modeling\.

Table 4:Ablation results of Semantic Partitioning Strategy on GBA and GLA\. Best results are highlighted inbold\.VariantGBAGLAMAERMSEMAPE \(%\)MAERMSEMAPE \(%\)w/ KM18\.9332\.3313\.4618\.3931\.5510\.82w/ RN19\.4432\.9614\.4218\.4632\.4210\.85w/ RP19\.0732\.4714\.2718\.4631\.5011\.22w/ DA18\.9332\.2213\.3718\.3431\.9911\.04NeST18\.7331\.8512\.8917\.8930\.5210\.74

### 5\.4Computational Efficiency and Runtime Analysis

Table[5](https://arxiv.org/html/2605.16447#S5.T5)compares the empirical runtime ofNeSTagainst the strong baseline PatchSTG on the GBA and GLA datasets, with all tests conducted on a single RTX 4090 GPU using a batch size of 64\.NeSTsubstantially accelerates both the training and inference stages\. Specifically, on the GBA dataset, the training time per epoch drops from 185 to 75 seconds \(a 59\.5% reduction\), and the total inference time decreases from 32 to 20 seconds \(a 37\.5% reduction\)\. Similar efficiency gains are also observed on the larger GLA dataset\. WhileNeSTrequires an offline preprocessing step for affinity matrix construction and spectral clustering \(e\.g\., 91 seconds on the GBA dataset\), this procedure is executed only once\. Consequently, this one\-time overhead is negligible relative to the overall training process\. These results demonstrate that the proposed linear cross\-attention mechanism and patch\-wise prediction design effectively constrain computational complexity, making the framework highly efficient in practice\.

Table 5:Empirical runtime comparison on the GBA and GLA datasets \(Batch Size = 64, single RTX 4090 GPU\)\.NeSTsignificantly reduces both training and inference costs\.DatasetModelTraining Time \(s/epoch\)Inference Time \(s\)GBAPatchSTG18532NeST\(Ours\)7520GLAPatchSTG23542NeST\(Ours\)13737

### 5\.5In\-Depth Analysis

Case Study\.Figure[2](https://arxiv.org/html/2605.16447#S5.F2)visualizes the nested forecasting mechanism on two representative latent regions\.Spatial Distribution \(Left\):The learned regions successfully group nodes with synchronized traffic patterns, regardless of physical distance\. For instance, nodes in Region 547 are distributed across disparate road segments yet share a consistent traffic evolution, confirming that our model captures semantic functional similarities beyond mere geographical proximity\.Mechanism Analysis \(Right\):The right panels illustrate the coarse\-to\-fine generation process\. At the macro\-level \(top plots\),NeSTfirst extracts distinct regional trends, such as the bell\-shaped pattern in Region 103 and the sharp decline in Region 547\. Explicitly guided by these macro\-signals \(indicated by the dashed arrows\), our model \(red solid line\) achieves precise micro\-level predictions\. As observed in the highlighted intervals, the macro guidance effectively directs individual nodes to adapt to sudden traffic shifts\. Consequently,NeSTaligns closely with the Ground Truth \(grey line\), whereas the baselinePatchSTG\(green dashed line\) fails to capture these dynamics, exhibiting significant lag or overestimation during trend transitions\.

Sensitivity to Region CountMM\. We investigate the sensitivity ofNeSTto the number of latent regionsMMon both GBA and GLA datasets\. The results reveal a consistent U\-shaped pattern, achieving optimal performance atM=0\.2​NM=0\.2N\. To further validate this, we conduct a fine\-grained analysis on the GBA dataset \(detailed in Appendix[A\.12](https://arxiv.org/html/2605.16447#A1.SS12)\)\. Specifically, prediction errors decrease asMMincreases from0\.1​N0\.1Nto0\.2​N0\.2Nbut rebound whenMMis further increased to0\.3​N0\.3N\. We attribute this phenomenon to a critical trade\-off in granular modeling: a smallMM\(0\.1​N0\.1N\) likely causes over\-aggregation, smoothing out distinct local patterns, whereas a largeMM\(0\.3​N0\.3N\) becomes susceptible to structural noise and fails to filter spurious correlations\. Consequently,M=0\.2​NM=0\.2Nstrikes the optimal balance, effectively abstracting stable macro\-trends while preserving necessary micro\-dynamics\.

![Refer to caption](https://arxiv.org/html/2605.16447v1/x3.png)Figure 3:Forecasting performance with varying numbers of latent regions \(MM\) on GBA and GLA\.

## 6Conclusion

In this paper, we proposedNeST, a novel macro\-to\-micro framework designed to tackle local noise sensitivity and cascading error accumulation in high\-dimensional spatio\-temporal forecasting\. By leveraging data\-driven semantic clustering without relying on physical priors,NeSTextracts stable regional representations that act as a natural low\-pass filter against local volatility\. Crucially, explicitly predicted future macro\-trends serve as top\-down guidance to regularize micro\-level dynamics, inherently anchoring the autoregressive rollout and mitigating long\-horizon error propagation\. Extensive experiments across diverse domains, including traffic, meteorology, energy, and classical time series, demonstrate thatNeSTachieves state\-of\-the\-art accuracy and exceptional long\-term stability\. Ultimately, this work provides a highly generalizable paradigm for multiscale sequence modeling in complex, real\-world environments\.

## Impact Statement and Limitations

This work advances the field of Machine Learning by proposing a robust framework for spatio\-temporal time\-series forecasting\. The proposed method has potential applications in smart city operations, energy management, and environmental monitoring, which can lead to more efficient and sustainable infrastructure\. We are not aware of any immediate negative societal consequences or specific ethical issues associated with this methodological research, provided that standard data privacy and fairness protocols are followed during deployment\.

WhileNeSTdemonstrates strong performance, we acknowledge several limitations\. First, constructing the feature\-driven affinity matrix introduces a preprocessing overhead that scales quadratically \(𝒪​\(N2\)\\mathcal\{O\}\(N^\{2\}\)\)\. Second, our global clustering assumes time\-invariant spatial correlations, limiting the model’s adaptability to sudden topological shifts or transient events\. Third, training via future macro\-trend reconstruction \(teacher forcing\) introduces an exposure bias, creating a training\-inference gap during autoregressive generation\. Finally, although autoregressive rollout stabilizes long\-horizon errors, its sequential nature is slower than direct multi\-step models, introducing inference latency that may pose challenges for strict real\-time applications at massive scales\. Addressing these constraints via dynamic regionalization and scheduled sampling remains a promising avenue for future work\.

## Acknowledgements

This work was supported by the Pujiang Talent Program \(No\. 2025PJA729\), National Natural Science Foundation of China \(Grant Nos\. 82394432, 92249302, and 62276127\), the Shanghai Municipal Science and Technology Major Project \(Grant No\. 2023SHZDZX02\), the Brain Science and Brain\-like Intelligence Technology \- National Science and Technology Major Project \(Grant No\.2021ZD0201300\), the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(Grant No\. JYB2025XDXM118\), and the “111” Center \(Grant No\. B26023\)\.

## References

- L\. Bai, L\. Yao, C\. Li, X\. Wang, and C\. Wang \(2020\)Adaptive graph convolutional recurrent network for traffic forecasting\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.17.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.4.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- K\. G\. Bassett and Jr\. \(1978\)Regression quantiles\.Econometrica46\(1\),pp\. 33–50\.Cited by:[§4\.3](https://arxiv.org/html/2605.16447#S4.SS3.SSS0.Px1.p1.1)\.
- C\. Challu, K\. G\. Olivares, B\. N\. Oreshkin, F\. G\. Ramírez, M\. M\. Canseco, and A\. Dubrawski \(2023\)NHITS: neural hierarchical interpolation for time series forecasting\.InAAAI,pp\. 6989–6997\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p3.1)\.
- J\. Chen, J\. E\. Lenssen, A\. Feng, W\. Hu, M\. Fey, L\. Tassiulas, J\. Leskovec, and R\. Ying \(2024\)From similarity to superiority: channel clustering for time series forecasting\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- L\. Chen, D\. Chen, Z\. Shang, B\. Wu, C\. Zheng, B\. Wen, and W\. Zhang \(2023\)Multi\-scale adaptive graph neural network for multivariate time series forecasting\.IEEE Trans\. Knowl\. Data Eng\.35\(10\),pp\. 10748–10761\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p3.1)\.
- M\. Chen, G\. Pang, W\. Wang, and C\. Yan \(2025\)Information bottleneck\-guided mlps for robust spatial\-temporal forecasting\.InICML,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- Z\. Diao, X\. Wang, D\. Zhang, G\. Xie, J\. Chen, C\. Pei, X\. Meng, K\. Xie, and G\. Zhang \(2024\)DMSTG: dynamic multiview spatio\-temporal networks for traffic forecasting\.IEEE Trans\. Mob\. Comput\.23\(6\),pp\. 6865–6880\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- Z\. Dong, R\. Jiang, H\. Gao, H\. Liu, J\. Deng, Q\. Wen, and X\. Song \(2024\)Heterogeneity\-informed meta\-parameter learning for spatiotemporal time series forecasting\.InKDD,pp\. 631–641\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- Y\. Fang, Y\. Liang, B\. Hui, Z\. Shao, L\. Deng, X\. Liu, X\. Jiang, and K\. Zheng \(2025\)Efficient large\-scale traffic forecasting with transformers: a spatial data management perspective\.InKDD,pp\. 307–317\.Cited by:[§A\.3](https://arxiv.org/html/2605.16447#A1.SS3.p1.3),[§1](https://arxiv.org/html/2605.16447#S1.p3.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.13.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.24.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.33.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- Y\. Fang, Y\. Qin, H\. Luo, F\. Zhao, B\. Xu, L\. Zeng, and C\. Wang \(2023\)When spatio\-temporal meet wavelets: disentangled traffic forecasting via efficient spectral graph attention networks\.InICDE,pp\. 517–529\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.10.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.21.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.30.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- Z\. Fang, Q\. Long, G\. Song, and K\. Xie \(2021\)Spatial\-temporal graph ODE networks for traffic flow forecasting\.InKDD,pp\. 364–373\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.18.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.28.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.5.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- H\. Gao, Z\. Dong, J\. Yong, S\. Fukushima, K\. Taura, and R\. Jiang \(2025\)How different from the past? spatio\-temporal time series forecasting with self\-supervised deviation learning\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- Y\. Gong, T\. He, M\. Chen, B\. Wang, L\. Nie, and Y\. Yin \(2024\)Spatio\-temporal enhanced contrastive and contextual learning for weather forecasting\.IEEE Trans\. Knowl\. Data Eng\.36\(8\),pp\. 4260–4274\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- K\. Guo, Y\. Hu, Y\. Sun, S\. Qian, J\. Gao, and B\. Yin \(2021\)Hierarchical graph convolution network for traffic forecasting\.InAAAI,pp\. 151–159\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- S\. Guo, Y\. Lin, L\. Gong, C\. Wang, Z\. Zhou, Z\. Shen, Y\. Huang, and H\. Wan \(2023\)Self\-supervised spatial\-temporal bottleneck attentive network for efficient long\-term traffic forecasting\.InICDE,pp\. 1585–1596\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- H\. Hamedmoghadam, N\. Zheng, D\. Li, and H\. L\. Vu \(2022\)Percolation\-based dynamic perimeter control for mitigating congestion propagation in urban road networks\.Transportation research part C: emerging technologies145,pp\. 103922\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- J\. Han, W\. Zhang, H\. Liu, T\. Tao, N\. Tan, and H\. Xiong \(2024\)BigST: linear complexity spatio\-temporal graph neural network for traffic forecasting on large\-scale road networks\.Proc\. VLDB Endow\.17\(5\),pp\. 1081–1090\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.12.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.23.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.32.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- L\. Han, B\. Du, L\. Sun, Y\. Fu, Y\. Lv, and H\. Xiong \(2021\)Dynamic and multi\-faceted spatio\-temporal deep learning for traffic speed forecasting\.InKDD,pp\. 547–555\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- S\. He, J\. Ji, and M\. Lei \(2025\)Decomposed spatio\-temporal mamba for long\-term traffic prediction\.InAAAI,pp\. 11772–11780\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- J\. Ji, J\. Wang, C\. Huang, J\. Wu, B\. Xu, Z\. Wu, J\. Zhang, and Y\. Zheng \(2023\)Spatio\-temporal self\-supervised learning for traffic flow prediction\.InAAAI,pp\. 4356–4364\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- J\. Jiang, C\. Han, W\. X\. Zhao, and J\. Wang \(2023a\)PDFormer: propagation delay\-aware dynamic long\-range transformer for traffic flow prediction\.InAAAI,pp\. 4365–4373\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- R\. Jiang, Z\. Wang, J\. Yong, P\. Jeph, Q\. Chen, Y\. Kobayashi, X\. Song, S\. Fukushima, and T\. Suzumura \(2023b\)Spatio\-temporal meta\-graph learning for traffic forecasting\.InAAAI,pp\. 8078–8086\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- R\. Jiang, X\. Zhang, K\. Jakhar, P\. Y\. Lu, P\. Hassanzadeh, M\. Maire, and R\. Willett \(2025\)Hierarchical implicit neural emulators\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p3.1)\.
- G\. Jin, Y\. Liang, Y\. Fang, Z\. Shao, J\. Huang, J\. Zhang, and Y\. Zheng \(2024\)Spatio\-temporal graph neural networks for predictive learning in urban computing: a survey\.IEEE Trans\. Knowl\. Data Eng\.36\(10\),pp\. 5388–5408\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- W\. Kong, Z\. Guo, and Y\. Liu \(2024\)Spatio\-temporal pivotal graph neural networks for traffic flow forecasting\.InAAAI,pp\. 8627–8635\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- R\. Kumar, M\. Bhanu, J\. Mendes\-Moreira, and J\. Chandra \(2025\)Spatio\-temporal predictive modeling techniques for different domains: a survey\.ACM Comput\. Surv\.57\(2\),pp\. 38:1–38:42\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- S\. Lan, Y\. Ma, W\. Huang, W\. Wang, H\. Yang, and P\. Li \(2022\)DSTAGNN: dynamic spatial\-temporal aware graph neural network for traffic flow forecasting\.InICML,pp\. 11906–11917\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.19.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.6.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- F\. Li, J\. Feng, H\. Yan, G\. Jin, F\. Yang, F\. Sun, D\. Jin, and Y\. Li \(2023\)Dynamic graph convolutional recurrent network for traffic prediction: benchmark and solution\.ACM Trans\. Knowl\. Discov\. Data17\(1\),pp\. 9:1–9:21\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.8.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- X\. Li, Y\. Zhang, G\. Long, Y\. Hu, W\. Lu, M\. Chen, C\. Zhang, and Y\. Gong \(2025\)Adaptive traffic forecasting on daily basis: A spatio\-temporal context learning approach\.IEEE Trans\. Knowl\. Data Eng\.37\(8\),pp\. 4446–4459\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- Y\. Li, R\. Yu, C\. Shahabi, and Y\. Liu \(2018\)Diffusion convolutional recurrent neural network: data\-driven traffic forecasting\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- H\. Liu, Z\. Dong, R\. Jiang, J\. Deng, J\. Deng, Q\. Chen, and X\. Song \(2023a\)Spatio\-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting\.InCIKM,pp\. 4125–4129\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- X\. Liu, Y\. Xia, Y\. Liang, J\. Hu, Y\. Wang, L\. Bai, C\. Huang, Z\. Liu, B\. Hooi, and R\. Zimmermann \(2023b\)LargeST: a benchmark dataset for large\-scale traffic forecasting\.InNeurIPS,Cited by:[§A\.1](https://arxiv.org/html/2605.16447#A1.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p1.1)\.
- Y\. Liu, T\. Hu, H\. Zhang, H\. Wu, S\. Wang, L\. Ma, and M\. Long \(2024\)ITransformer: inverted transformers are effective for time series forecasting\.InThe Twelfth International Conference on Learning Representations,Cited by:[§A\.13](https://arxiv.org/html/2605.16447#A1.SS13.p3.1),[§5\.2](https://arxiv.org/html/2605.16447#S5.SS2.p3.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InICLR \(Poster\),Cited by:[§A\.5](https://arxiv.org/html/2605.16447#A1.SS5.p1.8)\.
- T\. Lyu, J\. Han, and H\. Liu \(2025\)NRFormer: nationwide nuclear radiation forecasting with spatio\-temporal transformer\.InKDD \(2\),pp\. 4705–4716\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- J\. Ma, B\. Wang, G\. Wang, K\. Yang, Z\. Zhou, P\. Wang, X\. Wang, and Y\. Wang \(2025a\)Less but more: linear adaptive graph learning empowering spatiotemporal forecasting\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§A\.13](https://arxiv.org/html/2605.16447#A1.SS13.p2.1),[§5\.2](https://arxiv.org/html/2605.16447#S5.SS2.p3.1)\.
- J\. Ma, B\. Wang, G\. Wang, K\. Yang, Z\. Zhou, P\. Wang, X\. Wang, and Y\. Wang \(2025b\)Less but more: linear adaptive graph learning empowering spatiotemporal forecasting\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- Q\. Ma, Z\. Zhang, X\. Zhao, H\. Li, H\. Zhao, Y\. Wang, Z\. Liu, and W\. Wang \(2023\)Rethinking sensors modeling: hierarchical information enhanced traffic forecasting\.InCIKM,pp\. 1756–1765\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p3.1),[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- Y\. Ma, P\. Gérard, Y\. Tian, Z\. Guo, and N\. V\. Chawla \(2022\)Hierarchical spatio\-temporal graph neural networks for pandemic forecasting\.InCIKM,pp\. 1481–1490\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- Y\. Mao, H\. Zhou, L\. Chen, R\. Qi, Z\. Sun, Y\. Rong, X\. He, M\. Chen, S\. Mumtaz, V\. Frascolla, M\. Guizani, and J\. Rodrigues \(2025\)A survey on spatio\-temporal prediction: from transformers to foundation models\.ACM Comput\. Surv\.58\(4\)\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- I\. Marisca, C\. Alippi, and F\. M\. Bianchi \(2024\)Graph\-based forecasting with missing data through spatiotemporal downsampling\.InICML,Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- A\. Y\. Ng, M\. I\. Jordan, and Y\. Weiss \(2001\)On spectral clustering: analysis and an algorithm\.InNIPS,pp\. 849–856\.Cited by:[§A\.10](https://arxiv.org/html/2605.16447#A1.SS10.p1.1),[§1](https://arxiv.org/html/2605.16447#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.16447#S4.SS1.p1.1)\.
- Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Jbdc0vTOcol)Cited by:[§A\.13](https://arxiv.org/html/2605.16447#A1.SS13.p3.1)\.
- W\. Shao, Z\. Jin, S\. Wang, Y\. Kang, X\. Xiao, H\. Menouar, Z\. Zhang, J\. Zhang, and F\. D\. Salim \(2022a\)Long\-term spatio\-temporal forecasting via dynamic multiple\-graph attention\.InIJCAI,pp\. 2225–2232\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- Z\. Shao, F\. Wang, Y\. Xu, W\. Wei, C\. Yu, Z\. Zhang, D\. Yao, T\. Sun, G\. Jin, X\. Cao, G\. Cong, C\. S\. Jensen, and X\. Cheng \(2025\)Exploring progress in multivariate time series forecasting: comprehensive benchmarking and heterogeneity analysis\.IEEE Trans\. Knowl\. Data Eng\.37\(1\),pp\. 291–305\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- Z\. Shao, Z\. Zhang, F\. Wang, W\. Wei, and Y\. Xu \(2022b\)Spatial\-temporal identity: a simple yet effective baseline for multivariate time series forecasting\.InCIKM,pp\. 4454–4458\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.20.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.29.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.9.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- Z\. Shao, Z\. Zhang, W\. Wei, F\. Wang, Y\. Xu, X\. Cao, and C\. S\. Jensen \(2022c\)Decoupled dynamic spatial\-temporal graph neural network for traffic forecasting\.Proc\. VLDB Endow\.15\(11\),pp\. 2733–2746\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.7.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- J\. Shi and J\. Malik \(2000\)Normalized cuts and image segmentation\.IEEE Transactions on pattern analysis and machine intelligence22\(8\),pp\. 888–905\.Cited by:[§4\.1](https://arxiv.org/html/2605.16447#S4.SS1.p1.1)\.
- J\. Tian, Y\. Liang, R\. Xu, P\. Chen, C\. Guo, A\. Zhou, L\. Pan, Z\. Rao, and B\. Yang \(2025\)Air quality prediction with physics\-guided dual neural odes in open systems\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§A\.13](https://arxiv.org/html/2605.16447#A1.SS13.p2.1),[§5\.2](https://arxiv.org/html/2605.16447#S5.SS2.p3.1)\.
- H\. Wang, J\. Peng, F\. Huang, J\. Wang, J\. Chen, and Y\. Xiao \(2023\)MICN: multi\-scale local and global context modeling for long\-term series forecasting\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p3.1)\.
- S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Y\. Zhang, and J\. Zhou \(2024a\)TimeMixer: decomposable multiscale mixing for time series forecasting\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p3.1)\.
- S\. Wang, H\. Wu, X\. Shi, T\. Hu, H\. Luo, L\. Ma, J\. Y\. Zhang, and J\. ZHOU \(2024b\)TimeMixer: decomposable multiscale mixing for time series forecasting\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=7oLshfEIC2)Cited by:[§A\.13](https://arxiv.org/html/2605.16447#A1.SS13.p3.1)\.
- Z\. Wang, Y\. Sun, and H\. Zhu \(2025a\)Unifying knowledge from diverse datasets to enhance spatial\-temporal modeling: A granularity\-adaptive geographical embedding approach\.InICML,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- Z\. Wang, Y\. Qin, L\. Zeng, and R\. Zhang \(2025b\)High\-dynamic radar sequence prediction for weather nowcasting using spatiotemporal coherent gaussian representation\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p1.1)\.
- N\. Wu, W\. X\. Zhao, J\. Wang, and D\. Pan \(2020a\)Learning effective road network representation with hierarchical graph neural networks\.InKDD,pp\. 6–14\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- Z\. Wu, S\. Pan, G\. Long, J\. Jiang, X\. Chang, and C\. Zhang \(2020b\)Connecting the dots: multivariate time series forecasting with graph neural networks\.InKDD,pp\. 753–763\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- Z\. Wu, S\. Pan, G\. Long, J\. Jiang, and C\. Zhang \(2019\)Graph wavenet for deep spatial\-temporal graph modeling\.InIJCAI,pp\. 1907–1913\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.16.2),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.27.2),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.3.2),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- P\. Xie, M\. Ma, T\. Li, S\. Ji, S\. Du, Z\. Yu, and J\. Zhang \(2023\)Spatio\-temporal dynamic graph relation learning for urban metro flow prediction\.IEEE Trans\. Knowl\. Data Eng\.35\(10\),pp\. 9973–9984\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- C\. M\. Yeh, Y\. Fan, X\. Dai, U\. S\. Saini, V\. Lai, P\. O\. Aboagye, J\. Wang, H\. Chen, Y\. Zheng, Z\. Zhuang, L\. Wang, and W\. Zhang \(2024\)RPMixer: shaking up time series forecasting with random projections for large spatial\-temporal data\.InKDD,pp\. 3919–3930\.Cited by:[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.11.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.22.1),[Table 1](https://arxiv.org/html/2605.16447#S4.T1.8.1.31.1),[§5\.1](https://arxiv.org/html/2605.16447#S5.SS1.p2.1)\.
- B\. Yu, H\. Yin, and Z\. Zhu \(2018\)Spatio\-temporal graph convolutional networks: a deep learning framework for traffic forecasting\.InIJCAI,pp\. 3634–3640\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1),[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- Y\. Yuan, J\. Ding, J\. Feng, D\. Jin, and Y\. Li \(2024\)UniST: a prompt\-empowered universal model for urban spatio\-temporal prediction\.InKDD,pp\. 4095–4106\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p1.1)\.
- Y\. Zhang, P\. Wang, B\. Wang, X\. Wang, Z\. Zhao, Z\. Zhou, L\. Bai, and Y\. Wang \(2024\)Adaptive and interactive multi\-level spatio\-temporal network for traffic forecasting\.IEEE Trans\. Intell\. Transp\. Syst\.25\(10\),pp\. 14070–14086\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p3.1),[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.
- Y\. Zhao, P\. Deng, J\. Liu, X\. Jia, and M\. Wang \(2023\)Causal conditional hidden markov model for multimodal traffic prediction\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 4929–4936\.Cited by:[§1](https://arxiv.org/html/2605.16447#S1.p2.1)\.
- Z\. Zhou, R\. Basker, and D\. Yeung \(2025\)Graph neural networks for multivariate time\-series forecasting via learning hierarchical spatiotemporal dependencies\.Eng\. Appl\. Artif\. Intell\.147,pp\. 110304\.Cited by:[§2](https://arxiv.org/html/2605.16447#S2.p2.1)\.

## Appendix AAppendix

### A\.1Dataset\.

Our experiments utilize the GLA, GBA, and CA subsets from the LargeST benchmark\(Liuet al\.,[2023b](https://arxiv.org/html/2605.16447#bib.bib28)\), which comprise traffic data collected via the CalTrans PeMS sensor network\. We focus on records from 2019, aggregated into 15\-minute intervals \(96 daily records\)\. As summarized in Table[6](https://arxiv.org/html/2605.16447#A1.T6), these datasets provide a large\-scale evaluation environment, covering diverse regional scales with node counts ranging from 2,352 to 8,600 and totaling over 300 million observations\.

Table 6:Dataset statistics\.Datasets\#Nodes\#Samples\#TimeSlicesTimespanGBA235282M3504001/01/2019\-12/31/2019GLA3834134M3504001/01/2019\-12/31/2019CA8600301M3504001/01/2019\-12/31/2019

### A\.2Baselines

We compare the proposed approach with the following advanced baselines:

- •STID: It proposes an efficient MLP baseline that solves sample indistinguishability in forecasting by incorporating spatial and temporal identity information\.
- •GWNET: It captures hidden spatial dependencies using a learned adaptive matrix and models long\-range temporal trends via stacked dilated 1D convolutions\.
- •AGCRN: It employs node\-adaptive parameter learning and automatic graph generation to capture fine\-grained spatial\-temporal correlations without requiring pre\-defined graphs\.
- •STGODE: It captures synchronous spatial\-temporal dynamics using tensor\-based ordinary differential equations and semantic adjacency matrices\.
- •DGCRN: It leverages hyper\-networks to generate dynamic filter parameters and time\-varying graphs from node attributes, integrating them with static structures while optimizing efficiency by limiting decoder iterations\.
- •DSTAGNN: It replaces pre\-defined graphs with a data\-driven dynamic graph, using multi\-head attention and multi\-scale gated convolutions to capture complex spatial\-temporal dependencies\.
- •D2STGNN: It uses an estimation gate and residual decomposition to decouple and independently model diffusion and inherent traffic signals\.
- •STWave: It decouples traffic into trends and events using a dual\-channel framework with wavelet\-based positional encoding and a query sampling strategy\.
- •RPMixer: It models temporal dynamics via MLPs and spatial relationships through the integration of random projection layers\.
- •BigST: It is a spatial\-temporal graph neural networks characterized by linear complexity, which allows for the efficient exploitation of long\-range dependencies in large\-scale traffic forecasting task\.
- •PatchSTG: It groups nodes via KDTree\-based irregular spatial patching and utilizes depth and breadth attention to model local and global dependencies\.

### A\.3Evaluation Metrics

To comprehensively evaluate the model performance, and in line with previous works\(Fanget al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib59)\), we utilize three standard metrics: Mean Absolute Error \(MAE\), Root Mean Squared Error \(RMSE\), and Mean Absolute Percentage Error \(MAPE\)\. These metrics provide a multi\-faceted assessment of prediction accuracy, capturing both average deviations and sensitivity to extreme values\. Letyiy\_\{i\}denote the ground truth,y^i\\hat\{y\}\_\{i\}represent the predicted value, andnnbe the total number of samples; the specific definitions and characteristics of these metrics are as follows:

- •Mean Absolute Error \(MAE\):This metric calculates the average of the absolute differences between the predicted and actual values, providing a linear penalty for all errors\. Because it does not square the deviations, MAE is more robust to outliers and represents the basic magnitude of the forecasting error\. MAE=1n​∑i=1n\|yi−y^i\|\\text\{MAE\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\|y\_\{i\}\-\\hat\{y\}\_\{i\}\|\(11\)
- •Root Mean Squared Error \(RMSE\):By squaring the errors before averaging, RMSE assigns a significantly higher weight to large deviations\. This makes it a sensitive indicator of the model’s stability and its ability to avoid large\-scale prediction failures in critical traffic scenarios\. RMSE=1n​∑i=1n\(yi−y^i\)2\\text\{RMSE\}=\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}\}\(12\)
- •Mean Absolute Percentage Error \(MAPE\):This metric expresses the error as a percentage of the ground truth, offering a scale\-independent measure\. It is particularly valuable in traffic forecasting for comparing performance across different road segments or time periods with varying traffic volumes\. MAPE=1n​∑i=1n\|yi−y^iyi\|×100%\\text\{MAPE\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\|\\frac\{y\_\{i\}\-\\hat\{y\}\_\{i\}\}\{y\_\{i\}\}\\right\|\\times 100\\%\(13\)

### A\.4Proof of Theorem 1

Theorem: Let𝒞m\\mathcal\{C\}\_\{m\}be a cluster of size\|𝒞m\|\|\\mathcal\{C\}\_\{m\}\|with center𝐙m\\mathbf\{Z\}\_\{m\}\. The SNR of𝐙m\\mathbf\{Z\}\_\{m\}satisfies:

SNR​\(𝐙m\)≥\[1\+\(\|𝒞m\|−1\)​ρm\]⋅SNR¯m,\\mathrm\{SNR\}\(\\mathbf\{Z\}\_\{m\}\)\\geq\[1\+\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\\rho\_\{m\}\]\\cdot\\overline\{\\mathrm\{SNR\}\}\_\{m\},\(14\)whereSNR¯m\\overline\{\\mathrm\{SNR\}\}\_\{m\}is the average individual SNR, andρm\\rho\_\{m\}is the average pairwise correlation of true signals𝐒\\mathbf\{S\}within the cluster\.

###### Proof\.

We consider a single region𝒞m\\mathcal\{C\}\_\{m\}\. For any nodei∈𝒞mi\\in\\mathcal\{C\}\_\{m\}, its signal can be decomposed as𝐗i=𝐒i\+ϵi\\mathbf\{X\}\_\{i\}=\\mathbf\{S\}\_\{i\}\+\\bm\{\\epsilon\}\_\{i\}, where𝐒i\\mathbf\{S\}\_\{i\}is the deterministic signal component andϵi∼𝒩​\(0,σ2​𝐈C\)\\bm\{\\epsilon\}\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\\mathbf\{I\}\_\{C\}\)is i\.i\.d\. noise\. Then, we define𝐒¯m=1nk​∑i∈𝒞m𝐒i\\bar\{\\mathbf\{S\}\}\_\{m\}=\\frac\{1\}\{n\_\{k\}\}\\sum\_\{i\\in\\mathcal\{C\}\_\{m\}\}\\mathbf\{S\}\_\{i\}\.

The cluster center is𝐙m=1\|𝒞m\|​∑i∈𝒞m𝐗i=𝐒¯m\+1\|𝒞m\|​∑i∈𝒞mϵi\\mathbf\{Z\}\_\{m\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\sum\_\{i\\in\\mathcal\{C\}\_\{m\}\}\\mathbf\{X\}\_\{i\}=\\bar\{\\mathbf\{S\}\}\_\{m\}\+\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\sum\_\{i\\in\\mathcal\{C\}\_\{m\}\}\\bm\{\\epsilon\}\_\{i\}\. The power of the signal component in𝐙m\\mathbf\{Z\}\_\{m\}is‖𝐒¯m‖2\\\|\\bar\{\\mathbf\{S\}\}\_\{m\}\\\|^\{2\}\. The noise component1\|𝒞m\|​∑iϵi\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\sum\_\{i\}\\bm\{\\epsilon\}\_\{i\}has covarianceσ2\|𝒞m\|​𝐈C\\frac\{\\sigma^\{2\}\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\mathbf\{I\}\_\{C\}\. Thus, the SNR of𝐙m\\mathbf\{Z\}\_\{m\}is

SNR⁡\(𝐙m\)=‖𝐬¯m‖2σ2/\|𝒞m\|=\|𝒞m\|⋅∥𝐬¯m,∥2σ2\.\\operatorname\{SNR\}\(\\mathbf\{Z\}\_\{m\}\)=\\frac\{\\\|\\bar\{\\mathbf\{s\}\}\_\{m\}\\\|^\{2\}\}\{\\sigma^\{2\}/\|\\mathcal\{C\}\_\{m\}\|\}=\|\\mathcal\{C\}\_\{m\}\|\\cdot\\frac\{\\\|\\bar\{\\mathbf\{s\}\}\_\{m\},\\\|^\{2\}\}\{\\sigma^\{2\}\}\.\(15\)For an individual nodei∈𝒞mi\\in\\mathcal\{C\}\_\{m\}, its SNR isSNR⁡\(𝐱i\)=‖𝐒i‖2/σ2\\operatorname\{SNR\}\(\\mathbf\{x\}\_\{i\}\)=\\\|\\mathbf\{S\}\_\{i\}\\\|^\{2\}/\\sigma^\{2\}\. The average SNR within the cluster is

SNR¯m=1\|𝒞m\|​∑i∈𝒞m‖𝐒i‖2σ2\.\\overline\{\\operatorname\{SNR\}\}\_\{m\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\sum\_\{i\\in\\mathcal\{C\}\_\{m\}\}\\frac\{\\\|\\mathbf\{S\}\_\{i\}\\\|^\{2\}\}\{\\sigma^\{2\}\}\.\(16\)By the Cauchy–Schwarz inequality,

‖𝐒¯m‖2=1\|𝒞m\|2​‖∑i𝐒i‖2≥1\|𝒞m\|2​\(∑i‖𝐒i‖\)2\.\\left\\\|\\bar\{\\mathbf\{S\}\}\_\{m\}\\right\\\|^\{2\}=\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|^\{2\}\}\\left\\\|\\sum\_\{i\}\\mathbf\{S\}\_\{i\}\\right\\\|^\{2\}\\geq\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|^\{2\}\}\\left\(\\sum\_\{i\}\\\|\\mathbf\{S\}\_\{i\}\\\|\\right\)^\{2\}\.\(17\)Furthermore, recall the average correlation coefficientρm\\rho\_\{m\}among the𝐬i\\mathbf\{s\}\_\{i\}is defined by:

ρm≜\\displaystyle\\rho\_\{m\}\\triangleq1\|𝒞m\|​\(\|𝒞m\|−1\)​∑i≠j∈𝒞mCorr​\(𝐒i,𝐒j\)\\displaystyle\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\}\\sum\_\{i\\neq j\\in\\mathcal\{C\}\_\{m\}\}\\mathrm\{Corr\}\(\\mathbf\{S\}\_\{i\},\\mathbf\{S\}\_\{j\}\)\(18\)=\\displaystyle=1\|𝒞m\|​\(\|𝒞m\|−1\)​∑i≠j∈𝒞m𝐒i⊤​𝐒j‖𝐒i‖​‖𝐒j‖\.\\displaystyle\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\}\\sum\_\{i\\neq j\\in\\mathcal\{C\}\_\{m\}\}\\frac\{\\mathbf\{S\}\_\{i\}^\{\\top\}\\mathbf\{S\}\_\{j\}\}\{\\\|\\mathbf\{S\}\_\{i\}\\\|\\\|\\mathbf\{S\}\_\{j\}\\\|\}\.We can obtain the final format of‖𝐬¯m‖2\\left\\\|\\bar\{\\mathbf\{s\}\}\_\{m\}\\right\\\|^\{2\}:

‖𝐒¯m‖2≥1\+\(\|𝒞m\|−1\)​ρm\|𝒞m\|⋅1\|𝒞m\|​∑i‖𝐒i‖2\.\\left\\\|\\bar\{\\mathbf\{S\}\}\_\{m\}\\right\\\|^\{2\}\\geq\\frac\{1\+\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\\rho\_\{m\}\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\cdot\\frac\{1\}\{\|\\mathcal\{C\}\_\{m\}\|\}\\sum\_\{i\}\\\|\\mathbf\{S\}\_\{i\}\\\|^\{2\}\.\(19\)Substituting this into equation[15](https://arxiv.org/html/2605.16447#A1.E15)and using equation[16](https://arxiv.org/html/2605.16447#A1.E16)yields

SNR⁡\(𝐙m\)≥\[1\+\(\|𝒞m\|−1\)​ρm\]⋅SNR¯m\.\\operatorname\{SNR\}\(\\mathbf\{Z\}\_\{m\}\)\\geq\\left\[1\+\(\|\\mathcal\{C\}\_\{m\}\|\-1\)\\rho\_\{m\}\\right\]\\cdot\\overline\{\\operatorname\{SNR\}\}\_\{m\}\.\(20\)∎

The inequality for the global representation𝐙\\mathbf\{Z\}follows by noting thatSNR⁡\(𝐙\)\\operatorname\{SNR\}\(\\mathbf\{Z\}\)is essentially a weighted average of theSNR⁡\(𝐙m\)\\operatorname\{SNR\}\(\\mathbf\{Z\}\_\{m\}\), and that the maximum possible enhancement is constrained by the region with the smallest intra\-cluster correlationρm\\rho\_\{m\}and size\|𝒞m\|\|\\mathcal\{C\}\_\{m\}\|\. The factorN/KN/Karises from the dimensionality reduction fromNNnodes toMMregions\.

This theorem theoretically validates the denoising mechanism of our framework: the SNR gain scales with both cluster size\|𝒞m\|\|\\mathcal\{C\}\_\{m\}\|and intra\-cluster correlationρm\\rho\_\{m\}\. Since our spectral clustering naturally groups nodes with high signal synchronization \(maximizingρm\\rho\_\{m\}\), the resulting region\-level representations𝐙\\mathbf\{Z\}effectively suppress independent local noise while preserving the structural signal fidelity\.

### A\.5Implementation Details

All experiments were implemented using the PyTorch framework and conducted on a single NVIDIA A100 80GB GPU\. We optimize the model using the AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.16447#bib.bib11)\)optimizer with a fixed learning rate of3×10−43\\times 10^\{\-4\}\. To prevent overfitting, we employed an early stopping strategy, terminating the training if the validation loss did not decrease for 30 consecutive epochs\. Regarding the model hyperparameters, the dimension of input embeddings was set to 128, while the latent dimension for the attention mechanism was fixed at 256\. The number of latent regionsKKwas adaptively set0\.2​N0\.2N, whereNNdenotes the number of nodes in the GBA, GLA and CA respectively\. The single\-step prediction horizonPPwas set to\{4,6,4\}\\\{4,6,4\\\}for GBA, GLA and CA\. Accordingly, the loss balancing coefficients for regional prediction \(λ1\\lambda\_\{1\}\) and boundary predictions \(λ2\\lambda\_\{2\}\) were configured as \(0\.1, 0\.2\), \(0\.3, 0\.3\), and \(0\.1, 0\.2\), respectively\.

### A\.6Loss Function Definitions

#### Node\-level forecasting loss\.

For node\-level prediction, we adopt the Huber loss to balance robustness to outliers and sensitivity to small errors\. Given the prediction errore=x^−xe=\\hat\{x\}\-x, the Huber loss is defined as

ℓHuber​\(e\)=\{12​e2,\|e\|≤δ,δ​\(\|e\|−12​δ\),otherwise,\\ell\_\{\\text\{Huber\}\}\(e\)=\\begin\{cases\}\\frac\{1\}\{2\}e^\{2\},&\|e\|\\leq\\delta,\\\\ \\delta\(\|e\|\-\\frac\{1\}\{2\}\\delta\),&\\text\{otherwise\},\\end\{cases\}\(21\)whereδ\\deltais a predefined threshold\. The node\-level lossℒx\\mathcal\{L\}\_\{x\}is computed by averaging the Huber loss over all nodes and prediction horizons\.

#### Quantile\-based regional forecasting loss\.

To model predictive uncertainty at the regional level, we predict a set of quantiles\{τq\}q=1Q\\\{\\tau\_\{q\}\\\}\_\{q=1\}^\{Q\}for each region and optimize the corresponding pinball loss\. For a given quantileτ∈\(0,1\)\\tau\\in\(0,1\)and prediction errore=z^τ−ze=\\hat\{z\}\_\{\\tau\}\-z, the pinball loss is defined as

ρτ​\(e\)=max⁡\(τ​e,\(τ−1\)​e\)\.\\rho\_\{\\tau\}\(e\)=\\max\(\\tau e,\(\\tau\-1\)e\)\.\(22\)The regional forecasting lossℒz\\mathcal\{L\}\_\{z\}is obtained by averaging the pinball loss across all regions, quantiles, and prediction horizons\.

#### Masked guidance reconstruction loss\.

The masked guidance reconstruction lossℒbd\\mathcal\{L\}\_\{\\text\{bd\}\}applies the same quantile\-based pinball loss to the outputs of the guidance decoder\. During training, regional inputs are randomly masked, and the decoder is supervised to reconstruct the corresponding future regional states, encouraging robustness to missing or noisy guidance signals\.

### A\.7Period\-aligned Chunk Partition for Affinity Construction

To construct the feature\-driven affinity matrix𝐀\\mathbf\{A\}from raw temporal observations, we partition the training sequence intoT~\\tilde\{T\}non\-overlapping temporal chunks\. The key motivation is to align the chunking scheme with the intrinsic periodicity of the data, so that𝐀\\mathbf\{A\}captures similarity in*long\-term temporal evolution*rather than short\-term fluctuations\.

#### Empirical periodicity analysis\.

For each dataset, we conduct a preliminary periodicity analysis on the training split \(e\.g\., via autocorrelation / spectral inspection\) and observe a clear dominant period of approximatelyP=100P=100time steps across all three datasets\. Based on this observation, we set the number of chunks to

T~=100,\\tilde\{T\}=100,\(23\)and partition the training sequence into100100consecutive, non\-overlapping chunks\. For each node, we compute an averaged representation within each chunk, and define pairwise affinities by aggregating the chunk\-wise distances, as described in Eq\. \(1\)\.

#### Rationale\.

This period\-aligned chunking ensures that each chunk corresponds to a consistent seasonal phase of the underlying temporal process, thereby encouraging𝐀\\mathbf\{A\}to encode node\-wise similarity in coarse\-grained periodic dynamics\. As a result, the constructed affinity matrix is less sensitive to transient noise and short\-term irregular variations\.

### A\.8Effect of Cross\-Scale Interaction Depth

We study the sensitivity of our model to the depth of the Cross\-Scale Interaction module\. Specifically, we vary the number of Cross\-Transformer layers in\{4,5,6,7\}\\\{4,5,6,7\\\}while keeping all other hyperparameters unchanged\. We report the averaged multi\-step forecasting performance over all horizons \(Steps 1–12\) in terms of MAE, RMSE and MAPE\.

Table 7:Hyperparameter study on the number of Cross\-Transformer layers\. Results are averaged over all prediction horizons \(Steps 1–12\)\.LayersGBAGLAMAERMSEMAPEMAERMSEMAPE418\.7331\.8512\.9018\.4231\.6711\.01519\.0932\.7713\.5318\.4431\.8910\.94619\.0632\.5713\.5817\.8930\.5210\.74719\.1232\.4513\.7918\.2131\.2410\.76As shown in Table[7](https://arxiv.org/html/2605.16447#A1.T7), the performance is generally stable across different Cross\-Transformer depths on both datasets\. On GBA, the best results are achieved with 4 layers, obtaining the lowest MAE, RMSE, and MAPE, while deeper configurations do not bring further gains and may slightly degrade performance\. In contrast, on GLA, increasing the depth to a moderate level is beneficial: 6 layers yield the best MAE, RMSE, and MAPE, whereas further increasing the depth to 7 layers leads to a minor performance drop\. Overall, the differences across depths remain small \(within approximately 2% MAE variation between the best and worst settings on both datasets\), indicating that our model is robust to the choice of Cross\-Transformer depth\.

### A\.9Effectiveness of Quantile Forecasting

![Refer to caption](https://arxiv.org/html/2605.16447v1/x4.png)Figure 4:Quantile forecasting under different noise regimes\.We visualize the quantile prediction results using three quantile levels\{0\.1,0\.5,0\.9\}\\\{0\.1,0\.5,0\.9\\\}, where the shaded region denotes the predictive distribution spanned by the lower and upper quantiles\. In the high\-noise case \(left\), the quantile band widens to reflect increased uncertainty and covers most of the ground\-truth trajectory, indicating well\-calibrated uncertainty estimation\. In the low\-noise case \(right\), the quantiles collapse into a narrow band and closely match the ground truth, suggesting that the model appropriately reduces uncertainty when the signal is stable\.To assess the reliability of our probabilistic forecasting module, we report quantile predictions at levels\{0\.1,0\.5,0\.9\}\\\{0\.1,0\.5,0\.9\\\}, corresponding to the lower bound, median, and upper bound, respectively\. As shown in Fig\.[4](https://arxiv.org/html/2605.16447#A1.F4), the quantile forecasts exhibit desirable behavior across different noise regimes\. In the high\-noise setting, the predicted quantile band provides a calibrated uncertainty estimate, where the interval between the0\.10\.1and0\.90\.9quantiles expands to reflect increased stochasticity in the observations\. Importantly, this uncertainty band captures the majority of the ground\-truth trajectory, indicating that the model can produce informative prediction distributions rather than over\-confident point estimates\. In contrast, in the low\-noise setting, the three quantiles become tightly concentrated and nearly overlap with each other as well as with the ground truth, suggesting that the model correctly reduces predictive uncertainty when the signal is stable\. Overall, these results demonstrate that our quantile prediction is both adaptive and well\-calibrated, yielding accurate distributional forecasts under varying noise conditions\.

### A\.10Algorithmic Framework

The overall algorithmic procedure of our framework consists of two main stages: latent structure initialization and model optimization\. First, to construct the hierarchical structure from node signals, we employGraph Spectral Clustering\(Nget al\.,[2001](https://arxiv.org/html/2605.16447#bib.bib99)\)as detailed in Algorithm[1](https://arxiv.org/html/2605.16447#alg1)\. This process groups nodes with synchronized patterns into latent regions\. Subsequently, the training workflow ofNeST, presented in Algorithm[2](https://arxiv.org/html/2605.16447#alg2), integrates a boundary modeling task with a scheduled sampling strategy \(teacher forcing decay\) to bridge the training\-inference gap and jointly optimize the coarse\-to\-fine forecasting objectives\.

Algorithm 1Graph Spectral Clustering0:Node signals

𝐗∈ℝN×T\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times T\}, adjacency

𝐀∈ℝN×N\\mathbf\{A\}\\in\\mathbb\{R\}^\{N\\times N\}, clusters

MM, kernel width

σ\\sigma, K\-Means iterations

KK, initializations

ninitn\_\{\\text\{init\}\}\.

0:Assignment matrix

𝐒∈\{0,1\}N×M\\mathbf\{S\}\\in\\\{0,1\\\}^\{N\\times M\}, prototypes

𝐂∈ℝM×T\\mathbf\{C\}\\in\\mathbb\{R\}^\{M\\times T\}\.

1:Preprocess:

𝐀s​y​m←\(𝐀\+𝐀⊤\)/2\\mathbf\{A\}\_\{sym\}\\leftarrow\(\\mathbf\{A\}\+\\mathbf\{A\}^\{\\top\}\)/2; preprocess

𝐗\\mathbf\{X\}if needed\.

2:Construct similarity \(Gaussian\):

di​j=‖xi−xj‖2d\_\{ij\}=\\\|x\_\{i\}\-x\_\{j\}\\\|^\{2\},

𝐖i​j=exp⁡\(−di​j2​σ2\)\\mathbf\{W\}\_\{ij\}=\\exp\\\!\\left\(\-\\frac\{d\_\{ij\}\}\{2\\sigma^\{2\}\}\\right\), and

𝐖i​i=0\\mathbf\{W\}\_\{ii\}=0\.

3:Degree

𝐃i​i=∑j𝐖i​j\\mathbf\{D\}\_\{ii\}=\\sum\_\{j\}\\mathbf\{W\}\_\{ij\}; normalized Laplacian

𝐋s​y​m=𝐈−𝐃−1/2​𝐖𝐃−1/2\\mathbf\{L\}\_\{sym\}=\\mathbf\{I\}\-\\mathbf\{D\}^\{\-1/2\}\\mathbf\{W\}\\mathbf\{D\}^\{\-1/2\}\.

4:Compute

MMsmallest eigenvectors

𝐔M\\mathbf\{U\}\_\{M\}of

𝐋s​y​m\\mathbf\{L\}\_\{sym\}; row\-normalize

𝐔←RowNorm​\(𝐔M\)\\mathbf\{U\}\\leftarrow\\mathrm\{RowNorm\}\(\\mathbf\{U\}\_\{M\}\)\.

5:Run K\-Means on rows of

𝐔\\mathbf\{U\}with

ninitn\_\{\\text\{init\}\}restarts and

KKiterations; obtain one\-hot assignment

𝐒\\mathbf\{S\}\.

6:Prototypes:

𝐂=\(𝐒⊤​𝐒\)−1​𝐒⊤​𝐗\\mathbf\{C\}=\(\\mathbf\{S\}^\{\\top\}\\mathbf\{S\}\)^\{\-1\}\\mathbf\{S\}^\{\\top\}\\mathbf\{X\}\.

7:return

𝐒,𝐂\\mathbf\{S\},\\mathbf\{C\}\.

Algorithm 2NeSTTraining Algorithm0:Dataset

𝒟\\mathcal\{D\}, Max Epochs

EE, Decay

γ\\gamma, Min prob

rr\.

1:Initialize parameters

θ\\theta, set sampling prob

pt​f←1\.0p\_\{tf\}\\leftarrow 1\.0\.

2:forepoch

e=1e=1to

EEdo

3:

pt​f←max⁡\(r,Decay​\(pt​f,e,γ\)\)p\_\{tf\}\\leftarrow\\max\(r,\\text\{Decay\}\(p\_\{tf\},e,\\gamma\)\)
4:forbatch

\(ut,zt\+1,ut\+1,zt\+2\)\(u\_\{t\},z\_\{t\+1\},u\_\{t\+1\},z\_\{t\+2\}\)in

𝒟\\mathcal\{D\}do

5:1\. Boundary Gen:

6:

z^t\+1=fθ​\(ut,𝟎\)\\hat\{z\}\_\{t\+1\}=f\_\{\\theta\}\(u\_\{t\},\\mathbf\{0\}\)
7:2\. Sampling \(Scheduled Sampling\):

8:Draw mask

m∼Bernoulli​\(pt​f\)m\\sim\\text\{Bernoulli\}\(p\_\{tf\}\)
9:

z~t\+1=m⋅zt\+1\+\(1−m\)⋅StopGrad​\(z^t\+1\)\\tilde\{z\}\_\{t\+1\}=m\\cdot z\_\{t\+1\}\+\(1\-m\)\\cdot\\text\{StopGrad\}\(\\hat\{z\}\_\{t\+1\}\)
10:3\. Prediction:

11:

u^t\+1,z^t\+2=fθ​\(ut,z~t\+1\)\\hat\{u\}\_\{t\+1\},\\hat\{z\}\_\{t\+2\}=f\_\{\\theta\}\(u\_\{t\},\\tilde\{z\}\_\{t\+1\}\)
12:4\. Optimization:

13:

ℒt​a​s​k=ℒu​\(u^t\+1,ut\+1\)\+ℒz​2​\(z^t\+2,zt\+2\)\+λ⋅ℒz​1​\(z^t\+1,zt\+1\)\\mathcal\{L\}\_\{task\}=\\mathcal\{L\}\_\{u\}\(\\hat\{u\}\_\{t\+1\},u\_\{t\+1\}\)\+\\mathcal\{L\}\_\{z2\}\(\\hat\{z\}\_\{t\+2\},z\_\{t\+2\}\)\+\\lambda\\cdot\\mathcal\{L\}\_\{z1\}\(\\hat\{z\}\_\{t\+1\},z\_\{t\+1\}\)
14:Update

θ←Optimizer​\(θ,∇ℒt​a​s​k\)\\theta\\leftarrow\\text\{Optimizer\}\(\\theta,\\nabla\\mathcal\{L\}\_\{task\}\)
15:endfor

16:endfor

### A\.11Prediction Comparison

To qualitatively evaluate the forecasting performance, we visualize the multi\-step prediction results ofNeSTagainst the state\-of\-the\-art baseline,PatchSTG, on representative nodes from the GLA dataset\. Figure[5](https://arxiv.org/html/2605.16447#A1.F5)illustrates the forecasted traffic flows alongside the generated macro\-level trends\. As observed, the baseline method exhibits significant limitations in capturing rapid trend shifts\. For Node 3195 \(Left\),PatchSTGfails to anticipate the sharp downward trend, predicting a continuous high\-traffic volume \(blue dashed line\) while the ground truth \(grey line\) drops significantly\. This indicates a lack of awareness of the broader contextual evolution\. Conversely, for Node 2264 \(Right\),PatchSTGunderestimates the traffic demand, predicting an immediate decline to near\-zero values, whereas the actual traffic remains high before eventually dropping\. In contrast,NeSTdemonstrates superior robustness\. By explicitly conditioning the micro\-level generation on the predicted macro\-trends \(Top Row\), our model successfully rectifies these deviations\. For Node 3195, the macro\-module forecasts a clear downward signal, guiding the node predictor to accurately track the flow reduction\. Similarly, for Node 2264, the stable high\-traffic macro\-signal prevents the model from collapsing to zero early\. These comparisons highlight that our hierarchical coupling mechanism effectively mitigates forecasting errors caused by local noise and ensures alignment with the true temporal dynamics\.

![Refer to caption](https://arxiv.org/html/2605.16447v1/x5.png)Figure 5:Qualitative comparison of multi\-step forecasting results on GBA dataset\.We visualize the predictions for two distinct nodes: Node 3195 \(Left\) and Node 2264 \(Right\)\.\(a\) Top Panels:The macro\-level regional trends predicted by our auxiliary module, which serve as guidance\.\(b\) Bottom Panels:The comparison of fine\-grained node predictions\. The baselinePatchSTG\(blue dashed line\) struggles to capture sharp transitions, leading to significant overestimation \(Node 3195\) or underestimation \(Node 2264\)\. By incorporating the macro\-guidance,NeST\(red solid line\) effectively corrects these errors, yielding predictions that closely match theGround Truth\(grey solid line\) across the entire horizon\.
### A\.12Fine\-grained Sensitivity Analysis of Region Number

Table 8:Fine\-grained forecasting performance \(MAE\) with varying numbers of latent regions \(MM\) on the GBA dataset\.DatasetNumber of Regions \(MM\)0\.10​N0\.10N0\.15​N0\.15N0\.18​N0\.18N0\.20​N0\.20N0\.22​N0\.22N0\.25​N0\.25N0\.30​N0\.30NGBA19\.0619\.0318\.8918\.7318\.9619\.1419\.29To comprehensively evaluate the robustness of our spatial granularity design, Table[8](https://arxiv.org/html/2605.16447#A1.T8)provides the detailed, fine\-grained forecasting performance ofNeSTwith varying numbers of latent regions \(MM\) specifically on the GBA dataset\.

### A\.13Generalization to Non\-Traffic Domains

To further verify the applicability and robustness ofNeSTbeyond traffic forecasting, we evaluate it on diverse non\-traffic spatio\-temporal datasets as well as classical long\-term time series benchmarks\.

Non\-Traffic Spatio\-Temporal Forecasting\.We first evaluateNeSTon two representative non\-traffic datasets:KnowAirfrom the meteorology domain andUrbanEVfrom the energy domain\. We compare our method with recent state\-of\-the\-art spatio\-temporal baselines, including Air\-DualODE\(Tianet al\.,[2025](https://arxiv.org/html/2605.16447#bib.bib100)\), PatchSTG, and MAGE\(Maet al\.,[2025a](https://arxiv.org/html/2605.16447#bib.bib98)\)\. As shown in Table[9](https://arxiv.org/html/2605.16447#A1.T9),NeSTconsistently achieves the best performance on both datasets\. These results indicate that our data\-driven semantic clustering and cross\-scale attention mechanisms can effectively capture underlying spatio\-temporal dynamics without relying on traffic\-specific physical priors\.

Table 9:Performance comparison \(MAE / RMSE\) on non\-traffic spatio\-temporal forecasting datasets\.ModelKnowAir \(Meteorology\)UrbanEV \(Energy\)Air\-DualODE18\.64 / 29\.37\-PatchSTG16\.08 / 24\.705\.16 / 11\.53MAGE15\.36 / 23\.424\.95 / 11\.00NeST\(Ours\)14\.87/22\.894\.37/10\.26Classical Long\-Term Time Series Forecasting\.We further evaluateNeSTon two standard long\-term forecasting benchmarks,ElectricityandSolar\-Energy, to assess its general time series modeling capability\. Following the common setup, both the input length and prediction horizon are set to 96 \(96→9696\\rightarrow 96\)\. We compareNeSTwith strong temporal transformer\-based baselines, including PatchTST\(Nieet al\.,[2023](https://arxiv.org/html/2605.16447#bib.bib102)\), iTransformer\(Liuet al\.,[2024](https://arxiv.org/html/2605.16447#bib.bib101)\), and TimeMixer\(Wanget al\.,[2024b](https://arxiv.org/html/2605.16447#bib.bib103)\)\. As reported in Table[10](https://arxiv.org/html/2605.16447#A1.T10),NeSTachieves competitive or superior performance on both datasets, demonstrating that the proposed macro\-to\-micro guidance paradigm is effective not only for irregular spatial graphs, but also for general multivariate time series forecasting\.

Table 10:Performance comparison \(MSE / MAE\) on classical long\-term time series forecasting benchmarks \(96→9696\\rightarrow 96\)\.ModelElectricitySolar\-EnergyPatchTST0\.190 / 0\.2960\.265 / 0\.323iTransformer0\.148 / 0\.2400\.203 / 0\.237TimeMixer0\.153 / 0\.2470\.189/ 0\.259NeST\(Ours\)0\.141/0\.2350\.201 /0\.211
### A\.14Visualization of Centroid Features and Noise Filtering

To intuitively illustrate how the proposed macro\-to\-micro paradigm handles local anomalies, Figure[6](https://arxiv.org/html/2605.16447#A1.F6)compares raw micro\-node sequences with their corresponding macro centroid features\. While individual node trajectories often contain high\-frequency noise and sharp local fluctuations, the macro centroids effectively preserve the smoother regional trends\. This observation suggests that semantic clustering serves as a natural low\-pass filter\. By providing stable regional anchors, the cross\-scale attention mechanism can more reliably correct noisy micro\-level predictions\. In this way, macro\-level consistency helps shield node\-level forecasting from localized anomalies and transient structural noise\.

![Refer to caption](https://arxiv.org/html/2605.16447v1/x6.png)Figure 6:Visual comparison between raw node sequences \(exhibiting high\-frequency noise\) and their corresponding macro centroid features \(preserving smooth trends\)\.

Similar Articles

Nexus : An Agentic Framework for Time Series Forecasting

Hugging Face Daily Papers

Nexus introduces a multi-agent framework that decomposes time series forecasting into specialized stages, integrating numerical patterns and contextual information using LLMs, achieving state-of-the-art results on benchmarks.