A Global-Local Graph Attention Network for Traffic Forecasting
Summary
Proposes a Global-Local Graph Attention Network (GLGAT) with pairwise encoding and event-based adjacency matrix for traffic forecasting, effectively capturing spatio-temporal correlations and achieving competitive performance on real-world datasets.
View Cached Full Text
Cached at: 05/19/26, 06:35 AM
# A Global-Local Graph Attention Network for Traffic Forecasting
Source: [https://arxiv.org/html/2605.16726](https://arxiv.org/html/2605.16726)
###### Abstract
Traffic forecasting is a significant part of intelligent transportation systems\. One of the critical challenges of traffic forecasting is to find spatio\-temporal correlations\. In recent years, graph convolutional networks and graph attention networks have replaced traditional statistical models to predict future traffic\. However, it is complicated for both of them to allow vertices to have far different characters\. To address this, we propose the Global\-Local Graph Attention Network \(GLGAT\) with pairwise encoding and the event\-based adjacency matrix\. The GLGAT allows vertices to have a global attention matrix set for the whole graph and assigns local attention matrix sets to each vertex\. Experiments on two real\-world traffic datasets show that GLGAT can effectively capture spatio\-temporal correlations and has competitive performance against other state\-of\-the\-art baselines\.
## 1Introduction
Nowadays, traffic forecasting has gained more and more attention as a critical part of cars’ self\-driving and route\-planning systems and an essential part of urban intelligent transportation systems\. With an increasing number of real\-time traffic sensors on the road, more and more data become available for forecasting methods to forecast traffic congestion, find time\-saving travel routes, locate bottlenecks of city traffic, and help urban planning\. Moreover, with the rapid development of mobile networks and positioning systems, it has become accessible for many individuals and organizations to use real\-time traffic forecasting\. A forecasting algorithm with higher accuracy will be beneficial for these mentioned areas\.
The study of traffic forecasting has been going on for decades\. As a prediction problem of different locations and times, one feature of the traffic forecasting problem is that models need to find the spatio\-temporal correlations\. Researchers used traditional statistical models, e\.g\., auto\-regressive integrated moving average \(ARIMA\)\[[5](https://arxiv.org/html/2605.16726#bib.bib14),[16](https://arxiv.org/html/2605.16726#bib.bib12),[18](https://arxiv.org/html/2605.16726#bib.bib13)\]and Kalman filtering, in the early stage of the study\. However, as most of them use linear architectures, they are not suitable for the highly non\-linear problem\.
The performance has undergone dramatic improvement as an increasing number of deep learning models have recently become practical to solve this problem\[[4](https://arxiv.org/html/2605.16726#bib.bib16),[9](https://arxiv.org/html/2605.16726#bib.bib15)\]\. On the spatial correlation side, researchers treated the city as an image\. They labeled sensors on their real\-life locations and solved the problem with Convolutional Neural Network \(CNN\)\. Most recently, Graph Convolutional Network \(GCN\) and Graph Attention Network \(GAT\) have become mainstream since they are not limited by Euclidean spatial relationships\. On the temporal correlation side, Recurrent Neural Networks \(RNN\) and their variants, Long Short\-Term Memory \(LSTM\), bidirectional LSTM, and Gated Recurrent Unit \(GRU\), become popular sequence\-to\-sequence approaches in recent works\. The usage of transformers, a leading trend in natural language processing \(NLP\), also appeared in some latest papers\. Some researchers proposed methods that could identify spatio\-temporal correlations simultaneously, like Graph LSTM\.
Another feature of the traffic forecasting problem is that each sensor has its character\. For example, sensors on one\-way roads might have more directional preferences than sensors on the cross\-way, and sensors downtown might be more time\-related than sensors uptown, and sensors on narrow streets might be more easily affected by surroundings than sensors on wide streets\. However, recent approaches, like GCN and GAT, either fail to distinguish different adjacent sensors or demand large hidden dimensions as well as high network depth to allow each sensor to have different characters\.
To address this problem, we present a new model called Global\-Local Graph Attention Network \(GLGAT\), an extension of the GAT\. More specifically, unlike regular graph attentions, which use three matrices to do self\-attention, GLGAT assigns each sensor a triple of shared matrices for “global” attention and independent learnable matrices to form “local” attention functions\. GLGAT supports the multi\-head mechanism without losing parallelization, and it allows different adjacency matrices to act on different heads\. In our experiment, GLGAT with only flatten time dimension already has a competitive performance against other state\-of\-the\-art models with LSTM, GRU, or transformers\. In conclusion, this paper has the following main contribution:
1. 1\.Proposing a new graph attention framework, allowing sensors to have independent attention functions with their neighbors\. It allows each sensor to have a more localized preference than the traditional graph attention network\.
2. 2\.Introducing a pairwise encoding version, which is direction\-and\-distance\-based\. Compared with other encoding methods, the new version allows sensors to have more capacity to distinguish neighborhoods by their geographic relationships\.
3. 3\.Designing a data\-driven adjacency matrix based on the time correlation of speed increase and decrease events between different sensors\.
The organization of the rest of the paper is as follows\. Section II is a review of related works\. Section III elaborates the methodology and the structure of GLGAT\. Section IV demonstrates the experiments and the results\. Furthermore, we conclude in Section VI\.
## 2Related Work
In this section, we review some related literature to our work\. The first half is about the graph neural networks, and the second half is about the adjacency matrices that are widely used in traffic forecasting studies\.
### 2\.1Graph Neural Networks
The CNN structure has shown its promising ability to extract spatial relationships in Euclidean space in computer vision studies\. Many studies adopt CNN to find the spatial correlations to beat the performance of traditional statistical models, like ARIMA and its variances\[[5](https://arxiv.org/html/2605.16726#bib.bib14),[16](https://arxiv.org/html/2605.16726#bib.bib12),[18](https://arxiv.org/html/2605.16726#bib.bib13)\]\. By marking regular grid maps with traffic data, spatio\-temporal data becomes a series of images\. Combining CNN with models that detect temporal correlations, many studies improved their prediction accuracy significantly\. In our previous work\[[1](https://arxiv.org/html/2605.16726#bib.bib17)\], the model combining CNN, LSTM, and additional meteorological data can reasonably predict taxi demand\. Nevertheless, we notice that the CNN model neglects the topology structure within the sensor network, which hampers the performance when the number of sensors increases\.
The GCN is a solution that preserves both graph topological structure and convolution mechanism\. Bruna et al\. use Graph Laplacian in the graph convolution framework instead of the traditional square\-shape convolution core\. Defferrard, Bresson, and Vandergheynst use Chebyshev polynomials to reduce the computational complexity\. Kipf and Welling provide a first\-order Chebyshev polynomials approximation\. The GCN is compatible with structures that can find the temporal correlation, like RNN, LSTM, GRU, and Transformer, using the original adjacency matrix or its higher powers\. Later, noticing the achievements of attention and transformer in NLP, many researchers generalized the attention mechanism to replace the convolution core in GCN and build a new mechanism called GAT\. Many studies show that replacing GCN with GAT gives performance improvements\.
The CNN, GCN, and GAT share the same idea that performing a shared transformation can extract the graph\-structured information\. However, in the traffic forecasting problem, sensors have their unique characters\. It requires many hidden dimensions for globally shared convolution filters or attention matrices to find sensors’ local characteristics, which may cause high inter\-channel redundancy\. Besides, the widely used encoding, e\.g\., eigenvectors or sine and cosine functions, cannot represent relative relationships, which will hinder the attention mechanism from identifying sensors’ different neighbors\.
### 2\.2Adjacency Matrix
Adjacency matrices are critical to numerous state\-of\-the\-art deep learning models in traffic forecasting\. A bunch of ablation studies shows that a suitable adjacency matrix can improve models’ performance\. The majority of adjacency matrices are classified into three categories: connection\-based, dynamic, and similarity\-based\.
Connection\-based matrices relate to the connectivity between sensors\. The matrix values usually represent whether a road directly connects two sensors or whether vehicles can travel between them within 5 minutes\. One variance of this method is to use the inverse of graph hop count or travel time to replace binary numbers, which extends the relationship to a broader neighborhood\. One limitation with connection\-based matrices is that they could only work with physically short\-range correlations\. It requires a multi\-layer structure to endow models with the ability to find correlations between distant sensors\.
Unlike static matrices, dynamic matrices allow models to build the adjacency matrix while training\. The initialization of the dynamic adjacency matrix is carried out through specific static methods or randomization\. This method allows the model to find the most suitable matrix for the network structure\. However, it requires more complex network tuning, and it makes the model less explainable\.
Similarity\-based adjacency matrices have two major sub\-types, specifically, functional similarity matrices and traffic pattern similarity matrices\. Functional similarity matrices usually represent the point\-of\-interest around different sensors\. This method can be helpful while forecasting the human flow of the subway, where the point\-of\-interest of different stations varies a lot\. However, it is meaningless to predict the traffic flow of a sensor on the highway if it is in the middle of the road since cars there cannot get in or out of the road casually\. Traffic pattern similarity matrices, which extracted from training data, represent the similarities of the flow patterns between sensors\. For example, Dynamic Time Warping \(DTW\) and FastDTW\[[10](https://arxiv.org/html/2605.16726#bib.bib19)\]use a temporal graph to find similarities in time series\. However, they require sensors to have a strong correlation during most periods\. For example, it will eliminate sensor pairs with substantial similarities in a short time slot like the morning rush\.
## 3Methodology
In this section, we formulize the traffic forecasting problem, define the graph module of traffic data, and the GAT\. Then, we introduce the GLGAT as a refinement of the GAT and the pairwise encoding that complements the model\. Finally, we present an event\-based adjacency matrix that can fit most existing models\.
### 3\.1Preliminaries
#### 3\.1\.1Graph and Adjacency Matrix
We represent a graphG=\(V,E\)G=\(V,E\)as the topological structure of the traffic network\.VVrepresents the set of sensors on the road and\|V\|=N\|V\|=N\.EEis the set of edges thate∈Ee\\in Eif and only if two ends ofeeare connected directly by a road\. The adjacency matrixA∈RN×NA\\in R^\{N\\times N\}presents more information about the relationship of vertices\. Traditionally,a∈Aa\\in Ashows the connectivity:aais zero if the row and column vertices are connected and one otherwise\. In a broader definition,AArepresents the correlations between vertices\. For anya∈Aa\\in A, large\|a\|\|a\|suggests that the row vertex and the column vertex have a strong relationship, and small\|a\|\|a\|shows a weak relationship\.
#### 3\.1\.2Traffic Forecasting Problem
The target of the traffic forecasting problem is to predict future traffic based on historical traffic data\.Xt∈RN×KX^\{t\}\\in R^\{N\\times K\}denotes theKKtraffic features observed in each sensor at timett\. Given the graph,GG, inPPprevious steps of past graph signal, the traffic forecasting problem is to obtain a functionFFthat can predict the graph signal in the nextQQtimesteps\.
\[Xt−P\+1,…,t,G\]⟶F\[Xt\+1,…,t\+Q\]\[X^\{t\-P\+1,\\dots,t\},G\]\\stackrel\{\{\\scriptstyle F\}\}\{\{\\longrightarrow\}\}\[X^\{t\+1,\\dots,t\+Q\}\]\(1\)whereXt−P\+1,…,t∈RP×N×KX^\{t\-P\+1,\\dots,t\}\\in R^\{P\\times N\\times K\}andXt\+1,…,t\+Q∈RQ×N×KX^\{t\+1,\\dots,t\+Q\}\\in R^\{Q\\times N\\times K\}\.
#### 3\.1\.3Graph Attention Network
Veličković et al\. adopted attention mechanisms to learn the coefficients between vertex pairs and proposed GAT\[[15](https://arxiv.org/html/2605.16726#bib.bib18)\]\. Most GAT models use the self\-attention model as the base component of the graph attention layer\. For graph dataX𝑖𝑛∈RN×KX\_\{\\mathit\{in\}\}\\in R^\{N\\times K\}, hidden sizeHH, and output sizeK’K’, we can obtain the queryQ∈RN×HQ\\in R^\{N\\times H\}, the keyK∈RN×HK\\in R^\{N\\times H\}, and the valueV∈RN×HV\\in R^\{N\\times H\}of the self\-attention as follow,
Q=WQ\(X𝑖𝑛⊕E\)\+bQK=WK\(X𝑖𝑛⊕E\)\+bKV=WVX𝑖𝑛\+bV\\begin\{split\}Q&=W\_\{Q\}\(X\_\{\\mathit\{in\}\}\\oplus E\)\+b\_\{Q\}\\\\ K&=W\_\{K\}\(X\_\{\\mathit\{in\}\}\\oplus E\)\+b\_\{K\}\\\\ V&=W\_\{V\}X\_\{\\mathit\{in\}\}\+b\_\{V\}\\end\{split\}\(2\)whereE∈RN×HEE\\in R^\{N\\times H\_\{E\}\}is the encoding of each vertex\.⊕\\oplusis the concatenation operator\.WQ,WK∈RH×\(K\+HE\)W\_\{Q\},W\_\{K\}\\in R^\{H\\times\(K\+H\_\{E\}\)\}andWV∈RH×KW\_\{V\}\\in R^\{H\\times K\}are linear transform matrices andbQ,bK,bV∈RHb\_\{Q\},b\_\{K\},b\_\{V\}\\in R^\{H\}are bias for each transformation\. For two verticesviv\_\{i\}andvjv\_\{j\}, the query ofviv\_\{i\}isqi:=Q\[i\]∈RHq\_\{i\}:=Q\[i\]\\in R^\{H\}, and the key ofvjv\_\{j\}iskj:=K\[j\]∈RHk\_\{j\}:=K\[j\]\\in R^\{H\}\. The score ofvjv\_\{j\}onviv\_\{i\}iseije\_\{ij\}, which can be calculated as
eij=GELU\(qi⋅kj\)e\_\{ij\}=\\texttt\{GELU\}\(q\_\{i\}\\cdot k\_\{j\}\)\(3\)whereGELU\(⋅\)\\texttt\{GELU\}\(\\cdot\)\[[3](https://arxiv.org/html/2605.16726#bib.bib10)\]is the activation function introduced by Hendrycks et al\., and⋅\\cdotis vectors’ dot product\. Then, the attention coefficient denotes as
aij=\{0,A\[i,j\]=0,exp\(eij\)∑vj∈G,A\[i,k\]≠0exp\(eik\),otherwise,a\_\{ij\}=\\begin\{cases\}0,&A\[i,j\]=0,\\\\ \\dfrac\{\\texttt\{exp\}\(e\_\{ij\}\)\}\{\\sum\_\{v\_\{j\}\\in G,A\[i,k\]\\neq 0\}\\texttt\{exp\}\(e\_\{ik\}\)\},&\\text\{otherwise\},\\end\{cases\}\(4\)which is constrained by the adjacency matrixAA\. Then, in the hidden graph dataXℎ𝑖𝑑𝑑𝑒𝑛∈RN×HX\_\{\\mathit\{hidden\}\}\\in R^\{N\\times H\}, the feature ofviv\_\{i\},x’i:=Xℎ𝑖𝑑𝑑𝑒𝑛\[i\]x’\_\{i\}:=X\_\{\\mathit\{hidden\}\}\[i\], can be computed by weighted summation of values inVV:
x’i=∑vj∈GaijV\[j\]\.x’\_\{i\}=\\sum\_\{v\_\{j\}\\in G\}a\_\{ij\}V\[j\]\.\(5\)Then apply a feed\-forward layer to theXℎ𝑖𝑑𝑑𝑒𝑛X\_\{\\mathit\{hidden\}\}with a transform matrixW𝑓𝑓∈RK’×HW\_\{\\mathit\{ff\}\}\\in R^\{K’\\times H\}and a biasb𝑓𝑓∈RK’b\_\{\\mathit\{ff\}\}\\in R^\{K’\}
X𝑜𝑢𝑡=W𝑓𝑓Xℎ𝑖𝑑𝑑𝑒𝑛\+b𝑓𝑓X\_\{\\mathit\{out\}\}=W\_\{\\mathit\{ff\}\}X\_\{\\mathit\{hidden\}\}\+b\_\{\\mathit\{ff\}\}\(6\)SinceX𝑖𝑛X\_\{\\mathit\{in\}\}andX𝑜𝑢𝑡X\_\{\\mathit\{out\}\}have the same graph structure, stacking GAT can create a deep neural network as long as hidden dimensions match\.
### 3\.2Global\-Local Graph Attention Network
The GLGAT has a similar structure as GAT, which first transforms the input into the hidden state by a layer using the attention mechanism and then uses feed\-forward layers to compute the output by the hidden state\. Keeping the same feed\-forward structure, the significant change of GLGAT is in the attention layer\.
To support the multi\-head mechanism and further allow different adjacency matrices to act on different heads, we increase the hidden size toH’H’that
H’:=H×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑H’:=H\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\(7\)whereHHis the hidden size of each head,H𝑎𝑑𝑗H\_\{\\mathit\{adj\}\}is the number of adjacency matrices used, andHℎ𝑒𝑎𝑑H\_\{\\mathit\{head\}\}is the number of heads corresponding to each adjacency matrix\.
To allow each vertex to have its own local attention operation, the easiest way is to haveNNindependent attention operating on all vertices\. However, it will increase the computational complexity aboutNNtimes\. As a trade\-off, we keep the generation procedure of keyKKand valueVVunchanged and assign a transforming matrix of query for each vertex\. Thus the output size of queryQQis changed to
HQ:=H’\+H𝑎𝑑𝑗⋅H𝑃𝐸,H\_\{Q\}:=H’\+H\_\{\\mathit\{adj\}\}\\cdot H\_\{\\mathit\{PE\}\},\(8\)whereH𝑃𝐸H\_\{\\mathit\{PE\}\}is the size of pairwise encoding𝑃𝐸∈RN×N×H𝑃𝐸\\mathit\{PE\}\\in R^\{N\\times N\\times H\_\{\\mathit\{PE\}\}\}which will be discussed in the next sub\-section\. We also assign independent matrices and biases to each vertex\. More precisely, for the vertexviv\_\{i\}, the queryqi:=Q\[i\]q\_\{i\}:=Q\[i\]corresponding to it is generated by two parts,qi,𝐺𝑙𝑜𝑏𝑎𝑙∈RHQq\_\{i,\\mathit\{Global\}\}\\in R^\{H\_\{Q\}\}, which is calculated by “globally” shared parameters, andqi,𝐿𝑜𝑐𝑎𝑙∈RHQq\_\{i,\\mathit\{Local\}\}\\in R^\{H\_\{Q\}\}, which is calculated by “locally” owned parameters\. The formulas are
qi,𝐺𝑙𝑜𝑏𝑎𝑙=WQ−𝐺𝑙𝑜𝑏𝑎𝑙\(xi⊕ei\)\+bQ−𝐺𝑙𝑜𝑏𝑎𝑙qi,𝐿𝑜𝑐𝑎𝑙=WQ−𝐿𝑜𝑐𝑎𝑙,i\(xi⊕ei\)\+bQ−𝐿𝑜𝑐𝑎𝑙,i\\begin\{split\}q\_\{i,\\mathit\{Global\}\}&=W\_\{\\mathit\{Q\-Global\}\}\(x\_\{i\}\\oplus e\_\{i\}\)\+b\_\{\\mathit\{Q\-Global\}\}\\\\ q\_\{i,\\mathit\{Local\}\}&=W\_\{\\mathit\{Q\-Local\},i\}\(x\_\{i\}\\oplus e\_\{i\}\)\+b\_\{\\mathit\{Q\-Local\},i\}\\end\{split\}\(9\)where theei:=E\[i\]e\_\{i\}:=E\[i\]is theii\-th row of the traditional encodingE∈RN×HEE\\in R^\{N\\times H\_\{E\}\}that matches the vertexviv\_\{i\}\. TheWQ−𝐺𝑙𝑜𝑏𝑎𝑙∈RHQ×\(K\+HE\)W\_\{\\mathit\{Q\-Global\}\}\\in R^\{H\_\{Q\}\\times\(K\+H\_\{E\}\)\}andbQ−𝐺𝑙𝑜𝑏𝑎𝑙∈RHQb\_\{\\mathit\{Q\-Global\}\}\\in R^\{H\_\{Q\}\}are shared transform matrices and biases\. TheWQ−𝐿𝑜𝑐𝑎𝑙,i:=WQ−𝐿𝑜𝑐𝑎𝑙\[i\]W\_\{\\mathit\{Q\-Local\},i\}:=W\_\{\\mathit\{Q\-Local\}\}\[i\]andbQ−𝐿𝑜𝑐𝑎𝑙,i:=bQ−𝐿𝑜𝑐𝑎𝑙\[i\]b\_\{\\mathit\{Q\-Local\},i\}:=b\_\{\\mathit\{Q\-Local\}\}\[i\]areviv\_\{i\}’s private transforming matrices and biases to extract unique features for each vertex, andWQ−𝐿𝑜𝑐𝑎𝑙∈RN×HQ×\(K\+HE\)W\_\{\\mathit\{Q\-Local\}\}\\in R^\{N\\times H\_\{Q\}\\times\(K\+H\_\{E\}\)\},bQ−𝐿𝑜𝑐𝑎𝑙∈RN×HQb\_\{\\mathit\{Q\-Local\}\}\\in R^\{N\\times H\_\{Q\}\}\. The queryqiq\_\{i\}is determined by compressingqi,𝐺𝑙𝑜𝑏𝑎𝑙q\_\{i,\\mathit\{Global\}\}andqi,𝐿𝑜𝑐𝑎𝑙q\_\{i,\\mathit\{Local\}\}as
qi=WQ\(qi,Global⊕qi,Local\)\+bQq\_\{i\}=W\_\{Q\}\(q\_\{i,Global\}\\oplus q\_\{i,Local\}\)\+b\_\{Q\}\(10\)whereWQ∈RHQ×\(2HQ\)W\_\{Q\}\\in R^\{H\_\{Q\}\\times\(2H\_\{Q\}\)\},bQ∈RHQb\_\{Q\}\\in R^\{H\_\{Q\}\}are transforming parameters\. The queryQ∈RN×HQQ\\in R^\{N\\times H\_\{Q\}\}can be obtained by stackingqiq\_\{i\}s\. By splitting the last dimension, the query will be separated to two parts, the query for attentionQ𝐴𝑇∈RN×H’Q\_\{\\mathit\{AT\}\}\\in R^\{N\\times H’\}and the query for pairwise encodingQ𝑃𝐸∈RN×\(H𝑎𝑑𝑗×H𝑃𝐸\)Q\_\{\\mathit\{PE\}\}\\in R^\{N\\times\(H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{PE\}\}\)\}\.
Then reshape all matrices to fit the multi\-head structure, i\.e\., splitting the last dimension of the graph data\. The detailed shape changes are
Q𝐴𝑇:\(N×H’\)⟶\(N×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑×H\),Q𝑃𝐸:\(N×\(H𝑎𝑑𝑗⋅H𝑃𝐸\)\)⟶\(N×H𝑎𝑑𝑗×H𝑃𝐸\),K:\(N×H’\)⟶\(N×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑×H\),V:\(N×H’\)⟶\(N×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑×H\)\.\\begin\{split\}Q\_\{\\mathit\{AT\}\}:\\quad&\(N\\times H’\)\\longrightarrow\(N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\\times H\),\\\\ Q\_\{\\mathit\{PE\}\}:\\quad&\(N\\times\(H\_\{\\mathit\{adj\}\}\\cdot H\_\{\\mathit\{PE\}\}\)\)\\longrightarrow\(N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{PE\}\}\),\\\\ K:\\quad&\(N\\times H’\)\\longrightarrow\(N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\\times H\),\\\\ V:\\quad&\(N\\times H’\)\\longrightarrow\(N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\\times H\)\.\\end\{split\}\(11\)For thenn\-th adjacency matrixAnA\_\{n\}, for itsmm\-th head, the score ofvjv\_\{j\}onviv\_\{i\}can be calculated by
eij,n,m=GELU\(q𝑎𝑡,i,n,m⋅kj,n,m\+q𝑝𝑒,i,n⋅𝑝𝑒i,j\),e\_\{ij,n,m\}=\\texttt\{GELU\}\(q\_\{\\mathit\{at\},i,n,m\}\\cdot k\_\{j,n,m\}\+q\_\{\\mathit\{pe\},i,n\}\\cdot\\mathit\{pe\}\_\{i,j\}\),\(12\)whereq𝑎𝑡,i,n,m:=Q𝐴𝑇\[i,n,m\]∈RHq\_\{\\mathit\{at\},i,n,m\}:=Q\_\{\\mathit\{AT\}\}\[i,n,m\]\\in R^\{H\},kj,n,m:=K\[j,n,m\]∈RHk\_\{j,n,m\}:=K\[j,n,m\]\\in R^\{H\},q𝑝𝑒,i,n:=Q𝑃𝐸\[i,n\]∈RH𝑃𝐸q\_\{\\mathit\{pe\},i,n\}:=Q\_\{\\mathit\{PE\}\}\[i,n\]\\in R^\{H\_\{\\mathit\{PE\}\}\}, and𝑝𝑒i,j:=𝑃𝐸\[i,j\]∈RH𝑃𝐸\\mathit\{pe\}\_\{i,j\}:=\\mathit\{PE\}\[i,j\]\\in R^\{H\_\{\\mathit\{PE\}\}\}\. The attention coefficient denotes as
aij,n,m=exp\(eij,n,m\)⋅An\[i,j\]∑vj∈Gexp\(eik,n,m\)⋅An\[i,k\],a\_\{ij,n,m\}=\\dfrac\{\\texttt\{exp\}\(e\_\{ij,n,m\}\)\\cdot A\_\{n\}\[i,j\]\}\{\\sum\_\{v\_\{j\}\\in G\}\\texttt\{exp\}\(e\_\{ik,n,m\}\)\\cdot A\_\{n\}\[i,k\]\},\(13\)whereAn\[i,j\]∈\[0,1\]A\_\{n\}\[i,j\]\\in\[0,1\]is the value inii\-th row andjj\-th column of thenn\-th adjacency matrix\. Using the attention coefficient as weights and sum up the values, we have the hidden graph dataXℎ𝑖𝑑𝑑𝑒𝑛∈RN×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑×HX\_\{\\mathit\{hidden\}\}\\in R^\{N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\\times H\}obtained by
xℎ𝑖𝑑𝑑𝑒𝑛,i,n,m=∑vj∈Gaij,n,mvj,n,m\.x\_\{\\mathit\{hidden\},i,n,m\}=\\sum\_\{v\_\{j\}\\in G\}a\_\{ij,n,m\}v\{j,n,m\}\.\(14\)wherevj,n,m:=V\[j,n,m\]v\{j,n,m\}:=V\[j,n,m\]\.
By flattening the last three dimensions, we have
Xℎ𝑖𝑑𝑑𝑒𝑛:\(N×H𝑎𝑑𝑗×Hℎ𝑒𝑎𝑑×H\)⟶\(N×H’\)X\_\{\\mathit\{hidden\}\}:\(N\\times H\_\{\\mathit\{adj\}\}\\times H\_\{\\mathit\{head\}\}\\times H\)\\longrightarrow\(N\\times H’\)\(15\)Then, using the transform matrixW𝑓𝑓∈RK’×H’W\_\{\\mathit\{ff\}\}\\in R^\{K’\\times H’\}and biasb𝑓𝑓∈RK’b\_\{\\mathit\{ff\}\}\\in R^\{K’\}, the feed\-forward layer will generateX𝑜𝑢𝑡𝑝𝑢𝑡X\_\{\\mathit\{output\}\}same as the traditional way\. As long as the dimensions match, stacking GLGAT will produce a deeper neural network since the graph structure is preserved during the transformation\.
### 3\.3Pairwise Encoding
Introducing pairwise encoding𝑃𝐸∈RN×N×H𝑃𝐸\\mathit\{PE\}\\in R^\{N\\times N\\times H\_\{\\mathit\{PE\}\}\}is to provide more information during the query\-key comparison step\. The previous subsection shows that the𝑃𝐸\\mathit\{PE\}acts like a key supplement and operates with a separate query\. The pairwise encoding stores some basic correlations between vertices\. AH𝑃𝐸H\_\{\\mathit\{PE\}\}dimension data of the comparison ofviv\_\{i\}andvjv\_\{j\}can be stored in𝑃𝐸\[i,j\]\\mathit\{PE\}\[i,j\]\.
Several methods can form the encoding\. Here we demonstrate a direction\-and\-distance\-based version of pairwise encoding\. The traffic flow on one sensor is usually influenced by two factors, one is possible incidents along the road where the sensor is located, another one is the chosen paths and destinations of people passing this sensor\. Both of them show that the strength of correlation between sensors is subject to the direction of the road\. A function𝐸𝐷\(⋅\)\\mathit\{ED\}\(\\cdot\)transform position to the direction and classify all directions into eight classes, forty\-five degrees each, and encode with one\-hot\-coding with label\-smoothing\. For two verticesviv\_\{i\}andvjv\_\{j\}, whose corresponding sensors’ locations are\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and\(xj,yj\)\(x\_\{j\},y\_\{j\}\), the𝑝𝑒ij:=𝑃𝐸\[i,j\]\\mathit\{pe\}\_\{ij\}:=\\mathit\{PE\}\[i,j\]and𝑝𝑒ji:=PE\[j,i\]\\mathit\{pe\}\_\{ji\}:=PE\[j,i\]are formulized by𝐸𝐷\(⋅\)\\mathit\{ED\}\(\\cdot\)and distance functions:
E𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛,ij=𝐸𝐷\(xi,yi,xj,yj\),E𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛,ij=𝐸𝐷\(xi,yi,xj,yj\),EL1=\|xi−xj\|\+\|yi−yj\|,EL2=\(xi−xj\)2\+\(yi−yj\)2\.\\displaystyle\\begin\{split\}&E\_\{\\mathit\{direction\},ij\}=\\mathit\{ED\}\(x\_\{i\},y\_\{i\},x\_\{j\},y\_\{j\}\),\\\\ &E\_\{\\mathit\{direction\},ij\}=\\mathit\{ED\}\(x\_\{i\},y\_\{i\},x\_\{j\},y\_\{j\}\),\\\\ &E\_\{L1\}=\|x\_\{i\}\-x\_\{j\}\|\+\|y\_\{i\}\-y\_\{j\}\|,\\\\ &E\_\{L2\}=\\sqrt\{\(x\_\{i\}\-x\_\{j\}\)^\{2\}\+\(y\_\{i\}\-y\_\{j\}\)^\{2\}\}\.\\end\{split\}\(16\)𝑝𝑒ij\\displaystyle\\mathit\{pe\}\_\{ij\}=E𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛,ij⊕EL1⊕EL2,\\displaystyle=E\_\{\\mathit\{direction\},ij\}\\oplus E\_\{L1\}\\oplus E\_\{L2\},\(17\)𝑝𝑒ji\\displaystyle\\mathit\{pe\}\_\{ji\}=E𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛,ji⊕EL1⊕EL2\.\\displaystyle=E\_\{\\mathit\{direction\},ji\}\\oplus E\_\{L1\}\\oplus E\_\{L2\}\.\(18\)
Other information like the point\-of\-interest difference of each sensor’s surroundings can also be encoded in the pairwise encoding\. The pairwise encoding can also show sensors’ relative sequential information by comparing distances between sensors and highway exits in each sensor pair\. Besides, traffic forecasting problems in different timesteps can have different𝑃𝐸\\mathit\{PE\}s, where the historical traffic flow between each sensor during the same time of the day can be embedded\. A dynamic pairwise encoding is also allowed like traditional encoding\.
### 3\.4Event\-Based Adjacency Matrix
The event\-based adjacency matrix method is a data\-driven approach designed for the traffic forecasting problem\. The main idea is that, for one certain sensor, we find the other sensors that are likely to have a similar traffic flow change which would happen just a few time\-steps before the chosen sensor\. First, define a “divider” and set its value as the average value of the maximum and minimum historical traffic flow for each sensor\. Then, define the “up\-event,” which only happens if the traffic flow increases and surpasses the “divider” value\. Similarly, define the “down\-event” only if the traffic flow decreases and passes through the “divider” value\. The set of events of vertexviv\_\{i\}that corresponding to the sensor is called𝐸𝑉i,𝑢𝑝\\mathit\{EV\}\_\{i,\\mathit\{up\}\}and𝐸𝑉i,𝑑𝑜𝑤𝑛\\mathit\{EV\}\_\{i,\\mathit\{down\}\}correspondingly,
𝐸𝑉i,𝑢𝑝=\[t𝑢𝑝,1,…,t𝑢𝑝,u\],𝐸𝑉i,𝑑𝑜𝑤𝑛=\[t𝑑𝑜𝑤𝑛,1,…,t𝑑𝑜𝑤𝑛,d\]\.\\begin\{split\}\\mathit\{EV\}\_\{i,\\mathit\{up\}\}&=\[t\_\{\\mathit\{up\},1\},\\dots,t\_\{\\mathit\{up\},u\}\],\\\\ \\mathit\{EV\}\_\{i,\\mathit\{down\}\}&=\[t\_\{\\mathit\{down\},1\},\\dots,t\_\{\\mathit\{down\},d\}\]\.\\end\{split\}\(19\)
For eacht𝑢𝑝,n∈𝐸𝑉i,𝑢𝑝t\_\{\\mathit\{up\},n\}\\in\\mathit\{EV\}\_\{i,\\mathit\{up\}\}, we find allvjv\_\{j\}have “up\-event” that is contiguous totup,nt\_\{up,n\}, such that
\[t𝑢𝑝,n−tp,…,t𝑢𝑝,n\+tq\]∩𝐸𝑉j,𝑢𝑝≠∅,\[t\_\{\\mathit\{up\},n\}\-t\_\{p\},\\dots,t\_\{\\mathit\{up\},n\}\+t\_\{q\}\]\\cap\\mathit\{EV\}\_\{j,\\mathit\{up\}\}\\neq\\emptyset,\(20\)and group them as𝐺𝑟𝑜𝑢𝑝\(t𝑢𝑝,n\)\\mathit\{Group\}\(t\_\{\\mathit\{up\},n\}\), wheretpt\_\{p\}andtqt\_\{q\}determine acceptable range\. Then all related vertex group𝑉𝐺\\mathit\{VG\}ofviv\_\{i\}is
𝑉𝐺𝑢𝑝,i=\[𝐺𝑟𝑜𝑢𝑝\(t𝑢𝑝,i\),…,𝐺𝑟𝑜𝑢𝑝\(t𝑢𝑝,u\)\]\.\\mathit\{VG\}\_\{\\mathit\{up\},i\}=\[\\mathit\{Group\}\(t\_\{\\mathit\{up\},i\}\),\\dots,\\mathit\{Group\}\(t\_\{\\mathit\{up\},u\}\)\]\.\(21\)Similarly, we have
𝑉𝐺𝑑𝑜𝑤𝑛,i=\[𝐺𝑟𝑜𝑢𝑝\(t𝑑𝑜𝑤𝑛,i\),…,𝐺𝑟𝑜𝑢𝑝\(t𝑑𝑜𝑤𝑛,u\)\]\.\\mathit\{VG\}\_\{\\mathit\{down\},i\}=\[\\mathit\{Group\}\(t\_\{\\mathit\{down\},i\}\),\\dots,\\mathit\{Group\}\(t\_\{\\mathit\{down\},u\}\)\]\.\(22\)Based on the frequency of each vertex appearing in𝑉𝐺𝑢𝑝,i\\mathit\{VG\}\_\{\\mathit\{up\},i\}and𝑉𝐺𝑑𝑜𝑤𝑛,i\\mathit\{VG\}\_\{\\mathit\{down\},i\}, theii\-th row of a new adjacency matrix can be determined\. Both the adjacency matrix of “up\-event” and adjacency matrix of “down\-event” are included in our model\.
## 4Experiments
In this section, we first introduce the datasets and baselines that we use\. Then the setup of our model is shown, followed by the result of the experiments\. Finally, we conduct a detailed ablation study\.
### 4\.1Datasets
Our experiments are performed on two typical real\-world datasets, METR\-LA and PEMS\-BAY\[[6](https://arxiv.org/html/2605.16726#bib.bib8)\]\.
1. 1\.METR\-LA: This dataset is collected from loop detectors in the Los Angeles County road network\. It contains 207 sensors’ data in four months from March 1st, 2012 to June 30th, 2012, i\.e\., 34,272 5\-minute slices included\.
2. 2\.PEMS\-BAY: This dataset is collected by California Transportation Agencies Performance Measurement System\. It contains 325 sensors’ data in six months from January 1st, 2017 to May 31st, 2017, i\.e\., 52,116 5\-minute slices included\.
Some key features of two datasets are summarized in the table[1](https://arxiv.org/html/2605.16726#S4.T1)\.
Table 1:Dataset Statistics\.
### 4\.2Baselines
We compare the GLGAT with other traffic forecasting models, including
1. 1\.HA: Historical Average model uses the weighted average of historical data of the same time\-of\-the\-day to predict the future traffic flow\.
2. 2\.ARIMA: Auto\-regressive Integrated Moving Average model\[[5](https://arxiv.org/html/2605.16726#bib.bib14),[16](https://arxiv.org/html/2605.16726#bib.bib12),[18](https://arxiv.org/html/2605.16726#bib.bib13)\]with Kalman filter is a classic time series analysis model using a linear architecture\.
3. 3\.VAR: Vector Auto\-Regression\[[7](https://arxiv.org/html/2605.16726#bib.bib3),[8](https://arxiv.org/html/2605.16726#bib.bib4)\]evaluates the relationships between time series, assuming that they are stationary\.
4. 4\.SVR: Support Vector Regression\[[12](https://arxiv.org/html/2605.16726#bib.bib7)\]adapts the support vector machine model for traffic data\.
5. 5\.FNN: Feed Forward Neural Network directly flattens the input data and deals with traffic sequences\.
6. 6\.FC\-LSTM: Fully Connected Long Short Term Memory network\[[14](https://arxiv.org/html/2605.16726#bib.bib11)\]uses LSTM structure to handle data from different timesteps\.
7. 7\.DCRNN: Diffusion Convolutional Recurrent Neural Network\[[6](https://arxiv.org/html/2605.16726#bib.bib8)\]uses bidirectional random works for spatial correlation and encoder\-decoder architecture for temporal dependency\.
8. 8\.STGCN: Spatial\-Temporal Graph Convolutional Network\[[17](https://arxiv.org/html/2605.16726#bib.bib9)\]is a combination of graph convolutional and one\-dimension convolution\.
9. 9\.GaAN: Gated Attention Networks\[[19](https://arxiv.org/html/2605.16726#bib.bib2)\]uses an attention\-based network with a convolutional sub\-network and gate mechanism\.
10. 10\.APTN: Attention\-based Periodic\-Temporal neural Network\[[11](https://arxiv.org/html/2605.16726#bib.bib5)\]uses the encoder attention mechanism to catch spatial and periodical dependencies\.
11. 11\.GST\-GAT: Global Spatial–Temporal Graph Attention Network uses\[[13](https://arxiv.org/html/2605.16726#bib.bib6)\]the graph attention with LSTM and gating fusion mechanism\.
The first four methods, HA, ARIMA, VAR, and SVR, are traditional statistical models\. The FNN, FC\-LSTM, DCRNN, and STGCN are deep\-learning models that do not use attention mechanisms\. The last three models, GaAN, APTN, and GST\-GAT, are attention\-based\.
### 4\.3Experiment Setup
The experiment setup follows the setting in the paper of DCRNN\[[6](https://arxiv.org/html/2605.16726#bib.bib8)\]\. 70% of the data is the training set, 10% are the cross\-validation set, and the remaining 20% are the testing set\. Each sample sequence contains two\-hour data, with 5\-minute intervals\. Thus there are twenty\-four timesteps divided into two halves\.PPandQQdefined in the previous section are all twelve, i\.e\., the first 12 timesteps are the input, and the rest 12 timesteps are the ground truth that models need to predict\.
The GLGAT model testing has a seven\-layer structure\. A network block is shared on each floor\. The bottom layer uses GLGAT blocks of input size as three timesteps and output size 16\. It pads the data in the last timestep twice, and every three adjacent timesteps are grouped\. The second layer’s GLGAT blocks have input and output sizes equal to 16\. The third layer is a time\-flatten layer where the data of different timesteps are concatenated into a vector of length 192\. The fourth, fifth, and sixth layers are GLGAT blocks of input size and output size equal to 192\. Then the last layer is a fully connected layer which compresses the data dimension to 12 and outputs them as predictions of twelve future timesteps\. The initial learning rate is1\.0×10−41\.0\\times 10^\{\-4\}\. We optimize the model with Adam optimization by minimizing the smooth L1 loss\[[2](https://arxiv.org/html/2605.16726#bib.bib1)\]betweenYiY\_\{i\}andY^i\\hat\{Y\}\_\{i\}\. Our experiments run on a Windows computer with one Intel\(R\) Core\(TM\) i7\-10750H CPU, 32GB RAM, and one NVIDIA GeForce RTX 2080 Super GPU\.
Three conventional matrices, Mean Absolute Error \(MAE\), Root Mean Square Error \(RMSE\), and Mean Absolute Percentage Error \(MAPE\), measure the performance of different models\. Their formulas are
𝑀𝐴𝐸\\displaystyle\\mathit\{MAE\}=1N∑i=1N\|Yi−Y^i\|,\\displaystyle=\\dfrac\{1\}\{N\}\\sum^\{N\}\_\{i=1\}\\left\|Y\_\{i\}\-\\hat\{Y\}\_\{i\}\\right\|,\(23\)𝑅𝑀𝑆𝐸\\displaystyle\\mathit\{RMSE\}=1N∑i=1N\(Yi−Y^i\)2,\\displaystyle=\\sqrt\{\\dfrac\{1\}\{N\}\\sum^\{N\}\_\{i=1\}\\left\(Y\_\{i\}\-\\hat\{Y\}\_\{i\}\\right\)^\{2\}\},\(24\)𝑀𝐴𝑃𝐸\\displaystyle\\mathit\{MAPE\}=100%N∑i=1N\|Yi−Y^iYi\|,\\displaystyle=\\dfrac\{100\\%\}\{N\}\\sum^\{N\}\_\{i=1\}\\left\|\\dfrac\{Y\_\{i\}\-\\hat\{Y\}\_\{i\}\}\{Y\_\{i\}\}\\right\|,\(25\)whereNNis the number of testing,YiY\_\{i\}is the ground truth of theii\-th sample, andY^i\\hat\{Y\}\_\{i\}is the prediction from the model\. Smaller MAE, MAPE, and RMSE values suggest better performance\.
### 4\.4Experiment Results
Table 2:Traffic Prediction on the METR\-LA\. Bold numbers suggest best performance among these models, and underlined numbers means the second\-best\.Table 3:Traffic Prediction on the PEMS\-BAY\. Bold numbers suggest best performance among these models, and underlined numbers means the second\-best\.The Table[2](https://arxiv.org/html/2605.16726#S4.T2)and Table[3](https://arxiv.org/html/2605.16726#S4.T3)shows the performance matrices of each model on 15 minutes, 30 minutes, and 60 minutes ahead prediction, where traditional statistical models, deep\-learning models without attention, and deep\-learning models with attention are separated by horizontal lines\.
HA uses traffic flow patterns in the training set and does not consider the immediate past data when testing\. Unlike other models, HA’s performance is irrelevant to the increase in the forecasting horizon\. It has relatively good performance for long\-term prediction, but it is not capable of handling sudden changes\. Other traditional statistical methods have better short\-term prediction accuracy but worse 60\-minute forecasting results than HA\.
The deep\-learning models with and without attention mechanisms have better performance than traditional statistical models\. Moreover, attention\-based models have comparable performance compared with other deep\-learning models\. In the METR\-LA dataset, our model, GLGAT, listed at the bottom of two tables, achieves improvement over other methods in a majority number of matrices\. In the PEMS\-BAY, most of the results of GLGAT are the best or the second\-best from all comparing methods\.
### 4\.5Ablation Study
Additional experiments are designed to verify the effect of each component\. We design three ablation models:
1. 1\.Ablation\-1: Replace the event\-based adjacency matrices with the adjacency matrices of connectivity\.
2. 2\.Ablation\-2: Remove the pairwise encoding\.
3. 3\.Ablation\-3: Replace the GLGAT with traditional GAT and remove the pairwise encoding\.
Table 4:Ablation study on the METR\-LA\. Bold numbers suggest best performance among these models\.Table 5:Ablation study on the PEMS\-BAY\. Bold numbers suggest best performance among these models\.The performance of GLGAT and four ablation studies is in the Table[4](https://arxiv.org/html/2605.16726#S4.T4)and Table[5](https://arxiv.org/html/2605.16726#S4.T5)\. The difference between GLGAT and Ablation\-1 shows the efficiency of the event\-based adjacency matrices\. The improvement from Ablation\-3 to Ablation\-2 demonstrates the potency of the GLGAT structure\. Furthermore, the improvement from Ablation\-2 to the full GLGAT suggests that the pairwise encoding fits the network structure\.
## 5Conclusion
This paper proposes GLGAT, a deep learning framework for the traffic forecasting problem\. Along with a set of globally shared attention matrices, GLGAT allows each vertex to have specialized local attention matrices\. The pairwise encoding and the event\-based adjacency matrix come up as supplements of GLGAT structure\. Experiments show that GLGAT has competitive performance on two real\-world datasets against other state\-of\-the\-art models\. It is worth noting that this performance is achieved without using state\-of\-the\-art structures of finding temporal correlations, like LSTM, GRU, or transformer\. For future works, we will try to combine GLGAT with other models and reinforce the model with additional meteorological data\.
## References
- \[1\]\(2021\-07\)A novel hybrid deep learning model for taxi demand forecasting based on decomposition of time series and fusion of text data\.pp\. 1–17\.External Links:ISSN 10641246, 18758967,[Link](https://doi.org/10.3233/JIFS-210657),[Document](https://dx.doi.org/10.3233/JIFS-210657)Cited by:[§2\.1](https://arxiv.org/html/2605.16726#S2.SS1.p1.1)\.
- \[2\]R\. Girshick\(2015\)Fast r\-cnn\.InProceedings of the IEEE international conference on computer vision,pp\. 1440–1448\.Cited by:[§4\.3](https://arxiv.org/html/2605.16726#S4.SS3.p2.3)\.
- \[3\]D\. Hendrycks and K\. Gimpel\(2016\)Gaussian error linear units \(gelus\)\.arXiv preprint arXiv:1606\.08415\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.16726#S3.SS1.SSS3.p1.22)\.
- \[4\]W\. Jiang and J\. Luo\(2021\)Graph neural network for traffic forecasting: a survey\.arXiv preprint arXiv:2101\.11174\.Cited by:[§1](https://arxiv.org/html/2605.16726#S1.p3.1)\.
- \[5\]S\. V\. Kumar and L\. Vanajakshi\(2015\-06\)Short\-term traffic flow prediction using seasonal arima model with limited input data\.European Transport Research Review7\(3\)\.External Links:[Link](http://dx.doi.org/10.1007/s12544-015-0170-8),[Document](https://dx.doi.org/10.1007/s12544-015-0170-8)Cited by:[§1](https://arxiv.org/html/2605.16726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16726#S2.SS1.p1.1),[item 2](https://arxiv.org/html/2605.16726#S4.I2.i2.p1.1)\.
- \[6\]Y\. Li, R\. Yu, C\. Shahabi, and Y\. Liu\(2017\)Diffusion convolutional recurrent neural network: data\-driven traffic forecasting\.arXiv preprint arXiv:1707\.01926\.Cited by:[item 7](https://arxiv.org/html/2605.16726#S4.I2.i7.p1.1),[§4\.1](https://arxiv.org/html/2605.16726#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2605.16726#S4.SS3.p1.2)\.
- \[7\]M\. Lippi, M\. Bertini, and P\. Frasconi\(2013\)Short\-term traffic flow forecasting: an experimental comparison of time\-series analysis and supervised learning\.IEEE Transactions on Intelligent Transportation Systems14\(2\),pp\. 871–882\.External Links:[Document](https://dx.doi.org/10.1109/TITS.2013.2247040)Cited by:[item 3](https://arxiv.org/html/2605.16726#S4.I2.i3.p1.1)\.
- \[8\]H\. Lütkepohl\(2005\)New introduction to multiple time series analysis\.Springer Berlin Heidelberg\.External Links:[Link](http://dx.doi.org/10.1007/978-3-540-27752-1),[Document](https://dx.doi.org/10.1007/978-3-540-27752-1)Cited by:[item 3](https://arxiv.org/html/2605.16726#S4.I2.i3.p1.1)\.
- \[9\]L\. Qu, W\. Li, W\. Li, D\. Ma, and Y\. Wang\(2019\)Daily long\-term traffic flow forecasting based on a deep neural network\.Expert Systems with Applications121,pp\. 304–312\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.eswa.2018.12.031),[Link](https://www.sciencedirect.com/science/article/pii/S0957417418308017)Cited by:[§1](https://arxiv.org/html/2605.16726#S1.p3.1)\.
- \[10\]S\. Salvador and P\. Chan\(2007\-10\)Toward accurate dynamic time warping in linear time and space\.Intell\. Data Anal\.11\(5\),pp\. 561–580\.External Links:ISSN 1088\-467XCited by:[§2\.2](https://arxiv.org/html/2605.16726#S2.SS2.p4.1)\.
- \[11\]X\. Shi, H\. Qi, Y\. Shen, G\. Wu, and B\. Yin\(2021\)A spatial–temporal attention approach for traffic prediction\.IEEE Transactions on Intelligent Transportation Systems22\(8\),pp\. 4909–4918\.External Links:[Document](https://dx.doi.org/10.1109/TITS.2020.2983651)Cited by:[item 10](https://arxiv.org/html/2605.16726#S4.I2.i10.p1.1)\.
- \[12\]A\. J\. Smola and B\. Schölkopf\(2004\-08\)A tutorial on support vector regression\.Statistics and Computing14\(3\),pp\. 199–222\.External Links:[Link](http://dx.doi.org/10.1023/b:stco.0000035301.49549.88),[Document](https://dx.doi.org/10.1023/b%3Astco.0000035301.49549.88)Cited by:[item 4](https://arxiv.org/html/2605.16726#S4.I2.i4.p1.1)\.
- \[13\]B\. Sun, D\. Zhao, X\. Shi, and Y\. He\(2021\)Modeling global spatial–temporal graph attention network for traffic prediction\.IEEE Access9\(\),pp\. 8581–8594\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2021.3049556)Cited by:[item 11](https://arxiv.org/html/2605.16726#S4.I2.i11.p1.1)\.
- \[14\]I\. Sutskever, O\. Vinyals, and Q\. V\. Le\(2014\)Sequence to sequence learning with neural networks\.InAdvances in neural information processing systems,pp\. 3104–3112\.Cited by:[item 6](https://arxiv.org/html/2605.16726#S4.I2.i6.p1.1)\.
- \[15\]P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Lio, and Y\. Bengio\(2017\)Graph attention networks\.arXiv preprint arXiv:1710\.10903\.Cited by:[§3\.1\.3](https://arxiv.org/html/2605.16726#S3.SS1.SSS3.p1.6)\.
- \[16\]Y\. Wang and M\. Papageorgiou\(2005\)Real\-time freeway traffic state estimation based on extended kalman filter: a general approach\.Transportation Research Part B: Methodological39\(2\),pp\. 141–167\.External Links:ISSN 0191\-2615,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.trb.2004.03.003),[Link](https://www.sciencedirect.com/science/article/pii/S0191261504000438)Cited by:[§1](https://arxiv.org/html/2605.16726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16726#S2.SS1.p1.1),[item 2](https://arxiv.org/html/2605.16726#S4.I2.i2.p1.1)\.
- \[17\]B\. Yu, H\. Yin, and Z\. Zhu\(2018\-07\)Spatio\-temporal graph convolutional networks: a deep learning framework for traffic forecasting\.InProceedings of the Twenty\-Seventh International Joint Conference on Artificial Intelligence,External Links:[Link](http://dx.doi.org/10.24963/ijcai.2018/505),[Document](https://dx.doi.org/10.24963/ijcai.2018/505)Cited by:[item 8](https://arxiv.org/html/2605.16726#S4.I2.i8.p1.1)\.
- \[18\]G\. Zhang\(2003\)Time series forecasting using a hybrid arima and neural network model\.Neurocomputing50,pp\. 159–175\.External Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0925-2312%2801%2900702-0),[Link](https://www.sciencedirect.com/science/article/pii/S0925231201007020)Cited by:[§1](https://arxiv.org/html/2605.16726#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.16726#S2.SS1.p1.1),[item 2](https://arxiv.org/html/2605.16726#S4.I2.i2.p1.1)\.
- \[19\]J\. Zhang, X\. Shi, J\. Xie, H\. Ma, I\. King, and D\. Yeung\(2018\)Gaan: gated attention networks for learning on large and spatiotemporal graphs\.arXiv preprint arXiv:1803\.07294\.Cited by:[item 9](https://arxiv.org/html/2605.16726#S4.I2.i9.p1.1)\.Similar Articles
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Proposes GC-MoE, a graph-conditioned mixture of experts framework for traffic forecasting that assigns each node a personalized combination of frozen pretrained spatio-temporal GNN experts based on graph topology and recent input, training only a lightweight routing module (∼17K parameters) and achieving competitive performance on four benchmarks.
A Temporally Augmented Graph Attention Network for Affordance Classification
EEG-tGAT is a temporally augmented Graph Attention Network that improves affordance classification from interaction sequences by incorporating temporal attention and dropout mechanisms. The model enhances GATv2 for sequential data where temporal dimensions are semantically non-uniform.
Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks
This paper proposes a modular temporal enhancement framework for signed graph neural networks that integrates historical context via a Historical Context Integration Module (HCIM) with LSTM and multi-head temporal attention, achieving consistent improvements on real-world temporal signed networks for dynamic link prediction.
@HuggingPapers: Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance Naver AI eliminates unsta…
Naver AI introduces Stable-GFlowNet, a method to improve LLM red-teaming by eliminating unstable partition function estimation in Generative Flow Networks through contrastive trajectory balance.
Graph-Driven Cross-Industry Real-Time Monitoring Framework for Anti-Money Laundering Detection in Converged Mobility-Energy Supply Chain Networks
This paper proposes a graph-driven real-time anti-money laundering monitoring framework (GCRMF) for cross-industry supply chain networks, leveraging heterogeneous graphs and temporal attention networks, achieving over 17.8% F1 improvement.