Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Summary
Proposes GC-MoE, a graph-conditioned mixture of experts framework for traffic forecasting that assigns each node a personalized combination of frozen pretrained spatio-temporal GNN experts based on graph topology and recent input, training only a lightweight routing module (∼17K parameters) and achieving competitive performance on four benchmarks.
View Cached Full Text
Cached at: 06/01/26, 09:25 AM
# Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting ††thanks: This work was supported by the Emerging Projects program, Infotech Oulu; European Commission (101137711); European Regional Development Fund (A81373,A81376,A81568,A91867); Research Council of Finland (323630); the Strategic Research Council affiliated with Academy of Finland (372355) and Business Finland (8754/31/2022). ††thanks: © 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Source: [https://arxiv.org/html/2605.30486](https://arxiv.org/html/2605.30486)
###### Abstract
Spatio\-temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics\. Road segments differ in functional class, structure, and traffic behavior, suggesting that node\-wise expert specialization can be useful\. We propose*GC\-MoE*, a graph\-conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window\. GC\-MoE combines frozen pretrained spatio\-temporal GNN experts with an input\-aware, spatially contextualized router while training only a lightweight routing module\. We also study a bounded graph\-conditioned output refinement layer as an optional extension and include node\-adaptive ST\-LoRA adapters only as an ablation diagnostic\. Across four standard benchmarks \(PEMS04, PEMS07, METR\-LA, and PEMS\-BAY\), GC\-MoE improves MAE over a zero\-parameter ensemble baseline, with competitive RMSE and MAPE, while training only∼\{\\sim\}17K parameters on top of 1\.5M frozen expert weights\. The implementation is available at[https://github\.com/Ahghaffari/gc\_moe](https://github.com/Ahghaffari/gc_moe)\.
## IIntroduction
Spatio\-temporal \(ST\) forecasting underpins critical urban analytics tasks, such as traffic speed/flow prediction, in which measurements arrive from sensors connected by a road network and evolve over time\. Graph neural network \(GNN\) backbones that couple spatial structure with temporal dynamics, including diffusion recurrent models, graph\-convolutional architectures, and spectral variants, are strong baselines for these problems\[[17](https://arxiv.org/html/2605.30486#bib.bib1),[25](https://arxiv.org/html/2605.30486#bib.bib3),[24](https://arxiv.org/html/2605.30486#bib.bib2),[2](https://arxiv.org/html/2605.30486#bib.bib4)\]\. Recent work has also emphasized the importance of the full spatio\-temporal prediction pipeline, from spatial mapping and graph construction to model training and evaluation, highlighting the impact of graph design and preprocessing choices on downstream forecasting performance\[[8](https://arxiv.org/html/2605.30486#bib.bib12)\]\. Despite steady progress in model design, an important practical limitation remains: Different parts of the network may exhibit distinct dynamics due to differences in topology, road function, and connectivity, suggesting that a uniform backbone may be suboptimal\.
At the same time, the research progress in spatio\-temporal graph neural network \(ST\-GNN\) backbones suggests complementary strengths\. Diffusion\-based models capture multi\-hop propagation\[[17](https://arxiv.org/html/2605.30486#bib.bib1)\]; spectral graph convolutions capture smooth graph signals\[[25](https://arxiv.org/html/2605.30486#bib.bib3)\]; adaptive\-graph models learn node\-specific structure\[[1](https://arxiv.org/html/2605.30486#bib.bib13)\]\. A natural solution is to combine multiple architectures\. Classic ensembling \(e\.g\., uniform averaging\) improves robustness\[[5](https://arxiv.org/html/2605.30486#bib.bib14)\], but ignores that the best expert may vary by node and condition\. Learned ensembling via a meta\-learner can improve combinations\[[12](https://arxiv.org/html/2605.30486#bib.bib15)\], yet typical routers depend primarily on input features and do not explicitly encode*graph\-topological descriptors*of nodes\[[22](https://arxiv.org/html/2605.30486#bib.bib16),[7](https://arxiv.org/html/2605.30486#bib.bib11)\]or exploit*spatial neighbor context*to detect network\-wide congestion propagation\.
Parameter\-efficient fine\-tuning \(PEFT\) offers another approach that freezes a backbone and trains only small adapter modules, such as low\-rank adaptation \(LoRA\)\[[11](https://arxiv.org/html/2605.30486#bib.bib5)\]\. Recent ST\-LoRA variants adapt spatio\-temporal forecasting using node\-adaptive low\-rank modules with small trainable budgets\[[21](https://arxiv.org/html/2605.30486#bib.bib7)\]\. However, PEFT alone does not address architectural heterogeneity; even a well\-adapted single expert may remain suboptimal for certain node roles\. Moreover, in a multi\-expert setting, the interaction between routing and adapter corrections remains poorly understood\.
We introduce*GC\-MoE*\(GraphConditionedMixtureofExperts routing for ST\-GNN forecasting\), a modular framework that \(i\) pretrains multiple diverse ST\-GNN experts, \(ii\) freezes them as an expert set, \(iii\) learns an*input\-aware, spatially contextualized routing mechanism*that assigns per\-node expert weights using both static topology features and a dynamic pathway driven by temporally attended input signals with spatial message passing\. We additionally study a lightweight graph\-conditioned*output refinement*layer as an optional extension\. We also evaluate node\-adaptive ST\-LoRA adapters\[[11](https://arxiv.org/html/2605.30486#bib.bib5),[21](https://arxiv.org/html/2605.30486#bib.bib7)\]as an optional add\-on and report the results\.
The main contributions of this work are:
- •*Input\-aware, spatially contextualized graph\-conditioned routing\.*We propose a dual\-pathway router that fuses static topology descriptors with a dynamic representation computed via temporal attention over the input window and spatial neighbor message passing, enabling expert selection that adapts to current traffic conditions, not just static topology\.
- •*Frozen multi\-architecture specialization with low trainable budget\.*We combine diverse frozen pretrained experts through learned routing, training only∼\\sim17K parameters while leveraging the representational capacity of the frozen expert set\.
- •*Optional lightweight output refinement\.*We study a bounded graph\-conditioned refinement layer that can further improve performance in some settings at negligible parameter cost\.
- •*Ablation analysis of lightweight extensions\.*We evaluate the optional refinement module and use node\-adaptive ST\-LoRA adapters as a diagnostic ablation to study whether adapter\-based expert modification complements routing\.
The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2605.30486#S2)reviews related work and Section[III](https://arxiv.org/html/2605.30486#S3)formalizes the problem setup\. Section[IV](https://arxiv.org/html/2605.30486#S4)presents the proposed GC\-MoE framework in detail\. Section[V](https://arxiv.org/html/2605.30486#S5)describes the experimental setup, Section[VI](https://arxiv.org/html/2605.30486#S6)reports the empirical results, and Section[VII](https://arxiv.org/html/2605.30486#S7)discusses their implications, limitations, and relation to prior work\. Finally, Section[VIII](https://arxiv.org/html/2605.30486#S8)concludes the paper and outlines future directions\.
## IIRelated Work
### II\-ASpatio\-Temporal Graph Forecasting
ST\-GNNs jointly model spatial dependencies among sensors and temporal dynamics for tasks such as traffic speed and flow prediction\. Foundational approaches include DCRNN\[[17](https://arxiv.org/html/2605.30486#bib.bib1)\], which couples diffusion convolutions with gated recurrent units; STGCN\[[25](https://arxiv.org/html/2605.30486#bib.bib3)\], which replaces recurrence with purely convolutional temporal blocks interleaved with graph convolutions\. Moreover, Graph WaveNet\[[24](https://arxiv.org/html/2605.30486#bib.bib2)\]introduces adaptive adjacency learning and dilated causal convolutions\. Spectral variants such as StemGNN\[[2](https://arxiv.org/html/2605.30486#bib.bib4)\]jointly apply graph Fourier and discrete Fourier transforms\. Adaptive graph models such as AGCRN\[[1](https://arxiv.org/html/2605.30486#bib.bib13)\]learn node\-specific recurrent dynamics via adaptive graph convolution\. Recent work by Ghaffari et al\.\[[8](https://arxiv.org/html/2605.30486#bib.bib12)\]has further emphasized the importance of the full spatio\-temporal prediction pipeline, from spatial mapping and graph construction to model training and evaluation, showing that graph design and preprocessing choices significantly impact downstream forecasting performance\.
Recent architectures further improve performance and scalability\. PDFormer\[[14](https://arxiv.org/html/2605.30486#bib.bib17)\]models propagation delay patterns with delay\-aware spatial attention; STAEformer\[[18](https://arxiv.org/html/2605.30486#bib.bib18)\]shows that transformers with spatio\-temporal adaptive embeddings can be highly competitive; BigST\[[10](https://arxiv.org/html/2605.30486#bib.bib19)\]targets large\-scale road networks with linear complexity; and UniST\[[26](https://arxiv.org/html/2605.30486#bib.bib20)\]studies prompt\-based universal urban spatio\-temporal prediction\.
Despite this rapid progress, many strong ST\-GNN and transformer\-based forecasting models still apply a single trained backbone uniformly to every node, overlooking the heterogeneous dynamics arising from differences in topology, road function, and connectivity across the network\. This work addresses this limitation by conditioning per\-node expert routing on the graph structure and the recent traffic input window\.
### II\-BMixture of Experts and Routing
MoE models combine the outputs of multiple specialist sub\-networks through a learned gating function that produces data\-dependent mixture weights\. The foundational MoE framework was introduced by Jacobs et al\.\[[12](https://arxiv.org/html/2605.30486#bib.bib15)\], where a gating network learns to partition the input space among experts\. Shazeer et al\.\[[22](https://arxiv.org/html/2605.30486#bib.bib16)\]scaled this paradigm with a sparsely\-gated MoE layer, demonstrating significant capacity gains in language modeling while relying on additional load\-balancing losses to prevent expert collapse\. Switch Transformers\[[7](https://arxiv.org/html/2605.30486#bib.bib11)\]simplified sparse routing to a top\-1 expert selection per token, achieving efficient scaling to trillion\-parameter models\. Zhou et al\.\[[28](https://arxiv.org/html/2605.30486#bib.bib29)\]inverted the routing direction with expert\-choice selection, where each expert independently selects its top\-kkinputs, achieving natural load balance without auxiliary losses\.
Recent years have seen a surge of large\-scale MoE designs that refine routing and expert specialization\. Mixtral\[[13](https://arxiv.org/html/2605.30486#bib.bib23)\]demonstrates a practical open\-weight sparse MoE architecture in which each token is routed to 2 of 8 experts, achieving performance competitive with much larger dense models while activating only a fraction of parameters per forward pass\. DeepSeekMoE\[[3](https://arxiv.org/html/2605.30486#bib.bib24)\]shows that fine\-grained expert segmentation combined with shared expert isolation improves expert specialization, enabling a more nuanced division of knowledge between experts\. Branch\-Train\-MiX \(BTX\)\[[23](https://arxiv.org/html/2605.30486#bib.bib25)\]trains expert LLMs independently on different data domains and subsequently merges them into a unified MoE with lightweight routing, demonstrating that independently trained \(or frozen\) experts can be effectively combined, a paradigm conceptually close to our approach of routing over frozen pretrained backbones\. On the routing\-mechanism front, Puigcerver et al\.\[[20](https://arxiv.org/html/2605.30486#bib.bib22)\]propose Soft MoE, which replaces discrete token\-to\-expert assignment with a fully differentiable soft assignment via learned slot projections, avoiding the load\-balancing and training instability issues inherent in complex sparse routing\.
In the spatio\-temporal domain, TESTAM\[[16](https://arxiv.org/html/2605.30486#bib.bib21)\]is the work most closely related to our work\. It introduces a time\-enhanced spatio\-temporal attention model with a mixture of experts for traffic forecasting, where different experts specialize in different temporal traffic patterns \(e\.g\., recurring vs\. non\-recurring congestion\)\. However, TESTAM differs from GC\-MoE in several key aspects\. TESTAM is an end\-to\-end trained MoE architecture designed to model different temporal and spatio\-temporal traffic patterns, including recurring and non\-recurring regimes\. In contrast, GC\-MoE studies a frozen\-expert regime in which independently pretrained and architecturally heterogeneous ST\-GNN backbones are kept fixed, and only a lightweight graph\-conditioned router is trained\. Moreover, GC\-MoE explicitly conditions per\-node routing on hand\-crafted graph topology descriptors together with spatially propagated dynamic traffic context\.
Compared with prior MoE\-based traffic forecasting models, GC\-MoE specifically focuses on graph\-conditioned, per\-node soft routing over heterogeneous frozen expert architectures, a setting that is distinct from end\-to\-end MoE training over jointly optimized experts\. GC\-MoE fills this gap with a dual\-pathway router that fuses static topology features with a dynamic, spatially propagated traffic context representation to produce per\-node expert mixture weights\.
### II\-CParameter\-Efficient Fine\-Tuning and LoRA
PEFT methods adapt large pretrained models by updating only a small subset of parameters while keeping the backbone frozen\. LoRA\[[11](https://arxiv.org/html/2605.30486#bib.bib5)\]injects trainable low\-rank residual matrices into frozen weight matrices, enabling adaptation without increasing inference latency\. QLoRA\[[4](https://arxiv.org/html/2605.30486#bib.bib26)\]further reduces memory requirements by combining 4\-bit quantization with LoRA adapters, enabling the fine\-tuning of massive models on limited hardware\. DoRA\[[19](https://arxiv.org/html/2605.30486#bib.bib27)\]decomposes pretrained weights into magnitude and direction components and applies low\-rank adaptation only to the directional component, improving learning capacity and stability over standard LoRA\. In the spatio\-temporal domain, ST\-LoRA\[[21](https://arxiv.org/html/2605.30486#bib.bib7)\]extends LoRA with node\-adaptive low\-rank modules that account for spatial heterogeneity across the sensor graph, achieving competitive forecasting performance with small trainable budgets\. Budget allocation approaches such as AdaLoRA\[[27](https://arxiv.org/html/2605.30486#bib.bib6)\]adaptively distribute rank budgets across weight matrices based on importance scores\.
More recently, the interaction between PEFT adapters and MoE routing has received increasing attention\. LoRAMoE\[[6](https://arxiv.org/html/2605.30486#bib.bib28)\]investigates combining LoRA adapters with MoE\-style routing in large language models and finds that naive integration can cause world\-knowledge forgetting and training conflicts, requiring careful architectural design to preserve pretrained capabilities\. This observation is directly relevant to GC\-MoE, where routing is learned over frozen expert backbones\. Motivated by this, we include node\-adaptive ST\-LoRA adapters as an ablation within our frozen multi\-expert setting to test whether lightweight expert adaptation complements or interferes with graph\-conditioned routing\. The resulting findings are reported in Section[VI](https://arxiv.org/html/2605.30486#S6)\.
## IIIProblem Setup
Let𝒢=\(𝒱,ℰ,A\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\},A\)be a weighted sensor graph with\|𝒱\|=N\|\\mathcal\{V\}\|=Nnodes and adjacency matrixA∈ℝN×NA\\in\\mathbb\{R\}^\{N\\times N\}\. We denote byA~\\tilde\{A\}a normalized version ofAAused for one\-hop message passing \(e\.g\., row\-normalized or symmetrically normalized with self\-loops\)\.
Given a historical input window
X∈ℝB×T×N×D,X\\in\\mathbb\{R\}^\{B\\times T\\times N\\times D\},whereBBis the batch size,TTis the history length,NNis the number of nodes, andDDis the input feature dimension, the goal is to predict the nextHHsteps
Y∈ℝB×H×N×Dout,Y\\in\\mathbb\{R\}^\{B\\times H\\times N\\times D\_\{\\text\{out\}\}\},whereHHis the forecasting horizon andDoutD\_\{\\text\{out\}\}is the output feature dimension\.
We construct a set ofEEpretrained expert forecasters\{f\(e\)\}e=1E\\\{f^\{\(e\)\}\\\}\_\{e=1\}^\{E\}, each producing an expert prediction
Y^\(e\)=f\(e\)\(X\),Y^\(e\)∈ℝB×H×N×Dout\.\\hat\{Y\}^\{\(e\)\}=f^\{\(e\)\}\(X\),\\qquad\\hat\{Y\}^\{\(e\)\}\\in\\mathbb\{R\}^\{B\\times H\\times N\\times D\_\{\\text\{out\}\}\}\.For each samplebband nodenn, a router outputs mixture weightswn\(b,e\)≥0w\_\{n\}^\{\(b,e\)\}\\geq 0such that∑e=1Ewn\(b,e\)=1\\sum\_\{e=1\}^\{E\}w\_\{n\}^\{\(b,e\)\}=1\. The mixture prediction is computed node\-wise as
Y^b,:,n,:mix=∑e=1Ewn\(b,e\)Y^b,:,n,:\(e\)\.\\hat\{Y\}\_\{b,:,n,:\}^\{\\text\{mix\}\}=\\sum\_\{e=1\}^\{E\}w\_\{n\}^\{\(b,e\)\}\\,\\hat\{Y\}\_\{b,:,n,:\}^\{\(e\)\}\.\(1\)
## IVMethodology
GC\-MoE consists of two core components: \(i\) frozen pretrained experts and \(ii\) a dual\-pathway graph\-conditioned router\. In addition, we study a lightweight graph\-conditioned output refinement module as an optional add\-on, and node\-adaptive ST\-LoRA adapters as an ablation\-only extension\. Figure[1](https://arxiv.org/html/2605.30486#S4.F1)illustrates the full design space; the core GC\-MoE model used in the main results includes only frozen experts and the graph\-conditioned router, while refinement is evaluated separately in ablation\. The framework supports an arbitrary number of experts and is not tied to the specific three backbones used in our experiments\. We freeze multiple pretrained spatio\-temporal GNN experts and learn a graph\-conditioned router that outputs per\-node mixture weights from static topology and dynamic traffic context\. Optionally, a bounded graph\-conditioned refinement module can be added on top of the mixed prediction as a lightweight correction layer\.
Figure 1:Overview of the GC\-MoE design space\. Given a historical windowX∈ℝB×T×N×DX\\in\\mathbb\{R\}^\{B\\times T\\times N\\times D\}and adjacencyA∈ℝN×NA\\in\\mathbb\{R\}^\{N\\times N\},EEfrozen spatio\-temporal GNN experts produce\{Y^\(e\)\}e=1E\\\{\\hat\{Y\}^\{\(e\)\}\\\}\_\{e=1\}^\{E\}\. A dual\-pathway graph\-conditioned router combines static topology descriptors𝐬n\\mathbf\{s\}\_\{n\}with an input\-aware dynamic representation to output per\-node mixture weightswn\(b,e\)w\_\{n\}^\{\(b,e\)\}\. The core GC\-MoE model uses the router\-based node\-wise mixtureY^mix\\hat\{Y\}^\{\\text\{mix\}\}as its prediction\. An optional bounded graph\-conditioned refinement module, shown on the right, can be added as a lightweight correction layer and is evaluated separately in ablation\.### IV\-AFrozen Expert Backbones
We useE=3E=3diverse ST\-GNN backbones, namely STGCN\[[25](https://arxiv.org/html/2605.30486#bib.bib3)\], GWNet\[[24](https://arxiv.org/html/2605.30486#bib.bib2)\], AGCRN\[[1](https://arxiv.org/html/2605.30486#bib.bib13)\]\. Each expert is pretrained to converge on the target dataset and then frozen during MoE training\. Expert parameters are not updated; only routing and refinement parameters are trained\.
### IV\-BDual\-Pathway Graph\-Conditioned Router
##### Static topology descriptor
For each nodenn, we compute a normalized topology vector𝐬n∈ℝds\\mathbf\{s\}\_\{n\}\\in\\mathbb\{R\}^\{d\_\{s\}\}withds=9d\_\{s\}=9:
𝐬n=\[\\displaystyle\\mathbf\{s\}\_\{n\}=\\big\[deg,close,clust,PR,btw,\\displaystyle\\deg,\\mathrm\{close\},\\mathrm\{clust\},\\mathrm\{PR\},\\mathrm\{btw\},\(2\)kcore,eig,u2,u3\]n∈\[0,1\]9\\displaystyle\\mathrm\{kcore\},\\mathrm\{eig\},u\_\{2\},u\_\{3\}\\big\]\_\{n\}\\in\[0,1\]^\{9\}
where the entries correspond to \(normalized\) degree, closeness centrality, clustering coefficient, PageRank, betweenness centrality, k\-core number, eigenvector centrality, and the node\-wise entries of the second \(Fiedler\) and third Laplacian eigenvectors\. The 9 topology features were chosen to cover local connectivity, centrality, and spectral position\.
##### Static representation with neighborhood smoothing
LetProj\(⋅\)\\mathrm\{Proj\}\(\\cdot\)be a learnable linear map toℝdr\\mathbb\{R\}^\{d\_\{r\}\}and let𝐞n∈ℝdr\\mathbf\{e\}\_\{n\}\\in\\mathbb\{R\}^\{d\_\{r\}\}be a learnable node embedding\. The initial static embedding is
𝐫nstatic=Proj\(𝐬n\)\+𝐞n,𝐫nstatic∈ℝdr\.\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}=\\mathrm\{Proj\}\(\\mathbf\{s\}\_\{n\}\)\+\\mathbf\{e\}\_\{n\},\\qquad\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}\\in\\mathbb\{R\}^\{d\_\{r\}\}\.\(3\)Because bothProj\(⋅\)\\mathrm\{Proj\}\(\\cdot\)and𝐞n\\mathbf\{e\}\_\{n\}are learnable, the static pathway remains trainable even though it is built from fixed topology descriptors\. We then apply one\-hop neighborhood smoothing with a learnable scalar gateγ∈\(0,1\)\\gamma\\in\(0,1\):
𝐫nstatic←𝐫nstatic\+γ\(A~𝐫static\)n\.\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}\\leftarrow\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}\+\\gamma\\,\(\\tilde\{A\}\\mathbf\{r\}^\{\\text\{static\}\}\)\_\{n\}\.\(4\)
The raw topology descriptors are fixed node\-wise summaries\. Although several encode global graph position, they do not directly aggregate the structural context of neighboring nodes\. We therefore apply one\-hop smoothing to contextualize each node’s representation\. Although the topology descriptors and adjacency are fixed, the projection, node embeddings, and smoothing gate are learnable; therefore, the smoothed static representation is recomputed during each forward pass using the current parameters\. In practice, this branch is lightweight and adds negligible cost relative to expert inference\.
Concretely,\(A~𝐫static\)n=∑j∈𝒩\(n\)A~nj𝐫jstatic\(\\tilde\{A\}\\mathbf\{r\}^\{\\text\{static\}\}\)\_\{n\}=\\sum\_\{j\\in\\mathcal\{N\}\(n\)\}\\tilde\{A\}\_\{nj\}\\,\\mathbf\{r\}\_\{j\}^\{\\text\{static\}\}aggregates the embeddings of all nodes adjacent tonn, weighted by normalized edge strengths\. The update in \([4](https://arxiv.org/html/2605.30486#S4.E4)\) then blends this neighborhood summary back into𝐫nstatic\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}with a learnable scalar gateγ=σ\(γ^\)\\gamma=\\sigma\(\\hat\{\\gamma\}\), whereγ^\\hat\{\\gamma\}is an unconstrained parameter initialized at0\(soγ≈0\.5\\gamma\\approx 0\.5at the start of training\)\.
The gateγ\\gammacontrols the trade\-off between*self\-reliance*\(routing based solely on the node’s own topology\) and*contextual awareness*\(routing informed by the structural character of the surrounding subgraph\)\. If neighbors ofnnare predominantly high\-centrality bottleneck nodes, the smoothed embedding shifts toward that regime, causing the router to treatnnas part of a critical corridor rather than an isolated hub\. Conversely, ifnnis a high\-degree node surrounded by peripheral sensors, the neighborhood signal pulls the embedding away from the hub prototype, enabling finer\-grained expert specialization\.
##### Dynamic pathway \(temporal attention \+ spatial propagation\)
For each samplebband nodenn, we compute a temporally attended dynamic representation:
𝐡n\(b\)=∑t=1Tαt\(b,n\)g\(Xb,t,n,:\),∑t=1Tαt\(b,n\)=1,\\mathbf\{h\}\_\{n\}^\{\(b\)\}=\\sum\_\{t=1\}^\{T\}\\alpha\_\{t\}^\{\(b,n\)\}\\,g\(X\_\{b,t,n,:\}\),\\qquad\\sum\_\{t=1\}^\{T\}\\alpha\_\{t\}^\{\(b,n\)\}=1,\(5\)whereg\(⋅\)g\(\\cdot\)is a learnable feature map toℝdr\\mathbb\{R\}^\{d\_\{r\}\}andαt\(b,n\)\\alpha\_\{t\}^\{\(b,n\)\}are attention weights over theTThistory steps\. We then inject the 1\-hop neighbor context:
𝐡n\(b\)←𝐡n\(b\)\+\(A~𝐡\(b\)\)n,\\mathbf\{h\}\_\{n\}^\{\(b\)\}\\leftarrow\\mathbf\{h\}\_\{n\}^\{\(b\)\}\+\(\\tilde\{A\}\\mathbf\{h\}^\{\(b\)\}\)\_\{n\},\(6\)where𝐡\(b\)∈ℝN×dr\\mathbf\{h\}^\{\(b\)\}\\in\\mathbb\{R\}^\{N\\times d\_\{r\}\}stacks\{𝐡n\(b\)\}n=1N\\\{\\mathbf\{h\}\_\{n\}^\{\(b\)\}\\\}\_\{n=1\}^\{N\}\.
##### Scalar fusion gate
The router fuses static and dynamic representations using a scalar gateλn\(b\)∈\(0,1\)\\lambda\_\{n\}^\{\(b\)\}\\in\(0,1\):
λn\(b\)=σ\(MLP\(\[𝐫nstatic∥𝐡n\(b\)\]\)\),\\lambda\_\{n\}^\{\(b\)\}=\\sigma\\\!\\Big\(\\mathrm\{MLP\}\\big\(\[\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}\\,\\\|\\,\\mathbf\{h\}\_\{n\}^\{\(b\)\}\]\\big\)\\Big\),\(7\)Hereσ\(⋅\)\\sigma\(\\cdot\)denotes the logistic sigmoid\. Then, the fused router embedding
𝐫n\(b\)=λn\(b\)𝐫nstatic\+\(1−λn\(b\)\)𝐡n\(b\)\.\\mathbf\{r\}\_\{n\}^\{\(b\)\}=\\lambda\_\{n\}^\{\(b\)\}\\,\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}\+\\big\(1\-\\lambda\_\{n\}^\{\(b\)\}\\big\)\\,\\mathbf\{h\}\_\{n\}^\{\(b\)\}\.\(8\)
Both𝐫nstatic\\mathbf\{r\}\_\{n\}^\{\\text\{static\}\}and𝐡n\(b\)\\mathbf\{h\}\_\{n\}^\{\(b\)\}lie inℝdr\\mathbb\{R\}^\{d\_\{r\}\}, hence the fused representation𝐫n\(b\)∈ℝdr\\mathbf\{r\}\_\{n\}^\{\(b\)\}\\in\\mathbb\{R\}^\{d\_\{r\}\}\.
##### Routing logits and weights
The dynamic routing head outputs logits
𝐥n\(b,dyn\)=W2ReLU\(W1𝐫n\(b\)\),𝐥n\(b,dyn\)∈ℝE\.\\mathbf\{l\}\_\{n\}^\{\(b,\\text\{dyn\}\)\}=W\_\{2\}\\,\\mathrm\{ReLU\}\(W\_\{1\}\\mathbf\{r\}\_\{n\}^\{\(b\)\}\),\\qquad\\mathbf\{l\}\_\{n\}^\{\(b,\\text\{dyn\}\)\}\\in\\mathbb\{R\}^\{E\}\.\(9\)HereReLU\(x\)=max\(x,0\)\\mathrm\{ReLU\}\(x\)=\\max\(x,0\),W1∈ℝdh×drW\_\{1\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times d\_\{r\}\}andW2∈ℝE×dhW\_\{2\}\\in\\mathbb\{R\}^\{E\\times d\_\{h\}\}are learnable matrices \(with hidden sizedhd\_\{h\}\), and𝐥n\(b,dyn\)\\mathbf\{l\}\_\{n\}^\{\(b,\\text\{dyn\}\)\}denotes pre\-softmax routing logits\. The final logits include a learnable per\-node routing bias table𝜽n∈ℝE\\boldsymbol\{\\theta\}\_\{n\}\\in\\mathbb\{R\}^\{E\}:
𝐥n\(b\)=𝜽n\+𝐥n\(b,dyn\)\.\\mathbf\{l\}\_\{n\}^\{\(b\)\}=\\boldsymbol\{\\theta\}\_\{n\}\+\\mathbf\{l\}\_\{n\}^\{\(b,\\text\{dyn\}\)\}\.\(10\)Letln\(b,e\)l\_\{n\}^\{\(b,e\)\}denote theee\-th component of𝐥n\(b\)\\mathbf\{l\}\_\{n\}^\{\(b\)\}; we use a fixed softmax temperatureτ=1\.0\\tau=1\.0:
wn\(b,e\)=exp\(ln\(b,e\)/τ\)∑e′=1Eexp\(ln\(b,e′\)/τ\),τ=1\.0\.w\_\{n\}^\{\(b,e\)\}=\\frac\{\\exp\(l\_\{n\}^\{\(b,e\)\}/\\tau\)\}\{\\sum\_\{e^\{\\prime\}=1\}^\{E\}\\exp\(l\_\{n\}^\{\(b,e^\{\\prime\}\)\}/\\tau\)\},\\qquad\\tau=1\.0\.\(11\)
##### Load\-balancing loss
To discourage router collapse, we use the standard load\-balancing objective:
ℒbalance=E∑e=1EfePe,\\mathcal\{L\}\_\{\\text\{balance\}\}=E\\sum\_\{e=1\}^\{E\}f\_\{e\}\\,P\_\{e\},\(12\)where, averaged over the batch,
fe\\displaystyle f\_\{e\}=1BN∑b=1B∑n=1N𝟏\[argmaxe′wn\(b,e′\)=e\],\\displaystyle=\\frac\{1\}\{BN\}\\sum\_\{b=1\}^\{B\}\\sum\_\{n=1\}^\{N\}\\mathbf\{1\}\\\!\\left\[\\arg\\max\_\{e^\{\\prime\}\}w\_\{n\}^\{\(b,e^\{\\prime\}\)\}=e\\right\],\(13\)Pe\\displaystyle P\_\{e\}=1BN∑b=1B∑n=1Nwn\(b,e\)\.\\displaystyle=\\frac\{1\}\{BN\}\\sum\_\{b=1\}^\{B\}\\sum\_\{n=1\}^\{N\}w\_\{n\}^\{\(b,e\)\}\.
### IV\-COptional Graph\-Conditioned Bounded Output Refinement
LetY^mix\\hat\{Y\}^\{\\text\{mix\}\}denote the mixture prediction from \([1](https://arxiv.org/html/2605.30486#S3.E1)\)\. As an optional extension, we consider a lightweight refinement module that predicts a bounded, graph\-smoothed affine correction on top of the router\-based mixture output\. For each nodenn, we form a context vector
𝐜n=\[𝐬n∥𝐰¯n\]∈ℝds\+E,\\mathbf\{c\}\_\{n\}=\[\\mathbf\{s\}\_\{n\}\\,\\\|\\,\\bar\{\\mathbf\{w\}\}\_\{n\}\]\\in\\mathbb\{R\}^\{d\_\{s\}\+E\},\(14\)where𝐰¯n∈ℝE\\bar\{\\mathbf\{w\}\}\_\{n\}\\in\\mathbb\{R\}^\{E\}is the average routing weight vector for nodennacross the current batch, i\.e\.,𝐰¯n=1B∑b=1B𝐰n\(b\)\\bar\{\\mathbf\{w\}\}\_\{n\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\mathbf\{w\}\_\{n\}^\{\(b\)\}and𝐰n\(b\)=\[wn\(b,1\),…,wn\(b,E\)\]\\mathbf\{w\}\_\{n\}^\{\(b\)\}=\[w\_\{n\}^\{\(b,1\)\},\\dots,w\_\{n\}^\{\(b,E\)\}\]\.
Two small networks produce per\-node scale and bias \(both bounded viatanh\\tanh\):
mn\\displaystyle m\_\{n\}=βtanh\(scale\_net\(𝐜n\)\),\\displaystyle=\\beta\\,\\tanh\(\\text\{scale\\\_net\}\(\\mathbf\{c\}\_\{n\}\)\),\(15\)bn\\displaystyle b\_\{n\}=βtanh\(bias\_net\(𝐜n\)\),\\displaystyle=\\beta\\,\\tanh\(\\text\{bias\\\_net\}\(\\mathbf\{c\}\_\{n\}\)\),\(16\)wheremn,bn∈\(−β,β\)m\_\{n\},b\_\{n\}\\in\(\-\\beta,\\beta\)are broadcast across\(H,Dout\)\(H,D\_\{\\text\{out\}\}\)for nodennandβ∈\(0,1\)\\beta\\in\(0,1\)is a hyperparameter bounding the maximum refinement\. In our experiments, we setβ=0\.3\\beta=0\.3\. The bounded affine correction is
δ=Y^mix⊙m\+b,\\delta=\\hat\{Y\}^\{\\text\{mix\}\}\\odot m\+b,\(17\)wheremmandbbdenote the node\-wise scale and bias tensors broadcast to the shape ofY^mix\\hat\{Y\}^\{\\text\{mix\}\}\. We then apply one\-hop graph smoothing to the correction:
δ¯=A~δ\.\\bar\{\\delta\}=\\tilde\{A\}\\,\\delta\.\(18\)Finally, the refined prediction is
Y^=Y^mix\+σ\(g\)δ¯,\\hat\{Y\}=\\hat\{Y\}^\{\\text\{mix\}\}\+\\sigma\(g\)\\,\\bar\{\\delta\},\(19\)whereggis a learnable startup gate initialized to a negative value so thatσ\(g\)≈0\\sigma\(g\)\\approx 0at early training stages\.
### IV\-DOptional: Per\-Expert Node\-Adaptive ST\-LoRA Adapters \(Ablation Only\)
We evaluate node\-adaptive ST\-LoRA adapters inspired by LoRA\[[11](https://arxiv.org/html/2605.30486#bib.bib5)\]and ST\-LoRA\[[21](https://arxiv.org/html/2605.30486#bib.bib7)\]as an optional add\-on\. In our experiments, adapters*degraded*performance in the routed multi\-expert setting \(Section[VI](https://arxiv.org/html/2605.30486#S6)\), suggesting that adapter corrections can interfere with routing\. We therefore do not include adapters in the final model\.
### IV\-ETraining Objective
The total objective is
ℒ=ℒMAE\+λ1ℒbalance\+λ2ℒentropy\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{MAE\}\}\+\\lambda\_\{1\}\\mathcal\{L\}\_\{\\text\{balance\}\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{\\text\{entropy\}\}\.\(20\)We use masked MAE:
ℒMAE=1\|Ω\|∑\(b,h,n\)∈Ω‖Yb,h,n,:−Y^b,h,n,:‖1,\\mathcal\{L\}\_\{\\text\{MAE\}\}=\\frac\{1\}\{\|\\Omega\|\}\\sum\_\{\(b,h,n\)\\in\\Omega\}\\left\\\|Y\_\{b,h,n,:\}\-\\hat\{Y\}\_\{b,h,n,:\}\\right\\\|\_\{1\},\(21\)whereΩ\\Omegaindexes non\-missing targets\. The entropy term encourages confident \(peaked\) routing:
ℒentropy=1BN∑b=1B∑n=1N\(−∑e=1Ewn\(b,e\)logwn\(b,e\)\)\.\\mathcal\{L\}\_\{\\text\{entropy\}\}=\\frac\{1\}\{BN\}\\sum\_\{b=1\}^\{B\}\\sum\_\{n=1\}^\{N\}\\left\(\-\\sum\_\{e=1\}^\{E\}w\_\{n\}^\{\(b,e\)\}\\log w\_\{n\}^\{\(b,e\)\}\\right\)\.\(22\)The entropy term mildly encourages sharper per\-node expert preferences, while the load\-balancing term prevents collapse to a single expert at the population level\. In practice, we found that the combination stabilized training and yielded selective but non\-degenerate routing\. We selectedλ1=0\.01\\lambda\_\{1\}=0\.01andλ2=0\.5\\lambda\_\{2\}=0\.5based on validation performance and stable expert utilization\.
### IV\-FTrainable Parameters
In the core GC\-MoE model, only the router parameters \(including the fusion\-gate MLP and the per\-node bias table\{𝜽n\}\\\{\\boldsymbol\{\\theta\}\_\{n\}\\\}\), node embeddings\{𝐞n\}\\\{\\mathbf\{e\}\_\{n\}\\\}, and related routing parameters are trainable \(approximately1717K parameters\)\. Expert backbones are frozen and are not updated during training\. When the optional refinement module is enabled, the trainable parameter count increases slightly to approximately1818K\.
## VExperiments
### V\-ADatasets and Setup
We evaluate the GC\-MoE framework on four widely used traffic forecasting benchmarks summarized in Table[I](https://arxiv.org/html/2605.30486#S5.T1)\.PEMS04andPEMS07are traffic flow datasets collected by the California Department of Transportation \(Caltrans\) Performance Measurement System\[[9](https://arxiv.org/html/2605.30486#bib.bib30)\], aggregated at 5\-minute intervals\.METR\-LAandPEMS\-BAYare traffic speed datasets collected from loop detectors in Los Angeles County and the San Francisco Bay Area\[[17](https://arxiv.org/html/2605.30486#bib.bib1)\], respectively, also at 5\-minute resolution\. For all datasets, the input features consist of the traffic measurement and a time\-of\-day indicator, and the output is a single\-channel forecast\. We use the standard chronological splits: 6:2:2 \(train/val/test\) for PEMS04 and PEMS07, and 7:1:2 for METR\-LA and PEMS\-BAY, following\[[17](https://arxiv.org/html/2605.30486#bib.bib1),[24](https://arxiv.org/html/2605.30486#bib.bib2)\]\.
TABLE I:Dataset statistics\. All datasets use 5\-minute aggregation intervalsFollowing common practice\[[17](https://arxiv.org/html/2605.30486#bib.bib1),[24](https://arxiv.org/html/2605.30486#bib.bib2),[25](https://arxiv.org/html/2605.30486#bib.bib3)\], we use a history window ofT=12T=12steps to predict the nextH=12H=12steps\. Adjacency matrices are constructed from road\-network distances as provided by the benchmark datasets and normalized using the double\-transition scheme\[[24](https://arxiv.org/html/2605.30486#bib.bib2)\]\. We report three standard metrics: Mean Absolute Error \(MAE\), Root Mean Squared Error \(RMSE\), and Mean Absolute Percentage Error \(MAPE\), all averaged over the 12\-step horizon \(lower is better for all metrics\)\.
### V\-BExpert Backbones
We instantiateE=3E=3diverse frozen experts, each representing a different family of ST\-GNN architectures:
- •Graph WaveNet \(GWNet\)\[[24](https://arxiv.org/html/2605.30486#bib.bib2)\]: diffusion convolutions with adaptive adjacency learning and dilated causal temporal convolutions \(∼\{\\sim\}303K parameters on PEMS04\)\.
- •STGCN\[[25](https://arxiv.org/html/2605.30486#bib.bib3)\]: spectral graph convolutions \(Chebyshev polynomials\) interleaved with 1\-D convolutional temporal blocks \(∼\{\\sim\}298K parameters on PEMS04\)\.
- •AGCRN\[[1](https://arxiv.org/html/2605.30486#bib.bib13)\]: adaptive graph convolution with node\-specific GRU\-based recurrent dynamics \(∼\{\\sim\}903K parameters on PEMS04\)\.
Each expert is independently pretrained to convergence on the target dataset using Adam\[[15](https://arxiv.org/html/2605.30486#bib.bib31)\]with a learning rate of10−310^\{\-3\}, weight decay of5×10−45\\times 10^\{\-4\}, batch size 64, and early stopping with patience 15 \(maximum 200 epochs\)\. After pretraining, all expert parameters are frozen and are not updated during MoE training\.
### V\-CGC\-MoE Training
During MoE training, only the router and \(optionally\) the output refinement module are optimized using Adam with a learning rate of10−310^\{\-3\}, weight decay of10−510^\{\-5\}, gradient clipping at norm 5, batch size 64, and early stopping with patience 15\. The router embedding dimension is3232, and the routing MLP hidden size is3232\. We use a fixed softmax temperatureτ=1\.0\\tau=1\.0\. The output refinement hidden dimension is 32 with a bound of±0\.3\\pm 0\.3\. Training is conducted on Tesla P100 16 GB GPUs\.
### V\-DBaselines and Ablations
We compare GC\-MoE \(frozen experts \+ graph\-conditioned router\) against the following configurations:
- •Single experts: Each frozen expert evaluated independently \(GWNet, STGCN, AGCRN\)\.
- •Ens\-Avg: Uniform averaging of the three frozen expert outputs with zero learned parameters, serving as a strong non\-parametric baseline\.
- •Component ablations\(on PEMS04\): Router only \(core GC\-MoE\), router \+ adapters, router \+ refinement, router \+ refinement \+ adapters, and MoE partial fine\-tune \(experts and router unfrozen; LoRA base weights inside adapter blocks remain frozen by design\)\.
- •Router ablations\(on PEMS04\): Dense MLP \(no graph features\), Switch\-style top\-1 routing\[[7](https://arxiv.org/html/2605.30486#bib.bib11)\], expert\-choice routing\[[28](https://arxiv.org/html/2605.30486#bib.bib29)\], hash routing \(parameter\-free\), and the proposed graph\-conditioned router\.
Ablation studies are conducted on PEMS04, which we use as a representative benchmark for controlled analysis because it has a standard scale for traffic forecasting experiments while keeping the computational cost of testing multiple routing and adaptation variants manageable\. The final GC\-MoE core model is then evaluated on all four benchmarks\.
## VIResults
### VI\-AMain Forecasting Performance
TABLE II:Performance comparison across four traffic forecasting benchmarks \(lower is better\)\. Bold denotes the best result for each metric and dataset\. Ens\-Avg is a uniform average of three frozen experts \(zero learned parameters\)\. GC\-MoE uses the graph\-conditioned router with frozen experts \(∼\{\\sim\}17 K trainable parameters\)Table[II](https://arxiv.org/html/2605.30486#S6.T2)reports the main forecasting results across PEMS04, METR\-LA, PEMS\-BAY, and PEMS07\. GC\-MoE improves MAE over all single backbones and over the zero\-parameter ensemble baseline across all four benchmarks, while RMSE and MAPE remain competitive and are best on several datasets\. Improvements over Ens\-Avg are modest but consistent, demonstrating that node\-dependent, input\-aware routing provides additional gains beyond simple averaging of diverse experts\. The main comparison reports the core GC\-MoE model for consistency across all four datasets; the optional refinement module was evaluated separately via an ablation study on PEMS04\. On PEMS04, GC\-MoE reduces MAE from 18\.889 \(Ens\-Avg\) to 18\.741, and further to 18\.723 with refinement\. On METR\-LA and PEMS\-BAY, GC\-MoE achieves the best MAE among all methods while maintaining competitive RMSE and MAPE\.
### VI\-BComponent Ablation and Adapter Analysis
TABLE III:Ablation study on PEMS04 \(MAE; lower is better\)\. “Router only” trains only the graph\-conditioned router \(∼\{\\sim\}17 K params\)\. “\+ Refinement” adds the bounded output refinement layer\. “MoE partial fine\-tune” unfreezes the experts and the router but keeps the LoRA base weights inside the ST\-LoRA adapter blocks frozen \(1,5631\{,\}563K trainable out of1,8131\{,\}813K total\)\. ST\-LoRA adapters degrade performanceTable[III](https://arxiv.org/html/2605.30486#S6.T3)presents the contribution of each component on PEMS04\. Two consistent patterns emerge:
#### VI\-B1Router effectiveness
Training only the graph\-conditioned router already improves over Ens\-Avg, confirming that topology\- and context\-aware weighting is beneficial even without refinement\.
#### VI\-B2Output refinement gains
Adding the bounded graph\-conditioned refinement layer yields an additional improvement \(18\.723 MAE\), demonstrating that lightweight affine correction with graph smoothing can further reduce residual errors\.
#### VI\-B3Negative result on ST\-LoRA adapters
Introducing node\-adaptive ST\-LoRA adapters degrades performance when combined with routing and refinement\. This suggests a conflict between adapter\-based expert modification and routing\-based expert specialization\. Consequently, adapters are excluded from the final model\.
Therefore, in our PEMS04 experiments, MoE partial fine\-tune did not outperform the lightweight routed configuration, reinforcing the effectiveness of frozen\-expert routing\.
### VI\-CRouter Architecture Comparison
TABLE IV:Router architecture comparison on PEMS04\. All variants use the same frozen experts and output refinement; only the router differs\. Expert\-choice routing diverged and is omittedTable[IV](https://arxiv.org/html/2605.30486#S6.T4)compares alternative routing strategies on PEMS04\. The proposed graph\-conditioned router achieves the best MAE, RMSE, and MAPE among learned routing mechanisms\. We evaluated five routing strategies; four are reported in Table[IV](https://arxiv.org/html/2605.30486#S6.T4), while expert\-choice routing diverged during training and is omitted from the table\.
#### VI\-C1GC\-MoE Router \(Ours\)
Combines a static pathway, nine topological features smoothed over 1\-hop neighborhoods plus a learnable per\-node bias, with a dynamic pathway that applies temporal attention followed by adjacency\-based spatial propagation\. A scalar gate fuses both pathways before softmax routing\.
#### VI\-C2Dense MLP
A two\-layer MLP maps a learned node embedding to expert weights without any graph features, serving as a non\-graph\-aware baseline\.
#### VI\-C3Switch \(top\-1\)
Inspired by Switch Transformers\[[7](https://arxiv.org/html/2605.30486#bib.bib11)\], this router selects exactly one expert per node via a hard top\-1 with a straight\-through estimator, using node embeddings as input\.
#### VI\-C4Expert Choice \(top\-kk\)
Following the paradigm of\[[28](https://arxiv.org/html/2605.30486#bib.bib29)\], each expert independently scores all nodes and selects its top\-kk, allowing variable expert assignment per node\.
#### VI\-C5Hash
A parameter\-free deterministic assignment \(node\_idmodE\\text\{node\\\_id\}\\bmod E\) that controls for whether any learned routing is beneficial\.
Dense MLP routing \(no graph features\) performs worse, indicating that static topology descriptors and spatial propagation are important\. Sparse top\-1 \(Switch\-style\) routing underperforms soft routing, suggesting that soft mixtures better capture gradual specialization across nodes\. Deterministic hash routing also underperforms, confirming that learned, topology\-aware routing is necessary\.
### VI\-DParameter Efficiency
TABLE V:Parameter efficiency on PEMS04Table[V](https://arxiv.org/html/2605.30486#S6.T5)compares parameter efficiency on PEMS04\. GC\-MoE trains only∼\\sim17K parameters for routing, or∼\\sim18K with refinement, while freezing approximately 1\.5M expert parameters\. It achieves the best MAE among the compared configurations, and MoE partial fine\-tuning does not improve over GC\-MoE with refinement\. The efficiency claim refers to trainable parameters during adaptation, not to expert pretraining or inference cost; since GC\-MoE evaluates all frozen experts, its inference cost is closer to an ensemble than to a single backbone\.
## VIIDiscussion
The results suggest that the main benefit of GC\-MoE does not come from increasing backbone capacity, but from selectively composing the complementary strengths of diverse frozen experts at the node level\. Compared with uniform averaging, the gains are consistent across all four benchmark datasets\. This indicates that the improvement is attributable to graph\-conditioned, input\-aware routing rather than to additional expert capacity\.
Among the four benchmarks, on the largest PEMS07 dataset, the single frozen experts exhibit noticeably larger performance disparities than on the other datasets, where expert metrics are comparatively closer\. In this regime of higher expert heterogeneity, GC\-MoE demonstrates clearer gains over both the individual experts and Ens\-Avg, suggesting that learned routing is particularly beneficial when expert strengths are less uniform and the gap between experts is larger\. In contrast, when experts are already closely matched, as on some of the smaller benchmarks, the room for improvement beyond uniform averaging is naturally smaller\.
From the perspective of related work, the results position GC\-MoE between classical ensembling and fully trained MoE systems\. Unlike standard ensembles\[[5](https://arxiv.org/html/2605.30486#bib.bib14)\], GC\-MoE does not assign the same weights to all nodes, and unlike typical MoE formulations\[[12](https://arxiv.org/html/2605.30486#bib.bib15),[22](https://arxiv.org/html/2605.30486#bib.bib16),[7](https://arxiv.org/html/2605.30486#bib.bib11)\], it operates over heterogeneous frozen spatio\-temporal experts rather than jointly trained experts of a shared architecture\. The strong performance of the proposed router relative to dense MLP, sparse top\-1, and deterministic hash routing further suggests that explicit graph\-topological conditioning and soft node\-wise mixtures are important design choices in spatio\-temporal forecasting\.
The adapter ablation also provides an informative result\. While parameter\-efficient adaptation is often beneficial in single\-model settings\[[11](https://arxiv.org/html/2605.30486#bib.bib5),[21](https://arxiv.org/html/2605.30486#bib.bib7)\], our results suggest that, in a routed multi\-expert framework, node\-adaptive adapter corrections may interfere with, rather than complement, expert specialization\. This observation appears consistent with prior work showing that the design of LoRA\- and MoE\-style combinations can strongly affect performance\[[6](https://arxiv.org/html/2605.30486#bib.bib28)\], and suggests that parameter\-efficient adaptation is not necessarily complementary to routing in our frozen multi\-expert setting\.
It is also worth noting that the present study has several limitations\. First, the current expert pool contains only three backbones, all from the ST\-GNN family; incorporating more diverse experts may reveal stronger specialization effects\. Second, the static routing pathway relies on hand\-crafted topology descriptors\. Although these features are inexpensive and interpretable, they may not fully capture richer structural characteristics that could be learned through more expressive positional or structural encodings\. Finally, our experiments are limited to traffic forecasting benchmarks\. Although the proposed framework is general at the level of graph\-conditioned expert routing, different application domains may exhibit different graph semantics, temporal dynamics, and informative structural descriptors, so the effectiveness of the current design beyond traffic data remains to be validated\.
Overall, the findings support the view that lightweight graph\-conditioned routing is a practical way to exploit complementary frozen spatio\-temporal experts, especially when expert behaviors differ meaningfully, while also revealing important constraints on the interaction between routing and parameter\-efficient adaptation\.
## VIIIConclusion and Future Work
We presented GC\-MoE, a graph\-conditioned mixture of experts framework for spatio\-temporal traffic forecasting that routes each node to a personalized combination of frozen pretrained ST\-GNN experts\. The core of the approach is a dual\-pathway router that fuses static topology descriptors, smoothed over one\-hop neighborhoods, with a dynamic, temporally attended and spatially propagated representation of the current traffic state, producing per\-node soft expert mixture weights\. We additionally studied a lightweight bounded\-output refinement module as an optional extension\.
Experiments on four standard benchmark datasets \(PEMS04, METR\-LA, PEMS\-BAY, and PEMS07\) show that GC\-MoE improves MAE over every individual expert and over a zero\-parameter ensemble baseline, while MAPE is competitive and often improved, and the model trains only∼\{\\sim\}17–18K parameters on top of∼\{\\sim\}1\.5M frozen backbone weights \(roughly 1% of total parameters\)\. In addition, our ablation study indicates that a bounded output refinement layer can provide further gains, whereas node\-adaptive ST\-LoRA adapters were not beneficial in the routed multi\-expert setting\.
Several directions remain for future work\. First, the current framework uses three expert architectures; incorporating a more diverse pool of experts may further increase complementary coverage across different traffic regimes\. Second, our router operates at the node level with a shared set of global experts; a*hierarchical*routing scheme that first clusters nodes into traffic regions and then routes within each region could improve scalability to very large networks while capturing meso\-scale spatial structure\. Third, extending the static pathway with*learnable*structural or positional encodings derived from random\-walk kernels may capture richer graph\-structural information\. Fourth, investigating*cross\-city transfer*, training the router on one city and deploying it on another with a different topology but similar traffic dynamics, would test the generalizability of graph\-conditioned routing and move toward foundation\-model paradigms for urban spatio\-temporal prediction\[[26](https://arxiv.org/html/2605.30486#bib.bib20)\]\. Fifth, an important next step is to evaluate whether the same routing principle remains effective on other spatio\-temporal graph learning tasks and datasets with different graph semantics and temporal dynamics\. Finally, a deeper analysis of routing dynamics over time \(e\.g\., how expert preferences shift during peak vs\. off\-peak hours or incident conditions\) could yield interpretable insights into the complementary roles learned by different ST\-GNN architectures\.
## References
- \[1\]L\. Bai, L\. Yao, C\. Li, X\. Wang, and C\. Wang\(2020\)Adaptive graph convolutional recurrent network for traffic forecasting\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 17804–17815\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.30486#S4.SS1.p1.1),[3rd item](https://arxiv.org/html/2605.30486#S5.I1.i3.p1.1)\.
- \[2\]D\. Cao, Y\. Wang, J\. Duan, C\. Zhang, X\. Zhu, C\. Huang, Y\. Tong, B\. Xu, J\. Bai, J\. Tong,et al\.\(2020\)Spectral temporal graph neural network for multivariate time\-series forecasting\.Advances in neural information processing systems33,pp\. 17766–17778\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1)\.
- \[3\]D\. Dai, C\. Deng, C\. Zhao, R\. Xu, H\. Gao, D\. Chen, J\. Li, W\. Zeng, X\. Yu, Y\. Wu,et al\.\(2024\)Deepseekmoe: towards ultimate expert specialization in mixture\-of\-experts language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1280–1297\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p2.1)\.
- \[4\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer\(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p1.1)\.
- \[5\]T\. G\. Dietterich\(2000\)Ensemble methods in machine learning\.InInternational workshop on multiple classifier systems,pp\. 1–15\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§VII](https://arxiv.org/html/2605.30486#S7.p3.1)\.
- \[6\]S\. Dou, E\. Zhou, Y\. Liu, S\. Gao, W\. Shen, L\. Xiong, Y\. Zhou, X\. Wang, Z\. Xi, X\. Fan,et al\.\(2024\)LoRAMoE: alleviating world knowledge forgetting in large language models via moe\-style plugin\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1932–1945\.Cited by:[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p2.1),[§VII](https://arxiv.org/html/2605.30486#S7.p4.1)\.
- \[7\]W\. Fedus, B\. Zoph, and N\. Shazeer\(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p1.1),[4th item](https://arxiv.org/html/2605.30486#S5.I2.i4.p1.1),[§VI\-C3](https://arxiv.org/html/2605.30486#S6.SS3.SSS3.p1.1),[§VII](https://arxiv.org/html/2605.30486#S7.p3.1)\.
- \[8\]A\. Ghaffari, H\. Nguyen, L\. Lovén, and E\. Gilman\(2025\)STM\-graph: a python framework for spatio\-temporal mapping and graph neural network predictions\.InProceedings of the 34th ACM International Conference on Information and Knowledge Management,pp\. 6377–6381\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1)\.
- \[9\]S\. Guo, Y\. Lin, N\. Feng, C\. Song, and H\. Wan\(2019\)Attention based spatial\-temporal graph convolutional networks for traffic flow forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.33,pp\. 922–929\.Cited by:[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p1.1)\.
- \[10\]J\. Han, W\. Zhang, H\. Liu, T\. Tao, N\. Tan, and H\. Xiong\(2024\)Bigst: linear complexity spatio\-temporal graph neural network for traffic forecasting on large\-scale road networks\.Proceedings of the VLDB Endowment17\(5\),pp\. 1081–1090\.Cited by:[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p2.1)\.
- \[11\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p3.1),[§I](https://arxiv.org/html/2605.30486#S1.p4.1),[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p1.1),[§IV\-D](https://arxiv.org/html/2605.30486#S4.SS4.p1.1),[§VII](https://arxiv.org/html/2605.30486#S7.p4.1)\.
- \[12\]R\. A\. Jacobs, M\. I\. Jordan, S\. J\. Nowlan, and G\. E\. Hinton\(1991\)Adaptive mixtures of local experts\.Neural computation3\(1\),pp\. 79–87\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p1.1),[§VII](https://arxiv.org/html/2605.30486#S7.p3.1)\.
- \[13\]A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, E\. B\. Hanna, F\. Bressand,et al\.\(2024\)Mixtral of experts\.arXiv preprint arXiv:2401\.04088\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p2.1)\.
- \[14\]J\. Jiang, C\. Han, W\. X\. Zhao, and J\. Wang\(2023\)Pdformer: propagation delay\-aware dynamic long\-range transformer for traffic flow prediction\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 4365–4373\.Cited by:[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p2.1)\.
- \[15\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§V\-B](https://arxiv.org/html/2605.30486#S5.SS2.p1.3)\.
- \[16\]H\. Lee and S\. Ko\(2024\)TESTAM: a time\-enhanced spatio\-temporal attention model with mixture of experts\.arXiv preprint arXiv:2403\.02600\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p3.1)\.
- \[17\]Y\. Li, R\. Yu, C\. Shahabi, and Y\. Liu\(2017\)Diffusion convolutional recurrent neural network: data\-driven traffic forecasting\.arXiv preprint arXiv:1707\.01926\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p1.1),[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1),[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p1.1),[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p2.2)\.
- \[18\]H\. Liu, Z\. Dong, R\. Jiang, J\. Deng, J\. Deng, Q\. Chen, and X\. Song\(2023\)Spatio\-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting\.InProceedings of the 32nd ACM international conference on information and knowledge management,pp\. 4125–4129\.Cited by:[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p2.1)\.
- \[19\]S\. Liu, C\. Wang, H\. Yin, P\. Molchanov, Y\. F\. Wang, K\. Cheng, and M\. Chen\(2024\)Dora: weight\-decomposed low\-rank adaptation\.InForty\-first International Conference on Machine Learning,Cited by:[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p1.1)\.
- \[20\]J\. Puigcerver, C\. Riquelme Ruiz, B\. Mustafa, and N\. Houlsby\(2024\)From sparse to soft mixtures of experts\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 28435–28445\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p2.1)\.
- \[21\]W\. Ruan, W\. Chen, X\. Dang, J\. Zhou, W\. Li, X\. Liu, and Y\. Liang\(2025\)ST\-lora: low\-rank adaptation for spatio\-temporal forecasting\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 345–361\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p3.1),[§I](https://arxiv.org/html/2605.30486#S1.p4.1),[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p1.1),[§IV\-D](https://arxiv.org/html/2605.30486#S4.SS4.p1.1),[§VII](https://arxiv.org/html/2605.30486#S7.p4.1)\.
- \[22\]N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean\(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.arXiv preprint arXiv:1701\.06538\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p1.1),[§VII](https://arxiv.org/html/2605.30486#S7.p3.1)\.
- \[23\]S\. Sukhbaatar, O\. Golovneva, V\. Sharma, H\. Xu, X\. V\. Lin, B\. Rozière, J\. Kahn, D\. Li, W\. Yih, J\. Weston,et al\.\(2024\)Branch\-train\-mix: mixing expert llms into a mixture\-of\-experts llm\.arXiv preprint arXiv:2403\.07816\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p2.1)\.
- \[24\]Z\. Wu, S\. Pan, G\. Long, J\. Jiang, and C\. Zhang\(2019\)Graph wavenet for deep spatial\-temporal graph modeling\.arXiv preprint arXiv:1906\.00121\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.30486#S4.SS1.p1.1),[1st item](https://arxiv.org/html/2605.30486#S5.I1.i1.p1.1),[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p1.1),[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p2.2)\.
- \[25\]B\. Yu, H\. Yin, and Z\. Zhu\(2017\)Spatio\-temporal graph convolutional networks: a deep learning framework for traffic forecasting\.arXiv preprint arXiv:1709\.04875\.Cited by:[§I](https://arxiv.org/html/2605.30486#S1.p1.1),[§I](https://arxiv.org/html/2605.30486#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2605.30486#S4.SS1.p1.1),[2nd item](https://arxiv.org/html/2605.30486#S5.I1.i2.p1.1),[§V\-A](https://arxiv.org/html/2605.30486#S5.SS1.p2.2)\.
- \[26\]Y\. Yuan, J\. Ding, J\. Feng, D\. Jin, and Y\. Li\(2024\)Unist: a prompt\-empowered universal model for urban spatio\-temporal prediction\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 4095–4106\.Cited by:[§II\-A](https://arxiv.org/html/2605.30486#S2.SS1.p2.1),[§VIII](https://arxiv.org/html/2605.30486#S8.p3.1)\.
- \[27\]Q\. Zhang, M\. Chen, A\. Bukharin, N\. Karampatziakis, P\. He, Y\. Cheng, W\. Chen, and T\. Zhao\(2023\)Adalora: adaptive budget allocation for parameter\-efficient fine\-tuning\.arXiv preprint arXiv:2303\.10512\.Cited by:[§II\-C](https://arxiv.org/html/2605.30486#S2.SS3.p1.1)\.
- \[28\]Y\. Zhou, T\. Lei, H\. Liu, N\. Du, Y\. Huang, V\. Zhao, A\. M\. Dai, Q\. V\. Le, J\. Laudon,et al\.\(2022\)Mixture\-of\-experts with expert choice routing\.Advances in Neural Information Processing Systems35,pp\. 7103–7114\.Cited by:[§II\-B](https://arxiv.org/html/2605.30486#S2.SS2.p1.1),[4th item](https://arxiv.org/html/2605.30486#S5.I2.i4.p1.1),[§VI\-C4](https://arxiv.org/html/2605.30486#S6.SS3.SSS4.p1.1)\.Similar Articles
A Global-Local Graph Attention Network for Traffic Forecasting
Proposes a Global-Local Graph Attention Network (GLGAT) with pairwise encoding and event-based adjacency matrix for traffic forecasting, effectively capturing spatio-temporal correlations and achieving competitive performance on real-world datasets.
Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
Njord is a probabilistic graph neural network for ensemble ocean forecasting that provides uncertainty estimates and achieves state-of-the-art performance on global and regional benchmarks, improving surface temperature prediction.
EMO: Pretraining Mixture of Experts for Emergent Modularity
EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.
Expert Routing for Communication-Efficient MoE via Finite Expert Banks
The paper introduces an information-theoretic framework for communication-efficient expert routing in sparse mixture-of-experts models, treating the gate as a stochastic channel and deriving practical mutual information estimators to analyze accuracy-rate tradeoffs over finite expert banks.
MobileMoE: Scaling On-Device Mixture of Experts
MobileMoE introduces efficient on-device mixture-of-experts language models with sub-billion parameters, achieving better performance and efficiency than dense baselines and existing MoE models. The models are trained on open-source datasets and demonstrate significant speedups on commodity smartphones.