EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

arXiv cs.AI 06/02/26, 04:00 AM Papers
Summary
EnergyMamba proposes a novel spatiotemporal framework combining a graph-enhanced selective state space model and adaptive conformalized quantile regression for accurate and reliable energy consumption prediction with uncertainty estimates, achieving improvements on real-world datasets from Florida, New York, and California.
arXiv:2606.00506v1 Announce Type: new Abstract: Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:47 PM
# EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction
Source: [https://arxiv.org/html/2606.00506](https://arxiv.org/html/2606.00506)
,Rongchao XuFlorida State UniversityTallahassee, FloridaUSA[rx21a@fsu\.edu](https://arxiv.org/html/2606.00506v1/mailto:[email protected]),Lin JiangFlorida State UniversityTallahassee, FloridaUSA[lin\.jiang@fsu\.edu](https://arxiv.org/html/2606.00506v1/mailto:[email protected])andGuang WangFlorida State UniversityTallahassee, FloridaUSA[guang@cs\.fsu\.edu](https://arxiv.org/html/2606.00506v1/mailto:[email protected])

\(2026\)

###### Abstract\.

Energy consumption prediction is essential for efficient grid management, demand\-side optimization, and sustainable energy planning\. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: \(1\) they usually formulate this task as a purely time\-series prediction problem without explicitly modeling the spatial dependencies among different regions, and \(2\) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events\. To advance existing research, we propose EnergyMamba, an uncertainty\-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: \(i\) a novel Graph\-Enhanced Selective State Space Model \(GE\-Mamba\) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and \(ii\) an Adaptive Sequential Conformalized Quantile Regression \(AS\-CQR\) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts\. We evaluate EnergyMamba on four large\-scale real\-world datasets from Florida, New York, and California\. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state\-of\-the\-art baselines\.

Uncertainty Quantification, State Space Model, Energy Consumption Prediction

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818841††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Information systems Spatial\-temporal systems††ccs:Information systems Data mining## 1\.Introduction

Energy consumption prediction has attracted substantial attention from both industry and academia due to its critical role in supporting a wide range of practical applications with significant societal impacts\. Accurate and reliable energy consumption prediction is essential for efficient grid management\(Jhaet al\.,[2021](https://arxiv.org/html/2606.00506#bib.bib23)\), demand\-side optimization\(Binbusayyis and Sha,[2025](https://arxiv.org/html/2606.00506#bib.bib36)\), and sustainable energy planning\(Peteleazaet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib37)\)\. It also contributes to emergency preparedness\(Zhong and Sun,[2010](https://arxiv.org/html/2606.00506#bib.bib22)\)by helping stakeholders anticipate potential surges in energy consumption during extreme weather events, ensuring the stability and resilience of the power grid\.

Driven by its practical significance, energy consumption prediction has been studied using a variety of approaches, including traditional statistical models\(Punet al\.,[2019](https://arxiv.org/html/2606.00506#bib.bib25); Alghamdiet al\.,[2019](https://arxiv.org/html/2606.00506#bib.bib24)\), classical machine learning methods\(Denget al\.,[2026](https://arxiv.org/html/2606.00506#bib.bib60); Limet al\.,[2021](https://arxiv.org/html/2606.00506#bib.bib8); Liet al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib1); Piccialliet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib5)\), general deep learning methods\(Bouktifet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib31); Yuet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib21),[2026b](https://arxiv.org/html/2606.00506#bib.bib48)\), Transformers\(Alexandrovet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib15); Zenget al\.,[2023](https://arxiv.org/html/2606.00506#bib.bib12)\), and Diffusion\(Jianget al\.,[2025a](https://arxiv.org/html/2606.00506#bib.bib43),[b](https://arxiv.org/html/2606.00506#bib.bib44)\)\. More recently, Foundation Models\(Tuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib29)\)and large language models \(LLMs\)\(Liet al\.,[2024a](https://arxiv.org/html/2606.00506#bib.bib53); Zhanget al\.,[2026](https://arxiv.org/html/2606.00506#bib.bib61); Lianget al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib32)\)have been explored to model complex energy systems\. However, there are two key limitations of existing work\. First, most existing work formulates energy consumption prediction as a purely time\-series forecasting problem, neglecting the intrinsic spatial dependencies among different regions or grid zones that could provide critical contextual information for accurate prediction\. Second, state\-of\-the\-art approaches fail to continuously provide reliable predictions with uncertainty estimates, especially under abnormal situations such as extreme weather events, leading to overconfident and potentially misleading predictions\.

In this work, we aim to develop an uncertainty\-aware spatiotemporal framework for energy consumption prediction that explicitly models spatial dependencies while providing reliable uncertainty estimates\. Nevertheless, there are two key challenges to achieving this\. First, energy consumption patterns exhibit strong correlations with geographical factors, temporal factors, and building characteristics \(e\.g\., commercial vs\. residential\)\. These characteristics lead to highly heterogeneous consumption patterns across regions and time periods, making it difficult for existing models to effectively capture fine\-grained spatiotemporal dependencies\. Second, energy consumption behaviors and demand distributions may shift dramatically during extreme events, which makes it challenging to continuously provide reliable predictions and uncertainty estimates\.

To address the above challenges and advance existing research, we propose EnergyMamba, an uncertainty\-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction\. First, to address the spatiotemporal heterogeneity challenge, in EnergyMamba, we designGraph\-Enhanced Selective State Space Model \(GE\-Mamba\), a novel architecture that injects spatial context learned from grid topology into a bidirectional Mamba, organized within a U\-Net structure\. GE\-Mamba leverages Graph Convolutional Networks to extract spatial context from the grid topology and injects this context directly into an efficient Selective State Space Model for temporal dynamics modeling\. This design is motivated by Kirchhoff’s circuit laws: consumption changes at one node propagate to connected nodes through power flow equations, so conditioning temporal dynamics on spatial context enables more accurate modeling\. Second, to address the uncertainty quantification challenge, we proposeAdaptive Sequential Conformalized Quantile Regression \(AS\-CQR\)\. Unlike standard conformal prediction that assumes data exchangeability, AS\-CQR is designed for non\-stationary time series\. It includes \(i\) a locally adaptive nonconformity measure that normalizes residuals by the predicted interval width, ensuring scale\-invariant calibration across different regions, and \(ii\) an online feedback mechanism that dynamically adjusts the target quantile level based on recent coverage performance, enabling rapid adaptation to distribution shifts\.

The key contributions of this paper are as follows:

- •Conceptually, different from existing works that treat energy consumption prediction as a purely time\-series prediction problem, we formulate it as an uncertainty\-aware spatiotemporal prediction task with explicit spatial dependency modeling and reliable uncertainty quantification\.
- •Technically, we propose EnergyMamba, an uncertainty\-aware spatiotemporal learning framework comprising: \(i\) GE\-Mamba, a novel architecture that injects spatial context learned from grid topology into a bidirectional Mamba, organized within a U\-Net structure; and \(ii\) AS\-CQR, a distribution\-free uncertainty quantification method with locally adaptive normalization and online feedback calibration for reliable prediction under non\-stationary conditions\.
- •Empirically, we evaluate EnergyMamba on four real\-world datasets from Florida, New York, and California by comparing it with 15 state\-of\-the\-art baselines across five metrics\. Extensive results demonstrate that EnergyMamba achieves around 5% higher prediction accuracy and 6% better uncertainty quantification than the best baseline\. Our implementation is available at[https://github\.com/UFOdestiny/EnergyMamba](https://github.com/UFOdestiny/EnergyMamba)\.

## 2\.Data Analysis and Motivation

In this section, we conduct a data\-driven analysis to highlight key findings that motivate our design\. More detailed descriptions of data collection, preprocessing, and management procedures are described in Appendix[A](https://arxiv.org/html/2606.00506#A1)\.

### 2\.1\.Data Description

In this project, we are collaborating with a municipal utility provider in Florida\. We have access to the utility data collected from over 60,000 smart meters in Leon County, FL\. It provides us with household\-level energy consumption time series at a 30\-minute interval\. The raw data are aggregated at the Census Block Group \(CBG\) level to prevent the disclosure of personally identifiable information and safeguard user privacy\.

### 2\.2\.Data\-driven Insights

#### 2\.2\.1\.Spatial autocorrelation and local heterogeneity\.

We employ Moran’s I statistics\(Moran,[1950](https://arxiv.org/html/2606.00506#bib.bib38)\)to perform both global and local spatial autocorrelation analyses of energy consumption at the CBG level\. Figure[1\(a\)](https://arxiv.org/html/2606.00506#S2.F1.sf1)visualizes the relationship between a CBG’s own state and that of its neighboring CBGs, in whichziz\_\{i\}represents the standardized mean load of theii\-th CBG, while spatial lag \(WziWz\_\{i\}\) denotes the weighted average load of its neighboring CBGs computed using a row\-normalized spatial weight matrixWW\. The plot divides the data into four quadrants reflecting distinct local spatial correlations: \(1\)High\-HighandLow\-Lowindicating spatial clusters where a CBG shares similar patterns with its neighbors \(positive correlation\); and \(2\)High\-LowandLow\-Highhighlighting spatial dispersion where a CBG deviates from its surroundings \(negative correlation\)\. The fitted slope of the Moran’s I plot in Figure[1\(a\)](https://arxiv.org/html/2606.00506#S2.F1.sf1)is 0\.429 \(p<0\.001p<0\.001\), indicating statistically significant positive spatial autocorrelation\. From a physical systems perspective, this spatial dependence is not incidental but arises from fundamental circuit constraints\. Specifically, Kirchhoff’s Current Law enforces nodal power balance, implying that a local load perturbation at a CBG must be redistributed through adjacent links\. This redistribution propagates along the network topology, inducing correlated load deviations at electrically connected neighbors\.

![Refer to caption](https://arxiv.org/html/2606.00506v1/x1.png)\(a\)Moran’s I scatter plot\.
![Refer to caption](https://arxiv.org/html/2606.00506v1/x2.png)\(b\)LISA cluster map\.

Figure 1\.Spatial autocorrelation\.High\-Lowmeans High\-consumption CBG surrounded by Low\-usage neighbors\.Two panels summarize spatial autocorrelation in the Florida data: a Moran’s I scatter plot on the left and a LISA cluster map on the right\.Beyond global correlation, Figure[1\(b\)](https://arxiv.org/html/2606.00506#S2.F1.sf2)projects these statistical associations onto the physical regions to reveal local heterogeneity\.\(High\-High/Low\-Low\):High\-High clusters align with the dense urban core, reflecting potential electrical coupling, while Low\-Low clusters dominate the rural periphery with sparse connectivity\.\(High\-Low/Low\-High\):The map also identifies spatial regions that defy the global trend\. For instance, a High\-Low node likely identifies a CBG containing concentrated commercial or industrial loads \(e\.g\., a shopping complex or a hospital district\) surrounded by lower\-density residential neighborhoods\. These topological mismatches challenge purely time\-series models but provide meaningful signals for spatially aware models to distinguish between normal fluctuations and specific load types\.

![Refer to caption](https://arxiv.org/html/2606.00506v1/x3.png)\(a\)Load–residual scaling\.
![Refer to caption](https://arxiv.org/html/2606.00506v1/x4.png)\(b\)Locally adaptive normalization\.

Figure 2\.Uncertainty in energy consumption prediction\.Two panels illustrate uncertainty behavior: load\-residual scaling on the left and the effect of locally adaptive normalization on the right\.
#### 2\.2\.2\.Heteroscedastic uncertainty scales with load magnitude\.

Figure[2\(a\)](https://arxiv.org/html/2606.00506#S2.F2.sf1)reveals a strong linear relationship \(r=0\.94r=0\.94\) between the mean hourly load and the standard deviation of prediction residuals, which indicates heteroscedasticity, i\.e\., the variability of electricity consumption is not constant but proportional to the magnitude of usage; regions with higher base loads naturally exhibit larger absolute fluctuations and prediction errors\. As illustrated in Figure[2\(b\)](https://arxiv.org/html/2606.00506#S2.F2.sf2), this scale dependence causes the raw absolute residual distributions to differ significantly across load tiers\. By normalizing these residuals by their respective local standard deviations, the distributions across low, mid, and high\-load groups collapse onto a nearly identical curve\. This transformation stabilizes the variance, effectively converting the heteroscedastic errors into a scale\-independent distribution suitable for standardized analysis\.

These insights highlight two essential requirements for energy consumption modeling and prediction\. First, strongspatial dependenciesimply that prediction should account for interactions across neighboring regions\. Second, the observed heteroscedasticity and non\-stationarity indicate that uncertainty estimation must adapt tochanging load levels and distribution shifts\.

## 3\.Problem Formulation

### 3\.1\.Graph Construction

We model the energy systems as a graph𝒢=\(𝐕,𝐄,𝐀\)\\mathcal\{G\}=\(\\mathbf\{V\},\\mathbf\{E\},\\mathbf\{A\}\), where𝐕=\{v1,…,vN\}\\mathbf\{V\}=\\\{v\_\{1\},\\ldots,v\_\{N\}\\\}denotesNNnodes \(e\.g\., Census Block Groups or counties\),𝐄\\mathbf\{E\}is the edge set constructed via spatial proximity to approximate physical coupling, and𝐀∈ℝN×N\\mathbf\{A\}\\in\\mathbb\{R\}^\{N\\times N\}is the weighted adjacency matrix\. We construct𝐀\\mathbf\{A\}using a Gaussian kernel with sparsity thresholding:

\(1\)𝐀ij=\{exp⁡\(−dij2/σ2\),i≠jandexp⁡\(−dij2/σ2\)≥ϵ,0,otherwise,\\mathbf\{A\}\_\{ij\}=\\begin\{cases\}\\exp\\\!\\left\(\-d\_\{ij\}^\{2\}/\\sigma^\{2\}\\right\),&i\\neq j~\\text\{and\}~\\exp\\\!\\left\(\-d\_\{ij\}^\{2\}/\\sigma^\{2\}\\right\)\\geq\\epsilon,\\\\\[4\.0pt\] 0,&\\text\{otherwise\},\\end\{cases\}wheredijd\_\{ij\}is the centroid distance between nodesviv\_\{i\}andvjv\_\{j\},σ\>0\\sigma\>0controls spatial decay, andϵ∈\[0,1\]\\epsilon\\in\[0,1\]is a sparsity threshold that removes weak connections\.

### 3\.2\.Energy Consumption Prediction

Let𝐗∈ℝN×T\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times T\}denote the input matrix containing energy consumption readings of allNNnodes overTThistorical time steps\. Given the historical observations𝐗\\mathbf\{X\}and the adjacency matrix𝐀\\mathbf\{A\}, the goal is to predict future consumption for the nextToutT\_\{\\text\{out\}\}time steps\.

#### 3\.2\.1\.Point Prediction

Point prediction learns a mappingfffrom inputs to deterministic predictions:

\(2\)\{𝐗,𝐀\}→𝑓𝐘^∈ℝN×Tout\.\\\{\\mathbf\{X\},\\mathbf\{A\}\\\}\\xrightarrow\{~f~\}\\hat\{\\mathbf\{Y\}\}\\in\\mathbb\{R\}^\{N\\times T\_\{\\text\{out\}\}\}\.

#### 3\.2\.2\.Probabilistic Prediction with Uncertainty Quantification

In this work, we focus on probabilistic prediction, which quantifies uncertainty using prediction intervals\. Given a target miscoverage rateα\\alpha\(e\.g\.,α=0\.1\\alpha=0\.1for 90% coverage\), we learn a mapping functionℱ\\mathcal\{F\}to predict lower, upper, and median quantiles:

\(3\)\{𝐗,𝐀\}→ℱ\[q^lo,q^up,q^mi\]∈ℝN×Tout,\\\{\\mathbf\{X\},\\mathbf\{A\}\\\}\\xrightarrow\{~\\mathcal\{F\}~\}\\big\[\\hat\{q\}\_\{\\text\{lo\}\},\\,\\hat\{q\}\_\{\\text\{up\}\},\\,\\hat\{q\}\_\{\\text\{mi\}\}\\big\]\\in\\mathbb\{R\}^\{N\\times T\_\{\\text\{out\}\}\},whereq^lo=q^α/2\(𝐗\)\\hat\{q\}\_\{\\text\{lo\}\}=\\hat\{q\}\_\{\\alpha/2\}\(\\mathbf\{X\}\),q^up=q^1−α/2\(𝐗\)\\hat\{q\}\_\{\\text\{up\}\}=\\hat\{q\}\_\{1\-\\alpha/2\}\(\\mathbf\{X\}\), andq^mi=q^0\.5\(𝐗\)\\hat\{q\}\_\{\\text\{mi\}\}=\\hat\{q\}\_\{0\.5\}\(\\mathbf\{X\}\)represent the lower, upper bounds and median of the prediction interval, respectively\. The prediction intervalC^\(𝐗\)=\[q^lo,q^up\]\\hat\{C\}\(\\mathbf\{X\}\)=\[\\hat\{q\}\_\{\\text\{lo\}\},\\hat\{q\}\_\{\\text\{up\}\}\]should satisfy the coverage guarantee:

\(4\)ℙ\(Y∈C^\(𝐗\)\)≥1−α,\\mathbb\{P\}\\left\(Y\\in\\hat\{C\}\(\\mathbf\{X\}\)\\right\)\\geq 1\-\\alpha,whereY∈ℝY\\in\\mathbb\{R\}is the ground truth value for an individual prediction target\. The medianq^mi\\hat\{q\}\_\{\\text\{mi\}\}serves as the point prediction𝐘^\\hat\{\\mathbf\{Y\}\}\.

![Refer to caption](https://arxiv.org/html/2606.00506v1/x5.png)Figure 3\.Framework of EnergyMamba, which consists of two key components: \(i\)Graph\-Enhanced Selective State Space Model \(GE\-Mamba\), and \(ii\)Adaptive Sequential Conformalized Quantile Regression \(AS\-CQR\)\.Overview of the EnergyMamba framework, showing the GE\-Mamba backbone and the AS\-CQR calibration module and how they connect from input sequences to point and interval predictions\.

## 4\.Methodology

The empirical insights derived in Section[2](https://arxiv.org/html/2606.00506#S2)directly inform the design of EnergyMamba, which aims to bridge data\-driven learning with physical system principles\. First, the spatial dependencies governed by Kirchhoff’s laws motivate the \(i\)Graph\-Enhanced Selective State Space Model \(GE\-Mamba\)\. Instead of treating spatial aggregation as generic feature extraction, we design the GCN module as a learnable neural surrogate for power flow equations, conditioning Mamba’s state transitions on physics\-informed neighborhood context to simulate how local load perturbations propagate through the grid topology\. Second, the load\-dependent heteroscedasticity and non\-stationarity motivate \(ii\)Adaptive Sequential Conformalized Quantile Regression \(AS\-CQR\)\. Recognizing that energy systems exhibit both aleatoric uncertainty \(scaling with load magnitude due to physical losses\) and epistemic uncertainty \(arising from distribution shifts during extreme events\), AS\-CQR employs scale\-invariant normalization to handle the former and an online feedback loop to dynamically adapt to the latter\.

### 4\.1\.Graph\-Enhanced Selective State Space Model

In energy systems, consumption changes at one substation propagate to connected nodes through power flow equations; ignoring this coupling leads to suboptimal temporal modeling\. We address this through a novel Graph\-Enhanced Mamba architecture that injects spatial context into the selective state space model \(SSM\), organized within a U\-Net encoder\-decoder structure that enables multi\-scale temporal feature extraction with skip connections for preserving fine\-grained patterns\.

#### 4\.1\.1\.Input Embedding

The input matrix𝐗∈ℝN×T\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times T\}contains energy consumption readings ofNNnodes overTTtime steps\. Before feeding into the U\-Net architecture, each node’s time series is projected into a latent space of dimensionDD\. Specifically, each valuexn,t∈ℝx\_\{n,t\}\\in\\mathbb\{R\}is independently projected to aDD\-dimensional vector via a shared linear layer:

\(5\)𝐡n,t\(0\)=𝐖embxn,t\+𝐞pos,t,\\mathbf\{h\}^\{\(0\)\}\_\{n,t\}=\\mathbf\{W\}\_\{\\text\{emb\}\}x\_\{n,t\}\+\\mathbf\{e\}\_\{\\text\{pos\},t\},where𝐖emb∈ℝD×1\\mathbf\{W\}\_\{\\text\{emb\}\}\\in\\mathbb\{R\}^\{D\\times 1\}is a learnable projection that maps each scalar to aDD\-dimensional embedding,𝐞pos,t∈ℝD\\mathbf\{e\}\_\{\\text\{pos\},t\}\\in\\mathbb\{R\}^\{D\}is the learnable temporal position embedding at time steptt,DDis the hidden dimension, and stacking all𝐡n,t\(0\)\\mathbf\{h\}^\{\(0\)\}\_\{n,t\}yields𝐇\(0\)∈ℝN×T×D\\mathbf\{H\}^\{\(0\)\}\\in\\mathbb\{R\}^\{N\\times T\\times D\}as the initial hidden representation for the first encoder stage\.

#### 4\.1\.2\.U\-Net Architecture with GE\-Mamba Backbone

To capture multi\-scale temporal patterns, including short\-term fluctuations and long\-term trends, we organize GE\-Mamba blocks within a U\-Net style encoder\-decoder architecture\. The model comprisesSSencoder stages, a bottleneck, andSSdecoder stages\. Each encoder stage stacksKKGE\-Mamba blocks followed by temporal downsampling, which halves the temporal resolution and doubles the hidden dimension toD\(s\)=2s−1DD^\{\(s\)\}=2^\{s\-1\}D\. The bottleneck operates on the most compressed representation, while the decoder symmetrically upsamples features using transposed convolutions and skip connections\. These skip connections fuse coarse temporal patterns \(e\.g\., weekly trends\) with fine\-grained variations \(e\.g\., hourly peaks\), preserving abnormal events such as sudden demand spikes\.

#### 4\.1\.3\.GE\-Mamba Block

Each GE\-Mamba block integrates spatial context extraction and temporal modeling with residual connections\. The block takes hidden representations𝐇\(l−1\)\\mathbf\{H\}^\{\(l\-1\)\}from the previous layer and outputs updated representations𝐇\(l\)\\mathbf\{H\}^\{\(l\)\}\.

Spatial Context Extraction\.To extract spatial context from the grid topology, we employ a Graph Convolutional Network \(GCN\) that aggregates information from neighboring nodes\. This design is motivated by Kirchhoff’s laws in electrical networks: the power balance at each node depends on flows from all connected edges\(Quintelaet al\.,[2009](https://arxiv.org/html/2606.00506#bib.bib35)\)\. For each time steptt, let𝐇t∈ℝN×D\\mathbf\{H\}\_\{t\}\\in\\mathbb\{R\}^\{N\\times D\}denote the hidden representation\. The spatial context𝐙t∈ℝN×D\\mathbf\{Z\}\_\{t\}\\in\\mathbb\{R\}^\{N\\times D\}is computed as:

\(6\)𝐙t=GCN\(𝐇t,𝐀\)=GELU\(𝐃~−12𝐀~𝐃~−12𝐇t𝐖gcn\),\\mathbf\{Z\}\_\{t\}=\\text\{GCN\}\(\\mathbf\{H\}\_\{t\},\\mathbf\{A\}\)=\\text\{GELU\}\\left\(\\tilde\{\\mathbf\{D\}\}^\{\-\\frac\{1\}\{2\}\}\\tilde\{\\mathbf\{A\}\}\\tilde\{\\mathbf\{D\}\}^\{\-\\frac\{1\}\{2\}\}\\mathbf\{H\}\_\{t\}\\mathbf\{W\}\_\{\\text\{gcn\}\}\\right\),where𝐀~=𝐀\+𝐈N\\tilde\{\\mathbf\{A\}\}=\\mathbf\{A\}\+\\mathbf\{I\}\_\{N\}is the adjacency matrix augmented with self\-loops,𝐃~∈ℝN×N\\tilde\{\\mathbf\{D\}\}\\in\\mathbb\{R\}^\{N\\times N\}is the diagonal degree matrix with𝐃~ii=∑j𝐀~ij\\tilde\{\\mathbf\{D\}\}\_\{ii\}=\\sum\_\{j\}\\tilde\{\\mathbf\{A\}\}\_\{ij\},𝐖gcn∈ℝD×D\\mathbf\{W\}\_\{\\text\{gcn\}\}\\in\\mathbb\{R\}^\{D\\times D\}is a learnable weight matrix, andGELU\(⋅\)\\text\{GELU\}\(\\cdot\)is the Gaussian Error Linear Unit activation function\. The GCN is applied independently at each time step, producing a spatial context sequence𝐙=\[𝐙1,…,𝐙T\]∈ℝN×T×D\\mathbf\{Z\}=\[\\mathbf\{Z\}\_\{1\},\\ldots,\\mathbf\{Z\}\_\{T\}\]\\in\\mathbb\{R\}^\{N\\times T\\times D\}that is then fed into the SSM along with𝐇\(l−1\)\\mathbf\{H\}^\{\(l\-1\)\}for joint spatiotemporal modeling\. The symmetric normalization𝐃~−12𝐀~𝐃~−12\\tilde\{\\mathbf\{D\}\}^\{\-\\frac\{1\}\{2\}\}\\tilde\{\\mathbf\{A\}\}\\tilde\{\\mathbf\{D\}\}^\{\-\\frac\{1\}\{2\}\}ensures numerically stable message passing across nodes with varying degrees, which is important for power systems where hub substations connect to many downstream nodes while peripheral nodes have fewer connections\.

Graph\-Enhanced Selective State Space Model\.The Selective State Space Model \(Mamba\)\(Gu and Dao,[2024](https://arxiv.org/html/2606.00506#bib.bib19)\)provides an efficient mechanism for temporal sequence modeling with linear complexity𝒪\(T\)\\mathcal\{O\}\(T\)\. The key innovation of our approach is to condition the SSM dynamics on spatial context, enabling the model to adapt its temporal processing based on the state of neighboring nodes\. The SSM operates independently on each of theDDhidden channels\. For a single channel, the continuous\-time dynamics are defined by a latent state𝐬\(t\)∈ℝDs\\mathbf\{s\}\(t\)\\in\\mathbb\{R\}^\{D\_\{s\}\}, whereDsD\_\{s\}is the state dimension:

\(7\)𝐬′\(t\)\\displaystyle\\mathbf\{s\}^\{\\prime\}\(t\)=𝐀ssm𝐬\(t\)\+𝐁x\(t\),y\(t\)=𝐂𝐬\(t\),\\displaystyle=\\mathbf\{A\}\_\{\\text\{ssm\}\}\\mathbf\{s\}\(t\)\+\\mathbf\{B\}x\(t\),\\quad y\(t\)=\\mathbf\{C\}\\mathbf\{s\}\(t\),where𝐀ssm∈ℝDs×Ds\\mathbf\{A\}\_\{\\text\{ssm\}\}\\in\\mathbb\{R\}^\{D\_\{s\}\\times D\_\{s\}\}is the state transition matrix,𝐁∈ℝDs×1\\mathbf\{B\}\\in\\mathbb\{R\}^\{D\_\{s\}\\times 1\}is the input projection matrix,𝐂∈ℝ1×Ds\\mathbf\{C\}\\in\\mathbb\{R\}^\{1\\times D\_\{s\}\}is the output projection matrix, andx\(t\)∈ℝx\(t\)\\in\\mathbb\{R\}is the input for that channel\. Since the continuous\-time formulation in Eq\. \([7](https://arxiv.org/html/2606.00506#S4.E7)\) involves derivatives that are not directly applicable to sampled observations such as hourly energy readings, we apply zero\-order hold \(ZOH\) discretization with step sizeΔ\>0\\Delta\>0to obtain a recurrence relation suitable for discrete time\-series data:

\(8\)𝐀¯\\displaystyle\\bar\{\\mathbf\{A\}\}=exp⁡\(Δ𝐀ssm\),𝐁¯=\(Δ𝐀ssm\)−1\(exp⁡\(Δ𝐀ssm\)−𝐈\)⋅Δ𝐁,\\displaystyle=\\exp\(\\Delta\\mathbf\{A\}\_\{\\text\{ssm\}\}\),\\quad\\bar\{\\mathbf\{B\}\}=\(\\Delta\\mathbf\{A\}\_\{\\text\{ssm\}\}\)^\{\-1\}\(\\exp\(\\Delta\\mathbf\{A\}\_\{\\text\{ssm\}\}\)\-\\mathbf\{I\}\)\\cdot\\Delta\\mathbf\{B\},where𝐀¯\\bar\{\\mathbf\{A\}\}and𝐁¯\\bar\{\\mathbf\{B\}\}are the discretized state transition and input matrices\. Aggregating over allDDchannels, the discrete recurrence for the full hidden state is:

\(9\)𝐬t\\displaystyle\\mathbf\{s\}\_\{t\}=𝐀¯𝐬t−1\+𝐁¯𝐡t,𝐡~t=𝐂𝐬t,\\displaystyle=\\bar\{\\mathbf\{A\}\}\\mathbf\{s\}\_\{t\-1\}\+\\bar\{\\mathbf\{B\}\}\\mathbf\{h\}\_\{t\},\\quad\\tilde\{\\mathbf\{h\}\}\_\{t\}=\\mathbf\{C\}\\mathbf\{s\}\_\{t\},where𝐡t∈ℝD\\mathbf\{h\}\_\{t\}\\in\\mathbb\{R\}^\{D\}is the input hidden state at timettand𝐡~t∈ℝD\\tilde\{\\mathbf\{h\}\}\_\{t\}\\in\\mathbb\{R\}^\{D\}is the output\.

Spatial\-Conditioned Selectivity\.Building upon the Mamba structure’s input\-dependent nature, we enhance its selective information propagation by introducing a mechanism that conditions on both temporal inputs and spatial contexts\. Specifically, the static𝐁\\mathbf\{B\},𝐂\\mathbf\{C\}, andΔ\\Deltaare replaced by time\-varying, input\-dependent counterparts:

\(10\)𝐁t\\displaystyle\\mathbf\{B\}\_\{t\}=𝐖B\[𝐡t∥𝐳t\]\+𝐛B,\\displaystyle=\\mathbf\{W\}\_\{B\}\[\\mathbf\{h\}\_\{t\}\\\|\\mathbf\{z\}\_\{t\}\]\+\\mathbf\{b\}\_\{B\},\(11\)𝐂t\\displaystyle\\mathbf\{C\}\_\{t\}=𝐖C\[𝐡t∥𝐳t\]\+𝐛C,\\displaystyle=\\mathbf\{W\}\_\{C\}\[\\mathbf\{h\}\_\{t\}\\\|\\mathbf\{z\}\_\{t\}\]\+\\mathbf\{b\}\_\{C\},\(12\)Δt\\displaystyle\\Delta\_\{t\}=Softplus\(𝐖Δ\[𝐡t∥𝐳t\]\+bΔ\),\\displaystyle=\\text\{Softplus\}\(\\mathbf\{W\}\_\{\\Delta\}\[\\mathbf\{h\}\_\{t\}\\\|\\mathbf\{z\}\_\{t\}\]\+b\_\{\\Delta\}\),where\[⋅∥⋅\]\[\\cdot\\\|\\cdot\]denotes concatenation,𝐳t∈ℝD\\mathbf\{z\}\_\{t\}\\in\\mathbb\{R\}^\{D\}is the spatial context from Eq\. \([6](https://arxiv.org/html/2606.00506#S4.E6)\),𝐖B,𝐖C∈ℝDs×2D\\mathbf\{W\}\_\{B\},\\mathbf\{W\}\_\{C\}\\in\\mathbb\{R\}^\{D\_\{s\}\\times 2D\}and𝐖Δ∈ℝ1×2D\\mathbf\{W\}\_\{\\Delta\}\\in\\mathbb\{R\}^\{1\\times 2D\}are learnable weight matrices,𝐛B,𝐛C∈ℝDs\\mathbf\{b\}\_\{B\},\\mathbf\{b\}\_\{C\}\\in\\mathbb\{R\}^\{D\_\{s\}\}andbΔ∈ℝb\_\{\\Delta\}\\in\\mathbb\{R\}are learnable biases, andSoftplus\(x\)=log⁡\(1\+ex\)\\text\{Softplus\}\(x\)=\\log\(1\+e^\{x\}\)ensures positivity of the step size\. This spatial conditioning has a physical interpretation: the step sizeΔt\\Delta\_\{t\}controls how quickly the model forgets past states, and by conditioning on spatial context, nodes experiencing similar patterns in their neighborhood exhibit coordinated memory behavior\.

Bidirectional Processing \(BIP\)\.Energy time series exhibit both causal dependencies \(past events affect future consumption\) and contextual dependencies \(understanding patterns requires seeing the full sequence\)\. We employ bidirectional processing:

\(13\)BIP\(𝐇,𝐙\)=Mamba→\(𝐇,𝐙\)\+Mamba←\(Flip\(𝐇\),Flip\(𝐙\)\),\\text\{BIP\}\(\\mathbf\{H\},\\mathbf\{Z\}\)=\\text\{Mamba\}\_\{\\rightarrow\}\(\\mathbf\{H\},\\mathbf\{Z\}\)\+\\text\{Mamba\}\_\{\\leftarrow\}\(\\text\{Flip\}\(\\mathbf\{H\}\),\\text\{Flip\}\(\\mathbf\{Z\}\)\),whereMamba→\\text\{Mamba\}\_\{\\rightarrow\}processes the sequence forward in time,Mamba←\\text\{Mamba\}\_\{\\leftarrow\}processes backward, andFlip\(⋅\)\\text\{Flip\}\(\\cdot\)reverses the temporal dimension\.

Block Integration\.For a single block at layerll, we have

\(14\)𝐙\(l\)\\displaystyle\\mathbf\{Z\}^\{\(l\)\}=GCN\(RMSNorm\(𝐇\(l−1\)\),𝐀\),\\displaystyle=\\text\{GCN\}\(\\text\{RMSNorm\}\(\\mathbf\{H\}^\{\(l\-1\)\}\),\\mathbf\{A\}\),\(15\)𝐇\(l\)\\displaystyle\\mathbf\{H\}^\{\(l\)\}=𝐇\(l−1\)\+Dropout\(BIP\(RMSNorm\(𝐇\(l−1\)\),𝐙\(l\)\)\),\\displaystyle=\\mathbf\{H\}^\{\(l\-1\)\}\+\\text\{Dropout\}\(\\text\{BIP\}\(\\text\{RMSNorm\}\(\\mathbf\{H\}^\{\(l\-1\)\}\),\\mathbf\{Z\}^\{\(l\)\}\)\),whereRMSNorm\(⋅\)\\text\{RMSNorm\}\(\\cdot\)is Root Mean Square Layer Normalization\(Zhang and Sennrich,[2019](https://arxiv.org/html/2606.00506#bib.bib28)\)\.

#### 4\.1\.4\.Output Projection

The final decoder output𝐇dec\(1\)∈ℝN×T×D\\mathbf\{H\}^\{\(1\)\}\_\{\\text\{dec\}\}\\in\\mathbb\{R\}^\{N\\times T\\times D\}is projected to produce both point predictions and quantile estimates, whereDDis the hidden dimension\. We extract the last\-time\-step representation, denoted by𝐇out∈ℝN×D\\mathbf\{H\}\_\{\\text\{out\}\}\\in\\mathbb\{R\}^\{N\\times D\}, and apply three separate linear projections:

\(16\)𝐘^\\displaystyle\\hat\{\\mathbf\{Y\}\}=RMSNorm\(𝐇out\)𝐖mi\+𝐛mi,\\displaystyle=\\text\{RMSNorm\}\(\\mathbf\{H\}\_\{\\text\{out\}\}\)\\mathbf\{W\}\_\{\\text\{mi\}\}\+\\mathbf\{b\}\_\{\\text\{mi\}\},\(17\)q^lo\\displaystyle\\hat\{q\}\_\{\\text\{lo\}\}=RMSNorm\(𝐇out\)𝐖lo\+𝐛lo,\\displaystyle=\\text\{RMSNorm\}\(\\mathbf\{H\}\_\{\\text\{out\}\}\)\\mathbf\{W\}\_\{\\text\{lo\}\}\+\\mathbf\{b\}\_\{\\text\{lo\}\},\(18\)q^up\\displaystyle\\hat\{q\}\_\{\\text\{up\}\}=RMSNorm\(𝐇out\)𝐖up\+𝐛up,\\displaystyle=\\text\{RMSNorm\}\(\\mathbf\{H\}\_\{\\text\{out\}\}\)\\mathbf\{W\}\_\{\\text\{up\}\}\+\\mathbf\{b\}\_\{\\text\{up\}\},where𝐖mi,𝐖lo,𝐖up∈ℝD×Tout\\mathbf\{W\}\_\{\\text\{mi\}\},\\mathbf\{W\}\_\{\\text\{lo\}\},\\mathbf\{W\}\_\{\\text\{up\}\}\\in\\mathbb\{R\}^\{D\\times T\_\{\\text\{out\}\}\}are learnable weight matrices,𝐛mi,𝐛lo,𝐛up∈ℝTout\\mathbf\{b\}\_\{\\text\{mi\}\},\\mathbf\{b\}\_\{\\text\{lo\}\},\\mathbf\{b\}\_\{\\text\{up\}\}\\in\\mathbb\{R\}^\{T\_\{\\text\{out\}\}\}are learnable biases, andToutT\_\{\\text\{out\}\}is the prediction horizon\.

#### 4\.1\.5\.Training Objective

GE\-Mamba is trained end\-to\-end using a composite quantile regression loss\. For each quantile levelτ∈\{α/2,0\.5,1−α/2\}\\tau\\in\\\{\\alpha/2,\\,0\.5,\\,1\-\\alpha/2\\\}, we adopt the pinball loss:

\(19\)ℒτ\(y,q^τ\)=\{τ⋅\(y−q^τ\),y≥q^τ,\(1−τ\)⋅\(q^τ−y\),y<q^τ,\\mathcal\{L\}\_\{\\tau\}\(y,\\hat\{q\}\_\{\\tau\}\)=\\begin\{cases\}\\tau\\cdot\(y\-\\hat\{q\}\_\{\\tau\}\),&y\\geq\\hat\{q\}\_\{\\tau\},\\\\ \(1\-\\tau\)\\cdot\(\\hat\{q\}\_\{\\tau\}\-y\),&y<\\hat\{q\}\_\{\\tau\},\\end\{cases\}whereyyis the ground truth andq^τ\\hat\{q\}\_\{\\tau\}is the predictedτ\\tau\-th quantile\. The overall training loss is the sum of the pinball losses across all three quantile levels:

\(20\)ℒ=1N⋅Tout∑n=1N∑j=1Tout\(\\displaystyle\\mathcal\{L\}=\\frac\{1\}\{N\\cdot T\_\{\\text\{out\}\}\}\\sum\_\{n=1\}^\{N\}\\sum\_\{j=1\}^\{T\_\{\\text\{out\}\}\}\\Big\(ℒα/2\(yn,j,q^α/2\(n,j\)\)\\displaystyle\\mathcal\{L\}\_\{\\alpha/2\}\(y\_\{n,j\},\\hat\{q\}\_\{\\alpha/2\}^\{\(n,j\)\}\)\+ℒ0\.5\(yn,j,q^0\.5\(n,j\)\)\\displaystyle\+\\mathcal\{L\}\_\{0\.5\}\(y\_\{n,j\},\\hat\{q\}\_\{0\.5\}^\{\(n,j\)\}\)\+ℒ1−α/2\(yn,j,q^1−α/2\(n,j\)\)\)\.\\displaystyle\+\\mathcal\{L\}\_\{1\-\\alpha/2\}\(y\_\{n,j\},\\hat\{q\}\_\{1\-\\alpha/2\}^\{\(n,j\)\}\)\\Big\)\.The median head \(τ=0\.5\\tau=0\.5\) provides the point prediction𝐘^\\hat\{\\mathbf\{Y\}\}, while the lower and upper heads produce quantile bounds for downstream uncertainty calibration via AS\-CQR\.

### 4\.2\.Adaptive Sequential Conformalized Quantile Regression

Point predictions are valuable for power grid planning, but they fall short in critical grid operations that demand risk\-aware decision\-making\. Existing uncertainty quantification methods face two fundamental challenges in energy systems: \(1\)Heteroscedasticity, where prediction variance correlates with consumption magnitude; and \(2\)Non\-stationarity, where distribution shifts are driven by seasonality or abnormal scenarios that violate the exchangeability assumption of classical conformal prediction\. To address these challenges, we proposeAdaptive Sequential Conformalized Quantile Regression \(AS\-CQR\), a distribution\-free framework delivering reliable prediction intervals via two innovations:

1. \(1\)Locally Adaptive Normalization: Nonconformity scores are normalized by the predicted interval width, ensuring scale\-invariant calibration\.
2. \(2\)Online Feedback Mechanism: The target coverage level is dynamically adjusted based on recent prediction performance, enabling rapid adaptation to distribution shifts\.

#### 4\.2\.1\.Locally Adaptive Nonconformity Measure

Unlike standard Conformalized Quantile Regression \(CQR\)\(Romanoet al\.,[2019](https://arxiv.org/html/2606.00506#bib.bib34)\), which uses absolute residual errors, AS\-CQR employs a normalized measure to scale the nonconformity score with the predicted interval width:

\(21\)ϵt=max⁡\{q^lo\(𝐗t\)−Yt,Yt−q^up\(𝐗t\)\}q^up\(𝐗t\)−q^lo\(𝐗t\)\+δ,\\epsilon\_\{t\}=\\frac\{\\max\\left\\\{\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)\-Y\_\{t\},\\;Y\_\{t\}\-\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)\\right\\\}\}\{\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)\-\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)\+\\delta\},whereϵt∈ℝ\\epsilon\_\{t\}\\in\\mathbb\{R\}is the nonconformity score at timett,Yt∈ℝY\_\{t\}\\in\\mathbb\{R\}is the true observation,q^lo\(𝐗t\)\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)andq^up\(𝐗t\)\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)are the predicted lower and upper quantiles from Eqs\. \([17](https://arxiv.org/html/2606.00506#S4.E17)\)–\([18](https://arxiv.org/html/2606.00506#S4.E18)\), andδ\\deltais a small constant for numerical stability \(e\.g\.,10−610^\{\-6\}\)\. Note thatϵt\\epsilon\_\{t\}is a signed quantity:ϵt<0\\epsilon\_\{t\}<0whenYtY\_\{t\}lies inside the predicted interval\[q^lo,q^up\]\[\\hat\{q\}\_\{\\text\{lo\}\},\\hat\{q\}\_\{\\text\{up\}\}\], with the magnitude reflecting how deeply the observation is contained within the bounds;ϵt\>0\\epsilon\_\{t\}\>0whenYtY\_\{t\}falls outside, indicating miscoverage; andϵt=0\\epsilon\_\{t\}=0whenYtY\_\{t\}lies exactly on a boundary\. Combined with the interval\-width normalization, this yields a scale\-invariant calibration mechanism that produces efficient prediction intervals across regions with different magnitudes as motivated in Section[2](https://arxiv.org/html/2606.00506#S2)\. This normalization also makes the calibration scores more comparable across regions and time periods, preventing high\-load areas from dominating the adaptation process solely because they exhibit larger absolute residuals\.

#### 4\.2\.2\.Interval Construction

At each time steptt, the calibrated prediction interval is constructed using historical nonconformity scores\. Letℰt=\{ϵt−m\+1,…,ϵt\}\\mathcal\{E\}\_\{t\}=\\\{\\epsilon\_\{t\-m\+1\},\\ldots,\\epsilon\_\{t\}\\\}be a sliding window of themmmost recent nonconformity scores, wherem=100m=100is the window size\. The correction factor is computed as:

\(22\)Qt=Quantile1−α~t\(ℰt\),Q\_\{t\}=\\text\{Quantile\}\_\{1\-\\tilde\{\\alpha\}\_\{t\}\}\(\\mathcal\{E\}\_\{t\}\),whereQuantilep\(⋅\)\\text\{Quantile\}\_\{p\}\(\\cdot\)returns thepp\-th empirical quantile of the input set andα~t∈\(0,1\)\\tilde\{\\alpha\}\_\{t\}\\in\(0,1\)is the effective miscoverage rate \(detailed in Section[4\.2\.3](https://arxiv.org/html/2606.00506#S4.SS2.SSS3)\)\. The calibrated prediction intervalC^\(𝐗t\)=\[C^lo,C^up\]\\hat\{C\}\(\\mathbf\{X\}\_\{t\}\)=\[\\hat\{C\}\_\{\\text\{lo\}\},\\hat\{C\}\_\{\\text\{up\}\}\]is:

\(23\)C^\(𝐗t\)=\[q^lo\(𝐗t\)−Qt⋅wt,q^up\(𝐗t\)\+Qt⋅wt\],\\hat\{C\}\(\\mathbf\{X\}\_\{t\}\)=\\left\[\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)\-Q\_\{t\}\\cdot w\_\{t\},\\quad\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)\+Q\_\{t\}\\cdot w\_\{t\}\\right\],wherewt=q^up\(𝐗t\)−q^lo\(𝐗t\)∈ℝ\>0w\_\{t\}=\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)\-\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)\\in\\mathbb\{R\}\_\{\>0\}is the raw interval width\. The multiplicative correctionQt⋅wtQ\_\{t\}\\cdot w\_\{t\}ensures that interval adjustments are proportional to the model’s uncertainty, maintaining the locally adaptive property\.

#### 4\.2\.3\.Online Feedback Calibration

To handle distribution drift, we integrate an online update mechanism inspired by Adaptive Conformal Inference \(ACI\)\(Gibbs and Candes,[2021](https://arxiv.org/html/2606.00506#bib.bib33)\)\. The key insight is to treat the target miscoverage rateα∈\(0,1\)\\alpha\\in\(0,1\)as a control variable that is adjusted based on recent coverage performance\. The complete online calibration procedure at each time stepttis:

1. \(1\)Predict: Compute quantile predictionsq^lo\(𝐗t\)\\hat\{q\}\_\{\\text\{lo\}\}\(\\mathbf\{X\}\_\{t\}\)andq^up\(𝐗t\)\\hat\{q\}\_\{\\text\{up\}\}\(\\mathbf\{X\}\_\{t\}\)using GE\-Mamba\.
2. \(2\)Calibrate: Construct the calibrated intervalC^\(𝐗t\)\\hat\{C\}\(\\mathbf\{X\}\_\{t\}\)using Eq\. \([23](https://arxiv.org/html/2606.00506#S4.E23)\) with currentα~t\\tilde\{\\alpha\}\_\{t\}\.
3. \(3\)Observe: Receive the true observationYtY\_\{t\}\.
4. \(4\)Update: Adjust the effective miscoverage rate for the next step\.

Letα~t∈\(0,1\)\\tilde\{\\alpha\}\_\{t\}\\in\(0,1\)denote the effective miscoverage rate at timett, initialized asα~0=α\\tilde\{\\alpha\}\_\{0\}=\\alpha\. After observing the true valueYtY\_\{t\}, we update:

\(24\)α~t\+1=α~t\+γ⋅\(α−𝕀\{Yt∉C^\(𝐗t\)\}\),\\tilde\{\\alpha\}\_\{t\+1\}=\\tilde\{\\alpha\}\_\{t\}\+\\gamma\\cdot\\left\(\\alpha\-\\mathbb\{I\}\\\{Y\_\{t\}\\notin\\hat\{C\}\(\\mathbf\{X\}\_\{t\}\)\\\}\\right\),whereγ\\gammais the learning rate \(0\.005 in our work\) that controls adaptation speed and𝕀\{⋅\}\\mathbb\{I\}\\\{\\cdot\\\}is the indicator function\. This update interacts with the signed nonconformity scores to form a feedback loop:

- •Undercoverage\(Yt∉C^Y\_\{t\}\\notin\\hat\{C\}\): Since the indicator equals 1,α~t\+1\\tilde\{\\alpha\}\_\{t\+1\}decreases byγ\(1−α\)\\gamma\(1\-\\alpha\), raising the quantile threshold1−α~1\-\\tilde\{\\alpha\}\. Concurrently, the positive nonconformity scores from out\-of\-interval observations pushQtQ\_\{t\}upward, jointly producing wider intervals\.
- •Overcoverage\(Yt∈C^Y\_\{t\}\\in\\hat\{C\}\): The indicator equals 0, soα~t\+1\\tilde\{\\alpha\}\_\{t\+1\}increases byγα\\gamma\\alpha, lowering the quantile threshold\. Combined with the negative nonconformity scores from well\-contained observations,QtQ\_\{t\}can become negative, actively contracting the interval below the raw quantile predictions\.

This feedback mechanism ensures that AS\-CQR can rapidly adapt to distribution shifts for both undercoverage and overcoverage\.

#### 4\.2\.4\.Theoretical Guarantee

AS\-CQR inherits the long\-run coverage guarantee of adaptive conformal inference\. Under mild regularity conditions, the update rule in Eq\. \([24](https://arxiv.org/html/2606.00506#S4.E24)\) ensures convergence of the average empirical coverage to the target level:

\(25\)limT→∞1T∑t=1T𝕀\{Yt∈C^\(𝐗t\)\}=1−α\.\\lim\_\{T\\to\\infty\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{I\}\\\{Y\_\{t\}\\in\\hat\{C\}\(\\mathbf\{X\}\_\{t\}\)\\\}=1\-\\alpha\.This guarantee is distribution\-free, making AS\-CQR particularly suitable for non\-stationary energy systems where parametric assumptions may be unreliable\.

Table 1\.Comparison with 15 state\-of\-the\-art baselines on four datasets\.↓\\downarrowindicates lower is better\. The best results are inboldand the second\-best areunderlined\.✓indicates target coverage \(≥\\geq90%\) achieved\.

## 5\.Evaluation

In this section, we conduct a comprehensive experimental evaluation of our proposed EnergyMamba\. Specifically, we aim to address the following five research questions:

- •RQ 1: How does EnergyMamba perform compared to baselines?
- •RQ 2: Is EnergyMamba effective for uncertainty quantification?
- •RQ 3: Are all components in EnergyMamba effective?
- •RQ 4: How does EnergyMamba perform under abnormal scenarios such as extreme weather events?
- •RQ 5: Is EnergyMamba computationally efficient?

### 5\.1\.Evaluation Setup

#### 5\.1\.1\.Datasets

We evaluate the performance of EnergyMamba on four real\-world energy consumption datasets\. The first two datasets originate from Florida, covering 201 census block groups \(CBGs\) with fine\-grained data collected at 30\-minute intervals\. We denote the dataset spanning the year 2018 as Florida 1 and the dataset spanning 2019 as Florida 2\. To further verify the generalizability, we incorporate two additional datasets from New York\(Operator,[2024b](https://arxiv.org/html/2606.00506#bib.bib40)\)and California\(Operator,[2024a](https://arxiv.org/html/2606.00506#bib.bib41)\)in 2024, both recorded at 1\-hour intervals in a coarser spatial granularity \(11 regions and 9 zones, respectively\)\. More details will be in Appendix[A](https://arxiv.org/html/2606.00506#A1)\.

#### 5\.1\.2\.Baselines

We compare EnergyMamba with 5 categories of 15 state\-of\-the\-art baselines: \(1\) GNN\-based: DCRNN\(Liet al\.,[2018](https://arxiv.org/html/2606.00506#bib.bib10)\), STGCN\(Yuet al\.,[2018](https://arxiv.org/html/2606.00506#bib.bib9)\), AGCRN\(Baiet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib11)\), STZINB\(Zhuanget al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib26)\), UQGNN\(Yuet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib21)\), TrustEnergy\(Yuet al\.,[2026b](https://arxiv.org/html/2606.00506#bib.bib48)\); \(2\) Attention\-based: DSTAGNN\(Lanet al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib16)\), ASTGCN\(Guoet al\.,[2019](https://arxiv.org/html/2606.00506#bib.bib39)\); \(3\) Transformer\-based: GluonTS\(Alexandrovet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib15)\), PatchTST\(Zenget al\.,[2023](https://arxiv.org/html/2606.00506#bib.bib12)\), PowerPM\(Tuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib29)\); \(4\) LLM\-based: ST\-LLM\(Liuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib13)\), UrbanGPT\(Liet al\.,[2024b](https://arxiv.org/html/2606.00506#bib.bib14)\); \(5\) Mamba\-based: G\-Mamba\(Wanget al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib42)\), U\-Mamba\(Maet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib20)\)\. Details are shown in Appendix[B\.1](https://arxiv.org/html/2606.00506#A2.SS1)\.

#### 5\.1\.3\.Metrics

We utilize Mean Absolute Error \(MAE\) and Root Mean Squared Error \(RMSE\) to evaluate the performance of deterministic prediction, and three other commonly used metrics \(i\.e\., Mean Prediction Interval Width \(MPIW\), Interval Score \(IS\), and Coverage \(COV\)\), to evaluate the performance of uncertainty quantification\. Details in Appendix[B\.2](https://arxiv.org/html/2606.00506#A2.SS2)\.

#### 5\.1\.4\.Implementation Details

All experiments are conducted on a Linux server with an NVIDIA A100 GPU \(80 GB\)\. We use Adam with batch size 128 and initial learning rate1×10−31\\times 10^\{\-3\}, decayed every 15 epochs, and apply early stopping with patience 50 based on validation loss\. Datasets are split into training/validation/testing sets at 8:1:1, with input and prediction lengths of 192 and 6 time steps, corresponding to 4 days/3 hours for Florida and 8 days/6 hours for NYISO and CAISO\. For EnergyMamba, we set hidden dimensionD=64D=64, Mamba expansion factor 2, SSM state dimensionDs=16D\_\{s\}=16, andS=2S=2encoder\-decoder stages withK=2K=2GE\-Mamba blocks per stage\. The model is trained end\-to\-end with Eq\. \([20](https://arxiv.org/html/2606.00506#S4.E20)\) usingτ∈\{0\.05,0\.5,0\.95\}\\tau\\in\\\{0\.05,0\.5,0\.95\\\}; for AS\-CQR, we setα=0\.1\\alpha=0\.1,m=100m=100,γ=0\.005\\gamma=0\.005, dropout 0\.1, and gradient clipping 1\.0\. For deterministic baselines \(e\.g\., DCRNN, STGCN, and AGCRN\), we replace their original output layers with the same quantile heads used in EnergyMamba and train them with the same composite pinball loss for fair comparison\.

### 5\.2\.Overall Performance Comparison \(RQ 1\)

An overall comparison of EnergyMamba and other baselines is presented in Table[1](https://arxiv.org/html/2606.00506#S4.T1)\. We found that our EnergyMamba consistently achieves the best performance across most metrics for both prediction accuracy and uncertainty quantification\. Specifically, our framework reduces MAE by approximately 5% compared to the best baseline based on the average of all four datasets\. In addition, EnergyMamba also demonstrates superior performance on uncertainty quantification, with around 6% improvement in IS and reaching the target coverage\.

![Refer to caption](https://arxiv.org/html/2606.00506v1/x6.png)\(a\)Selective regression\.
![Refer to caption](https://arxiv.org/html/2606.00506v1/x7.png)\(b\)Ideal vs\. Empirical Coverage\.

Figure 4\.UQ analyses on the Florida dataset 1\.Two uncertainty\-quantification plots for Florida 1: a selective regression curve and an ideal\-versus\-empirical coverage comparison\.
### 5\.3\.Effectiveness of UQ \(RQ 2\)

Furthermore, we adopt selective regression\(Sokolet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib18); Shahet al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib17)\)to evaluate uncertainty quantification performance\. Selective regression allows the model to abstain from making predictions when confidence is insufficient, and is characterized by*coverage*\(the fraction of samples selected for prediction, distinct from COV, which denotes the proportion of ground truth observations falling within the predicted bounds\) and*risk*\(measured by MAE\)\. As shown in Figure[4\(a\)](https://arxiv.org/html/2606.00506#S5.F4.sf1), in the absence of uncertainty quantification, the MAE remains nearly constant across different coverage levels\. In contrast, when uncertainty is taken into account, the error increases as coverage grows, indicating a positive correlation between prediction error and coverage\. Moreover, Figure[4\(b\)](https://arxiv.org/html/2606.00506#S5.F4.sf2)shows that EnergyMamba achieves the most reliable calibration, as its curve lies closest to the diagonal line\. These results confirm that the proposed uncertainty estimates are informative and meaningful\.

### 5\.4\.Ablation Study \(RQ 3\)

To evaluate component contributions, we compare EnergyMamba against four variants: \(1\)w/o GCN: Removes spatial context, treating nodes independently; \(2\)w/o BIP: Uses unidirectional processing, ignoring backward dependencies; \(3\)w/o U\-Net: Uses a flat stack, removing multi\-scale modeling; \(4\)w/o AS\-CQR: Replaces adaptive mechanisms with standard CQR; and \(5\)GCN \+ Linear \(decoupled\): Applying GCN only at the input and output layers\. Table[2](https://arxiv.org/html/2606.00506#S5.T2)confirms the effectiveness of each component\. Removing GCN significantly degrades performance, validating the necessity of spatial context\. The higher MAE in w/o BIP and w/o U\-Net highlights the importance of backward dependencies and hierarchical temporal modeling, respectively\. Also, w/o AS\-CQR yields higher MPIW and misses target coverage, proving that adaptive normalization is critical for reliable uncertainty quantification\. Finally, the decoupled design fails to capture evolving spatial dependencies, leading to degradation in performance\.

### 5\.5\.Case Studies for Abnormal Situations \(RQ 4\)

One of the most challenging aspects of energy consumption prediction is handling abnormal or extreme scenarios, such as hurricanes, heat waves, and cold snaps, that often lead to sudden and significant deviations in consumption patterns\. Accurate and reliable prediction under these extreme events is critical for ensuring grid stability, optimizing energy resource allocation, and enabling timely decision\-making for both utilities and emergency response teams\. As shown in Figure[5](https://arxiv.org/html/2606.00506#S6.F5), under different extreme events, EnergyMamba successfully maintained accurate predictions, and ground truth values are within the prediction intervals in most cases, whereas the best baseline struggled to adapt to these abrupt changes\. These results highlight EnergyMamba’s superior adaptability to both demand drops and surges under abnormal scenarios\.

Table 2\.Ablation study on the Florida dataset 1\.Table 3\.Computational complexity comparison\.CategoryMethodTrain \(s\)↓\\downarrowInfer \(s\)↓\\downarrowMem \(MB\)↓\\downarrowParams \(K\)↓\\downarrowGNNDCRNN158472\.75110823\.7STGCN155702\.666308438\.1AGCRN6621\.253781747\.4STZINB19133\.808109249\.9UQGNN7103\.104039297\.7TrustEnergy38653\.90271191012\.2AttentionDSTAGNN32912\.202190298\.6ASTGCN17592\.152125230\.1TransformerGluonTS1040\.95137825\.9PatchTST1521\.853167145\.4PowerPM89002\.601650011085\.4LLMST\-LLM135205\.512485067462\.5UrbanGPT142006\.002860072520\.8MambaG\-Mamba62852\.897499299\.5U\-Mamba110903\.6012582664\.3OursEnergyMamba79452\.784125312\.8

### 5\.6\.Computational Complexity Analysis \(RQ 5\)

We explicitly evaluate the computational efficiency of EnergyMamba in terms of training time, peak GPU memory usage, and model size\. Table[3](https://arxiv.org/html/2606.00506#S5.T3)reports detailed comparisons with representative baselines\.

Time ComplexityEnergyMamba completes training in 7,945 seconds\. This efficiency primarily comes from the Mamba backbone, which scales linearly with sequence length \(O\(T\)O\(T\)\), avoiding the quadratic cost of attention\-based alternatives\. Compared with LLM\-based baselines, EnergyMamba is substantially faster and thus more practical for frequent retraining in dynamic grid scenarios\. Furthermore, during the inference phase, EnergyMamba achieves a highly competitive latency\. It operates at approximately twice the speed of heavy LLM models and also outperforms other state\-of\-the\-art Mamba variants such as G\-Mamba and U\-Mamba\.

Memory Footprint\.The peak GPU memory usage of EnergyMamba is 4\.03 GB, which is notably lower than many attention\-based and LLM\-based baselines\. This moderate memory demand improves deployability in resource\-constrained operational environments\.

Parameter Efficiency\.With 312\.8K trainable parameters, EnergyMamba remains lightweight while still achieving superior predictive performance \(RQ 1\), indicating that gains come from architectural design rather than brute\-force model scaling\. Overall, EnergyMamba achieves a favorable balance between accuracy and efficiency\.

## 6\.Related Work

### 6\.1\.Energy Consumption Prediction

Energy consumption prediction has attracted considerable attention from both academia and industry\. Early energy consumption prediction approaches largely relied on statistical methods, such as linear regression\(Honget al\.,[2011](https://arxiv.org/html/2606.00506#bib.bib2)\), support vector regression\(Sapankevych and Sankar,[2009](https://arxiv.org/html/2606.00506#bib.bib3)\), and random forest regression\(Wuet al\.,[2015](https://arxiv.org/html/2606.00506#bib.bib4)\), which model linear relationships between energy consumption\. However, these methods often struggle to capture the complex, nonlinear dynamics inherent in energy consumption data\. To address this limitation, deep learning approaches have been widely adopted, including general neural forecasting methods\(Shenet al\.,[2026a](https://arxiv.org/html/2606.00506#bib.bib55); Chenget al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib58); Yuet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib21),[2026a](https://arxiv.org/html/2606.00506#bib.bib49); Shenet al\.,[2026b](https://arxiv.org/html/2606.00506#bib.bib54)\), Transformers\(Alexandrovet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib15); Shenet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib56); Burattoet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib6)\), and Diffusion\(Xuet al\.,[2026a](https://arxiv.org/html/2606.00506#bib.bib45),[2025](https://arxiv.org/html/2606.00506#bib.bib46),[b](https://arxiv.org/html/2606.00506#bib.bib47)\), which excel at mining temporal patterns from historical time series\. More recently, the emergence of Foundation Models\(Tuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib29)\)and Large Language Models\(Liet al\.,[2026a](https://arxiv.org/html/2606.00506#bib.bib50),[2025](https://arxiv.org/html/2606.00506#bib.bib51); Lianget al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib32)\)has introduced a new paradigm, leveraging extensive pre\-training to achieve superior generalization\. Despite these advancements, most existing works formulate energy consumption prediction as a purely time\-series prediction problem, neglecting the intrinsic spatial dependencies among different regions or grid zones that could provide critical information\. Furthermore, the reliability of predictions, which is crucial for real\-world decision\-making under extreme weather or grid anomalies, remains underexplored\.

### 6\.2\.Uncertainty\-aware Spatiotemporal Prediction

Quantifying uncertainty is significant for robust and reliable decision\-making\(Yanget al\.,[2023](https://arxiv.org/html/2606.00506#bib.bib65),[2024](https://arxiv.org/html/2606.00506#bib.bib66)\)\. Most existing uncertainty\-aware spatiotemporal prediction works\(Liet al\.,[2026b](https://arxiv.org/html/2606.00506#bib.bib59); Xiao and Liu,[2025](https://arxiv.org/html/2606.00506#bib.bib64)\)focus on distribution\-based methods or conformal prediction\. Distribution\-based methods assume that the target variable follows a specific probability distribution and optimize the model to predict the distribution parameters\. For example, UQGNN\(Yuet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib21)\)assumes a Gaussian distribution, while STZINB\(Zhuanget al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib26)\)leverages a zero\-inflated negative binomial distribution\. Conformal Prediction is a distribution\-free framework that constructs valid prediction intervals\. CF\-GNN\(Huanget al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib27)\)integrates conformal prediction with GNN, while TrustEnergy\(Yuet al\.,[2026b](https://arxiv.org/html/2606.00506#bib.bib48)\)further enhances this by incorporating meta\-learning to achieve context\-aware prediction\. Despite these advances, a critical gap remains: most existing methods assume data exchangeability or stationarity, which do not hold in real\-world energy systems subject to distribution shifts\. They typically lack mechanisms to dynamically adjust uncertainty bounds in response to online feedback, leading to potential miscoverage during abnormal or extreme events\. Our work addresses this by incorporating adaptive conformal inference into a spatiotemporal framework, ensuring robust calibration even under non\-stationary conditions\.

![Refer to caption](https://arxiv.org/html/2606.00506v1/x8.png)\(a\)Hurricane\.
![Refer to caption](https://arxiv.org/html/2606.00506v1/x9.png)\(b\)Heat Wave\.

Figure 5\.Prediction results under extreme events\.Two case\-study plots show prediction intervals under a hurricane and a heat wave, comparing predicted trajectories with observed values during abnormal events\.

## 7\.Conclusion

In this paper, we propose EnergyMamba, an uncertainty\-aware spatiotemporal framework for accurate and reliable energy consumption prediction\. The design of EnergyMamba is directly motivated by empirical insights, addressing key challenges including physical spatial dependencies and load\-dependent heteroscedasticity\. EnergyMamba consists of two core components: \(i\) GE\-Mamba, which integrates GCN with a Selective State Space Model within a U\-Net architecture\. By leveraging a Bidirectional Processing mechanism and multi\-scale feature extraction, GE\-Mamba effectively captures complex spatiotemporal patterns while maintaining linear computational complexity; and \(ii\) AS\-CQR, which includes locally adaptive normalization and online feedback mechanisms to handle non\-stationarity, providing reliable uncertainty estimates even under distribution shifts during extreme weather events\. Extensive experiments on four real\-world datasets from Florida, New York, and California demonstrate that EnergyMamba outperforms 15 state\-of\-the\-art baselines, improving prediction accuracy by around 5% and uncertainty quantification by 6%\.

## 8\.Limitations and Ethical Considerations

The current implementation of EnergyMamba operates at the regional level \(e\.g\., CBGs\) and does not explicitly model the physical topology of the power grid\. While this design is easily generalizable across different regions and avoids the need to track grid topology changes, it may overlook structural and operational dependencies\. As future work, this framework can be extended to incorporate grid topology and leverage adaptive graph learning to dynamically update network representations\.

We adhere to the KDD Code of Ethics\. De\-identified data was obtained from a municipal utility provider in Florida under an NDA, which is stored and processed exclusively on FSU’s secured computing facilities\. We declare no conflicts of interest or foreseeable risks\.

## Acknowledgment

We thank all the reviewers for their insightful feedback to improve this paper\. This work is partially supported by the FSU Startup Fund, FSU CRC Summer Research Support \(SRS\) Award Program, and FSU/AWS Research Acceleration Fund\.

## References

- A\. Alexandrov, K\. Benidis, M\. Bohlke\-Schneider, V\. Flunkert, J\. Gasthaus, T\. Januschowski, D\. C\. Maddix, S\. Rangapuram, D\. Salinas, J\. Schulz,et al\.\(2020\)Gluonts: probabilistic and neural time series modeling in python\.Journal of Machine Learning Research21\(116\),pp\. 1–6\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1),[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- T\. Alghamdi, K\. Elgazzar, M\. Bayoumi, T\. Sharaf, and S\. Shah \(2019\)Forecasting traffic congestion using arima modeling\.In2019 15th International Wireless Communications & Mobile Computing Conference \(IWCMC\),Vol\.,pp\. 1227–1232\.External Links:[Document](https://dx.doi.org/10.1109/IWCMC.2019.8766698)Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- L\. Bai, L\. Yao, C\. Li, X\. Wang, and C\. Wang \(2020\)Adaptive graph convolutional recurrent network for traffic forecasting\.Advances in neural information processing systems33,pp\. 17804–17815\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- A\. Binbusayyis and M\. Sha \(2025\)Energy consumption prediction using modified deep cnn\-bi lstm with attention mechanism\.Heliyon11\(1\)\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p1.1)\.
- S\. Bouktif, A\. Fiaz, A\. Ouni, and M\. A\. Serhani \(2020\)Multi\-sequence lstm\-rnn deep learning and metaheuristics for electric load forecasting\.Energies13\(2\),pp\. 391\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- W\. G\. Buratto, R\. N\. Muniz, A\. Nied, and G\. V\. Gonzalez \(2024\)Seq2Seq\-lstm with attention for electricity load forecasting in brazil\.IEEE Access\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- X\. Cheng, C\. Yang, Y\. Zhao, Y\. Wang, H\. Karimi, and T\. Derr \(2025\)BTS: a comprehensive benchmark for tie strength prediction\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 5345–5354\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- M\. Deng, S\. Lu, J\. Shi, and W\. Zhang \(2026\)Adaptive traffic signal control optimization using a novel road partition and multi\-channel state representation method\.Urban Lifeline4\(1\),pp\. 9\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- I\. Gibbs and E\. Candes \(2021\)Adaptive conformal inference under distribution shift\.Advances in Neural Information Processing Systems34,pp\. 1660–1672\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/0d441de75945e5acbc865406fc9a2559-Paper.pdf)Cited by:[§4\.2\.3](https://arxiv.org/html/2606.00506#S4.SS2.SSS3.p1.2)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InFirst conference on language modeling,External Links:[Link](https://arxiv.org/abs/2312.00752)Cited by:[§4\.1\.3](https://arxiv.org/html/2606.00506#S4.SS1.SSS3.p3.4)\.
- S\. Guo, Y\. Lin, N\. Feng, C\. Song, and H\. Wan \(2019\)Attention based spatial\-temporal graph convolutional networks for traffic flow forecasting\.InProceedings of the AAAI conference on artificial intelligence,Vol\.33,pp\. 922–929\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px2.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- T\. Hong, P\. Wang, and H\. L\. Willis \(2011\)A naïve multiple linear regression benchmark for short term load forecasting\.In2011 IEEE power and energy society general meeting,pp\. 1–6\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- K\. Huang, Y\. Jin, E\. Candes, and J\. Leskovec \(2024\)Uncertainty quantification over graph with conformalized graph neural networks\.Advances in Neural Information Processing Systems36\.Cited by:[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- N\. Jha, D\. Prashar, M\. Rashid, S\. K\. Gupta, and R\.K\. Saket \(2021\)Electricity load forecasting and feature extraction in smart grid using neural networks\.Computers & Electrical Engineering96,pp\. 107479\.External Links:ISSN 0045\-7906,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.compeleceng.2021.107479),[Link](https://www.sciencedirect.com/science/article/pii/S0045790621004341)Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p1.1)\.
- L\. Jiang, Y\. Yang, and G\. Wang \(2025a\)HCRide: harmonizing passenger fairness and driver preference for human\-centered ride\-hailing\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,pp\. 10289–10297\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- L\. Jiang, D\. Yu, R\. Xu, T\. Tang, and G\. Wang \(2025b\)Uncertainty\-aware predict\-then\-optimize framework for equitable post\-disaster power restoration\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,pp\. 9719–9727\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- S\. Lan, Y\. Ma, W\. Huang, W\. Wang, H\. Yang, and P\. Li \(2022\)Dstagnn: dynamic spatial\-temporal aware graph neural network for traffic flow forecasting\.InInternational conference on machine learning,pp\. 11906–11917\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px2.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- L\. Li, Z\. Chen, and Y\. Dong \(2026a\)LLM as clinical graph structure refiner: enhancing representation learning in eeg seizure diagnosis\.External Links:2604\.28178,[Link](https://arxiv.org/abs/2604.28178)Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- L\. Li, E\. E\. Ozguven, Y\. Zhao, G\. Wang, Y\. Xie, and Y\. Dong \(2025\)TyphoFormer: language\-augmented transformer for accurate typhoon track forecasting\.InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems,SIGSPATIAL ’25,New York, NY, USA,pp\. 1174–1177\.External Links:ISBN 9798400720864,[Link](https://doi.org/10.1145/3748636.3763223),[Document](https://dx.doi.org/10.1145/3748636.3763223)Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- L\. Li, K\. Yang, J\. Bi, and F\. Luo \(2024a\)STS\-ccl: spatial\-temporal synchronous contextual contrastive learning for urban traffic forecasting\.InICASSP 2024 \- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 6705–6709\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP48485.2024.10446624)Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- W\. Li, Q\. Wang, Y\. Liu, M\. L\. Small, and J\. Gao \(2022\)A spatiotemporal decay model of human mobility when facing large\-scale crises\.Proceedings of the National Academy of Sciences119\(33\),pp\. e2203042119\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- X\. Li, J\. Cao, M\. Wang, Y\. Wu, L\. Yan, Y\. Zhou, Z\. Sha, and Y\. Ma \(2026b\)FAST: a synergistic framework of attention and state\-space models for spatiotemporal traffic prediction\.External Links:2604\.13453,[Link](https://arxiv.org/abs/2604.13453)Cited by:[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- Y\. Li, R\. Yu, C\. Shahabi, and Y\. Liu \(2018\)Diffusion convolutional recurrent neural network: data\-driven traffic forecasting\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/1707.01926)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- Z\. Li, L\. Xia, J\. Tang, Y\. Xu, L\. Shi, L\. Xia, D\. Yin, and C\. Huang \(2024b\)Urbangpt: spatio\-temporal large language models\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5351–5362\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px4.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- M\. Liang, Y\. Hu, H\. Weng, J\. Xi, and B\. Yin \(2025\)EnergyGPT: fine\-tuning large language model for multi\-energy load forecasting\.Renewable Energy,pp\. 123313\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- B\. Lim, S\. Ö\. Arık, N\. Loeff, and T\. Pfister \(2021\)Temporal fusion transformers for interpretable multi\-horizon time series forecasting\.International journal of forecasting37\(4\),pp\. 1748–1764\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- R\. Liu, C\. Li, H\. Tang, Y\. Ge, Y\. Shan, and G\. Li \(2024\)St\-llm: large language models are effective temporal learners\.InEuropean Conference on Computer Vision,pp\. 1–18\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px4.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- J\. Ma, F\. Li, and B\. Wang \(2024\)U\-mamba: enhancing long\-range dependency for biomedical image segmentation\.arXiv preprint arXiv:2401\.04722\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2401.04722),[Link](https://arxiv.org/abs/2401.04722)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px5.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- P\. A\. Moran \(1950\)Notes on continuous stochastic phenomena\.Biometrika37\(1/2\),pp\. 17–23\.Cited by:[§2\.2\.1](https://arxiv.org/html/2606.00506#S2.SS2.SSS1.p1.5)\.
- C\. I\. S\. Operator \(2024a\)Real\-time load data\.External Links:[Link](https://www.caiso.com/TodaysOutlook/Pages/default.aspx)Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00506#S5.SS1.SSS1.p1.1)\.
- N\. Y\. I\. S\. Operator \(2024b\)Real\-time load data\.External Links:[Link](https://www.nyiso.com/load-data)Cited by:[§5\.1\.1](https://arxiv.org/html/2606.00506#S5.SS1.SSS1.p1.1)\.
- D\. Peteleaza, A\. Matei, R\. Sorostinean, A\. Gellert, U\. Fiore, B\. Zamfirescu, and F\. Palmieri \(2024\)Electricity consumption forecasting for sustainable smart cities using machine learning methods\.Internet of Things27,pp\. 101322\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p1.1)\.
- F\. Piccialli, S\. Cuomo, D\. Crisci, E\. Prezioso, and G\. Mei \(2020\)A deep learning approach for facility patient attendance prediction based on medical booking data\.Scientific Reports10\(1\),pp\. 14623\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- L\. Pun, P\. Zhao, and X\. Liu \(2019\)A multiple regression approach for traffic flow estimation\.IEEE Access7\(\),pp\. 35998–36009\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2019.2904645)Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- F\. R\. Quintela, R\. C\. Redondo, N\. R\. Melchor, and M\. Redondo \(2009\)A general approach to kirchhoff’s laws\.IEEE Transactions on Education52\(2\),pp\. 273–278\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.00506#S4.SS1.SSS3.p2.3)\.
- Y\. Romano, E\. Patterson, and E\. Candes \(2019\)Conformalized quantile regression\.Advances in neural information processing systems32\.Cited by:[§4\.2\.1](https://arxiv.org/html/2606.00506#S4.SS2.SSS1.p1.16)\.
- N\. I\. Sapankevych and R\. Sankar \(2009\)Time series prediction using support vector machines: a survey\.IEEE computational intelligence magazine4\(2\),pp\. 24–38\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- A\. Shah, Y\. Bu, J\. K\. Lee, S\. Das, R\. Panda, P\. Sattigeri, and G\. W\. Wornell \(2022\)Selective regression under fairness criteria\.InInternational Conference on Machine Learning,pp\. 19598–19615\.External Links:[Link](https://proceedings.mlr.press/v162/shah22a.html)Cited by:[§5\.3](https://arxiv.org/html/2606.00506#S5.SS3.p1.1)\.
- B\. Shen, Z\. Cheng, N\. Z\. Gong, F\. Yao, and Y\. Dong \(2026a\)CREDIT: certified ownership verification of deep neural networks against model extraction attacks\.arXiv preprint arXiv:2602\.20419\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- B\. Shen, E\. Ozguven, Y\. Zhao, G\. Wang, Y\. Xie, and Y\. Dong \(2025\)Learning from the storm: a multivariate machine learning approach to predicting hurricane\-induced economic losses\.InProceedings of the 1st ACM SIGSPATIAL International Workshop on Spatial Intelligence for Smart and Connected Communities,pp\. 1–4\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- B\. Shen, M\. S\. Seraj, Z\. Cheng, S\. Chakraborty, and Y\. Dong \(2026b\)CITED: a decision boundary\-aware signature for gnns towards model extraction defense\.arXiv preprint arXiv:2602\.20418\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- A\. Sokol, N\. Moniz, and N\. Chawla \(2024\)Conformalized selective regression\.arXiv preprint arXiv:2402\.16300\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.16300),[Link](https://arxiv.org/abs/2402.16300)Cited by:[§5\.3](https://arxiv.org/html/2606.00506#S5.SS3.p1.1)\.
- S\. Tu, Y\. Zhang, J\. Zhang, Z\. Fu, Y\. Zhang, and Y\. Yang \(2024\)Powerpm: foundation model for power systems\.Advances in Neural Information Processing Systems37,pp\. 115233–115260\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1),[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- C\. Wang, O\. Tsepa, J\. Ma, and B\. Wang \(2024\)Graph\-mamba: towards long\-range graph sequence modeling with selective state spaces\.arXiv preprint arXiv:2402\.00789\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.00789),[Link](https://arxiv.org/abs/2402.00789)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px5.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- X\. Wu, J\. He, P\. Zhang, and J\. Hu \(2015\)Power system short\-term load forecasting based on improved random forest with grey relation projection\.Automation of Electric Power Systems39\(12\),pp\. 50–55\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- C\. Xiao and Y\. Liu \(2025\)A multifrequency data fusion deep learning model for carbon price prediction\.Journal of Forecasting44\(2\),pp\. 436–458\.Cited by:[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- R\. Xu, K\. Cai, L\. Jiang, Z\. Hong, Y\. Tian, and G\. Wang \(2026a\)GeoGen: a two\-stage coarse\-to\-fine framework for fine\-grained synthetic location\-based social network trajectory generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 1373–1381\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- R\. Xu, Z\. Hong, and G\. Wang \(2025\)AutoSTDiff: autoregressive spatio\-temporal denoising diffusion model for asynchronous trajectory generation\.InProceedings of the 2025 SIAM International Conference on Data Mining \(SDM\),pp\. 538–547\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- R\. Xu, L\. Jiang, D\. Yu, X\. Li, and G\. Wang \(2026b\)SynHAT: a two\-stage coarse\-to\-fine diffusion framework for synthesizing human activity traces\.arXiv preprint arXiv:2604\.14705\.Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- P\. Yang, N\. Akhtar, M\. Shah, and A\. Mian \(2024\)Regulating model reliance on non\-robust features by smoothing input marginal density\.InEuropean Conference on Computer Vision,pp\. 329–347\.Cited by:[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- P\. Yang, N\. Akhtar, Z\. Wen, M\. Shah, and A\. S\. Mian \(2023\)Re\-calibrating feature attributions for model interpretation\.InInternational Conference on Learning Representations,Cited by:[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- B\. Yu, H\. Yin, and Z\. Zhu \(2018\)Spatio\-temporal graph convolutional networks: a deep learning framework for traffic forecasting\.Proceedings of the 27th International Joint Conference on Artificial Intelligence,pp\. 3634–3640\.External Links:ISBN 9780999241127Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- D\. Yu, L\. Jiang, R\. Xu, and G\. Wang \(2026a\)HealthMamba: an uncertainty\-aware spatiotemporal graph state space model for effective and reliable healthcare facility visit prediction\.External Links:2602\.05286,[Link](https://arxiv.org/abs/2602.05286)Cited by:[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1)\.
- D\. Yu, R\. Xu, D\. Zhuang, Y\. Bu, S\. Wang, and G\. Wang \(2026b\)TrustEnergy: a unified framework for accurate and reliable user\-level energy usage prediction\.Proceedings of the AAAI Conference on Artificial Intelligence40\(46\),pp\. 39558–39566\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/41307),[Document](https://dx.doi.org/10.1609/aaai.v40i46.41307)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1),[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- D\. Yu, D\. Zhuang, L\. Jiang, R\. Xu, X\. Ye, Y\. Bu, S\. Wang, and G\. Wang \(2025\)UQGNN: uncertainty quantification of graph neural networks for multivariate spatiotemporal prediction\.InProceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems,pp\. 52–65\.Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1),[§6\.1](https://arxiv.org/html/2606.00506#S6.SS1.p1.1),[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.
- A\. Zeng, M\. Hong, X\. Chen, S\. Xu, A\. Sun, and A\. Srivastava \(2023\)A time series is worth 64 words: long\-term forecasting with transformers\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2211.05771)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.00506#S1.p2.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in neural information processing systems32\.Cited by:[§4\.1\.3](https://arxiv.org/html/2606.00506#S4.SS1.SSS3.p6.2)\.
- Z\. Zhang, R\. Fu, Y\. He, X\. Shen, Y\. Wang, X\. Du, H\. You, K\. Jin, J\. Shi, and S\. Fong \(2026\)FinSentLLM: multi\-llm and structured semantic signals for enhanced financial sentiment forecasting\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 17682–17686\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p2.1)\.
- S\. Zhong and Z\. Sun \(2010\)Challenges and opportunities in emergency management of electric power system blackout\.In2010 International Conference on E\-Product E\-Service and E\-Entertainment,pp\. 1–4\.Cited by:[§1](https://arxiv.org/html/2606.00506#S1.p1.1)\.
- D\. Zhuang, S\. Wang, H\. Koutsopoulos, and J\. Zhao \(2022\)Uncertainty quantification of sparse travel demand prediction with spatial\-temporal graph neural networks\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,KDD ’22,New York, NY, USA,pp\. 4639–4647\.External Links:ISBN 9781450393850,[Link](https://doi.org/10.1145/3534678.3539093),[Document](https://dx.doi.org/10.1145/3534678.3539093)Cited by:[§B\.1](https://arxiv.org/html/2606.00506#A2.SS1.SSS0.Px1.p1.1),[§5\.1\.2](https://arxiv.org/html/2606.00506#S5.SS1.SSS2.p1.1),[§6\.2](https://arxiv.org/html/2606.00506#S6.SS2.p1.1)\.

## Appendix

## Appendix AData\-driven Analysis

In this part, we provide detailed descriptions of data collection, preprocessing, and analysis procedures\. The four datasets span different spatial regions and scales, temporal resolutions, and time periods, enabling a comprehensive evaluation of our framework’s generalizability\. A summary of datasets is provided in Table[4](https://arxiv.org/html/2606.00506#A1.T4)\.

### A\.1\.Florida Datasets \(Florida 1 & Florida 2\)

We have access to utility data from a municipal utility provider in Florida\. The data include household\-level electricity consumption from smart meters\. These meters automatically record electricity consumption at 30\-minute intervals and transmit the readings to a central data management system\. The raw data includes meter ID, timestamp, and energy consumption in kilowatt\-hours \(kWh\)\. We partition the Florida data into two distinct datasets in order to verify the model’s generalization capabilities under different distribution shifts, e\.g\., different types of extreme weather events \(Hurricane Michael in 2018 and record\-breaking heatwaves in 2019\)\. Florida 1 spans the entire year of 2018 \(January 1 to December 31\), while Florida 2 covers the year 2019\. Each dataset contains approximately 17,520 time steps \(48 readings per day×\\times365 days\)\. We extensively evaluated our framework on a full 10\-year continuous dataset, confirming its stability under regular multi\-year seasonal cycles\. Since standard seasonality is relatively predictable, we deliberately selected 2018 and 2019 for the main manuscript to rigorously stress\-test the model against extreme distribution shifts\. These years exhibit radically different weather conditions: Florida 1 includes Hurricane Michael \(October 2018\), while Florida 2 features severe heat waves \(May 2019\)\.

At the utility scale, our 5% system\-wide accuracy improvement can translate to massive cost savings by reducing reliance on spinning reserves and mitigating millions in over\-procurement penalties\. Furthermore, EnergyMamba achieves an ~6% improvement in uncertainty quantification \(Interval Score\)\. Because modern grid operations are deeply risk\-sensitive, this enhanced probabilistic reliability is critical for determining safe reserve margins and preventing blackouts during extreme events\. By jointly optimizing deterministic accuracy and probabilistic forecasting, EnergyMamba delivers a dual contribution with profound economic and operational value\.

Table 4\.Summary of four real\-world energy consumption datasets used in this study\.
### A\.2\.New York ISO Dataset \(NYISO\)

The New York dataset is sourced from the New York Independent System Operator \(NYISO\)\. NYISO is collected from supervisory control and data acquisition \(SCADA\) systems that monitor real\-time power flows across the transmission network\. Load readings are aggregated at the zonal level, where each zone represents a distinct load pocket with similar electrical characteristics\. We collected data for the year 2024, recorded at 1\-hour intervals, resulting in 8,760 time steps per zone\. The hourly resolution aligns with standard electricity market operations and enables evaluation on coarser\-grained temporal patterns\.

### A\.3\.California ISO Dataset \(CAISO\)

The California dataset is obtained from the California Independent System Operator \(CAISO\), which oversees the operation of California’s bulk electric power system\. The dataset covers 9 Transmission Access Charge \(TAC\) areas: Pacific Gas & Electric \(PG&E\), Southern California Edison \(SCE\), San Diego Gas & Electric \(SDG&E\), Valley Electric Association \(VEA\), and several smaller municipal utilities\. These regions represent California’s three major investor\-owned utilities and associated service territories\. Similar to the NYISO dataset, we collected hourly load data for the year 2024, yielding 8,760 time steps per region\.

### A\.4\.Data Preprocessing and Management

We applied consistent preprocessing procedures across all datasets to ensure data quality and compatibility with our modeling framework\.

Missing Value Handling\.For the Florida datasets, missing readings \(0\.3% of data\) were imputed using linear interpolation between adjacent time steps\. For NYISO and CAISO datasets, missing values were rare \(<<0\.1%\) and were filled using the same imputation strategy\.

Outlier Detection\.We identified outliers using the interquartile range \(IQR\) method\. Values exceeding 1\.5 times the IQR beyond the first or third quartile were flagged and manually inspected\. Genuine extreme values \(e\.g\., during heat waves\) were retained, while obvious measurement errors were corrected using temporal interpolation\.

Normalization\.To ensure stable training, we applied log normalization to scale consumption values for each node independently\. All time series data are normalized using a transformation of natural logarithm, which is represented as:

\(26\)X′=ln⁡\(X\+1\),X^\{\\prime\}=\\ln\(X\+1\),whereXXdenotes the original dataset andX′X^\{\\prime\}is the normalized dataset\. All values are incremented by 1 to avoid undefined logarithms for zero\-valued entries\.

Graph Construction\.Following Section[3\.1](https://arxiv.org/html/2606.00506#S3.SS1), we constructed adjacency matrices based on geographical distances between region centroids as a proximity\-based coupling proxy\. For Florida datasets, we used CBG centroids; for NYISO and CAISO, we used the geographical centers of each zone/region\.

## Appendix BExperiment Setup

### B\.1\.Baseline

These baselines cover graph, attention, Transformer, LLM, and state\-space paradigms, and include both deterministic and probabilistic forecasting methods \(STZINB, UQGNN, and TrustEnergy\)\.

##### GNN\-based methods\.

DCRNN\(Liet al\.,[2018](https://arxiv.org/html/2606.00506#bib.bib10)\)is a diffusion\-convolution seq2seq model;STGCN\(Yuet al\.,[2018](https://arxiv.org/html/2606.00506#bib.bib9)\)combines graph and temporal convolutions;AGCRN\(Baiet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib11)\)learns adaptive node\-wise dependencies;STZINB\(Zhuanget al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib26)\)is a probabilistic graph model with a zero\-inflated negative binomial formulation;UQGNN\(Yuet al\.,[2025](https://arxiv.org/html/2606.00506#bib.bib21)\)focuses on multivariate uncertainty quantification; andTrustEnergy\(Yuet al\.,[2026b](https://arxiv.org/html/2606.00506#bib.bib48)\)combines meta\-learning with conformal prediction\.

##### Attention\-based methods\.

DSTAGNN\(Lanet al\.,[2022](https://arxiv.org/html/2606.00506#bib.bib16)\)is an attention\-based dynamic graph model, whileASTGCN\(Guoet al\.,[2019](https://arxiv.org/html/2606.00506#bib.bib39)\)introduces explicit spatial\-temporal attention into graph convolution\.

##### Transformer\-based methods\.

GluonTS\(Alexandrovet al\.,[2020](https://arxiv.org/html/2606.00506#bib.bib15)\)provides strong probabilistic forecasting baselines,PatchTST\(Zenget al\.,[2023](https://arxiv.org/html/2606.00506#bib.bib12)\)is a patch\-based Transformer for long\-range multivariate forecasting, andPowerPM\(Tuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib29)\)is a pre\-trained power forecasting model based on masked modeling and contrastive learning\.

##### LLM\-based methods\.

ST\-LLM\(Liuet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib13)\)is a spatial\-temporal large language model for traffic forecasting, andUrbanGPT\(Liet al\.,[2024b](https://arxiv.org/html/2606.00506#bib.bib14)\)is an instruction\-tuned urban forecasting framework built on an LLM backbone\.

##### Mamba\-based methods\.

G\-Mamba\(Wanget al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib42)\)combines graph modeling with Mamba blocks, whileU\-Mamba\(Maet al\.,[2024](https://arxiv.org/html/2606.00506#bib.bib20)\)adopts a U\-shaped Mamba architecture for hierarchical sequence modeling\. This category is the closest architectural family to EnergyMamba and is therefore especially relevant for isolating the contribution of our graph\-enhanced design\.

### B\.2\.Metrics

We report MAE and RMSE for deterministic prediction, and Mean Prediction Interval Width \(MPIW\), Interval Score \(IS\), and Coverage \(COV\) for uncertainty quantification\. Here,yiy\_\{i\}andy^i\\hat\{y\}\_\{i\}denote the ground truth and prediction for theii\-th sample, and\[li,ui\]\[l\_\{i\},u\_\{i\}\]denotes its prediction interval\. MAE and RMSE measure point\-forecast accuracy, with RMSE assigning a larger penalty to large deviations and thus being more sensitive to peak\-load errors\. For interval prediction, MPIW measures sharpness,MPIW=1N∑i=1N\(ui−li\)\\text\{MPIW\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(u\_\{i\}\-l\_\{i\}\), Coverage measures calibration, and IS summarizes both by penalizing intervals that are either too wide or fail to cover the ground truth\. Reporting these metrics together avoids favoring trivially wide intervals that attain high coverage but poor usefulness\.

\(27\)MAE=1N∑i=1N\|yi−y^i\|,RMSE=1N∑i=1N\(yi−y^i\)2\.\\text\{MAE\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\|y\_\{i\}\-\\hat\{y\}\_\{i\}\|,\\quad\\text\{RMSE\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(y\_\{i\}\-\\hat\{y\}\_\{i\}\)^\{2\}\}\.\(28\)IS=1N∑i=1N\[\(ui−li\)\+2α\(li−yi\)𝕀\(yi<li\)\+2α\(yi−ui\)𝕀\(yi\>ui\)\]\.\\text\{IS\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\[\(u\_\{i\}\-l\_\{i\}\)\+\\frac\{2\}\{\\alpha\}\(l\_\{i\}\-y\_\{i\}\)\\mathbb\{I\}\(y\_\{i\}<l\_\{i\}\)\+\\frac\{2\}\{\\alpha\}\(y\_\{i\}\-u\_\{i\}\)\\mathbb\{I\}\(y\_\{i\}\>u\_\{i\}\)\\Big\]\.\(29\)Coverage=100%N∑i=1N𝕀\(li≤yi≤ui\)\.\\text\{Coverage\}=\\frac\{100\\%\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\(l\_\{i\}\\leq y\_\{i\}\\leq u\_\{i\}\)\.Lower MAE, RMSE, MPIW, and IS are better, while Coverage should be close to the target confidence level\. In practice, a good uncertainty estimator should achieve reliable Coverage without excessively increasing MPIW\. If Coverage is substantially below the target level, the intervals are under\-dispersed and the model is overconfident; if it is much higher than the target, the intervals are often overly conservative\.

## GenAI Disclosure

In the preparation of this work, the authors utilized Generative AI tools solely for the purpose of language refinement and improving readability\. No AI tools were used to generate scientific concepts, experimental results, or the intellectual content of this paper\. The authors have reviewed all AI\-assisted edits and take full responsibility for the final content of the manuscript\.
EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

Similar Articles

BattVAE-GP: Generative Modeling of Long-Horizon Battery Degradation with Uncertainty Quantification

Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction

Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection

Estimation, Prediction, and Assortment Optimization for Markov Chain Choice Models with Panel Data

Sample-Efficient Pareto Front Modeling for Energy-Aware Reinforcement Learning Using Bayesian Optimization

Submit Feedback

Similar Articles

BattVAE-GP: Generative Modeling of Long-Horizon Battery Degradation with Uncertainty Quantification
Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction
Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection
Estimation, Prediction, and Assortment Optimization for Markov Chain Choice Models with Panel Data
Sample-Efficient Pareto Front Modeling for Energy-Aware Reinforcement Learning Using Bayesian Optimization