Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction

arXiv cs.LG Papers

Summary

This paper proposes using the Ensemble Score Filter (EnSF), a score-based diffusion data assimilation method, to correct forecasts from a pretrained spatio-temporal energy consumption model using noisy partial observations. Numerical experiments show EnSF significantly improves state estimation over open-loop propagation and outperforms the Ensemble Kalman Filter under nonlinear observations.

arXiv:2605.29072v1 Announce Type: new Abstract: Accurate estimation and forecasting of energy consumption are important for power-system operation, planning, and demand-side management. In practice, however, complete and timely measurements may not always be available, and the observed data can be partial, noisy, or delayed. This motivates the use of learned forecasting models for predicting the evolving consumption state, together with data assimilation methods for sequential forecast correction. In this work, we study a high-dimensional data assimilation problem for real energy-consumption data. \modeltext{The forward prediction is supplied by a pretrained black-box spatio-temporal forecasting model, which is treated as the state propagator in the filtering procedure.} We employ the Ensemble Score Filter (EnSF) to assimilate partial and noisy observations and to correct the forecast trajectory over time. The EnSF uses score-based diffusion models to approximate filtering distributions and avoids retraining neural-network score models during assimilation by using a closed-form score representation and Monte Carlo approximation. Numerical experiments demonstrate that open-loop propagation of the learned forecasting model can become unreliable over long horizons, while EnSF-based correction substantially improves state estimation. Comparisons with the Ensemble Kalman Filter (EnKF) further show that EnSF provides stronger correction under the nonlinear observation setting considered in this work.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:15 AM

# Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction
Source: [https://arxiv.org/html/2605.29072](https://arxiv.org/html/2605.29072)
###### Abstract

Accurate estimation and forecasting of energy consumption are important for power\-system operation, planning, and demand\-side management\. In practice, however, complete and timely measurements may not always be available, and the observed data can be partial, noisy, or delayed\. This motivates the use of learned forecasting models for predicting the evolving consumption state, together with data assimilation methods for sequential forecast correction\. In this work, we study a high\-dimensional data assimilation problem for real energy\-consumption data\.The forward prediction is supplied by a pretrained black\-box spatio\-temporal forecasting model, which is treated as the state propagator in the filtering procedure\.We employ the Ensemble Score Filter \(EnSF\) to assimilate partial and noisy observations and to correct the forecast trajectory over time\. The EnSF uses score\-based diffusion models to approximate filtering distributions and avoids retraining neural\-network score models during assimilation by using a closed\-form score representation and Monte Carlo approximation\. Numerical experiments demonstrate that open\-loop propagation of the learned forecasting model can become unreliable over long horizons, while EnSF\-based correction substantially improves state estimation\. Comparisons with the Ensemble Kalman Filter \(EnKF\) further show that EnSF provides stronger correction under the nonlinear observation setting considered in this work\.

###### keywords:

energy assimilation , power consumption , data assimilation , ensemble score filter , ensemble Kalman filter , STLLM

\\affiliation

\[math\]organization=Department of Mathematics, Florida State University, city=Tallahassee, postcode=32306, state=Florida, country=USA\\affiliation\[cs\]organization=Department of Computer Science, Florida State University, city=Tallahassee, postcode=32306, state=Florida, country=USA\\affiliation\[ornl\]organization=Computer Science and Mathematics Division, Oak Ridge National Laboratory, city=Oak Ridge, postcode=37831, state=Tennessee, country=USA

## 1Introduction

\[Introduction writing plan\. The introduction should follow this logic: \(i\) motivate real\-data energy\-consumption forecasting and power\-consumption state estimation; \(ii\) insert a model\-side literature review and motivation for using a pretrained spatio\-temporal forecasting model; \(iii\) explain why learned forward prediction alone is insufficient when accurate state information is delayed, partial, noisy, or unavailable; \(iv\) introduce data assimilation as the model\-data fusion framework for forecast correction; \(v\) review optimal filtering, EnKF, particle filters, and high\-dimensional nonlinear filtering challenges; \(vi\) introduce EnSF briefly as the score\-based data assimilation method used in this paper, leaving technical details to the methodology section; \(vii\) summarize the paper objective and numerical comparisons\. The model\-related literature\-review paragraph should be inserted after the general energy\-consumption forecasting motivation and before the data\-assimilation motivation\. \]

Accurate energy\-consumption forecasting is an important component of power\-system operation, planning, and demand\-side management\. Electricity demand has continued to grow rapidly in recent years\. The International Energy Agency reports that global electricity demand increased by4\.3%4\.3\\%in 2024 and is expected to grow at close to4%4\\%annually through 2027, driven by industrial activity, air conditioning, electrification, and data\-center expansion\[[26](https://arxiv.org/html/2605.29072#bib.bib270)\]\. Reliable load forecasting is therefore essential for generation scheduling, resource allocation, infrastructure planning, and operational reliability\[[40](https://arxiv.org/html/2605.29072#bib.bib271),[23](https://arxiv.org/html/2605.29072#bib.bib272)\]\. However, real energy\-consumption data often exhibit strong temporal variability, user\-level heterogeneity, nonlinear dependence on external conditions, and noisy or incomplete measurements\. Moreover, in practical forecasting settings, complete and timely measurements of the current consumption state may not always be available\. These features make accurate state estimation and long\-horizon prediction challenging, especially for high\-dimensional user\-level consumption trajectories\[[1](https://arxiv.org/html/2605.29072#bib.bib273)\]\.

\[Model\-side literature review placeholder\. A concise literature paragraph on machine\-learning\-based energy\-consumption forecasting should be inserted here if suitable references are provided\. This paragraph should mention representative spatio\-temporal forecasting models, attention\-based time\-series models, graph/spatial dependency modeling if relevant, and the motivation for using a pretrained forecasting model as the forward propagator\. \]

Recent advances in machine learning have provided powerful tools for energy\-consumption forecasting\. In particular, deep spatio\-temporal models can learn temporal dependence from historical consumption records while also extracting cross\-user or cross\-location dependence from high\-dimensional data\. Compared with classical statistical forecasting models, these learned models are better suited for capturing nonlinear temporal patterns and heterogeneous user\-level behavior\. In this work, the forward model is a pretrained spatio\-temporal forecasting model developed for real\-data energy\-consumption prediction\. The model takes historical consumption information as input and predicts the next\-step energy\-consumption state for all users in the dataset\.

The pretrained forecasting model used in this paper is treated as a learned black\-box state propagator\. Its parameters are fixed throughout the data assimilation experiments\. The purpose of the present paper is not to retrain or modify the forecasting model, but to investigate how data assimilation can be used to correct the forecast trajectory generated by such a pretrained model\. The model architecture, training procedure, and dataset construction are described in the following sections\.

Although modern machine\-learning forecasting models can capture complex spatio\-temporal patterns from historical data, their predictions may still deviate from the true consumption trajectory when the model is recursively propagated over long horizons\. This issue is not only a forecasting\-model problem, but also a state\-estimation problem: the latent energy\-consumption state evolves dynamically, while the available observations may be partial, noisy, nonlinear, or delayed\. Therefore, after a learned forward model has been constructed, a natural next step is to combine the model forecast with observational information collected over time\. Data assimilation provides a systematic framework for this model\-data fusion task\[[29](https://arxiv.org/html/2605.29072#bib.bib60)\]\.

Data assimilation plays a crucial role in bridging the gap between model predictions and observational data\. In sequential forecasting problems, a forward model provides a prediction of the system state, while observations provide incomplete and noisy information about the true state\. By systematically integrating these two sources of information, data assimilation methods refine state estimates and account for uncertainties arising from both the model and the data\. This is particularly important for high\-dimensional real\-data forecasting problems, where direct reliance on the forward model alone may lead to inaccurate long\-horizon prediction\[[15](https://arxiv.org/html/2605.29072#bib.bib61)\]\.

The primary mathematical framework for addressing data assimilation problems isoptimal filtering\. The goal of the optimal filtering problem is to approximate the conditional probability density function of the state process given the observational data, which is referred to as thefiltering distribution\[[8](https://arxiv.org/html/2605.29072#bib.bib173),[37](https://arxiv.org/html/2605.29072#bib.bib163)\]\. The corresponding conditional expectation provides the optimal state estimate\. Bayesian inference plays a central role in solving optimal filtering problems, with Kalman\-type filters being a prominent example\. For linear state dynamics and linear observations under Gaussian assumptions, the Kalman filter provides an analytical solution\.

For nonlinear systems, ensemble Kalman filters \(EnKF\) were developed as computationally efficient ensemble\-based approximations\[[19](https://arxiv.org/html/2605.29072#bib.bib3),[24](https://arxiv.org/html/2605.29072#bib.bib11)\]\. EnKF has been widely used in high\-dimensional applications because it avoids explicit propagation of the full covariance matrix and instead represents uncertainty through an ensemble of forecast samples\. However, the EnKF update is based on Gaussian\-type approximations of the filtering distribution, which may become restrictive when the dynamics, observations, or posterior distributions are strongly nonlinear or non\-Gaussian\. In such cases, the filtering distribution may contain structures that cannot be accurately represented by only its ensemble mean and covariance\.

Beyond EnKF, several methods have been proposed for nonlinear optimal filtering, including particle filters\[[2](https://arxiv.org/html/2605.29072#bib.bib12),[16](https://arxiv.org/html/2605.29072#bib.bib13),[21](https://arxiv.org/html/2605.29072#bib.bib14),[27](https://arxiv.org/html/2605.29072#bib.bib15),[32](https://arxiv.org/html/2605.29072#bib.bib16),[34](https://arxiv.org/html/2605.29072#bib.bib17),[9](https://arxiv.org/html/2605.29072#bib.bib18)\], Zakai filters\[[38](https://arxiv.org/html/2605.29072#bib.bib19),[31](https://arxiv.org/html/2605.29072#bib.bib20)\], and methods based on stochastic partial differential equations\[[10](https://arxiv.org/html/2605.29072#bib.bib21),[5](https://arxiv.org/html/2605.29072#bib.bib22),[6](https://arxiv.org/html/2605.29072#bib.bib23),[12](https://arxiv.org/html/2605.29072#bib.bib24),[25](https://arxiv.org/html/2605.29072#bib.bib25),[20](https://arxiv.org/html/2605.29072#bib.bib26)\]\. Particle filters, also known as sequential Monte Carlo methods, are well suited for representing non\-Gaussian filtering distributions in moderate\-dimensional problems\. However, their performance can deteriorate in high\-dimensional settings due to weight degeneracy and the difficulty of high\-dimensional sampling in Bayesian inference\. SPDE\-based filtering methods provide a rigorous probabilistic formulation for nonlinear filtering, but their numerical implementation can become computationally expensive as the state dimension increases\[[8](https://arxiv.org/html/2605.29072#bib.bib173),[39](https://arxiv.org/html/2605.29072#bib.bib27),[14](https://arxiv.org/html/2605.29072#bib.bib28),[17](https://arxiv.org/html/2605.29072#bib.bib29)\]\. These challenges motivate filtering methods that can handle nonlinear and non\-Gaussian state estimation while remaining computationally feasible in high\-dimensional systems\.

In this work, we employ the Ensemble Score Filter \(EnSF\) as a score\-based data assimilation method for correcting forecasts generated by a pretrained energy\-consumption forecasting model\. The EnSF is built upon score\-based diffusion models\[[18](https://arxiv.org/html/2605.29072#bib.bib227),[22](https://arxiv.org/html/2605.29072#bib.bib1),[35](https://arxiv.org/html/2605.29072#bib.bib242),[36](https://arxiv.org/html/2605.29072#bib.bib125)\], which characterize probability distributions through their score functions and generate samples through reverse\-time diffusion processes\. Traditional diffusion\-based filtering methods\[[13](https://arxiv.org/html/2605.29072#bib.bib265)\]estimate score functions with neural networks, which may require repeated training and large storage costs in sequential high\-dimensional filtering problems\. In contrast, EnSF avoids training a score network\. It directly approximates the score function using a closed\-form expression and mini\-batch Monte Carlo estimators, and incorporates observational information through an analytical update mechanism\. This provides a practical score\-based filtering framework for high\-dimensional nonlinear data assimilation\[[4](https://arxiv.org/html/2605.29072#bib.bib120),[33](https://arxiv.org/html/2605.29072#bib.bib124),[30](https://arxiv.org/html/2605.29072#bib.bib123),[11](https://arxiv.org/html/2605.29072#bib.bib122)\]\.

The objective of this paper is to investigate EnSF\-based forecast correction for real\-data energy\-consumption prediction\. The pretrained forecasting model is treated as a black\-box forward propagator, while the filtering procedure sequentially assimilates partial and noisy observations to correct the forecast trajectory\. The main point of the proposed framework is that the learned forecasting model supplies dynamical prediction when complete current\-state information is unavailable, while data assimilation prevents the forecast from relying solely on open\-loop model propagation\. We compare the open\-loop forecast, EnSF\-corrected forecast, and EnKF\-corrected forecast under different observation settings\. The numerical results show that data assimilation substantially improves long\-horizon state estimation and that EnSF provides stronger correction than EnKF in the nonlinear observation setting considered in this work\.

The rest of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.29072#S2)introduces the pretrained energy\-consumption forecasting model used as the forward propagator\. Section[3](https://arxiv.org/html/2605.29072#S3)formulates the data assimilation problem and presents the Ensemble Score Filter methodology\. Section[4](https://arxiv.org/html/2605.29072#S4)presents numerical experiments on real energy\-consumption data\. These experiments evaluate the behavior of the pretrained forecasting models, demonstrate the need for data assimilation in long\-horizon forecast correction, and compare EnSF with EnKF under partial and nonlinear observation settings\. Finally, the conclusion summarizes the main findings and discusses possible directions for future work\.

## 2Pretrained energy\-consumption forecasting model

The forward model used in this work is a pretrained spatio\-temporal large language model \(STLLM\) designed for energy\-consumption forecasting\. The model adapts several components commonly used in modern large language models, including RMSNorm, rotary positional embeddings \(RoPE\), and SwiGLU feed\-forward networks, to the spatio\-temporal forecasting setting\. Its main architectural feature is a factored attention structure that separates temporal dependency modeling from spatial dependency modeling\.

### 2\.1Input\-output formulation

The input to STLLM is a four\-dimensional tensor

𝐗in∈ℝB×T×N×Di​n,\\mathbf\{X\}^\{\\mathrm\{in\}\}\\in\\mathbb\{R\}^\{B\\times T\\times N\\times D\_\{in\}\},\(1\)whereBBis the batch size,TTis the historical sequence length,NNis the number of users or spatial nodes, andDi​nD\_\{in\}is the input feature dimension\. The model outputs a prediction tensor

𝐙^∈ℝB×H×N×Do​u​t,\\widehat\{\\mathbf\{Z\}\}\\in\\mathbb\{R\}^\{B\\times H\\times N\\times D\_\{out\}\},\(2\)whereHHis the forecast horizon andDo​u​tD\_\{out\}is the output feature dimension\. The notation𝐙^\\widehat\{\\mathbf\{Z\}\}is used here for the model prediction tensor in order to distinguish it from the observation variableYYused in the data assimilation formulation\.

Abstractly, for a historical input window, the pretrained model defines a learned forecasting map

X^tn\+1=fθ​\(Xtn−T\+1,…,Xtn\),\\widehat\{X\}\_\{t\_\{n\+1\}\}=f\_\{\\theta\}\\left\(X\_\{t\_\{n\-T\+1\}\},\\ldots,X\_\{t\_\{n\}\}\\right\),\(3\)whereθ\\thetadenotes the fixed pretrained model parameters\. For notational simplicity in the data\-assimilation formulation, we write the learned forecast map asfθf\_\{\\theta\}\. Throughout this paper, the parametersθ\\thetaare fixed\. The forecasting model is treated as a black\-box state propagator, and the role of data assimilation is to correct the forecasted state using observational information rather than to retrain the model\.

### 2\.2Architecture summary

The architecture consists of an input embedding layer, stacked STLLM blocks, and an output projection layer\. The embedding layer maps the raw input feature into a latent space of dimensionddand adds a learnable node embedding to encode the identity of each user or spatial node\. This allows the model to distinguish different users without requiring an explicit graph adjacency matrix as a mandatory input\.

Each STLLM block contains three main sub\-layers with residual connections: temporal attention, spatial attention, and a SwiGLU feed\-forward network\. Temporal attention is applied along the historical time dimension for each node, and RoPE is used to encode relative temporal position information\. Spatial attention is applied across users at each time step, allowing the model to learn inter\-user dependency patterns adaptively\. The feed\-forward component uses a SwiGLU activation to improve the expressive capability of the hidden representation\. All sub\-layers use RMSNorm and pre\-normalization residual connections\.

After the final STLLM block, the hidden representation at the last input time step is selected and projected into the prediction space\. This produces the next\-step forecast used in the subsequent data\-assimilation procedure\.

Table 1:Summary of the STLLM architectural components\. The model\-related descriptions are marked in blue for review\.ComponentStandard TransformerSTLLMNormalizationLayerNormRMSNormPosition encodingSinusoidal / learnedRoPEFeed\-forward activationReLU / GELUSwiGLUAttention structureJoint attentionFactored temporal\-spatial attentionNormalization placementPost\-normPre\-normSpatial modelingUsually graph\-dependent or implicitAdaptive spatial attention

## 3Estimation of states through data assimilation

### 3\.1The data assimilation problem

We formulate the forecast\-correction task as a sequential data assimilation problem\. LetXtn∈ℝdxX\_\{t\_\{n\}\}\\in\\mathbb\{R\}^\{d\_\{x\}\}denote the energy\-consumption state at timetnt\_\{n\}\. The state dimensiondxd\_\{x\}depends on the set of energy\-consuming users included in the dataset\. The pretrained forecasting model introduced above is used as the forward propagator, and the state evolution can be formulated as

Xtn\+1=fθ​\(Xtn,ηtn\),n=0,1,2,…,X\_\{t\_\{n\+1\}\}=f\_\{\\theta\}\(X\_\{t\_\{n\}\},\\eta\_\{t\_\{n\}\}\),\\qquad n=0,1,2,\\ldots,\(4\)wherefθf\_\{\\theta\}denotes the learned black\-box forecast model andηtn\\eta\_\{t\_\{n\}\}represents model error or stochastic uncertainty\.

The goal of the data assimilation procedure is to fuse observational information aboutXXwith the forecast obtained from the forward model \([4](https://arxiv.org/html/2605.29072#S3.E4)\) in order to obtain an improved estimate of the state\. We denote the observation at timetn\+1t\_\{n\+1\}byYtn\+1Y\_\{t\_\{n\+1\}\}, given by

Ytn\+1=ℋn\+1​\(Xtn\+1\)\+ϵtn\+1,Y\_\{t\_\{n\+1\}\}=\\mathcal\{H\}\_\{n\+1\}\(X\_\{t\_\{n\+1\}\}\)\+\\epsilon\_\{t\_\{n\+1\}\},\(5\)whereYtn\+1Y\_\{t\_\{n\+1\}\}is the observation vector,ℋn\+1\\mathcal\{H\}\_\{n\+1\}is the observation operator that may provide incomplete and indirect measurements of the state, andϵtn\+1\\epsilon\_\{t\_\{n\+1\}\}denotes observational noise\.

With \([4](https://arxiv.org/html/2605.29072#S3.E4)\) and \([5](https://arxiv.org/html/2605.29072#S3.E5)\), the data assimilation problem becomes finding the best estimate of the latent state from all available observations\. The optimal filtering estimate at timetn\+1t\_\{n\+1\}is given by

X^tn\+1:=𝔼​\[Xtn\+1∣Yt1:tn\+1\],\\hat\{X\}\_\{t\_\{n\+1\}\}:=\\mathbb\{E\}\\\!\\left\[X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\+1\}\}\\right\],\(6\)whereYt1:tn\+1Y\_\{t\_\{1\}:t\_\{n\+1\}\}denotes the observational information collected fromt1t\_\{1\}totn\+1t\_\{n\+1\}\. Equivalently, the estimate is the conditional expectation ofXtn\+1X\_\{t\_\{n\+1\}\}with respect to theσ\\sigma\-algebra generated by these observations\. For nonlinear high\-dimensional systems, this conditional distribution is generally unavailable in closed form\. Instead, one works on approximating the conditional probability density function of the state,p​\(Xtn\+1∣Yt1:tn\+1\),p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\+1\}\}\),which is referred to as the filtering distribution\.

The standard approach to solving the data assimilation problem is the Bayesian filtering framework\. The recursive Bayesian filter consists of a prediction step and an update step\. Suppose that the filtering distributionp​\(Xtn∣Yt1:tn\)p\(X\_\{t\_\{n\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)has been approximated at timetnt\_\{n\}\. By using the Chapman–Kolmogorov formula, one propagates the state equation in \([4](https://arxiv.org/html/2605.29072#S3.E4)\) fromtnt\_\{n\}totn\+1t\_\{n\+1\}and obtains the prior filtering distribution

p​\(Xtn\+1∣Yt1:tn\)=∫p​\(Xtn\+1∣Xtn\)​p​\(Xtn∣Yt1:tn\)​𝑑Xtn\.p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)=\\int p\(X\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\}\}\)p\(X\_\{t\_\{n\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)\\,dX\_\{t\_\{n\}\}\.\(7\)Here,p​\(Xtn\+1∣Xtn\)p\(X\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\}\}\)is the transition probability derived from the state dynamics introduced in \([4](https://arxiv.org/html/2605.29072#S3.E4)\)\. The resulting prior filtering distributionp​\(Xtn\+1∣Yt1:tn\)p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)combines the observational information collected up to timetnt\_\{n\}with the model information propagated to timetn\+1t\_\{n\+1\}\.

After the new observationYtn\+1Y\_\{t\_\{n\+1\}\}becomes available, the update step incorporates it through Bayesian formula:

p​\(Xtn\+1∣Yt1:tn\+1\)∝p​\(Xtn\+1∣Yt1:tn\)​p​\(Ytn\+1∣Xtn\+1\)\.p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\+1\}\}\)\\propto p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)\.\(8\)If the observational noise is Gaussian with covariance matrixRn\+1R\_\{n\+1\}, the likelihood is

p​\(Ytn\+1∣Xtn\+1\)∝exp⁡\[−12​\(ℋn\+1​\(Xtn\+1\)−Ytn\+1\)⊤​Rn\+1−1​\(ℋn\+1​\(Xtn\+1\)−Ytn\+1\)\]\.p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)\\propto\\exp\\\!\\left\[\-\\frac\{1\}\{2\}\\left\(\\mathcal\{H\}\_\{n\+1\}\(X\_\{t\_\{n\+1\}\}\)\-Y\_\{t\_\{n\+1\}\}\\right\)^\{\\top\}R\_\{n\+1\}^\{\-1\}\\left\(\\mathcal\{H\}\_\{n\+1\}\(X\_\{t\_\{n\+1\}\}\)\-Y\_\{t\_\{n\+1\}\}\\right\)\\right\]\.\(9\)
This Bayesian formulation provides the foundation for the filtering methods considered in this work\. In the following section, we introduce the Ensemble Score Filter for solving the data assimilation problem by utilizing the concept of score\-based diffusion models within the generative artificial intelligence framework\.

### 3\.2Score\-based diffusion model and training\-free score approximation

We first introduce the score\-based diffusion formulation that will be used to represent and sample from filtering distributions\. The purpose of the diffusion construction is to transform a target distribution into a standard Gaussian distribution through a forward stochastic process, and then recover samples from the target distribution by solving the corresponding reverse\-time process\. A key quantity in this procedure is the score function, which stores the distributional information needed to guide the reverse\-time sampling\.

LetZτ∈ℝdxZ\_\{\\tau\}\\in\\mathbb\{R\}^\{d\_\{x\}\}be a diffusion process defined on the pseudo\-time intervalτ∈\[0,1\]\\tau\\in\[0,1\]\. The variableτ\\tauis independent of the physical time variable in the data assimilation problem\. We consider the forward SDE

d​Zτ=b​\(τ\)​Zτ​d​τ\+σ​\(τ\)​d​Wτ,dZ\_\{\\tau\}=b\(\\tau\)Z\_\{\\tau\}\\,d\\tau\+\\sigma\(\\tau\)\\,dW\_\{\\tau\},\(10\)whereWτW\_\{\\tau\}is a standard Brownian motion, andb​\(τ\)b\(\\tau\)andσ​\(τ\)\\sigma\(\\tau\)are the drift and diffusion coefficients, respectively\. Following the score\-based diffusion construction in\[[36](https://arxiv.org/html/2605.29072#bib.bib125)\], we choose

b​\(τ\)=d​log⁡ατd​τ,σ2​\(τ\)=d​βτ2d​τ−2​d​log⁡ατd​τ​βτ2,b\(\\tau\)=\\frac\{d\\log\\alpha\_\{\\tau\}\}\{d\\tau\},\\qquad\\sigma^\{2\}\(\\tau\)=\\frac\{d\\beta\_\{\\tau\}^\{2\}\}\{d\\tau\}\-2\\frac\{d\\log\\alpha\_\{\\tau\}\}\{d\\tau\}\\beta\_\{\\tau\}^\{2\},\(11\)withατ=1−τ\\alpha\_\{\\tau\}=1\-\\tauandβτ2=τ\\beta\_\{\\tau\}^\{2\}=\\tauforτ∈\[0,1\]\\tau\\in\[0,1\]\. Since \([10](https://arxiv.org/html/2605.29072#S3.E10)\) is linear, this choice gives the conditional density

Qτ​\(Zτ∣Z0\)=𝒩​\(ατ​Z0,βτ2​Idx\)Q\_\{\\tau\}\(Z\_\{\\tau\}\\mid Z\_\{0\}\)=\\mathcal\{N\}\\left\(\\alpha\_\{\\tau\}Z\_\{0\},\\beta\_\{\\tau\}^\{2\}I\_\{d\_\{x\}\}\\right\)\(12\)for any fixed initial valueZ0Z\_\{0\}\. Hence, the forward diffusion maps the initial target distribution toward the standard Gaussian distribution asτ→1\\tau\\to 1\.

To recover samples from the target distribution, one considers the reverse\-time SDE

d​Zτ=\[b​\(τ\)​Zτ−σ2​\(τ\)​𝒮​\(Zτ,τ\)\]​d​τ\+σ​\(τ\)​d​W←τ,dZ\_\{\\tau\}=\\left\[b\(\\tau\)Z\_\{\\tau\}\-\\sigma^\{2\}\(\\tau\)\\mathcal\{S\}\(Z\_\{\\tau\},\\tau\)\\right\]d\\tau\+\\sigma\(\\tau\)d\\overleftarrow\{W\}\_\{\\tau\},\(13\)whered​W←τd\\overleftarrow\{W\}\_\{\\tau\}denotes a backward Itô stochastic integral\[[28](https://arxiv.org/html/2605.29072#bib.bib129),[7](https://arxiv.org/html/2605.29072#bib.bib126)\]\. The score function is defined by

𝒮​\(z,τ\):=∇zlog⁡Qτ​\(z\),\\mathcal\{S\}\(z,\\tau\):=\\nabla\_\{z\}\\log Q\_\{\\tau\}\(z\),\(14\)whereQτQ\_\{\\tau\}denotes the probability density ofZτZ\_\{\\tau\}\. If the score function is available, then the reverse\-time process \([13](https://arxiv.org/html/2605.29072#S3.E13)\) transports samples from the standard Gaussian distribution atτ=1\\tau=1back to the target distribution atτ=0\\tau=0\[[3](https://arxiv.org/html/2605.29072#bib.bib119)\]\.

The main computational issue is the approximation of𝒮​\(z,τ\)\\mathcal\{S\}\(z,\\tau\)\. LetQ0Q\_\{0\}be the target distribution atτ=0\\tau=0\. From \([12](https://arxiv.org/html/2605.29072#S3.E12)\), the marginal density ofZτZ\_\{\\tau\}isQτ​\(z\)=∫ℝdxQτ​\(z∣z0\)​Q0​\(z0\)​𝑑z0Q\_\{\\tau\}\(z\)=\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}Q\_\{\\tau\}\(z\\mid z\_\{0\}\)Q\_\{0\}\(z\_\{0\}\)\\,dz\_\{0\}\. Therefore, the score function can be rewritten as

𝒮​\(z,τ\)\\displaystyle\\mathcal\{S\}\(z,\\tau\)=∇zlog⁡\(∫ℝdxQτ​\(z∣z0\)​Q0​\(z0\)​𝑑z0\)\\displaystyle=\\nabla\_\{z\}\\log\\left\(\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}Q\_\{\\tau\}\(z\\mid z\_\{0\}\)Q\_\{0\}\(z\_\{0\}\)\\,dz\_\{0\}\\right\)\(15\)=∫ℝdx∇zQτ​\(z∣z0\)​Q0​\(z0\)​𝑑z0∫ℝdxQτ​\(z∣z0′\)​Q0​\(z0′\)​𝑑z0′\\displaystyle=\\frac\{\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}\\nabla\_\{z\}Q\_\{\\tau\}\(z\\mid z\_\{0\}\)Q\_\{0\}\(z\_\{0\}\)\\,dz\_\{0\}\}\{\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}Q\_\{\\tau\}\(z\\mid z^\{\\prime\}\_\{0\}\)Q\_\{0\}\(z^\{\\prime\}\_\{0\}\)\\,dz^\{\\prime\}\_\{0\}\}=∫ℝdx−z−ατ​z0βτ2​wτ​\(z,z0\)​Q0​\(z0\)​d​z0,\\displaystyle=\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}\-\\frac\{z\-\\alpha\_\{\\tau\}z\_\{0\}\}\{\\beta\_\{\\tau\}^\{2\}\}w\_\{\\tau\}\(z,z\_\{0\}\)Q\_\{0\}\(z\_\{0\}\)\\,dz\_\{0\},where the weight function is

wτ​\(z,z0\):=Qτ​\(z∣z0\)∫ℝdxQτ​\(z∣z0′\)​Q0​\(z0′\)​𝑑z0′\.w\_\{\\tau\}\(z,z\_\{0\}\):=\\frac\{Q\_\{\\tau\}\(z\\mid z\_\{0\}\)\}\{\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}Q\_\{\\tau\}\(z\\mid z^\{\\prime\}\_\{0\}\)Q\_\{0\}\(z^\{\\prime\}\_\{0\}\)\\,dz^\{\\prime\}\_\{0\}\}\.\(16\)The weight function satisfies∫ℝdxwτ​\(z,z0\)​Q0​\(z0\)​𝑑z0=1\\int\_\{\\mathbb\{R\}^\{d\_\{x\}\}\}w\_\{\\tau\}\(z,z\_\{0\}\)Q\_\{0\}\(z\_\{0\}\)\\,dz\_\{0\}=1\. Equation \([15](https://arxiv.org/html/2605.29072#S3.E15)\) expresses the score function as an expectation with respect to the target distributionQ0Q\_\{0\}\.

In standard score\-based diffusion models, the score function is commonly approximated by training a neural network\. For filtering problems, however, the target distribution changes recursively as new observations are assimilated, and retraining a score network at each assimilation step is computationally inefficient\. Instead, we use a training\-free Monte Carlo approximation of the score\. Suppose that\{z\(m\)\}m=1M\\\{z^\{\(m\)\}\\\}\_\{m=1\}^\{M\}is an ensemble of samples fromQ0Q\_\{0\}\. Given a mini\-batch\{z\(j\)\}j=1J\\\{z^\{\(j\)\}\\\}\_\{j=1\}^\{J\}withJ≤MJ\\leq M, we approximate the score by

𝒮​\(z,τ\)≈𝒮¯​\(z,τ\):=∑j=1J−z−ατ​z\(j\)βτ2​w¯τ​\(z,z\(j\)\),\\mathcal\{S\}\(z,\\tau\)\\approx\\bar\{\\mathcal\{S\}\}\(z,\\tau\):=\\sum\_\{j=1\}^\{J\}\-\\frac\{z\-\\alpha\_\{\\tau\}z^\{\(j\)\}\}\{\\beta\_\{\\tau\}^\{2\}\}\\bar\{w\}\_\{\\tau\}\(z,z^\{\(j\)\}\),\(17\)where the empirical weight is

w¯τ​\(z,z\(j\)\):=Qτ​\(z∣z\(j\)\)∑ℓ=1JQτ​\(z∣z\(ℓ\)\)\.\\bar\{w\}\_\{\\tau\}\(z,z^\{\(j\)\}\):=\\frac\{Q\_\{\\tau\}\(z\\mid z^\{\(j\)\}\)\}\{\\sum\_\{\\ell=1\}^\{J\}Q\_\{\\tau\}\(z\\mid z^\{\(\\ell\)\}\)\}\.\(18\)Thus, the weights are computed by normalizing the Gaussian transition density values\{Qτ​\(z∣z\(j\)\)\}j=1J\\\{Q\_\{\\tau\}\(z\\mid z^\{\(j\)\}\)\\\}\_\{j=1\}^\{J\}\. This approximation avoids neural\-network training and evaluates the score directly from the available ensemble samples\. In practice, the mini\-batch can be a small subset of the full ensemble while still providing sufficient accuracy for filtering problems\[[4](https://arxiv.org/html/2605.29072#bib.bib120)\]\.

### 3\.3The ensemble score filter for state estimation

We now introduce how the Ensemble Score Filter \(EnSF\) is used to solve the data assimilation problem formulated in Section[3\.1](https://arxiv.org/html/2605.29072#S3.SS1)\. The methodology of EnSF is to define score functions corresponding to the filtering distributions and approximate these scores using the training\-free Monte Carlo construction introduced above\. In the present work, the pretrained forecasting modelfθf\_\{\\theta\}is treated as the black\-box forward propagator, and EnSF is used to update the forecasted state by assimilating observational data\.

Assume that𝒮n\|n\\mathcal\{S\}\_\{n\|n\}is the approximated score function corresponding to the posterior filtering distributionp​\(Xtn∣Yt1:tn\)p\(X\_\{t\_\{n\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)\. For a uniform discretization of the pseudo\-time interval\[0,1\]\[0,1\], let

0=τ0<τ1<⋯<τl<⋯<τL=1,Δ​τ=τl\+1−τl\.0=\\tau\_\{0\}<\\tau\_\{1\}<\\cdots<\\tau\_\{l\}<\\cdots<\\tau\_\{L\}=1,\\qquad\\Delta\\tau=\\tau\_\{l\+1\}\-\\tau\_\{l\}\.Starting from Gaussian samples atτ=1\\tau=1, we generate an ensemble of state samples\{zn\|n\(m\)\}m=1M\\\{z\_\{n\|n\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}forp​\(Xtn∣Yt1:tn\)p\(X\_\{t\_\{n\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)by solving the reverse\-time SDE with the Euler–Maruyama scheme

Z¯τl\(m\)=Z¯τl\+1\(m\)−\[b​\(τl\+1\)​Z¯τl\+1\(m\)−σ2​\(τl\+1\)​𝒮n\|n​\(Z¯τl\+1\(m\),τl\+1\)\]​Δ​τ\+σ​\(τl\+1\)​Δ​W←τl\+1\(m\),\\displaystyle\\bar\{Z\}\_\{\\tau\_\{l\}\}^\{\(m\)\}=\\bar\{Z\}\_\{\\tau\_\{l\+1\}\}^\{\(m\)\}\-\\left\[b\(\\tau\_\{l\+1\}\)\\bar\{Z\}\_\{\\tau\_\{l\+1\}\}^\{\(m\)\}\-\\sigma^\{2\}\(\\tau\_\{l\+1\}\)\\mathcal\{S\}\_\{n\|n\}\\left\(\\bar\{Z\}\_\{\\tau\_\{l\+1\}\}^\{\(m\)\},\\tau\_\{l\+1\}\\right\)\\right\]\\Delta\\tau\+\\sigma\(\\tau\_\{l\+1\}\)\\Delta\\overleftarrow\{W\}\_\{\\tau\_\{l\+1\}\}^\{\(m\)\},\(19\)forl=L−1,…,0l=L\-1,\\ldots,0, with terminal conditionZ¯τL\(m\)∼𝒩​\(0,Idx\)\\bar\{Z\}\_\{\\tau\_\{L\}\}^\{\(m\)\}\\sim\\mathcal\{N\}\(0,I\_\{d\_\{x\}\}\)\. The generated samples atτ=0\\tau=0are identified with the posterior ensemble at timetnt\_\{n\}, namelyzn\|n\(m\):=Z¯τ0\(m\)z\_\{n\|n\}^\{\(m\)\}:=\\bar\{Z\}\_\{\\tau\_\{0\}\}^\{\(m\)\}form=1,…,Mm=1,\\ldots,M\.

In the prediction step, these posterior ensemble samples are propagated through the pretrained forecasting model\. For each ensemble member, we compute

zn\+1\|n\(m\)=fθ​\(zn\|n\(m\),ηtn\(m\)\),m=1,…,M\.z\_\{n\+1\|n\}^\{\(m\)\}=f\_\{\\theta\}\\left\(z\_\{n\|n\}^\{\(m\)\},\\eta\_\{t\_\{n\}\}^\{\(m\)\}\\right\),\\qquad m=1,\\ldots,M\.\(20\)The predicted ensemble\{zn\+1\|n\(m\)\}m=1M\\\{z\_\{n\+1\|n\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}provides a sample approximation of the prior filtering distributionp​\(Xtn\+1∣Yt1:tn\)p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)\. The corresponding prior score is approximated by the Monte Carlo scheme

𝒮¯n\+1\|n​\(z,τ\):=∑j=1J−z−ατ​zn\+1\|n\(j\)βτ2​w¯τ​\(z,zn\+1\|n\(j\)\),τ∈\[0,1\],\\bar\{\\mathcal\{S\}\}\_\{n\+1\|n\}\(z,\\tau\):=\\sum\_\{j=1\}^\{J\}\-\\frac\{z\-\\alpha\_\{\\tau\}z\_\{n\+1\|n\}^\{\(j\)\}\}\{\\beta\_\{\\tau\}^\{2\}\}\\bar\{w\}\_\{\\tau\}\\left\(z,z\_\{n\+1\|n\}^\{\(j\)\}\\right\),\\qquad\\tau\\in\[0,1\],\(21\)where\{zn\+1\|n\(j\)\}j=1J\\\{z\_\{n\+1\|n\}^\{\(j\)\}\\\}\_\{j=1\}^\{J\}is a mini\-batch selected from the predicted ensemble, withJ≤MJ\\leq M\. The empirical weight is given by

w¯τ​\(z,zn\+1\|n\(j\)\):=Qτ​\(z∣zn\+1\|n\(j\)\)∑ℓ=1JQτ​\(z∣zn\+1\|n\(ℓ\)\),\\bar\{w\}\_\{\\tau\}\\left\(z,z\_\{n\+1\|n\}^\{\(j\)\}\\right\):=\\frac\{Q\_\{\\tau\}\\left\(z\\mid z\_\{n\+1\|n\}^\{\(j\)\}\\right\)\}\{\\sum\_\{\\ell=1\}^\{J\}Q\_\{\\tau\}\\left\(z\\mid z\_\{n\+1\|n\}^\{\(\\ell\)\}\\right\)\},\(22\)whereQτQ\_\{\\tau\}is the Gaussian transition density of the forward diffusion process\. We call𝒮¯n\+1\|n\\bar\{\\mathcal\{S\}\}\_\{n\+1\|n\}the prior score associated with the prior filtering distribution\.

To generate state samples that follow the posterior filtering distribution, we update the prior score by incorporating the new observationYtn\+1Y\_\{t\_\{n\+1\}\}\. According to Bayesian rule,

p​\(Xtn\+1∣Yt1:tn\+1\)∝p​\(Xtn\+1∣Yt1:tn\)​p​\(Ytn\+1∣Xtn\+1\)\.p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\+1\}\}\)\\propto p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\}\}\)p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)\.Taking the logarithmic gradient motivates the posterior score approximation

𝒮¯n\+1\|n\+1​\(z,τ\)=𝒮¯n\+1\|n​\(z,τ\)\+g​\(τ\)​∇zlog⁡p​\(Ytn\+1∣Xtn\+1\)​\(z\),\\bar\{\\mathcal\{S\}\}\_\{n\+1\|n\+1\}\(z,\\tau\)=\\bar\{\\mathcal\{S\}\}\_\{n\+1\|n\}\(z,\\tau\)\+g\(\\tau\)\\nabla\_\{z\}\\log p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)\(z\),\(23\)wherep​\(Ytn\+1∣Xtn\+1\)p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)is the likelihood function defined in \([9](https://arxiv.org/html/2605.29072#S3.E9)\)\. The functiong​\(τ\)g\(\\tau\)is a damping function that controls how the observational information is incorporated in the diffusion domain\. In the current EnSF framework,g​\(τ\)g\(\\tau\)is monotonically decreasing on\[0,1\]\[0,1\]and satisfiesg​\(0\)=1g\(0\)=1andg​\(1\)=0g\(1\)=0\. This condition indicates that the likelihood information is fully incorporated atτ=0\\tau=0, where the target posterior distribution is recovered, while it vanishes atτ=1\\tau=1, where the reverse diffusion process is connected to the Gaussian reference distribution\. In this work, we use

g​\(τ\)=1−τ\.g\(\\tau\)=1\-\\tau\.\(24\)
For the Gaussian observation model in \([9](https://arxiv.org/html/2605.29072#S3.E9)\), the likelihood\-gradient term in \([23](https://arxiv.org/html/2605.29072#S3.E23)\) can be written as

∇zlog⁡p​\(Ytn\+1∣Xtn\+1\)​\(z\)=−D​ℋn\+1​\(z\)⊤​Rn\+1−1​\(ℋn\+1​\(z\)−Ytn\+1\),\\nabla\_\{z\}\\log p\(Y\_\{t\_\{n\+1\}\}\\mid X\_\{t\_\{n\+1\}\}\)\(z\)=\-D\\mathcal\{H\}\_\{n\+1\}\(z\)^\{\\top\}R\_\{n\+1\}^\{\-1\}\\left\(\\mathcal\{H\}\_\{n\+1\}\(z\)\-Y\_\{t\_\{n\+1\}\}\\right\),\(25\)whereD​ℋn\+1​\(z\)D\\mathcal\{H\}\_\{n\+1\}\(z\)is the Jacobian of the observation operator\. For direct partial observations, this expression reduces to a masked linear correction\. For nonlinear observations, such as the arctangent observation operator considered in our experiments, the Jacobian accounts for the nonlinear measurement map\.

Finally, the posterior score𝒮¯n\+1\|n\+1\\bar\{\\mathcal\{S\}\}\_\{n\+1\|n\+1\}is used in the reverse\-time SDE to generate posterior ensemble samples\{zn\+1\|n\+1\(m\)\}m=1M\\\{z\_\{n\+1\|n\+1\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}that approximatep​\(Xtn\+1∣Yt1:tn\+1\)p\(X\_\{t\_\{n\+1\}\}\\mid Y\_\{t\_\{1\}:t\_\{n\+1\}\}\)\. The filtered state estimate is then computed as the ensemble mean

X^tn\+1=1M​∑m=1Mzn\+1\|n\+1\(m\)\.\\widehat\{X\}\_\{t\_\{n\+1\}\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}z\_\{n\+1\|n\+1\}^\{\(m\)\}\.\(26\)
Thus, one EnSF assimilation step proceeds as follows\. First, posterior samples at timetnt\_\{n\}are generated from the current posterior score model\. Second, these samples are propagated through the pretrained forecasting model to form the prior ensemble\. Third, the prior ensemble is used to construct the Monte Carlo approximation of the prior score\. Fourth, the new observation is incorporated through the likelihood\-gradient correction to obtain the posterior score\. Finally, the reverse\-time diffusion process driven by the posterior score generates the updated ensemble at timetn\+1t\_\{n\+1\}\.

## 4Numerical experiments

### 4\.1Dataset, preprocessing, and model configurations

The numerical experiments are conducted on a real energy\-consumption dataset containing hourly records forN=5,000N=5\{,\}000utility users in Tallahassee, Florida\. Each full yearly record contains8,7368\{,\}736hourly time steps, corresponding to364364days with2424hourly records per day\. Each user has one recorded feature, namely energy consumption measured in kWh\. An adjacency matrix of size5,000×5,0005\{,\}000\\times 5\{,\}000describing spatial proximity between users is also available in the dataset, although the STLLM architecture can learn spatial dependence through attention without relying explicitly on a fixed graph structure\. The numerical examples reported below are conducted on selected test\-set trajectories constructed from these yearly records\.

The data are preprocessed using a LogMinMax normalization procedure\. First, a logarithmic transformation is applied to reduce the heavy right\-skew in the energy\-consumption distribution:

x′=log⁡\(1\+x\)\.x^\{\\prime\}=\\log\(1\+x\)\.\(27\)The transformed data are then scaled to\[0,1\]\[0,1\]using statistics computed from the training set\. Inverse transformations are applied before evaluating forecasting errors, so the reported MAE, MAPE, and RMSE values are computed on the original data scale\. Missing values are masked during both training and evaluation\.

The pretrained models used in the experiments correspond to different historical sequence lengths\. The11\-to\-11model uses one previous time step to predict the next state, the44\-to\-11model uses four previous time steps, and the1212\-to\-11model uses twelve previous time steps\. All model configurations use one\-step forecasting horizonH=1H=1\.

For the reported figures, we use finite test trajectory segments rather than the entire yearly record\. In Example[4\.3](https://arxiv.org/html/2605.29072#S4.SS3), the direct prediction comparison is displayed overt=0,…,800t=0,\\ldots,800\. In the data\-assimilation experiments, the filtering horizon is850850steps, although some figures display a shorter interval for visualization\.

Table 2:Experimental configurations for the pretrained STLLM models\. The train/validation/test counts refer to supervised forecasting samples constructed from the hourly records\.ExperimentYearSeq\. Len\.Batch SizeTrain/Val/TestEpochsHalf20181166980 / 872 / 872289Half220191216980 / 872 / 872126Half4\_12001446980 / 872 / 872149Half4\_22002441746 / 872 / 872113Half4\_32003446980 / 872 / 872161In the numerical examples below, the11\-to\-11,44\-to\-11, and1212\-to\-11models refer to pretrained STLLM models with sequence lengths11,44, and1212, respectively\. The insufficiently trained44\-to\-11model used in Examples[4\.4](https://arxiv.org/html/2605.29072#S4.SS4)and[4\.5](https://arxiv.org/html/2605.29072#S4.SS5)corresponds to the configuration trained with fewer training samples\. The full\-data44\-to\-11model corresponds to a configuration trained with the full training split\.

The exact mapping between the internal checkpoint names Half4\_1, Half4\_2, Half4\_3 and the checkpoints used in the EnSF experiments should be verified before submission\. The paper text can avoid relying on these internal names by using “25%\-data model” and “full\-data model”\.

Table 3:STLLM model hyperparameters\.HyperparameterValueModel dimensiondd64Number of attention headshh8Head dimensiondhd\_\{h\}8Feed\-forward dimensiondf​fd\_\{ff\}384Number of STLLM blocksLL4Dropout rate0\.1Total parameters749,057The models are trained using AdamW with initial learning rate10−310^\{\-3\}and weight decay10−410^\{\-4\}\. A cosine annealing schedule decreases the learning rate from10−310^\{\-3\}to10−610^\{\-6\}over a maximum of2,0002\{,\}000epochs\. The training loss is the masked mean absolute error over non\-missing entries:

ℒ=1\|ℳ\|​∑\(i,j,k\)∈ℳ\|y^i,j,k−yi,j,k\|,\\mathcal\{L\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(i,j,k\)\\in\\mathcal\{M\}\}\\left\|\\widehat\{y\}\_\{i,j,k\}\-y\_\{i,j,k\}\\right\|,\(28\)whereℳ\\mathcal\{M\}denotes the set of observed non\-NaN entries\. Gradient norms are clipped to a maximum value of55, and early stopping is applied when the validation MAE does not improve for3030consecutive epochs\.

The final training protocol, hardware description, and random seed information should be verified before submission\. If uncertain, hardware and seed details can be omitted from the final manuscript\.

### 4\.2Data assimilation setup

In the numerical experiments, the state vector represents the energy\-consumption values of50005000users\. Thus, we set

Xtn∈ℝdx,dx=5000\.X\_\{t\_\{n\}\}\\in\\mathbb\{R\}^\{d\_\{x\}\},\\qquad d\_\{x\}=5000\.The real recorded test\-set trajectory is treated as the reference trajectory and is denoted byXtntrueX^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}\. The pretrained forecasting model introduced above is used as the black\-box forward propagator\. For the data assimilation experiments based on the44\-to\-11model, the initial four\-step input window is taken from the reference trajectory\. After assimilation begins, the input window is updated recursively by the assimilated states: at each filtering step, the oldest state in the four\-step window is removed and the newly corrected state is appended\. Therefore, after initialization, the learned forward model is driven by data\-assimilation\-corrected states rather than by the true trajectory\.

All filtering computations are performed on the LogMinMax\-normalized scale used by the pretrained forecasting model\. The observational noise is also added on this normalized scale\. Specifically, for the observation model, we use Gaussian noise

ϵtn\+1∼𝒩​\(0,σobs2​I\),σobs=0\.05\.\\epsilon\_\{t\_\{n\+1\}\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\rm obs\}^\{2\}I\),\\qquad\\sigma\_\{\\rm obs\}=0\.05\.After filtering, the estimated states are transformed back to the original energy\-consumption scale before computing the error metrics\. Unless otherwise stated, the data assimilation experiments are run for850850filtering steps\.

For the Ensemble Score Filter, we use an ensemble sizeM=50M=50and discretize the pseudo\-time interval\[0,1\]\[0,1\]in the reverse diffusion process usingL=500L=500steps\. The damping function in the posterior score update is chosen as

g​\(τ\)=1−τ\.g\(\\tau\)=1\-\\tau\.For the EnKF comparison in Example[4\.5](https://arxiv.org/html/2605.29072#S4.SS5), we use the same pretrained forward model, ensemble size, observation masks, observation noise level, and normalized data scale as in the EnSF experiment\.

Partial observations are implemented through blockwise observation masks\. For an observation percentage100/B100/B, the state vector is partitioned intoBBcontiguous blocks with approximately equal sizes\. At filtering stepnn, the block indexed bynmodBn\\mod Bis observed\. Thus, for25%25\\%observation, one quarter of the state components is observed at each filtering step, and every component is observed once every four steps\. In the main numerical examples, we report results for observation levels25%25\\%,50%50\\%, and100%100\\%\.

In Example[4\.4](https://arxiv.org/html/2605.29072#S4.SS4), we use direct partial observations\. The observation model is

Ytn\+1=Mask⁡\(Xtn\+1\+ϵtn\+1\),Y\_\{t\_\{n\+1\}\}=\\operatorname\{Mask\}\\left\(X\_\{t\_\{n\+1\}\}\+\\epsilon\_\{t\_\{n\+1\}\}\\right\),\(29\)whereMask⁡\(⋅\)\\operatorname\{Mask\}\(\\cdot\)selects the observed block at the current filtering step\. In Example[4\.5](https://arxiv.org/html/2605.29072#S4.SS5), we compare EnSF with EnKF under a mixed observation setting\. Among the observed components, half are observed directly and half are observed through the nonlinear arctangent map\. Equivalently, for an observed componentii, the mixed observation operator is given by

ℋm​i​x​\(Xi\)=\{Xi,for directly observed components,arctan⁡\(Xi\),for nonlinear observed components\.\\mathcal\{H\}\_\{mix\}\(X\_\{i\}\)=\\begin\{cases\}X\_\{i\},&\\text\{for directly observed components\},\\\\ \\arctan\(X\_\{i\}\),&\\text\{for nonlinear observed components\}\.\\end\{cases\}\(30\)The corresponding observation is then

Ytn\+1=Mask⁡\(ℋm​i​x​\(Xtn\+1\)\+ϵtn\+1\)\.Y\_\{t\_\{n\+1\}\}=\\operatorname\{Mask\}\\left\(\\mathcal\{H\}\_\{mix\}\(X\_\{t\_\{n\+1\}\}\)\+\\epsilon\_\{t\_\{n\+1\}\}\\right\)\.\(31\)
We evaluate the forecasting and filtering results using mean absolute error \(MAE\), mean absolute percentage error \(MAPE\), and root mean squared error \(RMSE\)\. For a prediction or filtering estimateX~tn\\tilde\{X\}\_\{t\_\{n\}\}and reference stateXtntrueX^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}, these metrics are defined by

MAE⁡\(tn\)=1dx​∑i=1dx\|X~tni−\(Xtntrue\)i\|,\\operatorname\{MAE\}\(t\_\{n\}\)=\\frac\{1\}\{d\_\{x\}\}\\sum\_\{i=1\}^\{d\_\{x\}\}\\left\|\\tilde\{X\}\_\{t\_\{n\}\}^\{i\}\-\\left\(X^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}\\right\)^\{i\}\\right\|,\(32\)MAPE⁡\(tn\)=100dx​∑i=1dx\|X~tni−\(Xtntrue\)i\(Xtntrue\)i\|,\\operatorname\{MAPE\}\(t\_\{n\}\)=\\frac\{100\}\{d\_\{x\}\}\\sum\_\{i=1\}^\{d\_\{x\}\}\\left\|\\frac\{\\tilde\{X\}\_\{t\_\{n\}\}^\{i\}\-\\left\(X^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}\\right\)^\{i\}\}\{\\left\(X^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}\\right\)^\{i\}\}\\right\|,\(33\)and

RMSE⁡\(tn\)=1dx​∑i=1dx\(X~tni−\(Xtntrue\)i\)2\.\\operatorname\{RMSE\}\(t\_\{n\}\)=\\sqrt\{\\frac\{1\}\{d\_\{x\}\}\\sum\_\{i=1\}^\{d\_\{x\}\}\\left\(\\tilde\{X\}\_\{t\_\{n\}\}^\{i\}\-\\left\(X^\{\\mathrm\{true\}\}\_\{t\_\{n\}\}\\right\)^\{i\}\\right\)^\{2\}\}\.\(34\)All three metrics are computed after inverse normalization, so the reported errors are measured in the original energy\-consumption scale\. In Examples[4\.4](https://arxiv.org/html/2605.29072#S4.SS4)and[4\.5](https://arxiv.org/html/2605.29072#S4.SS5), the RMSE curves are computed over all50005000state components\. The trajectory plots, however, are shown only for selected representative dimensions in order to visualize the temporal behavior of individual users\.

### 4\.3Example 1\. Model comparison for the11\-to\-11,44\-to\-11, and1212\-to\-11forward models

In this example, we compare the forecasting performance of the11\-to\-11,44\-to\-11, and1212\-to\-11pretrained forward models on real energy\-consumption data\. We first consider the setting in which the model is supplied with correct input information at each prediction step\. We then consider an open\-loop setting in which the model output is recursively used as input without corrective information\. The purpose of this example is to evaluate the baseline predictive capability of the pretrained models and to demonstrate why data assimilation is needed for long\-horizon forecast correction\.

We first compare the direct prediction performance of the three models over the time intervalt=0,…,800t=0,\\dots,800\. For each model, we compare the model prediction with the corresponding real result\. The error metrics are computed according to the definitions in Section[4\.2](https://arxiv.org/html/2605.29072#S4.SS2)\.

Fig\.[1](https://arxiv.org/html/2605.29072#S4.F1)shows the prediction curves and the corresponding real results for the three models\. In all three cases, the learned forward model captures the dominant oscillatory pattern of the underlying energy\-consumption dynamics\. At the same time, visible discrepancies remain between the prediction and the real result, especially near peaks and troughs\. Nevertheless, the one\-step prediction results remain reasonably accurate, and the longer\-input models exhibit better agreement with the real signal\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/1to1_prediction_vs_real_same_plot.png)\(a\)11\-to\-11model prediction versus real result\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/4to1_prediction_vs_real_same_plot.png)\(b\)44\-to\-11model prediction versus real result\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/12to1_prediction_vs_real_same_plot.png)\(c\)1212\-to\-11model prediction versus real result\.

Figure 1:Comparison between model prediction and real result for the11\-to\-11,44\-to\-11, and1212\-to\-11forward models overt=0,…,800t=0,\\dots,800, when correct input information is supplied at each prediction step\.To provide a more direct comparison among the three models, we plot the time\-dependent MAE, MAPE, and RMSE in Fig\.[2](https://arxiv.org/html/2605.29072#S4.F2)\. The three models exhibit similar temporal error patterns, indicating that they are driven by the same underlying demand dynamics\. However, the44\-to\-11and1212\-to\-11models generally achieve lower errors than the11\-to\-11model, and the1212\-to\-11model gives the best overall performance among the three\. The MAPE curves contain several sharp spikes, which are expected when the true values become relatively small, since percentage\-based errors are more sensitive in that regime\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/mae_comparison_all_models.png)\(a\)MAE comparison\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/mape_comparison_all_models.png)\(b\)MAPE comparison\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/rmse_comparison_all_models.png)\(c\)RMSE comparison\.

Figure 2:Comparison of time\-dependent MAE, MAPE, and RMSE for the11\-to\-11,44\-to\-11, and1212\-to\-11forward models overt=0,…,800t=0,\\dots,800, when correct input information is supplied at each prediction step\.For completeness, we also report the average error statistics over the whole time interval in Table[4](https://arxiv.org/html/2605.29072#S4.T4)\. The results show that both the44\-to\-11and1212\-to\-11models improve upon the11\-to\-11model in all three metrics\. Among the three models, the1212\-to\-11model attains the lowest average MAE, MAPE, and RMSE, indicating the best overall predictive accuracy in this direct\-prediction setting\.

Table 4:Average error statistics for the11\-to\-11,44\-to\-11, and1212\-to\-11models overt=0,…,800t=0,\\dots,800when correct input information is supplied at each prediction step\. Lower values indicate better performance\.ModelMAEMAPE \(%\)RMSE11\-to\-110\.34016827\.3590600\.54469644\-to\-110\.31965825\.8769730\.5267021212\-to\-110\.31798425\.6618810\.522809We next consider the same three pretrained forward models in an open\-loop setting\. In this case, after the initial input is provided, the forecast is propagated solely by the learned model itself, without using corrective information from the true trajectory\. This setting is more difficult because forecast errors are no longer corrected at each step and may accumulate over time\. The purpose of this part of the example is to show that a pretrained forward model alone is not sufficient for reliable long\-horizon state tracking\.

The open\-loop results are compared with the real results in Fig\.[3](https://arxiv.org/html/2605.29072#S4.F3)\. Here, the comparison is performed on a saved subset of representative state entries rather than on the full state vector, since the open\-loop diagnostic data were stored only for those entries\. Even with this reduced comparison, the main behavior is clear: without sequential correction, the open\-loop forecasts lose agreement with the real trajectory\. The discrepancy is already visible for the11\-to\-11and44\-to\-11models and is especially severe for the1212\-to\-11model\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/1to1_openloop_vs_real.png)\(a\)11\-to\-11open\-loop result versus real result\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/4to1_openloop_vs_real.png)\(b\)44\-to\-11open\-loop result versus real result\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/12to1_openloop_vs_real.png)\(c\)1212\-to\-11open\-loop result versus real result\.

Figure 3:Comparison between open\-loop forecasts and real results for the11\-to\-11,44\-to\-11, and1212\-to\-11forward models\. The forecast is propagated without corrective information\. The comparison is shown on the saved subset of representative state entries\.The time\-dependent MAE, MAPE, and RMSE for the open\-loop trajectories are shown in Fig\.[4](https://arxiv.org/html/2605.29072#S4.F4)\. The error levels are much larger than those in the direct\-prediction comparison\. Among the three models, the44\-to\-11model gives the smallest average open\-loop errors, while the1212\-to\-11model performs the worst by a large margin\. This is particularly notable because the1212\-to\-11model performed best in the direct\-prediction setting\. Therefore, strong short\-horizon predictive accuracy does not by itself guarantee reliable long\-horizon open\-loop performance\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/openloop_mae_comparison_all_models.png)\(a\)MAE comparison\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/openloop_mape_comparison_all_models.png)\(b\)MAPE comparison\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/example1_plots/openloop_rmse_comparison_all_models.png)\(c\)RMSE comparison\.

Figure 4:Comparison of time\-dependent MAE, MAPE, and RMSE for the open\-loop forecasts of the11\-to\-11,44\-to\-11, and1212\-to\-11models\. The metrics are computed on the saved subset of representative state entries\.The average open\-loop error statistics are reported in Table[5](https://arxiv.org/html/2605.29072#S4.T5)\. These values confirm the qualitative conclusions drawn from the plots\. In particular, all three models exhibit substantial degradation relative to the direct\-prediction setting, and the deterioration is especially large for the1212\-to\-11model\. This part of the example provides the main motivation for the data\-assimilation results that follow: although the pretrained forward models can achieve reasonable short\-horizon prediction accuracy, their open\-loop forecasts can become unreliable when corrective information is no longer supplied\. Sequential data assimilation is therefore needed to incorporate observational information and improve state estimation over time\.

Table 5:Average error statistics for the open\-loop forecasts of the11\-to\-11,44\-to\-11, and1212\-to\-11models\. The metrics are computed on the saved subset of representative state entries\. Lower values indicate better performance\.ModelMAEMAPE \(%\)RMSE11\-to\-111\.21156273\.9613181\.39250344\-to\-111\.10647368\.1359041\.3198391212\-to\-111\.868828223\.7439662\.236019
### 4\.4Example 2\. EnSF correction for44\-to\-11models under different training schemes

Example[4\.3](https://arxiv.org/html/2605.29072#S4.SS3)shows that the pretrained forward models can provide reasonable short\-horizon predictions when correct input information is supplied, but their open\-loop forecasts may become unreliable when no corrective information is used\. In this example, we apply EnSF\-based data assimilation to improve the forecast results for the44\-to\-11model under two different training schemes\. The first model is trained using only25%25\\%of the available training data and is therefore treated as an insufficiently trained forward model\. The second model is trained using the full training dataset and is treated as a sufficiently trained forward model\.

For both cases, the pretrained44\-to\-11model is used as the black\-box forward propagator, while EnSF sequentially assimilates direct partial observations as described in \([29](https://arxiv.org/html/2605.29072#S4.E29)\)\. We compare the no\-DA open\-loop result with EnSF\-corrected results under observation levels25%25\\%,50%50\\%, and100%100\\%\. The RMSE curves in this example are computed over all50005000state components after inverse normalization\.

Fig\.[5](https://arxiv.org/html/2605.29072#S4.F5)presents the RMSE comparison for the two training schemes\. In both cases, the no\-DA curve has much larger RMSE than the EnSF\-corrected curves, indicating that the learned forward model alone is not sufficient for accurate long\-horizon prediction\. The improvement is especially clear for the model trained with only25%25\\%of the data, where the no\-DA forecast error remains large throughout the experiment\. After EnSF correction, the RMSE is substantially reduced for all observation levels\. As expected, using more observational information generally improves the correction, with the100%100\\%observation case producing the lowest RMSE among the DA results\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/x1.png)\(a\)44\-to\-11model trained with25%25\\%of the data\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/x2.png)\(b\)44\-to\-11model trained with the full dataset\.

Figure 5:RMSE comparison for EnSF correction under different training schemes\. The no\-DA curve corresponds to the open\-loop forecast, while the DA curves correspond to EnSF correction with direct partial observations at25%25\\%,50%50\\%, and100%100\\%observation levels\. The RMSE is computed over all50005000state components\.To further illustrate the effect of EnSF at the trajectory level, Fig\.[6](https://arxiv.org/html/2605.29072#S4.F6)shows representative trajectory comparisons using25%25\\%observations\. These trajectory plots are shown only for selected state dimensions in order to visualize individual temporal behavior\. The blue curves denote the truth, the orange curves denote the open\-loop forecast, and the green curves denote the EnSF\-corrected trajectory\. For the insufficiently trained model, the open\-loop forecast quickly loses agreement with the true trajectory and remains far from the observed temporal pattern, while the EnSF estimate stays much closer to the truth across the displayed dimensions\. For the fully trained model, the open\-loop result is different in scale and shape but still fails to reproduce the true temporal variability\. In both cases, EnSF provides a clear correction by recovering the main oscillatory structures of the true signal\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/x3.png)\(a\)44\-to\-11model trained with25%25\\%of the data, using25%25\\%direct observations\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/x4.png)\(b\)44\-to\-11model trained with the full dataset, using25%25\\%direct observations\.

Figure 6:Representative trajectory comparisons for the44\-to\-11model under EnSF correction\. The EnSF\-corrected trajectory tracks the truth more closely than the open\-loop forecast, even when only partial observations are assimilated\. The displayed trajectories correspond to selected representative state dimensions\.These results demonstrate that the benefit of EnSF is not limited to a specific training scheme of the forward model\. When the forward model is insufficiently trained, EnSF reduces the forecast error by incorporating observational information into the prediction process\. When the forward model is trained on the full dataset, EnSF still improves the result by correcting residual inaccuracies in the learned dynamics\. Thus, the combined framework of a pretrained spatio\-temporal forecasting model and EnSF\-based data assimilation provides a more reliable approach for real\-data energy\-consumption prediction than open\-loop model propagation alone\.

### 4\.5Example 3\. Comparison between EnSF and EnKF for the44\-to\-11model

In Example[4\.4](https://arxiv.org/html/2605.29072#S4.SS4), we demonstrated that EnSF can substantially improve the forecast results of the44\-to\-11model by assimilating partial observations\. We now compare EnSF with the Ensemble Kalman Filter \(EnKF\) on the same data\-assimilation task\. The purpose of this example is to examine whether the score\-based update used by EnSF provides an advantage over the Kalman\-type ensemble update used by EnKF in a high\-dimensional nonlinear observation setting\.

We use the44\-to\-11model trained with25%25\\%of the available data\. This setting is intentionally challenging, since the forward model is not trained on the full dataset and therefore provides a less accurate forecast prior\. Both EnSF and EnKF use the same pretrained model as the black\-box forward propagator and assimilate observations under the same partial\-observation masks and noise level\. In this example, we use the mixed observation model described in \([30](https://arxiv.org/html/2605.29072#S4.E30)\)–\([31](https://arxiv.org/html/2605.29072#S4.E31)\), where half of the observed components are observed directly and the other half are observed through the arctangent function\. We compare the no\-DA open\-loop forecast with data\-assimilation results using25%25\\%,50%50\\%, and100%100\\%observation levels\.

Fig\.[7](https://arxiv.org/html/2605.29072#S4.F7)compares the RMSE curves obtained by EnSF and EnKF\. The RMSE values are computed over all50005000state components after inverse normalization\. Both methods reduce the error relative to the no\-DA open\-loop forecast, confirming again that sequential observational correction is essential in this setting\. However, the EnSF\-corrected curves remain substantially below the no\-DA curve and exhibit relatively low RMSE throughout the time interval\. In contrast, although EnKF also improves upon the open\-loop forecast, its RMSE remains much larger than that of EnSF for most of the experiment\. This indicates that the EnSF update provides a more effective correction mechanism for this nonlinear high\-dimensional filtering problem\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/x5.png)\(a\)EnSF correction\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/x6.png)\(b\)EnKF correction\.

Figure 7:RMSE comparison between EnSF and EnKF for the44\-to\-11model trained with25%25\\%of the data under mixed direct/arctangent observations\. The no\-DA curve corresponds to the open\-loop forecast, while the data\-assimilation curves correspond to correction using25%25\\%,50%50\\%, and100%100\\%observation levels\. The RMSE is computed over all50005000state components\.To further compare the two methods at the trajectory level, Fig\.[8](https://arxiv.org/html/2605.29072#S4.F8)shows representative trajectories using25%25\\%observations\. The displayed trajectories correspond to selected representative state dimensions\. The blue curves denote the truth, the orange curves denote the open\-loop forecast, and the green curves denote the data\-assimilation result\. In the EnSF case, the corrected trajectory follows the main temporal structures of the truth across the displayed dimensions\. The open\-loop forecast, by contrast, remains far from the truth after the initial transient\. In the EnKF case, the corrected trajectory improves upon the open\-loop forecast but does not track the truth as consistently as EnSF\. In several displayed dimensions, EnKF produces overly smooth or displaced trajectories, indicating that the Kalman\-type correction is less effective for this mixed nonlinear observation setting\.

![Refer to caption](https://arxiv.org/html/2605.29072v1/x7.png)\(a\)EnSF correction with25%25\\%observations\.
![Refer to caption](https://arxiv.org/html/2605.29072v1/x8.png)\(b\)EnKF correction with25%25\\%observations\.

Figure 8:Representative trajectory comparison between EnSF and EnKF for the44\-to\-11model trained with25%25\\%of the data under mixed direct/arctangent observations\. The EnSF\-corrected trajectory more closely follows the truth, while the EnKF result shows weaker correction in the displayed dimensions\.Overall, this example demonstrates the advantage of EnSF over EnKF for the forecast\-correction task considered here\. While both methods use the same learned forward model and the same observational information, EnSF achieves a stronger reduction in RMSE and better trajectory recovery\. This result is consistent with the motivation for using a score\-based ensemble filter: the filtering distribution in this problem can be nonlinear and non\-Gaussian, and the EnSF update provides a more flexible correction than the Gaussian update structure used by EnKF\.

## 5Conclusion

In this work, we used the Ensemble Score Filter to improve state estimation for real\-data energy\-consumption forecasting\.The forward dynamics were represented by a pretrained spatio\-temporal forecasting model, which was treated as a black\-box state propagator\.Through numerical experiments, we showed that the pretrained model can provide reasonable short\-horizon predictions when correct input information is available, but its open\-loop forecasts may become unreliable over long horizons\. By assimilating partial and noisy observations, EnSF substantially reduced the forecast error and recovered the main temporal features of the true consumption trajectories\. The improvement was observed for both insufficiently trained and fully trained44\-to\-11forecasting models\.

We also compared EnSF with EnKF under a mixed direct/nonlinear observation setting\. The results show that EnSF achieves lower RMSE and better trajectory recovery than EnKF in the considered high\-dimensional filtering problem\. Overall, these results demonstrate the potential of EnSF as an effective data assimilation tool for correcting learned energy\-consumption forecasts\. Future work may extend the current framework by incorporating additional exogenous variables, especially temperature, into the forecasting model so that power consumption can be predicted and corrected under weather\-dependent demand patterns\. It would also be useful to validate the framework on additional real\-world energy datasets and to study more complex observation patterns and missing\-data mechanisms\.

## References

- \[1\]S\. Akhtar, S\. Shahzad, A\. Zaheer, H\. S\. Ullah, H\. Kilic, R\. Gono, M\. Jasiński, and Z\. Leonowicz\(2023\)Short\-term load forecasting models: a review of challenges, progress, and the road ahead\.Energies16\(10\),pp\. 4060\.External Links:[Document](https://dx.doi.org/10.3390/en16104060)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p2.2)\.
- \[2\]C\. Andrieu, A\. Doucet, and R\. Holenstein\(2010\)Particle markov chain monte carlo methods\.Journal of the Royal Statistical Society Series B: Statistical Methodology72\(3\),pp\. 269–342\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[3\]F\. Bao, Z\. Zhang, and G\. Zhang\(2023\)A score\-based nonlinear filter for data assimilation\.Journal of Computational Physics514,pp\. 113207\.Cited by:[§3\.2](https://arxiv.org/html/2605.29072#S3.SS2.p3.5)\.
- \[4\]F\. Bao, Z\. Zhang, and G\. Zhang\(2023\)An ensemble score filter for tracking high\-dimensional nonlinear dynamical systems\.Computer Methods in Applied Mechanics and Engineering432,pp\. 117447\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1),[§3\.2](https://arxiv.org/html/2605.29072#S3.SS2.p5.5)\.
- \[5\]F\. Bao, Y\. Cao, and X\. Han\(2015\)Forward backward doubly stochastic differential equations and the optimal filtering of diffusion processes\.arXiv preprint arXiv:1509\.06352\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[6\]F\. Bao, Y\. Cao, A\. Meir, and W\. Zhao\(2016\)A first order scheme for backward doubly stochastic differential equations\.SIAM/ASA Journal on Uncertainty Quantification4\(1\),pp\. 413–445\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[7\]F\. Bao, Y\. Cao, A\. Meir, and W\. Zhao\(2016\)A first order scheme for backward doubly stochastic differential equations\.SIAM/ASA J\. Uncertain\. Quantif\.4\(1\),pp\. 413–445\.External Links:ISSN 2166\-2525,[MathReview \(Alexandre Popier\)](https://www.ams.org/mathscinet-getitem?mr=3490505)Cited by:[§3\.2](https://arxiv.org/html/2605.29072#S3.SS2.p3.1)\.
- \[8\]F\. Bao, Y\. Cao, C\. Webster, and G\. Zhang\(2014\)A hybrid sparse\-grid approach for nonlinear filtering problems based on adaptive\-domain of the Zakai equation approximations\.SIAM/ASA J\. Uncertain\. Quantif\.2\(1\),pp\. 784–804\.External Links:ISSN 2166\-2525,[Document](https://dx.doi.org/10.1137/140952910),[Link](http://dx.doi.org/10.1137/140952910),[MathReview Entry](https://www.ams.org/mathscinet-getitem?mr=3293459)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1),[§1](https://arxiv.org/html/2605.29072#S1.p8.1)\.
- \[9\]F\. Bao, Y\. Cao, C\. Webster, and G\. Zhang\(2014\)A hybrid sparse\-grid approach for nonlinear filtering problems based on adaptive\-domain of the zakai equation approximations\.SIAM/ASA Journal on Uncertainty Quantification2\(1\),pp\. 784–804\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[10\]F\. Bao, Y\. Cao, and J\. Yong\(2021\)Data informed solution estimation for forward\-backward stochastic differential equations\.Analysis and Applications19\(03\),pp\. 439–464\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[11\]F\. Bao, H\. G\. Chipilski, S\. Liang, G\. Zhang, and J\. S\. Whitaker\(2025\)Nonlinear ensemble filtering with diffusion models: application to the surface quasigeostrophic dynamics\.Monthly Weather Review153\(7\),pp\. 1155–1169\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[12\]F\. Bao and V\. Maroulas\(2017\)Adaptive meshfree backward sde filter\.SIAM Journal on Scientific Computing39\(6\),pp\. A2664–A2683\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[13\]F\. Bao, Z\. Zhang, and G\. Zhang\(2024\)A score\-based nonlinear filter for data assimilation\.Journal of Computational Physics514,pp\. 113207\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[14\]R\. Cai, G\. Yang, H\. Averbuch\-Elor, Z\. Hao, S\. Belongie, N\. Snavely, and B\. Hariharan\(2020\)Learning gradient fields for shape generation\.InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16,pp\. 364–381\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[15\]A\. Carrassi, M\. Bocquet, L\. Bertino, and G\. Evensen\(2018\)Data assimilation in the geosciences: an overview of methods, issues, and perspectives\.Wiley Interdisciplinary Reviews: Climate Change9\(5\),pp\. e535\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p7.1)\.
- \[16\]A\. J\. Chorin and X\. Tu\(2009\)Implicit sampling for particle filters\.Proceedings of the National Academy of Sciences106\(41\),pp\. 17249–17254\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[17\]P\. Dhariwal and A\. Nichol\(2021\)Diffusion models beat gans on image synthesis\.Advances in neural information processing systems34,pp\. 8780–8794\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[18\]P\. Dhariwal and A\. Nichol\(2021\)Diffusion models beat gans on image synthesis\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 8780–8794\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[19\]G\. Evensen\(1994\)Sequential data assimilation with a nonlinear quasi\-geostrophic model using monte carlo methods to forecast error statistics\.Journal of Geophysical Research: Oceans99\(C5\),pp\. 10143–10162\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p9.1)\.
- \[20\]E\. Gobet, G\. Pages, H\. Pham, and J\. Printems\(2006\)Discretization and simulation of the zakai equation\.SIAM Journal on Numerical Analysis44\(6\),pp\. 2505–2538\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[21\]N\. J\. Gordon, D\. J\. Salmond, and A\. F\. Smith\(1993\)Novel approach to nonlinear/non\-gaussian bayesian state estimation\.InIEE proceedings F \(radar and signal processing\),Vol\.140,pp\. 107–113\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[22\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[23\]T\. Hong and S\. Fan\(2016\)Probabilistic electric load forecasting: a tutorial review\.International Journal of Forecasting32\(3\),pp\. 914–938\.External Links:[Document](https://dx.doi.org/10.1016/j.ijforecast.2015.11.011)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p2.2)\.
- \[24\]P\. L\. Houtekamer and H\. L\. Mitchell\(1998\)Data assimilation using an ensemble kalman filter technique\.Monthly weather review126\(3\),pp\. 796–811\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p9.1)\.
- \[25\]Y\. Hu, G\. Kallianpur, and J\. Xiong\(2002\)An approximation for the zakai equation\.Applied Mathematics and Optimization45,pp\. 23–44\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[26\]International Energy Agency\(2025\)Electricity 2025: analysis and forecast to 2027\.Note:[https://www\.iea\.org/reports/electricity\-2025](https://www.iea.org/reports/electricity-2025)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p2.2)\.
- \[27\]K\. Kang, V\. Maroulas, I\. Schizas, and F\. Bao\(2018\)Improved distributed particle filters for tracking in a wireless sensor network\.Computational Statistics & Data Analysis117,pp\. 90–108\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[28\]P\. E\. Kloeden and E\. Platen\(1992\)Numerical solution of stochastic differential equations\.Applications of Mathematics \(New York\), Vol\.23,Springer\-Verlag,Berlin\.External Links:ISBN 3\-540\-54062\-8,[MathReview \(G\. N\. Mil\\cprimeshteĭn\)](https://www.ams.org/mathscinet-getitem?mr=1214374)Cited by:[§3\.2](https://arxiv.org/html/2605.29072#S3.SS2.p3.1)\.
- \[29\]K\. Law, A\. Stuart, and K\. Zygalakis\(2015\)Data assimilation\.Cham, Switzerland: Springer214,pp\. 52\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p6.1)\.
- \[30\]S\. Liang, H\. Tran, F\. Bao, H\. Chipilski, P\. van Leeuwen, and G\. Zhang\(2024\)Ensemble score filter with image inpainting for data assimilation in tracking surface quasi\-geostrophic dynamics with partial observations\.arXiv,pp\. 1–35\.External Links:[Document](https://dx.doi.org/arXiv%3A2501.12419)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[31\]S\. Liang, R\. Hu, F\. Bao, R\. Archibald, and G\. Zhang\(2024\)An online algorithm for solving feedback optimal control problems with partial observations\.arXiv preprint arXiv:2404\.05734\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[32\]M\. K\. Pitt and N\. Shephard\(1999\)Filtering via simulation: auxiliary particle filters\.Journal of the American statistical association94\(446\),pp\. 590–599\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[33\]A\. H\. M\. Rafid, J\. Yin, Y\. Geng, S\. Liang, F\. Bao, L\. Ju, and G\. Zhang\(2024\)A scalable real\-time data assimilation framework for predicting turbulent atmosphere dynamics\.SC24\-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis1,pp\. 11–18\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[34\]C\. Snyder, T\. Bengtsson, P\. Bickel, and J\. Anderson\(2008\)Obstacles to high\-dimensional particle filtering\.Monthly Weather Review136\(12\),pp\. 4629–4640\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[35\]Y\. Song and S\. Ermon\(2019\)Generative modeling by estimating gradients of the data distribution\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1)\.
- \[36\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p11.1),[§3\.2](https://arxiv.org/html/2605.29072#S3.SS2.p2.6)\.
- \[37\]P\. J\. van Leeuwen\(2009\)Particle filtering in geophysical systems\.Monthly Weather Review137,pp\. 4089–4114\.External Links:[Document](https://dx.doi.org/10.1175/2009MWR2835.1)Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p8.1)\.
- \[38\]M\. Zakai\(1969\)On the optimal filtering of diffusion processes\.Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete11\(3\),pp\. 230–243\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[39\]H\. Zhang and D\. Laneuville\(2008\)Grid based solution of zakai equation with adaptive local refinement for bearings\-only tracking\.In2008 IEEE Aerospace Conference,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p10.1)\.
- \[40\]E\. Zhou, S\. Gadzanku, C\. Hodge, M\. Campton, S\. de la Rue du Can, and J\. Zhang\(2023\)Best practices in electricity load modeling and forecasting for long\-term power system planning\.Technical reportNational Renewable Energy Laboratory and Lawrence Berkeley National Laboratory\.Cited by:[§1](https://arxiv.org/html/2605.29072#S1.p2.2)\.

Similar Articles

Nested Spatio-Temporal Time Series Forecasting

arXiv cs.LG

This paper proposes a nested spatiotemporal forecasting framework that uses spectral clustering to construct semantically coherent macro-level regions, which provide top-down guidance for fine-grained micro-level predictions. Experiments on high-dimensional datasets show consistent improvements over state-of-the-art baselines.