Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series
Summary
This paper proposes Continuous Power Forecasting, treating power forecasting as a continual learning problem to handle nonstationary conditions. It evaluates six CL approaches on real-world datasets, showing benefits in adaptation and mitigating catastrophic forgetting.
View Cached Full Text
Cached at: 06/25/26, 05:07 AM
# Towards Continuous Power Forecasting: Practical Continual Learning for Real-World Energy Systems in Nonstationary Time Series
Source: [https://arxiv.org/html/2606.24955](https://arxiv.org/html/2606.24955)
11institutetext:Intelligent Embedded Systems, University of Kassel, 34121 Kassel, Germany11email:\{yujiang\.he, frederic\.uhrweiller, bsick\}@uni\-kassel\.de###### Abstract
Power forecasting models deployed in real\-world energy markets must operate under nonstationary conditions, where data distributions continually evolve due to weather variability, infrastructure upgrades, and changing consumption behaviors\. In practice, these models face strict operational constraints: historical data may be limited or unavailable for repeated retraining, and uninterrupted long\-term service is often required\.
This paper addresses these challenges by proposing the paradigm of Continuous Power Forecasting, which views power forecasting as a continual learning problem rather than a static offline task\. Based on an adaptive continual learning framework for regression, we systematically investigate the practical effectiveness of six representative continual learning approaches from three methodological categories\. These approaches are evaluated under different realistic assumptions regarding data accessibility and update policies\. Experimental validation on real\-world power datasets demonstrates that continual learning enables forecasting models to self\-adapt to distributional drift, accumulate knowledge over time, and mitigate catastrophic forgetting without relying on large\-scale historical data storage\. Beyond performance gains, our study provides practical insights into the stability and adaptation behaviors of different continual learning approaches under realistic operational constraints\. Overall, this work illustrates how continual learning can be pragmatically integrated into industrial power forecasting pipelines, offering a scalable and sustainable solution for long\-term deployment in dynamic environments\.
## 1Introduction
Power forecasting is a core component of modern energy systems, supporting market operations and grid management\. In real\-world deployments, forecasting models must operate over long lifecycles under nonstationary conditions, where data distributions evolve due to changing weather patterns, infrastructure, and consumption behavior\. At the same time, operational constraints such as limited access to historical data and the requirement for uninterrupted service make repeated offline retraining impractical\.
In practice, nonstationarity acts as both drifts in input distributions and changes in the underlying relationship between inputs and targets due to concept drift\. While collecting large training datasets can partially mitigate such effects, this approach is costly, slow, and often incompatible with data privacy protection requirements\. Conversely, naively fine\-tuning models on incoming data leads to catastrophic forgetting, and this weakens long\-term forecasting reliability\.
We address these challenges through the paradigm ofContinuous Power Forecasting\(CPF\), which treats power forecasting as a lifelong adaptive process with the support of continual learning \(CL\) rather than a static offline task\. Here,continuousemphasizes uninterrupted system\-level operation over the model’s lifecycle, while continual denotes the repetitive learning mechanism that enables incremental adaptation and knowledge retention\. The goal is to allow forecasting models to autonomously adapt to distributional drift and accumulate knowledge over time without relying on large\-scale historical data storage\.
Based on the modular CLeaR \(ContinualLearning forRegression\) framework\[[6](https://arxiv.org/html/2606.24955#bib.bib1)\], which enables unified comparison of distinct continual learning approaches for regression, this paper investigates practical effectiveness of six practical CL approaches for power forecasting under realistic operational assumptions\. The main contributions in this work can be summarized as follows:
- •We formalize CPF as an operational paradigm for lifelong energy forecasting systems operating under nonstationary conditions and strict operational constraints\.
- •We conduct a unified large\-scale empirical evaluation of six representative continual learning approaches under realistic data accessibility constraints, providing mechanism\-level insights into their stability\-plasticity trade\-offs across diverse power entities\.
- •We ensure reproducibility by evaluating all implementations on publicly accessible datasets and releasing the code of the CLeaR framework and all evaluated CL approaches111Code for the implementation will be released publicly upon acceptance\.\.
## 2Related Work
CL studies how models can learn multiple tasks in sequence without suffering catastrophic forgetting, a phenomenon fundamentally rooted in the stability\-plasticity dilemma\. A large body of CL research categorizes solutions into three main groups\[[2](https://arxiv.org/html/2606.24955#bib.bib4)\]: regularization\-based approaches\[[8](https://arxiv.org/html/2606.24955#bib.bib5),[21](https://arxiv.org/html/2606.24955#bib.bib6),[17](https://arxiv.org/html/2606.24955#bib.bib7)\]that constrain parameter updates, replay\-based approaches\[[14](https://arxiv.org/html/2606.24955#bib.bib8),[13](https://arxiv.org/html/2606.24955#bib.bib9),[18](https://arxiv.org/html/2606.24955#bib.bib10),[19](https://arxiv.org/html/2606.24955#bib.bib11)\]that utilize stored or synthesized historical samples to anchor optimization, and architecture\-based approaches\[[15](https://arxiv.org/html/2606.24955#bib.bib12)\]that assign dedicated parameter subspaces to different tasks to eliminate interference\. However, standardized CL benchmarks mainly focus on classification settings\[[2](https://arxiv.org/html/2606.24955#bib.bib4),[7](https://arxiv.org/html/2606.24955#bib.bib15),[10](https://arxiv.org/html/2606.24955#bib.bib14)\], leaving open questions about the applicability of these approaches to forecasting scenarios, which remain comparatively less explored in the CL literature\[[6](https://arxiv.org/html/2606.24955#bib.bib1),[1](https://arxiv.org/html/2606.24955#bib.bib16)\]\.
In the specific domain of power systems, continuous forecasting is a practical necessity due to evolving consumption patterns, generation variability, and operational constraints\[[4](https://arxiv.org/html/2606.24955#bib.bib3)\]\. While state\-of\-the\-art forecasting architectures, such as Informer\[[22](https://arxiv.org/html/2606.24955#bib.bib17)\], Autoformer\[[20](https://arxiv.org/html/2606.24955#bib.bib18)\], and PatchTST\[[12](https://arxiv.org/html/2606.24955#bib.bib19)\], have achieved remarkable accuracy on static benchmarks, they mostly rely on offline training or periodic retraining on large\-scale historical datasets\. Such assumptions are often infeasible in real deployments with restricted data accessibility, privacy constraints, or uninterrupted service requirements\.
Related literature on adaptive time\-series forecasting encompasses methods for concept drift detection and model updating, including statistical change detection and drift\-aware learning\[[3](https://arxiv.org/html/2606.24955#bib.bib23),[11](https://arxiv.org/html/2606.24955#bib.bib26)\], online learning algorithms\[[16](https://arxiv.org/html/2606.24955#bib.bib28)\], and ensemble\-based adaptation techniques\[[9](https://arxiv.org/html/2606.24955#bib.bib27)\]\. These approaches aim to maintain predictive performance under evolving distributions, but typically emphasize detection and reactive retraining rather than explicit stability\-plasticity trade\-offs under memory constraints\.
Bounded\-memory continual regression has recently been instantiated in modular frameworks such as CLeaR\[[6](https://arxiv.org/html/2606.24955#bib.bib1),[5](https://arxiv.org/html/2606.24955#bib.bib2)\], which integrates buffer\-based storage, novelty detection, and multiple update strategies for streaming settings\. However, systematic empirical evaluation of heterogeneous CL approaches under realistic operational constraints remains limited\.
Our work builds on these strands by providing a unified empirical evaluation of multiple CL approaches within a consistent protocol on a real\-world power grid dataset\. By situating regularization\-based, replay\-based, and generative approaches in a common framework and analyzing their behavior under diverse nonstationary conditions, we offer insights that extend beyond static benchmark settings\. This perspective contributes to both the CL literature and the applied data science community by clarifying the conditions under which particular adaptation strategies may be preferred in dynamic, resource\-constrained forecasting environments\.
## 3CLeaR Framework and Instantiations
CLeaR is an adaptive and modular CL framework designed for regression problems, where forecasting models must evolve to adapt to changing data distributions over time under streaming data and limited historical storage constraints\[[6](https://arxiv.org/html/2606.24955#bib.bib1)\]\. At the framework level, CLeaR consists of two main modules: a buffer\-based CL module and a novelty detection module\. The CL module mainly contains: \(1\) a neural network\-based forecasting model, \(2\) familiarity and novelty buffers, \(3\) a CL approach used to update model parameters, and \(4\) a trigger mechanism\. The familiarity buffer stores samples consistent with the current model state, while the novelty buffer accumulates samples indicating distributional drift in the data stream\. These buffer types are intrinsic to the CLeaR framework, as continuous monitoring of potential task changes is required for controlled adaptation\. Additional buffers, such as replay buffers, are optional and depend on the specific CL approach adopted in a given instantiation \(see Section[3\.2](https://arxiv.org/html/2606.24955#S3.SS2)\)\. The novelty detection module assigns incoming samples to buffers according to a detection function that quantifies distributional deviation relative to the deployed model\.
Operationally, CLeaR follows a novelty\-driven streaming adaptation loop\. Given a data stream\{𝐱t,yt\}t=1T\\\{\\mathbf\{x\}\_\{t\},y\_\{t\}\\\}\_\{t=1\}^\{T\}under bounded memory constraints, CLeaR iteratively: \(1\) receives streaming samples; \(2\) generates forecasts and evaluates detection signals; \(3\) assigns samples to appropriate buffer\(s\); \(4\) triggers parameter updates when predefined conditions are satisfied; \(5\) replaces the deployed model with the updated version, and \(6\) continues monitoring under the new model state\. Notably, these components are instantiated according to application requirements\. A concrete implementation of CLeaR is referred to as a CLeaR instance, obtained by specifying each defined module\. Figure[1](https://arxiv.org/html/2606.24955#S3.F1)illustrates an instance used in this study\. While the figure depicts a real\-replay instantiation, the detection and update mechanism described below is shared across all instances unless otherwise specified\.
Figure 1:CLeaR instance with real replay, showing dual\-branch novelty detection \(reconstruction and prediction errors\), buffer allocation, and asynchronous model update\.Section[3\.1](https://arxiv.org/html/2606.24955#S3.SS1)describes the instantiation of the detection and update mechanism adopted in this work, while Section[3\.2](https://arxiv.org/html/2606.24955#S3.SS2)details the CL approaches evaluated within this work\.
### 3\.1Novelty Detection and Update Mechanism
All evaluated CLeaR instances share a common forecasting architecture composed of an autoencoder \(AE\) for representation learning, and a target\-specific multilayer perceptron predictor\. The AE is first trained to learn compact representations of input features\. Once training is completed, the encoder parameters are frozen, and the learned representation is fed into the predictor for supervised training\. This architecture is intentionally modular and extendable\. The neural network can be replaced by more advanced neural networks, extended for multi\-task learning, or substituted with a variational autoencoder \(VAE\) when Generative Replay is required\.
The model provides two measurable error signals: reconstruction error and prediction error\. The former reflects drifts in the input distribution, while the latter captures deviations in the input\-target relationship\. These signals form the basis of novelty detection in this instantiation\. At each time step, the deployed model generates forecasts and computes both errors\. Each error is compared against a dynamically adjusted threshold\. In this instantiation, thresholds are defined as a scaled running mean error of the most recently updated model\. Specifically, after each update, the threshold is re\-estimated asτ=α⋅e¯\\tau=\\alpha\\cdot\\bar\{e\}, wheree¯\\bar\{e\}denotes the mean reconstruction or prediction error evaluated on the samples currently stored in the familiarity and novelty buffers, andα\\alphais a user\-defined scaling factor controlling sensitivity\. Smallerα\\alphaincreases sensitivity to deviations, while larger values reduce unnecessary updates\. Thresholds are initialized using the full warm\-up dataset during the warm\-up phase and then adjusted only after each model update\. Errors on familiarity samples reflect stability, whereas errors on novelty samples reflect adaptation quality\. The re\-estimated threshold therefore implicitly reflects the current stability\-plasticity balance achieved by the updated model\.
Samples exceeding the reconstruction or prediction threshold indicate potential drifts in the input distribution and in the underlying input\-target relationship, respectively\. Input samples𝐱\\mathbf\{x\}are stored in the corresponding familiarity or novelty buffer of the autoencoder, while labeled pairs\{𝐱,y\}\\\{\\mathbf\{x\},y\\\}are assigned to the buffers associated with the predictor\. This separation enables independent monitoring of distributional and conditional shifts\. Each novelty buffer is associated with a predefined capacity\. When the number of stored novelty samples reaches this capacity, the corresponding model component is scheduled for update\. Through the capacity\-based trigger, the updating frequency is jointly determined by the model’s performance, the sensitivity of the detection thresholds, and the magnitude of distributional drift in the data stream\. Once the model update is finished, the updated model replaces the previously deployed model\. Thereafter, the corresponding threshold is re\-estimated and the corresponding buffers are then emptied\. Streaming monitoring continues under the new model state until the next update is triggered\. In this study, updating is performed asynchronously for most instantiations\. The autoencoder and predictor maintain independent buffers and are updated only when their respective novelty buffers reach capacity\. It enables differentiated responses to distributional and conditional drifts\. For Generative Replay, however, synchronous updates are adopted to ensure consistency between representation learning and synthetic data generation\.
Different CLeaR instances share this workflow, while differing in the CL approaches used during parameter updating, as described in Section[3\.2](https://arxiv.org/html/2606.24955#S3.SS2)\.
### 3\.2Continual Learning Approaches
Rather than reiterating the standard taxonomy of continual learning, we focus on how different update mechanisms are instantiated within CLeaR under operational CPF constraints\. Specifically, we study three update strategies that differ in how historical information is preserved and accessed during parameter adaptation: \(1\) regularization\-based weight consolidation, \(2\) real replay from explicitly stored historical samples, and \(3\) pseudo replay via learned data approximation through generated or curated samples\. These approaches differ in memory footprint, computational overhead, and their assumptions regarding historical data accessibility, which are critical under streaming CPF settings\. For convenience, each approach is instantiated as a CLeaR instance for experimental validation \(Section[4](https://arxiv.org/html/2606.24955#S4)\), with abbreviations for each instance introduced in Section[4\.2](https://arxiv.org/html/2606.24955#S4.SS2)\.
#### 3\.2\.1Online EWC
Elastic Weight Consolidation \(EWC\)\[[8](https://arxiv.org/html/2606.24955#bib.bib5)\]mitigates catastrophic forgetting by constraining updates to parameters that are important for previously learned data\. Instead of storing historical samples, EWC estimates parameter importance using the Fisher Information \(FI\) and penalizes deviations from previously consolidated parameter values\.
In the online variant\[[17](https://arxiv.org/html/2606.24955#bib.bib7)\], parameter importance is accumulated recursively\. Letθi∗\\theta^\{\*\}\_\{i\}denote theii\-th consolidated weight parameter from the previous update andFi∗F^\{\*\}\_\{i\}the corresponding accumulated FI\. When the trigger condition is fulfilled, the model parameters are optimized by minimizing:
ℒEWC\(θ\)=ℒN\(θ;𝒟N\)\+λ2∑iFi∗\(θi−θi∗\)2,\\mathcal\{L\}\_\{\\text\{EWC\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\text\{N\}\}\(\\theta;\\mathcal\{D\}\_\{N\}\)\+\\frac\{\\lambda\}\{2\}\\sum\_\{i\}F^\{\*\}\_\{i\}\(\\theta\_\{i\}\-\\theta^\{\*\}\_\{i\}\)^\{2\},\(1\)whereℒN\\mathcal\{L\}\_\{\\text\{N\}\}denotes the task loss solely on the novelty samples𝒟N\\mathcal\{D\}\_\{N\}, which is interpreted as representing newly emerged task information, andλ\\lambdacontrols the strength of regularization\. After optimization, the accumulated FI is updated recursively:F∗=γF∗\+FnewF^\{\*\}=\\gamma F^\{\*\}\+F\_\{\\text\{new\}\}, whereFnewF\_\{\\text\{new\}\}is estimated using both the novelty and familiarity buffers since historical data are assumed no longer fully accessible\. This design implicitly assumes that the familiarity buffer provides a sufficient summary of recent historical distributions for consolidation purposes\.γ∈\[0,1\]\\gamma\\in\[0,1\]is a decay factor controlling the retention of historical importance\.
#### 3\.2\.2Real Replay\-based Approaches
In real replay, historical samples are explicitly stored and replayed during parameter updates\. Real replay\-based approaches mitigate forgetting by replaying a subset of previously observed samples and jointly optimizing on replayed and novelty data\. When an update is triggered at timetkt\_\{k\}, the replay buffer𝒟R\\mathcal\{D\}\_\{R\}is populated by sampling a fixed number of real historical samples from all data observed prior to the previous update timetk−1t\_\{k\-1\}\. Importantly, the replay buffer is reconstructed at each trigger event\. Let𝒟N\\mathcal\{D\}\_\{N\}denote the novelty dataset collected betweentk−1t\_\{k\-1\}andtkt\_\{k\}\. Model parameters are optimized by minimizing the combined objective:
ℒreplay\(θ\)=ℒN\(θ;𝒟N\)\+λRℒR\(θ;𝒟R\),\\mathcal\{L\}\_\{\\text\{replay\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\text\{N\}\}\(\\theta;\\mathcal\{D\}\_\{N\}\)\+\\lambda\_\{R\}\\mathcal\{L\}\_\{\\text\{R\}\}\(\\theta;\\mathcal\{D\}\_\{R\}\),\(2\)whereℒN\\mathcal\{L\}\_\{\\text\{N\}\}andℒR\\mathcal\{L\}\_\{\\text\{R\}\}denote the task loss evaluated on novelty and replay samples respectively, andλR\\lambda\_\{R\}controls the relative contribution of replayed data\. A fixed replay size is adopted for computational tractability and bounded memory usage, making deployment in real\-world streaming scenarios feasible\. The replay approaches differ only in how𝒟R\\mathcal\{D\}\_\{R\}is constructed under memory constraints\. Detailed algorithmic procedures and pseudocode are provided in the Appendix\.
##### Random Replay
This constructs𝒟R\\mathcal\{D\}\_\{R\}at each update trigger by uniformly selecting samples from the historical data available prior to the last completed update\. Each eligible historical sample has equal probability of being included in the replay set, irrespective of its recency\. Under a fixed replay size, Random Replay can be interpreted as a temporal downsampling mechanism\. Given that weather and power time\-series evolve gradually, such downsampling preserves the overall temporal trends while reducing short\-term fluctuations\. It serves as a baseline mechanism without imposing temporal preference\.
##### Recent Replay
To construct𝒟R\\mathcal\{D\}\_\{R\}, this approach uniformly selects samples from the historical data observed during theNRN\_\{R\}most recent updates\. Formally, ifnnupdates have been completed,𝒟R\\mathcal\{D\}\_\{R\}comprises samples from updatesn−NR\+1n\-N\_\{R\}\+1tonn, or all available updates ifn<NRn<N\_\{R\}\. By prioritizing recent updates, this approach emphasizes short\-term stability while allowing older knowledge to gradually fade\. When the givenNRN\_\{R\}exceeds the total number of currently completed updates, Recent Replay reduces to Random Replay over the historical data, providing an empirical unbiased approximation of the consolidated past experiences\.
##### Recent Replay with Decay
This approach extends the standard Recent Replay by introducing a recency\-based weighting over the most recentNRN\_\{R\}updates\. Within this temporal window, samples associated with more recent updates are assigned higher selection probabilities, linearly increasing with recency\. During replay set construction, the number of samples drawn from each update is proportional to its weight, ensuring that recent knowledge is emphasized while still retaining a limited influence from older updates\. If the total number of updatesnnis smaller thanNRN\_\{R\}, the weighting is applied over all available updates, yielding a gradually decayed sampling distribution across the historical dataset\.
#### 3\.2\.3Pseudo Replay\-based Approaches
Instead of maintaining a real replay buffer, pseudo replay generates or curates samples representing past knowledge, which are then incorporated into parameter updates\. It reduces direct memory dependence while preserving past information through implicit distribution modeling\.
##### Generative Replay
This implements pseudo replay by replacing the AE with a VAE to synthesize historical samples during adaptation\. Unlike a deterministic AE, the VAE encoder regularizes the latent space toward a standard prior, enabling the decoder to synthesize diverse pseudo\-samples by sampling from this learned distribution\.
After each completed update, the model parameters are archived to preserve the consolidated historical state\. When a new update is triggered, pseudo historical sample pairs are constructed using these archived models\. Specifically, latent codes are sampled from the approximated prior distribution\. The archived decoder generates pseudo input sequences from these latent representations, while the archived predictor simultaneously produces the corresponding pseudo outputs from the same representations\. As a result, replay samples appear as paired input\-output, ensuring consistency between reconstructed features and targets\.
For updating, the generated pseudo dataset𝒟G\\mathcal\{D\}\_\{G\}is combined with the novelty dataset𝒟N\\mathcal\{D\}\_\{N\}\. Model parameters are optimized using the objective that mirrors Eq\.[2](https://arxiv.org/html/2606.24955#S3.E2)but replaces real replay data with generated pseudo dataset\. Because pseudo input and output are generated jointly from shared latent variables, both the VAE and the predictor must be updated synchronously\. This synchronized optimization preserves alignment between representation learning and predictive mapping, which is critical for maintaining coherent historical knowledge across adaptation steps\. Compared to real replay, generative replay reduces explicit storage requirements at the cost of additional model training and sampling overhead, offering a flexible memory\-compute trade\-off under constrained storage scenarios\.
##### Familiarity\-based Replay
This approach utilizes familiarity samples to strengthen parameter updates using EWC\. Unlike other CLeaR instantiations, where familiarity samples are only used for FI estimation and threshold adjustment, this approach incorporates𝒟~F\\widetilde\{\\mathcal\{D\}\}\_\{F\}, a subset from the familiarity buffer, into the model optimization\. When an update is triggered, the optimization objective extends Eq\. \(1\) by replacing𝒟N\\mathcal\{D\}\_\{N\}with combined updating dataset𝒟comb=𝒟N∪𝒟~F\\mathcal\{D\}\_\{\\text\{comb\}\}=\\mathcal\{D\}\_\{N\}\\cup\\widetilde\{\\mathcal\{D\}\}\_\{F\}\. The proportion of familiarity samples is controlled to prevent excessive dominance over novelty\-driven adaptation\. The Fisher coefficientsFi∗F^\{\*\}\_\{i\}remain estimated from previous updates and are not modified by familiarity data\. The inclusion of familiarity samples introduces an additional optimization anchor and complement the parameter\-space regularization imposed by EWC\. Since both buffers participate in the update, this approach also mitigates potential misallocation errors arising from imperfect novelty detection by retaining samples deemed stable under the current model\.
## 4Experimental Validation of Continuous Power Forecasting
### 4\.1Real\-World Power Grid Dataset
The experiments are conducted on a real\-world regional power grid dataset \(available upon reasonable request\)\. The dataset spans approximately 23 months with a uniform temporal resolution of 15 minutes, resulting in 67 764 time steps\. The dataset comprises 95 power entities, including 7 independent generators, 59 independent consumers, and 29 aggregated residual loads\. The residual loads represent consumption units without individual measurements and are estimated from substation\-level aggregated statistics\. All entities are anonymized due to privacy protection\.
Forecasting is formulated as a pointwise mapping conditioned on given meteorological inputs rather than as a sequence\-to\-sequence task\. The meteorological data are derived from a day\-ahead numerical weather prediction \(NWP\) system and include 13 variables such as wind speed, temperature, radiation, cloud cover, and precipitation\. Given the relatively small geographical area \(approximately 83 km2\), spatially uniform weather inputs are assumed for all entities, and spatial variance is not explicitly modeled\. The original hourly NWP data are linearly interpolated to match the 15\-minute resolution of the power measurements\. The interpolation is applied exclusively to NWP and does not involve future observed values\. In addition, five timestamp\-based temporal indicators \(hour, minute, week, weekday, day\-of\-year\) are encoded using sine\-cosine transformations, resulting in 10 cyclical temporal features\. The final input dimensionality is therefore 23\.
Power measurements are normalized by rated or peak capacity for each entity, and weather variables are scaled to the range\[0,1\]\[0,1\]based on historical extrema\.
### 4\.2Continual Learning Protocol
For clarity in the experimental results, each continual learning approach is instantiated as a corresponding CLeaR instance and abbreviated as follows: Online EWC is referred to asCLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}, Random Replay asCLeaRRa\\text\{CLeaR\}\_\{\\text\{Ra\}\}, Recent Replay asCLeaRRe\\text\{CLeaR\}\_\{\\text\{Re\}\}, Recent Replay with Decay asCLeaRReD\\text\{CLeaR\}\_\{\\text\{ReD\}\}, Generative Replay asCLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}, and Familiarity\-based Replay asCLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}\. The data for each power entity is divided into three phases: warm\-up, updating, and test\.
Table 1:Grid search space for AE/VAE and predictor\. The encoder begins with the specified number of neurons in the first hidden layer, with subsequent layers containing 70% of the neurons in the previous layer\. The decoder mirrors the encoder to reconstruct input features\.During the warm\-up phase, the first 10 000 samples \(approximately 104 days\) are used to initialize the shared AE \(or VAE forCLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}\) and entity\-specific predictors, identifying optimal architectures and latent dimensions\. This phase allows the model to capture fundamental patterns from limited historical data\. All tunable parameters for AE/VAE and predictor are specified in Table[1](https://arxiv.org/html/2606.24955#S4.T1)\. Warm\-up training uses 80% of warm\-up samples for training and 20% for validation, with early stopping \(patience 50\), a maximum of 512 epochs, batch size 64, and the Adam optimizer with learning rate 0\.001\. Hyperparameter selection is based on the lowest validation root mean squared error \(RMSE\), using reconstruction error for the AE/VAE and prediction error for the entity\-specific predictor\.
Table 2:Grid search space for hyperparameters of CLeaR instances during the updating phase\. Replay buffer size for all replay\-based instances \(Ra, Re, ReD\) equals the novelty buffer size\.Following the warm\-up, each CLeaR instance enters the updating phase, spanning 54 764 samples over approximately 570 days\. During this phase, instances continuously generate forecasts while monitoring detection signals from incoming samples\. When a novelty buffer reaches capacity, the corresponding neural network component is updated according to the selected CL approach\. All hyperparameters controlling these updates are specified via grid search and summarized in Table[2](https://arxiv.org/html/2606.24955#S4.T2)\. For each CLeaR instance, hyperparameter configurations in the updating phase are selected individually for each entity based on the lowest RMSE on the updating dataset \(later referred to as Fitting Error\)\. No additional regularization is applied during this phase, as the CL approaches inherently introduce functional regularization via EWC penalties or replay losses\.
The updating phase is followed by the test phase, where the remaining 3 000 samples \(approximately 31 days\) are used to evaluate the final forecasting performance of each CLeaR instance\.Two baselinesare included:BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}, representing the warm\-up model without further updates, andBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}, trained on the full dataset spanning both warm\-up and updating phases, serving as empirical lower and upper performance references, respectively\. This protocol ensures fair, reproducible, and realistic evaluation under operational constraints typical of industrial power forecasting pipelines\.
### 4\.3Evaluation Metrics
Each CLeaR instance is evaluated after the updating phase using three post\-update metrics\[[6](https://arxiv.org/html/2606.24955#bib.bib1)\]: Fitting Error \(FE\), Prediction Error \(PE\), and Forgetting Ratio \(FR\)\. All metrics are computed on the fixed warm\-up, updating, and test datasets using the final model obtained after the last update\. LetMfinal\\mathrm\{M\}\_\{\\mathrm\{final\}\}denote the final model obtained at timetfinalt\_\{\\mathrm\{final\}\}, andMfinal\[𝐱\(t\)\]\\mathrm\{M\}\_\{\\mathrm\{final\}\}\\\!\\left\[\\mathbf\{x\}\(t\)\\right\]its output for input𝐱\(t\)\\mathbf\{x\}\(t\)\. We introduce a unified target notation𝐝\(t\)\\mathbf\{d\}\(t\), where𝐝\(t\)=𝐱\(t\)\\mathbf\{d\}\(t\)=\\mathbf\{x\}\(t\)for reconstruction and𝐝\(t\)=y\(t\)\\mathbf\{d\}\(t\)=y\(t\)for forecasting\.
FE measures how well the final model fits all samples that have been seen during the updating phase\. It is defined as the RMSE over the corresponding temporal interval:
FE=∑t=10 001tfinal\(𝐝\(t\)−Mfinal\[𝐱\(t\)\]\)2tfinal−10 000\.\\displaystyle\\mathrm\{FE\}=\\sqrt\{\\frac\{\\sum\_\{t=10\\,001\}^\{t\_\{\\mathrm\{final\}\}\}\\left\(\\mathbf\{d\}\\left\(t\\right\)\-\\mathrm\{M\}\_\{\\mathrm\{final\}\}\\left\[\\mathbf\{x\}\\left\(t\\right\)\\right\]\\right\)^\{2\}\}\{t\_\{\\text\{final\}\}\-10\\,000\}\}\.\(3\)Samples collected aftertfinalt\_\{\\mathrm\{final\}\}but not triggering an additional update are excluded from this calculation, as they do not influence model parameters\.
PE evaluates the generalization performance of the final model on unseen data from the test phase:
PE\\displaystyle\\mathrm\{PE\}=∑t=64 76567 764\(𝐝\(t\)−Mfinal\[𝐱\(t\)\]\)23 000\.\\displaystyle=\\sqrt\{\\frac\{\\sum\_\{t=64\\,765\}^\{67\\,764\}\\left\(\\mathbf\{d\}\\left\(t\\right\)\-\\mathrm\{M\}\_\{\\mathrm\{final\}\}\\left\[\\mathbf\{x\}\\left\(t\\right\)\\right\]\\right\)^\{2\}\}\{3\\,000\}\}\.\(4\)PE reflects out\-of\-sample performance beyond both warm\-up and updating data\.
FR quantifies the relative increase in error on the warm\-up dataset after the completion of the updating phase\. LetM0\\mathrm\{M\}\_\{0\}denote the warm\-up model \(i\.e\.,BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}\)\. FR is defined as:
FR\\displaystyle\\mathrm\{FR\}=max\{0,∑t=110 000\(𝐝\(t\)−Mfinal\[𝐱\(t\)\]\)2∑t=110 000\(𝐝\(t\)−M0\[𝐱\(t\)\]\)2−1\}\.\\displaystyle=\\mathrm\{max\}\\left\\\{0,\\ \\sqrt\{\\frac\{\\sum\_\{t=1\}^\{10\\,000\}\\left\(\\mathbf\{d\}\\left\(t\\right\)\-\\mathrm\{M\}\_\{\\mathrm\{final\}\}\\left\[\\mathbf\{x\}\\left\(t\\right\)\\right\]\\right\)^\{2\}\}\{\\sum\_\{t=1\}^\{10\\,000\}\\left\(\\mathbf\{d\}\\left\(t\\right\)\-\\mathrm\{M\}\_\{\\mathrm\{0\}\}\\left\[\\mathbf\{x\}\\left\(t\\right\)\\right\]\\right\)^\{2\}\}\}\-1\\right\\\}\.\(5\)If the final model performs no worse than the warm\-up model on the warm\-up dataset, FR equals zero\.
Together, FE, PE, and FR provide complementary perspectives on continual regression performance, respectively capturing fitting quality on updated data, generalization to unseen data, and resistance to catastrophic forgetting\.
### 4\.4Overall Performance Comparison
Using the optimal configurations selected as described in Section[4\.2](https://arxiv.org/html/2606.24955#S4.SS2), we report aggregated performance statistics across all 95 entities\.
##### AE Component \(Input Feature Reconstruction\)
Table[3](https://arxiv.org/html/2606.24955#S4.T3)summarizes the post\-update evaluation results for the AE component across all CLeaR instances\. Since the baseline models are not updated during the updating phase, the values of FR and number of updates are not applicable in the corresponding rows\.
As expected,BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}, which is trained only on the limited warm\-up dataset, exhibits substantially higher FE and PE during later phases\. All CLeaR instances achieve markedly lower FE thanBaselineL\\text\{Baseline\}\_\{\\text\{L\}\}, indicating that continual updates substantially improve retention of input feature structure\.
Clear differences exist between regularization\-based and replay\-based approaches\. The four replay\-based variants \(CLeaRRa\\text\{CLeaR\}\_\{\\text\{Ra\}\},CLeaRRe\\text\{CLeaR\}\_\{\\text\{Re\}\},CLeaRReD\\text\{CLeaR\}\_\{\\text\{ReD\}\}, andCLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}\) consistently outperformCLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}andCLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}in terms of all three metrics\. This suggests that explicit replay of historical samples outperforms regularization\-based constraints in preserving both reconstruction accuracy and the stability of feature\-level representations\.
Table 3:Post\-update evaluation of AE component \(input feature reconstruction\)\. Results are aggregated across 95 power entities and presented as mean and standard deviation \(in parentheses\)\. For reference, the mean errors achieved by AE on validation data forBaselineL\\text\{Baseline\}\_\{\\text\{L\}\}andBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}are0\.0050\.005and0\.0030\.003, respectively\.BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}andBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}values in the table represent their performance during the updating/test phases\.InstancesFitting ErrorPrediction ErrorForgetting RatioNumber of UpdatesBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}0\.0030\.021//BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}0\.3450\.203//CLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}0\.249 \(0\.037\)0\.180 \(0\.072\)91\.270 \(13\.061\)13\.368 \(3\.516\)CLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}0\.243 \(0\.037\)0\.208 \(0\.094\)90\.365 \(13\.612\)12\.358 \(3\.418\)CLeaRRa\\text\{CLeaR\}\_\{\\text\{Ra\}\}0\.014 \(0\.001\)0\.053 \(0\.004\)0\.669 \(0\.123\)8\.442 \(1\.194\)CLeaRRe\\text\{CLeaR\}\_\{\\text\{Re\}\}0\.016 \(0\.008\)0\.057 \(0\.019\)1\.503 \(1\.457\)8\.253\(0\.725\)CLeaRReD\\text\{CLeaR\}\_\{\\text\{ReD\}\}0\.016 \(0\.010\)0\.058 \(0\.023\)1\.533 \(1\.418\)8\.895 \(0\.788\)CLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}0\.011\(0\.001\)0\.045\(0\.005\)0\.551\(0\.215\)13\.484 \(3\.008\)
Two factors may partly explain the inferior performance ofCLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}andCLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}\. First, hyperparameter selection prioritizes forecasting FE rather than reconstruction performance\. Therefore, configurations optimal for predictive accuracy do not necessarily preserve the AE’s latent representation\. Second, without access to historical samples,CLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}andCLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}must rely mainly on FI\-based regularization to balance stability and plasticity\. Under limited model capacity, which is intentionally constrained during warm\-up to prevent overfitting, frequent updates \(approximately 13 on average\) increase the risk of overwriting the overlapped latent representations\. In Online EWC, the down\-weighting parameterγ\\gammaprogressively relaxes older constraints, further leading to the relatively high FR\.
While FR appears disproportionately large for regularization\-based methods, this effect is amplified by normalization in Eq\.[5](https://arxiv.org/html/2606.24955#S4.E5)\. Because the warm\-up reconstruction error is very small \(approximately 0\.005 on average\), even moderate absolute degradation results in a large relative ratio\. Thus, FR magnifies representational drift when the baseline denominator is small\.
Among all variants,CLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}achieves the best overall reconstruction performance, approaching the errors ofBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}, which is trained on the full dataset\. This indicates that Generative Replay closely approximates full\-data retraining under streaming constraints\. Notably,CLeaRRa\\text\{CLeaR\}\_\{\\text\{Ra\}\}also performs competitively despite employing a simple random replay strategy, highlighting that even non\-selective replay provides a strong anchor against representational drift\.
AlthoughCLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}exhibits a relatively high number of updates due to its synchronous triggering mechanism, its low FR and strong FE/PE indicate that Generative Replay can maintain stability even under frequent parameter adjustments\.
##### Predictor Component \(Power Forecasting\)
Table[4](https://arxiv.org/html/2606.24955#S4.T4)summarizes the aggregated post\-update evaluation metrics for all 95 power entities\.
Table 4:Post\-update evaluation of the predictor component \(power forecasting\) for all 95 power entities, comparing various CLeaR instances and baseline models\. Results are presented as the mean and standard deviation \(in parentheses\)\. For reference, the mean errors achieved by predictor on validation data forBaselineL\\text\{Baseline\}\_\{\\text\{L\}\}andBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}are 0\.037 \(0\.020\) and 0\.041 \(0\.020\), respectively\.BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}andBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}values in the table represent their performance during the updating/test phases\.InstancesFitting ErrorPrediction ErrorForgetting RatioNumber of UpdatesBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}0\.037 \(0\.017\)0\.142 \(0\.083\)//BaselineL\\text\{Baseline\}\_\{\\text\{L\}\}0\.157 \(0\.074\)0\.165 \(0\.102\)//CLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}0\.116 \(0\.059\)0\.140 \(0\.079\)2\.057 \(1\.427\)6\.147 \(1\.941\)CLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}0\.112 \(0\.060\)0\.135\(0\.076\)2\.081 \(1\.506\)5\.505\(2\.257\)CLeaRRa\\text\{CLeaR\}\_\{\\text\{Ra\}\}0\.086 \(0\.043\)0\.159 \(0\.083\)1\.079 \(1\.099\)7\.011 \(2\.075\)CLeaRRe\\text\{CLeaR\}\_\{\\text\{Re\}\}0\.080 \(0\.040\)0\.161 \(0\.085\)1\.925 \(1\.537\)6\.874 \(1\.842\)CLeaRReD\\text\{CLeaR\}\_\{\\text\{ReD\}\}0\.083 \(0\.040\)0\.158 \(0\.082\)2\.116 \(1\.616\)6\.895 \(1\.861\)CLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}0\.046\(0\.025\)0\.152 \(0\.086\)0\.344\(0\.587\)13\.484 \(3\.008\)
Consistent with our previous findings regarding input feature reconstruction, all CLeaR instances achieve lower FE thanBaselineL\\text\{Baseline\}\_\{\\text\{L\}\}, indicating the necessity and effectiveness of continual updates once the data distribution departs from the initial warm\-up temporal window\. Among all instances,CLeaRG\\text\{CLeaR\}\_\{\\text\{G\}\}yields the lowest FE and FR, indicating the strongest retention of previously acquired knowledge under continual updates\.
However, a different trend emerges in the PE metrics:CLeaRE\\text\{CLeaR\}\_\{\\text\{E\}\}andCLeaRF\\text\{CLeaR\}\_\{\\text\{F\}\}consistently surpass the replay\-based instances, achieving lower mean PE thanBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}\. This difference reflects the task\-dependent nature of the stability\-plasticity trade\-off\. In the input feature reconstruction context, the AE deals with NWP data, which exhibits high\-frequency but short\-term fluctuations with periodic structure\. As shown in Table[3](https://arxiv.org/html/2606.24955#S4.T3),BaselineU\\text\{Baseline\}\_\{\\text\{U\}\}maintains comparable errors on the validation data \(≈0\.003\\approx 0\.003\) and in the updating phase \(≈0\.003\\approx 0\.003\), with a minimal PE \(0\.021\)\. This suggests relatively stable reconstruction statistics where preserving historical knowledge \(stability\) is more important\. Consequently, in this context, these replay\-based instances, which can directly replay real/pseudo historical samples, substantially outperform the EWC\-based variants across all evaluation metrics by effectively anchoring the model to its initial reconstruction proficiency\.
Conversely, the power forecasting context may provide gradual drift in the input\-output mapping relationship\. ForBaselineU\\text\{Baseline\}\_\{\\text\{U\}\}, the error rises from 0\.041 during warm\-up to 0\.142 in the test phase, indicating considerable concept drift in power data\. In this context, the controlled forgetting facilitated by the Online EWC down\-weighting parameterγ\\gammamay provide additional plasticity\. By decaying outdated constraints, these instances exhibit higher plasticity, allowing them to adapt more quickly to new distribution characteristics\. As a result, EWC\-based instances achieve lower PE despite higher FR, suggesting that stronger retention is insufficient to guarantee superior generalization under distributional drift\.
A similar context\-dependent behavior is observed in the update frequency\. Compared with the AE component, the predictor exhibits fewer updates on average\. This difference arises because AE updates \(in Table[3](https://arxiv.org/html/2606.24955#S4.T3)\) are primarily driven by short\-term input distribution changes, whereas predictor updates \(in Table[4](https://arxiv.org/html/2606.24955#S4.T4)\) are triggered by drifts in the underlying input\-output relationship\. Such mapping drift occurs less frequently but could require more substantial parameter adaptation when detected\.
## 5Conclusion & Future Work
This study investigates the CLeaR framework for Continuous Power Forecasting \(CPF\) under nonstationary data streams with restricted data accessibility\. Extensive experiments on a real\-world power grid dataset demonstrate that the framework effectively mitigates the limitations imposed by constrained historical storage and evolving operating conditions\. Across 95 heterogeneous entities, the CLeaR instances consistently improve forecasting robustness compared with static baselines, suggesting the practical importance of structured continual adaptation in deployment scenarios\.
Although Generative Replay achieves superior performance across most metrics, simple performance ranking provides only a partial view of continual learning behavior\. The experimental results reveal that performance appears to depend on the nature and intensity of nonstationarity\. In relatively stable regimes, mechanisms emphasizing memory retention tend to maintain stronger generalization\. In contrast, under conditional drift, controlled relaxation of historical constraints may facilitate faster adaptation\. These findings indicate that stronger knowledge retention does not necessarily imply better generalization, particularly when the underlying input\-output relationship evolves over time\.
From a system design perspective, two principles emerge\. First, the selection of a continual learning mechanism must account for site\-specific computational and storage constraints\. Methods that cannot be executed within operational constraints offer limited practical value regardless of theoretical performance\. Second, mechanism selection should be aligned with the form of nonstationarity\. For example, seasonal or cyclic variations could favor stability\-oriented approaches with stronger memory preservation\. Gradual physical degradation could require steady adaptation with moderate forgetting\. Operational drifts or newly introduced forecasting targets prioritize rapid plasticity and efficient re\-initialization\. Continual learning systems for power applications should therefore be drift\-aware and constraint\-aware rather than universally stability\-oriented\.
Several directions are proposed for further investigation\. First, exploring the integration of pre\-trained models and zero\-shot learning may accelerate cold\-start adaptation and reduce dependency on extended warm\-up phases\. Second, incorporating human\-in\-the\-loop feedback into novelty detection and update triggering mechanisms may help bridge the gap between purely data\-driven adaptation and expert\-informed operational knowledge\. Third, selective forgetting mechanisms warrant systematic investigation in the context of long\-horizon power forecasting\. While the present work focuses primarily on preserving past knowledge, long\-horizon power system operation suggests that not all historical patterns remain informative and relevant\. Intentional removal of obsolete or no longer relevant patterns may improve both memory efficiency and long\-term generalization\.
Advancing along these directions will move CPF research beyond static mechanism comparisons toward adaptive, context\-aware continual learning systems capable of operating reliably in dynamic energy infrastructures\.
\{credits\}
#### 5\.0\.1Acknowledgements
This work was supported within the KonSEnz \(03EI4087B\) project, funded by BMWE: Deutsches Bundesministerium für Wirtschaft und Energie/German Federal Ministry for Economic Affairs and Energy\.
#### 5\.0\.2\\discintname
The authors have no competing interests to declare that are relevant to the content of this article\.
#### 5\.0\.3Statement on the Use of AI Tools\.
Large Language Models \(LLMs\) were used solely to improve the readability and correct grammatical errors in this manuscript\. All scientific content, such as research ideas, the conceptual framework, experimental design, analysis, and conclusions, was developed entirely by the authors\. The authors take full responsibility for the integrity and originality of the work\.
## References
- \[1\]A\. Cossu, A\. Carta, and D\. Bacciu\(2020\)Continual learning with gated incremental memories for sequential data processing\.In2020 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[2\]M\. De Lange, R\. Aljundi, M\. Masana, S\. Parisot, X\. Jia, A\. Leonardis, G\. Slabaugh, and T\. Tuytelaars\(2021\)A continual learning survey: defying forgetting in classification tasks\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(7\),pp\. 3366–3385\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[3\]J\. Gama, I\. Žliobaitė, A\. Bifet, M\. Pechenizkiy, and A\. Bouchachia\(2014\)A survey on concept drift adaptation\.ACM Computing Surveys46\(4\),pp\. 1–37\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p3.1)\.
- \[4\]Y\. He, J\. Henze, and B\. Sick\(2020\)Continuous learning of deep neural networks to improve forecasts for regional energy markets\.IFAC\-PapersOnLine53\(2\),pp\. 12175–12182\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p2.1)\.
- \[5\]Y\. He, Z\. Huang, and B\. Sick\(2021\)Toward application of continuous power forecasts in a regional flexibility market\.In2021 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN52387.2021.9533626)Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p4.1)\.
- \[6\]Y\. He and B\. Sick\(2021\-07\)CLeaR: an adaptive continual learning framework for regression tasks\.AI Perspectives3\(1\),pp\. 2\.Cited by:[§1](https://arxiv.org/html/2606.24955#S1.p4.1),[§2](https://arxiv.org/html/2606.24955#S2.p1.1),[§2](https://arxiv.org/html/2606.24955#S2.p4.1),[§3](https://arxiv.org/html/2606.24955#S3.p1.1),[§4\.3](https://arxiv.org/html/2606.24955#S4.SS3.p1.7)\.
- \[7\]Y\. Hsu, Y\. Liu, A\. Ramasamy, and Z\. Kira\(2018\)Re\-evaluating continual learning scenarios: a categorization and case for strong baselines\.arXiv preprint arXiv:1810\.12488\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[8\]J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences \(PNAS\)114\(13\),pp\. 3521–3526\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.24955#S3.SS2.SSS1.p1.1)\.
- \[9\]J\. Z\. Kolter and M\. A\. Maloof\(2007\)Dynamic weighted majority: an ensemble method for drifting concepts\.The Journal of Machine Learning Research8,pp\. 2755–2790\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p3.1)\.
- \[10\]V\. Lomonaco and D\. Maltoni\(2017\)Core50: a new dataset and benchmark for continuous object recognition\.InProceedings of the 1st Annual Conference on Robot Learning \(CoRL\),pp\. 17–26\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[11\]J\. Lu, A\. Liu, F\. Dong, F\. Gu, J\. Gama, and G\. Zhang\(2018\)Learning under concept drift: a review\.IEEE transactions on knowledge and data engineering31\(12\),pp\. 2346–2363\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p3.1)\.
- \[12\]Y\. Nie, N\. H\. Nguyen, P\. Sinthong, and J\. Kalagnanam\(2022\)A time series is worth 64 words: long\-term forecasting with Transformers\.arXiv preprint arXiv:2211\.14730\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p2.1)\.
- \[13\]S\. Rebuffi, A\. Kolesnikov, G\. Sperl, and C\. H\. Lampert\(2017\)iCaRL: incremental classifier and representation learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 2001–2010\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[14\]D\. Rolnick, A\. Ahuja, J\. Schwarz, T\. Lillicrap, and G\. Wayne\(2019\)Experience replay for continual learning\.Advances in Neural Information Processing Systems \(NeurIPS\)32\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[15\]A\. A\. Rusu, N\. C\. Rabinowitz, G\. Desjardins, H\. Soyer, J\. Kirkpatrick, K\. Kavukcuoglu, R\. Pascanu, and R\. Hadsell\(2016\)Progressive neural networks\.arXiv preprint arXiv:1606\.04671\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[16\]D\. Sahoo, Q\. Pham, J\. Lu, and S\. C\. H\. Hoi\(2017\)Online deep learning: learning deep neural networks on the fly\.arXiv preprint arXiv:1711\.03705\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p3.1)\.
- \[17\]J\. Schwarz, W\. Czarnecki, J\. Luketina, A\. Grabska\-Barwinska, Y\. W\. Teh, R\. Pascanu, and R\. Hadsell\(2018\)Progress & compress: a scalable framework for continual learning\.InProceedings of the 35th International Conference on Machine Learning \(ICML\),pp\. 4528–4537\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1),[§3\.2\.1](https://arxiv.org/html/2606.24955#S3.SS2.SSS1.p2.3)\.
- \[18\]H\. Shin, J\. K\. Lee, J\. Kim, and J\. Kim\(2017\)Continual learning with deep generative replay\.Advances in Neural Information Processing Systems \(NeurIPS\)30\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[19\]G\. M\. Van de Ven and A\. S\. Tolias\(2018\)Generative replay with feedback connections as a general strategy for continual learning\.arXiv preprint arXiv:1809\.10635\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[20\]H\. Wu, J\. Xu, J\. Wang, and M\. Long\(2021\)Autoformer: decomposition transformers with auto\-correlation for long\-term series forecasting\.Advances in Neural Information Processing Systems \(NeurIPS\)34,pp\. 22419–22430\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p2.1)\.
- \[21\]F\. Zenke, B\. Poole, and S\. Ganguli\(2017\)Continual learning through synaptic intelligence\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 3987–3995\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p1.1)\.
- \[22\]H\. Zhou, S\. Zhang, J\. Peng, S\. Zhang, J\. Li, H\. Xiong, and W\. Zhang\(2021\)Informer: beyond efficient transformer for long sequence time\-series forecasting\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35\(12\),pp\. 11106–11115\.Cited by:[§2](https://arxiv.org/html/2606.24955#S2.p2.1)\.
## Appendix
This appendix provides algorithmic details for the three real replay\-based continual learning \(CL\) approaches introduced in Section[3\.2](https://arxiv.org/html/2606.24955#S3.SS2)of the main paper titledTowards Continuous Power Forecasting: Practical Continual Learning for Real\-World Energy Systems in Nonstationary Time Series\. While the main text focuses on the unified replay loss formulation, this appendix specifies the data structures and sampling procedures used to construct the replay dataset for each approach\.
We first define the historical dataset and replay buffer used by all three real replay approaches\. The historical dataset𝒟H\\mathcal\{D\}\_\{H\}represents the complete collection of observed samples accumulated up to the current model update\. Assume thenn\-th model update is completed at timetnt\_\{n\}\. The historical dataset𝒟H\\mathcal\{D\}\_\{H\}is defined as:
𝒟H=\{\(𝐱\(t\),𝐲\(t\)\)∣t−1≤t≤tn,t,n∈ℕ\},\\mathcal\{D\}\_\{H\}=\\left\\\{\\left\(\\mathbf\{x\}\\left\(t\\right\),\\mathbf\{y\}\\left\(t\\right\)\\right\)\\mid t\_\{\-1\}\\leq t\\leq t\_\{n\},t,n\\in\\mathbb\{N\}\\right\\\},wherettdenotes the time step when the corresponding input\-output sample pair is obtained andt−1≡0t\_\{\-1\}\\equiv 0denotes the starting time\.
The replay buffer stores a subset of historical samples used for experience replay during model updates\. The replay dataset is denoted by𝒟R⊆𝒟H\\mathcal\{D\}\_\{R\}\\subseteq\\mathcal\{D\}\_\{H\}\. The replay budgetSRS\_\{R\}specifies the number of historical samples selected for replay at each update step, i\.e\.,\|𝒟R\|=SR\|\\mathcal\{D\}\_\{R\}\|=S\_\{R\}\. In all real replay approaches considered in this work, the replay budgetSRS\_\{R\}is treated as a fixed constant\.
### Random Replay
Random Replay represents the simplest approach\. It assumes that all historical samples stored in𝒟H\\mathcal\{D\}\_\{H\}are equally important for retaining past knowledge\. Following the completion of a model update \(at timetnt\_\{n\}\), Random Replay generates the replay dataset𝒟R\\mathcal\{D\}\_\{R\}by uniformly samplingSRS\_\{R\}data pairs from the entire historical dataset𝒟H\\mathcal\{D\}\_\{H\}without replacement\. This sampled replay dataset𝒟R\\mathcal\{D\}\_\{R\}is a down\-sampled subset of the complete historical data\. SinceSRS\_\{R\}is constant, the probability of selecting a sample corresponding to a very old task gradually decreases as𝒟H\\mathcal\{D\}\_\{H\}grows, indirectly reducing the reliance on accessing the most outdated data\.
Algorithm[1](https://arxiv.org/html/2606.24955#alg1)describes the procedure for generating𝒟R\\mathcal\{D\}\_\{R\}\.
Algorithm 1Random Replay Set Generation1:Input:Historical dataset
𝒟H=\{\(𝐱\(t\),𝐲\(t\)\)∣t−1<t≤tn\}\\mathcal\{D\}\_\{H\}=\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\mid t\_\{\-1\}<t\\leq t\_\{n\}\\\}, Replay buffer size
SRS\_\{R\}
2:Output:Replay dataset
𝒟R\\mathcal\{D\}\_\{R\}
3:procedureGenerateRandomReplaySet\(
𝒟H\\mathcal\{D\}\_\{H\},
SRS\_\{R\}\)
%\\%Executed after thenn\-th model training/update is completed\. Lett−1≡0t\_\{\-1\}\\equiv 0\.
4:if
\|𝒟H\|≤SR\|\\mathcal\{D\}\_\{H\}\|\\leq S\_\{R\}then
5:
𝒟R←𝒟H\\mathcal\{D\}\_\{R\}\\leftarrow\\mathcal\{D\}\_\{H\}%\\%If history is smaller than buffer size, use all samples
6:else
7:
𝒟R←UniformRandomSample\(𝒟H,SR\)\\mathcal\{D\}\_\{R\}\\leftarrow\\text\{UniformRandomSample\}\(\\mathcal\{D\}\_\{H\},S\_\{R\}\)%\\%Randomly sampleSRS\_\{R\}data pairs from𝒟H\\mathcal\{D\}\_\{H\}without replacement
8:endif
9:return
𝒟R\\mathcal\{D\}\_\{R\}
10:endprocedure
### Recent Replay
Recent Replay is a variant, which recognizes that in nonstationary environments, more recent data may be more representative of the current data distribution or subsequent tasks\. It prioritizes data from the most recent tasks\. This approach requires an additional parameter,NRN\_\{R\}, which defines the number of the most recent updates whose associated data is considered for replaying\. Specifically, this corresponds to the data collected within the time intervals delineated by the lastNRN\_\{R\}model update points\. Only samples within this time window are eligible for selection\. Following a model update,𝒟R\\mathcal\{D\}\_\{R\}is generated by uniformly samplingSRS\_\{R\}data pairs from the recent subset for use at the next model update\.
Algorithm 2Recent Replay Set Generation1:Input:Historical dataset
𝒟H=\{\(𝐱\(t\),𝐲\(t\)\)∣t−1<t≤tn\}\\mathcal\{D\}\_\{H\}=\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\mid t\_\{\-1\}<t\\leq t\_\{n\}\\\}, Replay buffer size
SRS\_\{R\}, Number of recent updates
NRN\_\{R\}, Number of updates
nn
2:Output:Replay dataset
𝒟R\\mathcal\{D\}\_\{R\}
3:procedureGenerateRecentReplaySet\(
𝒟H\\mathcal\{D\}\_\{H\},
SRS\_\{R\},
NRN\_\{R\},
nn\)
%\\%Executed after thenn\-th model training/update is completed\. Lett−1≡0t\_\{\-1\}\\equiv 0\.
4:if
n≤NR−1n\\leq N\_\{R\}\-1then
5:
𝒟Recent←\{\(𝐱\(t\),𝐲\(t\)\)∈𝒟H∣t−1≤t≤tn\}\\mathcal\{D\}\_\{\\text\{Recent\}\}\\leftarrow\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\in\\mathcal\{D\}\_\{H\}\\mid t\_\{\-1\}\\leq t\\leq t\_\{n\}\\\}%\\%Sampling starts from the initial time stept=t−1t=t\_\{\-1\}
6:else
7:
𝒟Recent←\{\(𝐱\(t\),𝐲\(t\)\)∈𝒟H∣tn−NR≤t≤tn\}\\mathcal\{D\}\_\{\\text\{Recent\}\}\\leftarrow\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\in\\mathcal\{D\}\_\{H\}\\mid t\_\{n\-N\_\{R\}\}\\leq t\\leq t\_\{n\}\\\}%\\%Samples associated with the lastNRN\_\{R\}update intervals
8:endif
9:
NRecent←\|𝒟Recent\|N\_\{\\text\{Recent\}\}\\leftarrow\|\\mathcal\{D\}\_\{\\text\{Recent\}\}\|
10:if
NRecent≤SRN\_\{\\text\{Recent\}\}\\leq S\_\{R\}then
11:
𝒟R←𝒟Recent\\mathcal\{D\}\_\{R\}\\leftarrow\\mathcal\{D\}\_\{\\text\{Recent\}\}%\\%If the recent subset is small, use all samples
12:else
13:
𝒟R←UniformRandomSample\(𝒟Recent,SR\)\\mathcal\{D\}\_\{R\}\\leftarrow\\text\{UniformRandomSample\}\(\\mathcal\{D\}\_\{\\text\{Recent\}\},S\_\{R\}\)%\\%Randomly sampleSRS\_\{R\}data pairs from𝒟Recent\\mathcal\{D\}\_\{\\text\{Recent\}\}
14:endif
15:return
𝒟R\\mathcal\{D\}\_\{R\}
16:endprocedure
When the number of conducted updatesnnis less thanNRN\_\{R\}, the workflow is identical to Random Replay\. However, starting fromn=NR\+1n=N\_\{R\}\+1, the mechanism enforces a sliding window\. Data corresponding to the oldest update outside theNRN\_\{R\}windows are sequentially excluded and thus will not be involved in subsequent replay generations\. By focusing on recent experiences, Recent Replay allows the model to adapt more quickly to recent changes in the underlying data distribution\. It effectively introduces a controlled mechanism to remove memories related to excessively outdated tasks\.
Algorithm[2](https://arxiv.org/html/2606.24955#alg2)describes the Recent Replay sampling process\.
### Recent Replay with Decay
Recent Replay with Decay is an extension of Recent Replay that refines the sampling strategy by utilizing a simple yet effective mechanism to proportionally prioritize the influence of more recent samples while decaying the impact of older ones\. While this approach still focuses on the data associated with the most recentNRN\_\{R\}updates, it samples the historical data in a way that samples closer to the current update receive a proportionally higher sampling weight, thereby securing a greater share of the total replay budgetSRS\_\{R\}\.
The approach assigns a weight to the data associated with each of theNRN\_\{R\}recent updates, typically decaying linearly\. For instance, the data corresponding to the most recent update receives a weight ofNRN\_\{R\}, while the data for the oldest update within the window receives a weight of11\. The total replay budgetSRS\_\{R\}is then proportionally distributed across theseNRN\_\{R\}update intervals based on their assigned weights\.
This weighting mechanism effectively balances the need for temporal relevance \(by assigning the highest priority to recent data\) with the retention of foundational historical information \(by assigning a positive weight to allNRN\_\{R\}update intervals\)\. This makes the approach particularly useful in nonstationary environments where underlying trends change gradually\.
Algorithm[3](https://arxiv.org/html/2606.24955#alg3)details the weighted sampling process\.
Algorithm 3Recent Replay with Decay Set Generation1:Input:Historical dataset
𝒟H=\{\(𝐱\(t\),𝐲\(t\)\)∣t−1<t≤tn\}\\mathcal\{D\}\_\{H\}=\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\mid t\_\{\-1\}<t\\leq t\_\{n\}\\\}, Replay buffer size
SRS\_\{R\}, Number of recent updates
NRN\_\{R\}, Number of updates
nn
2:Output:Replay dataset
𝒟R\\mathcal\{D\}\_\{R\}
3:procedureGenerateRecentReplayDecaySet\(
𝒟H\\mathcal\{D\}\_\{H\},
SRS\_\{R\},
NRN\_\{R\},
nn\)
%\\%Executed after thenn\-th model training/update is completed\. Lett−1≡0t\_\{\-1\}\\equiv 0\.
4:
𝒟R←∅\\mathcal\{D\}\_\{R\}\\leftarrow\\varnothing%\\%Initialize the Replay dataset
5:if
n≤NR−1n\\leq N\_\{R\}\-1then
%\\%Case 1: Replay all available tasksm∈\{0,…,n\}m\\in\\\{0,\\dots,n\\\}
6:
Mrange←\{0,1,…,n\}M\_\{\\text\{range\}\}\\leftarrow\\\{0,1,\\dots,n\\\}
7:
C←\(n\+1\)\(n\+2\)/2C\\leftarrow\(n\+1\)\(n\+2\)/2%\\%Normalization constant:∑i=1n\+1i\\sum\_\{i=1\}^\{n\+1\}i
8:else
%\\%Case 2: Replay recentNRN\_\{R\}tasksm∈\{n−NR\+1,…,n\}m\\in\\\{n\-N\_\{R\}\+1,\\dots,n\\\}
9:
Mrange←\{n−NR\+1,…,n\}M\_\{\\text\{range\}\}\\leftarrow\\\{n\-N\_\{R\}\+1,\\dots,n\\\}
10:
C←NR\(NR\+1\)/2C\\leftarrow N\_\{R\}\(N\_\{R\}\+1\)/2%\\%Normalization constant:∑i=1NRi\\sum\_\{i=1\}^\{N\_\{R\}\}i
11:endif
12:for
mmin
MrangeM\_\{\\text\{range\}\}do
%\\%Iterate through tasks to be replayed
13:
𝒟Task←\{\(𝐱\(t\),𝐲\(t\)\)∈𝒟H∣tm−1<t≤tm\}\\mathcal\{D\}\_\{\\text\{Task\}\}\\leftarrow\\\{\(\\mathbf\{x\}\(t\),\\mathbf\{y\}\(t\)\)\\in\\mathcal\{D\}\_\{H\}\\mid t\_\{m\-1\}<t\\leq t\_\{m\}\\\}%\\%Data specific to taskmm
14:if
n≤NRn\\leq N\_\{R\}then
15:
W←m\+1W\\leftarrow m\+1%\\%Weight increases linearly from 1 ton\+1n\+1
16:else
17:
W←m−\(n−NR\)W\\leftarrow m\-\(n\-N\_\{R\}\)%\\%Weight increases linearly from 1 toNRN\_\{R\}
18:endif
19:
STask←Round\(W⋅SR/C\)S\_\{\\text\{Task\}\}\\leftarrow\\text\{Round\}\\left\(W\\cdot S\_\{R\}/C\\right\)%\\%Calculate required sample size for taskmm
20:
SSample←min\(STask,\|𝒟Task\|\)S\_\{\\text\{Sample\}\}\\leftarrow\\min\(S\_\{\\text\{Task\}\},\|\\mathcal\{D\}\_\{\\text\{Task\}\}\|\)%\\%Ensure we don’t sample more than available
21:if
SSample\>0S\_\{\\text\{Sample\}\}\>0then
22:
𝒟Sample←UniformRandomSample\(𝒟Task,SSample\)\\mathcal\{D\}\_\{\\text\{Sample\}\}\\leftarrow\\text\{UniformRandomSample\}\(\\mathcal\{D\}\_\{\\text\{Task\}\},S\_\{\\text\{Sample\}\}\)
23:
𝒟R←𝒟R∪𝒟Sample\\mathcal\{D\}\_\{R\}\\leftarrow\\mathcal\{D\}\_\{R\}\\cup\\mathcal\{D\}\_\{\\text\{Sample\}\}
24:endif
25:endfor
26:return
𝒟R\\mathcal\{D\}\_\{R\}
27:endprocedureSimilar Articles
PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting
PeakFocus proposes a unified multi-scale framework for electricity load peak forecasting that jointly handles peak timing localization and intensity regression, addressing limitations of existing two-stage methods.
Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data
This paper provides a comprehensive survey of Federated Continual Learning (FCL), an emerging field that combines Federated Learning and Continual Learning to enable lifelong, adaptive, and privacy-preserving learning over distributed and non-stationary data. It proposes a taxonomy, reviews applications, metrics, and open challenges.
PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting
Proposes PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that addresses periodic perception degradation, trend-noise entanglement, and loss of cross-variable dependencies via multiplicative periodic gating, multi-scale structured encoder, and cross-scale collaborative attention.
Towards Continuous-time Causal Foundation Models
Proposes a continuity criterion for extending discrete-time causal prior-data fitted networks to continuous time using stochastic differential equations, introducing a taxonomy and fine-grid integration method that outperforms naive integration on irregular observation schedules.
From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
This paper introduces a framework for time series forecasting that uses importance-aware news compression and process reward model-guided retrieval to incorporate long news articles within fixed context limits, improving prediction accuracy across finance, energy, traffic, and Bitcoin benchmarks.