Probabilistic bias adjustment of seasonal forecasts using generative machine learning: A case study of Arctic sea ice predictions

arXiv cs.LG 05/29/26, 04:00 AM Papers
Summary
This paper presents a probabilistic post-processing framework using conditional variational autoencoders (cVAEs) to bias-adjust seasonal forecasts of Arctic sea ice, improving calibration, sharpness, and spectral power over standard methods.
arXiv:2605.29172v1 Announce Type: new Abstract: Seasonal climate predictions support planning and risk management by offering early information of the most likely-to-occur climate conditions in the coming months, and associated uncertainties. Ensemble forecasts enable this by simulating many plausible outcomes, allowing predictions to be expressed as usable probabilities. Large ensembles and high-resolution forecasts strengthen this guidance by better sampling uncertainty and capturing finer-scale processes but come with significant computational cost. Moreover, forecast ensembles drift and exhibit systematic biases and spatio-temporal errors that grow with lead time, requiring careful post-processing and calibration. A probabilistic post-processing framework based on conditional Variational Autoencoders (cVAEs) was developed at the Canadian Center for Climate Modeling and Analysis to generate large ensembles of bias adjusted seasonal predictions of Arctic sea ice. The generative model was designed to learn the observational distribution conditioned on the biased model prediction. This enables generation of arbitrarily large ensembles of well-calibrated, bias corrected forecasts with improved skill. Here, we extend this framework to address the loss of fine-scale energy and the characteristic blurriness in predictions, a known limitation of standard cVAEs. Specifically, we employ a generator in place of the Gaussian parametrized decoder in the cVAE and use Continuous Ranked Probability Score in the objective function instead of the Mean Square Error. We further use a higher resolution target dataset compared to the raw forecast. We show that the adjusted forecasts are better calibrated, more consistent with the observational distribution, and exhibit smaller errors than benchmark predictions, while also enhancing the resolution of the raw forecasts and improving sharpness and spectral power relative to the standard cVAE.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:18 AM
# Probabilistic bias adjustment of seasonal forecasts using generative machine learning: A case study of Arctic sea ice predictions
Source: [https://arxiv.org/html/2605.29172](https://arxiv.org/html/2605.29172)
###### Abstract

Seasonal climate predictions support planning and risk management by offering early information of the most likely\-to\-occur climate conditions in the coming months, and associated uncertainties\. Ensemble forecasts enable this by simulating many plausible outcomes, allowing predictions to be expressed as usable probabilities\. Large ensembles and high\-resolution forecasts strengthen this guidance by better sampling uncertainty and capturing finer\-scale processes but come with significant computational cost\. Moreover, forecast ensembles drift and exhibit systematic biases and spatio\-temporal errors that grow with lead time, requiring careful post\-processing and calibration\. A probabilistic post\-processing framework based on conditional Variational Autoencoders \(cVAEs\) was developed at the Canadian Center for Climate Modeling and Analysis to generate large ensembles of bias adjusted seasonal predictions of Arctic sea ice\. The generative model was designed to learn the observational distribution conditioned on the biased model prediction\. This enables generation of arbitrarily large ensembles of well\-calibrated, bias corrected forecasts with improved skill\. Here, we extend this framework to address the loss of fine\-scale energy and the characteristic blurriness in predictions, a known limitation of standard cVAEs\. Specifically, we employ a generator in place of the Gaussian parametrized decoder in the cVAE and use Continuous Ranked Probability Score in the objective function instead of the Mean Square Error\. We further use a higher resolution target dataset compared to the raw forecast\. We show that the adjusted forecasts are better calibrated, more consistent with the observational distribution, and exhibit smaller errors than benchmark predictions, while also enhancing the resolution of the raw forecasts and improving sharpness and spectral power relative to the standard cVAE\.

## 1Introduction

Seasonal predictions refer to forecasts on time scales ranging from a few months up to a year\. In Canada, seasonal forecasts of key climate variables are produced by the Canadian Seasonal to Interannual Prediction System \(CanSIPS;Merryfieldet al\.,[2013](https://arxiv.org/html/2605.29172#bib.bib64); Linet al\.,[2020](https://arxiv.org/html/2605.29172#bib.bib65)\), which combines two fully coupled ocean\-atmosphere\-land\-sea ice climate models\. Each month, the models are initialized using estimates of the current state of the climate and then integrated forward in time for up to 12 months\. In the third generation of the Canadian prediction system \(CanSIPSv3\), each model produces 20 ensemble members, 10 initialized on the first of the month, and 10 initialized 5 days before, to form the multi\-model ensemble forecasts\. One of the contributing models is the Canadian Earth System Model version 5\.1\(CanESM5\.1, Sigmondet al\.,[2023](https://arxiv.org/html/2605.29172#bib.bib74)\), developed at the Canadian Centre for Climate Modelling and Analysis \(CCCma\) and derived from CanESM5\.0\(Swartet al\.,[2019](https://arxiv.org/html/2605.29172#bib.bib10)\)\.

Successive versions of CanSIPS have demonstrated skill in a range of variables, including those related to the cryosphere\(Sigmondet al\.,[2013](https://arxiv.org/html/2605.29172#bib.bib68),[2016](https://arxiv.org/html/2605.29172#bib.bib69); Sospedra\-Alfonsoet al\.,[2016a](https://arxiv.org/html/2605.29172#bib.bib13),[b](https://arxiv.org/html/2605.29172#bib.bib12),[2024](https://arxiv.org/html/2605.29172#bib.bib14); Dirksonet al\.,[2021](https://arxiv.org/html/2605.29172#bib.bib67)\)\. Such variables tend to draw skill from their “memory” or persistence of their initial anomalies, and from interactions with local or remote drivers\. However, inherent model deficiencies and limitations of the initialization process lead to complex, often nonlinear systematic errors that grow with integration time \(or lead month, defined as the number of months after initialization for which a forecast is issued\)\. These deficiencies impact forecast skill, hampering CanSIPS’ ability to capitalize on key sources of predictability\. To mitigate such biases, post\-processing techniques are routinely applied, with methods that range from simple lead time dependent climatological bias corrections to more sophisticated machine learnning methods\.

For short\- to medium\-range weather prediction, machine learning and, in particular deep learning, has been implemented to improve the aggregate output statistics of prediction systems, including the ensemble spread of deterministic forecasts\(Saccoet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib17)\), corrections to probabilistic forecasts\(Grönquistet al\.,[2021](https://arxiv.org/html/2605.29172#bib.bib18)\), and artificial generation of high\-dimensional weather samples from the target forecast distribution using generative models\(Liet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib15)\)\. For decadal climate predictions, which extend the seasonal forecast range for up to a decade, lead time dependent linear trend corrections\(Kharinet al\.,[2012](https://arxiv.org/html/2605.29172#bib.bib90)\)and other post\-processing methods with various levels of complexity\(Pasternacket al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib91); Nadigaet al\.,[2019](https://arxiv.org/html/2605.29172#bib.bib92)\)have been proposed, including deep learning\-based adjustments\(Sospedra\-Alfonsoet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib73)\)\.

Among cryospheric variables with proven forecasting skill in CanSIPS is sea ice concentration \(SIC\)\(e\.g\., Sigmondet al\.,[2013](https://arxiv.org/html/2605.29172#bib.bib68),[2016](https://arxiv.org/html/2605.29172#bib.bib69)\), defined as the fraction of sea ice within a model grid cell\. CanSIPS’ Arctic SIC forecasts and their downstream products have contributed to several multi\-model efforts and outlooks, including the Arctic Climate Forum, a core activity of the World Meteorological Organization’s \(WMO’s\) Arctic Regional Climate Centre Network \(https://www\.arctic\-rcc\.org/acf\)\. Such forecasting products are important because Arctic sea ice cover is a key regulator of the polar climate system, shaping albedo feedbacks, ocean\-atmosphere heat exchange, and large\-scale circulation\. Moreover, the rapid decline of Arctic sea ice cover\(Serrezeet al\.,[2007](https://arxiv.org/html/2605.29172#bib.bib62); Cavalieri and Parkinson,[2012](https://arxiv.org/html/2605.29172#bib.bib63)\), especially in summer, poses a threat to local communities and ecosystems, but also creates economic opportunities for marine fishing, shipping, tourism and resource extraction\(Kimmritzet al\.,[2019](https://arxiv.org/html/2605.29172#bib.bib57); Zhanget al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib56)\)\. Therefore, preparing, mitigating and planning in response to such changes and their impacts requires accurate and reliable seasonal and longer term predictions\(Wagneret al\.,[2020](https://arxiv.org/html/2605.29172#bib.bib59)\)\.

For seasonal SIC prediction, common post\-processing methods include simple lead dependent climatological bias corrections, linear regression approaches\(Blanchard\-Wrigglesworthet al\.,[2017](https://arxiv.org/html/2605.29172#bib.bib70)\), and more recent machine learning models\(Palermeet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib72); Heet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib71)\)\. These techniques rely mainly on deterministic corrections and are constrained by the small number of computationally expensive members of the raw forecast ensemble\. As a result, even if they are bias corrected, these forecasts can struggle to isolate the predictable signal from substantial weather noise, particularly in regions of marginal sea ice cover\. Large ensembles can overcome this by sampling many plausible climate realizations and produce reliable probabilistic forecasts, stronger diagnostics of model biases, clearer quantification of internal variability, and better representation of uncertainty, including improved sampling of low\-probability, high\-impact extremes\.

Gooya and Sospedra\-Alfonso \([2025](https://arxiv.org/html/2605.29172#bib.bib102)\), hereafter GSA2025, proposed a probabilistic bias adjustment framework built on a generative ML model for seasonal predictions of Arctic SIC that overcome these limitations\. The framework relies on a conditional Variational Autoencoder \(cVAE\) model tasked with learning the distribution of observations conditioned on the biased ensemble mean of the raw seasonal forecasts\. With this approach, an arbitrary large ensemble of bias adjusted forecasts is produced\. However, while this approach solves the issues discussed above, it has drawbacks: adjusted forecasts are overly smooth at fine spatial scales, a known issue with the original cVAE formulation\(Larsenet al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib41); Xuet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib93); Dortaet al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib46)\)\.

The present study extends the work in GSA2025 to address these shortcomings, in particular the “blurry” appearance of standard cVAE models, where fine\-scale gradients, edges, and localized spatial variations are washed out by the model’s tendency to output the most likely outcome out of the possible ones\. We also examine the potential for this probabilistic framework to produce corrected forecasts at higher spatial resolution, making it effectively a post\-processing and downscaling tool\. The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2605.29172#S2)and appendix[A](https://arxiv.org/html/2605.29172#A1)detail the cVAE framework, its formulation for bias adjustment of seasonal predictions, the cVAE\-CRPS model, the architecture of the neural network, and the training/inference procedures\. The data used for training and evaluation are described in section[3](https://arxiv.org/html/2605.29172#S3)\. Probabilistic and deterministic evaluation metrics are introduced in section[4](https://arxiv.org/html/2605.29172#S4)and appendix[B](https://arxiv.org/html/2605.29172#A2), and the results are presented in section[5](https://arxiv.org/html/2605.29172#S5)\. Discussions and conclusions are provided in section[6](https://arxiv.org/html/2605.29172#S6)\.

## 2Methods

### 2\.1Conditional Variational Autoencoder \(cVAE\)

The generative model of GSA2025 is based on a conditional Variational Autoencoder\(cVAE, Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\)that models the target variabley∈𝒴\{y\}\\in\\mathcal\{Y\}with the help of an unobservable \(latent\) variablez∈𝒵\{z\}\\in\\mathcal\{Z\}\(Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28); Prince,[2023](https://arxiv.org/html/2605.29172#bib.bib45)\), conditioned on the input variablex∈𝒳x\\in\\mathcal\{X\}\. The latent variablezzcan be regarded as a lower\-dimensional representation ofy\{y\}that explains the variations in the data in a simpler manner\. The generative process uses samples in the latent space𝒵\\mathcal\{Z\}drawn from the conditional prior distributionpω\(z\|x\)p\_\{\\omega\}\(\{z\|x\}\), which is generally assumed to be Gaussian dependent on the condition\. The generated datay∈𝒴\{y\}\\in\\mathcal\{Y\}are sampled from the distributionpθ\(y\|z,x\)p\_\{\\theta\}\(\{y\}\|z,x\), referred to as probabilisticdecoder\. In a standard cVAE model, the decoder is assumed to be a Gaussian distribution\(Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\), but we follow a different approach here \(section[2](https://arxiv.org/html/2605.29172#S2)[2\.3](https://arxiv.org/html/2605.29172#S2.SS3)\)\.

Distribution parameters \(e\.g\., mean and standard deviation of the decoder distribution and the conditional prior\) are learned using maximum likelihood estimation, which optimizes the parametersθ\\thetaandω\\omegaof the neural network models to maximize the log\-likelihood of the observed targets in the generated output distribution\(Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\)\. Kingma & Welling \(2014\) showed that the VAE parameters can be efficiently estimated using the variational lower bound of the log\-likelihood as a surrogate objective function\. For the cVAE the surrogate objective function is:

logp\(y\|x\)=KL\(qϕ\(z\|y,x\)∥p\(z\|y,x\)\)\+𝔼qϕ\(z\|y,x\)\[−logqϕ\(z\|y,x\)\+logp\(y,z\|x\)\]≥−KL\(qϕ\(z\|y,x\)∥pω\(z\|x\)\)\+𝔼qϕ\(z\|y,x\)\[logpθ\(y\|z,x\)\],\\begin\{split\}\\log p\(y\|x\)=\\mathrm\{KL\}\\left\(q\_\{\\phi\}\(\{z\}\|y,x\)\\\|p\(\{z\}\|y,x\)\\right\)\+\\mathbb\{E\}\_\{q\_\{\\phi\}\(\{z\}\|y,x\)\}\\left\[\-\\log q\_\{\\phi\}\(\{z\}\|y,x\)\+\\log p\(\{y\},\{z\}\|x\)\\right\]\\\\ \\geq\-\\mathrm\{KL\}\\left\(q\_\{\\phi\}\(\{z\}\|y,x\)\\\|p\_\{\\omega\}\(z\|x\)\\right\)\+\\mathbb\{E\}\_\{q\_\{\\phi\}\(\{z\}\|y,x\)\}\\left\[\\log p\_\{\\theta\}\(\{y\}\|z,x\)\\right\],\\end\{split\}\(1\)
where theKL\\mathrm\{KL\}term refers to the Kullback\-Leibler divergence\. The distributionqϕ\(z\|y,x\)q\_\{\\phi\}\(\{z\}\|y,x\), known as the probabilisticencoderand also assumed to be Gaussian, is introduced to approximate the true intractable posteriorp\(z\|y,x\)p\(\{z\}\|y,x\)\(Kingma and Welling,[2014](https://arxiv.org/html/2605.29172#bib.bib23); Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\)\. Its parameters are also learned during the training process through a separate neural network with parametersϕ\\phi\.

### 2\.2Probabilistic bias adjustment of seasonal predictions

For probabilistic bias adjustment of seasonal prediction ensembles, the goal is to learn a mapping from a biasedensemble meanforecastx¯tl\\bar\{x\}\_\{tl\}, wherettdenotes the initialization time andllthe lead month, to the observational distributionp\(Y\|x¯tl\)p\(Y\|\\bar\{x\}\_\{tl\}\)\(GSA2025\)\. Although biased,x¯tl\\bar\{x\}\_\{tl\}simulates the predictable component of the climate system due to both external forcing and internal variability, the latter only possible due to the initialization of the forecast\. The observationytly\_\{tl\}, indexed for simplicity in the same initial\-time and lead\-month notation as the forecast, is the realized outcome of the climate system drawn from a range of possible states, which define the target observational distribution\. We are thus interested in modeling the observational distribution conditioned on the biased ensemble mean, which carries the predictable component of the forecasts\. GSA2025 achieve this by maximizing the log\-likelihood of the observationytly\_\{tl\}in the target distribution conditioned onx¯tl\\bar\{x\}\_\{tl\}\(maxθ\\max\{\{\}\_\{\\theta\}\}log⁡pθ\(Y=ytl\|x¯tl\)\\log p\_\{\\theta\}\(Y=y\_\{tl\}\|\\bar\{x\}\_\{tl\}\)\) with the cVAE model formulated as:

Encoder:qϕ\(z\|ytl,x¯tl\)=𝒩\(μNNϕ\(ytl,x¯tl\),σNNϕ2\(ytl,x¯tl\)I\)\\displaystyle\\textit\{Encoder:\}\\quad q\_\{\\phi\}\(\{z\}\|\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\phi\}\}\(\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\),\{\\sigma\}\_\{NN\_\{\\phi\}\}^\{2\}\(\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\)\{I\}\\right\)\(2\)Decoder:pθ\(y\|z,x¯tl\)=𝒩\(μNNθ\(z,x¯tl\),σ2I\)\\displaystyle\\textit\{Decoder:\}\\quad p\_\{\\theta\}\(\{y\}\|\{z\},\{\\bar\{x\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\theta\}\}\(\{z\},\{\\bar\{x\}\_\{tl\}\}\),\{\\sigma^\{2\}\}\{I\}\\right\)Prior:pω\(z\|x¯tl\)=𝒩\(μNNω\(x¯tl\),σNNω2\(x¯tl\)I\)\\displaystyle\\textit\{Prior:\}\\qquad p\_\{\\omega\}\(\{z\}\|\{\\bar\{x\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\omega\}\}\(\{\\bar\{x\}\_\{tl\}\}\),\{\\sigma\}\_\{NN\_\{\\omega\}\}^\{2\}\(\{\\bar\{x\}\_\{tl\}\}\)\{I\}\\right\)
whereI\{I\}is the identity matrix andσ2\{\\sigma^\{2\}\}is a location independent \(constant\) decoder noise\. Under this formulation, theKL\\mathrm\{KL\}term in the objective function \(Eq\.[1](https://arxiv.org/html/2605.29172#S2.E1)\) has a closed form solution\(Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\), while the second term reduces to a quantity proportional to the Mean Square Error \(MSE\) between the decoder meanμNNθ\(z,x¯tl\)\{\\mu\}\_\{NN\_\{\\theta\}\}\(\{z\},\{\\bar\{x\}\_\{tl\}\}\)and the target observationytly\_\{tl\}\. Minimizing this MSE is equivalent to maximizing the observation’s log\-likelihood under the decoder’s Gaussian distribution\. The expectation𝔼qϕ\(z\|ytl,x¯tl\)\[log⁡pθ\(Y=ytl\|z,x¯tl\)\]\\mathbb\{E\}\_\{q\_\{\\phi\}\(\{z\}\|y\_\{tl\},\\bar\{x\}\_\{tl\}\)\}\\left\[\\log p\_\{\\theta\}\(Y=y\_\{tl\}\|z,\\bar\{x\}\_\{tl\}\)\\right\]in Eq\.[1](https://arxiv.org/html/2605.29172#S2.E1)is further simplified by drawing samplesztln\{z\}^\{n\}\_\{tl\}\(n=1,…,Nn=1,\\dots,N\) using the posterior distributionqϕ\(z\|ytl,x¯tl\)q\_\{\\phi\}\(\{z\}\|y\_\{tl\},\\bar\{x\}\_\{tl\}\)\. With this formulation, the cVAE objective function simplifies to:

ℒ\(x;θ,ϕ\)\\displaystyle\\mathcal\{L\}\(x;\\theta,\\phi\)=\\displaystyle=KL\(qϕ\(z∣ytl,x¯tl\)∥pω\(z∣x¯tl\)\)−1N∑n=1Nlogpθ\(Y=ytl∣ztln,x¯tl\)\\displaystyle\\mathrm\{KL\}\\Bigl\(q\_\{\\phi\}\(z\\mid y\_\{tl\},\\bar\{x\}\_\{tl\}\)\\,\\\|\\,p\_\{\\omega\}\(z\\mid\\bar\{x\}\_\{tl\}\)\\Bigr\)\-\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\log p\_\{\\theta\}\\bigl\(Y=y\_\{tl\}\\mid z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\\bigr\)\(3\)∝\\displaystyle\\proptoKL\(qϕ\(z∣ytl,x¯tl\)∥pω\(z∣x¯tl\)\)\+1N∑n=1NMSE\(ytl,μNNθ\(ztln,x¯tl\)\)\.\\displaystyle\\mathrm\{KL\}\\Bigl\(q\_\{\\phi\}\(z\\mid y\_\{tl\},\\bar\{x\}\_\{tl\}\)\\,\\\|\\,p\_\{\\omega\}\(z\\mid\\bar\{x\}\_\{tl\}\)\\Bigr\)\+\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathrm\{MSE\}\\bigl\(y\_\{tl\},\\mu\_\{\\mathrm\{NN\}\_\{\\theta\}\}\(z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)\\bigr\)\.
Note that the negative log\-likelihood on the right\-hand side of Eq\.[3](https://arxiv.org/html/2605.29172#S2.E3)is minimized foreachztln\{z\}\_\{tl\}^\{\{n\}\}using MSE\.

### 2\.3The cVAE\-CRPS model

One issue with using this formulation of cVAE models, and more generally models with MSE as their objective function, is the loss of spectral energy at fine scales manifested as smooth \(blurry\) outputs where the high resolution structure is underrepresented\(Laiet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib16); Larsenet al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib41); Xuet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib93); Harriset al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib94); Dortaet al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib46)\)\. As mentioned above \(section[2](https://arxiv.org/html/2605.29172#S2)[2\.2](https://arxiv.org/html/2605.29172#S2.SS2)\), the MSE in the cVAE loss function results from the normality assumption over the decoder output \(Eq\.[5](https://arxiv.org/html/2605.29172#S2.E5)\)\. Moreover, the decoder assumes unstructured output noise, with the decoder mean often used as the cVAE output\. This treatment implicitly assumes negligible decoder noise\. In practice, output noise is often structured\(Dortaet al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib46)\)and neglecting it leads to a loss of small‑scale detail\(Finnet al\.,[2024b](https://arxiv.org/html/2605.29172#bib.bib95); Gooyaet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib84)\)\.

To address this, we employ a framework that makes no parametric assumptions of the decoder’s output distribution, thus MSE can no longer be derived in the objective function \(Eq\.[3](https://arxiv.org/html/2605.29172#S2.E3)\)\. Instead, we reformulate the log\-likelihood maximization using the Continuous Ranked Probability Score \(CRPS\)\. CRPS is a key metric for evaluating probabilistic forecasts that has gained traction as an objective function for weather and climate applications\(Aletet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib99); Langet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib100); Kochkovet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib101); Zhonget al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib96); Oskarssonet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib98)\)\. As strictly proper scoring rules\(Gneiting and Raftery,[2007](https://arxiv.org/html/2605.29172#bib.bib76); Waghmare and Ziegel,[2026](https://arxiv.org/html/2605.29172#bib.bib77)\)\(relative to suitable distribution classes\), CRPS and log\-likelihood would achieve extrema with the same distribution, theoretically, if the model class contains the true distribution\. In practice, we propose minimizing CRPS as asurrogatefor the−log⁡pθ\(Y=ytlt\|ztln,x¯tl\)\-\\log p\_\{\\theta\}\(Y=y\_\{t\}^\{lt\}\|\{z\}^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)term in Eq\.[3](https://arxiv.org/html/2605.29172#S2.E3)\.

Unlike MSE that requires a single output \(i\.e\., the mean of the Gaussian distribution\), calculating CRPS for a non\-parametric distribution \(Eq\.[7](https://arxiv.org/html/2605.29172#A1.E7)\) requires anoutputensembleY^tlM\\hat\{Y\}^\{M\}\_\{tl\}to be compared with the target observationytl\{y\}\_\{tl\}\. Previous studies using CRPS for training\(Koochaliet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib97); Zhonget al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib96)\)or fine\-tuning\(Oskarssonet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib98)\)VAE models generally acquire this output ensembleduring trainingby decoding, with the decoder in Eq\.[5](https://arxiv.org/html/2605.29172#S2.E5), theMMdifferent samples from the posterior distributionqϕ\(⋅\)q\_\{\\phi\}\(\\cdot\)\. However, these studies do not explain the connection to the mathematical derivation of the ELBO objective function, i\.e\., the way in which the log\-likelihood in the output distribution of Eq\.[3](https://arxiv.org/html/2605.29172#S2.E3)is to be maximized foreach latentzn\{z\}^\{n\}from the posterior\. In other words, it is for foreachzn\{z\}^\{n\}that the log\-likelihood term is proportional to MSE between the decoder mean and the target when the output is assumed to be normally distributed \(as in Eq\.[5](https://arxiv.org/html/2605.29172#S2.E5)\)\.

Here, we replace the Gaussian decoderpθ\(y\|z,x¯tl\)p\_\{\\theta\}\(\{y\}\|z,\\bar\{x\}\_\{tl\}\)in Eq\.[5](https://arxiv.org/html/2605.29172#S2.E5)with a probabilistic generatorGθ\(z,x¯tl\)G\_\{\\theta\}\(z,\\bar\{x\}\_\{tl\}\)\. This allows us to generate an ensemble of outputs for each generated sampleztln\{z\}^\{\{n\}\}\_\{tl\}in the latent space, needed to calculate the CRPS\. The generator is therefore optimized to output samples from the \(non\-parameterized\) distributionp\(y\|z,x¯tl\)p\(\{y\}\|z,\\bar\{x\}\_\{tl\}\)rather than the parameters of a Gaussian \(i\.e\., mean and standard deviation\) as in standard cVAEs\. Specifically, instead of a Gaussian output distribution, we employ samplesGθ\(ztln,x¯tl\)∼p\(x\|ztln,x¯tl\)G\_\{\\theta\}\(z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)\\sim p\(\{x\}\|\{z\}^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)\. The objective function then reads \(compare with Eq\.[3](https://arxiv.org/html/2605.29172#S2.E3)\):

ℒ\(x;θ,ϕ\)=KL\(qϕ\(z∣ytl,x¯tl\)∥pω\(z∣x¯tl\)\)\+1N∑n=1NCRPS\(ytl,GθM\(ztln,x¯tl\)\)\\mathcal\{L\}\(x;\\theta,\\phi\)=\\mathrm\{KL\}\\Bigl\(q\_\{\\phi\}\(z\\mid y\_\{tl\},\\bar\{x\}\_\{tl\}\)\\,\\\|\\,p\_\{\\omega\}\(z\\mid\\bar\{x\}\_\{tl\}\)\\Bigr\)\+\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathrm\{CRPS\}\(y\_\{tl\},G^\{M\}\_\{\\theta\}\(z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)\)\(4\)
withGθM\(ztln,x¯tl\)G^\{M\}\_\{\\theta\}\(z^\{n\}\_\{tl\},\\overline\{x\}\_\{tl\}\)indicating theMM\-ensemble of outputs generated for each sampleztln\{z\}^\{n\}\_\{tl\}\.

Drawing on the conditional\-GAN framework\(Daust and Monahan,[2024](https://arxiv.org/html/2605.29172#bib.bib22)\), we transform the cVAE decoder \(Eq\.[5](https://arxiv.org/html/2605.29172#S2.E5)\) into a truly probabilistic generator by using noise\-injection layers that induce stochastic behavior \(section[2](https://arxiv.org/html/2605.29172#S2)[2\.4](https://arxiv.org/html/2605.29172#S2.SS4)\)\. During training, we generate 10 outputs for each latent sampleztln\{z\}^\{n\}\_\{tl\}\. CRPS is then calculated at each grid cell using Eq\.[7](https://arxiv.org/html/2605.29172#A1.E7)for the 10\-member ensemble and is averaged on the spatial dimension\. We refer to this model as cVAE\-CRPS\. The decoder in cVAE versus cVAE\-CRPS can be regarded as mappingztln\{z\}^\{n\}\_\{tl\}samples from the lower dimensional latent space to a Gaussian distribution \(cVAE\) versus a collection of points \(cVAE\-CRPS\) in the higher dimensional output space\. The cVAE\-CRPS model is formulated as follows:

Encoder:qϕ\(z\|ytl,x¯tl\)=𝒩\(μNNϕ\(ytl,x¯tl\),σNNϕ2\(ytl,x¯tl\)I\)\\displaystyle\\textit\{Encoder:\}\\quad q\_\{\\phi\}\(\{z\}\|\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\phi\}\}\(\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\),\{\\sigma\}\_\{NN\_\{\\phi\}\}^\{2\}\(\{y\_\{tl\}\},\{\\bar\{x\}\_\{tl\}\}\)\{I\}\\right\)\(5\)Decoder:p\(y\|z,x¯tl\)∼GθM\(z,x¯tl\)\\displaystyle\\textit\{Decoder:\}\\quad p\(\{y\}\|\{z\},\{\\bar\{x\}\_\{tl\}\}\)\\sim G^\{M\}\_\{\\theta\}\(z,\\overline\{x\}\_\{tl\}\)Prior:pω\(z\|x¯tl\)=𝒩\(μNNω\(x¯tl\),σNNω2\(x¯tl\)I\)\\displaystyle\\textit\{Prior:\}\\qquad p\_\{\\omega\}\(\{z\}\|\{\\bar\{x\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\omega\}\}\(\{\\bar\{x\}\_\{tl\}\}\),\{\\sigma\}\_\{NN\_\{\\omega\}\}^\{2\}\(\{\\bar\{x\}\_\{tl\}\}\)\{I\}\\right\)

### 2\.4Architecture and inference

Our cVAE setup follows the architecture ofSohnet al\.\([2015](https://arxiv.org/html/2605.29172#bib.bib28)\)with some modifications\. The encoder and prior networks are series of double convolution processing blocks using a variation of ConvNeXt blocks\(Dheeshjithet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib80); Liuet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib81)\)with partial convolution layers\(Liuet al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib87)\)followed by maxpooling downsampling, mapping to a 1000\-dimensional latent space\. The decoder reverses the encoder and prior network operations using upsampling and double ConvNeXt blocks, followed by a final convolution layer projecting back to the data space\. For the cVAE\-CRPS model, we inject random noise as a channel in each of the upsampling and processing blocks in the decoder\. FollowingSohnet al\.\([2015](https://arxiv.org/html/2605.29172#bib.bib28)\), we create an additional deterministic network combining an encoder and a decoder without a latent space or noise injection\. We employ this network to provide an initial deterministic \(average\) estimate of the datay~tl\\tilde\{y\}\_\{tl\}\. This initial estimate is added as an extra condition to the prior network,p\(z\|x¯tl,y~tl\)=𝒩\(μNNω\(x¯tl,y~tl\),σNNω2\(x¯tl,y~tl\)I\)p\(\{z\}\|\{\\bar\{x\}\_\{tl\},\\tilde\{y\}\_\{tl\}\}\)=\\mathcal\{N\}\\left\(\{\\mu\}\_\{NN\_\{\\omega\}\}\(\{\\bar\{x\}\_\{tl\},\\tilde\{y\}\_\{tl\}\}\),\{\\sigma\}\_\{NN\_\{\\omega\}\}^\{2\}\(\{\\bar\{x\}\_\{tl\}\},\\tilde\{y\}\_\{tl\}\)\{I\}\\right\), and added to the main decoder outputbeforethe final convolution layer\. This helps the model focus on learning variations around the estimated deterministic average rather than reconstructing the most likely signal\. Details of the architecture and training can be found in the appendix[A](https://arxiv.org/html/2605.29172#A1)\.

At inference time, samples from the prior distribution are passed to the decoder to generate the output ensemble\. While theKL\\mathrm\{KL\}divergence term in the loss function penalizes disagreement between the latent structure and the prior distribution, latent encodings often diverge from falling perfectly under the prior distribution\(An and Jeon,[2023](https://arxiv.org/html/2605.29172#bib.bib31)\)\. This becomes specially important for extremes and the generation of ensembles that reflect correct tail behavior\(GSA2025; Gooyaet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib84); Oliveiraet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib32)\)\. The explicit formulation of the prior and structured latent space in cVAEs enables control over data generation\(GSA2025; Prince,[2023](https://arxiv.org/html/2605.29172#bib.bib45); Oliveiraet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib32); Szwarcmanet al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib27); Mooerset al\.,[2021](https://arxiv.org/html/2605.29172#bib.bib34); Hsieh and Wu,[2024](https://arxiv.org/html/2605.29172#bib.bib33); Gooyaet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib84)\)\. cVAEs benefit from the property of autoencoders that group similar samples closer together in the latent space and structure them to the normal prior distribution through the KL term\. Thus, samples belonging to more common events are expected to lie within regions where the prior distribution has a higher probability\(Oliveiraet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib32)\), with less common samples falling on the distribution tails\. Scaling the standard deviation of the prior distribution at inference time then allows sampling a wider range of internal climate variability\(GSA2025; Oliveiraet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib32)\)\.

In a well\-calibrated forecasting system, the spread in the ensemble prediction is expected to match the error in the ensemble mean forecast in the large ensemble limit\. We use this as a criterion for selecting a proper scaling factor of the prior standard deviation at inference time to control the dispersion of the generated ensemble\. Specifically, the scaling factor is derived by equating the spread of the ensemble output and the RMSE of the ensemble mean over the validation period\. This is implemented for a sufficiently large output ensemble \(200 members\)\. To avoid biasing the scaling factor to specific regions, we use spatially averaged ensemble spread and RMSE for our selection criterion\. Consequently, for every latent sample drawn from the scaled distribution, we sample the output distribution once using the stochastic cVAE\-CRPS generator\. A more detailed discussion about scaling the prior distribution is given in the appendix[C](https://arxiv.org/html/2605.29172#A3)\.

## 3Data

We post\-process retrospective seasonal forecasts \(or hindcasts\) of monthly mean Arctic SIC produced with CanSIPSv3’s CanESM5 model\. Each raw ensemble forecast consists of 10 members of 12\-month predictions initialized at the beginning of every month starting in January 1981 until the present\. The ensemble forecast is provided on a1×11\\times 1standard resolution\. To train the cVAE\-based model, we use hindcasts spanning January 1980 to December 2015, and reserve the forecasts issued in January20162016to December20182018as a validation set\. Adjusted forecasts are tested for years20192019to20232023\. We employ a temporal mask to ensure that there is no data leaking from the test sample to the validation sample, or from the validation to the training sample\(Sospedra\-Alfonsoet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib73)\)\.

The target observational data are from the satellite\-based NOAA/NSIDC Climate Data Record of passive microwave SIC v5\(Meieret al\.,[2024](https://arxiv.org/html/2605.29172#bib.bib103)\)spanning January 1981 to December 2024\. These data are provided on the NSIDC Sea Ice Polar Stereographic North grid at 25 km resolution\. We re\-arrange the observational data into a structure analogous to the forecasts consisting of monthly values spanning 12 months from the start of every month\. In addition, we project the forecasts on the same polar stereographic grid as the observations\. This implies that for latitudes south to80∘80^\{\\circ\}where the sea\-ice edge lies in most months, the cVAE model learns the fine scale variability conditioned on a coarser resolution field\. Remarkably, not only the cVAE model bias\-corrects, calibrates, and boosts the forecast ensemble, but also downscales it to a finer resolution\.

As a benchmark, we employ a lead month dependent climatological mean adjustment, for which forecast anomalies relative to the ensemble mean forecast are superimposed to the observed climatology\. Specifically, if we denote the forecasts asxkjmlx\_\{kjml\}withkkindicating the ensemble member,jjandmmthe initial year and month, respectively, andllthe lead month, the bias corrected forecast reads:

xkjml′=xkjml−x¯¯ml\+y¯ml\{x\}^\{\\prime\}\_\{kjml\}=x\_\{kjml\}\-\\overline\{\\overline\{x\}\}\_\{ml\}\+\\overline\{y\}\_\{ml\}\(6\)wherex¯¯ml\\overline\{\\overline\{x\}\}\_\{ml\}andy¯ml\\overline\{y\}\_\{ml\}denote the lead\-month dependent climatological mean of the ensemble mean forecast and observational data, respectively\.

## 4Evaluation metrics

The corrected ensemble using cVAE\-CRPS \(hereafter Nadj\) is compared to the climatological mean bias\-corrected benchmark ensemble \(hereafter Badj\)\. The Badj ensemble is limited by the number of members in the raw ensemble forecasts \(10 in this case\)\. For consistency, we evaluate the Nadj ensemble using 10 members only, but emphasize that its size can be made arbitrarily large\.

We assess the prediction skill of marginal \(pixel\-wise\) distributions using the RMSE of the ensemble mean forecast and CRPS \(Eq\.[7](https://arxiv.org/html/2605.29172#A1.E7)\) relative to the observational data \(hereafter Obs\), averaged spatially and over initial times\. We also employ rank histograms to assess whether forecast ensembles are well\-calibrated, in which case the Obs should be indistinguishable from the ensemble members\. Specifically, we count the rank of the Obs relative to the sorted ensemble forecast members at each grid point and initial time, group the ranks to create a histogram, and report the Cumulative Distribution Function \(CDF\)\(Daust and Monahan,[2024](https://arxiv.org/html/2605.29172#bib.bib22)\)\. For a calibrated forecast, each rank should have the same probability of occurrence, so the histogram should be a uniform distribution and its CDF the 1:1 straight line\. If the CDF has more weight at the \(peak\) tails, corresponding to a \(reversed\) U\-shaped histogram, the ensemble forecast is \(underconfident\) overconfident\. We report CDF plots for critical marginal ice regions, defined as grid cells with0\.15≤SIC≤0\.900\.15\\leq\\hbox\{SIC\}\\leq 0\.90\. This avoids biased results toward fully covered or open ocean regions\. In addition, we compare the Spread over Error \(SOE\) ratio, which measures the reliability of the ensemble and is defined in terms of the ensemble variance and the MSE of the ensemble mean forecast \(Eq\.[8](https://arxiv.org/html/2605.29172#A2.E8)\)\. SOE==1 indicates that the members and observations are statistically indistinguishable\(Hoet al\.,[2013](https://arxiv.org/html/2605.29172#bib.bib86)\), whereas SOE<<1 or SOE\>\>1 indicate overconfidence or underconfidence, respectively\. We compute the SOE ratio at each grid point and averaged it spatially and across initialization times, similar to RMSE and CRPS\. Details are provided in appendix[B](https://arxiv.org/html/2605.29172#A2)\.

Finally, we evaluate the corrected ensemble using several metrics that reflect the spatial and temporal structure of the forecasts\. We compute RMSE and anomaly correlation coefficient \(ACC\) of sea ice area \(SIA, Eq\.[9](https://arxiv.org/html/2605.29172#A2.E9)\) and extent \(SIE, Eq\.[10](https://arxiv.org/html/2605.29172#A2.E10)\), as well as pattern correlation averaged over the initialization times\. We compute the Integrated Ice Edge Error \(IIEE, Eq\.[11](https://arxiv.org/html/2605.29172#A2.E11)\), which measures the difference between the areas enclosed by predicted and true ice edges\(Goesslinget al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib85)\)\. We also compare the distribution of sea ice concentration values pooling all marginal ice grid cells and initialization times using quantile\-quantile \(QQ\) plots\. As discussed in the previous paragraph for CDFs, QQ plots are reported for the critical marginal ice regions\. Finally, we use the radially averaged power spectral density \(RAPSD\) to compare the covariance structure of the adjusted forecasts and the verification data, and to diagnose the smoothness of cVAE outputs\(Aletet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib99)\)\. For each target month, RAPSD is calculated by annularly averaging the two\-dimensional Fourier spectra around the origin of the wavenumber domain\.

## 5Results

Figure[1](https://arxiv.org/html/2605.29172#S5.F1)shows the probabilistic performance of the adjusted forecasts\. For the Nadj ensemble \(10 members\), the CDF curves at various lead times correspond to rank histograms close to the uniform distribution \(Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)a\), indicating a relatively well\-calibrated ensemble\. This is further supported by the SOE ratio \(Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)b\), which shows clear improvements relative to Badj with values ranging from approximately 0\.5 to 0\.6 for all forecast months\. By contrast, the Badj ensemble \(10 members\) appears largely overconfident, with heavy\-tailed CDF and SOE ratios below 0\.3 for all lead times\. Because larger ensembles reduce sampling noise in the SOE ratio \(Eq\.[8](https://arxiv.org/html/2605.29172#A2.E8)\), we also show that for Nadj using 200 members \(Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)b\), providing a more accurate assessment of the true reliability of the Nadj ensemble\. The values are consistently closer to 1 than the 10\-member ensemble, ranging from 0\.7 to 0\.9 for all forecast months\. In terms of QQ plots for SIC over marginal sea ice regions \(Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)c\), the Nadj ensemble shows a very good agreement with the Obs while Badj underestimates the observed quantiles, notably early and late in the forecast\. In addition, Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)d shows clear improvements in CRPS by Nadj relative to the Badj ensemble\. Notably, the Badj ensemble consistently shows degraded performance at the start of the forecast, indicating potential deficiencies in the initialization procedure\. These deficiencies seem to be mitigated for the Nadj ensemble, at least for the metrics in Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)b\-d, whose performance decays with lead time as expected\.

![Refer to caption](https://arxiv.org/html/2605.29172v1/x1.png)Figure 1:a\) CDF of rank histograms of the Nadj/Badj \(10 members\) versus lead month measured at marginal ice grid cells\. Only three lead month are plotted for visibility\. b\) SOE versus lead month showing reliability\. Nadj \(200\) shows SOE for the full 200 member ensemble\. c\) QQ plots at three lead months comparing the distribution of SIC at marginal ice grid cells with obs\. d\) Average CRPS comparison between Nadj/Badj corrected ensembles \(10 members\) and the target observation at each grid cell\.In terms of deterministic measures, the ensemble mean forecasts of Nadj outperform Badj both at the grid cell level and across the sea ice domain \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)\)\. Figure[2](https://arxiv.org/html/2605.29172#S5.F2)a shows the RMSE averaged in space as a function of the forecast months, indicating smaller errors for Nadj than for Badj\. The ensemble variance is also shown, which is generally smaller for Badj than for Nadj, and it has values commensurable with RMSE for Nadj as expected from the SOE ratios \(Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1)b\)\. For the bulk measures of integrated sea ice, Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)b shows the RMSE of the sea ice area \(SIA, Eq\.[9](https://arxiv.org/html/2605.29172#A2.E9)\) and extent \(SIE, Eq\.[10](https://arxiv.org/html/2605.29172#A2.E10)\) as well as the \(initial\-\)time\-averaged errors over the ice edge \(IIEE, Eq\.[11](https://arxiv.org/html/2605.29172#A2.E11)\)\. For all metrics, the ensemble mean Nadj forecast outperforms Badj by a large margin\. Similarly, the pattern correlation averaged over time is improved for the ensemble mean Nadj forecast relative to Badj \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)c\)\. In the case of ACC, on the other hand, both Badj and Nadj have a similar performance, with Badj showing slightly better skill over Nadj early in the forecast, although these results are highly susceptible to sampling error as the test period is only five years \(2019−20232019\-2023\)\. We note that the cVAE is not tasked with learning the forecast time dependence explicitly\. The phase of the interannual variability in the Nadj forecasts is learned through the predictable component in the condition, given by the ensemble mean of the \(biased\) raw forecast\. Therefore, there is no expectation that Nadj would significantly outperform Badj in terms of ACC unless it is provided with extra information e\.g\., from other variables or conditioning fields\. Remarkably, the cVAE model largely captures the time dependence of the integrated quantities while generating more accurate and reliable ensemble forecasts despite its conditioning field being biased\.

![Refer to caption](https://arxiv.org/html/2605.29172v1/x2.png)Figure 2:a\) Average RMSE \(solid\) between the ensemble mean forecast from Nadj/Badj compared to observation at each grid cell\. The dashed line shows the average mean ensemble spread calculated in the same manner\. b\) For each lead month, RMSE of SIA \(solid line\) and SIE \(dashed line\) over initialization time, and average IIEE \(dotted line\) over initialization time is compared between ensemble mean forecast from Nadj/Badj and obs\. c\) ACC in SIE and SIA, as well as pattern correlation relative to Obs over initialization times as a function of lead month\. \(d\) RAPSD ratio Nadj/Badj to observation for predictions of March maximum on lead month 1 averaged over2019−20232019\-2023\(5 years\) and across ensemble members \(5×\\times10 members\)\. The shading shows the range associated with different members\. \(e\) same as \(d\) but for September minimum\. Please note the y\-axis limits are different in panels \(d\) versus \(e\)\.We further assess the spatial covariance and physical realism of the generated SIC fields using the RAPSD metric \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)d,e\)\. Specifically, we compute the RAPSD ratios of the adjusted forecasts relative to the Obs for thetargetmonths of March \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)d\) and September \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)d\), representative of the monthly maximum and minimum Arctic sea ice cover, respectively, for the test period2019−20232019\-2023and using 10\-member ensembles for both Nadj and Badj\. On average across all initial times and ensemble members, Badj exhibits spectral\-energy biases at both large and small spatial scales for both target months\. The loss of fine\-scale energy in Badj primarily reflects the coarse resolution of the raw forecasts relative to the target observations\. By contrast, Nadj exhibits relatively small spectral\-energy biases at both large and small spatial scales\. This is noteworthy, as the post\-processing tool not only corrects mean\-state biases and calibrates the raw forecast ensemble but also effectively performs downscaling by leveraging the finer resolution of the target observations\. The increased spatial resolution is evident in Fig\.[3](https://arxiv.org/html/2605.29172#S5.F3), which shows a randomly chosen ensemble member for Nadj and Badj forecasts, as well as the target observation as a reference, for September and March20232023\.

One key motivation to modify the original cVAE model of GSA2025 is the spectral\-energy bias exhibited at fine scales\. The RAPSD curves for Nadj produced with cVAE\-CRPS and the standard cVAE model \(Nadj\_mse\) show that the former does a better job at representing fine\-scale variability \(Fig\.[A3](https://arxiv.org/html/2605.29172#A3.F3)\), while the later shows a clear drop in energy at fine scales\. The improvements are seen for both March and September sea ice, as well as for short and long lead times\. The loss of fine scale variability and smoothness of outputs becomes evident in SIC maps \(Fig\.[A4](https://arxiv.org/html/2605.29172#A3.F4)\) which is greatly improved with cVAE\-CRPS\. Moreover, the cVAE\-CRPS has lower CRPS and higher SOE \(Fig A5\), with similar or better ensemble mean performance regarding errors in IE and IA, IIEE, pattern correlation, and ACC across all lead months \(not shown\)\.

At fine spatial scales, both Bajd and Nadj \(Nadj\_mse\) models show lower skill in RAPSD for the September minimum and larger uncertainty\. The Badj shows a larger drop in RAPSD ratios compared to March maximum \(note the y\-axis limits are different on Fig\. 2d and 2e\) at scales smaller than 100 km\. On the other hand, Nadj ratios remain closer to 1, however, the cVAE\-CRPS model shows a tenancy to overshoot fine\-scale variability for the September sea ice minimum\. On longer lead months, RAPSD ratios show consistent performance on average, but with larger uncertainty reflected in the wider range values across the ensemble members \(Shading in Figs\. A6 and A3\)\. This is consistent with earlier results showing higher uncertainty for predictions over longer lead months\. For SIC, which is bounded between 0 and 1, the uncertainty in RAPSD ratios likely reflects uncertainty in representing the transitional regions between fully ice\-covered and open\-water conditions, particularly near the marginal ice zone and ice edge\. In this regard, the lower RAPSD skill and increased uncertainty during the September minimum are consistent with previous studies reporting reduced predictability of Arctic sea ice concentrations in summer time and increased sensitivity of the ice edge during the melt season\(Sigmondet al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib69)\)\. This becomes clearer using uncertainty maps\.

Figure[4](https://arxiv.org/html/2605.29172#S5.F4)shows the maps for the ensemble standard deviations for the target months of September and March, averaged over the test sample\. This variability reflects the underlying forecast uncertainty and is expected to be greatest along the sea\-ice edge, where freezing and melting processes fluctuate more strongly\. The figure shows that the Nadj ensemble forecasts capture this behavior, with the variance being stronger along the observed sea ice edges \(red dotted lines indicatingSIC≥0\.15SIC\\geq 0\.15\) while exhibiting more realistic patterns than Badj\. This is further confirmed by their SOE ratios computed with SIE \(Fig\.[A7](https://arxiv.org/html/2605.29172#A3.F7)\), which are consistently closer to 1 on all lead months for Nadj than for Badj\. The uncertainty grows larger with lead months, both in magnitude \(e\.g\., Fig\.[A7](https://arxiv.org/html/2605.29172#A3.F7)\) and spatial coverage \(Fig\.[4](https://arxiv.org/html/2605.29172#S5.F4)\), consistent with greater uncertainty in their RAPSD ratios \(Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)d,e\), especially at small spatial scales\. For Badj, the uncertainty in September sea ice at a 12\-month lead is particularly striking \(Fig\.[4](https://arxiv.org/html/2605.29172#S5.F4)\), as it encompasses most of the Arctic ocean despite the high observed SIC towards the pole \(Fig\.[3](https://arxiv.org/html/2605.29172#S5.F3)\), indicative of permanent sea ice cover\. By contrast, the Nadj uncertainty for the same target and lead month is more closely confined to the observed edges of the sea ice cover \(Fig\.[4](https://arxiv.org/html/2605.29172#S5.F4)\), as expected\.

![Refer to caption](https://arxiv.org/html/2605.29172v1/x3.png)Figure 3:Nadj \(first row\), Badj \(second row\), and observation \(third row\) SIC maps for target time of20232023March maximum \(t3 in first and second columns\), and20232023September minimum \(t9 in third and forth columns\) on lead months 1 \(l1\), and 12 \(l12\)\. For the Badj/Nadj ensembles, a random ensemble member is chosen\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x4.png)Figure 4:Nadj and Badj standard deviation \(std\) across ensemble averaged for predictions on target months of March maximum \(first row, t3\), and September minimum \(second row, t9\) on lead months 1 \(l1 in first and second columns\), and 12 \(l12 in third and forth columns\) over2019−20232019\-2023\. The dashed magenta line shows the mean ice edge from observations over the same period\.
## 6Discussions and conclusions

This work presents a probabilistic post\-processing framework developed at the Canadian Centre for Climate Modeling and Analysis \(CCCma\) to generate arbitrarily large ensembles of bias\-adjusted, high\-resolution seasonal forecasts\. The proposed methodology builds on the conditional Variational Autoencoder model ofGooya and Sospedra\-Alfonso \([2025](https://arxiv.org/html/2605.29172#bib.bib102)\)to bias adjust seasonal predictions of Arctic Sea Ice\. The key innovation in this work is replacing the cVAE’s Gaussian‑parametrized decoder with a generator that learns a non\-parametric output distribution\. Training is carried out using CRPS as a surrogate reconstruction loss \(cVAE‑CRPS\) within the standard evidence lower bound objective function\. By moving away from the Gaussian assumption and thus from MSE\-based objective function, this approach preserves fine\-scale structure and avoids the overly smooth outputs that typically affect conventional cVAEs\.

We apply the cVAE\-CRPS model for the probabilistic bias adjustment of seasonal predictions of Arctic Sea Ice Concentrations produced with CCCma’s CanESM5 model\. We used several deterministic and probabilistic metrics to evaluate the performance for marginal distributions \(pixel\-level\) and to capture the spatial structure of the adjusted forecasts relative to a climatological mean bias adjustment as a benchmark\. The forecast ensemble adjusted with cVAE\-CRPS consistently outperforms the benchmark for all metrics and lead months by a considerable margin\. Specifically, the cVAE\-CRPS adjusted ensemble produces rank histograms that are nearly uniform \(Fig\. 1a\), with a SOE ratio closer to 1 compared to the benchmark \(Fig\. 1b\)\. It shows strong agreement with the verification data in the distribution of SIC evidenced with QQ plots \(Fig\. 1c\), particularly over the critical marginal sea ice region where uncertainty is greatest, and it shows substantial CRPS improvements across the ensemble \(Fig\. 1d\)\. In addition, the model significantly reduces errors in the ensemble mean forecast both at grid cell level \(Fig\. 2\) and for integrated measures of sea ice area and extent across the Arctic domain, notably in predicting the edges of the sea ice cover\. Importantly, the cVAE\-CRPS model reduces spectral\-energy bias not only at large spatial scales but also at fine scales \(Fig\. 2d,e\), successfully reconstructing the target’s fine\-scale variability which the benchmark fails to achieve\.

The cVAE\-based approach formulates the probabilistic bias adjustment of seasonal forecast ensembles by modeling the target observational distribution conditioned on the biased ensemble mean forecast from the dynamical model, in this case CCCma’s CanESM5\. The ensemble mean forecast carries information of CanESM5’s predictable component due to both external forcing and internal variability, although it drifts toward the model preferred climatology typically reducing its predictive capacity\. The cVAE thus models the distribution of realizable outcomes in the observational distribution given this \(biased\) signal from the climate model\. We argue that, in this context, the cVAE performs bias correction by identifyinganaloguerepresentations in the observations, via the conditional prior distribution, given the \(predictable\) signal from the climate model\. However, unlike traditional ensemble analogue methods that search the observational record directly in data space, the cVAE identifies analogues within a structured lower\-dimensional latent space\. This offers a key advantage: in the structured and continuous latent space of a cVAE, the network learns to more effectively disentangle the condition’s relevant features, enabling it to infer analogues from a continuum of lower\-dimensional, observationally\-based structured representations\. Crucially, the prediction skill of cVAE\-CRPS is limited by the conditionining field, as evident with ACC results, unless additional information is provided, e\.g\., from other input variables or conditions\.

The primary advantage of cVAE models and generative machine learning approaches more broadly over traditional post\-processing methods is that they are fundamentally probabilistic\. This is especially important for chaotic systems such as weather and climate that are inherently uncertain\. A proper representation of the full range of possible outcomes is thus essential for informed decision making\. This is where possible challenges with underdispersivity in generative models merit special attention\. Variability in cVAE based models comes potentially from two sources; stochasticity in latent space sampling constrained by the prior distribution, and uncertainty in the output space formulated in the decoder distribution\. For the former, we showed that the explicit mathematical formulation of cVAEs, combined with their continuous and structured latent space, enables controlled sampling over a broader range of internal variability\. Specifically, we identified a scaling factor for the prior’s standard deviation such that the uncertainty in a sufficiently large generated ensemble matches the ensemble\-mean forecast error over a validation period \(section 2d\)\. For the output noise, we showed that ignoring the variability in the decoder output, e\.g\., by taking the mean of a Gaussian decoder as in traditional cVAEs, leads to smooth fields where fine scale structure is lost \(section 5\)\. To address this, we implemented a cVAE\-CRPS framework that enables the generation of sharp SIC maps from the output distribution rather than assuming an explicit Gaussian form\. As a consequence, we were able to maintain fine\-scale energy and produce sharp outputs\. Furthermore, this model is naturally suitable for downscaling, which is another important application for reliable high\-resolution forecasting\.

Our results show that the cVAE\-CRPS model is able to generate relatively skillful, reliable, and well\-calibrated forecast ensembles that outperform the climatological bias correction and the standard cVAE model of GSA2025\. The results are based on a 10\-member cVAE\-CRPS ensemble generated to mirror the benchmark ensemble, ensuring a fair comparison\. We showed that the performance improves significantly by increasing the size of the cVAE\-CRPS ensemble output \(Fig\. 1b\), which is possible in the framework of generative, probabilistic post\-processing methods\. Moreover, we showed that the cVAE\-CRPS model is able not only to bias correct, calibrate, and boost the size of the raw forecast ensemble but also to downscale the forecasts to the resolution of the target data while retaining its fine\-scale spectral\-energy\. Importantly, this framework is flexible enough to allow conditioning on several variables or to perform a multivariate bias adjustment\. Future work will examine whether incorporating additional variables or covariates, particularly those with higher predictability, can extend the predictability horizon in forecasts of other target variables\. Finally, the current framework generates adjusted forecast ensembles without explicit dependence on lead time\. In principle, the generated ensembles members should be independent if they sample unpredictable variability\. However, due to potential limitations with the optimization process, model architecture, and training data, residual temporal dependencies \(e\.g\., due to higher frequency information than the monthly data\) might exist that are not accounted for\. To address this limitation, future work will explore sequence models aimed at generating adjusted ensemble forecasts traceable in time\.

###### Acknowledgements\.

We thank Mr\. Johannes Exenberger for his advice and valuable input during ideation and early stages of this research\. The authors declare no conflict of interest\.\\datastatementThe seasonal predictions ofSICfrom CanESM5 using CMIP6 forcing used in this study are not currently publicly available\. However, SIC seasonal hindcast data initialized in 1980\-2023 used here will be made available at the time of publication in a public repository\. Observational verification data from the satellite\-based NOAA/NSIDC Climate Data Record of passive microwave SIC v5 are available athttps://nsidc\.org/data/g02202/versions/5/\. Research grade codes used for training the models can be found athttps://github\.com/ParsaGooya/SI\_seasonal\_forecast\_adjustment\. A production ready software package is currently being developed and will be ready before publication\. All other inquires should be directed to P\. Gooya\.

## References

- F\. Alet, I\. Price, A\. El\-Kadi, D\. Masters, S\. Markou, T\. R\. Andersson, J\. Stott, R\. Lam, M\. Willson, A\. Sanchez\-Gonzalez, and P\. Battaglia \(2025\)Skillful joint probabilistic weather forecasting from marginals\.External Links:2506\.10772,[Link](https://arxiv.org/abs/2506.10772)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1),[§4](https://arxiv.org/html/2605.29172#S4.p3.1)\.
- Distributional learning of variational autoencoder: application to synthetic data generation\.36,pp\. 57825–57851\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/b456a00e145ad56f6f251f79f8c8a7de-Paper-Conference.pdf)Cited by:[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- E\. Blanchard\-Wrigglesworth, A\. Barthélemy, M\. Chevallier, R\. Cullather, N\. Fučkar, F\. Massonnet, P\. Posey, W\. Wang, J\. Zhang, C\. Ardilouze, C\. M\. Bitz, G\. Vernieres, A\. Wallcraft, and M\. Wang \(2017\)Multi\-model seasonal forecast of arctic sea\-ice: forecast uncertainty at pan\-arctic and regional scales\.Climate Dynamics49\(4\),pp\. 1399–1410\.External Links:[Document](https://dx.doi.org/10.1007/s00382-016-3388-9),[Link](https://doi.org/10.1007/s00382-016-3388-9),ISSN 1432\-0894Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p5.1)\.
- D\. J\. Cavalieri and C\. L\. Parkinson \(2012\)Arctic sea ice variability and trends, 1979&ndash;2010\.The Cryosphere6\(4\),pp\. 881–889\.External Links:[Link](https://tc.copernicus.org/articles/6/881/2012/),[Document](https://dx.doi.org/10.5194/tc-6-881-2012)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- K\. Daust and A\. Monahan \(2024\)Capturing climatic variability: using deep learning for stochastic downscaling\.arXiv preprint arXiv:2406\.02587\.Note:Submitted to Artificial Intelligence for the Earth Systems AMS JournalExternal Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.02587)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p7.2),[§4](https://arxiv.org/html/2605.29172#S4.p2.4)\.
- S\. Dheeshjith, A\. Subel, A\. Adcroft, J\. Busecke, C\. Fernandez\-Granda, S\. Gupta, and L\. Zanna \(2025\)Samudra: an ai global ocean emulator for climate\.Geophysical Research Letters52\(10\),pp\. e2024GL114318\.Note:e2024GL114318 2024GL114318External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1029/2024GL114318),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2024GL114318),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1029/2024GL114318Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p1.14),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p1.2)\.
- A\. Dirkson, B\. Denis, M\. Sigmond, and W\. J\. Merryfield \(2021\)Development and calibration of seasonal probabilistic forecasts of ice\-free dates and freeze\-up dates\.Weather and Forecasting36,pp\. 301–324\.External Links:[Document](https://dx.doi.org/10.1175/WAF-D-20-0066.1),[Link](https://doi.org/10.1175/WAF-D-20-0066.1)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1)\.
- G\. Dorta, S\. Vicente, L\. Agapito, N\. D\. F\. Campbell, and I\. Simpson \(2018\)Structured uncertainty prediction networks\.External Links:1802\.07079,[Link](https://arxiv.org/abs/1802.07079)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p6.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- T\. S\. Finn, C\. Durand, A\. Farchi, M\. Bocquet, and J\. Brajard \(2024a\)Towards diffusion models for large\-scale sea\-ice modelling\.External Links:2406\.18417,[Link](https://arxiv.org/abs/2406.18417)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p3.1)\.
- T\. S\. Finn, C\. Durand, A\. Farchi, M\. Bocquet, P\. Rampal, and A\. Carrassi \(2024b\)Generative diffusion for regional surrogate models from sea\-ice simulations\.Journal of Advances in Modeling Earth Systems16\(10\),pp\. e2024MS004395\.Note:e2024MS004395 2024MS004395External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1029/2024MS004395),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2024MS004395),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1029/2024MS004395Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- T\. Gneiting and A\. E\. Raftery \(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102,pp\. 359–378\.External Links:[Document](https://dx.doi.org/10.1198/016214506000001437),[Link](https://doi.org/10.1198/016214506000001437)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1)\.
- H\. F\. Goessling, S\. Tietsche, J\. J\. Day, E\. Hawkins, and T\. Jung \(2016\)Predictability of the arctic sea ice edge\.Geophysical Research Letters43\(4\),pp\. 1642–1650\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/2015GL067232),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2015GL067232),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/2015GL067232Cited by:[Appendix B](https://arxiv.org/html/2605.29172#A2.p12.1),[§4](https://arxiv.org/html/2605.29172#S4.p3.1)\.
- P\. Gooya, R\. Sospedra\-Alfonso, and J\. Exenberger \(2025\)Toward generative machine learning for boosting ensembles of climate simulations\.External Links:[Link](http://dx.doi.org/10.22541/essoar.175336391.14023985/v1),[Document](https://dx.doi.org/10.22541/essoar.175336391.14023985/v1)Cited by:[Appendix C](https://arxiv.org/html/2605.29172#A3.p2.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- P\. Gooya and R\. Sospedra\-Alfonso \(2025\)Probabilistic bias adjustment of seasonal predictions of arctic sea ice concentration\.InNeurIPS 2025 Workshop on Tackling Climate Change with Machine Learning,External Links:[Link](https://www.climatechange.ai/papers/neurips2025/88)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p6.1),[§6](https://arxiv.org/html/2605.29172#S6.p1.1)\.
- P\. Grönquist, C\. Yao, T\. Ben\-Nun, N\. Dryden, P\. Dueben, S\. Li, and T\. Hoefler \(2021\)Deep learning for post\-processing ensemble weather forecasts\.Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences379\(2194\),pp\. 20200092\.External Links:[Document](https://dx.doi.org/10.1098/rsta.2020.0092),[Link](https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2020.0092),https://royalsocietypublishing\.org/doi/pdf/10\.1098/rsta\.2020\.0092Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- L\. Harris, A\. T\. T\. McRae, M\. Chantry, P\. D\. Dueben, and T\. N\. Palmer \(2022\)A generative deep learning approach to stochastic downscaling of precipitation forecasts\.Journal of Advances in Modeling Earth Systems14\(10\),pp\. e2022MS003120\.Note:e2022MS003120 2022MS003120External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1029/2022MS003120),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2022MS003120),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1029/2022MS003120Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- Z\. He, Y\. Wang, J\. Brajard, X\. Wang, and Z\. Shen \(2025\)Improving seasonal arctic sea ice predictions with the combination of machine learning and earth system model\.EGUsphere \[preprint\]\.External Links:[Document](https://dx.doi.org/10.5194/egusphere-2024-4092),[Link](https://doi.org/10.5194/egusphere-2024-4092)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p5.1)\.
- C\. K\. Ho, E\. Hawkins, L\. Shaffrey, J\. Bröcker, L\. Hermanson, J\. M\. Murphy, D\. M\. Smith, and R\. Eade \(2013\)Examining reliability of seasonal to decadal sea surface temperature forecasts: the role of ensemble dispersion\.Geophysical Research Letters40\(21\),pp\. 5770–5775\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/2013GL057630),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2013GL057630),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/2013GL057630Cited by:[Appendix B](https://arxiv.org/html/2605.29172#A2.p3.2),[§4](https://arxiv.org/html/2605.29172#S4.p2.4)\.
- M\. Hsieh and C\. Wu \(2024\)Developing an explainable variational autoencoder \(vae\) framework for accurate representation of local circulation in taiwan\.Journal of Geophysical Research: Atmospheres129\(12\),pp\. e2024JD041167\.Note:e2024JD041167 2024JD041167External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1029/2024JD041167),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2024JD041167),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1029/2024JD041167Cited by:[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- V\. V\. Kharin, G\. J\. Boer, W\. J\. Merryfield, J\. F\. Scinocca, and W\.\-S\. Lee \(2012\)Statistical adjustment of decadal predictions in a changing climate\.Geophys\. Res\. Lett\.39\.External Links:[Link](https://doi.org/10.1029/2012GL052647),[Document](https://dx.doi.org/10.1029/2012GL052647)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- M\. Kimmritz, F\. Counillon, L\. H\. Smedsrud, I\. Bethke, N\. Keenlyside, F\. Ogawa, and Y\. Wang \(2019\)Impact of ocean and sea ice initialisation on seasonal prediction skill in the arctic\.Journal of Advances in Modeling Earth Systems11\(12\),pp\. 4147–4166\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1029/2019MS001825),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2019MS001825),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1029/2019MS001825Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- D\. P\. Kingma and M\. Welling \(2014\)Auto\-encoding variational bayes\.External Links:[Link](http://arxiv.org/abs/1312.6114v10)Cited by:[§2\.1](https://arxiv.org/html/2605.29172#S2.SS1.p4.4)\.
- D\. Kochkov, J\. Yuval, I\. Langmore, P\. Norgaard, J\. Smith, G\. Mooers, M\. Klöwer, J\. Lottes, S\. Rasp, P\. Düben, S\. Hatfield, P\. Battaglia, A\. Sanchez\-Gonzalez, M\. Willson, M\. P\. Brenner, and S\. Hoyer \(2024\)Neural general circulation models for weather and climate\.Nature632\(8027\),pp\. 1060–1066\(English\)\.Note:© 2024\. The Author\(s\)\.External Links:ISSN 1476\-4687, 0028\-0836,[Document](https://dx.doi.org/10.1038/s41586-024-07744-y),[Link](https://doi.org/10.1038/s41586-024-07744-y)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1)\.
- A\. Koochali, E\. Tahaei, A\. Dengel, and S\. Ahmed \(2025\)VAEneu: a new avenue for vae application on probabilistic forecasting\.Applied Intelligence55\(7\),pp\. 483\.External Links:ISSN 1573\-7497,[Document](https://dx.doi.org/10.1007/s10489-024-06203-5),[Link](https://doi.org/10.1007/s10489-024-06203-5)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p3.6)\.
- C\. Lai, P\. Hassanzadeh, A\. Sheshadri, M\. Sonnewald, R\. Ferrari, and V\. Balaji \(2024\)Machine learning for climate physics and simulations\.Annual Review of Condensed Matter Physics\.External Links:ISSN 1947\-5454,[Link](https://www.annualreviews.org/content/journals/10.1146/annurev-conmatphys-043024-114758),[Document](https://dx.doi.org/https%3A//doi.org/10.1146/annurev-conmatphys-043024-114758)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- S\. Lang, M\. Alexe, M\. C\. A\. Clare, C\. Roberts, R\. Adewoyin, Z\. B\. Bouallègue, M\. Chantry, J\. Dramsch, P\. D\. Dueben, S\. Hahner, P\. Maciel, A\. Prieto\-Nemesio, C\. O’Brien, F\. Pinault, J\. Polster, B\. Raoult, S\. Tietsche, and M\. Leutbecher \(2024\)AIFS\-crps: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score\.External Links:2412\.15832,[Link](https://arxiv.org/abs/2412.15832)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1)\.
- A\. B\. L\. Larsen, S\. K\. Sønderby, H\. Larochelle, and O\. Winther \(2016\)Autoencoding beyond pixels using a learned similarity metric\.48,pp\. 1558–1566\.External Links:[Link](https://proceedings.mlr.press/v48/larsen16.html),[Document](https://dx.doi.org/10.48550/arXiv.1512.09300)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p6.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- L\. Li, R\. Carver, I\. Lopez\-Gomez, F\. Sha, and J\. Anderson \(2024\)Generative emulation of weather forecast ensembles with diffusion models\.Science Advances10\(13\),pp\. eadk4489\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.adk4489),[Link](https://www.science.org/doi/abs/10.1126/sciadv.adk4489),https://www\.science\.org/doi/pdf/10\.1126/sciadv\.adk4489Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- H\. Lin, W\. J\. Merryfield, R\. Muncaster, G\. C\. Smith, M\. Markovic, F\. Dupont, F\. Roy, J\. Lemieux, A\. Dirkson, V\. V\. Kharin, W\. Lee, M\. Charron, and A\. Erfani \(2020\)The canadian seasonal to interannual prediction system version 2 \(cansipsv2\)\.Weather and Forecasting35\(4\),pp\. 1317 – 1343\.External Links:[Document](https://dx.doi.org/10.1175/WAF-D-19-0259.1),[Link](https://journals.ametsoc.org/view/journals/wefo/35/4/wafD190259.xml)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p1.1)\.
- G\. Liu, F\. A\. Reda, K\. J\. Shih, T\. Wang, A\. Tao, and B\. Catanzaro \(2018\)Image inpainting for irregular holes using partial convolutions\.External Links:1804\.07723,[Link](https://arxiv.org/abs/1804.07723)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p1.14),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p1.2)\.
- Z\. Liu, H\. Mao, C\. Wu, C\. Feichtenhofer, T\. Darrell, and S\. Xie \(2022\)A convnet for the 2020s\.In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 11966–11976\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52688.2022.01167)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p1.14),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p1.2)\.
- W\. Meier, F\. Fetterer, A\. Windnagel, J\. S\. Stewart, and T\. Stafford \(2024\)NOAA/nsidc climate data record of passive microwave sea ice concentration, version 5\.National Snow and Ice Data Center\.Note:\[Date Accessed 02\-26\-2025\.\]External Links:[Document](https://dx.doi.org/10.7265/RJZB-PF78),[Link](https://nsidc.org/data/g02202/versions/5/)Cited by:[§3](https://arxiv.org/html/2605.29172#S3.p2.1)\.
- W\. J\. Merryfield, W\. Lee, G\. J\. Boer, V\. V\. Kharin, J\. F\. Scinocca, G\. M\. Flato, R\. S\. Ajayamohan, J\. C\. Fyfe, Y\. Tang, and S\. Polavarapu \(2013\)The canadian seasonal to interannual prediction system\. part i: models and initialization\.Monthly Weather Review141\(8\),pp\. 2910 – 2945\.External Links:[Document](https://dx.doi.org/10.1175/MWR-D-12-00216.1),[Link](https://journals.ametsoc.org/view/journals/mwre/141/8/mwr-d-12-00216.1.xml)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p1.1)\.
- G\. Mooers, J\. Tuyls, S\. Mandt, M\. Pritchard, and T\. G\. Beucler \(2021\)Generative modeling of atmospheric convection\.pp\. 98–105\.External Links:ISBN 9781450388481,[Link](https://doi.org/10.1145/3429309.3429324),[Document](https://dx.doi.org/10.1145/3429309.3429324)Cited by:[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- B\. T\. Nadiga, T\. Verma, W\. Weijer, and N\. M\. Urban \(2019\)Enhancing skill of initialized decadal predictions using a dynamic model of drift\.Geophys\. Res\. Lett\.46,pp\. 9991–9999\.External Links:[Link](https://doi.org/10.1029/2019GL084223),[Document](https://dx.doi.org/10.1029/2019GL084223)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- A\. Odena, V\. Dumoulin, and C\. Olah \(2016\)Deconvolution and checkerboard artifacts\.Distill1\(10\),pp\. e3\.External Links:ISSN 2476\-0757,[Document](https://dx.doi.org/10.23915/distill.00003)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p3.1)\.
- D\. A\. B\. Oliveira, J\. G\. Diaz, B\. Zadrozny, C\. D\. Watson, and X\. X\. Zhu \(2022\)Controlling weather field synthesis using variational autoencoders\.\(\),pp\. 5027–5030\.External Links:[Document](https://dx.doi.org/10.1109/IGARSS46834.2022.9884668)Cited by:[Appendix C](https://arxiv.org/html/2605.29172#A3.p2.1),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- J\. Oskarsson, T\. Landelius, M\. P\. Deisenroth, and F\. Lindsten \(2024\)Probabilistic weather forecasting with hierarchical graph neural networks\.External Links:2406\.04759,[Link](https://arxiv.org/abs/2406.04759)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p3.6)\.
- C\. Palerme, T\. Lavergne, J\. Rusin, A\. Melsom, J\. Brajard, A\. F\. Kvanum, A\. Macdonald Sørensen, L\. Bertino, and M\. Müller \(2024\)Improving short\-term sea ice concentration forecasts using deep learning\.The Cryosphere18\(4\),pp\. 2161–2176\.External Links:[Link](https://tc.copernicus.org/articles/18/2161/2024/),[Document](https://dx.doi.org/10.5194/tc-18-2161-2024)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p5.1)\.
- A\. Pasternack, J\. Bhend, M\. A\. Liniger, H\. W\. Rust, W\. A\. Muller, and U\. Ulbrich \(2018\)Parametric decadal climate forecast recalibration \(DeFoReSt 1\.0\)\.Geosci\. Model Dev\.11,pp\. 351–368\.External Links:[Link](https://gmd.copernicus.org/articles/11/351/2018/)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- S\. J\.D\. Prince \(2023\)Understanding deep learning\.The MIT Press\.External Links:[Link](http://udlbook.com/)Cited by:[§2\.1](https://arxiv.org/html/2605.29172#S2.SS1.p1.10),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- M\. Sacco, J\. Ruiz, M\. Pulido, and P\. Tandeo \(2022\)Evaluation of Machine Learning Techniques for Forecast Uncertainty Quantification\.Quarterly Journal of the Royal Meteorological Society148\(749\),pp\. 3470–3490\.External Links:[Link](https://imt-atlantique.hal.science/hal-03685523),[Document](https://dx.doi.org/10.1002/qj.4362)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1)\.
- S\. Sankarapandian and B\. Kulis \(2021\)β\\beta\-Annealed variational autoencoder for glitches\.External Links:2107\.10667,[Link](https://arxiv.org/abs/2107.10667)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p6.4)\.
- M\. C\. Serreze, M\. M\. Holland, and J\. Stroeve \(2007\)Perspectives on the arctic’s shrinking sea\-ice cover\.Science315\(5818\),pp\. 1533–1536\.External Links:[Document](https://dx.doi.org/10.1126/science.1139426),[Link](https://www.science.org/doi/abs/10.1126/science.1139426),https://www\.science\.org/doi/pdf/10\.1126/science\.1139426Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- M\. Sigmond, J\. C\. Fyfe, G\. M\. Flato, V\. V\. Kharin, and W\. J\. Merryfield \(2013\)Seasonal forecast skill of arctic sea ice area in a dynamical forecast system\.Geophysical Research Letters40\(3\),pp\. 529–534\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/grl.50129),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/grl.50129),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/grl\.50129Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1),[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- M\. Sigmond, M\. C\. Reader, G\. M\. Flato, W\. J\. Merryfield, and A\. Tivy \(2016\)Skillful seasonal forecasts of arctic sea ice retreat and advance dates in a dynamical forecast system\.Geophysical Research Letters43\(24\),pp\. 12,457–12,465\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/2016GL071396),[Link](https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2016GL071396),https://agupubs\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/2016GL071396Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1),[§1](https://arxiv.org/html/2605.29172#S1.p4.1),[§5](https://arxiv.org/html/2605.29172#S5.p5.1)\.
- M\. Sigmond, J\. Anstey, V\. Arora, R\. Digby, N\. Gillett, V\. Kharin, W\. Merryfield, C\. Reader, J\. Scinocca, N\. Swart, J\. Virgin, C\. Abraham, J\. Cole, N\. Lambert, W\. Lee, Y\. Liang, E\. Malinina, L\. Rieger, K\. von Salzen, C\. Seiler, C\. Seinen, A\. Shao, R\. Sospedra\-Alfonso, L\. Wang, and D\. Yang \(2023\)Improvements in the cana dian earth system model \(canesm\) through systematic model analysis: canesm5\.0 and canesm5\.1\.Geosci\. Model Dev\.16,pp\. 6553–6591\.External Links:[Document](https://dx.doi.org/10.5194/gmd-16-6553-2023),[Link](https://doi.org/10.5194/gmd-16-6553-2023)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p1.1)\.
- K\. Sohn, H\. Lee, and X\. Yan \(2015\)Learning structured output representation using deep conditional generative models\.28,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.29172#A1.p5.2),[§2\.1](https://arxiv.org/html/2605.29172#S2.SS1.p1.10),[§2\.1](https://arxiv.org/html/2605.29172#S2.SS1.p2.2),[§2\.1](https://arxiv.org/html/2605.29172#S2.SS1.p4.4),[§2\.2](https://arxiv.org/html/2605.29172#S2.SS2.p3.9),[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p1.2)\.
- R\. Sospedra\-Alfonso, W\. J\. Merryfield, V\. V\. Kharin, W\. Lee, H\. Lin, G\. T\. Diro, and R\. Muncaster \(2024\)Evaluation of soil moisture in the canadian seasonal to interannual prediction system version 2\.1 \(cansipsv2\.1\)\.J\. Appl\. Meteorol\. Clim\.63\(1\),pp\. 143–164\.External Links:[Link](https://doi.org/10.1175/JAMC-D-23-0034.1),[Document](https://dx.doi.org/10.1175/JAMC-D-23-0034.1)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1)\.
- R\. Sospedra\-Alfonso, W\. J\. Merryfield, and V\. V\. Kharin \(2016a\)Representation of snow in the canadian seasonal to interannual prediction system: part ii\. potential predictability and hindcast skill\.J\. Hydrometeorol\.17\(9\),pp\. 2511–2535\.External Links:[Link](http://dx.doi.org/10.1175/JHM-D-16-0027.s1),[Document](https://dx.doi.org/10.1175/JHM-D-16-0027.s1)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1)\.
- R\. Sospedra\-Alfonso, L\. Mudryk, W\. J\. Merryfield, and C\. Derksen \(2016b\)Representation of snow in the canadian seasonal to interannual prediction system: part i\. initialization\.J\. Hydrometeorol17\(5\),pp\. 1467–1488\.External Links:[Link](https://doi.org/10.1175/JHM-D-14-0223.1),[Document](https://dx.doi.org/10.1175/JHM-D-14-0223.1)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p2.1)\.
- R\. Sospedra\-Alfonso, P\. Gooya, and J\. Exenberger \(2025\)Adjustment of decadal ocean carbon sink predictions using deep learning\.Artificial Intelligence for the Earth Systems\.External Links:[Document](https://dx.doi.org/10.1175/AIES-D-24-0107.1),[Link](https://doi.org/10.1175/AIES-D-24-0107.1)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p3.1),[§3](https://arxiv.org/html/2605.29172#S3.p1.5)\.
- N\. C\. Swart, J\. N\. S\. Cole, V\. V\. Kharin, M\. Lazare, J\. F\. Scinocca, N\. P\. Gillett, J\. Anstey, V\. Arora, J\. R\. Christian, S\. Hanna, Y\. Jiao, W\. G\. Lee, F\. Majaess, O\. A\. Saenko, C\. Seiler, C\. Seinen, A\. Shao, M\. Sigmond, L\. Solheim, K\. von Salzen, D\. Yang, and B\. Winter \(2019\)The canadian earth system model version 5 \(canesm5\.0\.3\)\.Geoscientific Model Development12\(11\),pp\. 4823–4873\.External Links:[Link](https://gmd.copernicus.org/articles/12/4823/2019/),[Document](https://dx.doi.org/10.5194/gmd-12-4823-2019)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p1.1)\.
- D\. Szwarcman, J\. Guevara, M\. M\. G\. Macedo, B\. Zadrozny, C\. Watson, L\. Rosa, and D\. A\. B\. Oliveira \(2024\)Quantizing reconstruction losses for improving weather data synthesis\.Scientific Reports14\(1\),pp\. 3396\.External Links:[Document](https://dx.doi.org/10.1038/s41598-024-52773-2),[Link](https://doi.org/10.1038/s41598-024-52773-2)Cited by:[§2\.4](https://arxiv.org/html/2605.29172#S2.SS4.p2.1)\.
- K\. Waghmare and J\. Ziegel \(2026\)Proper scoring rules for estimation and forecast evaluation\.Annual Review of Statistics and Its Application13,pp\. 271–296\.External Links:[Document](https://dx.doi.org/10.1146/annurev-statistics-042424-050626),[Link](https://doi.org/10.1146/annurev-statistics-042424-050626)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1)\.
- P\. M\. Wagner, N\. Hughes, P\. Bourbonnais, J\. Stroeve, L\. Rabenstein, U\. Bhatt, J\. Little, H\. Wiggins, and A\. Fleming \(2020\)Sea\-ice information and forecast needs for industry maritime stakeholders\.Polar Geography43\(2\-3\),pp\. 160–187\.External Links:[Document](https://dx.doi.org/10.1080/1088937X.2020.1766592),[Link](https://doi.org/10.1080/1088937X.2020.1766592),https://doi\.org/10\.1080/1088937X\.2020\.1766592Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- C\. Xu, Q\. Wang, and L\. Sun \(2025\)Likelihood\-free variational autoencoders\.External Links:2504\.17622,[Link](https://arxiv.org/abs/2504.17622)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p6.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p1.1)\.
- Y\. Zhang, M\. Bushuk, M\. Winton, B\. Hurlin, T\. Delworth, M\. Harrison, L\. Jia, F\. Lu, A\. Rosati, and X\. Yang \(2022\)Subseasonal\-to\-seasonal arctic sea ice forecast skill improvement from sea ice concentration assimilation\.Journal of Climate35\(13\),pp\. 4233 – 4252\.External Links:[Document](https://dx.doi.org/10.1175/JCLI-D-21-0548.1),[Link](https://journals.ametsoc.org/view/journals/clim/35/13/JCLI-D-21-0548.1.xml)Cited by:[§1](https://arxiv.org/html/2605.29172#S1.p4.1)\.
- X\. Zhong, L\. Chen, H\. Li, J\. Liu, X\. Fan, J\. Feng, K\. Dai, J\. Luo, J\. Wu, and B\. Lu \(2024\)FuXi\-ens: a machine learning model for medium\-range ensemble weather forecasting\.External Links:2405\.05925,[Link](https://arxiv.org/abs/2405.05925)Cited by:[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2605.29172#S2.SS3.p3.6)\.

## Appendix AArchitecture and Training

The building blocks of the cVAE model are convolution blocks based on a modified version of ConvNeXt blocks\(Liuet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib81)\)as used byDheeshjithet al\.\([2025](https://arxiv.org/html/2605.29172#bib.bib80)\)\. We replaced all 2D convolutions with partial convolution\(Liuet al\.,[2018](https://arxiv.org/html/2605.29172#bib.bib87)\)layers\. This is a natural choice for the Arctic region, where there is an irregular land mask with small islands\. The partial convolution layer automatically ignores these regions while processing the data\. The encoder and prior networks follow the same architectures\. The encoder input \(ytly\_\{tl\},x¯tl\\bar\{x\}\_\{tl\}\) and the prior network input \(y~tl\\tilde\{y\}\_\{tl\},x¯tl\\bar\{x\}\_\{tl\}\), wherey~tl\\tilde\{y\}\_\{tl\}is an initial estimate produced by a deterministic network \(details below\)\. In addition, we add three extra conditioning fields, uniformly enteringsin⁡\(2π12\(t\+l\)\)\\sin\(\\frac\{2\\pi\}\{12\}\(t\+l\)\),cos⁡\(2π12\(t\+l\)\)\\cos\(\\frac\{2\\pi\}\{12\}\(t\+l\)\), and\(l12\)\(\\frac\{l\}\{12\}\)as input channels wherettis the initializationmonth at a given yearandllis the lead month\. These networks encode their input into the mean \(μNNϕ\{\\mu\}\_\{NN\_\{\\phi\}\}andμNNω\{\\mu\}\_\{NN\_\{\\omega\}\}\) and variance \(σNNϕ2\{\\sigma\}\_\{NN\_\{\\phi\}\}^\{2\}andσNNω2\{\\sigma\}\_\{NN\_\{\\omega\}\}^\{2\}\) of the latent and prior distributions, respectively\. The architecture is as follows:

∙Input \(5\)→3×3partial convolution \(16\)→Layer normalization \(16\)→DoubleConvNeXt \(32\)\\displaystyle\\bullet\\text\{Input \(5\)\}\\rightarrow\\text\{ $3\\times 3$ partial convolution \(16\)\}\\rightarrow\\text\{ Layer normalization \(16\)\}\\rightarrow\\text\{ DoubleConvNeXt \(32\)\}→MaxPool \(32\)→DoubleConvNeXt \(64\)→MaxPool \(64\)→DoubleConvNeXt \(128\)→\\displaystyle\\rightarrow\\text\{ MaxPool \(32\)\}\\rightarrow\\text\{ DoubleConvNeXt \(64\)\}\\rightarrow\\text\{ MaxPool \(64\)\}\\rightarrow\\text\{ DoubleConvNeXt \(128\)\}\\rightarrowMaxPool \(128\)→DoubleConvNeXt \(256\)→MaxPool \(256\)→DoubleConvNeXt \(256\)→\\displaystyle\\text\{ MaxPool \(128\)\}\\rightarrow\\text\{ DoubleConvNeXt \(256\)\}\\rightarrow\\text\{ MaxPool \(256\)\}\\rightarrow\\text\{ DoubleConvNeXt \(256\)\}\\rightarrowLayer normalization \(256\)→Dense \(2×1000\)\\displaystyle\\text\{ Layer normalization \(256\)\}\\rightarrow\\text\{ Dense \($2\\times 1000$\) \}
The decoder takes samples from the latent space and reverses the operations in the encoder \(prior\) networks using upsampling and double ConvNeXt blocks\. The upsampling blocks are composed of bilinear interpolation to double the resolution of the input, followed by a masked convolution with a 3 × 3 kernel smoothing the interpolated fields\. The combination of interpolation with convolution results in less checkerboard effects compared to a transposed convolution\(Odenaet al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib88)\)\. For the cVAE\-CRPS’s generator, layers of random noise are injected \(concatenated\) before each convolution operation in the ConvNeXt blocks, as well as in the upsampling blocks after the bilinear interpolation and before the 3 × 3 convolution\. Finally, an output block, which is a combination of ”layer normalization, ReLu activation, and1×11\\times 1convolution”, maps the decoder output back to the SIC space\. Like in\(Finnet al\.,[2024a](https://arxiv.org/html/2605.29172#bib.bib89)\), we use ReLu activation before the last convolution to help improve the representation of continuous\-discrete sea\-ice processes\. The decoder is as follows:

∙Latent samples \(1000\)→Dense \(256\)→Upsampling \(256 \+ Noise\)\\displaystyle\\bullet\\text\{Latent samples \(1000\)\}\\rightarrow\\text\{ Dense \(256\)\}\\rightarrow\\text\{ Upsampling \(256 \+ Noise\)\}→DoubleConvNeXt \(128 \+ Noise\)→Upsampling \(128 \+ Noise\)\\displaystyle\\rightarrow\\text\{ DoubleConvNeXt \(128 \+ Noise\)\}\\rightarrow\\text\{ Upsampling \(128 \+ Noise\)\}→DoubleConvNeXt \(64 \+ Noise\)→Upsampling \(64 \+ Noise\)\\displaystyle\\rightarrow\\text\{ DoubleConvNeXt \(64 \+ Noise\)\}\\rightarrow\\text\{ Upsampling \(64 \+ Noise\)\}→DoubleConvNeXt \(32 \+ Noise\)→Upsampling \(32 \+ Noise\)\\displaystyle\\rightarrow\\text\{ DoubleConvNeXt \(32 \+ Noise\)\}\\rightarrow\\text\{ Upsampling \(32 \+ Noise\)\}→DoubleConvNeXt \(16 \+ Noise\)\\displaystyle\\rightarrow\\text\{ DoubleConvNeXt \(16 \+ Noise\)\}→Layer normalization \(16\)→ReLu \(16\)→1×1partial convolution \(1\)\\displaystyle\\rightarrow\\text\{ Layer normalization \(16\)\}\\rightarrow\\text\{ ReLu \(16\) \}\\rightarrow\\text\{ $1\\times 1$ partial convolution \(1\) \}
The deterministic network used to produce the initial estimate of the bias corrected inputy~tl\\tilde\{y\}\_\{tl\}feeding into the encoder and prior networks is composed of the same encoder and decoder architectures without latent space \(i\.e\., by removing the last ”Layer normalization \+ Dense” layers in the encoder, and the first ”Dense” layer in the decoder\) or noise injection\. Importantly, the same output block \(layer normalization \+ ReLu \+ 1×\\times1 convolution\) is shared between the generator \(decoder\) and the deterministic network to encourage realistic outputs from the deterministic model\(Sohnet al\.,[2015](https://arxiv.org/html/2605.29172#bib.bib28)\)\. Lastly, the output of the last DoubleConvNeXt layer of the deterministic model \(which has 16 channels\) is summed to the corresponding layer output in the decoder before passing to the output block\.

We first trained the deterministic model separately using MSE as objective function\. We found that this pretraining improves the probabilistic skill of the cVAE\-CRPS helping it to focus on learning variability around the deterministic average guess\. After the pretraining stage, all network were trained end\-to\-end using Adam optimizer, a cosine decreasing learning rate with a maximum of 0\.0001, and an effective batch size of 4 \(2×\\times2 gradient accumulation steps\)\. The KL divergence and CRPS \(surrogate for log\-likelihood\) terms in the loss function \(Eq\.[4](https://arxiv.org/html/2605.29172#S2.E4)\) were normalized based on their dimensionalities \(1000 for KL and432×304432\\times 304for the CRPS\)\. Given that the dimensionality of the output is anO\(100\)O\(100\)of the latent space’s dimension, the KL term was weighed withβ=0\.01\\beta=0\.01which was annealed linearly from 0 over the first 10 epochs during training\(Sankarapandian and Kulis,[2021](https://arxiv.org/html/2605.29172#bib.bib49)\)\. The loss over the validation set was used as criterion for early stopping with a buffer of 10 epochs\.

We found that the spectral energy is sensitive to noise injection in the decoder’s upsampling blocks on spatial scales≲100\\lesssim 100km, especially for predictions in the summer target months\. Specifically, excess energy \(ratios\>1\>1in Fig\.[2](https://arxiv.org/html/2605.29172#S5.F2)\) is found for September minimum over the validation period when noise is injected at every spatial resolution\. We noticed that removing the noise injection layer over the smallest spatial scale \(i\.e\., after the last upsampling block\) improves performance\. However, to avoid enforcing the spatial scale over which the stochasticity is introduced, we maintained all noise injection layers\. Instead, we added a pooled CRPS term to the loss function, i\.e\., in addition to optimizing CRPS in each grid cell, we calculated a 2×\\times2 average\-pooled CRPS and computed the mean of both terms\. The CRPS is defined as:

CRPS\(y,Y^M\)=1M∑k=1M\|y^k−y\|−12M2∑k=1M∑k′=1M\|y^k−y^k′\|\\mathrm\{CRPS\}\(y,\\hat\{Y\}^\{M\}\)=\\frac\{1\}\{M\}\\sum^\{M\}\_\{k=1\}\|\\hat\{y\}\_\{k\}\-y\|\-\\frac\{1\}\{2M^\{2\}\}\\sum^\{M\}\_\{k=1\}\\sum^\{M\}\_\{k^\{\\prime\}=1\}\|\\hat\{y\}\_\{k\}\-\\hat\{y\}\_\{k^\{\\prime\}\}\|\(7\)
where y is the target observation andy^k\\hat\{y\}\_\{k\}is thekk\-th ensemble member of the cVAE\-CRPSMM\-member output ensembleY^M\\hat\{Y\}^\{M\}\. During training, the output ensemble corresponds toGθM\(ztln,x¯tl\)G^\{M\}\_\{\\theta\}\(z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)for eachztlnz^\{n\}\_\{tl\}sampled from the posteriorqϕ\(z\|ytl,x¯tl\)q\_\{\\phi\}\(z\|y\_\{tl\},\\bar\{x\}\_\{tl\}\), withM=10M=10\(Eq\.[4](https://arxiv.org/html/2605.29172#S2.E4)\)\. During evaluation, it corresponds toGθM\(ztln,x¯tl\)G^\{M\}\_\{\\theta\}\(z^\{n\}\_\{tl\},\\bar\{x\}\_\{tl\}\)for eachztlnz^\{n\}\_\{tl\}sampled from the priorpω\(z\|x¯tl\)p\_\{\\omega\}\(\{z\}\|\{\\bar\{x\}\_\{tl\}\}\), withMMarbitrary large\.

## Appendix BEvaluation Metrics

Rank histograms are produced by ranking the verification data \(Obs\) over the predicted ensemble at each grid cell, sorting the ensemble members in ascending order and ranking the observational target in that sequence\. We evaluate the ranks only across grid cells of marginal sea ice, defined as0\.15≤SIC≤0\.90\.15\\leq\\hbox\{SIC\}\\leq 0\.9\. This choice is made to avoid results dominated by the many easy\-to\-predict grid cells with 0 or 1 values, corresponding to open ocean or fully ice\-covered regions, respectively\. For each lead month, we compute CDFs by pulling the rankings of all grid cells of marginal ice cover and of all initialization months\.

QQ plots comparing quantiles of the adjusted ensembles \(Nadj or Badj\) with observations, are constructed, for each lead month, from distributions formed by pooling all grid cells of marginal sea ice cover, all ensemble members, and all initialization months\.

SOE ratios forMM\-member adjusted forecast ensemblesY^M\\hat\{Y\}^\{M\}are computed as\(Hoet al\.,[2013](https://arxiv.org/html/2605.29172#bib.bib86)\):

SOE=M\+1MσY^M2MSE\\displaystyle\\quad\\text\{SOE\}=\\sqrt\{\\frac\{M\+1\}\{M\}\\frac\{\\sigma^\{2\}\_\{\\hat\{Y\}^\{M\}\}\}\{\\hbox\{MSE\}\}\}\(8\)
where MSE denotes the mean square error of the ensemble mean andσY^M2\\sigma^\{2\}\_\{\\hat\{Y\}^\{M\}\}the variance of the ensemble around the ensemble mean, averaged over initialization times\. This calculation is done at each grid cell and the area weighted average is reported at each lead month\. RMSE and standard deviation are calculated as square root of MSE and square root of variance above before averaging, both of which are reported as spatial averages\. CRPS is calculated using Eq\.[7](https://arxiv.org/html/2605.29172#A1.E7)and reported following the same averaging protocol as standard deviation\.

For area integrated measures, the total sea ice area \(SIA\) is:

SIAtl=∫ArcticSICtlda\\displaystyle\\quad\\text\{SIA\}\_\{tl\}=\\int\_\{Arctic\}\\text\{SIC\}\_\{tl\}\\quad da\(9\)
where SIC denotes sea ice cover for observationsytly\_\{tl\}or the ensemble meany^¯tl\\bar\{\\hat\{y\}\}\_\{tl\}of Nadj or Badj\.

Similarly, sea ice extent \(SIE\) is computed as an area integral of grid cells with SIC\>0\.15\>0\.15:

SIEtl=∫ArcticI\(SICtl\>0\.15\)da\\displaystyle\\quad\\text\{SIE\}\_\{tl\}=\\int\_\{Arctic\}I\(\\text\{SIC\}\_\{tl\}\>15\)\\quad da\(10\)
whereI\(\.\)I\(\.\)denotes the indicator function\.

IIEE, which captures the differences along the ice edges, quantifies the area where the predicted and true ice concentrations differ\(Goesslinget al\.,[2016](https://arxiv.org/html/2605.29172#bib.bib85)\):

IIEEtl=∫S≥50∘N\|I\(ytl\>0\.15\)−I\(y^¯tl\>0\.15\)\|da\\displaystyle\\quad\\text\{IIEE\}\_\{tl\}=\\int\_\{S\_\{\\geq 50^\{\\circ\}N\}\}\|I\(y\_\{tl\}\>15\)\-I\(\\bar\{\\hat\{y\}\}\_\{tl\}\>15\)\|\\quad da\(11\)
SIA, SIE and IIEE, as well as pattern correlations, are all calculated at each lead month and initialization time, then averaged over initialization times\. Finally, for both the SIE and SIA estimates, anomalies relative to monthly climatology are used to calculate the ACC, defined for each lead month as the Pearson correlation coefficient across the initialization time dimension\.

## Appendix CScaling the standard deviation of the prior distribution

Figures[A1](https://arxiv.org/html/2605.29172#A3.F1)and[A2](https://arxiv.org/html/2605.29172#A3.F2)show results analogous to those of Fig\.[1](https://arxiv.org/html/2605.29172#S5.F1), except that the standard deviation of the prior distribution for the cVAE\-CRPS model has not been scaled\. Even in this case, the cVAE\-CRPS model is superior to Badj in terms of CDFs, QQ plots \(i\.e\., the distribution of SIC over grid cells of marginal sea ice cover\) and deterministic performance metrics such as RMSEs, IIEE, and pattern correlation\. Nevertheless, the generated ensemble is underdispersive or overconfident \(e\.g\. SOE≪1\\ll 1in Fig\.[A1](https://arxiv.org/html/2605.29172#A3.F1)b\), which can be improved with increasing ensemble size \(e\.g\. 200\-member ensemble in Fig\.[A1](https://arxiv.org/html/2605.29172#A3.F1)b\) and improved further with proper scaling\.

We scaled the prior’s standard deviation at inference time by comparison of the spatially averaged RMSE and the standard deviation for a sufficiently large ensemble over the validation set\. This enabled sampling a wider range of internal variability at inference time\(Oliveiraet al\.,[2022](https://arxiv.org/html/2605.29172#bib.bib32); Gooyaet al\.,[2025](https://arxiv.org/html/2605.29172#bib.bib84)\)\. With a proper scaling factor \(3\.25 based on the validation period of2016−20192016\-2019\), the corrected ensemble was shown to be reliable and well\-calibrated\. In selecting a criterion for determining the scaling factor, care was taken to avoid excessively widening the sampling space, which could lead to undercondifent ensembles, reduced prediction skill, or the generation of unrealistic output fields \(hallucination\)\. We found that the spatially averaged error to spread ratio used here, taken over the full domain and validation period, provides a reasonable criterion and avoids overfitting to particular metrics or regions\.

![Refer to caption](https://arxiv.org/html/2605.29172v1/x5.png)Figure A1:Same as Figure 1 but for the cVAE without scaling the standard deviation of the prior distributions at inference time\. a\) CDF of rank histograms of the Nadj/Badj versus lead months measured at marginal ice grid cells\. b\) SOE versus lead month showing reliability\. c\) QQ plots at three lead months comparing the distribution of SIC at marginal ice grid cells with obs\. d\) RMSE \(solid\) over initialization time between the ensemble mean Nadj/Badj compared to Obs at grid cells level averaged over the entire region\. The dashed line shows the global mean ensemble spread averaged over initialization time\. e\) For each lead month, RMSE of SIA \(solid line\) and SIE \(dashed line\) over initialization time, and average IIEE \(dotted line\) over intialization time is compared between ensemble mean Nadj/Badj and obs\. f\) same as \(e\) but for pattern correlation relative to obs\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x6.png)Figure A2:Same as Figure 2 but for the cVAE without scaling the standard deviation of the prior distributions at inference time\. a\) Average RMSE \(solid\) between the ensemble mean Nadj/Badj compared to observation at each grid cell\. The dashed line shows the average mean ensemble spread calculated in the same manner\. b\) For each lead month, RMSE of SIA \(solid line\) and SIE \(dashed line\) over initialization time, and average IIEE \(dotted line\) over initialization time is compared between ensemble mean Nadj/Badj and obs\. c\) ACC in SIE and SIA, as well as pattern correlation relative to Obs over initialization times as a function of lead month\. RAPSD results are not shown due to similarity\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x7.png)Figure A3:RASD ratio Nadj and Nadj\_mse to observation for predictions of March maximum \(a, c\) and September minimum \(b,d\) on lead month 1 \(a, b\) and 12 \(c, d\) over2019−20232019\-2023\(5 years\) and across ensemble members \(5 x 10 members\)\. The shading shows the range associated with different members\. Please note the y\-axis limits are different in panels \(a, c\) versus \(b, d\)\. To acquire the Nadj\_mse ensemble, for every latent sample from the scaled prior distribution, the decoder meanNNθ\(z,x¯tl\)NN\_\{\\theta\}\(\{z\},\{\\bar\{x\}\_\{tl\}\}\)in Eq\. 2 is used assuming negligible decoder noise\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x8.png)Figure A4:Nadj \(first row\), Nadj\_mse \(second row\), and observation \(third row\) SIC maps for target time of20232023March maximum \(t3 in first and second columns\), and20232023September minimum \(t9 in third and forth columns\) on lead months 1 \(l1\), and 12 \(l12\)\. For the Nadj/Nadj\_mse ensembles, a random ensemble member is chosen\. The land is removed for better visibility of ice edges\. To acquire the Nadj\_mse ensemble, for every latent sample from the scaled prior distribution, the decoder meanNNθ\(z,x¯tl\)NN\_\{\\theta\}\(\{z\},\{\\bar\{x\}\_\{tl\}\}\)in Eq\. 2 is used assuming negligible decoder noise\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x9.png)Figure A5:Comparison of CRPS \(a\) and SOE \(b\) between Nadj and Nadj\_mse \(10 members\)\.![Refer to caption](https://arxiv.org/html/2605.29172v1/x10.png)Figure A6:\(a\) RASD ratio Nadj/Badj to observation for predictions of March maximum on lead month 12 over2019−20232019\-2023\(5 years\) and across ensemble members \(5 x 10 members\)\. The shading shows the range associated with different members\. \(b\) same as \(a\) but for September minimum\. Please note the y\-axis limits are different in panels \(a\) versus \(b\)\.![Refer to caption](https://arxiv.org/html/2605.29172v1/Figure_S7.png)Figure A7:Badj/Nadj versus observation SOE in integrated SIE\. The variability in SIE across ensemble members is due to differences in estimation of the edge of ice separated from open ocean by grid cells withSIC≥0\.15SIC\\geq 0\.15\. Thus, SOE shows how reliable the ensemble is in predicting the ice edge compared to the observation\. The Badj is overconfident in predicting the edge of the ice as reflected in the variability patterns in Fig\. 5, and also more biased as seen in Fig\. 2\.
Probabilistic bias adjustment of seasonal forecasts using generative machine learning: A case study of Arctic sea ice predictions

Similar Articles

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret

On the quantitative analysis of decoder-based generative models

SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting

CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning

Submit Feedback

Similar Articles

AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret
On the quantitative analysis of decoder-based generative models
SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction
A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
CF-JEPA: Mask-free forward prediction with asymmetric encoder utilization for time-series representation learning