Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
Summary
Njord is a probabilistic graph neural network for ensemble ocean forecasting that provides uncertainty estimates and achieves state-of-the-art performance on global and regional benchmarks, improving surface temperature prediction.
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting
Source: [https://arxiv.org/html/2605.15470](https://arxiv.org/html/2605.15470)
Daniel Holmberg University of Helsinki daniel\.holmberg@helsinki\.fi &Joel Oskarsson ETH AI Center joel\.oskarsson@outlook\.com Erik Larsson Linköping University erik\.larsson@liu\.se &Fredrik Lindsten Linköping University fredrik\.lindsten@liu\.se &Teemu Roos University of Helsinki teemu\.roos@helsinki\.fi
###### Abstract
Ocean dynamics are inherently chaotic, yet existing machine learning ocean models produce only deterministic forecasts\. We introduce*Njord*, a probabilistic data\-driven model for ocean forecasting, applicable to both global and regional domains\. Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass\. We apply Njord globally at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution and regionally to the Baltic Sea at2km/2\\text\{\\,\}\\mathrm\{km\}\\text\{/\}resolution\. To scale to these large ocean grids we introduce K\-means cluster meshes that adapt to irregular sea surface geometry\. Experiments demonstrate strong performance on both domains compared to deterministic machine learning baselines, while also providing uncertainty estimates from the sampled ensemble forecasts\. On the global OceanBench benchmark, Njord achieves the lowest errors on average across upper\-ocean variables when evaluated against real\-world observations, with the largest improvements in surface temperature prediction\.
## 1Introduction
Accurate ocean forecasting is essential for a wide range of applications, from maritime navigation and fisheries management to coastal hazard mitigation and environmental monitoring\[[33](https://arxiv.org/html/2605.15470#bib.bib9)\]\. While numerical ocean models have long served as the backbone of operational forecasting, they are computationally expensive and require substantial infrastructure to run at the resolutions needed in operational global\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\]and regional applications\[[27](https://arxiv.org/html/2605.15470#bib.bib13)\], taking on the order of hours on CPU clusters\[[14](https://arxiv.org/html/2605.15470#bib.bib18),[22](https://arxiv.org/html/2605.15470#bib.bib5)\]\.
Recent advances have demonstrated that machine learning ocean models can match or even surpass the accuracy of physics\-based systems at a fraction of the computational cost\[[14](https://arxiv.org/html/2605.15470#bib.bib18),[9](https://arxiv.org/html/2605.15470#bib.bib19),[46](https://arxiv.org/html/2605.15470#bib.bib20),[25](https://arxiv.org/html/2605.15470#bib.bib21)\]
Figure 1:Njord\.at global short\-range \(1–10 days\) timescales\. These models are however, deterministic: they produce a single trajectory and are typically trained with mean squared error, which encourages predictions toward the conditional mean of the future state rather than capturing the full predictive distribution\. Consequently, they tend to smooth over fine\-scale variance and offer limited insight into the probability of extreme events\. For operational decision\-making, where risk mitigation relies on understanding forecast confidence and the full spectrum of possible scenarios, a single trajectory is often insufficient\. Probabilistic forecasting adds critical value here by modeling distributions over future states, allowing for the generation of ensembles that capture dynamic variability and quantify uncertainty\. In the atmospheric domain, conditional generative models\[[39](https://arxiv.org/html/2605.15470#bib.bib2),[43](https://arxiv.org/html/2605.15470#bib.bib34)\]have emerged as a promising framework for generating skillful and spatially coherent ensemble forecasts, but no comparable probabilistic data\-driven model exists for the ocean\.
In this work we propose, to our knowledge, the first probabilistic data\-driven model for short\-range high\-resolution global and regional ocean forecasting\. The model uses a latent variable framework built on hierarchicalGraph Neural Networks \(GNNs\)\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\], and produces calibrated ensemble forecasts of the depth\-resolved ocean state\. In a regional setting the model also conditions on boundary data from an independent global ocean model, where previous such emulators either lack boundary forcing\[[7](https://arxiv.org/html/2605.15470#bib.bib17)\], or depend on data from the very system they aim to replace at the boundary\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]\.
#### Our contributions are:
1. 1\.We introduce*Njord*111In old Norse mythology, Njord is a god of the sea\. See[Figure1](https://arxiv.org/html/2605.15470#S1.F1)\., the first generative ensemble forecasting model for global ocean physics, operating at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution\.
2. 2\.Njord employs aGNNarchitecture using a new clustering\-based graph layout, which better conforms to the irregular geometry of the ocean surface\.
3. 3\.In addition to existing variables included in high\-resolution machine learning ocean models, we incorporate sea ice\. Sea ice is an integral component of ocean physics simulations, but requires additional constraints to ensure physically realistic fields\.
4. 4\.We follow the OceanBench\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]evaluation for global ocean emulators, and show that Njord achieves competitive errors compared with state of the art models, while adding information about uncertainties through the ensemble approach\. For key surface variables Njord achieves the lowest error both compared to analysis data and direct observations\.
5. 5\.We further demonstrate that the same framework can be applied to regional ocean modeling by constructing Njord\-Baltic for the Baltic Sea at2km/2\\text\{\\,\}\\mathrm\{km\}\\text\{/\}resolution\. Njord\-Baltic achieves errors comparable to a deterministic baseline while also providing probabilistic forecasts\.
## 2Related work
#### Ocean emulators\.
At medium\-range timescales, data\-driven global ocean models\[[46](https://arxiv.org/html/2605.15470#bib.bib20),[14](https://arxiv.org/html/2605.15470#bib.bib18),[9](https://arxiv.org/html/2605.15470#bib.bib19),[25](https://arxiv.org/html/2605.15470#bib.bib21)\]have demonstrated good performance for deterministic forecasting\. Regional approaches have also shown promising results: OceanNet\[[7](https://arxiv.org/html/2605.15470#bib.bib17)\]is able to outperform a dynamical ocean model in predictingSea Surface Height \(SSH\), and SeaCast\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]is more skillful at forecasting the Mediterranean Sea at4km/4\\text\{\\,\}\\mathrm\{km\}\\text\{/\}resolution than the operational numerical model\. Beyond medium\-range forecasting, deep learning has been applied to seasonal ocean forecasting\[[3](https://arxiv.org/html/2605.15470#bib.bib40),[45](https://arxiv.org/html/2605.15470#bib.bib33)\], and climate prediction\[[10](https://arxiv.org/html/2605.15470#bib.bib39)\]\. Despite these advancements, ensemble ocean forecasting with data\-driven models is largely unexplored\. Recently, FuXi\-ONS\[[26](https://arxiv.org/html/2605.15470#bib.bib25)\]extended global ocean emulation to the ensemble setting at a1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}grid spacing and 5 day intervals by adding learned perturbations on top of a deterministic core\. This model is not included in our comparison as it focuses on different timescales and resolutions than our setting, and is also not trained to match the distribution of future states\.
#### Probabilistic weather forecasting\.
Our approach is heavily inspired by advancements in data\-driven ensemble weather forecasting\. GenCast\[[43](https://arxiv.org/html/2605.15470#bib.bib34)\]introduced diffusion\-based ensemble forecasting for medium\-range weather\. Diffusion models are effective at capturing high\-resolution details\[[42](https://arxiv.org/html/2605.15470#bib.bib49),[32](https://arxiv.org/html/2605.15470#bib.bib1)\], but are computationally demanding, as generating a single forecast requires many sequential forward passes\. More recently, training generative forecasting models using theContinuous Ranked Probability Score \(CRPS\)as training objective has been a popular approach both at global\[[30](https://arxiv.org/html/2605.15470#bib.bib23),[41](https://arxiv.org/html/2605.15470#bib.bib37),[6](https://arxiv.org/html/2605.15470#bib.bib38),[2](https://arxiv.org/html/2605.15470#bib.bib22)\]and regional scales\[[31](https://arxiv.org/html/2605.15470#bib.bib41),[38](https://arxiv.org/html/2605.15470#bib.bib50)\]\. These approaches differ primarily in how stochasticity is introduced and in how theCRPSis estimated\. TheCRPS\-based models are at least an order of magnitude more efficient during inference than diffusion\-based approaches\. Our work builds on the Graph\-EFM\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]latent variable model, which uses a combination of variational training and theCRPS\-based objective\. Graph\-EFM is purely a weather model, and we extend the approach to the ocean domain through architectural modifications that enable efficient modeling on substantially larger and more irregular ocean grids\.
Figure 2:One\-step prediction in the Njord model\. Residuals are predicted at timett, which are then added to the previous stateXt−1X^\{t\-1\}in order to produce the sampleX^t\\hat\{X\}^\{t\}\. The corresponding overview of Njord\-Baltic is shown in[Figure˜12](https://arxiv.org/html/2605.15470#A3.F12)in the appendix\.
## 3Background
### 3\.1Problem formulation
We are tackling the forecasting task of mapping a sequence of initial ocean statesX−p\+1:0=\(X−p\+1,…,X0\)X^\{\-p\+1:0\}=\(X^\{\-p\+1\},\\ldots,X^\{0\}\), whereppis the number of past steps, to a sequence of future statesX1:T=\(X1,…,XT\)X^\{1:T\}=\(X^\{1\},\\ldots,X^\{T\}\)\. Each stateXt∈ℝN×dxX^\{t\}\\in\\mathbb\{R\}^\{N\\times d\_\{x\}\}containsdxd\_\{x\}ocean variables atNNdifferent grid locations\. The state variables include quantities modeled at multiple vertical depth levels and surface\-level variables\. In addition to the initial states, the model receives additional forcing inputs on the sameNNgrid points\. This forcingF−p\+1:TF^\{\-p\+1:T\}includes:1\)known dynamic factors, such as the day of year,2\)static features, e\.g\. depth, and3\)atmospheric forcing at the surface, given by a weather model\.The atmospheric forcing includes relevant fields related to wind, pressure and temperature, that drive ocean surface dynamics\. In[Appendix˜B](https://arxiv.org/html/2605.15470#A2)we provide complete lists of all state and forcing variables\.
For ensemble forecasting we are specifically interested in a probabilistic mapping from initial states to forecasts\. We thus aim to model a distributionp\(X1:T∣X−p\+1:0,F−p\+1:T\)p\(X^\{1:T\}\\mid X^\{\-p\+1:0\},F^\{\-p\+1:T\}\)describing the possible future ocean states\. Under the Markov assumption of states only depending onppprevious ones, this distribution decomposes over time and it is sufficient to build a model forp\(Xt∣Xt−p:t−1,F−p\+1:T\)p\(X^\{t\}\\mid X^\{t\-p:t\-1\},F^\{\-p\+1:T\}\)\. Ensemble members, samples ofX1:TX^\{1:T\}, can then be produced by sequentially samplingXtX^\{t\}from this distribution, and conditioning on the sampled value for the next time step\.
### 3\.2Ensemble forecasting with latent variable models
The ensemble forecasting problem outlined above can be approached by training a deep generative model to approximately produce samples fromp\(Xt∣Xt−p:t−1,F−p\+1:T\)p\\mathopen\{\}\\mathclose\{\{\\left\(X^\{t\}\\mid X^\{t\-p:t\-1\},F^\{\-p\+1:T\}\}\}\\right\)\. One family of such models are deep latent variable models, similar to conditional variational auto\-encoders\[[44](https://arxiv.org/html/2605.15470#bib.bib45)\], which learn to map from conditioning inputs and a latent random variableZtZ^\{t\}to samples from a distribution\. Typically the set of conditioning inputs is restricted to only a subset of forcing and previous states\. In our setting such a model is comprised of a neural networkfθf\_\{\\theta\}realizing the mapping
X^t=fθ\(Xt−2:t−1,Ft−2:t,Zt\),Zt∼𝒩\(μt,I\)\\hat\{X\}^\{t\}=f\_\{\\theta\}\\mathopen\{\}\\mathclose\{\{\\left\(X^\{t\-2:t\-1\},F^\{t\-2:t\},Z^\{t\}\}\}\\right\),\\qquad Z^\{t\}\\sim\\mathcal\{N\}\\mathopen\{\}\\mathclose\{\{\\left\(\\mu^\{t\},I\}\}\\right\)\(1\)whereZtZ^\{t\}is chosen to be an isotropic Gaussian\. This sampling implicitly specifies the model distributionp^θ\(Xt∣Xt−2:t−1,Ft−2:t\)\\hat\{p\}\_\{\\theta\}\\mathopen\{\}\\mathclose\{\{\\left\(X^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\}\}\\right\)\. As sampling in these models only requires a single forward\-pass withfθf\_\{\\theta\}, it is far more computationally efficient than the iterative sampling of diffusion models\[[43](https://arxiv.org/html/2605.15470#bib.bib34),[20](https://arxiv.org/html/2605.15470#bib.bib46),[16](https://arxiv.org/html/2605.15470#bib.bib51),[32](https://arxiv.org/html/2605.15470#bib.bib1)\]\. Recently there has been an increasing interest for using deep latent\-variable models in earth system modeling\[[24](https://arxiv.org/html/2605.15470#bib.bib48),[39](https://arxiv.org/html/2605.15470#bib.bib2),[30](https://arxiv.org/html/2605.15470#bib.bib23),[2](https://arxiv.org/html/2605.15470#bib.bib22),[6](https://arxiv.org/html/2605.15470#bib.bib38)\]\.
In this work we build on the Graph\-EFM\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]latent variable model, previously used for weather forecasting\. Graph\-EFM uses a constructed mesh graph andGNNlayers\[[4](https://arxiv.org/html/2605.15470#bib.bib3)\]in order to capture spatial relationships over the forecasted area\. The neural networkfθf\_\{\\theta\}is implemented as aGNNfollowing the encode\-process\-decode architecture\. Gridded inputs are first mapped to latent representations on the mesh graph\. The processor part of the model then consists of a hierarchy ofGNNlayers operating over different spatial resolutions, withZtZ^\{t\}being integrated at the coarsest level by adding it to the latent representation\. AsGNNs map from this coarse representation down through the hierarchy, and finally decodes back to the original grid points, the stochasticity fromZtZ^\{t\}can affect all outputs of the model\. These outputs are then added to the previous stateXt−1X^\{t\-1\}through a residual connection, finally producing the random sampleX^t\\hat\{X\}^\{t\}\. In Graph\-EFMZtZ^\{t\}is additionally sampled from a prior distribution with a learned mean, asμt=gθ\(Xt−2:t−1,Ft−2:t\)\\mu^\{t\}=g\_\{\\theta\}\\mathopen\{\}\\mathclose\{\{\\left\(X^\{t\-2:t\-1\},F^\{t\-2:t\}\}\}\\right\)\. This prior mapping is realized as another hierarchicalGNN\. The model is trained by optimizing theEvidence Lower Bound \(ELBO\), which also requires training a separate encoder network forZtZ^\{t\}\. This encoder has a similar structure to the prior, but is only required as an auxiliary module for training, and not for forecasting\. For further details about Graph\-EFM see[Appendix˜A](https://arxiv.org/html/2605.15470#A1)andOskarssonet al\.\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]\.
At inference time, each ensemble member is sampled at time stepttby1\)drawing a sampleZt∼𝒩\(μt,I\)Z^\{t\}\\sim\\mathcal\{N\}\\mathopen\{\}\\mathclose\{\{\\left\(\\mu^\{t\},I\}\}\\right\), and2\)propagating this sample throughfθf\_\{\\theta\}together with conditioning variables,in order to produceX^t\\hat\{X\}^\{t\}\. This is repeated auto\-regressively acrossTTtime steps, conditioning each step on previously sampled states, to obtain full state trajectories\. A complete ensemble is generated by repeating this with independent samples, which is fully parallelizable across members\.
## 4Njord: A graph\-based probabilistic ocean model
To tackle the problem of probabilistic ocean forecasting, we follow the latent variable approach, providing efficient ensemble sampling and model training that directly targets the distribution of future ocean states\. We introduce Njord, a graph\-based forecasting model for the global Ocean, and Njord\-Baltic, a regional adaptation for the Baltic Sea \([Section˜4\.5](https://arxiv.org/html/2605.15470#S4.SS5)\)\. Njord extends the hierarchical GNN architecture of Graph\-EFM\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]with several key modifications tailored to high\-resolution ocean modeling\. Graph\-based modeling is appealing for ocean forecasting due to the inherent ability to work with irregular grids\. As opposed to methods that work with regular latitude\-longitude grids, Njord only operates over the actual ocean grid points\. This avoids wasting memory and compute on updating latent representations located over land, where there are no ocean dynamics to model\.
Njord instantiates the latent variable framework from[Section˜3\.2](https://arxiv.org/html/2605.15470#S3.SS2), and we make the specific choice of including two previous statesXt−2:t−1X^\{t\-2:t\-1\}and both past and future forcingFt−2:tF^\{t\-2:t\}as gridded inputs to the networksfθf\_\{\\theta\}andgθg\_\{\\theta\}\. Including multiple past states helps the model capture higher\-order dynamics, and forcing information across time is generally useful for fields with clear relationships to the atmosphere \(see[Section˜E\.1](https://arxiv.org/html/2605.15470#A5.SS1)for an ablation\)\. These gridded outputs are mapped to our hierarchical ocean graph, whereZtZ^\{t\}is integrated in the latent state, providing the stochasticity for our probabilistic ocean forecast\. An overview of the method is shown in[Figure˜2](https://arxiv.org/html/2605.15470#S2.F2)\.
\(a\)Icosahedral
\(b\)Cluster
Figure 3:Example of graph node placement in the Red Sea\.### 4\.1A graph adapted to ocean geometry
Graph\-based global weather forecasting models use icosahedral meshes\[[28](https://arxiv.org/html/2605.15470#bib.bib10),[39](https://arxiv.org/html/2605.15470#bib.bib2),[29](https://arxiv.org/html/2605.15470#bib.bib52)\]for constructing the spatial graph that the model operates over\. These meshes are constructed by iteratively subdividing an icosahedron, with each subdivision quadrupling the number of nodes and edges\[[28](https://arxiv.org/html/2605.15470#bib.bib10)\]\. As the size of the graph heavily impacts memory requirements, choosing the number of nodes and edges is a crucial choice in practice\. However, icosahedral refinement only allows discrete subdivision levels, limiting the available graph sizes\. Additionally, while the icosahedral shape is well\-motivated for the atmosphere surrounding Earth, it is less suitable for modeling the ocean surface\. Masking can be applied to only include nodes over the ocean, but without any adaptation to node placement this risks breaking apart large and important parts of the icosahedral graph connectivity\.
To construct a graph better adapted to the geometry of the global ocean we instead place the graph nodes based on the density of ocean grid points\. We apply spherical K\-means clustering of the ocean grid point 3D Cartesian coordinates, with latitude\-based area weights to ensure equitable spatial coverage\. As shown in[Figure˜3](https://arxiv.org/html/2605.15470#S4.F3), this leads to more evenly spaced nodes and avoids missing important bays, straits and canals \(see[Section˜C\.1](https://arxiv.org/html/2605.15470#A3.SS1)for more examples and[Section˜E\.2](https://arxiv.org/html/2605.15470#A5.SS2)for an empirical model comparison\)\. The clustering based approach can be used to place an arbitrary number of nodes, equal to the number of clusters\. Edges are then constructed via spherical Delaunay triangulation, followed by filtering of edges crossing land masses\. The procedure is repeated multiple times with a decreasing number of clusters, resulting in ocean graphs of increasing coarseness\. These then make up the hierarchy used by theGNNlayers in the model\. As the graph and grid points in Njord do not cover land, we instead add the minimum distance to the coastline as a static input feature in each grid point, informing the model where the ocean surface ends\. Further details on how the grid points are connected to the graph, land\-crossing edge filtering, and static graph features are given in[Appendix˜C](https://arxiv.org/html/2605.15470#A3)\.
### 4\.2Training objective
Similar to Graph\-EFM\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\], Njord is trained using a combination of theELBOandCRPS\. We apply masking to adapt to the ocean’s bathymetric structure and weight the loss for each prediction based on the area of the grid cell, the depth\-level and the inverse variance of daily change in each variable\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]\. To use theELBO, we train also an additional encoder network as a variational approximationqϕq\_\{\\phi\}\. This is parametrized as another Gaussian distribution, similar to the prior\. The full loss function for predictions at timettis
ℒ=−𝔼qϕ\[logpθ\(Xt∣Xt−2:t−1,Ft−2:t,Zt\)\]\+λKLDKL\(qϕ\(Zt∣Xt−2:t,Ft−2:t\)∥pθ\(Zt∣Xt−2:t−1,Ft−2:t\)\)\+λCRPSℒCRPS\.\\begin\{split\}\\mathcal\{L\}=&\-\\mathbb\{E\}\_\{q\_\{\\phi\}\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\[\\log p\_\{\\theta\}\(X^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\},Z^\{t\}\)\}\}\\right\]\\\\ &\+\\lambda\_\{\\text\{KL\}\}\\,D\_\{\\mathrm\{KL\}\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\(q\_\{\\phi\}\(Z^\{t\}\\mid X^\{t\-2:t\},F^\{t\-2:t\}\)\\,\\\|\\,p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\}\}\\right\)\+\\lambda\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}\\mathcal\{L\}\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}\.\\end\{split\}\(2\)The likelihoodpθp\_\{\\theta\}is Gaussian with meanX^t\\hat\{X\}^\{t\}and variance based on the earlier mentioned weightings\. We compute theCRPStermℒCRPS\\mathcal\{L\}\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}using the almost\-fairCRPSestimator\[[30](https://arxiv.org/html/2605.15470#bib.bib23)\]from two independently sampled forecasts\. Training follows a curriculum in multiple phases, gradually activating the KL term \(λKL\\lambda\_\{\\text\{KL\}\}\), introducing multi\-step rollout training \(summing[Equation˜2](https://arxiv.org/html/2605.15470#S4.E2)fort=1,2,…t=1,2,\\dots\), and finally enabling theCRPSloss \(λCRPS\\lambda\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}\)\. For more details about the training objective, including the exact training curriculum, see[Appendix˜D](https://arxiv.org/html/2605.15470#A4)\.
### 4\.3Sea ice modeling with physical constraints
Unlike other machine learning models for high\-resolution short term ocean forecasting\[[14](https://arxiv.org/html/2605.15470#bib.bib18),[9](https://arxiv.org/html/2605.15470#bib.bib19),[46](https://arxiv.org/html/2605.15470#bib.bib20)\], Njord also models sea ice\. These ice\-related variables have specific physical constraints:Sea Ice Concentration \(SIC\)is bounded to\[0,1\]\[0,1\]andSea Ice Thickness \(SIT\)must be positive\. A large fraction of values also sit exactly at these limits\. Without explicitly enforcing these constraints, autoregressive forecast rollout can produce unphysical predictions outside of these bounds\. As a probabilistic model, this is especially a concern for Njord, as we want the full probability distribution to conform to the constraints, not just the conditional mean\.
Instead of enforcing constraints only through post\-processing, we aim to also prevent the model from encountering negative sea ice inputs during training\. In[Section˜E\.3](https://arxiv.org/html/2605.15470#A5.SS3), we compare different strategies at global1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution and find that a combination of soft clamping and a*density channel*\[[18](https://arxiv.org/html/2605.15470#bib.bib27),[5](https://arxiv.org/html/2605.15470#bib.bib28)\]performs best\. A binary density channeld∈\{0,1\}d\\in\\\{0,1\\\}is constructed fromSICasd=𝕀\[SIC\>0\]d=\\mathbb\{I\}\[\\mathrm\{SIC\}\>0\]and appended to the model state\. The model predicts this channel jointly with all other variables\. The predicted density logit is passed through a sigmoid and thresholded at0\.50\.5\. Where the predicted density falls below the threshold, the density channel and all ice variables are set to zero in the next state; otherwise, the predicted ice values are retained\. This ensures that locations predicted to be ice\-free receive clean zero\-ice inputs, rather than small residual values that may accumulate over rollouts\. The raw, pre\-threshold predictions are still used for the loss computation\.
### 4\.4Scaling to high\-resolution global grids
Scaling Njord to global ocean grids at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}presents several modeling and training challenges\. Similar toAletet al\.\[[2](https://arxiv.org/html/2605.15470#bib.bib22)\], we note that much of the memory footprint in graph\-based models can be attributed to the many edges used for encoding the gridded data to the graph\. We modify the Interaction\[[4](https://arxiv.org/html/2605.15470#bib.bib3)\]and Propagation NetworkGNNs\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]in the model to use separate dimensionalities for the original grid\-level embeddings, edge representations and graph node representations\. This allows us to substantially reduce the memory used for edge representations, and instead scale up the hidden dimensionality in the coreGNNprocessor of the model\. Exact dimensionalities are given in[Section˜A\.3](https://arxiv.org/html/2605.15470#A1.SS3)\. We additionally apply gradient checkpointing\[[19](https://arxiv.org/html/2605.15470#bib.bib53),[8](https://arxiv.org/html/2605.15470#bib.bib6)\]between time steps to allow for training on auto\-regressive rollouts\.
We further reduce compute and training time by following a two\-stage training schedule, where we pretrain on1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution data before finetuning at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}\. The same graph is used for both resolutions, only swapping the encoding and decoding edges that connect the mesh to the high\-resolution grid\. The full training curriculum is outlined in[Appendix˜D](https://arxiv.org/html/2605.15470#A4)\.
### 4\.5Extension to regional modeling: Njord\-Baltic
The framework underpinning the global Njord model is general, and can also be applied to train regional ocean models\. We exemplify this by building Njord\-Baltic, a probabilistic ocean forecasting model for the Baltic Sea\. Quadrilateral graphs have previously been used for regional ocean models\[[23](https://arxiv.org/html/2605.15470#bib.bib8)\], but these suffer from some of the same limitations as icosahedral graphs in the global setting\. We instead use our clustering\-based graph construction also for Njord\-Baltic\.
For regional models, boundary conditions can become an additional consideration\. We handle this by introducing an additional boundary forcing inputBtB^\{t\}, and include this from timest−2,t−1t\-2,t\-1andttfor predictingX^t\\hat\{X\}^\{t\}\. This boundary forcing contains information about the surrounding ocean state, and can come from a global forecasting model or reanalysis data\. We use a separate encoder for the boundary inputs\[[1](https://arxiv.org/html/2605.15470#bib.bib16)\], allowing these to stay at the original coarse resolution of the global model\. An overview of the regional setup for Njord\-Baltic is shown in[Figure˜12](https://arxiv.org/html/2605.15470#A3.F12)in the appendix\.
## 5Experiments
Both global and regional models are trained on ocean reanalysis and atmospheric reanalysis data spanning 1993–2021, after which the models are finetuned on analysis data from 2023\. The finetuning is helpful to achieve a better calibrated ensemble, as the operational initial conditions are produced by the same model\. In the global data, there is also a slight mismatch in the bathymetry used by the reanalysis and analysis, and because we connect the GNN mesh to the surface grid, we prefer this to be the exact same for input and output\. During evaluation, 52 forecasts are initialized weekly \(every Tuesday\) throughout 2024, following the OceanBench\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]benchmark, and evaluated over a 10\-day forecast horizon\.
### 5\.1Global ocean forecasting
#### Global ocean data\.
The global ocean state for training is taken from the GLORYS12 global ocean reanalysis\[[34](https://arxiv.org/html/2605.15470#bib.bib31)\], a1/12∘1/12^\{\\circ\}resolution NEMO\[[37](https://arxiv.org/html/2605.15470#bib.bib32)\]based reanalysis assimilating satellite altimetry,Sea Surface Temperature \(SST\), and in\-situ profiles\. GLORYS12 provides daily\-mean fields ofSea Surface Height \(SSH\),Sea Ice Concentration \(SIC\),Sea Ice Thickness \(SIT\),Temperature \(T\),Salinity \(S\),Zonal Current \(U\), andMeridional Current \(V\)at multiple depth levels\. For the global Njord configuration, we use these variables at six representative depths\. We additionally finetune on operational GLO12\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\]analysis data\. Surface atmospheric forcing is obtained from the ERA5 global reanalysis\[[21](https://arxiv.org/html/2605.15470#bib.bib29)\], produced by ECMWF at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution\. We use eight surface atmospheric variables, bilinearly interpolated to the ocean model grid\. During evaluation we switch to operational 10\-day IFS\[[12](https://arxiv.org/html/2605.15470#bib.bib54)\]atmospheric forecasts for the forcing\. Further details are listed in[Table˜1](https://arxiv.org/html/2605.15470#A2.T1)\.
#### Global baselines\.
We use the machine learning models from OceanBench\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]as global baselines, namely GLONET\[[14](https://arxiv.org/html/2605.15470#bib.bib18)\], WenHai\[[9](https://arxiv.org/html/2605.15470#bib.bib19)\], and XiHe\[[46](https://arxiv.org/html/2605.15470#bib.bib20)\]\. To also provide a deterministic counterpart with a comparable training strategy and data splits to Njord, we extend SeaCast\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]to the global setting by adopting a global icosahedral mesh instead of the regional quadrilateral mesh it normally uses\. As a physics\-based operational baseline, we use GLO12\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\]\. We also include a persistence baseline, which repeats the last initial state over the whole forecast horizon\. Njord, SeaCast, and GLONET operate at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}, whereas WenHai, XiHe, and GLO12 use the native0\.083°/0\.083\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}simulation grid\.
#### Global results\.
Njord samples a single next\-step ensemble member in3s/3\\text\{\\,\}\\mathrm\{s\}\\text\{/\}on one AMD MI250X GPU\. In these experiments, we generate a 20\-member global ensemble\. Lead\-time performance is evaluated in[Figure˜4](https://arxiv.org/html/2605.15470#S5.F4)for selected variables\.Root Mean Squared Error \(RMSE\)is computed for the ensemble mean and compared against the GLO12 analysis at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}\. Full results from the global experiments are presented in[Appendix˜E](https://arxiv.org/html/2605.15470#A5), including scores for all models on OceanBench\. Across all variables, Njord demonstrates competitive performance relative to machine learning baselines, with stable error growth over the 10\-day forecast horizon\. SeaCast exhibits similar behavior, as it is pretrained on reanalysis and fine\-tuned on analysis data, consistent with Njord, whereas GLONET represents a state\-of\-the\-art model trained solely on reanalysis\. Results for the two other reanalysis\-based models XiHe and WenHai are shown in[Tables9](https://arxiv.org/html/2605.15470#A5.T9)–[11](https://arxiv.org/html/2605.15470#A5.T11)\. The GLO12 forecast constitutes a particularly strong baseline, as it is generated by the same system used to produce the analysis targets\. Due to the sparsity of ocean observations, outperforming the physical simulator on the analysis benchmark is inherently challenging\. On the other hand, machine learning models trained exclusively on reanalysis data may be biased toward that dataset\. They indeed outperform GLO12 on a number of reanalysis fields as seen in[Table˜9](https://arxiv.org/html/2605.15470#A5.T9)\. A more independent assessment of performance is therefore comparing against observations\. Nonetheless, analysis evaluation remains important, as initial conditions are derived from the same system, enabling comparison of consistent trajectories\.
Figure 4:RMSEforSea Surface Temperature \(SST\),Sea Surface Height \(SSH\),Sea Surface Salinity \(SSS\), andSea Ice Thickness \(SIT\)at each lead time relative to GLO12 analysis\.Figure 5:SSRaveraged over all global ocean variables\.TheSpread\-Skill Ratio \(SSR\)in[Figure˜5](https://arxiv.org/html/2605.15470#S5.F5)measures the ratio between ensemble spread and forecast error, providing an assessment of probabilistic calibration\. The 20\-member Njord ensemble exhibits a slight underdispersion at short lead times, but theSSRincreases steadily toward 1, indicating good calibration and a reliable representation of forecast uncertainty\. This behavior is consistent across the ocean variables\.
Spatial performance forSSTat a 10\-day lead time is shown in Figure[6](https://arxiv.org/html/2605.15470#S5.F6)\. The Njord ensemble mean accurately captures the global thermal structure of the GLO12 analysis, including sharp gradients associated with major current systems\. The ensemble spread highlights regions of increased uncertainty, aligning with dynamically active areas such as the Gulf Stream, the Brazil–Malvinas Confluence, the Agulhas Retroflection, and the Kuroshio Extension\. Forecasts of Arctic sea ice distribution is shown in Figure[7](https://arxiv.org/html/2605.15470#S5.F7)\. Njord reproduces large\-scale ice thickness gradients and the marginal ice zone\. Ensemble spread peaks in the Kara Sea and other highly dynamic marginal zones, reflecting uncertainty in sea ice transport over the 10\-day forecast horizon\.
Figure 6:GlobalSSTat a10d10\\text\{\\,\}\\mathrm\{d\}lead, initialized on 2024\-01\-30\. Ground truth is GLO12 analysis\.Figure 7:ArcticSITat10d10\\text\{\\,\}\\mathrm\{d\}lead time, initialized 2024\-01\-30\. Ground truth is GLO12 analysis\.
#### Observation results\.
The OceanBench\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]observation track enables independent verification using the IV\-TT CLASS\-4 dataset, which comprises temperature and salinity measured by a global array of autonomous floats,SSTfrom surface drifting buoys,Sea Level Anomaly \(SLA\)from along\-track satellite measurements, and surface currents at 15 m depth from drifter buoys\. As shown in[Table˜11](https://arxiv.org/html/2605.15470#A5.T11), Njord achieves the lowestRMSEfor 0–5 m temperature across all lead times, outperforming both the operational GLO12 baseline and other machine learning emulators\. Njord tends to perform on par with the best model for each observable in the upper ocean, less than100m100\\text\{\\,\}\\mathrm\{m\}depth, on the observation track\.
Figure 8:GlobalSSTpredictions evaluated on satellite measurements\.To further evaluateSSTforecasts outside of OceanBench, we compare the predicted potential temperature of the uppermost ocean layer against a global ocean bias\-adjustedSSTproduct\[[11](https://arxiv.org/html/2605.15470#bib.bib4)\], based on multi\-sensor satellite observations\.[Figure˜8](https://arxiv.org/html/2605.15470#S5.F8)shows globally averagedSSTRMSEover a 10\-day forecast horizon, where all models are interpolated to the0\.1°/0\.1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}SSTgrid\. Njord maintains the lowest error across all lead times compared to all other models\.
### 5\.2Regional ocean forecasting
#### Regional ocean data\.
The Baltic Sea state is obtained from the Baltic Sea Physics Reanalysis, produced with the NEMO ocean model\[[37](https://arxiv.org/html/2605.15470#bib.bib32)\]at2km/2\\text\{\\,\}\\mathrm\{km\}\\text\{/\}horizontal resolution\. We use daily\-mean fields ofSLA,SIC,SIT, and the three\-dimensionalT,S,U, andV, subsampled to five representative depth levels listed in[Table˜2](https://arxiv.org/html/2605.15470#A2.T2)\. For the regional configuration, GLORYS12 provides lateral boundary forcing during training, supplying the three\-dimensional ocean state along the open boundaries of the Baltic domain\. During evaluation, GLO12 forecasts are used as boundary forcing\. Atmospheric forcing follows the global setup: ERA5 is used during training and operational IFS forecasts during evaluation, both interpolated to the Baltic Sea grid\.
#### Regional baselines\.
We compare Njord\-Baltic to SeaCast\[[23](https://arxiv.org/html/2605.15470#bib.bib8)\], which serves as the primary baseline\. We also include GLO12 forecasts from OceanBench for reference, although they are not expected to be competitive at kilometer\-scale resolution due to their coarser0\.083°/0\.083\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}grid\. Also here we include a persistence baseline\.
#### Regional results\.
Njord\-Baltic samples a single next\-step ensemble member in1s/1\\text\{\\,\}\\mathrm\{s\}\\text\{/\}on one AMD MI250X GPU\. In these experiments, we generate a 5\-member ensemble\. Both Njord\-Baltic and SeaCast are evaluated under two training regimes: pretraining on reanalysis data and subsequent fine\-tuning on operational analysis data\. Fine\-tuning leads to a consistent improvement in probabilistic calibration, as reflected by higherSSRvalues in[Figure˜9](https://arxiv.org/html/2605.15470#S5.F9)\. It also reduces the error as seen in[Figure˜10](https://arxiv.org/html/2605.15470#S5.F10)containing model performance forT,S,Uat47m47\\text\{\\,\}\\mathrm\{m\}depth, as well asSIT\.
Figure 9:SSRaveraged over Baltic Sea variables\.Across variables, Njord\-Baltic achieves RMSE values comparable to SeaCast while providing probabilistic forecasts\. In this regional setting, GLO12 exhibits a relatively flat error curve, similar to a climatological baseline\. Both Njord\-Baltic and SeaCast clearly outperform persistence\. Njord\-Baltic matches SeaCast in deterministic accuracy while additionally providing calibrated ensemble forecasts\.
Figure 10:RMSE forTemperature \(T\),Salinity \(S\),Zonal Current \(U\)at47m47\\text\{\\,\}\\mathrm\{m\}depth, as well asSea Ice Thickness \(SIT\), all relative to NEMO analysis\.Figure 11:Baltic SeaSSTat10d10\\text\{\\,\}\\mathrm\{d\}lead time, initialized 2024\-03\-05\. Ground truth is NEMO analysis\.SpatialSSTfields at a 10\-day lead time are shown in[Figure˜11](https://arxiv.org/html/2605.15470#S5.F11)\. The ensemble mean captures the large\-scale thermal structure but appears smoother than the ground truth\. Individual ensemble members recover sharper gradients and localized features\. The ensemble spread highlights regions of increased uncertainty, particularly in the Danish Straits \(Kattegat and Skagerrak\), where exchange with the North Sea introduces strong variability\. The standard deviation is zero in ice\-covered regions such as the very north of the Baltic Sea during winter\. Fields and metrics for additional variables can be found in[Appendix˜E](https://arxiv.org/html/2605.15470#A5)\.
## 6Discussion
#### Limitations\.
While the current vertical resolution \(5 for the Baltic Sea and 6 for the global ocean\) is sufficient to benchmark the model, future probabilistic forecasting models should extend to a larger portion of the vertical column\. The horizontal resolution can also be increased to the native0\.083°/0\.083\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}in the global case, to be able to resolve even finer ocean dynamics\. We note that Njord\-Baltic exhibits some visual artifacts inSLAstandard deviation fields, potentially due to higher noise\-to\-signal ratio in the satellite\-derived ground\-truthSLAtraining targets\. As an ensemble forecasting model, computing the minimum\-error ensemble mean requires sampling multiple ensemble members, resulting in a computational cost proportionally higher than that of deterministic models\. Given the fast and parallelizable sampling, this is a small price for reliable uncertainty estimates\.
#### Conclusion\.
We have introduced*Njord*, the first probabilistic generative machine learning model for global and regional ocean forecasting\. By combining hierarchicalGNNswith a flexible K\-means cluster mesh construction, we successfully scale generative ensemble modeling to high\-resolution ocean grids with irregular coastal geometries\. The results show that Njord not only matches or exceeds the accuracy of deterministic machine learning models and traditional numerical systems, but also provides well\-calibrated uncertainty estimates\. This is important for operational use, where understanding forecast uncertainties in the ocean currents or advancing sea ice can help improve decision making\. Avenues for future work include scaling the approach to more vertical levels, additional variables, shorter timesteps and longer timescales\. Especially interesting is the prospect of combined oceanic and atmospheric modeling into a coupled ensemble system\.
## Acknowledgments and Disclosure of Funding
The authors would like to thank Simon Adamov \(ETHZ/MeteoSwiss\) for assistance with the atmospheric forcing data and Lars Axell and Erik Mulder \(SMHI\) for useful discussions\. This research was supported by the ETH AI Center through an ETH AI Center postdoctoral fellowship to Joel Oskarsson, the Research Council of Finland \(grant no: 361902\), the Swedish Research Council \(grant no: 2020\-04122, 2024\-05011\), the Wallenberg AI, Autonomous Systems and Software Program \(WASP\) funded by the Knut and Alice Wallenberg Foundation, and the Excellence Center at Linköping–Lund in Information Technology \(ELLIIT\)\. Daniel Holmberg acknowledges support from the Fulbright\-KAUTE Foundation Award for conducting research at UC Santa Barbara\. Our computations were enabled by the LUMI supercomputer, owned by the EuroHPC Joint Undertaking and hosted by CSC–IT Center for Science, and the Berzelius resource at the National Supercomputer Centre, provided by the Knut and Alice Wallenberg Foundation\. This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre \(CSCS\) under project ID a122 on Alps\.
## References
- \[1\]S\. Adamov, J\. Oskarsson, L\. Denby, T\. Landelius, K\. Hintz, S\. Christiansen, I\. Schicker, C\. Osuna, F\. Lindsten, O\. Fuhrer,et al\.\(2025\)Building machine learning limited area models: kilometer\-scale weather forecasting in realistic settings\.arXiv preprint arXiv:2504\.09340\.Cited by:[§4\.5](https://arxiv.org/html/2605.15470#S4.SS5.p2.4)\.
- \[2\]F\. Alet, I\. Price, A\. El\-Kadi, D\. Masters, S\. Markou, T\. R\. Andersson, J\. Stott, R\. Lam, M\. Willson, A\. Sanchez\-Gonzalez,et al\.\(2025\)Skillful joint probabilistic weather forecasting from marginals\.arXiv preprint arXiv:2506\.10772\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6),[§4\.4](https://arxiv.org/html/2605.15470#S4.SS4.p1.1)\.
- \[3\]T\. R\. Andersson, J\. S\. Hosking, M\. Pérez\-Ortiz, B\. Paige, A\. Elliott, C\. Russell, S\. Law, D\. C\. Jones, J\. Wilkinson, T\. Phillips,et al\.\(2021\)Seasonal Arctic sea ice forecasting with probabilistic deep learning\.Nature Communications12\(1\),pp\. 5124\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[4\]\(2016\)Interaction networks for learning about objects, relations and physics\.InAdvances in Neural Information Processing Systems,Vol\.29\.Cited by:[§A\.2](https://arxiv.org/html/2605.15470#A1.SS2.SSS0.Px1.p1.6),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p2.8),[§4\.4](https://arxiv.org/html/2605.15470#S4.SS4.p1.1)\.
- \[5\]C\. Bodnar, W\. P\. Bruinsma, A\. Lucic, M\. Stanley, A\. Allen, J\. Brandstetter, P\. Garvan, M\. Riechert, J\. A\. Weyn, H\. Dong,et al\.\(2025\)A foundation model for the Earth system\.Nature641\(8065\),pp\. 1180–1187\.Cited by:[§E\.3](https://arxiv.org/html/2605.15470#A5.SS3.SSS0.Px3.p1.4),[§4\.3](https://arxiv.org/html/2605.15470#S4.SS3.p2.4)\.
- \[6\]B\. Bonev, T\. Kurth, A\. Mahesh, M\. Bisson, J\. Kossaifi, K\. Kashinath, A\. Anandkumar, W\. D\. Collins, M\. S\. Pritchard, and A\. Keller\(2025\)FourCastNet 3: a geometric approach to probabilistic machine\-learning weather forecasting at scale\.arXiv preprint arXiv:2507\.12144\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[7\]A\. Chattopadhyay, M\. Gray, T\. Wu, A\. B\. Lowe, and R\. He\(2024\)OceanNet: a principled neural operator\-based digital twin for regional oceans\.Scientific Reports14\(21181\)\.Cited by:[§1](https://arxiv.org/html/2605.15470#S1.p4.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[8\]T\. Chen, B\. Xu, C\. Zhang, and C\. Guestrin\(2016\)Training deep nets with sublinear memory cost\.arXiv preprint arXiv:1604\.06174\.Cited by:[§A\.3](https://arxiv.org/html/2605.15470#A1.SS3.p1.8),[§D\.2](https://arxiv.org/html/2605.15470#A4.SS2.SSS0.Px4.p1.5),[§4\.4](https://arxiv.org/html/2605.15470#S4.SS4.p1.1)\.
- \[9\]Y\. Cui, R\. Wu, X\. Zhang, Z\. Zhu, B\. Liu, J\. Shi, J\. Chen, H\. Liu, S\. Zhou, L\. Su,et al\.\(2025\)Forecasting the eddying ocean with a deep neural network\.Nature Communications16\(1\),pp\. 2268\.Cited by:[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.p1.1),[§1](https://arxiv.org/html/2605.15470#S1.p2.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2),[§4\.3](https://arxiv.org/html/2605.15470#S4.SS3.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2)\.
- \[10\]S\. Dheeshjith, A\. Subel, A\. Adcroft, J\. Busecke, C\. Fernandez\-Granda, S\. Gupta, and L\. Zanna\(2025\)Samudra: an AI global ocean emulator for climate\.Geophysical Research Letters52\(10\)\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[11\]E\.U\. Copernicus Marine Service Information\(2026\)ODYSSEA global ocean \- sea surface temperature multi\-sensor L3 observations\.Copernicus Marine Environment Monitoring Service \(CMEMS\)\.External Links:[Document](https://dx.doi.org/10.48670/moi-00164),[Link](https://doi.org/10.48670/moi-00164)Cited by:[§E\.5](https://arxiv.org/html/2605.15470#A5.SS5.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px4.p2.1)\.
- \[12\]ECMWF\(2024\)Integrated forecasting system\.Note:[https://www\.ecmwf\.int/en/forecasts/documentation\-and\-support/changes\-ecmwf\-model](https://www.ecmwf.int/en/forecasts/documentation-and-support/changes-ecmwf-model)Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px1.p1.2)\.
- \[13\]A\. El Aouni, Q\. Gaudel, J\. E\. Johnson, R\. Charly, J\. Le Sommer, R\. Fablet, M\. Drevillon, Y\. Drillet, P\. Y\. Le Traon,et al\.\(2025\)OceanBench: a benchmark for data\-driven global ocean forecasting systems\.InNeural Information Processing Systems,Vol\.39\.Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.p1.1),[item 4](https://arxiv.org/html/2605.15470#S1.I1.i4.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px4.p1.1),[§5](https://arxiv.org/html/2605.15470#S5.p1.1),[footnote 3](https://arxiv.org/html/2605.15470#footnote3)\.
- \[14\]A\. El Aouni, Q\. Gaudel, C\. Regnier, S\. Van Gennip, O\. Le Galloudec, M\. Drevillon, Y\. Drillet, and J\. Lellouche\(2025\)GLONET: mercator’s end\-to\-end neural global ocean forecasting system\.Journal of Geophysical Research: Machine Learning and Computation2\(3\)\.Cited by:[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.p1.1),[§E\.6](https://arxiv.org/html/2605.15470#A5.SS6.p1.3),[§1](https://arxiv.org/html/2605.15470#S1.p1.1),[§1](https://arxiv.org/html/2605.15470#S1.p2.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2),[§4\.3](https://arxiv.org/html/2605.15470#S4.SS3.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2)\.
- \[15\]C\. A\. T\. Ferro\(2014\)Fair scores for ensemble forecasts\.Quarterly Journal of the Royal Meteorological Society140\(683\),pp\. 1917–1923\.Cited by:[§D\.1](https://arxiv.org/html/2605.15470#A4.SS1.p2.9)\.
- \[16\]T\. S\. Finn, C\. Durand, A\. Farchi, M\. Bocquet, and J\. Brajard\(2024\)Towards diffusion models for large\-scale sea\-ice modelling\.InICML 2024 Workshop on Machine Learning for Earth System Modeling,Cited by:[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[17\]V\. Fortin, M\. Abaza, F\. Anctil, and R\. Turcotte\(2014\)Why should ensemble spread match the RMSE of the ensemble mean?\.Journal of Hydrometeorology15\(4\),pp\. 1708 – 1713\.Cited by:[§E\.6](https://arxiv.org/html/2605.15470#A5.SS6.p2.1)\.
- \[18\]J\. Gordon, W\. P\. Bruinsma, A\. Y\. Foong, J\. Requeima, Y\. Dubois, and R\. E\. Turner\(2020\)Convolutional conditional neural processes\.InInternational Conference on Learning Representations,Cited by:[§E\.3](https://arxiv.org/html/2605.15470#A5.SS3.SSS0.Px3.p1.4),[§4\.3](https://arxiv.org/html/2605.15470#S4.SS3.p2.4)\.
- \[19\]A\. Griewank and A\. Walther\(2000\)Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation\.ACM Transactions on Mathematical Software26\(1\),pp\. 19–45\.Cited by:[§4\.4](https://arxiv.org/html/2605.15470#S4.SS4.p1.1)\.
- \[20\]V\. Hatanpää, E\. Ku, J\. Stock, M\. Emani, S\. Foreman, C\. Jung, S\. Madireddy, T\. Nguyen, V\. Sastry, R\. A\. Sinurat,et al\.\(2025\)AERIS: argonne Earth systems model for reliable and skillful predictions\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,pp\. 72–85\.Cited by:[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[21\]H\. Hersbach, B\. Bell, P\. Berrisford, S\. Hirahara, A\. Horányi, J\. Muñoz\-Sabater, J\. Nicolas, C\. Peubey, R\. Radu, D\. Schepers,et al\.\(2020\)The ERA5 global reanalysis\.Quarterly Journal of the Royal Meteorological Society146\(730\),pp\. 1999–2049\.Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px1.p1.2)\.
- \[22\]D\. Holmberg, E\. Clementi, I\. Epicoco, and T\. Roos\(2025\)Accurate Mediterranean Sea forecasting via graph\-based deep learning\.Scientific Reports15\(45051\)\.Cited by:[§A\.3](https://arxiv.org/html/2605.15470#A1.SS3.p1.8),[§D\.1](https://arxiv.org/html/2605.15470#A4.SS1.p1.14),[§E\.6](https://arxiv.org/html/2605.15470#A5.SS6.p1.3),[§1](https://arxiv.org/html/2605.15470#S1.p1.1),[§1](https://arxiv.org/html/2605.15470#S1.p4.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2),[§4\.2](https://arxiv.org/html/2605.15470#S4.SS2.p1.2),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2)\.
- \[23\]D\. Holmberg, E\. Clementi, and T\. Roos\(2024\)Regional ocean forecasting with hierarchical graph neural networks\.InNeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning,Cited by:[§C\.1](https://arxiv.org/html/2605.15470#A3.SS1.p3.3),[§D\.2](https://arxiv.org/html/2605.15470#A4.SS2.SSS0.Px3.p1.11),[§4\.5](https://arxiv.org/html/2605.15470#S4.SS5.p1.1),[§5\.2](https://arxiv.org/html/2605.15470#S5.SS2.SSS0.Px2.p1.1)\.
- \[24\]Y\. Hu, L\. Chen, Z\. Wang, and H\. Li\(2023\)SwinVRNN: a data\-driven ensemble forecasting model via learned distribution perturbation\.Journal of Advances in Modeling Earth Systems15\(2\)\.Cited by:[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[25\]Q\. Huang, Y\. Niu, X\. Zhong, A\. Guo, L\. Chen, D\. Zhang, X\. Zhang, and H\. Li\(2025\)FuXi\-Ocean: a global ocean forecasting system with sub\-daily resolution\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§1](https://arxiv.org/html/2605.15470#S1.p2.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[26\]Q\. Huang, X\. Zhong, A\. Guo, Z\. Peng, L\. Chen, and H\. Li\(2026\)Data\-driven ensemble prediction of the global ocean\.arXiv preprint arXiv:2603\.19591\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[27\]T\. Kärnä, P\. Ljungemyr, S\. Falahat, I\. Ringgaard, L\. Axell, V\. Korabel, J\. Murawski, I\. Maljutenko, A\. Lindenthal, S\. Jandt\-Scheelke,et al\.\(2021\)Nemo\-Nordic 2\.0: operational marine forecast model for the Baltic Sea\.Geoscientific Model Development14\(9\),pp\. 5731–5749\.Cited by:[§1](https://arxiv.org/html/2605.15470#S1.p1.1)\.
- \[28\]R\. Lam, A\. Sanchez\-Gonzalez, M\. Willson, P\. Wirnsberger, M\. Fortunato, F\. Alet, S\. Ravuri, T\. Ewalds, Z\. Eaton\-Rosen, W\. Hu,et al\.\(2023\)Learning skillful medium\-range global weather forecasting\.Science382\(6677\),pp\. 1416–1421\.Cited by:[§C\.1](https://arxiv.org/html/2605.15470#A3.SS1.p1.1),[§C\.3](https://arxiv.org/html/2605.15470#A3.SS3.p1.11),[§4\.1](https://arxiv.org/html/2605.15470#S4.SS1.p1.1)\.
- \[29\]S\. Lang, M\. Alexe, M\. Chantry, J\. Dramsch, F\. Pinault, B\. Raoult, M\. C\. Clare, C\. Lessig, M\. Maier\-Gerber, L\. Magnusson,et al\.\(2024\)AIFS–ECMWF’s data\-driven forecasting system\.arXiv preprint arXiv:2406\.01465\.Cited by:[§4\.1](https://arxiv.org/html/2605.15470#S4.SS1.p1.1)\.
- \[30\]S\. Lang, M\. Alexe, M\. C\. Clare, C\. Roberts, R\. Adewoyin, Z\. Ben Bouallègue, M\. Chantry, J\. Dramsch, P\. D\. Dueben, S\. Hahner,et al\.\(2026\)AIFS\-CRPS: ensemble forecasting using a model trained with a loss function based on the continuous ranked probability score\.npj Artificial Intelligence2\(1\),pp\. 18\.Cited by:[§D\.1](https://arxiv.org/html/2605.15470#A4.SS1.p2.9),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6),[§4\.2](https://arxiv.org/html/2605.15470#S4.SS2.p1.8)\.
- \[31\]E\. Larsson, J\. Oskarsson, T\. Landelius, and F\. Lindsten\(2025\)CRPS\-LAM: regional ensemble weather forecasting from matching marginals\.InEurIPS 2025 Workshop on AI for Climate and Conservation,Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]E\. Larsson, J\. Oskarsson, T\. Landelius, and F\. Lindsten\(2025\)Diffusion\-LAM: probabilistic limited area weather forecasting with diffusion\.InICLR 2025 Workshop on Tackling Climate Change with Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[33\]P\. Y\. Le Traon, A\. Reppucci, E\. Alvarez Fanjul, L\. Aouf, A\. Behrens, M\. Belmonte, A\. Bentamy, L\. Bertino, V\. E\. Brando, M\. B\. Kreiner,et al\.\(2019\)From observation to information and users: the Copernicus Marine Service perspective\.Frontiers in Marine Science6,pp\. 234\.Cited by:[§1](https://arxiv.org/html/2605.15470#S1.p1.1)\.
- \[34\]J\. Lellouche, E\. Greiner, R\. Bourdallé\-Badie, G\. Garric, A\. Melet, M\. Drévillon, C\. Bricaud, M\. Hamon, O\. Le Galloudec, C\. Regnier,et al\.\(2021\)The Copernicus global 1/12 oceanic and sea ice GLORYS12 reanalysis\.Frontiers in Earth Science9,pp\. 698876\.Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px1.p1.2)\.
- \[35\]J\. Lellouche, E\. Greiner, G\. Ruggiero, R\. Bourdallé\-Badie, C\. Testut, O\. Le Galloudec, M\. Benkiran, and G\. Garric\(2023\)Evolution of the Copernicus Marine Service global ocean analysis and forecasting high\-resolution system: potential benefit for a wide range of users\.InEuroGOOS International Conference,Vol\.10,pp\. 242–251\.Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.p1.1),[§E\.6](https://arxiv.org/html/2605.15470#A5.SS6.p1.3),[§1](https://arxiv.org/html/2605.15470#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px1.p1.2),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2)\.
- \[36\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§D\.2](https://arxiv.org/html/2605.15470#A4.SS2.SSS0.Px4.p1.5)\.
- \[37\]G\. Madec and the NEMO team\(2016\)NEMO ocean engine\.Technical reportInstitut Pierre\-Simon Laplace\.Cited by:[Appendix B](https://arxiv.org/html/2605.15470#A2.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px1.p1.2),[§5\.2](https://arxiv.org/html/2605.15470#S5.SS2.SSS0.Px1.p1.1)\.
- \[38\]E\. M\. Nordhagen, H\. H\. Haugen, A\. F\. S\. Salihi, M\. S\. Ingstad, T\. N\. Nipen, I\. A\. Seierstad, I\. Frogner, M\. Clare, S\. Lang, M\. Chantry,et al\.\(2025\)High\-resolution probabilistic data\-driven weather modeling with a stretched\-grid\.arXiv preprint arXiv:2511\.23043\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1)\.
- \[39\]J\. Oskarsson, T\. Landelius, M\. P\. Deisenroth, and F\. Lindsten\(2024\)Probabilistic weather forecasting with hierarchical graph neural networks\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§A\.1](https://arxiv.org/html/2605.15470#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.15470#A1.SS2.SSS0.Px2.p1.2),[§A\.3](https://arxiv.org/html/2605.15470#A1.SS3.p1.8),[§C\.1](https://arxiv.org/html/2605.15470#A3.SS1.p1.1),[§1](https://arxiv.org/html/2605.15470#S1.p3.1),[§1](https://arxiv.org/html/2605.15470#S1.p4.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p2.8),[§4\.1](https://arxiv.org/html/2605.15470#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.15470#S4.SS2.p1.2),[§4\.4](https://arxiv.org/html/2605.15470#S4.SS4.p1.1),[§4](https://arxiv.org/html/2605.15470#S4.p1.1)\.
- \[40\]J\. Oskarsson, T\. Landelius, and F\. Lindsten\(2023\)Graph\-based neural weather prediction for limited area modeling\.InNeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning,Cited by:[§C\.1](https://arxiv.org/html/2605.15470#A3.SS1.p3.3)\.
- \[41\]L\. Pacchiardi, R\. A\. Adewoyin, P\. Dueben, and R\. Dutta\(2024\)Probabilistic forecasting with generative networks via scoring rule minimization\.Journal of Machine Learning Research25\(45\),pp\. 1–64\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1)\.
- \[42\]J\. Pathak, Y\. Cohen, P\. Garg, P\. Harrington, N\. Brenowitz, D\. Durran, M\. Mardani, A\. Vahdat, S\. Xu, K\. Kashinath,et al\.\(2026\)Kilometer\-scale convection\-allowing model emulation using generative diffusion modeling\.Science Advances12\(5\),pp\. eadv0423\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1)\.
- \[43\]I\. Price, A\. Sanchez\-Gonzalez, F\. Alet, T\. R\. Andersson, A\. El\-Kadi, D\. Masters, T\. Ewalds, J\. Stott, S\. Mohamed, P\. Battaglia,et al\.\(2025\)Probabilistic weather forecasting with machine learning\.Nature637\(8044\),pp\. 84–90\.Cited by:[§D\.2](https://arxiv.org/html/2605.15470#A4.SS2.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.15470#S1.p3.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.6)\.
- \[44\]K\. Sohn, H\. Lee, and X\. Yan\(2015\)Learning structured output representation using deep conditional generative models\.InAdvances in Neural Information Processing Systems,C\. Cortes, N\. Lawrence, D\. Lee, M\. Sugiyama, and R\. Garnett \(Eds\.\),Vol\.28\.Cited by:[§3\.2](https://arxiv.org/html/2605.15470#S3.SS2.p1.3)\.
- \[45\]C\. Wang, M\. S\. Pritchard, N\. Brenowitz, Y\. Cohen, B\. Bonev, T\. Kurth, D\. Durran, and J\. Pathak\(2024\)Coupled ocean\-atmosphere dynamics in a machine learning Earth system model\.arXiv preprint arXiv:2406\.08632\.Cited by:[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2)\.
- \[46\]X\. Wang, R\. Wang, N\. Hu, P\. Wang, P\. Huo, G\. Wang, H\. Wang, S\. Wang, J\. Zhu, J\. Xu,et al\.\(2024\)XiHe: a data\-driven model for global ocean eddy\-resolving forecasting\.arXiv preprint arXiv:2402\.02995\.Cited by:[§E\.4](https://arxiv.org/html/2605.15470#A5.SS4.p1.1),[§1](https://arxiv.org/html/2605.15470#S1.p2.1),[§2](https://arxiv.org/html/2605.15470#S2.SS0.SSS0.Px1.p1.2),[§4\.3](https://arxiv.org/html/2605.15470#S4.SS3.p1.1),[§5\.1](https://arxiv.org/html/2605.15470#S5.SS1.SSS0.Px2.p1.2)\.
## Appendix AModel Details
### A\.1Graph\-EFM details
We adopt the probabilistic framework of Graph\-EFM\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\], a latent variable model in which stochasticity is introduced through latent variablesZZdefined on the mesh graph\. The generative model factorizes as:
p^θ\(X1:T∣X−1:0,F−p\+1:T\)=∏t=1T∫pθ\(Xt∣Zt,Xt−2:t−1,Ft−2:t\)pθ\(Zt∣Xt−2:t−1,Ft−2:t\)dZt\\hat\{p\}\_\{\\theta\}\(X^\{1:T\}\\mid X^\{\-1:0\},F^\{\-p\+1:T\}\)=\\prod\_\{t=1\}^\{T\}\\int p\_\{\\theta\}\(X^\{t\}\\mid Z^\{t\},X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\\,p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\\,\\mathrm\{d\}Z^\{t\}\(3\)whereXt−2:t−1X^\{t\-2:t\-1\}denotes the previous states andFt−2:tF^\{t\-2:t\}all forcing inputs \(boundary, atmosphere\) available at timett\. The priorpθ\(Zt∣Xt−2:t−1,Ft−2:t\)p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)is a diagonal Gaussian parameterized by aGNNthat encodes the previous state onto the graph\. The likelihoodpθ\(Xt∣Zt,Xt−2:t−1,Ft−2:t\)p\_\{\\theta\}\(X^\{t\}\\mid Z^\{t\},X^\{t\-2:t\-1\},F^\{t\-2:t\}\)is parameterized by the decoder, which combines the latent sampleZtZ^\{t\}with the encoded previous state to produce a predictive Gaussian over the next state\. Since the marginal likelihood is intractable, theELBOis optimized:
logp^θ\(Xt∣Xt−2:t−1,Ft−2:t\)≥𝔼qϕ\(Zt∣Xt−2:t,Ft−2:t\)\[logpθ\(Xt∣Zt,Xt−2:t−1,Ft−2:t\)\]⏟reconstruction / likelihood−λKLDKL\(qϕ\(Zt∣Xt−2:t,Ft−2:t\)∥pθ\(Zt∣Xt−2:t−1,Ft−2:t\)\)⏟KL divergence\\begin\{split\}\\log\\hat\{p\}\_\{\\theta\}\(X^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\\geq&\\underbrace\{\\mathbb\{E\}\_\{q\_\{\\phi\}\(Z^\{t\}\\mid X^\{t\-2:t\},F^\{t\-2:t\}\)\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\[\\log p\_\{\\theta\}\(X^\{t\}\\mid Z^\{t\},X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\}\}\\right\]\}\_\{\\text\{reconstruction / likelihood\}\}\\\\ &\-\\underbrace\{\\lambda\_\{\\text\{KL\}\}\\,D\_\{\\mathrm\{KL\}\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\(q\_\{\\phi\}\(Z^\{t\}\\mid X^\{t\-2:t\},F^\{t\-2:t\}\)\\,\\\|\\,p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)\}\}\\right\)\}\_\{\\text\{KL divergence\}\}\\end\{split\}\(4\)whereqϕ\(Zt∣Xt−2:t,Ft−2:t\)q\_\{\\phi\}\(Z^\{t\}\\mid X^\{t\-2:t\},F^\{t\-2:t\}\)is the approximate posterior \(encoder\), conditioned on both the previous and current state, andλKL\\lambda\_\{\\text\{KL\}\}is a weighting factor for the KL term\.
#### Encoder\.
The encoder takes as input the grid embedding conditioned on both the previous state and the target stateXtX^\{t\}, maps it to the mesh via the grid\-to\-meshGNN, processes it through the mesh hierarchy, and outputs the parameters\(μq,σq\)\(\\mu\_\{q\},\\sigma\_\{q\}\)of a diagonal Gaussian over the latent variables at the top mesh level:
qϕ\(Zt∣Xt−2:t,Ft−2:t\)=𝒩\(μq\(Xt−2:t,Ft−2:t\),diag\(σq2\(Xt−2:t,Ft−2:t\)\)\)q\_\{\\phi\}\(Z^\{t\}\\mid X^\{t\-2:t\},F^\{t\-2:t\}\)=\\mathcal\{N\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\(\\mu\_\{q\}\(X^\{t\-2:t\},F^\{t\-2:t\}\),\\,\\mathrm\{diag\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\(\\sigma\_\{q\}^\{2\}\(X^\{t\-2:t\},F^\{t\-2:t\}\)\}\}\\right\)\}\}\\right\)\(5\)
#### Prior\.
The prior is parameterized by a separate network with the same architecture as the encoder but conditioned only on the previous state:
pθ\(Zt∣Xt−2:t−1,Ft−2:t\)=𝒩\(gθ\(Xt−2:t−1,Ft−2:t\),I\)p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)=\\mathcal\{N\}\\\!\\mathopen\{\}\\mathclose\{\{\\left\(g\_\{\\theta\}\(X^\{t\-2:t\-1\},F^\{t\-2:t\}\),\\,I\}\}\\right\)\(6\)
#### Decoder\.
Given a latent sampleZt∼qϕZ^\{t\}\\sim q\_\{\\phi\}\(for theELBO\) orZt∼pθZ^\{t\}\\sim p\_\{\\theta\}\(during inference, and for theCRPSloss term\), the decoder injectsZtZ^\{t\}into the mesh representation at the top level, processes it through the full mesh hierarchy \(upward and downward sweeps\), decodes back to the grid, and outputs the predicted residualr^t\\hat\{r\}^\{t\}\. The forecast is computed as a residual:X^t=fθ\(Xt−2:t−1,Ft−2:t,Zt\)=Xt−1\+r^t\\hat\{X\}^\{t\}=f\_\{\\theta\}\\mathopen\{\}\\mathclose\{\{\\left\(X^\{t\-2:t\-1\},F^\{t\-2:t\},Z^\{t\}\}\}\\right\)=X^\{t\-1\}\+\\hat\{r\}^\{t\}\.
#### Ensemble generation\.
At inference, ensemble members are generated by drawing independent samplesZ1t,…,ZMt∼pθ\(Zt∣Xt−2:t−1,Ft−2:t\)Z^\{t\}\_\{1\},\\ldots,Z^\{t\}\_\{M\}\\sim p\_\{\\theta\}\(Z^\{t\}\\mid X^\{t\-2:t\-1\},F^\{t\-2:t\}\)from the prior\. Each sample is decoded independently through the same decoder, producing an ensemble of forecasts\{X^mt\}m=1M\\\{\\hat\{X\}^\{t\}\_\{m\}\\\}\_\{m=1\}^\{M\}\. The latent variables are defined at the coarsest mesh graph level, which has relatively few nodes, so the cost of generating additional ensemble members is dominated by the decoder pass\.
### A\.2GNNlayer formulations
Graph\-EFM uses two different kinds ofGNNlayers: Interaction Networks and Propagation networks\. To increase scalability, we generalize these in Njord to have more flexible input dimensionalities\. All different kinds ofGNNsare described in this section\.
#### Interaction Networks\.
TheGNNlayers in the encode–process–decode architecture are based on Interaction Networks\[[4](https://arxiv.org/html/2605.15470#bib.bib3)\]\. For a graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)with sender node representations𝐇S\\mathbf\{H\}^\{S\}, receiver node representations𝐇R\\mathbf\{H\}^\{R\}, and edge representations𝐄\\mathbf\{E\}, all sharing dimensionalitydzd\_\{z\}, the update𝐇R,𝐄←GNN\(𝒢,𝐇S,𝐄,𝐇R\)\\mathbf\{H\}^\{R\},\\mathbf\{E\}\\leftarrow\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{GNN\}\{\{\{\}\}GNN\}\}\(\\mathcal\{G\},\\mathbf\{H\}^\{S\},\\mathbf\{E\},\\mathbf\{H\}^\{R\}\)is
e~α→β\\displaystyle\\tilde\{e\}\_\{\\alpha\\to\\beta\}←MLPe\(eα→β,𝐇αS,𝐇βR\)\\displaystyle\\leftarrow\\mathrm\{MLP\}\_\{e\}\\\!\\bigl\(e\_\{\\alpha\\to\\beta\},\\,\\mathbf\{H\}^\{S\}\_\{\\alpha\},\\,\\mathbf\{H\}^\{R\}\_\{\\beta\}\\bigr\)\(7\)eα→β\\displaystyle e\_\{\\alpha\\to\\beta\}←eα→β\+e~α→β\\displaystyle\\leftarrow e\_\{\\alpha\\to\\beta\}\+\\tilde\{e\}\_\{\\alpha\\to\\beta\}\(8\)𝐇βR\\displaystyle\\mathbf\{H\}^\{R\}\_\{\\beta\}←𝐇βR\+MLPa\(𝐇βR,∑α∈𝒩e\(β\)e~α→β\)\\displaystyle\\leftarrow\\mathbf\{H\}^\{R\}\_\{\\beta\}\+\\mathrm\{MLP\}\_\{a\}\\\!\\\!\\mathopen\{\}\\mathclose\{\{\\left\(\\mathbf\{H\}^\{R\}\_\{\\beta\},\\;\\sum\_\{\\alpha\\in\\mathcal\{N\}\_\{e\}\(\\beta\)\}\\tilde\{e\}\_\{\\alpha\\to\\beta\}\}\}\\right\)\(9\)where𝒩e\(β\)=\{α:\(α,β\)∈ℰ\}\\mathcal\{N\}\_\{e\}\(\\beta\)=\\\{\\alpha:\(\\alpha,\\beta\)\\in\\mathcal\{E\}\\\}are the incoming neighbors of nodeβ\\beta\. The edge residual in Eq\. \([8](https://arxiv.org/html/2605.15470#A1.E8)\) allows edge representations to accumulate information across successiveGNNlayers\. These layers are used for same\-level mesh processing throughout the hierarchy\.
#### Propagation Networks\.
Interaction Networks are biased towards retaining existing receiver node representations\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]: when MLPs are initialized with outputs near zero, Eqs\. \([8](https://arxiv.org/html/2605.15470#A1.E8)\)–\([9](https://arxiv.org/html/2605.15470#A1.E9)\) produce no change toeα→βe\_\{\\alpha\\to\\beta\}or𝐇βR\\mathbf\{H\}^\{R\}\_\{\\beta\}\. To encourage information flow from senders to receivers, Propagation Networks\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]modify Eqs\. \([7](https://arxiv.org/html/2605.15470#A1.E7)\)–\([9](https://arxiv.org/html/2605.15470#A1.E9)\) to
e~α→β\\displaystyle\\tilde\{e\}\_\{\\alpha\\to\\beta\}←𝐇αS\+MLPe\(eα→β,𝐇αS,𝐇βR\)\\displaystyle\\leftarrow\\mathbf\{H\}^\{S\}\_\{\\alpha\}\+\\mathrm\{MLP\}\_\{e\}\\\!\\bigl\(e\_\{\\alpha\\to\\beta\},\\,\\mathbf\{H\}^\{S\}\_\{\\alpha\},\\,\\mathbf\{H\}^\{R\}\_\{\\beta\}\\bigr\)\(10\)𝐇~βR\\displaystyle\\tilde\{\\mathbf\{H\}\}^\{R\}\_\{\\beta\}←1\|𝒩e\(β\)\|∑α∈𝒩e\(β\)e~α→β\\displaystyle\\leftarrow\\frac\{1\}\{\|\\mathcal\{N\}\_\{e\}\(\\beta\)\|\}\\sum\_\{\\alpha\\in\\mathcal\{N\}\_\{e\}\(\\beta\)\}\\tilde\{e\}\_\{\\alpha\\to\\beta\}\(11\)𝐇βR\\displaystyle\\mathbf\{H\}^\{R\}\_\{\\beta\}←𝐇~βR\+MLPa\(𝐇βR,𝐇~βR\)\\displaystyle\\leftarrow\\tilde\{\\mathbf\{H\}\}^\{R\}\_\{\\beta\}\+\\mathrm\{MLP\}\_\{a\}\\\!\\bigl\(\\mathbf\{H\}^\{R\}\_\{\\beta\},\\,\\tilde\{\\mathbf\{H\}\}^\{R\}\_\{\\beta\}\\bigr\)\(12\)For MLPs initialized near zero, this reduces to averaging neighboring sender representations, encouraging propagation by construction\. These layers are used for inter\-level mappings in the mesh hierarchy\.
#### Flexible Interaction Networks\.
The standard Interaction Networks require all representations to share dimensionalitydzd\_\{z\}\. We relax this constraint by allowing separate dimensionalitiesdsd\_\{s\},drd\_\{r\}, andded\_\{e\}for sender nodes, receiver nodes, and edges, respectively\. The edge MLP maps fromde\+ds\+drd\_\{e\}\+d\_\{s\}\+d\_\{r\}todrd\_\{r\}, and the aggregation MLP maps fromdr\+drd\_\{r\}\+d\_\{r\}todrd\_\{r\}, so that the output matches the receiver node dimensionality\. Because the edge message dimensionalitydrd\_\{r\}differs from the edge input dimensionalityded\_\{e\}, the edge residual update in Eq\. \([8](https://arxiv.org/html/2605.15470#A1.E8)\) is omitted; only the receiver node representations𝐇R\\mathbf\{H\}^\{R\}are returned\. This is not a problem, as the updated edge representations are not needed where these layers are used in Njord\.
#### Flexible Propagation Networks\.
Similarly, we extend Propagation Networks to heterogeneous dimensions\. These layers are slightly more restrictive, as the updated receiver representation will always have the same dimensionality as the sender nodes\. Still, the key flexibility is to allow for a different edge dimensionality\. In the flexible Propagation Network the edge MLP maps fromde\+ds\+drd\_\{e\}\+d\_\{s\}\+d\_\{r\}todsd\_\{s\}, and the aggregation MLP maps fromds\+drd\_\{s\}\+d\_\{r\}todsd\_\{s\}, so that the output matches the sender node dimensionality\. As with the Flexible Interaction Network, the edge residual update can be omitted without issues\.
### A\.3Scaling to ocean grids
For comparison, the Graph\-EFM weather model\[[39](https://arxiv.org/html/2605.15470#bib.bib2)\]is applied on 29 040 global grid nodes and 63 784 regional grid nodes\. Our global ocean model operates on 676 736 grid nodes out of a 680×\\times1440 bounding box \(979 200 total\); and the Baltic Sea regional model on 147 701 grid nodes out of a 738×\\times763 bounding box \(563 094 total\)\. In all cases theGNNprocesses only the interior sea nodes\. This order\-of\-magnitude increase in spatial scale produces grid\-to\-mesh and mesh\-to\-grid graphs with𝒪\(106\)\\mathcal\{O\}\(10^\{6\}\)edges \(Tables[3](https://arxiv.org/html/2605.15470#A3.T3)–[4](https://arxiv.org/html/2605.15470#A3.T4)\), making the capacity of the edge\-embedding MLPs the dominant memory bottleneck\. Since the static edge features are low\-dimensional \(3–4 features\), we reduce the hidden and output dimensionality of the graph encoding and decoding edge MLPs from 256 \(as in SeaCast\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]\) tode=32d\_\{e\}=32, with no noteworthy effect on forecast skill while yielding significant savings in compute and memory\. Similarly, the grid and bottom\-mesh\-level node representations use dimensionalitydg=128d\_\{g\}=128, compared todz=256d\_\{z\}=256for the mesh processing layers\. Linear projection layers map betweendgd\_\{g\}anddzd\_\{z\}at the encoder–processor and processor–decoder boundaries\. Njord is configured to use 6 processing layers, which amount to 22M trainable parameters in total\. All of these choices, together with gradient checkpointing\[[8](https://arxiv.org/html/2605.15470#bib.bib6)\]at each autoregressive step, enable training on the large ocean grids\.
## Appendix BData Details
Tables[1](https://arxiv.org/html/2605.15470#A2.T1)and[2](https://arxiv.org/html/2605.15470#A2.T2)detail the comprehensive set of variables used to train and evaluate the global ocean model and the Baltic Sea regional model, respectively\. These encompass the internal physical state variables predicted by the model, alongside the external conditioning inputs, which include atmospheric forcings, lateral boundary conditions \(for the regional model\), and static geographic fields\. We train the global model using the GLORYS12 global ocean reanalysis\[[34](https://arxiv.org/html/2605.15470#bib.bib31)\]and finetune on operational GLO12 analysis data\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\]\. The regional model is trained on the2km/2\\text\{\\,\}\\mathrm\{km\}\\text\{/\}Baltic Sea Physics Reanalysis\[[37](https://arxiv.org/html/2605.15470#bib.bib32)\], using GLORYS12 for lateral boundary forcing\. During evaluation, regional boundaries are forced by GLO12 forecasts sourced from OceanBench\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\], which also supplies our global baseline model data\. For both configurations, surface atmospheric forcing uses the ERA5 reanalysis\[[21](https://arxiv.org/html/2605.15470#bib.bib29)\]during training and operational 10\-day IFS forecasts\[[12](https://arxiv.org/html/2605.15470#bib.bib54)\]during evaluation222Data usage and licensing: OceanBench data is provided under the EUPL\-1\.2 license\. ERA5 data is provided by the Copernicus Climate Change Service under the ECMWF Copernicus License\. IFS operational data is provided by ECMWF or through OceanBench\. Oceanographic data was obtained using E\.U\. Copernicus Marine Service Information under the Copernicus Marine Service License \(DOIs: 10\.48670/moi\-00021, 10\.48670/moi\-00016, 10\.48670/moi\-00013, 10\.48670/moi\-00010\)\.\.
Table 1:Variables, static fields, and forcing features for the global ocean dataset\.AbbreviationUnitVertical levelState variablesSea surface height above geoidzosmSea surfaceSea ice area fractionsiconc–Sea surfaceSea ice thicknesssithickmSea surfaceZonal sea water velocityuom/s0, 47, 92, 222, 318, 541 mMeridional sea water velocityvom/s0, 47, 92, 222, 318, 541 mSea water salinitysoPSU0, 47, 92, 222, 318, 541 mSea water potential temperaturethetao°C0, 47, 92, 222, 318, 541 mForcing fieldsSea floor depth below geoiddepthomSea floorMean dynamic topographymdtmSea surfaceSine of longitudesin\_lon––Cosine of longitudecos\_lon––Sine of latitudesin\_lat––Cosine of latitudecos\_lat––Distance to coastcoast\_distm–Sine of time of yearsin\_t––Cosine of time of yearcos\_t––Atmospheric forcing2\-meter air temperaturesotemair°CSea surfaceZonal 10\-meter windsowinu10m/sSea surfaceMeridional 10\-meter windsowinv10m/sSea surfaceDownward shortwave radiation fluxsosudoswW/m2Sea surfaceDownward longwave radiation fluxsosudolwW/m2Sea surfaceTotal precipitation ratesowapreckg m\-2s\-1Sea surface2\-meter dew point temperaturesod2m°CSea surfaceMean sea level pressuresomslprePaSea surfaceTable 2:Variables, static fields, and forcing features for the Baltic Sea dataset\.AbbreviationUnitVertical levelState variablesSea level anomalyslamSea surfaceSea ice area fractionsiconc–Sea surfaceSea ice thicknesssithickmSea surfaceZonal sea water velocityuom/s1, 9, 28, 47, 91 mMeridional sea water velocityvom/s1, 9, 28, 47, 91 mSea water salinitysoPSU1, 9, 28, 47, 91 mSea water potential temperaturethetao°C1, 9, 28, 47, 91 mForcing fieldsSea floor depth below geoiddepthomSea floorMean dynamic topographymdtmSea surfaceProjected x\-coordinatex\_coordm–Projected y\-coordinatey\_coordm–Distance to coastcoast\_distm–Sine of time of yearsin\_t––Cosine of time of yearcos\_t––Atmospheric forcing2\-meter air temperaturesotemair°CSea surfaceZonal 10\-meter windsowinu10m/sSea surfaceMeridional 10\-meter windsowinv10m/sSea surfaceDownward shortwave radiation fluxsosudoswW/m2Sea surfaceDownward longwave radiation fluxsosudolwW/m2Sea surfaceTotal precipitation ratesowapreckg m\-2s\-1Sea surface2\-meter dew point temperaturesod2m°CSea surfaceMean sea level pressuresomslprePaSea surfaceBoundary forcingSea surface height above geoidzosmSea surfaceZonal sea water velocityuom/s0, 47, 92 mMeridional sea water velocityvom/s0, 47, 92 mSea water salinitysoPSU0, 47, 92 mSea water potential temperaturethetao°C0, 47, 92 m
## Appendix CGraph details
### C\.1Cluster\-based graph construction
Previous global graph\-based forecasting models have used icosahedral meshes\[[28](https://arxiv.org/html/2605.15470#bib.bib10),[39](https://arxiv.org/html/2605.15470#bib.bib2)\]for constructing the graph\. Icosahedral meshes are constructed by iteratively subdividing an icosahedron, with each subdivision quadrupling the number of nodes\. This coarse refinement factor creates a practical limitation because the jumps in resolution between different splits are large, e\.g\., 7 splits produce 115 016 nodes after masking out land, compared to 28 753 for 6 splits\. We had to restrict our comparison with icosahedral meshes to 6 splits as it was not possible to fit higher than that in memory during training\. The icosahedral structure both limits the ability to choose the number of mesh nodes, and is poorly adapted to the irregular geometry of the ocean surface\. For better adapting to this geometry, we propose using spatial K\-means clustering with Delaunay triangulation to construct hierarchical meshes\.
For the global model, we use spherical K\-means clustering of the ocean grid point 3D Cartesian coordinates, with latitude\-based area weights to ensure equitable spatial coverage\. Same\-level edges are constructed via spherical Delaunay triangulation \(computed as the convex hull of the 3D points on the unit sphere\), followed by land\-crossing edge filtering\. The refinement factor between levels becomes a continuous parameter rather than a fixed quadrupling, enabling finer control over the mesh hierarchy\. The first mesh level is obtained by clustering theNNocean grid nodes intoN/r0N/r\_\{0\}clusters, wherer0r\_\{0\}is the grid\-to\-first\-mesh refinement factor\. Subsequent levels cluster the previous level’s nodes by a refinement factorrr\. In our global model,r0=20r\_\{0\}=20is used from the grid to the first mesh level and produces 33 777 mesh nodes, compared to 28 753 or 115 016 for 6 or 7 icosahedral splits\. Subsequent levels cluster the previous level’s nodes by a refinement factorr=4r=4\. We apply land\-crossing edge filtering so that the meshes conform to coastlines, bays, and straits\.
For regional graph\-based models, quadrilateral meshes have been used\[[40](https://arxiv.org/html/2605.15470#bib.bib7),[23](https://arxiv.org/html/2605.15470#bib.bib8)\]\. While more flexible in refinement, they do not necessarily line up well with irregular coastlines\. We use Euclidean K\-means applied to the ocean grid node positions\. For the regional mesh the grid\-to\-first\-mesh refinement factor is chosen asr0=9r\_\{0\}=9\. Subsequent levels also use a factorr=9r=9\. This makes the mesh comparable in size to the quadrilateral mesh using a3×33\\times 3coarsening factor\. The Mesh edges at each level are constructed via 2D Delaunay triangulation of the cluster centers, and edges crossing land areas are filtered out\. Inter\-level edges connect each node to its nearest neighbor at the adjacent level\. The full Njord\-Baltic framework, including the regional cluster graph, is shown in[Figure˜12](https://arxiv.org/html/2605.15470#A3.F12)\.
Figure 12:One\-step prediction in the Njord\-Baltic model\. Residuals are predicted at timett, which are then added to the previous stateXt−1X^\{t\-1\}in order to produce the sampleX^t\\hat\{X\}^\{t\}\.Our improved graph construction building on spatial clustering leads to graphs that better conform to the exact geometry of the sea surface\. This is especially noticeable around complex coastlines, such as around islands and straits\. Here we provide additional illustrations of the difference between our cluster\-based graphs and existing graph creation methods\.[Figure˜13](https://arxiv.org/html/2605.15470#A3.F13)shows the full graphs for the global ocean model, and[Figure˜14](https://arxiv.org/html/2605.15470#A3.F14)shows graphs for the Baltic Sea model\. For Njord\-Baltic, all spatial processing and coordinate embeddings are defined in a Lambert Conformal Conic projection centered at \(20°E, 60°N\), which is used also for the graph\. We further show examples of mesh node placement for specific regions in[Figures15](https://arxiv.org/html/2605.15470#A3.F15)–[16](https://arxiv.org/html/2605.15470#A3.F16)and[Figures17](https://arxiv.org/html/2605.15470#A3.F17)–[18](https://arxiv.org/html/2605.15470#A3.F18), for both nodes at the first mesh level𝒢0\\mathcal\{G\}\_\{0\}and the last𝒢2\\mathcal\{G\}\_\{2\}\. These chosen regions serve as clear examples of how our graph creation leads to different node placements\. Nodes in the cluster\-based graphs conform to the coastlines, and are nicely spaced throughout narrow straits\. For the higher graph level𝒢2\\mathcal\{G\}\_\{2\}, the quadrilateral and icosahedral graphs can completely lack any mesh nodes in key areas, since there is nothing in the graph creation that favors placing nodes over sea rather than over land\.
\(a\)Global cluster graph\.
\(b\)Global icosahedral graph\.
Figure 13:Global graphs used by Njord, with grid nodes in blue, encoding/decoding edges in black, and the hierarchical meshes colored in yellow\.\(a\)Baltic Sea cluster graph\.
\(b\)Baltic Sea quadrilateral graph\.
Figure 14:Regional graphs used by Njord, with grid nodes in blue, M2G and G2M edges in black, and the hierarchical meshes colored in yellow\.\(a\)Icosahedral, level 0
\(b\)Cluster, level 0
\(c\)Icosahedral, level 2
\(d\)Cluster, level 2
Figure 15:Example of mesh node placement in the Gulf of California \(latitude22∘22^\{\\circ\}N–33∘33^\{\\circ\}N, longitude117∘117^\{\\circ\}W–105∘105^\{\\circ\}W\)\.\(a\)Icosahedral, level 0
\(b\)Cluster, level 0
\(c\)Icosahedral, level 2
\(d\)Cluster, level 2
Figure 16:Example of mesh node placement in the northern Red Sea and Suez Canal \(latitude10∘10^\{\\circ\}N–33∘33^\{\\circ\}N, longitude28∘28^\{\\circ\}E–44∘44^\{\\circ\}E\)\.\(a\)Quadrilateral, level 0
\(b\)Cluster, level 0
\(c\)Quadrilateral, level 2
\(d\)Cluster, level 2
Figure 17:Example of mesh node placement in the Bråviken bay and Östergötland Archipelago, on the Swedish east coast\.\(a\)Quadrilateral, level 0
\(b\)Cluster, level 0
\(c\)Quadrilateral, level 2
\(d\)Cluster, level 2
Figure 18:Example of mesh node placement in the Turku Archipelago in south\-western Finland\.
### C\.2Grid–mesh connections
Grid\-to\-mesh \(G2M\) edges connect each grid node to all mesh nodes within a radius of0\.67dm0\.67\\,d\_\{m\}, wheredmd\_\{m\}is the mean edge length at the bottom mesh level\. Mesh\-to\-grid \(M2G\) edges connect each interior grid node to itsk=3k=3nearest mesh nodes\. Edges crossing land are filtered out: an edge is removed if its midpoint is closer to a land grid node than to a sea grid node, or if the edge exceeds a maximum length threshold\. For the regional model \(projected coordinates\), this threshold is20km/20\\text\{\\,\}\\mathrm\{km\}\\text\{/\}; for the global model \(unit\-sphere chord distance\), it is 0\.1, corresponding to approximately5\.7∘5\.7^\{\\circ\}of great\-circle arc\.
### C\.3Static graph features
Each edge in the graph carries a small set of static features that are embedded by an MLP before being used in theGNNlayers\. For the regional model, edge features consist of the edge length and the 2D displacement vector\(Δx,Δy\)\(\\Delta x,\\Delta y\)in projected coordinates, yielding 3 features per edge\. For the global model, edge features consist of the chord length and the 3D Cartesian displacement\(Δx,Δy,Δz\)\(\\Delta x,\\Delta y,\\Delta z\)on the unit sphere, yielding 4 features per edge\. Mesh node features at each hierarchy level are similarly embedded\. For the regional model, these are the projected position\(x,y\)\(x,y\)and the Voronoi cell area \(3 features\)\. For the global model, node features are\(sinλ,cosλ,sinϕ,cosϕ\)\(\\sin\\lambda,\\cos\\lambda,\\sin\\phi,\\cos\\phi\), whereλ\\lambdaandϕ\\phiare the longitude and latitude, plus the spherical Voronoi cell area \(5 features\)\. GraphCast\[[28](https://arxiv.org/html/2605.15470#bib.bib10)\]usescosϕ\\cos\\phi,sinλ\\sin\\lambda, andcosλ\\cos\\lambdaas mesh node features; we additionally includesinϕ\\sin\\phi, which is directly proportional to the Coriolis parameterf=2Ωsinϕf=2\\Omega\\sin\\phigoverning the influence of Earth’s rotation on ocean currents\. The Voronoi cell area for each mesh node is computed as the area of the corresponding Voronoi cell: in projected coordinates for the regional model and on the unit sphere \(via spherical Voronoi tessellation\) for the global model\. Coastal nodes whose Voronoi cells extend over land are assigned zero area\. All edge features are normalized by the longest mesh edge length, and mesh node features are min–max normalized across hierarchy levels\.
### C\.4Graph statistics
Tables[3](https://arxiv.org/html/2605.15470#A3.T3)through[6](https://arxiv.org/html/2605.15470#A3.T6)summarize the hierarchical node and edge counts for the various graph architectures evaluated in this work\. For the global configuration, we have the K\-means cluster mesh[Table˜3](https://arxiv.org/html/2605.15470#A3.T3)\) and the icosahedral mesh[4](https://arxiv.org/html/2605.15470#A3.T4)across both the1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}pretraining and0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}finetuning resolutions\. For the Baltic Sea, we use a cluster graph[Table˜6](https://arxiv.org/html/2605.15470#A3.T6)for Njord and a quadrilateral mesh[Table˜6](https://arxiv.org/html/2605.15470#A3.T6)for SeaCast\.
Table 3:Number of nodes and edges in the global cluster graph\.DatasetResolutionGraphNodesEdges𝒢0\\mathcal\{G\}\_\{0\}33777197348𝒢0,1/𝒢1,0\\mathcal\{G\}\_\{0,1\}/\\mathcal\{G\}\_\{1,0\}\-33777Rea\./Ana\.1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}/0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢1\\mathcal\{G\}\_\{1\}840948198𝒢1,2/𝒢2,1\\mathcal\{G\}\_\{1,2\}/\\mathcal\{G\}\_\{2,1\}\-8409𝒢2\\mathcal\{G\}\_\{2\}208811592Total44274341510Reanalysis1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-73077𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-126594Grid42348\-Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-1167675𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-2021392Grid676736\-Analysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-1165790𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-2016228Grid675219\-Table 4:Number of nodes and edges in the global icosahedral graph\.DatasetResolutionGraphNodesEdges𝒢0\\mathcal\{G\}\_\{0\}28753166354𝒢0,1/𝒢1,0\\mathcal\{G\}\_\{0,1\}/\\mathcal\{G\}\_\{1,0\}\-28753Rea\./Ana\.1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}/0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢1\\mathcal\{G\}\_\{1\}719440662𝒢1,2/𝒢2,1\\mathcal\{G\}\_\{1,2\}/\\mathcal\{G\}\_\{2,1\}\-7194𝒢2\\mathcal\{G\}\_\{2\}18179862Total37764288772Reanalysis1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-67914𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-126408Grid42348\-Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-1088358𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-2016626Grid676736\-Analysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}𝒢G2M\\mathcal\{G\}\_\{\\text\{G2M\}\}\-1086477𝒢M2G\\mathcal\{G\}\_\{\\text\{M2G\}\}\-2011528Grid675219\-Table 5:Number of nodes and edges in the Baltic Sea cluster graph\.
Table 6:Number of nodes and edges in the Baltic Sea quadrilateral graph\.
## Appendix DTraining Details
### D\.1Loss functions
The base loss function accounts for the ocean’s bathymetric structure, following SeaCast\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\]:
ℒbase=1T∑t=1T∑i=1Cλi∑l=1LiwlNl∑v=1Nlavℓ\(X^v,it,Xv,it,σ^v,it\)\\mathcal\{L\}\_\{\\mathrm\{base\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\sum\_\{i=1\}^\{C\}\\lambda\_\{i\}\\sum\_\{l=1\}^\{L\_\{i\}\}\\frac\{w\_\{l\}\}\{N\_\{l\}\}\\sum\_\{v=1\}^\{N\_\{l\}\}a\_\{v\}\\,\\ell\\mathopen\{\}\\mathclose\{\{\\left\(\\hat\{X\}^\{t\}\_\{v,i\},\\,X^\{t\}\_\{v,i\},\\,\\hat\{\\sigma\}^\{t\}\_\{v,i\}\}\}\\right\)\(13\)whereTTis the number of autoregressive rollout steps,CCis the number of variables,LiL\_\{i\}is the number of depth levels for variableii,NlN\_\{l\}is the number of ocean grid nodes at depth levelll,ava\_\{v\}is the latitude–longitude area weight for grid cellvv\(normalized to unit mean\),wlw\_\{l\}is a depth\-level weight that we configure as1/Li1/L\_\{i\}for depth\-resolved variables and 0\.5 for all surface\-level variables, and lastlyλi\\lambda\_\{i\}is the inverse variance of one\-step time differences for variableii\. The per\-entry lossℓ\\ellis theELBOandCRPSfor Njord\.
TheCRPSterm in Eq\. \([2](https://arxiv.org/html/2605.15470#S4.E2)\) employs the almost\-fairCRPS\(afCRPS\) estimator introduced by\[[30](https://arxiv.org/html/2605.15470#bib.bib23)\], which interpolates between the biased and unbiased \(fair\)CRPSestimators\[[15](https://arxiv.org/html/2605.15470#bib.bib14)\]\. The loss can be written as:
ℒCRPS=1M∑m=1M\|x^m−x\|−1−ε2M\(M−1\)∑m=1M∑m′=1M\|x^m−x^m′\|,ε=1−αM\\mathcal\{L\}\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\|\\hat\{x\}\_\{m\}\-x\|\-\\frac\{1\-\\varepsilon\}\{2M\(M\-1\)\}\\sum\_\{m=1\}^\{M\}\\sum\_\{m^\{\\prime\}=1\}^\{M\}\|\\hat\{x\}\_\{m\}\-\\hat\{x\}\_\{m^\{\\prime\}\}\|,\\qquad\\varepsilon=\\frac\{1\-\\alpha\}\{M\}\(14\)wherex^m\\hat\{x\}\_\{m\}is one scalar dimension from an ensemble memberX^mt\\hat\{X\}^\{t\}\_\{m\}sampled fromp^θ\\hat\{p\}\_\{\\theta\},xxis the target, andα∈\(0,1\]\\alpha\\in\(0,1\]controls the interpolation \(α=1\\alpha=1recovers the fairCRPS\)\. We useα=0\.95\\alpha=0\.95andM=2M=2ensemble members during training\. TheCRPSloss is normalized per variable by dividing by the per\-variable standard deviation and reduced using the same spatial mask and area weighting as Eq\. \([13](https://arxiv.org/html/2605.15470#A4.E13)\)\.
### D\.2Training curriculum
#### Global schedule\.
The global Njord model follows a two\-stage resolution training schedule, transitioning from coarse to fine resolution as detailed in Table[7](https://arxiv.org/html/2605.15470#A4.T7)\. The model is first pretrained on1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution data for 325 epochs using cosine learning rate annealing\. Subsequently, it is finetuned on0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution data for an additional 165 epochs, utilizing a 5\-epoch linear warmup followed by cosine decay\. Because Njord’s grid\-to\-mesh encoder’s uses Flexible Propagation Networks \(Eq\. \([11](https://arxiv.org/html/2605.15470#A1.E11)\)\) with mean aggregation no rescaling of incoming messages used for similar models with sum aggregation\[[43](https://arxiv.org/html/2605.15470#bib.bib34)\]is needed when moving to higher resolution\.
Finetuning with the afCRPSloss requires sampling two more trajectories, which makes it more expensive to train\. Fortunately you can get away with doing this for only a few epochs\. For the global case which has a much larger grid than the regional model, we especially pay attention to this by choosing a highλCRPS=106\\lambda\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}=10^\{6\}for 5 epochs only\. The model is further fine\-tuned on analysis data as seen in[Table˜7](https://arxiv.org/html/2605.15470#A4.T7)\. Note when looking at epochs and GPUh for analysis fine\-tuning that it consists of less training data \(1 year\) compared to reanalysis \(28 years\)\.
Table 7:Training schedule and hyperparameter configuration for the global Njord model\. Pretraining \(1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}\) follows a cosine annealing schedule from10−310^\{\-3\}to10−510^\{\-5\}, while finetuning \(0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}\) incorporates a 5\-epoch linear warmup from10−510^\{\-5\}to10−410^\{\-4\}followed by cosine decay from10−410^\{\-4\}to10−510^\{\-5\}\. Finetuning on the analysis dataset uses a constant10−510^\{\-5\}learning rate\.DatasetResolutionEpochsλKL\\lambda\_\{\\mathrm\{KL\}\}λCRPS\\lambda\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}UnrollingTTGPUhReanalysis1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}100001600Reanalysis1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}2000\.1011300Reanalysis1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}250\.102320Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}5001110Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}1500\.1013300Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}50\.102210Reanalysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}50\.110610^\{6\}2530Analysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}1000\.110610^\{6\}2590Analysis0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}1000\.110710^\{7\}71970
#### Regional schedule\.
The Njord\-Baltic model is trained for 350 epochs using a staged curriculum and cosine learning rate annealing\. As shown in Table[8](https://arxiv.org/html/2605.15470#A4.T8), the process begins with 100 epochs of pure autoencoder training \(T=1,λKL=0T=1,\\lambda\_\{\\mathrm\{KL\}\}=0\) to establish the base representation\. We then introduce the KL divergence term \(λKL=0\.1\\lambda\_\{\\mathrm\{KL\}\}=0\.1\) for 200 epochs to align the prior with the approximate posterior\. The final 50 epochs focus on temporal consistency and calibration: first by unrolling the model to two steps \(T=2T=2\) for 25 epochs, and then by incorporating theCRPSloss \(λCRPS=104\\lambda\_\{\\mathrm\{\\lx@glossaries@gls@link\{acronym\}\{CRPS\}\{\{\{\}\}CRPS\}\}\}=10^\{4\}\) for the remaining 25 epochs to optimize the ensemble spread\. Lastly, Njord\-Baltic is finetuned for 50 epochs on 1 year of analysis data\.
Table 8:Training schedule and hyperparameter configuration for the Njord\-Baltic model, using a cosine learning rate annealing schedule from10−310^\{\-3\}to10−510^\{\-5\}over 350 epochs\. Finetuning on the analysis dataset uses a constant10−510^\{\-5\}learning rate\.
#### Deterministic baseline\.
We train SeaCast as a baseline using the 3\-level icosahedral mesh in the global setting, and quadrilateral mesh in the regional setting\. Note that SeaCast is a regional model but we generalize to the globe\. SeaCast is configured with 3 processing layers and latent dimension 256\[[23](https://arxiv.org/html/2605.15470#bib.bib8)\], and trained for 175 epochs with the weighted MSE loss and cosine learning rate annealing where 150 epochs are single\-step training, followed by 25 epochs with two\-step autoregressive rollout to improve temporal stability\. The learning rate scheduler is 5 epoch linear warmup from10−510^\{\-5\}to10−410^\{\-4\}followed by cosine decay from10−410^\{\-4\}to10−510^\{\-5\}\. In the global setting, SeaCast is further trained on0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}data for 60 epochs, where 50 are 1\-step prediction, and the last 10 are 2\-step prediction with 5 epoch linear warmup from10−510^\{\-5\}to10−410^\{\-4\}followed by cosine decay from10−410^\{\-4\}to10−510^\{\-5\}\. Lastly it is finetuned on analysis data for 75 epochs, where 50 epochs are 2\-step training and the last 25 epochs use 7 steps, with a constant10−510^\{\-5\}learning rate\. In the regional setting, the SeaCast model is finetuned on Baltic Sea analysis data for 25 epochs with 2\-step prediction and a constant10−510^\{\-5\}learning rate\.
#### Implementation details\.
All models are optimized with AdamW\[[36](https://arxiv.org/html/2605.15470#bib.bib35)\]withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95and a weight decay of0\.10\.1\. Gradient checkpointing\[[8](https://arxiv.org/html/2605.15470#bib.bib6)\]is employed at each autoregressive step to fit the model into GPU memory for global training at0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution\. The training uses bfloat 16 precision and evaluation is performed with float 32 precision\. All models are trained using 64 AMD MI250X GPUs, except global0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}reanalysis training which ran on 128 AMD MI250X GPUs\. The GPUs each have 64GB VRAM\.
## Appendix EAdditional results
### E\.1Input and forcing steps
We study the effect of the temporal context provided to Njord by comparing the default configuration, which uses two input states\(t−2,t−1\)\(t\{\-\}2,t\{\-\}1\)and three forcing steps\(t−2,t−1,t\)\(t\{\-\}2,t\{\-\}1,t\), against two ablations: \(i\) the same two input states with a single forcing step att−1t\{\-\}1, and \(ii\) a single input state and forcing step att−1t\{\-\}1\. This ablation compares models trained using the1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}pretraining schedule\. Performance is reported per variable, depth and lead time as the normalizedRMSEdifference between the target configurationAAand the baselineBBasΔCRPS=CRPSA−CRPSBCRPSB\\Delta\_\{\\text\{CRPS\}\}\\;=\\;\\frac\{\\text\{CRPS\}\_\{A\}\-\\text\{CRPS\}\_\{B\}\}\{\\text\{CRPS\}\_\{B\}\}\. Negative \(blue\) values therefore indicate that the default configuration outperforms the ablation, while positive \(red\) values indicate degradation\.[Figure˜19](https://arxiv.org/html/2605.15470#A5.F19)show that the default configuration with two input steps and three forcing steps is broadly preferred\. Sea ice predictions specifically seems to benefit most from having two input steps\.
Figure 19:Ensemble meanCRPSscorecards\. The heatmaps display the relative difference between Njord with 2 input steps and 3 forcing steps versus using one two input steps and 1 forcing step and 1 input step and 1 forcing step across all ocean variables\. Blue indicates better performance by the original approach with 2 input steps and 3 forcing steps\.
### E\.2Graph type
The global cluster graph generally achieves lowerRMSEandCRPScompared to the icosahedral graph as seen in[Figure˜20](https://arxiv.org/html/2605.15470#A5.F20)\. In this experiment two Njord models with the cluster and icosahedral graphs in[Tables3](https://arxiv.org/html/2605.15470#A3.T3)–[4](https://arxiv.org/html/2605.15470#A3.T4)are trained on1°/1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}reanalysis data according to the pretraining schedule in[Table˜7](https://arxiv.org/html/2605.15470#A4.T7)\. The slight performance advantage from the cluster graph is partly attributed to the higher spatial density of the cluster graph, which utilizes 33,777 mesh nodes, whereas the three\-layer \(6, 5, 4 split\) icosahedral graph contains 28,753 nodes\. As demonstrated in[Figures15](https://arxiv.org/html/2605.15470#A3.F15)–[16](https://arxiv.org/html/2605.15470#A3.F16), the K\-means clustering approach also ensures that mesh nodes are evenly distributed over sea areas, maintaining coverage even within narrow bays and complex coastal geometries\. The comparison could be made even stronger by training on higher resolution data with both meshes, but it becomes unnecessarily computationally expensive to do so for this ablation\.
The cluster graph has the added benefit that it allows for a continuous selection of node counts, where one can choose the mesh resolution to what fits in GPU memory\. In contrast, icosahedral graphs are constrained by discrete subdivision levels, resulting in significant jumps in resolution between splits\. For example, after masking land areas, an increase from 6 to 7 splits results in a roughly fourfold increase in nodes, from 28,753 to 115,016, that then did not fit in memory anymore with a high hidden dimension\.
Figure 20:The heatmaps display the relative difference inRMSEandCRPSbetween Njord trained with the global cluster graph and the icosahedral graph across all ocean variables\.
### E\.3Sea ice treatment
We evaluate three strategies for enforcing sea ice constraints:
#### Unconstrained\.
The unconstrained model predicts state incrementsΔXt\\Delta X^\{t\}directly, with the next state obtained asX^t\+1=Xt\+ΔXt\\hat\{X\}^\{t\+1\}=X^\{t\}\+\\Delta X^\{t\}\. No bounds are enforced during training or inference\. We observe that this leads to unphysical sea ice values already at short lead times, with concentrations outside\[0,1\]\[0,1\]and negative thicknesses appearing at moderate rollout lengths\.
#### Output clamping\.
To constrain the sea ice variables to realistic bounds we apply smooth invertible activation functions in the residual update itself\. For variables with both a lower and upper bound \(heresiconcwith bounds\[0,1\]\[0,1\]\), we use a rescaled sigmoid:
fsig\(x\)=xL\+\(xU−xL\)σ\(x\),fsig−1\(y\)=σ−1\(y−xLxU−xL\)f\_\{\\mathrm\{sig\}\}\(x\)=x^\{L\}\+\(x^\{U\}\-x^\{L\}\)\\,\\sigma\(x\),\\qquad f\_\{\\mathrm\{sig\}\}^\{\-1\}\(y\)=\\sigma^\{\-1\}\\\!\\\!\\mathopen\{\}\\mathclose\{\{\\left\(\\frac\{y\-x^\{L\}\}\{x^\{U\}\-x^\{L\}\}\}\}\\right\)\(15\)For variables with only a lower bound \(heresithickwithxL=0x^\{L\}=0\), we use a shifted softplus:fsp\(x\)=xL\+softplus\(x\)f\_\{\\mathrm\{sp\}\}\(x\)=x^\{L\}\+\\mathrm\{softplus\}\(x\), with corresponding inverse\. The clamped next state is then computed as
X^vt\+1=f\(f−1\(Xvt\)\+ΔXvt\)\\hat\{X\}^\{t\+1\}\_\{v\}=f\\\!\\mathopen\{\}\\mathclose\{\{\\left\(f^\{\-1\}\(X^\{t\}\_\{v\}\)\+\\Delta X^\{t\}\_\{v\}\}\}\\right\)\(16\)This formulation operates in the unconstrained latent space off−1f^\{\-1\}, adds the predicted increment there, and maps back throughff\. Bothffandf−1f^\{\-1\}are smooth and differentiable everywhere\.
#### Density channel\.
Purely using a soft clamping may lead to accumulation of small deviations from zero ice over time, so in addition to the clamping we adopt a density channel mechanism\[[18](https://arxiv.org/html/2605.15470#bib.bib27)\]used previously for handling missing wave data in the Aurora foundation model\[[5](https://arxiv.org/html/2605.15470#bib.bib28)\]\. In Aurora, each ocean wave variable receives its own density channel indicating whether a measurement is present \(1\) or absent \(0\), allowing the model to represent the absence of wave data\. In our case a single binary density channeld∈\{0,1\}d\\in\\\{0,1\\\}is constructed fromSIC:d=𝕀\[SIC\>0\]d=\\mathbb\{I\}\[\\text\{SIC\}\>0\]\. This channel is appended to the model state, and predicted alongside all other variables\. During autoregressive rollout, the predicted density logitd^raw\\hat\{d\}\_\{\\mathrm\{raw\}\}is passed through a sigmoid and thresholded at0\.50\.5\. Where the predicted density falls below this threshold \(indicating no ice\), the density channel,SIC, andSITare all set to their normalized\-zero values in the feedback state passed to the next step\. Where density exceeds the threshold, the density channel is set to its normalized\-one value\. This ensures that the model receives a clean, zero\-ice input at ice\-free locations rather than a small residual value that can accumulate\. The raw predictions before applying the threshold are used for loss computation\.
The unconstrained variation misses the structure of sea ice near the Antarctic, compared to clamping or density \+ clamping, denoted as just density\. Postprocessing to sea ice bounds looks fairly good, but the correlation remains lower than when clamping is used in[Figure˜22](https://arxiv.org/html/2605.15470#A5.F22)\. Purely clamping leads to an unfavorable buildup ofSICnear the equator, present in all ensemble members at high lead times\. Clamping \+ density leads to betterCRPScompared to the unconstrained approach as seen in[Figure˜23](https://arxiv.org/html/2605.15470#A5.F23), by a fair amount which is a bit surprising, considering it should affect mainly sea ice\. Purely clamping leads to a very comparableCRPSto when the density channel is used on top\.
Figure 21:Spatial evaluation ofSICat a 30\-day lead time\. The panels compare the ground truth against the ensemble mean and ensemble standard deviation for different ways of handling ice boundaries\.Figure 22:Log\-scaled scatter density heatmaps evaluating predicted versus observedSICandSITat a 30\-day lead time\. Each panel includes a 1:1 reference line \(dashed\), a linear line of best fit \(solid red\), and the Pearson correlation coefficient \(rr\)\.Figure 23:Ensemble meanCRPSscorecards\. The heatmaps display the relative difference between the density \+ clamping model versus the unconstrained and clamping only approaches across all ocean variables\. Blue indicates better performance by the density \+ clamping approach\.
### E\.4OceanBench evaluation
To assess the global forecasting capabilities of Njord, we evaluate it using the OceanBench framework\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]\. OceanBench provides a standardized benchmark for data\-driven ocean forecasting by comparing models against the operational physics\-based system, GLO12\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\], as well as state\-of\-the\-art deep learning baselines including GLONET\[[14](https://arxiv.org/html/2605.15470#bib.bib18)\], WenHai\[[9](https://arxiv.org/html/2605.15470#bib.bib19)\], and XiHe\[[46](https://arxiv.org/html/2605.15470#bib.bib20)\]\. In OceanBench the native resolution of each model is used when comparing to the analysis or reanalysis targets\.
#### Evaluation on the GLORYS12 reanalysis track\.
[Table˜9](https://arxiv.org/html/2605.15470#A5.T9)presents the evaluation against the independent GLORYS12 reanalysis\[[34](https://arxiv.org/html/2605.15470#bib.bib31)\]track333We use the updated evaluation procedure and values from the live OceanBench webpage \([https://oceanbench\.lab\.dive\.edito\.eu](https://oceanbench.lab.dive.edito.eu/)\)\. These have been updated after the publication of the original paper\[[13](https://arxiv.org/html/2605.15470#bib.bib24)\]\.\. In this setting, Njord demonstrates competitive performance compared to the physics\-based GLO12 system\. Njord shows particular strength forecasting zonal and meridional currents, where it consistently outperforms GLO12 \(indicated by the blue cells\), as well as temperature down to50m50\\text\{\\,\}\\mathrm\{m\}\. Njord is skillful at predicting geostrophic currents at the surface\. Geostrophic currents provide a diagnostic of large\-scale ocean circulation and transport\. Under the geostrophic approximation, these currents are derived directly from forecasted SSH\. Because Njord maintains highly accurate and stable predictions of SSH, this fidelity translates directly into superior geostrophic current forecasts\.
Table 9:Scorecard for the GLORYS12 reanalysis track\. Colors represent the normalizedRMSEdifference with respect to the GLO12 operational baseline\. Blue cells indicate that the model outperforms GLO12, while red indicates higher error\.≤−0\.2\\leq\-0\.20≥\+0\.2\\geq\+0\.2Normalized RMSE Difference w\.r\.t\. GLO12
#### Evaluation on the GLO12 analysis track\.
[Table˜10](https://arxiv.org/html/2605.15470#A5.T10)presents the evaluation on the GLO12 analysis track\. Because the GLO12 forecast model and the GLO12 analysis share the exact same underlying physical parameterizations, it achieves the lowestRMSEwhen evaluated against its own analysis fields that are produced through a weekly data assimilation cycle applied to GLO12 forecasts\. Especially if the observations are sparse, which they generally are in the global ocean, it can be difficult to outperform the physical simulator on this benchmark\. On the other hand, the machine learning models are biased to perform better than GLO12 on the reanalysis benchmark\. A more independent view of performance is shown when comparing to observations in[Table˜11](https://arxiv.org/html/2605.15470#A5.T11)and[Figures˜24](https://arxiv.org/html/2605.15470#A5.F24)and[25](https://arxiv.org/html/2605.15470#A5.F25), where the machine learning models generally show favorable results compared to GLO12\.
Table 10:Scorecard for the GLO12 analysis track\. Colors represent the normalizedRMSEdifference with respect to the GLO12 operational baseline\. Blue cells indicate that the model outperforms GLO12, while red indicates higher error\.≤−1\.0\\leq\-1\.00≥\+1\.0\\geq\+1\.0Normalized RMSE Difference w\.r\.t\. GLO12
#### Evaluation on the observation track\.
[Table˜11](https://arxiv.org/html/2605.15470#A5.T11)presents the evaluation of the models against thein\-situobservations curated within the IV\-TT CLASS\-4 framework\. At the surface and near\-surface layers, Njord demonstrates strong predictive skill, consistently outperforming the operational GLO12 baseline when evaluated against surface drifting buoys measurements ofSST, shallow Argo \(global array of autonomous ocean profiling floats\) measurements of 0–5m temperature and salinity, and drifters measurements of 15m currents\. However, Njord’s performance degrades relative to the baseline at intermediate depth layers, which is particularly evident when compared against deeper Argo profiles for 100–300m temperature and salinity\. This decrease in skill at depth is an expected limitation, as Njord currently models fewer vertical depth levels than the other models\. Consequently, this lower vertical resolution provides a less granular representation of the 3D ocean state, which becomes apparent when validating against observational data spread across the water column\.
Table 11:Scorecard for the observation track within the IV\-TT CLASS\-4 framework\. Colors represent the normalizedRMSEdifference with respect to the GLO12 operational baseline\. Blue indicate that the model outperforms GLO12 against observations, while red indicates a higher error\.≤−0\.3\\leq\-0\.30≥\+0\.3\\geq\+0\.3Normalized RMSE Difference w\.r\.t\. GLO12
### E\.5Evaluation against SST observations
To evaluate the models’SSTforecasts, we used the potential temperature of the uppermost ocean layer, benchmarking these predictions against global ocean adjustedSST\[[11](https://arxiv.org/html/2605.15470#bib.bib4)\]\.
Figure 24:GlobalRMSEofSSTby forecast lead time, where Njord has the lowest error compared to satellite measurements\.The dataset merges multi\-sensor satellite observations into a Level\-3 global grid\.[Figure˜24](https://arxiv.org/html/2605.15470#A5.F24)shows the globally averagedRMSEforSSTover a 10\-day forecast horizon where the models are interpolated to the0\.1°/0\.1\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}SSTgrid\. Njord maintains the lowestRMSEacross all lead times\. The other machine learning models show higher error growth, and WenHai especially so, which is interesting considering it has atmospheric forcing that e\.g\. GLONET lacks\. By day 10, Njord’sRMSEis below 0\.6°C, remaining lower than the other machine learning models and the GLO12 baseline\.
[Figure˜25](https://arxiv.org/html/2605.15470#A5.F25)maps the normalizedRMSEdifference for lead times of 1, 4, 7, and 10 days\. Compared to GLONET and XiHe, Njord shows lower error across most ocean basins\. Compared to GLO12, Njord performs slightly better across the global ocean\. Note that GLO12 and XiHe operate at 3 times higher resolution\.
Figure 25:Spatial distribution of normalizedRMSEdifference forSSTbetween Njord ensemble mean and three baselines\. Blue indicates lower error for Njord\.
### E\.6Global metrics
We evaluate Njord on a global scale against the operational physics\-based GLO12 model\[[35](https://arxiv.org/html/2605.15470#bib.bib30)\], and deep learning baselines including GLONET\[[14](https://arxiv.org/html/2605.15470#bib.bib18)\], our global variant of SeaCast\[[22](https://arxiv.org/html/2605.15470#bib.bib5)\], and a Persistence forecast\. The ground truth used for verification is the global analysis\. We use this as reference, because the initial conditions are from the same product, and the only time a highSSRin training translated to equally highSSRduring evaluation was when we initialized and evaluated on the same source of data\. TheRMSEresults are expected to be similar to what is shown in[Table˜10](https://arxiv.org/html/2605.15470#A5.T10), but here all models are evaluated on the same0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution\. To do this GLO12 is downsampled from its native0\.083°/0\.083\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution using bilinear interpolation\. GLONET is like Njord a0\.25°/0\.25\\text\{\\,\}\\mathrm\{\\SIUnitSymbolDegree\}\\text\{/\}resolution model\. Note that this model is expected to behave differently because it is trained only on reanalysis\. SeaCast, on the other hand, provides a deterministic baseline trained in a similar manner as Njord\.
Ensemble metrics for Njord, specifically theSSR, theCRPS, and theRMSEof the ensemble mean, are compared against the deterministic baselines\. For the deterministic models \(GLO12, GLONET, WenHai, XiHe, SeaCast, and Persistence\),Mean Absolute Error \(MAE\)is shown alongside Njord’sCRPSas a comparable deterministic reference\. Mathematically, theCRPSevaluates the distance between the predictiveCumulative Distribution Function \(CDF\)and the empiricalCDFof the observation\. In the deterministic limit, where the predictive distribution is a Dirac delta function \(a point mass\) at the predicted value, theCRPSreduces exactly to theMAE\. We use the unbiasedCRPSestimator, corresponding toα=1\\alpha=1in[Equation˜14](https://arxiv.org/html/2605.15470#A4.E14)\. TheSSRcomputation includes the finite ensemble size correction fromFortinet al\.\[[17](https://arxiv.org/html/2605.15470#bib.bib55)\]\. To account for varying grid cell areas, the metrics are weighted by the cosine of the latitude, normalized to unit mean\.
[Figures26](https://arxiv.org/html/2605.15470#A5.F26)–[29](https://arxiv.org/html/2605.15470#A5.F29)report these metrics per variable and lead time\. The figure panels are organized into three columns:SSRon the left,CRPS\(orMAE\) in the middle, and RMSE on the right\. TheSSRis defined as the ratio between the standard deviation of the ensemble and theRMSEof the ensemble mean; values close to one indicate a well\-calibrated ensemble\. Because zonal and meridional currents exhibit very similar error accumulation patterns, only the zonal components are shown here\.
Figure 26:Surface variables:SSH,SIC, andSIT\. Columns from left to right showRMSE,CRPS, andSSR\.Figure 27:Temperature at six different depths\. Columns from left to right showRMSE,CRPS, andSSR\.Figure 28:Salinity at six different depths\. Columns from left to right showRMSE,CRPS, andSSR\.Figure 29:Zonal current at six different depths\. Columns from left to right showRMSE,CRPS, andSSR\.
### E\.7Ensemble size comparison
[Figure˜30](https://arxiv.org/html/2605.15470#A5.F30)demonstrates the impact of varying ensemble size \(MM\) on Njord’s predictive performance, measured as the relativeRMSEdifference compared to a baseline ofM=5M=5\. Increasing the ensemble size toM=20M=20yields a systematicRMSEreduction across all evaluated variables and depth levels\. While larger ensembles produce more accurate deterministic mean forecasts by better sampling predictive uncertainty, these gains must be weighed against the linear increase in computational cost\.
Figure 30:NormalizedRMSEdifference for various variables and depth levels, comparing ensemble sizesM∈\{10,15,20\}M\\in\\\{10,15,20\\\}against a baseline ensemble size ofM=5M=5\. Values below zero indicate a reduction inRMSErelative to the baseline\.
### E\.8Global forecasts
To illustrate the qualitative behavior of Njord on a global scale, we present example ensemble forecasts initialized on 24 December 2024 at a lead time of 10 days in[Figures31](https://arxiv.org/html/2605.15470#A5.F31)–[37](https://arxiv.org/html/2605.15470#A5.F37)\. For each variable, we show the analysis target, the ensemble mean, the ensemble standard deviation, and three individual ensemble members\.
The individual ensemble members appear sharp and exhibit noticeable variability, whereas the ensemble mean is smoother due to averaging\. Sea\-ice fields display well\-defined edges and are exactly zero in ice\-free regions, reflecting the use of clamping and a dedicated density channel\. The ensemble standard deviation is elevated near the ice edge, where its position varies between members, and forSIT, it is also elevated within regions where ice is present\.
For potential temperature andSSH, the ensemble spread is most pronounced in dynamically active regions characterized by sharp thermal fronts and turbulent eddies\. This is particularly evident along major western boundary currents, such as the Gulf Stream and Kuroshio, as well as the Agulhas Retroflection and the Antarctic Circumpolar Current\. In these regions, small spatial disagreements between members regarding the exact placement of a meandering current or a newly formed eddy translate into high local variance\.
In contrast, the uncertainty patterns for salinity and ocean currents are governed by distinctly different physical drivers\. Salinity deviations are heavily dominated by freshwater dynamics, with the highest ensemble spread localized around massive river outflows, such as the Amazon, Congo, and Ganges\-Brahmaputra plumes\. In these areas, slight variations in predicted coastal winds or surface currents drastically shift the floating lenses of fresh water\. For velocity, particularly the zonal current, the ensemble standard deviation features a striking band of uncertainty directly across the Equator\. This highlights the model’s spread in resolving Tropical Instability Waves and the chaotic shear between opposing equatorial currents, compounding the variance already seen in eddy\-rich boundary regions\.
Figure 31:Sea ice concentration at lead time 10 d, init 2024\-12\-24\.Figure 32:Sea ice thickness at lead time 10 d, init 2024\-12\-24\.Figure 33:Temperature at the surface, lead time 10 d, init 2024\-12\-24\.Figure 34:Salinity at the surface, lead time 10 d, init 2024\-12\-24\.Figure 35:Zonal current at the surface, lead time 10 d, init 2024\-12\-24\.Figure 36:Meridional current at the surface, lead time 10 d, init 2024\-12\-24\.Figure 37:Sea surface height at lead time 10 d, init 2024\-12\-24\.
### E\.9Regional metrics
We evaluate Njord on the Baltic Sea against the deterministic SeaCast baseline and a persistence forecast\. For both Njord\-Baltic and SeaCast we report two variants: a*reanalysis*model, pretrained on reanalysis only, and an*analysis*model, additionally finetuned on operational analysis\. All systems are initialized from the Baltic Sea analysis, forced with the IFS 10\-day atmospheric forecast at the surface, and constrained at the lateral boundary by the GLO12 10\-day ocean forecast from OceanBench\. GLO12 forecasts covering the Baltic Sea are also used as a reference in the comparison\. While it retains some large\-scale skill, it lacks the spatial resolution required to capture fine\-scale dynamics in the Baltic Sea\. Interpolation of GLO12 to the regional grid also results in some missing values in narrow coastal regions which are not used in the error calculation\. The ground truth used for verification is the Baltic Sea analysis\. Ensemble metrics for Njord\-Baltic:RMSEof the ensemble mean,CRPS, andSSR, are compared against deterministicRMSEandMAEfor SeaCast and persistence\. The SeaCast/persistenceMAEis shown alongsideCRPSas a deterministic reference\.
[Figures38](https://arxiv.org/html/2605.15470#A5.F38)–[41](https://arxiv.org/html/2605.15470#A5.F41)report metrics per variable and lead time\. TheSSRis defined as the ratio between the standard deviation of the ensemble and theRMSEof the ensemble mean; values close to one indicate a well\-calibrated ensemble\. Because zonal and meridional currents exhibit similar error accumulation patterns, only the zonal components are shown here for brevity\.
Figure 38:Surface variables:SLA,SICandSIT\. Reanalysis variants are shown dashed and analysis variants solid\.Figure 39:Temperature at 1, 9, 28, 47 and 91 m depth\.Figure 40:Salinity at 1, 9, 28, 47 and 91 m depth\.Figure 41:Meridional current at 1, 9, 28, 47 and 91 m depth\.
### E\.10Regional forecasts
To illustrate the qualitative behavior of Njord\-Baltic in a high\-resolution regional context, we present example ensemble forecasts for the Baltic Sea, initialized on 20 February 2024 at a lead time of 10 days, in[Figures42](https://arxiv.org/html/2605.15470#A5.F42)–[48](https://arxiv.org/html/2605.15470#A5.F48)\. For each variable, we show the analysis target, the ensemble mean, the ensemble standard deviation, and three individual ensemble members\.
The individual ensemble members appear sharp and exhibit noticeable variability, whereas the ensemble mean is smoother due to averaging\. Sea\-ice fields display well\-defined edges and are exactly zero in ice\-free regions, reflecting the use of clamping and a dedicated density channel\. For this late\-winter date, ice is primarily confined to the Bay of Bothnia and the Gulf of Finland\. The ensemble standard deviation clearly highlights the marginal ice zone, marking the uncertainty in the exact location of the ice edge\. ForSIT, uncertainty also extends into the interior of the ice pack, reflecting ensemble disagreements on dynamic thickening processes such as ridging\.
The physical drivers of uncertainty for salinity and potential temperature are distinctly visible\. Salinity variance is overwhelmingly concentrated in the Skagerrak and Kattegat straits\. This transition zone is highly dynamic, as dense, saline North Sea water forcefully mixes with the fresh, brackish outflow of the Baltic; small ensemble disagreements on the exact timing, volume, or extent of these inflows create massive local variance\. Potential temperature uncertainty is also elevated in these straits but extends further into the Baltic Proper, reflecting complex thermal mixing fronts and internal mesoscale eddies\.
Surface currents exhibit widespread uncertainty across the entire basin\. Because the Baltic Sea’s surface circulation is heavily wind\-driven, ensemble spread in these velocities reflects the chaotic, rapid response of surface waters to varying meteorological forcing across the members\. This variance naturally peaks in the narrow, high\-flow bottlenecks of the Danish straits\.
Finally, SLA shows elevated uncertainty not just in the straits, but also at the northern and eastern extremities of the basin\. In a shallow, enclosed sea, water levels are highly sensitive to wind stress piling water up against the coasts \(storm surges and seiches\)\. Furthermore, some visual artifacts remain apparent in the SLA standard deviation\. This could potentially arise because SLA fields are derived from interpolating sparse along\-track satellite observations, resulting in noisy targets that may require higher temporal resolution and denser training samples to model smoothly\.
Figure 42:Sea ice concentration at lead time 10 d, init 2024\-02\-20\.Figure 43:Sea ice thickness at lead time 10 d, init 2024\-02\-20\.Figure 44:Temperature at the surface, lead time 10 d, init 2024\-02\-20\.Figure 45:Salinity at the surface, lead time 10 d, init 2024\-02\-20\.Figure 46:Zonal current at the surface, lead time 10 d, init 2024\-02\-20\.Figure 47:Meridional current at the surface, lead time 10 d, init 2024\-02\-20\.Figure 48:Sea level anomaly at lead time 10 d, init 2024\-02\-20\.Similar Articles
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Proposes GC-MoE, a graph-conditioned mixture of experts framework for traffic forecasting that assigns each node a personalized combination of frozen pretrained spatio-temporal GNN experts based on graph topology and recent input, training only a lightweight routing module (∼17K parameters) and achieving competitive performance on four benchmarks.
FFJORD: Free-form continuous dynamics for scalable reversible generative models
FFJORD introduces a scalable reversible generative model using continuous dynamics and Hutchinson's trace estimator to enable unbiased log-density estimation without architectural constraints. The method achieves state-of-the-art results on density estimation and image generation while maintaining efficient sampling.
Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels
The paper proposes Neural Tangent Kernel-based uncertainty quantification for deterministic deep learning weather models, achieving sharper adaptive prediction intervals during extreme events without retraining.
Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings
This paper introduces GNOVA, a GRU-Neural ODE Variational Autoencoder framework for reconstructing and forecasting Alzheimer's disease cognitive trajectories from routine clinical data without expensive neuroimaging or biomarkers, achieving low error and uncertainty estimation on the ADNI dataset.
AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret
Introduces AdaWeather, an adaptive framework that combines multiple probabilistic weather forecasts using machine learning and mixture of experts, achieving logarithmic regret compared to the best static mixture of experts and showing empirical improvements in temperature forecasting.