Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting

arXiv cs.LG 05/08/26, 04:00 AM Papers
Summary
This paper introduces MeLISA, a latent-free autoregressive generative surrogate for forecasting high-dimensional physical dynamics that uses pixel-space MeanFlow to achieve efficient one-step generation. It demonstrates superior long-horizon statistical accuracy and inference speed compared to neural operators on turbulent flow benchmarks.
arXiv:2605.05540v1 Announce Type: new Abstract: Fast surrogate modeling for high-dimensional physical dynamics requires more than low short-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories. Neural operators provide inexpensive autoregressive forecasts but can drift in turbulent regimes, whereas rolling diffusion and latent generative surrogates can represent stochastic transitions at the cost of multi-step denoising, noise-schedule design, or auxiliary compression models. We propose MeanFlow Long-term Invariant Spatiotemporal Consistency Autoregressive Models (MeLISA), a latent-free autoregressive generative surrogate built on pixel-space MeanFlow. MeLISA defines a blockwise stochastic transition kernel that generates each forecast block with a single model evaluation, avoiding latent encoders and iterative diffusion solvers at inference time. To stabilize long-horizon rollouts, MeLISA combines a Window-Consistency MeanFlow objective that learns conditional spatiotemporal generation from partially observed temporal windows with a Time Increment Consistency loss that constrains multi-lag finite increments and targets temporal-correlation structure. We evaluate MeLISA with compact UNet and scalable DiT backbones on two high-resolution benchmarks, extended 2D Kolmogorov flow at $256 \times 256$ and turbulent channel-flow slice at $192 \times 192$. MeLISA outperforms neural-operator baselines on short-term forecasting accuracy and long-horizon statistical metrics, including energy spectra, turbulent kinetic energy, and mixing-rate-related dynamics, while achieving inference speeds comparable to, and in some cases faster than, neural operators. Compact 3.7-5.7M-parameter variants already deliver strong parameter efficiency, and DiT variants provide a scalable path up to 150M parameters. Overall, MeLISA benefits both rollout efficiency and long-horizon statistical accuracy.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/08/26, 07:56 AM
# Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting
Source: [https://arxiv.org/html/2605.05540](https://arxiv.org/html/2605.05540)
Tianyue Yang The Center for Computational Science University College London &Xiao Xue The Center for Computational Science University College London x\.xue@ucl\.ac\.uk

###### Abstract

Fast surrogate modeling for high\-dimensional physical dynamics requires more than low short\-term error: useful models must roll out efficiently while preserving the statistical structure of long trajectories\. Neural operators provide inexpensive autoregressive forecasts but can drift in turbulent regimes, whereas rolling diffusion and latent generative surrogates can represent stochastic transitions at the cost of multi\-step denoising, noise\-schedule design, or auxiliary compression models\. We proposeMeanFlowLong\-termInvariantSpatiotemporal ConsistencyAutoregressive Models \(MeLISA\), a latent\-free autoregressive generative surrogate built on pixel\-space MeanFlow\. MeLISA defines a blockwise stochastic transition kernel that generates each forecast block with a single model evaluation, avoiding latent encoders and iterative diffusion solvers at inference time\. To stabilize long\-horizon rollouts, MeLISA combines a*Window\-Consistency MeanFlow*objective that learns conditional spatiotemporal generation from partially observed temporal windows with a*Time Increment Consistency*loss that constrains multi\-lag finite increments and targets temporal\-correlation structure\. We evaluate MeLISA with compact UNet and scalable DiT backbones on two high\-resolution benchmarks, extended 2D Kolmogorov flow at256×256256\\times 256and turbulent channel\-flow slice at192×192192\\times 192\. MeLISA outperforms neural\-operator baselines on short\-term forecasting accuracy and long\-horizon statistical metrics, including energy spectra, turbulent kinetic energy, and mixing\-rate\-related dynamics, while achieving inference speeds comparable to, and in some cases faster than, neural operators\. Compact 3\.7–5\.7M\-parameter variants already deliver strong parameter efficiency, and DiT variants provide a scalable path up to 150M parameters\. Overall, MeLISA benefits both rollout efficiency and long\-horizon statistical accuracy\.

## 1Introduction

Accurate yet efficient simulation of complex dynamical systems remains a central challenge in computational physics\. These systems are typically governed by nonlinear partial differential equations \(PDEs\)\[[12](https://arxiv.org/html/2605.05540#bib.bib101)\], for which obtaining exact solutions is often intractable\. As a result, a wide range of numerical techniques have been developed, including Direct Numerical Simulation \(DNS\)\[[33](https://arxiv.org/html/2605.05540#bib.bib98),[66](https://arxiv.org/html/2605.05540#bib.bib136),[13](https://arxiv.org/html/2605.05540#bib.bib137),[60](https://arxiv.org/html/2605.05540#bib.bib139)\]\. While such methods can achieve high fidelity by resolving fine\-scale structures, they often come with substantial computational cost, making them impractical for many real\-world scenarios\. To alleviate this computational burden, a variety of reduced\-fidelity approaches have been proposed that trade physical resolution for efficiency, including Reynolds\-Averaged Navier–Stokes \(RANS\)\[[1](https://arxiv.org/html/2605.05540#bib.bib99)\]and Large Eddy Simulation \(LES\)\[[57](https://arxiv.org/html/2605.05540#bib.bib100)\]\. Nevertheless, these methods can remain prohibitively costly in regimes that demand fine spatial resolution, for instance, near solid boundaries where steep gradients must be resolved\. Motivated by these limitations and enabled by recent advances in deep learning, data\-driven surrogate models have emerged as a promising alternative\. Such surrogates aim to approximate the underlying dynamics directly from data, and have been successfully applied across a broad range of scientific domains, including weather forecasting\[[15](https://arxiv.org/html/2605.05540#bib.bib156),[4](https://arxiv.org/html/2605.05540#bib.bib157)\], quantum chemistry\[[69](https://arxiv.org/html/2605.05540#bib.bib158),[73](https://arxiv.org/html/2605.05540#bib.bib159)\], materials science\[[48](https://arxiv.org/html/2605.05540#bib.bib160)\], and fluid dynamics\[[76](https://arxiv.org/html/2605.05540#bib.bib161)\]\.

Deterministic neural operators have become a widely used class of data\-driven surrogates for PDE\-governed and spatiotemporal forecasting problems\[[35](https://arxiv.org/html/2605.05540#bib.bib15),[45](https://arxiv.org/html/2605.05540#bib.bib28),[59](https://arxiv.org/html/2605.05540#bib.bib56),[43](https://arxiv.org/html/2605.05540#bib.bib166)\]\. When used autoregressively, they map a finite context window to future states and can roll out efficiently\. However, long\-horizon autoregressive prediction remains challenging: small one\-step errors are repeatedly fed back as inputs, causing error accumulation and distribution shift over the rollout\[[50](https://arxiv.org/html/2605.05540#bib.bib192)\]\. This issue is especially pronounced in turbulent or chaotic regimes, where small biases in high\-frequency content or temporal correlations can lead to drift in trajectory\-level statistics such as energy spectra, turbulent kinetic energy, and invariant\-measure\-related quantities\[[25](https://arxiv.org/html/2605.05540#bib.bib41),[55](https://arxiv.org/html/2605.05540#bib.bib135),[27](https://arxiv.org/html/2605.05540#bib.bib105)\]\.

A complementary direction uses generative models, including diffusion models\[[22](https://arxiv.org/html/2605.05540#bib.bib106),[65](https://arxiv.org/html/2605.05540#bib.bib143)\], flow matching\[[40](https://arxiv.org/html/2605.05540#bib.bib107)\], and one\-step variants such as consistency models\[[64](https://arxiv.org/html/2605.05540#bib.bib130),[44](https://arxiv.org/html/2605.05540#bib.bib133)\], which view sample generation as a transport process from a base distribution to the data distribution\. These methods have recently been adopted as autoregressive surrogates for physical dynamics\[[31](https://arxiv.org/html/2605.05540#bib.bib146),[62](https://arxiv.org/html/2605.05540#bib.bib141),[6](https://arxiv.org/html/2605.05540#bib.bib140),[76](https://arxiv.org/html/2605.05540#bib.bib161),[58](https://arxiv.org/html/2605.05540#bib.bib147),[15](https://arxiv.org/html/2605.05540#bib.bib156)\]and explicitly model conditional distributions over future states\. However, in their current form they typically \(i\) require multi\-step denoising or SDE/ODE integration at every rollout step, leading to long inference latency, \(ii\) rely on frame\-wise progressive noise schedules\[[6](https://arxiv.org/html/2605.05540#bib.bib140),[61](https://arxiv.org/html/2605.05540#bib.bib148)\]to roll out stably, and \(iii\) operate in a latent space induced by an auxiliary VAE or encoder\[[61](https://arxiv.org/html/2605.05540#bib.bib148),[31](https://arxiv.org/html/2605.05540#bib.bib146)\], introducing additional training and inference complexity\.

In this work, we build on the recently proposed pixel MeanFlow \(p\-MF\) framework\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\], a one\-step generative model that operates directly in pixel space and avoids both multi\-step solvers and latent encoders\. To turn p\-MF into an autoregressive surrogate that retains fast rollout speed while preserving long\-horizon statistical structure, we address two specific questions: how should single\-frame MeanFlow be extended to a*window\-conditioned spatiotemporal*generator so that one\-step generation becomes non\-trivial under multi\-frame temporal context, and how can long\-horizon temporal correlations and mixing behavior be explicitly enforced when only short windows are observed during training? Our answers,*Window\-Consistency MeanFlow*and*Time Increment Consistency*, together yield a stochastic autoregressive surrogate that requires a single function evaluation per forecast block \(1 NFE per block, where NFE denotes number of function evaluations\), combining direct autoregressive rollout with the conditional\-distribution modeling of generative methods\. See Appendix[B](https://arxiv.org/html/2605.05540#A2)for an extensive discussion\. Our key contributions are summarized as follows:

- •We introduceMeanFlowLong\-termInvariantSpatiotemporal ConsistencyAutoregressive Models \(MeLISA\), a stochastic autoregressive surrogate that requires 1\-NFE per block, combining direct autoregressive rollout with the conditional\-distribution modeling of generative methods, without progressive noise schedules or multi\-step SDE solvers\.
- •We propose*Window\-Consistency MeanFlow*, the first extension of pixel MeanFlow from single\-frame generation to a window\-conditioned spatiotemporal transition kernel, where masked temporal context is what makes one\-step generative forecasting non\-trivial\.
- •We propose*Time Increment Consistency*, a finite\-lag regularizer that directly constrains temporal covariance and mixing structure of the rollout, supplying long\-horizon constraints that pointwise state\-reconstruction losses provably cannot\.
- •MeLISA performs generative forecasting directly in pixel space at up to256×256256\\times 256, removing the VAE / latent\-encoder / fidelity\-enhancement dependencies that current diffusion\-based scientific surrogates typically rely on\.
- •On extended 2D Kolmogorov flow \(256×256256\\times 256\) and a projected turbulent channel\-flow observable \(192×192192\\times 192\), MeLISA matches or exceeds neural\-operator baselines on short\-term accuracy at comparable rollout speed, while substantially improving recovery of energy spectra, turbulent kinetic energy, and mixing\-rate\-related dynamics\. Compact 3\.7–5\.7M\-parameter variants are already parameter\-efficient, and a DiT instantiation scales to 150M parameters\.

## 2Related Works

#### Generative models for dynamical systems\.

Diffusion models and their latent\-space variants—including Latent Diffusion Models \(LDM\)\[[62](https://arxiv.org/html/2605.05540#bib.bib141),[6](https://arxiv.org/html/2605.05540#bib.bib140),[9](https://arxiv.org/html/2605.05540#bib.bib150)\]and Latent Flow Matching \(LFM\)\[[37](https://arxiv.org/html/2605.05540#bib.bib151),[52](https://arxiv.org/html/2605.05540#bib.bib129)\], have been widely adopted for dynamical systems forecasting and have achieved state\-of\-the\-art performance\. Nevertheless, in the absence of distillation or related acceleration techniques, these approaches typically require many sampling steps at inference time, leading to substantial computational overhead\. In contrast, our method builds on a one\-step generative formulation, eliminating the need for specialized multi\-step sampling procedures\. Moreover, we perform generative modeling directly in pixel space, avoiding dimensionality reduction altogether\. This enables high\-resolution generation up to256×256256\\times 256without training auxiliary latent\-space components such as variational autoencoders \(VAEs\) or fidelity enhancement modules\[[63](https://arxiv.org/html/2605.05540#bib.bib120),[55](https://arxiv.org/html/2605.05540#bib.bib135),[39](https://arxiv.org/html/2605.05540#bib.bib169),[80](https://arxiv.org/html/2605.05540#bib.bib170)\]\. This design also makes our approach more general and robust, as it reduces the number of system\-specific architectural and training choices required in practice\. In particular, the one\-step nature of our model removes the need to specify a progressive noise schedule, which is a component that most rolling or autoregressive diffusion methods depend on for multi\-step sampling\[[6](https://arxiv.org/html/2605.05540#bib.bib140),[61](https://arxiv.org/html/2605.05540#bib.bib148),[26](https://arxiv.org/html/2605.05540#bib.bib142),[53](https://arxiv.org/html/2605.05540#bib.bib162)\]\.

#### One\-step generative models\.

One\-step \(or few\-step\) generative models have recently gained momentum, motivated by the view of generative modeling as distribution transport\[[65](https://arxiv.org/html/2605.05540#bib.bib143)\]\. Representative examples include Consistency Models \(CM/sCM\)\[[64](https://arxiv.org/html/2605.05540#bib.bib130),[44](https://arxiv.org/html/2605.05540#bib.bib133)\], inductive Moment Matching \(iMM\)\[[82](https://arxiv.org/html/2605.05540#bib.bib132)\], and Shortcut Diffusion\[[14](https://arxiv.org/html/2605.05540#bib.bib131)\]\. In the image generation domain, the MeanFlow \(MF\) family, including the original MF\[[16](https://arxiv.org/html/2605.05540#bib.bib109)\], improved MeanFlow \(i\-MF\)\[[17](https://arxiv.org/html/2605.05540#bib.bib110)\], and pixel MeanFlow \(p\-MF\)\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\], has shown particularly strong performance on both conditional and unconditional generation tasks\. Modern generative models for image and video synthesis often achieve few\-step or one\-step sampling by distilling large pre\-trained diffusion models\. For example, Consistency Distillation \(CD/sCD\)\[[64](https://arxiv.org/html/2605.05540#bib.bib130),[44](https://arxiv.org/html/2605.05540#bib.bib133)\]has been successfully applied to pre\-trained image generators\[[47](https://arxiv.org/html/2605.05540#bib.bib153)\]as well as video diffusion models\[[49](https://arxiv.org/html/2605.05540#bib.bib154),[71](https://arxiv.org/html/2605.05540#bib.bib155)\]\. In contrast, training one\-step generative models*from scratch*on video\-like data has received comparatively little attention; to the best of our knowledge, this setting has not been systematically explored, nor has it been studied in the context of physical dynamical systems\.

## 3Background

#### Problem setup\.

Let the sequential dataset be denoted by𝒟∈ℝB×T×F\\mathcal\{D\}\\in\\mathbb\{R\}^\{B\\times T\\times F\}, whereBBis the number of trajectories,TTis the trajectory length, andFFis the feature dimension\. We represent a sampled window from discrete physical timeτ∈ℤ\+\\tau\\in\\mathbb\{Z\}^\{\+\}toτ\+W\\tau\+Was

x¯τ:τ\+W:=\(xτ,…,xτ\+W−1\)∈ℝW×F,\\bar\{x\}^\{\\tau:\\tau\+W\}:=\(x^\{\\tau\},\\ldots,x^\{\\tau\+W\-1\}\)\\in\\mathbb\{R\}^\{W\\times F\},and use the shorthandx¯Wτ\\bar\{x\}\_\{W\}^\{\\tau\}when the meaning is clear\. In probabilistic forecasting, given an input windowx¯Winτ\\bar\{x\}\_\{W\_\{\\mathrm\{in\}\}\}^\{\\tau\}, the goal is to predict a future windowx¯^Woutτ\+Win\\hat\{\\bar\{x\}\}\_\{W\_\{\\mathrm\{out\}\}\}^\{\\tau\+W\_\{\\mathrm\{in\}\}\}, whereWinW\_\{\\mathrm\{in\}\}andWoutW\_\{\\mathrm\{out\}\}denote the input and output window lengths, respectively, and may be chosen arbitrarily\. This forecasting problem is formalized by the predictive distribution

X¯^Woutτ\+Win∼p\(X¯^Woutτ\+Win∣x¯Winτ\)\.\\hat\{\\bar\{X\}\}\_\{W\_\{\\mathrm\{out\}\}\}^\{\\tau\+W\_\{\\mathrm\{in\}\}\}\\sim p\\\!\\left\(\\hat\{\\bar\{X\}\}\_\{W\_\{\\mathrm\{out\}\}\}^\{\\tau\+W\_\{\\mathrm\{in\}\}\}\\mid\\bar\{x\}\_\{W\_\{\\mathrm\{in\}\}\}^\{\\tau\}\\right\)\.\(1\)

#### Diffusion models and flow matching\.

Diffusion models can be viewed as distribution transport methods that map a simple base distributionpbasep\_\{\\mathrm\{base\}\}\(typically Gaussian\) to the target data distributionpdatap\_\{\\mathrm\{data\}\}\. In denoising diffusion probabilistic models \(DDPM\)\[[22](https://arxiv.org/html/2605.05540#bib.bib106)\], this transport is realized through a Markovian diffusion process indexed by diffusion timett\. A neural denoiserDθD\_\{\\theta\}is trained to predict the injected noise at each time step:

ℒDDPM=𝔼t,x,ϵ\[‖Dθ\(zt,t\)−ϵ‖22\],\\mathcal\{L\}\_\{\\mathrm\{DDPM\}\}=\\mathbb\{E\}\_\{t,x,\\epsilon\}\\left\[\\left\\\|D\_\{\\theta\}\(z\_\{t\},t\)\-\\epsilon\\right\\\|\_\{2\}^\{2\}\\right\],\(2\)whereztz\_\{t\}denotes a noisy sample along the diffusion path at timett\[[22](https://arxiv.org/html/2605.05540#bib.bib106)\]\. This forward–reverse diffusion process can equivalently be described through a stochastic differential equation \(SDE\) formulation\[[65](https://arxiv.org/html/2605.05540#bib.bib143)\]\.

Flow matching approaches the same transport problem from a different perspective by modeling the dynamics with an ordinary differential equation \(ODE\)\. In particular, one defines an interpolating noisy state

zt=atx\+btϵ,z\_\{t\}=a\_\{t\}x\+b\_\{t\}\\epsilon,\(3\)whereϵ∼𝒩\(0,I\)\\epsilon\\sim\\mathcal\{N\}\(0,I\)is Gaussian noise, andata\_\{t\}andbtb\_\{t\}are time\-dependent coefficients\. In this work, we focus on the linear parameterizationat=1−ta\_\{t\}=1\-tandbt=tb\_\{t\}=t\[[42](https://arxiv.org/html/2605.05540#bib.bib163)\]\. The modelDθD\_\{\\theta\}is then trained to predict the transport velocity field

v\(zt\|x,ϵ\)=dztdt=at′x\+bt′ϵ,v\(z\_\{t\}\|x,\\epsilon\)=\\frac\{\\mathrm\{d\}z\_\{t\}\}\{\\mathrm\{d\}t\}=a^\{\\prime\}\_\{t\}x\+b^\{\\prime\}\_\{t\}\\epsilon,\(4\)via the conditional FM loss\[[40](https://arxiv.org/html/2605.05540#bib.bib107)\]

ℒFM=𝔼t,x,ϵ\[∥Dθ\(zt,t\)−v\(zt\|x,ϵ\)∥2\],\\mathcal\{L\}\_\{\\mathrm\{FM\}\}=\\mathbb\{E\}\_\{t,x,\\epsilon\}\\left\[\\left\\\|D\_\{\\theta\}\(z\_\{t\},t\)\-v\(z\_\{t\}\|x,\\epsilon\)\\right\\\|^\{2\}\\right\],\(5\)whereztz\_\{t\}is the noisy interpolated sample induced by the FM parameterization\.

#### MeanFlow models\.

Standard flow\-matching sampling requires multiple evaluations of the velocity network\. MeanFlow improves efficiency by predicting the*average*velocity over an interval\[r,t\]\[r,t\]\[[16](https://arxiv.org/html/2605.05540#bib.bib109)\]:

u\(zt,r,t\)=1t−r∫rtv\(zτ∣x,ϵ\)𝑑τ,u\(z\_\{t\},r,t\)=\\frac\{1\}\{t\-r\}\\int\_\{r\}^\{t\}v\(z\_\{\\tau\}\\mid x,\\epsilon\)\\,d\\tau,\(6\)which enables one\-step transport viazr=zt−\(t−r\)u\(zt,r,t\)z\_\{r\}=z\_\{t\}\-\(t\-r\)u\(z\_\{t\},r,t\)\. MeanFlow trains a networkDθD\_\{\\theta\}to predictuuusing a self\-consistency identity derived from this definition\. To improve stability,Genget al\.\[[17](https://arxiv.org/html/2605.05540#bib.bib110)\]define

Vθ\(zt,r,t\)=Dθ\+\(t−r\)sg⁡\(∂tDθ\+v\(zt∣x,ϵ\)∂ztDθ\),V\_\{\\theta\}\(z\_\{t\},r,t\)=D\_\{\\theta\}\+\(t\-r\)\\,\\operatorname\{sg\}\\\!\\left\(\\partial\_\{t\}D\_\{\\theta\}\+v\(z\_\{t\}\\mid x,\\epsilon\)\\,\\partial\_\{z\_\{t\}\}D\_\{\\theta\}\\right\),\(7\)and optimize the direct regression objective, namely the improved MeanFlow \(i\-MF\) objective

ℒi\-MF=𝔼t,x,ϵ\[∥Vθ\(zt,r,t\)−v\(zt∣x,ϵ\)∥2\]\.\\mathcal\{L\}\_\{\\mathrm\{i\\text\{\-\}MF\}\}=\\mathbb\{E\}\_\{t,x,\\epsilon\}\\left\[\\left\\\|V\_\{\\theta\}\(z\_\{t\},r,t\)\-v\(z\_\{t\}\\mid x,\\epsilon\)\\right\\\|^\{2\}\\right\]\.\(8\)This reformulation yields more stable training and better sample quality\. Pixel MeanFlow \(p\-MF\) further reparameterizes the model to predict a pixel\-space target\[[34](https://arxiv.org/html/2605.05540#bib.bib164)\]:

Dθ\(zt,r,t\)=zt−tu\(zt,r,t\),D\_\{\\theta\}\(z\_\{t\},r,t\)=z\_\{t\}\-t\\,u\(z\_\{t\},r,t\),\(9\)so that the same i\-MF objective can be applied while biasing prediction toward the data manifold\.

## 4Methodology: MeanFlow Long\-term Invariant Spatiotemporal Consistency Autoregressive Models

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/fig1.png)Figure 1:This diagram illustrates the two key mechanisms of MeanFlow Long\-term Invariant Spatiotemporal Consistency Autoregressive \(MeLISA\) models\. Panel \(a\) shows theWindow\-Consistency MeanFlow\(WinC\-MF\) objective, in which a MeanFlow denoiser is trained on a temporal window with partially observed, unmasked guidance frames\. Panel \(b\) shows theTime Increment Consistency\(TIC\) loss: given a context window, MeLISA first produces a reconstruction, and TIC then penalizes the discrepancy between the predicted and ground\-truth temporal increments across different lags\.Our method,MeLISA, incorporates physics\-aware prediction into the MeanFlow framework through two key mechanisms: theWindow\-Consistency MeanFlowobjective and theTime Increment Consistencyregularizer\.

### 4\.1Window Consistency MeanFlow

Prior work on autoregressive forecasting has shown that long\-horizon stability improves with larger input or output windows\[[51](https://arxiv.org/html/2605.05540#bib.bib165),[35](https://arxiv.org/html/2605.05540#bib.bib15),[59](https://arxiv.org/html/2605.05540#bib.bib56),[53](https://arxiv.org/html/2605.05540#bib.bib162)\]\. This is especially important for consistency\-sensitive physical statistics, such as autocorrelation functions\. Motivated by this observation, we propose a MeanFlow formulation that explicitly enforces temporal consistency over a window\.

Following Eq\.[9](https://arxiv.org/html/2605.05540#S3.E9), the average velocity for a single sample is

uθ\(zt,r,t\)=1t\[zt−Dθ\(zt,r,t\)\],u\_\{\\theta\}\(z\_\{t\},r,t\)=\\frac\{1\}\{t\}\\bigl\[z\_\{t\}\-D\_\{\\theta\}\(z\_\{t\},r,t\)\\bigr\],\(10\)which induces the instantaneous velocity

Vθ\(zt,r,t\)=uθ\+\(t−r\)sg⁡\(∂tuθ\+v\(zt∣x,ϵ\)∂ztuθ\)\.V\_\{\\theta\}\(z\_\{t\},r,t\)=u\_\{\\theta\}\+\(t\-r\)\\,\\operatorname\{sg\}\\\!\\left\(\\partial\_\{t\}u\_\{\\theta\}\+v\(z\_\{t\}\\mid x,\\epsilon\)\\,\\partial\_\{z\_\{t\}\}u\_\{\\theta\}\\right\)\.\(11\)
We extend this construction from a single frame to a temporal window\. For a clean windowx¯τ:τ\+W\\bar\{x\}^\{\\tau:\\tau\+W\}, we define the noisy window

z¯tτ:τ\+W=\(1−t\)x¯τ:τ\+W\+tϵ,\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}=\(1\-t\)\\,\\bar\{x\}^\{\\tau:\\tau\+W\}\+t\\,\\epsilon,\(12\)where the diffusion timesrrandttare shared across all frames in the window\. The corresponding average velocity is

u¯θτ:τ\+W\(z¯tτ:τ\+W,r,t\)=1t\[z¯tτ:τ\+W−Dθ\(z¯tτ:τ\+W,r,t\)\]\.\\bar\{u\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},r,t\)=\\frac\{1\}\{t\}\\left\[\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\-D\_\{\\theta\}\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},r,t\)\\right\]\.\(13\)
To enforce consistency under partial observation, letℳτM\\mathcal\{M\}\_\{\\tau\_\{M\}\}denote a masking operator that removes the frames indexed byτM\\tau\_\{M\}within the window\. We then define

u¯θτ:τ\+W\(z¯tτ:τ\+W,r,t,τM\)=1t\[z¯tτ:τ\+W−Dθ\(z¯tτ:τ\+W,ℳτM\(x¯τ:τ\+W\),r,t\)\]\.\\bar\{u\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},r,t,\\tau\_\{M\}\)=\\frac\{1\}\{t\}\\left\[\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\-D\_\{\\theta\}\\\!\\left\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},\\mathcal\{M\}\_\{\\tau\_\{M\}\}\(\\bar\{x\}^\{\\tau:\\tau\+W\}\),\\,r,\\,t\\right\)\\right\]\.\(14\)The on\-window instantaneous velocity is

V¯θτ:τ\+W\(z¯tτ:τ\+W,r,t,τM\)=u¯θτ:τ\+W\+\(t−r\)sg⁡\(∂tu¯θτ:τ\+W\+v¯τ:τ\+W∂z¯tτ:τ\+Wu¯θτ:τ\+W\),\\bar\{V\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},r,t,\\tau\_\{M\}\)=\\bar\{u\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\+\(t\-r\)\\,\\operatorname\{sg\}\\\!\\left\(\\partial\_\{t\}\\bar\{u\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\+\\bar\{v\}^\{\\tau:\\tau\+W\}\\,\\partial\_\{\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\}\\bar\{u\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\\right\),\(15\)where

v¯τ:τ\+W\(z¯tτ:τ\+W∣x¯τ:τ\+W,ϵ\)=ϵ−x¯τ:τ\+W\.\\bar\{v\}^\{\\tau:\\tau\+W\}\\bigl\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\\mid\\bar\{x\}^\{\\tau:\\tau\+W\},\\epsilon\\bigr\)=\\epsilon\-\\bar\{x\}^\{\\tau:\\tau\+W\}\.\(16\)This yields theWindow\-Consistency MeanFlow\(WinC\-MF\) objective,

ℒWinC\-MF=𝔼t,r,x,ϵ,τ,τM∥V¯θτ:τ\+W\(z¯tτ:τ\+W,r,t,τM\)−v¯τ:τ\+W\(z¯tτ:τ\+W∣x¯τ:τ\+W,ϵ\)∥22\.\\mathcal\{L\}\_\{\\mathrm\{WinC\\text\{\-\}MF\}\}=\\mathbb\{E\}\_\{t,r,x,\\epsilon,\\tau,\\tau\_\{M\}\}\\left\\\|\\bar\{V\}\_\{\\theta\}^\{\\tau:\\tau\+W\}\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},r,t,\\tau\_\{M\}\)\-\\bar\{v\}^\{\\tau:\\tau\+W\}\\bigl\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\\mid\\bar\{x\}^\{\\tau:\\tau\+W\},\\epsilon\\bigr\)\\right\\\|^\{2\}\_\{2\}\.\(17\)This objective encourages temporally consistent predictions from partially observed windows; its connection to self\-supervised learning \(SSL\) is discussed in Appendix[C\.1](https://arxiv.org/html/2605.05540#A3.SS1)\. This is illustrated in the panel \(a\) of Fig\.[1](https://arxiv.org/html/2605.05540#S4.F1)\. In practice, we sample the mask over later frames from a Bernoulli distribution, while always retaining the first frame as a temporal reference\.

### 4\.2Time Increment Consistency

Letxτx^\{\\tau\}denote the ground\-truth field at physical timeτ\\tau, andx^τ\\hat\{x\}^\{\\tau\}its reconstruction\. For a window of lengthWW, we define the lag\-wwtemporal increment as

Δxτ,τ\+w=xτ\+w−xτ,w=1,…,W−1\.\\Delta x^\{\\tau,\\tau\+w\}=x^\{\\tau\+w\}\-x^\{\\tau\},\\qquad w=1,\\dots,W\-1\.\(18\)Given a windowx¯τ:τ\+W\\bar\{x\}^\{\\tau:\\tau\+W\}, the MeLISA reconstruction is

x¯^τ:τ\+W=Dθ\(Concat\[ϵ,ℳτM\(x¯τ:τ\+W\),τM\],0,1\)\.\\hat\{\\bar\{x\}\}^\{\\tau:\\tau\+W\}=D\_\{\\theta\}\\\!\\left\(\\mathrm\{Concat\}\\\!\\left\[\\epsilon,\\,\\mathcal\{M\}\_\{\\tau\_\{M\}\}\(\\bar\{x\}^\{\\tau:\\tau\+W\}\),\\,\\tau\_\{M\}\\right\],0,1\\right\)\.\(19\)We then define theTime Increment Consistency\(TIC\) loss as

ℒTIC=∑w=1W−1κw𝔼x,τ,τM,ϵ\[‖Δxτ,τ\+w−Δx^τ,τ\+w‖22\],\\mathcal\{L\}\_\{\\mathrm\{TIC\}\}=\\sum\_\{w=1\}^\{W\-1\}\\kappa\_\{w\}\\,\\mathbb\{E\}\_\{x,\\tau,\\tau\_\{M\},\\epsilon\}\\left\[\\left\\\|\\Delta x^\{\\tau,\\tau\+w\}\-\\Delta\\hat\{x\}^\{\\tau,\\tau\+w\}\\right\\\|\_\{2\}^\{2\}\\right\],\(20\)whereκw\\kappa\_\{w\}is a lag\-dependent weight\. This objective enforces agreement between the temporal increments of the reconstructed and ground\-truth trajectories across multiple lags\. The full training objective is

ℒMeLISA=ℒWinC\-MF\+ℒTIC\.\\mathcal\{L\}\_\{\\mathrm\{MeLISA\}\}=\\mathcal\{L\}\_\{\\mathrm\{WinC\\text\{\-\}MF\}\}\+\\mathcal\{L\}\_\{\\mathrm\{TIC\}\}\.\(21\)TIC encourages recovery of long\-range spatiotemporal correlations by constraining second\-order temporal structure, which cannot be captured by per\-frame state reconstruction losses alone\. This mechanism is illustrated in the panel \(b\) of Fig\.[1](https://arxiv.org/html/2605.05540#S4.F1)\. Further discussion is provided in Appendix[C\.2](https://arxiv.org/html/2605.05540#A3.SS2)\. The training procedure of MeLISA is summarized in Algorithm[1](https://arxiv.org/html/2605.05540#alg1)\. Inference for each forecast block requires onlyone model evaluation\(1\-NFE\), see Algorithm[3](https://arxiv.org/html/2605.05540#alg3)\. We report that MeLISA models inference speed is as fast as neural operators, see Appendix[D\.5](https://arxiv.org/html/2605.05540#A4.SS5)for details\.

## 5Experiments

#### Dataset\.

In this section, we evaluate MeLISA on two high\-resolution benchmark datasets: the192×192192\\times 192turbulent channel flow dataset \(TCF192\) and the256×256256\\times 2562D Kolmogorov flow dataset \(KF256\)\. These datasets represent two distinct types of two\-dimensional dynamics\. KF256 is governed by a closed\-form 2D Partial Different Equation \(PDE\) with periodic external forcing, whereas TCF192 is obtained by slicing a 3D turbulent flow and therefore corresponds to projected dynamics rather than a fully observed closed system\. As a result, TCF192 contains stronger non\-Markovian effects, making its long\-range statistics more challenging to model\. See Appendix[G](https://arxiv.org/html/2605.05540#A7)for more details\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/fig2_tcf.png)Figure 2:Rollout results on an uncurated test trajectory fromTCF192, shown as absolute error with respect to the ground truth over the prediction horizon\. We compare three autoregressive baselines with one MeLISA variant, MeLISA\-Δ\\Delta\-S\. Here,ttdenotes the frame index within the trajectory\. MeLISA exhibits substantially improved stability and markedly slower error accumulation\.
#### MeLISA model instantiations\.

We evaluate MeLISA with two backbone families\.MeLISA\-Υ\\Upsilonuses an attention\-augmented UNet architecture following the DDPM style\[[22](https://arxiv.org/html/2605.05540#bib.bib106),[54](https://arxiv.org/html/2605.05540#bib.bib171)\], extended with 3D convolutions to process spatiotemporal inputs\. We consider two compact variants: MeLISA\-Υ\\Upsilon\-XS \(3M parameters\) and MeLISA\-Υ\\Upsilon\-S \(5M parameters\), highlighting the parameter efficiency of the framework\. To study scaling behavior, we also instantiate MeLISA with diffusion transformers \(DiTs\)\[[56](https://arxiv.org/html/2605.05540#bib.bib172)\]under the pixel\-space formulation ofLi and He \[[34](https://arxiv.org/html/2605.05540#bib.bib164)\]\. We denote this family byMeLISA\-Δ\\Delta, and scale it from 10M parameters \(MeLISA\-Δ\\Delta\-S\) to 150M parameters \(MeLISA\-Δ\\Delta\-B\), limited by the size of the training data\. Following prior work\[[34](https://arxiv.org/html/2605.05540#bib.bib164)\], we train MeLISA\-Δ\\Deltawith the Muon optimizer\[[41](https://arxiv.org/html/2605.05540#bib.bib173)\], which has been shown to provide faster convergence and greater stability for the p\-MF formulation\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\]\. Importantly, to remain consistent with the design of WinC\-MF, wedo not introduce explicit temporal attention modulesin either backbone family\. Full architectural details and training hyperparameters are provided in Table[3](https://arxiv.org/html/2605.05540#A1.T3)\.

#### Baseline models\.

We benchmark MeLISA against deterministic autoregressive baselines\.Neural operators\[[35](https://arxiv.org/html/2605.05540#bib.bib15),[45](https://arxiv.org/html/2605.05540#bib.bib28)\]are widely used for dynamical\-system forecasting and have shown strong performance across many benchmarks\. We consider three representative neural\-operator models: the original Fourier Neural Operator \(FNO\)\[[35](https://arxiv.org/html/2605.05540#bib.bib15)\], the U\-shaped Neural Operator \(UNO\)\[[59](https://arxiv.org/html/2605.05540#bib.bib56)\], which adopts a UNet\-style architecture for improved parameter efficiency, and the Local Fourier Neural Operator \(Local FNO\)\[[43](https://arxiv.org/html/2605.05540#bib.bib166)\], which incorporates local convolutional kernels to improve predictive accuracy\. For a fair comparison, all neural\-operator baselines are trained with two input frames and four output frames, matching the MeLISA setting\. We defer discussion ofrolling generative models, the family most closely related to MeLISA and increasingly used for spatiotemporal prediction\[[62](https://arxiv.org/html/2605.05540#bib.bib141),[6](https://arxiv.org/html/2605.05540#bib.bib140),[37](https://arxiv.org/html/2605.05540#bib.bib151),[53](https://arxiv.org/html/2605.05540#bib.bib162)\], to Appendix[B](https://arxiv.org/html/2605.05540#A2)\.

#### Metrics\.

We evaluate all models in an autoregressive rollout setting, initialized with the first two frames from each test trajectory\. Performance is measured using five complementary metrics: relativeL2L\_\{2\}error \(RL2L\_\{2\}\), structural similarity index measure \(SSIM\), power spectral density discrepancy \(PSDD\), turbulent kinetic energy difference \(TKED\), and mixing rate difference \(MRD\)\. SSIM captures perceptual and structural agreement by emphasizing local patterns such as edges, textures, and contrast\[[72](https://arxiv.org/html/2605.05540#bib.bib117),[24](https://arxiv.org/html/2605.05540#bib.bib118)\], while RL2L\_\{2\}measures pointwise reconstruction accuracy\. Since both metrics primarily reflect instantaneous prediction quality, we report them over theshort\-term prediction horizon, corresponding to the first40 framesfor both sets\. In contrast, PSDD measures how well a model preserves the distribution of energy across spatial frequencies, which is crucial for multi\-scale dynamics and turbulent flows\[[21](https://arxiv.org/html/2605.05540#bib.bib121),[68](https://arxiv.org/html/2605.05540#bib.bib122)\]\. TKED quantifies the discrepancy in trajectory\-level turbulent kinetic energy \(TKE\) relative to the ground truth\. The mixing rate is defined as the decay exponent of the full\-trajectory autocorrelation curve, estimated from the first 20 frames, and MRD measures the corresponding error in this quantity\. Because PSDD, TKED, and MRD characterize trajectory\-level statistical behavior, they are computed over thefull rollout trajectory\. Further details are provided in Appendix[F](https://arxiv.org/html/2605.05540#A6)\. All reported results are averaged over evaluations\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/fig3.png)Figure 3:\(a\) Radially averaged energy spectra for KF256 and TCF192\. The neural\-operator baselines tend to overemphasize low\-frequency modes and introduce spurious peaks at higher frequencies\. In contrast, MeLISA accurately reproduces the high\-frequency tail without any explicit spectral regularization\. \(b\) Spatial distribution of turbulent kinetic energy \(TKE\), shown together with scaledabsolute errorrelative to the ground truth for all models\. Only the MeLISA variants correctly recover the kinetic energy near the boundaries\. \(c\) Time\-averaged mean field on KF256 test trajectories, shown with scaled absolute error\. MeLISA again reproduces the correct large\-scale pattern more faithfully\. Local\-FNO is omitted because it cannot roll out stably over the full trajectory length\. Error scaling is applied only for visualization\.
### 5\.1Turbulent Channel Flow at192×192192\\times 192

We report results on the TCF192 dataset in this section\. Rollout visualizations are shown in Fig\.[2](https://arxiv.org/html/2605.05540#S5.F2), and the corresponding metrics are summarized in Table[1](https://arxiv.org/html/2605.05540#S5.T1)\. Despite having only∼\\sim3Mparameters, MeLISA\-Υ\\Upsilon\-XS performs competitively against autoregressive neural operators and achieves the best TKED score\. The MeLISA\-Δ\\Deltavariants obtain the strongest short\-term metrics \(RL2L\_\{2\}and SSIM\)\. The radial energy spectrum for TCF192 is shown in the right panel of Fig\.[3](https://arxiv.org/html/2605.05540#S5.F3)\(a\)\. All models capture the low\- to mid\-frequency range reasonably well, but most exhibit a spurious peak in the high\-frequency tail, deviating from the expected spectral decay\. Among all methods, theΥ\\Upsilonvariants match the tail behavior most faithfully\. The spatial distribution of turbulent kinetic energy is shown in Fig\.[3](https://arxiv.org/html/2605.05540#S5.F3)\(b\)\. Neural operators fail to reproduce the correct near\-boundary behavior, whereas all MeLISA variants recover the boundary energy distribution much more accurately\.

Table 1:TCF192 results for all models on short\-term and long\-term evaluation metrics with uncertainties\. TheΔ\\Deltavariants achieve the strongest short\-term predictive performance, whereas theΥ\\Upsilonvariants preserve long\-range statistical structure more effectively\. Short\-term metrics are computed over the first 40 frames, and long\-term metrics are evaluated over the full 625\-frame trajectory\.Model SpecsShort\-term metricsLong\-term metricsModel NameModel SizeRL2L\_\{2\}↓\\downarrowSSIM↑\\uparrowPSDD↓\\downarrowTKED↓\\downarrowMRD↓\\downarrowMeLISA Models \(1\-NFE\)MeLISA\-Υ\\Upsilon\-XS3\.68M0\.0831\(12\)0\.882\(20\)0\.0897\(21\)1\.20\(12\)×10−6\\times 10^\{\-6\}0\.475\(12\)MeLISA\-Υ\\Upsilon\-S5\.65M0\.0532\(88\)0\.925\(16\)0\.0635\(90\)1\.42\(13\)×10−6\\times 10^\{\-6\}0\.137\(13\)MeLISA\-Δ\\Delta\-S10\.5M0\.0396\(33\)0\.956\(8\)0\.0660\(5\)1\.47\(1\)×10−6\\times 10^\{\-6\}0\.196\(3\)Autoregressive Neural OperatorsFNO6\.68M0\.0617\(37\)0\.908\(7\)0\.0649\(13\)1\.59\(12\)×10−6\\times 10^\{\-6\}0\.476\(36\)UNO5\.98M0\.0617\(80\)0\.906\(16\)0\.0639\(15\)1\.89\(23\)×10−6\\times 10^\{\-6\}0\.538\(19\)Local\-FNO6\.76M0\.0579\(47\)0\.913\(9\)0\.0772\(54\)1\.99\(54\)×10−6\\times 10^\{\-6\}0\.507\(6\)
### 5\.2Kolmogorov Flow at256×256256\\times 256

We summarize the results on KF256 in this section, with particular emphasis on the scaling behavior of MeLISA\. The metrics are summarized in Table[2](https://arxiv.org/html/2605.05540#S5.T2)\. We train three MeLISA\-Δ\\Deltavariants on KF256, ranging from 10\.5M to 150M parameters\. Interestingly, although the short\-term metrics remain competitive, they slightly degrade with increasing model size, whereas the long\-term metrics, especially PSDD and TKED, generally improve with scale, albeit not monotonically, which is due to our cost\-matching training setup \(See Appendix[A\.1](https://arxiv.org/html/2605.05540#A1.SS1)\) and the limited size of dataset\. Comparing the two backbone families, the MeLISA\-Υ\\Upsilonvariants achieve stronger short\-term prediction metrics than theΔ\\Deltavariants\. Overall, all MeLISA models outperform the neural\-operator baselines on short\-term accuracy and on most long\-term statistical metrics\. The radial energy spectrum is shown in the left panel of Fig\.[3](https://arxiv.org/html/2605.05540#S5.F3)\(a\)\. Neural operators exhibit a clear bias at high frequencies, with UNO in particular producing a spurious peak in the spectral tail\. In contrast, all MeLISA variants recover the correct high\-frequency decay much more faithfully\. The time\-averaged mean field is shown in Fig\.[3](https://arxiv.org/html/2605.05540#S5.F3)\(c\), where all MeLISA models reproduce the overall spatial pattern accurately\.

Table 2:KF256 results for all models on short\-term and long\-term evaluation metrics with uncertainties\. Short\-term metrics are computed over the first 40 frames, and long\-term metrics are evaluated over the full 320\-frame trajectory\. Long\-term metrics for Local\-FNO are omitted because Local\-FNOs cannot roll out stably over the full trajectory length\.Model SpecsShort\-term metricsLong\-term metricsModel NameModel SizeRL2L\_\{2\}↓\\downarrowSSIM↑\\uparrowPSDD↓\\downarrowTKED↓\\downarrowMRD↓\\downarrowMeLISA Models \(1\-NFE\)MeLISA\-Υ\\Upsilon\-XS3\.68M0\.130\(15\)0\.925\(10\)0\.353\(16\)6\.89\(72\)0\.065\(54\)MeLISA\-Υ\\Upsilon\-S5\.65M0\.137\(15\)0\.919\(11\)0\.347\(18\)6\.73\(71\)0\.068\(39\)MeLISA\-Δ\\Delta\-S10\.5M0\.194\(33\)0\.867\(29\)0\.450\(10\)5\.17\(50\)0\.096\(41\)MeLISA\-Δ\\Delta\-M58\.3M0\.156\(31\)0\.903\(44\)0\.445\(78\)4\.24\(64\)0\.086\(20\)MeLISA\-Δ\\Delta\-B150M0\.174\(44\)0\.887\(36\)0\.342\(37\)4\.73\(26\)0\.130\(4\)Autoregressive Neural OperatorsFNO6\.68M0\.192\(12\)0\.845\(10\)1\.15\(5\)5\.72\(92\)0\.168\(88\)UNO5\.98M0\.187\(19\)0\.857\(9\)1\.42\(19\)6\.00\(57\)0\.338\(22\)Local\-FNO6\.76M0\.549\(85\)0\.744\(31\)\-\-\-

## 6Conclusion

We introducedMeLISA, a latent\-free autoregressive generative surrogate for dynamical systems that combines the efficiency of pixel\-space MeanFlow with training objectives designed to improve long\-horizon stability and physical consistency\. By formulating forecasting directly in pixel space, MeLISA avoids the additional modeling and inference overhead associated with latent compression, while still enabling one\-step generation over multiple future frames\. To address the difficulty of preserving realistic dynamics under repeated rollout, we proposed two key components: the*Window\-Consistency MeanFlow*objective, which encourages temporally coherent predictions from partially observed windows, and the*Time Increment Consistency*regularizer, which constrains temporal increments across multiple lags to better preserve long\-range spatiotemporal structure\.

Across two challenging high\-resolution benchmarks, turbulent channel flow at192×192192\\times 192and Kolmogorov flow at256×256256\\times 256, MeLISA consistently delivered strong short\-term predictive accuracy while more faithfully recovering trajectory\-level physical statistics than autoregressive neural\-operator baselines\. In particular, the proposed framework showed improved preservation of energy spectra, turbulent kinetic energy, and mixing\-related behavior over long rollouts, indicating that accurate frame\-wise prediction alone is insufficient for physically realistic surrogate modeling, and that explicit regularization of temporal structure is beneficial\.

Overall, these results position MeLISA as a promising next\-generation generative surrogate for scientific machine learning: it is efficient at inference, scalable to high\-resolution fields, and better aligned with the statistical requirements of physical dynamical systems\. Future work may extend this framework to fully three\-dimensional systems, longer\-context and multi\-scale forecasting settings, and broader classes of partially observed or non\-Markovian dynamics\.

## 7Acknowledgments

Authors acknowledge the use of resources provided by the Isambard\-AI National AI Research Resource \(AIRR\)\. Isambard\-AI is operated by the University of Bristol and funded by the UK Government’s Department for Science, Innovation and Technology \(DSIT\) through UK Research and Innovation \(UKRI\)\. This work was supported under project proposals “GenTurb3D"\.

## References

- \[1\]G\. Alfonsi\(2009\)Reynolds\-averaged navier–stokes equations for turbulence modeling\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[2\]V\. Armegioiu, Y\. Ramic, and S\. Mishra\(2025\)Rectified flows for fast multiscale fluid flow modeling\.arXiv preprint arXiv:2506\.03111\.Cited by:[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px2.p1.1),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.5.5.2)\.
- \[3\]M\. Assran, Q\. Duval, I\. Misra, P\. Bojanowski, P\. Vincent, M\. Rabbat, Y\. LeCun, and N\. Ballas\(2023\)Self\-supervised learning from images with a joint\-embedding predictive architecture\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15619–15629\.Cited by:[§D\.1](https://arxiv.org/html/2605.05540#A4.SS1.p1.6)\.
- \[4\]K\. Bi, L\. Xie, H\. Zhang, X\. Chen, X\. Gu, and Q\. Tian\(2022\)Pangu\-weather: a 3d high\-resolution model for fast and accurate global weather forecast\.arXiv preprint arXiv:2211\.02556\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[5\]A\. Blattmann, T\. Dockhorn, S\. Kulal, D\. Mendelevitch, M\. Kilian, D\. Lorenz, Y\. Levi, Z\. English, V\. Voleti, A\. Letts,et al\.\(2023\)Stable video diffusion: scaling latent video diffusion models to large datasets\.arXiv preprint arXiv:2311\.15127\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5)\.
- \[6\]S\. R\. Cachay, M\. Aittala, K\. Kreis, N\. Brenowitz, A\. Vahdat, M\. Mardani, and R\. Yu\(2025\)Elucidated rolling diffusion models for probabilistic forecasting of complex dynamics\.arXiv preprint arXiv:2506\.20024\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5),[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px1.p1.1),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.2.2.2),[§D\.5](https://arxiv.org/html/2605.05540#A4.SS5.p1.1),[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[7\]S\. Chen and G\. D\. Doolen\(1998\)Lattice boltzmann method for fluid flows\.Annual review of fluid mechanics30\(1\),pp\. 329–364\.Cited by:[§C\.2](https://arxiv.org/html/2605.05540#A3.SS2.SSS0.Px3.p1.1)\.
- \[8\]X\. Chen, Y\. Wang, L\. Zhang, S\. Zhuang, X\. Ma, J\. Yu, Y\. Wang, D\. Lin, Y\. Qiao, and Z\. Liu\(2023\)Seine: short\-to\-long video diffusion model for generative transition and prediction\.InThe Twelfth International Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5)\.
- \[9\]P\. Du, M\. H\. Parikh, X\. Fan, X\. Liu, and J\. Wang\(2024\)Conditional neural field latent diffusion model for generating spatiotemporal turbulence\.Nature Communications15\(1\),pp\. 10416\.Cited by:[Table 8](https://arxiv.org/html/2605.05540#A2.T8.6.6.2),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]A\. ElGazzar and M\. van Gerven\(2025\)Probabilistic forecasting via autoregressive flow matching\.arXiv preprint arXiv:2503\.10375\.Cited by:[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px1.p1.1)\.
- \[11\]A\. Ergasti, G\. G\. Tarollo, F\. Botti, T\. Fontanini, C\. Ferrari, M\. Bertozzi, and A\. Prati\(2025\)Rflav: Rolling flow matching for infinite audio video generation\.arXiv preprint arXiv:2503\.08307\.Cited by:[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px1.p1.1)\.
- \[12\]L\. C\. Evans\(2022\)Partial differential equations\.Vol\.19,American mathematical society\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[13\]R\. Eymard, T\. Gallouët, and R\. Herbin\(2000\)Finite volume methods\.Handbook of numerical analysis7,pp\. 713–1018\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[14\]K\. Frans, D\. Hafner, S\. Levine, and P\. Abbeel\(2024\)One step diffusion via shortcut models\.arXiv preprint arXiv:2410\.12557\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]Z\. Gao, X\. Shi, B\. Han, H\. Wang, X\. Jin, D\. Maddix, Y\. Zhu, M\. Li, and Y\. B\. Wang\(2023\)Prediff: precipitation nowcasting with latent diffusion models\.Advances in Neural Information Processing Systems36,pp\. 78621–78656\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px2.p2.1),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p2.4),[§1](https://arxiv.org/html/2605.05540#S1.p1.1),[§1](https://arxiv.org/html/2605.05540#S1.p3.1)\.
- \[16\]Z\. Geng, M\. Deng, X\. Bai, J\. Z\. Kolter, and K\. He\(2025\)Mean flows for one\-step generative modeling\.arXiv preprint arXiv:2505\.13447\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p2.3),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p1.9),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px3.p1.1)\.
- \[17\]Z\. Geng, Y\. Lu, Z\. Wu, E\. Shechtman, J\. Z\. Kolter, and K\. He\(2025\)Improved mean flows: on the challenges of fastforward generative models\.arXiv preprint arXiv:2512\.02012\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p2.3),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p1.8),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px3.p1.4)\.
- \[18\]V\. Gupta, T\. Koren, and Y\. Singer\(2018\)Shampoo: preconditioned stochastic tensor optimization\.InInternational Conference on Machine Learning,pp\. 1842–1850\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p3.1)\.
- \[19\]K\. He, X\. Chen, S\. Xie, Y\. Li, P\. Dollár, and R\. Girshick\(2021\)Masked autoencoders are scalable vision learners\.External Links:2111\.06377,[Link](https://arxiv.org/abs/2111.06377)Cited by:[§C\.1](https://arxiv.org/html/2605.05540#A3.SS1.p1.1),[§D\.1](https://arxiv.org/html/2605.05540#A4.SS1.p1.6)\.
- \[20\]Flax: a neural network library and ecosystem for JAXExternal Links:[Link](http://github.com/google/flax)Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p1.1)\.
- \[21\]F\. Hess, Z\. Monfared, M\. Brenner, and D\. Durstewitz\(2023\)Generalized teacher forcing for learning chaotic dynamics\.arXiv preprint arXiv:2306\.04406\.Cited by:[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px4.p1.3)\.
- \[22\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[Table 8](https://arxiv.org/html/2605.05540#A2.T8.10.10.6.1),[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px2.p1.4),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px2.p1.6),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[23\]E\. Hoogeboom, J\. Heek, and T\. Salimans\(2023\)Simple diffusion: end\-to\-end diffusion for high resolution images\.InInternational Conference on Machine Learning,pp\. 13213–13232\.Cited by:[Table 8](https://arxiv.org/html/2605.05540#A2.T8.3.3.6.1)\.
- \[24\]A\. Hore and D\. Ziou\(2010\)Image quality metrics: psnr vs\. ssim\.In2010 20th international conference on pattern recognition,pp\. 2366–2369\.Cited by:[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px4.p1.3)\.
- \[25\]R\. Jiang, P\. Y\. Lu, E\. Orlova, and R\. Willett\(2024\)Training neural operators to preserve invariant measures of chaotic attractors\.Advances in Neural Information Processing Systems36\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1)\.
- \[26\]T\. Karras, M\. Aittala, T\. Aila, and S\. Laine\(2022\)Elucidating the design space of diffusion\-based generative models\.Advances in neural information processing systems35,pp\. 26565–26577\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]S\. Khodakarami, V\. Oommen, A\. Bora, and G\. E\. Karniadakis\(2025\)Mitigating spectral bias in neural operators via high\-frequency scaling for physical systems\.arXiv preprint arXiv:2503\.13695\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1)\.
- \[28\]J\. Kim, J\. Kang, J\. Choi, and B\. Han\(2024\)Fifo\-diffusion: generating infinite videos from text without training\.Advances in Neural Information Processing Systems37,pp\. 89834–89868\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5)\.
- \[29\]D\. P\. Kingma and J\. Ba\(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[Table 5](https://arxiv.org/html/2605.05540#A1.T5.1.1.2),[Table 5](https://arxiv.org/html/2605.05540#A1.T5.2.2.2)\.
- \[30\]D\. Kochkov, J\. A\. Smith, A\. Alieva, Q\. Wang, M\. P\. Brenner, and S\. Hoyer\(2021\)Machine learning–accelerated computational fluid dynamics\.Proceedings of the National Academy of Sciences118\(21\)\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2101784118),ISSN 0027\-8424,[Link](https://www.pnas.org/content/118/21/e2101784118)Cited by:[§G\.2](https://arxiv.org/html/2605.05540#A7.SS2.SSS0.Px1.p1.3),[§G\.2](https://arxiv.org/html/2605.05540#A7.SS2.SSS0.Px2.p1.5)\.
- \[31\]G\. Kohl, L\. Chen, and N\. Thuerey\(2026\)Benchmarking autoregressive conditional diffusion models for turbulent flow simulation\.Neural Networks,pp\. 108641\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p3.1)\.
- \[32\]J\. Latt, B\. Chopard, O\. Malaspinas, M\. Deville, and A\. Michler\(2008\)Straight velocity boundaries in the lattice boltzmann method\.Physical Review E—Statistical, Nonlinear, and Soft Matter Physics77\(5\),pp\. 056703\.Cited by:[§G\.1](https://arxiv.org/html/2605.05540#A7.SS1.SSS0.Px2.p1.2)\.
- \[33\]M\. Lee and R\. D\. Moser\(2015\)Direct numerical simulation of turbulent channel flow up to\.Journal of fluid mechanics774,pp\. 395–415\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[34\]T\. Li and K\. He\(2026\)Back to basics: let denoising generative models denoise\.External Links:2511\.13720,[Link](https://arxiv.org/abs/2511.13720)Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p1.1),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p2.2),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px3.p1.6),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[35\]Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\(2020\)Fourier neural operator for parametric partial differential equations\.arXiv preprint arXiv:2010\.08895\.Cited by:[§A\.2](https://arxiv.org/html/2605.05540#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.05540#A1.SS2.p2.1),[§1](https://arxiv.org/html/2605.05540#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.05540#S4.SS1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[36\]Z\. Li, M\. Liu\-Schiaffini, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\(2022\)Learning chaotic dynamics in dissipative systems\.Advances in Neural Information Processing Systems35,pp\. 16768–16781\.Cited by:[§G\.2](https://arxiv.org/html/2605.05540#A7.SS2.SSS0.Px2.p2.1)\.
- \[37\]S\. H\. Lim, Y\. Wang, A\. Yu, E\. Hart, M\. W\. Mahoney, X\. S\. Li, and N\. B\. Erichson\(2024\)Elucidating the design choice of probability paths in flow matching for forecasting\.arXiv preprint arXiv:2410\.03229\.Cited by:[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px2.p1.1),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p2.4),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[38\]H\. Lin, P\. Hu, M\. Ren, Z\. Gao, Z\. Ma, T\. Wu, S\. Z\. Li,et al\.\(2025\)On the design of one\-step diffusion via shortcutting flow paths\.arXiv preprint arXiv:2512\.11831\.Cited by:[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p1.8)\.
- \[39\]T\. Y\. L\. Lin, J\. Yao, L\. Chiang, J\. Berner, and A\. Anandkumar\(2026\)Decoupled diffusion sampling for inverse problems on function spaces\.External Links:2601\.23280,[Link](https://arxiv.org/abs/2601.23280)Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[40\]Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le\(2022\)Flow matching for generative modeling\.arXiv preprint arXiv:2210\.02747\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px2.p2.9)\.
- \[41\]J\. Liu, J\. Su, X\. Yao, Z\. Jiang, G\. Lai, Y\. Du, Y\. Qin, W\. Xu, E\. Lu, J\. Yan,et al\.\(2025\)Muon is scalable for llm training\.arXiv preprint arXiv:2502\.16982\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p3.1),[Table 5](https://arxiv.org/html/2605.05540#A1.T5.3.3.2),[Table 5](https://arxiv.org/html/2605.05540#A1.T5.4.4.2),[Table 5](https://arxiv.org/html/2605.05540#A1.T5.5.5.2),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[42\]Q\. Liu\(2022\)Rectified flow: a marginal preserving approach to optimal transport\.arXiv preprint arXiv:2209\.14577\.Cited by:[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px2.p2.6)\.
- \[43\]M\. Liu\-Schiaffini, J\. Berner, B\. Bonev, T\. Kurth, K\. Azizzadenesheli, and A\. Anandkumar\(2024\)Neural operators with localized integral and differential kernels\.arXiv preprint arXiv:2402\.16845\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[44\]C\. Lu and Y\. Song\(2024\)Simplifying, stabilizing and scaling continuous\-time consistency models\.arXiv preprint arXiv:2410\.11081\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[45\]L\. Lu, P\. Jin, G\. Pang, Z\. Zhang, and G\. E\. Karniadakis\(2021\)Learning nonlinear operators via deeponet based on the universal approximation theorem of operators\.Nature machine intelligence3\(3\),pp\. 218–229\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[46\]Y\. Lu, S\. Lu, Q\. Sun, H\. Zhao, Z\. Jiang, X\. Wang, T\. Li, Z\. Geng, and K\. He\(2026\)One\-step latent\-free image generation with pixel mean flows\.arXiv preprint arXiv:2601\.22158\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p2.3),[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p3.1),[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p2.2),[§1](https://arxiv.org/html/2605.05540#S1.p4.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[47\]S\. Luo, Y\. Tan, L\. Huang, J\. Li, and H\. Zhao\(2023\)Latent consistency models: synthesizing high\-resolution images with few\-step inference\.arXiv preprint arXiv:2310\.04378\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[48\]X\. Luo, Z\. Wang, Q\. Wang, X\. Shao, J\. Lv, L\. Wang, Y\. Wang, and Y\. Ma\(2025\)CrystalFlow: a flow\-based generative model for crystalline materials\.Nature Communications16\(1\),pp\. 9267\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[49\]X\. Mao, Z\. Jiang, F\. Wang, J\. Zhang, H\. Chen, M\. Chi, Y\. Wang, and W\. Luo\(2025\)Osv: one step is enough for high\-quality image to video generation\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 12585–12594\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[50\]M\. McCabe, P\. Harrington, S\. Subramanian, and J\. Brown\(2023\)Towards stability of autoregressive neural operators\.arXiv preprint arXiv:2306\.10619\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1)\.
- \[51\]M\. McCabe, P\. Mukhopadhyay, T\. Marwah, B\. R\. Blancard, F\. Rozet, C\. Diaconu, L\. Meyer, K\. W\. Wong, H\. Sotoudeh, A\. Bietti,et al\.\(2025\)Walrus: a cross\-domain foundation model for continuum dynamics\.arXiv preprint arXiv:2511\.15684\.Cited by:[§4\.1](https://arxiv.org/html/2605.05540#S4.SS1.p1.1)\.
- \[52\]R\. Mokady, A\. Hertz, K\. Aberman, Y\. Pritch, and D\. Cohen\-Or\(2023\)Null\-text inversion for editing real images using guided diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 6038–6047\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[53\]R\. Molinaro, S\. Lanthaler, B\. Raonić, T\. Rohner, V\. Armegioiu, S\. Simonis, D\. Grund, Y\. Ramic, Z\. Y\. Wan, F\. Sha,et al\.\(2024\)Generative ai for fast and accurate statistical computation of fluids\.arXiv preprint arXiv:2409\.18359\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px2.p2.1),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.4.4.2),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.4.4.6.1.1),[§D\.5](https://arxiv.org/html/2605.05540#A4.SS5.p1.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.05540#S4.SS1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[54\]A\. Q\. Nichol and P\. Dhariwal\(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[55\]V\. Oommen, A\. Bora, Z\. Zhang, and G\. E\. Karniadakis\(2024\)Integrating neural operators with diffusion models improves spectral representation in turbulence modeling\.arXiv preprint arXiv:2409\.08477\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[56\]W\. Peebles and S\. Xie\(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[Table 8](https://arxiv.org/html/2605.05540#A2.T8.8.8.6.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px2.p1.7)\.
- \[57\]U\. Piomelli\(1999\)Large\-eddy simulation: achievements and challenges\.Progress in aerospace sciences35\(4\),pp\. 335–362\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[58\]I\. Price, A\. Sanchez\-Gonzalez, F\. Alet, T\. R\. Andersson, A\. El\-Kadi, D\. Masters, T\. Ewalds, J\. Stott, S\. Mohamed, P\. Battaglia,et al\.\(2025\)Probabilistic weather forecasting with machine learning\.Nature637\(8044\),pp\. 84–90\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p3.1)\.
- \[59\]M\. A\. Rahman, Z\. E\. Ross, and K\. Azizzadenesheli\(2022\)U\-no: u\-shaped neural operators\.arXiv preprint arXiv:2204\.11127\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.05540#S4.SS1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[60\]J\. N\. Reddy\(1993\)An introduction to the finite element method\.New York27\(14\)\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[61\]D\. Ruhe, J\. Heek, T\. Salimans, and E\. Hoogeboom\(2024\)Rolling diffusion models\.arXiv preprint arXiv:2402\.09470\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.3.3.2),[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[62\]S\. Rühling Cachay, B\. Zhao, H\. Joren, and R\. Yu\(2023\)Dyffusion: a dynamics\-informed diffusion model for spatiotemporal forecasting\.Advances in neural information processing systems36,pp\. 45259–45287\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px2.p2.1),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.1.1.2),[Table 8](https://arxiv.org/html/2605.05540#A2.T8.1.1.6.1),[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px3.p1.1)\.
- \[63\]D\. Shu, Z\. Li, and A\. B\. Farimani\(2023\)A physics\-informed diffusion model for high\-fidelity flow field reconstruction\.Journal of Computational Physics478,pp\. 111972\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[64\]Y\. Song, P\. Dhariwal, M\. Chen, and I\. Sutskever\(2023\)Consistency models\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[65\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2020\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§A\.1](https://arxiv.org/html/2605.05540#A1.SS1.p1.1),[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5),[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p2.3),[§1](https://arxiv.org/html/2605.05540#S1.p3.1),[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.05540#S3.SS0.SSS0.Px2.p1.6)\.
- \[66\]J\. C\. Strikwerda\(2004\)Finite difference schemes and partial differential equations\.SIAM\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[67\]S\. Succi\(2001\)The Lattice Boltzmann Equation for Fluid Dynamics and Beyond\.Oxford University Press\.Cited by:[§C\.2](https://arxiv.org/html/2605.05540#A3.SS2.SSS0.Px3.p1.1)\.
- \[68\]E\. Volkmann, A\. Brändle, D\. Durstewitz, and G\. Koppe\(2024\)A scalable generative model for dynamical system reconstruction from neuroimaging data\.Advances in Neural Information Processing Systems37,pp\. 80328–80362\.Cited by:[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px4.p1.3)\.
- \[69\]I\. von Glehn, J\. S\. Spencer, and D\. Pfau\(2022\)A self\-attention ansatz for ab\-initio quantum chemistry\.arXiv preprint arXiv:2211\.13672\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[70\]G\. Wang, Y\. Jiao, Q\. Xu, Y\. Wang, and C\. Yang\(2021\)Deep generative learning via schrödinger bridge\.InInternational conference on machine learning,pp\. 10794–10804\.Cited by:[§B\.2](https://arxiv.org/html/2605.05540#A2.SS2.SSS0.Px2.p1.1)\.
- \[71\]X\. Wang, S\. Zhang, H\. Zhang, Y\. Liu, Y\. Zhang, C\. Gao, and N\. Sang\(2023\)Videolcm: video latent consistency model\.arXiv preprint arXiv:2312\.09109\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[72\]Z\. Wang, A\. C\. Bovik, H\. R\. Sheikh, and E\. P\. Simoncelli\(2004\)Image quality assessment: from error visibility to structural similarity\.IEEE transactions on image processing13\(4\),pp\. 600–612\.Cited by:[§F\.2](https://arxiv.org/html/2605.05540#A6.SS2.p1.2),[§5](https://arxiv.org/html/2605.05540#S5.SS0.SSS0.Px4.p1.3)\.
- \[73\]A\. Wong, M\. J\. Shafiee, B\. Chwyl, and F\. Li\(2018\)Ferminets: learning generative machines to generate efficient neural networks via generative synthesis\.arXiv preprint arXiv:1809\.05989\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1)\.
- \[74\]P\. Wu, A\. Majumdar, K\. Stone, Y\. Lin, I\. Mordatch, P\. Abbeel, and A\. Rajeswaran\(2023\)Masked trajectory models for prediction, representation, and control\.External Links:2305\.02968,[Link](https://arxiv.org/abs/2305.02968)Cited by:[§C\.1](https://arxiv.org/html/2605.05540#A3.SS1.p1.1)\.
- \[75\]T\. Wu, Z\. Fan, X\. Liu, H\. Zheng, Y\. Gong, J\. Jiao, J\. Li, J\. Guo, N\. Duan, W\. Chen,et al\.\(2023\)Ar\-diffusion: auto\-regressive diffusion model for text generation\.Advances in Neural Information Processing Systems36,pp\. 39957–39974\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5)\.
- \[76\]X\. Xue, T\. Yang, M\. Gao, L\. Pan, M\. Wang, K\. Zhu, S\. Wang, J\. Li, M\. F\. ten Eikelder, and P\. V\. Coveney\(2026\)Uni\-flow: a unified autoregressive\-diffusion model for complex multiscale flows\.arXiv preprint arXiv:2602\.15592\.Cited by:[§1](https://arxiv.org/html/2605.05540#S1.p1.1),[§1](https://arxiv.org/html/2605.05540#S1.p3.1)\.
- \[77\]X\. Xue, H\. Yao, and L\. Davidson\(2022\)Synthetic turbulence generator for lattice Boltzmann method at the interface between RANS and LES\.Physics of Fluids34\(5\),pp\. 055118\.Cited by:[§G\.1](https://arxiv.org/html/2605.05540#A7.SS1.SSS0.Px3.p1.12)\.
- \[78\]M\. Yan, C\. Huang, P\. Bienstman, P\. Tino, W\. Lin, and J\. Sun\(2024\)Emerging opportunities and challenges for the future of reservoir computing\.Nature Communications15\(1\),pp\. 2056\.Cited by:[§D\.1](https://arxiv.org/html/2605.05540#A4.SS1.p1.6)\.
- \[79\]R\. Yang, P\. Srivastava, and S\. Mandt\(2023\)Diffusion probabilistic modeling for video generation\.Entropy25\(10\),pp\. 1469\.Cited by:[§B\.1](https://arxiv.org/html/2605.05540#A2.SS1.SSS0.Px1.p1.5)\.
- \[80\]T\. Yang and X\. Xue\(2026\)MENO: meanflow\-enhanced neural operators for dynamical systems\.External Links:2604\.06881,[Link](https://arxiv.org/abs/2604.06881)Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px1.p1.1)\.
- \[81\]H\. Zhang, A\. Siarohin, W\. Menapace, M\. Vasilkovsky, S\. Tulyakov, Q\. Qu, and I\. Skorokhodov\(2025\)Alphaflow: understanding and improving meanflow models\.arXiv preprint arXiv:2510\.20771\.Cited by:[§B\.3](https://arxiv.org/html/2605.05540#A2.SS3.p1.8)\.
- \[82\]L\. Zhou, S\. Ermon, and J\. Song\(2025\)Inductive moment matching\.arXiv preprint arXiv:2503\.07565\.Cited by:[§2](https://arxiv.org/html/2605.05540#S2.SS0.SSS0.Px2.p1.1)\.
- \[83\]R\. Zwanzig\(1961\)Memory effects in irreversible thermodynamics\.Physical Review124\(4\),pp\. 983\.Cited by:[§C\.2](https://arxiv.org/html/2605.05540#A3.SS2.SSS0.Px3.p1.1),[§C\.2](https://arxiv.org/html/2605.05540#A3.SS2.p1.1)\.

## Appendix

## Appendix AHyperparameters

### A\.1MeLISA Settings

We list the backbone architecture hyperparameters for MeLISA in Table[3](https://arxiv.org/html/2605.05540#A1.T3)\. Here,Depthrefers to the number of spatial downsampling stages in the UNet backbone, and to the number of transformer layers in the DiT backbone\. Our UNet implementation uses theflax\.nnxAPI \(flax\[[20](https://arxiv.org/html/2605.05540#bib.bib174)\]is a deep learning library for JAX\) and is adapted from theflax\.linenDDPM\-style UNet ofSonget al\.\[[65](https://arxiv.org/html/2605.05540#bib.bib143)\]\. Our DiT implementation also usesflax\.nnxand followsLuet al\.\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\], which builds on the design ofLi and He \[[34](https://arxiv.org/html/2605.05540#bib.bib164)\]\. In both backbones, we do not introduce separate temporal attention; instead, all input frames are processed jointly as a single large image\.

Table 3:Architecture hyperparameter setting for MeLISA backbone models\.The model hyperparameters for MeLISA are listed in Table[4](https://arxiv.org/html/2605.05540#A1.T4)\. The masking rate is chosen such that, on average, two frames are revealed during training, including the first frame, which is always retained as a temporal reference\. The TIC lag weights increase with lag, thereby penalizing error accumulation over temporal evolution more strongly at longer horizons\. Our sampling strategy for the casest≠rt\\neq r, corresponding to the flow matching limit of MeanFlow models\[[16](https://arxiv.org/html/2605.05540#bib.bib109),[17](https://arxiv.org/html/2605.05540#bib.bib110),[46](https://arxiv.org/html/2605.05540#bib.bib149)\], as well as the joint sampler forttandrr, is directly adapted fromLuet al\.\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\]\.

Table 4:MeLISA model setting\.The optimization hyperparameters are listed in Table[5](https://arxiv.org/html/2605.05540#A1.T5)\. For the UNet\-based variants, we use a linearly decaying learning\-rate schedule, which we found helpful for learning fine\-scale details\. For the DiT\-based variants, we followLuet al\.\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\]and use the Muon optimizer\[[41](https://arxiv.org/html/2605.05540#bib.bib173)\], a second\-order method based on the Newton–Schulz iteration and developed for large\-scale transformer training as an improvement over Shampoo\[[18](https://arxiv.org/html/2605.05540#bib.bib176)\]\. Its learning rate is thereforenotdirectly comparable to that of Adam\. For larger models, we adopt cosine annealing schedules to reduce overfitting\. We also decrease the number of training epochs as model size increases in order to approximately match training cost across models\. The training random seed is fixed at 42 for all models\.

Table 5:Optimization hyperparameter setting for MeLISA backbone models\.ModelOptimizerLearning RateScheduleEpochsBatch SizeUNet\-based VariantsMeLISA\-Υ\\Upsilon\-XSAdam\[[29](https://arxiv.org/html/2605.05540#bib.bib175)\]1e\-4Linear Decay500256MeLISA\-Υ\\Upsilon\-SAdam\[[29](https://arxiv.org/html/2605.05540#bib.bib175)\]1e\-4Linear Decay500256DiT\-based VariantsMeLISA\-Δ\\Delta\-SMuon\[[41](https://arxiv.org/html/2605.05540#bib.bib173)\]5e\-3Constant300256MeLISA\-Δ\\Delta\-MMuon\[[41](https://arxiv.org/html/2605.05540#bib.bib173)\]5e\-3Cosine Annealing200256MeLISA\-Δ\\Delta\-BMuon\[[41](https://arxiv.org/html/2605.05540#bib.bib173)\]5e\-3Cosine Annealing150256
### A\.2Baseline Settings

For the neural operator baselines, we followLiet al\.\[[35](https://arxiv.org/html/2605.05540#bib.bib15)\]and retain as many Fourier modes as feasible to improve spectral fidelity\. We keep the model depth fixed across architectures and match overall model capacity as closely as possible by adjusting the channel expansion factor\. All baselines are trained with two input frames and four output frames, consistent with the MeLISA setting\. The architecture hyperparameters are summarized in Table[6](https://arxiv.org/html/2605.05540#A1.T6)\.

Table 6:Architecture hyperparameter setting for neural operator models\.The optimization settings for the neural operator baselines are listed in Table[7](https://arxiv.org/html/2605.05540#A1.T7)\. These settings are kept identical across all models\. We adopt the learning rate configuration ofLiet al\.\[[35](https://arxiv.org/html/2605.05540#bib.bib15)\], replacing the piecewise\-constant decay schedule with a smooth linear decay\.

Table 7:Optimization hyperparameter setting for neural operator models\.
### A\.3MeLISA Training

Algorithm 1MeLISA Model Training0:Ordered dataset

𝒟\\mathcal\{D\}\(empirical distribution

p^𝒟\\hat\{p\}\_\{\\mathcal\{D\}\}\), denoiser

DθD\_\{\\theta\}, window length

WW, iterations

KK, batch size

BB, learning rate

η\\eta, masking rate

υ\\upsilon, lag weights

\{κw\}w=1W−1\\\{\\kappa\_\{w\}\\\}\_\{w=1\}^\{W\-1\}
1:Initialize parameters

θ\\thetaof network

DθD\_\{\\theta\}
2:for

k=1,…,Kk=1,\\dots,Kdo

3:Sample a batch of windows

\{x¯iτi:τi\+W\}i=1B∼p^𝒟\\\{\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\\}\_\{i=1\}^\{B\}\\sim\\hat\{p\}\_\{\\mathcal\{D\}\}
4:Sample

\{ϵi\}i=1B∼𝒩\(0,I\)\\\{\\epsilon\_\{i\}\\\}\_\{i=1\}^\{B\}\\sim\\mathcal\{N\}\(0,I\), diffusion times

\{ti\}i=1B\\\{t\_\{i\}\\\}\_\{i=1\}^\{B\}, reference times

\{ri\}i=1B\\\{r\_\{i\}\\\}\_\{i=1\}^\{B\}
5:Sample masks

\{τM,i\}i=1B\\\{\\tau\_\{M,i\}\\\}\_\{i=1\}^\{B\}over later frames from

Bernoulli\(υ\)\\mathrm\{Bernoulli\}\(\\upsilon\), always retaining the first frame

6:for

i=1,…,Bi=1,\\dots,Bdo

7:

z¯tiτi:τi\+W←\(1−ti\)x¯iτi:τi\+W\+tiϵi\\bar\{z\}\_\{t\_\{i\}\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\leftarrow\(1\-t\_\{i\}\)\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\+t\_\{i\}\\epsilon\_\{i\}
8:

v¯iτi:τi\+W←ϵi−x¯iτi:τi\+W\\bar\{v\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\leftarrow\\epsilon\_\{i\}\-\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}
9:

u¯θ,i←1ti\[z¯tiτi:τi\+W−Dθ\(Concat\[z¯tiτi:τi\+W,ℳτM,i\(x¯iτi:τi\+W\),τM,i\],ri,ti\)\]\\bar\{u\}\_\{\\theta,i\}\\leftarrow\\dfrac\{1\}\{t\_\{i\}\}\\left\[\\bar\{z\}\_\{t\_\{i\}\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\-D\_\{\\theta\}\\\!\\left\(\\mathrm\{Concat\}\\left\[\\bar\{z\}\_\{t\_\{i\}\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\},\\ \\mathcal\{M\}\_\{\\tau\_\{M,i\}\}\\\!\\left\(\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\),\\tau\_\{M,i\}\\right\],r\_\{i\},t\_\{i\}\\right\)\\right\]
10:

V¯θ,i←u¯θ,i\+\(ti−ri\)⋅sg⁡\(∂tu¯θ,i\+v¯iτi:τi\+W⊤∇z¯u¯θ,i\)\\bar\{V\}\_\{\\theta,i\}\\leftarrow\\bar\{u\}\_\{\\theta,i\}\+\(t\_\{i\}\-r\_\{i\}\)\\cdot\\operatorname\{sg\}\\\!\\left\(\\partial\_\{t\}\\bar\{u\}\_\{\\theta,i\}\+\{\\bar\{v\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\}\\ ^\{\\top\}\\nabla\_\{\\bar\{z\}\}\\bar\{u\}\_\{\\theta,i\}\\right\)
11:

x¯^iτi:τi\+W←Dθ\(Concat\[ϵi,ℳτM,i\(x¯iτi:τi\+W\),τM,i\],0,1\)\\hat\{\\bar\{x\}\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\leftarrow D\_\{\\theta\}\\\!\\left\(\\mathrm\{Concat\}\\left\[\\epsilon\_\{i\},\\mathcal\{M\}\_\{\\tau\_\{M,i\}\}\\\!\\left\(\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\),\\tau\_\{M,i\}\\right\],0,1\\right\)
12:endfor

13:

ℒWinC\-MF←1B∑i=1B‖V¯θ,i−v¯iτi:τi\+W‖22\\mathcal\{L\}\_\{\\mathrm\{WinC\\text\{\-\}MF\}\}\\leftarrow\\dfrac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\\\|\\bar\{V\}\_\{\\theta,i\}\-\\bar\{v\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\\\|\_\{2\}^\{2\}
14:

ℒTIC←1B∑i=1B∑w=1W−1κw‖\(xiτi\+w−xiτi\)−\(x^iτi\+w−x^iτi\)‖22\\mathcal\{L\}\_\{\\mathrm\{TIC\}\}\\leftarrow\\dfrac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\sum\_\{w=1\}^\{W\-1\}\\kappa\_\{w\}\\left\\\|\\bigl\(x\_\{i\}^\{\\tau\_\{i\}\+w\}\-x\_\{i\}^\{\\tau\_\{i\}\}\\bigr\)\-\\bigl\(\\hat\{x\}\_\{i\}^\{\\tau\_\{i\}\+w\}\-\\hat\{x\}\_\{i\}^\{\\tau\_\{i\}\}\\bigr\)\\right\\\|\_\{2\}^\{2\}
15:

ℒMeLISA←ℒWinC\-MF\+ℒTIC\\mathcal\{L\}\_\{\\mathrm\{MeLISA\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{WinC\\text\{\-\}MF\}\}\+\\mathcal\{L\}\_\{\\mathrm\{TIC\}\}
16:

θ←θ−η∇θℒMeLISA\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{MeLISA\}\}
17:endfor

## Appendix BOverview of Autoregressive Diffusion Models

We provide a conceptual and theoretical overview of diffusion\-based autoregressive modeling for physical dynamical systems in this section, and discuss the position of MeLISA models\.

### B\.1Diffusion Autoregressive Models

#### Rolling diffusion\.

One major line of work extendsvideo diffusion models\[[5](https://arxiv.org/html/2605.05540#bib.bib144),[79](https://arxiv.org/html/2605.05540#bib.bib178),[28](https://arxiv.org/html/2605.05540#bib.bib179),[8](https://arxiv.org/html/2605.05540#bib.bib180)\]to temporal forecasting, typically by applying a window\-based noise schedule\. A representative early framework is therolling sequence diffusion model\[[75](https://arxiv.org/html/2605.05540#bib.bib177),[61](https://arxiv.org/html/2605.05540#bib.bib148)\], which, like MeLISA, operates on sampled windowsx¯iτi:τi\+W\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\. These methods define a progressive noise scheduleσ¯\(ti;σmin,σmax,ρ\)\\bar\{\\sigma\}\(t\_\{i\};\\sigma\_\{\\min\},\\sigma\_\{\\max\},\\rho\), parameterized by the minimum noise levelσmin\\sigma\_\{\\min\}, maximum noise levelσmax\\sigma\_\{\\max\}, and curvature parameterρ\\rho\. Traditional DDPM\-based variants, such as rolling diffusion models\[[61](https://arxiv.org/html/2605.05540#bib.bib148)\], typically use either constant\-noise or variance\-preserving schedules; seeSonget al\.\[[65](https://arxiv.org/html/2605.05540#bib.bib143)\]for details\. More recent methods, such as elucidated rolling diffusion models \(ERDM\)\[[6](https://arxiv.org/html/2605.05540#bib.bib140)\], adopt EDM\-inspired schedules\[[26](https://arxiv.org/html/2605.05540#bib.bib142)\]with explicit curvature control\. These approaches are generally framed asprobabilistic forecasting models, whose main advantage over deterministic autoregressive methods, such as neural operators, is their ability to represent predictive uncertainty\. The general training procedure is summarized in Algorithm[2](https://arxiv.org/html/2605.05540#alg2)\. Because inference in these models typically requires tens to hundreds of diffusion steps, theirsampling speed is substantially slower than that of deterministic models\.

Theoretically, following the score\-based SDE formulation of diffusion models\[[65](https://arxiv.org/html/2605.05540#bib.bib143)\], rolling diffusion models can be interpreted as a physical\-time\-dependent extension of probabilistic diffusion\. For a window\-dependent variance schedule, the time derivative of the covariance matrix is

𝐆˙window\(t\)=diag⁡\(2σ¯1\(t\)σ¯˙1\(t\)𝐈D,…,2σ¯W\(t\)σ¯˙W\(t\)𝐈D\)\.\\dot\{\\mathbf\{G\}\}\_\{\\mathrm\{window\}\}\(t\)=\\operatorname\{diag\}\\\!\\bigl\(2\\bar\{\\sigma\}\_\{1\}\(t\)\\dot\{\\bar\{\\sigma\}\}\_\{1\}\(t\)\\mathbf\{I\}\_\{D\},\\ldots,2\\bar\{\\sigma\}\_\{W\}\(t\)\\dot\{\\bar\{\\sigma\}\}\_\{W\}\(t\)\\mathbf\{I\}\_\{D\}\\bigr\)\.\(22\)Note that𝐆˙window\(t\)\\dot\{\\mathbf\{G\}\}\_\{\\mathrm\{window\}\}\(t\)is block\-diagonal and depends on the position of each frame within the rollout window\. The corresponding forward diffusion SDE can be written as

d𝐱t=𝐊window\(t\)dω\(t\),\\mathrm\{d\}\\mathbf\{x\}\_\{t\}=\\mathbf\{K\}\_\{\\mathrm\{window\}\}\(t\)\\,\\mathrm\{d\}\\omega\(t\),\(23\)where𝐊window\(t\)\\mathbf\{K\}\_\{\\mathrm\{window\}\}\(t\)is any matrix square root satisfying

𝐊window\(t\)𝐊window\(t\)⊤=𝐆˙window\(t\)\.\\mathbf\{K\}\_\{\\mathrm\{window\}\}\(t\)\\mathbf\{K\}\_\{\\mathrm\{window\}\}\(t\)^\{\\top\}=\\dot\{\\mathbf\{G\}\}\_\{\\mathrm\{window\}\}\(t\)\.\(24\)Under standard regularity assumptions, the associated reverse\-time SDE is

d𝐱t=−𝐆˙window\(t\)∇𝐱log⁡pt\(𝐱t\)dt\+𝐊window\(t\)dω¯\(t\),\\mathrm\{d\}\\mathbf\{x\}\_\{t\}=\-\\dot\{\\mathbf\{G\}\}\_\{\\mathrm\{window\}\}\(t\)\\,\\nabla\_\{\\mathbf\{x\}\}\\log p\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\\,\\mathrm\{d\}t\+\\mathbf\{K\}\_\{\\mathrm\{window\}\}\(t\)\\,\\mathrm\{d\}\\bar\{\\omega\}\(t\),\(25\)and the corresponding probability\-flow ODE is

d𝐱tdt=−12𝐆˙window\(t\)∇𝐱log⁡pt\(𝐱t\)\.\\frac\{\\mathrm\{d\}\\mathbf\{x\}\_\{t\}\}\{\\mathrm\{d\}t\}=\-\\frac\{1\}\{2\}\\,\\dot\{\\mathbf\{G\}\}\_\{\\mathrm\{window\}\}\(t\)\\,\\nabla\_\{\\mathbf\{x\}\}\\log p\_\{t\}\(\\mathbf\{x\}\_\{t\}\)\.\(26\)In practice, the score is approximated by a denoiser network through the noisy conditional distribution induced by the rolling schedule\. Hence, inference in rolling diffusion models amounts to numerically integrating either the reverse\-time SDE or its probability\-flow ODE counterpart while repeatedly advancing the physical rollout window\.

Algorithm 2General Rolling Diffusion Model Training0:Ordered dataset

𝒟\\mathcal\{D\}\(empirical distribution

p^𝒟\\hat\{p\}\_\{\\mathcal\{D\}\}\), denoiser

DθD\_\{\\theta\}, window length

WW, iterations

KK, batch size

BB, learning rate

η\\eta, noise schedule parameters

σmin,σmax,ρ\\sigma\_\{\\min\},\\sigma\_\{\\max\},\\rho, preconditioning parameters

Pmean,PstdP\_\{\\mathrm\{mean\}\},P\_\{\\mathrm\{std\}\}
1:Initialize parameters

θ\\thetaof network

DθD\_\{\\theta\}
2:for

k=1,…,Kk=1,\\ldots,Kdo

3:Sample a batch of windows

\{x¯iτi:τi\+W\}i=1B∼p^𝒟\\left\\\{\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\\\}\_\{i=1\}^\{B\}\\sim\\hat\{p\}\_\{\\mathcal\{D\}\}
4:Sample

\{ϵi\}i=1B∼𝒩\(0,IW×D\)\\left\\\{\\epsilon\_\{i\}\\right\\\}\_\{i=1\}^\{B\}\\sim\\mathcal\{N\}\(0,I\_\{W\\times D\}\)and noise times

\{ti\}i=1B∼𝒰\(\[0,1\)\)\\left\\\{t\_\{i\}\\right\\\}\_\{i=1\}^\{B\}\\sim\\mathcal\{U\}\(\[0,1\)\)
5:for

i=1,…,Bi=1,\\ldots,Bdo

6:

𝝈i=\(σi,1,…,σi,W\)←σ¯\(ti;σmin,σmax,ρ\)\\boldsymbol\{\\sigma\}\_\{i\}=\(\\sigma\_\{i,1\},\\ldots,\\sigma\_\{i,W\}\)\\leftarrow\\bar\{\\sigma\}\(t\_\{i\};\\sigma\_\{\\min\},\\sigma\_\{\\max\},\\rho\)
7:

z¯iτi:τi\+W←x¯iτi:τi\+W\+𝝈i⊙ϵi\\bar\{z\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\leftarrow\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\+\\boldsymbol\{\\sigma\}\_\{i\}\\odot\\epsilon\_\{i\}
8:

x^iτi:τi\+W←cskip\(𝝈i\)⊙z¯iτi:τi\+W\+cout\(𝝈i\)⊙Dθ\(cin\(𝝈i\)⊙z¯iτi:τi\+W,cnoise\(𝝈i\)\)\\hat\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\leftarrow c\_\{\\mathrm\{skip\}\}\(\\boldsymbol\{\\sigma\}\_\{i\}\)\\odot\\bar\{z\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\+c\_\{\\mathrm\{out\}\}\(\\boldsymbol\{\\sigma\}\_\{i\}\)\\odot D\_\{\\theta\}\\\!\\left\(c\_\{\\mathrm\{in\}\}\(\\boldsymbol\{\\sigma\}\_\{i\}\)\\odot\\bar\{z\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\},\\,c\_\{\\mathrm\{noise\}\}\(\\boldsymbol\{\\sigma\}\_\{i\}\)\\right\)
9:endfor

10:

ℒRD←1B∑i=1B1W∑w=1Wλ\(σi,w\)f\(σi,w;Pmean,Pstd\)‖\(x¯iτi:τi\+W\)w−\(x^iτi:τi\+W\)w‖22\\mathcal\{L\}\_\{\\mathrm\{RD\}\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\frac\{1\}\{W\}\\sum\_\{w=1\}^\{W\}\\lambda\(\\sigma\_\{i,w\}\)\\,f\(\\sigma\_\{i,w\};P\_\{\\mathrm\{mean\}\},P\_\{\\mathrm\{std\}\}\)\\,\\left\\\|\\left\(\\bar\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\)\_\{w\}\-\\left\(\\hat\{x\}\_\{i\}^\{\\tau\_\{i\}:\\tau\_\{i\}\+W\}\\right\)\_\{w\}\\right\\\|\_\{2\}^\{2\}
11:

θ←θ−η∇θℒRD\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{RD\}\}
12:endfor

This viewpoint clarifies both the strength and limitation of rolling diffusion models\. Their main strength is that they provide a principled probabilistic formulation of uncertainty propagation over multiple future frames\. However, because generation is tied to reverse\-time integration, they are inherently*solver\-based*: each forecast block requires multiple denoising evaluations across diffusion time, followed by a rolling update of the window\. As a result, their inference procedure is not naturally compatible with a one\-step autoregressive schema\.

#### Autoregressive transition kernel\.

A complementary theoretical perspective is to view autoregressive generative forecasting directly through a*transition kernel*\. Letc¯\\bar\{c\}denote the observed context and letX¯^blk\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}denote a future block to be generated\. Then an autoregressive generative surrogate may be written abstractly as

X¯^blk∼pθ\(X¯^blk∣c¯\)\.\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}\\sim p\_\{\\theta\}\\\!\\left\(\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}\\mid\\bar\{c\}\\right\)\.\(27\)
This approach has been validated in both pixel\-space\[[53](https://arxiv.org/html/2605.05540#bib.bib162),[62](https://arxiv.org/html/2605.05540#bib.bib141)\], and VAE\-based latent space\[[15](https://arxiv.org/html/2605.05540#bib.bib156)\]\. In these existing autoregressive diffusion formulations, this kernel is instantiated in a relatively restrictive way: the model is conditioned on a fixed, short temporal context and predicts either a single next snapshot or a fixed future block under a pre\-specified input–output geometry\. Such formulations are probabilistically sound, but they do not always fully exploit the structure of a partially observed spatiotemporal window, and they offer limited flexibility with respect to varying context length or forecast horizon\.

### B\.2Flow Matching Autoregressive Models

#### Rolling flow matching models\.

Analogous to rolling diffusion models, window\-based rolling flow matching models have recently been explored in video generation\[[11](https://arxiv.org/html/2605.05540#bib.bib181)\], although the literature remains much sparser than that for diffusion\-based methods\. Simple rolling autoregressive schemes have also been studied for forecasting physical dynamical systems\[[10](https://arxiv.org/html/2605.05540#bib.bib182)\]\. However, to the best of our knowledge, there is still no dedicated frame\-dependent noise schedule for flow\-matching models comparable to the schedules used in diffusion\-based approaches such as ERDM\[[6](https://arxiv.org/html/2605.05540#bib.bib140)\]\.

#### Flow matching transition kernel models\.

Transition kernel formulations have received greater attention in the flow\-matching literature\. Recent work has employed rectified flows\[[2](https://arxiv.org/html/2605.05540#bib.bib183)\]for spatiotemporal prediction, whileLimet al\.\[[37](https://arxiv.org/html/2605.05540#bib.bib151)\]use flow matching\-based Schrödinger bridge models\[[70](https://arxiv.org/html/2605.05540#bib.bib185)\]to learn a transition kernel in latent space\.

### B\.3MeanFlow Model Family

Before discussing the role and advantages of MeLISA, we briefly review the development of MeanFlow models\. The original MeanFlow \(o\-MF\) model\[[16](https://arxiv.org/html/2605.05540#bib.bib109)\]was introduced as a one\-step alternative to standard optimal\-transport flow matching \(OT\-FM\)\. OT\-FM defines the interpolation path

zt=\(1−t\)x\+tϵ,z\_\{t\}=\(1\-t\)x\+t\\epsilon,\(28\)whereϵ∼𝒩\(0,𝐈\)\\epsilon\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\)\. MeanFlow models the time\-averaged velocity between diffusion timettand a reference timerr:

u\(zt,r,t\)=1t−r∫rtv\(zτ∣x,ϵ\)𝑑τ,u\(z\_\{t\},r,t\)=\\frac\{1\}\{t\-r\}\\int\_\{r\}^\{t\}v\(z\_\{\\tau\}\\mid x,\\epsilon\)\\,d\\tau,\(29\)which enables single\-step transport fromztz\_\{t\}tozrz\_\{r\}viazr=zt−\(t−r\)u\(zt,r,t\)z\_\{r\}=z\_\{t\}\-\(t\-r\)u\(z\_\{t\},r,t\)\. Differentiating Eq\.[29](https://arxiv.org/html/2605.05540#A2.E29)yields theMeanFlow identity

u\(zt,r,t\)=v\(zt∣x,ϵ\)−\(t−r\)ddtu\(zt,r,t\),u\(z\_\{t\},r,t\)=v\(z\_\{t\}\\mid x,\\epsilon\)\-\(t\-r\)\\frac\{d\}\{dt\}u\(z\_\{t\},r,t\),\(30\)which leads to the o\-MF objective

ℒo\-MF=‖uθ\(zt,r,t\)−sg⁡\(utgt\)‖2\.\\mathcal\{L\}\_\{\\mathrm\{o\\text\{\-\}MF\}\}=\\left\\\|u\_\{\\theta\}\(z\_\{t\},r,t\)\-\\operatorname\{sg\}\(u\_\{\\mathrm\{tgt\}\}\)\\right\\\|^\{2\}\.\(31\)This can be interpreted as aself\-supervised loss\[[38](https://arxiv.org/html/2605.05540#bib.bib186)\]defined over a velocity interval\. However, o\-MF has been reported to exhibit unstable training dynamics, including the phenomenon that the training loss may increase even as generation quality improves\[[17](https://arxiv.org/html/2605.05540#bib.bib110),[81](https://arxiv.org/html/2605.05540#bib.bib187)\]\. This behavior is largely attributed to samples near the flow\-matching limitt=rt=r, where MeanFlow degenerates to standard OT\-FM\.α\\alpha\-Flow\[[81](https://arxiv.org/html/2605.05540#bib.bib187)\]addresses this issue by introducing an annealing schedule that smoothly recovers the flow\-matching objective\. A more direct remedy was proposed byGenget al\.\[[17](https://arxiv.org/html/2605.05540#bib.bib110)\], who introduced improved MeanFlow \(i\-MF\)\. Instead of supervising the interval\-averaged velocity directly, i\-MF uses Eq\.[30](https://arxiv.org/html/2605.05540#A2.E30)to construct an estimate of the instantaneous velocity:

Vθ\(zt,r,t\)≡uθ\+\(t−r\)sg⁡\(∂tuθ\+v\(zt∣x,ϵ\)∂ztuθ\),V\_\{\\theta\}\(z\_\{t\},r,t\)\\equiv u\_\{\\theta\}\+\(t\-r\)\\,\\operatorname\{sg\}\\\!\\left\(\\partial\_\{t\}u\_\{\\theta\}\+v\(z\_\{t\}\\mid x,\\epsilon\)\\,\\partial\_\{z\_\{t\}\}u\_\{\\theta\}\\right\),\(32\)and optimizes the objective

ℒi\-MF=𝔼t,x,ϵ\[∥Vθ\(zt,r,t\)−v\(zt∣x,ϵ\)∥2\]\.\\mathcal\{L\}\_\{\\mathrm\{i\\text\{\-\}MF\}\}=\\mathbb\{E\}\_\{t,x,\\epsilon\}\\left\[\\left\\\|V\_\{\\theta\}\(z\_\{t\},r,t\)\-v\(z\_\{t\}\\mid x,\\epsilon\)\\right\\\|^\{2\}\\right\]\.\(33\)
Both o\-MF and i\-MF were originally developed in VAE\-induced latent space for256×256256\\times 256image generation\. To remove the VAE bottleneck,Li and He \[[34](https://arxiv.org/html/2605.05540#bib.bib164)\]proposed the Just\-image Transformer \(JiT\), enabling pixel\-space image generation up to1024×10241024\\times 1024\. Building on this idea,Luet al\.\[[46](https://arxiv.org/html/2605.05540#bib.bib149)\]introduced pixel MeanFlow \(p\-MF\), which reparameterizes the prediction target in pixel space as

Xθ\(zt,r,t\)=zt−tu\(zt,r,t\),u\(zt,r,t\)=1t\[zt−Xθ\(zt,r,t\)\]\.X\_\{\\theta\}\(z\_\{t\},r,t\)=z\_\{t\}\-t\\,u\(z\_\{t\},r,t\),\\qquad u\(z\_\{t\},r,t\)=\\frac\{1\}\{t\}\\bigl\[z\_\{t\}\-X\_\{\\theta\}\(z\_\{t\},r,t\)\\bigr\]\.\(34\)The induceduuis then used within the i\-MF objective\. In physical dynamical\-system forecasting, many diffusion\-based methods operate either in VAE latent space\[[15](https://arxiv.org/html/2605.05540#bib.bib156),[37](https://arxiv.org/html/2605.05540#bib.bib151)\]or at relatively low pixel resolutions \(typically up to128×128128\\times 128\)\. To the best of our knowledge, p\-MF is the only one\-step diffusion\-based formulation that has been validated in high\-dimensional pixel space\. MeLISA is therefore built on top of p\-MF\.

### B\.4Birth of MeLISA: The One\-Step Dilemma

We now position MeLISA relative to existing diffusion\-based forecasting paradigms\. From the perspective of rolling diffusion, one\-step generation appears at odds with the core idea of iterative denoising: to fully exploit one\-step efficiency, all future frames would ideally be generated in a single network evaluation\. From the transition\-kernel perspective, however, we still seek to preserve the richer temporal conditioning provided by window\-based generative models\.

MeLISA occupies an intermediate point between these two views\. On the one hand, it remains a probabilistic autoregressive model, since stochasticity enters through Gaussian noise injected into the masked future portion of the input window\. On the other hand, unlike rolling SDE\-based models, MeLISA does not require multi\-step reverse\-time integration\. Its sampling procedure, summarized in Algorithm[3](https://arxiv.org/html/2605.05540#alg3), takes a context bufferc¯\\bar\{c\}and rollout block sizeS=W−WctxS=W\-W\_\{\\mathrm\{ctx\}\}, constructs

x¯~←Concat\[ϵ,ℳfut\(\[c¯;0S×F\]\),τM\],\\tilde\{\\bar\{x\}\}\\leftarrow\\mathrm\{Concat\}\\\!\\left\[\\epsilon,\\;\\mathcal\{M\}\_\{\\mathrm\{fut\}\}\\\!\\left\(\[\\bar\{c\};\\,\\mathbf\{0\}\_\{S\\times F\}\]\\right\),\\;\\tau\_\{M\}\\right\],\(35\)and applies a single denoising map

x¯^=Dθ\(x¯~,0,1\),X¯^blk=TailS⁡\(x¯^\)\.\\hat\{\\bar\{x\}\}=D\_\{\\theta\}\(\\tilde\{\\bar\{x\}\},0,1\),\\qquad\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}=\\operatorname\{Tail\}\_\{S\}\(\\hat\{\\bar\{x\}\}\)\.\(36\)This defines the stochastic transition kernel

pθ\(X¯^blk∣c¯\)=∫δ\(X¯^blk−TailS⁡\(Dθ\(x¯~\(ϵ\),0,1\)\)\)p\(ϵ\)𝑑ϵ,p\_\{\\theta\}\\\!\\left\(\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}\\mid\\bar\{c\}\\right\)=\\int\\delta\\\!\\Bigl\(\\hat\{\\bar\{X\}\}\_\{\\mathrm\{blk\}\}\-\\operatorname\{Tail\}\_\{S\}\\\!\\bigl\(D\_\{\\theta\}\(\\tilde\{\\bar\{x\}\}\(\\epsilon\),0,1\)\\bigr\)\\Bigr\)\\,p\(\\epsilon\)\\,d\\epsilon,\(37\)whereδ\(⋅\)\\delta\(\\cdot\)denotes the Dirac delta distribution\. Hence, MeLISA preserves the probabilistic transition\-kernel interpretation of autoregressive generation, but defines the kernel over a*partially observed spatiotemporal window*rather than a single\-step state transition\.

This perspective yields two key advantages\. First, compared with standard autoregressive diffusion kernels, MeLISA conditions on a richer object \(a masked temporal window\), which allows it to exploit multi\-frame structure more effectively\. Second, compared with SDE\-based rolling diffusion models such as ERDM, MeLISA supports one\-step inference: each future block is generated with a single model evaluation instead of iterative reverse\-time integration\. In this sense, MeLISA combines the probabilistic expressivity of stochastic autoregressive modeling with the efficiency and flexibility of direct window\-conditioned generation\.

This interpretation also explains MeLISA’s practical flexibility at inference time\. Since the model conditions only on the most recentWctxW\_\{\\mathrm\{ctx\}\}frames and predicts blocks of sizeSS, it naturally supports arbitrary observed input lengthsWin≥WctxW\_\{\\mathrm\{in\}\}\\geq W\_\{\\mathrm\{ctx\}\}and arbitrary forecast horizonsWoutW\_\{\\mathrm\{out\}\}via repeated blockwise rollout\. Unlike classical rolling diffusion models, whose formulations are tightly coupled to diffusion\-time integration, MeLISA decouples*probabilistic forecasting*from*multi\-step SDE solving*\. This yields a one\-step stochastic autoregressive model that is particularly well suited for long\-horizon dynamical prediction\. Table[8](https://arxiv.org/html/2605.05540#A2.T8)compares MeLISA with the models discussed in this section\. We note that, due to the novelty of our formulation, MeLISA hasno strict counterpartthat can serve as a direct benchmark\.

Table 8:Model\-level comparison between MeLISA and existing diffusion\-based generative models\. Here, UQ denotes uncertainty quantification and TK denotes transition kernel\. CoNFiLD performs long\-horizon prediction in a non\-autoregressive manner by directly predicting all latent variables in a single diffusion batch\. In contrast, MeLISA models are 1\-NFE, latent\-free, and do not require a specialized backbone\. Under this analysis, MeLISA has no strict counterpart among existing methods\.Algorithm 3MeLISA Autoregressive Inference0:Trained denoiser

DθD\_\{\\theta\}, observed context

x¯Winτ\\bar\{x\}\_\{W\_\{\\mathrm\{in\}\}\}^\{\\tau\}, forecast horizon

WoutW\_\{\\mathrm\{out\}\}, model window length

WW, conditioning length

Wctx<WW\_\{\\mathrm\{ctx\}\}<W
1:Set rollout block size

S←W−WctxS\\leftarrow W\-W\_\{\\mathrm\{ctx\}\}
2:Let

Headq⁡\(⋅\)\\operatorname\{Head\}\_\{q\}\(\\cdot\)and

Tailq⁡\(⋅\)\\operatorname\{Tail\}\_\{q\}\(\\cdot\)denote the first and last

qqframes of a sequence, respectively

3:Initialize context buffer

c¯←TailWctx⁡\(x¯Winτ\)\\bar\{c\}\\leftarrow\\operatorname\{Tail\}\_\{W\_\{\\mathrm\{ctx\}\}\}\\\!\\left\(\\bar\{x\}\_\{W\_\{\\mathrm\{in\}\}\}^\{\\tau\}\\right\)
4:Initialize forecast sequence

𝒴^←\[\]\\hat\{\\mathcal\{Y\}\}\\leftarrow\[\\ \]
5:while

\|𝒴^\|<Wout\|\\hat\{\\mathcal\{Y\}\}\|<W\_\{\\mathrm\{out\}\}do

6:Form a partially observed rollout window

x¯~←Concat\[ϵ,ℳfut\(\[c¯;0S×F\]\),τM\],\\tilde\{\\bar\{x\}\}\\leftarrow\\mathrm\{Concat\}\\left\[\\epsilon,\\ \\mathcal\{M\}\_\{\\mathrm\{fut\}\}\\\!\\left\(\\left\[\\bar\{c\};\\,\\mathbf\{0\}\_\{S\\times F\}\\right\]\\right\),\\ \\tau\_\{M\}\\right\],where

ℳfut\\mathcal\{M\}\_\{\\mathrm\{fut\}\}masks the final

SSframes and preserves the first

WctxW\_\{\\mathrm\{ctx\}\}frames, and

ϵ\\epsilonis Gaussian noise and

τM\\tau\_\{M\}is the masking time indices

7:

x¯^←Dθ\(x¯~,0,1\)\\hat\{\\bar\{x\}\}\\leftarrow D\_\{\\theta\}\\\!\\left\(\\tilde\{\\bar\{x\}\},\\,0,\\,1\\right\)
8:

x¯^blk←TailS⁡\(x¯^\)\\hat\{\\bar\{x\}\}\_\{\\mathrm\{blk\}\}\\leftarrow\\operatorname\{Tail\}\_\{S\}\\\!\\left\(\\hat\{\\bar\{x\}\}\\right\)
9:

q←min⁡\(S,Wout−\|𝒴^\|\)q\\leftarrow\\min\\\!\\left\(S,\\;W\_\{\\mathrm\{out\}\}\-\|\\hat\{\\mathcal\{Y\}\}\|\\right\)
10:

𝒴^←\[𝒴^;Headq⁡\(x¯^blk\)\]\\hat\{\\mathcal\{Y\}\}\\leftarrow\\left\[\\hat\{\\mathcal\{Y\}\};\\,\\operatorname\{Head\}\_\{q\}\\\!\\left\(\\hat\{\\bar\{x\}\}\_\{\\mathrm\{blk\}\}\\right\)\\right\]
11:Update the context buffer with the most recent

WctxW\_\{\\mathrm\{ctx\}\}frames:

c¯←TailWctx⁡\(\[c¯;Headq⁡\(x¯^blk\)\]\)\\bar\{c\}\\leftarrow\\operatorname\{Tail\}\_\{W\_\{\\mathrm\{ctx\}\}\}\\\!\\left\(\\left\[\\bar\{c\};\\,\\operatorname\{Head\}\_\{q\}\\\!\\left\(\\hat\{\\bar\{x\}\}\_\{\\mathrm\{blk\}\}\\right\)\\right\]\\right\)
12:endwhile

13:return

X¯^Woutτ\+Win←𝒴^\\hat\{\\bar\{X\}\}\_\{W\_\{\\mathrm\{out\}\}\}^\{\\tau\+W\_\{\\mathrm\{in\}\}\}\\leftarrow\\hat\{\\mathcal\{Y\}\}

## Appendix CTheoretical Origin of MeLISA

### C\.1Window Consistency: A Self\-supervised Learning Perspective

We now discuss the motivation behind the Window\-Consistency MeanFlow \(WinC\-MF\) objective in Eq\.[17](https://arxiv.org/html/2605.05540#S4.E17)\. The main challenge is that MeanFlow is fundamentally a one\-step generative method: the model should receive enough temporal context to infer meaningful dynamics, but not so much information that forecasting reduces to a deterministic copy task\. Our solution is to condition on*masked trajectories*\. In this way, the model can exploit partial temporal structure while uncertainty remains over the missing portions of the window\. This design is closely related to masked modeling approaches such as Masked Autoencoders \(MAEs\)\[[19](https://arxiv.org/html/2605.05540#bib.bib168)\]and Masked Trajectory Models \(MTMs\)\[[74](https://arxiv.org/html/2605.05540#bib.bib167)\], which likewise learn predictive representations from partially observed inputs\.

Using the notation from the main text, and suppressing the explicitConcat\\mathrm\{Concat\}operation for simplicity, let

ℱℳ=σ\(z¯tτ:τ\+W,ℳτM\(x¯τ:τ\+W\),τM,t,r\)\\mathcal\{F\}\_\{\\mathcal\{M\}\}=\\sigma\\\!\\left\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},\\,\\mathcal\{M\}\_\{\\tau\_\{M\}\}\(\\bar\{x\}^\{\\tau:\\tau\+W\}\),\\,\\tau\_\{M\},\\,t,\\,r\\right\)\(38\)denote the information available to the model: the noisy window, the mask pattern, and the diffusion\-time variables\. Define the target field as

X^tτ:τ\+W=z¯tτ:τ\+W−tu\(z¯tτ:τ\+W,r,t\)\.\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}=\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\}\-t\\,u\\\!\\left\(\\bar\{z\}\_\{t\}^\{\\tau:\\tau\+W\},\\,r,\\,t\\right\)\.\(39\)From a self\-supervised learning viewpoint, WinC\-MF asks the model to predict this target from partial information\. Under squared loss, the optimal predictor is the conditional expectation

𝔼\[X^tτ:τ\+W∣ℱℳ\]\.\\mathbb\{E\}\\\!\\left\[\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}\\mid\\mathcal\{F\}\_\{\\mathcal\{M\}\}\\right\]\.\(40\)Indeed, for anyℱℳ\\mathcal\{F\}\_\{\\mathcal\{M\}\}\-measurable predictorDD, the prediction error decomposes into an irreducible term and an approximation term:

𝔼\[‖X^tτ:τ\+W−D‖2\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\left\\\|\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}\-D\\right\\\|^\{2\}\\right\]=𝔼\[∥X^tτ:τ\+W−𝔼\[X^tτ:τ\+W∣ℱℳ\]∥2\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\left\\\|\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}\-\\mathbb\{E\}\\\!\\left\[\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}\\mid\\mathcal\{F\}\_\{\\mathcal\{M\}\}\\right\]\\right\\\|^\{2\}\\right\]\(41\)\+𝔼\[∥𝔼\[X^tτ:τ\+W∣ℱℳ\]−D∥2\]\.\\displaystyle\\quad\+\\mathbb\{E\}\\\!\\left\[\\left\\\|\\mathbb\{E\}\\\!\\left\[\\hat\{X\}\_\{t\}^\{\\tau:\\tau\+W\}\\mid\\mathcal\{F\}\_\{\\mathcal\{M\}\}\\right\]\-D\\right\\\|^\{2\}\\right\]\.The first term does not depend onDD, so the loss is minimized uniquely by the conditional expectation\. Although MeLISA is not trained with a direct regression loss inXX\-space, the same conclusion applies to WinC\-MF because its objective differs only by linear transformations of the underlying target\.

This perspective makes the role of masking especially clear\. If the full clean trajectory were revealed, then the target would become measurable with respect to the conditioning variables, so the conditional variance would collapse to zero\. In that regime, there would be no irreducible uncertainty left in the task: the model could simply recover the target directly from the observed trajectory, and forecasting would lose its probabilistic character\. By contrast, under nontrivial masking, some future components remain hidden and are not determined byℱℳ\\mathcal\{F\}\_\{\\mathcal\{M\}\}\. Whenever the target depends genuinely on those hidden components, it is no longer fully measurable given the conditioning information, and the corresponding conditional variance remains strictly positive on a set of nonzero probability\. In other words, masking prevents the trivial*copy\-everything*solution by preserving uncertainty over the unobserved part of the window\.

Therefore, WinC\-MF can be viewed as a form of self\-supervised conditional inference: the model learns to reconstruct a target defined over a temporal window from a partially observed and noisy version of that same window\. The observed entries provide enough structure to encode local temporal regularities, while the masked entries force the model to infer plausible hidden dynamics rather than memorize the full trajectory\. This is precisely what allows WinC\-MF to enrich MeanFlow with temporal context without collapsing the forecasting problem into deterministic reconstruction\.

### C\.2Time Increment Consistency Loss

We now discuss the role of the Time Increment Consistency \(TIC\) loss in MeLISA\. The central question is why a one\-step pixel\-space autoregressive generative model should remain physically consistent over long rollouts, especially in turbulent regimes\. Our answer is that MeLISA does not rely on state reconstruction alone\. In addition to the MeanFlow objective, it imposes an explicit finite\-lag temporal\-increment constraint, which regularizes how the resolved observable evolves across time\. This provides a concrete mechanism for the benchmark classes considered in this work\. In closed systems such as KF256, TIC approximates consistency with the integrated PDE tendency of the vorticity field\. In projected systems such as TCF192, which is obtained from an LBM\-LES turbulent channel\-flow simulation, TIC constrains the finite\-lag evolution of the reduced observable in exactly the sense suggested by projected\-dynamics formalisms such as Mori\-Zwanzig\[[83](https://arxiv.org/html/2605.05540#bib.bib188)\]\. More broadly, TIC acts on lagged fluctuation structure rather than only on pointwise reconstruction, which is the level at which many physically relevant turbulent statistics are defined\.

#### Setup and notation\.

LetH≅ℝdH\\cong\\mathbb\{R\}^\{d\}denote the finite\-dimensional resolved state space after spatial discretization of the observable of interest\. At discrete physical timeτ∈ℤ\+\\tau\\in\\mathbb\{Z\}^\{\+\}, let

denote the ground\-truth resolved state, and letx^τ\\hat\{x\}^\{\\tau\}denote the corresponding model reconstruction or prediction\. For a rollout window of lengthWW, define the lag\-wwtemporal increment by

Δxτ,τ\+w:=xτ\+w−xτ,w=1,…,W−1,\\Delta x^\{\\tau,\\tau\+w\}:=x^\{\\tau\+w\}\-x^\{\\tau\},\\qquad w=1,\\dots,W\-1,\(43\)and similarly

Δx^τ,τ\+w:=x^τ\+w−x^τ\.\\Delta\\hat\{x\}^\{\\tau,\\tau\+w\}:=\\hat\{x\}^\{\\tau\+w\}\-\\hat\{x\}^\{\\tau\}\.\(44\)The TIC loss used in MeLISA is

ℒTIC=∑w=1W−1κw𝔼x,τ,τM,ϵ\[‖Δxτ,τ\+w−Δx^τ,τ\+w‖22\],\\mathcal\{L\}\_\{\\mathrm\{TIC\}\}=\\sum\_\{w=1\}^\{W\-1\}\\kappa\_\{w\}\\,\\mathbb\{E\}\_\{x,\\tau,\\tau\_\{M\},\\epsilon\}\\left\[\\left\\\|\\Delta x^\{\\tau,\\tau\+w\}\-\\Delta\\hat\{x\}^\{\\tau,\\tau\+w\}\\right\\\|\_\{2\}^\{2\}\\right\],\(45\)with lag weightsκw≥0\\kappa\_\{w\}\\geq 0\. This is the anchored increment form implemented in MeLISA: each future frame is compared to the same anchor timeτ\\tau, rather than only to its immediate predecessor\. The key point is that TIC is not a generic smoothness prior\. In turbulence, rapid fluctuation is often physical, so a smoothness penalty by itself would be poorly justified\. TIC instead penalizes mismatch in finite\-lag evolution, which is much closer to the way temporal structure is characterized in turbulence theory and in statistical descriptions of dynamical systems\.

A first useful observation is that finite\-lag increments are directly tied to temporal covariance decay\. If\{Xτ\}τ∈ℤ\\\{X^\{\\tau\}\\\}\_\{\\tau\\in\\mathbb\{Z\}\}is a second\-order stationaryHH\-valued process with meanμ=𝔼\[Xτ\]\\mu=\\mathbb\{E\}\[X^\{\\tau\}\]and covariance operator

Γ\(w\):=𝔼\[\(Xτ\+w−μ\)⊗\(Xτ−μ\)\],\\Gamma\(w\):=\\mathbb\{E\}\\\!\\left\[\(X^\{\\tau\+w\}\-\\mu\)\\otimes\(X^\{\\tau\}\-\\mu\)\\right\],\(46\)then expanding the squared increment gives

𝔼\[‖Xτ\+w−Xτ‖22\]=2tr⁡\(Γ\(0\)−Γ\(w\)\)\.\\mathbb\{E\}\\\!\\left\[\\\|X^\{\\tau\+w\}\-X^\{\\tau\}\\\|\_\{2\}^\{2\}\\right\]=2\\,\\operatorname\{tr\}\\\!\\left\(\\Gamma\(0\)\-\\Gamma\(w\)\\right\)\.\(47\)So at second order, finite\-lag increments are not auxiliary quantities: they are exactly equivalent to temporal covariance decay\. In particular, matching increment statistics necessarily constrains how quickly temporal correlations are lost along the rollout\.

The same point admits a spectral interpretation\. For any fixeda∈Ha\\in H, define the scalar projected process

zτ:=⟨a,Xτ⟩z^\{\\tau\}:=\\langle a,X^\{\\tau\}\\rangle\(48\)with autocovariance

Rz\(w\):=𝔼\[\(zτ\+w−𝔼zτ\)\(zτ−𝔼zτ\)\]\.R\_\{z\}\(w\):=\\mathbb\{E\}\\\!\\left\[\(z^\{\\tau\+w\}\-\\mathbb\{E\}z^\{\\tau\}\)\(z^\{\\tau\}\-\\mathbb\{E\}z^\{\\tau\}\)\\right\]\.\(49\)Then

𝔼\[\|zτ\+w−zτ\|2\]=2\(Rz\(0\)−Rz\(w\)\)\.\\mathbb\{E\}\\\!\\left\[\|z^\{\\tau\+w\}\-z^\{\\tau\}\|^\{2\}\\right\]=2\\bigl\(R\_\{z\}\(0\)\-R\_\{z\}\(w\)\\bigr\)\.\(50\)Under the Wiener–Khinchin representation

Rz\(w\)=12π∫−ππeiwωSz\(ω\)dω,R\_\{z\}\(w\)=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}e^\{iw\\omega\}S\_\{z\}\(\\omega\)\\,\\mathrm\{d\}\\omega,\(51\)this becomes

𝔼\[\|zτ\+w−zτ\|2\]=12π∫−ππ2\(1−cos⁡\(wω\)\)Sz\(ω\)dω\.\\mathbb\{E\}\\\!\\left\[\|z^\{\\tau\+w\}\-z^\{\\tau\}\|^\{2\}\\right\]=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}2\\bigl\(1\-\\cos\(w\\omega\)\\bigr\)\\,S\_\{z\}\(\\omega\)\\,\\mathrm\{d\}\\omega\.\(52\)Hence TIC is also a temporal\-spectrum\-aware regularizer: it penalizes mismatch in the energy carried by temporal frequencies, with lag\-dependent weighting through the filter2\(1−cos⁡\(wω\)\)2\(1\-\\cos\(w\\omega\)\)\.

A second observation is that TIC reduces to tendency matching at short lags\. Ifx:\[0,T\]→Hx:\[0,T\]\\to His a smooth trajectory satisfying

x˙\(t\)=F\(x\(t\)\),\\dot\{x\}\(t\)=F\(x\(t\)\),\(53\)then the fundamental theorem of calculus gives

x\(t\+δ\)−x\(t\)=∫tt\+δF\(x\(s\)\)ds,x\(t\+\\delta\)\-x\(t\)=\\int\_\{t\}^\{t\+\\delta\}F\(x\(s\)\)\\,\\mathrm\{d\}s,\(54\)and a first\-order Taylor expansion yields

x\(t\+δ\)−x\(t\)δ=F\(x\(t\)\)\+O\(δ\),δ→0\.\\frac\{x\(t\+\\delta\)\-x\(t\)\}\{\\delta\}=F\(x\(t\)\)\+O\(\\delta\),\\qquad\\delta\\to 0\.\(55\)Thus, at sufficiently short lags, TIC acts as a discrete tendency\-matching regularizer: it encourages the model not only to predict plausible states, but also to reproduce the local time evolution that connects them\.

This also clarifies why a state\-only objective is insufficient\. Consider a deterministic one\-step predictorf\(cτ\)f\(c^\{\\tau\}\)trained from contextcτc^\{\\tau\}by the squared loss

ℒMSE\(f\)=𝔼\[‖f\(cτ\)−xτ\+1‖22\]\.\\mathcal\{L\}\_\{\\mathrm\{MSE\}\}\(f\)=\\mathbb\{E\}\\left\[\\\|f\(c^\{\\tau\}\)\-x^\{\\tau\+1\}\\\|\_\{2\}^\{2\}\\right\]\.\(56\)Conditioning oncτc^\{\\tau\}and completing the square shows that the minimizer is

f⋆\(cτ\)=𝔼\[xτ\+1∣cτ\]\.f^\{\\star\}\(c^\{\\tau\}\)=\\mathbb\{E\}\\\!\\left\[x^\{\\tau\+1\}\\mid c^\{\\tau\}\\right\]\.\(57\)So a pointwise squared\-loss predictor is driven toward a conditional mean\. In chaotic, turbulent, or partially observed systems, however, the physically relevant object is often a full conditional law, or at least a correct multi\-time organization of the resolved observable\. MeLISA addresses the first issue through its Window Consistency MeanFlow formulation, but a state\-only loss still does not directly constrain finite\-lag structure\. TIC supplies that missing constraint by forcing the rollout to evolve with the correct lagged increments\.

These general observations become especially transparent on our two benchmarks\.

#### KF256: finite\-lag consistency with the vorticity dynamics\.

For KF256, the learned state is the discretized vorticity field\. Letω\\omegadenote the continuous vorticity, letStS\_\{t\}be the exact flow map of the underlying Kolmogorov\-flow dynamics, and let𝒪\\mathcal\{O\}be the observation operator mappingω\\omegato the discretized field used by the network\. Then

xτ=𝒪ω\(τΔt\),xτ\+w=𝒪SwΔtω\(τΔt\),x^\{\\tau\}=\\mathcal\{O\}\\omega\(\\tau\\Delta t\),\\qquad x^\{\\tau\+w\}=\\mathcal\{O\}S\_\{w\\Delta t\}\\omega\(\\tau\\Delta t\),\(58\)so the finite\-lag increment is

Δxτ,τ\+w=𝒪\(SwΔtω\(τΔt\)−ω\(τΔt\)\)\.\\Delta x^\{\\tau,\\tau\+w\}=\\mathcal\{O\}\\\!\\left\(S\_\{w\\Delta t\}\\omega\(\\tau\\Delta t\)\-\\omega\(\\tau\\Delta t\)\\right\)\.\(59\)Equivalently,

Δxτ,τ\+w=∫τΔt\(τ\+w\)Δt𝒪∂tω\(s\)ds\.\\Delta x^\{\\tau,\\tau\+w\}=\\int\_\{\\tau\\Delta t\}^\{\(\\tau\+w\)\\Delta t\}\\mathcal\{O\}\\,\\partial\_\{t\}\\omega\(s\)\\,\\mathrm\{d\}s\.\(60\)If the vorticity equation is written abstractly as

∂tω=𝒩\(ω\)\+gK,\\partial\_\{t\}\\omega=\\mathcal\{N\}\(\\omega\)\+g\_\{\\mathrm\{K\}\},\(61\)where𝒩\\mathcal\{N\}collects the advective and dissipative terms andgKg\_\{\\mathrm\{K\}\}denotes the Kolmogorov forcing contribution, then

Δxτ,τ\+w=∫τΔt\(τ\+w\)Δt𝒪\(𝒩\(ω\(s\)\)\+gK\(s\)\)ds\.\\Delta x^\{\\tau,\\tau\+w\}=\\int\_\{\\tau\\Delta t\}^\{\(\\tau\+w\)\\Delta t\}\\mathcal\{O\}\\bigl\(\\mathcal\{N\}\(\\omega\(s\)\)\+g\_\{\\mathrm\{K\}\}\(s\)\\bigr\)\\,\\mathrm\{d\}s\.\(62\)This gives TIC a direct interpretation on KF256: it penalizes mismatch in the time\-integrated vorticity tendency over finite lags\. In the short\-lag regime,

Δxτ,τ\+w=wΔt𝒪\(∂tω\(τΔt\)\)\+O\(\(wΔt\)2\),\\Delta x^\{\\tau,\\tau\+w\}=w\\Delta t\\,\\mathcal\{O\}\\bigl\(\\partial\_\{t\}\\omega\(\\tau\\Delta t\)\\bigr\)\+O\\\!\\left\(\(w\\Delta t\)^\{2\}\\right\),\(63\)so TIC becomes a discrete consistency condition on the true PDE evolution\. This is the cleanest setting for TIC, because the resolved field is generated by a closed equation and finite\-lag increments are simply finite\-time displacements under that dynamics\.

#### TCF192: finite\-lag consistency for a projected LBM\-LES observable\.

The interpretation for TCF192 is different\. Here the target is a 2D observable extracted from a turbulent channel\-flow simulation generated by LBM\-LES\[[7](https://arxiv.org/html/2605.05540#bib.bib189),[67](https://arxiv.org/html/2605.05540#bib.bib76)\], rather than the full state of a closed 2D system\. Accordingly, the learned field should be viewed as a projected observable of a higher\-dimensional flow, not as a state that necessarily satisfies a closed autonomous equation on its own\. This is exactly the setting in which projected\-dynamics formalisms such as Mori\-Zwanzig\[[83](https://arxiv.org/html/2605.05540#bib.bib188)\]become relevant\.

Let𝒳\(t\)∈𝒳\\mathcal\{X\}\(t\)\\in\\mathcal\{X\}denote the full channel\-flow state, evolving according to

𝒳˙\(t\)=ℛ\(𝒳\(t\)\)\.\\dot\{\\mathcal\{X\}\}\(t\)=\\mathcal\{R\}\(\\mathcal\{X\}\(t\)\)\.\(64\)Let

Π:𝒳→H\\Pi:\\mathcal\{X\}\\to H\(65\)be the observation operator extracting the 2D field used for learning, and define the reduced observable

x\(t\):=Π𝒳\(t\)\.x\(t\):=\\Pi\\mathcal\{X\}\(t\)\.\(66\)BecauseΠ\\Piremoves unresolved degrees of freedom,x\(t\)x\(t\)is generally non\-Markovian when viewed in isolation\. In the Mori\-Zwanzig framework, this reduced evolution can be written as the sum of a resolved drift term, a memory term, and an orthogonal or noise term\. Ifℒ\\mathcal\{L\}denotes the Liouville operator and𝒫\\mathcal\{P\}is a projection onto resolved observables with𝒬=I−𝒫\\mathcal\{Q\}=I\-\\mathcal\{P\}, then a standard identity has the form

ddtξ\(t\)=𝒫ℒξ\(t\)\+𝒫ℒ∫0te\(t−s\)𝒬ℒ𝒬ℒξ\(s\)ds\+𝒫ℒet𝒬ℒ𝒬ϕ0,\\displaystyle\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\xi\(t\)=\\mathcal\{P\}\\mathcal\{L\}\\xi\(t\)\+\\mathcal\{P\}\\mathcal\{L\}\\int\_\{0\}^\{t\}e^\{\(t\-s\)\\mathcal\{Q\}\\mathcal\{L\}\}\\mathcal\{Q\}\\mathcal\{L\}\\xi\(s\)\\,\\mathrm\{d\}s\+\\mathcal\{P\}\\mathcal\{L\}e^\{t\\mathcal\{Q\}\\mathcal\{L\}\}\\mathcal\{Q\}\\phi\_\{0\},\(67\)whereξ\(t\)\\xi\(t\)is the resolved observable andϕ0\\phi\_\{0\}is the initial observable\. Integrating this identity over a finite lagδ\\deltashows that the increment of the reduced observable naturally decomposes into

x\(t\+δ\)−x\(t\)=resolved drift contribution\+memory contribution\+unresolved contribution\.x\(t\+\\delta\)\-x\(t\)=\\text\{resolved drift contribution\}\+\\text\{memory contribution\}\+\\text\{unresolved contribution\}\.\(68\)This is the key point for TCF192: finite\-lag increments of the learned slice accumulate not only instantaneous resolved dynamics, but also memory and unresolved forcing inherited from the full turbulent system\. A finite\-lag increment loss is therefore exactly the right kind of regularizer for this setting\.

In other words, TIC is not imposing an artificial closed 2D PDE on the TCF192 field\. Instead, it regularizes the part of the projected dynamics that is actually observable and learnable, namely its finite\-lag evolution\. This viewpoint is physically meaningful beyond formal projection theory\. Space\-time correlations are classical objects in wall\-bounded turbulence, and recent channel\-flow studies report well\-defined lagged relationships among turbulent kinetic energy, dissipation, and wall\-normal structure\. From this perspective, preserving finite\-lag increments is a principled way to preserve the temporal organization of the reduced observable, even when no closed equation is available for that observable alone\.

In summary, TIC regularizes the temporal organization of MeLISA rollouts at a level that state reconstruction alone cannot enforce\. On KF256, where the resolved variable is the vorticity field of a closed system, TIC approximates consistency with the integrated governing dynamics\. On TCF192, where the target is a projected observable produced by LBM\-LES, TIC constrains the finite\-lag behavior of the reduced field, including the effects of memory and unresolved forcing\. This is precisely why TIC improves long\-horizon physical realism: it encourages the model not only to predict plausible states, but also to evolve them with the correct lagged structure\.

## Appendix DAblation Studies

### D\.1Effect of WinC\-MF Mechanism

In this section, we study how MeLISA depends on the number of guidance frames used during autoregressive rollout\. We note thatWinC\-MF cannot be removed or ablated, as it is the principal mechanism that enables MeLISA to perform spatiotemporal prediction\. The distinction between WinC\-MF and other possible formulations is discussed in detail in Appendix[B](https://arxiv.org/html/2605.05540#A2)\. Specifically, on the KF256 dataset, we examine the trend of both short\-term and long\-term metrics as the number of guidance frames varies from 1 to 5\. For short\-term prediction, we focus on the RL2L\_\{2\}metric, shown in Fig\.[4](https://arxiv.org/html/2605.05540#A4.F4), computed over the first 40 frames as in the main text\. Most models remain stable with 1–3 guidance frames but deteriorate sharply when 4 or 5 frames are provided\. In contrast, MeLISA\-Δ\\Delta\-M and MeLISA\-Δ\\Delta\-B remain stable across the entire range\. This behavior is consistent with prior observations that pixel\-space self\-supervised objectives typically require sufficiently large models to be effective\[[19](https://arxiv.org/html/2605.05540#bib.bib168),[3](https://arxiv.org/html/2605.05540#bib.bib190),[78](https://arxiv.org/html/2605.05540#bib.bib42)\]\. Under the training masking rate of0\.80\.8, the expected number of revealed frames is1\+5×\(1−0\.8\)=21\+5\\times\(1\-0\.8\)=2, including the first frame that is always observed\. Interestingly, the largest model, MeLISA\-Δ\\Delta\-B, achieves its best performance with 3 guidance frames\. This suggests that, with further scaling and larger training datasets, MeLISA may exhibit stronger emergent context utilization, pointing toward the possibility of a generative foundation model for dynamical systems\.

Another important metric is PSDD\. The dependence of full\-trajectory PSDD on the number of guidance frames is shown in Fig\.[5](https://arxiv.org/html/2605.05540#A4.F5)\. Interestingly, MeLISA\-Δ\\Delta\-M exhibits a consistent improvement in PSDD as more context frames are provided\. MeLISA\-Υ\\Upsilon\-XS, despite being the smallest model, achieves the best overall spectral accuracy, highlighting the parameter efficiency\. To further assess physical consistency, we also examine TKED as a function of the number of guidance frames\. The trend in Fig\.[6](https://arxiv.org/html/2605.05540#A4.F6)shows that MeLISA\-Δ\\Delta\-M and MeLISA\-Δ\\Delta\-B achieve the strongest long\-term physical consistency, with TKED values that are smaller when 3 or 5 frames are revealed\.

Overall, these results suggest that MeLISA\-Δ\\Delta\-M and MeLISA\-Δ\\Delta\-B make the most effective use of additional context\. This provides encouraging evidence that larger MeLISA models can develop stronger context utilization and more faithfully capture the underlying physical dynamics, pointing toward the potential for MeLISA\-based foundation models for dynamical systems\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/rl2.png)Figure 4:Prediction RL2L\_\{2\}loss against number of MeLISA guidance frames\.![Refer to caption](https://arxiv.org/html/2605.05540v1/images/PSDD.png)Figure 5:Long\-term PSDD loss against number of MeLISA guidance frames\. The y\-axis is in log scale\.![Refer to caption](https://arxiv.org/html/2605.05540v1/images/TKED.png)Figure 6:TKED loss against number of MeLISA guidance frames\. The y\-axis is in log scale\.
### D\.2Effect of TIC Loss

To evaluate the role of the TIC loss, we additionally trained MeLISA\-Δ\\Delta\-M on KF256 without the TIC term\. Autoregressive rollout results for an uncurated initial condition are shown in Fig\.[7](https://arxiv.org/html/2605.05540#A4.F7), and the corresponding statistics are summarized in Table[9](https://arxiv.org/html/2605.05540#A4.T9)\. The ablation reveals a characteristic failure mode: without TIC, the prediction gradually collapses toward a channel\-flow\-like mean field\. The long\-term spectral error \(PSDD\) also degrades substantially\. This behavior is consistent with the theoretical analysis in Appendix[C\.2](https://arxiv.org/html/2605.05540#A3.SS2): WinC\-MF alone is insufficient to recover the correct process, because the model can partially satisfy the objective by regressing toward amean fieldsolution rather than preserving the correct long\-range dynamics\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/tic.png)Figure 7:Autoregressive rollout results for MeLISA\-Δ\\Delta\-M with two context frames, comparing the full model against a variant trained without TIC under the same optimization setting\.ttdenotes the frame index within the trajectory\.Table 9:KF256 results for MeLISA\-Δ\\Delta\-M trained with and without TIC loss\. Short\-term metrics are computed over the first 40 frames, while long\-term metrics are evaluated over the full 320\-frame trajectory\.
### D\.3Rollout Stability

We study the temporal rollout stability of MeLISA on the KF256 dataset\. Short\-range error accumulation is shown in Fig\.[8](https://arxiv.org/html/2605.05540#A4.F8)\. Overall, MeLISA exhibits better long\-horizon error stability, whereas the neural\-operator baselines achieve lower error only during the first few rollout steps\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/rl2_rollout.png)Figure 8:Accumulation of RL2L\_\{2\}error over the first 40 rollout frames on KF256\.![Refer to caption](https://arxiv.org/html/2605.05540v1/images/psdd_rollout.png)Figure 9:Accumulation of PSDD over autoregressive rollouts on KF256, evaluated over the full 320\-frame trajectory\.Although the spatial error remains visually reasonable, the spectral error grows much more severely during rollout \(Fig\.[9](https://arxiv.org/html/2605.05540#A4.F9)\)\. Autoregressive neural operators rapidly lose high\-frequency content, whereas all MeLISA variants quickly reach a stable spectral\-error level\. To further assess long\-horizon stability under extreme conditions, we roll out all autoregressive models for up to 9998 frames, using two guidance frames for MeLISA\. Local\-FNO is omitted from this experiment because it cannot stably roll out even to the full in\-distribution trajectory length of 320 frames\. As shown in Fig\.[10](https://arxiv.org/html/2605.05540#A4.F10), all MeLISA variants remain stable up to the final frame and produce plausible results\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/rollout_kf256.png)Figure 10:Stress test of maximum rollout length for MeLISA models\. White frames indicate either unavailable ground truth orNaNoutputs from autoregressive baselines\.
### D\.4Probabilistic Forecasting

We emphasize that MeLISA is intrinsically aprobabilistic forecastingmodel\. To evaluate this capability, we compute the unscaled continuous ranked probability score \(CRPS; see Appendix[F](https://arxiv.org/html/2605.05540#A6)\) on the KF256 dataset\. The absolute CRPS value is not directly comparable across systems with different resolutions or dynamical complexity\. For each initial condition, we generate an ensemble of 10 rollout trajectories by varying the random seed, and compute the CRPS with respect to the ground\-truth trajectory\. The temporal evolution of CRPS is shown in Fig\.[11](https://arxiv.org/html/2605.05540#A4.F11)\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/crps_rollout_short.png)

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/crps_rollout.png)

Figure 11:Evolution of CRPS on KF256\. The upper panel shows short\-term CRPS, and the lower panel shows trajectory\-level CRPS over long rollouts\.In the short\-term prediction regime \(upper panel of Fig\.[11](https://arxiv.org/html/2605.05540#A4.F11)\), the MeLISA\-Υ\\Upsilonvariants achieve lower CRPS\. Over longer trajectories \(lower panel of Fig\.[11](https://arxiv.org/html/2605.05540#A4.F11)\), however, the MeLISA\-Δ\\Deltavariants exhibit greater long\-range consistency\. For a qualitative illustration, we show the rollout ensemble for a single initial condition in Fig\.[12](https://arxiv.org/html/2605.05540#A4.F12)\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/ensem_kf256.png)Figure 12:Trajectory ensemble generated by MeLISA\-Δ\\Delta\-B for a single initial condition\. The random seeds are obtained by adding multiples of 10 to 42\.
### D\.5Training Cost & Inference Costs

We report both training and inference cost in this section\. For the neural\-operator baselines, all models require roughly 2–3 days of training on a single NVIDIA A100 80GB GPU\. All MeLISA models are trained on 32 GH200 nodes \(4 GPUs per node\) for 4 hours, corresponding to approximately 5 days on a 4 GH200\. This training cost is comparable to that reported in prior work\[[6](https://arxiv.org/html/2605.05540#bib.bib140),[53](https://arxiv.org/html/2605.05540#bib.bib162)\]\.

For inference, we reportjaxGFLOP counts per model call and per trajectory in Table[10](https://arxiv.org/html/2605.05540#A4.T10)\. Since neural operators involve Fourier\-space operations that are not reliably captured by standardPyTorchprofilers, these counts are not directly comparable, and we therefore do not report them here\. To assess practical efficiency, we additionally measure wall\-clock rollout time per trajectory on a single NVIDIA 4090 GPU in Table[11](https://arxiv.org/html/2605.05540#A4.T11)\. Overall,MeLISA achieves rollout speeds comparable to those of neural operators\.

Table 10:GFLOP counts recorded usingjax\.jit\(model\)\.trace\(\*inputs\)\.lower\(\)on a single NVIDIA 4090 GPU\.Table 11:Inference time per trajectory in seconds for MeLISA and neural\-operator baselines on a single NVIDIA 4090 GPU\. Times are reported to two significant figures\.

## Appendix EMore Results

We provide additional qualitative and statistical results for KF256 and TCF192 in this section\. Figure[13](https://arxiv.org/html/2605.05540#A5.F13)presents a representative rollout example on KF256\. Consistent with the main text, the autoregressive baselines accumulate error rapidly as the rollout proceeds, whereas MeLISA\-Δ\\Delta\-S maintains substantially better stability over time\. The error maps further show that MeLISA not only reduces the overall error magnitude, but also avoids the localized error amplification that becomes pronounced in the baseline models at later steps\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/fig2.png)Figure 13:Rollout results on an uncurated test trajectory fromKF256, shown as absolute error with respect to the ground truth over the prediction horizon\. We compare three autoregressive baselines with one MeLISA variant, MeLISA\-Δ\\Delta\-S\. Here,ttdenotes the frame index within the trajectory\. MeLISA exhibits substantially improved stability and markedly slower error accumulation\.Figures[14](https://arxiv.org/html/2605.05540#A5.F14)and[15](https://arxiv.org/html/2605.05540#A5.F15)report trajectory autocorrelation statistics for KF256 and TCF192, respectively\. These plots provide a complementary view of rollout quality by measuring how well each model preserves temporal dependence over both short and long horizons\. On KF256, MeLISA tracks the decay profile of the ground\-truth autocorrelation more faithfully across the full trajectory, indicating improved recovery of long\-range temporal structure\. On TCF192, the difference is even more pronounced: the neural\-operator baselines exhibit clear ringing artifacts induced by the windowed autoregressive rollout procedure, while MeLISA produces much smoother curves that better match the physical decay behavior\. These results further support the claim that MeLISA preserves long\-term temporal statistics more reliably than deterministic autoregressive baselines\.

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/kf256_autocorr_short.png)

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/kf256_autocorr_long.png)

Figure 14:Autocorrelation on the KF256 dataset\. The upper panel shows the first 40 frames, while the lower panel shows the full 320 frame trajectory\.![Refer to caption](https://arxiv.org/html/2605.05540v1/images/tcf192_autocorr_short.png)

![Refer to caption](https://arxiv.org/html/2605.05540v1/images/tcf192_autocorr_long.png)

Figure 15:Autocorrelation on the TCF192 dataset\. The upper panel shows the first 40 frames, while the lower panel shows the full 625 frame trajectory\. Neural operator baselines exhibit pronounced ringing due to the windowed autoregressive rollout setting, whereas the MeLISA curves remain much smoother\.
## Appendix FMetrics

We evaluate predicted trajectories against ground\-truth trajectories using a combination of short\-horizon field\-wise metrics and long\-horizon statistical metrics\. Let

z,z^∈ℝB×T×H×Wz,\\hat\{z\}\\in\\mathbb\{R\}^\{B\\times T\\times H\\times W\}denote the reference and predicted trajectories, respectively\.

### F\.1RelativeL2L\_\{2\}Error

To quantify overall prediction error over the short\-horizon rollout, we compute a trajectory\-level relativeL2L^\{2\}error over the firstTeval=40T\_\{\\mathrm\{eval\}\}=40retained steps\. For each trajectoryjj, the predicted and reference fields are flattened across time and space, and the error is defined as

ℰrL2=1B∑j=1B‖z^1:Tevalj−z1:Tevalj‖2‖z1:Tevalj‖2\+ε,\\mathcal\{E\}\_\{\\mathrm\{r\}L\_\{2\}\}=\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}\\frac\{\\left\\lVert\\hat\{z\}^\{\\,j\}\_\{1:T\_\{\\mathrm\{eval\}\}\}\-z^\{\\,j\}\_\{1:T\_\{\\mathrm\{eval\}\}\}\\right\\rVert\_\{2\}\}\{\\left\\lVert z^\{\\,j\}\_\{1:T\_\{\\mathrm\{eval\}\}\}\\right\\rVert\_\{2\}\+\\varepsilon\},\(69\)where∥⋅∥2\\\|\\cdot\\\|\_\{2\}denotes the Euclidean norm after flattening all retained temporal and spatial entries of a trajectory, andε\\varepsilonis a small constant for numerical stability\.

### F\.2Structural Similarity Index Measure

To quantify structural agreement between predicted and reference 2D fields, we use theStructural Similarity Index Measure \(SSIM\)\[[72](https://arxiv.org/html/2605.05540#bib.bib117)\]\. SSIM compares local luminance, contrast, and structure between a reference fieldx∈ℝH×Wx\\in\\mathbb\{R\}^\{H\\times W\}and a predictiony∈ℝH×Wy\\in\\mathbb\{R\}^\{H\\times W\}\.

#### Gaussian\-window local statistics\.

Local moments are estimated using a Gaussian\-weighted window of sizew×ww\\times wwith standard deviationσ\\sigma\. Denoting Gaussian smoothing byℋ\(⋅\)\\mathcal\{H\}\(\\cdot\), the local means are

μx=ℋ\(x\),μy=ℋ\(y\),\\mu\_\{x\}=\\mathcal\{H\}\(x\),\\qquad\\mu\_\{y\}=\\mathcal\{H\}\(y\),\(70\)and the local variances and covariance are

σx2=ℋ\(x2\)−μx2,σy2=ℋ\(y2\)−μy2,σxy=ℋ\(xy\)−μxμy\.\\sigma\_\{x\}^\{2\}=\\mathcal\{H\}\(x^\{2\}\)\-\\mu\_\{x\}^\{2\},\\qquad\\sigma\_\{y\}^\{2\}=\\mathcal\{H\}\(y^\{2\}\)\-\\mu\_\{y\}^\{2\},\\qquad\\sigma\_\{xy\}=\\mathcal\{H\}\(xy\)\-\\mu\_\{x\}\\mu\_\{y\}\.\(71\)In our implementation,ℋ\\mathcal\{H\}is applied via a separable Gaussian convolution with “same” padding, using a normalized 1D Gaussian kernel along each spatial axis\.

#### SSIM map and aggregation\.

The SSIM map is given by

SSIM\(x,y\)=\(2μxμy\+C1\)\(2σxy\+C2\)\(μx2\+μy2\+C1\)\(σx2\+σy2\+C2\),\\mathrm\{SSIM\}\(x,y\)=\\frac\{\(2\\mu\_\{x\}\\mu\_\{y\}\+C\_\{1\}\)\(2\\sigma\_\{xy\}\+C\_\{2\}\)\}\{\(\\mu\_\{x\}^\{2\}\+\\mu\_\{y\}^\{2\}\+C\_\{1\}\)\(\\sigma\_\{x\}^\{2\}\+\\sigma\_\{y\}^\{2\}\+C\_\{2\}\)\},\(72\)and the scalar SSIM score is the spatial average:

SSIM¯\(x,y\)=1HW∑i=1H∑j=1WSSIMij\(x,y\)\.\\overline\{\\mathrm\{SSIM\}\}\(x,y\)=\\frac\{1\}\{HW\}\\sum\_\{i=1\}^\{H\}\\sum\_\{j=1\}^\{W\}\\mathrm\{SSIM\}\_\{ij\}\(x,y\)\.\(73\)

#### Stability constants and data range\.

To stabilize the computation, we use

C1=\(K1L\)2,C2=\(K2L\)2,C\_\{1\}=\(K\_\{1\}L\)^\{2\},\\qquad C\_\{2\}=\(K\_\{2\}L\)^\{2\},\(74\)where the data rangeLLis computed per sample from the reference field:

L=max⁡\(x\)−min⁡\(x\)\.L=\\max\(x\)\-\\min\(x\)\.\(75\)We use the standard constantsK1=0\.01K\_\{1\}=0\.01andK2=0\.03K\_\{2\}=0\.03\. If the reference field is constant \(i\.e\.,L=0L=0\), the implementation returnsSSIM¯=1\\overline\{\\mathrm\{SSIM\}\}=1when prediction and reference agree within numerical tolerance, and0otherwise\.

### F\.3Mixing Rate Difference

To assess whether predicted trajectories reproduce the temporal decorrelation behavior of the reference dynamics, we estimate amixing ratefrom the empirical autocovariance and compare it to the ground\-truth value\.

#### Empirical covariance\.

Given a trajectory tensorx∈ℝB×T×H×Wx\\in\\mathbb\{R\}^\{B\\times T\\times H\\times W\}, we first compute the global mean over all trajectories, times, and spatial locations,

x¯=1BTHW∑b=1B∑t=1T∑u=1H∑v=1Wxb,t,u,v,\\bar\{x\}=\\frac\{1\}\{BTHW\}\\sum\_\{b=1\}^\{B\}\\sum\_\{t=1\}^\{T\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\}x\_\{b,t,u,v\},\(76\)and define the centred fieldx′=x−x¯x^\{\\prime\}=x\-\\bar\{x\}\. For a lagℓ∈\{0,…,K\}\\ell\\in\\\{0,\\dots,K\\\}, the empirical covariance is

C^\(ℓ\)=1BmℓHW∑b=1B∑k=1mℓ∑u=1H∑v=1Wxb,k,u,v′xb,k\+ℓ,u,v′,\\widehat\{C\}\(\\ell\)=\\frac\{1\}\{B\\,m\_\{\\ell\}\\,H\\,W\}\\sum\_\{b=1\}^\{B\}\\sum\_\{k=1\}^\{m\_\{\\ell\}\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\}x^\{\\prime\}\_\{b,k,u,v\}\\,x^\{\\prime\}\_\{b,k\+\\ell,u,v\},\(77\)where

mℓ=T−K−ℓ\.m\_\{\\ell\}=T\-K\-\\ell\.\(78\)

#### Normalized autocorrelation and exponential fit\.

The covariance is normalized by its zero\-lag value,

C¯\(ℓ\)=C^\(ℓ\)C^\(0\),\\overline\{C\}\(\\ell\)=\\frac\{\\widehat\{C\}\(\\ell\)\}\{\\widehat\{C\}\(0\)\},\(79\)and we estimate the mixing rateλ\\lambdaby fitting the exponential decay model

C¯\(ℓ\)≈exp⁡\(−λℓ\),ℓ=0,…,K,\\overline\{C\}\(\\ell\)\\approx\\exp\(\-\\lambda\\ell\),\\qquad\\ell=0,\\dots,K,\(80\)with the constraintλ≥0\\lambda\\geq 0\.

#### Reported error\.

Letλgt\\lambda\_\{\\mathrm\{gt\}\}be the fitted mixing rate of the ground\-truth trajectory andλpred\\lambda\_\{\\mathrm\{pred\}\}that of the prediction\. The reported metric is the relative mixing\-rate error

ℰmix=\|λpred−λgt\|λgt\.\\mathcal\{E\}\_\{\\mathrm\{mix\}\}=\\frac\{\\left\|\\lambda\_\{\\mathrm\{pred\}\}\-\\lambda\_\{\\mathrm\{gt\}\}\\right\|\}\{\\lambda\_\{\\mathrm\{gt\}\}\}\.\(81\)Lower values indicate better agreement with the temporal mixing behavior of the reference dynamics\.

### F\.4Power Spectral Density Discrepancy

To measure agreement in spatial frequency content, we compute aPower Spectral Density Discrepancy\(PSDD\) between predicted and reference fields\. This metric compares the radially averaged Fourier power spectra of the two signals and is particularly useful for assessing whether the model reproduces the correct distribution of energy across spatial scales\.

#### Frame\-wise Fourier representation\.

The PSDD is evaluated on all retained frames after flattening batch and time dimensions, so that each frame is treated as an independent sample of shape\(H,W,1\)\(H,W,1\)\. For each samplexx, we compute the shifted 2D discrete Fourier transform

F~x=fftshift\(ℱ\{x\}\),\\widetilde\{F\}\_\{x\}=\\mathrm\{fftshift\}\\\!\\left\(\\mathcal\{F\}\\\{x\\\}\\right\),\(82\)and similarly foryy\.

#### Power spectrum and radial averaging\.

The unnormalized power spectrum is

Px\(i,j\)=\|F~x\(i,j\)\|2\.P\_\{x\}\(i,j\)=\\left\|\\widetilde\{F\}\_\{x\}\(i,j\)\\right\|^\{2\}\.\(83\)For each spatial frequency location\(i,j\)\(i,j\), we define a discrete radial bin index

r\(i,j\)=⌊\(i−⌊H/2⌋\)2\+\(j−⌊W/2⌋\)2⌋\.r\(i,j\)=\\left\\lfloor\\sqrt\{\\left\(i\-\\left\\lfloor H/2\\right\\rfloor\\right\)^\{2\}\+\\left\(j\-\\left\\lfloor W/2\\right\\rfloor\\right\)^\{2\}\}\\right\\rfloor\.\(84\)The radially averaged spectrum is then obtained by averagingPx\(i,j\)P\_\{x\}\(i,j\)over all positions belonging to the same radius bin:

Rx\(k\)=1\|Ωk\|∑\(i,j\)∈ΩkPx\(i,j\),Ωk=\{\(i,j\):r\(i,j\)=k\}\.R\_\{x\}\(k\)=\\frac\{1\}\{\|\\Omega\_\{k\}\|\}\\sum\_\{\(i,j\)\\in\\Omega\_\{k\}\}P\_\{x\}\(i,j\),\\qquad\\Omega\_\{k\}=\\\{\(i,j\):r\(i,j\)=k\\\}\.\(85\)In our implementation, channels are aggregated before radial averaging\.

#### Cropping and normalization\.

We retain radial bins

k∈\{0,1,…,Kr−1\},k\\in\\\{0,1,\\dots,K\_\{r\}\-1\\\},whereKrK\_\{r\}denotes the number of retained radial bins\. The cropped radial spectrum is normalized to sum to one:

R^x\(k\)=Rx\(k\)∑k′=0Kr−1Rx\(k′\)\+ε,\\widehat\{R\}\_\{x\}\(k\)=\\frac\{R\_\{x\}\(k\)\}\{\\sum\_\{k^\{\\prime\}=0\}^\{K\_\{r\}\-1\}R\_\{x\}\(k^\{\\prime\}\)\+\\varepsilon\},and analogously forR^y\(k\)\\widehat\{R\}\_\{y\}\(k\)\.

#### LogarithmicL1L\_\{1\}discrepancy\.

The PSDD used in the analysis is the mean absolute difference between the logarithms of the normalized radial spectra:

ℒPSDD\(x,y\)=1Kr∑k=0Kr−1\|log⁡\(R^x\(k\)\+ε\)−log⁡\(R^y\(k\)\+ε\)\|,\\mathcal\{L\}\_\{\\mathrm\{PSDD\}\}\(x,y\)=\\frac\{1\}\{K\_\{r\}\}\\sum\_\{k=0\}^\{K\_\{r\}\-1\}\\left\|\\log\\\!\\left\(\\widehat\{R\}\_\{x\}\(k\)\+\\varepsilon\\right\)\-\\log\\\!\\left\(\\widehat\{R\}\_\{y\}\(k\)\+\\varepsilon\\right\)\\right\|,\(86\)whereKrK\_\{r\}\(220 for KF256, all bins for TCF192\) is the number of retained radial bins\. The final score is obtained by averaging this quantity over all flattened frames\.

### F\.5Turbulent Kinetic Energy Difference

To evaluate whether the predicted trajectories reproduce the amplitude of temporal fluctuations at each spatial location, we compute aTurbulent Kinetic Energy Difference\(TKED\)\.

#### Turbulent fluctuation energy\.

For a trajectory tensorx∈ℝB×T×H×Wx\\in\\mathbb\{R\}^\{B\\times T\\times H\\times W\}, we first compute the temporal mean for each trajectory and spatial position,

x¯b,u,v=1T∑t=1Txb,t,u,v,\\bar\{x\}\_\{b,u,v\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}x\_\{b,t,u,v\},\(87\)and define the turbulent fluctuation field

xb,t,u,v′=xb,t,u,v−x¯b,u,v\.x^\{\\prime\}\_\{b,t,u,v\}=x\_\{b,t,u,v\}\-\\bar\{x\}\_\{b,u,v\}\.\(88\)The corresponding kinetic\-energy map is then

TKEx\(u,v\)=1B∑b=1B\(1T∑t=1T\(xb,t,u,v′\)2\)\.\\mathrm\{TKE\}\_\{x\}\(u,v\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\left\(\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(x^\{\\prime\}\_\{b,t,u,v\}\\right\)^\{2\}\\right\)\.\(89\)This yields a spatial map describing the average fluctuation energy at each grid location\.

#### Normalized discrepancy\.

LetTKEz^\\mathrm\{TKE\}\_\{\\hat\{z\}\}andTKEz\\mathrm\{TKE\}\_\{z\}denote the predicted and reference temporal kinetic\-energy maps\. The reported discrepancy is

ℰTKED=1HW∑u=1H∑v=1W\(TKEz^\(u,v\)−TKEz\(u,v\)\)21HW∑u=1H∑v=1WTKEz\(u,v\)\.\\mathcal\{E\}\_\{\\mathrm\{TKED\}\}=\\frac\{\\frac\{1\}\{HW\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\}\\left\(\\mathrm\{TKE\}\_\{\\hat\{z\}\}\(u,v\)\-\\mathrm\{TKE\}\_\{z\}\(u,v\)\\right\)^\{2\}\}\{\\frac\{1\}\{HW\}\\sum\_\{u=1\}^\{H\}\\sum\_\{v=1\}^\{W\}\\mathrm\{TKE\}\_\{z\}\(u,v\)\}\.\(90\)Lower values indicate that the predicted trajectories reproduce the temporal fluctuation energy of the reference trajectories more faithfully\.

### F\.6Continuous Ranked Probability Score

To evaluate probabilistic predictions, we use theContinuous Ranked Probability Score\(CRPS\), which measures the agreement between an ensemble forecast distribution and the corresponding observed target\. For a scalar target valueyyand an ensemble forecast

\{x\(n\)\}n=1M,\\\{x^\{\(n\)\}\\\}\_\{n=1\}^\{M\},the empirical CRPS is defined as

CRPS\(\{x\(n\)\}n=1M,y\)=1M∑n=1M\|x\(n\)−y\|−12M2∑n=1M∑m=1M\|x\(n\)−x\(m\)\|\.\\mathrm\{CRPS\}\\\!\\left\(\\\{x^\{\(n\)\}\\\}\_\{n=1\}^\{M\},y\\right\)=\\frac\{1\}\{M\}\\sum\_\{n=1\}^\{M\}\\left\|x^\{\(n\)\}\-y\\right\|\-\\frac\{1\}\{2M^\{2\}\}\\sum\_\{n=1\}^\{M\}\\sum\_\{m=1\}^\{M\}\\left\|x^\{\(n\)\}\-x^\{\(m\)\}\\right\|\.\(91\)The first term measures the mean absolute deviation of the ensemble members from the observation, while the second term corrects for ensemble spread\. Lower values indicate better calibrated and more accurate probabilistic forecasts\.

#### Field\-wise ensemble evaluation\.

For spatial fields, the CRPS is computed independently at each spatial location\. Let

z^b,t,u,v\(n\)\\hat\{z\}\_\{b,t,u,v\}^\{\(n\)\}denote thennth ensemble member for batch elementbbat time stepttand spatial location\(u,v\)\(u,v\), and let

be the corresponding ground\-truth value\. Then the pointwise CRPS is

CRPSb,t,u,v=CRPS\(\{z^b,t,u,v\(n\)\}n=1M,zb,t,u,v\)\.\\mathrm\{CRPS\}\_\{b,t,u,v\}=\\mathrm\{CRPS\}\\\!\\left\(\\\{\\hat\{z\}\_\{b,t,u,v\}^\{\(n\)\}\\\}\_\{n=1\}^\{M\},\\;z\_\{b,t,u,v\}\\right\)\.\(92\)

## Appendix GDataset Details

### G\.1Turbulent Channel Flow

#### Dataset summary\.

The turbulent channel flow dataset used in this work generated via a lattice Boltzmann based in\-house numerical solver\.

#### Numerical solver: lattice Boltzmann method\.

Ground\-truth trajectories are produced with the large\-eddy LBM solver described in Section S1\. LBM is a mesoscopic CFD method that, instead of discretizing the Navier–Stokes equations directly, evolves a discrete set of particle\-distribution functionsfi\(𝐱,t\)f\_\{i\}\(\\mathbf\{x\},t\)on a regular lattice through local collision and streaming steps; macroscopic fields \(density, velocity, pressure\) are recovered as low\-order moments offif\_\{i\}\[[32](https://arxiv.org/html/2605.05540#bib.bib79)\]\. The collision step is strictly local and the streaming step only moves data to nearest\-neighbor lattice sites, which makes LBM massively parallel and numerically well\-suited to high\-Reynolds\-number wall\-bounded flows, motivating its use as the reference solver here\.

#### Flow configuration\.

The simulation domain isLx×Ly×Lz=768×192×192L\_\{x\}\\times L\_\{y\}\\times L\_\{z\}=768\\times 192\\times 192lattice units, withxx,yyandzzdenoting the streamwise, wall\-normal and spanwise directions\. The flow is driven by a uniform streamwise body force at a friction Reynolds numberReτ=180Re\_\{\\tau\}=180\(bulk Reynolds numberRe=3250Re=3250\), with periodic boundary conditions inxxandzz, no\-slip walls at the top and bottom, and a sponge layer near the outlet to suppress spurious reflections\. To accelerate transition to a developed turbulent state, a cubic tripping obstacle of size20×20×10020\\times 20\\times 100is placed atx=192x=192\[[77](https://arxiv.org/html/2605.05540#bib.bib74)\]; after10610^\{6\}timesteps the obstacle is removed and the flow is evolved for a further 50 domain\-through times before sampling begins on the mid\-plane cross\-section atx=384x=384\.

#### Computing costs\.

Generating the full dataset required approximately10610^\{6\}CPU hours on the ARCHER2 supercomputer \(2,0482\{,\}048cores per run, 32 runs\)\.

### G\.22D Kolmogorov Flow

#### Dataset summary\.

Based on the governing equations described above, the Kolmogorov flow dataset consists of7070independent simulation trajectories of the two\-dimensional incompressible Navier\-Stokes equations with Kolmogorov forcing at Reynolds numberRe=1000\\mathrm\{Re\}=1000\. All simulations are performed on a doubly periodic square domain, discretized using a uniform Cartesian grid of size256×256256\\times 256\. The community code can be found in\[[30](https://arxiv.org/html/2605.05540#bib.bib95)\]\.

Each simulation produces a full spatiotemporal trajectory of the vorticity field consisting of320320consecutive temporal frames\. From each trajectory, multiple overlapping temporal fragments are extracted to construct supervised learning samples\. The validation and test sets are split at trajectory level\.

#### Governing equation\.

The Kolmogorov flow dataset is generated from numerical simulations of the two\-dimensional incompressible Navier\-Stokes equations with a spatially periodic body force\[[30](https://arxiv.org/html/2605.05540#bib.bib95)\]\. The governing equations read

∂𝐮∂t\+𝐮⋅∇𝐮\\displaystyle\\frac\{\\partial\\mathbf\{u\}\}\{\\partial t\}\+\\mathbf\{u\}\\cdot\\nabla\\mathbf\{u\}=−∇p\+ν∇2𝐮\+𝐟\(𝐱\),\\displaystyle=\-\\nabla p\+\\nu\\nabla^\{2\}\\mathbf\{u\}\+\\mathbf\{f\}\(\\mathbf\{x\}\),\(93\)∇⋅𝐮\\displaystyle\\nabla\\cdot\\mathbf\{u\}=0,\\displaystyle=0,where𝐮\(𝐱,t\)=\(u,v\)\\mathbf\{u\}\(\\mathbf\{x\},t\)=\(u,v\)is the velocity field,ppis the pressure, andν\\nudenotes the kinematic viscosity\. The external forcing is given by the Kolmogorov forcing

𝐟\(𝐱\)=\(sin⁡\(ny\),0\),\\mathbf\{f\}\(\\mathbf\{x\}\)=\\left\(\\sin\(ny\),\\;0\\right\),\(94\)with forcing wavenumbernn, which injects energy at a prescribed spatial scale while preserving periodicity\.

The system is evolved on a two\-dimensional periodic square domainΩ=\[0,2π\]2\\Omega=\[0,2\\pi\]^\{2\}\. Under this forcing, the Kolmogorov flow admits laminar solutions at low Reynolds numbers and undergoes a sequence of instabilities and transitions to spatiotemporally chaotic dynamics as the Reynolds number increases, making it a canonical testbed for studies of turbulence, transition, and data\-driven modeling\[[36](https://arxiv.org/html/2605.05540#bib.bib27)\]\.
Towards Scalable One-Step Generative Modeling for Autoregressive Dynamical System Forecasting

Similar Articles

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective

Prediction and control with temporal segment models

Submit Feedback

Similar Articles

SDFlow: Similarity-Driven Flow Matching for Time Series Generation
AeroJEPA: Learning Semantic Latent Representations for Scalable 3D Aerodynamic Field Modeling
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
Energy Generative Modeling: A Lyapunov-based Energy Matching Perspective
Prediction and control with temporal segment models