Patch-PODiff-ViT: Structured Latent Diffusion with Patchwise POD for Super-Resolution and Uncertainty Quantification
Summary
Patch-PODiff-ViT introduces a structured latent diffusion framework using patchwise Proper Orthogonal Decomposition (POD) for super-resolution and uncertainty quantification, enabling efficient diffusion with a fixed linear orthonormal basis and analytic propagation of predictive variance.
View Cached Full Text
Cached at: 07/01/26, 05:35 AM
# Patch-PODiff-ViT: Structured Latent Diffusion with Patchwise POD for Super-Resolution and Uncertainty Quantification
Source: [https://arxiv.org/html/2606.31290](https://arxiv.org/html/2606.31290)
Onkar Jadhav School of Earth and Oceans, UWA Oceans Institute University of Western Australia Crawley, WA, Australia onkar\.jadhav@uwa\.edu\.au Tim French School of Physics, Mathematics and Computing, Computer Science and Software Engineering University of Western Australia Crawley, WA, Australia Matthew Rayson School of Earth and Oceans, UWA Oceans Institute University of Western Australia Crawley, WA, Australia Nicole L\. Jones School of Earth and Oceans, UWA Oceans Institute University of Western Australia Crawley, WA, Australia
###### Abstract
Diffusion models enable probabilistic super\-resolution and conditional generation, but pixel\-space methods are computationally expensive and learned latent spaces often lack interpretable uncertainty quantification\. We introduce Patch\-PODiff\-ViT, a structured latent diffusion framework in which the latent space is defined by patchwise Proper Orthogonal Decomposition \(POD\), a fixed linear orthonormal basis over local patches, rather than learned by a nonlinear autoencoder\. This yields low\-dimensional, variance\-ordered tokens that preserve spatial structure and enable efficient diffusion in a structured low\-dimensional latent space with a Vision Transformer\. Because the decoder is fixed, linear, and orthonormal, latent coefficient uncertainty can be propagated directly to physical\-space predictive variance, enabling analytic propagation of predictive variance through the linear decoder without Monte Carlo estimation in pixel space\. Across sea surface temperature, medical imaging, and natural images, the method achieves strong reconstruction with fewer parameters and lower memory, while producing well\-calibrated spatial uncertainty that closely matches empirical ensembles\.
## 1Introduction
High\-resolution spatial fields arise across a wide range of applications, including climate modeling\(Priceet al\.,[2023](https://arxiv.org/html/2606.31290#bib.bib19); Watt and Mansfield,[2024](https://arxiv.org/html/2606.31290#bib.bib33); Jadhavet al\.,[2025](https://arxiv.org/html/2606.31290#bib.bib61)\), medical imaging\(Moseret al\.,[2024](https://arxiv.org/html/2606.31290#bib.bib41)\), and natural image generation\(Rombachet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib39)\)\. While modern pipelines provide coarse observations at scale, resolving fine\-scale spatial structure remains computationally expensive\. Super\-resolution methods address this challenge\(Leinonenet al\.,[2020](https://arxiv.org/html/2606.31290#bib.bib20); Stengelet al\.,[2020](https://arxiv.org/html/2606.31290#bib.bib21); Sahariaet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib40); Leinonenet al\.,[2023](https://arxiv.org/html/2606.31290#bib.bib45)\), but accurate reconstruction alone is often insufficient\. Reliable uncertainty quantification is essential for downstream decision\-making, particularly in regimes with sharp gradients, localized extremes, and incomplete observations\.
Diffusion\-based models\(Hoet al\.,[2020](https://arxiv.org/html/2606.31290#bib.bib9); Songet al\.,[2021b](https://arxiv.org/html/2606.31290#bib.bib10)\)provide a powerful framework for probabilistic super\-resolution and conditional generation\(Sahariaet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib40)\), producing high\-fidelity samples and enabling ensemble\-based uncertainty estimation\. However, operating in pixel space is computationally expensive at high resolutions, leading to large models, high memory usage, and slow sampling, which makes ensemble generation costly in practice\(Hoet al\.,[2020](https://arxiv.org/html/2606.31290#bib.bib9)\)\.
Latent diffusion models mitigate this cost by operating in a lower\-dimensional space learned via autoencoders\(Rombachet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib39); Vahdatet al\.,[2021](https://arxiv.org/html/2606.31290#bib.bib15)\)\. While effective for natural images, these latent representations are typically nonlinear and lack a direct correspondence to spatial structure, making uncertainty propagation to physical\-space predictive variance less interpretable\(Böhmet al\.,[2019](https://arxiv.org/html/2606.31290#bib.bib58)\)and limiting their applicability in settings where structure and interpretability are important\. In contrast, many spatial fields exhibit strong local structure that can be efficiently represented using linear reduced\-order methods such as Proper Orthogonal Decomposition \(POD\)\(Sirovich,[1987](https://arxiv.org/html/2606.31290#bib.bib43); Berkoozet al\.,[1993](https://arxiv.org/html/2606.31290#bib.bib42); Benneret al\.,[2015](https://arxiv.org/html/2606.31290#bib.bib25)\)\. POD yields an orthonormal, variance\-ordered basis that captures dominant spatial patterns and defines a geometrically meaningful latent space, where coefficients correspond to progressively finer scales\. Importantly, this structure is often more pronounced at the local level: individualp×pp\\times ppatches can be represented using far fewer modes than the full field while retaining the same variance, enabling substantial compression\. Such local structure is not limited to physical systems, and is often observed in medical and natural images, particularly at the patch level\.
Despite its widespread use in scientific computing\(Cosciaet al\.,[2024](https://arxiv.org/html/2606.31290#bib.bib35); Duet al\.,[2024](https://arxiv.org/html/2606.31290#bib.bib48)\), POD remains underexplored as a structured latent space for diffusion\-based generative modeling\. Although it provides a linear and interpretable mapping between latent coefficients and spatial fields, this structure has not been fully leveraged for efficient and principled uncertainty propagation\.
In this work, we introduce Patch\-PODiff\-ViT, an extension of PODiff\(Jadhavet al\.,[2026](https://arxiv.org/html/2606.31290#bib.bib60)\), a conditional diffusion framework operating in a structured latent space defined by patchwise POD representations\. Instead of learning a latent space, we construct a fixed, variance\-ordered basis over local patches, yielding low\-dimensional tokens that preserve spatial locality and scale separation\. Diffusion is performed over these tokens using a Vision Transformer denoiser, enabling global spatial reasoning with improved efficiency\. Crucially, the linear POD structure allows predictive uncertainty in latent space to be propagated analytically to the physical domain, providing a tractable and interpretable alternative to explicit full\-resolution covariance estimation without additional learned components\.
This formulation connects reduced\-order and generative modeling, showing that structured linear representations combined with expressive denoisers enable efficient and interpretable probabilistic inference\. Unlike pixel\-space diffusion, it scales to high\-resolution fields with lower computational cost, and unlike learned latent diffusion, it preserves a direct and tractable link between latent variables and spatial statistics\.
We evaluate the method on three domains: sea surface temperature \(SST\) downscaling, medical image super\-resolution, and natural image reconstruction\. Across all datasets, Patch\-PODiff\-ViT achieves strong reconstruction performance with fewer parameters and lower memory than pixel\-space diffusion, while producing well\-calibrated, spatially meaningful uncertainty estimates that closely match empirical ensemble statistics\.
##### Contributions\.
This paper makes four contributions\. \(i\) We introduce Patch\-PODiff\-ViT, a structured latent diffusion framework using patchwise POD and transformer\-based denoising\. \(ii\) We provide a theoretical foundation: Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1)bounds reconstruction error under variance truncation, and Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)enables closed\-form propagation of latent uncertainty to pixel space under a block\-diagonal approximation\. \(iii\) We show improved computational efficiency via a low\-dimensional structured latent representation\. \(iv\) We validate the method across geophysical, medical, and natural image datasets, achieving strong reconstruction and well\-calibrated uncertainty\.
## 2Method
Patch\-PODiff\-ViT performs conditional generative modeling in a structured latent space defined by patchwise POD, reducing fields fromH×WH\\times Wpixels toP×KP\\times Klatent tokens \(Figure[1](https://arxiv.org/html/2606.31290#S2.F1)\)\. Each image is decomposed intoPPpatches, projected onto a shared POD basisΦ\\Phi, and denoised by a Vision Transformer in token space\. For super\-resolution, conditioning and target fields are encoded in the same latent space\. At inference, denoised coefficients are decoded linearly via𝐮^p=𝐮¯\+Φ𝐚^p\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\hat\{\\mathbf\{a\}\}\_\{p\}and stitched to reconstruct the field, enabling analytic uncertainty propagation through the POD decoder\.
### 2\.1Patchwise POD Representation
Let\{𝐮i\}i=1N\\\{\\mathbf\{u\}\_\{i\}\\\}\_\{i=1\}^\{N\}denote a set of high\-resolution training fields with𝐮i∈ℝH×W×C\\mathbf\{u\}\_\{i\}\\in\\mathbb\{R\}^\{H\\times W\\times C\}\. Each field is decomposed into patches of sizep×pp\\times p, extracted with strider≤pr\\leq p, yielding
𝐮i=\{𝐮i,p\}p=1P,𝐮i,p∈ℝs,s=C⋅p2,\\mathbf\{u\}\_\{i\}=\\\{\\mathbf\{u\}\_\{i,p\}\\\}\_\{p=1\}^\{P\},\\quad\\mathbf\{u\}\_\{i,p\}\\in\\mathbb\{R\}^\{s\},\\quad s=C\\cdot p^\{2\},wherePPdenotes the number of patches per field\. Whenr=pr=p, the patches are non\-overlapping; whenr<pr<p, they overlap\. The patch sizeppis treated as a hyperparameter, and its effect on reconstruction quality and spectral efficiency is studied in Appendix[I](https://arxiv.org/html/2606.31290#A9)\. For single\-channel datasetsC=1\.C=1\.
We construct a global POD basis by pooling all training patches and computing the economy SVD of the centered patch matrix\. The top\-KKsingular vectors form the orthonormal basisΦ∈ℝs×K\\Phi\\in\\mathbb\{R\}^\{s\\times K\}with singular valuesσ1≥⋯≥σK≥0\\sigma\_\{1\}\\geq\\cdots\\geq\\sigma\_\{K\}\\geq 0\. We select the truncation levelKKto satisfy the energy criterion
∑k=1Kσk2∑k=1sσk2≥η,η=0\.99\.\\frac\{\\sum\_\{k=1\}^\{K\}\\sigma\_\{k\}^\{2\}\}\{\\sum\_\{k=1\}^\{s\}\\sigma\_\{k\}^\{2\}\}\\geq\\eta,\\qquad\\eta=0\.99\.\(1\)This choice is supported by Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1), which shows that the expected reconstruction error is bounded by\(1−η\)\(1\-\\eta\)of the total patchwise variance\. The shared patch basis keeps token dimensions fixed and improves statistical efficiency by pooling local patches\.
##### Latent encoding\.
Let𝐮¯\\bar\{\\mathbf\{u\}\}denote the global patch mean\. Each patch is encoded as
𝐚i,p=Φ⊤\(𝐮i,p−𝐮¯\)∈ℝK\.\\mathbf\{a\}\_\{i,p\}=\\Phi^\{\\top\}\(\\mathbf\{u\}\_\{i,p\}\-\\bar\{\\mathbf\{u\}\}\)\\in\\mathbb\{R\}^\{K\}\.To normalize variance across modes, we standardize the coefficients per mode as
𝐚~i,p=Λ−1/2𝐚i,p,Λ=diag\(σ12,…,σK2\)\.\\tilde\{\\mathbf\{a\}\}\_\{i,p\}=\\Lambda^\{\-1/2\}\\mathbf\{a\}\_\{i,p\},\\qquad\\Lambda=\\mathrm\{diag\}\(\\sigma\_\{1\}^\{2\},\\dots,\\sigma\_\{K\}^\{2\}\)\.The POD encoder is fixed and parameter\-free\. In particular, no gradients pass through it during training\.
### 2\.2Latent Diffusion over Patch Tokens
##### Token sequence\.
We represent each field𝐮i\\mathbf\{u\}\_\{i\}as a sequence ofPPlatent tokens
𝐀~i=\[𝐚~i,1,…,𝐚~i,P\]∈ℝP×K\.\\tilde\{\\mathbf\{A\}\}\_\{i\}=\[\\tilde\{\\mathbf\{a\}\}\_\{i,1\},\\dots,\\tilde\{\\mathbf\{a\}\}\_\{i,P\}\]\\in\\mathbb\{R\}^\{P\\times K\}\.The total latent dimensionalityP×KP\\times Kis substantially smaller than the pixel\-space dimensionalityH×WH\\times W, enabling efficient diffusion over the full spatial extent of the field\.
##### Forward process\.
We define the forward diffusion process as
q\(𝐀~t∣𝐀~0\)=𝒩\(α¯t𝐀~0,\(1−α¯t\)𝐈\),q\(\\tilde\{\\mathbf\{A\}\}\_\{t\}\\mid\\tilde\{\\mathbf\{A\}\}\_\{0\}\)=\\mathcal\{N\}\\\!\\left\(\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\tilde\{\\mathbf\{A\}\}\_\{0\},\\;\(1\-\\bar\{\\alpha\}\_\{t\}\)\\mathbf\{I\}\\right\),\(2\)whereα¯t=∏j=1tαj\\bar\{\\alpha\}\_\{t\}=\\prod\_\{j=1\}^\{t\}\\alpha\_\{j\}follows a cosine noise scheduleNichol and Dhariwal \([2021](https://arxiv.org/html/2606.31290#bib.bib12)\)overT=1,000T=1\{,\}000stepsHoet al\.\([2020](https://arxiv.org/html/2606.31290#bib.bib9)\)\.
##### Training objective\.
We train a denoising networkϵθ\{\\epsilon\}\_\{\\theta\}to predict the injected noise:
ℒ\(θ\)=𝔼𝐀~0,t,ϵ\[‖ϵ−ϵθ\(𝐀~t,𝐂,t\)‖22\],\\mathcal\{L\}\(\\theta\)=\\mathbb\{E\}\_\{\\tilde\{\\mathbf\{A\}\}\_\{0\},\\,t,\\,\{\\epsilon\}\}\\\!\\left\[\\\|\{\\epsilon\}\-\{\\epsilon\}\_\{\\theta\}\(\\tilde\{\\mathbf\{A\}\}\_\{t\},\\mathbf\{C\},t\)\\\|\_\{2\}^\{2\}\\right\],\(3\)where𝐂\\mathbf\{C\}denotes the conditioning signal derived from the low\-resolution input \(Section[2\.3](https://arxiv.org/html/2606.31290#S2.SS3)\), andt∼Uniform\{1,…,T\}t\\sim\\mathrm\{Uniform\}\\\{1,\\dots,T\\\}\.
### 2\.3Conditioning on Low\-Resolution Input
For super\-resolution, we upsample the low\-resolution input𝐱LR\\mathbf\{x\}^\{\\mathrm\{LR\}\}via bicubic interpolation,𝐱up=𝒰\(𝐱LR\)∈ℝH×W\\mathbf\{x\}^\{\\mathrm\{up\}\}=\\mathcal\{U\}\(\\mathbf\{x\}^\{\\mathrm\{LR\}\}\)\\in\\mathbb\{R\}^\{H\\times W\}, and encode it using the same patch\-POD pipeline as the high\-resolution fields, yielding a conditioning token sequence𝐂∈ℝP×K\\mathbf\{C\}\\in\\mathbb\{R\}^\{P\\times K\}\. Encoding both HR and LR fields in the same latent space enables the denoiser to learn a structured residual in coefficient space\. Each token𝐜p\\mathbf\{c\}\_\{p\}represents the POD coefficients of the upsampled LR patch at positionpp, providing spatially aligned conditioning\.
We apply token\-wise additive conditioning by projecting noisy HR tokens and LR tokens todmodeld\_\{\\mathrm\{model\}\}and combining them as
𝐡p=𝐖in𝐚~t,p\+𝐖cond𝐜p,\\mathbf\{h\}\_\{p\}=\\mathbf\{W\}\_\{\\mathrm\{in\}\}\\,\\tilde\{\\mathbf\{a\}\}\_\{t,p\}\+\\mathbf\{W\}\_\{\\mathrm\{cond\}\}\\,\\mathbf\{c\}\_\{p\},\(4\)yielding fused tokens𝐇∈ℝP×dmodel\\mathbf\{H\}\\in\\mathbb\{R\}^\{P\\times d\_\{\\mathrm\{model\}\}\}\. Here𝐖in\\mathbf\{W\}\_\{\\mathrm\{in\}\}and𝐖cond\\mathbf\{W\}\_\{\\mathrm\{cond\}\}are learnable projection matrices inℝdmodel×K\\mathbb\{R\}^\{d\_\{\\mathrm\{model\}\}\\times K\}\. Additive conditioning ensures token alignment and preserves locality\.
### 2\.4Vision Transformer Denoising Architecture
The denoising network is a Vision Transformer operating on the token sequence𝐇∈ℝP×dmodel\\mathbf\{H\}\\in\\mathbb\{R\}^\{P\\times d\_\{\\mathrm\{model\}\}\}\. Tokens are augmented with 2D positional embeddings and processed byLLtransformer blocks with multi\-head self\-attention, MLP layers, and timestep\-conditioned Adaptive LayerNorm\(Peebles and Xie,[2023](https://arxiv.org/html/2606.31290#bib.bib52)\)\. A linear head predicts noiseϵ^∈ℝP×K\\hat\{\\epsilon\}\\in\\mathbb\{R\}^\{P\\times K\}\.
### 2\.5Reconstruction and Uncertainty Quantification
At inference, latent tokens are sampled using DDIM\(Songet al\.,[2021a](https://arxiv.org/html/2606.31290#bib.bib14)\)withS=100S=100steps and de\-normalised to recover POD coefficients,
𝐚^p=Λ1/2𝐚~^p\.\\hat\{\\mathbf\{a\}\}\_\{p\}=\\Lambda^\{1/2\}\\hat\{\\tilde\{\\mathbf\{a\}\}\}\_\{p\}\.Each patch is then reconstructed by the linear decoder
𝐮^p=𝐮¯\+Φ𝐚^p\.\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\hat\{\\mathbf\{a\}\}\_\{p\}\.\(5\)The full field is assembled using a fixed linear stitching operator𝒮\\mathcal\{S\}\. We generateMMindependent latent samples to estimate coefficient\-level covariance, and then propagate this covariance through the fixed POD decoder to obtain pixel\-space predictive variance\. Thus, sampling is performed in the low\-dimensional POD space, while spatial uncertainty is obtained through the known decoder geometry\.
Figure 1:Overview of the proposed Patch\-PODiff\-ViT pipeline for conditional super\-resolution in a structured latent space\.
### 2\.6Theoretical Guarantees
The linear POD encoder\-decoder yields two key results\.
#### 2\.6\.1Proposition 1: Reconstruction Error Bound
###### Proposition 1\(Patchwise POD Reconstruction Error Bound\)\.
Let random field𝐮∈ℝd\\mathbf\{u\}\\in\\mathbb\{R\}^\{d\}be partitioned intoPPnon\-overlapping patches𝐮p∈ℝs\\mathbf\{u\}\_\{p\}\\in\\mathbb\{R\}^\{s\}\. Let𝐮¯\\bar\{\\mathbf\{u\}\}be the global patch mean andΦ∈ℝs×K\\Phi\\in\\mathbb\{R\}^\{s\\times K\}the POD basis retaining energy fractionη\\eta\. Define𝐮^p=𝐮¯\+ΦΦ⊤\(𝐮p−𝐮¯\)\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\Phi^\{\\top\}\(\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\)and let𝐮^\\hat\{\\mathbf\{u\}\}be the reconstructed field\. Then
𝔼\[‖𝐮−𝐮^‖2\]≤\(1−η\)∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\-\\hat\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\\leq\(1\-\\eta\)\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.
The result follows from the Eckart–Young theorem; see Appendix[A](https://arxiv.org/html/2606.31290#A1)\. We useη=0\.99\\eta=0\.99and validate this bound in Section[4\.2](https://arxiv.org/html/2606.31290#S4.SS2)\.
#### 2\.6\.2Proposition 2: Analytic Uncertainty Propagation
###### Proposition 2\(Analytic Uncertainty Propagation under Patchwise POD\)\.
Let𝐮^p=𝐮¯\+Φ𝐚p\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\mathbf\{a\}\_\{p\}withΣap=Cov\(𝐚p\)\\Sigma\_\{a\_\{p\}\}=\\mathrm\{Cov\}\(\\mathbf\{a\}\_\{p\}\)\. Then:
1. \(i\)Σup=ΦΣapΦ⊤\\Sigma\_\{u\_\{p\}\}=\\Phi\\,\\Sigma\_\{a\_\{p\}\}\\,\\Phi^\{\\top\},
2. \(ii\)for non\-overlapping patches, assumingCov\(𝐚p,𝐚q\)=0\\mathrm\{Cov\}\(\\mathbf\{a\}\_\{p\},\\mathbf\{a\}\_\{q\}\)=0forp≠qp\\neq q,Σu=blkdiag\(ΦΣa1Φ⊤,…,ΦΣaPΦ⊤\)\\Sigma\_\{u\}=\\mathrm\{blkdiag\}\(\\Phi\\Sigma\_\{a\_\{1\}\}\\Phi^\{\\top\},\\dots,\\Phi\\Sigma\_\{a\_\{P\}\}\\Phi^\{\\top\}\),
3. \(iii\)for overlapping patches,Σu=𝒮Σ~𝒮⊤\\Sigma\_\{u\}=\\mathcal\{S\}\\tilde\{\\Sigma\}\\mathcal\{S\}^\{\\top\}, whereΣ~\\tilde\{\\Sigma\}stacks the patch covariances\.
The result follows from linearity; see Appendix[B](https://arxiv.org/html/2606.31290#A2)\. In practice, we estimateΣ^ap\\hat\{\\Sigma\}\_\{a\_\{p\}\}fromMMgenerated latent samples and obtain pixel\-level predictive variance asdiag\(ΦΣ^apΦ⊤\)\\mathrm\{diag\}\(\\Phi\\hat\{\\Sigma\}\_\{a\_\{p\}\}\\Phi^\{\\top\}\)at cost𝒪\(PMK2\+PsK\)\\mathcal\{O\}\(PMK^\{2\}\+PsK\), without explicitly forming or storing a fullHW×HWHW\\times HWpixel\-space covariance matrix\.
### 2\.7Computational Efficiency
Patch\-PODiff\-ViT is efficient due to \(i\) low\-dimensional tokens \(K≪sK\\ll s\), \(ii\) resolution\-independent tokenisation, and \(iii\) parameter\-free POD encoding\. See Section[4\.5](https://arxiv.org/html/2606.31290#S4.SS5)for quantitative comparisons\.
### 2\.8Limitations
Patch\-PODiff\-ViT is most effective when spatial fields exhibit low\-rank local structure\. For highly turbulent or discontinuous fields with slow singular value decay, more POD modes are required, increasing latent dimensionalityP×KP\\times Kand narrowing the efficiency gap relative to pixel\-space methods\. We observe this in Appendix[J](https://arxiv.org/html/2606.31290#A10), where atPe=O\(106\)\\mathrm\{Pe\}=O\(10^\{6\}\), up toK=150K=150modes are needed to retain99%99\\%variance, compared toK=2K=2–2626in our main datasets\. The POD basis is fixed after construction, so distribution shifts may require recomputation, with adaptive or online extensions left for future work\. Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)uses a block\-diagonal covariance approximation that neglects cross\-patch terms, trading modeling fidelity for tractability\. This affects only the covariance map, since the ViT still captures cross\-patch dependencies during sampling, and overlapping aggregation via𝒮\\mathcal\{S\}can partially mitigate boundary effects\. Discarded\-mode uncertainty is not modeled explicitly, but remains small underη=0\.99\\eta=0\.99by Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1)\.
## 3Experimental Setup
### 3\.1Datasets
We evaluate on three domains: geophysical fields, medical images, and natural images\. For SST, we use daily Western Australian coast fields at640×480640\\times 480from ROMS\(Shchepetkin and McWilliams,[2005](https://arxiv.org/html/2606.31290#bib.bib46)\), conditioned on ACCESS\-S2 inputs at53×3153\\times 31\(Weddet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib47)\)bicubically interpolated to the target grid\. We train on 1998–2009 \(≈\\approx4,000 fields\), validate on 2010, and test on 2011, which includes the documented West Australian marine heatwave\. Land pixels are masked and metrics are computed over ocean pixels only\. For medical imaging, we use NIH ChestX\-ray14\(Wanget al\.,[2017](https://arxiv.org/html/2606.31290#bib.bib53)\), resizing frontal radiographs to256×256256\\times 256and following the official train/test split\. For natural images, we use FFHQ\(Karraset al\.,[2019](https://arxiv.org/html/2606.31290#bib.bib57)\)at256×256256\\times 256with an 80/20 random split using seed 0\. For X\-ray and FFHQ, low\-resolution inputs are produced by4×4\\timesbicubic downsampling to64×6464\\times 64followed by upsampling to256×256256\\times 256before encoding\.
### 3\.2Baselines and Implementation Details
We compare against five baselines covering deterministic, latent diffusion, pixel\-space diffusion, transformer diffusion, and POD\-based variants\.
U\-Net:A deterministic convolutional U\-Net\(Ronnebergeret al\.,[2015](https://arxiv.org/html/2606.31290#bib.bib44)\)with a symmetric encoder–decoder, four resolution levels, base widthC=128C=128, channel widths\(C,2C,4C,8C\)\(C,2C,4C,8C\), skip connections, and anℓ2\\ell\_\{2\}loss\. The same architecture is used across all datasets\.
VAE\-LDM:A latent diffusion model following Rombach et al\.\(Rombachet al\.,[2022](https://arxiv.org/html/2606.31290#bib.bib39)\), with a convolutional VAE encoder of base widthC=128C=128that compresses256×256256\\times 256inputs to64×64×464\\times 64\\times 4latent maps\. The denoiser uses the same ViT configuration as Patch\-PODiff\-ViT, providing a controlled learned\-latent baseline, although the latent dimensionalities are not identical\. The VAE is trained on the same training split and selected using validation reconstruction quality before training the latent diffusion model\.
DiT:A Diffusion Transformer\(Peebles and Xie,[2023](https://arxiv.org/html/2606.31290#bib.bib52)\)operating directly on16×1616\\times 16pixel patches without POD compression\. It uses the same ViT configuration as Patch\-PODiff\-ViT, namelydmodel=512d\_\{\\mathrm\{model\}\}=512,L=12L=12, andH=8H=8, controlling for denoiser capacity; both methods use the same patch sequence length, while Patch\-PODiff\-ViT compresses each token toKKPOD modes\.
PixelDiff:A pixel\-space diffusion model using the same U\-Net backbone as the deterministic baseline, trained with a DDPM objective and conditioned on the bicubically upsampled LR input via channel concatenation, following the residual diffusion paradigm of CorrDiffMardaniet al\.\([2025](https://arxiv.org/html/2606.31290#bib.bib59)\)\.
Fullfield\-PODiff:A non\-patchwise POD baseline using a single full\-field POD basis, withKKselected by the sameη=0\.99\\eta=0\.99criterion, isolating the contribution of patchwise structure\.
All diffusion models use a cosine noise schedule withT=1,000T=1\{,\}000training timesteps,S=100S=100inference steps, andM=100M=100ensemble samples for uncertainty\. Models are trained with AdamW at learning rate10−410^\{\-4\}using identical preprocessing and hardware\. Hyperparameters are selected on validation splits and fixed for test evaluation\. Standard deviations are reported in Appendix[D](https://arxiv.org/html/2606.31290#A4)\.
Patch\-PODiff\-ViT Implementation Details:The POD basis is computed from training patches atη=0\.99\\eta=0\.99\. The ViT denoiser usesdmodel=512d\_\{\\mathrm\{model\}\}=512,L=12L=12,H=8H=8heads, and MLP ratio4\.04\.0\(≈70\{\\approx\}70M parameters\), and is trained for 1M steps with batch size 64, AdamW \(lr=10−4=10^\{\-4\}, weight decay10−410^\{\-4\}\), gradient clipping 1\.0, and EMA 0\.9995\. We use a cosine schedule \(T=1,000T=1\{,\}000\) and DDIM sampling \(S=100S=100\) at inference\. Uncertainty estimates useM=100M=100ensemble members unless stated otherwise\. The default patch size and resultingKKvalues are reported alongside the main results\. Patch size sensitivity is in Appendix[I](https://arxiv.org/html/2606.31290#A9)\. All models trained on AMD Instinct MI250X GPUs\.
### 3\.3Evaluation Metrics
We evaluate reconstruction using RMSE, PSNR, SSIM\(Wanget al\.,[2004](https://arxiv.org/html/2606.31290#bib.bib54)\), LPIPS\(Zhanget al\.,[2018](https://arxiv.org/html/2606.31290#bib.bib55)\), and FID\(Heuselet al\.,[2017](https://arxiv.org/html/2606.31290#bib.bib56)\)\. Uncertainty is assessed with CRPS\(Gneiting and Raftery,[2007](https://arxiv.org/html/2606.31290#bib.bib26)\), MACE, and empirical coverage at nominal levels\{50,60,70,80,90,95,99\}%\\\{50,60,70,80,90,95,99\\\}\\%usingM=100M=100samples, with SST UQ computed over all 365 test days in 2011\. Efficiency is measured by parameter count, peak GPU memory, training time, per\-sample inference time, andM=100M=100ensemble inference time\. All models use identical preprocessing: SST is min–max normalized over training ocean pixels with land masked out, while Chest X\-ray and FFHQ are scaled to\[0,1\]\[0,1\], with LPIPS and FID using the required pretrained\-network input ranges\.
## 4Results
We evaluate Patch\-PODiff\-ViT using reconstruction accuracy, theoretical bound validation, uncertainty quantification, computational efficiency, and downstream SST ensemble forecasting\. The experiments test three claims: patchwise POD provides a compact but spatially structured latent space; the fixed linear decoder enables accurate propagation of latent uncertainty to pixel\-space variance; and the efficiency advantage is largest when local patch spectra decay rapidly\. Unless stated otherwise, all results use16×1616\\times 16patches,η=0\.99\\eta=0\.99energy threshold, andM=100M=100ensemble members\. Ensemble sizeMMsensitivity is examined in Appendix[I\.4](https://arxiv.org/html/2606.31290#A9.SS4)\.
### 4\.1Reconstruction Accuracy
Figure 2:Qualitative comparison across three domains\. Columns: LR input, U\-Net, PixelDiff, VAE\-LDM, Patch\-PODiff\-ViT \(ours\), GT\. SST \(rows 1–2\): full reconstruction and absolute error\. Chest X\-ray \(rows 3–4\): full reconstruction and zoomed absolute error\. FFHQ \(rows 5–6\): full reconstruction and zoomed absolute error\. Patch\-PODiff\-ViT consistently produces lower error across all domains\. Full comparisons with zoomed regions are in Appendix[E](https://arxiv.org/html/2606.31290#A5)\.Table 1:Reconstruction quality across three datasets\. Lower is better for RMSE, LPIPS, FID; higher for PSNR, SSIM\. Standard deviations are provided in Appendix[D](https://arxiv.org/html/2606.31290#A4)in Table[5](https://arxiv.org/html/2606.31290#A4.T5)\. Results are averaged over 3 seed runs\.Table[1](https://arxiv.org/html/2606.31290#S4.T1)shows that Patch\-PODiff\-ViT achieves consistently strong reconstruction performance across all datasets, with the best result on most reported metrics\. Among diffusion baselines, DiT is the strongest competitor, suggesting that the gains arise not only from the transformer denoiser but also from the structured POD latent space\. Appendix[I\.5](https://arxiv.org/html/2606.31290#A9.SS5)confirms that variance\-ordered compression, not linear structure alone, drives these gains\. Additionally, since VAE\-LDM uses the same denoiser configuration, these gains suggest that the structured POD latent representation is advantageous in these settings, though the comparison does not fully equalize latent dimensionality\. Figure[2](https://arxiv.org/html/2606.31290#S4.F2)shows qualitative results across all three domains\. Patch\-PODiff\-ViT consistently preserves large\-scale structure and fine\-scale detail, including fronts in SST, anatomical boundaries in Chest X\-ray, and texture in FFHQ, while competing methods exhibit smoothing or local distortions near sharp gradients\. This trend is reflected quantitatively in Table[1](https://arxiv.org/html/2606.31290#S4.T1)and further supported by extended comparisons in Appendix[E](https://arxiv.org/html/2606.31290#A5)\.
Fullfield\-PODiff further ablates patchwise structure: despite using a POD latent space, its FID is consistently worse than Patch\-PODiff\-ViT\. This suggests that local patchwise POD, rather than POD compression alone, drives the improvement\. The U\-Net underperforms all diffusion baselines across perceptual metrics, consistent with the known limitations of deterministic regression for high\-frequency detail recovery\.
### 4\.2Validation of the POD Reconstruction Bound
We empirically validate the retained\-energy reconstruction bound in Appendix[C](https://arxiv.org/html/2606.31290#A3)\. Across all datasets, mean empirical POD errors remain below the expected bound; the above\-97% sample\-wise satisfaction rates are only empirical diagnostics\. This confirms that the retained\-energy criterion in Eq\. \([1](https://arxiv.org/html/2606.31290#S2.E1)\) provides reliable control of patchwise reconstruction error in the settings considered\.
### 4\.3Uncertainty Quantification
Uncertainty quantification is a central component of the proposed method\. We evaluate it along three axes: \(i\) validity of analytic propagation \(Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)\), \(ii\) calibration of predictive intervals, and \(iii\) sharpness of the predictive distribution\.
Analytic vs\. empirical uncertainty:
Figure 3:Analytic versus empirical uncertainty on SST\. Left: analytic standard deviation computed via Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)\. Middle: empirical standard deviation from an ensemble ofM=100M=100samples\. Right: difference map, with Pearson correlationr=0\.983r=0\.983\. The close agreement indicates that the analytic formulation captures both the spatial structure and magnitude of uncertainty without explicitly forming full pixel\-space covariance estimates\.Figure[3](https://arxiv.org/html/2606.31290#S4.F3)evaluates the covariance propagation step in Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)by comparing pixel\-wise standard deviation maps obtained fromdiag\(ΦΣ^apΦ⊤\)\\mathrm\{diag\}\(\\Phi\\hat\{\\Sigma\}\_\{a\_\{p\}\}\\Phi^\{\\top\}\)with empirical standard deviation maps computed fromM=100M=100decoded samples\. The two agree closely on SST \(r=0\.9828r=0\.9828\), with similar correlations on Chest X\-ray \(r=0\.986r=0\.986\) and FFHQ \(r=0\.953r=0\.953\)\. Additional maps are provided in Appendix[G](https://arxiv.org/html/2606.31290#A7)\. Including covariance estimation, analytic propagation costs𝒪\(PMK2\+PsK\)\\mathcal\{O\}\(PMK^\{2\}\+PsK\)and avoids explicitly forming full pixel\-space covariance estimates\. Appendix[H](https://arxiv.org/html/2606.31290#A8)further supports the block\-diagonal covariance approximation: off\-diagonal cross\-patch covariance is small for SST and Chest X\-ray and moderate for FFHQ, consistent with the mild patch\-boundary effects observed on natural images\. Spatially, uncertainty concentrates near SST coastlines and thermal fronts, X\-ray lung boundaries and rib edges, and FFHQ features such as eyes and hair\.
Calibration and Sharpness:Figure[4](https://arxiv.org/html/2606.31290#S4.F4)shows that Patch\-PODiff\-ViT closely follows the ideal calibration line across datasets and nominal levels\. It achieves the lowest MACE, with0\.0080\.008on SST,0\.00460\.0046on Chest X\-ray, and0\.00840\.0084on FFHQ, while DiT and VAE\-LDM show larger deviations\. For sharpness, Patch\-PODiff\-ViT obtains the lowest CRPS on SST and Chest X\-ray\. On FFHQ, VAE\-LDM attains lower CRPS but higher MACE, indicating a sharper but less calibrated predictive distribution\. Full coverage and CRPS values are provided in Appendix[F](https://arxiv.org/html/2606.31290#A6)\.
Figure 4:Reliability diagrams comparing empirical versus nominal coverage for SST, FFHQ, and Chest X\-ray\. The dashed line denotes perfect calibration\. Patch\-PODiff\-ViT remains close to the ideal line across datasets, indicating well\-calibrated predictive intervals\.Together, these results show that Patch\-PODiff\-ViT produces uncertainty estimates that are both accurate and well calibrated\. The POD\-based analytic propagation enables efficient uncertainty estimation, while maintaining competitive calibration and sharpness across diverse domains\.
### 4\.4Downstream Application: Spatial Uncertainty for Ensemble Forecasting
Figure 5:SST ensemble downstream application\. \(a\) Ground truth SST field, \(b\) Patch\-PODiff\-ViT ensemble mean, and \(c\) predictive standard deviation in physical units \(∘C\)\. The uncertainty map highlights regions of higher predictive uncertainty along coastal areas and strong thermal gradients\.Figure[5](https://arxiv.org/html/2606.31290#S4.F5)illustrates Patch\-PODiff\-ViT on a representative day from the 2011 West Australian marine heatwave\. The ensemble mean recovers the large\-scale warm water mass, sharp meridional gradient, and coastal fine\-scale structure, while the predictive standard deviation highlights higher uncertainty near coastlines and strong thermal gradients\. From an application perspective, the model provides both a high\-resolution estimate and a spatial confidence map, helping identify regions that require greater caution\.
### 4\.5Computational Efficiency
Table 2:Computational cost comparison across diffusion models\.Table[2](https://arxiv.org/html/2606.31290#S4.T2)compares computational cost\. Patch\-PODiff\-ViT uses 70M parameters and trains in 8\.7 h, compared with 220M and 30\.6 h for VAE\-LDM\. It also reduces peak memory from 16\.1 GB to 8\.6 GB and generates anM=100M=100ensemble in 11\.036 s, compared with 26\.038 s for VAE\-LDM and 189\.530 s for PixelDiff\. DiT has similar sampling time due to the matched transformer backbone, but Patch\-PODiff\-ViT achieves better reconstruction and calibration through POD\-compressed tokens\.
Additional ablations on patch size and denoiser architecture are provided in Appendix[I](https://arxiv.org/html/2606.31290#A9), showing that16×1616\\times 16performs best among the tested patch sizes and that ViT self\-attention is critical, as replacing the ViT with a per\-token MLP substantially degrades performance \(FID 3\.986 to 12\.59\)\.
## References
- \[1\]\(2015\)A survey of projection\-based model reduction methods for parametric dynamical systems\.SIAM Review57\(4\),pp\. 483–531\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p3.1)\.
- \[2\]G\. Berkooz, P\. Holmes, and J\. L\. Lumley\(1993\)The proper orthogonal decomposition in the analysis of turbulent flows\.Annual review of fluid mechanics25\(1\),pp\. 539–575\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p3.1)\.
- \[3\]V\. Böhm, F\. Lanusse, and U\. Seljak\(2019\)Uncertainty quantification with generative models\.arXiv preprint arXiv:1910\.10046\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p3.1)\.
- \[4\]D\. Coscia, N\. Demo, and G\. Rozza\(2024\)Generative adversarial reduced order modelling\.Scientific Reports14\(1\),pp\. 3826\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p4.1)\.
- \[5\]P\. Du, M\. H\. Parikh, X\. Fan, X\. Liu, and J\. Wang\(2024\)Conditional neural field latent diffusion model for generating spatiotemporal turbulence\.Nature Communications15\(1\),pp\. 10416\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p4.1)\.
- \[6\]T\. Gneiting and A\. E\. Raftery\(2007\)Strictly proper scoring rules, prediction, and estimation\.Journal of the American Statistical Association102\(477\),pp\. 359–378\.Cited by:[§3\.3](https://arxiv.org/html/2606.31290#S3.SS3.p1.4)\.
- \[7\]M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter\(2017\)GANs trained by a two time\-scale update rule converge to a local Nash equilibrium\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§3\.3](https://arxiv.org/html/2606.31290#S3.SS3.p1.4)\.
- \[8\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.31290#S2.SS2.SSS0.Px2.p1.2)\.
- \[9\]O\. Jadhav, T\. French, I\. Janekovic, N\. L\. Jones, and M\. D\. Rayson\(2025\)Deep learning\-based statistical downscaling of sea surface temperature using a residual corrective neural network\.ESS Open Archive2025\(0813\),pp\.\.External Links:[Document](https://dx.doi.org/10.22541/essoar.175510661.11731470/v1),[Link](https://essopenarchive.org/doi/abs/10.22541/essoar.175510661.11731470/v1),https://essopenarchive\.org/doi/pdf/10\.22541/essoar\.175510661\.11731470/v1Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[10\]O\. Jadhav, T\. French, M\. Rayson, and N\. L\. Jones\(2026\)PODiff: latent diffusion in proper orthogonal decomposition space for scientific super\-resolution\.arXiv preprint arXiv:2605\.03399\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p5.1)\.
- \[11\]T\. Karras, S\. Laine, and T\. Aila\(2019\)A style\-based generator architecture for generative adversarial networks\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4401–4410\.Cited by:[§3\.1](https://arxiv.org/html/2606.31290#S3.SS1.p1.8)\.
- \[12\]J\. Leinonen, U\. Hamann, D\. Nerini, U\. Germann, and G\. Franch\(2023\)Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification\.arXiv preprint arXiv:2304\.12891\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[13\]J\. Leinonen, D\. Nerini, and A\. Berne\(2020\)Stochastic super\-resolution for downscaling time\-evolving atmospheric fields with a generative adversarial network\.IEEE Transactions on Geoscience and Remote Sensing59\(9\),pp\. 7211–7223\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[14\]M\. Mardani, N\. Brenowitz, Y\. Cohen, J\. Pathak, C\. Chen, C\. Liu, A\. Vahdat, M\. A\. Nabian, T\. Ge, A\. Subramaniam,et al\.\(2025\)Residual corrective diffusion modeling for km\-scale atmospheric downscaling\.Communications Earth & Environment6\(1\),pp\. 124\.Cited by:[§3\.2](https://arxiv.org/html/2606.31290#S3.SS2.p5.1)\.
- \[15\]B\. B\. Moser, A\. S\. Shanbhag, F\. Raue, S\. Frolov, S\. Palacio, and A\. Dengel\(2024\)Diffusion models, image super\-resolution, and everything: a survey\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[16\]A\. Q\. Nichol and P\. Dhariwal\(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[§2\.2](https://arxiv.org/html/2606.31290#S2.SS2.SSS0.Px2.p1.2)\.
- \[17\]W\. Peebles and S\. Xie\(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4195–4205\.Cited by:[§2\.4](https://arxiv.org/html/2606.31290#S2.SS4.p1.3),[§3\.2](https://arxiv.org/html/2606.31290#S3.SS2.p4.5)\.
- \[18\]I\. Price, A\. Sanchez\-Gonzalez, F\. Alet, T\. R\. Andersson, A\. El\-Kadi, D\. Masters, T\. Ewalds, J\. Stott, S\. Mohamed, P\. Battaglia,et al\.\(2023\)Gencast: diffusion\-based ensemble forecasting for medium\-range weather\.arXiv preprint arXiv:2312\.15796\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[19\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1),[§1](https://arxiv.org/html/2606.31290#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.31290#S3.SS2.p3.3)\.
- \[20\]O\. Ronneberger, P\. Fischer, and T\. Brox\(2015\)U\-net: convolutional networks for biomedical image segmentation\.InInternational Conference on Medical image computing and computer\-assisted intervention,pp\. 234–241\.Cited by:[§3\.2](https://arxiv.org/html/2606.31290#S3.SS2.p2.3)\.
- \[21\]C\. Saharia, J\. Ho, W\. Chan, T\. Salimans, D\. J\. Fleet, and M\. Norouzi\(2022\)Image super\-resolution via iterative refinement\.IEEE transactions on pattern analysis and machine intelligence45\(4\),pp\. 4713–4726\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1),[§1](https://arxiv.org/html/2606.31290#S1.p2.1)\.
- \[22\]A\.F\. Shchepetkin and J\.C\. McWilliams\(2005\)The regional oceanic modeling system \(ROMS\): a split\-explicit, free\-surface, topography\-following\-coordinate oceanic model\.Ocean Modelling9\(4\),pp\. 347–404\.Cited by:[§3\.1](https://arxiv.org/html/2606.31290#S3.SS1.p1.8)\.
- \[23\]L\. Sirovich\(1987\)Turbulence and the dynamics of coherent structures\. I\. coherent structures\.Quarterly of applied mathematics45\(3\),pp\. 561–571\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p3.1)\.
- \[24\]J\. Song, C\. Meng, and S\. Ermon\(2021\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=St1giarCHLP)Cited by:[§2\.5](https://arxiv.org/html/2606.31290#S2.SS5.p1.1)\.
- \[25\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p2.1)\.
- \[26\]K\. Stengel, A\. Glaws, D\. Hettinger, and R\. N\. King\(2020\)Adversarial super\-resolution of climatological wind and solar data\.Proceedings of the National Academy of Sciences117\(29\),pp\. 16805–16815\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[27\]A\. Vahdat, K\. Kreis, and J\. Kautz\(2021\)Score\-based generative modeling in latent space\.Advances in neural information processing systems34,pp\. 11287–11302\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p3.1)\.
- \[28\]X\. Wang, Y\. Peng, L\. Lu, Z\. Lu, M\. Bagheri, and R\. M\. Summers\(2017\)ChestX\-ray8: hospital\-scale chest x\-ray database and benchmarks on weakly\-supervised classification and localization of common thorax diseases\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 2097–2106\.Cited by:[§3\.1](https://arxiv.org/html/2606.31290#S3.SS1.p1.8)\.
- \[29\]Z\. Wang, A\. C\. Bovik, H\. R\. Sheikh, and E\. P\. Simoncelli\(2004\)Image quality assessment: from error visibility to structural similarity\.IEEE Transactions on Image Processing13\(4\),pp\. 600–612\.Cited by:[§3\.3](https://arxiv.org/html/2606.31290#S3.SS3.p1.4)\.
- \[30\]R\. A\. Watt and L\. A\. Mansfield\(2024\)Generative diffusion\-based downscaling for climate\.arXiv preprint arXiv:2404\.17752\.Cited by:[§1](https://arxiv.org/html/2606.31290#S1.p1.1)\.
- \[31\]R\. Wedd, O\. Alves, C\. de Burgh\-Day, C\. Down, M\. Griffiths, H\.H\. Hendon, D\. Hudson, S\. Li, E\.P\. Lim, A\.G\. Marshall,et al\.\(2022\)ACCESS\-S2: the upgraded bureau of meteorology multi\-week to seasonal prediction system\.Journal of Southern Hemisphere Earth Systems Science72\(3\),pp\. 218–242\.Cited by:[§3\.1](https://arxiv.org/html/2606.31290#S3.SS1.p1.8)\.
- \[32\]R\. Zhang, P\. Isola, A\. A\. Efros, E\. Shechtman, and O\. Wang\(2018\)The unreasonable effectiveness of deep features as a perceptual metric\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 586–595\.Cited by:[§3\.3](https://arxiv.org/html/2606.31290#S3.SS3.p1.4)\.
## Appendix AProposition 1 Proof
###### Proposition 1\(Patchwise POD Reconstruction Error Bound\)\.
Let random field𝐮∈ℝd\\mathbf\{u\}\\in\\mathbb\{R\}^\{d\}be partitioned intoPPnon\-overlapping patches\{𝐮p\}p=1P\\\{\\mathbf\{u\}\_\{p\}\\\}\_\{p=1\}^\{P\}, where each patch𝐮p∈ℝs\\mathbf\{u\}\_\{p\}\\in\\mathbb\{R\}^\{s\}andd=Psd=Ps\.
Let𝐮¯∈ℝs\\bar\{\\mathbf\{u\}\}\\in\\mathbb\{R\}^\{s\}be the global patch mean andΦ∈ℝs×K\\Phi\\in\\mathbb\{R\}^\{s\\times K\}the shared POD basis formed by the top\-KKeigenvectors of the pooled patch covariance, with corresponding eigenvalues
λ1≥λ2≥⋯≥λs≥0\.\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots\\geq\\lambda\_\{s\}\\geq 0\.Assume the retained modes satisfy
∑k=1Kλk∑k=1sλk≥η,η∈\(0,1\)\.\\frac\{\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\}\{\\sum\_\{k=1\}^\{s\}\\lambda\_\{k\}\}\\geq\\eta,\\qquad\\eta\\in\(0,1\)\.
Define the patchwise POD reconstruction
𝐮^p=𝐮¯\+ΦΦ⊤\(𝐮p−𝐮¯\),\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\Phi^\{\\top\}\(\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\),and let𝐮^\\hat\{\\mathbf\{u\}\}be the global reconstruction obtained by reassembling the reconstructed patches\. Then
𝔼\[‖𝐮−𝐮^‖2\]≤\(1−η\)∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\-\\hat\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\\leq\(1\-\\eta\)\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.
###### Proof\.
Because the patches are non\-overlapping,
‖𝐮−𝐮^‖2=∑p=1P‖𝐮p−𝐮^p‖2\.\\\|\\mathbf\{u\}\-\\hat\{\\mathbf\{u\}\}\\\|^\{2\}=\\sum\_\{p=1\}^\{P\}\\\|\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\.Taking expectations gives
𝔼\[‖𝐮−𝐮^‖2\]=∑p=1P𝔼\[‖𝐮p−𝐮^p‖2\]\.\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\-\\hat\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]=\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\\right\]\.
For each patchpp, define the centered patch𝐮~p:=𝐮p−𝐮¯\\tilde\{\\mathbf\{u\}\}\_\{p\}:=\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\. Then
𝐮p−𝐮^p=\(𝐈−ΦΦ⊤\)𝐮~p,\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}=\(\\mathbf\{I\}\-\\Phi\\Phi^\{\\top\}\)\\tilde\{\\mathbf\{u\}\}\_\{p\},and therefore
‖𝐮p−𝐮^p‖2=‖\(𝐈−ΦΦ⊤\)𝐮~p‖2\.\\\|\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}=\\\|\(\\mathbf\{I\}\-\\Phi\\Phi^\{\\top\}\)\\tilde\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\.
SinceΦ\\Phiis constructed from the pooled patch covariance, the POD projection identity applies to the pooled patch distribution:
1P∑p=1P𝔼\[‖\(𝐈−ΦΦ⊤\)𝐮~p‖2\]=∑k=K\+1sλk\.\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\(\\mathbf\{I\}\-\\Phi\\Phi^\{\\top\}\)\\tilde\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\\right\]=\\sum\_\{k=K\+1\}^\{s\}\\lambda\_\{k\}\.Using the retained\-energy assumption,
∑k=K\+1sλk=∑k=1sλk−∑k=1Kλk≤\(1−η\)∑k=1sλk\.\\sum\_\{k=K\+1\}^\{s\}\\lambda\_\{k\}=\\sum\_\{k=1\}^\{s\}\\lambda\_\{k\}\-\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\\leq\(1\-\\eta\)\\sum\_\{k=1\}^\{s\}\\lambda\_\{k\}\.The total variance of the pooled centered patches is
∑k=1sλk=1P∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\sum\_\{k=1\}^\{s\}\\lambda\_\{k\}=\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.Combining the previous two equations gives
1P∑p=1P𝔼\[‖𝐮p−𝐮^p‖2\]≤\(1−η\)1P∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\\right\]\\leq\(1\-\\eta\)\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.Multiplying both sides byPPyields
∑p=1P𝔼\[‖𝐮p−𝐮^p‖2\]≤\(1−η\)∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\hat\{\\mathbf\{u\}\}\_\{p\}\\\|^\{2\}\\right\]\\leq\(1\-\\eta\)\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.Finally, using the non\-overlapping patch decomposition,
𝔼\[‖𝐮−𝐮^‖2\]≤\(1−η\)∑p=1P𝔼\[‖𝐮p−𝐮¯‖2\]\.\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\-\\hat\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\\leq\(1\-\\eta\)\\sum\_\{p=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\\|\\mathbf\{u\}\_\{p\}\-\\bar\{\\mathbf\{u\}\}\\\|^\{2\}\\right\]\.This completes the proof\. ∎
## Appendix BProposition 2 Proof
###### Proposition 2\(Analytic Uncertainty Propagation under Patchwise POD\)\.
Let an input field be represented byPPpatches\. For each patchpp, let the reconstruction be
𝐮^p=𝐮¯\+Φ𝐚p,\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\mathbf\{a\}\_\{p\},where𝐮¯∈ℝs\\bar\{\\mathbf\{u\}\}\\in\\mathbb\{R\}^\{s\}is the global patch mean,Φ∈ℝs×K\\Phi\\in\\mathbb\{R\}^\{s\\times K\}is the shared retained POD basis, and𝐚p∈ℝK\\mathbf\{a\}\_\{p\}\\in\\mathbb\{R\}^\{K\}is a random latent coefficient vector with covarianceΣap:=Cov\(𝐚p\)\\Sigma\_\{a\_\{p\}\}:=\\mathrm\{Cov\}\(\\mathbf\{a\}\_\{p\}\)\.
Then the following hold:
1. \(i\)Patch\-level covariance\.The covariance of the reconstructed patch is Σup:=Cov\(𝐮^p\)=ΦΣapΦ⊤\.\\Sigma\_\{u\_\{p\}\}:=\\mathrm\{Cov\}\(\\hat\{\\mathbf\{u\}\}\_\{p\}\)=\\Phi\\Sigma\_\{a\_\{p\}\}\\Phi^\{\\top\}\.
2. \(ii\)Non\-overlapping patch assembly\.Suppose patches are reassembled without overlap into𝐮^=\[𝐮^1⊤⋯𝐮^P⊤\]⊤\\hat\{\\mathbf\{u\}\}=\[\\hat\{\\mathbf\{u\}\}\_\{1\}^\{\\top\}\\cdots\\hat\{\\mathbf\{u\}\}\_\{P\}^\{\\top\}\]^\{\\top\}\. If the latent variables are mutually uncorrelated across distinct patches, i\.e\.Cov\(𝐚p,𝐚q\)=𝟎\\mathrm\{Cov\}\(\\mathbf\{a\}\_\{p\},\\mathbf\{a\}\_\{q\}\)=\\mathbf\{0\}forp≠qp\\neq q, then the global covariance is block\-diagonal: Σu:=Cov\(𝐮^\)=blkdiag\(ΦΣa1Φ⊤,…,ΦΣaPΦ⊤\)\.\\Sigma\_\{u\}:=\\mathrm\{Cov\}\(\\hat\{\\mathbf\{u\}\}\)=\\mathrm\{blkdiag\}\\left\(\\Phi\\Sigma\_\{a\_\{1\}\}\\Phi^\{\\top\},\\dots,\\Phi\\Sigma\_\{a\_\{P\}\}\\Phi^\{\\top\}\\right\)\.
3. \(iii\)Overlapping patches with linear aggregation\.Suppose overlapping reconstructed patches are concatenated into𝐮~=\[𝐮^1⊤⋯𝐮^P⊤\]⊤\\tilde\{\\mathbf\{u\}\}=\[\\hat\{\\mathbf\{u\}\}\_\{1\}^\{\\top\}\\cdots\\hat\{\\mathbf\{u\}\}\_\{P\}^\{\\top\}\]^\{\\top\}, and the final field is obtained by a fixed linear operator𝒮\\mathcal\{S\}:𝐮^=𝒮𝐮~\\hat\{\\mathbf\{u\}\}=\\mathcal\{S\}\\tilde\{\\mathbf\{u\}\}\. If patch latents are mutually uncorrelated across patches, then Cov\(𝐮~\)=Σ~:=blkdiag\(ΦΣa1Φ⊤,…,ΦΣaPΦ⊤\),\\mathrm\{Cov\}\(\\tilde\{\\mathbf\{u\}\}\)=\\tilde\{\\Sigma\}:=\\mathrm\{blkdiag\}\\left\(\\Phi\\Sigma\_\{a\_\{1\}\}\\Phi^\{\\top\},\\dots,\\Phi\\Sigma\_\{a\_\{P\}\}\\Phi^\{\\top\}\\right\),and the global covariance is Σu=𝒮Σ~𝒮⊤\.\\Sigma\_\{u\}=\\mathcal\{S\}\\tilde\{\\Sigma\}\\mathcal\{S\}^\{\\top\}\.
###### Proof\.
The result follows from linear covariance propagation\. For each patch,𝐮^p=𝐮¯\+Φ𝐚p\\hat\{\\mathbf\{u\}\}\_\{p\}=\\bar\{\\mathbf\{u\}\}\+\\Phi\\mathbf\{a\}\_\{p\}\. Since𝐮¯\\bar\{\\mathbf\{u\}\}is deterministic,
Cov\(𝐮^p\)=Cov\(Φ𝐚p\)=ΦCov\(𝐚p\)Φ⊤,\\mathrm\{Cov\}\(\\hat\{\\mathbf\{u\}\}\_\{p\}\)=\\mathrm\{Cov\}\(\\Phi\\mathbf\{a\}\_\{p\}\)=\\Phi\\,\\mathrm\{Cov\}\(\\mathbf\{a\}\_\{p\}\)\\,\\Phi^\{\\top\},proving part \(i\)\.
Parts \(ii\) and \(iii\) follow by writing the global reconstruction as either a concatenation \(non\-overlapping case\) or a linear transformation via𝒮\\mathcal\{S\}\(overlapping case\), and applying the identityCov\(𝐁𝐱\)=𝐁Cov\(𝐱\)𝐁⊤\\mathrm\{Cov\}\(\\mathbf\{B\}\\mathbf\{x\}\)=\\mathbf\{B\}\\,\\mathrm\{Cov\}\(\\mathbf\{x\}\)\\,\\mathbf\{B\}^\{\\top\}, together with the assumption that cross\-patch covariances vanish\. ∎
###### Corollary 1\(Pointwise predictive variance\)\.
Under Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2), the predictive variance at within\-patch pixel indexℓ\\ellin patchppis\[Σup\]ℓℓ=\[ΦΣapΦ⊤\]ℓℓ\[\\Sigma\_\{u\_\{p\}\}\]\_\{\\ell\\ell\}=\[\\Phi\\Sigma\_\{a\_\{p\}\}\\Phi^\{\\top\}\]\_\{\\ell\\ell\}\. In the overlapping case, for global pixel indexℓ\\ell,Var\[u^ℓ\]=\[𝒮Σ~𝒮⊤\]ℓℓ\\mathrm\{Var\}\[\\hat\{u\}\_\{\\ell\}\]=\[\\mathcal\{S\}\\tilde\{\\Sigma\}\\mathcal\{S\}^\{\\top\}\]\_\{\\ell\\ell\}\.
## Appendix CPOD Basis Analysis
### C\.1Singular Value Decay and Compression
Figure[6](https://arxiv.org/html/2606.31290#A3.F6)shows the cumulative retained energy of the patchwise POD basis across datasets\. The shaded bands indicate variation across patches, and the dashed line marks the99%99\\%energy threshold used in the main experiments\.
The required number of modes varies noticeably:K99=2K\_\{99\}=2for SST,K99=11K\_\{99\}=11for Chest X\-ray, andK99=26K\_\{99\}=26for FFHQ\. This reflects increasing local complexity from smooth geophysical fields to anatomical structure and natural image texture\. The ordering also explains the observed compression trends: datasets with faster spectral decay allow stronger dimensionality reduction, while more complex datasets require larger latent dimensions to preserve fine\-scale structure\.
### C\.2POD Mode Statistics
Table[3](https://arxiv.org/html/2606.31290#A3.T3)reports the mean, median, and maximum number of POD modes needed to retain95%95\\%and99%99\\%variance across patches\. The medianK99K\_\{99\}values match those used in the main experiments, showing that the selected latent dimensions are representative\. SST and Chest X\-ray show small mean–median gaps, indicating uniform spectral decay, while FFHQ shows a mild tail of slower\-decaying patches due to higher texture variability\. Overall, the statistics support a shared patchwise POD basis whose dimensionality remains stable within each dataset while adapting to domain complexity\.
Table 3:POD mode statistics per dataset\. Mean, median, and maximum number of modes required to retain95%95\\%and99%99\\%of patchwise variance, computed across all patches within each dataset\.Figure 6:Patchwise cumulative retained energy as a function of mode indexKKfor each dataset\. Shaded bands indicate variability across patches\. The dashed line marks theη=0\.99\\eta=0\.99energy threshold\. SST reachesK99=2K\_\{99\}=2, ChestX\-rayK99=11K\_\{99\}=11, and FFHQK99=26K\_\{99\}=26, reflecting differing degrees of local low\-rank structure across domains\.
### C\.3Validation of the POD Reconstruction Bound
We validate the reconstruction error bound in Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1)by comparing the empirical POD reconstruction error with the theoretical bound for each dataset\. Table[4](https://arxiv.org/html/2606.31290#A3.T4)shows that the mean empirical POD reconstruction error stays below the expected bound for all datasets, with ratios of 0\.94–0\.98; sample\-wise satisfaction is reported as an empirical diagnostic, not as a deterministic guarantee\. This supports the retained\-energy criterion in Eq\.[1](https://arxiv.org/html/2606.31290#S2.E1)as a reliable control on patchwise reconstruction error\.
Table 4:Empirical validation of Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1)\. The mean reconstruction error remains below the expected theoretical bound across all datasets\.
## Appendix DExtended Reconstruction Results
Table[5](https://arxiv.org/html/2606.31290#A4.T5)extends Table[1](https://arxiv.org/html/2606.31290#S4.T1)by reporting mean±\\pmstandard deviation across three seed runs for RMSE, PSNR, SSIM, and LPIPS across all models and datasets\. FID is reported once per dataset, since it is a distribution\-level metric rather than a per\-sample statistic\.
Table 5:Reconstruction quality across three datasets\. Lower values indicate better performance for RMSE, LPIPS, and FID, while higher values are better for PSNR and SSIM\. Results are reported as mean±\\pmstandard deviation across three independent training runs, where each run first averages the metric over the test set\. FID is computed at the dataset level and reported once\.
## Appendix EQualitative Comparisons
Figure 7:Qualitative comparison on SST\. Columns show upscaled LR input, U\-Net, PixelDiff, VAE\-LDM, Patch\-PODiff\-ViT, and ground truth \(GT\)\. Top row shows full\-field reconstructions, middle row shows a zoomed region of interest, and bottom row shows absolute reconstruction error\. Patch\-PODiff\-ViT consistently preserves both large\-scale structure and fine\-scale variability, particularly in regions with strong spatial gradients\.Figures[7](https://arxiv.org/html/2606.31290#A5.F7),[8](https://arxiv.org/html/2606.31290#A5.F8), and[9](https://arxiv.org/html/2606.31290#A5.F9)present extended qualitative super\-resolution results for SST, Chest X\-ray, and FFHQ respectively, complementing the three\-domain overview in Figure[2](https://arxiv.org/html/2606.31290#S4.F2)\. The SST figure shows one representative scene; the X\-ray and FFHQ figures each show two scenes\. All include full reconstructions, zoomed regions of interest, and corresponding error maps, consistent with the quantitative trends reported in Section[4\.1](https://arxiv.org/html/2606.31290#S4.SS1)\.
On SST \(Figure[7](https://arxiv.org/html/2606.31290#A5.F7)\), the low\-resolution input and U\-Net baseline fail to recover fine\-scale thermal fronts, particularly in the zoomed coastal region\. PixelDiff and VAE\-LDM partially recover large\-scale structure but exhibit smoothing near sharp meridional gradients\. Patch\-PODiff\-ViT produces the lowest error, preserving both the large\-scale warm water mass and fine\-scale frontal structure in the zoomed region\.
On Chest X\-ray \(Figure[8](https://arxiv.org/html/2606.31290#A5.F8)\), the low\-resolution input and U\-Net baseline exhibit higher reconstruction errors than diffusion\-based methods, particularly along rib edges and lung boundaries\. PixelDiff and VAE\-LDM reduce these errors but introduce noticeable smoothing of anatomical structures\. In contrast, Patch\-PODiff\-ViT produces sharper boundaries and more accurate structural detail, with consistently lower error in both global views and zoomed regions\.
On FFHQ \(Figure[9](https://arxiv.org/html/2606.31290#A5.F9)\), baseline methods tend to smooth or distort high\-frequency textures, especially in regions such as hair and eyes\. Patch\-PODiff\-ViT better preserves fine\-scale detail, resulting in more faithful reconstructions and reduced error in high\-detail regions, as seen in the zoomed panels\.
Figure 8:Qualitative comparison on Chest X\-ray\. Each scene shows \(top\) full reconstruction, \(middle\) zoomed region, and \(bottom\) absolute error\. Columns correspond to LR input, U\-Net, PixelDiff, VAE\-LDM, Patch\-PODiff\-ViT, and ground truth\. Patch\-PODiff\-ViT recovers sharper anatomical structures with lower reconstruction error, particularly along rib edges and lung boundaries\.Figure 9:Qualitative comparison on FFHQ\. Each scene shows \(top\) full reconstruction, \(middle\) zoomed region, and \(bottom\) absolute error\. Columns correspond to LR input, U\-Net, PixelDiff, VAE\-LDM, Patch\-PODiff\-ViT, and ground truth\. Patch\-PODiff\-ViT preserves fine\-scale texture more effectively, with lower error in high\-detail regions such as hair and facial features \(Scene 1\)\. Gains are more modest in regions with sharp synthetic boundaries such as the cartoon detail in Scene 2\.
## Appendix FCoverage and Calibration Details
### F\.1CRPS and MACE Summary
Table[6](https://arxiv.org/html/2606.31290#A6.T6)reports CRPS and MACE for all models usingM=100M=100ensemble samples, with SST averaged over all 2011 test days\. Patch\-PODiff\-ViT achieves the lowest CRPS on SST and Chest X\-ray, indicating sharper predictive distributions in these domains\. On FFHQ, VAE\-LDM attains a lower CRPS \(0\.008920\.00892\) than Patch\-PODiff\-ViT \(0\.01160\.0116\), but this comes with a higher MACE \(0\.01210\.0121vs\.0\.00840\.0084\), indicating improved sharpness at the cost of reduced calibration\. Across all datasets, Patch\-PODiff\-ViT consistently achieves the lowest MACE, demonstrating better calibration of predictive uncertainty\.
Table 6:CRPS \(mean±\\pmstandard deviation\) and MACE across datasets\. Lower values indicate better performance\. CRPS is computed overM=100M=100ensemble samples, while MACE is evaluated against nominal coverage levels\.
### F\.2Full Coverage Tables
Tables[7](https://arxiv.org/html/2606.31290#A6.T7)–[9](https://arxiv.org/html/2606.31290#A6.T9)report empirical coverage at nominal levels\{50,60,70,80,90,95,99\}%\\\{50,60,70,80,90,95,99\\\}\\%for all models across SST, Chest X\-ray, and FFHQ\. For each level, we report the empirical coverage along with the deviationΔ=empirical−nominal\\Delta=\\mathrm\{empirical\}\-\\mathrm\{nominal\}, where positive values indicate over\-coverage and negative values indicate under\-coverage\. Across all datasets, Patch\-PODiff\-ViT exhibits the smallest deviations from nominal coverage, indicating consistently well\-calibrated uncertainty estimates\. In contrast, DiT and VAE\-LDM tend to show moderate over\- or under\-coverage depending on the dataset, while PixelDiff exhibits larger deviations, particularly at higher confidence levels\.
Table 7:Empirical coverage at each nominal level for SST\.Δ=empirical−nominal\\Delta=\\mathrm\{empirical\}\-\\mathrm\{nominal\}; positive values indicate over\-coverage\.Table 8:Empirical coverage at each nominal level for Chest X\-ray\.Table 9:Empirical coverage at each nominal level for FFHQ\.
## Appendix GUncertainty Map Comparisons
Figure 10:Analytic versus empirical uncertainty on Chest X\-ray\. Left: analytic standard deviation from Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)\. Middle: empirical standard deviation computed fromM=100M=100ensemble samples\. Right: difference map with Pearson correlationr=0\.986r=0\.986\. High uncertainty is concentrated along lung boundaries and rib edges\.Figure 11:Analytic versus empirical uncertainty on FFHQ\. Left: analytic standard deviation from Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)\. Middle: empirical standard deviation computed fromM=100M=100ensemble samples\. Right: difference map with Pearson correlationr=0\.953r=0\.953\. The faint grid pattern reflects patch\-boundary effects from independent patchwise covariance propagation and deterministic stitching\.Figures[10](https://arxiv.org/html/2606.31290#A7.F10)and[11](https://arxiv.org/html/2606.31290#A7.F11)extend the analytic versus empirical uncertainty comparison from SST \(Figure[3](https://arxiv.org/html/2606.31290#S4.F3)\) to Chest X\-ray and FFHQ, further supporting the calibration results in Section[4\.3](https://arxiv.org/html/2606.31290#S4.SS3)\.
For Chest X\-ray, analytic and empirical maps agree strongly \(r=0\.986r=0\.986\), with high uncertainty concentrated along lung boundaries and rib edges, where structural variability and reconstruction difficulty are greatest\. For FFHQ, agreement remains high \(r=0\.9528r=0\.9528\), with uncertainty concentrated around fine\-scale features such as eyes, mouth, and hair\. The faint grid pattern in the FFHQ difference map reflects patch\-boundary effects from independent patchwise covariance propagation and deterministic stitching\. While cross\-patch covariance affects off\-diagonal entries of the global covariance, pointwise variances are determined by the propagated within\-patch covariance and any aggregation or post\-processing applied after decoding\. These effects are most visible for FFHQ, where texture varies strongly across neighboring patches\.
Figure 12:Cross\-patch covariance energy\. SST and Chest X\-ray show very low off\-diagonal energy, while FFHQ shows a moderate but still limited ratio, consistent with stronger texture\-driven cross\-patch dependence\.
## Appendix HCross\-Patch Covariance Analysis
Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)uses a block\-diagonal covariance approximation that neglects cross\-patch covariance\. To quantify this, we measure the fraction of latent covariance energy in off\-diagonal cross\-patch blocks\. Figure[12](https://arxiv.org/html/2606.31290#A7.F12)shows low off\-diagonal energy for SST and Chest X\-ray, with mean ratios of1\.01%1\.01\\%and3\.90%3\.90\\%, respectively, and a moderate but still limited ratio for FFHQ \(11\.72%11\.72\\%\)\. This supports the block\-diagonal approximation for structured physical and medical fields, while explaining the mild FFHQ patch\-boundary discrepancies observed in uncertainty maps\.
## Appendix IAblation Studies
### I\.1Energy Threshold
Table[10](https://arxiv.org/html/2606.31290#A9.T10)evaluatesη∈\{0\.99,0\.95,0\.90\}\\eta\\in\\\{0\.99,0\.95,0\.90\\\}\. Lowerη\\etareduces retained modes and latent dimensionality, but degrades reconstruction\. Pixel metrics change moderately fromη=0\.99\\eta=0\.99to0\.950\.95, whereas FID worsens sharply atη=0\.90\\eta=0\.90\(e\.g\.,3\.986→12\.143\.986\\rightarrow 12\.14on SST and9\.17→27\.929\.17\\rightarrow 27\.92on FFHQ\), supportingη=0\.99\\eta=0\.99as the default\.
Table 10:Effect of energy thresholdη\\etaon reconstruction quality\. Lower is better for RMSE, LPIPS, FID; higher for PSNR, SSIM\.
### I\.2Patch Size
Table[11](https://arxiv.org/html/2606.31290#A9.T11)compares patch sizes8×88\\times 8,16×1616\\times 16, and32×3232\\times 32\. The16×1616\\times 16setting consistently achieves the best RMSE and FID across datasets\. Smaller patches increase sequence length and reduce per\-patch compressibility, while larger patches require more POD modes to retain the same energy, offsetting dimensionality reduction\. Overall,16×1616\\times 16provides the best trade\-off between locality, compression, and sequence length\.
Table 11:Effect of patch size on reconstruction quality\. Lower is better for RMSE and FID; higher for PSNR and SSIM\.
### I\.3Denoiser Architecture: ViT vs\. Per\-Token MLP
Table[12](https://arxiv.org/html/2606.31290#A9.T12)compares the ViT denoiser with a per\-token MLP that processes patches independently\. Removing cross\-patch communication substantially degrades performance, increasing FID by2\.22\.2–3\.2×3\.2\\timesand worsening RMSE across all datasets\. This confirms that self\-attention is important for modeling inter\-patch dependencies\.
Table 12:ViT denoiser versus per\-token MLP\. Lower is better for RMSE, LPIPS, FID; higher for PSNR, SSIM\.
### I\.4Effect of Ensemble Size
Table 13:MACE and empirical coverage at selected nominal levels as a function of ensemble sizeMM, averaged over all 2011 SST test days\.Table[13](https://arxiv.org/html/2606.31290#A9.T13)reports MACE and selected empirical coverage levels for different ensemble sizesMM, averaged over all 2011 SST test days\. Calibration improves fromM=25M=25toM=50M=50, but gains beyondM=100M=100are marginal, indicating diminishing returns\. We therefore useM=100M=100as the default, balancing ensemble inference cost and calibration accuracy\. Notably, evenM=50M=50achieves MACE below 0\.01, suggesting that useful uncertainty estimates remain available under tighter compute budgets\.
### I\.5Effect of Compression Level
To disentangle POD compression from POD structure, we evaluate a variant retaining all K=256 modes on SST\. Despite identical ViT architecture and linear POD structure, this uncompressed variant yields RMSE 0\.0046 and FID 5\.18, worse than Patch\-PODiff\-ViT \(RMSE 0\.0030, FID 3\.986\) and comparable to DiT \(RMSE 0\.0040, FID 5\.01\)\. This confirms that variance\-ordered truncation to the data’s intrinsic dimensionality — not merely the linearity of the decoder — is the primary driver of reconstruction quality\.
## Appendix JAdvection\-Dominated Regime Analysis
Table 14:MedianK99K\_\{99\}and reconstruction error across datasets\.We test Patch\-PODiff\-ViT in a strongly advection\-dominated setting to validate the limitation in Section[2\.8](https://arxiv.org/html/2606.31290#S2.SS8)\. Synthetic fields are generated from the 2D convection–diffusion equation withν∈\[10−5,10−4\]\\nu\\in\[10^\{\-5\},10^\{\-4\}\]and velocities in\[−1,1\]2\[\-1,1\]^\{2\}, yieldingPe=O\(106\)\\mathrm\{Pe\}=O\(10^\{6\}\), on a128×128128\\times 128grid with4×4\\timesdownsampling\. Table[14](https://arxiv.org/html/2606.31290#A10.T14)shows that this regime requiresK99=150K\_\{99\}=150modes, far more than the main datasets, indicating much slower spectral decay\. Patch\-PODiff\-ViT still achievesRMSE=0\.0291\\mathrm\{RMSE\}=0\.0291, improving over U\-Net by6\.8×6\.8\\times, but the larger latent size reduces the compression advantage\. Thus, the method is most efficient when local patch spectra decay rapidly\. The highly advective regimes may require adaptive bases, overlapping aggregation, or hybrid local–global covariance modeling\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction summarize the proposed Patch\-PODiff\-ViT framework, its structured POD latent representation, uncertainty propagation, computational efficiency, and empirical evaluation\. The claims are supported by the theoretical results, experiments, ablations, and limitations discussed in the paper\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Section[2\.8](https://arxiv.org/html/2606.31290#S2.SS8)discusses limitations related to slow spectral decay, fixed POD bases, cross\-patch covariance approximations, and discarded\-mode uncertainty\. Appendix[J](https://arxiv.org/html/2606.31290#A10)further evaluates an advection\-dominated regime where the compression advantage is reduced\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: The assumptions and statements for Proposition[1](https://arxiv.org/html/2606.31290#Thmproposition1)and Proposition[2](https://arxiv.org/html/2606.31290#Thmproposition2)are provided in the main text, with complete proofs given in Appendices[A](https://arxiv.org/html/2606.31290#A1)and[B](https://arxiv.org/html/2606.31290#A2)\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: The paper specifies the datasets, train/validation/test splits, preprocessing, POD basis construction, model architectures, training hyperparameters, sampling settings, and evaluation metrics\. The processed SST benchmark fields will be released with documentation and access terms consistent with the original data providers, while X\-ray and FFHQ use publicly available datasets\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: ChestX\-ray14 and FFHQ are publicly available\. Upon acceptance, we will release the complete Patch\-PODiff\-ViT codebase, preprocessing pipelines, trained model configurations, evaluation scripts, and the processed SST benchmark fields needed to reproduce the main results\. These will be hosted in a public repository or university\-supported public link with documentation and reproduction instructions\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Sections[3\.1](https://arxiv.org/html/2606.31290#S3.SS1)–[3\.3](https://arxiv.org/html/2606.31290#S3.SS3)describe the datasets, preprocessing, baselines, optimizer, learning rate, diffusion schedule, sampling steps, ensemble size, and evaluation metrics\. Additional implementation details and ablations are provided in the appendices\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: Reconstruction metrics are reported as mean and standard deviation across three seed runs in Appendix[D](https://arxiv.org/html/2606.31290#A4)\. CRPS is also reported with standard deviation, and empirical coverage is reported across multiple nominal levels\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: The paper reports hardware type, training time, peak GPU memory, parameter count, per\-sample inference time, and ensemble inference time for the main diffusion models in Section[4\.5](https://arxiv.org/html/2606.31290#S4.SS5)\. These values provide the compute requirements needed to reproduce the main experiments\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The research uses established public or licensed datasets, does not involve new human\-subject data collection, and does not deploy the proposed model in a real\-world decision\-making system\. We have reviewed the NeurIPS Code of Ethics and believe the work conforms to it\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: The work may support uncertainty\-aware scientific and medical super\-resolution by providing calibrated spatial confidence maps\. However, generated high\-resolution outputs should not be treated as direct observations in high\-stakes settings and should require domain validation, uncertainty reporting, and expert oversight\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper does not release a high\-risk foundation model, scraped dataset, or general\-purpose image generator\. FFHQ is used only as a standard benchmark for reconstruction evaluation, and the proposed method is not presented for identity recognition, surveillance, or unrestricted image generation\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: Existing datasets and methods are cited in the paper\. ChestX\-ray14 is used under the NIH Clinical Center terms with required attribution; FFHQ is used under its Creative Commons BY\-NC\-SA 4\.0 license; ROMS is distributed under an MIT/X\-style license; and ACCESS\-S2 data are used under Bureau of Meteorology data\-access terms\. The processed SST benchmark fields will be released with appropriate attribution, documentation, and access terms consistent with the ROMS and ACCESS\-S2 data providers\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2606.31290v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[N/A\]
64. Justification: The paper does not introduce a new dataset or benchmark asset\. The codebase will be made available upon acceptance, with full documentation for reproducing the Patch\-PODiff\-ViT pipeline across all three datasets\. Access to the SST benchmark fields will follow the pathway described in Item 5\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: The paper does not involve crowdsourcing, user studies, or new human\-subject data collection\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve new human\-subject research, crowdsourcing, or interaction with study participants\. ChestX\-ray14 and FFHQ are pre\-existing datasets used as benchmarks\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: The core method does not use LLMs as a model component, experimental tool, or source of scientific results\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.Similar Articles
Decoupled Latent Optimization of Diffusion Models for Full Waveform Inversion
Introduces Decoupled Latent Optimization (DLO) for full waveform inversion, which relaxes latent optimization into a quadratic-penalty objective, outperforming classical and diffusion-based methods on benchmarks while preserving smoothed-velocity initialization.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD proposes a multi-task training paradigm for diffusion models that uses online policy distillation to efficiently combine task-specific teachers into a unified student, achieving state-of-the-art results on all evaluated benchmarks.
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.
zhen-nan/L2P
L2P proposes an efficient transfer paradigm that leverages pre-trained latent diffusion models to build pixel-space diffusion models, enabling high-quality generation with minimal computational overhead and data requirements, and supporting native 4K resolution.
MMDiff: Extending Diffusion Transformers for Multi-Modal Generation
MMDiff extends frozen diffusion transformers into multi-modal generative systems using lightweight decoders, achieving significant improvements in semantic segmentation and other perceptual tasks through multi-timestep feature fusion.