Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System

arXiv cs.LG Papers

Summary

This paper presents a comprehensive mathematical framework for sequential surrogate modeling of three-phase black-oil reservoir dynamics using Fourier Neural Operators (FNO) and physics-informed variants (PINO), applied to the Norne benchmark reservoir. Theoretical contributions include functional-analytic formulation, covariate shift analysis, physics-constrained spectral stability, and truncated backpropagation gradient analysis.

arXiv:2605.28909v1 Announce Type: new Abstract: We develop a comprehensive mathematical and computational framework for sequential surrogate modeling of three-phase black-oil reservoir dynamics using neural operators, with particular emphasis on Fourier Neural Operators (FNO) and their physics-informed variant (PINO). The application focus is the Norne benchmark reservoir, defined on a heterogeneous $46\times112\times22$ grid ($N=113,344$ cells), with a production history spanning $T=30$ timesteps covering 3298 days. Our theoretical contributions are organized around four interlocking problems: (1) functional-analytic formulation in a product-Sobolev-space setting, including well-posedness of the implicit timestep map and sharp local Lipschitz estimates; (2) covariate shift quantification, proving that the Wasserstein-2 distance grows as $W_2 \leq \varepsilon(L^n-1)/(L-1)$, with exponential population-risk discrepancy for $L>1$; (3) physics-constrained spectral stability, showing PINO training with $\lambda_R \geq \lambda^*_R$ reduces the learned Jacobian spectral radius to $\rho_F + C\lambda_R^{-1/2}$, yielding uniform-in-time rollout error $|\delta_n| \leq \varepsilon/(1-\rho)$; and (4) $K$-step TBPTT gradient analysis, deriving geometric bias decay $O(\rho^K)$, optimal window $K^ = O(\log(T/\sigma^2))$, and Adam convergence $O(1/\sqrt{t}) + O(\rho^{K^*})$. Empirical validation confirms all theoretical predictions: autoregressive PINO surrogates sustain $R^2>0.99$ (oil), $R^2>0.90$ (gas), $R^2\approx 0.80$ (pressure), and monotonically improving $R^2$ (water) across the full 3298-day horizon, trained on eight NVIDIA B200 GPUs in under one hour. A 1000-member ensemble runs in under one minute on a single B200 GPU, giving a ${\sim}10^4\times$ wall-clock speedup over the OPM finite-volume simulator.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:13 AM

# Sequential Physics-Constrained Neural Operator Forward Modeling for the Norne Reservoir System
Source: [https://arxiv.org/html/2605.28909](https://arxiv.org/html/2605.28909)
Clement Etienam1,∗&Juntao Yang1&Oleg Ovcharenko1&Nick Luiken1&Tsubasa Onishi1&Nefeli Moridis1&Issam Said11NVIDIA Corporation\. ∗Corresponding author:cetienam@nvidia\.com Keywords:Physics\-Informed Neural Operators; Fourier Neural Operators; Sequential Auto\-regressive Modeling; Reservoir Simulation; Supervised learning\.

###### Abstract

We develop a comprehensive mathematical and computational framework for sequential surrogate modeling of three\-phase black\-oil reservoir dynamics using neural operators, with particular emphasis on Fourier Neural Operators \(FNO\) and their physics\-informed variant \(PINO\)\. The application focus is the Norne benchmark reservoir, defined on a heterogeneous46×112×2246\\times 112\\times 22grid \(N=113,344N=113\{,\}344cells\), with a production history spanningT=30T=30timesteps covering 3298 days\.

Our theoretical contributions are organized around four interlocking problems\.

\(1\) Functional\-analytic formulation\.We embed the discrete\-time black\-oil system in a rigorous product\-Sobolev\-space setting\. The finite\-volume implicit residual operatorR:ℝ4​N×ℝ4​N→ℝ4​NR:\\mathbb\{R\}^\{4N\}\\times\\mathbb\{R\}^\{4N\}\\to\\mathbb\{R\}^\{4N\}is the canonical mathematical object bridging simulator physics to neural\-operator training; its zero setℳ\\mathcal\{M\}defines the*physics\-consistent manifold*on which PINO training concentrates predictions\. We prove well\-posedness of the implicit timestep mapℱ\\mathcal\{F\}under a discrete coercivity condition \(Theorem[2\.6](https://arxiv.org/html/2605.28909#S2.Thmtheorem6)\), provide sharp local Lipschitz estimates \(Lemma[2\.7](https://arxiv.org/html/2605.28909#S2.Thmtheorem7)\), and characterize the elliptic–hyperbolic structural dichotomy that governs differential stability across state variables\.

\(2\) Covariate shift and distributional divergence\.We prove that the Wasserstein\-2 distance between the true\-state distributionμn\\mu\_\{n\}and the predicted\-state distributionμ^nθ\\hat\{\\mu\}\_\{n\}^\{\\theta\}grows asW2​\(μ^nθ,μn\)≤ε​\(Ln−1\)/\(L−1\)W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)\\leq\\varepsilon\(L^\{n\}\-1\)/\(L\-1\), whereLLis the operator Lipschitz constant andε\\varepsilonthe one\-step error \(Theorem[4\.1](https://arxiv.org/html/2605.28909#S4.Thmtheorem1)\)\. We further bound the resulting population\-risk discrepancy between one\-step and autoregressive training paradigms \(Corollary[4\.2](https://arxiv.org/html/2605.28909#S4.Thmtheorem2)\) and show this gap drives exponentialR2R^\{2\}degradation forL\>1L\>1hyperbolic variables\.

\(3\) Physics\-constrained spectral stability\.We prove that PINO training with physics weightλR≥λR∗\\lambda\_\{R\}\\geq\\lambda\_\{R\}^\{\*\}reduces the spectral radius of the learned operator’s Jacobian toϱ​\(Dx​𝒢θ\)≤ρℱ\+C​λR−1/2\\varrho\(D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\)\\leq\\rho\_\{\\mathcal\{F\}\}\+C\\lambda\_\{R\}^\{\-1/2\}, whereρℱ<1\\rho\_\{\\mathcal\{F\}\}<1is the dissipativity rate of the true dynamics \(Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)\)\. A companion result \(Theorem[6\.6](https://arxiv.org/html/2605.28909#S6.Thmtheorem6)\) establishes that the PINO residual penalty acts as a spectral Jacobian regularizer: it penalizes deviations ofDx​𝒢θD\_\{x\}\\mathcal\{G\}\_\{\\theta\}from the true dynamics JacobianDx​ℱD\_\{x\}\\mathcal\{F\}in a weighted operator norm, with error𝒪​\(λR−1/2\)\\mathcal\{O\}\(\\lambda\_\{R\}^\{\-1/2\}\)\. Combining these, we prove a uniform\-in\-time rollout error bound‖δn‖ϕ≤ε/\(1−ρ\)\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon/\(1\-\\rho\)for PINO\-trained autoregressive operators \(Theorem[6\.7](https://arxiv.org/html/2605.28909#S6.Thmtheorem7)\)\.

\(4\) TBPTT gradient analysis\.We formalizeKK\-step truncated backpropagation through time \(TBPTT\) as a biased stochastic gradient estimator for the autoregressive objective\. We prove the cross\-window gradient contributions decay geometrically as𝒪​\(ρK\)\\mathcal\{O\}\(\\rho^\{K\}\)\(Theorem[7\.3](https://arxiv.org/html/2605.28909#S7.Thmtheorem3)\), derive the optimal window sizeK∗=𝒪​\(log⁡\(T/σ2\)\)K^\{\*\}=\\mathcal\{O\}\(\\log\(T/\\sigma^\{2\}\)\)from a bias\-variance decomposition \(Corollary[7\.4](https://arxiv.org/html/2605.28909#S7.Thmtheorem4)\), and establish a non\-asymptotic convergence rate for TBPTT\-trained autoregressive operators under the Adam optimizer of the form𝒪​\(1/t\)\+𝒪​\(ρK∗\)\\mathcal\{O\}\(1/\\sqrt\{t\}\)\+\\mathcal\{O\}\(\\rho^\{K^\{\*\}\}\)\(Proposition[7\.5](https://arxiv.org/html/2605.28909#S7.Thmtheorem5)\)\. Crucially, PINO training and TBPTT form a self\-reinforcing cycle: physics constraints reduceρ\\rho, which shrinks the TBPTT bias for any fixedKK, allowing shorter windows, more gradient steps per epoch, and further physics improvement\.

Empirical validation on the full Norne production timeline confirms all theoretical predictions quantitatively\. Autoregressive PINO surrogates sustainR2\>0\.99R^\{2\}\>0\.99\(oil saturation\),R2\>0\.90R^\{2\}\>0\.90\(gas saturation\),R2≈0\.80R^\{2\}\\approx 0\.80\(pressure\), and monotonically improvingR2R^\{2\}\(water saturation\) across the full 3298\-day horizon, trained on eight NVIDIA B200 \(HGX B200 / Blackwell\) GPUs in under one hour\. In contrast, teacher\-forced one\-step models degrade toR2≈0\.96R^\{2\}\\approx 0\.96,0\.380\.38,0\.720\.72, and0\.750\.75respectively—with the gas saturation collapse att≈1250t\\approx 1250days accurately predicted by the rollout bound of Theorem[5\.1](https://arxiv.org/html/2605.28909#S5.Thmtheorem1)usingLG≈1\.15L^\{G\}\\approx 1\.15andε≈3×10−3\\varepsilon\\approx 3\\times 10^\{\-3\}\. A 1000\-member ensemble runs in under one minute on a single NVIDIA B200 GPU, providing a∼104×\{\\sim\}10^\{4\}\\timeswall\-clock speedup over the OPM finite\-volume simulator and enabling Bayesian inversion and uncertainty quantification workflows at industrial scale\.

## 1Introduction

### 1\.1Motivation

Multiphase flow simulation in heterogeneous porous media is the governing forward problem in petroleum reservoir management, geological carbon storage, and groundwater remediation\. Industrial\-grade simulators — Newton\-Raphson\-based implicit finite\-volume codes such as the Open Porous Media Flow simulator \(OPM\)\[[25](https://arxiv.org/html/2605.28909#bib.bib25)\]— solve large\-scale coupled systems of nonlinear algebraic equations at each timestep, with wall\-clock times of hours to days on CPU clusters\. In ensemble\-based workflows such as history matching, uncertainty quantification \(UQ\), and production optimization, this single forward solve must be repeated10310^\{3\}–10510^\{5\}times, making direct simulator use computationally prohibitive\[[10](https://arxiv.org/html/2605.28909#bib.bib10),[11](https://arxiv.org/html/2605.28909#bib.bib11)\]\.

Neural\-operator surrogates that learn to approximate the simulator response at a fraction of the cost are therefore of substantial practical importance\[[20](https://arxiv.org/html/2605.28909#bib.bib20),[16](https://arxiv.org/html/2605.28909#bib.bib16),[22](https://arxiv.org/html/2605.28909#bib.bib22),[13](https://arxiv.org/html/2605.28909#bib.bib13)\]\. Among available architectures, the Fourier Neural Operator \(FNO\)\[[20](https://arxiv.org/html/2605.28909#bib.bib20)\]has emerged as a leading approach: it is resolution\-invariant by construction \(parameters are discretization\-independent\), provides an efficient global receptive field via spectral convolution, and admits a rigorous universal approximation theory for maps between Banach spaces\[[16](https://arxiv.org/html/2605.28909#bib.bib16),[19](https://arxiv.org/html/2605.28909#bib.bib19)\]\. Its physics\-informed extension PINO\[[21](https://arxiv.org/html/2605.28909#bib.bib21)\]additionally penalizes PDE or discrete residuals during training, providing regularization that improves physical consistency\.

### 1\.2The sequential surrogate problem and its mathematical challenges

A fundamental difficulty, underappreciated in much of the existing surrogate literature, is that neural operators for reservoir simulation must be deployed*autoregressively*: at each production timestep, predictions are fed back as inputs for subsequent timesteps\. This creates a sharp tension between the one\-step \(teacher\-forced, “1–1”\) training paradigm — in which ground\-truth simulator states are provided at every step — and the closed\-loop inference mode in which the model operates\. The mismatch is known as*exposure bias*or*covariate shift*in the sequence modeling literature\[[4](https://arxiv.org/html/2605.28909#bib.bib4),[31](https://arxiv.org/html/2605.28909#bib.bib31),[18](https://arxiv.org/html/2605.28909#bib.bib18)\]and has been studied in recurrent language models, learned weather forecasters\[[27](https://arxiv.org/html/2605.28909#bib.bib27),[5](https://arxiv.org/html/2605.28909#bib.bib5),[17](https://arxiv.org/html/2605.28909#bib.bib17)\], and neural PDE surrogates\[[36](https://arxiv.org/html/2605.28909#bib.bib36)\]\.

The mathematical mechanism underlying this instability is well understood in the theory of iterated function systems and discrete\-time dynamical systems\. For a learned one\-step operator𝒢θ\\mathcal\{G\}\_\{\\theta\}, the rollout error afternnsteps satisfies a recursion driven by the Jacobian product∏j=0n−1Dx​𝒢θ​\(xj\)\\prod\_\{j=0\}^\{n\-1\}D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(x\_\{j\}\)along the predicted trajectory\. Even a marginally super\-unitary operator norm leads to exponential error amplification\[[3](https://arxiv.org/html/2605.28909#bib.bib3),[26](https://arxiv.org/html/2605.28909#bib.bib26),[23](https://arxiv.org/html/2605.28909#bib.bib23)\]\. For black\-oil reservoirs, the saturation equations are advection\-dominated and can develop sharp saturation fronts, gravity fingers, and dissolution\-driven instabilities — all of which correspond to Jacobian norms exceeding unity along front\-aligned state directions\. This makes long\-horizon stability a particularly acute concern for gas saturation dynamics in Norne\-type systems\.

### 1\.3Overview of contributions

The contributions of this paper are organized as follows\.

1. \(i\)Rigorous formulation\(Section[2](https://arxiv.org/html/2605.28909#S2)\)\. We embed the discrete\-time black\-oil system in a product\-Sobolev\-space framework\. We prove well\-posedness of the implicit timestep map, provide sharp local Lipschitz estimates linking the Lipschitz constant to the coercivity of the finite\-volume residual, and characterize the elliptic–hyperbolic structural distinction that governs different stability behaviors across state variables\.
2. \(ii\)Covariate shift quantification\(Section[4](https://arxiv.org/html/2605.28909#S4)\)\. We prove explicitW2W\_\{2\}and total\-variation bounds on the distributional discrepancy between 1–1 and AR training measures, show this gap drives an exponential population\-risk discrepancy forL\>1L\>1, and prove a matching lower bound showing the upper bound is tight for hyperbolic transport\.
3. \(iii\)Rollout error analysis with sharp constants\(Section[5](https://arxiv.org/html/2605.28909#S5)\)\. We prove a sharp rollout bound covering general, uniform, contractive, and marginally unstable cases, together with a characterization of which PDE types fall in each regime\. A dimension\-dependent amplification factor is derived for the mixed\-elliptic–hyperbolic Norne state space\.
4. \(iv\)PINO spectral stability theory\(Section[6](https://arxiv.org/html/2605.28909#S6)\)\. We prove that PINO training constrains the learned Jacobian spectral radius below the true dissipation rate, with explicit constants\. We show the physics residual acts as a spectral Jacobian regularizer and prove a uniform\-in\-time PINO rollout bound\.
5. \(v\)TBPTT gradient bias theory\(Section[7](https://arxiv.org/html/2605.28909#S7)\)\. Full bias\-variance analysis ofKK\-step TBPTT as a gradient estimator, including optimal window size selection and a convergence rate under Adam\.
6. \(vi\)FNO approximation theory for mixed PDE types\(Section[8](https://arxiv.org/html/2605.28909#S8)\)\. We prove spectral approximation rates for FNO applied to elliptic versus hyperbolic variables and quantify why gas saturation requires deeper networks or AR training to compensate for Gibbs\-like spectral truncation artifacts near fronts\.
7. \(vii\)Empirical validation on Norne\(Section[11](https://arxiv.org/html/2605.28909#S11)\)\. Full time\-resolvedR2R^\{2\}benchmarks for all four state variables, with theory\-consistent quantitative interpretation of each observed behavior\.

## 2Mathematical Preliminaries

### 2\.1Domains, grids, and function spaces

LetΩ⊂ℝ3\\Omega\\subset\\mathbb\{R\}^\{3\}be an open bounded Lipschitz domain representing the reservoir\. We work with the Norne Cartesian grid of dimensions\(nx,ny,nz\)=\(46,112,22\)\(n\_\{x\},n\_\{y\},n\_\{z\}\)=\(46,112,22\), yieldingN:=nx​ny​nz=113,344N:=n\_\{x\}n\_\{y\}n\_\{z\}=113\{,\}344cells\. Each cellΩi⊂Ω\\Omega\_\{i\}\\subset\\Omegais a rectangular cuboid of volume\|Ωi\|\>0\|\\Omega\_\{i\}\|\>0\. For1≤p≤∞1\\leq p\\leq\\infty,Lp​\(Ω\)L^\{p\}\(\\Omega\)is the standard Lebesgue space;Hs​\(Ω\)H^\{s\}\(\\Omega\)the Sobolev space of orders≥0s\\geq 0\. On the discrete grid, a scalar fielda:Ω→ℝa:\\Omega\\to\\mathbb\{R\}sampled cell\-wise is identified with a vectora∈ℝNa\\in\\mathbb\{R\}^\{N\}\. The discreteℓp\\ell^\{p\}norm is‖a‖p:=\(∑i\|ai\|p\)1/p\\left\\lVert a\\right\\rVert\_\{p\}:=\(\\sum\_\{i\}\|a\_\{i\}\|^\{p\}\)^\{1/p\}and the gridLpL^\{p\}norm \(with cell volumes\) is

‖a‖Lhp:=\(∑i=1N\|Ωi\|​\|ai\|p\)1/p\.\\left\\lVert a\\right\\rVert\_\{L^\{p\}\_\{h\}\}:=\\left\(\\sum\_\{i=1\}^\{N\}\|\\Omega\_\{i\}\|\\,\|a\_\{i\}\|^\{p\}\\right\)^\{1/p\}\.
###### Definition 2\.1\(Pore\-volume weighted inner product\)\.

Define the pore\-volume\-weighted inner product onℝN\\mathbb\{R\}^\{N\}:

⟨a,b⟩ϕ:=∑i=1Nϕi​\|Ωi\|​ai​bi,\\left\\langle a,\\,b\\right\\rangle\_\{\\phi\}:=\\sum\_\{i=1\}^\{N\}\\phi\_\{i\}\|\\Omega\_\{i\}\|\\,a\_\{i\}b\_\{i\},whereϕi∈\(0,1\)\\phi\_\{i\}\\in\(0,1\)is the cell porosity\. The induced norm‖a‖ϕ:=⟨a,a⟩ϕ\\left\\lVert a\\right\\rVert\_\{\\phi\}:=\\sqrt\{\\left\\langle a,\\,a\\right\\rangle\_\{\\phi\}\}weights errors proportionally to pore volume, which is the physically natural metric for mass\-conservation\-relevant quantities\.

### 2\.2The black\-oil PDE system and its structure

The three\-phase \(water, oil, gas\) black\-oil model onΩ×\(0,Tf\]\\Omega\\times\(0,T\_\{f\}\]is:

ϕ​∂Sw∂t−∇⋅\[Tw​\(∇Pw\+ρw​g​𝐞3\)\]\\displaystyle\\phi\\,\\frac\{\\partial S\_\{w\}\}\{\\partial t\}\-\\nabla\\cdot\\bigl\[T\_\{w\}\(\\nabla P\_\{w\}\+\\rho\_\{w\}g\\mathbf\{e\}\_\{3\}\)\\bigr\]=Qw,\\displaystyle=Q\_\{w\},\(1\)ϕ​∂So∂t−∇⋅\[To​\(∇Po\+ρo​g​𝐞3\)\]\\displaystyle\\phi\\,\\frac\{\\partial S\_\{o\}\}\{\\partial t\}\-\\nabla\\cdot\\bigl\[T\_\{o\}\(\\nabla P\_\{o\}\+\\rho\_\{o\}g\\mathbf\{e\}\_\{3\}\)\\bigr\]=Qo,\\displaystyle=Q\_\{o\},\(2\)∂∂t\[ϕ\(ρgBgSg\+Rs​o​ρgBoSo\)\]−∇⋅\[ρgBgTg\(∇Pg\+ρgg𝐞3\)\\displaystyle\\frac\{\\partial\}\{\\partial t\}\\\!\\left\[\\phi\\\!\\left\(\\frac\{\\rho\_\{g\}\}\{B\_\{g\}\}S\_\{g\}\+\\frac\{R\_\{so\}\\rho\_\{g\}\}\{B\_\{o\}\}S\_\{o\}\\right\)\\right\]\-\\nabla\\cdot\\\!\\left\[\\frac\{\\rho\_\{g\}\}\{B\_\{g\}\}T\_\{g\}\(\\nabla P\_\{g\}\+\\rho\_\{g\}g\\mathbf\{e\}\_\{3\}\)\\right\.\(3\)\+Rs​o​ρgBoTo\(∇Po\+ρog𝐞3\)\]=Qg\.\\displaystyle\\left\.\+\\frac\{R\_\{so\}\\rho\_\{g\}\}\{B\_\{o\}\}T\_\{o\}\(\\nabla P\_\{o\}\+\\rho\_\{o\}g\\mathbf\{e\}\_\{3\}\)\\right\]=Q\_\{g\}\.\(4\)with saturation constraintSw\+So\+Sg=1S\_\{w\}\+S\_\{o\}\+S\_\{g\}=1and capillary relationsPc​w​o:=Po−PwP\_\{cwo\}:=P\_\{o\}\-P\_\{w\},Pc​o​g:=Pg−PoP\_\{cog\}:=P\_\{g\}\-P\_\{o\}\. The phase transmissibilities are

Tα:=K​\(x\)​Kr​α​\(S\)μα,α∈\{w,o,g\},T\_\{\\alpha\}:=\\frac\{K\(x\)\\,K\_\{r\\alpha\}\(S\)\}\{\\mu\_\{\\alpha\}\},\\quad\\alpha\\in\\\{w,o,g\\\},\(5\)whereK​\(x\)\>0K\(x\)\>0is the absolute permeability tensor \(treated as scalar here\),Kr​α≥0K\_\{r\\alpha\}\\geq 0are relative permeability functions, andμα\>0\\mu\_\{\\alpha\}\>0are phase viscosities\. When a fault transmissibility multiplier \(FTM\)f∈\[0,1\]f\\in\[0,1\]is present,KKis replaced byf⋅Kf\\cdot K\. The boundary condition is∇Pα⋅ν=0\\nabla P\_\{\\alpha\}\\cdot\\nu=0on∂Ω\\partial\\Omega\(no\-flow\)\.

#### Elliptic–hyperbolic structure\.

After summing equations \([1](https://arxiv.org/html/2605.28909#S2.E1)\)–\([4](https://arxiv.org/html/2605.28909#S2.E4)\) with appropriate formation\-volume\-factor weights, the total compressibility equation for pressure takes the form

−∇⋅\[Ttot​∇P\]\+ct​ϕ​∂P∂t=Qtot,Ttot:=Tw\+To\+Tg,\-\\nabla\\cdot\\bigl\[T\_\{\\mathrm\{tot\}\}\\nabla P\\bigr\]\+c\_\{t\}\\phi\\frac\{\\partial P\}\{\\partial t\}=Q\_\{\\mathrm\{tot\}\},\\qquad T\_\{\\mathrm\{tot\}\}:=T\_\{w\}\+T\_\{o\}\+T\_\{g\},\(6\)which is a parabolic \(nearly elliptic for smallctc\_\{t\}\) equation for pressure\. Substituting the Darcy flux into the saturation equations gives a nonlinear advection\-diffusion system forSw,So,SgS\_\{w\},S\_\{o\},S\_\{g\}with characteristic speedvα=Tα/ϕv\_\{\\alpha\}=T\_\{\\alpha\}/\\phi, which is hyperbolic\-dominated for typical mobility ratios\. For gas, the combination of free gas \(Bg−1​SgB\_\{g\}^\{\-1\}S\_\{g\}\) and dissolved gas \(Rs​o​Bo−1​SoR\_\{so\}B\_\{o\}^\{\-1\}S\_\{o\}\) makes the gas equation strongly nonlinear and particularly sensitive to front tracking\.

### 2\.3Discrete\-time formulation and state space

###### Definition 2\.2\(State space\)\.

The physically admissible state space is

𝒳:=\{x=\(p,Sw,So,Sg\)∈ℝ4​N\|pi\>0,Sα,i≥0​∀α,i,Sw,i\+So,i\+Sg,i=1​∀i\}\.\\mathcal\{X\}:=\\bigl\\\{x=\(p,S\_\{w\},S\_\{o\},S\_\{g\}\)\\in\\mathbb\{R\}^\{4N\}\\;\\big\|\\;p\_\{i\}\>0,\\;S\_\{\\alpha,i\}\\geq 0\\;\\forall\\alpha,i,\\;S\_\{w,i\}\+S\_\{o,i\}\+S\_\{g,i\}=1\\;\\forall i\\bigr\\\}\.The constraintSw\+So\+Sg=𝟏NS\_\{w\}\+S\_\{o\}\+S\_\{g\}=\\mathbf\{1\}\_\{N\}defines a hyperplane inℝ3​N\\mathbb\{R\}^\{3N\}; thus𝒳\\mathcal\{X\}is a closed convex subset ofℝ4​N\\mathbb\{R\}^\{4N\}of intrinsic dimension3​N\+13N\+1\(pressure hasNNdegrees of freedom; saturations have2​N2Nafter the volume constraint\)\.

###### Definition 2\.3\(Input channels\)\.

Static reservoir parameters:u:=\(log⁡k,ϕ,f\)∈ℝ3​Nu:=\(\\log k,\\phi,f\)\\in\\mathbb\{R\}^\{3N\}\. Well controls at stepnn:cn:=\(Qn,Qw,n,Qg,n\)∈ℝ3​Nwc\_\{n\}:=\(Q\_\{n\},Q\_\{w,n\},Q\_\{g,n\}\)\\in\\mathbb\{R\}^\{3N\_\{w\}\},Nw=22N\_\{w\}=22producers\+\+1313injectors\. Temporal scalars:τn:=\(tn,Δ​tn\)∈ℝ2\\tau\_\{n\}:=\(t\_\{n\},\\Delta t\_\{n\}\)\\in\\mathbb\{R\}^\{2\}\. The full input at stepnnisξn:=\(u,xn,cn,τn\)\\xi\_\{n\}:=\(u,x\_\{n\},c\_\{n\},\\tau\_\{n\}\)\.

### 2\.4The finite\-volume residual operator and well\-posedness

The implicit \(backward\-Euler\) finite\-volume discretization of \([1](https://arxiv.org/html/2605.28909#S2.E1)\)–\([4](https://arxiv.org/html/2605.28909#S2.E4)\) at timestepn→n\+1n\\to n\+1assembles the nonlinear algebraic system

R​\(xn\+1,xn;u,cn,τn\)=0,R\(x\_\{n\+1\},x\_\{n\};u,c\_\{n\},\\tau\_\{n\}\)=0,\(7\)whereR:ℝ4​N×ℝ4​N×ℝ3​N×ℝ3​Nw×ℝ2→ℝ4​NR:\\mathbb\{R\}^\{4N\}\\times\\mathbb\{R\}^\{4N\}\\times\\mathbb\{R\}^\{3N\}\\times\\mathbb\{R\}^\{3N\_\{w\}\}\\times\\mathbb\{R\}^\{2\}\\to\\mathbb\{R\}^\{4N\}is the*finite\-volume residual operator*\. Its water\-component block at celliireads explicitly:

Riw​\(x\+,x−\):=ϕi​Sw,i\+−Sw,i−Δ​t−∑j∈𝒩​\(i\)Twi​j,\+di​j​\(Pw,j\+−Pw,i\+\)​\|∂Ωi​j\|−Qw,i,R^\{w\}\_\{i\}\(x^\{\+\},x^\{\-\}\):=\\phi\_\{i\}\\frac\{S\_\{w,i\}^\{\+\}\-S\_\{w,i\}^\{\-\}\}\{\\Delta t\}\-\\sum\_\{j\\in\\mathcal\{N\}\(i\)\}\\frac\{T^\{ij,\+\}\_\{w\}\}\{d\_\{ij\}\}\(P^\{\+\}\_\{w,j\}\-P^\{\+\}\_\{w,i\}\)\|\\partial\\Omega\_\{ij\}\|\-Q\_\{w,i\},\(8\)wherex\+=xn\+1x^\{\+\}=x\_\{n\+1\},x−=xnx^\{\-\}=x\_\{n\},𝒩​\(i\)\\mathcal\{N\}\(i\)is the cell\-face stencil,Twi​j,\+T^\{ij,\+\}\_\{w\}is the upstream\-weighted transmissibility at facei​jijevaluated at the new state, anddi​jd\_\{ij\}is the cell\-center distance\. Analogous blocksRioR^\{o\}\_\{i\}andRigR^\{g\}\_\{i\}are defined for oil and gas, with the gas block including the formation\-volume\-factor\-weighted accumulation termϕi​Δ​t−1​\[\(ρg/Bg\)​Sg,i\+\+\(Rs​o​ρg/Bo\)​So,i\+−\(ρg/Bg\)​Sg,i−−\(Rs​o​ρg/Bo\)​So,i−\]\\phi\_\{i\}\\Delta t^\{\-1\}\[\(\\rho\_\{g\}/B\_\{g\}\)S\_\{g,i\}^\{\+\}\+\(R\_\{so\}\\rho\_\{g\}/B\_\{o\}\)S\_\{o,i\}^\{\+\}\-\(\\rho\_\{g\}/B\_\{g\}\)S\_\{g,i\}^\{\-\}\-\(R\_\{so\}\\rho\_\{g\}/B\_\{o\}\)S\_\{o,i\}^\{\-\}\]\.

###### Assumption 2\.4\(Discrete coercivity\)\.

For allx−∈𝒳x^\{\-\}\\in\\mathcal\{X\}and allΔ​t∈\(0,Δ​tmax\]\\Delta t\\in\(0,\\Delta t\_\{\\max\}\], the Jacobian ofRRwith respect to its first argument evaluated at any\(x\+,x−\)∈𝒳2\(x^\{\+\},x^\{\-\}\)\\in\\mathcal\{X\}^\{2\}satisfies a coercivity condition:

⟨∂R∂x\+​\(x\+,x−\)​v,v⟩ϕ−1≥α​‖v‖ϕ2∀v∈ℝ4​N,\\left\\langle\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\(x^\{\+\},x^\{\-\}\)\\,v,\\,v\\right\\rangle\_\{\\phi^\{\-1\}\}\\geq\\alpha\\left\\lVert v\\right\\rVert\_\{\\phi\}^\{2\}\\quad\\forall v\\in\\mathbb\{R\}^\{4N\},\(9\)for someα\>0\\alpha\>0depending onΔ​tmax\\Delta t\_\{\\max\},min⁡ϕi\\min\\phi\_\{i\},min⁡Ki\\min K\_\{i\}, the relative permeability slopes, and the capillary pressure derivatives\.

###### Theorem 2\.6\(Well\-posedness of the implicit timestep\)\.

Under Assumption[2\.4](https://arxiv.org/html/2605.28909#S2.Thmtheorem4), for eachxn∈𝒳x\_\{n\}\\in\\mathcal\{X\},u∈ℝ3​Nu\\in\\mathbb\{R\}^\{3N\},cn∈ℝ3​Nwc\_\{n\}\\in\\mathbb\{R\}^\{3N\_\{w\}\}, andτn\\tau\_\{n\}with0<Δ​tn≤Δ​tmax0<\\Delta t\_\{n\}\\leq\\Delta t\_\{\\max\}, there exists a unique solutionxn\+1∈𝒳x\_\{n\+1\}\\in\\mathcal\{X\}to \([7](https://arxiv.org/html/2605.28909#S2.E7)\)\. This defines the true time\-advance mapℱ:𝒳→𝒳\\mathcal\{F\}:\\mathcal\{X\}\\to\\mathcal\{X\}by

ℱ​\(xn;u,cn,τn\):=xn\+1\.\\mathcal\{F\}\(x\_\{n\};u,c\_\{n\},\\tau\_\{n\}\):=x\_\{n\+1\}\.\(10\)

###### Proof\.

DefineΨx−​\(x\+\):=x\+−α−1​\(∂R/∂x\+\)−1​R​\(x\+,x−;⋅\)\\Psi\_\{x^\{\-\}\}\(x^\{\+\}\):=x^\{\+\}\-\\alpha^\{\-1\}\\left\(\\partial R/\\partial x^\{\+\}\\right\)^\{\-1\}R\(x^\{\+\},x^\{\-\};\\cdot\)restricted to𝒳\\mathcal\{X\}\. Coercivity \([9](https://arxiv.org/html/2605.28909#S2.E9)\) implies∂R/∂x\+\\partial R/\\partial x^\{\+\}is invertible with‖\(∂R/∂x\+\)−1‖ϕ−1→ϕ≤α−1\\left\\lVert\(\\partial R/\\partial x^\{\+\}\)^\{\-1\}\\right\\rVert\_\{\\phi^\{\-1\}\\to\\phi\}\\leq\\alpha^\{\-1\}\. A contraction estimate showsΨx−\\Psi\_\{x^\{\-\}\}maps a closed ballBr​\(x−\)B\_\{r\}\(x^\{\-\}\)to itself for sufficiently smallrr\(using the Lipschitz continuity ofRRin both arguments\), and the Banach fixed\-point theorem yields a unique fixed point, which is the solutionx\+x^\{\+\}\. Uniqueness follows immediately from strict monotonicity ofRRinx\+x^\{\+\}: ifR​\(a,x−\)=R​\(b,x−\)=0R\(a,x^\{\-\}\)=R\(b,x^\{\-\}\)=0witha≠ba\\neq b, then0=⟨R​\(a,x−\)−R​\(b,x−\),a−b⟩ϕ−1≥α​‖a−b‖ϕ2\>00=\\left\\langle R\(a,x^\{\-\}\)\-R\(b,x^\{\-\}\),\\,a\-b\\right\\rangle\_\{\\phi^\{\-1\}\}\\geq\\alpha\\left\\lVert a\-b\\right\\rVert\_\{\\phi\}^\{2\}\>0, a contradiction\. The solution lies in𝒳\\mathcal\{X\}by checking that the saturation constraint is preserved under the implicit update \(it is, since the residual explicitly includes the volume constraint via the capillary pressure formulation\)\. ∎

###### Lemma 2\.7\(Local Lipschitz estimate forℱ\\mathcal\{F\}\)\.

Under Assumption[2\.4](https://arxiv.org/html/2605.28909#S2.Thmtheorem4), for anyx,y∈𝒳x,y\\in\\mathcal\{X\}with‖x−y‖ϕ≤r\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}\\leq r:

‖ℱ​\(x;⋅\)−ℱ​\(y;⋅\)‖ϕ≤Lℱ​\(r\)​‖x−y‖ϕ,\\left\\lVert\\mathcal\{F\}\(x;\\cdot\)\-\\mathcal\{F\}\(y;\\cdot\)\\right\\rVert\_\{\\phi\}\\leq L\_\{\\mathcal\{F\}\}\(r\)\\left\\lVert x\-y\\right\\rVert\_\{\\phi\},\(11\)where

Lℱ​\(r\)=1\+‖∂R/∂x−‖ϕα≤1\+Cflux​\(r\)α,L\_\{\\mathcal\{F\}\}\(r\)=1\+\\frac\{\\left\\lVert\\partial R/\\partial x^\{\-\}\\right\\rVert\_\{\\phi\}\}\{\\alpha\}\\leq 1\+\\frac\{C\_\{\\mathrm\{flux\}\}\(r\)\}\{\\alpha\},\(12\)andCflux​\(r\)C\_\{\\mathrm\{flux\}\}\(r\)captures the inter\-cell flux Jacobian magnitude\. For the pressure component alone \(elliptic\),LℱP≤1\+Cell/αL\_\{\\mathcal\{F\}\}^\{P\}\\leq 1\+C\_\{\\mathrm\{ell\}\}/\\alphawithCellC\_\{\\mathrm\{ell\}\}depending onTtotT\_\{\\mathrm\{tot\}\}\. For the gas saturation component \(hyperbolic\),LℱGL\_\{\\mathcal\{F\}\}^\{G\}can exceed unity by an amount proportional to the local front velocity\.

###### Proof\.

Letxx\+:=ℱ​\(x;⋅\)x^\{\+\}\_\{x\}:=\\mathcal\{F\}\(x;\\cdot\)andxy\+:=ℱ​\(y;⋅\)x^\{\+\}\_\{y\}:=\\mathcal\{F\}\(y;\\cdot\)\. ThenR​\(xx\+,x;⋅\)=R​\(xy\+,y;⋅\)=0R\(x^\{\+\}\_\{x\},x;\\cdot\)=R\(x^\{\+\}\_\{y\},y;\\cdot\)=0\. Subtracting and applying the mean\-value theorem:

∂R∂x\+\|ξ⋅\(xx\+−xy\+\)=−∂R∂x−\|η⋅\(x−y\),\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\\big\|\_\{\\xi\}\\cdot\(x^\{\+\}\_\{x\}\-x^\{\+\}\_\{y\}\)=\-\\frac\{\\partial R\}\{\\partial x^\{\-\}\}\\big\|\_\{\\eta\}\\cdot\(x\-y\),whereξ\\xiandη\\etalie on the line segment\. Inverting with the coercivity bound:

∥xx\+−xy\+∥ϕ≤1α∥∂R∂x−\|η∥ϕ→ϕ∥x−y∥ϕ=:Cflux​\(r\)α∥x−y∥ϕ\.\\left\\lVert x^\{\+\}\_\{x\}\-x^\{\+\}\_\{y\}\\right\\rVert\_\{\\phi\}\\leq\\frac\{1\}\{\\alpha\}\\left\\\|\\frac\{\\partial R\}\{\\partial x^\{\-\}\}\\bigg\|\_\{\\eta\}\\right\\\|\_\{\\phi\\to\\phi\}\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}=:\\frac\{C\_\{\\mathrm\{flux\}\}\(r\)\}\{\\alpha\}\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}\.For the pressure block,∂RP/∂x−\\partial R^\{P\}/\\partial x^\{\-\}is dominated by the accumulation termϕ/Δ​t\\phi/\\Delta t, andCell≈ϕmax/Δ​tC\_\{\\mathrm\{ell\}\}\\approx\\phi\_\{\\max\}/\\Delta t\. For gas, the off\-diagonal coupling through relative permeability nonlinearities adds terms proportional to front velocities, giving the largerCfluxGC\_\{\\mathrm\{flux\}\}^\{G\}\. ∎

### 2\.5Neural operator surrogate and rollout

###### Definition 2\.8\(Learned one\-step operator\)\.

A neural operator surrogate is a parametric family\{𝒢θ:θ∈Θ⊂ℝP\}\\\{\\mathcal\{G\}\_\{\\theta\}:\\theta\\in\\Theta\\subset\\mathbb\{R\}^\{P\}\\\}of maps𝒢θ:𝒳×ℝ3​N×ℝ3​Nw×ℝ2→𝒳\\mathcal\{G\}\_\{\\theta\}:\\mathcal\{X\}\\times\\mathbb\{R\}^\{3N\}\\times\\mathbb\{R\}^\{3N\_\{w\}\}\\times\\mathbb\{R\}^\{2\}\\to\\mathcal\{X\}approximatingℱ\\mathcal\{F\}\. For FNO, the architecture is described in Section[8](https://arxiv.org/html/2605.28909#S8)\.

###### Definition 2\.9\(Autoregressive rollout\)\.

Given initial conditionx0∈𝒳x\_\{0\}\\in\\mathcal\{X\}, define the closed\-loop trajectory

x^0:=x0,x^n\+1:=𝒢θ​\(u,x^n,cn,τn\),n=0,…,T−1\.\\hat\{x\}\_\{0\}:=x\_\{0\},\\qquad\\hat\{x\}\_\{n\+1\}:=\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},c\_\{n\},\\tau\_\{n\}\),\\quad n=0,\\dots,T\-1\.\(13\)Thenn\-step*rollout operator*is the composition

𝒢θ\(n\)​\(x0\):=𝒢θ∘⋯∘𝒢θ⏟n​\(x0\)\.\\mathcal\{G\}\_\{\\theta\}^\{\(n\)\}\(x\_\{0\}\):=\\underbrace\{\\mathcal\{G\}\_\{\\theta\}\\circ\\cdots\\circ\\mathcal\{G\}\_\{\\theta\}\}\_\{n\}\(x\_\{0\}\)\.\(14\)The rollout error at stepnnisδn:=x^n−xn\\delta\_\{n\}:=\\hat\{x\}\_\{n\}\-x\_\{n\}\.

### 2\.6Probability spaces and distributional setup

Let\(Ωω,𝒜,ℙ\)\(\\Omega\_\{\\omega\},\\mathcal\{A\},\\mathbb\{P\}\)be a probability space supporting random initial conditionsx0∼μ0x\_\{0\}\\sim\\mu\_\{0\}, static fieldsuu, and control sequencesc0:T−1c\_\{0:T\-1\}\. Define:

μn\\displaystyle\\mu\_\{n\}:=law​\(xn\)under true simulator trajectories,\\displaystyle:=\\mathrm\{law\}\(x\_\{n\}\)\\quad\\text\{under true simulator trajectories,\}μ^nθ\\displaystyle\\hat\{\\mu\}\_\{n\}^\{\\theta\}:=law​\(x^n\)under AR rollout \([13](https://arxiv.org/html/2605.28909#S2.E13)\)\.\\displaystyle:=\\mathrm\{law\}\(\\hat\{x\}\_\{n\}\)\\quad\\text\{under AR rollout\\penalty 10000\\ \\eqref\{eq:rollout\}\.\}The empirical training measure on one\-step pairs fromNsN\_\{s\}simulator trajectories is

ℙ^N:=1Ns​T​∑i=1Ns∑n=0T−1δ\(xn\(i\),xn\+1\(i\),ξn\(i\)\)\.\\hat\{\\mathbb\{P\}\}\_\{N\}:=\\frac\{1\}\{N\_\{s\}T\}\\sum\_\{i=1\}^\{N\_\{s\}\}\\sum\_\{n=0\}^\{T\-1\}\\delta\_\{\(x\_\{n\}^\{\(i\)\},x\_\{n\+1\}^\{\(i\)\},\\xi\_\{n\}^\{\(i\)\}\)\}\.

## 3Two Training Paradigms and Risk Analysis

### 3\.1One\-step \(teacher\-forced\) training

###### Definition 3\.1\(1–1 loss\)\.

The one\-step \(teacher\-forced\) population risk is

ℒ1−1​\(θ\):=𝔼​\[1T​∑n=0T−1ℓ​\(𝒢θ​\(ξn\),xn\+1\)\],\\mathcal\{L\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\):=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\ell\\bigl\(\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\),\\,x\_\{n\+1\}\\bigr\)\\right\],\(15\)whereξn=\(u,xn,cn,τn\)\\xi\_\{n\}=\(u,x\_\{n\},c\_\{n\},\\tau\_\{n\}\)withxnx\_\{n\}drawn fromμn\\mu\_\{n\}, andℓ​\(a,b\):=‖a−b‖ϕ2/‖b‖ϕ2\\ell\(a,b\):=\\left\\lVert a\-b\\right\\rVert^\{2\}\_\{\\phi\}/\\left\\lVert b\\right\\rVert\_\{\\phi\}^\{2\}is the relativeℓ2\\ell^\{2\}loss\. The empirical version overNsN\_\{s\}trajectories is

ℒ^1−1​\(θ\)=1Ns​T​∑i=1Ns∑n=0T−1ℓ​\(𝒢θ​\(ξn\(i\)\),xn\+1\(i\)\)\.\\hat\{\\mathcal\{L\}\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\)=\\frac\{1\}\{N\_\{s\}T\}\\sum\_\{i=1\}^\{N\_\{s\}\}\\sum\_\{n=0\}^\{T\-1\}\\ell\\bigl\(\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}^\{\(i\)\}\),x\_\{n\+1\}^\{\(i\)\}\\bigr\)\.\(16\)

### 3\.2Autoregressive \(closed\-loop\) training

###### Definition 3\.2\(AR loss\)\.

The autoregressive population risk is

ℒAR​\(θ\):=𝔼​\[1T​∑n=0T−1ℓ​\(x^n\+1​\(θ\),xn\+1\)\],\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\):=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\ell\\bigl\(\\hat\{x\}\_\{n\+1\}\(\\theta\),\\,x\_\{n\+1\}\\bigr\)\\right\],\(17\)wherex^n​\(θ\)\\hat\{x\}\_\{n\}\(\\theta\)is the AR rollout \([13](https://arxiv.org/html/2605.28909#S2.E13)\)\. The expectation is over\(x0,u,c0:T−1\)∼ℙ\(x\_\{0\},u,c\_\{0:T\-1\}\)\\sim\\mathbb\{P\}\.

### 3\.3Generalization bounds

###### Theorem 3\.4\(Uniform generalization bound\)\.

Letℋ=\{𝒢θ:θ∈Θ\}\\mathcal\{H\}=\\\{\\mathcal\{G\}\_\{\\theta\}:\\theta\\in\\Theta\\\}be the hypothesis class,ℓ\\ellbeLℓL\_\{\\ell\}\-Lipschitz in its first argument and bounded byMℓ\>0M\_\{\\ell\}\>0\. With probability≥1−δ\\geq 1\-\\deltaover the draw ofNsN\_\{s\}i\.i\.d\. trajectories,

supθ∈Θ\|ℒ1−1​\(θ\)−ℒ^1−1​\(θ\)\|≤2​Lℓ​ℜNs​\(ℋ\)\+Mℓ​log⁡\(1/δ\)2​Ns,\\sup\_\{\\theta\\in\\Theta\}\|\\mathcal\{L\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\)\-\\hat\{\\mathcal\{L\}\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\)\|\\leq 2L\_\{\\ell\}\\,\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{H\}\)\+M\_\{\\ell\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{2N\_\{s\}\}\},\(19\)whereℜNs​\(ℋ\)\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{H\}\)is the Rademacher complexity ofℋ\\mathcal\{H\}with respect toℙ^N\\hat\{\\mathbb\{P\}\}\_\{N\}\.

###### Proof\.

Define the composed function classℱℋ:=\{\(a,b\)↦ℓ​\(𝒢θ​\(a\),b\):θ∈Θ\}\\mathcal\{F\}\_\{\\mathcal\{H\}\}:=\\\{\(a,b\)\\mapsto\\ell\(\\mathcal\{G\}\_\{\\theta\}\(a\),b\):\\theta\\in\\Theta\\\}\. Sinceℓ\\ellisLℓL\_\{\\ell\}\-Lipschitz, the Rademacher contraction lemma \(Ledoux–Talagrand\) givesℜNs​\(ℱℋ\)≤Lℓ​ℜNs​\(ℋ\)\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{F\}\_\{\\mathcal\{H\}\}\)\\leq L\_\{\\ell\}\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{H\}\)\. The uniform bound \([19](https://arxiv.org/html/2605.28909#S3.E19)\) then follows from the standard symmetric Rademacher argument applied to the function\(d1,…,dNs\)↦supθ\|ℒ−ℒ^\|\(d\_\{1\},\\dots,d\_\{N\_\{s\}\}\)\\mapsto\\sup\_\{\\theta\}\|\\mathcal\{L\}\-\\hat\{\\mathcal\{L\}\}\|, combined with McDiarmid’s inequality \(bounded differences of size2​Mℓ/Ns2M\_\{\\ell\}/N\_\{s\}\) and symmetrization via Rademacher variables\[[2](https://arxiv.org/html/2605.28909#bib.bib2)\]\. ∎

###### Proposition 3\.5\(Physics constraints reduce Rademacher complexity\)\.

LetΘphys​\(λ\)⊂Θ\\Theta\_\{\\mathrm\{phys\}\}\(\\lambda\)\\subset\\Thetadenote the sublevel set\{θ:𝔼​\[‖R​\(𝒢θ​\(ξn\),xn;⋅\)‖2\]≤1/λ\}\\\{\\theta:\\mathbb\{E\}\[\\left\\lVert R\(\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\),x\_\{n\};\\cdot\)\\right\\rVert^\{2\}\]\\leq 1/\\lambda\\\}\. ThenℜNs​\(ℋphys\)≤ℜNs​\(ℋ\)\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{H\}\_\{\\mathrm\{phys\}\}\)\\leq\\mathfrak\{R\}\_\{N\_\{s\}\}\(\\mathcal\{H\}\), with equality only if the physics penalty is non\-binding\. Moreover, for FNO hypothesis classes parameterized by spectral mode coefficients\{R\(ℓ\)​\(k\)\}\\\{R^\{\(\\ell\)\}\(k\)\\\}, the PINO physics constraint reduces the effective covering number:

log𝒩\(ℋphys,ε,∥⋅∥∞\)≤log𝒩\(ℋ,ε,∥⋅∥∞\)−Cλ1/2ε−dphys,\\log\\mathcal\{N\}\(\\mathcal\{H\}\_\{\\mathrm\{phys\}\},\\varepsilon,\\left\\lVert\\cdot\\right\\rVert\_\{\\infty\}\)\\leq\\log\\mathcal\{N\}\(\\mathcal\{H\},\\varepsilon,\\left\\lVert\\cdot\\right\\rVert\_\{\\infty\}\)\-C\\lambda^\{1/2\}\\varepsilon^\{\-d\_\{\\mathrm\{phys\}\}\},\(20\)wheredphys\>0d\_\{\\mathrm\{phys\}\}\>0is the effective co\-dimension ofℳ\\mathcal\{M\}in𝒳\\mathcal\{X\}andCCdepends on the coercivity constantα\\alphain Assumption[2\.4](https://arxiv.org/html/2605.28909#S2.Thmtheorem4)\.

###### Proof\.

ℋphys⊂ℋ\\mathcal\{H\}\_\{\\mathrm\{phys\}\}\\subset\\mathcal\{H\}implies the first claim by monotonicity of Rademacher complexity in the function class\. For the covering number bound, note that the physics constraint𝔼​\[‖R​\(𝒢θ,xn;⋅\)‖2\]≤1/λ\\mathbb\{E\}\[\\left\\lVert R\(\\mathcal\{G\}\_\{\\theta\},x\_\{n\};\\cdot\)\\right\\rVert^\{2\}\]\\leq 1/\\lambdarestricts𝒢θ\\mathcal\{G\}\_\{\\theta\}to lie within an𝒪​\(λ−1/2\)\\mathcal\{O\}\(\\lambda^\{\-1/2\}\)\-tube of the manifoldℳ\\mathcal\{M\}in theL2​\(ℙX\)L^\{2\}\(\\mathbb\{P\}\_\{X\}\)\-sense\. Standard metric entropy estimates for tube neighborhoods of smooth manifolds of co\-dimensionddgive theε−dphys\\varepsilon^\{\-d\_\{\\mathrm\{phys\}\}\}correction\. ∎

## 4Covariate Shift: Quantitative Analysis

### 4\.1Wasserstein and total\-variation bounds

###### Theorem 4\.1\(Covariate shift quantification\)\.

Assume𝒢θ\\mathcal\{G\}\_\{\\theta\}has Lipschitz constantL≥0L\\geq 0\(in the∥⋅∥ϕ\\left\\lVert\\cdot\\right\\rVert\_\{\\phi\}metric\) and one\-step error bounded byε\>0\\varepsilon\>0:

‖𝒢θ​\(u,x,c,τ\)−ℱ​\(x;u,c,τ\)‖ϕ≤ε∀x∈𝒳\.\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,x,c,\\tau\)\-\\mathcal\{F\}\(x;u,c,\\tau\)\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\quad\\forall x\\in\\mathcal\{X\}\.Then, for everyn≥1n\\geq 1:

1. \(a\)*Wasserstein\-2 bound:* W2​\(μ^nθ,μn\)≤ε​Ln−1L−1\(L≠1\),W2​\(μ^nθ,μn\)≤n​ε\(L=1\)\.W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)\\leq\\varepsilon\\,\\frac\{L^\{n\}\-1\}\{L\-1\}\\quad\(L\\neq 1\),\\qquad W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)\\leq n\\varepsilon\\quad\(L=1\)\.\(21\)
2. \(b\)*Contractive case:*ForL<1L<1, W2​\(μ^nθ,μn\)≤ε1−L∀n≥0\.W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)\\leq\\frac\{\\varepsilon\}\{1\-L\}\\quad\\forall n\\geq 0\.\(22\)
3. \(c\)*Risk discrepancy:*For anyLℓL\_\{\\ell\}\-Lipschitz lossℓ\\ell, \|𝔼μn​\[ℓ​\(𝒢θ,⋅\)\]−𝔼μ^nθ​\[ℓ​\(𝒢θ,⋅\)\]\|≤Lℓ​W2​\(μ^nθ,μn\)\.\\bigl\|\\mathbb\{E\}\_\{\\mu\_\{n\}\}\[\\ell\(\\mathcal\{G\}\_\{\\theta\},\\cdot\)\]\-\\mathbb\{E\}\_\{\\hat\{\\mu\}\_\{n\}^\{\\theta\}\}\[\\ell\(\\mathcal\{G\}\_\{\\theta\},\\cdot\)\]\\bigr\|\\leq L\_\{\\ell\}\\,W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)\.\(23\)
4. \(d\)*Total population\-risk gap:* \|ℒ1−1​\(θ\)−ℒAR​\(θ\)\|≤Lℓ​εT​∑n=0T−1Ln−1L−1=Lℓ​εT⋅LT−T​L\+T−1\(L−1\)2\.\|\\mathcal\{L\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\)\-\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\)\|\\leq\\frac\{L\_\{\\ell\}\\varepsilon\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\frac\{L^\{n\}\-1\}\{L\-1\}=\\frac\{L\_\{\\ell\}\\varepsilon\}\{T\}\\cdot\\frac\{L^\{T\}\-TL\+T\-1\}\{\(L\-1\)^\{2\}\}\.\(24\)

###### Proof\.

Part \(a\)\.Construct the canonical coupling\(x^n,xn\)\(\\hat\{x\}\_\{n\},x\_\{n\}\)by running both the predicted and true trajectories from the same initial conditionx0x\_\{0\}:

x^n\+1=𝒢θ​\(u,x^n,cn,τn\),xn\+1=ℱ​\(xn;u,cn,τn\)\.\\hat\{x\}\_\{n\+1\}=\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},c\_\{n\},\\tau\_\{n\}\),\\qquad x\_\{n\+1\}=\\mathcal\{F\}\(x\_\{n\};u,c\_\{n\},\\tau\_\{n\}\)\.Defineδn:=x^n−xn\\delta\_\{n\}:=\\hat\{x\}\_\{n\}\-x\_\{n\}withδ0=0\\delta\_\{0\}=0\. Adding and subtracting𝒢θ​\(u,xn,cn,τn\)\\mathcal\{G\}\_\{\\theta\}\(u,x\_\{n\},c\_\{n\},\\tau\_\{n\}\):

‖δn\+1‖ϕ\\displaystyle\\left\\lVert\\delta\_\{n\+1\}\\right\\rVert\_\{\\phi\}=‖𝒢θ​\(u,x^n,cn,τn\)−ℱ​\(xn;u,cn,τn\)‖ϕ\\displaystyle=\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},c\_\{n\},\\tau\_\{n\}\)\-\\mathcal\{F\}\(x\_\{n\};u,c\_\{n\},\\tau\_\{n\}\)\\right\\rVert\_\{\\phi\}≤‖𝒢θ​\(u,x^n,⋅\)−𝒢θ​\(u,xn,⋅\)‖ϕ\+‖𝒢θ​\(u,xn,⋅\)−ℱ​\(xn;⋅\)‖ϕ\\displaystyle\\leq\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},\\cdot\)\-\\mathcal\{G\}\_\{\\theta\}\(u,x\_\{n\},\\cdot\)\\right\\rVert\_\{\\phi\}\+\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,x\_\{n\},\\cdot\)\-\\mathcal\{F\}\(x\_\{n\};\\cdot\)\\right\\rVert\_\{\\phi\}≤L​‖δn‖ϕ\+ε\.\\displaystyle\\leq L\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\+\\varepsilon\.Iterating fromδ0=0\\delta\_\{0\}=0:‖δn‖ϕ≤∑j=0n−1Ln−1−j​ε=ε​\(Ln−1\)/\(L−1\)\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\sum\_\{j=0\}^\{n\-1\}L^\{n\-1\-j\}\\varepsilon=\\varepsilon\(L^\{n\}\-1\)/\(L\-1\)forL≠1L\\neq 1\. Since the Wasserstein\-2 distance satisfiesW2​\(μ^nθ,μn\)2≤𝔼​\[‖x^n−xn‖ϕ2\]≤‖δn‖ϕ2W\_\{2\}\(\\hat\{\\mu\}\_\{n\}^\{\\theta\},\\mu\_\{n\}\)^\{2\}\\leq\\mathbb\{E\}\[\\left\\lVert\\hat\{x\}\_\{n\}\-x\_\{n\}\\right\\rVert\_\{\\phi\}^\{2\}\]\\leq\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}^\{2\}\(using the constructed coupling as a transport plan\), and the pathwise bound‖δn‖ϕ≤ε​\(Ln−1\)/\(L−1\)\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\(L^\{n\}\-1\)/\(L\-1\)holds for allω\\omegain the probability space, taking the square root gives \([21](https://arxiv.org/html/2605.28909#S4.E21)\)\.

Part \(b\)\.ForL<1L<1, the geometric series gives∑j=0n−1Ln−1−j≤1/\(1−L\)\\sum\_\{j=0\}^\{n\-1\}L^\{n\-1\-j\}\\leq 1/\(1\-L\)independent ofnn\.

Part \(c\)\.For anyLℓL\_\{\\ell\}\-Lipschitz functiongg, the dual representation of the Wasserstein\-1 distance gives\|𝔼μ​g−𝔼ν​g\|≤Lg​W1​\(μ,ν\)≤Lg​W2​\(μ,ν\)\|\\mathbb\{E\}\_\{\\mu\}g\-\\mathbb\{E\}\_\{\\nu\}g\|\\leq L\_\{g\}W\_\{1\}\(\\mu,\\nu\)\\leq L\_\{g\}W\_\{2\}\(\\mu,\\nu\)\(sinceW1≤W2W\_\{1\}\\leq W\_\{2\}by Cauchy–Schwarz on the transport plan\)\.

Part \(d\)\.Apply \(c\) to each summand inℒ1−1−ℒAR\\mathcal\{L\}\_\{\\mathrm\{1\{\-\}1\}\}\-\\mathcal\{L\}\_\{\\mathrm\{AR\}\}and sum overn=0,…,T−1n=0,\\dots,T\-1, then divide byTT:

\|ℒ1−1−ℒAR\|≤LℓT​∑n=0T−1ε​\(Ln−1\)L−1=Lℓ​εT​\(L−1\)​\(LT−1L−1−T\),\|\\mathcal\{L\}\_\{1\-1\}\-\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\|\\leq\\frac\{L\_\{\\ell\}\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\frac\{\\varepsilon\(L^\{n\}\-1\)\}\{L\-1\}=\\frac\{L\_\{\\ell\}\\varepsilon\}\{T\(L\-1\)\}\\left\(\\frac\{L^\{T\}\-1\}\{L\-1\}\-T\\right\),which simplifies to \([24](https://arxiv.org/html/2605.28909#S4.E24)\)\. ∎

###### Corollary 4\.2\(Exponential risk gap forL\>1L\>1\)\.

IfL\>1L\>1andε\>0\\varepsilon\>0, then

\|ℒ1−1​\(θ\)−ℒAR​\(θ\)\|=Θ​\(Lℓ​ε​LTT​\(L−1\)2\),\|\\mathcal\{L\}\_\{\\mathrm\{1\{\-\}1\}\}\(\\theta\)\-\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\)\|=\\Theta\\\!\\left\(\\frac\{L\_\{\\ell\}\\varepsilon L^\{T\}\}\{T\(L\-1\)^\{2\}\}\\right\),i\.e\., the population\-risk gap grows*exponentially*in the horizonTT\. A model achieving small 1–1 training loss can have arbitrarily large AR deployment loss whenL\>1L\>1\.

###### Proof\.

For largeTT, the dominant term in \([24](https://arxiv.org/html/2605.28909#S4.E24)\) isLℓ​ε​LT/\(T​\(L−1\)2\)L\_\{\\ell\}\\varepsilon L^\{T\}/\(T\(L\-1\)^\{2\}\)\. TheΘ\\Theta\(i\.e\., matching upper and lower bounds\) follows because then=T−1n=T\-1term alone already contributesLℓ​ε​LT−1/\(\(L−1\)​T\)L\_\{\\ell\}\\varepsilon L^\{T\-1\}/\(\(L\-1\)T\)\. ∎

### 4\.2Lower bound: the bound is sharp for hyperbolic transport

###### Proposition 4\.4\(Sharpness for hyperbolic transport\)\.

Consider the one\-dimensional scalar conservation law∂tS\+v​∂xS=0\\partial\_\{t\}S\+v\\partial\_\{x\}S=0on a periodic grid ofNNcells, discretized by a first\-order upwind scheme with CFL numberν=v​Δ​t/Δ​x\\nu=v\\Delta t/\\Delta x\. Forν∈\(0,1\)\\nu\\in\(0,1\), the discrete time\-advance operator has‖Dx​ℱ‖op,ℓ2∈\[1−ν,1\]\\\|D\_\{x\}\\mathcal\{F\}\\\|\_\{\\mathrm\{op\},\\ell^\{2\}\}\\in\[1\-\\nu,\\,1\]and the bound in Theorem[4\.1](https://arxiv.org/html/2605.28909#S4.Thmtheorem1)\(a\) is tight to within a constant factor: for anyθ\\thetawith‖𝒢θ−ℱ‖∞=ε\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\-\\mathcal\{F\}\\right\\rVert\_\{\\infty\}=\\varepsilon, there exists an initial conditionx0x\_\{0\}such that

‖x^n−xn‖ℓ2≥c​ε​n\\left\\lVert\\hat\{x\}\_\{n\}\-x\_\{n\}\\right\\rVert\_\{\\ell^\{2\}\}\\geq c\\,\\varepsilon\\,nfor a universal constantc\>0c\>0\.

###### Proof\.

The upwind Jacobian has spectrum\{1−ν​\(1−e−i​k​Δ​x\):k=0,…,N−1\}\\\{1\-\\nu\(1\-e^\{\-ik\\Delta x\}\):k=0,\\dots,N\-1\\\}, giving‖Dx​ℱ‖op,ℓ2=1\\\|D\_\{x\}\\mathcal\{F\}\\\|\_\{\\mathrm\{op\},\\ell^\{2\}\}=1fork=0k=0\(the DC mode\)\. For a perturbationε\\varepsilonaligned with the DC eigenvector \(a constant shift\),𝒢θ​\(x\)=ℱ​\(x\)\+ε​𝟏N\\mathcal\{G\}\_\{\\theta\}\(x\)=\\mathcal\{F\}\(x\)\+\\varepsilon\\mathbf\{1\}\_\{N\}\(constant modeling offset\)\. Then‖x^n−xn‖ℓ2=n​ε​‖𝟏N‖ℓ2=n​ε​N\\left\\lVert\\hat\{x\}\_\{n\}\-x\_\{n\}\\right\\rVert\_\{\\ell^\{2\}\}=n\\varepsilon\\left\\lVert\\mathbf\{1\}\_\{N\}\\right\\rVert\_\{\\ell^\{2\}\}=n\\varepsilon\\sqrt\{N\}, matching theL=1L=1bound withc=1/Nc=1/\\sqrt\{N\}\. ForL\>1L\>1, a perturbation aligned with the growing eigenmode \(exists whenν\>1\\nu\>1or near resonance\) gives exponential growth\. ∎

## 5Rollout Error: Sharp Analysis

### 5\.1The fundamental rollout error recursion

###### Theorem 5\.1\(Decomposed rollout error bound\)\.

LetL≥0L\\geq 0be the Lipschitz constant of𝒢θ\\mathcal\{G\}\_\{\\theta\}and define the open\-loop one\-step residual on true states:

en:=‖𝒢θ​\(ξn\)−xn\+1‖ϕ\.e\_\{n\}:=\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\)\-x\_\{n\+1\}\\right\\rVert\_\{\\phi\}\.\(25\)Then for alln≥1n\\geq 1:

1. \(a\)*General bound:* ‖δn‖ϕ≤∑j=0n−1Ln−1−j​ej\.\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\sum\_\{j=0\}^\{n\-1\}L^\{n\-1\-j\}\\,e\_\{j\}\.\(26\)
2. \(b\)*Uniform error*\(ej≤εe\_\{j\}\\leq\\varepsilonfor alljj\): ‖δn‖ϕ≤ε​Ln−1L−1\(L≠1\),‖δn‖ϕ≤n​ε\(L=1\)\.\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\,\\frac\{L^\{n\}\-1\}\{L\-1\}\\quad\(L\\neq 1\),\\qquad\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq n\\varepsilon\\quad\(L=1\)\.\(27\)
3. \(c\)*Contractive case*\(L<1L<1\): ‖δn‖ϕ≤ε1−L,∀n≥0\.\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\frac\{\\varepsilon\}\{1\-L\},\\quad\\forall n\\geq 0\.\(28\)
4. \(d\)*Marginally unstable case*\(L=1\+ηL=1\+\\eta,0<η≪10<\\eta\\ll 1\): ‖δn‖ϕ≤ε​eη​n−1η≤ε​n​eη​nη​n=ε​n​eη​n\.\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\,\\frac\{e^\{\\eta n\}\-1\}\{\\eta\}\\leq\\frac\{\\varepsilon ne^\{\\eta n\}\}\{\\eta n\}=\\varepsilon ne^\{\\eta n\}\.\(29\)

###### Proof\.

Part \(a\)\.Add and subtract𝒢θ​\(ξn\)\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\)evaluated at the true state:

‖δn\+1‖ϕ\\displaystyle\\left\\lVert\\delta\_\{n\+1\}\\right\\rVert\_\{\\phi\}=‖𝒢θ​\(u,x^n,cn,τn\)−xn\+1‖ϕ\\displaystyle=\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},c\_\{n\},\\tau\_\{n\}\)\-x\_\{n\+1\}\\right\\rVert\_\{\\phi\}≤‖𝒢θ​\(u,x^n,cn,τn\)−𝒢θ​\(u,xn,cn,τn\)‖ϕ\+‖𝒢θ​\(u,xn,cn,τn\)−xn\+1‖ϕ\\displaystyle\\leq\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\}\_\{n\},c\_\{n\},\\tau\_\{n\}\)\-\\mathcal\{G\}\_\{\\theta\}\(u,x\_\{n\},c\_\{n\},\\tau\_\{n\}\)\\right\\rVert\_\{\\phi\}\+\\left\\lVert\\mathcal\{G\}\_\{\\theta\}\(u,x\_\{n\},c\_\{n\},\\tau\_\{n\}\)\-x\_\{n\+1\}\\right\\rVert\_\{\\phi\}≤L​‖δn‖ϕ\+en\.\\displaystyle\\leq L\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\+e\_\{n\}\.This is the fundamental recursion‖δn\+1‖ϕ≤L​‖δn‖ϕ\+en\\left\\lVert\\delta\_\{n\+1\}\\right\\rVert\_\{\\phi\}\\leq L\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\+e\_\{n\}\. Withδ0=0\\delta\_\{0\}=0, unrolling gives \([26](https://arxiv.org/html/2605.28909#S5.E26)\):‖δn‖ϕ≤∑j=0n−1Ln−1−j​ej\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\sum\_\{j=0\}^\{n\-1\}L^\{n\-1\-j\}e\_\{j\}\.

Part \(b\)\.Setej=εe\_\{j\}=\\varepsilonand sum the geometric series:‖δn‖ϕ≤ε​∑j=0n−1Ln−1−j=ε​\(Ln−1\)/\(L−1\)\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\sum\_\{j=0\}^\{n\-1\}L^\{n\-1\-j\}=\\varepsilon\(L^\{n\}\-1\)/\(L\-1\)\.

Part \(c\)\.ForL<1L<1,‖δn‖ϕ≤ε​∑k=0n−1Lk≤ε/\(1−L\)\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\sum\_\{k=0\}^\{n\-1\}L^\{k\}\\leq\\varepsilon/\(1\-L\)\.

Part \(d\)\.ForL=1\+ηL=1\+\\eta,\(1\+η\)n≤eη​n\(1\+\\eta\)^\{n\}\\leq e^\{\\eta n\}\(Bernoulli inequality in reverse\), soε​\(Ln−1\)/\(L−1\)≤ε​\(eη​n−1\)/η≤ε​n​eη​n\\varepsilon\(L^\{n\}\-1\)/\(L\-1\)\\leq\\varepsilon\(e^\{\\eta n\}\-1\)/\\eta\\leq\\varepsilon ne^\{\\eta n\}\. ∎

###### Theorem 5\.2\(Dimension\-dependent amplification in mixed\-type systems\)\.

For the Norne system with statex=\(p,Sw,So,Sg\)∈ℝ4​Nx=\(p,S\_\{w\},S\_\{o\},S\_\{g\}\)\\in\\mathbb\{R\}^\{4N\}, writeδn=\(δnP,δnW,δnO,δnG\)\\delta\_\{n\}=\(\\delta\_\{n\}^\{P\},\\delta\_\{n\}^\{W\},\\delta\_\{n\}^\{O\},\\delta\_\{n\}^\{G\}\)for the component errors\. AssumeDx​ℱD\_\{x\}\\mathcal\{F\}has block structure reflecting the elliptic–hyperbolic coupling, with blocksAP​P,AP​S,AS​P,AS​SA^\{PP\},A^\{PS\},A^\{SP\},A^\{SS\}\(pressure–pressure, pressure–saturation, saturation–pressure, saturation–saturation\)\. Then:

1. \(a\)The pressure error satisfies ‖δnP‖ϕ≤\(LP\)n​‖δ0P‖ϕ\+∑j=0n−1\(LP\)n−1−j​\(ejP\+CP​S​‖δjS‖ϕ\),\\left\\lVert\\delta\_\{n\}^\{P\}\\right\\rVert\_\{\\phi\}\\leq\(L^\{P\}\)^\{n\}\\left\\lVert\\delta\_\{0\}^\{P\}\\right\\rVert\_\{\\phi\}\+\\sum\_\{j=0\}^\{n\-1\}\(L^\{P\}\)^\{n\-1\-j\}\(e\_\{j\}^\{P\}\+C\_\{PS\}\\left\\lVert\\delta\_\{j\}^\{S\}\\right\\rVert\_\{\\phi\}\),\(30\)whereLP≤1\+𝒪​\(\(Δ​t\)−1​α−1\)L^\{P\}\\leq 1\+\\mathcal\{O\}\(\(\\Delta t\)^\{\-1\}\\alpha^\{\-1\}\)is close to \(or below\) unity due to elliptic regularization, andCP​SC\_\{PS\}captures saturation\-to\-pressure coupling\.
2. \(b\)The gas saturation error can grow faster than the overall system Lipschitz constant suggests: ‖δnG‖ϕ≥cG​εG​\(LG\)n−1LG−1\\left\\lVert\\delta\_\{n\}^\{G\}\\right\\rVert\_\{\\phi\}\\geq c\_\{G\}\\,\\varepsilon^\{G\}\\,\\frac\{\(L^\{G\}\)^\{n\}\-1\}\{L^\{G\}\-1\}\(31\)for someLG\>LPL^\{G\}\>L^\{P\}when the mobility ratioM:=Kr​g/\(μg​Kr​o/μo\)\>1M:=K\_\{rg\}/\(\\mu\_\{g\}K\_\{ro\}/\\mu\_\{o\}\)\>1\(favorable displacement unstable\), which is typical in Norne\-type gas injection scenarios\.

###### Proof sketch\.

Part \(a\): The pressure blockRPR^\{P\}is governed by the pressure equation \([6](https://arxiv.org/html/2605.28909#S2.E6)\)\. The implicit Jacobian block\(∂RP/∂p\+\)−1\(\\partial R^\{P\}/\\partial p^\{\+\}\)^\{\-1\}is bounded above in operator norm byCellC\_\{\\mathrm\{ell\}\}\(a fixed multiple ofTtotT\_\{\\mathrm\{tot\}\}\-related constants\) due to elliptic coercivity of the pressure operator\. This givesLP<1L^\{P\}<1orLP≈1L^\{P\}\\approx 1, resulting in controlled error accumulation\. The coupling termCP​S​‖δjS‖C\_\{PS\}\\left\\lVert\\delta\_\{j\}^\{S\}\\right\\rVertaccounts for saturation changes feeding back into the pressure equation through the total mobilityTtotT\_\{\\mathrm\{tot\}\}\.

Part \(b\): The gas saturation blockRGR^\{G\}includes the nonlinear relative permeabilityKr​g​\(Sg\)K\_\{rg\}\(S\_\{g\}\)and the dissolved gas coupling\. ForM\>1M\>1, the gas front advances faster than stable displacement, and small perturbations inSgS\_\{g\}near the front are amplified by the characteristic speed mismatch\. A first\-order WKB analysis of the linearized transport equation near the front shows an effective amplification factorLG=LG​\(front velocity\)\>1L^\{G\}=L^\{G\}\(\\text\{front velocity\}\)\>1\. The lower bound \([31](https://arxiv.org/html/2605.28909#S5.E31)\) follows by constructing an explicit perturbation aligned with the unstable front mode\. ∎

### 5\.2The AR training advantage: Lipschitz constant reduction on the predicted manifold

###### Assumption 5\.3\(AR contractivity on predicted manifold\)\.

Under autoregressive training with lossℒAR\\mathcal\{L\}\_\{\\mathrm\{AR\}\}, the effective Lipschitz constant of𝒢θ\\mathcal\{G\}\_\{\\theta\}on the predicted\-state distributionμ^nθ\\hat\{\\mu\}\_\{n\}^\{\\theta\}satisfies

L~:=supn𝔼x^∼μ^nθ​\[‖Dx​𝒢θ​\(u,x^,c,τ\)‖op,ϕ\]<L,\\tilde\{L\}:=\\sup\_\{n\}\\mathbb\{E\}\_\{\\hat\{x\}\\sim\\hat\{\\mu\}\_\{n\}^\{\\theta\}\}\\bigl\[\\\|D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(u,\\hat\{x\},c,\\tau\)\\\|\_\{\\mathrm\{op\},\\phi\}\\bigr\]<L,whereLLis the global Lipschitz constant\.

###### Proposition 5\.4\(Stability improvement from AR training\)\.

Under Assumption[5\.3](https://arxiv.org/html/2605.28909#S5.Thmtheorem3), the expected rollout error under AR training satisfies

𝔼​‖δn‖ϕ≤L~n​𝔼​‖δ0‖ϕ\+ε1−L~\(L~<1\),\\mathbb\{E\}\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\tilde\{L\}^\{n\}\\,\\mathbb\{E\}\\left\\lVert\\delta\_\{0\}\\right\\rVert\_\{\\phi\}\+\\frac\{\\varepsilon\}\{1\-\\tilde\{L\}\}\\quad\(\\tilde\{L\}<1\),\(32\)whereas under 1–1 training withL\>1L\>1the bound from \([27](https://arxiv.org/html/2605.28909#S5.E27)\) grows as\(Ln−1\)​ε/\(L−1\)\(L^\{n\}\-1\)\\varepsilon/\(L\-1\)\. The relative improvement is

AR error bound1–1 error bound≤ε/\(1−L~\)ε​\(Ln−1\)/\(L−1\)→0as​n→∞​for​L\>1,L~<1\.\\frac\{\\text\{AR error bound\}\}\{\\text\{1\-\-1 error bound\}\}\\leq\\frac\{\\varepsilon/\(1\-\\tilde\{L\}\)\}\{\\varepsilon\(L^\{n\}\-1\)/\(L\-1\)\}\\to 0\\quad\\text\{as \}n\\to\\infty\\text\{ for \}L\>1,\\tilde\{L\}<1\.\(33\)

###### Proof\.

Under AR training, the effective recursion on the predicted manifold usesL~<1\\tilde\{L\}<1:𝔼​‖δn\+1‖ϕ≤L~​𝔼​‖δn‖ϕ\+𝔼​en≤L~​𝔼​‖δn‖ϕ\+ε\\mathbb\{E\}\\left\\lVert\\delta\_\{n\+1\}\\right\\rVert\_\{\\phi\}\\leq\\tilde\{L\}\\,\\mathbb\{E\}\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\+\\mathbb\{E\}e\_\{n\}\\leq\\tilde\{L\}\\,\\mathbb\{E\}\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\+\\varepsilon\. Iterating fromn=0n=0and usingδ0=0\\delta\_\{0\}=0:𝔼​‖δn‖ϕ≤ε​∑k=0n−1L~k≤ε/\(1−L~\)\\mathbb\{E\}\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\varepsilon\\sum\_\{k=0\}^\{n\-1\}\\tilde\{L\}^\{k\}\\leq\\varepsilon/\(1\-\\tilde\{L\}\)\. The ratio \([33](https://arxiv.org/html/2605.28909#S5.E33)\) follows directly\. ∎

## 6PINO: Physics\-Constrained Stability Theory

### 6\.1PINO objective and the physics\-consistent manifold

###### Definition 6\.1\(PINO loss\)\.

The PINO training objective augments the supervised loss with the discrete finite\-volume residual penalty:

ℒPINO​\(θ\):=ℒAR​\(θ\)\+λR​𝔼​\[1T​∑n=0T−1‖R​\(x^n\+1,x~n;u,cn,τn\)‖ϕ2\],\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta\):=\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\)\+\\lambda\_\{R\}\\,\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\left\\lVert R\(\\hat\{x\}\_\{n\+1\},\\tilde\{x\}\_\{n\};u,c\_\{n\},\\tau\_\{n\}\)\\right\\rVert^\{2\}\_\{\\phi\}\\right\],\(34\)wherex~n\\tilde\{x\}\_\{n\}is eitherxnx\_\{n\}\(teacher\-forced physics penalty\) orx^n\\hat\{x\}\_\{n\}\(closed\-loop physics penalty\), andλR\>0\\lambda\_\{R\}\>0is the physics weight\.

###### Definition 6\.2\(Physics\-consistent manifold\)\.

For fixed\(u,cn,τn\)\(u,c\_\{n\},\\tau\_\{n\}\), define

ℳn:=\{\(a,b\)∈𝒳2:R​\(a,b;u,cn,τn\)=0\}\.\\mathcal\{M\}\_\{n\}:=\\\{\(a,b\)\\in\\mathcal\{X\}^\{2\}:R\(a,b;u,c\_\{n\},\\tau\_\{n\}\)=0\\\}\.\(35\)By Theorem[2\.6](https://arxiv.org/html/2605.28909#S2.Thmtheorem6),ℳn\\mathcal\{M\}\_\{n\}is the graph ofℱ\\mathcal\{F\}:ℳn=\{\(xn\+1,xn\):\(xn\+1,xn\)∈𝒳2,xn\+1=ℱ​\(xn;⋅\)\}\\mathcal\{M\}\_\{n\}=\\\{\(x\_\{n\+1\},x\_\{n\}\):\(x\_\{n\+1\},x\_\{n\}\)\\in\\mathcal\{X\}^\{2\},\\,x\_\{n\+1\}=\\mathcal\{F\}\(x\_\{n\};\\cdot\)\\\}\.

### 6\.2Jacobian structure on the physics\-consistent manifold

###### Lemma 6\.3\(Jacobian from implicit differentiation\)\.

Let𝒢θ​\(ξn\)=x^n\+1\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\)=\\hat\{x\}\_\{n\+1\}satisfyR​\(x^n\+1,xn;u,cn,τn\)=0R\(\\hat\{x\}\_\{n\+1\},x\_\{n\};u,c\_\{n\},\\tau\_\{n\}\)=0\(i\.e\., the prediction lies onℳn\\mathcal\{M\}\_\{n\}\)\. Then the JacobianJ:=Dx​𝒢θ​\(ξn\)∈ℝ4​N×4​NJ:=D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(\\xi\_\{n\}\)\\in\\mathbb\{R\}^\{4N\\times 4N\}satisfies

J=−\(∂R∂x\+\|x^n\+1\)−1​∂R∂x−\|xn\.J=\-\\left\(\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\\bigg\|\_\{\\hat\{x\}\_\{n\+1\}\}\\right\)^\{\-1\}\\frac\{\\partial R\}\{\\partial x^\{\-\}\}\\bigg\|\_\{x\_\{n\}\}\.\(36\)Moreover, if𝒢θ=ℱ\\mathcal\{G\}\_\{\\theta\}=\\mathcal\{F\}exactly, thenJ=Jℱ:=Dx​ℱ​\(xn;⋅\)J=J\_\{\\mathcal\{F\}\}:=D\_\{x\}\\mathcal\{F\}\(x\_\{n\};\\cdot\)\.

###### Proof\.

DifferentiatingR​\(𝒢θ​\(u,x,c,τ\),x;⋅\)=0R\(\\mathcal\{G\}\_\{\\theta\}\(u,x,c,\\tau\),x;\\cdot\)=0with respect toxx:

∂R∂x\+⋅Dx​𝒢θ\+∂R∂x−=0,\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\\cdot D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\+\\frac\{\\partial R\}\{\\partial x^\{\-\}\}=0,which gives \([36](https://arxiv.org/html/2605.28909#S6.E36)\) upon inverting∂R/∂x\+\\partial R/\\partial x^\{\+\}\(which is invertible by Assumption[2\.4](https://arxiv.org/html/2605.28909#S2.Thmtheorem4)\)\. The second claim follows since if𝒢θ=ℱ\\mathcal\{G\}\_\{\\theta\}=\\mathcal\{F\}, this reduces to the implicit differentiation ofℱ\\mathcal\{F\}\. ∎

###### Assumption 6\.4\(Discrete dissipativity\)\.

The true time\-advance mapℱ\\mathcal\{F\}is dissipative in the pore\-volume metric: there exist0<ρℱ<10<\\rho\_\{\\mathcal\{F\}\}<1andR0\>0R\_\{0\}\>0such that

‖ℱ​\(x;⋅\)−ℱ​\(y;⋅\)‖ϕ≤ρℱ​‖x−y‖ϕfor all​x,y∈𝒳​with​‖x−y‖ϕ\>R0\.\\left\\lVert\\mathcal\{F\}\(x;\\cdot\)\-\\mathcal\{F\}\(y;\\cdot\)\\right\\rVert\_\{\\phi\}\\leq\\rho\_\{\\mathcal\{F\}\}\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}\\quad\\text\{for all \}x,y\\in\\mathcal\{X\}\\text\{ with \}\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}\>R\_\{0\}\.\(37\)For‖x−y‖ϕ≤R0\\left\\lVert x\-y\\right\\rVert\_\{\\phi\}\\leq R\_\{0\}\(near\-front regime\), the Lipschitz constant isLℱL\_\{\\mathcal\{F\}\}\. The global Lipschitz constant ismax⁡\(ρℱ,Lℱ\)\\max\(\\rho\_\{\\mathcal\{F\}\},L\_\{\\mathcal\{F\}\}\)\.

###### Theorem 6\.5\(Spectral stability under PINO training\)\.

Suppose Assumption[6\.4](https://arxiv.org/html/2605.28909#S6.Thmtheorem4)holds\. Letθ^\\hat\{\\theta\}be a minimizer ofℒPINO​\(θ\)\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta\)over a hypothesis class such that for anyθ\\theta,𝒢θ\\mathcal\{G\}\_\{\\theta\}isC1C^\{1\}inxx\. Then for allξ=\(u,x,c,τ\)\\xi=\(u,x,c,\\tau\)withx∈ℳn,x:=\{a:\(a,b\)∈ℳn​for some​b\}x\\in\\mathcal\{M\}\_\{n,x\}:=\\\{a:\(a,b\)\\in\\mathcal\{M\}\_\{n\}\\text\{ for some \}b\\\}:

1. \(a\)The spectral radius of the learned Jacobian satisfies ϱ​\(Dx​𝒢θ^​\(ξ\)\)≤ρℱ\+CJλR,\\varrho\\bigl\(D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\(\\xi\)\\bigr\)\\leq\\rho\_\{\\mathcal\{F\}\}\+\\frac\{C\_\{J\}\}\{\\sqrt\{\\lambda\_\{R\}\}\},\(38\)whereCJC\_\{J\}depends on the Sobolev regularity ofRR, the coercivity constantα\\alpha, and the Lipschitz constant of the relative permeability functions\.
2. \(b\)The induced operator norm satisfies ‖Dx​𝒢θ^​\(ξ\)‖op,ϕ≤ρℱ\+CJλR\.\\\|D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\(\\xi\)\\\|\_\{\\mathrm\{op\},\\phi\}\\leq\\rho\_\{\\mathcal\{F\}\}\+\\frac\{C\_\{J\}\}\{\\sqrt\{\\lambda\_\{R\}\}\}\.\(39\)
3. \(c\)ForλR≥λR∗:=\(CJ/\(1−ρℱ\)\)2\\lambda\_\{R\}\\geq\\lambda\_\{R\}^\{\*\}:=\(C\_\{J\}/\(1\-\\rho\_\{\\mathcal\{F\}\}\)\)^\{2\}, the learned operator is contractive:‖Dx​𝒢θ^‖op,ϕ<1\\\|D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\\\|\_\{\\mathrm\{op\},\\phi\}<1\.

###### Proof\.

Step 1: Residual magnitude at minimizer\.At a minimizerθ^\\hat\{\\theta\}ofℒPINO\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}, standard KKT/first\-order optimality conditions imply

𝔼​\[‖R​\(𝒢θ^​\(ξn\),x~n;⋅\)‖ϕ2\]≤C0λR\\mathbb\{E\}\\\!\\left\[\\left\\lVert R\(\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\(\\xi\_\{n\}\),\\tilde\{x\}\_\{n\};\\cdot\)\\right\\rVert^\{2\}\_\{\\phi\}\\right\]\\leq\\frac\{C\_\{0\}\}\{\\lambda\_\{R\}\}for some constantC0C\_\{0\}depending on the supervised loss value\. Letrn:=R​\(𝒢θ^​\(ξn\),xn;⋅\)r\_\{n\}:=R\(\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\(\\xi\_\{n\}\),x\_\{n\};\\cdot\), so𝔼​\[‖rn‖ϕ2\]≤C0/λR\\mathbb\{E\}\[\\left\\lVert r\_\{n\}\\right\\rVert^\{2\}\_\{\\phi\}\]\\leq C\_\{0\}/\\lambda\_\{R\}\.

Step 2: Perturbed implicit differentiation\.From the residual equationR​\(x^n\+1,xn;⋅\)=rnR\(\\hat\{x\}\_\{n\+1\},x\_\{n\};\\cdot\)=r\_\{n\}\(off manifold byrnr\_\{n\}\), differentiating inxnx\_\{n\}:

∂R∂x\+⋅J\+∂R∂x−=∂rn∂x\.\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\\cdot J\+\\frac\{\\partial R\}\{\\partial x^\{\-\}\}=\\frac\{\\partial r\_\{n\}\}\{\\partial x\}\.Thus

J=Jℱ\+\(∂R∂x\+\)−1​∂rn∂x,J=J\_\{\\mathcal\{F\}\}\+\\left\(\\frac\{\\partial R\}\{\\partial x^\{\+\}\}\\right\)^\{\-1\}\\frac\{\\partial r\_\{n\}\}\{\\partial x\},whereJℱ=−\(∂R/∂x\+\)−1​\(∂R/∂x−\)J\_\{\\mathcal\{F\}\}=\-\(\\partial R/\\partial x^\{\+\}\)^\{\-1\}\(\\partial R/\\partial x^\{\-\}\)is the true dynamics Jacobian \(cf\. Lemma[6\.3](https://arxiv.org/html/2605.28909#S6.Thmtheorem3)\)\.

Step 3: Norm bound\.Using coercivity \(‖\(∂R/∂x\+\)−1‖op,ϕ≤1/α\\\|\(\\partial R/\\partial x^\{\+\}\)^\{\-1\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq 1/\\alpha\):

‖J−Jℱ‖op,ϕ≤1α​‖∂rn∂x‖ϕ≤‖∇xrn‖Hϕ1α\.\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq\\frac\{1\}\{\\alpha\}\\left\\\|\\frac\{\\partial r\_\{n\}\}\{\\partial x\}\\right\\\|\_\{\\phi\}\\leq\\frac\{\\left\\lVert\\nabla\_\{x\}r\_\{n\}\\right\\rVert\_\{H^\{1\}\_\{\\phi\}\}\}\{\\alpha\}\.By the Poincaré–Sobolev embedding and the residual bound:‖∇xrn‖L2≤CR​‖rn‖H1≤CR′​‖rn‖ϕ1/2​‖rn‖H11/2\\left\\lVert\\nabla\_\{x\}r\_\{n\}\\right\\rVert\_\{L^\{2\}\}\\leq C\_\{R\}\\left\\lVert r\_\{n\}\\right\\rVert\_\{H^\{1\}\}\\leq C\_\{R\}^\{\\prime\}\\left\\lVert r\_\{n\}\\right\\rVert\_\{\\phi\}^\{1/2\}\\left\\lVert r\_\{n\}\\right\\rVert\_\{H^\{1\}\}^\{1/2\}\(interpolation\), giving‖J−Jℱ‖op,ϕ≤CR′′​‖rn‖ϕ1/2/α≤CJ/λR\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq C\_\{R\}^\{\\prime\\prime\}\\left\\lVert r\_\{n\}\\right\\rVert\_\{\\phi\}^\{1/2\}/\\alpha\\leq C\_\{J\}/\\sqrt\{\\lambda\_\{R\}\}\.

Step 4: Spectral radius\.By Weyl’s inequality for matrices,\|ϱ​\(J\)−ϱ​\(Jℱ\)\|≤‖J−Jℱ‖op,ϕ≤CJ/λR\|\\varrho\(J\)\-\\varrho\(J\_\{\\mathcal\{F\}\}\)\|\\leq\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq C\_\{J\}/\\sqrt\{\\lambda\_\{R\}\}\. Sinceϱ​\(Jℱ\)≤ρℱ\\varrho\(J\_\{\\mathcal\{F\}\}\)\\leq\\rho\_\{\\mathcal\{F\}\}\(by Assumption[6\.4](https://arxiv.org/html/2605.28909#S6.Thmtheorem4)\),ϱ​\(J\)≤ρℱ\+CJ/λR\\varrho\(J\)\\leq\\rho\_\{\\mathcal\{F\}\}\+C\_\{J\}/\\sqrt\{\\lambda\_\{R\}\}, which gives \([38](https://arxiv.org/html/2605.28909#S6.E38)\) and \([39](https://arxiv.org/html/2605.28909#S6.E39)\)\.

Part \(c\)\.Settingρℱ\+CJ/λR<1\\rho\_\{\\mathcal\{F\}\}\+C\_\{J\}/\\sqrt\{\\lambda\_\{R\}\}<1givesλR\>\(CJ/\(1−ρℱ\)\)2=:λR∗\\lambda\_\{R\}\>\(C\_\{J\}/\(1\-\\rho\_\{\\mathcal\{F\}\}\)\)^\{2\}=:\\lambda\_\{R\}^\{\*\}\. ∎

###### Theorem 6\.6\(PINO residual as Jacobian spectral regularizer\)\.

The physics residual term inℒPINO\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}penalizes the deviation of the learned Jacobian from the true dynamics Jacobian in a weighted operator norm\. Specifically, for the PINO minimizerθ^\\hat\{\\theta\}:

𝔼​\[‖Dx​𝒢θ^​\(ξn\)−Dx​ℱ​\(ξn\)‖op,ϕ2\]≤CJ2α2​λR\.\\mathbb\{E\}\\bigl\[\\\|D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\(\\xi\_\{n\}\)\-D\_\{x\}\\mathcal\{F\}\(\\xi\_\{n\}\)\\\|\_\{\\mathrm\{op\},\\phi\}^\{2\}\\bigr\]\\leq\\frac\{C\_\{J\}^\{2\}\}\{\\alpha^\{2\}\\lambda\_\{R\}\}\.\(40\)Consequently, PINO training implicitly learns the spectral structure of the true dynamics Jacobian, with error in the learned spectrum decaying as𝒪​\(λR−1/2\)\\mathcal\{O\}\(\\lambda\_\{R\}^\{\-1/2\}\)\.

###### Proof\.

From Step 2–3 of the proof of Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5):‖J−Jℱ‖op,ϕ≤CJ​‖rn‖ϕ/α\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq C\_\{J\}\\left\\lVert r\_\{n\}\\right\\rVert\_\{\\phi\}/\\alpha\. Squaring and taking expectations:𝔼​\[‖J−Jℱ‖op,ϕ2\]≤\(CJ/α\)2​𝔼​\[‖rn‖ϕ2\]≤CJ2/\(α2​λR\)\\mathbb\{E\}\[\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}^\{2\}\]\\leq\(C\_\{J\}/\\alpha\)^\{2\}\\mathbb\{E\}\[\\left\\lVert r\_\{n\}\\right\\rVert\_\{\\phi\}^\{2\}\]\\leq C\_\{J\}^\{2\}/\(\\alpha^\{2\}\\lambda\_\{R\}\), which is \([40](https://arxiv.org/html/2605.28909#S6.E40)\)\. The spectral claim follows from the expected form of Weyl’s inequality:𝔼​\[\|ϱ​\(J\)−ϱ​\(Jℱ\)\|2\]≤𝔼​\[‖J−Jℱ‖op,ϕ2\]≤CJ2/\(α2​λR\)\\mathbb\{E\}\[\|\\varrho\(J\)\-\\varrho\(J\_\{\\mathcal\{F\}\}\)\|^\{2\}\]\\leq\\mathbb\{E\}\[\\\|J\-J\_\{\\mathcal\{F\}\}\\\|\_\{\\mathrm\{op\},\\phi\}^\{2\}\]\\leq C\_\{J\}^\{2\}/\(\\alpha^\{2\}\\lambda\_\{R\}\)\. ∎

###### Theorem 6\.7\(Uniform\-in\-time PINO rollout bound\)\.

SupposeλR≥λR∗\\lambda\_\{R\}\\geq\\lambda\_\{R\}^\{\*\}\(so‖Dx​𝒢θ^‖op,ϕ≤ρ<1\\\|D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq\\rho<1\) and the one\-step error is bounded byε\\varepsilon\. Then

‖δn‖ϕ≤ε1−ρ,∀n≥0\.\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\leq\\frac\{\\varepsilon\}\{1\-\\rho\},\\quad\\forall n\\geq 0\.\(41\)Moreover, if the training loss also ensuresen→0e\_\{n\}\\to 0as training progresses, thensupn‖δn‖ϕ→0\\sup\_\{n\}\\left\\lVert\\delta\_\{n\}\\right\\rVert\_\{\\phi\}\\to 0uniformly in time\.

###### Proof\.

This is Part \(c\) of Theorem[5\.1](https://arxiv.org/html/2605.28909#S5.Thmtheorem1)applied withL=ρ<1L=\\rho<1\. Under PINO training,L=‖Dx​𝒢θ^‖op,ϕ≤ρL=\\\|D\_\{x\}\\mathcal\{G\}\_\{\\hat\{\\theta\}\}\\\|\_\{\\mathrm\{op\},\\phi\}\\leq\\rhoby Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)\(c\)\. ∎

## 7KK\-Step Truncated Backpropagation Through Time

### 7\.1TBPTT as a biased stochastic gradient estimator

###### Definition 7\.1\(TBPTT gradient estimator\)\.

Partition\{0,…,T−1\}\\\{0,\\dots,T\-1\\\}into windows𝒲r=\{r​K,…,min⁡\(\(r\+1\)​K−1,T−1\)\}\\mathcal\{W\}\_\{r\}=\\\{rK,\\dots,\\min\(\(r\+1\)K\-1,T\-1\)\\\}forr=0,…,⌈T/K⌉−1r=0,\\dots,\\lceil T/K\\rceil\-1\. The TBPTT gradient estimator is

gK​\(θ\):=∑r=0⌈T/K⌉−1∇θ\(∑n∈𝒲rℓ​\(x^n\+1,xn\+1\)\)\|x^r​K=sg​\(x^r​K\),g\_\{K\}\(\\theta\):=\\sum\_\{r=0\}^\{\\lceil T/K\\rceil\-1\}\\nabla\_\{\\theta\}\\\!\\left\(\\sum\_\{n\\in\\mathcal\{W\}\_\{r\}\}\\ell\(\\hat\{x\}\_\{n\+1\},x\_\{n\+1\}\)\\right\)\\bigg\|\_\{\\hat\{x\}\_\{rK\}=\\mathrm\{sg\}\(\\hat\{x\}\_\{rK\}\)\},\(42\)wheresg\\mathrm\{sg\}denotes the stop\-gradient \(detach\) operation: the predicted state at each window boundary is treated as a constant for the purposes of gradient computation\.

### 7\.2Bias decay theorem

###### Theorem 7\.3\(TBPTT gradient bias decay\)\.

Letg​\(θ\):=∇θℒAR​\(θ\)g\(\\theta\):=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\)be the exact AR gradient\. Assume the closed\-loop dynamics satisfy the average contractivity condition: there existsρ∈\[0,1\)\\rho\\in\[0,1\)such that

supn𝔼​\[‖Dx​𝒢θ​\(ξ^n\)‖op,ϕ\]≤ρ\.\\sup\_\{n\}\\mathbb\{E\}\\bigl\[\\\|D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{n\}\)\\\|\_\{\\mathrm\{op\},\\phi\}\\bigr\]\\leq\\rho\.\(43\)Then the bias of the TBPTT estimator \([42](https://arxiv.org/html/2605.28909#S7.E42)\) satisfies:

‖𝔼​\[g​\(θ\)\]−𝔼​\[gK​\(θ\)\]‖≤Cg​ρK1−ρ⋅T,\\left\\lVert\\mathbb\{E\}\[g\(\\theta\)\]\-\\mathbb\{E\}\[g\_\{K\}\(\\theta\)\]\\right\\rVert\\leq\\frac\{C\_\{g\}\\,\\rho^\{K\}\}\{1\-\\rho\}\\cdot T,\(44\)whereCg:=supn,θ𝔼​\[‖Dθ​𝒢θ​\(ξ^n\)‖ϕ→ℝP⋅‖∂ℓ/∂x^n\+1‖ϕ\]C\_\{g\}:=\\sup\_\{n,\\theta\}\\mathbb\{E\}\[\\left\\lVert D\_\{\\theta\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{n\}\)\\right\\rVert\_\{\\phi\\to\\mathbb\{R\}^\{P\}\}\\cdot\\left\\lVert\\partial\\ell/\\partial\\hat\{x\}\_\{n\+1\}\\right\\rVert\_\{\\phi\}\]\.

###### Proof\.

The gradient bias is

𝔼​\[g​\(θ\)\]−𝔼​\[gK​\(θ\)\]=1T​∑n=0T−1∑m=0⌊m/K⌋<⌊n/K⌋n−1𝔼​\[∂ℓ∂x^n\+1​∏j=m\+1nDx​𝒢θ​\(ξ^j\)​Dθ​𝒢θ​\(ξ^m\)\],\\mathbb\{E\}\[g\(\\theta\)\]\-\\mathbb\{E\}\[g\_\{K\}\(\\theta\)\]=\\frac\{1\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\sum\_\{\\begin\{subarray\}\{c\}m=0\\\\ \\lfloor m/K\\rfloor<\\lfloor n/K\\rfloor\\end\{subarray\}\}^\{n\-1\}\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial\\ell\}\{\\partial\\hat\{x\}\_\{n\+1\}\}\\prod\_\{j=m\+1\}^\{n\}D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{j\}\)D\_\{\\theta\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{m\}\)\\right\],where the inner sum runs over all cross\-window pairs \(mmandnnin different windows\)\. For such pairs,n−m≥Kn\-m\\geq K\. Taking norms and using the bound \([43](https://arxiv.org/html/2605.28909#S7.E43)\) and the submultiplicativity of operator norms:

‖𝔼​\[∂ℓ∂x^n\+1​∏j=m\+1nDx​𝒢θ​\(ξ^j\)​Dθ​𝒢θ​\(ξ^m\)\]‖\\displaystyle\\left\\lVert\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial\\ell\}\{\\partial\\hat\{x\}\_\{n\+1\}\}\\prod\_\{j=m\+1\}^\{n\}D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{j\}\)D\_\{\\theta\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{m\}\)\\right\]\\right\\rVert≤𝔼​\[‖∂ℓ∂x^n\+1‖ϕ​∏j=m\+1n‖Dx​𝒢θ​\(ξ^j\)‖op,ϕ⋅‖Dθ​𝒢θ​\(ξ^m\)‖ϕ→ℝP\]\\displaystyle\\leq\\mathbb\{E\}\\\!\\left\[\\left\\lVert\\frac\{\\partial\\ell\}\{\\partial\\hat\{x\}\_\{n\+1\}\}\\right\\rVert\_\{\\phi\}\\prod\_\{j=m\+1\}^\{n\}\\\|D\_\{x\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{j\}\)\\\|\_\{\\mathrm\{op\},\\phi\}\\cdot\\left\\lVert D\_\{\\theta\}\\mathcal\{G\}\_\{\\theta\}\(\\hat\{\\xi\}\_\{m\}\)\\right\\rVert\_\{\\phi\\to\\mathbb\{R\}^\{P\}\}\\right\]≤Cg​ρn−m≤Cg​ρK\.\\displaystyle\\leq C\_\{g\}\\,\\rho^\{n\-m\}\\leq C\_\{g\}\\,\\rho^\{K\}\.The number of cross\-window pairs\(m,n\)\(m,n\)withn−m≥Kn\-m\\geq Kis at mostT​\(T−1\)/2≤T2/2T\(T\-1\)/2\\leq T^\{2\}/2\. However, for eachnn, the sum overmmtelescopes:∑m:n−m≥Kρn−m≤∑s=K∞ρs=ρK/\(1−ρ\)\\sum\_\{m:n\-m\\geq K\}\\rho^\{n\-m\}\\leq\\sum\_\{s=K\}^\{\\infty\}\\rho^\{s\}=\\rho^\{K\}/\(1\-\\rho\)\. Summing overn=0,…,T−1n=0,\\dots,T\-1and dividing byTT:

‖𝔼​\[g\]−𝔼​\[gK\]‖≤1T​∑n=0T−1Cg​ρK1−ρ=Cg​ρK1−ρ,\\left\\lVert\\mathbb\{E\}\[g\]\-\\mathbb\{E\}\[g\_\{K\}\]\\right\\rVert\\leq\\frac\{1\}\{T\}\\sum\_\{n=0\}^\{T\-1\}\\frac\{C\_\{g\}\\rho^\{K\}\}\{1\-\\rho\}=\\frac\{C\_\{g\}\\rho^\{K\}\}\{1\-\\rho\},which is \([44](https://arxiv.org/html/2605.28909#S7.E44)\) \(up to the factorTTfrom the unnormalized version\)\. The factorTTappears when working with the unnormalized loss∑nℓ\\sum\_\{n\}\\ellrather than the averagedℒAR=T−1​∑nℓ\\mathcal\{L\}\_\{\\mathrm\{AR\}\}=T^\{\-1\}\\sum\_\{n\}\\ell\. ∎

###### Corollary 7\.4\(Bias–variance optimal window size\)\.

LetσK2:=𝕍​\[gK​\(θ\)\]/ℜ\\sigma\_\{K\}^\{2\}:=\\mathbb\{V\}\[g\_\{K\}\(\\theta\)\]/\\mathfrak\{R\}be the per\-window gradient variance \(approximately constant inKKfor moderateKK\)\. The mean\-squared error ofgKg\_\{K\}as an estimator ofggis

MSE​\(K\)=‖𝔼​\[g\]−𝔼​\[gK\]‖2\+σK2≤Cg2​ρ2​K\(1−ρ\)2\+σ2​KT,\\mathrm\{MSE\}\(K\)=\\left\\lVert\\mathbb\{E\}\[g\]\-\\mathbb\{E\}\[g\_\{K\}\]\\right\\rVert^\{2\}\+\\sigma\_\{K\}^\{2\}\\leq\\frac\{C\_\{g\}^\{2\}\\rho^\{2K\}\}\{\(1\-\\rho\)^\{2\}\}\+\\frac\{\\sigma^\{2\}K\}\{T\},\(45\)where the second term reflects that largerKKmeans fewer independent windows and higher gradient variance\. Minimizing overK≥1K\\geq 1:

K∗=log⁡\(2​T​Cg2​log⁡\(1/ρ\)σ2​\(1−ρ\)2\)2​log⁡\(1/ρ\),K^\{\*\}=\\frac\{\\log\\\!\\left\(\\frac\{2TC\_\{g\}^\{2\}\\log\(1/\\rho\)\}\{\\sigma^\{2\}\(1\-\\rho\)^\{2\}\}\\right\)\}\{2\\log\(1/\\rho\)\},\(46\)which grows logarithmically inTTand decreases asρ→0\\rho\\to 0\(more contractive dynamics allow shorter windows with the same bias\)\.

###### Proof\.

The MSE is the sum of bias squared and variance\. The variance termσK2≈σ2​K/T\\sigma\_\{K\}^\{2\}\\approx\\sigma^\{2\}K/Treflects that for fixed total data, a window of sizeKKgivesT/KT/Kindependent gradient estimates, each with varianceσ2\\sigma^\{2\}, so the aggregated variance scales as\(T/K\)−1⋅σ2=σ2​K/T\(T/K\)^\{\-1\}\\cdot\\sigma^\{2\}=\\sigma^\{2\}K/T\. DifferentiatingMSE​\(K\)\\mathrm\{MSE\}\(K\)with respect toKK\(treatingKKas continuous\):

dd​K​MSE​\(K\)=−2​Cg2​log⁡\(1/ρ\)​ρ2​K\(1−ρ\)2\+σ2T=0,\\frac\{\\,\\mathrm\{d\}\}\{\\,\\mathrm\{d\}K\}\\mathrm\{MSE\}\(K\)=\-\\frac\{2C\_\{g\}^\{2\}\\log\(1/\\rho\)\\rho^\{2K\}\}\{\(1\-\\rho\)^\{2\}\}\+\\frac\{\\sigma^\{2\}\}\{T\}=0,givingρ2​K∗=σ2​\(1−ρ\)2/\(2​T​Cg2​log⁡\(1/ρ\)\)\\rho^\{2K^\{\*\}\}=\\sigma^\{2\}\(1\-\\rho\)^\{2\}/\(2TC\_\{g\}^\{2\}\\log\(1/\\rho\)\), from which \([46](https://arxiv.org/html/2605.28909#S7.E46)\) follows by taking logarithms\. ∎

###### Proposition 7\.5\(Convergence rate under Adam with TBPTT\)\.

Suppose the PINO\-AR objectiveℒPINO​\(θ\)\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta\)isβ1\\beta\_\{1\}\-smooth and satisfies the Polyak–Łojasiewicz \(PL\) condition:‖∇ℒPINO​\(θ\)‖2≥2​μ​\(ℒPINO​\(θ\)−ℒ∗\)\\left\\lVert\\nabla\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta\)\\right\\rVert^\{2\}\\geq 2\\mu\(\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta\)\-\\mathcal\{L\}^\{\*\}\)for someμ\>0\\mu\>0\. Letθ\(t\)\\theta^\{\(t\)\}be the Adam iterates with learning rateηt\\eta\_\{t\}, using the TBPTT gradient estimatorgKg\_\{K\}\. WithK=K∗K=K^\{\*\}from \([46](https://arxiv.org/html/2605.28909#S7.E46)\) and step sizesηt=η/t\\eta\_\{t\}=\\eta/\\sqrt\{t\}, the expected suboptimality satisfies

𝔼​\[ℒPINO​\(θ\(t\)\)−ℒ∗\]≤CAdamt\+Cg​ρK∗1−ρ,\\mathbb\{E\}\\bigl\[\\mathcal\{L\}\_\{\\mathrm\{PINO\}\}\(\\theta^\{\(t\)\}\)\-\\mathcal\{L\}^\{\*\}\\bigr\]\\leq\\frac\{C\_\{\\mathrm\{Adam\}\}\}\{\\sqrt\{t\}\}\+\\frac\{C\_\{g\}\\rho^\{K^\{\*\}\}\}\{1\-\\rho\},\(47\)whereCAdamC\_\{\\mathrm\{Adam\}\}depends onμ\\mu,β1\\beta\_\{1\},η\\eta, andσ\\sigma, and the second term is the irreducible TBPTT bias\. ForK=K∗K=K^\{\*\}, the TBPTT bias is𝒪​\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\), matching the statistical rate\.

###### Proof\.

Standard Adam convergence under the PL condition and stochastic gradients with bounded biasb:=‖𝔼​\[g\]−𝔼​\[gK\]‖b:=\\left\\lVert\\mathbb\{E\}\[g\]\-\\mathbb\{E\}\[g\_\{K\}\]\\right\\rVertgives rate𝒪​\(1/t\)\+b\\mathcal\{O\}\(1/\\sqrt\{t\}\)\+b\. The bias termb≤Cg​ρK/\(1−ρ\)b\\leq C\_\{g\}\\rho^\{K\}/\(1\-\\rho\)from Theorem[7\.3](https://arxiv.org/html/2605.28909#S7.Thmtheorem3)\. AtK=K∗K=K^\{\*\},ρK∗=𝒪​\(1/T\)\\rho^\{K^\{\*\}\}=\\mathcal\{O\}\(1/\\sqrt\{T\}\)\(from the definition ofK∗K^\{\*\}\), giving the stated bound\. ∎

## 8FNO Architecture: Approximation Theory for Mixed\-Type PDEs

### 8\.1Neural operators in Banach spaces

###### Theorem 8\.1\(Universal approximation for neural operators\[[16](https://arxiv.org/html/2605.28909#bib.bib16)\]\)\.

LetV=Hs​\(Ω\)V=H^\{s\}\(\\Omega\)andW=Hs′​\(Ω\)W=H^\{s^\{\\prime\}\}\(\\Omega\)fors,s′≥0s,s^\{\\prime\}\\geq 0\. LetG:V→WG:V\\to Wbe a continuous nonlinear operator\. Then for any compact𝒦⊂V\\mathcal\{K\}\\subset Vandε\>0\\varepsilon\>0, there exists a neural operatorGθG\_\{\\theta\}\(in the FNO family\) such that

supv∈𝒦‖Gθ​\(v\)−G​\(v\)‖W<ε\.\\sup\_\{v\\in\\mathcal\{K\}\}\\left\\lVert G\_\{\\theta\}\(v\)\-G\(v\)\\right\\rVert\_\{W\}<\\varepsilon\.

### 8\.2FNO layer and spectral parameterization

An FNO layer transforms a latent channel fieldv\(ℓ\)∈ℝdv×Nv^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{v\}\\times N\}:

v\(ℓ\+1\)​\(x\)=σ​\(W\(ℓ\)​v\(ℓ\)​\(x\)\+\(ℱ−1​\[R\(ℓ\)⋅v^\(ℓ\)\]\)​\(x\)\),v^\{\(\\ell\+1\)\}\(x\)=\\sigma\\\!\\left\(W^\{\(\\ell\)\}v^\{\(\\ell\)\}\(x\)\+\\bigl\(\\mathcal\{F\}^\{\-1\}\[R^\{\(\\ell\)\}\\cdot\\hat\{v\}^\{\(\\ell\)\}\]\\bigr\)\(x\)\\right\),\(48\)wherev^\(ℓ\):=ℱ​\[v\(ℓ\)\]\\hat\{v\}^\{\(\\ell\)\}:=\\mathcal\{F\}\[v^\{\(\\ell\)\}\]is the discrete Fourier transform \(DFT\) on the grid,R\(ℓ\)​\(k\)∈ℂdv×dvR^\{\(\\ell\)\}\(k\)\\in\\mathbb\{C\}^\{d\_\{v\}\\times d\_\{v\}\}are learnable complex weights truncated tok∈\[−kmax,kmax\]3k\\in\[\-k\_\{\\max\},k\_\{\\max\}\]^\{3\},W\(ℓ\)∈ℝdv×dvW^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{v\}\\times d\_\{v\}\}is a pointwise linear map, andσ\\sigmais a nonlinear activation\. The spectral truncation introduces a bias toward low\-frequency features\.

###### Proposition 8\.2\(Spectral approximation rates for elliptic vs\. hyperbolic\)\.

1. \(a\)*Elliptic \(pressure\)\.*Ifp∈Hs\+2​\(Ω\)p\\in H^\{s\+2\}\(\\Omega\)is the pressure field \(from elliptic regularity applied to \([6](https://arxiv.org/html/2605.28909#S2.E6)\) withf∈Hs​\(Ω\)f\\in H^\{s\}\(\\Omega\)\), the FNO truncation error satisfies ‖p−pkmax‖L2≤Cell​kmax−\(s\+2\)​‖f‖Hs\.\\left\\lVert p\-p\_\{k\_\{\\max\}\}\\right\\rVert\_\{L^\{2\}\}\\leq C\_\{\\mathrm\{ell\}\}\\,k\_\{\\max\}^\{\-\(s\+2\)\}\\left\\lVert f\\right\\rVert\_\{H^\{s\}\}\.\(49\)Thekmax−\(s\+2\)k\_\{\\max\}^\{\-\(s\+2\)\}decay is fast for smooth source terms\.
2. \(b\)*Hyperbolic \(gas saturation\)\.*IfSgS\_\{g\}has a sharp front of widthδ≪1\\delta\\ll 1\(discontinuity or near\-discontinuity\), the Gibbs phenomenon gives ‖Sg−Sg,kmax‖L2≥cG​kmax−1/2,\\left\\lVert S\_\{g\}\-S\_\{g,k\_\{\\max\}\}\\right\\rVert\_\{L^\{2\}\}\\geq c\_\{G\}\\,k\_\{\\max\}^\{\-1/2\},\(50\)wherecG\>0c\_\{G\}\>0is a constant depending on the jump magnitude\. Thekmax−1/2k\_\{\\max\}^\{\-1/2\}rate is the optimalL2L^\{2\}rate for functions of bounded variation with a single jump discontinuity\.

Consequently, FNO with fixedkmaxk\_\{\\max\}has dramatically better approximation quality for pressure than for gas saturation, explaining the differentialR2R^\{2\}behavior observed in Figure[2](https://arxiv.org/html/2605.28909#S11.F2)\.

###### Proof\.

Part \(a\): Standard Sobolev spectral approximation theory\. The Fourier coefficients ofppdecay as\|p^​\(k\)\|≤C​\(1\+\|k\|2\)−\(s\+2\)/2\|\\hat\{p\}\(k\)\|\\leq C\(1\+\|k\|^\{2\}\)^\{\-\(s\+2\)/2\}by elliptic regularity, and truncation at\|k\|≤kmax\|k\|\\leq k\_\{\\max\}gives‖p−pkmax‖L22=∑\|k\|\>kmax\|p^​\(k\)\|2≤C2​∑\|k\|\>kmax\(1\+\|k\|2\)−\(s\+2\)≤C′⁣2​kmax−\(2​s\+4−d\)\\left\\lVert p\-p\_\{k\_\{\\max\}\}\\right\\rVert\_\{L^\{2\}\}^\{2\}=\\sum\_\{\|k\|\>k\_\{\\max\}\}\|\\hat\{p\}\(k\)\|^\{2\}\\leq C^\{2\}\\sum\_\{\|k\|\>k\_\{\\max\}\}\(1\+\|k\|^\{2\}\)^\{\-\(s\+2\)\}\\leq C^\{\\prime 2\}k\_\{\\max\}^\{\-\(2s\+4\-d\)\}\(ford=3d=3\), which is \([49](https://arxiv.org/html/2605.28909#S8.E49)\)\.

Part \(b\): For a function with a jump discontinuity in one direction, the partial Fourier sum overshoots by the Gibbs constant≈8\.9%\\approx 8\.9\\%of the jump magnitude near the discontinuity\. TheL2L^\{2\}error for truncation at\|k\|≤kmax\|k\|\\leq k\_\{\\max\}of a function inB​V​\(Ω\)BV\(\\Omega\)\(bounded variation\) satisfies‖Sg−Sg,kmax‖L2≥c​kmax−1/2\\left\\lVert S\_\{g\}\-S\_\{g,k\_\{\\max\}\}\\right\\rVert\_\{L^\{2\}\}\\geq ck\_\{\\max\}^\{\-1/2\}\[[40](https://arxiv.org/html/2605.28909#bib.bib40)\], which gives \([50](https://arxiv.org/html/2605.28909#S8.E50)\)\. ∎

## 9Training Algorithm

Algorithm[1](https://arxiv.org/html/2605.28909#alg1)presents the complete autoregressive PINO training step withKK\-step TBPTT and a Peacemann well\-response head\. It directly implements the objectives in Definitions[3\.2](https://arxiv.org/html/2605.28909#S3.Thmtheorem2)and[6\.1](https://arxiv.org/html/2605.28909#S6.Thmtheorem1), with the detach operation at TBPTT boundaries corresponding to the stop\-gradient in Definition[7\.1](https://arxiv.org/html/2605.28909#S7.Thmtheorem1)\.

Algorithm 1training\_step: AR PINO \+KK\-step TBPTT \+ Peacemann head1:Model

GθG\_\{\\theta\}; horizon

TT; TBPTT window

KK; physics weight

λR\\lambda\_\{R\}; component weights

wP,wW,wO,wG,wΠw\_\{P\},w\_\{W\},w\_\{O\},w\_\{G\},w\_\{\\Pi\}\.

2:Inputs:

𝐗​\[k\]∈ℝB×T×nz×nx×ny\\mathbf\{X\}\[k\]\\in\\mathbb\{R\}^\{B\\times T\\times n\_\{z\}\\times n\_\{x\}\\times n\_\{y\}\}for

k∈\{perm,poro,fault,Q,Qg,Qw,dt,t,pini,sini,sgini,soini\}k\\in\\\{\\texttt\{perm,poro,fault,Q,Qg,Qw,dt,t,pini,sini,sgini,soini\}\\\}\.

3:Targets:

𝐘P,𝐘W,𝐘O,𝐘G∈ℝB×T×nz×nx×ny\\mathbf\{Y\}\_\{P\},\\mathbf\{Y\}\_\{W\},\\mathbf\{Y\}\_\{O\},\\mathbf\{Y\}\_\{G\}\\in\\mathbb\{R\}^\{B\\times T\\times n\_\{z\}\\times n\_\{x\}\\times n\_\{y\}\}; Peacemann target

𝐘Π∈ℝB×66×T\\mathbf\{Y\}\_\{\\Pi\}\\in\\mathbb\{R\}^\{B\\times 66\\times T\}\.

4:Loss scalar

ℓ∈ℝ\\ell\\in\\mathbb\{R\}; updated metrics

Rα2R^\{2\}\_\{\\alpha\}, RMSEαfor

α∈\{P,W,O,G\}\\alpha\\in\\\{P,W,O,G\\\}\.

5:

ℓwin←0\\ell\_\{\\mathrm\{win\}\}\\leftarrow 0;

ℓtotal←0\\ell\_\{\\mathrm\{total\}\}\\leftarrow 0;

𝐬^−1←∅\\hat\{\\mathbf\{s\}\}\_\{\-1\}\\leftarrow\\varnothing\.// Window accumulator, running total, prev\. state

6:for

n=0n=0to

T−1T\-1do// AR loop over timesteps

7:if

n=0n=0then

𝐗\(n\)←slice​𝐗​at step 0\\mathbf\{X\}^\{\(n\)\}\\leftarrow\\text\{slice \}\\mathbf\{X\}\\text\{ at step 0\}\.// First step: use true initial state from data

8:else

𝐗\(n\)←slice​𝐗​at step​n\\mathbf\{X\}^\{\(n\)\}\\leftarrow\\text\{slice \}\\mathbf\{X\}\\text\{ at step \}n, but replacepini,sini,sgini,soiniwith

\(P^n−1,S^wn−1,S^gn−1,S^on−1\)\(\\hat\{P\}^\{n\-1\},\\hat\{S\}\_\{w\}^\{n\-1\},\\hat\{S\}\_\{g\}^\{n\-1\},\\hat\{S\}\_\{o\}^\{n\-1\}\)\.// AR: feed predicted state as new initial condition

9:endif

10:

𝐔\(n\)←Pack​\(𝐗\(n\)\)\\mathbf\{U\}^\{\(n\)\}\\leftarrow\\mathrm\{Pack\}\(\\mathbf\{X\}^\{\(n\)\}\)\.// Stack all input channels into single tensor

11:// Forward pass: predict pressure and saturations

12:

P^n←Gθ​\(𝐔\(n\),pressure\)∈ℝB×1×nz×nx×ny\\hat\{P\}^\{n\}\\leftarrow G\_\{\\theta\}\(\\mathbf\{U\}^\{\(n\)\},\\texttt\{pressure\}\)\\in\\mathbb\{R\}^\{B\\times 1\\times n\_\{z\}\\times n\_\{x\}\\times n\_\{y\}\}\.

13:

S^wn,S^on,S^gn←Gθ​\(𝐔\(n\),saturation\)∈ℝB×3×nz×nx×ny\\hat\{S\}\_\{w\}^\{n\},\\hat\{S\}\_\{o\}^\{n\},\\hat\{S\}\_\{g\}^\{n\}\\leftarrow G\_\{\\theta\}\(\\mathbf\{U\}^\{\(n\)\},\\texttt\{saturation\}\)\\in\\mathbb\{R\}^\{B\\times 3\\times n\_\{z\}\\times n\_\{x\}\\times n\_\{y\}\}\.

14:// Supervised loss at stepnn

15:

ℓsup\(n\)←wP​ℒ​\(P^n,𝐘P,n\)\+wW​ℒ​\(S^wn,𝐘W,n\)\+wO​ℒ​\(S^on,𝐘O,n\)\+wG​ℒ​\(S^gn,𝐘G,n\)\\ell^\{\(n\)\}\_\{\\mathrm\{sup\}\}\\leftarrow w\_\{P\}\\,\\mathcal\{L\}\(\\hat\{P\}^\{n\},\\mathbf\{Y\}\_\{P,n\}\)\+w\_\{W\}\\,\\mathcal\{L\}\(\\hat\{S\}\_\{w\}^\{n\},\\mathbf\{Y\}\_\{W,n\}\)\+w\_\{O\}\\,\\mathcal\{L\}\(\\hat\{S\}\_\{o\}^\{n\},\\mathbf\{Y\}\_\{O,n\}\)\+w\_\{G\}\\,\\mathcal\{L\}\(\\hat\{S\}\_\{g\}^\{n\},\\mathbf\{Y\}\_\{G,n\}\)\.

16:ifPINO enabledand\(epoch

mod\\bmodpino\_freq

=0=0\)then// PINO residual \(evaluated periodically for efficiency\)

17:

𝐫P\(n\),𝐫S\(n\),𝐫G\(n\)←BlackOilResidual​\(𝐗\(n\),P^n,S^wn,S^on,S^gn\)\\mathbf\{r\}\_\{P\}^\{\(n\)\},\\mathbf\{r\}\_\{S\}^\{\(n\)\},\\mathbf\{r\}\_\{G\}^\{\(n\)\}\\leftarrow\\mathrm\{BlackOilResidual\}\(\\mathbf\{X\}^\{\(n\)\},\\hat\{P\}^\{n\},\\hat\{S\}\_\{w\}^\{n\},\\hat\{S\}\_\{o\}^\{n\},\\hat\{S\}\_\{g\}^\{n\}\)\.// Eq\. \([7](https://arxiv.org/html/2605.28909#S2.E7)\)

18:

ℓphys\(n\)←λR​\(λP​‖𝐫P\(n\)‖2\+λS​‖𝐫S\(n\)‖2\+λG​‖𝐫G\(n\)‖2\)\\ell^\{\(n\)\}\_\{\\mathrm\{phys\}\}\\leftarrow\\lambda\_\{R\}\(\\lambda\_\{P\}\\left\\lVert\\mathbf\{r\}\_\{P\}^\{\(n\)\}\\right\\rVert^\{2\}\+\\lambda\_\{S\}\\left\\lVert\\mathbf\{r\}\_\{S\}^\{\(n\)\}\\right\\rVert^\{2\}\+\\lambda\_\{G\}\\left\\lVert\\mathbf\{r\}\_\{G\}^\{\(n\)\}\\right\\rVert^\{2\}\)\.

19:else

ℓphys\(n\)←0\\ell^\{\(n\)\}\_\{\\mathrm\{phys\}\}\\leftarrow 0\.

20:endif

21:

ℓwin←ℓwin\+ℓsup\(n\)\+ℓphys\(n\)\\ell\_\{\\mathrm\{win\}\}\\leftarrow\\ell\_\{\\mathrm\{win\}\}\+\\ell^\{\(n\)\}\_\{\\mathrm\{sup\}\}\+\\ell^\{\(n\)\}\_\{\\mathrm\{phys\}\}\.

22:ifTBPTTand\(

\(n\+1\)modK=0\(n\+1\)\\bmod K=0or

n=T−1n=T\-1\)then// End of TBPTT window: backprop and detach

23:

∇θ\(ℓwin/T\)\\nabla\_\{\\theta\}\(\\ell\_\{\\mathrm\{win\}\}/T\)\.backprop\(\)\.// Backpropagate through current window only

24:

ℓtotal←ℓtotal\+detach​\(ℓwin\)\\ell\_\{\\mathrm\{total\}\}\\leftarrow\\ell\_\{\\mathrm\{total\}\}\+\\mathrm\{detach\}\(\\ell\_\{\\mathrm\{win\}\}\)\.

25:

ℓwin←0\\ell\_\{\\mathrm\{win\}\}\\leftarrow 0\.

26:

𝐬^n←detach​\(𝐬^n\)\\hat\{\\mathbf\{s\}\}\_\{n\}\\leftarrow\\mathrm\{detach\}\(\\hat\{\\mathbf\{s\}\}\_\{n\}\)\.// Stop gradient: predicted state treated as constant for next window

27:endif

28:

𝐬^n←\(P^n,S^wn,S^on,S^gn\)\\hat\{\\mathbf\{s\}\}\_\{n\}\\leftarrow\(\\hat\{P\}^\{n\},\\hat\{S\}\_\{w\}^\{n\},\\hat\{S\}\_\{o\}^\{n\},\\hat\{S\}\_\{g\}^\{n\}\)\.// Store predicted state for next AR step

29:endfor

30:// Peacemann head: sequence\-to\-sequence well response prediction

31:

𝐘^Π←Gθ​\(𝐗\(p\),peacemann\)∈ℝB×66×T\\hat\{\\mathbf\{Y\}\}\_\{\\Pi\}\\leftarrow G\_\{\\theta\}\(\\mathbf\{X\}^\{\(p\)\},\\texttt\{peacemann\}\)\\in\\mathbb\{R\}^\{B\\times 66\\times T\}, where

𝐗\(p\)∈ℝB×90×T\\mathbf\{X\}^\{\(p\)\}\\in\\mathbb\{R\}^\{B\\times 90\\times T\}\.

32:

ℓΠ←wΠ​ℒ​\(𝐘^Π,𝐘Π\)\\ell\_\{\\Pi\}\\leftarrow w\_\{\\Pi\}\\mathcal\{L\}\(\\hat\{\\mathbf\{Y\}\}\_\{\\Pi\},\\mathbf\{Y\}\_\{\\Pi\}\)\.

33:ifTBPTT enabledthen

34:

∇θℓΠ\\nabla\_\{\\theta\}\\ell\_\{\\Pi\}\.backprop\(\);return

ℓ=ℓtotal/T\+detach​\(ℓΠ\)\\ell=\\ell\_\{\\mathrm\{total\}\}/T\+\\mathrm\{detach\}\(\\ell\_\{\\Pi\}\)\.

35:else

36:

∇θ\(ℓtotal\+ℓΠ\)\\nabla\_\{\\theta\}\(\\ell\_\{\\mathrm\{total\}\}\+\\ell\_\{\\Pi\}\)\.backprop\(\);return

detach​\(ℓtotal\+ℓΠ\)\\mathrm\{detach\}\(\\ell\_\{\\mathrm\{total\}\}\+\\ell\_\{\\Pi\}\)\.

37:endif

#### Correctness and theoretical connection\.

The detach at line 22 is exactly the stop\-gradient of Definition[7\.1](https://arxiv.org/html/2605.28909#S7.Thmtheorem1); the Jacobian chains in \([18](https://arxiv.org/html/2605.28909#S3.E18)\) beyond windowKKare truncated\. The PINO residual at lines 12–15 evaluatesR​\(x^n\+1,x^n;⋅\)R\(\\hat\{x\}\_\{n\+1\},\\hat\{x\}\_\{n\};\\cdot\)as in Definition[6\.1](https://arxiv.org/html/2605.28909#S6.Thmtheorem1), driving predictions toward the physics\-consistent manifoldℳn\\mathcal\{M\}\_\{n\}\(Definition[6\.2](https://arxiv.org/html/2605.28909#S6.Thmtheorem2)\) and activating the spectral stability mechanism of Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)\. The Peacemann head predicts well responses \(WOPR, WWPR, WBHP\) using a sequence\-to\-sequence sub\-network; its gradient is computed in a separate backward pass\.

#### Code availability\.

## 10Computational Platform and Performance

#### Hardware\.

All experiments were executed on a singleNVIDIA HGX B200node \(Blackwell generation, TSMC 4NP\), equipped with eight NVIDIA B200 GPUs\. Each B200 integrates two Blackwell dies via a10​TB/s10\\penalty 10000\\ \\text\{TB/s\}NV\-HBI interconnect:192​GB192\\penalty 10000\\ \\text\{GB\}HBM3e memory per GPU \(≈8​TB/s\\approx 8\\penalty 10000\\ \\text\{TB/s\}peak bandwidth\); fifth\-generation NVLink,1\.8​TB/s1\.8\\penalty 10000\\ \\text\{TB/s\}bidirectional peer\-bandwidth per GPU;14\.4​TB/s14\.4\\penalty 10000\\ \\text\{TB/s\}aggregate intra\-node bandwidth via the NVSwitch crossbar\. Peak BF16 Tensor Core throughput:4,500​TFLOPS4\{,\}500\\penalty 10000\\ \\text\{TFLOPS\}per GPU\.

#### Memory analysis\.

The Norne state tensor\(B,T,4​N\)\(B,T,4N\)withB=8B=8,T=10T=10,N=113,344N=113\{,\}344at FP32 requires≈3\.3​GB\\approx 3\.3\\penalty 10000\\ \\text\{GB\}per forward pass\. The full physics\-residual tensor \(four blocks, same shape\) adds≈3\.3​GB\\approx 3\.3\\penalty 10000\\ \\text\{GB\}\. Combined with FP32 optimizer states \(≈2​P\\approx 2Pbytes for Adam,P≈50​MP\\approx 50\\text\{M\}parameters for PINO\), all tensors fit in a single B200’s HBM3e with no host offloading\. No domain decomposition, activation checkpointing, or gradient offloading is required\.

#### Software\.

NVIDIA PhysicsNeMo\[[29](https://arxiv.org/html/2605.28909#bib.bib29)\], CUDA 12\.8, BF16 mixed\-precision with dynamic loss scaling,torch\.distributedDDP for eight\-GPU training\. The 3D FFT in FNO spectral layers usescuFFTwith mode truncation\(kmax,z,kmax,x,kmax,y\)=\(6,16,16\)\(k\_\{\\max,z\},k\_\{\\max,x\},k\_\{\\max,y\}\)=\(6,16,16\), operating in compute\-bound regime\.

Table 1:Wall\-clock performance on8×8\\timesNVIDIA B200 \(HGX B200\)\. “OPM equiv\.” denotes approximate wall\-time for equivalent work using the Open Porous Media Flow finite\-volume simulator on CPU clusters\.

## 11Norne Benchmark: Results and Theory\-Consistent Interpretation

### 11\.1Setup and evaluation metric

![Refer to caption](https://arxiv.org/html/2605.28909v1/All1.png)Figure 1:Three\-dimensional permeability field of the Norne reservoir model \(Nx=46N\_\{x\}=46,Ny=112N\_\{y\}=112,Nz=22N\_\{z\}=22\)\. Top\-left: 3D permeability distribution with well trajectories overlaid, including water injectors, gas injectors, and oil/water/gas producers\. Top\-right: side view highlighting vertical heterogeneity and stratigraphic connectivity\. Bottom\-left: top view illustrating lateral permeability variations and channelized structures\. Bottom\-right: vertically averaged permeability map with well locations annotated\. Permeability is shown on a logarithmic scale \(mD\), emphasizing strong spatial heterogeneity that governs multiphase flow and plume migration dynamics\.The Norne field\[[9](https://arxiv.org/html/2605.28909#bib.bib9)\]is an offshore Norwegian oil\-and\-gas field on a46×112×2246\\times 112\\times 22Cartesian grid\. Key simulation properties are given in Table[2](https://arxiv.org/html/2605.28909#S11.T2)\. Training data:NsN\_\{s\}OPM simulator trajectories with randomized permeability, porosity, and well\-control perturbations\. Evaluation: time\-resolved coefficient of determination

Rα2​\(tn\):=1−∑i=1Ns‖x^α,n\(i\)−xα,n\(i\)‖ϕ2∑i=1Ns‖xα,n\(i\)−x¯α,n‖ϕ2,α∈\{P,Sw,So,Sg\},R^\{2\}\_\{\\alpha\}\(t\_\{n\}\):=1\-\\frac\{\\sum\_\{i=1\}^\{N\_\{s\}\}\\left\\lVert\\hat\{x\}\_\{\\alpha,n\}^\{\(i\)\}\-x\_\{\\alpha,n\}^\{\(i\)\}\\right\\rVert^\{2\}\_\{\\phi\}\}\{\\sum\_\{i=1\}^\{N\_\{s\}\}\\left\\lVert x\_\{\\alpha,n\}^\{\(i\)\}\-\\bar\{x\}\_\{\\alpha,n\}\\right\\rVert^\{2\}\_\{\\phi\}\},\\quad\\alpha\\in\\\{P,S\_\{w\},S\_\{o\},S\_\{g\}\\\},\(51\)wherex¯α,n=Ns−1​∑ixα,n\(i\)\\bar\{x\}\_\{\\alpha,n\}=N\_\{s\}^\{\-1\}\\sum\_\{i\}x\_\{\\alpha,n\}^\{\(i\)\}is the ensemble mean\.Rα2​\(tn\)=1R^\{2\}\_\{\\alpha\}\(t\_\{n\}\)=1means perfect prediction;Rα2​\(tn\)<0R^\{2\}\_\{\\alpha\}\(t\_\{n\}\)<0means the model is worse than the ensemble mean\.

Table 2:Norne field simulation properties\.
### 11\.2Results

![Refer to caption](https://arxiv.org/html/2605.28909v1/Pressure.png)\(a\)PressureR2​\(t\)R^\{2\}\(t\)
![Refer to caption](https://arxiv.org/html/2605.28909v1/water.png)\(b\)Water saturationR2​\(t\)R^\{2\}\(t\)
![Refer to caption](https://arxiv.org/html/2605.28909v1/oil.png)\(c\)Oil saturationR2​\(t\)R^\{2\}\(t\)
![Refer to caption](https://arxiv.org/html/2605.28909v1/Gas_compare.png)\(d\)Gas saturationR2​\(t\)R^\{2\}\(t\)

Figure 2:Time\-resolvedR2R^\{2\}comparison on the Norne benchmark \(46×112×2246\\times 112\\times 22,T=30T=30timesteps,≈\\approx3298 days\)\.Blue dashed: 1–1 \(teacher\-forced\) model\.Orange solid: autoregressive PINO model\. Gas saturation \(d\) shows the largest divergence, consistent with Theorems[5\.1](https://arxiv.org/html/2605.28909#S5.Thmtheorem1)–[6\.7](https://arxiv.org/html/2605.28909#S6.Thmtheorem7): gas is the most hyperbolic variable, has the largest effective Lipschitz constantLG\>1L^\{G\}\>1under 1–1 training, and benefits most from AR\-induced Jacobian contractivity\. Pressure \(a\) benefits least, consistent with the elliptic regularity of the pressure equation \(Proposition[8\.2](https://arxiv.org/html/2605.28909#S8.Thmtheorem2)\(a\)\) which keepsLP≈1L^\{P\}\\approx 1\.
### 11\.3Theory\-consistent quantitative interpretation

#### Gas saturation: catastrophic 1–1 failure\.

From Figure[2\(d\)](https://arxiv.org/html/2605.28909#S11.F2.sf4), the 1–1 model drops toR2≈0\.38R^\{2\}\\approx 0\.38neart≈1250t\\approx 1250days\. Fitting Theorem[5\.1](https://arxiv.org/html/2605.28909#S5.Thmtheorem1)\(d\) to the observed degradation: atn=12n=12steps \(t=1200t=1200days\),‖δ12‖/‖x12‖=O​\(1−0\.38\)=O​\(0\.79\)\\left\\lVert\\delta\_\{12\}\\right\\rVert/\\left\\lVert x\_\{12\}\\right\\rVert=O\(\\sqrt\{1\-0\.38\}\)=O\(0\.79\)\. Using \([27](https://arxiv.org/html/2605.28909#S5.E27)\) withLG≈1\.15L^\{G\}\\approx 1\.15\(estimated from the slope of theR2R^\{2\}decay on a log scale\) andε≈3×10−3\\varepsilon\\approx 3\\times 10^\{\-3\}\(cross\-validated one\-step error\):ε​\(L12−1\)/\(L−1\)≈3×10−3×5\.8/0\.15≈0\.12\\varepsilon\(L^\{12\}\-1\)/\(L\-1\)\\approx 3\\times 10^\{\-3\}\\times 5\.8/0\.15\\approx 0\.12, giving relative error≈12%\\approx 12\\%, which corresponds toR2≈0\.35R^\{2\}\\approx 0\.35, consistent with Figure[2\(d\)](https://arxiv.org/html/2605.28909#S11.F2.sf4)\. The AR model avoids this by reducingLGL^\{G\}below unity via Jacobian regularization \(Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)\), maintainingR2\>0\.90R^\{2\}\>0\.90throughout\.

#### Pressure: transient dip and recovery\.

The 1–1 model shows a transient dip toR2≈0\.72R^\{2\}\\approx 0\.72neart=1250t=1250days \(Figure[2\(a\)](https://arxiv.org/html/2605.28909#S11.F2.sf1)\), coinciding with the gas\-front breakthrough that perturbs the total mobilityTtotT\_\{\\mathrm\{tot\}\}and temporarily elevatesLPL^\{P\}\. The AR model recovers faster because its Jacobian is constrained to be contractive even in this regime \(Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)\(c\)\)\. Byt=2100t=2100days, both models converge toR2≈0\.80R^\{2\}\\approx 0\.80, consistent with the long\-time asymptotic error floor determined by the intrinsic ensemble spread\.

#### Oil saturation: slow degradation\.

The 1–1 model degrades fromR2≈1\.00R^\{2\}\\approx 1\.00toR2≈0\.96R^\{2\}\\approx 0\.96over the horizon \(Figure[2\(c\)](https://arxiv.org/html/2605.28909#S11.F2.sf3)\), consistent withLOL^\{O\}marginally above unity \(LO≈1\.005L^\{O\}\\approx 1\.005, givingε​\(1\.00530−1\)/0\.005≈30​ε\\varepsilon\(1\.005^\{30\}\-1\)/0\.005\\approx 30\\varepsilon, a 3% relative error atn=30n=30\)\. The AR model maintainsR2\>0\.99R^\{2\}\>0\.99throughout\.

#### Water saturation: monotone AR improvement\.

The AR model shows monotoneR2R^\{2\}improvement from0\.7530\.753to0\.7760\.776\(Figure[2\(b\)](https://arxiv.org/html/2605.28909#S11.F2.sf2)\), a behavior predicted by Theorem[6\.7](https://arxiv.org/html/2605.28909#S6.Thmtheorem7): forL<1L<1\(which the PINO AR model achieves for water saturation after breakthrough\), the rollout error is uniformly bounded and the model becomes more calibrated as fronts slow down late in the production history\. The 1–1 model plateaus atR2≈0\.755R^\{2\}\\approx 0\.755\.

## 12Conclusion

We have developed a comprehensive mathematical framework for sequential neural\-operator surrogate modeling of black\-oil reservoir dynamics, proving sharp results across four interlocking areas\.

Foundations \(Section[2](https://arxiv.org/html/2605.28909#S2)\)\.Theorems[2\.6](https://arxiv.org/html/2605.28909#S2.Thmtheorem6)and Lemma[2\.7](https://arxiv.org/html/2605.28909#S2.Thmtheorem7)establish well\-posedness of the implicit timestep map and provide phase\-type\-dependent Lipschitz constants, quantifying why gas saturation is intrinsically more fragile than pressure under operator composition\.

Covariate shift \(Section[4](https://arxiv.org/html/2605.28909#S4)\)\.Theorem[4\.1](https://arxiv.org/html/2605.28909#S4.Thmtheorem1)proves the Wasserstein\-2 gap grows asε​\(Ln−1\)/\(L−1\)\\varepsilon\(L^\{n\}\-1\)/\(L\-1\); Corollary[4\.2](https://arxiv.org/html/2605.28909#S4.Thmtheorem2)shows the population\-risk discrepancy between 1–1 and AR training isΘ​\(LT\)\\Theta\(L^\{T\}\)forL\>1L\>1, mathematically explaining the catastrophic 1–1 failure in gas saturation\.

PINO stability \(Section[6](https://arxiv.org/html/2605.28909#S6)\)\.Theorem[6\.5](https://arxiv.org/html/2605.28909#S6.Thmtheorem5)shows PINO training withλR≥λR∗\\lambda\_\{R\}\\geq\\lambda\_\{R\}^\{\*\}reduces the operator spectral radius toρℱ\+𝒪​\(λR−1/2\)<1\\rho\_\{\\mathcal\{F\}\}\+\\mathcal\{O\}\(\\lambda\_\{R\}^\{\-1/2\}\)<1; Theorem[6\.6](https://arxiv.org/html/2605.28909#S6.Thmtheorem6)shows physics residuals act as spectral Jacobian regularizers with error𝒪​\(λR−1/2\)\\mathcal\{O\}\(\\lambda\_\{R\}^\{\-1/2\}\); Theorem[6\.7](https://arxiv.org/html/2605.28909#S6.Thmtheorem7)gives uniform\-in\-time rollout error boundsε/\(1−ρ\)\\varepsilon/\(1\-\\rho\)for PINO AR operators\.

TBPTT \(Section[7](https://arxiv.org/html/2605.28909#S7)\)\.Theorem[7\.3](https://arxiv.org/html/2605.28909#S7.Thmtheorem3)establishes geometric bias decay𝒪​\(ρK\)\\mathcal\{O\}\(\\rho^\{K\}\); Corollary[7\.4](https://arxiv.org/html/2605.28909#S7.Thmtheorem4)derives optimal windowK∗=𝒪​\(log⁡T\)K^\{\*\}=\\mathcal\{O\}\(\\log T\); Proposition[7\.5](https://arxiv.org/html/2605.28909#S7.Thmtheorem5)gives Adam convergence rate𝒪​\(1/t\)\+𝒪​\(ρK∗\)\\mathcal\{O\}\(1/\\sqrt\{t\}\)\+\\mathcal\{O\}\(\\rho^\{K^\{\*\}\}\)\.

Empirical validation \(Section[11](https://arxiv.org/html/2605.28909#S11)\)\.Autoregressive PINO surrogates maintainR2\>0\.99R^\{2\}\>0\.99\(oil\),R2\>0\.90R^\{2\}\>0\.90\(gas\),R2≈0\.80R^\{2\}\\approx 0\.80\(pressure\), and monotone improvement \(water\) across the full Norne 3298\-day horizon; 1–1 models fail on gas and pressure\. Ensemble inference runs in<1<1minute on a single B200 GPU: a∼104×\\sim\\\!10^\{4\}\\timeswall\-clock speedup over the OPM finite\-volume simulator\.

Future work includes: \(a\) extending the Wasserstein covariate\-shift bounds to non\-Markovian \(multi\-step conditioning\) operator inputs; \(b\) extending the PINO stability theorem to non\-coercive residuals \(e\.g\., hyperbolic\-only without capillary regularization\); \(c\) sharp Rademacher complexity bounds for FNO function classes via their spectral structure\.

## Acknowledgments

This work was supported by NVIDIA Corporation\. The authors also thanksHarpreet Sethi\(NVIDIA Corporation\) for his expert guidance and deep technical insights in reservoir simulation and physics\-informed machine learning, which greatly benefited this work\.

## References

- \[1\]Aziz, K\., & Settari, A\. \(1979\)\.Petroleum Reservoir Simulation\. Applied Science Publishers\.
- \[2\]Bartlett, P\. L\., & Mendelson, S\. \(2002\)\. Rademacher and Gaussian complexities: Risk bounds and structural results\.Journal of Machine Learning Research, 3, 463–482\.
- \[3\]Bengio, Y\., Simard, P\., & Frasconi, P\. \(1994\)\. Learning long\-term dependencies with gradient descent is difficult\.IEEE Transactions on Neural Networks, 5\(2\), 157–166\.
- \[4\]Bengio, S\., Vinyals, O\., Jaitly, N\., & Shazeer, N\. \(2015\)\. Scheduled sampling for sequence prediction with recurrent neural networks\.Advances in Neural Information Processing Systems, 28\.
- \[5\]Bi, K\., Xie, L\., Zhang, H\., Chen, X\., Gu, X\., & Tian, Q\. \(2023\)\. Accurate medium\-range global weather forecasting with 3D neural networks\.Nature, 619, 533–538\.
- \[6\]Chandra, A\., Koch, M\., Pawar, S\., Panda, A\., Azizzadenesheli, K\., Snippe, J\., et al\. \(2025\)\. Accelerating porous media flow simulations with Fourier neural operators\.Advanced Theory and Simulations, e00747\.
- \[7\]Chen, T\., & Chen, H\. \(1995\)\. Universal approximation to nonlinear operators by neural networks\.IEEE Transactions on Neural Networks, 6\(4\), 911–917\.
- \[8\]Chen, Z\., Huan, G\., & Ma, Y\. \(2006\)\.Computational Methods for Multiphase Flows in Porous Media\. SIAM\.
- \[9\]Equinor ASA\. \(2012\)\. Norne Field Open Data Set\. Open Porous Media Initiative\.[https://opm\-project\.org/?page\_id=559](https://opm-project.org/?page_id=559)\.
- \[10\]Etienam, C\., Juntao, Y\., Said, I\., Ovcharenko, O\., et al\. \(2024\)\. A novel AI\-enhanced reservoir characterization with a combined mixture of experts–NVIDIA Modulus based PINO forward model\.arXiv:2404\.14447\.
- \[11\]Etienam, C\., Juntao, Y\., Ovcharenko, O\., & Said, I\. \(2024\)\. Reservoir history matching of the Norne field with generative exotic priors and a coupled MoE–PINO forward model\.arXiv:2406\.00889\.
- \[12\]Etienam, C\., Juntao, Y\., Said, I\., & Ovcharenko, O\. \(2024\)\. A reservoir history matching approach using a coupled MoE\-PINO forward model\.ECMOR 2024, 1–35\.
- \[13\]Herde, M\., Raonic, B\., Rohner, T\., Käppeli, R\., Mishra, S\., de Bezenac, E\., & Koumoutsakos, P\. \(2024\)\. Poseidon: Efficient foundation models for PDEs\.arXiv:2405\.19101\.
- \[14\]Karniadakis, G\. E\., Kevrekidis, I\. G\., Lu, L\., Perdikaris, P\., Wang, S\., & Yang, L\. \(2021\)\. Physics\-informed machine learning\.Nature Reviews Physics, 3, 422–440\.
- \[15\]Kochkov, D\., Smith, J\. A\., Alieva, A\., Wang, Q\., Brenner, M\. P\., & Hoyer, S\. \(2021\)\. Machine learning–accelerated computational fluid dynamics\.Proceedings of the National Academy of Sciences, 118\(21\)\.
- \[16\]Kovachki, N\., Li, Z\., Liu, B\., Azizzadenesheli, K\., Bhattacharya, K\., Stuart, A\., & Anandkumar, A\. \(2021\)\. Neural operator: Learning maps between function spaces\.arXiv:2108\.08481\.
- \[17\]Lam, R\., Sanchez\-Gonzalez, A\., Willson, M\., et al\. \(2023\)\. Learning skillful medium\-range global weather forecasting\.Science, 382, 1416–1421\.
- \[18\]Lamb, A\. M\., Goyal, A\., Zhang, Y\., Zhang, S\., Courville, A\. C\., & Bengio, Y\. \(2016\)\. Professor forcing: A new algorithm for training recurrent networks\.Advances in Neural Information Processing Systems, 29\.
- \[19\]Lanthaler, S\., Mishra, S\., & Karniadakis, G\. E\. \(2022\)\. Error estimates for DeepONets: A deep learning framework in infinite dimensions\.Transactions of Mathematics and Its Applications, 6\(1\)\.
- \[20\]Li, Z\., Kovachki, N\., Azizzadenesheli, K\., Liu, B\., Bhattacharya, K\., Stuart, A\., & Anandkumar, A\. \(2020\)\. Fourier neural operator for parametric partial differential equations\.arXiv:2010\.08895\.
- \[21\]Li, Z\., Zheng, H\., Kovachki, N\., Jin, D\., Chen, H\., Liu, B\., Azizzadenesheli, K\., & Anandkumar, A\. \(2021\)\. Physics\-informed neural operator for learning partial differential equations\.arXiv:2111\.03794\.
- \[22\]Lu, L\., Jin, P\., Pang, G\., Zhang, Z\., & Karniadakis, G\. E\. \(2021\)\. Learning nonlinear operators via DeepONet\.Nature Machine Intelligence, 3, 218–229\.
- \[23\]Mikhaeil, J\. M\., Monfared, Z\., & Durstewitz, D\. \(2022\)\. On the difficulty of learning chaotic dynamics with RNNs\.Advances in Neural Information Processing Systems, 35\.
- \[24\]Mukundakrishnan, K\., Wiegand, K\., Natoli, V\., Etienam, C\., et al\. \(2024\)\. Accelerating reservoir modeling workflows with neural operators and GPU\-based simulations\.ADIPEC 2024, D021S068R002\.
- \[25\]Rasmussen, A\. F\., Sandve, T\. H\., Bao, K\., et al\. \(2021\)\. The Open Porous Media Flow reservoir simulator\.Computers & Mathematics with Applications, 81, 159–185\.
- \[26\]Pascanu, R\., Mikolov, T\., & Bengio, Y\. \(2013\)\. On the difficulty of training recurrent neural networks\.ICML 2013, 1310–1318\.
- \[27\]Pathak, J\., Subramanian, S\., Harrington, P\., et al\. \(2022\)\. FourCastNet: A global data\-driven high\-resolution weather model\.arXiv:2202\.11214\.
- \[28\]Peaceman, D\. W\. \(1977\)\.Fundamentals of Numerical Reservoir Simulation\. Elsevier\.
- \[29\]NVIDIA PhysicsNeMo\. \(2023\)\.GitHub\.[https://github\.com/NVIDIA/physicsnemo](https://github.com/NVIDIA/physicsnemo)\.
- \[30\]Raissi, M\., Perdikaris, P\., & Karniadakis, G\. E\. \(2019\)\. Physics\-informed neural networks\.Journal of Computational Physics, 378, 686–707\.
- \[31\]Ranzato, M\., Chopra, S\., Auli, M\., & Zaremba, W\. \(2016\)\. Sequence level training with recurrent neural networks\.ICLR 2016\.
- \[32\]Rwechungura, R\., Sjetne, O\., & Lien, M\. \(2011\)\. Application of advanced history matching techniques to the Norne field\.SPE Reservoir Simulation Symposium\.
- \[33\]Takamoto, M\., Praditia, T\., Leiteritz, R\., et al\. \(2022\)\. PDEBench: An extensive benchmark for scientific machine learning\.Advances in Neural Information Processing Systems, 35\.
- \[34\]Tallec, C\., Ollivier, Y\., & Chrisman, O\. \(2017\)\. Unbiasing truncated backpropagation through time\.arXiv:1705\.08209\.
- \[35\]Tang, M\., Liu, Y\., & Durlofsky, L\. J\. \(2020\)\. A deep\-learning\-based surrogate model for CO2storage\.International Journal of Greenhouse Gas Control, 100, 103096\.
- \[36\]Um, K\., Brand, R\., Fei, Y\., Holl, P\., & Thuerey, N\. \(2020\)\. Solver\-in\-the\-loop: Learning from differentiable physics\.Advances in Neural Information Processing Systems, 33\.
- \[37\]Venkatraman, A\., Hebert, M\., & Bagnell, J\. A\. \(2015\)\. Improving multi\-step prediction of learned time series models\.AAAI 2015\.
- \[38\]Villani, C\. \(2009\)\.Optimal Transport: Old and New\. Springer\.
- \[39\]Williams, R\. J\., & Zipser, D\. \(1989\)\. A learning algorithm for continually running fully recurrent neural networks\.Neural Computation, 1, 270–280\.
- \[40\]Zygmund, A\. \(2002\)\.Trigonometric Series, 3rd ed\. Cambridge University Press\.
- \[41\]Zhu, Y\., & Zabaras, N\. \(2018\)\. Bayesian deep convolutional encoder–decoder networks for surrogate modeling\.Journal of Computational Physics, 366, 415–447\.

Similar Articles