A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

arXiv cs.LG Papers

Summary

This paper proposes a new architecture that augments Flux Neural Operators with recurrent Vision Transformers to solve conservation laws as a foundation model. It demonstrates robust generalization and long-time prediction capabilities across diverse conservative systems without explicit access to governing equations.

arXiv:2605.05488v1 Announce Type: new Abstract: We propose an architecture that augments the Flux Neural Operator (Flux NO), which combines the classical finite volume method (FVM) with neural operators, with ViT-based context injection. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context-conditioned neural operator. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long-time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes. Our code is available at https://github.com/xx257xx/CONTEXT_FLUX_NO.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 07:32 AM

# A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers
Source: [https://arxiv.org/html/2605.05488](https://arxiv.org/html/2605.05488)
Taeyoung Kim Center for AI and Natural Sciences Korea Institute for Advanced Study Seoul, South Korea 02455 taeyoungkim@kias\.re\.kr &Joon\-Hyuk Ko11footnotemark:1 Center for AI and Natural Sciences Korea Institute for Advanced Study Seoul, South Korea 02455 jhko725@kias\.re\.krEqual contribution\. Correspondence to: Taeyoung Kim <taeyoungkim@kias\.re\.kr\>, Joon\-Hyuk Ko <jhko725@kias\.re\.kr\>\.

###### Abstract

We propose an architecture that augments the Flux Neural Operator \(Flux NO\), which combines the classical finite volume method \(FVM\) with neural operators, with ViT\-based context injection\. Our model is formulated as a hypernetwork: it extracts solution dynamics over a finite temporal window, encodes them with a recurrent Vision Transformer, and generates the parameters of a context\-conditioned neural operator\. This enables the model to infer and solve conservation laws without explicit access to the governing equation or PDE coefficients\. Experimentally, we show that the proposed method preserves the robustness, generalization ability, and long\-time prediction advantages of Flux NO over standard neural operators, while delivering reliable numerical solutions across a broad range of conservative systems, including previously unseen fluxes\. Our code is available at[https://github\.com/xx257xx/CONTEXT\_FLUX\_NO](https://github.com/xx257xx/CONTEXT_FLUX_NO)\.

## 1Introduction

Neural\-network\-based methods for scientific computing have rapidly emerged as a major research direction, with a wide range of paradigms being introduced in quick succession\. This evolution can be broadly understood as three successive shifts\. First, physics\-informed neural networks \(PINNs\) were proposed to solve partial differential equations \(PDEs\) by directly optimizing neural networks subject to the governing equations together with initial and boundary conditions\(Raissiet al\.,[2019](https://arxiv.org/html/2605.05488#bib.bib34)\)\. Second, operator learning introduced a different perspective: rather than solving each PDE instance independently, neural operators learn the solution map of a prescribed PDE family, enabling direct prediction of forward or inverse solutions from input conditions\(Liet al\.,[2021](https://arxiv.org/html/2605.05488#bib.bib32); Luet al\.,[2021](https://arxiv.org/html/2605.05488#bib.bib31); Kovachkiet al\.,[2023](https://arxiv.org/html/2605.05488#bib.bib33)\)\. More recently, inspired by the few\-shot and in\-context capabilities of Transformer\-based foundation models\(Brownet al\.,[2020](https://arxiv.org/html/2605.05488#bib.bib35); Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.05488#bib.bib5)\), this viewpoint has been extended to scientific machine learning, giving rise to PDE foundation models that aim to solve diverse classes of PDEs by conditioning on contextual information such as observed dynamics, equation families, or domain structure\(Haoet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib8); Herdeet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib9); Subramanianet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib36)\)\.

Motivated by this line of development, we revisit the classical finite volume method \(FVM\), in which the evolution of conservation laws is governed by numerical fluxes at cell interfaces\(LeVeque,[2002](https://arxiv.org/html/2605.05488#bib.bib15)\), and combine it with neural operators in the spirit of the Flux Neural Operator \(Flux NO\)\(Tranet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib37)\)\. Building on this formulation, we propose a recurrent ViT\-based context injection mechanism that lifts Flux NO into a foundation\-model framework\. The resulting model infers the underlying dynamics from short solution trajectories and adapts its numerical\-flux operator accordingly, without requiring explicit knowledge of PDE coefficients or closed\-form flux expressions\.

Our main contributions are as follows:

- •We formulate an in\-context flux\-learning problem for parametric conservation laws, where a short observed trajectory is used to infer a latent numerical flux operator\.
- •We introduce a context\-conditioned Flux Neural Operator in which a recurrent ViT encoder produces a compact context code that conditions the finite\-volume flux operator\.
- •We show that enforcing a conservative flux\-difference update improves autoregressive stability and OOD robustness compared with generic PDE foundation\-model baselines on one\-dimensional conservation\-law benchmarks and a related diffusive Burgers\-type problem\.

## 2Background

This section reviews the ingredients that motivate our architecture\. We emphasize two points\. First, conservation laws require numerical updates that respect flux\-difference structure, and Flux NOs encode this conservative structure, but are not inherently designed for in\-context adaptation across unseen flux functions\. Second, recent PDE foundation models provide context\-conditioned adaptability, but often do so with generic prediction architectures that do not explicitly preserve conservative numerical structure\.

### 2\.1Conservation laws and Flux Neural Operators

We consider conservation laws of the form

∂t𝒖\+∇⋅𝑭​\(𝒖;𝒑\)=0,\\partial\_\{t\}\\bm\{u\}\+\\nabla\\cdot\\bm\{F\}\(\\bm\{u\};\\bm\{p\}\)=0,\(1\)where𝒖​\(t,𝒙\)∈ℝd\\bm\{u\}\(t,\\bm\{x\}\)\\in\\mathbb\{R\}^\{d\}is the conserved state and𝑭​\(𝒖;𝒑\)\\bm\{F\}\(\\bm\{u\};\\bm\{p\}\)is the physical flux, possibly parameterized by coefficients𝒑\\bm\{p\}\. The key structure of[Eq\.˜1](https://arxiv.org/html/2605.05488#S2.E1)is that temporal evolution is determined by flux imbalance\. In a one\-dimensional finite volume discretization, this leads to the semi\-discrete update

dd​t​u¯i​\(t\)=−1Δ​x​\(f^i\+12​\(t\)−f^i−12​\(t\)\),\\frac\{d\}\{dt\}\\bar\{u\}\_\{i\}\(t\)=\-\\frac\{1\}\{\\Delta x\}\\left\(\\hat\{f\}\_\{i\+\\frac\{1\}\{2\}\}\(t\)\-\\hat\{f\}\_\{i\-\\frac\{1\}\{2\}\}\(t\)\\right\),\(2\)and, after time discretization, to the conservative update

u¯in\+1=u¯in−Δ​tΔ​x​\(f^i\+12n−f^i−12n\)\.\\bar\{u\}^\{\\,n\+1\}\_\{i\}=\\bar\{u\}^\{\\,n\}\_\{i\}\-\\frac\{\\Delta t\}\{\\Delta x\}\\left\(\\hat\{f\}^\{\\,n\}\_\{i\+\\frac\{1\}\{2\}\}\-\\hat\{f\}^\{\\,n\}\_\{i\-\\frac\{1\}\{2\}\}\\right\)\.\(3\)The telescoping flux\-difference structure ensures discrete conservation under suitable boundary conditions and is particularly important for nonlinear hyperbolic problems, where smooth solutions can develop shocks and long\-time prediction requires stable transport behavior\.

Operator learning provides a data\-driven framework for approximating solution maps between function spaces\. Neural operators such as DeepONet\(Luet al\.,[2021](https://arxiv.org/html/2605.05488#bib.bib31)\)and Fourier Neural Operator\(Liet al\.,[2021](https://arxiv.org/html/2605.05488#bib.bib32)\)learn such maps from data and can be evaluated rapidly on new inputs\. However, many neural operators predict future solution fields directly and therefore do not explicitly enforce the conservative structure in[Eq\.˜3](https://arxiv.org/html/2605.05488#S2.E3)\. This can lead to conservation errors or unstable error accumulation during autoregressive rollout\.

Flux Neural Operators address this issue by combining neural operators with the finite volume viewpoint\(Kim and Kang,[2025](https://arxiv.org/html/2605.05488#bib.bib12); Kimet al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib13)\)\. Instead of directly predicting the next solution snapshot, Flux NO learns a numerical flux operator,

f^i\+12=GΘ​\(Si\+12​\(𝒖n\)\),\\hat\{f\}\_\{i\+\\frac\{1\}\{2\}\}=G\_\{\\Theta\}\\left\(S\_\{i\+\\frac\{1\}\{2\}\}\(\\bm\{u\}^\{\\,n\}\)\\right\),\(4\)whereSi\+12​\(𝒖n\)S\_\{i\+\\frac\{1\}\{2\}\}\(\\bm\{u\}^\{\\,n\}\)denotes a local or nonlocal stencil representation around the interfacei\+12i\+\\frac\{1\}\{2\}, andGΘG\_\{\\Theta\}is a neural operator\. The next state is then obtained by substituting this learned flux into the finite volume update:

u¯in\+1=u¯in−Δ​tΔ​x​\(GΘ​\(Si\+12​\(𝒖n\)\)−GΘ​\(Si−12​\(𝒖n\)\)\)\.\\bar\{u\}^\{\\,n\+1\}\_\{i\}=\\bar\{u\}^\{\\,n\}\_\{i\}\-\\frac\{\\Delta t\}\{\\Delta x\}\\left\(G\_\{\\Theta\}\(S\_\{i\+\\frac\{1\}\{2\}\}\(\\bm\{u\}^\{\\,n\}\)\)\-G\_\{\\Theta\}\(S\_\{i\-\\frac\{1\}\{2\}\}\(\\bm\{u\}^\{\\,n\}\)\)\\right\)\.\(5\)Thus, the model is constrained to evolve the solution through flux differences, giving it an inductive bias aligned with conservation laws\. Since errors enter through a conservative residual rather than an unconstrained global prediction, this structure is especially useful for robust long\-time rollout and resolution transfer\.

### 2\.2PDE foundation models and context conditioning

Recent work has begun to move from single\-equation neural operators toward foundation models for PDEs\. The goal is to train models that can operate across broader families of equations, coefficients, discretizations, and physical regimes by conditioning on contextual information, such as short observed trajectories, equation descriptors, simulation metadata, or prompt\-like input–output examples\.

Several approaches use transformer\-style architectures, patch tokenization, autoregressive sequence modeling, or hypernetwork conditioning to enable such cross\-system adaptation\(Yang and Osher,[2024](https://arxiv.org/html/2605.05488#bib.bib26); Yanget al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib27); Haoet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib8); Morelet al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib18)\)\. These methods provide a mechanism for in\-context generalization: a single trained model can adapt its behavior based on the observed task context, without explicit retraining for each new equation instance\.

However, many PDE foundation models remain generic predictors of future states or latent solution fields\. Their architectures are typically designed around sequence modeling or global operator regression, rather than the conservative numerical structure specific to hyperbolic conservation laws\. Consequently, they may lack an explicit finite\-volume update rule, interface flux representation, or guaranteed flux\-difference form, which can be important in shock\-dominated regimes, long\-time rollout, and resolution transfer\.

Our method combines context\-conditioned adaptation with a conservative numerical backbone\. A short trajectory segment is encoded into a context vector, and a hypernetwork uses this vector to generate the parameters of a Flux NO target network\. Thus, the model does not merely condition a generic predictor on context; it conditions the numerical flux operator itself\. Compared with standard neural operators, the resulting model evolves states through a flux\-difference update\. Compared with Flux NO, it replaces a fixed flux operator with a context\-generated one\. Compared with generic PDE foundation models, it injects context into a structure\-preserving solver, enabling adaptation to unseen flux functions while retaining the finite\-volume inductive bias needed for conservation laws\.

## 3In\-Context Flux Neural Operator

### 3\.1Problem setting

For conservation laws in[Eq\.˜1](https://arxiv.org/html/2605.05488#S2.E1), our goal is to learn a context\-conditioned evolution operator from short trajectory observations\. Let𝒖​\(t,𝒙\)\\bm\{u\}\(t,\\bm\{x\}\)be the continuous solution and let𝒖n∈ℝd×N𝒙\\bm\{u\}^\{n\}\\in\\mathbb\{R\}^\{d\\times N\_\{\\bm\{x\}\}\}denote its grid\-sampled state at timet=n​Δ​tt=n\\Delta t, whereN𝒙:=Nx1×⋯×NxnN\_\{\\bm\{x\}\}:=N\_\{x\_\{1\}\}\\times\\cdots\\times N\_\{x\_\{n\}\}\. Given a context trajectory

𝑼n−k\+1:n=\(𝒖n−k\+1,…,𝒖n\)∈ℝk×d×N𝒙,\\bm\{U\}^\{n\-k\+1:n\}=\(\\bm\{u\}^\{n\-k\+1\},\\ldots,\\bm\{u\}^\{n\}\)\\in\\mathbb\{R\}^\{k\\times d\\times N\_\{\\bm\{x\}\}\},we seek to predict the next state𝒖n\+1\\bm\{u\}^\{n\+1\}\.

Rather than learning this map as an unconstrained input–output predictor, we decompose the problem into two stages: first infer a latent representation of the underlying dynamics from the observed trajectory, and then use this representation to instantiate a context\-conditioned Flux NO\. This naturally leads to a hypernetwork formulation,

𝒄=ℰ​\(𝑼n−k\+1:n\),Θ=H​\(𝒄\),𝒖n\+1=ℱ​\(𝒖n,Δ​t;Θ\),\\bm\{c\}=\\mathscr\{E\}\(\\bm\{U\}^\{n\-k\+1:n\}\),\\qquad\\Theta=H\(\\bm\{c\}\),\\qquad\\bm\{u\}^\{n\+1\}=\\mathscr\{F\}\(\\bm\{u\}^\{n\},\\Delta t;\\Theta\),\(6\)whereℰ\\mathscr\{E\}is the context encoder,HHmaps the context vector to target\-network parameters, andℱ\\mathscr\{F\}is the Flux NO target network\. The encoder is not given the analytical flux function, PDE coefficients, or equation labels; all conditioning information must be inferred from the observed solution history\.

### 3\.2Context Encoder and Hypernetwork

Given a short trajectory segment, the encoder extracts a compact context vector and maps it to the parameters of the Flux NO target network\. We impose an information bottleneck,

ℝk×din×N𝒙⟶ℝe⟶ℝq,e≪q,\\mathbb\{R\}^\{k\\times d\_\{\\mathrm\{in\}\}\\times N\_\{\\bm\{x\}\}\}\\longrightarrow\\mathbb\{R\}^\{e\}\\longrightarrow\\mathbb\{R\}^\{q\},\\qquad e\\ll q,\(7\)whereeeis the context dimension andqqis the number of generated target\-network parameters\. When grid coordinates are used, they are appended as additional input channels\.

#### Temporal recurrent mixing and spatial attention\.

The encoder is designed to process temporal and spatial axes separately while respecting causality along time\. We therefore adopt a temporally recurrent Vision Transformer design inspired by TRecViT\(Patrauceanet al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib21)\), where temporal mixing is handled by gated linear recurrent units\(Deet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib4); Botevet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib1)\)and spatial mixing by transformer blocks\.

Given𝑼n−k\+1:n∈ℝk×d×N𝒙\\bm\{U\}^\{n\-k\+1:n\}\\in\\mathbb\{R\}^\{k\\times d\\times N\_\{\\bm\{x\}\}\}, the encoder first tokenizes each time slice using a ViT patch embedding with learnable positional encodings\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.05488#bib.bib5)\):

𝑽\(0\)=PatchEmbed​\(𝑼n−k\+1:n\)∈ℝk×P×e,\\bm\{V\}^\{\(0\)\}=\\mathrm\{PatchEmbed\}\(\\bm\{U\}^\{n\-k\+1:n\}\)\\in\\mathbb\{R\}^\{k\\times P\\times e\},\(8\)wherePPis the number of spatial patches\. Each encoder layer alternates between temporal recurrent mixing for each spatial token and spatial self\-attention for each time step:

𝑽^:,p\(ℓ\)=TemporalBlock\(ℓ\)​\(𝑽:,p\(ℓ\)\),𝑽t,:\(ℓ\+1\)=SpatialTransformer\(ℓ\)​\(𝑽^t,:\(ℓ\)\)\.\\widehat\{\\bm\{V\}\}^\{\(\\ell\)\}\_\{:,p\}=\\mathrm\{TemporalBlock\}^\{\(\\ell\)\}\\left\(\\bm\{V\}^\{\(\\ell\)\}\_\{:,p\}\\right\),\\qquad\\bm\{V\}^\{\(\\ell\+1\)\}\_\{t,:\}=\\mathrm\{SpatialTransformer\}^\{\(\\ell\)\}\\left\(\\widehat\{\\bm\{V\}\}^\{\(\\ell\)\}\_\{t,:\}\\right\)\.\(9\)In our implementation, the temporal block is a residual recurrent block based on a gated linear recurrent unit with a causal depthwise one\-dimensional convolution\. This alternating structure allows the encoder to propagate information through the observed trajectory while modeling spatial interactions at each time step\.

After the final layer, we apply token\-wise layer normalization and average the final temporal state over spatial tokens:

𝒄=1P​∑p=1PLayerNorm​\(𝑽k,p\(L\)\)∈ℝe\.\\bm\{c\}=\\frac\{1\}\{P\}\\sum\_\{p=1\}^\{P\}\\mathrm\{LayerNorm\}\\left\(\\bm\{V\}^\{\(L\)\}\_\{k,p\}\\right\)\\in\\mathbb\{R\}^\{e\}\.\(10\)The hypernetwork then maps this context vector to the target\-network parameters,

Θ=H​\(𝒄\)∈ℝq\.\\Theta=H\(\\bm\{c\}\)\\in\\mathbb\{R\}^\{q\}\.\(11\)

### 3\.3Flux Neural Operator Target Network

The target network is a Flux NO whose parameters are generated from the context vector\. Thus, unlike the original Flux NO with fixed parameters, our model instantiates a different numerical flux operator for each inferred dynamics\.

For clarity, we describe the one\-dimensional case\. Given the current state𝒖n\\bm\{u\}^\{n\}, we construct left\- and right\-shifted stencil featuresVlV^\{l\}andVrV^\{r\}under periodic boundary conditions\. These features contain local solution values around cell interfaces, together with grid coordinates when used\. The generated Flux NO maps them to numerical fluxes,

f^i\+12n=GΘ​\(Vir\),f^i−12n=GΘ​\(Vil\),\\hat\{f\}^\{\\,n\}\_\{i\+\\frac\{1\}\{2\}\}=G\_\{\\Theta\}\(V^\{r\}\_\{i\}\),\\qquad\\hat\{f\}^\{\\,n\}\_\{i\-\\frac\{1\}\{2\}\}=G\_\{\\Theta\}\(V^\{l\}\_\{i\}\),\(12\)and advances the solution by the finite\-volume update

uin\+1=uin−Δ​tΔ​x​\(f^i\+12n−f^i−12n\)\.u^\{n\+1\}\_\{i\}=u^\{n\}\_\{i\}\-\\frac\{\\Delta t\}\{\\Delta x\}\\left\(\\hat\{f\}^\{\\,n\}\_\{i\+\\frac\{1\}\{2\}\}\-\\hat\{f\}^\{\\,n\}\_\{i\-\\frac\{1\}\{2\}\}\\right\)\.\(13\)Equivalently,

𝒖n\+1=𝒖n−Δ​tΔ​x​\(GΘ​\(Vr\)−GΘ​\(Vl\)\)\.\\bm\{u\}^\{n\+1\}=\\bm\{u\}^\{n\}\-\\frac\{\\Delta t\}\{\\Delta x\}\\left\(G\_\{\\Theta\}\(V^\{r\}\)\-G\_\{\\Theta\}\(V^\{l\}\)\\right\)\.\(14\)This form makes the conservative structure explicit: the model predicts fluxes, and the solution changes only through flux differences across neighboring interfaces\.

The flux operatorGΘG\_\{\\Theta\}is implemented as a depth\-LLneural operator acting on the stencilized state:

z\(0\)​\(x\)=Wlift;Θ​V​\(x\),z~\(ℓ\)​\(x\)=∫kΘ\(ℓ\)​\(x,x′\)​z\(ℓ−1\)​\(x′\)​𝑑x′,ℓ=1,…,L,z\(ℓ\)​\(x\)=z\(ℓ−1\)​\(x\)\+σ​\(z~\(ℓ\)​\(x\)\),ℓ=1,…,L,GΘ​\(x\)=Wproj;Θ​z\(L\)​\(x\)\.\\begin\{split\}z^\{\(0\)\}\(x\)&=W\_\{\\mathrm\{lift\};\\Theta\}V\(x\),\\\\ \\widetilde\{z\}^\{\(\\ell\)\}\(x\)&=\\int k^\{\(\\ell\)\}\_\{\\Theta\}\(x,x^\{\\prime\}\)z^\{\(\\ell\-1\)\}\(x^\{\\prime\}\)\\,dx^\{\\prime\},\\qquad\\ell=1,\\ldots,L,\\\\ z^\{\(\\ell\)\}\(x\)&=z^\{\(\\ell\-1\)\}\(x\)\+\\sigma\\left\(\\widetilde\{z\}^\{\(\\ell\)\}\(x\)\\right\),\\qquad\\ell=1,\\ldots,L,\\\\ G\_\{\\Theta\}\(x\)&=W\_\{\\mathrm\{proj\};\\Theta\}z^\{\(L\)\}\(x\)\.\\end\{split\}\(15\)HereWlift;ΘW\_\{\\mathrm\{lift\};\\Theta\},Wproj;ΘW\_\{\\mathrm\{proj\};\\Theta\}, andkΘ\(ℓ\)k^\{\(\\ell\)\}\_\{\\Theta\}are generated by the hypernetwork\. In this way, the context vector instantiates the numerical flux operator itself, rather than merely modulating intermediate activations\. The overall architecture is illustrated in Figure[1](https://arxiv.org/html/2605.05488#S3.F1)\.

![Refer to caption](https://arxiv.org/html/2605.05488v1/Integrated_HFNO.png)Figure 1:Overview of HFluxNO\. \(a\) A temporally recurrent Vision Transformer encodes the context trajectory by alternating temporal recurrent mixing and spatial attention, producing a context vector that is mapped by a hypernetwork to Flux NO parameters\. \(b\) The generated Flux NO target network predicts numerical fluxes, which are used in a conservative finite\-volume update\.

## 4Experiments

### 4\.1Baselines

To demonstrate the efficacy of our method, we selected the following state\-of\-the\-art models from recent literature\. For a fair performance comparison, we implement all models inJAXBradburyet al\.\([2018](https://arxiv.org/html/2605.05488#bib.bib10)\)porting over original implementations if necessary\. We provide a brief description of the baselines below, and refer readers to[Appendix˜A](https://arxiv.org/html/2605.05488#A1)for additional details\. In preliminary experiments, ICON exhibited substantially larger prediction errors than the other baselines on the 1D cubic conservation\-law benchmark\. Because its performance was not competitive in this setting, we trained ICON using a single random seed and omit it from the main quantitative comparisons for clarity\. Its results are reported separately in[Appendix˜A](https://arxiv.org/html/2605.05488#A1)\.

#### ICON

\(Yang and Osher,[2024](https://arxiv.org/html/2605.05488#bib.bib26); Yanget al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib27)\)is a decoder\-only transformer language model that was repurposed for operator learning\. Instead of language tokens, the model is trained to ingest the PDE solution field sampled at discrete time points and generate the output at some future point in time\.

#### DPOT

\(Haoet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib8)\)is a non\-transformer model that first compresses the input trajectory using spatial patch embedding, followed by a learnable weighted sum along the time axis\. Subsequently, Fourier attention layers\(Guibaset al\.,[2021](https://arxiv.org/html/2605.05488#bib.bib6); Haoet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib8)\)are applied on the aggregated context to learn kernel integral transforms conditional to the input context\.

#### DISCO

\(Morelet al\.,[2025](https://arxiv.org/html/2605.05488#bib.bib18)\)is another hypernetwork\-based architecture, with the axial vision transformer architecture fromMcCabeet al\.\([2024](https://arxiv.org/html/2605.05488#bib.bib17)\)as the hypernetwork, and a neural ordinary differential equation\(Chenet al\.,[2018](https://arxiv.org/html/2605.05488#bib.bib3); Kidger,[2021](https://arxiv.org/html/2605.05488#bib.bib11)\)with a U\-Net\(Ronnebergeret al\.,[2015](https://arxiv.org/html/2605.05488#bib.bib23)\)vector field as the target network\. Next time\-step predictions are generated by numerically integrating the U\-Net vector field using an adaptive Runge\-Kutta solver, which can make this method more computationally expensive than its counterparts\.

### 4\.2Datasets

While large, high quality PDE datasets have been made available in recent years\(Takamotoet al\.,[2022](https://arxiv.org/html/2605.05488#bib.bib28); Ohanaet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib20); Koehleret al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib14)\), many of these datasets have limited variety in the equational form of the PDEs, their coefficient values, and the function family the initial conditions are sampled from\. While a natural consequence of the difficulty of generating high quality PDE solutions, this limitation makes it difficult to gauge the generalization capabilities of the trained multiphysics neural operators\. As such, we perform experiments with newly generated datasets designed to test both in\-distribution performance and controlled forms of out\-of\-distribution generalization\. For newly generated data, we provide a brief description of each dataset below, and provide in\-depth simulation details in[Appendix˜B](https://arxiv.org/html/2605.05488#A2)\.

#### 1D Cubic Conservation Laws

We first consider the problem of learning a family of 1D cubic conservation laws, as proposed byYang and Osher \([2024](https://arxiv.org/html/2605.05488#bib.bib26)\)\. The governing equation is given as,

ut\+\(c1​u\+c2​u2\+c3​u3\)x=0,x∈\[0,1\],\(c1,c2,c3\)∼Unif​\(\[−1,1\]3\)u\_\{t\}\+\(c\_\{1\}u\+c\_\{2\}u^\{2\}\+c\_\{3\}u^\{3\}\)\_\{x\}=0,\\quad x\\in\[0,1\],\\quad\(c\_\{1\},c\_\{2\},c\_\{3\}\)\\sim\\mathrm\{Unif\}\(\[\-1,1\]^\{3\}\)with periodic boundary conditions and sample the initial conditionsu​\(0\)u\(0\)from a 1D Gaussian random field with the periodic covariance function\.

#### 1D Shallow Water Equations

Next, we consider a parametrized form of the 1D shallow\-water equations\. Letm=h​um=hudenote the momentum\. The state isq=\(h,m\)⊤q=\(h,m\)^\{\\top\}, and the governing equation is

qt\+F​\(q\)x=0,x∈\[0,1\],q\_\{t\}\+F\(q\)\_\{x\}=0,\\qquad x\\in\[0,1\],\(16\)with flux

F​\(q\)=\(α​mγ​m2/h\+12​β​h2\),\(α,γ,β\)∼Unif​\(\[0\.5,1\.5\]×\[0\.5,1\.5\]×\[8,12\]\)\.F\(q\)=\\begin\{pmatrix\}\\alpha m\\\\ \\gamma m^\{2\}/h\+\\frac\{1\}\{2\}\\beta h^\{2\}\\end\{pmatrix\},\\qquad\(\\alpha,\\gamma,\\beta\)\\sim\\mathrm\{Unif\}\\left\(\[0\.5,1\.5\]\\times\[0\.5,1\.5\]\\times\[8,12\]\\right\)\.\(17\)The standard shallow\-water equations correspond to\(α,γ,β\)=\(1,1,g\)\(\\alpha,\\gamma,\\beta\)=\(1,1,g\)\. For the initial conditions, we samplem​\(0\)m\(0\)from a Gaussian random field andh​\(0\)h\(0\)from a lognormal random field to ensure that the water height remains non\-negative\.

#### 1D Viscous Burgers Equation

The last equation we simulate is the viscous Burgers equation, given as

ut\+\(a⋅u2\)x=ν​ux​x,x∈\[0,1\],\(a,ν\)∼Unif​\(\[0\.5,1\.5\]×\[0\.005,0\.015\]\)\.u\_\{t\}\+\(a\\cdot u^\{2\}\)\_\{x\}=\\nu u\_\{xx\},\\quad x\\in\[0,1\],\\quad\(a,\\nu\)\\sim\\text\{Unif\}\(\[0\.5,1\.5\]\\times\[0\.005,0\.015\]\)\.\(18\)Note that this equation is not a conservation law \([Eq\.˜1](https://arxiv.org/html/2605.05488#S2.E1)\) due to the presence of a dissipative term on the right hand side\. Therefore we include this dataset in our benchmarks to gauge if our HFluxNO model can handle more general cases beyond the strictly conservative setting it was motivated by\.

For all simulated datasets, the equations are solved in the time intervalt∈\[0,0\.4\]t\\in\[0,0\.4\]with a sampling period ofΔ​t=0\.005\\Delta t=0\.005\. We generate 100 initial conditions per coefficient choice, and 1000, 100, 100 coefficient choices for the training, validation, and test datasets respectively\.

### 4\.3Model training and evaluation

All models were trained using the mean squared error between model predictions and data for a single time step prediction\. We set the context length to bek=20k=20for most of our experiments\. All models were trained for 50000 gradient steps, with the AdamW optimizer and a linear warm\-up cosine decay schedule for the learning rate\.

We evaluate the trained models on \(i\) in\-distribution accuracy, \(ii\) out\-of\-distribution robustness \(shock\-dominated regimes, sine fluxes\), \(iii\) long\-time roll\-out beyond the training horizon\. For evaluation metric, we used the relativel2l^\{2\}andl∞l^\{\\infty\}norms which are defined as in \([19](https://arxiv.org/html/2605.05488#S4.E19)\)

Rel\.​l2​\(u,utarget\):=‖u−utarget‖2‖utarget‖2,Rel\.​l∞​\(u,utarget\):=‖u−utarget‖∞‖utarget‖∞\\text\{Rel\. \}l^\{2\}\(u,u\_\{\\text\{target\}\}\):=\\frac\{\\\|u\-u\_\{\\text\{target\}\}\\\|\_\{2\}\}\{\\\|u\_\{\\text\{target\}\}\\\|\_\{2\}\},\\quad\\text\{Rel\. \}l^\{\\infty\}\(u,u\_\{\\text\{target\}\}\):=\\frac\{\\\|u\-u\_\{\\text\{target\}\}\\\|\_\{\\infty\}\}\{\\\|u\_\{\\text\{target\}\}\\\|\_\{\\infty\}\}\(19\)over space at each time, and then averaged over time; or over the full spatiotemporal grid\.

## 5Results

### 5\.1In\-distribution predictions\.

We first evaluate the trained models under the in\-distribution setting, using a test dataset of 100 coefficient combinations with 100 initial conditions each, sampled from the same distributions as the training data\. We consider two types of model predictions \- \(i\) a single step forecast, where the context is fed into the model to predict the solutionΔ​t\\Delta ttime later, and \(ii\) a short autoregressive rollout, where the model output is recursively fed back into the model as input context for 20 times to generate a short prediction trajectory over a time horizon of20​Δ​t20\\Delta t\.

Table 1:In\-distribution \(ID\) prediction accuracy on the 1D benchmark datasets\. Reported as mean±\\pmstd over three training runs\. The best results are in bold, and the runner\-ups are underlined\.From the results in[Table˜1](https://arxiv.org/html/2605.05488#S5.T1), we find that the models often have markedly higher relativel∞l^\{\\infty\}errors than relativel2l^\{2\}errors, which stems from the difficulty of exactly capturing shock front locations over time, as opposed to getting the overall form of the solution correctly\. The baseline models show an interesting trend: DISCO performs worse than DPOT in several single\-step settings, but its stronger dynamical prior improves longer autoregressive rollouts\. However, incorporating prior structure into the model pays off in the longer term, with DISCO outperforming the other more flexible baselines\.

In contrast to this trade\-off between single\-step and autoregressive performance for the baseline models, we find that HFluxNO consistently outperforms the baselines in both single\-step prediction and autoregressive rollout\. This indicates that the choice of the model prior structure also greatly matters, and that our architecture design based on the finite volume method is highly effective in learning hyperbolic conservation laws\.

#### Long time prediction capabilities

To further stress test the predictive capabilities of the trained models, we generated long time predictions corresponding to a rollout time oftr​o​l​l​o​u​t=0\.4t\_\{rollout\}=0\.4\. From the results shown in[Fig\.˜2](https://arxiv.org/html/2605.05488#S5.F2)we see that our model consistently maintains lower error over time compared with the baselines\. Furthermore, we see that the way error accumulates in the model predictions over time differs \([Fig\.˜2](https://arxiv.org/html/2605.05488#S5.F2), right panel\)\. DPOT and, to a lesser extent, DISCO quickly accumulate high\-frequency artifacts with increasing rollouts, which is a well\-known problem plaguing autoregressive neural operator architectures\(Lippeet al\.,[2023](https://arxiv.org/html/2605.05488#bib.bib16); Worrallet al\.,[2024](https://arxiv.org/html/2605.05488#bib.bib25)\)\. In contrast, our model does not suffer from such artifacts, with errors only stemming from a slight misprediction of the wave propagation speed\. This indicates that our model has properly learned the local physics of the problem, due to the effectiveness of the built\-in inductive biases\.

![Refer to caption](https://arxiv.org/html/2605.05488v1/x1.png)Figure 2:Long time prediction performances of the in\-context neural operator models across different datasets\.
#### Out\-of\-distribution generalization\.

We next evaluate the out\-of\-distribution \(OOD\) generalization capability of the trained models by \(i\) assessing model performances on a new dataset, generated using a different, shock\-dominated initial condition distribution\. Additionally, for the cubic conservation law experiment, We further test models on \(ii\) seen initial conditions \(GRFs\) but unseen equations \(sine\-flux dynamics\), and \(iii\) unseen initial conditions and equations\.

These settings respectively test robustness to shifted initial\-condition distributions, and generalization to a different flux family\. Datasets were generated analogously to the in\-distribution dataset \(details are provided in[Appendix˜B](https://arxiv.org/html/2605.05488#A2)\) and models were evaluated directly on these OOD test sets without fine\-tuning\. The quantitative results for these OOD settings are reported in Table[2](https://arxiv.org/html/2605.05488#S5.T2), and qualitative examples are shown in Figure[3](https://arxiv.org/html/2605.05488#S5.F3)\.

![Refer to caption](https://arxiv.org/html/2605.05488v1/x2.png)Figure 3:Qualitative rollout examples\. The top row \(a\) shows an in\-distribution cubic test trajectory, whereas the bottom row \(b\) shows an OOD trajectory with shock\-dominated initial conditions and sine\-flux dynamics\.Table 2:OOD generalization performance\. The Cubic, Shallow Water, and Viscous Burgers rows use shock\-dominated initial conditions\. The sine\-flux rows evaluate cubic\-trained models on unseen sine\-flux dynamics without retraining, either with GRF initial conditions or shock\-dominated initial conditions\. Reported as mean±\\pmstd over three training runs\.
#### Limitations\.

Our experiments focus primarily on one\-dimensional conservation\-law dynamics, with one additional diffusive Burgers\-type benchmark beyond the strictly conservative setting\. Although the proposed architecture is motivated by a conservative finite\-volume structure, its performance on higher\-dimensional systems, complex geometries, strongly coupled multiphysics problems, and real\-world noisy observations remains to be investigated\. In addition, the current study evaluates context adaptation on selected equation families, and broader generalization across substantially different PDE classes is left for future work\.

## 6Conclusion

In this work, we proposed HFluxNO, which extends Flux NO into a context\-adaptive foundation model for conservation\-law dynamics\. To handle temporal causality in the input trajectory, we designed a context\-injection encoder hypernetwork based on a temporally recurrent Vision Transformer, while using the original Flux NO architecture as the target network\. This design allows the model to infer latent governing dynamics from short solution histories and instantiate a context\-conditioned conservative flux operator\.

Through training and evaluation against recent baseline models, HFluxNO showed competitive or improved performance across several settings, including in\-distribution prediction, out\-of\-distribution generalization with respect to initial conditions and flux functions, and long\-time autoregressive prediction\. The benchmark problems considered in this paper include one\-dimensional scalar conservation laws, one\-dimensional vector\-valued conservation laws, and a viscous Burgers\-type equation with an explicit diffusive term\. These results suggest that combining in\-context adaptation with a conservative flux\-difference inductive bias can be beneficial for neural solvers of conservation\-law dynamics\.

Future work will extend this framework to richer multiphysics settings, including higher\-dimensional systems, more diverse equation families, and more complex physical regimes\.

## References

- A\. Botev, S\. De, S\. L\. Smith, A\. Fernando, G\. Muraru, R\. Haroun, L\. Berrada, R\. Pascanu, P\. G\. Sessa, R\. Dadashi, L\. Hussenot, J\. Ferret, S\. Girgin, O\. Bachem, A\. Andreev, K\. Kenealy, T\. Mesnard, C\. Hardin, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love, P\. Tafti, A\. Joulin, N\. Fiedel, E\. Senter, Y\. Chen, S\. Srinivasan, G\. Desjardins, D\. Budden, A\. Doucet, S\. Vikram, A\. Paszke, T\. Gale, S\. Borgeaud, C\. Chen, A\. Brock, A\. Paterson, J\. Brennan, M\. Risdal, R\. Gundluru, N\. Devanathan, P\. Mooney, N\. Chauhan, P\. Culliton, L\. G\. Martins, E\. Bandy, D\. Huntsperger, G\. Cameron, A\. Zucker, T\. Warkentin, L\. Peran, M\. Giang, Z\. Ghahramani, C\. Farabet, K\. Kavukcuoglu, D\. Hassabis, R\. Hadsell, Y\. W\. Teh, and N\. de Frietas \(2024\)RecurrentGemma: Moving Past Transformers for Efficient Open Language Models\.Cited by:[§3\.2](https://arxiv.org/html/2605.05488#S3.SS2.SSS0.Px1.p1.1)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang \(2018\)JAX: composable transformations of Python\+NumPy programsExternal Links:[Link](http://github.com/jax-ml/jax)Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1)\.
- Neural Ordinary Differential Equations\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px3.p1.1)\.
- S\. De, S\. L\. Smith, A\. Fernando, A\. Botev, G\. Cristian\-Muraru, A\. Gu, R\. Haroun, L\. Berrada, Y\. Chen, S\. Srinivasan, G\. Desjardins, A\. Doucet, D\. Budden, Y\. W\. Teh, R\. Pascanu, N\. D\. Freitas, and C\. Gulcehre \(2024\)Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models\.arXiv\.External Links:2402\.19427,[Document](https://dx.doi.org/10.48550/arXiv.2402.19427)Cited by:[§3\.2](https://arxiv.org/html/2605.05488#S3.SS2.SSS0.Px1.p1.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2020\)An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.05488#S3.SS2.SSS0.Px1.p2.1)\.
- J\. Guibas, M\. Mardani, Z\. Li, A\. Tao, A\. Anandkumar, and B\. Catanzaro \(2021\)Efficient Token Mixing for Transformers via Adaptive Fourier Neural Operators\.InInternational Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Hao, C\. Su, S\. Liu, J\. Berner, C\. Ying, H\. Su, A\. Anandkumar, J\. Song, and J\. Zhu \(2024\)DPOT: Auto\-Regressive Denoising Operator Transformer for Large\-Scale PDE Pre\-Training\.InProceedings of the 41st International Conference on Machine Learning,pp\. 17616–17635\.External Links:ISSN 2640\-3498Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.05488#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Herde, B\. Raonić, T\. Rohner, R\. Käppeli, R\. Molinaro, E\. de Bézenac, and S\. Mishra \(2024\)Poseidon: Efficient Foundation Models for PDEs\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 72525–72624\.Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1)\.
- D\. I\. Ketcheson, K\. T\. Mandli, A\. J\. Ahmadia, A\. Alghamdi, M\. Quezada de Luna, M\. Parsani, M\. G\. Knepley, and M\. Emmett \(2012\)PyClaw: Accessible, Extensible, Scalable Tools for Wave Propagation Problems\.SIAM Journal on Scientific Computing34\(4\),pp\. C210–C231\.Cited by:[§B\.1](https://arxiv.org/html/2605.05488#A2.SS1.p2.2),[§B\.2](https://arxiv.org/html/2605.05488#A2.SS2.p4.4)\.
- P\. Kidger \(2021\)On Neural Differential Equations\.Ph\.D\. Thesis,University of Oxford\.External Links:2202\.02435Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px3.p1.1)\.
- T\. Kim, Y\. Ha, and M\. Kang \(2025\)Neural operators learn the local physics of magnetohydrodynamics\.Computers & Fluids297,pp\. 106661\.External Links:ISSN 0045\-7930,[Document](https://dx.doi.org/10.1016/j.compfluid.2025.106661)Cited by:[§2\.1](https://arxiv.org/html/2605.05488#S2.SS1.p3.4)\.
- T\. Kim and M\. Kang \(2025\)Approximating Numerical Fluxes Using Fourier Neural Operators for Hyperbolic Conservation Laws\.Commun\. Comput\. Phys\.37\(2\),pp\. 420–456\.External Links:ISSN 1991\-7120, 1815\-2406,[Document](https://dx.doi.org/10.4208/cicp.OA-2024-0123)Cited by:[§2\.1](https://arxiv.org/html/2605.05488#S2.SS1.p3.4)\.
- F\. Koehler, S\. Niedermayr, R\. Westermann, and N\. Thuerey \(2024\)APEBench: A Benchmark for Autoregressive Neural Emulators of PDEs\.InThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§4\.2](https://arxiv.org/html/2605.05488#S4.SS2.p1.1)\.
- N\. B\. Kovachki, Z\. Li, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. M\. Stuart, and A\. Anandkumar \(2023\)Neural operator: learning maps between function spaces with applications to PDEs\.Journal of Machine Learning Research24\(89\),pp\. 1–97\.External Links:ISSN 1532\-4435Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1)\.
- R\. J\. LeVeque \(2002\)Finite volume methods for hyperbolic problems\.Cambridge University Press,Cambridge\.External Links:ISBN 978\-0\-521\-00924\-9Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p2.1)\.
- Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar \(2021\)Fourier neural operator for parametric partial differential equations\.International Conference on Learning Representations\.Note:ICLR 2021Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05488#S2.SS1.p2.1)\.
- P\. Lippe, B\. Veeling, P\. Perdikaris, R\. Turner, and J\. Brandstetter \(2023\)PDE\-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 67398–67433\.Cited by:[§5\.1](https://arxiv.org/html/2605.05488#S5.SS1.SSS0.Px1.p1.1)\.
- L\. Lu, P\. Jin, G\. Pang, Z\. Zhang, and G\. E\. Karniadakis \(2021\)Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators\.Nature Machine Intelligence3\(3\),pp\. 218–229\.External Links:[Document](https://dx.doi.org/10.1038/s42256-021-00302-5)Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.05488#S2.SS1.p2.1)\.
- M\. McCabe, B\. Régaldo\-Saint Blancard, L\. Parker, R\. Ohana, M\. Cranmer, A\. Bietti, M\. Eickenberg, S\. Golkar, G\. Krawezik, F\. Lanusse, M\. Pettee, T\. Tesileanu, K\. Cho, and S\. Ho \(2024\)Multiple Physics Pretraining for Spatiotemporal Surrogate Models\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 119301–119335\.Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px3.p1.1)\.
- R\. Morel, J\. Han, and E\. Oyallon \(2025\)DISCO: learning to DISCover an evolution Operator for multi\-physics\-agnostic prediction\.InForty\-Second International Conference on Machine Learning,Cited by:[§2\.2](https://arxiv.org/html/2605.05488#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px3.p1.1)\.
- S\. Müller, L\. Schüler, A\. Zech, and F\. Heße \(2022\)GSTools v1\.3: a toolbox for geostatistical modelling in Python\.Geosci\. Model Dev\.15\(7\),pp\. 3161–3182\.External Links:ISSN 1991\-9603,[Document](https://dx.doi.org/10.5194/gmd-15-3161-2022)Cited by:[§B\.2](https://arxiv.org/html/2605.05488#A2.SS2.p3.3)\.
- R\. Ohana, M\. McCabe, L\. Meyer, R\. Morel, F\. J\. Agocs, M\. Beneitez, M\. Berger, B\. Burkhart, S\. B\. Dalziel, D\. B\. Fielding, D\. Fortunato, J\. A\. Goldberg, K\. Hirashima, Y\. Jiang, R\. R\. Kerswell, S\. Maddu, J\. Miller, P\. Mukhopadhyay, S\. S\. Nixon, J\. Shen, R\. Watteaux, B\. R\. Blancard, F\. Rozet, L\. H\. Parker, M\. Cranmer, and S\. Ho \(2024\)The Well: a Large\-Scale Collection of Diverse Physics Simulations for Machine Learning\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 44989–45037\.Cited by:[§4\.2](https://arxiv.org/html/2605.05488#S4.SS2.p1.1)\.
- V\. Patraucean, X\. O\. He, J\. Heyward, C\. Zhang, M\. S\. M\. Sajjadi, G\. Muraru, A\. Zholus, M\. Karami, R\. Goroshin, Y\. Chen, S\. Osindero, J\. Carreira, and R\. Pascanu \(2025\)TRecViT: A Recurrent Video Transformer\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§3\.2](https://arxiv.org/html/2605.05488#S3.SS2.SSS0.Px1.p1.1)\.
- M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis \(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational Physics378,pp\. 686–707\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/10.1016/j.jcp.2018.10.045)Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-Net: Convolutional Networks for Biomedical Image Segmentation\.InMedical Image Computing and Computer\-Assisted Intervention – MICCAI 2015,N\. Navab, J\. Hornegger, W\. M\. Wells, and A\. F\. Frangi \(Eds\.\),Cham,pp\. 234–241\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-24574-4%5F28),ISBN 978\-3\-319\-24574\-4Cited by:[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px3.p1.1)\.
- S\. Subramanian, P\. Harrington, K\. Keutzer, W\. Bhimji, D\. Morozov, M\. W\. Mahoney, and A\. Gholami \(2024\)Towards foundation models for scientific machine learning: characterizing scaling and transfer behavior\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p1.1)\.
- M\. Takamoto, T\. Praditia, R\. Leiteritz, D\. MacKinlay, F\. Alesiani, D\. Pflüger, and M\. Niepert \(2022\)PDEBench: An Extensive Benchmark for Scientific Machine Learning\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 1596–1611\.Cited by:[§4\.2](https://arxiv.org/html/2605.05488#S4.SS2.p1.1)\.
- A\. Tran, A\. Mathews, L\. Xie, and C\. S\. Ong \(2024\)Flux neural operator for hyperbolic partial differential equations\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§1](https://arxiv.org/html/2605.05488#S1.p2.1)\.
- D\. E\. Worrall, M\. Cranmer, J\. N\. Kutz, and P\. Battaglia \(2024\)Spectral Shaping for Neural PDE Surrogates\.Cited by:[§5\.1](https://arxiv.org/html/2605.05488#S5.SS1.SSS0.Px1.p1.1)\.
- L\. Yang, S\. Liu, and S\. J\. Osher \(2025\)Fine\-tune language models as multi\-modal differential equation solvers\.Neural Networks188,pp\. 107455\.External Links:ISSN 0893\-6080,[Document](https://dx.doi.org/10.1016/j.neunet.2025.107455)Cited by:[§2\.2](https://arxiv.org/html/2605.05488#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px1.p1.1)\.
- L\. Yang and S\. J\. Osher \(2024\)PDE generalization of in\-context operator networks: A study on 1D scalar nonlinear conservation laws\.Journal of Computational Physics519,pp\. 113379\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/10.1016/j.jcp.2024.113379)Cited by:[§A\.1](https://arxiv.org/html/2605.05488#A1.SS1.p1.4),[§2\.2](https://arxiv.org/html/2605.05488#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.05488#S4.SS1.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.05488#S4.SS2.SSS0.Px1.p1.2)\.

## Appendix AAdditional details on baselines

### A\.1ICON

For ICON, we conducted experiments using the code fromYang and Osher \[[2024](https://arxiv.org/html/2605.05488#bib.bib26)\], adopting the original experimental setup and protocol as closely as possible\. Since the baseline models are designed to predict the solution afterΔ​t=0\.005\\Delta t=0\.005, we trained one ICON model to predict the same forward time interval\. In addition, following the original setting ofYang and Osher \[[2024](https://arxiv.org/html/2605.05488#bib.bib26)\], we trained another ICON model to predict the solution afterΔ​t=0\.1\\Delta t=0\.1\. We denote these two models as ICON \(τ=0\.005\\tau=0\.005\) and ICON \(τ=0\.1\\tau=0\.1\), respectively\.

The performance on the in\-distribution test dataset is reported in Table[3](https://arxiv.org/html/2605.05488#A1.T3)\. For ICON \(τ=0\.005\\tau=0\.005\), we performed autoregressive rollout by fixing the randomly sampled context and recursively feeding the model output back as input\. As shown in the table, this leads to very poor rollout performance\. In contrast, ICON \(τ=0\.1\\tau=0\.1\) predicts the target state with a single inference step and performs better than the former setting, but it still substantially lags behind the baseline models\. Since ICON \(τ=0\.005\\tau=0\.005\) is effectively not meaningful for long\-horizon prediction, we conducted the long\-time prediction and OOD test experiments using ICON \(τ=0\.1\\tau=0\.1\)\. The corresponding results are summarized in Table[4](https://arxiv.org/html/2605.05488#A1.T4)\.

Table 3:In\-distribution prediction accuracy of ICON on the 1D cubic conservation law\. We report single\-step prediction errors and errors afterΔ​t=0\.1\\Delta t=0\.1which is autoregressive rollout errors over 20 steps forτ=0\.005\\tau=0\.005and single\-step forτ=0\.1\\tau=0\.1\. Lower is better\.Table 4:Generalization performance of ICON on the 1D conservation laws\. We report long\-time prediction performance on the cubic conservation law using two\-step inference with theτ=0\.1\\tau=0\.1model, as well as OOD generalization under shock\-dominated initial conditions, sine\-flux dynamics, and their combination\. Lower is better\.
### A\.2Model Complexity and Computational Cost

Table 5:Model size and compute budget for the cubic conservation law dataset\.

## Appendix BData generation

We generate numerical trajectories using classical finite\-volume or finite\-difference solvers and use them as supervised training data\. For each equation family, we sample equation parameters from a prescribed distribution and independently sample initial conditions\. Each pair of equation parameters and initial condition defines one trajectory\. The equation parameters are stored as metadata but are not provided as model inputs during training or evaluation\.

Unless otherwise stated, all simulations are performed on the periodic spatial domainx∈\[0,1\]x\\in\[0,1\]\. For all newly generated 1D datasets, we useNx=100N\_\{x\}=100spatial grid cells and saveNt=100N\_\{t\}=100time snapshots\. We sample10001000,100100, and100100coefficient choices for the training, validation, and test datasets, respectively\. For each coefficient choice, we generate100100independent initial conditions\. This results in100,000100\{,\}000training trajectories,10,00010\{,\}000validation trajectories, and10,00010\{,\}000test trajectories\.

### B\.11D Cubic Conservation Laws

We consider the one\-dimensional scalar conservation law

ut\+f​\(u\)x=0,x∈\[0,1\],u\_\{t\}\+f\(u\)\_\{x\}=0,\\qquad x\\in\[0,1\],with periodic boundary conditions\. The flux is given by

f​\(u\)=a​u3\+b​u2\+c​u,\(a,b,c\)∼Unif​\(\[−1,1\]3\)\.f\(u\)=au^\{3\}\+bu^\{2\}\+cu,\\qquad\(a,b,c\)\\sim\\mathrm\{Unif\}\(\[\-1,1\]^\{3\}\)\.Initial conditions are sampled from a mean\-zero periodic Gaussian random field with covariance kernel

k​\(x,x′\)=exp⁡\(−\(1−cos⁡\(2​π​\(x−x′\)\)\)\)\.k\(x,x^\{\\prime\}\)=\\exp\\left\(\-\\left\(1\-\\cos\(2\\pi\(x\-x^\{\\prime\}\)\)\\right\)\\right\)\.
The trajectories are generated usingPyClaw\[Ketchesonet al\.,[2012](https://arxiv.org/html/2605.05488#bib.bib22)\]with a custom scalar Riemann solver for the cubic flux\. We use the MC total\-variation\-diminishing limiter, one wave family, desired CFL number0\.50\.5, and maximum CFL number0\.90\.9\. The resulting dataset has shape

\[Nc,Ninit,Nt,Nx,Nq\]=\[1000,100,100,100,1\]\[N\_\{c\},N\_\{\\mathrm\{init\}\},N\_\{t\},N\_\{x\},N\_\{q\}\]=\[1000,100,100,100,1\]for the training split, whereNq=1N\_\{q\}=1is the number of conserved variables\.

#### OOD dataset simulations

For the out\-of\-distribution experiments, we consider both different types of initial conditions and different equation forms\.

The shock dominated initial conditions were generated by generating random periodic step functions with variable number of steps and step heights\. The minimum and maximum number of steps were set to 1 and 5 respectively, and the minimum and maximum step heights were set to \-1 and 1\.

The different equation form considered was the sine flux\-based conservation law, with the flux given as,

f​\(u\)=a​sin⁡\(b​u\),\(a,b\)∼Unif​\(\[−1,1\]3\)\.f\(u\)=a\\sin\(bu\),\\qquad\(a,b\)\\sim\\mathrm\{Unif\}\(\[\-1,1\]^\{3\}\)\.For all three types of OOD datasets created \(different initial conditions, different equations, different initial conditions and equations\), we also sampled 100 coefficient choices and 100 initial conditions per coefficient choice, resulting in 10,000 test trajectories\.

### B\.21D Parametric Shallow Water Equations

We consider a two\-component parametric shallow\-water\-type conservation law\. The state is

q=\(h,m\)⊤,m=h​u,q=\(h,m\)^\{\\top\},\\qquad m=hu,wherehhis the water height andmmis the momentum\. The governing equation is

qt\+F​\(q\)x=0,x∈\[0,1\],q\_\{t\}\+F\(q\)\_\{x\}=0,\\qquad x\\in\[0,1\],with flux

F​\(q\)=\(α​mγ​m2/h\+12​β​h2\)\.F\(q\)=\\begin\{pmatrix\}\\alpha m\\\\ \\gamma m^\{2\}/h\+\\frac\{1\}\{2\}\\beta h^\{2\}\\end\{pmatrix\}\.The parameters are sampled as

\(α,γ,β\)∼Unif​\(\[0\.5,1\.5\]×\[0\.5,1\.5\]×\[8,12\]\)\.\(\\alpha,\\gamma,\\beta\)\\sim\\mathrm\{Unif\}\\left\(\[0\.5,1\.5\]\\times\[0\.5,1\.5\]\\times\[8,12\]\\right\)\.The standard shallow water equations correspond to\(α,γ,β\)=\(1,1,g\)\(\\alpha,\\gamma,\\beta\)=\(1,1,g\)\.

For the initial conditions, we samplem​\(0\)m\(0\)from a Gaussian random field with a Gaussian covariance function

k​\(x,x′\)=σ2​\(1−exp⁡\(−s2​\|x−x′\|2l2\)\),k\(x,x^\{\\prime\}\)=\\sigma^\{2\}\\left\(1\-\\exp\\left\(\-\\frac\{s^\{2\}\|x\-x^\{\\prime\}\|^\{2\}\}\{l^\{2\}\}\\right\)\\right\),where we usedσ2=0\.5\\sigma^\{2\}=0\.5andl=0\.3l=0\.3\.

To ensure positivity of the water height,h​\(0\)h\(0\)is sampled from a lognormal random field, whose covariance function was set identical to that ofm​\(0\)m\(0\)\. In the numerical solver, a small height floorhfloor=10−8h\_\{\\mathrm\{floor\}\}=10^\{\-8\}is used for divisions and square roots\. The Gaussian and lognormal random fields on a periodic lattice was generated using thegstools\[Mülleret al\.,[2022](https://arxiv.org/html/2605.05488#bib.bib19)\]package\.

Trajectories are generated usingPyClaw\[Ketchesonet al\.,[2012](https://arxiv.org/html/2605.05488#bib.bib22)\]with a custom Roe\-type approximate Riemann solver\. We use the MC total\-variation\-diminishing limiter, two wave families, desired CFL number0\.50\.5, and maximum CFL number0\.90\.9\. The resulting dataset has two state channels, corresponding tohhandmm, and hence has shape

\[Nc,Ninit,Nt,Nx,Nq\]=\[1000,100,100,100,2\]\[N\_\{c\},N\_\{\\mathrm\{init\}\},N\_\{t\},N\_\{x\},N\_\{q\}\]=\[1000,100,100,100,2\]for the training split\.

#### OOD dataset simulations

For the out\-of\-distribution experiment, we generated a dataset with different initial condition family forh​\(0\)h\(0\): shock\-dominated versions ofh​\(0\)h\(0\)were generated using random periodic step functions as for the cubic conservation law case\. The minimum and maximum number of steps were set to 1 and 5 respectively, and the minimum and maximum step heights were set to 0\.5 and 4\.5 to abide by the non\-negativity constraint\. The initial conditions form​\(0\)m\(0\)was kept identical to the in\-distribution case, as we found that using step functions for both field results in excessively irregular solutions\. Likewise, we sampled 100 coefficient choices and 100 initial conditions per coefficient choice, resulting in 10,000 test trajectories\.

### B\.31D Viscous Burgers Equation

We consider the parametric viscous Burgers\-type equation

ut\+a​\(u2\)x=b​ux​x,x∈\[0,1\],u\_\{t\}\+a\(u^\{2\}\)\_\{x\}=bu\_\{xx\},\\qquad x\\in\[0,1\],with periodic boundary conditions\. The parameters are sampled as

\(a,b\)∼Unif​\(\[0\.5,1\.5\]×\[0\.005,0\.015\]\)\.\(a,b\)\\sim\\mathrm\{Unif\}\\left\(\[0\.5,1\.5\]\\times\[0\.005,0\.015\]\\right\)\.This dataset contains an explicit diffusion term and is included to test whether the proposed architecture can handle dynamics beyond strictly hyperbolic conservation laws\.

Initial conditions are sampled from the same class of one\-dimensional Gaussian random fields used for the scalar conservation\-law experiments\. The equation is solved using an explicit finite\-volume/finite\-difference scheme: the nonlinear advective term is discretized with a local Rusanov flux, while the diffusion term is discretized using a centered second\-order finite difference\. Periodic boundary conditions are imposed throughout the simulation\.

The internal time step is chosen adaptively using both advective and diffusive stability constraints, with CFL number0\.40\.4\. The solver is forced to land exactly on each saved output time by shortening the final internal step before a saved snapshot if necessary\. The resulting dataset has shape

\[Nc,Ninit,Nt,Nx,Nq\]=\[1000,100,100,100,1\]\[N\_\{c\},N\_\{\\mathrm\{init\}\},N\_\{t\},N\_\{x\},N\_\{q\}\]=\[1000,100,100,100,1\]for the training split\.

#### OOD dataset simulations

For the out\-of\-distribution experiment, we generated a shock\-dominated initial condition dataset\. The initial conditions were once again set to periodic random step functions, whose parameters were identical to the cubic conservation law case\. Once again, we sampled 100 coefficient choices and 100 initial conditions per coefficient choice, resulting in 10,000 test trajectories\.

Similar Articles

Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Hugging Face Daily Papers

Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.