Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

arXiv cs.LG Papers

Summary

Proposes REEF-GP, a post-hoc uncertainty quantification framework that fits a Gaussian process to the residuals of a frozen neural operator using its internal embeddings, enabling geometry-aware and calibrated uncertainties at low cost.

arXiv:2606.17513v1 Announce Type: new Abstract: Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:39 AM

# Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning
Source: [https://arxiv.org/html/2606.17513](https://arxiv.org/html/2606.17513)
Oriol Vendrell\-Gallart Nima Negarandeh Ramin Bostanabad Department of Mechanical and Aerospace Engineering University of California, Irvine Irvine, CA 92617 \{ovendrel,nnegaran,raminb\}@uci\.edu

###### Abstract

Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification \(UQ\), especially under geometric variability\. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry\-aware representations learned by the operator itself\. We propose REEF\-GP \(Residual on Embedded Features Gaussian Process\), a post\-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space\. Rather than learning a separate feature map, REEF\-GP adapts the operator’s intrinsic coordinate\-feature representations to construct geometry\-aware uncertainties\. To ensure stability and scalability on unstructured domains, REEF\-GP incorporates spectral\-normalized projections, heteroscedastic geometry\-aware noise, and efficient subset\-based training that avoids restrictive low\-rank approximations\. Across five PDE benchmarks with varying geometries, REEF\-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost\. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions \(e\.g\., shock fronts\)\. Our results demonstrate that accurate and scalable post\-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter\-centric approaches\.

## 1Introduction

Neural operators are increasingly used to approximate the solution of partial differential equations \(PDEs\) in domains such as climatologyPathaket al\.\([2022](https://arxiv.org/html/2606.17513#bib.bib63)\); Bonevet al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib78)\), fluid dynamicsLiet al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib9)\); Rahmanet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib77)\), plasma physicsGopakumaret al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib82)\); Careyet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib83)\), and solid mechanicsKhorramiet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib79)\); Jeyarajet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib81)\); Liet al\.\([2023a](https://arxiv.org/html/2606.17513#bib.bib13)\)\. Driven by this growing adoption, recent years have seen substantial advances in their architectural design and training strategies to bridge the accuracy gap with traditional solvers while improving scalability and accommodating PDE solutions over varying geometries\. In this regard, Transolver and its extensionsWuet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib15)\); Luoet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib16)\); Zhouet al\.\([2026](https://arxiv.org/html/2606.17513#bib.bib61)\)stand out as prominent examples that leverage transformer\-based backbones to achieve high accuracy, scale effectively to large problems, and natively operate on point clouds or mesh\-based datasets\.

Despite recent successes, most neural operators are built deterministically and provide point estimates that overlook uncertainties\. This limitation hinders their adoption in scientific applications where UQ is criticalMouliet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib80)\); Psaroset al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib76)\)or in downstream tasks that rely on probabilistic predictions \(e\.g\., adaptive data collectionMusekampet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib71)\)\)\. Existing methods for addressing this issue broadly fall into train\-time and post\-hoc categories\. Train\-time approaches include the gold\-standard deep ensemblesLakshminarayananet al\.\([2017](https://arxiv.org/html/2606.17513#bib.bib35)\), where predictions of multiple independent models are combined together; Bayesian approximations via stochastic regularization techniques such as MC DropoutGal and Ghahramani \([2016](https://arxiv.org/html/2606.17513#bib.bib29)\); and inherently probabilistic models like DINOZAURMatveevet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib12)\), which uses a diffusion multiplier to introduce stochasticity to Fourier neural operators \(FNOs\)Liet al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib9)\)\.

In contrast to train\-time UQ, post\-hoc methods add probabilistic components to a pretrained model at inference time\. This alternative is particularly attractive since training neural operators can be an expensive and resource\-intensive process, especially for inherently probabilistic variants\. Last\-layer Laplace approximationDaxbergeret al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib33)\)is a popular and generic post\-hoc technique that has been recently applied to operator learningMagnaniet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib57)\)\. Our method belongs to this post\-hoc category but, unlike Laplace approximation, it builds Gaussian processes \(GPs\)Rasmussen and Williams \([2005](https://arxiv.org/html/2606.17513#bib.bib37)\)whose mean and kernel functions are constructed around a pretrained neural operator, see Figure[1](https://arxiv.org/html/2606.17513#S1.F1)\.

Kernel methods and GPsMoraet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib7)\); Batlleet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib1)\); Loweryet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib5)\)have recently been used for operator learning and benefit from theoretical guarantees and connections to Bayesian inferenceBatlleet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib6)\)\. However, they suffer from scalability issues related to dataset size and dimensionality, especially in the case of mesh\-based data\. Additionally, Bayesian inference in infinite\-dimensional spaces is inherentlybrittleOwhadiet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib45)\): unlike in finite settings, if the assumed prior is even slightly misspecified regarding the operator’s regularity, the posterior may converge to an incorrect solution with high confidence rather than contracting to the truth\. Consequently, standard approximations unaware of the underlying infinite dimensional nature of the problem introduce inductive biases that are liable to yield uncalibrated predictions\. Reliable UQ therefore requires not just approximating the posterior of a fixed model, but actively learning from data the prior that best describes the geometry of the operatorOwhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\)\.

We introduce REEF\-GP, a novel post\-hoc UQ framework that models the discrepancy of a frozen pretrained neural operator with respect to the true solution as a function of both spatial coordinates and latent geometric features\. Our main contributions are as follows:

- •We design a geometry\-aware deep kernel that leverages the internal representations of a pretrained operator\. These hidden layers encode deformed representations of the input geometry and REEF\-GP reuses them rather than imposing a prior of its own\.
- •We develop training and inference procedures based on stochastic subset optimization and product of experts to make REEF\-GP practical at the scale of operator learning datasets\.
- •We demonstrate competitive calibration on five challenging 2D and 3D benchmarks, including settings with geometric distribution shift\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x1.png)Figure 1:REEF\-GP architecture\.𝒢θ\\mathcal\{G\}\_\{\\theta\}is a frozen neural operator that maps the input functionsu,au,ato the output functionvv\. REEF\-GP models𝒢θ\\mathcal\{G\}\_\{\\theta\}’s residual discrepancy by operating on the concatenation \(C\) of𝐮,𝐱,𝐡\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}in the functional regression form \(F\)\. Internally, it transforms \(ρ\\rho\) the learned geometry\-aware embeddings of𝒢θ\\mathcal\{G\}\_\{\\theta\}’s internal layers\(𝐡𝟏,𝐡𝐥,𝐡𝐋\)\(\\mathbf\{h\_\{1\}\},\\mathbf\{h\_\{l\}\},\\mathbf\{h\_\{L\}\}\)to a new space where the kernel operates\. In this example, the embedding in the feature\-coordinate kernel space tears off the geometry exactly at the shock wave locations\. The discrepancyδ\\deltais added to the base prediction after recovering \(R\) the point cloud form to obtain the output distribution\.
## 2Background and Related Work

### 2\.1Geometry\-Aware Operator Learning

Let𝒰\\mathcal\{U\}and𝒱\\mathcal\{V\}be two Banach spaces of functions defined on a reference bounded domainΩ⊂ℝd\\Omega\\subset\\mathbb\{R\}^\{d\}, wheredddenotes the spatial dimension\. Each PDE instance is associated with a continuous geometric descriptora∈𝒜a\\in\\mathcal\{A\}, which corresponds to a specific physical domainDa⊆ΩD\_\{a\}\\subseteq\\Omega\. For any input function and geometry pair\(u,a\)∈𝒰×𝒜\(u,a\)\\in\\mathcal\{U\}\\times\\mathcal\{A\}, the corresponding PDE solutionv∈𝒱v\\in\\mathcal\{V\}is defined onDaD\_\{a\}and satisfies the governing equations:

𝒫​\(v,u\)​\(𝐱\)\\displaystyle\\mathcal\{P\}\(v,u\)\(\\mathbf\{x\}\)=0,𝐱∈Da,\\displaystyle=0,\\quad\\mathbf\{x\}\\in D\_\{a\},\(1\)ℬ​\(v,u\)​\(𝐱\)\\displaystyle\\mathcal\{B\}\(v,u\)\(\\mathbf\{x\}\)=0,𝐱∈∂Da,\\displaystyle=0,\\quad\\mathbf\{x\}\\in\\partial D\_\{a\},where𝒫\\mathcal\{P\}andℬ\\mathcal\{B\}denote the differential and boundary operators, respectively\. Assuming a unique solution operator𝒢†\\mathcal\{G\}^\{\\dagger\}exists, we can write:

𝒢†:𝒰×𝒜→𝒱\.\\mathcal\{G\}^\{\\dagger\}:\\mathcal\{U\}\\times\\mathcal\{A\}\\to\\mathcal\{V\}\.\(2\)
#### Neural Operators\.

Under this framework, the objective is to learn the infinite\-dimensional operator𝒢†\\mathcal\{G\}^\{\\dagger\}given some data\. To this end, these models construct a parametric operator𝒢θ\\mathcal\{G\}\_\{\\theta\}that approximates𝒢†\\mathcal\{G\}^\{\\dagger\}by minimizing the empirical risk over the dataset𝒟\\mathcal\{D\}containingMMtriplets\{\(ui,ai,vi\)\}i=1M\\\{\(u\_\{i\},a\_\{i\},v\_\{i\}\)\\\}\_\{i=1\}^\{M\}, wherevi=𝒢†​\(ui,ai\)v\_\{i\}=\\mathcal\{G\}^\{\\dagger\}\(u\_\{i\},a\_\{i\}\)\. While𝒢†\\mathcal\{G\}^\{\\dagger\}acts on continuous spaces, in practice we only have access to discrete numerical evaluations\{\(𝐮i,𝐚i,𝐯i\)\}i=1M\\\{\(\\mathbf\{u\}\_\{i\},\\mathbf\{a\}\_\{i\},\\mathbf\{v\}\_\{i\}\)\\\}\_\{i=1\}^\{M\}\. For a given instance, the continuous domain is discretized as a mesh or point cloud𝐚i=\{𝐱j\}j=1N⊂Dai\\mathbf\{a\}\_\{i\}=\\\{\\mathbf\{x\}\_\{j\}\\\}\_\{j=1\}^\{N\}\\subset D\_\{a\_\{i\}\}and𝐯i=\{vi​\(𝐱i,j\)\}j=1N∈ℝN\\mathbf\{v\}\_\{i\}=\\\{v\_\{i\}\(\\mathbf\{x\}\_\{i,j\}\)\\\}\_\{j=1\}^\{N\}\\in\\mathbb\{R\}^\{N\}collects the corresponding nodal solution values\.

A large body of work in operator learning has focused on designing architectures that can represent solution operators across discretizations and geometries\. Early spectral approaches such as FNOs are particularly effective on regular discretizationsLiet al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib9)\)but more recent geometry\-aware variants accommodate irregular domains, meshes, and point cloudsLiet al\.\([2023a](https://arxiv.org/html/2606.17513#bib.bib13),[b](https://arxiv.org/html/2606.17513#bib.bib22)\); Wuet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib15)\)\. Transformer\-based architectures such as Transolver are especially relevant in this setting as they natively operate on general geometries and preserve spatially aligned hidden states that can later be exploited for UQ\.

### 2\.2Gaussian Processes

GPsRasmussen and Williams \([2005](https://arxiv.org/html/2606.17513#bib.bib37)\)provide a principled Bayesian framework for regression, yielding closed\-form posteriors with calibrated uncertainty estimates\. We apply them to operator learning via a*functional regression*perspectiveMoraet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib7)\): instead of directly learning𝒢†\\mathcal\{G\}^\{\\dagger\}, which outputs an infinite\-dimensional function, we learn its evaluation functional, so the pointwise evaluation of the operator is represented by a finite\-dimensional surrogate

𝒢~†:ℝdu×ℝda×Ω→ℝ,\(𝐮,𝐚,𝐱\)↦v​\(𝐱\),\\widetilde\{\\mathcal\{G\}\}^\{\\dagger\}:\\mathbb\{R\}^\{d\_\{u\}\}\\times\\mathbb\{R\}^\{d\_\{a\}\}\\times\\Omega\\to\\mathbb\{R\},\\qquad\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\\mapsto v\(\\mathbf\{x\}\),\(3\)where𝐮∈ℝdu\\mathbf\{u\}\\in\\mathbb\{R\}^\{d\_\{u\}\}and𝐚∈ℝda\\mathbf\{a\}\\in\\mathbb\{R\}^\{d\_\{a\}\}are discrete encodings of the input functionuuand geometryaa, and𝐱∈Da\\mathbf\{x\}\\in D\_\{a\}is a query coordinate\. Operator learning thus reduces to scalar regression over an augmented input𝐳=\(𝐮,𝐚,𝐱\)\\mathbf\{z\}=\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\. The complete derivation is provided in Appendix[A](https://arxiv.org/html/2606.17513#A1)\. GivenMMsamples each evaluated atNNspatial points, the training set containsM​NMNpairs\{\(𝐳i,vi\)\}i=1M​N=\{𝐙,𝐯\}\\\{\(\\mathbf\{z\}\_\{i\},v\_\{i\}\)\\\}\_\{i=1\}^\{MN\}=\\\{\\mathbf\{Z\},\\mathbf\{v\}\\\}, where each𝐳i=\(𝐮s​\(i\),𝐚s​\(i\),𝐱i\)\\mathbf\{z\}\_\{i\}=\(\\mathbf\{u\}\_\{s\(i\)\},\\mathbf\{a\}\_\{s\(i\)\},\\mathbf\{x\}\_\{i\}\)ands​\(i\)s\(i\)identifies the sample to which pointiibelongs\. A GP prior over the latent functionf​\(𝐳\)f\(\\mathbf\{z\}\)takes the form

f​\(𝐳\)∼𝒢​𝒫​\(m​\(𝐳;𝜷\),k​\(𝐳,𝐳′;ϕ\)\),f\(\\mathbf\{z\}\)\\sim\\mathcal\{GP\}\\big\(m\(\\mathbf\{z\};\\boldsymbol\{\\beta\}\),\\ k\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\};\\boldsymbol\{\\phi\}\)\\big\),\(4\)with parametric mean and covariance functionsm​\(⋅\)m\(\\cdot\)andk​\(⋅,⋅\)k\(\\cdot,\\cdot\)\. Given noisy observationsvi=f​\(𝐳i\)\+ϵv\_\{i\}=f\(\\mathbf\{z\}\_\{i\}\)\+\\epsilonwithϵ∼𝒩​\(0,λ2\)\\epsilon\\sim\\mathcal\{N\}\(0,\\lambda^\{2\}\), the hyperparameters\{𝜷,ϕ,λ2\}\\\{\\boldsymbol\{\\beta\},\\boldsymbol\{\\phi\},\\lambda^\{2\}\\\}are typically optimized via maximum likelihood estimation\. The posterior mean and covariance at a test input𝐳∗\\mathbf\{z\}\_\{\*\}then admit the closed formsm¯​\(𝐳∗\)=m​\(𝐳∗\)\+k​\(𝐳∗,𝐙\)​\(𝐊\+λ2​𝐈\)−1​\(𝐯−m​\(𝐙\)\)\\bar\{m\}\(\\mathbf\{z\}\_\{\*\}\)=m\(\\mathbf\{z\}\_\{\*\}\)\+k\(\\mathbf\{z\}\_\{\*\},\\mathbf\{Z\}\)\(\\mathbf\{K\}\+\\lambda^\{2\}\\mathbf\{I\}\)^\{\-1\}\(\\mathbf\{v\}\-m\(\\mathbf\{Z\}\)\)andk¯​\(𝐳∗,𝐳∗′\)=k​\(𝐳∗,𝐳∗′\)−k​\(𝐳∗,𝐙\)​\(𝐊\+λ2​𝐈\)−1​k​\(𝐙,𝐳∗′\)\\bar\{k\}\(\\mathbf\{z\}\_\{\*\},\\mathbf\{z\}\_\{\*\}^\{\\prime\}\)=k\(\\mathbf\{z\}\_\{\*\},\\mathbf\{z\}\_\{\*\}^\{\\prime\}\)\-k\(\\mathbf\{z\}\_\{\*\},\\mathbf\{Z\}\)\(\\mathbf\{K\}\+\\lambda^\{2\}\\mathbf\{I\}\)^\{\-1\}k\(\\mathbf\{Z\},\\mathbf\{z\}\_\{\*\}^\{\\prime\}\)\.

#### Scalability and Kernel Design\.

Standard GPs are fundamentally limited by their𝒪​\(N3\)\\mathcal\{O\}\(N^\{3\}\)time and𝒪​\(N2\)\\mathcal\{O\}\(N^\{2\}\)memory complexity, making exact inference intractable for operator learning where datasets can easily contain millions of discretization points\. Moreover, commonly used scalable GP approximations can introduce structural assumptions that are poorly matched to geometry\-varying PDE surrogates\. For instance, tensor\-product kernelsWilson and Nickisch \([2015](https://arxiv.org/html/2606.17513#bib.bib41)\); Zheet al\.\([2019](https://arxiv.org/html/2606.17513#bib.bib42)\)are naturally tied to regular grids and impose separable covariance structure, which can be restrictive for non\-stationary and anisotropic solution fields on irregular domains\. Likewise, inducing\-point approximations such as NyströmWilliams and Seeger \([2000](https://arxiv.org/html/2606.17513#bib.bib39)\), SGPRTitsias \([2009](https://arxiv.org/html/2606.17513#bib.bib38)\), and SVGPHensmanet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib40)\)rely on low\-rank approximations that can limit the GP’s flexibility\.

Prior GP\-based approaches to operator learning involve fixed\-geometry settingsBatlleet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib1)\); Moraet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib7)\)where the input dimensionality remains moderate and Kronecker structure can be exploited for scalability\. These advantages disappear with geometric variability\. A natural alternative is deep kernel learning \(DKL\) which, when combined with scalable approximations, has enabled GPs to address large\-scale and high\-dimensional problems\. However, in settings governed by strong structural constraints \(e\.g\., PDE\-driven problems\) kernel design is critical for achieving both expressivity and well\-calibrated uncertaintyWilsonet al\.\([2016](https://arxiv.org/html/2606.17513#bib.bib27)\)\. In this work, we address this challenge by leveraging the learnt embeddings of a pretrained neural operator, thus encoding geometry and physics\-aware structure directly into the GP\.

### 2\.3Scalable Uncertainty Quantification for Neural Operators

From a Bayesian perspective, UQ in deep networks such as neural operators amounts to characterizing the predictive distribution induced by uncertain model parameters or representations\. In principle, this requires marginalizing over a posterior involving millions of parameters\. Since this is computationally intractable, practical UQ relies on scalable approximations that are either train\-time or post\-hoc\.

#### Train\-time UQ\.

These methods consider uncertainties during training, typically via weight\-space stochasticity or repeated retraining as in deep ensemblesLakshminarayananet al\.\([2017](https://arxiv.org/html/2606.17513#bib.bib35)\)\. Computationally efficient alternatives include SWAGMaddoxet al\.\([2019](https://arxiv.org/html/2606.17513#bib.bib62)\), which performs approximate Bayesian model averaging using weight trajectories collected during training, and Monte Carlo \(MC\) dropoutGal and Ghahramani \([2016](https://arxiv.org/html/2606.17513#bib.bib29)\), where uncertainty is estimated via repeated stochastic forward passes with dropout activated at inference time\. These ideas have also been adapted to operator learning; for example, Probabilistic Neural Operator \(PNO\)Bülteet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib58),[2025](https://arxiv.org/html/2606.17513#bib.bib59)\)uses dropout during training to generate functional samples and optimize an energy\-score objective\. In broader deep\-learning settings, related architectural modifications such as Spectral\-normalized GPsLiuet al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib51)\)have also been proposed to improve uncertainty estimation\.

#### Post\-hoc UQ\.

To preserve the computational efficiency of deterministic training, post\-hoc methods augment a frozen pretrained base model for UQ\. The most widely used approach is the Laplace approximationMacKay \([1992](https://arxiv.org/html/2606.17513#bib.bib31)\); Ritteret al\.\([2018](https://arxiv.org/html/2606.17513#bib.bib32)\); Daxbergeret al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib33)\), which fits a local Gaussian posterior around the maximum a posteriori \(MAP\) estimate of the weights \(for scalability, this strategy is often restricted to the last layer\)\. Laplace approximation has also been explored for neural operatorsMagnaniet al\.\([2022](https://arxiv.org/html/2606.17513#bib.bib34)\); Weberet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib56)\), and more recently LUNOMagnaniet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib57)\)applies it to the internal spectral representations and propagates them through a linearized upstream network\. Beyond operator learning, Deep Vecchia Ensembles \(DVE\)Jimenez and Katzfuss \([2025](https://arxiv.org/html/2606.17513#bib.bib52)\)move from weight\-space to feature\-space by fitting a GP on internal neural representations\.

While train\-time methods benefit from rich epistemic exploration, they are prohibitively expensive for large\-scale operator learning on unstructured meshes\. In contrast, post\-hoc UQ methods are cheaper but beyond DVE they mostly rely on local weight\-space approximations instead of fully exploiting the geometry\-aware structure encoded in the pretrained operator\. These limitations motivate uncertainty models that operate directly in the coordinate and feature space of frozen neural operators\.

## 3Method: REEF\-GP

Most existing UQ methods for neural operators act in weight space, either during training or through post\-hoc local approximations\. With REEF\-GP \(Figure[1](https://arxiv.org/html/2606.17513#S1.F1)\) we pursue a complementary perspective: rather than perturbing the model parameters, we model the structural inadequacy of a pretrained neural operator directly in the input\-coordinate space induced by the operator and its internal representations\. More specifically, we approach the problem from the perspective of the Kennedy–O’Hagan \(KOH\) discrepancy frameworkKennedy and O’Hagan \([2002](https://arxiv.org/html/2606.17513#bib.bib54)\)\. While KOH was originally introduced for Bayesian calibration of computer models and fusing high\- and low\-fidelity datasetsHigdonet al\.\([2004](https://arxiv.org/html/2606.17513#bib.bib2)\); Loeppkyet al\.\([2006](https://arxiv.org/html/2606.17513#bib.bib4)\), we reformulate it for post\-hoc UQ in operator learning\. We treat the pretrained neural operator𝒢θ\\mathcal\{G\}\_\{\\theta\}as a deterministic computer model whose parameters remain strictly frozen\. Since𝒢θ\\mathcal\{G\}\_\{\\theta\}is an approximation to𝒢†\\mathcal\{G\}^\{\\dagger\}, there exists a residual discrepancy between the two which we interpret as*model inadequacy*\.

Building on the functional regression view introduced in Section[2\.2](https://arxiv.org/html/2606.17513#S2.SS2), we write the true solution evaluated at a spatial query coordinate𝐱∈Da\\mathbf\{x\}\\in D\_\{a\}as:

v​\(𝐱\)=𝒢θ​\(𝐮,𝐚\)​\(𝐱\)\+δ​\(𝐮,𝐚,𝐱\)\+ϵ,v\(\\mathbf\{x\}\)=\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\},\\mathbf\{a\}\)\(\\mathbf\{x\}\)\+\\delta\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\+\\epsilon,\(5\)whereδ​\(⋅\)\\delta\(\\cdot\)denotes the model inadequacy andϵ\\epsiloncaptures stochastic variability or observational noise\.

Unlike classical KOH formulations which treat the computer model as a black box, we exploit its internal structure\. A neural operator can be written as a composition ofLLlatent transformations:

𝒢θ​\(𝐮,𝐚\)=ℒL∘ℒL−1∘⋯∘ℒ1​\(𝐮,𝐚\)\.\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\},\\mathbf\{a\}\)=\\mathcal\{L\}\_\{L\}\\circ\\mathcal\{L\}\_\{L\-1\}\\circ\\dots\\circ\\mathcal\{L\}\_\{1\}\(\\mathbf\{u\},\\mathbf\{a\}\)\.\(6\)Therefore,δ​\(⋅\)\\delta\(\\cdot\)is the result of cumulative approximation errors propagated through the network’s internal states \(this observation is consistent with the representational limits and spectral bias in deep learningRahamanet al\.\([2019](https://arxiv.org/html/2606.17513#bib.bib55)\)\)\. Additionally, recent analyses indicate that compared to the highly constrained output layer, intermediate representations often contain strictly richer and less\-compressed information regarding the underlying physicsSkeanet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib65)\); Jimenez and Katzfuss \([2025](https://arxiv.org/html/2606.17513#bib.bib52)\)\. In the case of operator learning, we demonstrate thismultiscalerepresentation with a detailed layer\-wise analysis in Appendix[C](https://arxiv.org/html/2606.17513#A3), where we also illustrate the layer\-wise geometry deformation using t\-SNEvan der Maaten and Hinton \([2008](https://arxiv.org/html/2606.17513#bib.bib60)\)\.

Motivated by this multiscale error structure, we enrich the discrepancy model with hidden representations extracted from the frozen𝒢θ\\mathcal\{G\}\_\{\\theta\}\. Let𝐡l​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\_\{l\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)denote the spatially aligned hidden feature at layerlland query coordinate𝐱\\mathbf\{x\}\. We selectL′≤LL^\{\\prime\}\\leq Linformative layers \(the selection procedure is described in Appendix[B\.1](https://arxiv.org/html/2606.17513#A2.SS1)with ablations in Appendix[D\.1](https://arxiv.org/html/2606.17513#A4.SS1)\) and concatenate them into the augmented state𝐡​\(𝐮,𝐚,𝐱\)=⨁l=1L′𝐡l​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)=\\bigoplus\_\{l=1\}^\{L^\{\\prime\}\}\\mathbf\{h\}\_\{l\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\. We then refine the discrepancy model by replacing the black\-box correction term with a feature\-aware alternative defined on the coordinate\-feature state:

v​\(𝐱\)=𝒢θ​\(𝐮,𝐚\)​\(𝐱\)\+δ​\(𝐮,𝐱,𝐡​\(𝐮,𝐚,𝐱\)\)\+ϵ\.v\(\\mathbf\{x\}\)=\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\},\\mathbf\{a\}\)\(\\mathbf\{x\}\)\+\\delta\(\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\)\+\\epsilon\.\(7\)This yields a discrepancy model conditioned on location and the local regime given by the operator\.

### 3\.1Probabilistic Discrepancy Modeling

To obtain calibrated uncertainties, we model the feature\-aware discrepancy probabilistically by first decomposing it into two parts: a latent noiseless residual correction functionf​\(⋅\)f\(\\cdot\)and a heteroscedastic noise termε​\(⋅\)\\varepsilon\(\\cdot\)that accounts for the unexplained variability:

δ​\(𝐮,𝐱,𝐡​\(𝐮,𝐚,𝐱\)\)=f​\(𝐮,𝐱,𝐡​\(𝐮,𝐚,𝐱\)\)\+ε​\(𝐮,𝐱,𝐡​\(𝐮,𝐚,𝐱\)\)\.\\delta\(\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\)=f\(\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\)\+\\varepsilon\(\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\)\.\(8\)Then, we place a zero\-mean GP prior over the residual, that isf∼𝒢​𝒫​\(0,kϕ​\(𝐳¯,𝐳¯′\)\)f\\sim\\mathcal\{GP\}\(0,k\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{\\bar\{z\}\},\\mathbf\{\\bar\{z\}\}^\{\\prime\}\)\)where𝐳¯\\mathbf\{\\bar\{z\}\}is the augmented state𝐳¯=\(𝐮,𝐱,𝐡​\(𝐮,𝐚,𝐱\)\)\\mathbf\{\\bar\{z\}\}=\(\\mathbf\{u\},\\mathbf\{x\},\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\)andkϕk\_\{\\boldsymbol\{\\phi\}\}is the kernel with hyperparametersϕ\\boldsymbol\{\\phi\}\. Next, we jointly model the unresolved error and the independent observational noise as:

ε​\(𝐳¯\)\+ϵ∼𝒩​\(0,σ𝝍2​\(𝐳¯\)\+σn2\),\\varepsilon\(\\mathbf\{\\bar\{z\}\}\)\+\\epsilon\\sim\\mathcal\{N\}\\\!\\left\(0,\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{n\}^\{2\}\\right\),\(9\)whereσn2\\sigma\_\{n\}^\{2\}is a global observation\-noise term modeled asϵ∼𝒩​\(0,σn2\)\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma\_\{n\}^\{2\}\)\. Although many PDE benchmarks are effectively deterministic, retainingσn2\\sigma\_\{n\}^\{2\}allows our framework to accommodate stochastic simulators or measurement noise in experimental settings\. Moreover, we parametrizeσ𝝍2\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}by a spectral\-normalizedLiuet al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib51)\)dense network that maps𝐳¯\\mathbf\{\\bar\{z\}\}to a strictly positive value\. The spectral normalization mitigates feature collapse which, in turn, enables a GP to surrogateδ​\(⋅\)\\delta\(\\cdot\)\.

With the independent Gaussian priors and Equations[8](https://arxiv.org/html/2606.17513#S3.E8)and[9](https://arxiv.org/html/2606.17513#S3.E9), the marginal prior for the solution value at a query point𝐱\\mathbf\{x\}is:

v​\(𝐱\)∼𝒩​\(𝒢θ​\(𝐮,𝐚\)​\(𝐱\),kϕ​\(𝐳¯,𝐳¯\)\+σ𝝍2​\(𝐳¯\)\+σn2\)\.v\(\\mathbf\{x\}\)\\sim\\mathcal\{N\}\\Big\(\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\},\\mathbf\{a\}\)\(\\mathbf\{x\}\),\\;k\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{\\bar\{z\}\},\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{n\}^\{2\}\\Big\)\.\(10\)The kernel design, hyperparameter estimation, and inference are detailed in the next sections\.

### 3\.2Kernel Design for Infinite\-dimensional Inference

The rationale behind defining the discrepancyδ\\deltaas a function of the operator’s internal state is rooted in a fundamental theoretical challenge: the inherent brittleness of Bayesian inference in infinite\-dimensional spacesOwhadiet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib45)\)\. In finite\-dimensional settings, posterior contraction is often robust to moderate prior misspecification, but in function spaces even mild mismatch between the prior regularity and the target operator can lead to highly confident yet incorrect inference\.

To avoid imposing a misspecified prior, we draw inspiration from Kernel Flows \(KF\)Owhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\)which advocates for learning the kernel geometry from data without the need of relying on physics informed priorsSpitieris and Steinsland \([2023](https://arxiv.org/html/2606.17513#bib.bib53)\)\. Our key modification is to avoid learning the manifold from scratch: we assume that the pretrained neural operator has already deformed the original input\-geometry space into a physics\-aware latent representation through its hidden layers\. In Appendix[C](https://arxiv.org/html/2606.17513#A3)we provide a more formal motivation for this latent geometry and its connection to KF and DKL\.

We therefore construct a deep kernelWilsonet al\.\([2016](https://arxiv.org/html/2606.17513#bib.bib27)\)over the augmented state \(see Appendix[B\.2](https://arxiv.org/html/2606.17513#A2.SS2)for architecture details\)\. To respect the distinct topological roles of spatial location, input constraints, and latent operator state, we leverage kernel closure properties and define the covariance as the product kernel:

kϕ​\(𝐳¯,𝐳¯′\)=kspace​\(𝐱,𝐱′\)⋅kfun​\(𝐮,𝐮′\)⋅klatent​\(𝐡,𝐡′\)\.k\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{\\bar\{z\}\},\\mathbf\{\\bar\{z\}\}^\{\\prime\}\)=k\_\{\\text\{space\}\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\cdot k\_\{\\text\{fun\}\}\(\\mathbf\{u\},\\mathbf\{u\}^\{\\prime\}\)\\cdot k\_\{\\text\{latent\}\}\(\\mathbf\{h\},\\mathbf\{h\}^\{\\prime\}\)\.\(11\)Here,kspacek\_\{\\text\{space\}\}acts on physical coordinates,kfunk\_\{\\text\{fun\}\}acts on the input function representation when present, andklatentk\_\{\\text\{latent\}\}acts on the hidden operator embeddings\. For the latent component, we further define:

klatent​\(𝐡,𝐡′\)=kbase​\(ρ​\(𝐡\),ρ​\(𝐡′\)\),k\_\{\\text\{latent\}\}\(\\mathbf\{h\},\\mathbf\{h\}^\{\\prime\}\)=k\_\{\\text\{base\}\}\\\!\\left\(\\rho\(\\mathbf\{h\}\),\\rho\(\\mathbf\{h\}^\{\\prime\}\)\\right\),\(12\)wherekbasek\_\{\\text\{base\}\}is a stationary kernel andρ​\(⋅\)\\rho\(\\cdot\)is a warping function parameterized with a spectral\-normalized MLP\. This construction yields a non\-stationary prior in the original coordinate space: spatial correlations are modulated by similarity in the learned latent state, allowing the discrepancy model to adapt to different local physical regimes while remaining anchored to the pretrained operator’s internal geometry\.

### 3\.3Scalable Training

Under the functional regression formulation of Section[2\.2](https://arxiv.org/html/2606.17513#S2.SS2), the training set containsM​NMNelements\. In operator learningM​NMNis often in the order of millions and so exact GP training is infeasible due to𝒪​\(\(M​N\)3\)\\mathcal\{O\}\(\(MN\)^\{3\}\)time and𝒪​\(\(M​N\)2\)\\mathcal\{O\}\(\(MN\)^\{2\}\)memory requirements\. Hence, again inspired by KFOwhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\), we develop a stochastic subset strategy that jointly optimizes all model hyperparameters\. At each optimization step, we randomly sample a subset ofNs≪M​NN\_\{s\}\\ll MNpoints and evaluate𝜹\\boldsymbol\{\\delta\}defined as theNsN\_\{s\}\-dimensional vector of structural residuals with entriesδi=v​\(𝐱i\)−𝒢θ​\(𝐮s​\(i\),𝐚s​\(i\)\)​\(𝐱i\)\\delta\_\{i\}=v\(\\mathbf\{x\}\_\{i\}\)\-\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\}\_\{s\(i\)\},\\mathbf\{a\}\_\{s\(i\)\}\)\(\\mathbf\{x\}\_\{i\}\)\. Then, we update the hyperparameters using mini\-batch negative log marginal likelihood as the loss:

ℒ​\(ϕ,𝝍\)=12​𝜹⊤​𝐊ϕ​𝝍−1​𝜹\+12​log⁡\|𝐊ϕ​𝝍\|\+Ns2​log⁡\(2​π\),\\mathcal\{L\}\(\\boldsymbol\{\\phi\},\\boldsymbol\{\\psi\}\)=\\frac\{1\}\{2\}\\boldsymbol\{\\delta\}^\{\\top\}\\mathbf\{K\}\_\{\\boldsymbol\{\\phi\}\\boldsymbol\{\\psi\}\}^\{\-1\}\\boldsymbol\{\\delta\}\+\\frac\{1\}\{2\}\\log\|\\mathbf\{K\}\_\{\\boldsymbol\{\\phi\}\\boldsymbol\{\\psi\}\}\|\+\\frac\{N\_\{s\}\}\{2\}\\log\(2\\pi\),\(13\)where𝐊ϕ​𝝍=kϕ​\(𝐙¯,𝐙¯\)\+diag​\(σ𝝍2​\(𝐙¯\)\)\+σn2​𝐈\\mathbf\{K\}\_\{\\boldsymbol\{\\phi\}\\boldsymbol\{\\psi\}\}=k\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{\\bar\{Z\}\},\\mathbf\{\\bar\{Z\}\}\)\+\\mathrm\{diag\}\\\!\\left\(\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{Z\}\}\)\\right\)\+\\sigma\_\{n\}^\{2\}\\mathbf\{I\}is the tractableNs×NsN\_\{s\}\\times N\_\{s\}joint covariance matrix evaluated at the augmented states𝐙¯\\mathbf\{\\bar\{Z\}\}\. By drawing randomNsN\_\{s\}\-sized subsets at each epoch, we strictly decouple the memory constraints from the true mesh resolutionNN\. In our studies, this training procedure combined with spectral normalization and heteroscedastic noise remains stable\. The implementation, empirical validation and optimization settings are detailed in Appendix[B\.4](https://arxiv.org/html/2606.17513#A2.SS4)\.

We highlight that the KF loss introduced inOwhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\)discards UQ and focuses on building kernelinterpolantswhose accuracy is robust to the training data size\. We borrow the notion of robustness behind KF loss and extend it to accommodate UQ\. In Appendix[C](https://arxiv.org/html/2606.17513#A3)we further elaborate in this regard and detail how REEF\-GP differs from standard DKL with inducing points\.

### 3\.4Inference via Generalized Product of Experts

Conditioning the GP onM​NMNtraining points is infeasible for fast inference\. To address this issue, we use generalized Product of Experts \(gPoE\)Cao and Fleet \([2015](https://arxiv.org/html/2606.17513#bib.bib72)\)where we first randomly drawKKsupport subsets of the data to constructKKindependent GP experts that share the same hyperparameters\. We then aggregate their predictive means and variances in precision space to obtain the posterior of our discrepancy surrogate\. Finally, this posterior is added to the frozen base operator’s prediction to form the final predictive distribution\. This formulation strictly bounds the test\-time memory footprint to𝒪​\(Ns2\)\\mathcal\{O\}\(N\_\{s\}^\{2\}\)per expert\. Full mathematical details of the gPoE aggregation are provided in Appendix[B\.5](https://arxiv.org/html/2606.17513#A2.SS5)\.

## 4Experiments

#### Base Neural Operator\.

We use TransolverWuet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib15)\)as the base deterministic neural operator in all experiments\. Unlike operators with global spectral filters \(e\.g\., FNOs\), Transolver preserves pointwise spatial correspondence between blocks\. This allows us to extract the spatially\-aligned hidden features𝐡​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)required for our augmented state\. By keeping the base architecture fixed, we isolate the effects of our UQ formulation\. Hyperparameter configurations are detailed in Appendix[E](https://arxiv.org/html/2606.17513#A5)\.

#### Benchmarks\.

We use a combination of 2D \(*Elasticity, Airfoil*, and*Pipe*\) and 3D \(*ShapeNet Car*and*Ahmed Body*\) datasets involving point clouds\. In the main text we present results on the representative 2D*Airfoil*and 3D*ShapeNet Car*benchmarks; full results across all five datasets are reported in Appendix[I](https://arxiv.org/html/2606.17513#A9)\. Dataset details are included in Appendix[F](https://arxiv.org/html/2606.17513#A6)\.

#### Baselines\.

We compare REEF\-GP against six UQ baselines: \(1\) Deep Ensembles, \(2\) MC Dropout, \(3\) PNO, \(4\) Input Perturbation, \(5\) Laplace Approximation \(LUNO\-LA\), and \(6\) Deep Vecchia Ensemble \(DVE\-spatial\)\. All baselines use the same pretrained Transolver architecture\. Implementation details are provided in Appendix[G](https://arxiv.org/html/2606.17513#A7)and evaluation metrics are defined in Appendix[H](https://arxiv.org/html/2606.17513#A8)\.

Table 1:Evaluation metrics on representative 2D \(Airfoil\) and 3D \(ShapeNet Car\) benchmarks\.Lower is better \(↓\\downarrow\)\. Best inbold, second bestunderlined\. Base is the deterministic Transolver without UQ; rankings exclude Base\. REEF\-GP is our approach\. Metrics are defined in Appendix[H](https://arxiv.org/html/2606.17513#A8)\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−2\\times 10^\{\-2\}\)Base \(no UQ\)1\.241\.24±0\.12\\pm 0\.12––––Ensemble1\.06\\mathbf\{1\.06\}−3\.46¯\\underline\{\-3\.46\}28\.63\\mathbf\{28\.63\}3\.89\\mathbf\{3\.89\}0\.35\\mathbf\{0\.35\}MC Dropout1\.581\.58±0\.21\\pm 0\.21−3\.38\-3\.38±0\.15\\pm 0\.1544\.8844\.88±5\.66\\pm 5\.665\.55¯\\underline\{5\.55\}±0\.91\\pm 0\.910\.550\.55±0\.08\\pm 0\.08PNO2\.322\.32±1\.40\\pm 1\.40−3\.28\-3\.28±0\.65\\pm 0\.6574\.2774\.27±56\.12\\pm 56\.128\.268\.26±6\.39\\pm 6\.390\.790\.79±0\.47\\pm 0\.47Perturbation3\.333\.33±0\.19\\pm 0\.19−3\.32\-3\.32±0\.05\\pm 0\.0581\.0581\.05±5\.67\\pm 5\.679\.719\.71±0\.66\\pm 0\.661\.081\.08±0\.06\\pm 0\.06LUNO\-LA1\.24¯\\underline\{1\.24\}±0\.12\\pm 0\.12−3\.07\-3\.07±0\.82\\pm 0\.8238\.70¯\\underline\{38\.70\}±2\.50\\pm 2\.506\.196\.19±0\.54\\pm 0\.540\.470\.47±0\.04\\pm 0\.04DVE\-spatial2\.092\.09±0\.25\\pm 0\.2519\.9819\.98±7\.32\\pm 7\.3288\.4188\.41±12\.53\\pm 12\.5325\.8925\.89±3\.09\\pm 3\.090\.920\.92±0\.10\\pm 0\.10REEF\-GP1\.24¯\\underline\{1\.24\}±0\.12\\pm 0\.12−3\.51\\mathbf\{\-3\.51\}±0\.06\\pm 0\.0639\.8039\.80±2\.35\\pm 2\.355\.695\.69±0\.47\\pm 0\.470\.45¯\\underline\{0\.45\}±0\.05\\pm 0\.05*ShapeNet Car \(3D\)*\(%\)Base \(no UQ\)8\.628\.62±0\.13\\pm 0\.13––––Ensemble7\.73\\mathbf\{7\.73\}4\.16¯\\underline\{4\.16\}1\.88\\mathbf\{1\.88\}30\.28¯\\underline\{30\.28\}194\.57\\mathbf\{194\.57\}MC Dropout9\.049\.04±0\.59\\pm 0\.5917\.2117\.21±4\.21\\pm 4\.212\.522\.52±0\.26\\pm 0\.2667\.9667\.96±9\.04\\pm 9\.04267\.68267\.68±17\.66\\pm 17\.66PNO13\.7613\.76±1\.74\\pm 1\.744\.984\.98±1\.52\\pm 1\.523\.533\.53±0\.50\\pm 0\.5051\.1051\.10±12\.17\\pm 12\.17337\.16337\.16±43\.04\\pm 43\.04Perturbation8\.628\.62±0\.13\\pm 0\.1313\.4413\.44±1\.14\\pm 1\.142\.282\.28±0\.03\\pm 0\.0350\.8250\.82±1\.10\\pm 1\.10224\.73224\.73±3\.97\\pm 3\.97LUNO\-LA8\.628\.62±0\.13\\pm 0\.1333\.5533\.55±23\.43\\pm 23\.432\.472\.47±0\.13\\pm 0\.1373\.4773\.47±14\.91\\pm 14\.91271\.32271\.32±12\.10\\pm 12\.10DVE\-spatial13\.0813\.08±0\.26\\pm 0\.2611\.2211\.22±3\.03\\pm 3\.033\.763\.76±0\.10\\pm 0\.1096\.1796\.17±10\.43\\pm 10\.43396\.70396\.70±6\.79\\pm 6\.79REEF\-GP8\.52¯\\underline\{8\.52\}±0\.14\\pm 0\.142\.84\\mathbf\{2\.84\}±0\.09\\pm 0\.092\.05¯\\underline\{2\.05\}±0\.03\\pm 0\.0328\.81\\mathbf\{28\.81\}±1\.65\\pm 1\.65224\.31¯\\underline\{224\.31\}±4\.18\\pm 4\.18
### 4\.1Results

Table[1](https://arxiv.org/html/2606.17513#S4.T1)reports the evaluation metrics on the two representative benchmarks\. Results on the remaining three cases are deferred to Appendix[I](https://arxiv.org/html/2606.17513#A9)\(Table[10](https://arxiv.org/html/2606.17513#A9.T10)\) and are consistent with the trends discussed here\.

#### Predictive accuracy is preserved\.

REEF\-GP does not degrade the base operator’s predictive accuracy in any of the five datasets \(Tables[1](https://arxiv.org/html/2606.17513#S4.T1),[10](https://arxiv.org/html/2606.17513#A9.T10)\), and rL2 remains identical or very similar to the deterministic baseline\. LUNO\-LA also has this attractive property but Perturbation and all the train\-time methods \(MC Dropout, PNO, and DVE\-spatial\) incur substantial rL2 penalties relative to Base\. The only method that \(expectedly\) improves rL2 over Base is Deep Ensembles but at the cost of substantial additional computation\.

#### UQ matches Deep Ensembles on 3D benchmarks\.

On*ShapeNet Car*, REEF\-GP achieves NLL of2\.842\.84, outperforming even Deep Ensembles \(4\.164\.16\) and reducing NLL by an order of magnitude relative to the strongest post\-hoc competitor LUNO\-LA \(33\.5533\.55\)\. On*Ahmed Body*, our NLL is tied with the Deep Ensembles \(Table[10](https://arxiv.org/html/2606.17513#A9.T10)\)\. Regarding the rest of the metrics and datasets, REEF\-GP is competitive to either the best or second\-best performing baselines\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x2.png)
![Refer to caption](https://arxiv.org/html/2606.17513v1/x3.png)

Figure 2:Predictive standard deviation fields on representative 2D \(*Airfoil*\) and 3D \(*ShapeNet Car*\) benchmarks\.Top two rows show an airfoil sample and bottom two rows a car sample\. For each sample the rows show in order: ground truth, base prediction, base error and the standard deviation of a sample for each of the UQ baselines\.Spatially coherent uncertainty maps\.Aggregate metrics can be inconclusive when best values shift across datasets and scoring rules\. In geometric operator learning, spatial consistency between predicted uncertainty and prediction error is itself a meaningful signal\. Figure[8](https://arxiv.org/html/2606.17513#A3.F8)shows per\-method standard deviation fields on*Airfoil*and*ShapeNet Car*alongside the base operator’s prediction and error\. Our uncertainty concentrates around the shock fronts on*Airfoil*and the front bumper region on*Car*, closely tracking where the operator’s error is largest\. Calibration is therefore not just numerically tight but spatially meaningful\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x4.png)Figure 3:Compute costs on representative 2D and 3D benchmarks:Train overhead, evaluation time, peak train memory, and peak evaluation memory on*Airfoil*\(2D\) and*ShapeNet Car*\(3D\)\. All axes use log scale\. REEF\-GP’s train and evaluation times are competitive but its memory requirements are higher\. We define training overhead as the additional wall\-clock time required to add UQ capabilities to a pretrained deterministic base Transolver\.Cost\-quality tradeoff\.Figure[3](https://arxiv.org/html/2606.17513#S4.F3)compares compute cost across all UQ methods on*Airfoil*and*ShapeNet Car*\. REEF\-GP achieves training overhead an order of magnitude lower than Deep Ensembles and competitive with all post\-hoc baselines\. Eval time and train memory are competitive across methods, with REEF\-GP sitting close to the median\. The only tradeoff is in eval memory, where REEF\-GP carries an𝒪​\(Ns2\)\\mathcal\{O\}\(N\_\{s\}^\{2\}\)cost from per\-expert kernel matrices; absolute values nonetheless remain well within commodity GPU memory\.

### 4\.2Ablation Studies

We conduct five ablations using the*Airfoil*benchmark to dissect the impact of each architectural and methodological choice; full results, tables, and visualizations are deferred to Appendix[D](https://arxiv.org/html/2606.17513#A4)\. \(i\)Layer selection\(Appendix[D\.1](https://arxiv.org/html/2606.17513#A4.SS1)\): our CKA\-based three\-layer selection\{L0,l∗,Llast\}\\\{L\_\{0\},l^\{\*\},L\_\{\\text\{last\}\}\\\}matches concatenating the entire 8\-layer Transolver stack at significantly lower memory cost, and substantially outperforms single\-layer variants\. \(ii\)Size of training data\(Appendix[D\.2](https://arxiv.org/html/2606.17513#A4.SS2)\): REEF\-GP adapts gracefully to base operators of varying quality, with the predicted uncertainty correctly inflating to track the operator’s deteriorating accuracy asMtrainM\_\{\\text\{train\}\}decreases from750750to5050, while the GP’srL2closely follows the base operator’s across two orders of magnitude in training\-set size \(Figure[11](https://arxiv.org/html/2606.17513#A4.F11)\)\. \(iii\)Stochastic subset size\(Appendix[D\.3](https://arxiv.org/html/2606.17513#A4.SS3)\): performance is essentially insensitive to the per\-expert support sizeNsN\_\{s\}across two orders of magnitude \(from500500to25,00025\{,\}000\), with overlapping error bars on every probabilistic metric\. We adoptNs=5,000N\_\{s\}=5\{,\}000as a middle\-of\-the\-range default that balances conditioning quality against the cubic Cholesky cost\. \(iv\)Number of gPoE experts\(Appendix[D\.4](https://arxiv.org/html/2606.17513#A4.SS4)\): calibration is essentially insensitive to the expert countK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\. \(v\)Heteroscedastic vs\. homoscedastic noise\(Appendix[D\.5](https://arxiv.org/html/2606.17513#A4.SS5)\): both noise models yield comparable aggregate metrics, but the heteroscedastic variant produces sharper, error\-localized uncertainty fields, motivating its use as our default\.

### 4\.3Geometric Out\-Of\-Distribution \(OOD\) Robustness

To assess whether REEF\-GP remains well\-calibrated under geometric distribution shift, we partition each test set into four quartiles by maximum mean discrepancy \(MMD\) distance from the training distribution, ranging from Q1 \(closest\) to Q4 \(most geometrically novel\)\. A reliable UQ method should produce stable calibration across quartiles, inflating its predicted uncertainty as test geometries drift away from training\. Across all benchmarks, REEF\-GP provides the second\-best rL2 \(after Deep Ensemble\) and its NLL negligibly changes from Q1 to Q4, while LUNO\-LA’s NLL and Perturbation’s NLL significantly increase\. Full per\-quartile rL2 and NLL metrics, MMD construction details, and per\-method comparisons are reported in Appendix[J](https://arxiv.org/html/2606.17513#A10)and Table[11](https://arxiv.org/html/2606.17513#A10.T11)\.

## 5Conclusions

Our results show that the coordinate–feature space induced by a pretrained neural operator can be used for post\-hoc UQ\. By fitting REEF\-GP directly on the neural operator’s internal representations, we obtain calibrated uncertainty maps that match deep ensemble across five benchmarks but at a fraction of the training cost of a neural operator\. The resulting uncertainty is also spatially interpretable: it concentrates around shock fronts and other regions where the operator’s approximation is known to break down, and it remains stable as test geometries drift away from the training distribution\. We view this as evidence that the geometry a neural operator learns while solving a PDE already encodes much of what is needed to quantify its own uncertainty, opening a complementary axis to parameter\-centric approaches and a natural starting point for UQ on increasingly large geometry\-aware operators\.

#### Limitations and future work\.

REEF\-GP relies on neural operators whose hidden states preserve pointwise spatial correspondence, as in Transolver; extending it to backbones with latent or spectral internal representations \(e\.g\., FNO\) is left for future work\. Our current experiments involve stationary problems with single\-output responses\. Time\-dependent or vector\-valued solutions require explicit inclusion of time as an input and multi\-output GP extensions\. These are natural next steps given the modular post\-hoc design of REEF\-GP\. Finally, although training overhead is an order of magnitude lower than deep ensembles, peak evaluation memory scales as𝒪​\(Ns2\)\\mathcal\{O\}\(N\_\{s\}^\{2\}\)per expert, which is the dominant cost on large 3D meshes\.

## References

- \[1\]\(2018\)Test\-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks\.External Links:[Link](https://openreview.net/forum?id=rJZz-knjz)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px5.p1.3)\.
- \[2\]P\. Batlle, Y\. Chen, B\. Hosseini, H\. Owhadi, and A\. M\. Stuart\(2025\)Error analysis of kernel/gp methods for nonlinear and parametric pdes\.Journal of Computational Physics520,pp\. 113488\.Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p4.1)\.
- \[3\]P\. Batlle, M\. Darcy, B\. Hosseini, and H\. Owhadi\(2024\)Kernel methods are competitive for operator learning\.Journal of Computational Physics496\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/ARTN%20112549%0A10.1016/j.jcp.2023.112549)Cited by:[Appendix A](https://arxiv.org/html/2606.17513#A1.p1.2),[§1](https://arxiv.org/html/2606.17513#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p2.1)\.
- \[4\]B\. Bonev, T\. Kurth, C\. Hundt, J\. Pathak, M\. Baust, K\. Kashinath, and A\. Anandkumar\(2023\)Spherical fourier neural operators: learning stable dynamics on the sphere\.Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[5\]C\. Bülte, P\. Scholl, and G\. Kutyniok\(2024\)Probabilistic predictions with fourier neural operators\.External Links:[Link](https://openreview.net/forum?id=orKA6gJwlB)Cited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1)\.
- \[6\]C\. Bülte, P\. Scholl, and G\. Kutyniok\(2025\)Probabilistic neural operators for functional uncertainty quantification\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=gangoPXSRw)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px3.p1.3),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1)\.
- \[7\]Y\. Cao and D\. J\. Fleet\(2015\)Generalized product of experts for automatic and principled fusion of gaussian process predictions\.External Links:1410\.7827,[Link](https://arxiv.org/abs/1410.7827)Cited by:[§B\.5](https://arxiv.org/html/2606.17513#A2.SS5.p1.7),[§3\.4](https://arxiv.org/html/2606.17513#S3.SS4.p1.4)\.
- \[8\]N\. Carey, L\. Zanisi, S\. Pamela, V\. Gopakumar, J\. Omotani, J\. Buchanan, J\. Brandstetter, F\. Paischer, G\. Galletti, and P\. Setinek\(2025\)Neural operator surrogate models of plasma edge simulations: feasibility and data efficiency\.External Links:2502\.17386,[Link](https://arxiv.org/abs/2502.17386)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[9\]A\. X\. Chang, T\. Funkhouser, L\. Guibas, P\. Hanrahan, Q\. Huang, Z\. Li, S\. Savarese, M\. Savva, S\. Song, H\. Su, J\. Xiao, L\. Yi, and F\. Yu\(2015\)ShapeNet: an information\-rich 3d model repository\.External Links:1512\.03012,[Link](https://arxiv.org/abs/1512.03012)Cited by:[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px4.p1.2)\.
- \[10\]H\. Chen, L\. Zheng, R\. A\. Kontar, and G\. Raskutti\(2022\)Gaussian process parameter estimation using mini\-batch stochastic gradient descent: convergence guarantees and empirical benefits\.Journal of Machine Learning Research23\(227\),pp\. 1–59\.External Links:[Link](http://jmlr.org/papers/v23/20-1365.html)Cited by:[§B\.4](https://arxiv.org/html/2606.17513#A2.SS4.p1.1)\.
- \[11\]E\. Daxberger, A\. Kristiadi, A\. Immer, R\. Eschenhagen, M\. Bauer, and P\. Hennig\(2021\)Laplace redux \- effortless bayesian deep learning\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 20089–20103\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2021/file/a7c9585703d275249f30a088cebba0ad-Paper.pdf)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px4.p1.4),[§1](https://arxiv.org/html/2606.17513#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[12\]Y\. Gal and Z\. Ghahramani\(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.InProceedings of the 33rd International Conference on International Conference on Machine Learning \- Volume 48,ICML’16,pp\. 1050–1059\.Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1)\.
- \[13\]V\. Gopakumar, S\. Pamela, L\. Zanisi, Z\. Li, A\. Anandkumar, and M\. Team\(2023\)Fourier neural operator for plasma modelling\.External Links:2302\.06542,[Link](https://arxiv.org/abs/2302.06542)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[14\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2015\)Deep residual learning for image recognition\.External Links:1512\.03385,[Link](https://arxiv.org/abs/1512.03385)Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.SS0.SSS0.Px1.p2.3)\.
- \[15\]D\. Hendrycks and K\. Gimpel\(2023\)Gaussian error linear units \(gelus\)\.External Links:1606\.08415,[Link](https://arxiv.org/abs/1606.08415)Cited by:[Appendix E](https://arxiv.org/html/2606.17513#A5.SS0.SSS0.Px1.p1.1)\.
- \[16\]J\. Hensman, A\. Matthews, and Z\. Ghahramani\(2015\-09–12 May\)Scalable Variational Gaussian Process Classification\.InProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics,G\. Lebanon and S\. V\. N\. Vishwanathan \(Eds\.\),Proceedings of Machine Learning Research, Vol\.38,San Diego, California, USA,pp\. 351–360\.External Links:[Link](https://proceedings.mlr.press/v38/hensman15.html)Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.SS0.SSS0.Px2.p2.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p1.2)\.
- \[17\]D\. Higdon, M\. Kennedy, J\. C\. Cavendish, J\. A\. Cafeo, and R\. D\. Ryne\(2004\)Combining field data and computer simulations for calibration and prediction\.SIAM Journal on Scientific Computing26\(2\),pp\. 448–466\.External Links:ISSN 1064\-8275 1095\-7197,[Document](https://dx.doi.org/10.1137/s1064827503426693),[Link](https://arxiv.org/html/2606.17513v1/%3CGo%20to%20ISI%3E://WOS:000226823100006)Cited by:[§3](https://arxiv.org/html/2606.17513#S3.p1.3)\.
- \[18\]D\. Jeyaraj, H\. Eivazi, J\. Tröger, S\. Wittek, S\. Hartmann, and A\. Rausch\(2025\)A neural operator based hybrid microscale model for multiscale simulation of rate\-dependent materials\.External Links:2506\.16918,[Link](https://arxiv.org/abs/2506.16918)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[19\]F\. Jimenez and M\. Katzfuss\(2025\-03–05 May\)Vecchia gaussian process ensembles on internal representations of deep neural networks\.InProceedings of The 28th International Conference on Artificial Intelligence and Statistics,Y\. Li, S\. Mandt, S\. Agrawal, and E\. Khan \(Eds\.\),Proceedings of Machine Learning Research, Vol\.258,pp\. 3403–3411\.External Links:[Link](https://proceedings.mlr.press/v258/jimenez25a.html)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px6.p1.3),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.17513#S3.p3.2)\.
- \[20\]M\. C\. Kennedy and A\. O’Hagan\(2002\-01\)Bayesian calibration of computer models\.Journal of the Royal Statistical Society Series B: Statistical Methodology63\(3\),pp\. 425–464\.External Links:ISSN 1369\-7412,[Document](https://dx.doi.org/10.1111/1467-9868.00294),[Link](https://doi.org/10.1111/1467-9868.00294),https://academic\.oup\.com/jrsssb/article\-pdf/63/3/425/49590547/jrsssb\_63\_3\_425\.pdfCited by:[§3](https://arxiv.org/html/2606.17513#S3.p1.3)\.
- \[21\]M\. S\. Khorrami, P\. Goyal, J\. R\. Mianroodi, B\. Svendsen, P\. Benner, and D\. Raabe\(2025\)A physics\-encoded fourier neural operator approach for surrogate modeling of divergence\-free stress fields in solids\.External Links:2408\.15408,[Link](https://arxiv.org/abs/2408.15408)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[22\]S\. Kornblith, M\. Norouzi, H\. Lee, and G\. E\. Hinton\(2019\)Similarity of neural network representations revisited\.CoRRabs/1905\.00414\.External Links:[Link](http://arxiv.org/abs/1905.00414),1905\.00414Cited by:[§B\.1](https://arxiv.org/html/2606.17513#A2.SS1.SSS0.Px1.p1.6)\.
- \[23\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6405–6416\.External Links:ISBN 9781510860964Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1)\.
- \[24\]Z\. Li, D\. Z\. Huang, B\. Liu, and A\. Anandkumar\(2023\-01\)Fourier neural operator with learned deformations for pdes on general geometries\.J\. Mach\. Learn\. Res\.24\(1\)\.External Links:ISSN 1532\-4435Cited by:[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px1.p1.3),[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px2.p1.2),[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px3.p1.2),[§1](https://arxiv.org/html/2606.17513#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17513#S2.SS1.SSS0.Px1.p2.1)\.
- \[25\]Z\. Li, N\. Kovachki, K\. Azizzadenesheli, B\. Liu, K\. Bhattacharya, A\. Stuart, and A\. Anandkumar\(2021\)Fourier neural operator for parametric partial differential equations\.External Links:2010\.08895,[Link](https://arxiv.org/abs/2010.08895)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1),[§1](https://arxiv.org/html/2606.17513#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17513#S2.SS1.SSS0.Px1.p2.1)\.
- \[26\]Z\. Li, N\. B\. Kovachki, C\. Choy, B\. Li, J\. Kossaifi, S\. P\. Otta, M\. A\. Nabian, M\. Stadler, C\. Hundt, K\. Azizzadenesheli, and A\. Anandkumar\(2023\)Geometry\-informed neural operator for large\-scale 3d PDEs\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=86dXbqT5Ua)Cited by:[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px5.p1.2),[§2\.1](https://arxiv.org/html/2606.17513#S2.SS1.SSS0.Px1.p2.1)\.
- \[27\]J\. Z\. Liu, S\. Padhy, J\. Ren, Z\. Lin, Y\. Wen, G\. Jerfel, Z\. Nado, J\. Snoek, D\. Tran, and B\. Lakshminarayanan\(2023\)A simple approach to improve single\-model deep uncertainty via distance\-awareness\.Journal of Machine Learning Research24\(42\),pp\. 1–63\.External Links:[Link](http://jmlr.org/papers/v24/22-0479.html)Cited by:[§B\.2](https://arxiv.org/html/2606.17513#A2.SS2.SSS0.Px1.p1.3),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.17513#S3.SS1.p1.13)\.
- \[28\]J\. Loeppky, D\. Bingham, and W\. Welch\(2006\)Computer model calibration or tuning in practice\.Technometrics, submitted for publication\.Cited by:[§3](https://arxiv.org/html/2606.17513#S3.p1.3)\.
- \[29\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.External Links:1711\.05101,[Link](https://arxiv.org/abs/1711.05101)Cited by:[Appendix E](https://arxiv.org/html/2606.17513#A5.SS0.SSS0.Px1.p1.1)\.
- \[30\]M\. Lowery, J\. Turnage, Z\. Morrow, J\. D\. Jakeman, A\. Narayan, S\. Zhe, and V\. Shankar\(2024\)Kernel neural operators \(knos\) for scalable, memory\-efficient, geometrically\-flexible operator learning\.arXiv preprint arXiv:2407\.00809\.Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p4.1)\.
- \[31\]H\. Luo, H\. Wu, H\. Zhou, L\. Xing, Y\. Di, J\. Wang, and M\. Long\(2025\)Transolver\+\+: an accurate neural solver for pdes on million\-scale geometries\.External Links:2502\.02414,[Link](https://arxiv.org/abs/2502.02414)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[32\]B\. MacDonald, P\. Ranjan, and H\. Chipman\(2015\)GPfit: an r package for fitting a gaussian process model to deterministic simulator outputs\.Journal of Statistical Software64\(12\),pp\. 1–23\.External Links:ISSN 1548\-7660,[Link](https://arxiv.org/html/2606.17513v1/%3CGo%20to%20ISI%3E://WOS:000352916200001)Cited by:[§B\.2](https://arxiv.org/html/2606.17513#A2.SS2.SSS0.Px2.p1.11)\.
- \[33\]D\. J\. C\. MacKay\(1992\-05\)Bayesian interpolation\.Neural Computation4\(3\),pp\. 415–447\.External Links:ISSN 0899\-7667,[Document](https://dx.doi.org/10.1162/neco.1992.4.3.415),[Link](https://doi.org/10.1162/neco.1992.4.3.415),https://direct\.mit\.edu/neco/article\-pdf/4/3/415/812340/neco\.1992\.4\.3\.415\.pdfCited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[34\]W\. J\. Maddox, T\. Garipov, P\. Izmailov, D\. Vetrov, and A\. G\. Wilson\(2019\)A simple baseline for bayesian uncertainty in deep learning\.InProceedings of the 33rd International Conference on Neural Information Processing Systems,Cited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px1.p1.1)\.
- \[35\]E\. Magnani, N\. Krämer, R\. Eschenhagen, L\. Rosasco, and P\. Hennig\(2022\)Approximate bayesian neural operators: uncertainty quantification for parametric pdes\.CoRRabs/2208\.01565\.External Links:[Link](https://doi.org/10.48550/arXiv.2208.01565),[Document](https://dx.doi.org/10.48550/ARXIV.2208.01565),2208\.01565Cited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[36\]E\. Magnani, M\. Pförtner, T\. Weber, and P\. Hennig\(2025\)Linearization turns neural operators into function\-valued gaussian processes\.External Links:[Link](https://openreview.net/forum?id=4Z04wVQ9FY)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px4.p1.4),[§1](https://arxiv.org/html/2606.17513#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[37\]A\. Matveev, S\. Ghosh, A\. Hussain, J\. Leahy, and M\. Michaelides\(2025\)Light\-weight diffusion multiplier and uncertainty quantification for fourier neural operators\.External Links:2508\.00643,[Link](https://arxiv.org/abs/2508.00643)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1)\.
- \[38\]C\. Mora, A\. Yousefpour, S\. Hosseinmardi, H\. Owhadi, and R\. Bostanabad\(2025\)Operator learning with gaussian processes\.Computer Methods in Applied Mechanics and Engineering434,pp\. 117581\.External Links:ISSN 0045\-7825,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.cma.2024.117581),[Link](https://www.sciencedirect.com/science/article/pii/S0045782524008351)Cited by:[Appendix A](https://arxiv.org/html/2606.17513#A1.p1.2),[§1](https://arxiv.org/html/2606.17513#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p2.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.p1.1)\.
- \[39\]S\. C\. Mouli, D\. C\. Maddix, S\. Alizadeh, G\. Gupta, A\. Stuart, M\. W\. Mahoney, and Y\. Wang\(2024\)Using uncertainty quantification to characterize and improve out\-of\-domain learning for pdes\.External Links:2403\.10642,[Link](https://arxiv.org/abs/2403.10642)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1)\.
- \[40\]D\. Musekamp, M\. Kalimuthu, D\. Holzmüller, M\. Takamoto, and M\. Niepert\(2025\)Active learning for neural PDE solvers\.External Links:[Link](https://openreview.net/forum?id=x4ZmQaumRg)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1)\.
- \[41\]H\. Owhadi, C\. Scovel, and T\. Sullivan\(2015\)On the brittleness of bayesian inference\.SIAM Review57\(4\),pp\. 566–582\.External Links:[Document](https://dx.doi.org/10.1137/130938633),[Link](https://doi.org/10.1137/130938633),https://doi\.org/10\.1137/130938633Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.17513#S3.SS2.p1.1)\.
- \[42\]H\. Owhadi and G\. R\. Yoo\(2019\)Kernel flows: from learning kernels from data into the abyss\.Journal of Computational Physics389,pp\. 22–47\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2019.03.040),[Link](https://www.sciencedirect.com/science/article/pii/S0021999119302232)Cited by:[§B\.4](https://arxiv.org/html/2606.17513#A2.SS4.p1.1),[Appendix C](https://arxiv.org/html/2606.17513#A3.p1.1),[§1](https://arxiv.org/html/2606.17513#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.17513#S3.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.17513#S3.SS3.p1.8),[§3\.3](https://arxiv.org/html/2606.17513#S3.SS3.p2.1)\.
- \[43\]J\. Pathak, S\. Subramanian, P\. Harrington, S\. Raja, A\. Chattopadhyay, M\. Mardani, T\. Kurth, D\. Hall, Z\. Li, K\. Azizzadenesheli, P\. Hassanzadeh, K\. Kashinath, and A\. Anandkumar\(2022\)FourCastNet: a global data\-driven high\-resolution weather model using adaptive fourier neural operators\.External Links:2202\.11214,[Link](https://arxiv.org/abs/2202.11214)Cited by:[Appendix G](https://arxiv.org/html/2606.17513#A7.SS0.SSS0.Px5.p1.3),[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[44\]A\. F\. Psaros, X\. Meng, Z\. Zou, L\. Guo, and G\. E\. Karniadakis\(2023\)Uncertainty quantification in scientific machine learning: methods, metrics, and comparisons\.Journal of Computational Physics477,pp\. 111902\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2022.111902),[Link](https://www.sciencedirect.com/science/article/pii/S0021999122009652)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p2.1)\.
- \[45\]N\. Rahaman, A\. Baratin, D\. Arpit, F\. Draxler, M\. Lin, F\. Hamprecht, Y\. Bengio, and A\. Courville\(2019\-09–15 Jun\)On the spectral bias of neural networks\.InProceedings of the 36th International Conference on Machine LearningICLR 2024 Workshop on AI4DifferentialEquations In ScienceForty\-second International Conference on Machine LearningNeurIPS 2024 Workshop on Bayesian Decision\-making and UncertaintyMedical Imaging with Deep LearningForty\-second International Conference on Machine LearningThe Thirteenth International Conference on Learning RepresentationsThe Thirty\-eighth Annual Conference on Neural Information Processing SystemsProceedings of the 40th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning ResearchICML’23, Vol\.97,pp\. 5301–5310\.External Links:[Link](https://proceedings.mlr.press/v97/rahaman19a.html)Cited by:[§3](https://arxiv.org/html/2606.17513#S3.p3.2)\.
- \[46\]M\. A\. Rahman, R\. J\. George, M\. Elleithy, D\. Leibovici, Z\. Li, B\. Bonev, C\. White, J\. Berner, R\. A\. Yeh, J\. Kossaifi, K\. Azizzadenesheli, and A\. Anandkumar\(2024\)Pretraining codomain attention neural operators for solving multiphysics PDEs\.External Links:[Link](https://openreview.net/forum?id=wSpIdUXZYX)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.
- \[47\]C\. E\. Rasmussen and C\. K\. I\. Williams\(2005\-11\)Gaussian processes for machine learning\.The MIT Press\.External Links:ISBN 9780262256834,[Document](https://dx.doi.org/10.7551/mitpress/3206.001.0001),[Link](https://doi.org/10.7551/mitpress/3206.001.0001),https://direct\.mit\.edu/book\-pdf/2514321/book\_9780262256834\.pdfCited by:[§1](https://arxiv.org/html/2606.17513#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.p1.1)\.
- \[48\]H\. Ritter, A\. Botev, and D\. Barber\(2018\)A scalable laplace approximation for neural networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Skdvd2xAZ)Cited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[49\]O\. Skean, M\. R\. Arefin, D\. Zhao, N\. N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv\(2025\)Layer by layer: uncovering hidden representations in language models\.External Links:[Link](https://openreview.net/forum?id=WGXb7UdvTX)Cited by:[§3](https://arxiv.org/html/2606.17513#S3.p3.2)\.
- \[50\]M\. Spitieris and I\. Steinsland\(2023\)Bayesian calibration of imperfect computer models using physics\-informed priors\.Journal of Machine Learning Research24\(108\),pp\. 1–39\.External Links:[Link](http://jmlr.org/papers/v24/22-0676.html)Cited by:[§3\.2](https://arxiv.org/html/2606.17513#S3.SS2.p2.1)\.
- \[51\]M\. Titsias\(2009\-16–18 Apr\)Variational learning of inducing variables in sparse gaussian processes\.InProceedings of the Twelfth International Conference on Artificial Intelligence and Statistics,D\. van Dyk and M\. Welling \(Eds\.\),Proceedings of Machine Learning Research, Vol\.5,Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA,pp\. 567–574\.External Links:[Link](https://proceedings.mlr.press/v5/titsias09a.html)Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.SS0.SSS0.Px2.p2.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p1.2)\.
- \[52\]N\. Umetani and B\. Bickel\(2018\-07\)Learning three\-dimensional flow for interactive aerodynamic design\.ACM Trans\. Graph\.37\(4\)\.External Links:ISSN 0730\-0301,[Link](https://doi.org/10.1145/3197517.3201325),[Document](https://dx.doi.org/10.1145/3197517.3201325)Cited by:[Appendix F](https://arxiv.org/html/2606.17513#A6.SS0.SSS0.Px4.p1.2)\.
- \[53\]L\. van der Maaten and G\. Hinton\(2008\)Visualizing data using t\-sne\.Journal of Machine Learning Research9\(86\),pp\. 2579–2605\.External Links:[Link](http://jmlr.org/papers/v9/vandermaaten08a.html)Cited by:[§3](https://arxiv.org/html/2606.17513#S3.p3.2)\.
- \[54\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.SS0.SSS0.Px1.p2.3)\.
- \[55\]T\. Weber, E\. Magnani, M\. Pförtner, and P\. Hennig\(2024\)Uncertainty quantification for fourier neural operators\.External Links:[Link](https://openreview.net/forum?id=knSgoNJcnV)Cited by:[§2\.3](https://arxiv.org/html/2606.17513#S2.SS3.SSS0.Px2.p1.1)\.
- \[56\]C\. Williams and M\. Seeger\(2000\)Using the nyström method to speed up kernel machines\.InAdvances in Neural Information Processing Systems,T\. Leen, T\. Dietterich, and V\. Tresp \(Eds\.\),Vol\.13,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2000/file/19de10adbaa1b2ee13f77f679fa1483a-Paper.pdf)Cited by:[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p1.2)\.
- \[57\]A\. G\. Wilson, Z\. Hu, R\. Salakhutdinov, and E\. P\. Xing\(2016\-09–11 May\)Deep kernel learning\.InProceedings of the 19th International Conference on Artificial Intelligence and Statistics,A\. Gretton and C\. C\. Robert \(Eds\.\),Proceedings of Machine Learning Research, Vol\.51,Cadiz, Spain,pp\. 370–378\.External Links:[Link](https://proceedings.mlr.press/v51/wilson16.html)Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.p1.1),[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p2.1),[§3\.2](https://arxiv.org/html/2606.17513#S3.SS2.p3.6)\.
- \[58\]A\. G\. Wilson and H\. Nickisch\(2015\)Kernel interpolation for scalable structured gaussian processes \(kiss\-gp\)\.InProceedings of the 32nd International Conference on Machine Learning \- Volume 37,ICML’15,pp\. 1775–1784\.Cited by:[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p1.2)\.
- \[59\]H\. Wu, H\. Luo, H\. Wang, J\. Wang, and M\. Long\(2024\)Transolver: a fast transformer solver for pdes on general geometries\.External Links:2402\.02366,[Link](https://arxiv.org/abs/2402.02366)Cited by:[Appendix C](https://arxiv.org/html/2606.17513#A3.SS0.SSS0.Px1.p1.7),[Appendix E](https://arxiv.org/html/2606.17513#A5.p1.1),[§1](https://arxiv.org/html/2606.17513#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.17513#S2.SS1.SSS0.Px1.p2.1),[§4](https://arxiv.org/html/2606.17513#S4.SS0.SSS0.Px1.p1.1)\.
- \[60\]X\. Yang and H\. Owhadi\(2024\)A mini\-batch method for solving nonlinear pdes with gaussian processes\.External Links:2306\.00307,[Link](https://arxiv.org/abs/2306.00307)Cited by:[§B\.4](https://arxiv.org/html/2606.17513#A2.SS4.p1.1)\.
- \[61\]A\. Yousefpour, Z\. Z\. Foumani, M\. Shishehbor, C\. Mora, and R\. Bostanabad\(2024\)GP\+: a python library for kernel\-based learning via gaussian processes\.Advances in Engineering Software195,pp\. 103686\.External Links:ISSN 0965\-9978,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.advengsoft.2024.103686),[Link](https://www.sciencedirect.com/science/article/pii/S0965997824000930)Cited by:[§B\.2](https://arxiv.org/html/2606.17513#A2.SS2.SSS0.Px2.p1.11)\.
- \[62\]S\. Zhe, W\. Xing, and R\. M\. Kirby\(2019\-16–18 Apr\)Scalable high\-order gaussian process regression\.InProceedings of the Twenty\-Second International Conference on Artificial Intelligence and Statistics,K\. Chaudhuri and M\. Sugiyama \(Eds\.\),Proceedings of Machine Learning Research, Vol\.89,pp\. 2611–2620\.External Links:[Link](https://proceedings.mlr.press/v89/zhe19a.html)Cited by:[§2\.2](https://arxiv.org/html/2606.17513#S2.SS2.SSS0.Px1.p1.2)\.
- \[63\]H\. Zhou, H\. Wu, H\. Shangguan, Y\. Ma, H\. Weng, J\. Wang, and M\. Long\(2026\)Transolver\-3: scaling up transformer solvers to industrial\-scale geometries\.External Links:2602\.04940,[Link](https://arxiv.org/abs/2602.04940)Cited by:[§1](https://arxiv.org/html/2606.17513#S1.p1.1)\.

## Appendix AOperator Learning with Gaussian Processes

We adopt thefunctional regressionperspective ofMoraet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib7)\)as opposed to the operator\-valued approach ofBatlleet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib1)\)\(we note that neither works study the UQ properties of GPs in the context of operator learning\)\. Specifically, instead of learning the map𝒢†\\mathcal\{G\}^\{\\dagger\}directly \(which outputs an infinite\-dimensional function\), we learn the evaluation functional associated with the operator\. Formally, we define a real\-valued bilinear form𝒢~†\\tilde\{\\mathcal\{G\}\}^\{\\dagger\}acting on the input space and the dual space of the output:

𝒢~†:\(𝒰×𝒜\)×𝒱∗→ℝ,\(\(u,a\),φ\)↦⟨φ,𝒢†​\(u,a\)⟩𝒱∗×𝒱,\\tilde\{\\mathcal\{G\}\}^\{\\dagger\}:\(\\mathcal\{U\}\\times\\mathcal\{A\}\)\\times\\mathcal\{V\}^\{\*\}\\to\\mathbb\{R\},\\quad\(\(u,a\),\\varphi\)\\mapsto\\langle\\varphi,\\mathcal\{G\}^\{\\dagger\}\(u,a\)\\rangle\_\{\\mathcal\{V\}^\{\*\}\\times\\mathcal\{V\}\},\(14\)where𝒱∗\\mathcal\{V\}^\{\*\}is the dual space of𝒱\\mathcal\{V\}and⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangledenotes the duality pairing\. In particular, by choosingφ\\varphito be the Dirac delta functionalδ𝐱\\delta\_\{\\mathbf\{x\}\}centered at𝐱∈Da\\mathbf\{x\}\\in D\_\{a\}, the pairing recovers the pointwise evaluation of the solution field:

𝒢~†​\(\(u,a\),δ𝐱\)=⟨δ𝐱,v⟩=v​\(𝐱\)\.\\tilde\{\\mathcal\{G\}\}^\{\\dagger\}\(\(u,a\),\\delta\_\{\\mathbf\{x\}\}\)=\\langle\\delta\_\{\\mathbf\{x\}\},v\\rangle=v\(\\mathbf\{x\}\)\.\(15\)This transformation effectively converts the operator learning problem into a scalar regression task, where the spatial coordinate𝐱\\mathbf\{x\}becomes an input variable\. The process can be extended to vector\-valued cases by either repeating the above process or solving a multi\-output regression problem\.

In practice, we only have access to finite discretizations\. Letℰu:𝒰→ℝdu\\mathcal\{E\}\_\{u\}:\\mathcal\{U\}\\to\\mathbb\{R\}^\{d\_\{u\}\}be a bounded linear observation operator that extracts discretedud\_\{u\}dimensional features from the continuous input function \(e\.g\., sensor measurements or nodes of a mesh in computer simulations\) such that𝐮=ℰu​\(u\)\\mathbf\{u\}=\\mathcal\{E\}\_\{u\}\(u\)\. Similarly, letℰa:𝒜→ℝda\\mathcal\{E\}\_\{a\}:\\mathcal\{A\}\\to\\mathbb\{R\}^\{d\_\{a\}\}denote a discretization operator that produces the discretedad\_\{a\}\-dimensional representation of the geometry such that𝐚=ℰa​\(a\)\\mathbf\{a\}=\\mathcal\{E\}\_\{a\}\(a\)\. Consequently, the infinite\-dimensional functional𝒢~†\\tilde\{\\mathcal\{G\}\}^\{\\dagger\}is approximated by a finite\-dimensional surrogate:

f:ℝdu×ℝda×Ω→ℝ,\(𝐮,𝐚,𝐱\)↦v​\(𝐱\)\.f:\\mathbb\{R\}^\{d\_\{u\}\}\\times\\mathbb\{R\}^\{d\_\{a\}\}\\times\\Omega\\to\\mathbb\{R\},\\quad\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\\mapsto v\(\\mathbf\{x\}\)\.\([3](https://arxiv.org/html/2606.17513#S2.E3)\)
This formulation allows us to treat the triplet𝐳=\(𝐮,𝐚,𝐱\)\\mathbf\{z\}=\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)as the joint input to a regression model\. By placing a GP prior overff, i\.e\.,f​\(𝐳\)∼𝒢​𝒫​\(m​\(𝐳\),k​\(𝐳,𝐳′\)\)f\(\\mathbf\{z\}\)\\sim\\mathcal\{GP\}\(m\(\\mathbf\{z\}\),k\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)\), we can infer the solution valuev​\(𝐱\)v\(\\mathbf\{x\}\)at any query point𝐱\\mathbf\{x\}within the irregular domainDaD\_\{a\}, while naturally obtaining uncertainty estimates via the GP posterior\.

## Appendix BImplementation Details of REEF\-GP

This appendix details our design choices as well as training and inference algorithms that can be used to instantiate REEF\-GP and reproduce our results\. GitHub link will be available after review\.

### B\.1Layer Selection

REEF\-GP requires an augmented state𝐳¯\\mathbf\{\\bar\{z\}\}that aggregates hidden features𝐡l​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\_\{l\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)fromL′L^\{\\prime\}selected layers of a frozen base neural operator \(Transolver in this case\)\. Naively concatenating allL=8L=8Transolver layers would produce a very high\-dimensional representation that increases the costs while providing negligible gains as these representations have redundancies\. Internal Transolver layers exhibit strong feature correlation so consecutive blocks contribute overlapping information, see Figure[4](https://arxiv.org/html/2606.17513#A2.F4)\. We therefore select a small subset of layers that jointly span the network’s representational hierarchy: an early layer \(preserving raw geometric structure\), a late layer \(encoding the predicted physics\), and a middle layer that is maximally distinct from both\. Below, we provide a heuristic for choosing the middle layer which has been used in all our studies\. Unlike DVE, we observe that not all layers are needed, see ablation studies in Table[3](https://arxiv.org/html/2606.17513#A4.T3)\.

#### CKA\-based middle layer selection\.

For the middle layer, we use a centered kernel alignment \(CKA\)Kornblithet al\.\([2019](https://arxiv.org/html/2606.17513#bib.bib73)\)criterion to identify the block whose representation is jointly most distinct from the first and last layers\. We pass a batch ofBBsamples through the frozen Transolver and extract the hidden state at each candidate layerll\. Flattening across the batch and spatial dimensions yields𝐇l∈ℝ\(B⋅N\)×Dh\\mathbf\{H\}\_\{l\}\\in\\mathbb\{R\}^\{\(B\\cdot N\)\\times D\_\{h\}\}\. We randomly subsamplePs=5,000P\_\{s\}=5\{,\}000rows and center each subsampled tensor before computing CKA\. For two centered matrices𝐗\\mathbf\{X\}and𝐘\\mathbf\{Y\}, linear CKA is:

CKA​\(𝐗,𝐘\)=‖𝐘⊤​𝐗‖F2‖𝐗⊤​𝐗‖F⋅‖𝐘⊤​𝐘‖F\.\\text\{CKA\}\(\\mathbf\{X\},\\mathbf\{Y\}\)=\\frac\{\\\|\\mathbf\{Y\}^\{\\top\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}\}\{\\\|\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\\\|\_\{F\}\\cdot\\\|\\mathbf\{Y\}^\{\\top\}\\mathbf\{Y\}\\\|\_\{F\}\}\.\(16\)The middle layerl∗l^\{\*\}is then chosen to minimize the combined similarity to the boundary layers:

l∗=arg⁡minl∈\{1,…,L−1\}⁡\[CKA​\(𝐇l,𝐇0\)\+CKA​\(𝐇l,𝐇L\)\]\.l^\{\*\}=\\arg\\min\_\{l\\in\\\{1,\\dots,L\-1\\\}\}\\big\[\\,\\text\{CKA\}\(\\mathbf\{H\}\_\{l\},\\mathbf\{H\}\_\{0\}\)\+\\text\{CKA\}\(\\mathbf\{H\}\_\{l\},\\mathbf\{H\}\_\{L\}\)\\,\\big\]\.\(17\)This procedure runs once per dataset and seed, takes negligible time relative to GP training, and selects three layers in total:\{L0,l∗,Llast\}\\\{L\_\{0\},l^\{\*\},L\_\{\\text\{last\}\}\\\}\. Their outputs are concatenated along the feature dimension to form𝐡​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\)\. Hidden dimensions vary per dataset \(Table[8](https://arxiv.org/html/2606.17513#A5.T8)\), so the resulting augmented state has dimension3​Dh3D\_\{h\}\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x5.png)Figure 4:Layer\-wise CKA similarity across benchmarks\.For each candidate Transolver layer𝐇l\\mathbf\{H\}\_\{l\}, we plot CKA similarity to the input layer𝐇0\\mathbf\{H\}\_\{0\}\(solid\) and to the output layer𝐇L\\mathbf\{H\}\_\{L\}\(dashed\) as a function of depth\. Intermediate layers are jointly distinct from both endpoints and our middle\-layer selection chooses the maximally distinct one\.

### B\.2Deep Kernel Design

The GP prior onffin Equation[3](https://arxiv.org/html/2606.17513#S2.E3)has a deep kernel that operates on three input types \(spatial coordinates𝐱\\mathbf\{x\}, discretized input function𝐮\\mathbf\{u\}, and the selected embeddings𝐡\\mathbf\{h\}from Transolver\)\.

#### Latent feature projection\.

We project the high\-dimensional augmented state𝐡\\mathbf\{h\}through a learned MLPρ:ℝ3​Dh→ℝDl\\rho:\\mathbb\{R\}^\{3D\_\{h\}\}\\to\\mathbb\{R\}^\{D\_\{l\}\}that mixes the three layers and projects them into a compact latent space for the kernel\. The network consists of two spectrally normalized linear layers with a ReLU activation:

ρ​\(𝐡\)=W2⋅ReLU​\(W1​𝐡\)\\rho\(\\mathbf\{h\}\)=W\_\{2\}\\cdot\\text\{ReLU\}\(W\_\{1\}\\mathbf\{h\}\)\(18\)Spectral normalization bounds the Lipschitz constant ofρ\\rho, preventing the projection from collapsing the input geometry manifold and preserving the latent distance metric for the kernelLiuet al\.\([2023](https://arxiv.org/html/2606.17513#bib.bib51)\)\.

#### Kernel composition\.

The composite kernel is a product of three base kernels:

kϕ​\(𝐳¯,𝐳¯′\)=σ2⋅kspace​\(𝐱,𝐱′\)⋅kfun​\(𝐮,𝐮′\)⋅klatent​\(ρ​\(𝐡\),ρ​\(𝐡′\)\),k\_\{\\boldsymbol\{\\phi\}\}\(\\mathbf\{\\bar\{z\}\},\\mathbf\{\\bar\{z\}\}^\{\\prime\}\)=\\sigma^\{2\}\\cdot k\_\{\\text\{space\}\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\cdot k\_\{\\text\{fun\}\}\(\\mathbf\{u\},\\mathbf\{u\}^\{\\prime\}\)\\cdot k\_\{\\text\{latent\}\}\(\\rho\(\\mathbf\{h\}\),\\rho\(\\mathbf\{h\}^\{\\prime\}\)\),\(19\)whereσ2\\sigma^\{2\}is a learned global outputscale,kspacek\_\{\\text\{space\}\}acts on physical coordinates \(d=2d=2or33\),kfunk\_\{\\text\{fun\}\}acts on the discretized input function when present \(e\.g\., tension traction on the top edge for*Elasticity*\), andklatentk\_\{\\text\{latent\}\}acts on the projected augmented state\. The function\-dimension factorkfunk\_\{\\text\{fun\}\}is omitted when no input function is present\. We use radial basis function \(RBF\) with automatic relevance determination \(ARD\) lengthscales as the base kernel across all the datasets\. Additionally, as recommended inMacDonaldet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib3)\); Yousefpouret al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib70)\), change of variable is used so that the lengthscaleswware all optimized in thelog10\\log\_\{10\}\-space, e\.g\.,kspace​\(𝐱,𝐱′\)=exp​\(−∑r=1d10wr​\(xr−xr′\)2\)k\_\{\\text\{space\}\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)=\\text\{exp\}\\left\(\-\\sum\_\{r=1\}^\{d\}10^\{w\_\{r\}\}\(x\_\{r\}\-x^\{\\prime\}\_\{r\}\)^\{2\}\\right\)\. To ensure numerical stability, we restrict the search interval to\[−5,3\]\[\-5,3\]\.

#### Hyperprior Design\.

To avoid overfitting, we place Gaussian priors on the lengthscales and log\-process variance:

wspace\\displaystyle w\_\{\\text\{space\}\}∼𝒩​\(−1\.5,0\.52\),\\displaystyle\\sim\\mathcal\{N\}\(\-1\.5,\\,0\.5^\{2\}\),wfun\\displaystyle w\_\{\\text\{fun\}\}∼𝒩​\(0\.0,0\.52\),\\displaystyle\\sim\\mathcal\{N\}\(0\.0,\\,0\.5^\{2\}\),wlatent\\displaystyle w\_\{\\text\{latent\}\}∼𝒩​\(−1\.0,0\.52\),\\displaystyle\\sim\\mathcal\{N\}\(\-1\.0,\\,0\.5^\{2\}\),log⁡σ2\\displaystyle\\log\\sigma^\{2\}∼𝒩​\(0\.0,0\.52\)\.\\displaystyle\\sim\\mathcal\{N\}\(0\.0,0\.5^\{2\}\)\.These priors are fixed across all experiments\. No prior is used for the parameters ofρ​\(⋅\)\\rho\(\\cdot\)\.

### B\.3Heteroscedastic Noise Likelihood

The structural errorε\\varepsilonand the observational noiseϵ\\epsilonare functionally indistinguishable from the GP’s perspective as both contribute to the diagonal of the covariance matrix\. They are jointly absorbed into a single input\-dependent noise varianceσ𝝍2​\(𝐳¯\)\+σn2\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{n\}^\{2\}parametrized by an MLP as:

σ𝝍2​\(𝐳¯\)\+σn2=Softplus​\(g𝝍​\(𝐳¯\)\)\+σmin2,\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{n\}^\{2\}=\\text\{Softplus\}\\big\(g\_\{\\boldsymbol\{\\psi\}\}\(\\mathbf\{\\bar\{z\}\}\)\\big\)\+\\sigma\_\{\\min\}^\{2\},\(20\)whereg𝝍g\_\{\\boldsymbol\{\\psi\}\}is a 3\-layer MLP with hidden dimensions\[64,32\]\[64,32\], output dimension of 1, GELU activations, layer normalization at the input, and spectral normalization on each linear layer\. We setσmin2=10−3\\sigma\_\{\\min\}^\{2\}=10^\{\-3\}to ensure numerical conditioning of the covariance matrix\. The final linear layer’s bias is initialized to−3\.0\-3\.0so thatσ𝝍2\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}starts at a small value\. The MLP receives the same input representation used by the kernel\{𝐱,𝐮,𝐡\}\\\{\\mathbf\{x\},\\mathbf\{u\},\\mathbf\{h\}\\\}and it returns a diagonal noise covariance:

p​\(v​\(𝐱\)∣f​\(𝐳¯\)\)=𝒩​\(f​\(𝐳¯\),σ𝝍2​\(𝐳¯\)\+σn2\)\.p\(v\(\\mathbf\{x\}\)\\mid f\(\\mathbf\{\\bar\{z\}\}\)\)=\\mathcal\{N\}\\big\(f\(\\mathbf\{\\bar\{z\}\}\),\\;\\sigma\_\{\\boldsymbol\{\\psi\}\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\)\+\\sigma\_\{n\}^\{2\}\\big\)\.\(21\)

### B\.4Training

The full functional regression problem hasM​N∈\[5×105,5×106\]MN\\in\[5\\times 10^\{5\},5\\times 10^\{6\}\]training points across our benchmarks, making exact GP training intractable\. To circumvent this, we draw inspiration from Kernel Flows \(KF\)Owhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\)and employ stochastic subsampling\. While KF loss typically minimizes an RKHS stability metric, we directly minimize the mini\-batch negative marginal log\-likelihood \(NLL\) which, unlike KF loss, also prioritizes uncertainty calibration\. Although optimizing on mini\-batches yields a biased estimator of the full\-data NLL, recent works have successfully leveraged random mini\-batchesOwhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\); Chenet al\.\([2022](https://arxiv.org/html/2606.17513#bib.bib74)\); Yang and Owhadi \([2024](https://arxiv.org/html/2606.17513#bib.bib75)\)and our empirical studies demonstrate similar trends, see Figure[5](https://arxiv.org/html/2606.17513#A2.F5)\. We train using a fixed max\-iteration budget and the configurations detailed in Table[2](https://arxiv.org/html/2606.17513#A2.T2)\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x6.png)Figure 5:NLL loss curves across benchmarks for five seeds\.Solid lines represent the mean NLL, with shaded regions indicating one standard deviation\. To prevent overfitting of the GP hyperparameters, we continuously track the validation loss and employ early stopping\. The optimal stopping point, where validation loss reaches its minimum, is marked by the dotted vertical line and point in each subplot\.Table 2:Hyperparameter configurations for our method across benchmarks\.Almost the same configuration is used for all five datasets with minimal per\-task tuning, demonstrating robustness across varied geometries and physical regimes\.HyperparameterElasticityAirfoilPipeShapeNet CarAhmed BodyLatent dimDlD\_\{l\}6464646464Spatial kernelkspacek\_\{\\text\{space\}\}RBFRBFRBFRBFRBFInput function kernelkfunk\_\{\\text\{fun\}\}RBF–––RBFLatent kernelklatentk\_\{\\text\{latent\}\}RBFRBFRBFRBFRBFSubset sizeNsN\_\{s\}5,0005,0005,0005,0005,000Number of expertsKK55555Eval chunk sizeNqN\_\{q\}10,00010,00010,00010,00010,000Max iterations500500500500500NN learning rateηnn\\eta\_\{\\text\{nn\}\}2×10−32\\times 10^\{\-3\}2×10−32\\times 10^\{\-3\}2×10−42\\times 10^\{\-4\}2×10−32\\times 10^\{\-3\}2×10−32\\times 10^\{\-3\}GP learning rateηgp\\eta\_\{\\text\{gp\}\}5×10−25\\times 10^\{\-2\}5×10−25\\times 10^\{\-2\}1×10−31\\times 10^\{\-3\}5×10−25\\times 10^\{\-2\}5×10−25\\times 10^\{\-2\}Weight decay \(NN\)1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}#### Numerical stability\.

GP marginal likelihood and predictive computations are performed using exact Cholesky decomposition \(jitter10−610^\{\-6\}\) when the kernel remains positive definite, and we fall back to conjugate gradient \(CG\) approximation if it becomes mildly ill\-conditioned during training\. With a support set ofNs=5000N\_\{s\}=5000points, Cholesky is computationally feasible \(O​\(Ns3\)O\(N\_\{s\}^\{3\}\)cost\)\. We additionally clip gradients to a maximum norm of1010before each optimizer step\.

#### Reproducibility details\.

Base Transolver models are previously trained using NVIDIA A100 GPUs \(40 GB\)\. All post\-hoc UQ method fitting and inference for both our method and the baselines are performed on a single workstation equipped with an NVIDIA RTX 6000 Ada Generation GPU \(48 GB\)\. This split reflects realistic deployment: the deterministic operator is trained once leveraging HPC resources, while the post\-hoc UQ component is fit and queried on a workstation\. All experiments usefloat32precision\. Baselines for MC Dropout, PNO, Perturbation, LUNO\-LA, and DVE\-spatial share the same Transolver backbone described in Appendix[E](https://arxiv.org/html/2606.17513#A5)\. Random seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}are used across all experiments\.

### B\.5Inference

At inference time, conditioning on the full support set ofM​NMNtraining points remains computationally prohibitive\. We employ a uniformly weighted generalized Product of Experts \(gPoE\)Cao and Fleet \([2015](https://arxiv.org/html/2606.17513#bib.bib72)\):KKrandom support subsets of sizeNsN\_\{s\}conditionKKGP experts, each producing a posterior predictive meanδ¯k​\(𝐳¯∗\)\\bar\{\\delta\}\_\{k\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)and varianceσk2​\(𝐳¯∗\)\\sigma\_\{k\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)at a query point𝐳¯∗=\(𝐮∗,𝐱∗,𝐡​\(𝐮∗,𝐚∗,𝐱∗\)\)\\mathbf\{\\bar\{z\}\}\_\{\*\}=\(\\mathbf\{u\}\_\{\*\},\\mathbf\{x\}\_\{\*\},\\mathbf\{h\}\(\\mathbf\{u\}\_\{\*\},\\mathbf\{a\}\_\{\*\},\\mathbf\{x\}\_\{\*\}\)\)\. Unlike mixture aggregation, gPoE combines experts multiplicatively in precision space, yielding a closed\-form Gaussian:

1σ∗2​\(𝐳¯∗\)=1K​∑k=1K1σk2​\(𝐳¯∗\),δ¯∗​\(𝐳¯∗\)=σ∗2​\(𝐳¯∗\)⋅1K​∑k=1Kδ¯k​\(𝐳¯∗\)σk2​\(𝐳¯∗\)\.\\frac\{1\}\{\\sigma\_\{\*\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\frac\{1\}\{\\sigma\_\{k\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\},\\quad\\quad\\bar\{\\delta\}\_\{\*\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)=\\sigma\_\{\*\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\\cdot\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\frac\{\\bar\{\\delta\}\_\{k\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\}\{\\sigma\_\{k\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\}\.\(22\)
The aggregated discrepancy is added back to the frozen base operator to form the final predictive distribution:

v​\(𝐱∗\)∼𝒩​\(𝒢θ​\(𝐮∗,𝐚∗\)​\(𝐱∗\)\+δ¯∗​\(𝐳¯∗\),σ∗2​\(𝐳¯∗\)\)\.v\(\\mathbf\{x\}\_\{\*\}\)\\sim\\mathcal\{N\}\\Big\(\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\}\_\{\*\},\\mathbf\{a\}\_\{\*\}\)\(\\mathbf\{x\}\_\{\*\}\)\+\\bar\{\\delta\}\_\{\*\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\),\\;\\sigma\_\{\*\}^\{2\}\(\\mathbf\{\\bar\{z\}\}\_\{\*\}\)\\Big\)\.\(23\)This formulation bounds the test\-time memory footprint strictly to𝒪​\(Ns2\)\\mathcal\{O\}\(N\_\{s\}^\{2\}\)per expert, circumventing the𝒪​\(\(M​N\)2\)\\mathcal\{O\}\(\(MN\)^\{2\}\)requirement of the full support set, while remaining exact within each expert’s local approximation\. The GP experts use the same hyperparameters and only differ in their conditioning data\.

#### Memory complexity at inference\.

The dominant memory cost is the𝒪​\(Ns2\)\\mathcal\{O\}\(N\_\{s\}^\{2\}\)kernel matrix stored once per expert \(re\-used across all query batches\)\. Query points are processed in chunks ofNq=10,000N\_\{q\}=10\{,\}000to bound peak memory; this gives a worst\-case per\-batch peak of𝒪​\(Ns⋅Nq\)\\mathcal\{O\}\(N\_\{s\}\\cdot N\_\{q\}\)\.

## Appendix CDeformation Flow of the Geometry

In Section[3\.2](https://arxiv.org/html/2606.17513#S3.SS2)we introduce a deep kernel formulation that relies on the frozen internal representations of a pretrained neural operator rather than learning a feature map from scratch\. This design connects two established lines of work: deep kernel learning \(DKL\)Wilsonet al\.\([2016](https://arxiv.org/html/2606.17513#bib.bib27)\), which composes a stationary kernel with a learned feature extractor, and KFOwhadi and Yoo \([2019](https://arxiv.org/html/2606.17513#bib.bib43)\), which constructs data\-dependent kernels through iterative, flow\-based deformations of the input space\. Mechanically, REEF\-GP is closest to DKL but conceptually several of our design choices are inspired by KF\. The remainder of this appendix formalizes these connections and empirically analyzes the layer\-wise geometry deformation of the Transolver backbone\.

#### Residual Networks as KF\.

KF constructs a deep kernel by evaluating a base kernelK1K\_\{1\}on a deformed input space\. For two input states𝐳\\mathbf\{z\}and𝐳′\\mathbf\{z\}^\{\\prime\}, the kernel takes the formK​\(𝐳,𝐳′\)=K1​\(FL​\(𝐳\),FL​\(𝐳′\)\)K\(\\mathbf\{z\},\\mathbf\{z\}^\{\\prime\}\)=K\_\{1\}\(F\_\{L\}\(\\mathbf\{z\}\),F\_\{L\}\(\\mathbf\{z\}^\{\\prime\}\)\)whereFLF\_\{L\}is a flow map learned via a sequence of incremental deformations:

Fl\+1​\(𝐳\)=Fl​\(𝐳\)\+τ​Gl\+1​\(Fl​\(𝐳\)\),F\_\{l\+1\}\(\\mathbf\{z\}\)=F\_\{l\}\(\\mathbf\{z\}\)\+\\tau\\,G\_\{l\+1\}\(F\_\{l\}\(\\mathbf\{z\}\)\),\(24\)withτ\\tauacting as a discrete step size andGl\+1G\_\{l\+1\}a transformation in the span of kernel evaluations\. This incremental update mirrors the residual blocks within TransolverWuet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib15)\)and motivates us to leverage its internal geometry\-aware representations\.

Transolver is built upon transformer blocksVaswaniet al\.\([2017](https://arxiv.org/html/2606.17513#bib.bib28)\)with residual connectionsHeet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib66)\)\. Let𝐡l\\mathbf\{h\}\_\{l\}denote the latent state at layerllcorresponding to a spatial coordinate𝐱\\mathbf\{x\}\. The layer\-wise update reads:

𝐡l\+1=𝐡l\+ℱattn​\(𝐡l\)\+ℱMLP​\(𝐡l\+ℱattn​\(𝐡l\)\),\\mathbf\{h\}\_\{l\+1\}=\\mathbf\{h\}\_\{l\}\+\\mathcal\{F\}\_\{\\text\{attn\}\}\(\\mathbf\{h\}\_\{l\}\)\+\\mathcal\{F\}\_\{\\text\{MLP\}\}\(\\mathbf\{h\}\_\{l\}\+\\mathcal\{F\}\_\{\\text\{attn\}\}\(\\mathbf\{h\}\_\{l\}\)\),\(25\)whereℱattn\\mathcal\{F\}\_\{\\text\{attn\}\}is the Physics\-Attention mechanism\. Structurally, Transolver executes a discrete flow mapping that deforms the original coordinate and input space𝒰×𝒜×Ω\\mathcal\{U\}\\times\\mathcal\{A\}\\times\\Omegainto a structured latent manifold\. We do not claim that Transolver is a KF model \(it is trained via empirical risk minimization against the PDE solution, not the KF stability loss\) but the layer\-wise update has the same residual\-flow geometry, which is what suggests that the intermediate hidden states carry useful information beyond the final embedding alone\.

Figures[6](https://arxiv.org/html/2606.17513#A3.F6)and[7](https://arxiv.org/html/2606.17513#A3.F7)show the t\-SNE representation of the internal embeddings of all the Transolver layers for two airfoil samples\. The network learns to deform the geometry into latent representations that resolve the physics of the problem\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x7.png)Figure 6:Deformation flow of the geometry for an airfoil sample with shock waves above and below the airfoil\.Each panel shows the two\-dimensional t\-SNE representations of the pointwise embeddings of the initial geometry at each of the 8 layers of the Transolver backbone\. Across depth, the network progressively deforms the original geometry into a latent manifold, supporting the view that intermediate layers consistently encode geometry\-aware structure\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x8.png)Figure 7:Deformation flow of the geometry for an airfoil sample with a shock wave below the airfoil\.Each panel shows the two\-dimensional t\-SNE representations of the pointwise embeddings of the initial geometry at each of the 8 layers of the Transolver backbone\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x9.png)
![Refer to caption](https://arxiv.org/html/2606.17513v1/x10.png)

Figure 8:Feature map from Transolver embeddings to the GP kernel space\.Top: airfoil sample with shock waves at both upper and lower surfaces\. Bottom: airfoil sample with shock wave at the lower surface\. The first three columns show t\-SNE projections of the three CKA\-selected Transolver layers used to construct the augmented state𝐡​\(𝐮,𝐚,𝐱\)\\mathbf\{h\}\(\\mathbf\{u\},\\mathbf\{a\},\\mathbf\{x\}\); the rightmost column shows the embedding after the spectral\-normalized projectionρ​\(⋅\)\\rho\(\\cdot\)used inside the kernel\. The GP kernel space tears precisely at the shock wave region, indicating that the projection reshapes the operator’s latent geometry into a metric space where uncertainty can localize around physically meaningful error regions\.
#### Relation to DKL\.

Using learned internal representations inside a stationary kernel with learnable hyperparameters is essentially the same as DKL\. However, two design choices distinguish REEF\-GP from a standard DKL pipeline, and both follow from the post\-hoc UQ goal\. First, the feature extractor is not trained jointly with the GP:𝒢θ\\mathcal\{G\}\_\{\\theta\}has already produced a physics\-aware deformation by solving the PDE under empirical risk minimization, and we reuse that deformation rather than relearning one\. The internal layers deform the geometry into a manifold where the physical dynamics are smoothly resolvable, but the resulting embeddings are not, on their own, adapted to provide UQ\.

Second, scaling DKL to operator learning datasets \(M​N∼106MN\\sim 10^\{6\}\) typically relies on sparse approximations such as inducing pointsTitsias \([2009](https://arxiv.org/html/2606.17513#bib.bib38)\); Hensmanet al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib40)\), which impose a low\-rank covariance structure that can constrain calibration in ways that are difficult to control without additional regularization\. We take a different route, again inspired by KF: rather than committing to a fixed low\-rank structure, we condition on random subsets of the training data at every step \(Appendix[B\.4](https://arxiv.org/html/2606.17513#A2.SS4)\)\. KF itself was designed for interpolation and explicitly discards UQ information, so we keep its stochastic\-subset philosophy while replacing the RKHS\-stability loss with a mini\-batch negative log marginal likelihood that retains the calibration signal\.

#### Adapting the Flow for UQ\.

The Transolver embeddings are shaped by predictive accuracy, not by the geometry a stationary kernel would prefer for UQ\. We therefore need to compress them into a new latent space whose role is to calibrate uncertainty rather than to relearn the flow\. Figure[8](https://arxiv.org/html/2606.17513#A3.F8)illustrates the mechanism\. Through the layer selection algorithm we choose three internal representations that are maximally distinct, project them through a spectral\-normalized MLP, and use the resulting embedding to augment the input features before feeding them to the stationary kernel\. The projection yields a geometry\-aware embedding suited for UQ: as seen in Figure[8](https://arxiv.org/html/2606.17513#A3.F8), the representation tears off at the shock wave regions, bringing some interpretability to the metric space used to quantify uncertainty\.

## Appendix DAblation Studies

We use the*Airfoil*benchmark for the ablation studies\. This dataset is sufficiently challenging to surface meaningful differences between design choices, while remaining feasible to retrain across many configurations\. The training details and hardware used are described in Appendix[B\.4](https://arxiv.org/html/2606.17513#A2.SS4)\.

### D\.1Layer Selection

We compare four strategies for constructing the augmented state𝐡\\mathbf\{h\}: \(i\) using only the input layerL0L\_\{0\}, \(ii\) only the output layerLlastL\_\{\\text\{last\}\}, \(iii\) concatenating the full stack of all layers, and \(iv\) our default CKA\-based selection of three layers\{L0,l∗,Llast\}\\\{L\_\{0\},l^\{\*\},L\_\{\\text\{last\}\}\\\}\. Results are reported in Table[3](https://arxiv.org/html/2606.17513#A4.T3)and visualized in Figure[9](https://arxiv.org/html/2606.17513#A4.F9)\. Single\-layer variants \(L0L\_\{0\}orLlastL\_\{\\text\{last\}\}alone\) underperform across all probabilistic metrics, withL0L\_\{0\}alone being the weakest\. Our CKA\-selected three\-layer combination is statistically indistinguishable from concatenating the full stack \(overlapping error bars on every metric\), at lower memory and compute cost\.

Table 3:Layer selection ablation\.Effect of the layer set used to construct the augmented state𝐡\\mathbf\{h\}\. Lower is better \(↓\\downarrow\)\.L0L\_\{0\}uses only the input layer;LlastL\_\{\\text\{last\}\}only the output; All layers concatenates the entire stack;REEF\-GPselects\{L0,l∗,Llast\}\\\{L\_\{0\},l^\{\*\},L\_\{\\text\{last\}\}\\\}via CKA\. Best per metric inbold, second bestunderlined\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−3\\times 10^\{\-3\}\)REEF\-GP1\.24\\mathbf\{1\.24\}±0\.12\\pm 0\.12−3\.51¯\\underline\{\-3\.51\}±0\.06\\pm 0\.0639\.80¯\\underline\{39\.80\}±2\.35\\pm 2\.3556\.89¯\\underline\{56\.89\}±4\.66\\pm 4\.660\.45\\mathbf\{0\.45\}±0\.05\\pm 0\.05LlastL\_\{\\text\{last\}\}only1\.24\\mathbf\{1\.24\}±0\.12\\pm 0\.12−3\.36\-3\.36±0\.12\\pm 0\.1240\.2540\.25±2\.41\\pm 2\.4160\.7060\.70±5\.10\\pm 5\.100\.46¯\\underline\{0\.46\}±0\.05\\pm 0\.05L0L\_\{0\}only1\.24\\mathbf\{1\.24\}±0\.12\\pm 0\.12−2\.75\-2\.75±0\.14\\pm 0\.1440\.8940\.89±2\.46\\pm 2\.4667\.7267\.72±5\.50\\pm 5\.500\.46¯\\underline\{0\.46\}±0\.05\\pm 0\.05All layers \(L0​…​LlastL\_\{0\}\\dots L\_\{\\text\{last\}\}\)1\.24\\mathbf\{1\.24\}±0\.12\\pm 0\.12−3\.54\\mathbf\{\-3\.54\}±0\.06\\pm 0\.0639\.71\\mathbf\{39\.71\}±2\.40\\pm 2\.4055\.99\\mathbf\{55\.99\}±4\.74\\pm 4\.740\.45\\mathbf\{0\.45\}±0\.05\\pm 0\.05![Refer to caption](https://arxiv.org/html/2606.17513v1/x11.png)Figure 9:Layer\-selection ablation on*Airfoil*\.Top: ground truth field, base prediction, absolute error\. Bottom: predicted standard deviation under four layer\-selection variants\. All variants concentrate uncertainty around the shock front; the CKA\-selected three\-layer combination \(REEF\-GP\) closely matches the full\-stack baseline at reduced time and memory costs\.
### D\.2Number of Training Samples

We jointly vary the number of training samples used to train the base Transolver and learn the GP discrepancy model\. The test set is held fixed across all configurations\. As the operator is trained on more data, its predictions become more accurate, leaving smaller residuals for the GP to model\. We observe that REEF\-GP adapts gracefully to base operators of varying quality, recovering well\-calibrated uncertainty in both data\-rich and data\-poor regimes\. Results are reported in Table[4](https://arxiv.org/html/2606.17513#A4.T4)and visualized in Figure[10](https://arxiv.org/html/2606.17513#A4.F10)and Figure[11](https://arxiv.org/html/2606.17513#A4.F11)\.

Table 4:Training samples ablation on*Airfoil*\.Effect of the number of training samples used to fit our GP discrepancy model\. Lower is better \(↓\\downarrow\)\. Best per metric inbold, second bestunderlined\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−3\\times 10^\{\-3\}\)\(×10−2\\times 10^\{\-2\}\)Mtrain=50M\_\{\\text\{train\}\}=509\.689\.68±0\.80\\pm 0\.800\.340\.34±1\.87\\pm 1\.8732\.5532\.55±3\.04\\pm 3\.0462\.1662\.16±11\.72\\pm 11\.723\.703\.70±0\.38\\pm 0\.38Mtrain=150M\_\{\\text\{train\}\}=1503\.923\.92±1\.14\\pm 1\.14−2\.17\-2\.17±0\.63\\pm 0\.6311\.2211\.22±2\.75\\pm 2\.7518\.4518\.45±5\.47\\pm 5\.471\.491\.49±0\.42\\pm 0\.42Mtrain=250M\_\{\\text\{train\}\}=2502\.182\.18±0\.68\\pm 0\.68−3\.07\-3\.07±0\.06\\pm 0\.066\.356\.35±1\.46\\pm 1\.469\.119\.11±0\.87\\pm 0\.870\.800\.80±0\.22\\pm 0\.22Mtrain=500M\_\{\\text\{train\}\}=5001\.24¯\\underline\{1\.24\}±0\.12\\pm 0\.12−3\.51¯\\underline\{\-3\.51\}±0\.06\\pm 0\.063\.98¯\\underline\{3\.98\}±0\.24\\pm 0\.245\.69¯\\underline\{5\.69\}±0\.47\\pm 0\.470\.45¯\\underline\{0\.45\}±0\.05\\pm 0\.05Mtrain=750M\_\{\\text\{train\}\}=7501\.03\\mathbf\{1\.03\}±0\.03\\pm 0\.03−3\.60\\mathbf\{\-3\.60\}±0\.02\\pm 0\.023\.46\\mathbf\{3\.46\}±0\.04\\pm 0\.045\.06\\mathbf\{5\.06\}±0\.08\\pm 0\.080\.38\\mathbf\{0\.38\}±0\.01\\pm 0\.01![Refer to caption](https://arxiv.org/html/2606.17513v1/x12.png)Figure 10:Training samples ablation on*Airfoil*\.Each row corresponds to a different training\-set sizeMtrainM\_\{\\text\{train\}\}, applied jointly to the base Transolver and the GP discrepancy model\. Columns: ground truth field, base prediction, absolute error of the base prediction, and predicted standard deviation\. AsMtrainM\_\{\\text\{train\}\}decreases, the base operator’s error grows; REEF\-GP correctly inflates its predicted uncertainty to track this growth, with the spatial structure of the predicted standard deviation closely matching the spatial structure of the error\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x13.png)Figure 11:Relative L2 error vs\. training\-set size on*Airfoil*\.Both curves show mean across 5 seeds with shaded±1\\pm 1standard deviation\. REEF\-GP closely follows the base Transolver’s accuracy across two orders of magnitude inMtrainM\_\{\\text\{train\}\}, confirming that adding GP\-based uncertainty quantification does not degrade predictive performance\.
### D\.3Stochastic Training Subset Size

We ablate the per\-expert support\-set sizeNsN\_\{s\}used both during stochastic training \(each gradient step conditions the GP on a random size\-NsN\_\{s\}subset of the training residuals\) and during inference \(each gPoE expert is conditioned on an independent size\-NsN\_\{s\}subset\)\. LargerNsN\_\{s\}provides each GP with more conditioning data per step, at the cost of𝒪​\(Ns3\)\\mathcal\{O\}\(N\_\{s\}^\{3\}\)per Cholesky factorization\. Results are reported in Table[5](https://arxiv.org/html/2606.17513#A4.T5)and visualized in Figure[12](https://arxiv.org/html/2606.17513#A4.F12)\. We observe that REEF\-GP recovers the same calibration quality across two orders of magnitude inNsN\_\{s\}\(from500500to25,00025\{,\}000\)\. Figure[12](https://arxiv.org/html/2606.17513#A4.F12)confirms this visually for a random test sample: the predicted uncertainty fields are nearly identical across all five subset sizes, concentrating around the shock front in every variant\. Following these observations, we adoptNs=5000N\_\{s\}=5000as a middle\-of\-the\-range default that balances conditioning quality against the cubic cost of Cholesky factorization\.

Table 5:Subset\-size ablation on*Airfoil*\.Effect of the per\-expert support\-set sizeNsN\_\{s\}used at inference\. For the results in text we useNs=5000N\_\{s\}=5000\. Lower is better \(↓\\downarrow\)\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−3\\times 10^\{\-3\}\)Ns=500N\_\{s\}=5001\.241\.24±0\.12\\pm 0\.12−3\.50\-3\.50±0\.08\\pm 0\.0839\.9139\.91±2\.39\\pm 2\.3957\.0057\.00±5\.53\\pm 5\.530\.450\.45±0\.05\\pm 0\.05Ns=1000N\_\{s\}=10001\.241\.24±0\.12\\pm 0\.12−3\.50\-3\.50±0\.08\\pm 0\.0839\.8739\.87±2\.41\\pm 2\.4156\.8156\.81±5\.92\\pm 5\.920\.450\.45±0\.05\\pm 0\.05Ns=5000N\_\{s\}=50001\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.35\\pm 2\.3556\.8956\.89±4\.66\\pm 4\.660\.450\.45±0\.05\\pm 0\.05Ns=10000N\_\{s\}=100001\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.36\\pm 2\.3656\.9456\.94±4\.64\\pm 4\.640\.450\.45±0\.05\\pm 0\.05Ns=25000N\_\{s\}=250001\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.38\\pm 2\.3857\.3057\.30±4\.68\\pm 4\.680\.450\.45±0\.05\\pm 0\.05![Refer to caption](https://arxiv.org/html/2606.17513v1/x14.png)Figure 12:Subset\-size ablation on*Airfoil*\.Top: ground truth field, base prediction, absolute error\. Bottom: predicted standard deviation under five subset sizesNsN\_\{s\}\. All variants concentrate uncertainty around the shock front\.
### D\.4Number of Experts

We compare three values ofKK, the number of GP experts used in the gPoE inference scheme:K=1K=1, our defaultK=5K=5, andK=10K=10\. Results on*Airfoil*are reported in Table[6](https://arxiv.org/html/2606.17513#A4.T6)and visualized in Figure[13](https://arxiv.org/html/2606.17513#A4.F13)\. Calibration is essentially insensitive toKKon this benchmark: all three settings produce statistically indistinguishable metrics\.

Table 6:Number of gPoE experts ablation on*Airfoil*\.Effect of the number of GP expertsKKused in the generalized Product of Experts inference scheme\. Lower is better \(↓\\downarrow\)\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−3\\times 10^\{\-3\}\)K=1K=11\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8239\.82±2\.36\\pm 2\.3656\.9056\.90±4\.65\\pm 4\.650\.450\.45±0\.05\\pm 0\.05K=5K=51\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.35\\pm 2\.3556\.8956\.89±4\.66\\pm 4\.660\.450\.45±0\.05\\pm 0\.05K=10K=101\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.35\\pm 2\.3556\.8956\.89±4\.66\\pm 4\.660\.450\.45±0\.05\\pm 0\.05![Refer to caption](https://arxiv.org/html/2606.17513v1/x15.png)Figure 13:Number of gPoE experts ablation on*Airfoil*\.Top: ground truth field, base prediction, absolute error\. Bottom: predicted standard deviation under three values ofKK, the number of GP experts in the gPoE inference scheme\.
### D\.5Heteroscedastic vs\. Homoscedastic Noise

We compare our heteroscedastic noise modelσ𝝍2​\(𝐳¯\)\\sigma^\{2\}\_\{\\boldsymbol\{\\psi\}\}\(\\mathbf\{\\bar\{z\}\}\), which is input\-dependent and parameterized by an MLP, against a homoscedastic baseline that fits a single scalar noise variance\. Results are reported in Table[7](https://arxiv.org/html/2606.17513#A4.T7)and visualized in Figure[14](https://arxiv.org/html/2606.17513#A4.F14)\. The two noise models yield comparable aggregate metrics, with overlapping error bars on every probabilistic score\. The qualitative difference is spatial: both models concentrate uncertainty around the shock front, but the heteroscedastic model produces a sharper, higher\-magnitude uncertainty signal localized to the error region, while the homoscedastic model assigns non\-negligible uncertainty across regions where the operator’s error is small, with a compressed numerical range overall\. We adopt the heteroscedastic model as our default as it provides more structurally meaningful uncertainties\.

Table 7:Heteroscedastic vs\. homoscedastic noise ablation on*Airfoil*\.Effect of the noise model used in the GP likelihood\. Heteroscedastic uses an input\-dependent varianceσ𝝍2​\(𝐳¯\)\\sigma^\{2\}\_\{\\boldsymbol\{\\psi\}\}\(\\mathbf\{\\bar\{z\}\}\); homoscedastic uses a single scalar variance\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Airfoil \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−3\\times 10^\{\-3\}\)Heteroscedastic1\.241\.24±0\.12\\pm 0\.12−3\.51\-3\.51±0\.06\\pm 0\.0639\.8039\.80±2\.35\\pm 2\.3556\.8956\.89±4\.66\\pm 4\.660\.450\.45±0\.05\\pm 0\.05Homoscedastic1\.241\.24±0\.12\\pm 0\.12−3\.41\-3\.41±0\.14\\pm 0\.1438\.8238\.82±2\.66\\pm 2\.6658\.8458\.84±5\.84\\pm 5\.840\.450\.45±0\.05\\pm 0\.05![Refer to caption](https://arxiv.org/html/2606.17513v1/x16.png)Figure 14:Heteroscedastic vs\. homoscedastic noise ablation on*Airfoil*\.Top: ground truth, base prediction, absolute error\. Bottom: predicted standard deviation under each noise model\. The heteroscedastic model concentrates uncertainty around the shock front where the operator’s error is largest\.

## Appendix EBase Neural Operator Configuration

To ensure a rigorous evaluation of our post\-hoc UQ framework, we train a deterministic TransolverWuet al\.\([2024](https://arxiv.org/html/2606.17513#bib.bib15)\)model as required for each of the benchmark tasks and baseline configurations\. The architecture is configured based on the original setup to achieve near state\-of\-the\-art predictive accuracy\.

#### Transolver\.

The model architecture consists of a sequence of tranformer blocks that rely on a Physics\-Attention mechanism that adaptively groups the discretized spatial domain into learnable slices\. Table[8](https://arxiv.org/html/2606.17513#A5.T8)details the specific configurations used for the evaluated benchmarks\. Across all tasks, we use the GELUHendrycks and Gimpel \([2023](https://arxiv.org/html/2606.17513#bib.bib69)\)activation function and optimize the network using AdamWLoshchilov and Hutter \([2019](https://arxiv.org/html/2606.17513#bib.bib68)\)\. The learning rate is decayed using a OneCycle learning rate scheduler to ensure stable convergence\. We use the relativeL2L^\{2\}error as the loss function\. All models are trained for a fixed number of epochs, with results averaged over five random seeds to ensure statistical robustness\.

Table 8:Architecture and hyperparameter configurations\.The base Transolver slightly varies across the benchmarks\.HyperparameterElasticityAirfoilPipeShapeNet CarAhmed BodyHidden channels \(DhD\_\{h\}\)128128128256256Transformer layers \(LL\)88888Physics slices \(SS\)6464643232Attention heads \(HH\)88888Epochs500500500500500Batch size14411Initial learning rate1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}1×10−31\\times 10^\{\-3\}Weight decay1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}1×10−41\\times 10^\{\-4\}
#### Dataset Splits for UQ Evaluation\.

While standard operator learning benchmarks often assume abundant data, the value of UQ is most pronounced in data\-constrained environments where the surrogate model is forced to generalize\. To rigorously test the calibration of REEF\-GP and the baselines under these conditions, we evaluate all main experiments using a reduced data split\. Table[9](https://arxiv.org/html/2606.17513#A5.T9)shows the data splits for each dataset\.

Table 9:Dataset splits\.The number of samples slightly varies across the five benchmarks\.ElasticityAirfoilPipeShapeNet CarAhmed BodyTrain \(Mt​r​a​i​nM\_\{train\}\)500500500500400Validation \(Mv​a​lM\_\{val\}\)10010010010050Test \(Mt​e​s​tM\_\{test\}\)250250250250100

## Appendix FBenchmarks

We use three 2D datasets and two 3D datasets to evaluate REEF\-GP against competing methods detailed in Appendix[G](https://arxiv.org/html/2606.17513#A7)\.

#### Elasticity\.

This benchmark provides the stress field of an elastic material subject to a tension traction𝐭\\mathbf\{t\}applied on the top edge of a unit cell with an irregular shape void in the interiorLiet al\.\([2023a](https://arxiv.org/html/2606.17513#bib.bib13)\)\. There are a total ofM=2000M=2000samples\. For each sample the input domain is discretized as a point cloud comprisingN=972N=972spatial nodes\.

#### Airfoil\.

This benchmark provides the resulting Mach number field from a transonic flow over 2D airfoil geometries, governed by the Euler equationLiet al\.\([2023a](https://arxiv.org/html/2606.17513#bib.bib13)\)\. The spatial domain is discretized into a structured mesh around the airfoil which we fix toN=2820N=2820points per sample\. There are a total ofM=2490M=2490samples\.

#### Pipe\.

This benchmark provides the fluid velocity through a pipe of changing shapeLiet al\.\([2023a](https://arxiv.org/html/2606.17513#bib.bib13)\)\. The input domain is discretized viaN=129×27N=129\\times 27spatial points and there are a total ofM=2310M=2310samples\.

#### ShapeNet Car\.

This benchmark provides the surface pressure field resulting from turbulent aerodynamic flow over diverse 3D vehicle geometries extracted from the "car" category of ShapeNetChanget al\.\([2015](https://arxiv.org/html/2606.17513#bib.bib67)\), governed by the Navier\-Stokes equationsUmetani and Bickel \([2018](https://arxiv.org/html/2606.17513#bib.bib50)\)\. The input domain for each vehicle surface is discretized as an unstructured point cloud, which we fix toN=3500N=3500spatial points per sample\. There are a total ofM=889M=889samples\.

#### Ahmed Body\.

This benchmark provides vehicle aerodynamics simulations based on the Ahmed\-body shapesLiet al\.\([2023b](https://arxiv.org/html/2606.17513#bib.bib22)\)\. There are a total ofM=551M=551shapes and we fix the spatial points toN=5000N=5000per sample\. As additional inputs we consider the inlet velocity and Reynolds number as they change across samples\.

## Appendix GUQ Baselines Configuration

To assess the performance of our post\-hoc UQ framework, we compare it against six established UQ methodologies\. To ensure a fair and rigorous comparison, all baselines are implemented using the exact same base Transolver architecture and evaluated over the same dataset splits\.

#### Deep Ensembles\.

We ensemble55independent Transolver models, each of them trained from scratch using the same hyperparameters but initialized with a different random seed\. The final predictive mean and variance are computed empirically from the ensemble’s55distinct forward passes\. While highly accurate, this method incurs a5×5\\timesmultiplier in total training cost\.

#### Monte Carlo \(MC\) Dropout\.

Following standard practices, dropout is injected directly into the transformer blocks\. Specifically, within the Physics\-Attention mechanism, dropout is applied in two distinct locations: first, directly to the attention matrix after the softmax operation \(randomly dropping query\-key connections prior to multiplication with the value matrix\), and second, to the output of the linear projection immediately following the deslicing step, before the tensor is passed to subsequent network components\. We use a standard dropout rate ofp=0\.1p=0\.1\. At inference time, dropout is kept active, and we draw5050stochastic forward passes to compute the predictive mean and variance\.

#### Probabilistic Neural Operator \(PNO\)\.

PNOBülteet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib59)\)extends the MC Dropout approach by optimizing the network directly for probabilistic evaluation during training\. Using Transolver as the base architecture with the identical dropout locations described above, we implement PNO by modifying the training objective\. Instead of the deterministic relativeL2L^\{2\}error, PNO minimizes the empirical Energy Score \(ES\) defined in Equation[30](https://arxiv.org/html/2606.17513#A8.E30)on the training set\. To balance computational feasibility with stable training, we useK=5K=5forward passes during training, andK=50K=50forward passes during evaluation\.

#### Function\-Valued Laplace Approximation \(LUNO\-LA\)\.

We implement a last\-layer Laplace approximation derived from the LUNO frameworkMagnaniet al\.\([2025](https://arxiv.org/html/2606.17513#bib.bib57)\)using thelaplace\-torchlibraryDaxbergeret al\.\([2021](https://arxiv.org/html/2606.17513#bib.bib33)\)\. LUNO leverages the concept ofcurryingto translate weight\-space uncertainty into a continuous, function\-valued Gaussian Process over the spatial domain\. Because Transolver relies on standard pointwise linear projections in its final layers \(rather than complex spectral convolutions\), the network mapping is inherently linear with respect to the last layer’s weights\. Thus, applying a Laplace approximation to the final layer natively yields the exact LUNO covariance kernelk​\(x,x′\)=𝐡​\(x\)⊤​Σ​𝐡​\(x′\)k\(x,x^\{\\prime\}\)=\\mathbf\{h\}\(x\)^\{\\top\}\\Sigma\\mathbf\{h\}\(x^\{\\prime\}\), where𝐡​\(x\)\\mathbf\{h\}\(x\)are the spatially\-aligned Transolver features andΣ\\Sigmais the weight posterior covariance computed using a full Hessian structure\. In our experiments we find that utilizing all spatial points makes covariance construction and inversion memory\-prohibitive and unstable, preventing convergence to a robust solution\. To overcome this bottleneck and ensure well\-conditioned covariance inversion, we uniformly subsample a maximum of50​k50kspatial points across the dataset during optimization\. The prior precision of the Laplace approximation is optimized via grid search on a validation set\.

#### Input Perturbation\.

Drawing on test\-time data augmentationAyhan and Berens \([2018](https://arxiv.org/html/2606.17513#bib.bib64)\)and initial\-condition perturbationPathaket al\.\([2022](https://arxiv.org/html/2606.17513#bib.bib63)\), we derive predictive uncertainty from a fully trained, deterministic Transolver model’s sensitivity to input variations\. At inference time, we inject independent and identically distributed \(i\.i\.d\.\) Gaussian noise \(σn=10−2\\sigma\_\{n\}=10^\{\-2\}\) directly into the input tensors\. For each test sample, we execute5050independent forward passes, sampling a new noise mask for the inputs prior to each pass\. The final predictive mean and variance are computed empirically across these5050stochastic forward passes\.

#### Deep Vecchia Ensemble \(DVE\-spatial\)\.

We adapt the Deep Vecchia Ensemble \(DVE\)Jimenez and Katzfuss \([2025](https://arxiv.org/html/2606.17513#bib.bib52)\), utilizing the author’s original implementation, which constructs an ensemble of GPs over the internal hidden\-layer representations of a trained neural network\. The original DVE computes the nearest\-neighbor conditioning sets for its Vecchia approximation using distances strictly in the intermediate feature space\. However, in dense spatial and operator learning tasks, geometrically neighboring points frequently exhibit nearly identical embeddings, which leads to severely ill\-conditioned covariance matrices\. To evaluate DVE as a viable baseline in our setting, we adapt it into*DVE\-spatial*by concatenating the physical spatial coordinates to the intermediate embeddings prior to computing the distance metrics\. Although DVE was designed for standard regression rather than operator learning and it might require further specialized modifications for this domain, it serves as a highly relevant benchmark for UQ applied directly within a network’s feature space\. Because the layer\-wise nearest\-neighbor searches and GP inference are computationally expensive, we configure DVE\-spatial to balance its cost based on its default settings\. Specifically, we restrict the Vecchia nearest\-neighbor conditioning set size tom=8m=8and utilize an approximate nearest neighbor search by partitioning the feature space into500500cells and searching across the1515closest cells for each query\.

## Appendix HMetrics

We evaluate predictive accuracy with the relativeL2L^\{2\}error, and probabilistic calibration with four standard scoring rules: NLL, CRPS, NIS, and ES\. Throughout,MtestM\_\{\\text\{test\}\}denotes the number of test instances; for instanceii,𝐯i∈ℝN\\mathbf\{v\}\_\{i\}\\in\\mathbb\{R\}^\{N\}is the ground\-truth nodal solution with entriesvi,jv\_\{i,j\}, and the predictive distribution at nodejjis Gaussian with meanμi,j\\mu\_\{i,j\}and varianceσi,j2\\sigma^\{2\}\_\{i,j\}\.

#### RelativeL2L^\{2\}Error\.

Quantifies the predictive accuracy of the operator:

rL2=1Mtest​∑i=1Mtest‖𝒢†​\(ui,ai\)−𝒢θ​\(𝐮i,𝐚i\)‖𝒱‖𝒢†​\(ui,ai\)‖𝒱,\\mathrm\{rL2\}=\\frac\{1\}\{M\_\{\\text\{test\}\}\}\\sum\_\{i=1\}^\{M\_\{\\text\{test\}\}\}\\frac\{\\\|\\mathcal\{G\}^\{\\dagger\}\(u\_\{i\},a\_\{i\}\)\-\\mathcal\{G\}\_\{\\theta\}\(\\mathbf\{u\}\_\{i\},\\mathbf\{a\}\_\{i\}\)\\\|\_\{\\mathcal\{V\}\}\}\{\\\|\\mathcal\{G\}^\{\\dagger\}\(u\_\{i\},a\_\{i\}\)\\\|\_\{\\mathcal\{V\}\}\},\(26\)where∥⋅∥𝒱\\\|\\cdot\\\|\_\{\\mathcal\{V\}\}is theL2L^\{2\}norm over the physical domain\.

#### Negative Log\-Likelihood \(NLL\)\.

A strictly proper scoring rule that penalizes both miscalibration and overconfidence:

NLL=1Mtest​N​∑i=1Mtest∑j=1N\[12​log⁡\(2​π​σi,j2\)\+\(vi,j−μi,j\)22​σi,j2\]\.\\mathrm\{NLL\}=\\frac\{1\}\{M\_\{\\text\{test\}\}N\}\\sum\_\{i=1\}^\{M\_\{\\text\{test\}\}\}\\sum\_\{j=1\}^\{N\}\\left\[\\frac\{1\}\{2\}\\log\(2\\pi\\sigma^\{2\}\_\{i,j\}\)\+\\frac\{\(v\_\{i,j\}\-\\mu\_\{i,j\}\)^\{2\}\}\{2\\sigma^\{2\}\_\{i,j\}\}\\right\]\.\(27\)

#### Continuous Ranked Probability Score \(CRPS\)\.

A probabilistic generalization of the mean absolute error\. For Gaussian predictives it admits the closed form:

CRPS=1Mtest​N​∑i=1Mtest∑j=1Nσi,j​\[zi,j​\(2​Φ​\(zi,j\)−1\)\+2​ϕ​\(zi,j\)−1π\],\\mathrm\{CRPS\}=\\frac\{1\}\{M\_\{\\text\{test\}\}N\}\\sum\_\{i=1\}^\{M\_\{\\text\{test\}\}\}\\sum\_\{j=1\}^\{N\}\\sigma\_\{i,j\}\\\!\\left\[z\_\{i,j\}\\bigl\(2\\Phi\(z\_\{i,j\}\)\-1\\bigr\)\+2\\phi\(z\_\{i,j\}\)\-\\tfrac\{1\}\{\\sqrt\{\\pi\}\}\\right\],\(28\)with standardized errorzi,j=\(vi,j−μi,j\)/σi,jz\_\{i,j\}=\(v\_\{i,j\}\-\\mu\_\{i,j\}\)/\\sigma\_\{i,j\}, andΦ,ϕ\\Phi,\\phithe standard\-normal CDF and PDF\.

#### Negatively Oriented Interval Score \(NIS\)\.

Evaluates the quality of central prediction intervals\. For confidence level1−α1\-\\alpha, letLi,jL\_\{i,j\}andUi,jU\_\{i,j\}be the predicted lower and upper bounds at nodejj:

NIS=1Mtest​N​∑i=1Mtest∑j=1N\[\(Ui,j−Li,j\)\+2α​\(Li,j−vi,j\)\+\+2α​\(vi,j−Ui,j\)\+\],\\mathrm\{NIS\}=\\frac\{1\}\{M\_\{\\text\{test\}\}N\}\\sum\_\{i=1\}^\{M\_\{\\text\{test\}\}\}\\sum\_\{j=1\}^\{N\}\\left\[\(U\_\{i,j\}\-L\_\{i,j\}\)\+\\frac\{2\}\{\\alpha\}\(L\_\{i,j\}\-v\_\{i,j\}\)\_\{\+\}\+\\frac\{2\}\{\\alpha\}\(v\_\{i,j\}\-U\_\{i,j\}\)\_\{\+\}\\right\],\(29\)where\(x\)\+=max⁡\(x,0\)\(x\)\_\{\+\}=\\max\(x,0\)\. The first term rewards sharpness; the latter two penalize miscoverage\. We report NIS atα=0\.05\\alpha=0\.05\.

#### Energy Score \(ES\)\.

A multivariate proper scoring rule, estimated by Monte Carlo withK=50K=50samples𝐯i\(k\)\\mathbf\{v\}\_\{i\}^\{\(k\)\}drawn from the predictive:

ES=1Mtest​∑i=1Mtest\[1K​∑k=1K‖𝐯i\(k\)−𝐯i‖2−12​K​\(K−1\)​∑k≠k′‖𝐯i\(k\)−𝐯i\(k′\)‖2\]\.\\mathrm\{ES\}=\\frac\{1\}\{M\_\{\\text\{test\}\}\}\\sum\_\{i=1\}^\{M\_\{\\text\{test\}\}\}\\left\[\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\\|\\mathbf\{v\}\_\{i\}^\{\(k\)\}\-\\mathbf\{v\}\_\{i\}\\\|\_\{2\}\-\\frac\{1\}\{2K\(K\-1\)\}\\sum\_\{k\\neq k^\{\\prime\}\}\\\|\\mathbf\{v\}\_\{i\}^\{\(k\)\}\-\\mathbf\{v\}\_\{i\}^\{\(k^\{\\prime\}\)\}\\\|\_\{2\}\\right\]\.\(30\)The first term rewards accuracy; the second penalizes overconfidence by rewarding sample diversity\.

## Appendix IAdditional Results

This section presents the results of REEF\-GP and the UQ baselines on the remaining three benchmarks:*Elasticity*\(2D\),*Pipe*\(2D\), and*Ahmed Body*\(3D\)\. The quantitative results are reported in Table[10](https://arxiv.org/html/2606.17513#A9.T10), while spatial uncertainty visualizations and compute cost trade\-offs are provided in Figures[15](https://arxiv.org/html/2606.17513#A9.F15)–[17](https://arxiv.org/html/2606.17513#A9.F17)and Figure[18](https://arxiv.org/html/2606.17513#A9.F18), respectively\.

#### Consistent preservation of predictive accuracy\.

REEF\-GP successfully preserves the predictive accuracy of the base Transolver across all three additional datasets\. As shown in Table[10](https://arxiv.org/html/2606.17513#A9.T10), the rL2 of REEF\-GP matches that of the deterministic Base model perfectly\. In contrast, train\-time modifications such as MC Dropout, PNO, and DVE\-spatial consistently suffer from degraded rL2 performance\.

#### Competitive probabilistic performance\.

REEF\-GP remains highly competitive across all probabilistic metrics\. Most notably, on the complex 3D*Ahmed Body*dataset, REEF\-GP achieves an NLL of4\.214\.21, effectively tying with the computationally expensive Deep Ensembles \(4\.204\.20\) while outperforming all other post\-hoc and train\-time baselines\. On the 2D benchmarks \(*Elasticity*and*Pipe*\), REEF\-GP yields well\-calibrated posteriors with NLL and CRPS scores closely tracking the top\-performing methods\.

#### Spatial alignment of uncertainty\.

Figures[15](https://arxiv.org/html/2606.17513#A9.F15),[16](https://arxiv.org/html/2606.17513#A9.F16), and[17](https://arxiv.org/html/2606.17513#A9.F17)further confirm the spatial coherence of REEF\-GP\. The predicted standard deviation tracks the localized prediction errors of the base operator, although in*Elasticity*benchmark our REEF\-GP uncertainty is quite regular across the entire domain\.

Cost\-quality tradeoff\.Figure[18](https://arxiv.org/html/2606.17513#A9.F18)compares the compute cost across all UQ methods on all available benchmarks\. It confirms that REEF\-GP consistently achieves a training overhead an order of magnitude lower than Deep Ensembles while remaining competitive on evaluation time and train memory\. The only tradeoff is in evaluation memory, although it remains well within commodity GPU memory\.

Table 10:Evaluation metrics for the three benchmarks not in the main text\.Lower is better \(↓\\downarrow\)\. Best inbold, second bestunderlined\. Rankings exclude Base\. See Appendix[H](https://arxiv.org/html/2606.17513#A8)for the definition of the metrics\.MethodrL2↓\\downarrowNLL↓\\downarrowCRPS↓\\downarrowNIS↓\\downarrowES↓\\downarrow*Elasticity \(2D\)*\(%\)Base \(no UQ\)1\.681\.68±0\.16\\pm 0\.16––––Ensemble1\.24\\mathbf\{1\.24\}2\.45\\mathbf\{2\.45\}1\.30\\mathbf\{1\.30\}13\.88\\mathbf\{13\.88\}53\.73\\mathbf\{53\.73\}MC Dropout1\.58¯\\underline\{1\.58\}±0\.12\\pm 0\.1219\.8719\.87±3\.22\\pm 3\.222\.062\.06±0\.14\\pm 0\.1459\.7659\.76±4\.89\\pm 4\.8997\.2997\.29±7\.60\\pm 7\.60PNO2\.392\.39±0\.14\\pm 0\.143\.083\.08±0\.07\\pm 0\.072\.652\.65±0\.15\\pm 0\.1530\.5830\.58±2\.04\\pm 2\.04118\.18118\.18±7\.12\\pm 7\.12Perturbation2\.172\.17±0\.14\\pm 0\.142\.852\.85±0\.05\\pm 0\.052\.432\.43±0\.10\\pm 0\.1029\.1329\.13±0\.71\\pm 0\.71125\.12125\.12±4\.99\\pm 4\.99LUNO\-LA1\.681\.68±0\.16\\pm 0\.162\.80¯\\underline\{2\.80\}±0\.36\\pm 0\.361\.85¯\\underline\{1\.85\}±0\.18\\pm 0\.1825\.32¯\\underline\{25\.32\}±5\.82\\pm 5\.8287\.8287\.82±10\.50\\pm 10\.50DVE\-spatial5\.255\.25±0\.23\\pm 0\.2310\.4810\.48±0\.73\\pm 0\.736\.366\.36±0\.30\\pm 0\.30162\.11162\.11±10\.59\\pm 10\.59317\.43317\.43±15\.28\\pm 15\.28REEF\-GP1\.681\.68±0\.16\\pm 0\.162\.832\.83±0\.10\\pm 0\.101\.971\.97±0\.15\\pm 0\.1525\.9425\.94±2\.70\\pm 2\.7086\.37¯\\underline\{86\.37\}±7\.96\\pm 7\.96*Pipe \(2D\)*\(%\)\(×10−4\\times 10^\{\-4\}\)\(×10−3\\times 10^\{\-3\}\)\(×10−2\\times 10^\{\-2\}\)Base \(no UQ\)0\.520\.52±0\.02\\pm 0\.02––––Ensemble0\.42\\mathbf\{0\.42\}−6\.32\\mathbf\{\-6\.32\}2\.61\\mathbf\{2\.61\}3\.68\\mathbf\{3\.68\}3\.38\\mathbf\{3\.38\}MC Dropout0\.640\.64±0\.03\\pm 0\.03−5\.58¯\\underline\{\-5\.58\}±0\.08\\pm 0\.085\.12¯\\underline\{5\.12\}±0\.20\\pm 0\.206\.96¯\\underline\{6\.96\}±0\.28\\pm 0\.285\.78¯\\underline\{5\.78\}±0\.29\\pm 0\.29PNO0\.770\.77±0\.05\\pm 0\.05−5\.12\-5\.12±0\.05\\pm 0\.057\.777\.77±0\.36\\pm 0\.3611\.0711\.07±0\.51\\pm 0\.517\.687\.68±0\.46\\pm 0\.46Perturbation2\.492\.49±0\.25\\pm 0\.25−4\.22\-4\.22±0\.10\\pm 0\.1021\.1821\.18±1\.59\\pm 1\.5926\.3526\.35±0\.81\\pm 0\.8122\.2822\.28±1\.35\\pm 1\.35LUNO\-LA0\.52¯\\underline\{0\.52\}±0\.02\\pm 0\.02−4\.63\-4\.63±0\.15\\pm 0\.1510\.1110\.11±1\.30\\pm 1\.3016\.6116\.61±2\.16\\pm 2\.168\.638\.63±0\.90\\pm 0\.90DVE\-spatial3\.693\.69±0\.89\\pm 0\.890\.600\.60±1\.60\\pm 1\.6033\.9833\.98±8\.33\\pm 8\.3369\.7269\.72±12\.12\\pm 12\.1235\.2935\.29±7\.82\\pm 7\.82REEF\-GP0\.52¯\\underline\{0\.52\}±0\.02\\pm 0\.02−4\.73\-4\.73±0\.02\\pm 0\.028\.198\.19±0\.09\\pm 0\.0913\.8513\.85±0\.29\\pm 0\.297\.217\.21±0\.16\\pm 0\.16*Ahmed Body \(3D\)*\(%\)\(×103\\times 10^\{3\}\)Base \(no UQ\)6\.626\.62±0\.05\\pm 0\.05––––Ensemble5\.98\\mathbf\{5\.98\}4\.20\\mathbf\{4\.20\}6\.27\\mathbf\{6\.27\}104\.01\\mathbf\{104\.01\}1\.03\\mathbf\{1\.03\}MC Dropout6\.946\.94±0\.42\\pm 0\.4215\.1015\.10±5\.25\\pm 5\.258\.968\.96±1\.42\\pm 1\.42227\.43227\.43±37\.79\\pm 37\.791\.341\.34±0\.07\\pm 0\.07PNO11\.5911\.59±0\.94\\pm 0\.948\.388\.38±1\.68\\pm 1\.6815\.4315\.43±1\.63\\pm 1\.63244\.62244\.62±30\.07\\pm 30\.071\.811\.81±0\.15\\pm 0\.15Perturbation7\.517\.51±0\.14\\pm 0\.146\.886\.88±0\.80\\pm 0\.808\.438\.43±0\.11\\pm 0\.11144\.79144\.79±3\.94\\pm 3\.941\.241\.24±0\.02\\pm 0\.02LUNO\-LA6\.62¯\\underline\{6\.62\}±0\.05\\pm 0\.0516\.9516\.95±10\.78\\pm 10\.788\.07¯\\underline\{8\.07\}±0\.32\\pm 0\.32206\.69206\.69±50\.88\\pm 50\.881\.341\.34±0\.09\\pm 0\.09DVE\-spatial13\.5413\.54±1\.59\\pm 1\.597\.907\.90±1\.72\\pm 1\.7217\.6617\.66±1\.52\\pm 1\.52400\.41400\.41±24\.53\\pm 24\.532\.652\.65±0\.24\\pm 0\.24REEF\-GP6\.636\.63±0\.05\\pm 0\.054\.21¯\\underline\{4\.21\}±0\.08\\pm 0\.088\.888\.88±0\.79\\pm 0\.79136\.68¯\\underline\{136\.68\}±3\.12\\pm 3\.121\.19¯\\underline\{1\.19\}±0\.01\\pm 0\.01![Refer to caption](https://arxiv.org/html/2606.17513v1/x17.png)Figure 15:Predictive standard deviation fields on*Elasticity*\.The rows show in order: ground truth, base prediction, base error and the standard deviation of a sample for each of the UQ baselines\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x18.png)Figure 16:Predictive standard deviation fields on*Pipe*\.The rows show in order: ground truth, base prediction, base error and the standard deviation of a sample for each of the UQ baselines\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x19.png)Figure 17:Predictive standard deviation fields on*Ahmed*\.The rows show in order: ground truth, base prediction, base error and the standard deviation of a sample for each of the UQ baselines\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x20.png)Figure 18:Compute costs on all benchmarks:Train overhead \(the additional time needed on top of a trained base operator to enable UQ\), evaluation time, peak train memory, and peak evaluation memory on all benchmarks\. REEF\-GP maintains an excellent balance between training overhead and inference speed, while peak evaluation memory is higher\.

## Appendix JGeometric OOD

#### Maximum Mean Discrepancy \(MMD\)\.

Given two point clouds𝒳\\mathcal\{X\}and𝒴\\mathcal\{Y\}, each comprisingPPpoints, the squared MMD is defined as:

M​M​D2​\(𝒳,𝒴\)=1P2​∑i=1P∑j=1Pk​\(xi,xj\)\+1P2​∑i=1P∑j=1Pk​\(yi,yj\)−2P2​∑i=1P∑j=1Pk​\(xi,yj\)MMD^\{2\}\(\\mathcal\{X\},\\mathcal\{Y\}\)=\\frac\{1\}\{P^\{2\}\}\\sum\_\{i=1\}^\{P\}\\sum\_\{j=1\}^\{P\}k\(x\_\{i\},x\_\{j\}\)\+\\frac\{1\}\{P^\{2\}\}\\sum\_\{i=1\}^\{P\}\\sum\_\{j=1\}^\{P\}k\(y\_\{i\},y\_\{j\}\)\-\\frac\{2\}\{P^\{2\}\}\\sum\_\{i=1\}^\{P\}\\sum\_\{j=1\}^\{P\}k\(x\_\{i\},y\_\{j\}\)\(31\)wherek​\(⋅,⋅\)k\(\\cdot,\\cdot\)denotes the Gaussian \(Radial Basis Function\) kernel, given byk​\(x,y\)=exp⁡\(−‖x−y‖22​l2\)k\(x,y\)=\\exp\\left\(\-\\frac\{\|\|x\-y\|\|^\{2\}\}\{2l^\{2\}\}\\right\)\. To appropriately scale the kernel to the data distribution, the lengthscale parameterllis determined via a median heuristic computed as the median of the pairwiseL2L\_\{2\}distances sampled from the training point clouds\.

Table 11:OOD robustness across MMD\-based geometric quantiles\.Test samples are partitioned into four groups by MMD distance from the training distribution: Q1 \(closest, in\-distribution\) through Q4 \(furthest, most geometrically novel\)\. For each method we report rL2 \(↓\\downarrow\) and NLL \(↓\\downarrow\); rL2 in percent\. Best inbold, second bestunderlined\.MethodQ1Q2Q3Q4rL2 \(%\)NLLrL2 \(%\)NLLrL2 \(%\)NLLrL2 \(%\)NLL*Elasticity \(2D\)*Ensemble1\.08\\mathbf\{1\.08\}2\.59\\mathbf\{2\.59\}1\.09\\mathbf\{1\.09\}2\.23\\mathbf\{2\.23\}1\.35\\mathbf\{1\.35\}2\.43\\mathbf\{2\.43\}1\.42\\mathbf\{1\.42\}2\.56\\mathbf\{2\.56\}MC Dropout1\.39¯\\underline\{1\.39\}±0\.11\\pm 0\.1119\.0419\.04±3\.48\\pm 3\.481\.42¯\\underline\{1\.42\}±0\.12\\pm 0\.1219\.6619\.66±3\.32\\pm 3\.321\.72¯\\underline\{1\.72\}±0\.13\\pm 0\.1320\.0520\.05±2\.93\\pm 2\.931\.81¯\\underline\{1\.81\}±0\.12\\pm 0\.1220\.7620\.76±3\.22\\pm 3\.22PNO2\.202\.20±0\.15\\pm 0\.152\.932\.93±0\.09\\pm 0\.092\.232\.23±0\.13\\pm 0\.133\.003\.00±0\.07\\pm 0\.072\.512\.51±0\.15\\pm 0\.153\.143\.14±0\.08\\pm 0\.082\.622\.62±0\.13\\pm 0\.133\.253\.25±0\.06\\pm 0\.06Perturbation1\.991\.99±0\.13\\pm 0\.132\.71¯\\underline\{2\.71\}±0\.04\\pm 0\.042\.012\.01±0\.14\\pm 0\.142\.752\.75±0\.04\\pm 0\.042\.302\.30±0\.14\\pm 0\.142\.91¯\\underline\{2\.91\}±0\.07\\pm 0\.072\.402\.40±0\.16\\pm 0\.163\.023\.02±0\.07\\pm 0\.07LUNO\-LA1\.491\.49±0\.15\\pm 0\.152\.59\\mathbf\{2\.59\}±0\.28\\pm 0\.281\.521\.52±0\.16\\pm 0\.162\.60¯\\underline\{2\.60\}±0\.30\\pm 0\.301\.821\.82±0\.16\\pm 0\.162\.932\.93±0\.41\\pm 0\.411\.921\.92±0\.18\\pm 0\.183\.093\.09±0\.46\\pm 0\.46DVE\-spatial4\.784\.78±0\.19\\pm 0\.199\.699\.69±0\.59\\pm 0\.594\.894\.89±0\.20\\pm 0\.209\.689\.68±0\.69\\pm 0\.695\.445\.44±0\.27\\pm 0\.2710\.7110\.71±0\.82\\pm 0\.825\.885\.88±0\.27\\pm 0\.2711\.8811\.88±0\.84\\pm 0\.84REEF\-GP1\.491\.49±0\.15\\pm 0\.152\.732\.73±0\.09\\pm 0\.091\.521\.52±0\.16\\pm 0\.162\.692\.69±0\.08\\pm 0\.081\.821\.82±0\.16\\pm 0\.162\.922\.92±0\.11\\pm 0\.111\.921\.92±0\.18\\pm 0\.182\.97¯\\underline\{2\.97\}±0\.13\\pm 0\.13*Airfoil \(2D\)*Ensemble0\.85\\mathbf\{0\.85\}−3\.60¯\\underline\{\-3\.60\}0\.95\\mathbf\{0\.95\}−3\.56\\mathbf\{\-3\.56\}1\.13\\mathbf\{1\.13\}−3\.49\\mathbf\{\-3\.49\}1\.31\\mathbf\{1\.31\}−3\.19\-3\.19MC Dropout1\.381\.38±0\.22\\pm 0\.22−3\.49\-3\.49±0\.13\\pm 0\.131\.401\.40±0\.19\\pm 0\.19−3\.45¯\\underline\{\-3\.45\}±0\.13\\pm 0\.131\.601\.60±0\.18\\pm 0\.18−3\.36\-3\.36±0\.14\\pm 0\.141\.941\.94±0\.27\\pm 0\.27−3\.22\-3\.22±0\.20\\pm 0\.20PNO2\.062\.06±1\.40\\pm 1\.40−3\.36\-3\.36±0\.65\\pm 0\.652\.122\.12±1\.31\\pm 1\.31−3\.36\-3\.36±0\.63\\pm 0\.632\.392\.39±1\.47\\pm 1\.47−3\.26\-3\.26±0\.68\\pm 0\.682\.732\.73±1\.41\\pm 1\.41−3\.13\-3\.13±0\.63\\pm 0\.63Perturbation3\.203\.20±0\.19\\pm 0\.19−3\.35\-3\.35±0\.05\\pm 0\.053\.133\.13±0\.16\\pm 0\.16−3\.36\-3\.36±0\.05\\pm 0\.053\.413\.41±0\.19\\pm 0\.19−3\.31\-3\.31±0\.05\\pm 0\.053\.603\.60±0\.22\\pm 0\.22−3\.23¯\\underline\{\-3\.23\}±0\.06\\pm 0\.06LUNO\-LA1\.031\.03±0\.10\\pm 0\.10−3\.37\-3\.37±0\.52\\pm 0\.521\.13¯\\underline\{1\.13\}±0\.13\\pm 0\.13−3\.28\-3\.28±0\.57\\pm 0\.571\.30¯\\underline\{1\.30\}±0\.11\\pm 0\.11−2\.92\-2\.92±1\.04\\pm 1\.041\.51¯\\underline\{1\.51\}±0\.17\\pm 0\.17−2\.68\-2\.68±1\.16\\pm 1\.16DVE\-spatial1\.881\.88±0\.27\\pm 0\.2717\.3617\.36±6\.55\\pm 6\.551\.971\.97±0\.26\\pm 0\.2618\.0118\.01±6\.45\\pm 6\.452\.142\.14±0\.25\\pm 0\.2520\.4820\.48±7\.39\\pm 7\.392\.382\.38±0\.20\\pm 0\.2024\.1424\.14±9\.44\\pm 9\.44REEF\-GP1\.03¯\\underline\{1\.03\}±0\.10\\pm 0\.10−3\.61\\mathbf\{\-3\.61\}±0\.05\\pm 0\.051\.13¯\\underline\{1\.13\}±0\.13\\pm 0\.13−3\.56\\mathbf\{\-3\.56\}±0\.07\\pm 0\.071\.30¯\\underline\{1\.30\}±0\.11\\pm 0\.11−3\.47¯\\underline\{\-3\.47\}±0\.07\\pm 0\.071\.51¯\\underline\{1\.51\}±0\.17\\pm 0\.17−3\.41\\mathbf\{\-3\.41\}±0\.07\\pm 0\.07*Pipe \(2D\)*Ensemble0\.38\\mathbf\{0\.38\}−6\.27\\mathbf\{\-6\.27\}0\.31\\mathbf\{0\.31\}−6\.63\\mathbf\{\-6\.63\}0\.39\\mathbf\{0\.39\}−6\.58\\mathbf\{\-6\.58\}0\.60\\mathbf\{0\.60\}−5\.80\\mathbf\{\-5\.80\}MC Dropout0\.540\.54±0\.05\\pm 0\.05−5\.59¯\\underline\{\-5\.59\}±0\.06\\pm 0\.060\.520\.52±0\.04\\pm 0\.04−5\.80¯\\underline\{\-5\.80\}±0\.04\\pm 0\.040\.680\.68±0\.07\\pm 0\.07−5\.68¯\\underline\{\-5\.68\}±0\.11\\pm 0\.110\.840\.84±0\.02\\pm 0\.02−5\.25¯\\underline\{\-5\.25\}±0\.16\\pm 0\.16PNO0\.650\.65±0\.06\\pm 0\.06−5\.20\-5\.20±0\.04\\pm 0\.040\.630\.63±0\.05\\pm 0\.05−5\.20\-5\.20±0\.04\\pm 0\.040\.770\.77±0\.06\\pm 0\.06−5\.15\-5\.15±0\.04\\pm 0\.041\.041\.04±0\.07\\pm 0\.07−4\.92\-4\.92±0\.07\\pm 0\.07Perturbation2\.062\.06±0\.19\\pm 0\.19−4\.33\-4\.33±0\.08\\pm 0\.082\.482\.48±0\.24\\pm 0\.24−4\.29\-4\.29±0\.09\\pm 0\.092\.482\.48±0\.26\\pm 0\.26−4\.22\-4\.22±0\.11\\pm 0\.112\.972\.97±0\.30\\pm 0\.30−4\.05\-4\.05±0\.11\\pm 0\.11LUNO\-LA0\.47¯\\underline\{0\.47\}±0\.05\\pm 0\.05−4\.69\-4\.69±0\.15\\pm 0\.150\.40¯\\underline\{0\.40\}±0\.03\\pm 0\.03−4\.67\-4\.67±0\.16\\pm 0\.160\.49¯\\underline\{0\.49\}±0\.06\\pm 0\.06−4\.64\-4\.64±0\.16\\pm 0\.160\.72¯\\underline\{0\.72\}±0\.03\\pm 0\.03−4\.54\-4\.54±0\.14\\pm 0\.14DVE\-spatial3\.343\.34±0\.95\\pm 0\.950\.120\.12±1\.04\\pm 1\.043\.733\.73±0\.84\\pm 0\.840\.720\.72±1\.77\\pm 1\.773\.553\.55±0\.90\\pm 0\.900\.260\.26±1\.54\\pm 1\.544\.134\.13±0\.88\\pm 0\.881\.311\.31±2\.16\\pm 2\.16REEF\-GP0\.47¯\\underline\{0\.47\}±0\.05\\pm 0\.05−4\.72\-4\.72±0\.05\\pm 0\.050\.40¯\\underline\{0\.40\}±0\.03\\pm 0\.03−4\.85\-4\.85±0\.03\\pm 0\.030\.49¯\\underline\{0\.49\}±0\.06\\pm 0\.06−4\.79\-4\.79±0\.05\\pm 0\.050\.72¯\\underline\{0\.72\}±0\.03\\pm 0\.03−4\.57\-4\.57±0\.03\\pm 0\.03*ShapeNet Car \(3D\)*Ensemble5\.81\\mathbf\{5\.81\}3\.13¯\\underline\{3\.13\}6\.84\\mathbf\{6\.84\}4\.26¯\\underline\{4\.26\}8\.55\\mathbf\{8\.55\}4\.48¯\\underline\{4\.48\}9\.77\\mathbf\{9\.77\}4\.81¯\\underline\{4\.81\}MC Dropout7\.197\.19±0\.74\\pm 0\.7414\.7714\.77±4\.73\\pm 4\.738\.028\.02±0\.53\\pm 0\.5321\.2521\.25±5\.54\\pm 5\.549\.849\.84±0\.54\\pm 0\.5417\.0317\.03±3\.67\\pm 3\.6711\.1511\.15±0\.57\\pm 0\.5715\.8615\.86±3\.09\\pm 3\.09PNO12\.5012\.50±1\.71\\pm 1\.714\.354\.35±1\.24\\pm 1\.2412\.0812\.08±1\.45\\pm 1\.454\.544\.54±1\.21\\pm 1\.2114\.6114\.61±1\.83\\pm 1\.835\.015\.01±1\.47\\pm 1\.4715\.8915\.89±2\.03\\pm 2\.036\.046\.04±2\.19\\pm 2\.19Perturbation6\.656\.65±0\.13\\pm 0\.136\.616\.61±0\.65\\pm 0\.657\.627\.62±0\.13\\pm 0\.1310\.7910\.79±1\.14\\pm 1\.149\.489\.48±0\.15\\pm 0\.1515\.1715\.17±1\.70\\pm 1\.7010\.7910\.79±0\.13\\pm 0\.1321\.4121\.41±1\.77\\pm 1\.77LUNO\-LA6\.596\.59±0\.11\\pm 0\.1119\.7919\.79±13\.47\\pm 13\.477\.647\.64±0\.13\\pm 0\.1330\.4430\.44±21\.51\\pm 21\.519\.519\.51±0\.17\\pm 0\.1737\.4137\.41±25\.97\\pm 25\.9710\.7910\.79±0\.12\\pm 0\.1246\.9946\.99±33\.15\\pm 33\.15DVE\-spatial10\.6710\.67±0\.29\\pm 0\.298\.278\.27±1\.90\\pm 1\.9011\.1211\.12±0\.27\\pm 0\.279\.669\.66±2\.65\\pm 2\.6514\.1714\.17±0\.23\\pm 0\.2312\.3012\.30±3\.51\\pm 3\.5116\.4616\.46±0\.29\\pm 0\.2914\.7514\.75±4\.15\\pm 4\.15REEF\-GP6\.47¯\\underline\{6\.47\}±0\.12\\pm 0\.122\.29\\mathbf\{2\.29\}±0\.02\\pm 0\.027\.54¯\\underline\{7\.54\}±0\.14\\pm 0\.142\.68\\mathbf\{2\.68\}±0\.09\\pm 0\.099\.42¯\\underline\{9\.42\}±0\.18\\pm 0\.183\.00\\mathbf\{3\.00\}±0\.11\\pm 0\.1110\.71¯\\underline\{10\.71\}±0\.14\\pm 0\.143\.42\\mathbf\{3\.42\}±0\.17\\pm 0\.17*Ahmed Body \(3D\)*Ensemble5\.16\\mathbf\{5\.16\}3\.90\\mathbf\{3\.90\}6\.35\\mathbf\{6\.35\}4\.77¯\\underline\{4\.77\}5\.83\\mathbf\{5\.83\}4\.03¯\\underline\{4\.03\}6\.60\\mathbf\{6\.60\}4\.09\\mathbf\{4\.09\}MC Dropout6\.216\.21±0\.46\\pm 0\.4610\.7510\.75±3\.22\\pm 3\.227\.217\.21±0\.46\\pm 0\.4618\.7518\.75±5\.92\\pm 5\.926\.716\.71±0\.39\\pm 0\.3913\.1313\.13±3\.39\\pm 3\.397\.617\.61±0\.43\\pm 0\.4317\.7517\.75±8\.58\\pm 8\.58PNO11\.0511\.05±1\.05\\pm 1\.055\.865\.86±0\.80\\pm 0\.8011\.6311\.63±0\.95\\pm 0\.9512\.7012\.70±3\.55\\pm 3\.5511\.6111\.61±0\.78\\pm 0\.785\.135\.13±0\.52\\pm 0\.5212\.0712\.07±1\.10\\pm 1\.109\.849\.84±2\.26\\pm 2\.26Perturbation6\.746\.74±0\.26\\pm 0\.265\.425\.42±0\.45\\pm 0\.457\.987\.98±0\.13\\pm 0\.137\.977\.97±1\.15\\pm 1\.157\.157\.15±0\.18\\pm 0\.186\.476\.47±0\.68\\pm 0\.688\.158\.15±0\.17\\pm 0\.177\.667\.66±1\.39\\pm 1\.39LUNO\-LA5\.74¯\\underline\{5\.74\}±0\.12\\pm 0\.1210\.1210\.12±5\.77\\pm 5\.776\.946\.94±0\.08\\pm 0\.0826\.0026\.00±17\.84\\pm 17\.846\.48¯\\underline\{6\.48\}±0\.16\\pm 0\.169\.459\.45±5\.13\\pm 5\.137\.337\.33±0\.23\\pm 0\.2322\.2222\.22±14\.74\\pm 14\.74DVE\-spatial13\.1713\.17±1\.52\\pm 1\.528\.028\.02±1\.49\\pm 1\.4913\.9313\.93±1\.68\\pm 1\.688\.938\.93±2\.19\\pm 2\.1913\.0013\.00±1\.48\\pm 1\.486\.146\.14±1\.12\\pm 1\.1214\.0514\.05±1\.69\\pm 1\.698\.508\.50±2\.10\\pm 2\.10REEF\-GP5\.825\.82±0\.15\\pm 0\.154\.06¯\\underline\{4\.06\}±0\.15\\pm 0\.156\.92¯\\underline\{6\.92\}±0\.09\\pm 0\.094\.34\\mathbf\{4\.34\}±0\.08\\pm 0\.086\.496\.49±0\.17\\pm 0\.174\.01\\mathbf\{4\.01\}±0\.16\\pm 0\.167\.29¯\\underline\{7\.29\}±0\.23\\pm 0\.234\.42¯\\underline\{4\.42\}±0\.06\\pm 0\.06

#### OOD Evaluation Results\.

Table[11](https://arxiv.org/html/2606.17513#A10.T11)reports the quantitative performance of all baselines partitioned across the four geometric MMD quartiles\. A robust UQ method should seamlessly scale its predictive uncertainty as the test geometry deviates from the training manifold, resulting in a stable NLL even as the deterministic error \(rL2\) naturally increases\.

Across all benchmarks, REEF\-GP demonstrates strong out\-of\-distribution stability\. For instance, in the*ShapeNet Car*dataset, as geometries shift from Q1 \(most similar to the training samples\) to Q4 \(most different from training samples\), REEF\-GP’s NLL increases only slightly and it remains the best\-calibrated method in every quartile\. In contrast, other post\-hoc methods suffer calibration collapse under identical shifts: Perturbation’s NLL and and LUNO\-LA’s NLL more than triple between Q1 and Q4, and similar trends are observed on the Ahmed Body benchmark\. Furthermore, REEF\-GP consistently delivers competitive metrics with deep ensembles\.

#### Qualitative OOD Uncertainty\.

Figures[19](https://arxiv.org/html/2606.17513#A10.F19)through[22](https://arxiv.org/html/2606.17513#A10.F22)and Figures[23](https://arxiv.org/html/2606.17513#A10.F23)through[26](https://arxiv.org/html/2606.17513#A10.F26)visualize the spatial distribution of uncertainty for the 3D benchmarks as the test geometries become increasingly distinct from the training set \(progressing towards Q4\)\. REEF\-GP’s predicted uncertainty remains coherent with the base model error regions over the geometry\.

![Refer to caption](https://arxiv.org/html/2606.17513v1/x21.png)Figure 19:Predictive standard deviation fields on a test*Car*sample from Q1\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x22.png)Figure 20:Predictive standard deviation fields on a test*Car*sample from Q2\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x23.png)Figure 21:Predictive standard deviation fields on a test*Car*sample from Q3\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x24.png)Figure 22:Predictive standard deviation fields on a test*Car*sample from Q4\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x25.png)Figure 23:Predictive standard deviation fields on a test*Ahmed*sample from Q1\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x26.png)Figure 24:Predictive standard deviation fields on a test*Ahmed*sample from Q2\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x27.png)Figure 25:Predictive standard deviation fields on a test*Ahmed*sample from Q3\.![Refer to caption](https://arxiv.org/html/2606.17513v1/x28.png)Figure 26:Predictive standard deviation fields on a test*Ahmed*sample from Q4\.

Similar Articles

Scalable Uncertainty Reasoning in Knowledge Graphs

arXiv cs.AI

This thesis proposes a modular framework for scalable uncertainty reasoning in knowledge graphs, addressing imprecise attribute values, probabilistic triple existence, and incomplete schema through tailored algebraic, logical, and geometric techniques.

Reading Calibrated Uncertainty from Language Model Trajectories

arXiv cs.LG

This paper introduces a method to calibrate uncertainty in language models by extracting eleven scale-invariant geometric features from per-layer MLP update trajectories and feeding them to a sparse linear probe, outperforming MSP under selective abstention by up to 21 AURC points.