A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

arXiv cs.LG 05/27/26, 04:00 AM Papers
pac-bayesian physics-informed machine-learning generalization pde multi-task-learning neural-networks
Summary
This paper develops a PAC-Bayesian framework for physics-informed machine learning, providing high-probability generalization guarantees for unbounded losses. It proposes a multi-task perspective that jointly handles data fidelity, PDE residuals, and boundary conditions, and introduces a self-bounding learning algorithm.
arXiv:2605.26341v1 Announce Type: new Abstract: Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincar\'e-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:08 AM
# A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
Source: [https://arxiv.org/html/2605.26341](https://arxiv.org/html/2605.26341)
Thien V\. Nguyen Université Jean Monnet Saint\-Étienne, CNRS, Institut d’Optique Graduate School, Laboratoire Hubert Curien UMR 5516, F\-42023, Saint\-Étienne, France van\.thien\.nguyen@univ\-st\-etienne\.frAmaury Habrard Université Jean Monnet Saint\-Étienne, CNRS, Institut d’Optique Graduate School, Laboratoire Hubert Curien UMR 5516, Inria, F\-42023, Saint\-Étienne, France Institut Universitaire de France amaury\.habrard@univ\-st\-etienne\.frBenjamin Guedj Inria and University College London, France and United Kingdom b\.guedj@ucl\.ac\.uk

###### Abstract

Physics\-informed machine learning \(PIML\) integrates mechanistic knowledge, typically in the form of partial differential equations \(PDE\), into data\-driven models\. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses\. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data\. In this work, we develop a PAC\-Bayesian framework for PIML that provides high\-probability generalisation guarantees in the presence of unbounded losses\. We adopt a multi\-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union\-bound approaches\. Our analysis leverages the structure of physics\-informed objectives to derive novel bounds where the complexity scales with input\-gradient norms of the losses, revealing a direct link between physical regularity and generalisation\. We instantiate this framework under Sobolev and Poincaré\-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes\. Building on these results, we propose a self\-bounding\-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings\. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non\-vacuous, significantly tighter than union\-bound baselines, and can be effectively minimised during training\. Overall, our results provide a principled statistical foundation for the generalisation of physics\-informed models\.

## 1Introduction

Physics\-informed machine learning\(PIML, Karniadakiset al\.,[2021](https://arxiv.org/html/2605.26341#bib.bib25)\)aims to integrate prior scientific knowledge, typically expressed as partial differential equations \(PDEs\), into data\-driven models\. By constraining the hypothesis space through physical laws, PIML has demonstrated strong empirical performance across a range of applications, including forward and inverse problems, scientific simulation, and hybrid modelling\. A central premise underlying these methods is that physical structure should improve generalisation by reducing the effective complexity of the learned model\.

Despite this intuition, the statistical mechanisms by which physical constraints influence generalisation remains poorly understood\. Existing theoretical analyses of PIML largely focus on approximation error or optimisation behaviour, and only partially address the fundamental statistical question: how well does a model trained on finite data generalise to unseen inputs? This gap is particularly pronounced in realistic settings, where losses are unbounded and multiple heterogeneous objectives \(data fidelity, PDE residuals, and boundary conditions\) are optimised jointly\.

Among the most prominent approaches, physics\-informed neural networks \(PINNs\)Raissiet al\.\([2019](https://arxiv.org/html/2605.26341#bib.bib26)\)enforce physical constraints by penalising violations of the governing equations during training, while alternative formulations based on kernel methodsDoumècheet al\.\([2025a](https://arxiv.org/html/2605.26341#bib.bib34)\)and variational principlesRojaset al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib44)\)have recently gained attention\. However, the theoretical understanding of generalisation capabilities of PIML methods remains a challenging issue\. A large spectrum of results studied the difficulty of learning PINNs establishing convergence ratesShinet al\.\([2023](https://arxiv.org/html/2605.26341#bib.bib33)\); Doumècheet al\.\([2025b](https://arxiv.org/html/2605.26341#bib.bib29)\)\.\(Ryck and Mishra,[2022](https://arxiv.org/html/2605.26341#bib.bib27); Mishra and Molinaro,[2022](https://arxiv.org/html/2605.26341#bib.bib31)\)proposed a general schema for deriving generalisation bounds based on stability properties and approximation results\. Some paper have studied generalisation based on \(local\) Rademacher complexityJiaoet al\.\([2022](https://arxiv.org/html/2605.26341#bib.bib32)\); Luet al\.\([2022](https://arxiv.org/html/2605.26341#bib.bib30)\); Xuet al\.\([2025](https://arxiv.org/html/2605.26341#bib.bib28)\)\. Related works are discussed in Appendix[D](https://arxiv.org/html/2605.26341#A4)\.

In this paper, we provide a novel view of the PIML problem through PAC\-Bayes theory\(Alquier,[2024](https://arxiv.org/html/2605.26341#bib.bib37); Guedj,[2019](https://arxiv.org/html/2605.26341#bib.bib38); Hellströmet al\.,[2025](https://arxiv.org/html/2605.26341#bib.bib36)\)\. This theory provides a powerful framework to study model performance considering randomised predictors and establishing robust, flexible generalisation guarantees\. A central feature of this framework is the trade\-off between empirical performance and model complexity, typically quantified via an information\-theoretic divergence between a data\-dependent posterior distribution and a prior\. This perspective is particularly appealing for PIML, as it enables the integration of physical knowledge in the form of structured priors or constraints on the hypothesis space\.

However, applying PAC\-Bayes tools to PIML raises significant challenges\. Early PAC\-Bayesian works concentrate on bounded losses for classification, whereas PIML is inherently a regression framework with unbounded loss functions \(for example, squared error or the residual of a differential operator\) that can exhibit complex tail behavior\. Consequently, standard PAC\-Bayes bounds, whose derivations typically rely on boundedness assumptions or sub\-Gaussian tails, do not apply directly\. Extending PAC\-Bayesian guarantees to this PIML regime therefore requires controlling exponential moments of potentially heavy\-tailed losses, a non\-trivial task that requires additional structural assumptions like assuming boundedness of higher\-order moments of the lossHaddouche and Guedj \([2023](https://arxiv.org/html/2605.26341#bib.bib40)\); Holland \([2019](https://arxiv.org/html/2605.26341#bib.bib39)\), or of the cumulant generating function \(CGF\)Casadoet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib4)\); Rodríguez\-Gálvezet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib16)\)\. However, these assumptions introduce new parameters or constants that are difficult to estimate or optimise in practice, and their connections to the bound, the data, or the underlying physical constraints remain less intuitive\.

In this work, we address this gap by developing PAC\-Bayesian generalisation bounds tailored to physics\-informed learning in regression settings with unbounded losses\. Our approach explicitly accounts for the hybrid nature of PIML, where the learning objective combines a data\-fitting term and a physics\-based regularisation term encoding prior knowledge through a differential operator\. We show how this structure can be leveraged within the PAC\-Bayesian framework to derive meaningful generalisation guarantees going beyond the classical bounded\-loss setting\. In particular, we provide bounds that capture the interplay between data, model complexity, and the strength of the physical prior, thereby offering new insights into when and how physics improves learning\.

Our key insight is that PIML admits a natural multi\-task structure in which data and physics constraints can be treated jointly within a single PAC\-Bayesian analysis\. Exploiting this structure allows us to derive significantly tighter bounds than standard approaches based on independent treatment of each loss\. Moreover, we show that the interaction between physical constraints and generalisation can be captured through input\-gradient dependent complexity terms, revealing a direct connection between the smoothness induced by physics and statistical performance\.

Our contribution are as follows:

- •We treat the PIML problem from a multi\-task point of view and propose two different smoothness assumptions, namely Sobolev[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2)\(stronger\) and Poincaré[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)\(weaker\), to derive two new bounds \(Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)and Theorem[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)\) where the complexity scales with the weighted norm of the gradient of the losses with respect to \(w\.r\.t\) the input\.
- •To complement our theory, we examine the practical plausibility of the underlying assumptions by means of a principled procedure to estimate the Sobolev and Poincaré constants to reflect real model and data distributions\. Using these estimators, we present a self‑bounding PIML that directly targets the derived generalisation bounds by minimising their stochastic surrogate objectives, yielding a practical training procedure that aligns optimisation with the theoretical guarantees\.
- •We evaluate our algorithm and the associated generalisation bounds on PDE benchmarks\. Results show that our bounds are substantially tighter than classic union‑bound baselines, and that the self‑bounding procedure reliably reduces the bound in practice\. Interestingly, we show that informed priors can be built by exploiting only the PDE structure on the input domain—without additional labeled data to learn the prior—thereby improving bound tightness while staying label\-efficient\.

## 2Preliminaries

### 2\.1Physics\-informed machine learning

A PDE with equation constraints, initial \(ICs\) and boundary conditions \(BCs\) can be formulated as

𝒟\[u\]\(𝐱\)=0,𝐱∈Ω;ℐ\[u\]\(0,𝐱\)=0,𝐱∈Ω0;ℬ\[u\]\(𝐱\)=0,𝐱∈∂Ω,\\mathcal\{D\}\[u\]\(\\bm\{\\mathrm\{x\}\}\)=0,\\hskip 2\.84526pt\\bm\{\\mathrm\{x\}\}\\in\\Omega;\\hskip 8\.53581pt\\mathcal\{I\}\[u\]\(0,\\bm\{\\mathrm\{x\}\}\)=0,\\hskip 2\.84526pt\\bm\{\\mathrm\{x\}\}\\in\\Omega\_\{0\};\\hskip 8\.53581pt\\mathcal\{B\}\[u\]\(\\bm\{\\mathrm\{x\}\}\)=0,\\hskip 2\.84526pt\\bm\{\\mathrm\{x\}\}\\in\\partial\\Omega,\(1\)where𝒟,ℐ,ℬ\\mathcal\{D\},\\mathcal\{I\},\\mathcal\{B\}are \(residual\) derivative, initial and boundary operators, respectively\. The aim is to learn the target functionu:ℝd→ℝu:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}, which maps an input vector𝐱∈Ω⊂ℝd\\bm\{\\mathrm\{x\}\}\\in\\Omega\\subset\\mathbb\{R\}^\{d\}to an output valuey∈ℝy\\in\\mathbb\{R\}\. Here𝐱∈ℝd\\bm\{\\mathrm\{x\}\}\\in\\mathbb\{R\}^\{d\}stands for the input coordinate, composed of time and spatial position\.Ω0\\Omega\_\{0\}corresponds to thet=0t=0instance, while∂Ω\\partial\\Omegacorresponds to extreme spatial positions\. Our goal is to learn a parametric modelu𝜽u\_\{\\bm\{\\theta\}\}that approximates the true functionuu, with𝜽∈ℝd\\bm\{\\theta\}\\in\\mathbb\{R\}^\{d\}a parameter set belonging to a model classΘ\\Theta\. As an abuse of notation𝜽∈Θ\\bm\{\\theta\}\\in\\Thetadesigns both a model and its parameters withd𝜽:=dd\_\{\\bm\{\\theta\}\}:=d, we also assume thatu𝜽u\_\{\\bm\{\\theta\}\}belongs to the Sobolev SpaceH1\(Ω\)H^\{1\}\(\\Omega\)\. Typically, learning is done by minimising the following physics\-informed machine learning risk:

ℛ^PIMLλ\(𝜽\)\\displaystyle\\hat\{\\mathcal\{R\}\}\_\{\\textrm\{PIML\}\_\{\\lambda\}\}\(\\bm\{\\theta\}\)=λdℛ^d\(𝜽\)\+λpℛ^p\(𝜽\)\+λicℛ^ic\(𝜽\)\+λbcℛ^bc\(𝜽\)\\displaystyle=\\lambda\_\{d\}\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\bm\{\\theta\}\)\+\\lambda\_\{p\}\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\bm\{\\theta\}\)\+\\lambda\_\{ic\}\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\bm\{\\theta\}\)\+\\lambda\_\{bc\}\\hat\{\\mathcal\{R\}\}\_\{bc\}\(\\bm\{\\theta\}\)=λd\|Sd\|∑\(𝐱,y\)∈Sdℓd\(u𝜽\(𝐱\),y\)\+λp\|Sp\|∑𝐱∈Spℓp\(𝒟\[u𝜽\]\(𝐱\)\)\\displaystyle=\\frac\{\\lambda\_\{d\}\}\{\|S\_\{d\}\|\}\\sum\_\{\(\\bm\{\\mathrm\{x\}\},y\)\\in S\_\{d\}\}\\ell\_\{d\}\\left\(u\_\{\\bm\{\\theta\}\}\(\\bm\{\\mathrm\{x\}\}\),y\\right\)\+\\frac\{\\lambda\_\{p\}\}\{\|S\_\{p\}\|\}\\sum\_\{\\bm\{\\mathrm\{x\}\}\\in S\_\{p\}\}\\ell\_\{p\}\\left\(\\mathcal\{D\}\[u\_\{\\bm\{\\theta\}\}\]\(\\bm\{\\mathrm\{x\}\}\)\\right\)\+λic\|Sic\|∑𝐱∈Sicℓic\(ℐ\[u𝜽\]\(𝐱\)\)\+λbc\|Sbc\|∑𝐱∈Sbcℓbc\(ℬ\[u𝜽\]\(𝐱\)\),\\displaystyle\+\\frac\{\\lambda\_\{ic\}\}\{\|S\_\{ic\}\|\}\\sum\_\{\\bm\{\\mathrm\{x\}\}\\in S\_\{ic\}\}\\ell\_\{ic\}\\left\(\\mathcal\{I\}\[u\_\{\\bm\{\\theta\}\}\]\(\\bm\{\\mathrm\{x\}\}\)\\right\)\+\\frac\{\\lambda\_\{bc\}\}\{\|S\_\{bc\}\|\}\\sum\_\{\\bm\{\\mathrm\{x\}\}\\in S\_\{bc\}\}\\ell\_\{bc\}\\left\(\\mathcal\{B\}\[u\_\{\\bm\{\\theta\}\}\]\(\\bm\{\\mathrm\{x\}\}\)\\right\),\(2\)whereℓd,ℓp,ℓic,ℓbc\\ell\_\{d\},\\ell\_\{p\},\\ell\_\{ic\},\\ell\_\{bc\}are data\-fidelity, PDE, IC and BC loss functions, respectively\. Note thatℓp,ℓic,ℓbc\\ell\_\{p\},\\ell\_\{ic\},\\ell\_\{bc\}are the physics\-based loss terms\. We letSd,Sp,Sic,SbcS\_\{d\},S\_\{p\},S\_\{ic\},S\_\{bc\}be the corresponding sets of observation and PDE, IC, BC collocation points,drawn independent across datasets and i\.i\.d within each dataset\. We consider a flexible framework where each sample can be generated by a different probability distribution, modeling potentially diverse acquisitions procedures\. Note thatSdS\_\{d\}is generated overΩ×ℝ\\Omega\\times\\mathbb\{R\}, while the others are generated overΩ\\Omega\. For convenience, we use an abuse of notation using generic lossesℓi\(𝜽,𝐱\)\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)where𝐱\\bm\{\\mathrm\{x\}\}denotes either a data fromΩ×ℝ\\Omega\\times\\mathbb\{R\}orΩ\\Omegadepending the loss considered, however the input gradient∇𝐱ℓi\(𝜽,𝐱\)\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)is always considered with respect to the input spaceΩ\\Omega\. The hyperparametersλd,λp,λic,λbc\\lambda\_\{d\},\\lambda\_\{p\},\\lambda\_\{ic\},\\lambda\_\{bc\}control the relative importance of these loss terms\. This formulation encourages the learned model to not only match the observed data, but also to remain consistent with the underlying physical laws encoded by the PDE\.It can be seen as a particular multi\-task problem with a shared backbone and deterministic, non\-learned heads determined by the governing PDE system\. Hence, to streamline the subsequent theoretical analysis, we consider a physics‑informed optimisation problem comprisingNLN\_\{L\}loss components, where theii\-th component isℓi\\ell\_\{i\}and is trained on a datasetSiS\_\{i\}sampled from an underlying distributionDiD\_\{i\}\. Besides, for an arbitrary loss functionℓ\\ell, denoteℛ\(𝜽\)=𝔼𝐱ℓ\(𝜽,𝐱\)\\mathcal\{R\}\(\\bm\{\\theta\}\)=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)andℛ^\(𝜽\)=1\|S\|∑𝐱∈Sℓ\(𝜽,𝐱\)\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{\|S\|\}\\sum\_\{\\bm\{\\mathrm\{x\}\}\\in S\}\\ell\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)as its population \(true\) and empirical risk on its training setSS\. This way, we can rewrite the weighted empirical PIML risk above asℛ^PIMLλ\(𝜽\)=∑i=1NLλiℛ^i\(𝜽\)\\hat\{\\mathcal\{R\}\}\_\{\\textrm\{PIML\}\_\{\\lambda\}\}\(\\bm\{\\theta\}\)=\\sum\_\{i=1\}^\{N\_\{L\}\}\\lambda\_\{i\}\\hat\{\\mathcal\{R\}\}\_\{i\}\(\\bm\{\\theta\}\)\.

Training PINNs is challenging and typically requires adapting the loss weights\{λi\}\\\{\\lambda\_\{i\}\\\}to improve convergenceMcClenny and Braga\-Neto \([2020](https://arxiv.org/html/2605.26341#bib.bib9)\); Wanget al\.\([2022](https://arxiv.org/html/2605.26341#bib.bib8)\), which yields different\{λi\}\\\{\\lambda\_\{i\}\\\}across runs even for the same training data\. Consequently, it is inappropriate to rely on these adaptively weighted risks when studying the generalisation of the learned model\. The simplest alternative is to consider the total riskℛPIML\(𝜽\)=∑i=1NLℛi\(𝜽\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}\(\\bm\{\\theta\}\)=\\sum\_\{i=1\}^\{N\_\{L\}\}\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\. Besides, in a general multi\-task learning scenario, we have to deal with diverse training sizes, therefore it might be of great interest to have also a sample\-weighted risk \(i\.e\., inspired byZakeriniaet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib3)\)\)ℛPIML𝒮\(𝜽\)=1M∑i=1NLmiℛi\(𝜽\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\), wheremim\_\{i\}is the sample size of theii\-th task andM=∑i=1NLmiM=\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}is the total sample size\. For convenience, letS=\{Si\}i=1NLS=\\\{S\_\{i\}\\\}\_\{i=1\}^\{N\_\{L\}\}is the collection of training sets in PIML context\. From these notions, we will consider two corresponding versions of generalisation gap:

gen\(𝜽,S\)=ℛPIML\(𝜽\)−ℛ^PIML\(𝜽\)=∑i=1NL1mi∑j=1mi\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\),\\displaystyle\\mathrm\{gen\}\(\\bm\{\\theta\},S\)=\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{PIML\}\}\(\\bm\{\\theta\}\)=\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{1\}\{m\_\{i\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\),\(3\)gen𝒮\(𝜽,S\)=ℛPIML𝒮\(𝜽\)−ℛ^PIML𝒮\(𝜽\)=1M∑i=1NL∑j=1mi\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\.\\displaystyle\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)=\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\)\.\(4\)

### 2\.2PAC\-Bayes bounds for regression

Recent advancesAlquier and Guedj \([2018](https://arxiv.org/html/2605.26341#bib.bib14)\); Haddouche and Guedj \([2023](https://arxiv.org/html/2605.26341#bib.bib40)\); Haddoucheet al\.\([2021](https://arxiv.org/html/2605.26341#bib.bib10)\)in PAC‑Bayesian theory have removed a key restriction of classical generalisation guarantees by extending them to regression settings with unbounded losses, enabling PAC‑Bayes analyses for a broader class of loss functions\. These developments make the bounds finite even when individual losses are unbounded, thereby widening the theory’s applicability\.

### Bounds with KL divergence

A line of work builds on the variational representation of the Kullback–Leibler \(KL\) divergence to obtain PAC‑Bayes bounds that accommodate heavier tails\.Alquieret al\.\([2016](https://arxiv.org/html/2605.26341#bib.bib11)\)presents an oracle bound with a complexity term1λm\[KL\(ρ∥π\)\+ln⁡ΨP,D\(λ\)δ\]\\frac\{1\}\{\\lambda m\}\\left\[\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{\\Psi\_\{P,D\}\(\\lambda\)\}\{\\delta\}\\right\]for a fixedλ\\lambda, whereρ\\rhoandπ\\piare respectively the posterior and prior distributions belonging toℳ\(Θ\)\\mathcal\{M\}\(\\Theta\)the set of probability measures over the hypothesis classΘ\\Theta; andΨρ,D\(λ\):=𝔼𝜽∼ρ𝔼S∼Dm\[exp⁡\(λm\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)\)\]\\Psi\_\{\\rho,D\}\(\\lambda\):=\\mathbb\{E\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\,\\mathbb\{E\}\_\{S\\sim D^\{m\}\}\\left\[\\exp\\big\(\\lambda m\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)\\big\)\\right\]\. In the following, we assumeρ\\rhoto be absolutely continuous with respect toπ\\pi, andπ\\pito be independent from learning data that may be used to trainρ\\rho\. To turn this oracle‑style statement into practical data‑dependent bounds, one must control the moment generating behavior of the loss\. Prior works obtain such control under various tail assumptionsAlquier and Guedj \([2018](https://arxiv.org/html/2605.26341#bib.bib14)\); Catoni \([2004](https://arxiv.org/html/2605.26341#bib.bib13)\); Germainet al\.\([2016](https://arxiv.org/html/2605.26341#bib.bib15)\), or by assuming a bounded CGFRodríguez\-Gálvezet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib16)\)\. An alternative is to impose assumptions on the data distribution itself often cast as a Gaussian distributionGuoet al\.\([2025](https://arxiv.org/html/2605.26341#bib.bib18)\); Shalaevaet al\.\([2019](https://arxiv.org/html/2605.26341#bib.bib17)\)\.

A practical complication is that many of these bounds depend on a tuning parameterλ\\lambdathat must be fixed independently of the training data\. Early solutions grid overλ\\lambdaand apply a union bound, but this does not guarantee an optimal choice ofλ\\lambda\. More recent workCasadoet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib4)\)uses a Cramér–Chernoff argument together with the generalised inverse to produce bounds that hold uniformly overλ\>0\\lambda\>0\. The analysis is centered on the CGFΛ𝜽\(λ\):=ln⁡𝔼𝐱∼D\[eλ\(ℛ\(𝜽\)−ℓ\(𝜽,𝐱\)\)\]\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\):=\\ln\\mathbb\{E\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\\big\[e^\{\\lambda\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\ell\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\)\}\\big\], and yields a complexityinfλ\[KL\(ρ∥π\)\+ln⁡mδλ\(m−1\)\+𝔼θ∼Q\[Λ𝜽\(λ\)\]λ\]\\inf\_\{\\lambda\}\\left\[\\frac\{\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{m\}\{\\delta\}\}\{\\lambda\(m\-1\)\}\+\\frac\{\\mathbb\{E\}\_\{\\theta\\sim Q\}\[\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\]\}\{\\lambda\}\\right\], hence enabling us to conveniently solve for optimalλ\\lambda\.

Following prior work,Casadoet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib4)\)introduces model‑dependent assumptions that allow one to solve for theλ\\lambdaminimising the above complexity term and obtain explicit bounds\. As an example, they consider regularisation based on the model’s input gradient under a log‑Sobolev type condition,Λ𝜽\(λ\)≤C2λ2‖∇𝐱ℓ‖22\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\\leq\\frac\{C\}\{2\}\\lambda^\{2\}\\\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\\\|\_\{2\}^\{2\}, which produces a bound with a*multiplicative complexity term*: the KL divergence is scaled by a hypothesis‑dependent factor derived from the gradient norm\.

### Beyond KL divergence

The variational KL representation is not the only route to PAC‑Bayes change‑of‑measure inequalities\. Several recent works develop bounds using alternative divergences: Rényi divergencesBéginet al\.\([2016](https://arxiv.org/html/2605.26341#bib.bib20)\),ff\-divergencesAlquier and Guedj \([2018](https://arxiv.org/html/2605.26341#bib.bib14)\); Ohnishi and Honorio \([2021](https://arxiv.org/html/2605.26341#bib.bib6)\); Picard\-Weibel and Guedj \([2022](https://arxiv.org/html/2605.26341#bib.bib12)\), or Wasserstein distanceViallardet al\.\([2023](https://arxiv.org/html/2605.26341#bib.bib19)\), among others\. The work fromOhnishi and Honorio \([2021](https://arxiv.org/html/2605.26341#bib.bib6)\)introduces several change‑of‑measure tools and applies them to different loss classes\. One consequence is a multiplicative‑complexity bound on the second\-order moment of the loss as follows\.

\{restatable\}

\[Generalised from Ohnisi and HonorioOhnishi and Honorio \([2021](https://arxiv.org/html/2605.26341#bib.bib6)\)\]theoremOracleChiTwo For any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), with probability at least1−δ1\-\\deltawe have:

ℛ\(ρ\)≤ℛ^\(ρ\)\+D¯𝔼θ∼π𝔼S∼Dm\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]δ,\\mathcal\{R\}\(\\rho\)\\leq\\hat\{\\mathcal\{R\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\bar\{D\}\\,\\mathbb\{E\}\_\{\\theta\\sim\\pi\}\\mathbb\{E\}\_\{S\\sim D^\{m\}\}\\left\[\\big\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\\big\)^\{2\}\\right\]\}\{\\delta\}\},\(5\)whereD¯=χ2\(ρ∥π\)\+1\\bar\{D\}=\\chi^\{2\}\(\\rho\\\|\\pi\)\+1,ℛ\(ρ\)=𝔼𝜽∼ρ\[ℛ\(𝜽\)\]\\mathcal\{R\}\(\\rho\)=\\mathbb\{E\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\big\[\\mathcal\{R\}\(\\bm\{\\theta\}\)\\big\]andℛ^\(ρ\)=𝔼𝜽∼ρ\[ℛ^\(𝜽\)\]\\hat\{\\mathcal\{R\}\}\(\\rho\)=\\mathbb\{E\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\big\[\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\\big\]\. The proof can be found in Appendix[A\.2](https://arxiv.org/html/2605.26341#A1.SS2)\. This bound gives a complexity that scales with the variance of the loss function, at the cost of a scaling factorD¯δ\\frac\{\\bar\{D\}\}\{\\delta\}\. It can be thus used to directly derive a bound under a bounded loss‑variance assumption, or— as we show later—when combined with the Poincaré assumption, to obtain a bound involving the input gradient‖∇𝐱ℓ‖22\\\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\\\|\_\{2\}^\{2\}\.

## 3PAC\-Bayes bounds with input gradient\-dependent complexity

Although the aforementioned bounds can be applied naively to each individual loss term in PIML and then combined via a union bound, doing so yields loose guarantees\. First, minimising the bound for each loss independently typically produces different optimal values ofλ\\lambda, so∑iinfλgi\(λ\)≥infλ∑igi\(λ\)\\sum\_\{i\}\\inf\_\{\\lambda\}g\_\{i\}\(\\lambda\)\\geq\\inf\_\{\\lambda\}\\sum\_\{i\}g\_\{i\}\(\\lambda\)\. Second, the union bound forces a sum of KL terms and also amplifies the penalty associated with the confidence parameterθ\\theta\. To alleviate this looseness, we adopt a multi‑task learning perspective introduced byZakerinia and Lampert \([2025](https://arxiv.org/html/2605.26341#bib.bib1)\)\. Specifically, the PIML problem can be regarded as an extreme multi‑task instance: all tasks share the same learnable architecture and differ only in fixed components used to compute each loss, with additional operators𝒟,ℐ,ℬ\\mathcal\{D\},\\mathcal\{I\},\\mathcal\{B\}forming the PDE/IC/BC residuals\. Then, we can obtain substantially tighter bounds by treating the losses as a single composite risk and exploiting the following i\.i\.d\. assumption over the collection of training sets\.

###### Assumption 3\.1\(i\.i\.d\. sampling across tasks and closure under losses\)\.

Let\{Si\}i=1NL\\\{S\_\{i\}\\\}\_\{i=1\}^\{N\_\{L\}\}denote the collection of training datasets, where each dataset is given bySi=\{𝐱i,j\}j=1miS\_\{i\}=\\\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\\}\_\{j=1\}^\{m\_\{i\}\}\. We assume that, for eachi∈\{1,…,NL\}i\\in\\\{1,\\dots,N\_\{L\}\\\}, the samples\{𝐱i,j\}j=1mi\\\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\\}\_\{j=1\}^\{m\_\{i\}\}are drawn independently and identically distributed according to an underlying distributionDiD\_\{i\}, i\.e\.,𝐱i,j∼i\.i\.d\.Di\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}D\_\{i\}\. Moreover, the datasets\{Si\}i=1NL\\\{S\_\{i\}\\\}\_\{i=1\}^\{N\_\{L\}\}are mutually independent across tasks\. Finally, for any measurable loss functionℓi\\ell\_\{i\}applied independently to each sample𝐱i,j\\bm\{\\mathrm\{x\}\}\_\{i,j\}, the transformed variables\{ℓi\(𝛉,𝐱i,j\)\}j=1mi\\\{\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\\}\_\{j=1\}^\{m\_\{i\}\}remain i\.i\.d\. within each task\.

Below we derive a generic sample\-weighted PAC\-Bayes\-Chernoff bound for PIML\.

\{restatable\}

theoremGeneralPBChernoeffMTL Under Assumption[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1), for any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\),for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), for anyλ\>0\\lambda\>0, with probability at least1−δ1\-\\deltawe have:

gen𝒮\(ρ,S\)≤infλ\{KL\(ρ∥π\)\+ln⁡Mδλ\(M−1\)\+∑i=1NLmi𝔼ρ\[Λ𝜽\(i\)\(λ\)\]Mλ\}\.\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\inf\_\{\\lambda\}\\left\\\{\\frac\{\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{M\}\{\\delta\}\}\{\\lambda\(M\-1\)\}\+\\frac\{\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\rho\}\[\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\]\}\{M\\lambda\}\\right\\\}\.\(6\)The proof of[˜3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)can be found in Appendix[A\.2](https://arxiv.org/html/2605.26341#A1.SS2)\. This result is central to our approach\. Unlike standard PAC\-Bayes analyses that treat each loss independently and combine guarantees via union bounds,[˜3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)leverages the joint structure of PIML to control all tasks simultaneously\. This leads to tighter bounds by avoiding both the duplication of complexity terms and suboptimal choices of tuning parameters across tasks\. It holds for allλ\>0\\lambda\>0and can be optimised overλ\\lambda; performing this optimisation incurs aln⁡M\\ln Mpenalty that is applied jointly to all losses\. Note that the complexity term obtained by applying a union bound toCasadoet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib4)\)is∑i=1NLinfλ\{KL\(ρ\|\|π\)\+lnNLmiδλ\(M−1mi\)\+mi𝔼Q\[Λ𝜽\(i\)\(λ\)\]λM\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\inf\_\{\\lambda\}\\\{\\frac\{\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{N\_\{L\}m\_\{i\}\}\{\\delta\}\}\{\\lambda\(M\-\\frac\{1\}\{m\_\{i\}\}\)\}\+\\frac\{m\_\{i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{Q\}\[\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\]\}\{\\lambda M\}\\\}\. Even without invoking the sum of individual infima, this comparison already shows that our multi‑task formulation leaves the KL term and the confidence term unamplified, thereby producing a strictly tighter bound\. Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)will be used later on, to derive bounds for both PIML loss and gradient\.

### 3\.1Bounds from Sobolev assumption

To make the above bound more intuitive, we need to controlΛ𝜽\(i\)\(λ\)\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)with a term that represents some properties of the target physical function such as its regularity \(smoothness\) in the input domain\. Concretely, we assume that the underlying measure on the Sobolev space satisfies a*Φ\\Phi\-Sobolev inequality*Chafaï \([2004](https://arxiv.org/html/2605.26341#bib.bib41)\),Bakryet al\.\([2014](https://arxiv.org/html/2605.26341#bib.bib43)\)\. Specifically, for sufficiently smoothf∈H1\(Ω\)f\\in H^\{1\}\(\\Omega\), theΦ\\Phi\-entropy offfis bounded by a constant multiple of‖∇f‖L22\\\|\\nabla f\\\|\_\{L^\{2\}\}^\{2\}, thereby providing a flexible framework to capture different regimes of regularity and concentration\. In our setting, this assumption acts as a smoothness condition on the function class, penalizing irregular behavior, enabling refined stability and concentration guarantees\.

###### Assumption 3\.2\(Sobolev\-smoothness of the PIML model\)\.

LetΩ⊂ℝdi\\Omega\\subset\\mathbb\{R\}^\{d\_\{i\}\}be a convex, bounded domain and let\{Di\}i=1NL\\\{D\_\{i\}\\\}\_\{i=1\}^\{N\_\{L\}\}denote the underlying data distribution onΩ\\Omega\. We say that the PIML model classu𝛉\(𝐱\)u\_\{\\bm\{\\theta\}\}\(\\bm\{\\mathrm\{x\}\}\)satisfies the Sobolev\-smoothness assumption for all𝛉\\bm\{\\theta\}if, for every loss term indexed byii, there existsCS,i\>0C\_\{S,i\}\>0such thatΛ𝛉\(i\)\(λ\)≤CS,i2λ2𝔼𝐱∼D\[‖∇𝐱ℓi\(𝛉,𝐱\)‖22\]\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\\leq\\frac\{C\_\{S,i\}\}\{2\}\\lambda^\{2\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\\Big\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\Big\]\.

Assumption[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2)implies that the modelu𝜽\(𝐱\)u\_\{\\bm\{\\theta\}\}\(\\bm\{\\mathrm\{x\}\}\)is sufficiently smooth so that both data loss and all physical residual losses also exhibit the smoothness in the sense ofΦ\\Phi\-Sobolev inequality\. It can be used to derive an oracle bound as shown in Lemma[A\.3](https://arxiv.org/html/2605.26341#A1.SS3)of Appendix[A\.3](https://arxiv.org/html/2605.26341#A1.SS3)\. We can make it empirical with the following assumption on the Lipschitz of the input\-gradient of the losses\.

###### Assumption 3\.3\(Input\-data Lipschitz\)\.

For anyθ∈Θ\\theta\\in\\Theta, for any𝐱∈𝒳\\bm\{\\mathrm\{x\}\}\\in\\mathcal\{X\}, and for all taskii, we assume that the input\-gradient is bounded, i\.e\., there existsLi\>0L\_\{i\}\>0such that‖∇𝐱ℓi\(𝛉,𝐱\)‖22≤Li\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\leq L\_\{i\}\.

Assumption[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)is generally satisfied in standard PINNs under Sobolev smoothness\. This is equivalent to say that the gradients are sub\-gaussian, thus allowing us to control the CGF of the gradient norms\. Then by making use of Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1), we obtain the following empirical bound\.

\{restatable\}

\[PAC\-Bayes\-Sobolev bound\]theoremPBCSobolevEmp Under Assumptions[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1),[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3), for any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), with probability at least1−δ1\-\\deltawe have:

gen𝒮\(ρ,S\)≤1M2‖∇^PIMLS\(ρ\)‖22K\(ρ,π,δ\)\+LPIMLSK\(ρ,π,δ\)32,\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{2\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}K\(\\rho,\\pi,\\delta\)\+L\_\{\\mathrm\{PIML\}\_\{S\}\}K\(\\rho,\\pi,\\delta\)^\{\\frac\{3\}\{2\}\}\},\(7\)where‖∇^PIMLS\(ρ\)‖22:=∑i=1NLCS,i𝔼𝜽∼ρ∑j=1mi\[‖∇𝐱ℓi\(𝜽,𝐱i,j\)‖22\]\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}C\_\{S,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\Big\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\\Big\];K\(ρ,π,δ\):=M\(KL\(ρ∥π\)\+ln⁡2Mδ\)M−1K\(\\rho,\\pi,\\delta\):=\\frac\{M\(KL\(\\rho\\\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{M\-1\}andLPIMLS:=2∑i=1NLmiCS,i2Li2L\_\{\\mathrm\{PIML\}\_\{S\}\}:=\\sqrt\{2\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{S,i\}^\{2\}L\_\{i\}^\{2\}\}\. The proof of[˜3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)can be found in Appendix[A\.3](https://arxiv.org/html/2605.26341#A1.SS3)\. The bound in[˜3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)exhibits a key structural property: the complexity term scales with the input\-gradient norms of the losses\. This reveals that smoother models, in the sense of smaller input gradients, enjoy tighter generalisation guarantees\. In the PIML setting, this provides a formal explanation for the regularising effect of physical constraints, which implicitly enforce smoothness through differential operators\. One part, theKLtermK\(Q,P,δ\)K\(Q,P,\\delta\)favors posterior close to the prior, which can be efficiently learned while staying independent of posterior training data\. The weighted gradient terms provide insights onto how input\-gradient regularisation can help to reduce the generalisation gap\.

### 3\.2Bounds from Poincaré assumption

In addition, and as a baseline counterpart to theΦ\\Phi\-Sobolev assumption, we can impose a*Poincaré inequality*as a direct link to theχ2\\chi^\{2\}\-divergence bound in Theorem[2](https://arxiv.org/html/2605.26341#S2.SSx2)\. It typically provides a weaker but fundamental form of regularityEvans \([2010](https://arxiv.org/html/2605.26341#bib.bib42)\),Bakryet al\.\([2014](https://arxiv.org/html/2605.26341#bib.bib43)\)\. For sufficiently smoothf∈H1\(Ω\)f\\in H^\{1\}\(\\Omega\), this inequality controls the variance offfby its Dirichlet energy, link global fluctuations to the squared gradient norm, and can be directly plugged into theχ2\\chi^\{2\}\-divergence bound in Theorem[2](https://arxiv.org/html/2605.26341#S2.SSx2)\. Similarly to the previous section, it is natural to make the following smoothness assumption\.

###### Assumption 3\.4\(Poincaré\-smoothness of the PIML model\)\.

LetΩ⊂ℝd\\Omega\\subset\\mathbb\{R\}^\{d\}be a convex, bounded domain and let\{Di\}i=1NL\\\{D\_\{i\}\\\}\_\{i=1\}^\{N\_\{L\}\}denote the uniform probability measure onΩ\\Omega\. We say that the PIML modelu𝛉\(𝐱\)u\_\{\\bm\{\\theta\}\}\(\\bm\{\\mathrm\{x\}\}\)satisfies the Poincaré\-smoothness assumption if, for every loss term indexed byii, there existsCP,i\>0C\_\{P,i\}\>0such that𝕍𝐱∼D\[ℓi\(𝛉,𝐱\)\]≤CP,i𝔼𝐱∼D‖∇ℓi\(𝛉,𝐱\)‖22\\operatorname\*\{\\mathbb\{V\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\[\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\]\\leq C\_\{P,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\.

We now make use of this Poincaré\-smoothness and Lipschitzness assumptions along with Theorem[2](https://arxiv.org/html/2605.26341#S2.SSx2)and Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)to derive the following two empirical bounds corresponding to the two versions of generalisation gap presented in Section[2\.1](https://arxiv.org/html/2605.26341#S2.SS1)\.

\{restatable\}

\[PAC\-Bayes\-Poincaré bounds\]theoremPBPEmp Under Assumptions[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1),[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)and[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3), for any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), for anyλ\>0\\lambda\>0, with probability at least1−δ1\-\\deltawe have:

gen\(ρ,S\)≤\(2‖∇^PIMLP\(π\)‖22\+LPIMLPK\(δ\)12\)D¯δ,\\mathrm\{gen\}\(\\rho,S\)\\leq\\sqrt\{\\frac\{\(2\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\|\|\_\{2\}^\{2\}\+L\_\{\\mathrm\{PIML\}\_\{P\}\}K\(\\delta\)^\{\\frac\{1\}\{2\}\}\)\\bar\{D\}\}\{\\delta\}\},\(8\)
gen𝒮\(ρ,S\)≤1M\(2‖∇^PIMLP𝒮\(π\)‖22\+LPIMLP𝒮K\(δ\)12\)D¯δ,\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{\\frac\{\(2\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\pi\)\|\|\_\{2\}^\{2\}\+L\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}K\(\\delta\)^\{\\frac\{1\}\{2\}\}\)\\bar\{D\}\}\{\\delta\}\},\(9\)where‖∇^PIMLP\(π\)‖22:=∑i=1NLCP,imi2𝔼𝜽∼π∑j=1mi‖∇ℓi\(𝜽,𝐱i,j\)‖22\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}\}\{m\_\{i\}^\{2\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{j=1\}^\{m\_\{i\}\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\};‖∇^PIMLP𝒮\(π\)‖22:=∑i=1NLCP,i𝔼𝜽∼π∑j=1mi‖∇ℓi\(𝜽,𝐱i,j\)‖22\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\pi\)\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}C\_\{P,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{j=1\}^\{m\_\{i\}\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\};LPIMLP:=2∑i=1NLCP,i2Li2mi3L\_\{\\mathrm\{PIML\}\_\{P\}\}:=\\sqrt\{2\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\{m\_\{i\}^\{3\}\}\};LPIMLP𝒮:=2∑i=1NLmiCP,i2Li2L\_\{\\mathrm\{PIML\}\_\{P\}^\{\\mathcal\{S\}\}\}:=\\sqrt\{2\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}; andK\(δ\):=Mln⁡2MδM−1K\(\\delta\):=\\frac\{M\\ln\\frac\{2M\}\{\\delta\}\}\{M\-1\}\.

The proof of this theorem can be found in Appendix[A\.4](https://arxiv.org/html/2605.26341#A1.SS4)\. We can see that the complexity of these bounds contains a similar structure with the Sobolev\-based bound in Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)\. However, the multiplicative term is now scaled by an expected weighted sum of the input gradients under the prior distribution\. It is obvious that the two Poincaré bounds here are less tighter than the Sobolev counterpart: the confidence penalty scales as1δ\\frac\{1\}\{\\delta\}in the former versusln⁡\(1δ\)\\ln\\\!\\big\(\\frac\{1\}\{\\delta\}\\big\)in the latter, and theχ2\+1\\chi^\{2\}\+1term grows much faster \(e\.g\., exponentially in the case of Gaussian distribution\)\. As a result, Poincaré bounds are less flexible from an optimisation perspective\.

## 4Experimental evaluation on physics\-informed neural networks

In this section, we present how to concretise the bounds in the previous section from a practical point of view, by first estimating the constants\{CS,i,CP,i,Li\}\\\{C\_\{S,i\},C\_\{P,i\},L\_\{i\}\\\}from Assumptions[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2),[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)and[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)\.

### 4\.1Refining the assumptions with localisation and clipping

Tighter bounds require finite constants that make the proposed inequalities hold uniformly over the hypothesis class\. Empirical observations \(Appendix[B\.2](https://arxiv.org/html/2605.26341#A2.SS2)\) show that, when sampling models at increasing distanceRRfrom a well‑trained model, the input\-gradient of the losses—in particular the PDE and data\-fidelity terms—grow rapidly, exceeding10810^\{8\}already atR=1R=1\. This explosive growth is highly detrimental to the bounds’ tightness and substantially impairs the posterior’s ability to learn\.

However, once training has relatively converged, models typically operate in a region where all loss components remain small and exhibit limited variability\. We empirically observe that the losses are much smaller than100100, and if we apply a clipping at that value, it almost surely has no impact to the loss value\. Nevertheless, such clipping is highly beneficial when estimating those constants, as it ensures that the losses remain locally bounded in the neighborhood of a well\-trained model, which in turn bounds their variances and stabilises the associated CGF\. As a consequence, we can also clip input\-gradients while still being able to find constants satisfying the Sobolev and Poincaré assumptions\. To avoid unecessary loose bounds, we selectLiL\_\{i\}by balancing its trade\-off with the Poincaré and Sobolev constants, exploiting their asymptotic proportionality to restrict the search toCP,iC\_\{P,i\}andLiL\_\{i\}\. We first selects candidate pairs minimizingCP,iLiC\_\{P,i\}L\_\{i\}and then chooses the one with the smallestCP,iC\_\{P,i\}, while estimating constants within a radiusRRaround the prior\. The radius is chosen to ensure sampled models remain close to the prior\. In practiceR2∼\(3σ\)2dθR^\{2\}\\sim\(3\\sigma\)^\{2\}d\_\{\\theta\}suffices to ensure that, withdθd\_\{\\theta\}being the number of learnable parameters of model\. Finally, observe that choosing a sufficiently large clipping threshold leaves the loss values unchanged, while the bounds with clipped values are naturally upper\-bounded by their unclipped counterparts\. Consequently, the bounds stated in Theorems[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)and[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)remain valid when the associated constants are estimated using the above scheme\. For improved training behavior and interpretability, we therefore employ the unclipped bounds during both training and evaluation\. Further details are deferred to Appendix[B\.3](https://arxiv.org/html/2605.26341#A2.SS3)\.

### 4\.2Self\-bounding\-aware algorithm

The optimisation of the presented bounds is presented in Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\. It consists of two main steps:1\)learning a priorπ\\pifrom prior training sets then estimating the constants from the calibration set; and2\)learning the posteriorρ\\rhoto minimise the bounds via their respective stochastic surrogates\.

Step 1\.The prior model’s parameters𝜽π\\bm\{\\theta\}\_\{\\pi\}is obtained afterNTN\_\{T\}iterations using a mini\-batch gradient descent algorithm\. For each iteration, we randomly sample a set of mini\-batches from the set of prior datasetsSpriorS\_\{prior\}and update𝜽π\\bm\{\\theta\}\_\{\\pi\}by minimising the adaptively weighted riskℛ^PIMLλ\(𝜽\)\\hat\{\\mathcal\{R\}\}\_\{\\textrm\{PIML\}\_\{\\lambda\}\}\(\\bm\{\\theta\}\)\. To ensure𝜽π\\bm\{\\theta\}\_\{\\pi\}achieves good performance, we use the neural tangent kernel \(NTK\) training schemeWanget al\.\([2022](https://arxiv.org/html/2605.26341#bib.bib8)\)\. From this learned prior𝜽π\\bm\{\\theta\}\_\{\\pi\}, we search for the constants of Sobolev, Poincaré and Lipschitz assumptions \(Lines 6\-11 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\)\. We begin by clipping all losses at100100, then using a calibration datasetScalibS\_\{calib\}to search for the clipping thresholdLiL\_\{i\}favoring small bounds\.

Step 2\.Given the learned priorπ=𝒩\(𝜽π,σ2𝟏d𝜽\)\\pi=\\mathcal\{N\}\(\\bm\{\\theta\}\_\{\\pi\},\\sigma^\{2\}\\mathbf\{1\}\_\{d\_\{\\bm\{\\theta\}\}\}\)and the set of estimated constants\{\(CS,i,CP,i,Li\)\}i=1NL\\\{\(C\_\{S,i\},C\_\{P,i\},L\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{L\}\}, we initialise the posterior parameter𝜽ρ\\bm\{\\theta\}\_\{\\rho\}from𝜽π\\bm\{\\theta\}\_\{\\pi\}and optimise it duringNT′N\_\{T^\{\\prime\}\}iterations\. For each iteration, we randomly draw a set of mini\-batches fromSpostS\_\{post\}, and sample𝜽′:=𝜽ρ\+ϵ\\bm\{\\theta\}^\{\\prime\}:=\\bm\{\\theta\}\_\{\\rho\}\+\\bm\{\\epsilon\}withϵ∼𝒩\(𝟎,σ2𝟏dθ\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(\\bm\{0\},\\sigma^\{2\}\\mathbf\{1\}\_\{d\_\{\\theta\}\}\)\. Note that in a balanced setting \(i\.e\., ,mi=m,∀im\_\{i\}=m,\\;\\forall i\), the sample\-centric bounds can be rewritten as an equally\-weighted bound and we can thus learn to directly minimise the bound \(self\-bounding\)\. Otherwise, in a more general scenario where we have diverse sample sizes, there is no guarantee that reducing the bound𝒰𝒮\(ρ\)\\mathcal\{U\}^\{\\mathcal\{S\}\}\(\\rho\)onℛPIML𝒮\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\rho\)will also reduce the bound on the target true riskℛPIML\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}\(\\rho\)\. To alleviate this issue, we can simply learn to minimise the following union\-bounds to directly control the true risk of each loss \(bounding\-aware\):

∑i=1NL\{ℛ^i\(𝜽′\)\+2CS,imi∑j=1mi\[‖∇𝐱ℓi\(𝜽′,𝐱i,j\)‖22\]Ki\(𝜽ρ\)\+2LiCSiKi\(𝜽ρ\)32\},\\sum\_\{i=1\}^\{N\_\{L\}\}\\left\\\{\\hat\{\\mathcal\{R\}\}\_\{i\}\(\\bm\{\\theta\}^\{\\prime\}\)\+\\sqrt\{2\\frac\{C\_\{S,i\}\}\{m\_\{i\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\Big\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\}^\{\\prime\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\\Big\]K\_\{i\}\(\\bm\{\\theta\}\_\{\\rho\}\)\+\\sqrt\{2\}L\_\{i\}C\_\{S\_\{i\}\}K\_\{i\}\(\\bm\{\\theta\}\_\{\\rho\}\)^\{\\frac\{3\}\{2\}\}\}\\;\\right\\\},\(10\)∑i=1NL\{ℛ^i\(𝜽′\)\+NLδ\(2CP,imi2∑j=1mi‖∇ℓi\(π,𝐱i,j\)‖22\+2LiCP,imiKi\(𝜽π\)12\)exp⁡\(r2\(𝜽ρ\)σ2\)\},\\sum\_\{i=1\}^\{N\_\{L\}\}\\\!\\left\\\{\\hat\{\\mathcal\{R\}\}\_\{i\}\(\\bm\{\\theta\}^\{\\prime\}\)\\\!\+\\\!\\sqrt\{\\frac\{N\_\{L\}\}\{\\delta\}\\bigg\(2\\frac\{C\_\{P,i\}\}\{m\_\{i\}^\{2\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\|\|\\nabla\\ell\_\{i\}\(\\pi,\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\\\!\+\\\!\\sqrt\{2\}\\frac\{L\_\{i\}C\_\{P,i\}\}\{m\_\{i\}\}K\_\{i\}\(\\bm\{\\theta\}\_\{\\pi\}\)^\{\\frac\{1\}\{2\}\}\\bigg\)\\\!\\exp\{\\left\(\\frac\{r^\{2\}\(\\bm\{\\theta\}\_\{\\rho\}\)\}\{\\sigma^\{2\}\}\\right\)\}\}\\right\\\},\(11\)where‖∇ℓi\(π,𝐱i,j\)‖22:=𝔼𝜽∼π‖∇ℓi\(𝜽,𝐱i,j\)‖22\|\|\\nabla\\ell\_\{i\}\(\\pi,\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}:=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\},r2\(𝜽\):=‖𝜽−𝜽π‖22r^\{2\}\(\\bm\{\\theta\}\):=\|\|\\bm\{\\theta\}\-\\bm\{\\theta\}\_\{\\pi\}\|\|\_\{2\}^\{2\}andKi\(𝜽\):=1mi−1\(r2\(𝜽\)2σ2\+ln⁡2NLmiδ\)K\_\{i\}\(\\bm\{\\theta\}\):=\\frac\{1\}\{m\_\{i\}\-1\}\\left\(\\frac\{r^\{2\}\(\\bm\{\\theta\}\)\}\{2\\sigma^\{2\}\}\+\\ln\\frac\{2N\_\{L\}m\_\{i\}\}\{\\delta\}\\right\)\. After learning, we can obtain the final bound𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)by solving an optimisation problem \(Step 2, Line 8 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\) with the linear constraint offered by the bounds in Equation[7](https://arxiv.org/html/2605.26341#S3.E7)or[9](https://arxiv.org/html/2605.26341#S3.E9)\. See Appendix[B\.4](https://arxiv.org/html/2605.26341#A2.SS4)for more details\.

Algorithm 1Optimisation of the physics\-informed PAC\-Bayes bounds1\.Prior datasets

SpriorS\_\{prior\}; Calibration datasets

ScalibS\_\{calib\}; posterior datasets

SpostS\_\{post\}, initial parameters

𝜽0∈ℝd𝜽\\bm\{\\theta\}\_\{0\}\\in\\mathbb\{R\}^\{d\_\{\\bm\{\\theta\}\}\}; Predefined variance parameter

σ\\sigma; number of iterations

NTN\_\{T\}and

NT′N\_\{T^\{\\prime\}\}; number of preselected clipping values

kk; number of draws per search

NdrawN\_\{draw\}\.

2\.Step 1\-prior learning & constant estimation1\.𝜽π←𝜽0\\bm\{\\theta\}\_\{\\pi\}\\leftarrow\\bm\{\\theta\}\_\{0\}2\.fort=1t=1toNTN\_\{T\}do3\.Draw a set of mini\-batches fromSpriorS\_\{prior\}4\.𝜽π←\\bm\{\\theta\}\_\{\\pi\}\\leftarrowUpdate𝜽π\\bm\{\\theta\}\_\{\\pi\}withℛ^PIMLλ\(𝜽π\)\\hat\{\\mathcal\{R\}\}\_\{\\textrm\{PIML\}\_\{\\lambda\}\}\(\\bm\{\\theta\}\_\{\\pi\}\)5\.endfor6\.forlossi=1i=1toNLN\_\{L\}do7\.𝒮𝒯←\{𝒞p\(τ,1,Ndraw\)\|τ∈𝒯\}\\mathcal\{S\}\_\{\\mathcal\{T\}\}\\leftarrow\\\{\\,\\mathcal\{C\}\_\{p\}\(\\tau,1,N\_\{draw\}\)\\;\|\\;\\tau\\in\\mathcal\{T\}\\,\\\}8\.ℒ←Smallestk\(𝒮𝒯\)\\mathcal\{L\}\\leftarrow\\mathrm\{Smallest\}\_\{k\}\(\\mathcal\{S\}\_\{\\mathcal\{T\}\}\)9\.Li←arg⁡minτ∈ℒ⁡𝒞p\(τ,1,Ndraw\)L\_\{i\}\\leftarrow\\arg\\min\_\{\\tau\\in\\mathcal\{L\}\}\\mathcal\{C\}\_\{p\}\(\\tau,1,N\_\{draw\}\)10\.SelectCP,i,CS,iC\_\{\{P,i\}\},C\_\{\{S,i\}\}from clipped gradient norm of lossiiatLiL\_\{i\}11\.endfor12\.Step 2\-bound minimisation1\.𝜽ρ←𝜽π\\bm\{\\theta\}\_\{\\rho\}\\leftarrow\\bm\{\\theta\}\_\{\\pi\}2\.fort=1t=1toNT′N\_\{T^\{\\prime\}\}do3\.Draw a set of mini\-batches fromSpostS\_\{post\}4\.Sample a noiseϵ∼𝒩\(𝟎,σ2𝟏dθ\)\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(\\bm\{0\},\\sigma^\{2\}\\mathbf\{1\}\_\{d\_\{\\theta\}\}\)5\.𝜽′←𝜽ρ\+ϵ\\bm\{\\theta\}^\{\\prime\}\\leftarrow\\bm\{\\theta\}\_\{\\rho\}\+\\bm\{\\epsilon\}6\.𝜽ρ←\\bm\{\\theta\}\_\{\\rho\}\\leftarrowUpdate𝜽ρ\\bm\{\\theta\}\_\{\\rho\}with either direct surrogates \(self\-bounding\); or[10](https://arxiv.org/html/2605.26341#S4.E10)or[11](https://arxiv.org/html/2605.26341#S4.E11)\(bounding\-aware\)7\.endfor8\.\(Optional\) Compute𝒰𝒮\(ρ\)\\mathcal\{U\}^\{\\mathcal\{S\}\}\(\\rho\)then solve the optimisation problem to obtain𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)9\.return𝜽ρ,𝒰\(ρ\)\\bm\{\\theta\}\_\{\\rho\},\\;\\mathcal\{U\}\(\\rho\)

### 4\.3Experiments

In this section, we empirically illustrate that our PAC\-Bayesian framework, with Algorithm[1](https://arxiv.org/html/2605.26341#alg1)above, is able to provide generalisation guarantees with non\-vacuous bounds for the PIML risk\.

Benchmark\.For a comprehensive evaluation, we consider three benchmark problems: 1D\-Reaction, 1D\-Wave, and Convection\. For each physical loss, we split the samples into three disjoint sets: 10k prior, 30k posterior, and 20k calibration points\. Beyond a baseline with 30k posterior samples and 10k calibration points from observational data, we also study a limited\-data regime with only 300 \(and as few as 2\) observations to reflect practical data scarcity\. In these restricted cases, we use all of these samples to train the posterior, while employing the constants of the initial lossℓic\\ell\_\{ic\}for the lossℓd\\ell\_\{d\}\(as they are both approximation errors\)\.Note that the prior model is trained using only the physics\-based loss terms in all settings\. See Appendix[B\.5](https://arxiv.org/html/2605.26341#A2.SS5)for more details on implementation\.

Bounds\.We denote the bound obtained via Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)asOurs\-Sob\. For Theorem 2, the bound using Inequality[8](https://arxiv.org/html/2605.26341#S3.E8)is denotedOurs\-Poi\., while the sample\-weighted version based on Inequality[9](https://arxiv.org/html/2605.26341#S3.E9)is written asOurs\-𝐏𝐨𝐢\.𝒮\\bm\{\\mathrm\{Poi\.\}\_\{\\mathcal\{S\}\}\}\. To our knowledge, no generalisation bounds incorporating priors exist for PIML; thus we propose to compare ourselves with baseline approaches under the same assumptions\. In particular,U\-Sob\.denotes the union bound over individual losses using the Sobolev bound \(inspired byCasadoet al\.\([2024](https://arxiv.org/html/2605.26341#bib.bib4)\)\), andU\-Poi\.analogously denotes the union of Poincaré bounds\.

Table 1:Empirical test risk/ generalisation bounds on 1D\-Wave with different observation data sample sizes\. Hereℓd∼1e\-4\\ell\_\{d\}\\sim 1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}which is very small compared to the total riskℛ^test\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\.BoundOurs\-Sob\.Ours\-Poi\.Ours\-𝐏𝐨𝐢\.𝒮\\bm\{\\mathrm\{Poi\.\}\_\{\\mathcal\{S\}\}\}𝐔𝐒𝐨𝐛\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Sob\.\}\}\}𝐔𝐏𝐨𝐢\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Poi\.\}\}\}md=3e4m\_\{d\}\\\!=\\\!3\\mathrm\{e\}44\.22e\-1/6\.76e\-14\.22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/\\mathbf\{6\.76\}\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.41e\-1/7\.32e\-14\.41\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/7\.32\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.41e\-1/7\.32e\-14\.41\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/7\.32\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.25e\-1/6\.80e\-14\.25\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/6\.80\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.47e\-1/1\.124\.47\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/1\.12md=300m\_\{d\}\\\!=\\\!3004\.23e\-1/7\.01e\-14\.23\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/\\mathbf\{7\.01\}\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.39e\-1/7\.58e\-14\.39\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/7\.58\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.46e\-1/1\.02e\-14\.46\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/1\.02\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.23e\-1/7\.08e\-14\.23\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/7\.08\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4\.46e\-1/1\.434\.46\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/1\.43md=2m\_\{d\}\\\!=\\\!24\.45e\-1/1\.754\.45\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/\\mathbf\{1\.75\}4\.49e\-1/2\.024\.49\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/2\.024\.50e\-1/4\.194\.50\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/4\.194\.45e\-1/1\.764\.45\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/1\.764\.50e\-1/4\.124\.50\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}/4\.12Results\.In our experiment, we first evaluate numerically the bounds above in a vanilla setting, where we have equal sample size by settingmd=3×104m\_\{d\}\\\!=\\\!3\\\!\\times\\\!10^\{4\}\. Figure[1](https://arxiv.org/html/2605.26341#S4.F1)shows that our bounds provide tighter guarantees than the union\-based counterparts, i\.e\., , under2×2\\timesof the test risks in all cases\. It is also shown that Sobolev\-based bounds \(Ours\-SobandU\-Sob\.\) are better than Poincaré\-based counterparts, withOurs\-Sobachieves smallest values in all problems\. Figure[2](https://arxiv.org/html/2605.26341#S4.F2)then depicts the results in a more restricted case where we only havemd=300m\_\{d\}=300observational points\. While the bounds on 1D\-Wave only increase slightly, the effects on 1D\-Reaction and Convection are more significant, with the ratio𝒰\(ρ\)/ℛ^test\(ρ\)∼10\\mathcal\{\\mathcal\{U\}\(\\rho\)\}/\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\\sim 10\. This arises from overly pessimistic constants when employing constants ofℓic\\ell\_\{ic\}forℓd\\ell\_\{d\}in one part, and from the significantly larger gradients \(less regularity and greater difficulty\) exhibited byℓd\\ell\_\{d\}in the other\. Besides, Table[1](https://arxiv.org/html/2605.26341#S4.T1)shows that if the model learns well and achieves lowℓd\\ell\_\{d\}\(also small‖∇𝐱ℓd\(𝜽,𝐱\)‖22\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{d\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\), it is possible to achieve tight bounds with only a few samples of observational data\. Additional results and baseline are available in Appendix[B\.6](https://arxiv.org/html/2605.26341#A2.SS6)\.

![Refer to caption](https://arxiv.org/html/2605.26341v1/x1.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x2.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x3.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x4.png)

Figure 1:Test error and generalisation bounds whenmd=30m\_\{d\}=30k\.![Refer to caption](https://arxiv.org/html/2605.26341v1/x5.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x6.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x7.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x8.png)

Figure 2:Test error and generalisation bounds whenmd=300m\_\{d\}=300\.

## 5Conclusion

We introduced a PAC\-Bayesian framework for physics\-informed machine learning that provides generalisation guarantees in regression settings with unbounded losses\. By adopting a multi\-task perspective, we derived bounds that joinly control data and physics objectives, avoiding the looseness of standard union\-bound approaches\. Our analysis reveals that the generalisation gap is governed by input\-gradient dependent complexity terms, establishing a direct connection between physical regularity and statistical performance\. We discuss the limitations of our work in Appendix[C](https://arxiv.org/html/2605.26341#A3)\.

Beyond the theoretical contributions, we proposed a self\-bounding\-aware learning algorithm and a practical procedure to estimate the required constants, enabling the optimisation of generalisation bounds in realistic settings\. Empirical results on standard PDE benchmarks demonstrate that our bounds are non\-vacuous and can be effectively minimised during training\.

As far as we know, this paper is the first PAC\-Bayesian framework for PIML, accompanied with an effective algorithm able to optimise the proposed bounds\. Overall, this work provides a principled statistical foundation for physics\-informed learning and opens several directions for future research, including tighter data\-dependent analyses, constructions of physics\-informed priors, extensions to more complex physical systems, and connections with implicit regularisation in deep learning\.

## References

- \[1\]\(2018\)Simpler PAC\-Bayesian bounds for hostile data\.Machine Learning107\(5\)\.External Links:ISSN 1573\-0565,[Link](http://dx.doi.org/10.1007/s10994-017-5690-0),[Document](https://dx.doi.org/10.1007/s10994-017-5690-0)Cited by:[§2\.2](https://arxiv.org/html/2605.26341#S2.SS2.p1.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11),[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p1.1)\.
- \[2\]P\. Alquier, J\. Ridgway, and N\. Chopin\(2016\)On the properties of variational approximations of Gibbs posteriors\.Journal of Machine Learning Research17\(236\),pp\. 1–41\.External Links:[Link](http://jmlr.org/papers/v17/15-290.html)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[3\]P\. Alquier\(2024\-01\)User\-friendly introduction to PAC\-Bayes bounds\.Foundations and Trends in Machine Learning17\(2\),pp\. 174–303\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p4.1)\.
- \[4\]D\. Bakry, I\. Gentil, and M\. Ledoux\(2014\)Analysis and geometry of markov diffusion operators\.Springer\.Cited by:[§3\.1](https://arxiv.org/html/2605.26341#S3.SS1.p1.6),[§3\.2](https://arxiv.org/html/2605.26341#S3.SS2.p1.5)\.
- \[5\]L\. Bégin, P\. Germain, F\. Laviolette, and J\. Roy\(2016\-09–11 May\)PAC\-Bayesian bounds based on the rényi divergence\.InProceedings of the 19th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.51,Cadiz, Spain,pp\. 435–444\.External Links:[Link](https://proceedings.mlr.press/v51/begin16.html)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p1.1)\.
- \[6\]I\. Casado, L\. A\. Ortega, A\. Pérez, and A\. R\. Masegosa\(2024\)PAC\-Bayes\-Chernoff bounds for unbounded losses\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=CyzZeND3LB)Cited by:[§A\.2](https://arxiv.org/html/2605.26341#A1.SS2.2.p1.7),[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p5.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p2.7),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p3.2),[§3](https://arxiv.org/html/2605.26341#S3.p3.9),[§4\.3](https://arxiv.org/html/2605.26341#S4.SS3.p3.1)\.
- \[7\]O\. Catoni\(2004\)Statistical learning theory and stochastic optimization\. Ecole d’été de probabilités de Saint\-Flour XXXI\-2001\.Springer\.Note:Collection : Lecture notes in mathematics n°1851External Links:[Link](https://hal.science/hal-00104952)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[8\]D\. Chafaï\(2004\-08\)Entropies, convexity, and functional inequalities: onΦ\\Phi\-entropies andΦ\\Phi\-Sobolev inequalities\.Journal of Mathematics of Kyoto University44,pp\.\.Cited by:[§3\.1](https://arxiv.org/html/2605.26341#S3.SS1.p1.6)\.
- \[9\]M\. D\. Donsker and S\. R\. S\. Varadhan\(1975\)Asymptotic evaluation of certain Markov process expectations for large time, I\.Communications on Pure and Applied Mathematics28\(1\),pp\. 1–47\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/cpa.3160280102),[Link](https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.3160280102),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1002/cpa\.3160280102Cited by:[Lemma A\.1](https://arxiv.org/html/2605.26341#A1.Thmtheorem1.1.p1.1.1)\.
- \[10\]N\. Doumèche, F\. Bach, G\. Biau, and C\. Boyer\(2024\)Physics\-informed machine learning as a kernel method\.InThe Thirty Seventh Annual Conference on Learning Theory, June 30 \- July 3, 2023, Edmonton, Canada,Proceedings of Machine Learning Research,pp\. 1399–1450\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1)\.
- \[11\]N\. Doumèche, F\. Bach, G\. Biau, and C\. Boyer\(2025\)Physics\-informed kernel learning\.J\. Mach\. Learn\. Res\.26,pp\. 124:1–124:39\.External Links:[Link](https://jmlr.org/papers/v26/24-1536.html)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[12\]N\. Doumèche, G\. Biau, and C\. Boyer\(2025\)On the convergence of PINNs\.Bernoulli31\(3\),pp\. 2127–2151\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[13\]L\. C\. Evans\(2010\)Partial differential equations\.2 edition,American Mathematical Society\.Cited by:[§3\.2](https://arxiv.org/html/2605.26341#S3.SS2.p1.5)\.
- \[14\]I\. Gat, Y\. Adi, A\. Schwing, and T\. Hazan\(2022\)On the importance of gradient norm in PAC\-bayesian bounds\.InAnnual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=uvE-fQHA4t_)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1)\.
- \[15\]P\. Germain, F\. Bach, A\. Lacoste, and S\. Lacoste\-Julien\(2016\)PAC\-Bayesian theory meets bayesian inference\.InAdvances in Neural Information Processing Systems,Vol\.29,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2016/file/84d2004bf28a2095230e8e14993d398d-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[16\]P\. Germain, A\. Lacasse, M\. Marchand, S\. Shanian, and F\. Laviolette\(2009\)From pac\-bayes bounds to kl regularization\.InAdvances in Neural Information Processing Systems,Y\. Bengio, D\. Schuurmans, J\. Lafferty, C\. Williams, and A\. Culotta \(Eds\.\),Vol\.22,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2009/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1)\.
- \[17\]B\. Guedj\(2019\)A primer on PAC\-Bayesian learning\.InProceedings of the second congress of the French Mathematical Society,Vol\.33\.External Links:[Link](https://arxiv.org/abs/1901.05353)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p4.1)\.
- \[18\]R\. Guo, R\. Jin, X\. Li, and Y\. Zhou\(2025\)PAC\-Bayes bounds for multivariate linear regression and linear autoencoders\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=S1zkFSby8G)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[19\]M\. Haddouche, B\. Guedj, O\. Rivasplata, and J\. Shawe\-Taylor\(2021\)PAC\-Bayes unleashed: generalisation bounds with unbounded losses\.Entropy23\(10\)\.External Links:[Link](https://www.mdpi.com/1099-4300/23/10/1330),ISSN 1099\-4300,[Document](https://dx.doi.org/10.3390/e23101330)Cited by:[§2\.2](https://arxiv.org/html/2605.26341#S2.SS2.p1.1)\.
- \[20\]M\. Haddouche and B\. Guedj\(2023\)PAC\-Bayes generalisation bounds for heavy\-tailed losses through supermartingales\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=qxrwt6F3sf)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.26341#S2.SS2.p1.1)\.
- \[21\]M\. Haddouche, P\. Viallard, U\. Şimşekli, and B\. Guedj\(2025\-02\)A PAC\-Bayesian Link Between Generalisation and Flat Minima\.InALT 2025 \- 36th International Conference on Algorithmic Learning Theory,Milan, Italy,pp\. 1–31\.External Links:[Link](https://hal.science/hal-04455639)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1)\.
- \[22\]F\. Hellström, G\. Durisi, B\. Guedj, and M\. Raginsky\(2025\)Generalization bounds: perspectives from information theory and PAC\-Bayes\.Foundations and Trends in Machine Learning\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p4.1)\.
- \[23\]M\. Holland\(2019\)PAC\-Bayes under potentially heavy tails\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/3a20f62a0af1aa152670bab3c602feed-Paper.pdf)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p5.1)\.
- \[24\]Y\. Jiao, Y\. Lai, D\. Li, X\. Lu, F\. Wang, Y\. Wang, and J\. Z\. Yang\(2022\)A rate of convergence of physics informed neural networks for the linear second order elliptic PDEs\.Communications in Computational Physics31\(4\),pp\. 1272–1295\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[25\]G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang\(2021\-05\)Physics\-informed machine learning\.Nature Reviews Physics3\(6\)\.External Links:ISSN 2522\-5820Cited by:[§1](https://arxiv.org/html/2605.26341#S1.p1.1)\.
- \[26\]Y\. Lu, H\. Chen, J\. Lu, L\. Ying, and J\. H\. Blanchet\(2022\)Machine learning for elliptic PDEs: fast rate generalization bound, neural scaling law and minimax optimality\.InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022,Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[27\]L\. McClenny and U\. Braga\-Neto\(2020\)Self\-adaptive physics\-informed neural networks using a soft attention mechanism\.J\. Comput\. Phys\.474,pp\. 111722\.External Links:[Link](https://api.semanticscholar.org/CorpusID:221586462)Cited by:[§2\.1](https://arxiv.org/html/2605.26341#S2.SS1.p2.8)\.
- \[28\]S\. Mishra and R\. Molinaro\(2022\-04\)Estimates on the generalization error of physics\-informed neural networks for approximating a class of inverse problems for PDEs\.IMA Journal of Numerical Analysis42\(2\),pp\. 981–1022\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[29\]Y\. Ohnishi and J\. Honorio\(2021\-13–15 Apr\)Novel change of measure inequalities with applications to PAC\-Bayesian bounds and monte carlo estimation\.InProceedings of The 24th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.130,pp\. 1711–1719\.External Links:[Link](https://proceedings.mlr.press/v130/ohnishi21a.html)Cited by:[Lemma A\.2](https://arxiv.org/html/2605.26341#A1.Thmtheorem2),[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p1.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p2.4)\.
- \[30\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga, A\. Desmaison, A\. Kopf, E\. Z\. Yang, Z\. DeVito, M\. Raison, A\. Tejani, S\. Chilamkurthy, B\. Steiner, L\. Fang, J\. Bai, and S\. Chintala\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems 32,pp\. 8024–8035\.External Links:[Link](http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)Cited by:[§B\.5](https://arxiv.org/html/2605.26341#A2.SS5.p3.10)\.
- \[31\]A\. Picard\-Weibel and B\. Guedj\(2022\)On change of measure inequalities for f\-divergences\.ArXivabs/2202\.05568\.External Links:[Link](https://api.semanticscholar.org/CorpusID:246823569)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p1.1)\.
- \[32\]M\. Raissi, P\. Perdikaris, and G\. E\. Karniadakis\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.Journal of Computational physics\.Cited by:[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[33\]B\. Rodríguez\-Gálvez, R\. Thobaben, and M\. Skoglund\(2024\)More PAC\-Bayes bounds: from bounded losses, to losses with general tail behaviors, to anytime validity\.Journal of Machine Learning Research25\(110\),pp\. 1–43\.External Links:[Link](http://jmlr.org/papers/v25/23-1360.html)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§1](https://arxiv.org/html/2605.26341#S1.p5.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[34\]S\. Rojas, P\. Maczuga, J\. Muñoz\-Matute, D\. Pardo, and M\. Paszyński\(2024\)Robust variational physics\-informed neural networks\.Computer Methods in Applied Mechanics and Engineering425,pp\. 116904\.External Links:ISSN 0045\-7825Cited by:[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[35\]T\. D\. Ryck and S\. Mishra\(2022\)Generic bounds on the approximation error for physics\-informed \(and\) operator learning\.InAnnual Conference on Neural Information Processing Systems,Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[36\]A\. Scampicchio, L\. F\. Toso, R\. Rickenbach, J\. Anderson, and M\. Zeilinger\(2026\)Physics\-informed learning under mixing: how physical knowledge speeds up learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IvLVPbeoRx)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1)\.
- \[37\]V\. Shalaeva, A\. F\. Esfahani, P\. Germain, and M\. Petreczky\(2019\)Improved PAC\-Bayesian bounds for linear regression\.InAAAI Conference on Artificial Intelligence,External Links:[Link](https://api.semanticscholar.org/CorpusID:208857400)Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p2.1),[§2](https://arxiv.org/html/2605.26341#S2.SSx1.p1.11)\.
- \[38\]Y\. Shin, Z\. Zhang, and G\. E\. Karniadakis\(2023\)Error estimates of residual minimization using neural networks for linear PDEs\.Journal of Machine Learning for Modeling and Computing4\(4\)\.Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[39\]M\. Takamoto, T\. Praditia, R\. Leiteritz, D\. MacKinlay, F\. Alesiani, D\. Pflüger, and M\. Niepert\(2022\)PDEBENCH: An Extensive Benchmark for Scientific Machine Learning\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 1596–1611\.Cited by:[§B\.1](https://arxiv.org/html/2605.26341#A2.SS1.p8.1)\.
- \[40\]P\. Viallard, M\. Haddouche, U\. Şimşekli, and B\. Guedj\(2023\)Learning via Wasserstein\-based high probability generalisation bounds\.InAnnual Conference on Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2306.04375),[Document](https://dx.doi.org/10.48550/arXiv.2306.04375)Cited by:[§2](https://arxiv.org/html/2605.26341#S2.SSx2.p1.1)\.
- \[41\]S\. Wang, X\. Yu, and P\. Perdikaris\(2022\)When and why PINNs fail to train: a neural tangent kernel perspective\.Journal of Computational Physics449,pp\. 110768\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2021.110768),[Link](https://www.sciencedirect.com/science/article/pii/S002199912100663X)Cited by:[§B\.6](https://arxiv.org/html/2605.26341#A2.SS6.p4.1),[§2\.1](https://arxiv.org/html/2605.26341#S2.SS1.p2.8),[§4\.2](https://arxiv.org/html/2605.26341#S4.SS2.p2.10)\.
- \[42\]X\. Xu, Y\. Li, and Z\. Huang\(2025\)Refined generalization analysis of the deep Ritz method and physics\-informed neural networks\.InProceeding of the International Conference on Machine Learning \(ICML\),Cited by:[Appendix D](https://arxiv.org/html/2605.26341#A4.p1.1),[§1](https://arxiv.org/html/2605.26341#S1.p3.1)\.
- \[43\]H\. Zakerinia, A\. Behjati, and C\. Lampert\(2024\-21–27 Jul\)More flexible PAC\-Bayesian meta\-learning by learning learning algorithms\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 58122–58139\.External Links:[Link](https://proceedings.mlr.press/v235/zakerinia24a.html)Cited by:[§2\.1](https://arxiv.org/html/2605.26341#S2.SS1.p2.8)\.
- \[44\]H\. Zakerinia and C\. Lampert\(2025\)Fast rate bounds for multi\-task and meta\-learning with different sample sizes\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=NYcnZaA4vO)Cited by:[§3](https://arxiv.org/html/2605.26341#S3.p1.4)\.

## Appendix AProofs

### A\.1Auxiliary lemmas

###### Lemma A\.1\(General PAC\-Bayesian Bound\)\.

For any distributionDDon𝒳\\mathcal\{X\}, for any hypothesis setΘ\\Theta, for any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any measurable functionφ:Θ×𝒳m→ℝ\\varphi:\\Theta\\times\\mathcal\{X\}^\{m\}\\rightarrow\\mathbb\{R\}, for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\)we have:

ℙS∼Dm\[𝔼𝜽∼ρφ\(𝜽,S\)≤KL\(ρ∥π\)\+ln⁡\(1δ𝔼S′∼Dm𝔼𝜽′∼π⁡eφ\(𝜽′,S′\)\)\]≥1−δ\.\\operatorname\*\{\\mathbb\{P\}\}\_\{S\\sim D^\{m\}\}\\left\[\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\varphi\(\\bm\{\\theta\},S\)\\leq\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\left\(\\frac\{1\}\{\\delta\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S^\{\\prime\}\\sim D^\{m\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim\\pi\}e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S^\{\\prime\}\)\}\\right\)\\right\]\\geq 1\-\\delta\.\(12\)
###### Proof\.

Applying Donsker\-Varadhan variational formula\[[9](https://arxiv.org/html/2605.26341#bib.bib7)\]we obtain

𝔼𝜽∼ρφ\(𝜽,S\)≤KL\(ρ∥π\)\+ln⁡\[𝔼𝜽′∼π\(eφ\(𝜽′,S\)\)\]\.\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\varphi\(\\bm\{\\theta\},S\)\\leq\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\left\[\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim\\pi\}\\left\(e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S\)\}\\right\)\\right\]\.\(13\)Now applying Markov’s inequality to the exponential term, we have:

ℙS∼Dm\[𝔼𝜽′∼π\(eφ\(𝜽′,S\)\)≤1δ𝔼S′∼Dm𝔼𝜽′∼P⁡eφ\(𝜽′,S′\)\]≥1−δ\\operatorname\*\{\\mathbb\{P\}\}\_\{S\\sim D^\{m\}\}\\left\[\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim\\pi\}\\left\(e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S\)\}\\right\)\\leq\\frac\{1\}\{\\delta\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S^\{\\prime\}\\sim D^\{m\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim\\mathrm\{P\}\}e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S^\{\\prime\}\)\}\\right\]\\geq 1\-\\delta⇔ℙS∼Dm\[ln⁡\(𝔼𝜽′∼P\(eφ\(𝜽′,S\)\)\)≤ln⁡\(1δ𝔼S′∼Dm𝔼𝜽′∼P⁡eφ\(𝜽′,S′\)\)\]≥1−δ\.\\Leftrightarrow\\operatorname\*\{\\mathbb\{P\}\}\_\{S\\sim D^\{m\}\}\\left\[\\ln\\left\(\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim P\}\\left\(e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S\)\}\\right\)\\right\)\\leq\\ln\\left\(\\frac\{1\}\{\\delta\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S^\{\\prime\}\\sim D^\{m\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}^\{\\prime\}\\sim\\mathrm\{P\}\}e^\{\\varphi\(\\bm\{\\theta\}^\{\\prime\},S^\{\\prime\}\)\}\\right\)\\right\]\\geq 1\-\\delta\.\(14\)Combining[14](https://arxiv.org/html/2605.26341#A1.E14)with[13](https://arxiv.org/html/2605.26341#A1.E13), we complete the proof\. ∎

###### Lemma A\.2\(\[[29](https://arxiv.org/html/2605.26341#bib.bib6)\], Lemma 9\)\.

For any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any measurable functionφ:Θ×𝒳m→ℝ\\varphi:\\Theta\\times\\mathcal\{X\}^\{m\}\\rightarrow\\mathbb\{R\}, for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\)we have

𝔼𝜽∼ρ\[φ\]≤\(χ2\(ρ∥π\)\+1\)𝔼𝜽∼π\[φ2\]\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\[\\varphi\]\\leq\\sqrt\{\\left\(\\chi^\{2\}\(\\rho\\\|\\pi\)\+1\\right\)\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\[\\varphi^\{2\}\]\}\(15\)

###### Proposition A\.3\(Multi\-task Cramér\-Chernoff\)\.

Define the sample\-weighted cumulant generating function \(CGF\) asΛ𝛉𝒮\(λ\)=1M∑i=1NLmiΛ𝛉\(λ\)\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}\}\(\\lambda\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\. For anyθ∈Θ\\theta\\in\\Thetaanda∈Ra\\in R,

ℙS\[gen𝒮\(𝜽,S\)≥a\]≤e−MΛ𝜽𝒮∗\(a\)\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\\geq a\\right\]\\leq e^\{\-M\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(a\)\}\(16\)whereΛ𝛉𝒮∗\(a\)=supλ\{λa−Λ𝛉𝒮\(λ\)\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(a\)=\\sup\_\{\\lambda\}\\\{\\lambda a\-\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}\}\(\\lambda\)\\\}is the Cramér transform of the sample\-centric CGF above\.

###### Proof\.

Applying Markov’s inequality and using Assumption[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1), we have that

ℙS\[gen𝒮\(𝜽,S\)≥a\]\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\Big\[\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\\geq a\\Big\]=ℙS\[eMλgen𝒮\(𝜽,S\)≥eMλa\]\\displaystyle=\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\Big\[e^\{M\\lambda\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\}\\geq e^\{M\\lambda a\}\\Big\]≤e−Mλa𝔼S\[eMλgen𝒮\(𝜽,S\)\]\\displaystyle\\leq e^\{\-M\\lambda a\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}\\Big\[e^\{M\\lambda\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\}\\Big\]=e−Mλa𝔼S\[exp⁡\(λ∑i=1NL∑j=1mi\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\)\]\\displaystyle=e^\{\-M\\lambda a\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}\\left\[\\exp\\left\(\\lambda\\sum\_\{i=1\}^\{N\_\{L\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)\\right\)\\right\]=e−Mλa𝔼S\[∏i=1NL∏j=1miexp⁡\(λ\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\)\]\\displaystyle=e^\{\-M\\lambda a\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}\\left\[\\prod\_\{i=1\}^\{N\_\{L\}\}\\prod\_\{j=1\}^\{m\_\{i\}\}\\exp\\left\(\\lambda\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)\\right\)\\right\]=e−Mλa\[∏i=1NL∏j=1mi𝔼xi,j∼𝒟iexp⁡\(λ\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\)\]\\displaystyle=e^\{\-M\\lambda a\}\\left\[\\prod\_\{i=1\}^\{N\_\{L\}\}\\prod\_\{j=1\}^\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{x\_\{i,j\}\\sim\\mathcal\{D\}\_\{i\}\}\\exp\\left\(\\lambda\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)\\right\)\\right\]=e−Mλa\[∏i=1NL∏j=1mieΛ𝜽\(i\)\(λ\)\]=e−Mλae∑i=1NLmiΛ𝜽\(i\)\(Λ\)\\displaystyle=e^\{\-M\\lambda a\}\\left\[\\prod\_\{i=1\}^\{N\_\{L\}\}\\prod\_\{j=1\}^\{m\_\{i\}\}e^\{\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\}\\right\]=e^\{\-M\\lambda a\}e^\{\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\Lambda\)\}=e−M\(λa−Λ𝜽𝒮\(λ\)\)\.\\displaystyle=e^\{\-M\(\\lambda a\-\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}\}\(\\lambda\)\)\}\.Since this holds∀λ\>0\\forall\\lambda\>0, we have that

ℙS\[gen𝒮\(𝜽,S\)≥a\]\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\Big\[\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\\geq a\\Big\]≤infλe−M\(λa−Λ𝜽𝒮\(λ\)\)\\displaystyle\\leq\\inf\_\{\\lambda\}e^\{\-M\(\\lambda a\-\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}\}\(\\lambda\)\)\}=e−Msupλ\(λa−Λ𝜽𝒮\(λ\)\)=e−MΛ𝜽𝒮∗\(a\),\\displaystyle=e^\{\-M\\sup\_\{\\lambda\}\(\\lambda a\-\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}\}\(\\lambda\)\)\}=e^\{\-M\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(a\)\},where the last equality comes from the definition\. This completes the proof\. ∎

### A\.2Proofs of oracle bounds

###### Proof\.

Applying Lemma[A\.2](https://arxiv.org/html/2605.26341#A1.Thmtheorem2)withφ=\|ℛ\(𝜽\)−ℛ^\(𝜽\)\|\\varphi=\|\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\|, then combining with Jensen’s inequality we have that

\|𝔼𝜽∼ρ\[ℛ\(𝜽\)\]−𝔼𝜽∼ρ\[ℛ^\(𝜽\)\]\|≤𝔼𝜽∼ρ\|ℛ\(𝜽\)−ℛ^\(𝜽\)\|≤D¯𝔼𝜽∼π\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]\|\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\[\\mathcal\{R\}\(\\bm\{\\theta\}\)\]\-\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\[\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\]\|\\leq\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\|\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\|\\leq\\sqrt\{\\bar\{D\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]\}\(17\)Now using Markov inequality and Fubini’s theorem, we have that

ℙS∼Dm\[𝔼𝜽∼π\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]≥ϵ\]\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\\sim D^\{m\}\}\\left\[\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]\\geq\\epsilon\\right\]≤1ϵ𝔼S∼Dm𝔼𝜽∼π\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]\\displaystyle\\leq\\frac\{1\}\{\\epsilon\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\\sim D^\{m\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]=1ϵ𝔼𝜽∼π𝔼S∼Dm\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]\.\\displaystyle=\\frac\{1\}\{\\epsilon\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\\sim D^\{m\}\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]\.Settingδ=1ϵ𝔼𝜽∼π𝔼S∼Dm\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]\\delta=\\frac\{1\}\{\\epsilon\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\\sim D^\{m\}\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\], now we can rewrite this inequality as follows

ℙS∼Dm\[𝔼𝜽∼π\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]≤1δ𝔼𝜽∼π𝔼S∼Dm\[\(ℛ\(𝜽\)−ℛ^\(𝜽\)\)2\]\]≥1−δ\.\\operatorname\*\{\\mathbb\{P\}\}\_\{S\\sim D^\{m\}\}\\left\[\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]\\leq\\frac\{1\}\{\\delta\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\\sim D^\{m\}\}\[\(\\mathcal\{R\}\(\\bm\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(\\bm\{\\theta\}\)\)^\{2\}\]\\right\]\\geq 1\-\\delta\.\(18\)Finally, by combining[18](https://arxiv.org/html/2605.26341#A1.E18)with[17](https://arxiv.org/html/2605.26341#A1.E17), we complete the proof\. ∎

###### Proof\.

Apply Lemma[A\.1](https://arxiv.org/html/2605.26341#A1.Thmtheorem1)withφ\(𝜽,S\)=M1Λ𝜽∗\(gen𝒮\(𝜽,S\)\)\\varphi\(\\bm\{\\theta\},S\)=M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\*\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\),0<M1<M0<M\_\{1\}<M, it holds with probability at least1−δ1\-\\deltathat

𝔼𝜽∼ρM1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)≤KL\(ρ∥π\)\+ln⁡\(1δ𝔼S∼𝒳n𝔼𝜽∼π⁡eM1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)\)\.\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\\leq\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\left\(\\frac\{1\}\{\\delta\}\\operatorname\*\{\\mathbb\{E\}\}\_\{S\\sim\\mathcal\{X\}^\{n\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}\\right\)\.Since the priorπ\\piis independent from the training samplesSS, we can swap expectation using Fubini’s theorem and then control𝔼SeM1Λ𝜽∗\(gen𝒮\(𝜽,S\)\)\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\*\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}for any fixedθ∈Θ\\theta\\in\\Theta\. This is done by leveraging Proposition[A\.3](https://arxiv.org/html/2605.26341#A1.Thmtheorem3)and the surviving function theorem as presented in\[[6](https://arxiv.org/html/2605.26341#bib.bib4)\]\. First, from Proposition[A\.3](https://arxiv.org/html/2605.26341#A1.Thmtheorem3)and following the same proof as Lemma 6 in\[[6](https://arxiv.org/html/2605.26341#bib.bib4)\], we have that

ℙS\(MΛ𝜽∗\(gen𝒮\(𝜽,S\)\)≥c\)≤ℙX∼exp⁡\(1\)\(X≥c\)\.\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\(M\\Lambda\_\{\\bm\{\\theta\}\}^\{\*\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\\geq c\\right\)\\leq\\operatorname\*\{\\mathbb\{P\}\}\_\{X\\sim\\exp\(1\)\}\(X\\geq c\)\.\(19\)
SinceX∼exp⁡\(1\)X\\sim\\exp\(1\), we getkX∼exp\(1/k\)kX\\sim exp\(1/k\)\. Thus, multiplying the random variableΛ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)byM1M\\frac\{M\_\{1\}\}\{M\}we have that

ℙS\[M1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)≥a\]≤ℙX∼exp⁡\(MM1\)\(X≥a\)\.\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\\geq a\\right\]\\leq\\operatorname\*\{\\mathbb\{P\}\}\_\{X\\sim\\exp\(\\frac\{M\}\{M\_\{1\}\}\)\}\(X\\geq a\)\.\(20\)SinceX∼exp⁡\(MM1\)X\\sim\\exp\(\\frac\{M\}\{M\_\{1\}\}\), we haveeX∼Pareto\(MM1,1\)e^\{X\}\\sim\\mathrm\{Pareto\}\(\\frac\{M\}\{M\_\{1\}\},1\)\. Thus, for anyt\>1t\>1we have

ℙS\[eM1Λ𝜽∗\(gen𝒮\(𝜽,S\)\)≥t\]≤ℙX∼Pareto\(MM1,1\)\(X≥t\)\.\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\*\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}\\geq t\\right\]\\leq\\operatorname\*\{\\mathbb\{P\}\}\_\{X\\sim\\mathrm\{Pareto\}\(\\frac\{M\}\{M\_\{1\}\},1\)\}\(X\\geq t\)\.\(21\)Now, the term𝔼SeM1Λ𝜽∗\(gen𝒮\(𝜽,S\)\)\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\*\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}can be bounded by using the survival function:

𝔼SeM1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)\\displaystyle\\operatorname\*\{\\mathbb\{E\}\}\_\{S\}e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}=∫1∞ℙS\[eM1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)≥t\]𝑑t\\displaystyle=\\int\_\{1\}^\{\\infty\}\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[e^\{M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)\}\\geq t\\right\]dt≤∫1∞ℙX∼Pareto\(MM1,1\)\(X≥t\)𝑑t\\displaystyle\\leq\\int\_\{1\}^\{\\infty\}\\operatorname\*\{\\mathbb\{P\}\}\_\{X\\sim\\mathrm\{Pareto\}\(\\frac\{M\}\{M\_\{1\}\},1\)\}\(X\\geq t\)dt=𝔼X∼Pareto\(MM1,1\)X\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{X\\sim\\mathrm\{Pareto\}\(\\frac\{M\}\{M\_\{1\}\},1\)\}X=MM1MM1−1=MM−M1\.\\displaystyle=\\frac\{\\frac\{M\}\{M\_\{1\}\}\}\{\\frac\{M\}\{M\_\{1\}\}\-1\}=\\frac\{M\}\{M\-M\_\{1\}\}\.Besides, note that

𝔼𝜽∼ρM1Λ𝜽𝒮∗\(gen𝒮\(𝜽,S\)\)\\displaystyle\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}M\_\{1\}\\Lambda\_\{\\bm\{\\theta\}\}^\{\\mathcal\{S\}^\{\*\}\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\)=𝔼𝜽∼ρM1supλ\{λgen𝒮\(𝜽,S\)−Λ𝜽\(λ\)\}\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}M\_\{1\}\\sup\_\{\\lambda\}\\left\\\{\\lambda\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\-\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\\right\\\}≥M1supλ𝔼𝜽∼ρ\{λgen𝒮\(𝜽,S\)−Λ𝜽\(λ\)\}\\displaystyle\\geq M\_\{1\}\\sup\_\{\\lambda\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\left\\\{\\lambda\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\-\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\\right\\\}=M1supλ\{λgen𝒮\(ρ,S\)−𝔼𝜽∼ρ\[Λ𝜽\(λ\)\]\}:=M1Λρ∗\(gen𝒮\(ρ,S\)\)\.\\displaystyle=M\_\{1\}\\sup\_\{\\lambda\}\\left\\\{\\lambda\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\-\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\[\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\]\\right\\\}:=M\_\{1\}\\Lambda^\{\*\}\_\{\\rho\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\)\.So we have that

ℙS\[M1Λρ∗\(gen𝒮\(ρ,S\)\)≤KL\(ρ∥π\)\+ln⁡1δ\+ln⁡MM−M1\]≥1−δ\.\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[M\_\{1\}\\Lambda^\{\*\}\_\{\\rho\}\(\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\)\\leq\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{1\}\{\\delta\}\+\\ln\\frac\{M\}\{M\-M\_\{1\}\}\\right\]\\geq 1\-\\delta\.ChoosingM1=M−1M\_\{1\}=M\-1then taking the generalised inverse of both sides, we get

ℙS\[gen𝒮\(ρ,S\)≤infλ\{KL\(ρ\|\|π\)\+lnMδλ\(M−1\)\+𝔼𝜽∼ρ\[Λ𝜽\(λ\)\]λ\}\]≥1−δ\.\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\inf\_\{\\lambda\}\\left\\\{\\frac\{\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{M\}\{\\delta\}\}\{\\lambda\(M\-1\)\}\+\\frac\{\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\[\\Lambda\_\{\\bm\{\\theta\}\}\(\\lambda\)\]\}\{\\lambda\}\\right\\\}\\right\]\\geq 1\-\\delta\.∎

### A\.3Proofs of bounds with Sobolev\-smoothness assumption

\{restatable\}

lemmaPBCSobolev Under Assumption[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2), for any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), with probability at least1−δ1\-\\deltawe have:

gen𝒮\(ρ,S\)≤2\(‖∇PIMLS‖22\)KL\(ρ∥π\)\+ln⁡1δM\(M−1\),\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\sqrt\{2\\left\(\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}\\right\)\\frac\{\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{1\}\{\\delta\}\}\{M\(M\-1\)\}\},\(22\)where‖∇PIMLS‖22:=∑i=1NLmiCS,i𝔼𝜽∼ρ𝔼𝐱∼Di\[‖∇𝐱ℓi\(𝜽,𝐱\)‖22\]\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{S,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\_\{i\}\}\\Big\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\Big\]\.

###### Proof\.

Applying Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)combining with Assumption[3\.2](https://arxiv.org/html/2605.26341#S3.Thmtheorem2), we have that

ℙS\[gen𝒮\(ρ,S\)≤infλg\(λ\)\]≥1−δ,\\displaystyle\\operatorname\*\{\\mathbb\{P\}\}\_\{S\}\\left\[\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\inf\_\{\\lambda\}g\(\\lambda\)\\right\]\\geq 1\-\\delta,\(23\)whereg\(λ\)=KL\(ρ\|\|π\)\+lnMδλ\(M−1\)\+λ2∑i=1NLmiMCS,i𝔼𝜽∼ρ𝔼𝐱i∼Di\[‖∇𝐱ℓi\(𝜽,𝐱i\)‖22\]g\(\\lambda\)=\\frac\{\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{M\}\{\\delta\}\}\{\\lambda\(M\-1\)\}\+\\frac\{\\lambda\}\{2\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{m\_\{i\}\}\{M\}C\_\{S,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}\\left\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i\}\)\|\|\_\{2\}^\{2\}\\right\]\. Since this inequality holds for allλ\>0\\lambda\>0, it also satisfies forλ\\lambdathat minimises the right\-hand side \(RHS\) of Equation[23](https://arxiv.org/html/2605.26341#A1.E23)\. Now solving forλ\\lambdato minimiseg\(λ\)g\(\\lambda\)then plugging it to the RHS, we obtain that

gen𝒮\(ρ,S\)≤2\(‖∇PIMLS‖22\)KL\(ρ\|\|π\)\+ln2MδM\(M−1\),\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\sqrt\{2\\left\(\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}\\right\)\\frac\{\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\}\{M\(M\-1\)\}\},\(24\)which can be finally rewritten as follows

gen𝒮\(ρ,S\)≤1M2\(‖∇PIMLS‖22\)M\(KL\(ρ\|\|π\)\+ln2Mδ\)M−1\.\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{2\\left\(\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}\\right\)\\frac\{M\(\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{M\-1\}\}\.\(25\)∎

###### Proof\.

First, applying Lemma[A\.3](https://arxiv.org/html/2605.26341#A1.SS3), with probability at least1−δ21\-\\frac\{\\delta\}\{2\}, it holds that

gen𝒮\(ρ,S\)≤1M2\(‖∇PIMLS‖22\)M\(KL\(ρ\|\|π\)\+ln2Mδ\)M−1\.\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{2\\left\(\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}\\right\)\\frac\{M\(\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{M\-1\}\}\.\(26\)
Now we need to replace the true gradient term‖∇PIMLS‖22\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\|\|\_\{2\}^\{2\}by its corresponding empirical version\. By Assumption[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3),‖∇𝐱ℓi\(𝜽,𝐱\)‖22≤Li\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\leq L\_\{i\}therefore‖∇𝐱ℓi\(𝜽,𝐱\)‖22\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}isLi24\\frac\{L\_\{i\}^\{2\}\}\{4\}\-sub\-gaussian, i\.e\., ,

𝔼𝐱i∼DieλCS,i\(𝔼𝐱i∼Di‖∇𝐱ℓi\(𝜽,𝐱\)‖22−‖∇𝐱ℓi\(𝜽,𝐱\)‖22\)≤eλ2CS,i2Li28\.\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}e^\{\\lambda C\_\{S,i\}\(\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\-\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\)\}\\leq e^\{\\frac\{\\lambda^\{2\}C\_\{S,i\}^\{2\}L\_\{i\}^\{2\}\}\{8\}\}\.\(27\)Applying Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)with the "losses"Ci‖∇𝐱ℓi\(𝜽,𝐱\)‖22C\_\{i\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}then combining with Equation[27](https://arxiv.org/html/2605.26341#A1.E27)to bound the CGF, we obtain with probability at least1−δ21\-\\frac\{\\delta\}\{2\}that

‖∇PIMLS\(ρ\)‖22≤‖∇^PIMLS\(ρ\)‖22\+infλ\{M\(KL\(ρ∥π\)\+ln⁡2Mδ\)λ\(M−1\)\+λ∑i=1NLmiCS,i2Li28\}\.\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}\+\\inf\_\{\\lambda\}\\left\\\{\\frac\{M\(\\textrm\{KL\}\(\\rho\\\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{\\lambda\(M\-1\)\}\+\\frac\{\\lambda\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{S,i\}^\{2\}L\_\{i\}^\{2\}\}\{8\}\\right\\\}\.\(28\)Solvingλ\\lambdato minimise the RHS, we have that:

‖∇PIMLS\(ρ\)‖22≤‖∇^PIMLS\(ρ\)‖22\+M\(KL\(ρ\|\|π\)\+ln2Mδ\)2\(M−1\)∑i=1NLmiCS,i2Li2\.\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{S\}\}\(\\rho\)\|\|\_\{2\}^\{2\}\+\\sqrt\{\\frac\{M\(\\textrm\{KL\}\(\\rho\|\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{2\(M\-1\)\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{S,i\}^\{2\}L\_\{i\}^\{2\}\}\.\(29\)Finally combining[29](https://arxiv.org/html/2605.26341#A1.E29)and[26](https://arxiv.org/html/2605.26341#A1.E26)via a union bound, we complete the proof\. ∎

### A\.4Proofs of bounds with Poincaré\-smoothness assumption

\{restatable\}

lemmaPBP Under Assumption[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4), for any confidence levelδ∈\(0,1\)\\delta\\in\(0,1\), any priorπ∈ℳ\(Θ\)\\pi\\in\\mathcal\{M\}\(\\Theta\), for any posteriorρ∈ℳ\(Θ\)\\rho\\in\\mathcal\{M\}\(\\Theta\), for anyλ\>0\\lambda\>0, forα\>1\\alpha\>1with probability at least1−δ1\-\\deltawe have:

gen\(ρ,S\)≤D¯‖∇PIMLP‖22δ,\\mathrm\{gen\}\(\\rho,S\)\\leq\\sqrt\{\\frac\{\\bar\{D\}\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\|\|\_\{2\}^\{2\}\}\{\\delta\}\},\(30\)
gen𝒮\(ρ,S\)≤1MD¯‖∇PIMLP𝒮‖22δ,\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{\\frac\{\\bar\{D\}\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\|\|\_\{2\}^\{2\}\}\{\\delta\}\},\(31\)
where‖∇PIMLP\(π\)‖22:=∑i=1NLCP,imi𝔼𝜽∼π𝔼𝐱∼Di‖∇ℓi\(𝜽,𝐱\)‖22\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}\}\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\_\{i\}\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}and‖∇PIMLP\(π\)‖22:=∑i=1NLmiCP,i𝔼𝜽∼π𝔼𝐱∼Di‖∇ℓi\(𝜽,𝐱\)‖22\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\|\|\_\{2\}^\{2\}:=\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{P,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\_\{i\}\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\.

###### Proof\.

We proceed by directly applying Theorem[2](https://arxiv.org/html/2605.26341#S2.SSx2)with the corresponding composite loss functions\. First, let us considerφ=gen\(𝜽,S\)\\varphi=\\mathrm\{gen\}\(\\bm\{\\theta\},S\)\. Leveraging the i\.i\.d assumption between training points and Poincaré assumption we have

𝔼𝜽∼π𝔼𝐱i,j∼Di\(gen\(𝜽,S\)\)2\\displaystyle\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\,\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(\\mathrm\{gen\}\(\\bm\{\\theta\},S\)\\right\)^\{2\}=𝔼𝜽∼π𝔼𝐱i,j∼Di\(∑i=1NL1mi∑j=1mi\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\)2\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\,\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{1\}\{m\_\{i\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)\\right\)^\{2\}=𝔼𝜽∼π∑i=1NL∑j=1mi𝔼𝐱i,j∼Di\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)mi\)2\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(\\frac\{\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\}\{m\_\{i\}\}\\right\)^\{2\}=𝔼𝜽∼π∑i=1NL1mi𝕍\[ℓi\(𝜽\)\]\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{1\}\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{V\}\}\[\\ell\_\{i\}\(\\bm\{\\theta\}\)\]≤𝔼𝜽∼π∑i=1NLCP,imi𝔼𝐱∼Di‖∇𝐱ℓi\(𝜽,𝐱\)‖22\(Assumption[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)\),\\displaystyle\\leq\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}\}\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\_\{i\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\hskip 4\.2679pt\(\\textrm\{Assumption~\\ref\{asn:model\_poincare\}\}\),where the second equality comes from Assumption[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)\.

Similarly, for Inequality[31](https://arxiv.org/html/2605.26341#A1.E31), let us consider the sample\-centric generalisation gapφ=Mgen𝒮\(𝜽,S\)\\varphi=Mgen^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\. Leveraging the i\.i\.d assumption between training points and Poincaré assumption we also have

𝔼𝜽∼π𝔼𝐱i,j∼Di\(Mgen𝒮\(𝜽,S\)\)2\\displaystyle\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\,\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(M\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\},S\)\\right\)^\{2\}=𝔼𝜽∼π𝔼𝐱i,j∼Di\(∑i=1NL∑j=1mi\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)\)2\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\,\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(\\sum\_\{i=1\}^\{N\_\{L\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)\\right\)^\{2\}=𝔼𝜽∼π∑i=1NL∑j=1mi𝔼𝐱i,j∼Di\(ℛi\(𝜽\)−ℓi\(𝜽,𝐱i,j\)\)2\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i,j\}\\sim D\_\{i\}\}\\left\(\\mathcal\{R\}\_\{i\}\(\\bm\{\\theta\}\)\-\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\\right\)^\{2\}=𝔼𝜽∼π∑i=1NLmi𝕍\[ℓi\(𝜽\)\]\\displaystyle=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}\\operatorname\*\{\\mathbb\{V\}\}\[\\ell\_\{i\}\(\\bm\{\\theta\}\)\]≤𝔼𝜽∼π∑i=1NLmiCP,i𝔼𝐱∼Di‖∇𝐱ℓi\(𝜽,𝐱\)‖22\(Assumption[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4)\)\.\\displaystyle\\leq\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{P,i\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\_\{i\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\hskip 4\.2679pt\(\\textrm\{Assumption~\\ref\{asn:model\_poincare\}\}\)\.Then by simply plugging these results into Theorem[2](https://arxiv.org/html/2605.26341#S2.SSx2), we finish the proof\. ∎

###### Proof\.

The proof for these two inequalities can be done following the same argument as the proof of Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)\. First, applying Lemma[A\.4](https://arxiv.org/html/2605.26341#A1.SS4)withδ′=δ2\\delta^\{\\prime\}=\\frac\{\\delta\}\{2\}we have that

gen\(ρ,S\)≤2δD¯‖∇PIMLP\(𝜽\)‖22,\\mathrm\{gen\}\(\\rho,S\)\\leq\\sqrt\{\\frac\{2\}\{\\delta\}\\bar\{D\}\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\bm\{\\theta\}\)\|\|\_\{2\}^\{2\}\},\(32\)gen𝒮\(ρ,S\)≤1M2δD¯‖∇PIMLP𝒮\(𝜽\)‖22\.\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)\\leq\\frac\{1\}\{M\}\\sqrt\{\\frac\{2\}\{\\delta\}\\bar\{D\}\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\bm\{\\theta\}\)\|\|\_\{2\}^\{2\}\}\.\(33\)
Exploiting the Lipschitzness assumption like in the proof of Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3), we obtain that

𝔼𝐱i∼DieλCP,imi2\(𝔼𝐱i∼Di‖∇𝐱ℓi\(𝜽,𝐱\)‖22−‖∇𝐱ℓi\(𝜽,𝐱\)‖22\)≤eλ2CP,i2Li28mi4,\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}e^\{\\lambda\\frac\{C\_\{P,i\}\}\{m\_\{i\}^\{2\}\}\(\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\-\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\)\}\\leq e^\{\\frac\{\\lambda^\{2\}C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\{8m\_\{i\}^\{4\}\}\},\(34\)𝔼𝐱i∼DieλCP,i\(𝔼𝐱i∼Di‖∇𝐱ℓi\(𝜽,𝐱\)‖22−‖∇𝐱ℓi\(𝜽,𝐱\)‖22\)≤eλ2CP,i2Li28\.\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}e^\{\\lambda C\_\{P,i\}\(\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\_\{i\}\\sim D\_\{i\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\-\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\)\}\\leq e^\{\\frac\{\\lambda^\{2\}C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\{8\}\}\.\(35\)
Now respectively plugging[34](https://arxiv.org/html/2605.26341#A1.E34), and[35](https://arxiv.org/html/2605.26341#A1.E35)into Theorem[3\.1](https://arxiv.org/html/2605.26341#S3.Thmtheorem1)with the lossesCP,imi2‖∇𝐱ℓi\(𝜽,𝐱\)‖22\\frac\{C\_\{P,i\}\}\{m\_\{i\}^\{2\}\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}, andCP,i‖∇𝐱ℓi\(𝜽,𝐱\)‖22C\_\{P,i\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}, then solving forλ\\lambda, we obtain with probability at least1−δ21\-\\frac\{\\delta\}\{2\}that

\|\|∇PIMLP\(ρ′\)∥22≤\|\|∇^PIMLP\(ρ′\)\|\|22\+M\(KL\(ρ′∥π\)\+ln⁡2Mδ\)2\(M−1\)∑i=1NLCP,i2Li2mi3,\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\rho^\{\\prime\}\)\\\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\rho^\{\\prime\}\)\|\|\_\{2\}^\{2\}\+\\sqrt\{\\frac\{M\(\\textrm\{KL\}\(\\rho^\{\\prime\}\\\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{2\(M\-1\)\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\{m\_\{i\}^\{3\}\}\},\(36\)‖∇PIMLP𝒮\(ρ′\)‖22≤‖∇^PIMLP𝒮\(ρ′\)‖22\+M\(KL\(ρ′∥π\)\+ln⁡2Mδ\)2\(M−1\)∑i=1NLmiCP,i2Li2\.\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\rho^\{\\prime\}\)\|\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\rho^\{\\prime\}\)\|\|\_\{2\}^\{2\}\+\\sqrt\{\\frac\{M\(\\textrm\{KL\}\(\\rho^\{\\prime\}\\\|\\pi\)\+\\ln\\frac\{2M\}\{\\delta\}\)\}\{2\(M\-1\)\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\.\(37\)Settingρ′=π\\rho^\{\\prime\}=\\pi, we obtain that

\|\|∇PIMLP\(π\)∥22≤\|\|∇^PIMLP\(π\)\|\|22\+Mln⁡2Mδ2\(M−1\)∑i=1NLCP,i2Li2mi3,\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\\\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}\(\\pi\)\|\|\_\{2\}^\{2\}\+\\sqrt\{\\frac\{M\\ln\\frac\{2M\}\{\\delta\}\}\{2\(M\-1\)\}\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\{m\_\{i\}^\{3\}\}\},\(38\)‖∇PIMLP𝒮\(π\)‖22≤‖∇^PIMLP𝒮\(π\)‖22\+Mln⁡2Mδ2\(M−1\)∑i=1NLmiCP,i2Li2\.\|\|\\nabla\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\pi\)\|\|\_\{2\}^\{2\}\\leq\|\|\\hat\{\\nabla\}\_\{\\mathrm\{PIML\}\_\{P\}\}^\{\\mathcal\{S\}\}\(\\pi\)\|\|\_\{2\}^\{2\}\+\\sqrt\{\\frac\{M\\ln\\frac\{2M\}\{\\delta\}\}\{2\(M\-1\)\}\\sum\_\{i=1\}^\{N\_\{L\}\}m\_\{i\}C\_\{P,i\}^\{2\}L\_\{i\}^\{2\}\}\.\(39\)
Combining[38](https://arxiv.org/html/2605.26341#A1.E38)with[32](https://arxiv.org/html/2605.26341#A1.E32),[39](https://arxiv.org/html/2605.26341#A1.E39)with[33](https://arxiv.org/html/2605.26341#A1.E33)via a union bound, then rewriting it using the above notations, we complete the proof\. ∎

## Appendix BAdditional Information on Experiment Setup, Implementation and Results

### B\.1Benchmarks

1D\-Wave\.This problem considers a one\-dimensional hyperbolic partial differential equation \(PDE\) that is commonly used as a benchmark for wave propagation phenomena\. The governing equation is111For clarity,π\\pidenotes the numeric constantπ≈3\.14\\pi\\approx 3\.14in this section, to avoid confusion with the prior distribution in the bounds\.

∂2u∂t2−4∂2u∂x2\\displaystyle\\frac\{\\partial^\{2\}u\}\{\\partial t^\{2\}\}\-4\\frac\{\\partial^\{2\}u\}\{\\partial x^\{2\}\}=0,\\displaystyle=0,x∈\(0,1\),t∈\(0,1\),\\displaystyle x\\in\(0,1\),\\;t\\in\(0,1\),\(40\)u\(x,0\)\\displaystyle u\(x,0\)=sin⁡\(πx\)\+12sin⁡\(βπx\),\\displaystyle=\\sin\(\\pi x\)\+\\frac\{1\}\{2\}\\sin\(\\beta\\pi x\),x∈\[0,1\],\\displaystyle x\\in\[0,1\],∂u\(x,0\)∂t\\displaystyle\\frac\{\\partial u\(x,0\)\}\{\\partial t\}=0,\\displaystyle=0,x∈\[0,1\],\\displaystyle x\\in\[0,1\],u\(0,t\)\\displaystyle u\(0,t\)=u\(1,t\)=0,\\displaystyle=u\(1,t\)=0,t∈\[0,1\]\.\\displaystyle t\\in\[0,1\]\.
The corresponding analytic solution is

u\(x,t\)=sin⁡\(πx\)cos⁡\(2πt\)\+12sin⁡\(βπx\)cos⁡\(2βπt\)\.u\(x,t\)=\\sin\(\\pi x\)\\cos\(2\\pi t\)\+\\frac\{1\}\{2\}\\sin\(\\beta\\pi x\)\\cos\(2\\beta\\pi t\)\.\(41\)
In our experiments, we setβ=4\\beta=4\. Despite its relatively simple form, this problem poses challenges for physics\-informed neural networks \(PINNs\) due to the presence of multiple frequency components in the solution\. As described above, we have five different equations, therefore in the experiment parts, for clarity, we denote the physical loss terms as follows:

ℓp\(𝜽,𝐱\)\\displaystyle\\ell\_\{p\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(∂2u𝜽∂t2−4∂2u𝜽∂x2\)2,\\displaystyle=\\left\(\\frac\{\\partial^\{2\}u\_\{\\bm\{\\theta\}\}\}\{\\partial t^\{2\}\}\-4\\frac\{\\partial^\{2\}u\_\{\\bm\{\\theta\}\}\}\{\\partial x^\{2\}\}\\right\)^\{2\},\(42\)ℓic\(𝜽,𝐱\)\\displaystyle\\ell\_\{ic\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(x,0\)−sin⁡\(πx\)−12sin⁡\(βπx\)\)2,\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(x,0\)\-\\sin\(\\pi x\)\-\\frac\{1\}\{2\}\\sin\(\\beta\\pi x\)\\right\)^\{2\},ℓig\(𝜽,𝐱\)\\displaystyle\\ell\_\{ig\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(∂u𝜽\(x,0\)∂t\)2,\\displaystyle=\\left\(\\frac\{\\partial u\_\{\\bm\{\\theta\}\}\(x,0\)\}\{\\partial t\}\\right\)^\{2\},ℓb1\(𝜽,𝐱\)\\displaystyle\\ell\_\{b\_\{1\}\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(0,t\)\)2,\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(0,t\)\\right\)^\{2\},ℓb2\(𝜽,𝐱\)\\displaystyle\\ell\_\{b\_\{2\}\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(1,t\)\)2\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(1,t\)\\right\)^\{2\}
As depicted in Figure[3](https://arxiv.org/html/2605.26341#A2.F3), the solution map is relatively smoother compared to the other two problems, thereby it is easier for models to achieve small data\-fidelity error\. However, the equation contains second\-order derivatives which is the main obstacle both in terms of learning and computation\. Indeed, the lossℓp\\ell\_\{p\}and its derivative‖∇𝜽ℓp‖\\\|\\nabla\_\{\\bm\{\\theta\}\}\\ell\_\{p\}\\\|are the dominant terms in both empirical loss and the complexity of the bound\.

1D\-Reaction\.This problem considers a one\-dimensional nonlinear ordinary differential equation \(ODE\) that arises in the modeling of chemical reactions\. The equation is given by

∂u∂t−κu\(1−u\)\\displaystyle\\frac\{\\partial u\}\{\\partial t\}\-\\kappa u\(1\-u\)=0,\\displaystyle=0,x∈\(0,2π\),t∈\(0,1\),\\displaystyle x\\in\(0,2\\pi\),\\;t\\in\(0,1\),\(43\)u\(x,0\)\\displaystyle u\(x,0\)=exp⁡\(−\(x−π\)22\(π/4\)2\),\\displaystyle=\\exp\\\!\\left\(\-\\frac\{\(x\-\\pi\)^\{2\}\}\{2\(\\pi/4\)^\{2\}\}\\right\),x∈\[0,2π\],\\displaystyle x\\in\[0,2\\pi\],u\(0,t\)\\displaystyle u\(0,t\)=u\(2π,t\),\\displaystyle=u\(2\\pi,t\),t∈\[0,1\]\.\\displaystyle t\\in\[0,1\]\.
The corresponding analytical solution is

u\(x,t\)=h\(x\)eκth\(x\)eκt\+1−h\(x\),u\(x,t\)=\\frac\{h\(x\)e^\{\\kappa t\}\}\{h\(x\)e^\{\\kappa t\}\+1\-h\(x\)\},\(44\)where

h\(x\)=exp⁡\(−\(x−π\)22\(π/4\)2\)\.h\(x\)=\\exp\\\!\\left\(\-\\frac\{\(x\-\\pi\)^\{2\}\}\{2\(\\pi/4\)^\{2\}\}\\right\)\.\(45\)
In our experiments, we setκ=5\\kappa=5\. We consider the following three physical losses:

ℓp\(𝜽,𝐱\)\\displaystyle\\ell\_\{p\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(∂u𝜽∂t−κu𝜽\(1−u𝜽\)\)2,\\displaystyle=\\left\(\\frac\{\\partial u\_\{\\bm\{\\theta\}\}\}\{\\partial t\}\-\\kappa u\_\{\\bm\{\\theta\}\}\(1\-u\_\{\\bm\{\\theta\}\}\)\\right\)^\{2\},\(46\)ℓic\(𝜽,𝐱\)\\displaystyle\\ell\_\{ic\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(x,0\)−h\(x\)\)2,\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(x,0\)\-h\(x\)\\right\)^\{2\},ℓb\(𝜽,𝐱\)\\displaystyle\\ell\_\{b\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(0,t\)−u𝜽\(2π,t\)\)2\.\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(0,t\)\-u\_\{\\bm\{\\theta\}\}\(2\\pi,t\)\\right\)^\{2\}\.This problem draws much attention as early PINN training suffers failure mode due to the presence of the non\-linear term of the equation\. As shown in Figure[3](https://arxiv.org/html/2605.26341#A2.F3), the solution map contains sharp boundaries in the center high\-value area, making it hard for being learned by neural networks\. Therefore, the data\-fidelity lossℓd\\ell\_\{d\}slightly dominates over the PDE residual lossℓp\\ell\_\{p\}in this case, as shown by the empirical risks in Table[3](https://arxiv.org/html/2605.26341#A2.T3)\. However, as the total empirical risk is very small \(∼10−3\\sim 10^\{\-3\}\), the complexity term is the most dominant in the bound, it will likely favor posterior extremely close to the prior and the self\-bounding\-aware algorithm become useless\. To deal with this, we amplify each loss term by a factor of100100and use this scaled version during the entire computation of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\. Once the final bound on the scaled risks is computed, we just divide it by100100to obtain the bound on the original risks\.

Convection\.This problem considers a hyperbolic partial differential equation \(PDE\) that serves as a standard benchmark for transport\-dominated dynamics\[[39](https://arxiv.org/html/2605.26341#bib.bib50)\], also known as advection\. The governing equation is

∂u∂t\+β∂u∂x\\displaystyle\\frac\{\\partial u\}\{\\partial t\}\+\\beta\\frac\{\\partial u\}\{\\partial x\}=0,\\displaystyle=0,x∈\(0,2π\),t∈\(0,1\),\\displaystyle x\\in\(0,2\\pi\),\\;t\\in\(0,1\),\(47\)u\(x,0\)\\displaystyle u\(x,0\)=sin⁡\(x\),\\displaystyle=\\sin\(x\),x∈\[0,2π\],\\displaystyle x\\in\[0,2\\pi\],u\(0,t\)\\displaystyle u\(0,t\)=u\(2π,t\),\\displaystyle=u\(2\\pi,t\),t∈\[0,1\]\.\\displaystyle t\\in\[0,1\]\.
The corresponding analytic solution is

u\(x,t\)=sin⁡\(x−βt\)\.u\(x,t\)=\\sin\(x\-\\beta t\)\.\(48\)
In our experiments, we setβ=50\\beta=50\. Despite the simple closed\-form expression, this problem remains challenging for physics\-informed neural networks \(PINNs\) due to the high\-frequency and rapidly propagating solution features\. As this problem has similar equation forms like 1D\-Reaction, we consider the following physical losses in the experiments:

ℓp\(𝜽,𝐱\)\\displaystyle\\ell\_\{p\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(∂u𝜽∂t\+β∂u𝜽∂x\)2,\\displaystyle=\\left\(\\frac\{\\partial u\_\{\\bm\{\\theta\}\}\}\{\\partial t\}\+\\beta\\frac\{\\partial u\_\{\\bm\{\\theta\}\}\}\{\\partial x\}\\right\)^\{2\},\(49\)ℓic\(𝜽,𝐱\)\\displaystyle\\ell\_\{ic\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(x,0\)−sin⁡\(x\)\)2,\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(x,0\)\-\\sin\(x\)\\right\)^\{2\},ℓb\(𝜽,𝐱\)\\displaystyle\\ell\_\{b\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)=\(u𝜽\(0,t\)−u𝜽\(2π,t\)\)2\.\\displaystyle=\\left\(u\_\{\\bm\{\\theta\}\}\(0,t\)\-u\_\{\\bm\{\\theta\}\}\(2\\pi,t\)\\right\)^\{2\}\.
![Refer to caption](https://arxiv.org/html/2605.26341v1/x9.png)\(a\)1D\-Wave
![Refer to caption](https://arxiv.org/html/2605.26341v1/x10.png)\(b\)1D\-Reaction
![Refer to caption](https://arxiv.org/html/2605.26341v1/x11.png)\(c\)Convection

Figure 3:Visualisation of the solutionuufor the three benchmarks\.Since the total empirical risk is relatively small \(∼10−2\\sim 10^\{\-2\}\), we also scale the losses with a factor of1010during the entire computation of Algorithm[1](https://arxiv.org/html/2605.26341#alg1), then divide the resulting scaled bound by1010to obtain the bound on the original risk\.

### B\.2Empirical observations on gradient distribution

We empirically examine the feasibility of the bounds introduced above, which require the existence of finite constants ensuring that the proposed inequalities hold uniformly over the hypothesis class\. In particular, these assumptions implicitly rely on bounding the Lipschitz continuity of the model, i\.e\., controlling the supremum of the gradient norm over the parameter space\.

To probe this requirement, we focus on the empirical behavior of the Lipschitz constant, approximated by the maximum gradient norm\. Starting from a trained model𝜽π\\bm\{\\theta\}\_\{\\pi\}, for each distance, we generate100100perturbed models by arbitrarily sampling the direction, and evaluate the corresponding gradient norms\. This procedure provides a direct estimate of how the effective Lipschitz constant evolves as one moves away from a well\-trained solution\. The results are reported in Figure[4](https://arxiv.org/html/2605.26341#A2.F4), where we plot the maximum observed gradient norm of the data\-fidelity lossℓd\\ell\_\{d\}and PDE lossℓp\\ell\_\{p\}as a function of the distance‖𝜽π−𝜽‖2\|\|\\bm\{\\theta\}\_\{\\pi\}\-\\bm\{\\theta\}\|\|\_\{2\}\. Our observations reveal a pronounced growth of the gradient norm with respect to the distance from the trained model\. In particular, the estimated Lipschitz constant increases rapidly and exhibits unbounded behavior as one explores regions further away in parameter space\. Even for moderate perturbations at distance of11, the gradient norm of the PDE loss can already reach a magnitude of10810^\{8\}\. This leads to significantly loosened bounds, which may become vacuous in practice\. As a consequence, any constant chosen to uniformly bound the gradient over such regions must be extremely large, effectively diverging when the domain is not restricted to a small neighborhood of the trained model\.

![Refer to caption](https://arxiv.org/html/2605.26341v1/x12.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x13.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x14.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x15.png)\(a\)1D\-Wave
![Refer to caption](https://arxiv.org/html/2605.26341v1/x16.png)\(b\)1D\-Reaction
![Refer to caption](https://arxiv.org/html/2605.26341v1/x17.png)\(c\)Convection

Figure 4:Gradient norm of the lossesℓd\\ell\_\{d\}andℓp\\ell\_\{p\}w\.r\.t the inputs in three benchmarks\.These observations suggest that, to obtain tighter bounds, it is crucial to restrict attention to local neighborhoods of a well‑trained model \(used as the prior\)\. In particular, low bound values arise when the posterior does not diverge significantly from the prior\. Therefore, estimating the relevant constants within a ball centered at the prior—with radius large enough to contain the posterior—suffices to ensure the conditions are satisfied in practice\.

### B\.3Leveraging local loss boundedness for constant estimation

Although the input\-gradient of losses explodes quickly, we empirically observe that the losses themselves remain small and exhibit limited variability in the neighborhood of the learned prior model\. Figure[5](https://arxiv.org/html/2605.26341#A2.F5)shows that in all three benchmarks, the dominant lossesℓd\\ell\_\{d\}andℓp\\ell\_\{p\}are much smaller than100100\. Therefore, if we apply a clipping at that value, it almost surely has no impact to the loss value\. Nevertheless, such clipping is highly beneficial when estimating those constants, as it ensures that the losses remain locally bounded in the neighborhood of the prior model, which in turn bounds their variances and stabilizes the associated CGF\. As a consequence, we can also clip input\-gradients while always being able to find constants satisfying the Sobolev and Poincaré assumptions\. It is now necessary to identify proper constants that tighten the bounds\.

![Refer to caption](https://arxiv.org/html/2605.26341v1/x18.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x19.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x20.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x21.png)\(a\)1D\-Wave
![Refer to caption](https://arxiv.org/html/2605.26341v1/x22.png)\(b\)1D\-Reaction
![Refer to caption](https://arxiv.org/html/2605.26341v1/x23.png)\(c\)Convection

Figure 5:Distribution ofℓd\\ell\_\{d\}andℓp\\ell\_\{p\}around well\-trained models in three benchmarks\.There is an inherent trade\-off between the clipping thresholdsLiL\_\{i\}and the resulting Sobolev constantCS,iC\_\{S,i\}and Poincaré constantCP,iC\_\{P,i\}\. Concretely, reducingLiL\_\{i\}will result in largerCS,iC\_\{S,i\}andCP,iC\_\{P,i\}, while choosing excessively largeLiL\_\{i\}will produces loose bounds\. Therefore, it is desirable to selectLiL\_\{i\}that reduces the bound values in Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)and[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4), i\.e\., by minimising those constants along with the quantitiesLiCS,iL\_\{i\}C\_\{S,i\}andLiCP,iL\_\{i\}C\_\{P,i\}\. For this purpose, we make use of a collection of calibration setScalibS\_\{calib\}disjoint toSpriorS\_\{prior\}andSpostS\_\{post\}, to compute the CGF, the variance of the losses and the expected gradient\. Figure[6](https://arxiv.org/html/2605.26341#A2.F6)shows the curve2Λ𝜽\(i\)\(λ\)λ2𝔼𝐱∼D\[‖∇𝐱ℓi\(𝜽,𝐱\)‖22\]\\frac\{2\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\}\{\\lambda^\{2\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\\left\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\right\]\}at different distances and we can see that the ratio of interest generally increases with distance, but decreases withλ\\lambda\. Besides, by using L’Hopital’s rule, we have thatlimλ→0\+2Λ𝜽\(i\)\(λ\)λ2𝔼𝐱∼D\[‖∇𝐱ℓi\(𝜽,𝐱\)‖22\]=𝕍𝐱∼D\[ℓi\(𝜽,𝐱\)\]𝔼𝐱∼D‖∇ℓi\(𝜽,𝐱\)‖22\\lim\_\{\\lambda\\rightarrow 0^\{\+\}\}\\frac\{2\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\}\{\\lambda^\{2\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\\left\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\right\]\}=\\frac\{\\operatorname\*\{\\mathbb\{V\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\[\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\]\}\{\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\}, which directly relates to the quantity of interest of the Poincaré assumption\. Therefore, we can simply restrict the search toLiL\_\{i\}andCP,iC\_\{P,i\}only\. Figure[7](https://arxiv.org/html/2605.26341#A2.F7)then shows that𝕍𝐱∼D\[ℓi\(𝜽,𝐱\)\]𝔼𝐱∼D‖∇ℓi\(𝜽,𝐱\)‖22\\frac\{\\operatorname\*\{\\mathbb\{V\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\[\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\]\}\{\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\}generally increase with distance, therefore we can restrict the search ofLiL\_\{i\}by looking at the value of the ratios at distance11\.

![Refer to caption](https://arxiv.org/html/2605.26341v1/x24.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x25.png)

Figure 6:Curves ofΛ𝜽\(i\)\(λ\)0\.5λ2𝔼𝐱∼D\[‖∇𝐱ℓi\(𝜽,𝐱\)‖22\]\\frac\{\\Lambda\_\{\\bm\{\\theta\}\}^\{\(i\)\}\(\\lambda\)\}\{0\.5\\lambda^\{2\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\\left\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\\right\]\}at different distances‖𝜽π−𝜽‖2\|\|\\bm\{\\theta\}\_\{\\pi\}\-\\bm\{\\theta\}\|\|\_\{2\}between𝜽\\bm\{\\theta\}and the prior model𝜽π\\bm\{\\theta\}\_\{\\pi\}\. Note that𝜽\\bm\{\\theta\}is sampled at random directions\.The search is described in lines 6\-11 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\. In details, for each lossii, the function𝒞p\(τ,1,Ndraw\)\\mathcal\{C\}\_\{p\}\(\\tau,1,N\_\{draw\}\)will samplesNdrawN\_\{draw\}models at distance11, compute the ratio𝕍𝐱∼D\[ℓi\(𝜽,𝐱\)\]𝔼𝐱∼D‖∇ℓi\(𝜽,𝐱\)‖22\\frac\{\\operatorname\*\{\\mathbb\{V\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\[\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\]\}\{\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\}when the gradient is clipped atτ\\tau, then return the maximum value as𝒮τ\\mathcal\{S\}\_\{\\tau\}\. This form a set𝒮𝒯\\mathcal\{S\}\_\{\\mathcal\{T\}\}\. Here we setτ\\taufrom 1 to10410^\{4\}\. Next, the functionSmallestk\(𝒮𝒯\)\\mathrm\{Smallest\}\_\{k\}\(\\mathcal\{S\}\_\{\\mathcal\{T\}\}\)select the threshold candidates corresponding tokk\-smallest values in𝒮𝒯\\mathcal\{S\}\_\{\\mathcal\{T\}\}, and return a list of indicesℒ\\mathcal\{L\}\. Then we choose the one with smallest𝒞p\(τ,1,Ndraw\)\\mathcal\{C\}\_\{p\}\(\\tau,1,N\_\{draw\}\)asLiL\_\{i\}\. This scheme allows us to avoid over\-loose bounds, by limiting the factors that control the bounds \(LiCS,iL\_\{i\}C\_\{S,i\},LiCP,iL\_\{i\}C\_\{P,i\}andCP,iC\_\{P,i\},CS,iC\_\{S,i\}\)\. Finally, from the chosenLiL\_\{i\}, we randomly sampleNdrawN\_\{draw\}models one more time to refine the search, and determine the value of\(CP,i,CS,i\)\(C\_\{P,i\},C\_\{S,i\}\)by choosing the maximum value at radius under or equal to11\. This method enables an effective estimation of the constants without costly optimization procedure\. However, as the search heavily relies on the draw of models at random directions, the resulting constants may changes from one run to another, although they generally have the same order of magnitude\. We thus set random seeds to obtain deterministic results\. Table[2](https://arxiv.org/html/2605.26341#A2.T2)presents an example of the set of constants estimated in three benchmark\. Although having the same nature of approximation error, the data\-fidelity lossℓd\\ell\_\{d\}and the initial condition lossℓic\\ell\_\{ic\}have very different constant values\. It is due to the fact that the interior region is more diffcult to learn, and the loss exhibits much higher gradient there than at the initial/boundary area\. As a consequence, the Sobolev and Poincaré constants are much larger in the latter case\. Therefore, usingℓic\\ell\_\{ic\}to alleviate the lack of observation calibration data will likely inflate the bounds\.

![Refer to caption](https://arxiv.org/html/2605.26341v1/x26.png)
![Refer to caption](https://arxiv.org/html/2605.26341v1/x27.png)

Figure 7:Scatter plots of𝕍𝐱\[ℓ\(𝜽,𝐱\)\]𝔼𝐱∼D‖∇𝐱ℓ¯\(𝐱\)‖22\\frac\{\\mathbb\{V\}\_\{\\bm\{\\mathrm\{x\}\}\}\[\\ell\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\)\]\}\{\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\mathrm\{x\}\}\\sim D\}\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\bar\{\\ell\}\(\\bm\{\\mathrm\{x\}\}\)\|\|\_\{2\}^\{2\}\}forℓd\\ell\_\{d\}andℓp\\ell\_\{p\}\. Here the Assumption hold withCPd=2e\-5C\_\{P\_\{d\}\}=2\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}andCPd=1e\-2C\_\{P\_\{d\}\}=1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}within the radius of11, i\.e\.,‖𝜽−𝜽π‖2≤1\|\|\\bm\{\\theta\}\-\\bm\{\\theta\}\_\{\\pi\}\|\|\_\{2\}\\leq 1\. In the case ofℓp\\ell\_\{p\}, we see that the ratio decrease when‖𝜽−𝜽π‖2≥10\|\|\\bm\{\\theta\}\-\\bm\{\\theta\}\_\{\\pi\}\|\|\_\{2\}\\geq 10, since the loss values are clipped and its variance is therefore decreased, while the gradient do not decrease with distance\.Table 2:An example of the constants estimated by our selection algorithm\.Constant1D\-Wave1D\-ReactionConvectionLdL\_\{d\}113030400400CS,dC\_\{\{S,d\}\}7e\-67\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}6\}2e\-12\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}7e\-47\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}CP,dC\_\{\{P,d\}\}2e\-52\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}1e\-11\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}7e\-47\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}LpL\_\{p\}4\.4e34\.4\\mathrm\{e\}330306\.2e36\.2\\mathrm\{e\}3CS,pC\_\{\{S,p\}\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}5e\-15\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}CP,pC\_\{\{P,p\}\}2e\-12\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}8e\-18\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}4e\-34\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}LicL\_\{ic\}225544CS,icC\_\{\{S,ic\}\}3e\-23\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}228e\-18\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}CP,icC\_\{\{P,ic\}\}4e\-24\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}229e\-19\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}LigL\_\{ig\}1010CS,igC\_\{\{S,ig\}\}7e\-37\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\-\-CP,igC\_\{\{P,ig\}\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\-\-Lb1L\_\{b\_\{1\}\}44\-\-CS,b1C\_\{\{S,b\_\{1\}\}\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\-\-CP,b1C\_\{\{P,b\_\{1\}\}\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\-\-Lb2L\_\{b\_\{2\}\}33CS,b2C\_\{\{S,b\_\{2\}\}\}2e\-22\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\-\-CP,b2C\_\{\{P,b\_\{2\}\}\}1e\-51\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\-\-LbL\_\{b\}\-221010CS,bC\_\{\{S,b\}\}\-1e\-11\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}2e\-32\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}CP,bC\_\{\{P,b\}\}\-1e\-11\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}2e\-32\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}
### B\.4Details on the self\-bounding\-aware algorithms and the computation of the bounds

Note that in a balanced setting \(i\.e\., ,mi=m,∀im\_\{i\}=m,\\;\\forall i\), the sample\-centric bounds ongen𝒮\(ρ,S\)\\mathrm\{gen\}^\{\\mathcal\{S\}\}\(\\rho,S\)can be rewritten as an equally\-weighted bound, by simply dividing both sides tomm\. This way, we can learn to directly minimise the bound \(self\-bounding\) through the following objectives corresponding to the bound in theorems[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)and[3\.4](https://arxiv.org/html/2605.26341#S3.Thmtheorem4):

∑i=1NLℛ^ic\(𝜽′\)\+2∑i=1NLCS,im2∑j=1m\[‖∇𝐱ℓi\(𝜽′,𝐱i,j\)‖22\]K\(𝜽ρ\)\+LPIMLSm2K\(𝜽ρ\)32,\\sum\_\{i=1\}^\{N\_\{L\}\}\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\bm\{\\theta\}^\{\\prime\}\)\+\\sqrt\{2\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{S,i\}\}\{m^\{2\}\}\\sum\_\{j=1\}^\{m\}\\Big\[\|\|\\nabla\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\}^\{\\prime\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\\Big\]K\(\\bm\{\\theta\}\_\{\\rho\}\)\+\\frac\{L\_\{\\mathrm\{PIML\}\_\{S\}\}\}\{m^\{2\}\}K\(\\bm\{\\theta\}\_\{\\rho\}\)^\{\\frac\{3\}\{2\}\}\},\(50\)∑i=1NLℛ^ic\(𝜽′\)\+\(2∑i=1NLCP,im2∑j=1m‖∇ℓi\(π,𝐱i,j\)‖22\+LPIMLPK\(𝜽π\)\)δe\(‖𝜽ρ−𝜽π‖22σ2\),\\sum\_\{i=1\}^\{N\_\{L\}\}\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\bm\{\\theta\}^\{\\prime\}\)\+\\sqrt\{\\frac\{\\left\(2\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{C\_\{P,i\}\}\{m^\{2\}\}\\sum\_\{j=1\}^\{m\}\|\|\\nabla\\ell\_\{i\}\(\\pi,\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\+L\_\{\\mathrm\{PIML\}\_\{P\}\}K\(\\bm\{\\theta\}\_\{\\pi\}\)\\right\)\}\{\\delta\}e^\{\\left\(\\frac\{\|\|\\bm\{\\theta\}\_\{\\rho\}\-\\bm\{\\theta\}\_\{\\pi\}\|\|\_\{2\}^\{2\}\}\{\\sigma^\{2\}\}\\right\)\}\},\(51\)where‖∇ℓi\(π,𝐱i,j\)‖22:=𝔼𝜽∼π‖∇ℓi\(𝜽,𝐱i,j\)‖22\|\|\\nabla\\ell\_\{i\}\(\\pi,\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}:=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\|\|\\nabla\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\}\_\{i,j\}\)\|\|\_\{2\}^\{2\}\.

In the general unbalanced setting, we only obtain a sample\-weighted bound𝒰PIML𝒮\(ρ\)\\mathcal\{U\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\rho\)rather than the target equally\-weighted bound𝒰PIML\(ρ\)\\mathcal\{U\}\_\{\\mathrm\{PIML\}\}\(\\rho\)\. Therefore, we have to solve an optimization problem to obtain the target bound from the sample\-weighted one\. For convenience, let us denote‖∇^𝐱ℓi\(ρ′\)‖22:=𝔼𝜽∼ρ′1mi∑j=1mi‖∇^𝐱ℓi\(𝜽,𝐱𝒊,𝒋\)‖22\|\|\\hat\{\\nabla\}\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\rho^\{\\prime\}\)\|\|\_\{2\}^\{2\}:=\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\rho^\{\\prime\}\}\\frac\{1\}\{m\_\{i\}\}\\sum\_\{j=1\}^\{m\_\{i\}\}\|\|\\hat\{\\nabla\}\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\},\\bm\{\\mathrm\{x\}\_\{i,j\}\}\)\|\|\_\{2\}^\{2\}andKi\(ρ′,π,δ\)=KL\(ρ′∥π\)\+ln⁡2miδmi−1K\_\{i\}\(\\rho^\{\\prime\},\\pi,\\delta\)=\\frac\{\\textrm\{KL\}\(\\rho^\{\\prime\}\\\|\\pi\)\+\\ln\\frac\{2m\_\{i\}\}\{\\delta\}\}\{m\_\{i\}\-1\}for an arbitrary distributionρ′\\rho^\{\\prime\}over model parameters\. Then we can formulate the following linear optimisation problem to obtain the bound𝒰PIML\(ρ\)\\mathcal\{U\}\_\{\\mathrm\{PIML\}\}\(\\rho\)from sample\-weighted bound𝒰PIML𝒮\(ρ\)\\mathcal\{U\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\rho\):

max∑i=1NLℛi\(ρ\)s\.t\.\\displaystyle\\max\\hskip 2\.84526pt\\sum\_\{i=1\}^\{N\_\{L\}\}\\mathcal\{R\}\_\{i\}\(\\rho\)\\hskip 14\.22636pt\\mathrm\{s\.t\.\}\(52\)ℛi\(ρ\)≤ℛ^i\(ρ\)\+𝒰i\(ρ\),\\displaystyle\\mathcal\{R\}\_\{i\}\(\\rho\)\\leq\\hat\{\\mathcal\{R\}\}\_\{i\}\(\\rho\)\+\\mathcal\{U\}\_\{i\}\(\\rho\),∑i=1NLmiMℛi\(ρ\)≤∑i=1NLmiMℛ^i\(ρ\)\+𝒰𝒮\(ρ\),\\displaystyle\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{m\_\{i\}\}\{M\}\\mathcal\{R\}\_\{i\}\(\\rho\)\\leq\\sum\_\{i=1\}^\{N\_\{L\}\}\\frac\{m\_\{i\}\}\{M\}\\hat\{\\mathcal\{R\}\}\_\{i\}\(\\rho\)\+\\mathcal\{U\}^\{\\mathcal\{S\}\}\(\\rho\),whose constraints hold simultaneously with probability at least1−δ1\-\\deltaover the draw of the collection posterior setSpostS\_\{post\}\. Here𝒰i\(ρ\):=2CS,i\[‖∇^𝐱ℓi\(ρ\)‖22\]K\(ρ,π,δNL\)\+2CS,iLiKi\(ρ,π,δNL\)32\\mathcal\{U\}\_\{i\}\(\\rho\):=\\sqrt\{2C\_\{S,i\}\\left\[\|\|\\hat\{\\nabla\}\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\rho\)\|\|\_\{2\}^\{2\}\\right\]K\(\\rho,\\pi,\\frac\{\\delta\}\{N\_\{L\}\}\)\+\\sqrt\{2\}C\_\{S,i\}L\_\{i\}K\_\{i\}\(\\rho,\\pi,\\frac\{\\delta\}\{N\_\{L\}\}\)^\{\\frac\{3\}\{2\}\}\}, or𝒰i\(ρ\):=NLD¯δ\(2CP,imi𝔼𝜽∼π\[‖∇^𝐱ℓi\(𝜽\)‖22\]\+2CP,iLimiKi\(π,π,δNL\)12\)\\mathcal\{U\}\_\{i\}\(\\rho\):=\\sqrt\{\\frac\{N\_\{L\}\\bar\{D\}\}\{\\delta\}\\left\(2\\frac\{C\_\{P,i\}\}\{m\_\{i\}\}\\operatorname\*\{\\mathbb\{E\}\}\_\{\\bm\{\\theta\}\\sim\\pi\}\\left\[\|\|\\hat\{\\nabla\}\_\{\\bm\{\\mathrm\{x\}\}\}\\ell\_\{i\}\(\\bm\{\\theta\}\)\|\|\_\{2\}^\{2\}\\right\]\+\\sqrt\{2\}\\frac\{C\_\{P,i\}L\_\{i\}\}\{m\_\{i\}\}K\_\{i\}\(\\pi,\\pi,\\frac\{\\delta\}\{N\_\{L\}\}\)^\{\\frac\{1\}\{2\}\}\\right\)\}for Sobolev or Poincaré bounds, respectively\.𝒰𝒮\(ρ\)\\mathcal\{U\}^\{\\mathcal\{S\}\}\(\\rho\)is the RHS term of inequalities[7](https://arxiv.org/html/2605.26341#S3.E7)or[9](https://arxiv.org/html/2605.26341#S3.E9)\.

Now a natural question may rise: which is the relevant training objective from these constraints? Practically, there is no guarantee that reducing the bound onℛPIML𝒮\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\rho\)will also reduce the bound on the target true riskℛPIML\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}\(\\rho\)\. Indeed, even if the learning algorithm successfully reduces the bound onℛPIML𝒮\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}^\{\\mathcal\{S\}\}\(\\rho\), it may increase the bound on the data\-fidelity as well since the complexity is more easily inflated with smaller sample size\. Therefore, for simplicity, we only use sample\-weighted bound to tighten the union\-bound after learning, rather than integrating it directly as a training objective\. By combining the bounds over single losses, we derive the surrogate objectives \(bounding\-aware\) in[10](https://arxiv.org/html/2605.26341#S4.E10)and[11](https://arxiv.org/html/2605.26341#S4.E11)\.

### B\.5Implementations

Throughout the experiments, we mainly aim to verify the exploration ability of the self\-bounding\-aware algorithm, rather than to obtain state\-of\-the\-art performance\. For dataset generation, we uniformly sample 100k collocation points in 2D over the domains associated with each physical loss term:Ω\\Omegaforℓd\\ell\_\{d\},Ω0\\Omega\_\{0\}forℓi\\ell\_\{i\}, and∂Ω\\partial\\Omegaforℓb\\ell\_\{b\}\. These are partitioned into prior, calibration, posterior, and test sets containing 10k, 20k, 30k, and 40k samples, respectively\. For the data\-fidelity loss, we instead sample 80k points, using 10k for calibration, 30k for the posterior set, and the remaining 40k for testing\. In the unbalanced setting, we use a fixed subset consisting of the firstmdm\_\{d\}samples from the posterior set\. For each benchmark, datasets are generated once and reused across all experimental configurations to ensure a consistent evaluation protocol\.

We use a standard PINN with a two\-dimensional input, three hidden layers of width256256neurons, and a scalar output\. Each hidden layer employs atanh\\tanhactivation function\. The resulting network contains approximatelyd𝜽≈1\.3×105d\_\{\\bm\{\\theta\}\}\\approx 1\.3\\times 10^\{5\}parameters\. Accordingly, we choose the radiusRRsuch that\(3σ\)2d𝜽∼R2=1\(3\\sigma\)^\{2\}d\_\{\\bm\{\\theta\}\}\\sim R^\{2\}=1\. We normalise the model’s input using mean and standard deviation computed from the calibration set of PDE loss\. For Step 1, the prior model is initialised with1fanin\\frac\{1\}\{\\sqrt\{\\mathrm\{fan\}\_\{in\}\}\}\-scaled uniform distribution, withfanin\\mathrm\{fan\}\_\{in\}corresponding to the input connections of the neurons of the considered layer, and then trained duringNT=30,000N\_\{T\}=30,000iterations for 1D\-Wave and Convection, whileNT=10,000N\_\{T\}=10,000for 1D\-Reaction\. This is to ensure that the self\-bounding\-aware algorithm can still search for a posterior distribution distinct from the prior, rather than staying at an already good prior and learning nothing in the second stage\. Then we estimate the constants by settingNdraw=10N\_\{draw\}=10,k=10k=10and searching forτ\\tauwithin the range1−1041\-10^\{4\}\. For each random seed, we train a single prior, which is then used both to estimate the constants and to initialise the posterior for all bounds in Stage 2\. This ensures that all bounds use identical constants and initialisations, enabling a fair comparison\.

During Step 2, to optimise the bound, we fix the confidence parameterδ=0\.05\\delta=0\.05, and the varianceσ2=10−6\\sigma^\{2\}=10^\{\-6\}for both prior distributionπ\\piand posterior distributionρ\\rho\. The parameter of the posterior distribution is learned forNT′=10,000N\_\{T^\{\\prime\}\}=10,000iterations\. We make use of Adam optimiser with a batch size of300300for both stages\. The learning rate is initialised at10−310^\{\-3\}for Stage 1 and10−710^\{\-7\}for Stage 2, with an exponential decay of0\.950\.95after every10001000iterations\. All experiments are implemented in PyTorch\[[30](https://arxiv.org/html/2605.26341#bib.bib45)\]and the optimization problem[52](https://arxiv.org/html/2605.26341#A2.E52)is solved usingscipy\.optimize\.minimizetools\. The expected empirical risk is estimated via Monte Carlo simulation by sampling 100 models from the posterior distribution\. The models are trained on a single NVIDIA RTX 5090 GPU\.

### B\.6Detailed results

Table[3](https://arxiv.org/html/2605.26341#A2.T3)reports detailed results for the balanced setting withmd=30km\_\{d\}=30k\. Among all methods,Ours\-Sobconsistently achieves the smallest bound values across all benchmarks\. The KL divergences between the posterior and prior are generally larger for the Sobolev\-type bounds, reflecting their weaker constraints compared to the Poincaré\-type counterparts\. In particular,Ours\-Sob, which uses the total training size as the effective sample size, permits a wider optimization region and consequently attains substantially larger KL divergences\. In nearly all cases, the bound values decrease after posterior learning, indicating that the self\-bounding\-aware algorithm effectively minimizes the target bound\.

Tables[4](https://arxiv.org/html/2605.26341#A2.T4)and[5](https://arxiv.org/html/2605.26341#A2.T5)report detailed results for the unbalanced setting withmd=300m\_\{d\}=300\. While the bounds only increase slightly for the 1D\-Wave problem, they inflate much more rapidly for the other two problems\. This behavior is primarily caused by the small training sample size in one partition, as well as the use of the constants ofℓic\\ell\_\{ic\}forℓd\\ell\_\{d\}—which are orders of magnitude larger—to compensate for the absence of calibration data\. Regarding Poincaré\-type bounds, it is clear thatOurs\-Poi\.achieves much lower bounds compared toOurs\-𝐏𝐨𝐢\.𝒮\\bm\{\\mathrm\{Poi\.\}\_\{\\mathcal\{S\}\}\}and𝐔𝐏𝐨𝐢\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Poi\.\}\}\}\. This can be explained by the fact that union Poincaré bounds increase with a factor ofNL\\sqrt\{N\_\{L\}\}, and the sample\-weighted constraint \(to obtainOurs\-𝐏𝐨𝐢\.𝒮\\bm\{\\mathrm\{Poi\.\}\_\{\\mathcal\{S\}\}\}\) can partially compensate it\. As a result, it is better to directly use the equally\-weighted bound in inequality[8](https://arxiv.org/html/2605.26341#S3.E8)\. Besides, with little improvement and in some cases even deteriorate, it indicates that with stricter constraint, it is more difficult for the learning algorithm to minimise the Poincaré\-type bounds\. In contrast, the algorithm remains effective at reducing the Sobolev\-type bounds and the empirical risks simultaneously\. Using the PAC\-Bayes\-Sobolev bound in Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)effectively helps to tighten the bound obtained by the baseline union approach\. However, this gain is less considerable compared toOurs\-Sob\.bounds obtained in the balanced setting\. This suggests that integrating the sample\-weighted constraint into the training objective might be of great interest\.

Designing such objectives is generally nontrivial and often requires substantial effort\. However, in physics\-informed settings, the typical abundance of physics\-based samples can be leveraged to derive tighter bounds in the unbalanced regime\. In practice, the sample size associated with the physical losses can be made arbitrarily large\. To take advantages from this, we can combine all physical losses into a single lossℛphysics\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{physics\}\}\(\\rho\), for exampleℛphysics\(ρ\)=ℛp\(ρ\)\+ℛic\(ρ\)\+ℛb\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{physics\}\}\(\\rho\)=\\mathcal\{R\}\_\{p\}\(\\rho\)\+\\mathcal\{R\}\_\{ic\}\(\\rho\)\+\\mathcal\{R\}\_\{b\}\(\\rho\)in 1D\-Reaction and Convection\. As mentioned in Appendix[B\.4](https://arxiv.org/html/2605.26341#A2.SS4), we can obtain a bound onℛphysics\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{physics\}\}\(\\rho\)in both balanced \(direct result\) and unbalanced settings \(by solving the optimization problem[52](https://arxiv.org/html/2605.26341#A2.E52)\)\. Informally, with probability at least1−δ′1\-\\delta^\{\\prime\}, whereδ′∈\(0,δ\)\\delta^\{\\prime\}\\in\(0,\\delta\), we can always obtain a bound of the formℛphysics\(ρ\)≤ℛ^physics\(ρ\)\+𝒞physics\(δ′\)\\mathcal\{R\}\_\{\\mathrm\{physics\}\}\(\\rho\)\\leq\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{physics\}\}\(\\rho\)\+\\mathcal\{C\}\_\{\\mathrm\{physics\}\}\(\\delta^\{\\prime\}\)\. Then applying a union bound together with the bound onℛd\(ρ\)\\mathcal\{R\}\_\{d\}\(\\rho\), which holds with probability at least1−\(δ−δ′\)1\-\(\\delta\-\\delta^\{\\prime\}\), yields a bound onℛPIML\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{PIML\}\}\(\\rho\)\. Now, it is obvious that allocating a smaller confidence budgetδ′\\delta^\{\\prime\}to bound onℛphysics\(ρ\)\\mathcal\{R\}\_\{\\mathrm\{physics\}\}\(\\rho\)has only a limited effect on𝒞physics\(δ′\)\\mathcal\{C\}\_\{\\mathrm\{physics\}\}\(\\delta^\{\\prime\}\), due to the large associated sample size, while assigning a larger budgetδ−δ′\\delta\-\\delta^\{\\prime\}to the data term mitigates the inflation of the bound induced by the small value ofmdm\_\{d\}\. To validate this idea, we employ the PAC\-Bayes\-Sobolev bound in Theorem[3\.3](https://arxiv.org/html/2605.26341#S3.Thmtheorem3)withδ′=δ/2\\delta^\{\\prime\}=\\delta/2and obtain a Sobolev\-type bound of1\.61±0\.191\.61\\pm 0\.19on 1D\-Wave whenmd=2m\_\{d\}=2\. This bound is substantially tighter thanOurs\-Sob\.\(1\.751\.75in Table[1](https://arxiv.org/html/2605.26341#S4.T1)\)\. Furthermore, the improvement achieved by this strategy is considerably larger than that obtained by solving the optimization problem in[52](https://arxiv.org/html/2605.26341#A2.E52)\(i\.e\., from1\.761\.76of𝐔𝐒𝐨𝐛\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Sob\.\}\}\}to1\.611\.61, compared to1\.761\.76to1\.751\.75\)\.

Table 3:Results for the balanced setting withmd=30km\_\{d\}=30k\. Reported values are averages and standard deviations over 5 runs for the empirical \(Emp\.\) risks, KL divergence, and bound values before \(𝒰\(π\)\\mathcal\{U\}\(\\pi\)\) and after \(𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\) Stage 2 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\. Blue numbers indicate that the bound decreases after training\.DatasetMetricOurs\-Sob\.Ours\-Poi\.𝐔𝐒𝐨𝐛\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Sob\.\}\}\}𝐔𝐏𝐨𝐢\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Poi\.\}\}\}1D\-WaveEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.17±0\.11\)e\-4\(1\.17\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.24±0\.18\)e\-4\(1\.24\\\!\\pm\\\!0\.18\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.22±0\.14\)e\-4\(1\.22\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.25±0\.19\)e\-4\(1\.25\\\!\\pm\\\!0\.19\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.05±0\.42\)e\-1\(4\.05\\\!\\pm\\\!0\.42\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.33±0\.40\)e\-1\(4\.33\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.14±0\.45\)e\-1\(4\.14\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.36±0\.39\)e\-1\(4\.36\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.37±0\.52\)e\-4\(1\.37\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.47±0\.71\)e\-4\(1\.47\\\!\\pm\\\!0\.71\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.55±0\.94\)e\-4\(1\.55\\\!\\pm\\\!0\.94\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(1\.00±0\.04\)e\-2\(1\.00\\\!\\pm\\\!0\.04\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.02±0\.10\)e\-2\(1\.02\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.04±0\.07\)e\-2\(1\.04\\\!\\pm\\\!0\.07\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.03±0\.10\)e\-2\(1\.03\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.05±0\.11\)e\-4\(1\.05\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.15±0\.12\)e\-4\(1\.15\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.10±0\.14\)e\-4\(1\.10\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.15±0\.13\)e\-4\(1\.15\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.14±0\.06\)e\-4\(1\.14\\\!\\pm\\\!0\.06\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.09\)e\-4\(1\.18\\\!\\pm\\\!0\.09\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(4\.16±0\.42\)e\-1\(4\.16\\\!\\pm\\\!0\.42\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.44±0\.40\)e\-1\(4\.44\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.25±0\.46\)e\-1\(4\.25\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.46±0\.39\)e\-3\(4\.46\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.15±0\.11\)e\-4\(1\.15\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.24±0\.19\)e\-4\(1\.24\\\!\\pm\\\!0\.19\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.20±0\.14\)e\-4\(1\.20\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.25±0\.19\)e\-4\(1\.25\\\!\\pm\\\!0\.19\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.11±0\.52\)e\-1\(4\.11\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.35±0\.40\)e\-1\(4\.35\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.24±0\.51\)e\-1\(4\.24\\\!\\pm\\\!0\.51\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.37±0\.39\)e\-1\(4\.37\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.49±0\.71\)e\-4\(1\.49\\\!\\pm\\\!0\.71\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.54±0\.93\)e\-4\(1\.54\\\!\\pm\\\!0\.93\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(9\.92±1\.15\)e\-3\(9\.92\\\!\\pm\\\!1\.15\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.02±0\.10\)e\-2\(1\.02\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.05±0\.10\)e\-2\(1\.05\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.02±0\.10\)e\-2\(1\.02\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.10±0\.12\)e\-4\(1\.10\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.15±0\.12\)e\-4\(1\.15\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.15±0\.13\)e\-4\(1\.15\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.16±0\.13\)e\-4\(1\.16\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.11±0\.11\)e\-4\(1\.11\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.17±0\.14\)e\-4\(1\.17\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.22±0\.52\)e\-1\(4\.22\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.48±0\.48\)e\-1\(4\.48\\\!\\pm\\\!0\.48\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.25±0\.46\)e\-1\(4\.25\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.46±0\.39\)e\-3\(4\.46\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}KL3\.20±0\.843\.20\\\!\\pm\\\!0\.84\(9\.74±5\.94\)e\-3\(9\.74\\\!\\pm\\\!5\.94\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(9\.67±1\.77\)e\-1\(9\.67\\\!\\pm\\\!1\.77\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(2\.09±1\.21\)e\-3\(2\.09\\\!\\pm\\\!1\.21\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(7\.06±1\.56\)e\-1\(7\.06\\\!\\pm\\\!1\.56\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(7\.20±1\.18\)e\-1\(7\.20\\\!\\pm\\\!1\.18\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(7\.11±1\.56\)e\-1\(7\.11\\\!\\pm\\\!1\.56\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}1\.13±0\.241\.13\\\!\\pm\\\!0\.24𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(6\.76±1\.48\)e\-1\(\\mathbf\{6\.76\}\\\!\\pm\\\!1\.48\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(7\.14±1\.13\)e\-1\(7\.14\\\!\\pm\\\!1\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(6\.80±1\.45\)e\-1\(6\.80\\\!\\pm\\\!1\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}1\.12±0\.241\.12\\\!\\pm\\\!0\.24𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)3\.0e\-23\.0\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}6e\-36\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}3\.1e\-23\.1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}1\.0e\-21\.0\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}1D\-ReactionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.66±0\.41\)e\-3\(2\.66\\\!\\pm\\\!0\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.99±0\.47\)e\-3\(2\.99\\\!\\pm\\\!0\.47\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.79±0\.43\)e\-3\(2\.79\\\!\\pm\\\!0\.43\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(3\.00±0\.47\)e\-3\(3\.00\\\!\\pm\\\!0\.47\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.09±0\.03\)e\-3\(1\.09\\\!\\pm\\\!0\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.21±0\.15\)e\-3\(1\.21\\\!\\pm\\\!0\.15\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.08±0\.03\)e\-3\(1\.08\\\!\\pm\\\!0\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.26±0\.21\)e\-3\(1\.26\\\!\\pm\\\!0\.21\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(6\.92±0\.24\)e\-5\(6\.92\\\!\\pm\\\!0\.24\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(8\.03±1\.81\)e\-5\(8\.03\\\!\\pm\\\!1\.81\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(6\.89±0\.25\)e\-5\(6\.89\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(8\.43±2\.46\)e\-5\(8\.43\\\!\\pm\\\!2\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(2\.97±0\.15\)e\-4\(2\.97\\\!\\pm\\\!0\.15\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.69±0\.94\)e\-4\(3\.69\\\!\\pm\\\!0\.94\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(2\.97±0\.16\)e\-4\(2\.97\\\!\\pm\\\!0\.16\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.97±1\.41\)e\-4\(3\.97\\\!\\pm\\\!1\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(4\.12±0\.42\)e\-3\(4\.12\\\!\\pm\\\!0\.42\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.65±0\.50\)e\-3\(4\.65\\\!\\pm\\\!0\.50\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.24±0\.44\)e\-3\(4\.24\\\!\\pm\\\!0\.44\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.75±0\.56\)e\-3\(4\.75\\\!\\pm\\\!0\.56\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.60±0\.40\)e\-3\(2\.60\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.89±0\.45\)e\-3\(2\.89\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.73±0\.42\)e\-3\(2\.73\\\!\\pm\\\!0\.42\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.91±0\.45\)e\-3\(2\.91\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.10±0\.03\)e\-3\(1\.10\\\!\\pm\\\!0\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.10±0\.01\)e\-3\(1\.10\\\!\\pm\\\!0\.01\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.08±0\.03\)e\-3\(1\.08\\\!\\pm\\\!0\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.15±0\.07\)e\-3\(1\.15\\\!\\pm\\\!0\.07\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(6\.94±0\.25\)e\-5\(6\.94\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(7\.20±0\.41\)e\-5\(7\.20\\\!\\pm\\\!0\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(6\.91±0\.25\)e\-5\(6\.91\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(7\.60±1\.07\)e\-5\(7\.60\\\!\\pm\\\!1\.07\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(2\.98±0\.16\)e\-4\(2\.98\\\!\\pm\\\!0\.16\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.14±0\.14\)e\-4\(3\.14\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(2\.97±0\.16\)e\-4\(2\.97\\\!\\pm\\\!0\.16\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.38±0\.52\)e\-4\(3\.38\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.06±0\.41\)e\-3\(4\.06\\\!\\pm\\\!0\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.37±0\.45\)e\-3\(4\.37\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.18±0\.45\)e\-3\(4\.18\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.48±0\.46\)e\-3\(4\.48\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}KL11\.07±2\.3911\.07\\\!\\pm\\\!2\.39\(2\.34±2\.87\)e\-2\(2\.34\\\!\\pm\\\!2\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(4\.67±2\.90\)e\-1\(4\.67\\\!\\pm\\\!2\.90\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(5\.91±6\.53\)e\-3\(5\.91\\\!\\pm\\\!6\.53\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(7\.39±1\.17\)e\-3\(7\.39\\\!\\pm\\\!1\.17\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(7\.64±1\.45\)e\-3\(7\.64\\\!\\pm\\\!1\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(7\.99±1\.28\)e\-3\(7\.99\\\!\\pm\\\!1\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.18±0\.25\)e\-2\(1\.18\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(7\.28±1\.23\)e\-3\(\\mathbf\{7\.28\}\\\!\\pm\\\!1\.23\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(7\.54±1\.40\)e\-3\(7\.54\\\!\\pm\\\!1\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(7\.59±1\.10\)e\-3\(7\.59\\\!\\pm\\\!1\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.18±0\.25\)e\-2\(1\.18\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)1\.1e\-41\.1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}1e\-41\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}4\.0e\-44\.0\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}0\.0\.ConvectionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.76±0\.86\)e\-2\(2\.76\\\!\\pm\\\!0\.86\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.89±0\.89\)e\-2\(2\.89\\\!\\pm\\\!0\.89\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.80±0\.86\)e\-2\(2\.80\\\!\\pm\\\!0\.86\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.89±0\.89\)e\-2\(2\.89\\\!\\pm\\\!0\.89\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(2\.67±0\.40\)e\-2\(2\.67\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(3\.81±0\.72\)e\-2\(3\.81\\\!\\pm\\\!0\.72\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.76±0\.35\)e\-2\(2\.76\\\!\\pm\\\!0\.35\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(3\.96±0\.75\)e\-2\(3\.96\\\!\\pm\\\!0\.75\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(3\.07±0\.17\)e\-4\(3\.07\\\!\\pm\\\!0\.17\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.71±2\.49\)e\-4\(4\.71\\\!\\pm\\\!2\.49\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.34±0\.63\)e\-4\(3\.34\\\!\\pm\\\!0\.63\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.76±2\.56\)e\-4\(4\.76\\\!\\pm\\\!2\.56\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.18±0\.34\)e\-3\(1\.18\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.14±0\.28\)e\-3\(1\.14\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.17±0\.34\)e\-3\(1\.17\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.14±0\.28\)e\-3\(1\.14\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(5\.58±1\.11\)e\-2\(5\.58\\\!\\pm\\\!1\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(6\.86±1\.57\)e\-2\(6\.86\\\!\\pm\\\!1\.57\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(5\.71±1\.09\)e\-2\(5\.71\\\!\\pm\\\!1\.09\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(7\.01±1\.60\)e\-2\(7\.01\\\!\\pm\\\!1\.60\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.76±0\.84\)e\-2\(2\.76\\\!\\pm\\\!0\.84\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.86±0\.87\)e\-2\(2\.86\\\!\\pm\\\!0\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.78±0\.85\)e\-2\(2\.78\\\!\\pm\\\!0\.85\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.87±0\.87\)e\-2\(2\.87\\\!\\pm\\\!0\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(2\.71±0\.35\)e\-2\(2\.71\\\!\\pm\\\!0\.35\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(3\.45±0\.61\)e\-2\(3\.45\\\!\\pm\\\!0\.61\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.77±0\.36\)e\-2\(2\.77\\\!\\pm\\\!0\.36\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(3\.68±0\.67\)e\-2\(3\.68\\\!\\pm\\\!0\.67\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(3\.24±0\.53\)e\-4\(3\.24\\\!\\pm\\\!0\.53\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.54±2\.43\)e\-4\(4\.54\\\!\\pm\\\!2\.43\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.34±0\.63\)e\-4\(3\.34\\\!\\pm\\\!0\.63\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.66±2\.62\)e\-4\(4\.66\\\!\\pm\\\!2\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.17±0\.34\)e\-3\(1\.17\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.15±0\.29\)e\-3\(1\.15\\\!\\pm\\\!0\.29\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.17±0\.34\)e\-3\(1\.17\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.14±0\.28\)e\-3\(1\.14\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(5\.61±1\.03\)e\-2\(5\.61\\\!\\pm\\\!1\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(6\.47±1\.43\)e\-2\(6\.47\\\!\\pm\\\!1\.43\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(5\.70±1\.07\)e\-2\(5\.70\\\!\\pm\\\!1\.07\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(6\.71±1\.50\)e\-2\(6\.71\\\!\\pm\\\!1\.50\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}KL17\.30±9\.8017\.30\\\!\\pm\\\!9\.80\(5\.83±3\.28\)e\-1\(5\.83\\\!\\pm\\\!3\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}10\.30±4\.7010\.30\\\!\\pm\\\!4\.70\(1\.04±0\.55\)e\-2\(1\.04\\\!\\pm\\\!0\.55\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(8\.88±1\.85\)e\-2\(8\.88\\\!\\pm\\\!1\.85\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(9\.26±2\.40\)e\-2\(9\.26\\\!\\pm\\\!2\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(9\.55±2\.13\)e\-2\(9\.55\\\!\\pm\\\!2\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.28±0\.34\)e\-1\(1\.28\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(7\.64±1\.14\)e\-2\(\\mathbf\{7\.64\}\\\!\\pm\\\!1\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(9\.00±2\.27\)e\-2\(9\.00\\\!\\pm\\\!2\.27\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(8\.31±1\.49\)e\-2\(8\.31\\\!\\pm\\\!1\.49\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.27±0\.32\)e\-1\(1\.27\\\!\\pm\\\!0\.32\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)1\.24e\-21\.24\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}2\.7e\-32\.7\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}1\.24e\-21\.24\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}1e\-31\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}

Table 4:Results for Poincaré bounds in the unbalanced setting withmd=300m\_\{d\}=300\. Reported values are averages and standard deviations over 5 runs for the empirical \(Emp\.\) risks, KL divergence, and bound values before \(𝒰\(π\)\\mathcal\{U\}\(\\pi\)\) and after \(𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\) Stage 2 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\.Blueandrednumbers indicate decreases and increases in the bound after training, respectively\.DatasetMetricOurs\-Poi\.Ours\-𝐏𝐨𝐢\.𝒮\\bm\{\\mathrm\{Poi\.\}\_\{\\mathcal\{S\}\}\}𝐔𝐏𝐨𝐢\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Poi\.\}\}\}1D\-WaveEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.25±0\.22\)e\-4\(1\.25\\\!\\pm\\\!0\.22\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.25±0\.22\)e\-4\(1\.25\\\!\\pm\\\!0\.22\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.33±0\.39\)e\-1\(4\.33\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.36±0\.39\)e\-1\(4\.36\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.55±0\.94\)e\-4\(1\.55\\\!\\pm\\\!0\.94\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(1\.02±0\.09\)e\-2\(1\.02\\\!\\pm\\\!0\.09\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.03±0\.10\)e\-2\(1\.03\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.15±0\.12\)e\-4\(1\.15\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.15±0\.12\)e\-4\(1\.15\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(4\.46±0\.40\)e\-1\(4\.46\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.48±0\.39\)e\-1\(4\.48\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.24±0\.16\)e\-4\(1\.24\\\!\\pm\\\!0\.16\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.25±0\.19\)e\-4\(1\.25\\\!\\pm\\\!0\.19\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.35±0\.39\)e\-1\(4\.35\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.38±0\.39\)e\-1\(4\.38\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.54±0\.93\)e\-4\(1\.54\\\!\\pm\\\!0\.93\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(1\.02±0\.09\)e\-2\(1\.02\\\!\\pm\\\!0\.09\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.02±0\.10\)e\-2\(1\.02\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.15±0\.12\)e\-4\(1\.15\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.16±0\.13\)e\-4\(1\.16\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.19±0\.11\)e\-4\(1\.19\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.18±0\.14\)e\-4\(1\.18\\\!\\pm\\\!0\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.46±0\.40\)e\-1\(4\.46\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.48±0\.39\)e\-1\(4\.48\\\!\\pm\\\!0\.39\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}KL\(9\.54±5\.52\)e\-3\(9\.54\\\!\\pm\\\!5\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.67±0\.83\)e\-3\(1\.67\\\!\\pm\\\!0\.83\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(7\.22±1\.18\)e\-1\(7\.22\\\!\\pm\\\!1\.18\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(8\.07±1\.14\)e\-1\(8\.07\\\!\\pm\\\!1\.14\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}1\.21±0\.241\.21\\\!\\pm\\\!0\.24𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(7\.16±1\.12\)e\-1\(7\.16\\\!\\pm\\\!1\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(8\.03±1\.08\)e\-1\(8\.03\\\!\\pm\\\!1\.08\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}1\.21±0\.231\.21\\\!\\pm\\\!0\.23𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)6e\-36\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}4e\-34\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}0\.0\.1D\-ReactionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(3\.22±0\.49\)e\-3\(3\.22\\\!\\pm\\\!0\.49\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(3\.22±0\.49\)e\-3\(3\.22\\\!\\pm\\\!0\.49\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.31±0\.27\)e\-3\(1\.31\\\!\\pm\\\!0\.27\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.32±0\.28\)e\-3\(1\.32\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(8\.81±3\.04\)e\-5\(8\.81\\\!\\pm\\\!3\.04\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(8\.83±3\.10\)e\-5\(8\.83\\\!\\pm\\\!3\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(4\.23±1\.83\)e\-4\(4\.23\\\!\\pm\\\!1\.83\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.26±1\.87\)e\-4\(4\.26\\\!\\pm\\\!1\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(5\.04±0\.65\)e\-3\(5\.04\\\!\\pm\\\!0\.65\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(5\.05±0\.66\)e\-3\(5\.05\\\!\\pm\\\!0\.66\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.93±0\.47\)e\-3\(2\.93\\\!\\pm\\\!0\.47\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.94±0\.47\)e\-3\(2\.94\\\!\\pm\\\!0\.47\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.29±0\.22\)e\-3\(1\.29\\\!\\pm\\\!0\.22\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.29±0\.23\)e\-3\(1\.29\\\!\\pm\\\!0\.23\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(8\.57±2\.57\)e\-5\(8\.57\\\!\\pm\\\!2\.57\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(8\.60±2\.63\)e\-5\(8\.60\\\!\\pm\\\!2\.63\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(4\.02±1\.62\)e\-4\(4\.02\\\!\\pm\\\!1\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.04±1\.66\)e\-4\(4\.04\\\!\\pm\\\!1\.66\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.71±0\.59\)e\-3\(4\.71\\\!\\pm\\\!0\.59\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(4\.72±0\.61\)e\-3\(4\.72\\\!\\pm\\\!0\.61\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}KL\(2\.62±2\.60\)e\-4\(2\.62\\\!\\pm\\\!2\.60\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(8\.92±7\.74\)e\-5\(8\.92\\\!\\pm\\\!7\.74\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(7\.06±1\.20\)e\-2\(7\.06\\\!\\pm\\\!1\.20\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.36±0\.23\)e\-1\(1\.36\\\!\\pm\\\!0\.23\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(1\.37±0\.24\)e\-1\(1\.37\\\!\\pm\\\!0\.24\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(7\.06±1\.20\)e\-2\(7\.06\\\!\\pm\\\!1\.20\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.36±0\.23\)e\-1\(1\.36\\\!\\pm\\\!0\.23\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(1\.38±0\.24\)e\-1\(1\.38\\\!\\pm\\\!0\.24\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)0\.0\.0\.0\.−1e\-3\-1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ConvectionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.90±0\.98\)e\-2\(2\.90\\\!\\pm\\\!0\.98\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.90±0\.98\)e\-2\(2\.90\\\!\\pm\\\!0\.98\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.08±0\.77\)e\-2\(4\.08\\\!\\pm\\\!0\.77\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(4\.08±0\.77\)e\-2\(4\.08\\\!\\pm\\\!0\.77\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(4\.79±2\.59\)e\-4\(4\.79\\\!\\pm\\\!2\.59\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.79±2\.59\)e\-4\(4\.79\\\!\\pm\\\!2\.59\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.14±0\.28\)e\-3\(1\.14\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.14±0\.28\)e\-3\(1\.14\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(7\.14±1\.70\)e\-2\(7\.14\\\!\\pm\\\!1\.70\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(7\.15±1\.70\)e\-2\(7\.15\\\!\\pm\\\!1\.70\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.93±0\.47\)e\-2\(2\.93\\\!\\pm\\\!0\.47\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(2\.87±0\.87\)e\-2\(2\.87\\\!\\pm\\\!0\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.08±0\.80\)e\-2\(4\.08\\\!\\pm\\\!0\.80\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(4\.08±0\.80\)e\-2\(4\.08\\\!\\pm\\\!0\.80\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(4\.77±2\.75\)e\-4\(4\.77\\\!\\pm\\\!2\.75\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(4\.77±2\.75\)e\-4\(4\.77\\\!\\pm\\\!2\.75\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.13±0\.27\)e\-3\(1\.13\\\!\\pm\\\!0\.27\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.13±0\.27\)e\-3\(1\.13\\\!\\pm\\\!0\.27\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(7\.11±1\.62\)e\-2\(7\.11\\\!\\pm\\\!1\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(7\.11±1\.62\)e\-2\(7\.11\\\!\\pm\\\!1\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}KL\(9\.52±4\.20\)e\-6\(9\.52\\\!\\pm\\\!4\.20\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}6\}\(3\.42±1\.11\)e\-6\(3\.42\\\!\\pm\\\!1\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}6\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)3\.09±1\.213\.09\\\!\\pm\\\!1\.215\.96±2\.055\.96\\\!\\pm\\\!2\.056\.15±2\.416\.15\\\!\\pm\\\!2\.41𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)3\.09±1\.213\.09\\\!\\pm\\\!1\.215\.97±2\.065\.97\\\!\\pm\\\!2\.066\.15±2\.426\.15\\\!\\pm\\\!2\.42𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)0\.0\.−1e\-2\-1\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}0\.0\.

Table 5:Results for Sobolev bounds in the unbalanced setting withmd=300m\_\{d\}=300\. Reported values are averages and standard deviations over 5 runs for the empirical \(Emp\.\) risks, KL divergence, and bound values before \(𝒰\(π\)\\mathcal\{U\}\(\\pi\)\) and after \(𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\) Stage 2 of Algorithm[1](https://arxiv.org/html/2605.26341#alg1)\.Blueandrednumbers indicate decreases and increases in the bound after training, respectively\.DatasetMetricOurs\-Sob\.𝐔𝐒𝐨𝐛\.\\bm\{\\mathrm\{U\}\_\{\\mathrm\{Sob\.\}\}\}1D\-WaveEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.15±0\.10\)e\-4\(1\.15\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.10±0\.46\)e\-1\(4\.10\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.33±0\.46\)e\-4\(1\.33\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(1\.00±0\.04\)e\-2\(1\.00\\\!\\pm\\\!0\.04\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.06±0\.10\)e\-4\(1\.06\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.13±0\.05\)e\-4\(1\.13\\\!\\pm\\\!0\.05\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(4\.23±0\.46\)e\-1\(4\.23\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.15±0\.11\)e\-4\(1\.15\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.24±0\.51\)e\-1\(4\.24\\\!\\pm\\\!0\.51\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.38±0\.48\)e\-4\(1\.38\\\!\\pm\\\!0\.48\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(1\.05±0\.10\)e\-2\(1\.05\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.12±0\.11\)e\-4\(1\.12\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.12±0\.13\)e\-4\(1\.12\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.23±0\.46\)e\-1\(4\.23\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}KL1\.82±0\.481\.82\\\!\\pm\\\!0\.48𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(7\.28±1\.54\)e\-1\(7\.28\\\!\\pm\\\!1\.54\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(7\.36±1\.55\)e\-1\(7\.36\\\!\\pm\\\!1\.55\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(7\.01±1\.52\)e\-1\(\\bm\{7\.01\}\\\!\\pm\\\!1\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(7\.08±1\.53\)e\-1\(7\.08\\\!\\pm\\\!1\.53\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)2\.7e\-22\.7\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}2\.8e\-22\.8\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}1D\-ReactionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(3\.11±0\.49\)e\-3\(3\.11\\\!\\pm\\\!0\.49\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.21±0\.10\)e\-3\(1\.21\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(7\.60±0\.92\)e\-5\(7\.60\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(3\.32±0\.40\)e\-4\(3\.32\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(4\.72±0\.44\)e\-3\(4\.72\\\!\\pm\\\!0\.44\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.85±0\.46\)e\-3\(2\.85\\\!\\pm\\\!0\.46\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.21±0\.10\)e\-3\(1\.21\\\!\\pm\\\!0\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(7\.62±0\.92\)e\-5\(7\.62\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(3\.32±0\.40\)e\-4\(3\.32\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.46±0\.41\)e\-3\(4\.46\\\!\\pm\\\!0\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}KL\(2\.38±0\.93\)e\-1\(2\.38\\\!\\pm\\\!0\.93\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)\(5\.85±0\.64\)e\-2\(5\.85\\\!\\pm\\\!0\.64\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(5\.89±0\.65\)e\-2\(5\.89\\\!\\pm\\\!0\.65\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)\(5\.77±0\.61\)e\-2\(\\bm\{5\.77\}\\\!\\pm\\\!0\.61\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(5\.81±0\.62\)e\-2\(5\.81\\\!\\pm\\\!0\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)8e\-48\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}8e\-28\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ConvectionEmp\.TrainRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.88±0\.97\)e\-2\(2\.88\\\!\\pm\\\!0\.97\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.01±0\.80\)e\-2\(4\.01\\\!\\pm\\\!0\.80\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(4\.56±2\.10\)e\-4\(4\.56\\\!\\pm\\\!2\.10\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.13±0\.28\)e\-3\(1\.13\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^train\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{train\}\}\(\\rho\)\(7\.04±1\.73\)e\-2\(7\.04\\\!\\pm\\\!1\.73\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}Emp\.TestRiskℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.84±0\.87\)e\-2\(2\.84\\\!\\pm\\\!0\.87\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.01±0\.80\)e\-2\(4\.01\\\!\\pm\\\!0\.80\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(4\.56±2\.12\)e\-4\(4\.56\\\!\\pm\\\!2\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.13±0\.28\)e\-3\(1\.13\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(7\.01±1\.62\)e\-2\(7\.01\\\!\\pm\\\!1\.62\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}KL\(3\.16±2\.82\)e\-1\(3\.16\\\!\\pm\\\!2\.82\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}𝒰\(π\)\\mathcal\{U\}\(\\pi\)2\.29±0\.882\.29\\\!\\pm\\\!0\.882\.28±0\.882\.28\\\!\\pm\\\!0\.88𝒰\(ρ\)\\mathcal\{U\}\(\\rho\)2\.24±0\.85\\bm\{2\.24\}\\\!\\pm\\\!0\.852\.25±0\.852\.25\\\!\\pm\\\!0\.85𝒰\(π\)−𝒰\(ρ\)\\mathcal\{U\}\(\\pi\)\-\\mathcal\{U\}\(\\rho\)4e\-24\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}3e\-23\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}

Table 6:Comparison of the expected empirical risk under our posterior and under a Gaussian distribution centered at a deterministically NTK\-trained model using the full training data \(prior and posterior datasets combined\)\.DatasetEmp\. Test RiskOurs\-Sob\.NTK\-60k1D\-Waveℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(1\.15±0\.11\)e\-4\(1\.15\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(9\.77±0\.45\)e\-5\(9\.77\\\!\\pm\\\!0\.45\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(4\.11±0\.52\)e\-1\(4\.11\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(3\.97±0\.29\)e\-1\(3\.97\\\!\\pm\\\!0\.29\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(1\.54±0\.92\)e\-4\(1\.54\\\!\\pm\\\!0\.92\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.09±0\.05\)e\-4\(1\.09\\\!\\pm\\\!0\.05\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^ig\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ig\}\(\\rho\)\(9\.92±1\.15\)e\-3\(9\.92\\\!\\pm\\\!1\.15\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.04±0\.12\)e\-2\(1\.04\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^b1\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{1\}\}\(\\rho\)\(1\.10±0\.12\)e\-4\(1\.10\\\!\\pm\\\!0\.12\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.00±0\.05\)e\-4\(1\.00\\\!\\pm\\\!0\.05\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b2\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\_\{2\}\}\(\\rho\)\(1\.11±0\.11\)e\-4\(1\.11\\\!\\pm\\\!0\.11\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(1\.03±0\.06\)e\-4\(1\.03\\\!\\pm\\\!0\.06\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.22±0\.52\)e\-1\(4\.22\\\!\\pm\\\!0\.52\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}\(4\.08±0\.30\)e\-1\(4\.08\\\!\\pm\\\!0\.30\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}1\}1D\-Reactionℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.60±0\.40\)e\-3\(2\.60\\\!\\pm\\\!0\.40\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(6\.74±0\.28\)e\-5\(6\.74\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(1\.10±0\.03\)e\-3\(1\.10\\\!\\pm\\\!0\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.18±0\.07\)e\-3\(1\.18\\\!\\pm\\\!0\.07\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(6\.94±0\.25\)e\-5\(6\.94\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}\(7\.25±0\.28\)e\-5\(7\.25\\\!\\pm\\\!0\.28\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}5\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(2\.98±0\.16\)e\-4\(2\.98\\\!\\pm\\\!0\.16\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.10±0\.25\)e\-4\(3\.10\\\!\\pm\\\!0\.25\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(4\.06±0\.41\)e\-3\(4\.06\\\!\\pm\\\!0\.41\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(1\.63±0\.09\)e\-3\(1\.63\\\!\\pm\\\!0\.09\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}Convectionℛ^d\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{d\}\(\\rho\)\(2\.76±0\.84\)e\-2\(2\.76\\\!\\pm\\\!0\.84\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(3\.39±0\.27\)e\-4\(3\.39\\\!\\pm\\\!0\.27\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^p\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{p\}\(\\rho\)\(2\.71±0\.35\)e\-2\(2\.71\\\!\\pm\\\!0\.35\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.21±0\.36\)e\-2\(1\.21\\\!\\pm\\\!0\.36\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}ℛ^ic\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{ic\}\(\\rho\)\(3\.24±0\.53\)e\-4\(3\.24\\\!\\pm\\\!0\.53\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}\(3\.13±0\.13\)e\-4\(3\.13\\\!\\pm\\\!0\.13\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}ℛ^b\(ρ\)\\hat\{\\mathcal\{R\}\}\_\{b\}\(\\rho\)\(1\.17±0\.34\)e\-3\(1\.17\\\!\\pm\\\!0\.34\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}3\}\(2\.79±0\.66\)e\-4\(2\.79\\\!\\pm\\\!0\.66\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}4\}∑ℛ^test\(ρ\)\\sum\\hat\{\\mathcal\{R\}\}\_\{\\mathrm\{test\}\}\(\\rho\)\(5\.61±1\.03\)e\-2\(5\.61\\\!\\pm\\\!1\.03\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}\(1\.30±0\.37\)e\-2\(1\.30\\\!\\pm\\\!0\.37\)\\mathrm\{e\\scalebox\{1\.2\}\[1\.0\]\{\-\}2\}For completeness, we also compare our self\-bounding\-aware algorithm against the best results obtained by a state\-of\-the\-art PINN training approach based on the neural tangent kernel \(NTK\)\[[41](https://arxiv.org/html/2605.26341#bib.bib8)\]\. Since our method outputs a posterior distribution rather than a single deterministic model, we compare the expected empirical test risk under our posterior with that obtained from a Gaussian distribution centered at the deterministic NTK\-trained model\. The NTK model is trained for both 60k iterations on the full training set, obtained by combining the prior and posterior datasets\. In contrast, our self\-bounding\-aware algorithm uses a total of only 40k iterations for 1D\-Wave and 1D\-Convection, and 20k iterations for 1D\-Reaction\.

The results are reported in Table[6](https://arxiv.org/html/2605.26341#A2.T6)underOurs\-Sob\.\. Overall, the NTK\-based method achieves lower risks across the three benchmarks\. Nevertheless, for 1D\-Wave,Ours\-Sob\.attains slightly lower risks on the initial condition \(IC\) losses\. For the more challenging 1D\-Reaction and 1D\-Convection problems, the NTK method significantly outperforms our approach, particularly on the data\-fidelity term\. However, while our objective was not to provide the best numerical results but rather a novel principled statistical framework for PIML, our posterior still achieves PDE residual risks of the same order of magnitude, and in some cases even lower values, notably for all physical losses in 1D\-Reaction\.

One possible explanation is that, for problems whose solutions exhibit sharp transitions or high\-frequency structures, sufficiently long training can substantially reduce the approximation error of deterministic models\. In contrast, it may be inherently difficult to further reduce the PDE residual in order to accurately capture fine\-scale variations\. Consequently, when such highly trained models are used as priors, the self\-bounding\-aware algorithm may have limited room to further tighten the bound\. This opens the door for developing new ways to integrate physics priors in PAC\-Bayes and more efficient \(self\-bounding\)\-algorithms\.

## Appendix CLimitations

This paper presents a PAC\-Bayesian analysis for physics\-informed machine learning and provides both generalisation guarantees and practical algorithms, supported by extensive experiments in three benchmarks\. Despite promising results with tight bounds and intuitive results, there are still several limitations\. Regarding theoretical analysis, the Poincaré bounds withχ2\\chi^\{2\}\-divergence and prior\-dependent complexity is less flexible and difficult to be minimised in practice\. On the other hand, it remains unclear how the physical constraint impacts the generalisation behavior of the data\-fidelity loss\. In the experimental part, the estimation of constants, although being effective, is stochastic and deeply depends on the number of drawNdrawN\_\{draw\}, requiring random seed to obtain repetitive results\. The use ofℓic\\ell\_\{ic\}’s constants to alleviate observational data scarcity is clearly suboptimal and is a factor that inflates the bound\. Finding an alternative deterministic approach to estimate the constants and avoid this looseness is one of the main open questions\. Finally, the surrogate objective in the case of unbalanced setting does not directly integrate the constraint offer by the sample\-centric bound, and designing such objectives might be a direction for future works\.

## Appendix DRelated works

Generalisation of PIML\.Despite tremendous empirical success, the generalisation of physics\-informed machine learning \(PIML\) methods remains an open question that requires further study\. A common line of work focuses on the convergence of physics informed models to the PDE solution\[[38](https://arxiv.org/html/2605.26341#bib.bib33),[12](https://arxiv.org/html/2605.26341#bib.bib29)\]\. In particular,\[[12](https://arxiv.org/html/2605.26341#bib.bib29)\]addresses overfitting of PINNs by introducing a Sobolev regularisation term to enforce estimator regularity and show consistency and strong convergence properties\. Other works derive approximation error bounds under stability assumptions on the underlying PDEs\[[35](https://arxiv.org/html/2605.26341#bib.bib27),[28](https://arxiv.org/html/2605.26341#bib.bib31)\]\. Several papers study generalisation via \(local\) Rademacher complexity\[[24](https://arxiv.org/html/2605.26341#bib.bib32),[26](https://arxiv.org/html/2605.26341#bib.bib30),[42](https://arxiv.org/html/2605.26341#bib.bib28)\]; notably,\[[42](https://arxiv.org/html/2605.26341#bib.bib28)\]obtains generalisation bounds for second order elliptic PDEs by adopting a multi\-task perspective and assuming the underlying solutions lie in Barron or Sobolev spaces\.\[[10](https://arxiv.org/html/2605.26341#bib.bib35),[11](https://arxiv.org/html/2605.26341#bib.bib34)\]treat the PIML problem as a kernel regression task and use Fourier methods to construct tractable estimators with quantified convergence rates\.\[[36](https://arxiv.org/html/2605.26341#bib.bib46)\]views physical loss terms as a form of regularisation in addition to data\-fidelity loss and shows that, under a knowledge alignment assumption \(i\.e\., the regulariser is approximately zero at the true solution\), the estimator converges at a faster rate\. Together, these works highlight the importance of encoding prior knowledge about the physical system—via regularity, stability, or data\-physics alignment—to establish generalisation guarantees for approximation error or the total PIML loss\. In the same vein, we establish a connection between generalisation, data, and physical regularity by employing a PAC\-Bayesian framework to derive bounds based on input gradient complexity, and empirically show that, under sufficient regularity of the true solution, tight bounds can be obtained even in low data regimes\.

PAC\-Bayesian analysis for PIML\.PAC\-Bayes theory has emerged as a powerful framework for analysing modern machine learning models through data\-dependent probabilistic guarantees that balance empirical risk and model complexity via information\-theoretic divergences\[[3](https://arxiv.org/html/2605.26341#bib.bib37),[17](https://arxiv.org/html/2605.26341#bib.bib38),[22](https://arxiv.org/html/2605.26341#bib.bib36)\]\. While early PAC\-Bayesian analyses primarily focused on bounded\-loss classification settings, more recent works have extended the framework to regression and heavy\-tailed losses through higher\-order moment control\[[20](https://arxiv.org/html/2605.26341#bib.bib40),[23](https://arxiv.org/html/2605.26341#bib.bib39)\], assumptions on the cumulant generating function \(CGF\)\[[6](https://arxiv.org/html/2605.26341#bib.bib4),[33](https://arxiv.org/html/2605.26341#bib.bib16)\], and structural assumptions on the data distribution\[[37](https://arxiv.org/html/2605.26341#bib.bib17),[18](https://arxiv.org/html/2605.26341#bib.bib18)\]\. These advances are particularly relevant for physics\-informed machine learning \(PIML\), where losses induced by PDE residuals or physical constraints are typically unbounded and may exhibit complex tail behaviour\. Moreover, PAC\-Bayes naturally accommodates prior physical knowledge through structured priors and constrained hypothesis spaces\. More recently, a line of work has explored model\-dependent assumptions leading to generalisation bounds that incorporate regularisation terms based on parameter norms\[[16](https://arxiv.org/html/2605.26341#bib.bib47)\], parameter gradients\[[21](https://arxiv.org/html/2605.26341#bib.bib48)\], and input gradients\[[14](https://arxiv.org/html/2605.26341#bib.bib49),[6](https://arxiv.org/html/2605.26341#bib.bib4)\]\. Closest to our work,\[[6](https://arxiv.org/html/2605.26341#bib.bib4)\]derive empirical PAC\-Bayes bounds with input\-gradient\-scaled complexity terms\. However, their analysis focuses on a single loss function, and it remains unclear how the resulting constants can be estimated in practical PIML settings\. In contrast, we adopt a multi\-task perspective, under two different smoothness assumptions, to derive joint bounds that are tighter than standard union\-bound constructions\. Furthermore, to the best of our knowledge, our work is among the first to provide both PAC\-Bayesian generalisation guarantees and a complete practical procedure for learning and computing the bounds in PIML settings\.
A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

Similar Articles

Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning

Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Submit Feedback

Similar Articles

Physics-Informed Neural Networks with Learnable Loss Balancing and Transfer Learning
Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation
PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift
MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation
From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD