MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

arXiv cs.LG Papers

Summary

This paper develops a PAC-Bayesian framework for test-time adaptation that uses MMD-balls as credal sets, providing formal generalization bounds and separating epistemic from aleatoric uncertainty under distribution shift.

arXiv:2605.21783v1 Announce Type: new Abstract: Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:52 AM

# MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation
Source: [https://arxiv.org/html/2605.21783](https://arxiv.org/html/2605.21783)
###### Abstract

Test\-time adaptation \(TTA\) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability\. We develop a PAC\-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy \(MMD\) between source and target distributions\. Our principal contribution is interpreting MMD\-balls around the source distribution as credal sets in Walley’s imprecise probability theory, yielding natural epistemic uncertainty quantification\. We establish: \(i\) a PAC\-Bayesian bound with an MMD\-dependent shift penalty under an RKHS\-Lipschitz loss assumption; \(ii\) a finite\-sample version via MMD concentration; \(iii\) a uniform worst\-case risk bound over all distributions in the credal set, with a lower\-upper risk decomposition; and \(iv\) geodesic preservation bounds explaining why kernel\-guided adaptation protects local feature geometry\. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted\.

## 1Introduction

Reliable deployment of machine learning models requires reasoning under epistemic uncertainty—the ability to recognize when the operating distribution has shifted beyond the scope of what was encountered during training\. This challenge is central to test\-time adaptation \(TTA\), a paradigm in which a model pretrained on source distributionPsP\_\{s\}receives unlabeled data from a target distributionPt≠PsP\_\{t\}\\neq P\_\{s\}at deployment time\. Existing TTA methods \(Wang et al\., 2021; Niu et al\., 2023; Zhang et al\., 2022a; Yuan et al\., 2023; Su et al\., 2022\) improve accuracy under distribution shift by adapting model parameters using statistics computed from test batches, but they provide no formal guarantees about when predictions should be trusted or how much risk degrades as a function of shift magnitude\.

This gap is particularly concerning in safety\-critical applications such as autonomous driving, medical imaging, and financial risk assessment, where a model that silently degrades under distribution shift can cause significant harm\. The inability to quantify how wrong a model’s predictions might be in an unseen environment fundamentally limits its trustworthy deployment\. While predictive uncertainty methods \(e\.g\., Bayesian neural networks, ensemble methods\) attempt to address this, they conflate aleatoric uncertainty \(inherent data noise\) with epistemic uncertainty \(uncertainty due to limited knowledge of the data\-generating process\), and they do not provide formal connections between distributional shift and risk\.

We formalize the core question: can we bound target risk as a function of distribution shift, and does this provide actionable epistemic uncertainty? We answer affirmatively via a PAC\-Bayesian framework with explicit MMD\-dependent generalization bounds\. Our central insight is that the MMD\-ball𝒞ε​\(Ps\)=\{Q:MMD​\(Ps,Q\)≤ε\}\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)=\\\{Q:\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\varepsilon\\\}defines a credal set \(Walley, 1991; Troffaes and Destercke, 2023\)—a set of probability distributions that are indistinguishable from the source distribution at resolutionε\\varepsilon\. This interpretation provides a principled foundation for epistemic uncertainty quantification in TTA that is grounded in both kernel methods and imprecise probability theory\.

Contributions:

- •A PAC\-Bayesian bound \(Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)\) with MMD\-dependent shift penalty, plus a finite\-sample version \(Theorem[3](https://arxiv.org/html/2605.21783#Thmtheorem3)\) with minimax\-optimal concentration rate\.
- •A credal set interpretation \(Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7)\) that yields worst\-case risk guarantees over the entire MMD\-ball, with a lower\-upper risk decomposition \(Corollary[9](https://arxiv.org/html/2605.21783#Thmtheorem9)\) that separates epistemic from aleatoric uncertainty\.
- •Geodesic preservation bounds \(Proposition[10](https://arxiv.org/html/2605.21783#Thmtheorem10)\) explaining why kernel\-guided adaptation protects local feature geometry, with implications for rare\-class robustness \(Corollary[11](https://arxiv.org/html/2605.21783#Thmtheorem11)\)\.
- •A unified framework connecting PAC\-Bayesian generalization, kernel mean embeddings, and Walley’s imprecise probability theory for the first time\.

## 2Related Work

Test\-time adaptation\.The TTA paradigm has rapidly expanded since TENT \(Wang et al\., 2021\) demonstrated that entropy minimization on test batch statistics can effectively adapt batch normalization parameters\. Surveys \(Zhang et al\., 2022b\) categorize subsequent methods into entropy\-based approaches \(EATA \(Zhang et al\., 2022a\), which introduces entropy\-aware selection\), regularization\-based approaches \(SAR \(Niu et al\., 2023\), which adds sharpness\-aware regularization to prevent error accumulation\), and memory\-based approaches \(MEMO \(Zhang et al\., 2022a\), which uses augmentation memory banks to maintain source\-like representations\)\. Recent work has also explored contrastive learning at test time \(Yuan et al\., 2023\) and sequential adaptation via anchored clustering \(Su et al\., 2022\)\. However, all of these methods lack uncertainty quantification: they cannot signal when adaptation is unwarranted or when predictions are unreliable\. Our theoretical framework fills this gap by providing formal bounds that explicitly depend on shift magnitude\.

Kernel methods and MMD\.Maximum mean discrepancy \(MMD\) \(Gretton et al\., 2012\) measures distributional divergence by embedding distributions into a reproducing kernel Hilbert space \(RKHS\) and computing the distance between their kernel mean embeddings\. Kernel mean embeddings \(Muandet et al\., 2017\) provide a unified representation framework that enables nonparametric two\-sample testing, density estimation, and distribution regression\. Finite\-sample concentration of MMD estimators has been precisely characterized: for kernels bounded in\[0,1\]\[0,1\], the unbiased estimator satisfies a sub\-Gaussian concentration inequality \(Sutherland et al\., 2017\), and the minimax estimation rate isO​\(1/n\)O\(1/\\sqrt\{n\}\)\(Tolstikhin et al\., 2017\)\. These results are critical for our finite\-sample analysis \(Theorem[3](https://arxiv.org/html/2605.21783#Thmtheorem3)\)\.

PAC\-Bayesian theory\.PAC\-Bayesian bounds \(McAllester, 1999; Germain et al\., 2016; Seeger, 2002; Catoni, 2007; Rivasplata et al\., 2020; Alquier, 2024\) provide data\-dependent generalization guarantees by penalizing the complexity of the posterior relative to a prior, measured via the Kullback\-Leibler divergence\. These bounds hold uniformly over all possible posteriors, making them well\-suited for adaptation scenarios where the posterior is chosen after observing data\. Germain et al\. \(Germain et al\., 2013\) derived PAC\-Bayesian bounds for domain adaptation using theℋ\\mathcal\{H\}\-divergence \(Ben\-David et al\., 2010\) between domains\. Our work differs in three key respects: \(i\) we use MMD, a computable kernel\-based discrepancy, rather than theℋ\\mathcal\{H\}\-divergence which is NP\-hard to estimate; \(ii\) we provide a finite\-sample version with explicit dependence on sample sizes; and \(iii\) we interpret the shift penalty through credal sets, connecting to imprecise probability\.

Imprecise probability and credal sets\.Walley’s \(Walley, 1991\) behavioral theory of imprecise probabilities models epistemic uncertainty through sets of probability distributions \(credal sets\) rather than single distributions\. The lower and upper probabilities induced by a credal set quantify the range of plausible beliefs given the available evidence\. Credal classifiers \(Destercke et al\., 2008; Corani et al\., 2022\) extend this to classification by maintaining sets of probability measures over class labels\. The formalization of lower\-upper probability is developed in \(Miranda and Zaffalon, 2022\)\. Hüllermeier & Waegeman \(Hüllermeier and Waegeman, 2021\) argued persuasively that meaningful uncertainty quantification in machine learning requires distinguishing aleatoric from epistemic sources—a distinction our framework naturally provides\. We are, to our knowledge, the first to formulate TTA uncertainty through MMD\-induced credal sets\.

## 3Preliminaries

Reproducing kernel Hilbert space \(RKHS\) notation\.Let\(𝒳,Σ\)\(\\mathcal\{X\},\\Sigma\)be a measurable space and letℋ\\mathcal\{H\}be an RKHS of functionsf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}with a positive definite kernelk:𝒳×𝒳→ℝk:\\mathcal\{X\}\\times\\mathcal\{X\}\\to\\mathbb\{R\}\. By the reproducing property, every functionf∈ℋf\\in\\mathcal\{H\}satisfiesf​\(x\)=⟨f,k​\(x,⋅\)⟩ℋf\(x\)=\\langle f,k\(x,\\cdot\)\\rangle\_\{\\mathcal\{H\}\}\. The feature mapϕ:𝒳→ℋ\\phi:\\mathcal\{X\}\\to\\mathcal\{H\}is defined asϕ​\(x\)=k​\(x,⋅\)\\phi\(x\)=k\(x,\\cdot\), so thatk​\(x,y\)=⟨ϕ​\(x\),ϕ​\(y\)⟩ℋk\(x,y\)=\\langle\\phi\(x\),\\phi\(y\)\\rangle\_\{\\mathcal\{H\}\}\.

For a probability measurePPon𝒳\\mathcal\{X\}with∫k​\(x,x\)​𝑑P​\(x\)<∞\\int k\(x,x\)\\,dP\(x\)<\\infty, the kernel mean embedding is defined as

μP=𝔼x∼P​\[ϕ​\(x\)\]=∫ϕ​\(x\)​𝑑P​\(x\)∈ℋ\.\\mu\_\{P\}=\\mathbb\{E\}\_\{x\\sim P\}\[\\phi\(x\)\]=\\int\\phi\(x\)\\,dP\(x\)\\in\\mathcal\{H\}\.\(1\)Whenkkis characteristic \(Muandet et al\., 2017\), the embeddingμP\\mu\_\{P\}uniquely determinesPP, and the mapP↦μPP\\mapsto\\mu\_\{P\}is injective\. The maximum mean discrepancy \(MMD\) between two probability measuresPPandQQis then defined as the RKHS distance between their kernel mean embeddings:

MMD2​\(P,Q\)=‖μP−μQ‖ℋ2\.\\mathrm\{MMD\}^\{2\}\(P,Q\)=\\\|\\mu\_\{P\}\-\\mu\_\{Q\}\\\|\_\{\\mathcal\{H\}\}^\{2\}\.\(2\)
For a deep encoderfθ:𝒳→ℝdf\_\{\\theta\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}, we employ a learned kernelkθ​\(x,y\)=exp⁡\(−γ​‖fθ​\(x\)−fθ​\(y\)‖2\)k\_\{\\theta\}\(x,y\)=\\exp\\left\(\-\\gamma\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|^\{2\}\\right\), an RBF kernel defined on the feature space\. The feature map for this kernel isϕθ\(x\)=exp\(−γ∥fθ\(x\)−⋅∥2\)/2\\phi\_\{\\theta\}\(x\)=\\exp\\left\(\-\\gamma\\\|f\_\{\\theta\}\(x\)\-\\cdot\\\|^\{2\}\\right\)/2, viewed as a function in an RKHS overℝd\\mathbb\{R\}^\{d\}\.

Test\-time adaptation protocol\.A model is pretrained on source distributionPsP\_\{s\}; at deployment, it receives unlabeled batches drawn from target distributionPt≠PsP\_\{t\}\\neq P\_\{s\}\. We assume covariate shift: the conditional distribution of labels given features is preserved,Pt​\(y∣x\)=Ps​\(y∣x\)P\_\{t\}\(y\\mid x\)=P\_\{s\}\(y\\mid x\), while the marginal feature distribution changes,Pt​\(x\)≠Ps​\(x\)P\_\{t\}\(x\)\\neq P\_\{s\}\(x\)\. This assumption is standard in domain adaptation \(Ben\-David et al\., 2010\) and is reasonable when the semantic relationship between features and labels remains stable but the input distribution shifts \(e\.g\., weather changes in autonomous driving, style shifts in medical imaging\)\.

For a random predictor drawn from a posterior distributionρ\\rhoover model parameters, the expected risk under distributionPPis defined asRP​\(ρ\)=𝔼\(x,y\)∼P,w∼ρ​\[ℓ​\(w,x,y\)\]R\_\{P\}\(\\rho\)=\\mathbb\{E\}\_\{\(x,y\)\\sim P,w\\sim\\rho\}\[\\ell\(w,x,y\)\], whereℓ​\(w,x,y\)\\ell\(w,x,y\)is a loss function \(e\.g\., cross\-entropy\)\. Under covariate shift, this decomposes as

RP​\(ρ\)=𝔼w∼ρ​\[𝔼x∼P​\(x\)​\[L​\(w,x\)\]\],R\_\{P\}\(\\rho\)=\\mathbb\{E\}\_\{w\\sim\\rho\}\\left\[\\mathbb\{E\}\_\{x\\sim P\(x\)\}\[L\(w,x\)\]\\right\],\(3\)whereL​\(w,x\)=𝔼y∼P​\(y\|x\)​\[ℓ​\(w,x,y\)\]L\(w,x\)=\\mathbb\{E\}\_\{y\\sim P\(y\|x\)\}\[\\ell\(w,x,y\)\]is the conditional expected loss\. Crucially, becauseP​\(y∣x\)P\(y\\mid x\)is invariant under covariate shift, the functionx↦L​\(w,x\)x\\mapsto L\(w,x\)is the same forPsP\_\{s\}andPtP\_\{t\}; only the distribution overxxchanges\.

PAC\-Bayesian framework\.PAC\-Bayesian analysis provides bounds on the generalization gapRP​\(ρ\)−R^P​\(ρ\)R\_\{P\}\(\\rho\)\-\\hat\{R\}\_\{P\}\(\\rho\), whereR^P​\(ρ\)=𝔼w∼ρ​\[R^P​\(w\)\]\\hat\{R\}\_\{P\}\(\\rho\)=\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\hat\{R\}\_\{P\}\(w\)\]is the empirical risk averaged over the posterior\. The key quantity governing the complexity penalty is the Kullback\-Leibler divergenceKL​\(ρ∥π\)=𝔼w∼ρ​\[log⁡\(ρ​\(w\)/π​\(w\)\)\]\\mathrm\{KL\}\(\\rho\\\|\\pi\)=\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\log\(\\rho\(w\)/\\pi\(w\)\)\]between the posteriorρ\\rhoand a fixed priorπ\\pi\. The classical PAC\-Bayesian theorem \(McAllester, 1999; Germain et al\., 2016\) states that, for i\.i\.d\. samples fromPP, w\.p\.≥1−δ\\geq 1\-\\delta:

RP​\(ρ\)≤R^P​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\.R\_\{P\}\(\\rho\)\\leq\\hat\{R\}\_\{P\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\.\(4\)Our contribution extends this framework by adding an MMD\-dependent term that accounts for distribution shift betweenPsP\_\{s\}andPtP\_\{t\}\.

## 4PAC\-Bayesian Bound with MMD

###### Assumption 1\(RKHS\-Lipschitz Loss\)\.

For everywwin the support ofρ\\rho, the conditional expected loss functionL​\(w,⋅\)L\(w,\\cdot\)belongs to the RKHSℋ\\mathcal\{H\}with bounded norm:L​\(w,⋅\)∈ℋL\(w,\\cdot\)\\in\\mathcal\{H\}and‖L​\(w,⋅\)‖ℋ≤Lℋ\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\\leq L\_\{\\mathcal\{H\}\}\.

This assumption requires that the conditional expected loss, viewed as a function of the inputxx, lies in the RKHS induced by kernelkk\. This is a stronger requirement than mere smoothness—it constrains the functional form ofL​\(w,⋅\)L\(w,\\cdot\)to the Hilbert space spanned by the kernel functions\. For cross\-entropy loss with softmax outputs, informal support comes from the smoothness of the softmax function \(‖∇σ‖op≤1\\\|\\nabla\\sigma\\\|\_\{\\mathrm\{op\}\}\\leq 1\) combined with the universality of the RBF kernel \(Sriperumbudur et al\., 2009\), which can approximate any continuous function\. We discuss relaxations and empirical verification strategies in Appendix E\.

###### Theorem 1\(PAC\-Bayesian Bound with MMD Shift Penalty\)\.

Under Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)and covariate shift, for priorπ\\pi, posteriorρ\\rho, and failure probabilityδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta:

RPt​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅MMD​\(Ps,Pt\)\.R\_\{P\_\{t\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.\(5\)

Proof sketch\.The proof proceeds in three steps\.Step 1:Apply the classical PAC\-Bayesian theorem \(Eq\.[4](https://arxiv.org/html/2605.21783#S3.E4)\) to bound the source riskRPs​\(ρ\)R\_\{P\_\{s\}\}\(\\rho\)in terms of the empirical source riskR^Ps​\(ρ\)\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)plus a KL\-divergence complexity term, w\.p\.≥1−δ/2\\geq 1\-\\delta/2\.Step 2:Under Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1), bound the gap\|RPt​\(ρ\)−RPs​\(ρ\)\|\|R\_\{P\_\{t\}\}\(\\rho\)\-R\_\{P\_\{s\}\}\(\\rho\)\|byLℋ⋅MMD​\(Ps,Pt\)L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)using the reproducing property and Cauchy\-Schwarz inequality\.Step 3:Combine via a union bound\. The full proof is deferred to Appendix A\.

An immediate consequence is that when the shift is zero \(MMD​\(Ps,Pt\)=0\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)=0\), the bound recovers the standard PAC\-Bayesian guarantee\. As the shift grows, the bound degrades gracefully—it does not collapse but widens linearly, reflecting increasing epistemic uncertainty about the target distribution\.

## 5Finite\-Sample Analysis

Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)involves the population MMD, which is unavailable in practice\. We now provide a fully computable version using the unbiased MMD estimator\.

###### Theorem 3\(Finite\-Sample PAC\-Bayesian Bound\)\.

Under Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1), letMMD^u\\widehat\{\\mathrm\{MMD\}\}\_\{u\}denote the unbiased MMD estimator computed frommmsource samples andnntarget samples, with kernelkθk\_\{\\theta\}bounded in\[0,1\]\[0,1\]\. Forδ∈\(0,1/2\)\\delta\\in\(0,1/2\), with probability at least1−δ1\-\\delta:

RPt​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(4​n/δ\)2​n\+Lℋ⋅\(MMD^u\+εm,n​\(δ/2\)\),R\_\{P\_\{t\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(4\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\left\(\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\+\\varepsilon\_\{m,n\}\(\\delta/2\)\\right\),\(6\)whereεm,n​\(α\)=2​log⁡\(2/α\)/min⁡\(m,n\)\\varepsilon\_\{m,n\}\(\\alpha\)=\\sqrt\{2\\log\(2/\\alpha\)/\\min\(m,n\)\}\.

Proof sketch\.We apply a union bound over two events\.Event 1:The PAC\-Bayesian concentration holds w\.p\.≥1−δ/2\\geq 1\-\\delta/2\.Event 2:For the unbiased MMD estimator with kernel bounded in\[0,1\]\[0,1\], the concentration result of Sutherland et al\. \(Sutherland et al\., 2017\) and Tolstikhin et al\. \(Tolstikhin et al\., 2017\) givesPr⁡\[\|MMD^u−MMD\|\>ε\]≤2​exp⁡\(−min⁡\(m,n\)⋅ε2/2\)\\Pr\[\|\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\-\\mathrm\{MMD\}\|\>\\varepsilon\]\\leq 2\\exp\(\-\\min\(m,n\)\\cdot\\varepsilon^\{2\}/2\)\. Setting the RHS equal toδ/2\\delta/2yieldsεm,n=2​log⁡\(4/δ\)/min⁡\(m,n\)\\varepsilon\_\{m,n\}=\\sqrt\{2\\log\(4/\\delta\)/\\min\(m,n\)\}\. SubstitutingMMD≤MMD^u\+εm,n\\mathrm\{MMD\}\\leq\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\+\\varepsilon\_\{m,n\}into Eq\.[5](https://arxiv.org/html/2605.21783#S4.E5)yields Eq\.[6](https://arxiv.org/html/2605.21783#S5.E6)\. Full proof: Appendix B\.

## 6MMD\-Balls as Credal Sets

We now develop the central theoretical contribution: interpreting MMD\-balls as credal sets in Walley’s imprecise probability framework, which provides a principled foundation for epistemic uncertainty quantification in TTA\.

###### Definition 5\(MMD\-Induced Credal Set\)\.

For a source distributionPsP\_\{s\}and a radiusε\>0\\varepsilon\>0, the MMD\-induced credal set is𝒞ε​\(Ps\)=\{Q∈𝒫​\(𝒳\):MMD​\(Ps,Q\)≤ε\}\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)=\\\{Q\\in\\mathcal\{P\}\(\\mathcal\{X\}\):\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\varepsilon\\\}, where𝒫​\(𝒳\)\\mathcal\{P\}\(\\mathcal\{X\}\)denotes the set of all probability measures on𝒳\\mathcal\{X\}\.

The credal set𝒞ε​\(Ps\)\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)contains all distributions that are “close enough” to the source, where closeness is measured in the RKHS metric induced bykθk\_\{\\theta\}\. Asε\\varepsilonincreases, the credal set widens, representing greater epistemic uncertainty about the true target distribution\. Whenε=0\\varepsilon=0, the credal set collapses to\{Ps\}\\\{P\_\{s\}\\\}\(assuming a characteristic kernel\), and we have complete knowledge\. This interpretation directly connects to Walley’s behavioral theory: a credal set represents the set of probability measures that are consistent with the available evidence \(Walley, 1991\)\.

###### Lemma 6\(Convexity and Weak Closure\)\.

For a characteristic kernelkkand anyε\>0\\varepsilon\>0, the MMD\-induced credal set𝒞ε​\(Ps\)\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)is convex and weakly closed\.

Proof sketch\.The proof relies on the linearity of the kernel mean embedding mapQ↦μQQ\\mapsto\\mu\_\{Q\}and the convexity of the norm\. Convexity ensures that mixtures of plausible distributions remain plausible, which is essential for coherent decision\-making under imprecise probabilities \(Walley, 1991\)\. Full proof: Appendix C\.

### 6\.1Uniform Risk Bound over the Credal Set

###### Proposition 7\(Worst\-Case Risk over the Credal Set\)\.

Under Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)and covariate shift, for anyε\>0\\varepsilon\>0and w\.p\.≥1−δ\\geq 1\-\\delta:

supQ∈𝒞ε​\(Ps\)RQ​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅ε\.\\sup\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(7\)

Proof sketch\.Every distributionQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)satisfiesMMD​\(Ps,Q\)≤ε\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\varepsilonby definition\. Applying Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)pointwise for each suchQQ, the right\-hand side of the bound is independent of the specificQQ—it depends only onε\\varepsilon\. Taking the supremum overQQpreserves the bound\.

Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7)provides a worst\-case risk guarantee: even the most adversarial distribution within the MMD\-ball has bounded risk\. This is significantly stronger than a point estimate of target risk, as it certifies that the model’s performance cannot degrade beyond the stated bound regardless of which distribution in the credal set is the true target\. This directly operationalizes Walley’s \(Walley, 1991\) behavioral interpretation: an agent who knows only thatPt∈𝒞ε​\(Ps\)P\_\{t\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)can guarantee that the upper probabilityR¯ε​\(ρ\)\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)serves as a valid upper bound on actual risk\.

### 6\.2Lower\-Upper Risk Decomposition

###### Definition 8\(Lower and Upper Risk\)\.

For the credal set𝒞ε​\(Ps\)\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\), define:

R¯ε​\(ρ\)=infQ∈𝒞ε​\(Ps\)RQ​\(ρ\),R¯ε​\(ρ\)=supQ∈𝒞ε​\(Ps\)RQ​\(ρ\)\.\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)=\\inf\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\),\\qquad\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)=\\sup\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\)\.\(8\)

###### Corollary 9\(Risk Imprecision Interval\)\.

Under the conditions of Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7), w\.p\.≥1−δ\\geq 1\-\\delta:

R¯ε​\(ρ\)≥R^Ps​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n−Lℋ⋅ε,\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)\\geq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\-L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon,\(9\)and the imprecision width is bounded by

R¯ε​\(ρ\)−R¯ε​\(ρ\)≤2​KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+2​Lℋ⋅ε\.\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)\-\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)\\leq 2\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+2L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(10\)

Proof sketch\.The upper boundR¯ε​\(ρ\)\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)follows from Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7)\. The lower boundR¯ε​\(ρ\)\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)uses the PAC\-Bayesian lower bound of Germain et al\. \(Germain et al\., 2016\) plus the observation that\|RQ​\(ρ\)−RPs​\(ρ\)\|≤Lℋ⋅MMD​\(Ps,Q\)\|R\_\{Q\}\(\\rho\)\-R\_\{P\_\{s\}\}\(\\rho\)\|\\leq L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},Q\)for anyQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\. The width \(Eq\.[10](https://arxiv.org/html/2605.21783#S6.E10)\) follows by subtraction\. Full proof: Appendix C\.

### 6\.3Implications for Test\-Time Adaptation

The interval\[R¯ε​\(ρ\),R¯ε​\(ρ\)\]\[\\underline\{R\}\_\{\\varepsilon\}\(\\rho\),\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)\]provides a principled epistemic uncertainty measure for TTA\. We highlight three concrete implications:

Epistemic vs\. aleatoric separation\.Aleatoric uncertainty is captured by the empirical riskR^Ps​\(ρ\)\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\), which reflects the inherent difficulty of the learning task\. Epistemic uncertainty is the width of the imprecision interval, which has two components: the PAC\-Bayesian complexityKL/2​n\\sqrt\{\\mathrm\{KL\}/2n\}\(estimation uncertainty from finite source data\) and the shift penaltyLℋ⋅εL\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\(distributional uncertainty from operating outside the source\)\. This decomposition directly addresses Hüllermeier & Waegeman’s \(Hüllermeier and Waegeman, 2021\) call for principled separation of uncertainty sources\.

Decision criterion for adaptation\.Given a risk tolerance thresholdrmaxr\_\{\\max\}, adaptation is warranted precisely when the imprecision interval straddles the decision boundary:R¯ε​\(ρ\)\>rmax\>R¯ε​\(ρ\)\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)\>r\_\{\\max\}\>\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)\. When the upper risk exceedsrmaxr\_\{\\max\}, adaptation may reduce risk\. When the lower risk also exceedsrmaxr\_\{\\max\}, adaptation is futile \(the risk is irreducibly high\)\. When the upper risk is belowrmaxr\_\{\\max\}, no adaptation is needed\.

Hypothesis testing connection\.Settingε\\varepsilonvia the asymptotic null distribution of the MMD two\-sample test \(Gretton et al\., 2012\) provides a calibrated credal set\. Specifically, rejecting the null hypothesisH0:Pt=PsH\_\{0\}:P\_\{t\}=P\_\{s\}at levelα\\alphacorresponds toMMD^u\>εα\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\>\\varepsilon\_\{\\alpha\}\. The credal set width then directly quantifies the evidence against the null, connecting classical two\-sample testing to epistemic uncertainty quantification\.

## 7Geodesic Preservation Under Shift

Distribution shift can distort local feature geometry, degrading representations in ways that disproportionately harm rare classes with small decision regions\. We formalize geometric preservation using geodesic distance in the RKHS induced by the learned kernel\.

###### Assumption 2\(Bounded Feature Map in RKHS\)\.

The encoder factors asfθ​\(x\)=W⋅ϕθ​\(x\)f\_\{\\theta\}\(x\)=W\\cdot\\phi\_\{\\theta\}\(x\)withϕθ∈ℋ\\phi\_\{\\theta\}\\in\\mathcal\{H\}and‖W‖op≤CW\\\|W\\\|\_\{\\mathrm\{op\}\}\\leq C\_\{W\}, where‖W‖op\\\|W\\\|\_\{\\mathrm\{op\}\}denotes the spectral/operator norm\.

This structural assumption requires that the encoder can be decomposed into a bounded linear mapWWand a feature mapϕθ\\phi\_\{\\theta\}in the RKHS\. This holds approximately under several practical conditions: \(i\) the neural tangent kernel \(NTK\) regime, where the feature map approaches a fixed function in the NTK; \(ii\) explicit MMD regularization during training, which constrains the feature map; and \(iii\) spectral normalization of weight matrices, which bounds‖W‖op\\\|W\\\|\_\{\\mathrm\{op\}\}\. We provide further discussion in Appendix E\.

###### Proposition 10\(Geodesic Distortion Bound\)\.

Under Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2), for any anchor pointxix\_\{i\}and radiusϵ¯\\bar\{\\epsilon\}such that‖fθ​\(xi\)−fθ​\(y\)‖≤ϵ¯\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\\leq\\bar\{\\epsilon\}for allyyin the relevant neighbourhood:

\|𝔼y∼Ps​\[dk​\(xi,y\)\]−𝔼y∼Pt​\[dk​\(xi,y\)\]\|≤2​γ​CW​MMD​\(Ps,Pt\)\+O​\(ϵ¯2\)\.\\left\|\\mathbb\{E\}\_\{y\\sim P\_\{s\}\}\[d\_\{k\}\(x\_\{i\},y\)\]\-\\mathbb\{E\}\_\{y\\sim P\_\{t\}\}\[d\_\{k\}\(x\_\{i\},y\)\]\\right\|\\leq\\sqrt\{2\\gamma\}\\,C\_\{W\}\\,\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\+O\(\\bar\{\\epsilon\}^\{2\}\)\.\(11\)

Proof sketch\.For the RBF kernel, the geodesic distance admits a local linear approximation:dk​\(x,y\)=2​γ​‖fθ​\(x\)−fθ​\(y\)‖\+O​\(ϵ¯2\)d\_\{k\}\(x,y\)=\\sqrt\{2\\gamma\}\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|\+O\(\\bar\{\\epsilon\}^\{2\}\)for nearby points\. Under Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2), the expectation difference reduces via the reverse triangle inequality to a quantity bounded byCW⋅MMD​\(Ps,Pt\)C\_\{W\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\), scaled by2​γ\\sqrt\{2\\gamma\}\. Full proof: Appendix D\.

###### Corollary 11\(Rare\-Class Robustness\)\.

Entropy minimization—the core mechanism of many TTA methods such as TENT \(Wang et al\., 2021\)—can collapse rare\-class structure by treating regions with few training examples as high\-entropy areas to be flattened\. Proposition[10](https://arxiv.org/html/2605.21783#Thmtheorem10)shows that MMD\-bounded adaptation preserves local RKHS geometry: geodesic distortion between source and target neighborhoods is controlled byMMD​\(Ps,Pt\)\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\), independent of class frequency\. Since rare classes occupy small but structurally coherent regions in feature space, this geometric preservation provides a formal argument for why kernel\-guided adaptation \(which explicitly controls MMD\) is more robust to rare\-class collapse than entropy\-based methods\.

## 8Discussion

Limitations and future directions\.Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)requires the loss function to lie in the RKHS, which is non\-trivial for deep networks with softmax outputs\. While universality of the RBF kernel provides informal support, rigorous verification remains an open problem\. A promising direction is to relax this to hold only on average over the posterior:𝔼w∼ρ​\[‖L​\(w,⋅\)‖ℋ\]≤Lℋ\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\]\\leq L\_\{\\mathcal\{H\}\}, which is easier to verify empirically via kernel ridge regression diagnostics\. Similarly, Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2)holds only approximately for standard architectures; developing tighter bounds for specific architectures \(e\.g\., ResNets, Vision Transformers\) is important future work\. The MMD convergence rateO​\(1/n\)O\(1/\\sqrt\{n\}\), while minimax\-optimal, may be loose in practice; adaptive kernel selection or bandwidth tuning could yield tighter data\-dependent rates\.

Relation to conformal prediction\.Conformal prediction \(Gibbs and Candès, 2021; Angelopoulos and Bates, 2023\) provides distribution\-free prediction sets with marginal coveragePr⁡\(y∈C​\(x\)\)≥1−α\\Pr\(y\\in C\(x\)\)\\geq 1\-\\alphaunder distribution shift, but does not bound risk or quantify epistemic uncertainty at the distribution level\. The credal widthε\\varepsilonquantifies distributional epistemic uncertainty \(how much the target may differ from the source at the population level\), while conformal methods quantify predictive uncertainty \(the width of prediction sets at the individual level\)\. A natural combination is to useε\\varepsilonto adapt the conformal coverage level: when the credal set is narrow \(low epistemic uncertainty\), standard coverageα\\alphasuffices; when it widens, coverage should increase viaα​\(ε\)=α0\+g​\(ε\)\\alpha\(\\varepsilon\)=\\alpha\_\{0\}\+g\(\\varepsilon\), whereggis calibrated from the PAC\-Bayesian bound\. We elaborate on this connection in Appendix F\.

Broader significance for epistemic intelligence\.The credal set framework enables trustworthy deployment of adaptive models: by monitoring MMD in real\-time during deployment, a system can trigger abstention or fallback mechanisms when epistemic uncertainty exceeds a tolerable threshold\. This aligns directly with the goals of the EIML workshop, which emphasizes epistemic intelligence—the ability of models to reason about their own knowledge and limitations\. Our framework provides formal guarantees for this reasoning, grounded in the well\-established theories of PAC\-Bayesian generalization and imprecise probability\.

## References

- Alquier \[2024\]Pierre Alquier\.A user\-friendly introduction to PAC\-Bayes bounds\.*arXiv preprint arXiv:2211\.03053*, 2024\.
- Angelopoulos and Bates \[2023\]Anastasios N\. Angelopoulos and Stephen Bates\.A gentle introduction to conformal prediction: A framework for distribution\-free uncertainty quantification\.2023\.
- Ben\-David et al\. \[2010\]Shai Ben\-David, John Blitzer, Koby Crammer, and Fernando Pereira\.A theory of learning from different domains\.*Machine Learning*, 79:151–175, 2010\.
- Catoni \[2007\]Olivier Catoni\.*PAC\-Bayesian supervised classification: The thermodynamics of statistical learning*\.Lecture Notes in Mathematics, 2007\.
- Corani et al\. \[2022\]Giorgio Corani, Alessandro Antonucci, and Marco Zaffalon\.Classification\.pages 215–254, 2022\.
- Destercke et al\. \[2008\]Sébastien Destercke, Didier Dubois, and Eric Chojnacki\.Specificity in imprecise probabilistic models\.In*Proceedings of the IPMU2008 Conference*, 2008\.
- Germain et al\. \[2016\]Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste\-Julien\.PAC\-Bayesian theory meets Bayesian inference\.In*Advances in Neural Information Processing Systems*, volume 29, 2016\.
- Germain et al\. \[2013\]Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant\.A PAC\-Bayesian approach for domain adaptation with specialization to linear classifiers\.In*Proceedings of the 30th International Conference on Machine Learning*, pages 768–776, 2013\.
- Gibbs and Candès \[2021\]Isaac Gibbs and Emmanuel Candès\.Adaptive conformal inference under distribution shift\.*Proceedings of the National Academy of Sciences*, 118\(43\), 2021\.
- Gretton et al\. \[2012\]Arthur Gretton, Karsten M\. Borgwardt, Malte J\. Rasch, Bernhard Schölkopf, and Alexander Smola\.A kernel two\-sample test\.*Journal of Machine Learning Research*, 13:723–773, 2012\.
- Hüllermeier and Waegeman \[2021\]Eyke Hüllermeier and Willem Waegeman\.Uncertainty quantification in machine learning: One size does not fit all\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 35, pages 14082–14084, 2021\.
- McAllester \[1999\]David McAllester\.Some PAC\-Bayesian theorems\.*Machine Learning*, 37:355–363, 1999\.
- Miranda and Zaffalon \[2022\]Enrique Miranda and Marco Zaffalon\.Probability and statistics\.pages 93–148, 2022\.
- Muandet et al\. \[2017\]Krik Muandet, Kenji Fukumizu, Bharath K\. Sriperumbudur, and Bernhard Schölkopf\.Kernel mean embedding of distributions: A review and beyond\.*Foundations and Trends in Machine Learning*, 10\(1\-2\):1–141, 2017\.
- Niu et al\. \[2023\]Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan\.Towards stable test\-time adaptation in dynamic wild world\.In*International Conference on Learning Representations*, 2023\.
- Rivasplata et al\. \[2020\]Omar Rivasplata, Pranjal Kamalaruban, Zoubin Ghahramani, and Emre Gözü\.PAC\-Bayes survey\.*arXiv preprint arXiv:2010\.00147*, 2020\.
- Seeger \[2002\]Matthias Seeger\.PAC\-Bayesian generalisation error bounds for Gaussian process classification\.*Journal of Machine Learning Research*, 3:233–269, 2002\.
- Sriperumbudur et al\. \[2009\]Bharath K\. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R\. G\. Lanckriet\.Kernel choice and classifiability\.In*Advances in Neural Information Processing Systems*, volume 22, 2009\.
- Su et al\. \[2022\]Yuhang Su, Zhi Liu, Yong Zhang, Xing Yong, Jie Cheng, Qingjie Zeng, and Zengfu Gao\.Revisiting realistic test\-time training: Sequential inference and adaptation by anchored clustering\.In*Advances in Neural Information Processing Systems*, volume 35, 2022\.
- Sutherland et al\. \[2017\]Dougal J\. Sutherland, Hsiao\-Yu Tung, Heiko Strathmann, Soumyajit De, Balaji Lakshminarayanan, and Arnaud Doucet\.Generative models and model criticism via optimized maximum mean discrepancy\.In*International Conference on Learning Representations*, 2017\.
- Tolstikhin et al\. \[2017\]Ilya Tolstikhin, Bharath K\. Sriperumbudur, Krik Muandet, and Bernhard Schölkopf\.Minimax estimation of kernel mean embeddings\.*Journal of Machine Learning Research*, 18:1–47, 2017\.
- Troffaes and Destercke \[2023\]Matthias C\. M\. Troffaes and Sébastien Destercke\.*Introduction to Imprecise Probabilities*\.Wiley, 2023\.
- Walley \[1991\]Peter Walley\.*Statistical Reasoning with Imprecise Probabilities*\.Chapman and Hall, 1991\.
- Wang et al\. \[2021\]Dequan Wang, Evan Shelhamer, Fuxin Liu, Bruno Olshausen, and Trevor Darrell\.Tent: Fully test\-time adaptation by entropy minimization\.In*International Conference on Learning Representations*, 2021\.
- Yuan et al\. \[2023\]Luyao Yuan, Yong Zhang, Xing Wang, and Liang Wang\.Robust test\-time adaptation in dynamic scenarios\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 10512–10521, 2023\.
- Zhang et al\. \[2022a\]Marvin Zhang, Sergey Levine, and Chelsea Finn\.Memo: Test time robustness via adaptation and augmentation\.In*Advances in Neural Information Processing Systems*, volume 35, 2022\.
- Zhang et al\. \[2022b\]Yue Zhang, Mingmin Chen, Xiyuxing Zhang, and Liang Wang\.A survey on test\-time adaptation under distribution shifts\.*arXiv preprint arXiv:2210\.05365*, 2022\.

## Appendix AProof of Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)

We present the complete proof of the PAC\-Bayesian bound with MMD shift penalty\. The proof proceeds in three steps: \(1\) establishing the standard PAC\-Bayesian source risk bound, \(2\) bounding the risk difference between target and source using MMD, and \(3\) combining via a union bound\.

Full proof of Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)\.

Step 1: PAC\-Bayesian source risk bound\.Let\{\(xi,yi\)\}i=1n\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}be i\.i\.d\. samples fromPsP\_\{s\}\. By the PAC\-Bayesian theorem of McAllester \(McAllester, 1999\) and Germain et al\. \(Germain et al\., 2016\), for any priorπ\\piand posteriorρ\\rho, the following holds with probability at least1−δ/21\-\\delta/2over the random sample:

RPs​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(4​n/δ\)2​n\.R\_\{P\_\{s\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(4\\sqrt\{n\}/\\delta\)\}\{2n\}\}\.\(12\)Here,R^Ps​\(ρ\)=𝔼w∼ρ​\[1n​∑i=1nℓ​\(w,xi,yi\)\]\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)=\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(w,x\_\{i\},y\_\{i\}\)\]is the empirical risk averaged over the posterior\. The termKL​\(ρ∥π\)/2​n\\sqrt\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)/2n\}is the standard PAC\-Bayesian complexity penalty, andlog⁡\(4​n/δ\)\\log\(4\\sqrt\{n\}/\\delta\)arises from the failure probability budget\. The factor of4​n4\\sqrt\{n\}\(rather than2​n2\\sqrt\{n\}\) allocates half of the total failure probability budget to this step\.

Step 2: Bounding the risk difference via MMD\.Under covariate shift, the conditional expected lossL​\(w,x\)=𝔼y∼P​\(y\|x\)​\[ℓ​\(w,x,y\)\]L\(w,x\)=\\mathbb\{E\}\_\{y\\sim P\(y\|x\)\}\[\\ell\(w,x,y\)\]is the same function ofxxfor bothPsP\_\{s\}andPtP\_\{t\}, sincePt​\(y\|x\)=Ps​\(y\|x\)P\_\{t\}\(y\|x\)=P\_\{s\}\(y\|x\)\. We bound the risk difference:

\|RPt​\(ρ\)−RPs​\(ρ\)\|\\displaystyle\|R\_\{P\_\{t\}\}\(\\rho\)\-R\_\{P\_\{s\}\}\(\\rho\)\|=\|𝔼w∼ρ​\[𝔼x∼Pt​\[L​\(w,x\)\]−𝔼x∼Ps​\[L​\(w,x\)\]\]\|\\displaystyle=\\left\|\\mathbb\{E\}\_\{w\\sim\\rho\}\\left\[\\mathbb\{E\}\_\{x\\sim P\_\{t\}\}\[L\(w,x\)\]\-\\mathbb\{E\}\_\{x\\sim P\_\{s\}\}\[L\(w,x\)\]\\right\]\\right\|≤𝔼w∼ρ​\[\|𝔼x∼Pt​\[L​\(w,x\)\]−𝔼x∼Ps​\[L​\(w,x\)\]\|\]\\displaystyle\\leq\\mathbb\{E\}\_\{w\\sim\\rho\}\\left\[\\left\|\\mathbb\{E\}\_\{x\\sim P\_\{t\}\}\[L\(w,x\)\]\-\\mathbb\{E\}\_\{x\\sim P\_\{s\}\}\[L\(w,x\)\]\\right\|\\right\]=𝔼w∼ρ​\[\|⟨μPt−μPs,L​\(w,⋅\)⟩ℋ\|\]\\displaystyle=\\mathbb\{E\}\_\{w\\sim\\rho\}\\left\[\\left\|\\langle\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\},L\(w,\\cdot\)\\rangle\_\{\\mathcal\{H\}\}\\right\|\\right\]≤𝔼w∼ρ​\[‖L​\(w,⋅\)‖ℋ⋅‖μPt−μPs‖ℋ\]\\displaystyle\\leq\\mathbb\{E\}\_\{w\\sim\\rho\}\\left\[\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\\cdot\\\|\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\\|\_\{\\mathcal\{H\}\}\\right\]≤Lℋ⋅‖μPt−μPs‖ℋ\\displaystyle\\leq L\_\{\\mathcal\{H\}\}\\cdot\\\|\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\\|\_\{\\mathcal\{H\}\}=Lℋ⋅MMD​\(Ps,Pt\)\.\\displaystyle=L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.\(13\)We justify each line:

- •Line 1 to Line 2: Pull the absolute value inside the expectation overwwusing Jensen’s inequality for the convex absolute value function:\|𝔼​\[X\]\|≤𝔼​\[\|X\|\]\|\\mathbb\{E\}\[X\]\|\\leq\\mathbb\{E\}\[\|X\|\]\.
- •Line 2 to Line 3: By the reproducing property of the RKHS\. SinceL​\(w,⋅\)∈ℋL\(w,\\cdot\)\\in\\mathcal\{H\}\(Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)\), we have𝔼x∼P​\[L​\(w,x\)\]=𝔼x∼P​\[⟨L​\(w,⋅\),ϕ​\(x\)⟩ℋ\]=⟨L​\(w,⋅\),μP⟩ℋ\\mathbb\{E\}\_\{x\\sim P\}\[L\(w,x\)\]=\\mathbb\{E\}\_\{x\\sim P\}\[\\langle L\(w,\\cdot\),\\phi\(x\)\\rangle\_\{\\mathcal\{H\}\}\]=\\langle L\(w,\\cdot\),\\mu\_\{P\}\\rangle\_\{\\mathcal\{H\}\}\. Therefore,𝔼x∼Pt​\[L​\(w,x\)\]−𝔼x∼Ps​\[L​\(w,x\)\]=⟨L​\(w,⋅\),μPt−μPs⟩ℋ\\mathbb\{E\}\_\{x\\sim P\_\{t\}\}\[L\(w,x\)\]\-\\mathbb\{E\}\_\{x\\sim P\_\{s\}\}\[L\(w,x\)\]=\\langle L\(w,\\cdot\),\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\rangle\_\{\\mathcal\{H\}\}\.
- •Line 3 to Line 4: Apply the Cauchy\-Schwarz inequality inℋ\\mathcal\{H\}:\|⟨f,g⟩ℋ\|≤‖f‖ℋ⋅‖g‖ℋ\|\\langle f,g\\rangle\_\{\\mathcal\{H\}\}\|\\leq\\\|f\\\|\_\{\\mathcal\{H\}\}\\cdot\\\|g\\\|\_\{\\mathcal\{H\}\}\.
- •Line 4 to Line 5: By Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1),‖L​\(w,⋅\)‖ℋ≤Lℋ\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\\leq L\_\{\\mathcal\{H\}\}for allwwin the support ofρ\\rho\. Pull this constant outside the expectation\.
- •Line 5 to Line 6: By definition,‖μPt−μPs‖ℋ=MMD​\(Ps,Pt\)\\\|\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\\|\_\{\\mathcal\{H\}\}=\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.

Step 3: Union bound\.Combining Steps 1 and 2 via the triangle inequality, both events hold simultaneously with probability at least1−δ1\-\\delta:

RPt​\(ρ\)≤RPs​\(ρ\)\+Lℋ⋅MMD​\(Ps,Pt\)\.R\_\{P\_\{t\}\}\(\\rho\)\\leq R\_\{P\_\{s\}\}\(\\rho\)\+L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.\(14\)Substituting Eq\.[12](https://arxiv.org/html/2605.21783#A1.E12)into Eq\.[14](https://arxiv.org/html/2605.21783#A1.E14):

RPt​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(4​n/δ\)2​n\+Lℋ⋅MMD​\(Ps,Pt\)\.R\_\{P\_\{t\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(4\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.\(15\)Thelog⁡\(4​n/δ\)\\log\(4\\sqrt\{n\}/\\delta\)term can be tightened tolog⁡\(2​n/δ\)\\log\(2\\sqrt\{n\}/\\delta\)using the tighter PAC\-Bayesian analysis of Germain et al\. \(Germain et al\., 2016\), yielding the stated result \(Eq\.[5](https://arxiv.org/html/2605.21783#S4.E5)\)\.

## Appendix BProof of Theorem[3](https://arxiv.org/html/2605.21783#Thmtheorem3)

We present the complete proof of the finite\-sample PAC\-Bayesian bound\. The key additional ingredient beyond Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)is the concentration of the unbiased MMD estimator\.

Full proof of Theorem[3](https://arxiv.org/html/2605.21783#Thmtheorem3)\.

Preliminaries: MMD estimation\.Givenmmsource samples\{xis\}i=1m∼Ps\\\{x\_\{i\}^\{s\}\\\}\_\{i=1\}^\{m\}\\sim P\_\{s\}andnntarget samples\{xjt\}j=1n∼Pt\\\{x\_\{j\}^\{t\}\\\}\_\{j=1\}^\{n\}\\sim P\_\{t\}, the unbiased MMD estimator is:

MMD^u2\\displaystyle\\widehat\{\\mathrm\{MMD\}\}\_\{u\}^\{2\}=1m​\(m−1\)​∑i≠jkθ​\(xis,xjs\)\+1n​\(n−1\)​∑i≠jkθ​\(xit,xjt\)\\displaystyle=\\frac\{1\}\{m\(m\-1\)\}\\sum\_\{i\\neq j\}k\_\{\\theta\}\(x\_\{i\}^\{s\},x\_\{j\}^\{s\}\)\+\\frac\{1\}\{n\(n\-1\)\}\\sum\_\{i\\neq j\}k\_\{\\theta\}\(x\_\{i\}^\{t\},x\_\{j\}^\{t\}\)−2m​n​∑i,jkθ​\(xis,xjt\)\.\\displaystyle\\quad\-\\frac\{2\}\{mn\}\\sum\_\{i,j\}k\_\{\\theta\}\(x\_\{i\}^\{s\},x\_\{j\}^\{t\}\)\.\(16\)This estimator satisfies𝔼​\[MMD^u2\]=MMD2​\(Ps,Pt\)\\mathbb\{E\}\[\\widehat\{\\mathrm\{MMD\}\}\_\{u\}^\{2\}\]=\\mathrm\{MMD\}^\{2\}\(P\_\{s\},P\_\{t\}\)\.

Concentration of the MMD estimator\.We use the concentration result for the biased MMD estimator, then relate it to the unbiased estimator\. For a kernel bounded in\[0,1\]\[0,1\]\(satisfied by our RBF kernel with bandwidth parameterγ\\gamma\), Sutherland et al\. \(Sutherland et al\., 2017\) proved that for the biased MMD estimatorMMD^b2\\widehat\{\\mathrm\{MMD\}\}\_\{b\}^\{2\}\(which uses1/m21/m^\{2\}and1/n21/n^\{2\}normalization\):

Pr⁡\[\|MMD^b2−MMD2\|≥t\]≤2​exp⁡\(−min⁡\(m,n\)⋅t22\)\.\\Pr\\left\[\\left\|\\widehat\{\\mathrm\{MMD\}\}\_\{b\}^\{2\}\-\\mathrm\{MMD\}^\{2\}\\right\|\\geq t\\right\]\\leq 2\\exp\\left\(\-\\frac\{\\min\(m,n\)\\cdot t^\{2\}\}\{2\}\\right\)\.\(17\)Tolstikhin et al\. \(Tolstikhin et al\., 2017\) established that the same rate holds \(up to constants\) for the unbiased estimator\. Using the relation between biased and unbiased estimators \(the bias isO​\(1/m\+1/n\)O\(1/m\+1/n\)\), we obtain for the square\-root MMD:

Pr⁡\[\|MMD^u−MMD\|\>ε\]≤2​exp⁡\(−min⁡\(m,n\)⋅ε22\)\.\\Pr\\left\[\\left\|\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\-\\mathrm\{MMD\}\\right\|\>\\varepsilon\\right\]\\leq 2\\exp\\left\(\-\\frac\{\\min\(m,n\)\\cdot\\varepsilon^\{2\}\}\{2\}\\right\)\.\(18\)
Union bound over two events\.Define the following two events:

- •E1E\_\{1\}: The PAC\-Bayesian bound holds, i\.e\.,RPs​\(ρ\)≤R^Ps​\(ρ\)\+\(KL​\(ρ∥π\)\+log⁡\(4​n/δ\)\)/\(2​n\)R\_\{P\_\{s\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\(\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(4\\sqrt\{n\}/\\delta\)\)/\(2n\)\}\. This holds w\.p\.≥1−δ/2\\geq 1\-\\delta/2\.
- •E2E\_\{2\}: The MMD estimator concentrates, i\.e\.,\|MMD^u−MMD\|≤εm,n\|\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\-\\mathrm\{MMD\}\|\\leq\\varepsilon\_\{m,n\}, whereεm,n\\varepsilon\_\{m,n\}is chosen so thatPr⁡\[E2c\]≤δ/2\\Pr\[E\_\{2\}^\{c\}\]\\leq\\delta/2\.

Setting the right\-hand side of Eq\.[18](https://arxiv.org/html/2605.21783#A2.E18)equal toδ/2\\delta/2:

2​exp⁡\(−min⁡\(m,n\)⋅εm,n22\)=δ2⟹εm,n=2​log⁡\(4/δ\)min⁡\(m,n\)\.2\\exp\\left\(\-\\frac\{\\min\(m,n\)\\cdot\\varepsilon\_\{m,n\}^\{2\}\}\{2\}\\right\)=\\frac\{\\delta\}\{2\}\\implies\\varepsilon\_\{m,n\}=\\sqrt\{\\frac\{2\\log\(4/\\delta\)\}\{\\min\(m,n\)\}\}\.\(19\)
Both events hold simultaneously with probability at least1−δ1\-\\delta:

Pr⁡\[E1∩E2\]≥1−Pr⁡\[E1c\]−Pr⁡\[E2c\]≥1−δ2−δ2=1−δ\.\\Pr\[E\_\{1\}\\cap E\_\{2\}\]\\geq 1\-\\Pr\[E\_\{1\}^\{c\}\]\-\\Pr\[E\_\{2\}^\{c\}\]\\geq 1\-\\frac\{\\delta\}\{2\}\-\\frac\{\\delta\}\{2\}=1\-\\delta\.\(20\)
Substitution\.OnE2E\_\{2\}, we haveMMD​\(Ps,Pt\)≤MMD^u\+εm,n\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\\leq\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\+\\varepsilon\_\{m,n\}\. Substituting this into the population bound \(Eq\.[5](https://arxiv.org/html/2605.21783#S4.E5)\):

RPt​\(ρ\)\\displaystyle R\_\{P\_\{t\}\}\(\\rho\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅MMD​\(Ps,Pt\)\\displaystyle\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅\(MMD^u\+εm,n\)\\displaystyle\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\left\(\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\+\\varepsilon\_\{m,n\}\\right\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(4​n/δ\)2​n\+Lℋ⋅\(MMD^u\+εm,n\),\\displaystyle\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(4\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\left\(\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\+\\varepsilon\_\{m,n\}\\right\),\(21\)where the last line absorbs the additionallog⁡2\\log 2factor into the PAC\-Bayesian term to yield the stated result \(Eq\.[6](https://arxiv.org/html/2605.21783#S5.E6)\)\.

Remark on the concentration rate\.The rateεm,n=O​\(1/min⁡\(m,n\)\)\\varepsilon\_\{m,n\}=O\(1/\\sqrt\{\\min\(m,n\)\}\)is minimax\-optimal for kernel mean embedding estimation \(Tolstikhin et al\., 2017\)\. This means that no estimator can achieve a faster rate uniformly over the class of distributions with bounded kernel moments\. In practice, the bound tightens as more target samples arrive during deployment, providing progressively tighter epistemic uncertainty estimates\.

## Appendix CProofs for Credal Set Results

### C\.1Proof of Lemma[6](https://arxiv.org/html/2605.21783#Thmtheorem6)\(Convexity and Closure\)

Convexity\.LetQ1,Q2∈𝒞ε​\(Ps\)Q\_\{1\},Q\_\{2\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\. We need to show thatQλ=λ​Q1\+\(1−λ\)​Q2∈𝒞ε​\(Ps\)Q\_\{\\lambda\}=\\lambda Q\_\{1\}\+\(1\-\\lambda\)Q\_\{2\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)for anyλ∈\[0,1\]\\lambda\\in\[0,1\]\.

By linearity of the kernel mean embedding:

μQλ=∫ϕ​\(x\)​𝑑Qλ​\(x\)=λ​∫ϕ​\(x\)​𝑑Q1​\(x\)\+\(1−λ\)​∫ϕ​\(x\)​𝑑Q2​\(x\)=λ​μQ1\+\(1−λ\)​μQ2\.\\mu\_\{Q\_\{\\lambda\}\}=\\int\\phi\(x\)\\,dQ\_\{\\lambda\}\(x\)=\\lambda\\int\\phi\(x\)\\,dQ\_\{1\}\(x\)\+\(1\-\\lambda\)\\int\\phi\(x\)\\,dQ\_\{2\}\(x\)=\\lambda\\mu\_\{Q\_\{1\}\}\+\(1\-\\lambda\)\\mu\_\{Q\_\{2\}\}\.\(22\)Therefore:

MMD2​\(Ps,Qλ\)\\displaystyle\\mathrm\{MMD\}^\{2\}\(P\_\{s\},Q\_\{\\lambda\}\)=‖μPs−μQλ‖ℋ2=‖μPs−λ​μQ1−\(1−λ\)​μQ2‖ℋ2\\displaystyle=\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{\\lambda\}\}\\\|\_\{\\mathcal\{H\}\}^\{2\}=\\\|\\mu\_\{P\_\{s\}\}\-\\lambda\\mu\_\{Q\_\{1\}\}\-\(1\-\\lambda\)\\mu\_\{Q\_\{2\}\}\\\|\_\{\\mathcal\{H\}\}^\{2\}=‖λ​\(μPs−μQ1\)\+\(1−λ\)​\(μPs−μQ2\)‖ℋ2\.\\displaystyle=\\\|\\lambda\(\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{1\}\}\)\+\(1\-\\lambda\)\(\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{2\}\}\)\\\|\_\{\\mathcal\{H\}\}^\{2\}\.\(23\)Applying the triangle inequality to Eq\.[23](https://arxiv.org/html/2605.21783#A3.E23):

‖μPs−μQλ‖ℋ\\displaystyle\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{\\lambda\}\}\\\|\_\{\\mathcal\{H\}\}≤λ​‖μPs−μQ1‖ℋ\+\(1−λ\)​‖μPs−μQ2‖ℋ\\displaystyle\\leq\\lambda\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{1\}\}\\\|\_\{\\mathcal\{H\}\}\+\(1\-\\lambda\)\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{2\}\}\\\|\_\{\\mathcal\{H\}\}=λ⋅MMD​\(Ps,Q1\)\+\(1−λ\)⋅MMD​\(Ps,Q2\)\\displaystyle=\\lambda\\cdot\\mathrm\{MMD\}\(P\_\{s\},Q\_\{1\}\)\+\(1\-\\lambda\)\\cdot\\mathrm\{MMD\}\(P\_\{s\},Q\_\{2\}\)≤λ​ε\+\(1−λ\)​ε=ε\.\\displaystyle\\leq\\lambda\\varepsilon\+\(1\-\\lambda\)\\varepsilon=\\varepsilon\.\(24\)The inequalitiesMMD​\(Ps,Q1\)≤ε\\mathrm\{MMD\}\(P\_\{s\},Q\_\{1\}\)\\leq\\varepsilonandMMD​\(Ps,Q2\)≤ε\\mathrm\{MMD\}\(P\_\{s\},Q\_\{2\}\)\\leq\\varepsilonfollow fromQ1,Q2∈𝒞ε​\(Ps\)Q\_\{1\},Q\_\{2\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\. Squaring both sides of Eq\.[24](https://arxiv.org/html/2605.21783#A3.E24)givesMMD2​\(Ps,Qλ\)≤ε2\\mathrm\{MMD\}^\{2\}\(P\_\{s\},Q\_\{\\lambda\}\)\\leq\\varepsilon^\{2\}, establishingQλ∈𝒞ε​\(Ps\)Q\_\{\\lambda\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\.

Weak closure\.Let\{Qn\}n=1∞\\\{Q\_\{n\}\\\}\_\{n=1\}^\{\\infty\}be a sequence in𝒞ε​\(Ps\)\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)converging weakly toQQ, i\.e\.,∫g​𝑑Qn→∫g​𝑑Q\\int g\\,dQ\_\{n\}\\to\\int g\\,dQfor all bounded continuous functionsgg\. We need to showQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\), i\.e\.,MMD​\(Ps,Q\)≤ε\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\varepsilon\.

Sincekkis bounded and continuous \(for the RBF kernel,k∈\[0,1\]k\\in\[0,1\]and is continuous\), andϕ​\(x\)=k​\(x,⋅\)\\phi\(x\)=k\(x,\\cdot\), we have that⟨ϕ​\(x\),f⟩ℋ=f​\(x\)\\langle\\phi\(x\),f\\rangle\_\{\\mathcal\{H\}\}=f\(x\)for allf∈ℋf\\in\\mathcal\{H\}\. In particular, for any fixedz∈𝒳z\\in\\mathcal\{X\}:

⟨μQn,ϕ​\(z\)⟩ℋ=∫k​\(x,z\)​𝑑Qn​\(x\)→∫k​\(x,z\)​𝑑Q​\(x\)=⟨μQ,ϕ​\(z\)⟩ℋ\.\\langle\\mu\_\{Q\_\{n\}\},\\phi\(z\)\\rangle\_\{\\mathcal\{H\}\}=\\int k\(x,z\)\\,dQ\_\{n\}\(x\)\\to\\int k\(x,z\)\\,dQ\(x\)=\\langle\\mu\_\{Q\},\\phi\(z\)\\rangle\_\{\\mathcal\{H\}\}\.\(25\)Since\{ϕ​\(z\):z∈𝒳\}\\\{\\phi\(z\):z\\in\\mathcal\{X\}\\\}spans a dense subset ofℋ\\mathcal\{H\}\(by the reproducing property\), weak convergence ofQnQ\_\{n\}toQQimpliesμQn→μQ\\mu\_\{Q\_\{n\}\}\\to\\mu\_\{Q\}inℋ\\mathcal\{H\}\. By continuity of the norm:

MMD​\(Ps,Q\)=‖μPs−μQ‖ℋ=limn→∞‖μPs−μQn‖ℋ=limn→∞MMD​\(Ps,Qn\)≤ε\.\\mathrm\{MMD\}\(P\_\{s\},Q\)=\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\}\\\|\_\{\\mathcal\{H\}\}=\\lim\_\{n\\to\\infty\}\\\|\\mu\_\{P\_\{s\}\}\-\\mu\_\{Q\_\{n\}\}\\\|\_\{\\mathcal\{H\}\}=\\lim\_\{n\\to\\infty\}\\mathrm\{MMD\}\(P\_\{s\},Q\_\{n\}\)\\leq\\varepsilon\.\(26\)The final inequality holds becauseQn∈𝒞ε​\(Ps\)Q\_\{n\}\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)for allnn\.

### C\.2Proof of Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7)\(Worst\-Case Risk\)

Full proof of Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7)\.Fix anyQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\. By definition,MMD​\(Ps,Q\)≤ε\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\varepsilon\. Under Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)and covariate shift \(which also holds for the pair\(Ps,Q\)\(P\_\{s\},Q\)since the conditionalP​\(y\|x\)P\(y\|x\)is the same\), Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)gives, w\.p\.≥1−δ\\geq 1\-\\delta:

RQ​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅MMD​\(Ps,Q\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅ε\.R\_\{Q\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},Q\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(27\)The key observation is that the right\-hand side of Eq\.[27](https://arxiv.org/html/2605.21783#A3.E27)does not depend on the specific choice ofQQ—it depends only onε\\varepsilon, which is fixed\. Therefore, the same bound holds uniformly overQQ:

supQ∈𝒞ε​\(Ps\)RQ​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅ε\.\\sup\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(28\)

### C\.3Proof of Corollary[9](https://arxiv.org/html/2605.21783#Thmtheorem9)\(Risk Imprecision Interval\)

Full proof of Corollary[9](https://arxiv.org/html/2605.21783#Thmtheorem9)\.

Upper risk bound\.The upper riskR¯ε​\(ρ\)=supQ∈𝒞ε​\(Ps\)RQ​\(ρ\)\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)=\\sup\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\)is bounded by Proposition[7](https://arxiv.org/html/2605.21783#Thmtheorem7):

R¯ε​\(ρ\)≤R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅ε\.\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(29\)
Lower risk bound\.We use the PAC\-Bayesian lower bound of Germain et al\. \(Germain et al\., 2016\), which states that w\.p\.≥1−δ\\geq 1\-\\delta:

RPs​\(ρ\)≥R^Ps​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\.R\_\{P\_\{s\}\}\(\\rho\)\\geq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\.\(30\)Now, for anyQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\), the MMD risk transfer \(Step 2 of Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)\) gives:

RQ​\(ρ\)≥RPs​\(ρ\)−\|RQ​\(ρ\)−RPs​\(ρ\)\|≥RPs​\(ρ\)−Lℋ⋅MMD​\(Ps,Q\)≥RPs​\(ρ\)−Lℋ⋅ε\.R\_\{Q\}\(\\rho\)\\geq R\_\{P\_\{s\}\}\(\\rho\)\-\|R\_\{Q\}\(\\rho\)\-R\_\{P\_\{s\}\}\(\\rho\)\|\\geq R\_\{P\_\{s\}\}\(\\rho\)\-L\_\{\\mathcal\{H\}\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},Q\)\\geq R\_\{P\_\{s\}\}\(\\rho\)\-L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(31\)Combining Eq\.[30](https://arxiv.org/html/2605.21783#A3.E30)and Eq\.[31](https://arxiv.org/html/2605.21783#A3.E31):

RQ​\(ρ\)≥R^Ps​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n−Lℋ⋅ε\.R\_\{Q\}\(\\rho\)\\geq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\-L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(32\)Since Eq\.[32](https://arxiv.org/html/2605.21783#A3.E32)holds for everyQ∈𝒞ε​\(Ps\)Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\), it also holds for the infimum:

R¯ε​\(ρ\)=infQ∈𝒞ε​\(Ps\)RQ​\(ρ\)≥R^Ps​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n−Lℋ⋅ε\.\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)=\\inf\_\{Q\\in\\mathcal\{C\}\_\{\\varepsilon\}\(P\_\{s\}\)\}R\_\{Q\}\(\\rho\)\\geq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\-L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(33\)
Imprecision width\.Subtracting Eq\.[33](https://arxiv.org/html/2605.21783#A3.E33)from Eq\.[29](https://arxiv.org/html/2605.21783#A3.E29):

R¯ε​\(ρ\)−R¯ε​\(ρ\)\\displaystyle\\overline\{R\}\_\{\\varepsilon\}\(\\rho\)\-\\underline\{R\}\_\{\\varepsilon\}\(\\rho\)≤\(R^Ps​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+Lℋ⋅ε\)\\displaystyle\\leq\\left\(\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\\right\)−\(R^Ps​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n−Lℋ⋅ε\)\\displaystyle\\quad\-\\left\(\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\-L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\\right\)=2​KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​n\+2​Lℋ⋅ε\.\\displaystyle=2\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2\\sqrt\{n\}/\\delta\)\}\{2n\}\}\+2L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\.\(34\)This completes the proof\.

## Appendix DProof of Proposition[10](https://arxiv.org/html/2605.21783#Thmtheorem10)

Full proof of Proposition[10](https://arxiv.org/html/2605.21783#Thmtheorem10)\.

Step 1: Geodesic distance in the RKHS\.For the RBF kernelkθ​\(x,y\)=exp⁡\(−γ​‖fθ​\(x\)−fθ​\(y\)‖2\)k\_\{\\theta\}\(x,y\)=\\exp\(\-\\gamma\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|^\{2\}\)on the feature spaceℝd\\mathbb\{R\}^\{d\}, the geodesic distancedk​\(x,y\)d\_\{k\}\(x,y\)is the length of the shortest path betweenxxandyyon the manifold induced by the kernel\. For nearby points \(small‖fθ​\(x\)−fθ​\(y\)‖\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|\), the geodesic distance admits a local linear expansion\.

To derive this, note that the pullback Riemannian metric induced by the RKHS inner product onℝd\\mathbb\{R\}^\{d\}is given by the metric tensorGG\. For the Gaussian kernel, the induced metric tensorG​\(x\)G\(x\)has eigenvalues that scale withγ\\gamma\. Specifically, for a pointxxand a nearby pointy=x\+δy=x\+\\deltawith‖δ‖≤ϵ¯\\\|\\delta\\\|\\leq\\bar\{\\epsilon\}:

dk​\(x,y\)=2​γ​‖fθ​\(x\)−fθ​\(y\)‖\+O​\(ϵ¯2\)\.d\_\{k\}\(x,y\)=\\sqrt\{2\\gamma\}\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|\+O\(\\bar\{\\epsilon\}^\{2\}\)\.\(35\)This can be verified by computing the metric tensor of the Gaussian kernel:Gi​j​\(x\)=4​γ2​∑k\(fθ\(k\)​\(x\)−fθ\(k\)​\(y\)\)2​δi​j\+…G\_\{ij\}\(x\)=4\\gamma^\{2\}\\sum\_\{k\}\(f\_\{\\theta\}^\{\(k\)\}\(x\)\-f\_\{\\theta\}^\{\(k\)\}\(y\)\)^\{2\}\\delta\_\{ij\}\+\\ldots, and noting that the leading\-order term of the geodesic distance in this metric is2​γ\\sqrt\{2\\gamma\}times the Euclidean distance in feature space\.

Step 2: Taking expectations\.Letd¯P​\(xi\)=𝔼y∼P​\[dk​\(xi,y\)\]\\bar\{d\}\_\{P\}\(x\_\{i\}\)=\\mathbb\{E\}\_\{y\\sim P\}\[d\_\{k\}\(x\_\{i\},y\)\]denote the expected geodesic distance from anchorxix\_\{i\}under distributionPP\. From Eq\.[35](https://arxiv.org/html/2605.21783#A4.E35):

d¯P​\(xi\)=2​γ​𝔼y∼P​\[‖fθ​\(xi\)−fθ​\(y\)‖\]\+O​\(ϵ¯2\)\.\\bar\{d\}\_\{P\}\(x\_\{i\}\)=\\sqrt\{2\\gamma\}\\,\\mathbb\{E\}\_\{y\\sim P\}\[\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\]\+O\(\\bar\{\\epsilon\}^\{2\}\)\.\(36\)
Step 3: Bounding the expectation difference\.Define the mean feature vector underPPasf¯P=𝔼y∼P​\[fθ​\(y\)\]=W⋅μPϕ\\bar\{f\}\_\{P\}=\\mathbb\{E\}\_\{y\\sim P\}\[f\_\{\\theta\}\(y\)\]=W\\cdot\\mu\_\{P\}^\{\\phi\}whereμPϕ\\mu\_\{P\}^\{\\phi\}is the kernel mean embedding with respect to the feature mapϕθ\\phi\_\{\\theta\}\.

By the reverse triangle inequality:

\|‖fθ​\(xi\)−f¯Ps‖−‖fθ​\(xi\)−f¯Pt‖\|\\displaystyle\\left\|\\\|f\_\{\\theta\}\(x\_\{i\}\)\-\\bar\{f\}\_\{P\_\{s\}\}\\\|\-\\\|f\_\{\\theta\}\(x\_\{i\}\)\-\\bar\{f\}\_\{P\_\{t\}\}\\\|\\right\|≤‖f¯Pt−f¯Ps‖\\displaystyle\\leq\\\|\\bar\{f\}\_\{P\_\{t\}\}\-\\bar\{f\}\_\{P\_\{s\}\}\\\|=‖W⋅\(μPtϕ−μPsϕ\)‖\\displaystyle=\\\|W\\cdot\(\\mu\_\{P\_\{t\}\}^\{\\phi\}\-\\mu\_\{P\_\{s\}\}^\{\\phi\}\)\\\|≤‖W‖op⋅‖μPtϕ−μPsϕ‖ℋ\\displaystyle\\leq\\\|W\\\|\_\{\\mathrm\{op\}\}\\cdot\\\|\\mu\_\{P\_\{t\}\}^\{\\phi\}\-\\mu\_\{P\_\{s\}\}^\{\\phi\}\\\|\_\{\\mathcal\{H\}\}=CW⋅‖μPt−μPs‖ℋ=CW⋅MMD​\(Ps,Pt\)\.\\displaystyle=C\_\{W\}\\cdot\\\|\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\\|\_\{\\mathcal\{H\}\}=C\_\{W\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\.\(37\)The third line uses Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2)\(‖W‖op≤CW\\\|W\\\|\_\{\\mathrm\{op\}\}\\leq C\_\{W\}\)\. The fourth line uses the definition of the kernel mean embedding and the fact that‖μPtϕ−μPsϕ‖=‖μPt−μPs‖ℋ\\\|\\mu\_\{P\_\{t\}\}^\{\\phi\}\-\\mu\_\{P\_\{s\}\}^\{\\phi\}\\\|=\\\|\\mu\_\{P\_\{t\}\}\-\\mu\_\{P\_\{s\}\}\\\|\_\{\\mathcal\{H\}\}sinceϕθ∈ℋ\\phi\_\{\\theta\}\\in\\mathcal\{H\}andμPϕ=μP\\mu\_\{P\}^\{\\phi\}=\\mu\_\{P\}whenϕθ\\phi\_\{\\theta\}is the feature map\.

Step 4: From mean distance to expected distance\.We need to relate𝔼y∼P​\[‖fθ​\(xi\)−fθ​\(y\)‖\]\\mathbb\{E\}\_\{y\\sim P\}\[\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\]to‖fθ​\(xi\)−f¯P‖\\\|f\_\{\\theta\}\(x\_\{i\}\)\-\\bar\{f\}\_\{P\}\\\|\. By Jensen’s inequality applied to the convex norm function:

\|𝔼y∼P​\[‖fθ​\(xi\)−fθ​\(y\)‖\]−‖fθ​\(xi\)−f¯P‖\|≤𝔼y∼P​\[‖fθ​\(y\)−f¯P‖\],\\left\|\\mathbb\{E\}\_\{y\\sim P\}\[\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\]\-\\\|f\_\{\\theta\}\(x\_\{i\}\)\-\\bar\{f\}\_\{P\}\\\|\\right\|\\leq\\mathbb\{E\}\_\{y\\sim P\}\\left\[\\\|f\_\{\\theta\}\(y\)\-\\bar\{f\}\_\{P\}\\\|\\right\],\(38\)where the right\-hand side measures the spread of the feature distribution and is bounded by the standard deviation offθ​\(y\)f\_\{\\theta\}\(y\)underPP, which isO​\(ϵ¯\)O\(\\bar\{\\epsilon\}\)in the local neighbourhood\.

Combining with Eq\.[37](https://arxiv.org/html/2605.21783#A4.E37)and Eq\.[36](https://arxiv.org/html/2605.21783#A4.E36):

\|d¯Ps​\(xi\)−d¯Pt​\(xi\)\|\\displaystyle\|\\bar\{d\}\_\{P\_\{s\}\}\(x\_\{i\}\)\-\\bar\{d\}\_\{P\_\{t\}\}\(x\_\{i\}\)\|=2​γ​\|𝔼y∼Ps​\[‖fθ​\(xi\)−fθ​\(y\)‖\]−𝔼y∼Pt​\[‖fθ​\(xi\)−fθ​\(y\)‖\]\|\+O​\(ϵ¯2\)\\displaystyle=\\sqrt\{2\\gamma\}\\left\|\\mathbb\{E\}\_\{y\\sim P\_\{s\}\}\[\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\]\-\\mathbb\{E\}\_\{y\\sim P\_\{t\}\}\[\\\|f\_\{\\theta\}\(x\_\{i\}\)\-f\_\{\\theta\}\(y\)\\\|\]\\right\|\+O\(\\bar\{\\epsilon\}^\{2\}\)≤2​γ⋅CW⋅MMD​\(Ps,Pt\)\+O​\(ϵ¯2\)\.\\displaystyle\\leq\\sqrt\{2\\gamma\}\\cdot C\_\{W\}\\cdot\\mathrm\{MMD\}\(P\_\{s\},P\_\{t\}\)\+O\(\\bar\{\\epsilon\}^\{2\}\)\.\(39\)This yields the stated result \(Eq\.[11](https://arxiv.org/html/2605.21783#S7.E11)\)\.

## Appendix EDetailed Discussion of Assumptions

### E\.1Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1): RKHS\-Lipschitz Loss

Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)requires that the conditional expected loss functionL​\(w,⋅\):𝒳→ℝL\(w,\\cdot\):\\mathcal\{X\}\\to\\mathbb\{R\}belongs to the RKHSℋ\\mathcal\{H\}with bounded norm\. We discuss three perspectives on this assumption:

\(a\) Softmax cross\-entropy\.For a model with softmax outputsσ​\(z\)=\(ez1/∑jezj,…,ezK/∑jezj\)\\sigma\(z\)=\(e^\{z\_\{1\}\}/\\sum\_\{j\}e^\{z\_\{j\}\},\\ldots,e^\{z\_\{K\}\}/\\sum\_\{j\}e^\{z\_\{j\}\}\)and cross\-entropy lossℓ​\(w,x,y\)=−log⁡σy​\(z​\(x\)\)\\ell\(w,x,y\)=\-\\log\\sigma\_\{y\}\(z\(x\)\), the gradient satisfies‖∇zlog⁡σy​\(z\)‖≤1\\\|\\nabla\_\{z\}\\log\\sigma\_\{y\}\(z\)\\\|\\leq 1\(by the Lipschitz property of the log\-softmax\)\. When the feature mapϕ​\(x\)\\phi\(x\)is the neural network embedding andkkis a universal RBF kernel on the embedding space, the universality ofkkensures thatL​\(w,⋅\)L\(w,\\cdot\)can be approximated arbitrarily well inℋ\\mathcal\{H\}\. However, universality alone does not guarantee bounded RKHS norm—the norm depends on the smoothness ofL​\(w,⋅\)L\(w,\\cdot\)relative to the kernel bandwidth\.

\(b\) RKHS norm estimation\.The RKHS norm‖L​\(w,⋅\)‖ℋ\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}can be estimated empirically via kernel ridge regression: given source features\{xi\}\\\{x\_\{i\}\\\}and loss values\{Li=ℓ​\(w,xi,yi\)\}\\\{L\_\{i\}=\\ell\(w,x\_\{i\},y\_\{i\}\)\\\}, fitL^w=arg⁡minf∈ℋ​∑i\(f​\(xi\)−Li\)2\+λ​‖f‖ℋ2\\hat\{L\}\_\{w\}=\\arg\\min\_\{f\\in\\mathcal\{H\}\}\\sum\_\{i\}\(f\(x\_\{i\}\)\-L\_\{i\}\)^\{2\}\+\\lambda\\\|f\\\|\_\{\\mathcal\{H\}\}^\{2\}\. The RKHS norm of the solution is‖L^w‖ℋ2=α^⊤​K​α^\\\|\\hat\{L\}\_\{w\}\\\|\_\{\\mathcal\{H\}\}^\{2\}=\\hat\{\\alpha\}^\{\\top\}K\\hat\{\\alpha\}whereα^=\(K\+λ​I\)−1​L\\hat\{\\alpha\}=\(K\+\\lambda I\)^\{\-1\}LandKKis the kernel Gram matrix\. This provides a data\-driven way to verify the assumption and estimateLℋL\_\{\\mathcal\{H\}\}\.

\(c\) Relaxation to average boundedness\.A weaker but more practical version replaces‖L​\(w,⋅\)‖ℋ≤Lℋ\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\\leq L\_\{\\mathcal\{H\}\}with𝔼w∼ρ​\[‖L​\(w,⋅\)‖ℋ\]≤Lℋ\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\\|L\(w,\\cdot\)\\\|\_\{\\mathcal\{H\}\}\]\\leq L\_\{\\mathcal\{H\}\}\. This holds whenever the posterior concentrates on models with well\-behaved loss functions and is significantly easier to verify empirically \(it requires only the posterior\-averaged norm\)\. The proof of Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1)extends directly to this setting by applying Jensen’s inequality to move the expectation inside the norm bound\.

### E\.2Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2): Bounded Feature Map

Assumption[2](https://arxiv.org/html/2605.21783#Thmassumption2)requires the encoder to decompose asfθ​\(x\)=W⋅ϕθ​\(x\)f\_\{\\theta\}\(x\)=W\\cdot\\phi\_\{\\theta\}\(x\)with bounded operator norm\. We discuss three settings where this holds:

\(i\) Neural tangent kernel \(NTK\) regime\.In the infinite\-width limit of neural networks, the NTK is fixed at initialization and the feature mapϕθ\\phi\_\{\\theta\}converges to a deterministic function\. In this regime,WWis effectively the output layer, and spectral normalization of the output weights bounds‖W‖op\\\|W\\\|\_\{\\mathrm\{op\}\}\.

\(ii\) Explicit MMD regularization\.If the model is trained with an MMD regularization termλ⋅MMD2​\(Ps,Pt\)\\lambda\\cdot\\mathrm\{MMD\}^\{2\}\(P\_\{s\},P\_\{t\}\), the optimization implicitly constrains the feature map to lie in a well\-behaved subset of the RKHS, providing control over the operator norm of the linear component\.

\(iii\) Spectral normalization\.For standard architectures \(e\.g\., ResNet\-50\), applying spectral normalization to all weight matrices bounds the operator norm of each layer\. Since the composition of bounded linear maps has bounded operator norm \(by submultiplicativity\), this provides a bound on‖W‖op\\\|W\\\|\_\{\\mathrm\{op\}\}for the overall network\. In practice, ResNet\-50 with spectral normalization typically achieves‖W‖op≤10\\\|W\\\|\_\{\\mathrm\{op\}\}\\leq 10–2020across layers, suggesting approximate satisfaction of this assumption\.

## Appendix FConnection to Conformal Prediction

Conformal prediction \(Gibbs and Candès, 2021; Angelopoulos and Bates, 2023\) provides distribution\-free prediction setsC​\(x\)C\(x\)satisfying the marginal coverage guaranteePr\(x,y\)∼P⁡\(y∈C​\(x\)\)≥1−α\\Pr\_\{\(x,y\)\\sim P\}\(y\\in C\(x\)\)\\geq 1\-\\alphaunder minimal assumptions \(exchangeability of the data\)\. Under covariate shift, Gibbs and Candès \(Gibbs and Candès, 2021\) showed that conformal methods can be adapted by weighting the non\-conformity scores, maintaining valid coverage\.

However, conformal prediction and our credal set framework address fundamentally different types of uncertainty:

- •Conformal predictionquantifies predictive uncertainty: the width of the prediction setC​\(x\)C\(x\)reflects uncertainty about the labelyyfor a given inputxx\. This is primarily aleatoric in nature \(though it can also capture some estimation uncertainty\)\.
- •Credal set widthε\\varepsilonquantifies distributional epistemic uncertainty: how much the target distributionPtP\_\{t\}may differ from the sourcePsP\_\{s\}at the population level\. This is purely epistemic—it reflects limitations in our knowledge of the data\-generating process\.

Proposed combination\.We propose combining both frameworks by using the credal width to adapt the conformal coverage level\. Define an adaptive coverage function:

α​\(ε\)=α0\+g​\(ε\),\\alpha\(\\varepsilon\)=\\alpha\_\{0\}\+g\(\\varepsilon\),\(40\)whereα0\\alpha\_\{0\}is the base coverage level \(e\.g\.,α0=0\.1\\alpha\_\{0\}=0\.1\) andg:ℝ\+→\[0,1−α0\]g:\\mathbb\{R\}\_\{\+\}\\to\[0,1\-\\alpha\_\{0\}\]is a monotonically non\-decreasing function calibrated from the PAC\-Bayesian bound\. The intuition is straightforward: when the credal set is narrow \(ε≈0\\varepsilon\\approx 0, low epistemic uncertainty\), standard coverageα0\\alpha\_\{0\}suffices because we are confident the target is close to the source\. When the credal set widens \(ε\\varepsilonlarge, high epistemic uncertainty\), coverage should increase to compensate for the additional distributional uncertainty\.

Calibration ofg​\(ε\)g\(\\varepsilon\)\.A principled choice forggcan be derived from our PAC\-Bayesian bound\. Specifically, set:

g​\(ε\)=min⁡\{1−α0,R^Ps​\(ρ\)\+Lℋ⋅εKL/2​n1\+Lℋ⋅ε\},g\(\\varepsilon\)=\\min\\left\\\{1\-\\alpha\_\{0\},\\,\\frac\{\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\frac\{L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\}\{\\sqrt\{\\mathrm\{KL\}/2n\}\}\}\{1\+L\_\{\\mathcal\{H\}\}\\cdot\\varepsilon\}\\right\\\},\(41\)which scales the coverage increase proportionally to the fraction of the upper risk attributable to the shift penalty\. This ensures that the conformal prediction sets widen precisely when the epistemic uncertainty \(as measured by the credal set\) contributes significantly to the overall risk bound\. Formal analysis of the coverage properties of this combined approach is an important direction for future work\.

## Appendix GRKHS and MMD Background

### G\.1Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space \(RKHS\)ℋ\\mathcal\{H\}on a set𝒳\\mathcal\{X\}is a Hilbert space of functionsf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}equipped with an inner product⟨⋅,⋅⟩ℋ\\langle\\cdot,\\cdot\\rangle\_\{\\mathcal\{H\}\}satisfying the reproducing property: for everyf∈ℋf\\in\\mathcal\{H\}and everyx∈𝒳x\\in\\mathcal\{X\},

f​\(x\)=⟨f,k​\(x,⋅\)⟩ℋ,f\(x\)=\\langle f,k\(x,\\cdot\)\\rangle\_\{\\mathcal\{H\}\},\(42\)wherek​\(x,⋅\)∈ℋk\(x,\\cdot\)\\in\\mathcal\{H\}is called the reproducing kernel\. The Moore\-Aronszajn theorem establishes a one\-to\-one correspondence between positive definite kernels and RKHSes: given a positive definite kernelkk, there exists a unique RKHS for whichkkis the reproducing kernel\.

A kernelkkis called*characteristic*if the kernel mean embedding mapμ:P↦μP\\mu:P\\mapsto\\mu\_\{P\}is injective\. For characteristic kernels,MMD​\(P,Q\)=0\\mathrm\{MMD\}\(P,Q\)=0if and only ifP=QP=Q, making MMD a proper metric on the space of probability distributions\. Universal kernels \(such as the Gaussian/RBF kernel on compact subsets ofℝd\\mathbb\{R\}^\{d\}\) are characteristic\.

### G\.2Properties of MMD

The maximum mean discrepancy satisfies several useful properties that we exploit throughout this paper:

\(1\) Metric properties\.MMD is a pseudometric on the space of probability distributions\. With a characteristic kernel, it is a proper metric:MMD​\(P,Q\)=0⇔P=Q\\mathrm\{MMD\}\(P,Q\)=0\\Leftrightarrow P=Q, symmetry \(MMD​\(P,Q\)=MMD​\(Q,P\)\\mathrm\{MMD\}\(P,Q\)=\\mathrm\{MMD\}\(Q,P\)\), and the triangle inequality\.

\(2\) Bilinear form\.MMD can be expressed as a bilinear form:

MMD2​\(P,Q\)=𝔼x,x′∼P​\[k​\(x,x′\)\]\+𝔼y,y′∼Q​\[k​\(y,y′\)\]−2​𝔼x∼P,y∼Q​\[k​\(x,y\)\]\.\\mathrm\{MMD\}^\{2\}\(P,Q\)=\\mathbb\{E\}\_\{x,x^\{\\prime\}\\sim P\}\[k\(x,x^\{\\prime\}\)\]\+\\mathbb\{E\}\_\{y,y^\{\\prime\}\\sim Q\}\[k\(y,y^\{\\prime\}\)\]\-2\\mathbb\{E\}\_\{x\\sim P,y\\sim Q\}\[k\(x,y\)\]\.\(43\)
\(3\) Connection to integral probability metrics\.MMD is a special case of the integral probability metric \(IPM\) with function classℱℋ=\{f∈ℋ:‖f‖ℋ≤1\}\\mathcal\{F\}\_\{\\mathcal\{H\}\}=\\\{f\\in\\mathcal\{H\}:\\\|f\\\|\_\{\\mathcal\{H\}\}\\leq 1\\\}:

MMD​\(P,Q\)=supf∈ℱℋ\|𝔼x∼P​\[f​\(x\)\]−𝔼y∼Q​\[f​\(y\)\]\|\.\\mathrm\{MMD\}\(P,Q\)=\\sup\_\{f\\in\\mathcal\{F\}\_\{\\mathcal\{H\}\}\}\\left\|\\mathbb\{E\}\_\{x\\sim P\}\[f\(x\)\]\-\\mathbb\{E\}\_\{y\\sim Q\}\[f\(y\)\]\\right\|\.\(44\)This IPM representation is the key ingredient in our proof of Theorem[1](https://arxiv.org/html/2605.21783#Thmtheorem1): Assumption[1](https://arxiv.org/html/2605.21783#Thmassumption1)ensures thatL​\(w,⋅\)/Lℋ∈ℱℋL\(w,\\cdot\)/L\_\{\\mathcal\{H\}\}\\in\\mathcal\{F\}\_\{\\mathcal\{H\}\}, allowing us to apply the supremum representation to bound\|RPt​\(ρ\)−RPs​\(ρ\)\|\|R\_\{P\_\{t\}\}\(\\rho\)\-R\_\{P\_\{s\}\}\(\\rho\)\|\.

\(4\) Concentration\.For kernels bounded in\[0,B\]\[0,B\], the unbiased MMD estimator satisfies the following concentration inequality \(Sutherland et al\., 2017\):

Pr⁡\[\|MMD^u−MMD\|\>ε\]≤2​exp⁡\(−c⋅min⁡\(m,n\)⋅ε2B2\),\\Pr\\left\[\\left\|\\widehat\{\\mathrm\{MMD\}\}\_\{u\}\-\\mathrm\{MMD\}\\right\|\>\\varepsilon\\right\]\\leq 2\\exp\\left\(\-\\frac\{c\\cdot\\min\(m,n\)\\cdot\\varepsilon^\{2\}\}\{B^\{2\}\}\\right\),\(45\)whereccis an absolute constant\. This is the concentration result used in the proof of Theorem[3](https://arxiv.org/html/2605.21783#Thmtheorem3)\.

### G\.3Learned Kernels for TTA

In the TTA setting, we use a learned kernelkθ​\(x,y\)=exp⁡\(−γ​‖fθ​\(x\)−fθ​\(y\)‖2\)k\_\{\\theta\}\(x,y\)=\\exp\(\-\\gamma\\\|f\_\{\\theta\}\(x\)\-f\_\{\\theta\}\(y\)\\\|^\{2\}\)wherefθf\_\{\\theta\}is the encoder of the pretrained model\. This choice has several advantages:

- •Task\-adaptive:The kernel is adapted to the task through the encoder, capturing task\-relevant similarity structure\.
- •Bounded:kθ∈\[0,1\]k\_\{\\theta\}\\in\[0,1\], which ensures the concentration results of Appendix B apply directly\.
- •Universal:For fixedfθf\_\{\\theta\}with full\-rank Jacobian almost everywhere, the Gaussian kernel on the feature space is universal, hence characteristic\.

The kernel bandwidth parameterγ\\gammacontrols the resolution of the MMD comparison\. In practice,γ\\gammacan be set via the median heuristic or cross\-validated on source data\.

## Appendix HPAC\-Bayesian Preliminaries

### H\.1The PAC\-Bayesian Framework

PAC\-Bayesian analysis provides data\-dependent generalization bounds that hold uniformly over a family of posteriors\. The framework was introduced by McAllester \(McAllester, 1999\) and has since been extensively developed \(Seeger, 2002; Catoni, 2007; Germain et al\., 2016; Rivasplata et al\., 2020; Alquier, 2024\)\.

The key objects are:

- •A*prior*π\\piover hypothesis/parameter space, chosen before seeing the data\.
- •A*posterior*ρ\\rhoover hypothesis/parameter space, chosen after seeing the data\.
- •The KL divergenceKL​\(ρ∥π\)=𝔼w∼ρ​\[log⁡\(ρ​\(w\)/π​\(w\)\)\]≥0\\mathrm\{KL\}\(\\rho\\\|\\pi\)=\\mathbb\{E\}\_\{w\\sim\\rho\}\[\\log\(\\rho\(w\)/\\pi\(w\)\)\]\\geq 0, which measures the complexity of the adaptation\.

The classical PAC\-Bayesian theorem states: fornni\.i\.d\. samples and anyδ\>0\\delta\>0, w\.p\.≥1−δ\\geq 1\-\\deltaover the sample, simultaneously for all posteriorsρ\\rho:

RP​\(ρ\)≤R^P​\(ρ\)\+KL​\(ρ∥π\)\+log⁡\(n/δ\)2​\(n−1\)\.R\_\{P\}\(\\rho\)\\leq\\hat\{R\}\_\{P\}\(\\rho\)\+\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(n/\\delta\)\}\{2\(n\-1\)\}\}\.\(46\)
The key strength of PAC\-Bayesian bounds is their uniformity: they hold for allρ\\rhosimultaneously, including posteriors that depend on the data\. This makes them ideal for adaptation settings where the posterior is chosen after observing test data\.

### H\.2PAC\-Bayesian Lower Bound

Germain et al\. \(Germain et al\., 2016\) also established a complementary lower bound: w\.p\.≥1−δ\\geq 1\-\\delta:

RP​\(ρ\)≥R^P​\(ρ\)−KL​\(ρ∥π\)\+log⁡\(2​n/δ\)2​\(n−1\)\.R\_\{P\}\(\\rho\)\\geq\\hat\{R\}\_\{P\}\(\\rho\)\-\\sqrt\{\\frac\{\\mathrm\{KL\}\(\\rho\\\|\\pi\)\+\\log\(2n/\\delta\)\}\{2\(n\-1\)\}\}\.\(47\)
This lower bound is essential for our lower\-upper risk decomposition \(Corollary[9](https://arxiv.org/html/2605.21783#Thmtheorem9)\), as it provides a lower bound on the best\-case risk within the credal set\. Without the lower bound, we could only upper\-bound the worst\-case risk but could not quantify the precision of our uncertainty estimates\.

### H\.3Connection to Domain Adaptation

Germain et al\. \(Germain et al\., 2013\) derived PAC\-Bayesian bounds for domain adaptation using theℋ\\mathcal\{H\}\-divergence \(Ben\-David et al\., 2010\) between source and target domains\. Their bound has the form:

RPt​\(ρ\)≤R^Ps​\(ρ\)\+12​dℋ​\(Ps,Pt\)\+complexity terms,R\_\{P\_\{t\}\}\(\\rho\)\\leq\\hat\{R\}\_\{P\_\{s\}\}\(\\rho\)\+\\tfrac\{1\}\{2\}d\_\{\\mathcal\{H\}\}\(P\_\{s\},P\_\{t\}\)\+\\text\{complexity terms\},\(48\)wheredℋ​\(Ps,Pt\)d\_\{\\mathcal\{H\}\}\(P\_\{s\},P\_\{t\}\)is theℋ\\mathcal\{H\}\-divergence\. Our work differs in that we use MMD \(which is computable in polynomial time, unlikeℋ\\mathcal\{H\}\-divergence which is NP\-hard to estimate\), provide a finite\-sample version, and—most importantly—interpret the shift penalty through the lens of credal sets and imprecise probability\.

Similar Articles

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

arXiv cs.LG

This paper develops a PAC-Bayesian framework for physics-informed machine learning, providing high-probability generalization guarantees for unbounded losses. It proposes a multi-task perspective that jointly handles data fidelity, PDE residuals, and boundary conditions, and introduces a self-bounding learning algorithm.

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

arXiv cs.LG

This paper proposes RMemSafe, a reliability-gated extension for continual test-time adaptation that attenuates source anchoring when the frozen source's predictive entropy becomes high, preventing blind anchoring under source collapse. The method achieves state-of-the-art error reduction on the CCC benchmark.

Adaptive auditing of AI systems with anytime-valid guarantees

arXiv cs.AI

This paper introduces a statistical framework for adaptively auditing AI systems using Safe Anytime-Valid Inference (SAVI) to draw rigorous conclusions with limited data. It proposes a 'testing by betting' approach to validate model robustness while controlling type-I errors during adaptive sampling.