Variational Inference for Evidential Deep Learning

arXiv cs.LG Papers

Summary

A mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL), is proposed to address limitations in conventional Evidential Deep Learning by reformulating it through variational inference, deriving an Evidence Lower Bound, establishing a generalization bound, and achieving state-of-the-art performance on visual and medical datasets.

arXiv:2605.26477v1 Announce Type: new Abstract: While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback-Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model's ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter $\alpha=e+1$. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting $\boldsymbol{\alpha} = \mathbf{e} + \mathbf{1}$ can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI-EDL achieves state-of-the-art performance, showing excellent performance in out-of-distribution detection, noise detection and autonomous driving scenario. The code is available in https://github.com/seutjw/VI-EDL.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:11 AM

# Variational Inference for Evidential Deep Learning
Source: [https://arxiv.org/html/2605.26477](https://arxiv.org/html/2605.26477)
Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou, , and Yuheng JiaJiawei Tang and Xinyan Du are with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China\.Hui Liu is with the School of Computing Information Sciences, Saint Francis University, Hong Kong SAR, China\.Junhui Hou is with the Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China\.Yuheng Jia is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China, and also with the Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications \(Southeast University\), Ministry of Education, China\.

###### Abstract

While Deep Neural Networks \(DNNs\) achieve remarkable performance, their tendency to produce overconfident predictions\. Evidential Deep Learning \(EDL\) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty\. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback–Leibler \(KL\) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model’s ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameterα=e\+1\\alpha=e\+1\. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning \(VI\-EDL\)\. By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound \(ELBO\), which prevents the evidence from growing excessively\. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting𝜶=𝐞\+𝟏\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}can minimize it\. Extensive experiments on standard visual and medical datasets demonstrate that VI\-EDL achieves state\-of\-the\-art performance, showing excellent performance in out\-of\-distribution detection, noise detection and autonomous driving scenario\.The code is available in https://github\.com/seutjw/VI\-EDL\.

## IIntroduction

Deep Neural Networks \(DNNs\) have achieved remarkable success in a wide range of predictive tasks\[[22](https://arxiv.org/html/2605.26477#bib.bib1),[15](https://arxiv.org/html/2605.26477#bib.bib2)\]\. However, their tendency to produce overconfident point\-estimates\[[14](https://arxiv.org/html/2605.26477#bib.bib3),[28](https://arxiv.org/html/2605.26477#bib.bib4)\]—even when encountering out\-of\-distribution \(OOD\) samples\[[16](https://arxiv.org/html/2605.26477#bib.bib5),[34](https://arxiv.org/html/2605.26477#bib.bib6),[23](https://arxiv.org/html/2605.26477#bib.bib30),[26](https://arxiv.org/html/2605.26477#bib.bib31),[25](https://arxiv.org/html/2605.26477#bib.bib32)\]—remains a critical bottleneck for deployment in safety\-critical domains like autonomous driving\[[18](https://arxiv.org/html/2605.26477#bib.bib7),[10](https://arxiv.org/html/2605.26477#bib.bib8),[27](https://arxiv.org/html/2605.26477#bib.bib25)\]and biomedical applications\[[9](https://arxiv.org/html/2605.26477#bib.bib9)\]\. While Bayesian Neural Networks\[[4](https://arxiv.org/html/2605.26477#bib.bib29)\]and MC\-Dropout\[[11](https://arxiv.org/html/2605.26477#bib.bib28)\]offer some solutions, they typically require computationally expensive multiple sampling\. To address this, Evidential Deep Learning \(EDL\)\[[30](https://arxiv.org/html/2605.26477#bib.bib10)\]has emerged as a promising paradigm\. Instead of applying a standard softmax function to output deterministic probabilities, EDL restructures the prediction pipeline into an evidence\-based chain\. Specifically, EDL reformulates the standard classification paradigm by estimating a class probability distribution rather than providing a single point estimate\.

EDL models the predicted class probability distribution as a Dirichlet distribution\[[17](https://arxiv.org/html/2605.26477#bib.bib11),[7](https://arxiv.org/html/2605.26477#bib.bib12)\]\. Given an input𝐱\\mathbf\{x\}, a neural networkfθf\_\{\\theta\}is employed to extract non\-negative evidence vector𝐞=\[e1,…,eK\]≥0\\mathbf\{e\}=\[e\_\{1\},\\dots,e\_\{K\}\]\\geq 0forKKclasses, typically utilizing an activation function such as Softplus\. This evidence vector is then used to parameterize the Dirichlet distributionDir​\(𝐩\|𝜶\)\\text\{Dir\}\(\\mathbf\{p\}\|\\boldsymbol\{\\alpha\}\), where𝜶\\boldsymbol\{\\alpha\}is the Dirichlet parameters𝜶=\[α1,…,αK\]\\boldsymbol\{\\alpha\}=\[\\alpha\_\{1\},\.\.\.,\\alpha\_\{K\}\]with each element defining asαk=ek\+1,∀k∈\[1,2,…,K\]\\alpha\_\{k\}=e\_\{k\}\+1,\\forall k\\in\[1,2,\.\.\.,K\]and the expected class probability vector𝐩=\[p^1,…,p^K\]\\mathbf\{p\}=\[\\hat\{p\}\_\{1\},\.\.\.,\\hat\{p\}\_\{K\}\]\. Accordingly, the uncertainty is quantified byu=K/Su=K/Sand the expected class probabilities arep^k=αk/S\\hat\{p\}\_\{k\}=\\alpha\_\{k\}/S, wherek∈\[1,2,…,K\]k\\in\[1,2,\.\.\.,K\]andS=∑k=1KαkS=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\.

To train the model, standard EDL minimizes an expected Mean Squared Error \(MSE\) risk\. This term measures the expected discrepancy between the ground\-truth label vector𝐲=\[y1,…,yK\]∈\{0,1\}K\\mathbf\{y\}=\[y\_\{1\},\.\.\.,y\_\{K\}\]\\in\\\{0,1\\\}^\{K\}and the expected class probability vector𝐩\\mathbf\{p\}\. By evaluating the expectation over the simplex, this objective implicitly minimizes the variance of the predictions\. Furthermore, to prevent the model from generating misleading evidence for incorrect classes, a Kullback\-Leibler \(KL\) divergence term is introduced\. It penalizes the divergence between the predicted distribution \(excluding the evidence of the true class\) and a flat dirichlet distributionDir​\(𝐩\|𝟏\)\\text\{Dir\}\(\\mathbf\{p\}\|\\mathbf\{1\}\), where𝟏\\mathbf\{1\}represents an all\-one vector\.

The overall loss function of standard EDL is formulated as:

ℒE​D​L=\\displaystyle\\mathcal\{L\}\_\{EDL\}=𝔼𝐩∼Dir​\(𝜶\)​\[‖𝐲−𝐩‖22\]⏟Expected MSE\+\\displaystyle\\underbrace\{\\mathbb\{E\}\_\{\\mathbf\{p\}\\sim\\text\{Dir\}\(\\boldsymbol\{\\alpha\}\)\}\\left\[\\\|\\mathbf\{y\}\-\\mathbf\{p\}\\\|\_\{2\}^\{2\}\\right\]\}\_\{\\text\{Expected MSE\}\}\+\(1\)λt⋅DK​L\(Dir\(𝐩\|𝜶~\)∥Dir\(𝐩\|𝟏\)\)⏟KL divergence penalty,\\displaystyle\\lambda\_\{t\}\\cdot\\underbrace\{D\_\{KL\}\\left\(\\text\{Dir\}\(\\mathbf\{p\}\|\\tilde\{\\boldsymbol\{\\alpha\}\}\)\\parallel\\text\{Dir\}\(\\mathbf\{p\}\|\\mathbf\{1\}\)\\right\)\}\_\{\\text\{KL divergence penalty\}\},whereλt\\lambda\_\{t\}is an annealing coefficient that gradually ramps up the KL penalty during the initial training epochs, and𝜶~=𝐲\+\(𝟏−𝐲\)⊙𝜶\\tilde\{\\boldsymbol\{\\alpha\}\}=\\mathbf\{y\}\+\(\\mathbf\{1\}\-\\mathbf\{y\}\)\\odot\\boldsymbol\{\\alpha\}represents the modified Dirichlet parameters after masking out the true class evidence, where𝟏∈ℝK\\mathbf\{1\}\\in\\mathbb\{R\}^\{K\}represents an all\-one vector\.

Despite its conceptual elegance, the conventional EDL framework suffers from two drawbacks:

- •In EDL, the KL divergence term only suppresses the evidence of negative classes while leaving the positive class unconstrained\. This inherently motivates the network to take an optimization shortcut, blindly inflating feature and weight magnitudes to produce excessively high target evidence\. Driven by the MSE term, the network tends to act as a numerical amplifier and inevitably maps even non\-semantic random noise to massive spurious evidence\. This directly decreases the model’s ability to quantify uncertainty\.
- •EDL maps the generated evidence𝐞\\mathbf\{e\}to the Dirichlet parametersα\\mathbf\{\\alpha\}using a heuristically defined function𝜶=𝐞\+𝟏\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}\. Despite its effectiveness empirically, this formulation lacks a principled mathematical justification\.

In this work, we propose resolving these two problems by reconstructing the EDL framework in a probabilistic framework\. We transform the evidence extraction process into a variational inference \(VI\)\[[3](https://arxiv.org/html/2605.26477#bib.bib13)\]problem and derive a rigorous Evidence Lower Bound \(ELBO\)\[[19](https://arxiv.org/html/2605.26477#bib.bib24)\], organically restoring the Bayesian legitimacy of the EDL paradigm\. Moreover, the form of the ELBO also naturally dictates a global regularization across all classes, structurally preventing the magnitude\-induced overconfidence\. In addition, we also design a cosine prototype layer for the evidence network to further control the magnitude of generated evidence\.

Furthermore, our framework provides a definitive theoretical closure to the heuristic𝜶=𝐞\+𝟏\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}mapping\. By imposing a generalized Dirichlet priorDir​\(𝝀\)\\text\{Dir\}\(\\boldsymbol\{\\lambda\}\)in the probabilistic model, our variational formulation rigorously derives the posterior parameterization as𝜶=𝐞\+𝝀\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\boldsymbol\{\\lambda\}via Bayesian conjugate updating\. Consequently, we conduct a rigorous generalization bound analysis, theoretically proving that setting the prior parameter to𝝀=𝟏\\boldsymbol\{\\lambda\}=\\mathbf\{1\}optimally minimizes the generalization error bound\. This finding retrospectively validates the empirical design of standard EDL while firmly grounding it in statistical learning theory\.

The main contributions of this work are summarized as follows:

- •A Principled Variational Framework for EDL:We provide a novel probabilistic derivation of the evidential objective using Variational Inference \(VI\-EDL\)\. By formulating the evidence generation as a conjugate update, we replace heuristic terms with a mathematically rigorous ELBO, which guarantees the theoretical soundness and interpretability of our model\. Along with the cosine prototype layer, it also prevents the evidence of all classes from growing excessively\.
- •Theoretical Generalization Guarantees:We analytically derive a generalization bound of our proposed model and delve into several theoretical insights\. In Section[IV\-B](https://arxiv.org/html/2605.26477#S4.SS2), we prove that the predicted uncertainty \(Insight 3\), feature and network complexity \(Insight 4\) affect this bound, and setting𝜶=𝐞\+𝟏\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}can minimize it \(Insight 2\)\. Therefore, we fill the fundamental theoretical gap of standard EDL\.
- •Extensive Empirical Validation:We evaluate our framework on various benchmarks, including standard visual datasets and medical datasets\. Experimental results demonstrate that VI\-EDL significantly outperforms state\-of\-the\-art evidential baselines, achieving superior performance in out\-of\-distribution \(OOD\) detection, noise detection, and autonomous driving scenarios\.

The remainder of this paper is organized as follows\. Section[II](https://arxiv.org/html/2605.26477#S2)briefly reviews the related literature on uncertainty estimation and evidential deep learning\. Section[III](https://arxiv.org/html/2605.26477#S3)elaborates on the proposed Variational Inference Evidential Deep Learning \(VI\-EDL\) framework, including the derivation of the Evidence Lower Bound and the design of the cosine prototype layer\. Section[IV](https://arxiv.org/html/2605.26477#S4)provides a rigorous theoretical analysis of our approach, establishing a generalization bound and delving into its insights\. Section[V](https://arxiv.org/html/2605.26477#S5)presents extensive experimental results, ablation studies, and robustness analyses across multiple benchmarks to demonstrate the superiority of the proposed method\. Finally, Section[VI](https://arxiv.org/html/2605.26477#S6)concludes this paper and outlines potential avenues for future research\.

## IIRelated Works

Evidential Deep Learning \(EDL\) has gained substantial traction as a scalable and deterministic approach to uncertainty quantification\. Pioneered by Sensoy et al\.\[[30](https://arxiv.org/html/2605.26477#bib.bib10)\], standard EDL leverages Subjective Logic to map DNN outputs to the parameters of a Dirichlet distribution, estimating both class probabilities and epistemic uncertainty\. Following this paradigm, a line of research has emerged to improve evidential models\. I\-EDL\[[8](https://arxiv.org/html/2605.26477#bib.bib15)\]incorporates the Fisher information matrix to assess the evidence informativeness carried by samples\. Re\-EDL\[[6](https://arxiv.org/html/2605.26477#bib.bib14)\]treats the Dirichlet prior weight as an adjustable hyperparameter and provide a more flexible and simplified optimization objective\. F\-EDL\[[36](https://arxiv.org/html/2605.26477#bib.bib16)\]introduces a flexible Dirichlet distribution to capture a more expressive and adaptive representation of uncertainty\.

Beyond theoretical advancements, the deterministic uncertainty quantification capability of EDL has facilitated its widespread adoption across various safety\-critical domains\. In the medical field, evidential models have been successfully applied to disease diagnosis and medical image analysis\[[24](https://arxiv.org/html/2605.26477#bib.bib33),[1](https://arxiv.org/html/2605.26477#bib.bib34)\], where reliable uncertainty estimates are crucial for clinical decision\-making\. Similarly, in autonomous driving systems, EDL has been extensively leveraged for robust perception and multi\-modal sensor fusion\[[13](https://arxiv.org/html/2605.26477#bib.bib35),[35](https://arxiv.org/html/2605.26477#bib.bib36)\], effectively handling the information conflict and uncertainty arising from noisy, open\-world environments\. Furthermore, recent studies have extended evidential frameworks to bioinformatics and molecular property prediction\[[32](https://arxiv.org/html/2605.26477#bib.bib37)\], providing trustworthy predictions for complex domain data\.

While these EDL methods demonstrate empirical progress, they share a critical vulnerability: failing to impose any restriction on the unbounded growth of positive class evidence\. Our proposed VI\-EDL resolves this magnitude\-induced overconfidence\. By deriving the evidential objective strictly from the Evidence Lower Bound \(ELBO\) and enforcing a cosine prototype evidence layer, our framework bounds the maximum achievable evidence strictly, ensuring the reliability of evidential models in complex environments\.

## IIIProposed Method

In this section, we reconstruct Evidential Deep Learning from the probabilistic perspective of Variational Inference \(VI\)\. We first formulate the classification task as a latent variable generative model and derive the Evidence Lower Bound \(ELBO\) objective\. Subsequently, we introduce a distance\-aware cosine prototype evidence network to generate the evidence\.

### III\-ANotations

Let𝒳⊂ℝn×d\\mathcal\{X\}\\subset\\mathbb\{R\}^\{n\\times d\}be the input space for aKK\-class classification task\. Given an input𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}, the ground\-truth label is represented as a one\-hot vector𝐲=\[y1,…,yK\]∈\{0,1\}K\\mathbf\{y\}=\[y\_\{1\},\.\.\.,y\_\{K\}\]\\in\\\{0,1\\\}^\{K\}, whereyk=1y\_\{k\}=1if the sample belongs to classkkandyk=0y\_\{k\}=0otherwise\. Let𝐩=\[p1,…,pK\]\\mathbf\{p\}=\[p\_\{1\},\\dots,p\_\{K\}\]represent a categorical class probability vector\. To ensure valid probability semantics \(sum up to 1\),𝐩\\mathbf\{p\}must reside on the\(K−1\)\(K\-1\)\-dimensional unit simplex, i\.e\.,∑k=1Kpk=1\\sum\_\{k=1\}^\{K\}p\_\{k\}=1and∀k∈\[1,2,…,K\],pk≥0\\forall k\\in\[1,2,\.\.\.,K\],p\_\{k\}\\geq 0\.

### III\-BVariational Inference Framework for EDL

Variational Posterior and ELBO\.From a probabilistic perspective, we treat the class probability distribution𝐩\\mathbf\{p\}as a latent variable\. In a purely analytical Bayesian setting, we assign a Dirichlet prior over𝐩\\mathbf\{p\}, denoted asP​\(𝐩\)=Dir​\(𝐩;𝝀\)∝∏k=1Kpkλk−1P\(\\mathbf\{p\}\)=\\text\{Dir\}\(\\mathbf\{p\};\\boldsymbol\{\\lambda\}\)\\propto\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\\lambda\_\{k\}\-1\}\. If one perfectly observes the categorical pseudo\-counts𝐞=\[e1,…,eK\]\\mathbf\{e\}=\[e\_\{1\},\\dots,e\_\{K\}\]\(i\.e\., the evidence of each class, where∀k∈\[1,2,…,K\],ek≥0\\forall k\\in\[1,2,\.\.\.,K\],e\_\{k\}\\geq 0\), the likelihood follows a Multinomial distributionP​\(𝐞\|𝐩\)∝∏k=1KpkekP\(\\mathbf\{e\}\|\\mathbf\{p\}\)\\propto\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{e\_\{k\}\}\. According to Bayes’ theorem, the exact posterior is formulated by multiplying the likelihood and the prior, where the exponents naturally add up:

P​\(𝐩\|𝐞\)∝P​\(𝐞\|𝐩\)​P​\(𝐩\)∝∏k=1Kpk\(ek\+λk\)−1\.P\(\\mathbf\{p\}\|\\mathbf\{e\}\)\\propto P\(\\mathbf\{e\}\|\\mathbf\{p\}\)P\(\\mathbf\{p\}\)\\propto\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\(e\_\{k\}\+\\lambda\_\{k\}\)\-1\}\.\(2\)This Dirichlet\-Multinomial conjugacy gracefully dictates that the exact posterior isP​\(𝐩\|𝐞\)=Dir​\(𝐩;𝐞\+𝝀\)P\(\\mathbf\{p\}\|\\mathbf\{e\}\)=\\text\{Dir\}\(\\mathbf\{p\};\\mathbf\{e\}\+\\boldsymbol\{\\lambda\}\)\. However, in our actual classification task, these explicit pseudo\-counts are latent; we can only observe the raw input𝐱\\mathbf\{x\}, rendering the true posterior computationally intractable\. We apply variational inference to solve this by defining an approximate posteriorqϕ​\(𝐩\|𝐱\)q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\. To structurally ground this approximation in Bayesian principles, we design its variational family to explicitly mimic the additive posterior conjunction\. Specifically, we employ a neural networkfϕf\_\{\\phi\}to extract evidence𝐞=fϕ​\(𝐱\)∈ℝK\\mathbf\{e\}=f\_\{\\phi\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{K\}from the input, then parameterize the variational posterior using this extracted evidence combined with the prior\. It is defined as:

qϕ​\(𝐩\|𝐱\)=Dir​\(𝐩;𝜶\),where​𝜶=𝐞\+𝝀\.q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)=\\text\{Dir\}\(\\mathbf\{p\};\\boldsymbol\{\\alpha\}\),\\quad\\text\{where \}\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\boldsymbol\{\\lambda\}\.\(3\)
We optimize the model by maximizing the log marginal likelihood of the observed label𝐲\\mathbf\{y\}\. By introducing the variational posteriorqϕ​\(𝐩\|𝐱\)q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\), we derive the Evidence Lower Bound \(ELBO\) explicitly:

log⁡P​\(𝐲\|𝐱\)=\\displaystyle\\log P\(\\mathbf\{y\}\|\\mathbf\{x\}\)=log​∫P​\(𝐲\|𝐩\)​P​\(𝐩\)​𝑑𝐩\\displaystyle\\log\\int P\(\\mathbf\{y\}\|\\mathbf\{p\}\)P\(\\mathbf\{p\}\)d\\mathbf\{p\}\(4\)=\\displaystyle=log​∫qϕ​\(𝐩\|𝐱\)​P​\(𝐲\|𝐩\)​P​\(𝐩\)qϕ​\(𝐩\|𝐱\)​𝑑𝐩\\displaystyle\\log\\int q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\\frac\{P\(\\mathbf\{y\}\|\\mathbf\{p\}\)P\(\\mathbf\{p\}\)\}\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}d\\mathbf\{p\}≥\\displaystyle\\geq𝔼qϕ​\(𝐩\|𝐱\)​\[log⁡P​\(𝐲\|𝐩\)​P​\(𝐩\)qϕ​\(𝐩\|𝐱\)\]\\displaystyle\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\\left\[\\log\\frac\{P\(\\mathbf\{y\}\|\\mathbf\{p\}\)P\(\\mathbf\{p\}\)\}\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\\right\]=\\displaystyle=𝔼qϕ​\(𝐩\|𝐱\)​\[log⁡P​\(𝐲\|𝐩\)\]\\displaystyle\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\[\\log P\(\\mathbf\{y\}\|\\mathbf\{p\}\)\]−DK​L​\(qϕ​\(𝐩\|𝐱\)∥P​\(𝐩\)\)\.\\displaystyle\-D\_\{KL\}\(q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\\parallel P\(\\mathbf\{p\}\)\)\.
Maximizing this strict lower bound is mathematically equivalent to minimizing our variational loss function\. To provide further flexibility in balancing the accuracy\-uncertainty trade\-off, we introduce a hyper\-parameterβ\\beta, leading to the final objective:

ℒE​L​B​O=\\displaystyle\\mathcal\{L\}\_\{ELBO\}=−𝔼qϕ​\(𝐩\|𝐱\)​\[log⁡P​\(𝐲\|𝐩\)\]⏟Expected Negative Log\-Likelihood\\displaystyle\\underbrace\{\-\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\[\\log P\(\\mathbf\{y\}\|\\mathbf\{p\}\)\]\}\_\{\\text\{Expected Negative Log\-Likelihood\}\}\(5\)\+β⋅DK​L​\(qϕ​\(𝐩\|𝐱\)∥P​\(𝐩\)\)⏟Prior Regularization\.\\displaystyle\+\\beta\\cdot\\underbrace\{D\_\{KL\}\(q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\\parallel P\(\\mathbf\{p\}\)\)\}\_\{\\text\{Prior Regularization\}\}\.
Fundamentally, the role of the prior regularization term is to suppress the evidence across all classes, thereby mitigating overconfidence caused by excessively high evidence\.

Prior and Likelihood\.We assume that𝐩\\mathbf\{p\}is sampled from a Dirichlet prior distribution parameterized by a prior vector𝝀=\[λ1,…,λK\]\\boldsymbol\{\\lambda\}=\[\\lambda\_\{1\},\\dots,\\lambda\_\{K\}\]:

P​\(𝐩\)=Dir​\(𝐩;𝝀\)\.P\(\\mathbf\{p\}\)=\\text\{Dir\}\(\\mathbf\{p\};\\boldsymbol\{\\lambda\}\)\.\(6\)
To ensure a valid probability density function without inducing boundary sparsity, we enforce the constraintλk≥1,∀k∈\[1,K\]\\lambda\_\{k\}\\geq 1,\\forall k\\in\[1,K\]\.In this paper, we specifically set the prior vector to𝛌=𝟏K\\boldsymbol\{\\lambda\}=\\mathbf\{1\}\_\{K\}, where𝟏K\\mathbf\{1\}\_\{K\}is aKK\-dimensional vector of ones\. The detailed theoretical rationale for this selection is deferred toInsight 2in Section[IV\-B](https://arxiv.org/html/2605.26477#S4.SS2)\. To maintain the coherence of the derivation, we retain the general notation𝛌\\boldsymbol\{\\lambda\}in the subsequent formulations\.

For the generation of the observation𝐲\\mathbf\{y\}, we assume it is drawn from a Gaussian likelihood distribution centered at the latent variable𝐩\\mathbf\{p\}:

P​\(𝐲\|𝐩\)∝exp⁡\(−12​σ2​‖𝐲−𝐩‖22\)\.P\(\\mathbf\{y\}\|\\mathbf\{p\}\)\\propto\\exp\\left\(\-\\frac\{1\}\{2\\sigma^\{2\}\}\|\|\\mathbf\{y\}\-\\mathbf\{p\}\|\|\_\{2\}^\{2\}\\right\)\.\(7\)
Derivation of the Overall Variational Loss\.We now derive the two components of the ELBO in Eq\. \([5](https://arxiv.org/html/2605.26477#S3.E5)\)\. For the Expected Negative Log\-Likelihood term−𝔼qϕ​\(𝐩\|𝐱\)​\[log⁡P​\(𝐲\|𝐩\)\]\-\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\[\\log P\(\\mathbf\{y\}\|\\mathbf\{p\}\)\], with Eq\. \([7](https://arxiv.org/html/2605.26477#S3.E7)\) we have

−𝔼qϕ​\(𝐩\|𝐱\)\[logP\(𝐲\|𝐩\)\]\)∝𝔼qϕ\[\|\|𝐲−𝐩\|\|22\]\\displaystyle\-\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\}\[\\log P\(\\mathbf\{y\}\|\\mathbf\{p\}\)\]\)\\propto\\mathbb\{E\}\_\{q\_\{\\phi\}\}\\left\[\|\|\\mathbf\{y\}\-\\mathbf\{p\}\|\|\_\{2\}^\{2\}\\right\]\(8\)=∑k=1K\(\(yk−𝔼qϕ​\[pk\]\)2\+Varqϕ​\(pk\)\),\\displaystyle=\\sum\_\{k=1\}^\{K\}\\left\(\(y\_\{k\}\-\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[p\_\{k\}\]\)^\{2\}\+\\text\{Var\}\_\{q\_\{\\phi\}\}\(p\_\{k\}\)\\right\),=∑k=1K\(yk−αkS\)2\+∑k=1Kαk​\(S−αk\)S2​\(S\+1\)\\displaystyle=\\sum\_\{k=1\}^\{K\}\\left\(y\_\{k\}\-\\frac\{\\alpha\_\{k\}\}\{S\}\\right\)^\{2\}\+\\sum\_\{k=1\}^\{K\}\\frac\{\\alpha\_\{k\}\(S\-\\alpha\_\{k\}\)\}\{S^\{2\}\(S\+1\)\}=∑k=1K\(yk−p^k\)2⏟Predictive Bias\+∑k=1Kp^k​\(1−p^k\)S\+1⏟Variance,\\displaystyle=\\underbrace\{\\sum\_\{k=1\}^\{K\}\(y\_\{k\}\-\\hat\{p\}\_\{k\}\)^\{2\}\}\_\{\\text\{Predictive Bias\}\}\+\\underbrace\{\\sum\_\{k=1\}^\{K\}\\frac\{\\hat\{p\}\_\{k\}\(1\-\\hat\{p\}\_\{k\}\)\}\{S\+1\}\}\_\{\\text\{Variance\}\},whereS=∑k=1KαkS=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}andp^k=αk/S\\hat\{p\}\_\{k\}=\\alpha\_\{k\}/Sfor each classkk\. Optimizing Eq\. \([8](https://arxiv.org/html/2605.26477#S3.E8)\) equals minimizing both the predictive bias and the variance of the class probability distribution simultaneously\.

For the Prior Regularization \(KL divergence penalty\) term, it can be written asDK​L​\(qϕ∥P\)=𝔼qϕ​\[log⁡qϕ​\(𝐩\)\]−𝔼qϕ​\[log⁡P​\(𝐩\)\]D\_\{KL\}\(q\_\{\\phi\}\\parallel P\)=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log q\_\{\\phi\}\(\\mathbf\{p\}\)\]\-\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log P\(\\mathbf\{p\}\)\]by definition\. Expanding these expectations and using some properties of the Dirichlet distribution, we obtain the exact analytical form:

DK​L​\(𝜶\)=\\displaystyle D\_\{KL\}\(\\boldsymbol\{\\alpha\}\)=log⁡Γ​\(S\)−∑k=1Klog⁡Γ​\(αk\)\\displaystyle\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\)\(9\)\+∑k=1K\(αk−λk\)​\(ψ​\(αk\)−ψ​\(S\)\)\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\)−log⁡Γ​\(‖𝝀‖1\)\+∑k=1Klog⁡Γ​\(λk\),\\displaystyle\\quad\-\\log\\Gamma\(\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\)\+\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\lambda\_\{k\}\),whereψ​\(⋅\)\\psi\(\\cdot\)is the digamma function\. The detailed derivation of Eq\.[9](https://arxiv.org/html/2605.26477#S3.E9)is provided in Section[A](https://arxiv.org/html/2605.26477#A1)of the Appendix\.

Since the prior parameter vector𝝀\\boldsymbol\{\\lambda\}is fixed before training, the terms involving only𝝀\\boldsymbol\{\\lambda\}are strict constants with respect to the network output𝜶\\boldsymbol\{\\alpha\}, yielding zero gradients during backpropagation\. Dropping these independent constants, the effective objective for the KL divergence is given by:

D~K​L​\(𝜶\)=\\displaystyle\\tilde\{D\}\_\{KL\}\(\\boldsymbol\{\\alpha\}\)=log⁡Γ​\(S\)−∑k=1Klog⁡Γ​\(αk\)\\displaystyle\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\)\(10\)\+∑k=1K\(αk−λk\)​\(ψ​\(αk\)−ψ​\(S\)\),\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\),
whereΓ​\(·\)\\Gamma\(·\)represents the Gamma function\. Integrating the expected reconstruction error in Eq\. \([8](https://arxiv.org/html/2605.26477#S3.E8)\) and the effective KL divergence in Eq\. \([10](https://arxiv.org/html/2605.26477#S3.E10)\) into Eq\. \([5](https://arxiv.org/html/2605.26477#S3.E5)\), the overall variational loss function for our model is formulated as:

ℒV​I−E​D​L​\(𝜶,𝐲\)=\\displaystyle\\mathcal\{L\}\_\{VI\-EDL\}\(\\boldsymbol\{\\alpha\},\\mathbf\{y\}\)=ℒM​S​E​\(𝜶,𝐲\)\+β⋅D~K​L​\(𝜶\)\\displaystyle\\mathcal\{L\}\_\{MSE\}\(\\boldsymbol\{\\alpha\},\\mathbf\{y\}\)\+\\beta\\cdot\\tilde\{D\}\_\{KL\}\(\\boldsymbol\{\\alpha\}\)\(11\)=\\displaystyle=∑k=1K\(yk−p^k\)2\+∑k=1Kp^k​\(1−p^k\)S\+1\+\\displaystyle\\sum\_\{k=1\}^\{K\}\(y\_\{k\}\-\\hat\{p\}\_\{k\}\)^\{2\}\+\\sum\_\{k=1\}^\{K\}\\frac\{\\hat\{p\}\_\{k\}\(1\-\\hat\{p\}\_\{k\}\)\}\{S\+1\}\+β⋅\[logΓ\(S\)−∑k=1KlogΓ\(αk\)\\displaystyle\\beta\\cdot\\bigg\[\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\)\+∑k=1K\(αk−λk\)\(ψ\(αk\)−ψ\(S\)\)\]\.\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\)\\bigg\]\.

### III\-CCosine Prototype Evidence Layer

While the aforementioned KL prior regularization term in Eq\. \([5](https://arxiv.org/html/2605.26477#S3.E5)\) mitigates the issue of excessively large evidence from an optimization perspective, relying solely on loss\-driven regularization may still leave room for vulnerability under extreme anomalies\. To structurally enforce this constraint and further avoid overconfidence, we introduce a cosine prototype evidence layer in this section to replace the final linear classifier offϕf\_\{\\phi\}, thereby explicitly bounding the maximum amount of generated evidence\. Specifically, given an input𝐱\\mathbf\{x\}, the backbone network projects it into a lower\-dimensional embedding space as a feature vector𝐱^\\mathbf\{\\hat\{x\}\}\. For each classk∈\{1,…,K\}k\\in\\\{1,\\dots,K\\\}, we introduce a learnable class prototype vector𝐫k\\mathbf\{r\}\_\{k\}, representing the semantic anchor of that class\. The evidenceeke\_\{k\}for classkkis formulated based on the cosine similarity between the feature𝐱^\\mathbf\{\\hat\{x\}\}and the prototype𝐫k\\mathbf\{r\}\_\{k\}:

ek=Softplus​\(γ⋅\(cos⁡\(𝐱^,𝐫k\)−m\)\),e\_\{k\}=\\text\{Softplus\}\\Big\(\\gamma\\cdot\\big\(\\cos\(\\mathbf\{\\hat\{x\}\},\\mathbf\{r\}\_\{k\}\)\-m\\big\)\\Big\),\(12\)wherecos⁡\(𝐱^,𝐫k\)=𝐱^⊤​𝐫k‖𝐱^‖2​‖𝐫k‖2\\cos\(\\mathbf\{\\hat\{x\}\},\\mathbf\{r\}\_\{k\}\)=\\frac\{\\mathbf\{\\hat\{x\}\}^\{\\top\}\\mathbf\{r\}\_\{k\}\}\{\|\|\\mathbf\{\\hat\{x\}\}\|\|\_\{2\}\|\|\\mathbf\{r\}\_\{k\}\|\|\_\{2\}\},γ\\gammais a learnable scale parameter to amplify the bounded cosine similarity, andmmis a learnable margin parameter serving as a similarity threshold\.

This cosine prototype layer limits thatek∈\[0,Softplus​\(γ⋅\(1−m\)\)\]e\_\{k\}\\in\[0,\\text\{Softplus\}\(\\gamma\\cdot\(1\-m\)\)\]to control the growth of the evidence, and it explicitly forces the evidence network to learn compact intra\-class representations\. Specifically, when an unfamiliar sample is evaluated, its feature𝐱~\\mathbf\{\\tilde\{x\}\}will naturally exhibit low cosine similarity with all known class prototypes𝐫k\\mathbf\{r\}\_\{k\}\. Bounded by the marginmmand the Softplus function, the network will output near\-zero evidence for all classes, thereby providing high uncertainty estimates\.

### III\-DModel Training and Prediction

During the training phase, we employ an annealing warm\-up technique\. Specifically, we introduce an epoch\-dependent annealing factorλt=min⁡\(1\.0,t/Ew​a​r​m​u​p\)\\lambda\_\{t\}=\\min\(1\.0,t/\{E\_\{warmup\}\}\), wherettis the current epoch andEw​a​r​m​u​pE\_\{warmup\}is the number of warm\-up epochs\. This factor dynamically scales the KL divergence term, allowing the network to focus on fitting the data likelihood in the initial stages and gradually imposing the uncertainty regularization as training stabilizes\.

Consequently, the annealing variational loss function at epochttis formulated as:

ℒa​n​n​e​a​l\(t\)=ℒM​S​E​\(α,𝒚\)\+λt⋅β⋅D~K​L​\(qϕ​\(𝐩\|𝐱\)∥P​\(𝐩\)\)\.\\mathcal\{L\}\_\{anneal\}^\{\(t\)\}=\\mathcal\{L\}\_\{MSE\}\(\\mathbf\{\\alpha\},\\boldsymbol\{y\}\)\+\\lambda\_\{t\}\\cdot\\beta\\cdot\\tilde\{D\}\_\{KL\}\(q\_\{\\phi\}\(\\mathbf\{p\}\|\\mathbf\{x\}\)\\parallel P\(\\mathbf\{p\}\)\)\.\(13\)
Algorithm 1Training process of VIEDL1:Input:Training dataset

𝒟=\[X,Y\]\\mathcal\{D\}=\[X,Y\]\.

2:Parameters:Hyper\-parameter

β\\betaand warm\-up epochs

Ew​a​r​m​u​pE\_\{warmup\}\.

3:Output:The optimal evidence network

fϕ∗f\_\{\\phi^\{\*\}\}, class prototype vectors

\[𝐫1∗,…,𝐫K∗\]\[\\mathbf\{r\}^\{\*\}\_\{1\},\.\.\.,\\mathbf\{r\}^\{\*\}\_\{K\}\], scale parameter

γ∗\\gamma^\{\*\}and margin parameter

m∗m^\{\*\}\.

4:Initialize the parameters

ϕ\\phi,

\[𝐫1,…,𝐫K\]\[\\mathbf\{r\}\_\{1\},\.\.\.,\\mathbf\{r\}\_\{K\}\],

γ\\gammaand

mm;

5:forepoch=1, 2, …do

6:Calculate annealing factor

λt=min⁡\(1\.0,t/Ew​a​r​m​u​p\)\\lambda\_\{t\}=\\min\(1\.0,t/E\_\{warmup\}\);

7:forbatch=1, 2, …do

8:Sample a mini\-batch

\(x,y\)\(x,y\)from

𝒟\\mathcal\{D\};

9:Extract

x~\\tilde\{x\}for each sample by the backbone network;

10:Obtain evidence

eke\_\{k\}of each class

kkby Eq\. \(12\);

11:Calculate

αk=ek\+1\\alpha\_\{k\}=e\_\{k\}\+1,

S=∑k=1KαkS=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}and

p^k=αk/S\\hat\{p\}\_\{k\}=\\alpha\_\{k\}/S;

12:Calculate the annealing variational loss by Eq\. \([13](https://arxiv.org/html/2605.26477#S3.E13)\);

13:Update all parameters via gradient descent;

14:endfor

15:endfor

The pseudo\-code is detailed in Algorithm[1](https://arxiv.org/html/2605.26477#alg1)\. After the training process, the optimized networkfϕ∗f\_\{\\phi^\{\*\}\}can be deployed for deterministic prediction and epistemic uncertainty quantification simultaneously in a single forward pass\. Given a new test instance, the model predicts the categorical distribution by calculating the expected probabilitiesp^k=αk/S,S=∑k=1Kαk\\hat\{p\}\_\{k\}=\\alpha\_\{k\}/S,S=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\. The final predicted class labely^\\hat\{y\}is determined by the maximum expected probabilityy^=arg⁡maxk∈\{1,…,K\}⁡p^k\.\\hat\{y\}=\\arg\\max\_\{k\\in\\\{1,\\dots,K\\\}\}\\hat\{p\}\_\{k\}\.Meanwhile, the epistemic uncertaintyuuof the prediction can be calculated asu=∑k=1KλkSu=\\frac\{\\sum\_\{k=1\}^\{K\}\\lambda\_\{k\}\}\{S\}, with∀k,λk=1\\forall k,\\lambda\_\{k\}=1\.

## IVTheoretical Analysis

In this section, we not only theoretically establish the generalization bound\[[31](https://arxiv.org/html/2605.26477#bib.bib26),[2](https://arxiv.org/html/2605.26477#bib.bib27)\]for the proposed model, but also delve into the fundamental insights it provides\.

### IV\-ABasic Assumptions and the Generalization Bound

We first introduce two essential assumptions regarding the evidence capacity and the input space, which serve as the foundation for our complexity analysis\.

###### Assumption IV\.1\.

In Evidential Deep Learning \(EDL\), the output uncertainty is defined as the ratio of the prior capacity to the total distribution density:μ=‖𝛌‖1‖e‖1\+‖𝛌‖1∈\[0,1\]\\mu=\\frac\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\{\|\|e\|\|\_\{1\}\+\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\\in\[0,1\]\. We assume a robustness constraintμ≥μm​i​n\\mu\\geq\\mu\_\{min\}, which mathematically imposes a strict upper bound on the total generated evidence‖e‖1\|\|e\|\|\_\{1\}:

‖e‖1≤‖𝝀‖1​\(1μm​i​n−1\)≜M\.\|\|e\|\|\_\{1\}\\leq\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)\\triangleq M\.\(14\)

###### Assumption IV\.2\.

The input feature space𝒳\\mathcal\{X\}of the neural network is bounded by a maximum radiusR\>0R\>0, such that for any inputx∈𝒳x\\in\\mathcal\{X\},‖x‖2≤R\|\|x\|\|\_\{2\}\\leq R\.

Crucially, Assumption[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmassumption1)indicates that the total amount of evidence is finite, and Assumption[IV\.2](https://arxiv.org/html/2605.26477#S4.Thmassumption2)implies that the input space is bounded, both of which are highly mild and standard assumptions\. According to these assumptions, we can obtain the generalization bound for our proposed model\.

###### Theorem IV\.1\.

With probability at least1−δ1\-\\delta, the expected true riskℒt​r​u​e​\(ϕ\)\\mathcal\{L\}\_\{true\}\(\\phi\)of the VI\-EDL model is bounded by:

ℒt​r​u​e​\(ϕ\)≤\\displaystyle\\mathcal\{L\}\_\{true\}\(\\phi\)\\leqℒe​m​p\(ϕ\)\+𝒪\(\[\(2\+1\(K\+1\)2\+2K​\(K\+1\)\\displaystyle\\mathcal\{L\}\_\{emp\}\(\\phi\)\+\\mathcal\{O\}\\Bigg\(\\Bigg\[\(2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\(15\)\+2β\+βminj⁡λj\)\|\|𝝀\|\|1\+β\]×\(1μm​i​n−1\)\\displaystyle\+2\\beta\+\\frac\{\\beta\}\{\\min\_\{j\}\\lambda\_\{j\}\}\)\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\+\\beta\\bigg\]\\times\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)×R​K​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)n\)\\displaystyle\\times\\frac\{R\\sqrt\{K\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\}\{\\sqrt\{n\}\}\\Bigg\)\+3​B​log⁡\(2/δ\)2​n,\\displaystyle\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\},whereℒt​r​u​e\\mathcal\{L\}\_\{true\}represents the true generalization risk \(prediction error on unseen samples\) andℒe​m​p\\mathcal\{L\}\_\{emp\}represents the empirical risk \(prediction error on the training set\)\. We assume the evidence networkfϕf\_\{\\phi\}is anLL\-layer neural network with weight matricesWlW\_\{l\}andLσlL\_\{\\sigma\_\{l\}\}\-Lipschitz activation functionsσl\\sigma\_\{l\}\(l∈\[1,2,…,L\]\)\(l\\in\[1,2,\.\.\.,L\]\), and the value of the loss functionℒV​I−E​D​L\\mathcal\{L\}\_\{VI\-EDL\}is bounded by\[0,B\]∈ℝ\[0,B\]\\in\\mathbb\{R\}\.

The detailed step\-by\-step proofs are provided in Section[B](https://arxiv.org/html/2605.26477#A2)of the Appendix, and the derivation of Theorem[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmtheorem1)proceeds in three progressive steps\. First, we establish the global Lipschitz continuity of our variational loss functionℒV​I−E​D​L\\mathcal\{L\}\_\{VI\-EDL\}with respect to the generated evidence to bound its gradient\. Second, leveraging the Ledoux\-Talagrand contraction lemma and the structural unrolling of the neural network, we decouple the loss function and bound the expected Rademacher complexity of the hypothesis space\. Finally, we employ McDiarmid’s Inequality to tightly concentrate the empirical Rademacher complexity around its expectation, bridging the gap between empirical and true risks\.

### IV\-BTheoretical Insights of the Generalization Bound

Theorem[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmtheorem1)demonstrates that the true generalization riskℒt​r​u​e\\mathcal\{L\}\_\{true\}is rigorously bounded by the empirical riskℒe​m​p\\mathcal\{L\}\_\{emp\}and an additional methematical gap\. This bound formally guarantees the generalization capability of the proposed VI\-EDL framework, ensuring that the model does not merely memorize training data but learns transferable patterns from the underlying class probability distribution\. Moreover, it encapsulates the interplay between the data, the network architecture, and the uncertainty quantification mechanism, yielding several insights into the learning dynamics of the proposed model:

- •1\. Sample Size \(nn\) and Dimensionality of the Label Space \(KK\):The generalization bound explicitly decreases at a rate of𝒪​\(1/n\)\\mathcal\{O\}\(1/\\sqrt\{n\}\)\. As the number of training samplesnnincreases, the bound becomes tighter, which naturally guaranties that the empirical error converges to the true expected risk\. The number of classesKKplays a dual role\. IfK=0K=0, the bound naturally tends to infinity, indicating that the learning problem is fundamentally unsolvable\. In contrast, asK→∞K\\to\\infty, the uncertainty is maximally compressed by the known labels, pushing the generalization bound to its minimum valid state\. Intuitively, if a task possesses an infinitely comprehensive label space that encompasses all possible categories, any given instance would be completely interpretable by the known semantics\. Under such ideal conditions, the uncertainty naturally vanishes, as the very concept of "unknowns" ceases to exist\.
- •2\. Optimality of the Uniform Prior \(λ\\boldsymbol\{\\lambda\}\):The generalization bound is a strictly monotonically increasing linear function of‖𝝀‖1\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\. To achieve the tightest possible bound, one must strictly minimize‖𝝀‖1\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\. Given the validity constraint∀i,λi≥1\\forall i,\\lambda\_\{i\}\\geq 1\(required to prevent singular Dirichlet prior densities\), the global minimum is achieved when𝝀=𝟏K\\boldsymbol\{\\lambda\}=\\mathbf\{1\}\_\{K\}, where𝟏K\\mathbf\{1\}\_\{K\}is aKK\-dimensional vector of ones\. This mathematically proves that initializing with a uniform Dirichlet prior is not merely an empirical heuristic, but the theoretically optimal choice\.
- •3\. Trade\-off via Minimum Uncertainty \(μm​i​n\\mu\_\{min\}\):The minimum uncertaintyμm​i​n\\mu\_\{min\}acts as an uncertainty amplifier\. Ifμm​i​n→0\\mu\_\{min\}\\to 0, the amplifier tends to infinity\. In this case, evidence provided by the model tends to infinity, indicating severe overconfidence\. However, while forcingμm​i​n=1\\mu\_\{min\}=1makes the bound vanish, it completely destroys the predictive capability of the model\. This model just output no evidence and total uncertainty for any instance, inflating the empirical errorℒe​m​p​\(ϕ\)\\mathcal\{L\}\_\{emp\}\(\\phi\)\. Therefore,μm​i​n\\mu\_\{min\}must be a balanced value, ideally learned dynamically by the model itself\.
- •4\. Network Capacity and Feature Complexity \(R,Wl,σlR,W\_\{l\},\\sigma\_\{l\}\):The bound is directly proportional to the maximum feature radiusRR, the spectral norms of the weight matrices‖Wl‖2\|\|W\_\{l\}\|\|\_\{2\}, and the Lipschitz constants of the activation functionsLσlL\_\{\\sigma\_\{l\}\}\. Larger values imply a more complex hypothesis family, which increases the difficulty of fitting and degrades generalization\. This also highlights the necessity of architectural regularization in evidential networks\.

![Refer to caption](https://arxiv.org/html/2605.26477v1/cifar.png)

\(a\) CIFAR\-10

![Refer to caption](https://arxiv.org/html/2605.26477v1/blood.png)

\(b\) BloodMNIST

![Refer to caption](https://arxiv.org/html/2605.26477v1/path.png)

\(c\) PathMNIST

Figure 1:Comparison of feature magnitude‖𝐱^‖\\\|\\hat\{\\mathbf\{x\}\}\\\|\.The left panel of each subfigure is the original image\. The right panel shows the kernel density curve, which illustrates the density distribution of the 512\-dimensional activation values extracted by the convolutional kernels for the given image\. The dotted line indicates the 95th percentile, and the area under this curve equals 1\.\(a\)A natural image from CIFAR\-10 serves as a control, showing concentrated activations\.\(b\)&\(c\)Two medical samples from BloodMNIST and PathMNIST exhibiting a broader density distribution, triggering larger feature magnitude‖𝐱^‖\\\|\\hat\{\\mathbf\{x\}\}\\\|\.

## VExperiments

### V\-AExperimental Setup

Baselines\.Following standard evaluation protocols, we primarily focus on comparing our VI\-EDL with other EDL methods, including the traditional EDL\[[30](https://arxiv.org/html/2605.26477#bib.bib10)\],ℐ\\mathcal\{I\}\-EDL\[[8](https://arxiv.org/html/2605.26477#bib.bib15)\], Re\-EDL\[[6](https://arxiv.org/html/2605.26477#bib.bib14)\], and F\-EDL\[[36](https://arxiv.org/html/2605.26477#bib.bib16)\]\. Additionally, we present the results of other uncertainty quantification methods, including the probabilistic posterior network NatPN\[[5](https://arxiv.org/html/2605.26477#bib.bib17)\], the deterministic baseline Softmax \(CE\), and the Deep Ensemble\[[21](https://arxiv.org/html/2605.26477#bib.bib20)\]method using 5 model instances \(M=5M=5\)\.

Datasets\.We select CIFAR\-10\[[20](https://arxiv.org/html/2605.26477#bib.bib18)\], SVHN\[[12](https://arxiv.org/html/2605.26477#bib.bib19)\], Flowers\[[29](https://arxiv.org/html/2605.26477#bib.bib21)\], CIFAR\-100\[[20](https://arxiv.org/html/2605.26477#bib.bib18)\], and MedMNIST\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\]\(including Blood, Path, Tissue, and OrganMNIST\) for evaluation\. In detail, We utilize four widely used natural image datasets in our experiments:

- •CIFAR\-10 and CIFAR\-100\[[20](https://arxiv.org/html/2605.26477#bib.bib18)\]: Both datasets consist of 60,000 color images with a resolution of32×3232\\times 32pixels\. CIFAR\-10 is categorized into 10 classes of natural objects and animals \(with 6,000 images per class\), while CIFAR\-100 contains 100 fine\-grained classes \(with 600 images per class\)\.
- •SVHN\[[12](https://arxiv.org/html/2605.26477#bib.bib19)\]: This dataset contains over 600,000 color images with a resolution of32×3232\\times 32pixels\. The images depict real\-world printed digits \(0 to 9\) cropped from house number plates in Google Street View\.
- •Flowers\[[29](https://arxiv.org/html/2605.26477#bib.bib21)\]: This dataset consists of 8,189 high\-resolution images of flowers commonly occurring in the United Kingdom\. The images are categorized into 102 distinct flower species\.

Moreover, we also evaluate our model on four subsets from the MedMNIST v2 benchmark\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\], which provides standardized 2D biomedical images resized to28×2828\\times 28pixels:

- •BloodMNIST:Comprises 17,092 optical microscope images depicting normal peripheral blood cells\. The images are categorized into 8 distinct cell types\.
- •PathMNIST:Contains 107,180 histological images of colorectal cancer tissues\. The images are classified into 9 different tissue types\.
- •TissueMNIST:Consists of 236,386 monochromatic images showing human kidney cortex cells\. These cells are segmented from wide\-field microscopy images and classified into 8 categories\.
- •OrganMNIST:Derived from 3D abdominal CT scans that are processed into 2D bounding\-box crops\. It is categorized into 11 distinct abdominal organs\. Taking the Axial view subset as an example, it contains 58,850 images\.

Implementation\.SVHN\[[12](https://arxiv.org/html/2605.26477#bib.bib19)\], Flowers\[[29](https://arxiv.org/html/2605.26477#bib.bib21)\]and CIFAR\-100\[[20](https://arxiv.org/html/2605.26477#bib.bib18)\]are utilized as OOD data for CIFAR\-10\[[20](https://arxiv.org/html/2605.26477#bib.bib18)\], while PathMNIST\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\], TissueMNIST\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\]and OrganMNIST\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\]are used for BloodMNIST\[[33](https://arxiv.org/html/2605.26477#bib.bib22)\]\. Resnet18 serves as the backbone network for both CIFAR\-10 and BloodMNIST\. The Adam optimizer is employed with a learning rate of1×10−31\\times 10^\{\-3\}for CIFAR\-10 and BloodMNIST\. The hyper\-parameterβ\\betais set to 0\.5 and 0\.1 for CIFAR\-10 and BloodMNIST,which is selected from the range \[0\.1:0\.1:1\.0\] on the validation set\. The batch size is set to 256, and the warm\-up and maximum epoch is set to 20 and 30 for CIFAR\-10 and BloodMNIST\. Reported results are averaged over 5 runs\. The compute resources are concluded in Section[C](https://arxiv.org/html/2605.26477#A3)in the Appendix\.

![Refer to caption](https://arxiv.org/html/2605.26477v1/myplot.png)Figure 2:Visualization of the predicted class probability distributions for EDL baselines and our method in the BloodMNIST and PathMNIST dataset\. Each vertex of the triangles represents a specific class \(with top\-3 evidence\), and total evidence of all the classes \(Total Ev\) is presented below triangles\. As is shown in this figure, our method generateshigh evidencefor the ID input andlowest evidencefor the OOD input, indicating its effectiveness in OOD detection among these datasets\.TABLE I:OOD detection results on CIFAR\-10, with mean and standard deviation reported over five runs\. The best results are highlighted inbold\. All results are shown in percentiles \(%\)\.TABLE II:OOD detection results on BloodMNIST, with mean and standard deviation reported over five runs\. The best results are highlighted inbold\. All results are shown in percentiles \(%\)\.
### V\-BOut\-of\-Distribution and Noise Robustness Evaluation

To comprehensively evaluate the effectiveness of our proposed VI\-EDL framework, we conduct extensive experiments focusing on Out\-of\-Distribution detection and robustness against data noise\. The results are reported in classification accuracy on in\-distribution \(ID\) samples \(ID ACC \(↑\)\) and the capability to distinguish unknown from known samples \(OOD AUROC \(↑\) and OOD FPR95 \(↓\)\)\.

OOD Detection\.We first train the models on the CIFAR\-10 and BloodMNIST dataset and run OOD detection test on three other datasets, with the results summarized in Table[I](https://arxiv.org/html/2605.26477#S5.T1)and[II](https://arxiv.org/html/2605.26477#S5.T2)\. These results demonstrate that our proposed VI\-EDL outperforms existing baselines across most uncertainty metrics\. Notably, EDL baselines often suffer from a noticeable degradation in ID ACC due to the optimization conflict introduced by heuristic KL penalties\. In contrast, our method successfully maintains a highly competitive ID ACC while simultaneously achieving most of the highest AUROC and the lowest FPR95\. This indicates that the cosine prototype layer and KL penalty on all classes effectively neutralizes magnitude\-induced overconfidence when encountering OOD samples\. Furthermore, it is noteworthy that the overall performance in Table[II](https://arxiv.org/html/2605.26477#S5.T2)is generally inferior to that in Table[I](https://arxiv.org/html/2605.26477#S5.T1)\. We hypothesize that this discrepancy arises because the maximum feature magnitude of the BloodMNIST dataset, after being mapped through the CNN backbone, is larger than that of CIFAR\-10\.

To verify this, we extract representative samples from the MedMNIST benchmark \(specifically, BloodMNIST and PathMNIST\) and compare their pre\-logits feature magnitudes‖𝐱^‖\\\|\\hat\{\\mathbf\{x\}\}\\\|mapped by a standard ResNet\-18 backbone against a natural image from CIFAR\-10\. The results are illustrated in Fig\.[1](https://arxiv.org/html/2605.26477#S4.F1)\. As shown in Fig\.[1](https://arxiv.org/html/2605.26477#S4.F1)\(a\), the feature activations of the natural image in CIFAR\-10 are well\-concentrated within a narrow range\. In contrast, the two medical images in Fig\.[1](https://arxiv.org/html/2605.26477#S4.F1)\(b\)&\(c\)exhibit relatively heavy\-tail density distributions\. This phenomenon likely occurs because CNNs are highly sensitive to domain\-specific interferences such as cell staining variations and acquisition artifacts\. Coupled with the fact that biological tissues possess inherently more complex structures than natural images, these factors over\-stimulate the convolutional filters, causing the network to output abnormally high feature values\. As indicated byInsight 4in Section[IV\-B](https://arxiv.org/html/2605.26477#S4.SS2), a larger feature radius yields a looser generalization bound, which consequently leads to a degradation in the method’s performance\.

![Refer to caption](https://arxiv.org/html/2605.26477v1/ID_d.png)

\(a\) Normal: Clear Weather

![Refer to caption](https://arxiv.org/html/2605.26477v1/real_d.png)

\(b\) Anomalous: Rainy Condition

![Refer to caption](https://arxiv.org/html/2605.26477v1/OOD_fog.png)

\(c\) Anomalous: Foggy Condition

Figure 3:Comparison of normal and anomalous weather in autonomous driving\.\(a\)Normal: a typical driving scene with optimal lighting\.\(b\)Anomalous: a rainy night where dense water droplets distort features\.\(c\)Anomalous: a foggy environment with extreme atmospheric scattering\.TABLE III:Noise detection results on CIFAR\-10, with mean and standard deviation reported over five runs\. A higher value ofσ\\sigmaindicates a more intense additive Gaussian noise perturbation on the original data\. The best results are highlighted inbold\. All results are shown in percentiles \(%\)\.Visualization Analysis\.We visualize the generated evidence of each EDL method in Table[II](https://arxiv.org/html/2605.26477#S5.T2)to further analyze the performance of our method\. Since the 8\-class BloodMNIST \(9\-class PathMNIST\) dataset naturally produces a 7\-dimensional \(8\-dimensional\) probability simplex, we dynamically project the predictions onto a 2\-simplex \(triangle\) by extracting the top\-3 categories with the highest evidence for each instance\. Moreover, we construct the hardest possible OOD samples by pairing a randomly selected ID image with its closest OOD counterpart\. This is retrieved by minimizing the Mean Squared Error \(MSE\) between the fixed ID sample and all available OOD samples\. As illustrated in Figure[2](https://arxiv.org/html/2605.26477#S5.F2), for confident ID samples, the evidence is sharply concentrated at the corresponding category vertex for all methods, showing high confidence\. However, for OOD inputs, other EDL baselines still generate relatively high evidence\. In contrast, the evidence of our method is suppressed across all dimensions, indicating high uncertainty\.

A plausible underlying reason of the failure for these EDL baselines may lie in the feature extraction stage\. Given a raw input image𝐱\\mathbf\{x\}, the backbone network maps it to a deep feature representation𝐱^\\hat\{\\mathbf\{x\}\}\. In medical domains, raw images are inevitably subject to physical acquisition variations\. When processed by a deep CNN backbone, these non\-semantic perturbations may over\-stimulate convolutional filters, inducing abnormally large feature magnitudes‖𝐱^‖\\\|\\hat\{\\mathbf\{x\}\}\\\|\(See Fig\.[1](https://arxiv.org/html/2605.26477#S4.F1)\)\. Conventional linear evidence layers conflate this raw intensity with semantic relevance, generating excessive high evidence\. However, by applying a KL divergence penalty and cosine prototype layer, our method decouples semantic alignment from feature intensity and suppresses evidence of all classes, ensuring that abnormally large‖𝐱^‖\|\|\\hat\{\\mathbf\{x\}\}\|\|fails to generate a high evidence\.

Robustness Against Data Noise\.We also assess the model’s robustness against varying degrees of data noise on CIFAR\-10, as detailed in Table[III](https://arxiv.org/html/2605.26477#S5.T3)\. Our method consistently outperforms all baseline approaches across different noise intensities, effectively distinguishing heavily corrupted inputs from clean data, highlighting the model’s exceptional noise robustness\. The results corroborate that the integration of a global KL divergence penalty alongside the cosine prototype mechanism enforces strict regulation over the evidence generation process, thereby successfully preventing the model from hallucinating high\-confidence predictions when encountering various noisy samples\.

### V\-CAutonomous Driving Scenario\.

In this section, we run our VI\-EDL and two other baseline EDL methods \(Re\-EDL and F\-EDL\) in the real\-world autonomous driving scenario on the BDD100K dataset, which is a large\-scale and highly diverse driving video dataset including standard driving scenes\. To rigorously assess these methods’ robustness under anomalous conditions, we construct an anomalous set by sampling 1,000 clean images from the dataset and applying synthetic perturbations\. As illustrated in Fig\.[3](https://arxiv.org/html/2605.26477#S5.F3), we generate 500 images simulating rainy conditions and 500 simulating foggy conditions\. The three methods are then tasked with identifying this newly constructed anomalous dataset\. The results are detailed in Table[IV](https://arxiv.org/html/2605.26477#S5.T4)\. Notably, VI\-EDL exhibits the most significant discrepancy in uncertainty estimates between normal and anomalous driving conditions\. Specifically, the uncertainty difference \(Unc\. Diff\) of our proposed model reaches 0\.1380, which is about 4\.63 times higher than that of Re\-EDL and 2\.13 times that of F\-EDL\. This uncertainty gap demonstrates a superior capability of VI\-EDL to reliably differentiate between normal and anomalous driving scenarios\.

We attribute this superior performance to the structural constraints of VI\-EDL\. These weather conditions typically introduce complex visual distortions that yield aberrant feature magnitudes, and our cosine prototype layer and KL penalty effectively neutralizing magnitude\-induced overconfidence\. This also proves that our method is highly practical: when the estimated uncertainty exceeds a predefined threshold, the autonomous driving system can proactively issue an alert, prompting human driver intervention and thereby enhancing overall driving safety\.

![Refer to caption](https://arxiv.org/html/2605.26477v1/sensi-auroc.png)

\(a\) AUROC \(↑\\uparrow\)

![Refer to caption](https://arxiv.org/html/2605.26477v1/sensi-fpr.png)

\(b\) FPR95 \(↓\\downarrow\)

Figure 4:Sensitivity analysis of the hyper\-parameterβ\\betaon CIFAR\-10 vs SVHN in terms of AUROC \(↑\\uparrow\) and FPR95 \(↓\\downarrow\)\. The red solid line represents the performance of our method VI\-EDL varying withβ\\beta, and the dashed lines represent the performance of three other EDL baselines, which is fixed to their results of optimal parameters\.
TABLE IV:Comparison of uncertainty \(Unc\.\) across different methods on normal and anomalous samples in the BDD100K dataset\. A larger uncertainty difference \(Unc\. Diff\.\) indicates a better ability to distinguish OOD samples from ID samples, and the best Unc\. Diff\. is highlighted inbold
TABLE V:Ablation study on KL divergence and cosine prototype layer\. For ID dataset we choose CIFAR10, while we choose SVHN, FLOWERS and CIFAR100 as OOD datasets\. The best results are highlighted inbold\.

### V\-DAblation Study and Parameter Sensitivity Analysis

Ablation Study\.In this section, we independently validate the efficacy of the two proposed terms: the KL divergence term and the cosine prototype layer\. The quantitative results are summarized in Table[V](https://arxiv.org/html/2605.26477#S5.T5), where a ✓indicates the inclusion of a term and a ✗ denotes its ablation \(setting the KL hyper\-parameterβ\\betato zero or substituting the cosine prototype layer with a Softplus layer\)\. As observed, the removal of either term leads to a noticeable degradation in both ID accuracy and OOD detection capabilities, and our original model yields the highest overall performance\. These demonstrate that both of the two terms are effective in enhancing the model’s performance\.

Parameter Sensitivity Analysis\.In our VI\-EDL framework, the hyper\-parameterβ\\betais selected from\[0\.01,0\.03,0\.05,0\.1\]\[0\.01,0\.03,0\.05,0\.1\]across all datasets and experiments by cross\-validation\. To investigate the robustness of our model, we evaluate the OOD detection metrics \(AUROC and FPR95\) across a broad spectrum ofβ\\betavalues ranging from0\.0020\.002to0\.60\.6, as illustrated in Figure[4](https://arxiv.org/html/2605.26477#S5.F4)\. The results show that, across a wide range ofβ\\betavalues, VI\-EDL achieves AUROC scores that outperform the three EDL baselines in most cases, while its FPR95 remains lower than all three baselines under every testedβ\\betasetting, indicating the robustness of our model to the choice of it hyper\-parameterβ\\beta\.

## VIConclusion

In this paper, we presented VI\-EDL, a principled variational inference framework that reconstructs Evidential Deep Learning \(EDL\)\. Recognizing the drawbacks of conventional EDL, we reformulated the evidence generation process as an approximate posterior inference problem\. By replacing heuristic evidence modeling with a theoretically grounded variational formulation and a cosine prototype evidence layer, VI\-EDL improves the interpretability of evidential learning while preventing evidence from growing excessively across all classes\. We also establish a generalization bound and prove that the predicted uncertainty \(Insight 3\), feature and network complexity \(Insight 4\) affect this bound, and setting𝜶=𝐞\+𝟏\\boldsymbol\{\\alpha\}=\\mathbf\{e\}\+\\mathbf\{1\}can minimize it \(Insight 2\) inSection[IV\-B](https://arxiv.org/html/2605.26477#S4.SS2)\. Extensive empirical evaluations across diverse benchmarks demonstrate that VI\-EDL achieves state\-of\-the\-art performance in out\-of\-distribution \(OOD\) detection, noise detection and autonomous driving scenario\.

## References

- \[1\]\(2026\)Evidential retriever: uncertainty\-aware medical image retrieval\.InProceedings of The 9th International Conference on Medical Imaging with Deep Learning,Vol\.315,pp\. 2208–2232\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p2.1)\.
- \[2\]P\. L\. Bartlett and S\. Mendelson\(2002\)Rademacher and gaussian complexities: risk bounds and structural results\.J\. Mach\. Learn\. Res\.3,pp\. 463–482\.Cited by:[§IV](https://arxiv.org/html/2605.26477#S4.p1.1)\.
- \[3\]D\. M\. Blei, A\. Kucukelbir, and J\. D\. McAuliffe\(2016\)Variational inference: A review for statisticians\.CoRRabs/1601\.00670\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p7.1)\.
- \[4\]C\. Blundell, J\. Cornebise, K\. Kavukcuoglu, and D\. Wierstra\(2015\)Weight uncertainty in neural network\.InProceedings of the International Conference on Machine Learning, ICML 2015,pp\. 1613–1622\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[5\]B\. Charpentier, O\. Borchert, D\. Zügner, S\. Geisler, and S\. GünnemannNatural posterior network: deep bayesian predictive uncertainty for exponential family distributions\.InInternational Conference on Learning Representations, ICLR 2022,Cited by:[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.
- \[6\]M\. Chen, J\. Gao, and C\. Xu\(2025\)Revisiting essential and nonessential settings of evidential deep learning\.IEEE Trans\. Pattern Anal\. Mach\. Intell\.47\(10\),pp\. 8658–8673\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.
- \[7\]A\. P\. Dempster\(2008\)Upper and lower probabilities induced by a multivalued mapping\.InClassic Works of the Dempster\-Shafer Theory of Belief Functions,pp\. 57–72\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p2.13)\.
- \[8\]D\. Deng, G\. Chen, Y\. Yu, F\. Liu, and P\. Heng\(2023\)Uncertainty estimation by fisher information\-based evidential deep learning\.InProceedings of the International Conference on Machine Learning, ICML 2023,Proceedings of Machine Learning Research,pp\. 7596–7616\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.
- \[9\]A\. Esteva, A\. Robicquet, B\. Ramsundar, V\. Kuleshov, M\. DePristo, K\. Chou, C\. Cui, G\. Corrado, S\. Thrun, and J\. Dean\(2019\)A guide to deep learning in healthcare\.Nature Medcine25\(1\),pp\. 24–29\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[10\]D\. Feng, L\. Rosenbaum, and K\. Dietmayer\(2018\)Towards safe autonomous driving: capture uncertainty in the deep neural network for lidar 3d vehicle detection\.InInternational Conference on Intelligent Transportation Systems, ITSC 2018,pp\. 3266–3273\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[11\]Y\. Gal and Z\. Ghahramani\(2016\)Dropout as a bayesian approximation: representing model uncertainty in deep learning\.InProceedings of the International Conference on Machine Learning, ICML 2016,pp\. 1050–1059\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[12\]M\. Geerts, K\. Shaikh, J\. D\. Weerdt, and S\. vanden Broucke\(2022\)Predicting the state of a house using google street view \- an analysis of deep binary classification models for the assessment of the quality of flemish houses\.InResearch Challenges in Information Science \- 16th International Conference, RCIS 2022,pp\. 703–710\.Cited by:[2nd item](https://arxiv.org/html/2605.26477#S5.I1.i2.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p2.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p4.2)\.
- \[13\]M\. N\. Geletu, J\. Lauffenburger, T\. Josso\-Laurain, M\. Devanne, and M\. M\. Wogari\(2024\)Evidential deep learning for sensor fusion\.In27th International Conference on Information Fusion, FUSION 2024, Venice, Italy, July 8\-11, 2024,pp\. 1–8\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p2.1)\.
- \[14\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the International Conference on Machine Learning, ICML 2017,pp\. 1321–1330\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[15\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,pp\. 770–778\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[16\]D\. Hendrycks and K\. GimpelA baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.InInternational Conference on Learning Representations, ICLR 2017,Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[17\]A\. Jøsang\(2016\)Subjective logic \- A formalism for reasoning under uncertainty\.Artificial Intelligence: Foundations, Theory, and Algorithms,Springer\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p2.13)\.
- \[18\]A\. Kendall and Y\. Gal\(2017\)What uncertainties do we need in bayesian deep learning for computer vision?\.InAdvances in Neural Information Processing Systems on Neural Information Processing Systems 2017, NeurIPS 2017,pp\. 5574–5584\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[19\]D\. P\. Kingma and M\. WellingAuto\-encoding variational bayes\.InInternational Conference on Learning Representations, ICLR 2014,Y\. Bengio and Y\. LeCun \(Eds\.\),Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p7.1)\.
- \[20\]A\. Krizhevsky and G\. Hinton\(2009\)Learning multiple layers of features from tiny images\.Technical reportCited by:[1st item](https://arxiv.org/html/2605.26477#S5.I1.i1.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p2.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p4.2)\.
- \[21\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.InAdvances in Neural Information Processing Systems on Neural Information Processing Systems 2017, NeurIPS 2017,pp\. 6402–6413\.Cited by:[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.
- \[22\]Y\. LeCun, Y\. Bengio, and G\. Hinton\(2015\)Deep learning\.Nature521\(7553\),pp\. 436–444\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[23\]K\. Lee, K\. Lee, H\. Lee, and J\. Shin\(2018\)A simple unified framework for detecting out\-of\-distribution samples and adversarial attacks\.InAdvances in Neural Information Processing Systems on Neural Information Processing Systems 2018, NeurIPS 2018,pp\. 7167–7177\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[24\]H\. Li, Y\. Nan, J\. D\. Ser, and G\. Yang\(2023\)Region\-based evidential deep learning to quantify uncertainty and improve robustness of brain tumor segmentation\.Neural Comput\. Appl\.35\(30\),pp\. 22071–22085\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p2.1)\.
- \[25\]S\. Liang, Y\. Li, and R\. SrikantEnhancing the reliability of out\-of\-distribution image detection in neural networks\.InInternational Conference on Learning Representations, ICLR 2018,Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[26\]W\. Liu, X\. Wang, J\. D\. Owens, and Y\. LiEnergy\-based out\-of\-distribution detection\.InAdvances in Neural Information Processing Systems on Neural Information Processing Systems 2020, NeurIPS 2020,Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[27\]K\. Muhammad, A\. Ullah, J\. Lloret, J\. D\. Ser, and V\. H\. C\. de Albuquerque\(2021\)Deep learning for safe autonomous driving: current challenges and future directions\.IEEE Trans\. Intell\. Transp\. Syst\.22\(7\),pp\. 4316–4336\.External Links:[Document](https://dx.doi.org/10.1109/TITS.2020.3032227)Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[28\]A\. M\. Nguyen, J\. Yosinski, and J\. Clune\(2015\)Deep neural networks are easily fooled: high confidence predictions for unrecognizable images\.InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015,pp\. 427–436\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[29\]M\. Nilsback and A\. Zisserman\(2008\)Automated flower classification over a large number of classes\.InSixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008,pp\. 722–729\.Cited by:[3rd item](https://arxiv.org/html/2605.26477#S5.I1.i3.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p2.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p4.2)\.
- \[30\]M\. Sensoy, L\. M\. Kaplan, and M\. Kandemir\(2018\)Evidential deep learning to quantify classification uncertainty\.InAdvances in Neural Information Processing Systems Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018,pp\. 3183–3193\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1),[§II](https://arxiv.org/html/2605.26477#S2.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.
- \[31\]S\. Shalev\-Shwartz and S\. Ben\-David\(2014\)Understanding machine learning \- from theory to algorithms\.Cambridge University Press\.External Links:ISBN 978\-1\-10\-705713\-5Cited by:[§IV](https://arxiv.org/html/2605.26477#S4.p1.1)\.
- \[32\]A\. P\. Soleimany, A\. Amini, S\. Goldman, D\. Rus, S\. Bhatia, and C\. Coley\(2021\)Evidential deep learning for guided molecular property prediction and discovery\.ACS Central Science\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p2.1)\.
- \[33\]J\. Yang, R\. Shi, D\. Wei, Z\. Liu, L\. Zhao, B\. Ke, H\. Pfister, and B\. Ni\(2021\)MedMNIST v2: A large\-scale lightweight benchmark for 2d and 3d biomedical image classification\.CoRRabs/2110\.14795\.Cited by:[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p2.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p3.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p4.2)\.
- \[34\]J\. Yang, K\. Zhou, Y\. Li, and Z\. Liu\(2021\)Generalized out\-of\-distribution detection: A survey\.CoRRabs/2110\.11334\.Cited by:[§I](https://arxiv.org/html/2605.26477#S1.p1.1)\.
- \[35\]Q\. Yang, Y\. Zhao, and H\. Cheng\(2026\)Uncertainty\-aware evidential fusion for multi\-modal object detection in autonomous driving\.Drones10\(2\)\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p2.1)\.
- \[36\]T\. Yoon and H\. Kim\(2025\)Uncertainty estimation by flexible evidential deep learning\.CoRRabs/2510\.18322\.Cited by:[§II](https://arxiv.org/html/2605.26477#S2.p1.1),[§V\-A](https://arxiv.org/html/2605.26477#S5.SS1.p1.2)\.

## Appendix ADetailed Derivation of Eq\. \([9](https://arxiv.org/html/2605.26477#S3.E9)\)

First, recall the probability density function of a Dirichlet distribution\. For the variational posteriorqϕ​\(𝐩\)q\_\{\\phi\}\(\\mathbf\{p\}\), it is defined over the\(K−1\)\(K\-1\)\-dimensional unit simplex as:

qϕ​\(𝐩\)=1B​\(𝜶\)​∏k=1Kpkαk−1,q\_\{\\phi\}\(\\mathbf\{p\}\)=\\frac\{1\}\{B\(\\boldsymbol\{\\alpha\}\)\}\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\\alpha\_\{k\}\-1\},\(A\.1\)whereB​\(𝜶\)B\(\\boldsymbol\{\\alpha\}\)is the multivariate Beta function, expanding to:

B​\(𝜶\)=∏k=1KΓ​\(αk\)Γ​\(∑k=1Kαk\)=∏k=1KΓ​\(αk\)Γ​\(S\),B\(\\boldsymbol\{\\alpha\}\)=\\frac\{\\prod\_\{k=1\}^\{K\}\\Gamma\(\\alpha\_\{k\}\)\}\{\\Gamma\\left\(\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\\right\)\}=\\frac\{\\prod\_\{k=1\}^\{K\}\\Gamma\(\\alpha\_\{k\}\)\}\{\\Gamma\(S\)\},\(A\.2\)withS=∑k=1KαkS=\\sum\_\{k=1\}^\{K\}\\alpha\_\{k\}\. Similarly, the prior distribution is

P​\(𝐩\)=1B​\(𝝀\)​∏k=1Kpkλk−1\.P\(\\mathbf\{p\}\)=\\frac\{1\}\{B\(\\boldsymbol\{\\lambda\}\)\}\\prod\_\{k=1\}^\{K\}p\_\{k\}^\{\\lambda\_\{k\}\-1\}\.\(A\.3\)
By definition, the KL divergence betweenqϕq\_\{\\phi\}andPPis given by:

DK​L​\(qϕ∥P\)\\displaystyle D\_\{KL\}\(q\_\{\\phi\}\\parallel P\)=𝔼qϕ​\[log⁡qϕ​\(𝐩\)P​\(𝐩\)\]\\displaystyle=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\\left\[\\log\\frac\{q\_\{\\phi\}\(\\mathbf\{p\}\)\}\{P\(\\mathbf\{p\}\)\}\\right\]\(A\.4\)=𝔼qϕ​\[log⁡qϕ​\(𝐩\)−log⁡P​\(𝐩\)\]\.\\displaystyle=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log q\_\{\\phi\}\(\\mathbf\{p\}\)\-\\log P\(\\mathbf\{p\}\)\]\.
Next, we expand the log\-densities of both distributions:

log⁡qϕ​\(𝐩\)=−log⁡B​\(𝜶\)\+∑k=1K\(αk−1\)​log⁡pk,\\log q\_\{\\phi\}\(\\mathbf\{p\}\)=\-\\log B\(\\boldsymbol\{\\alpha\}\)\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-1\)\\log p\_\{k\},\(A\.5\)log⁡P​\(𝐩\)=−log⁡B​\(𝝀\)\+∑k=1K\(λk−1\)​log⁡pk\.\\log P\(\\mathbf\{p\}\)=\-\\log B\(\\boldsymbol\{\\lambda\}\)\+\\sum\_\{k=1\}^\{K\}\(\\lambda\_\{k\}\-1\)\\log p\_\{k\}\.\(A\.6\)
Substituting these expansions back into Equation[A\.4](https://arxiv.org/html/2605.26477#A1.E4), we group the terms together:

DK​L​\(qϕ∥P\)\\displaystyle D\_\{KL\}\(q\_\{\\phi\}\\parallel P\)\(A\.7\)=𝔼qϕ​\[log⁡B​\(𝝀\)−log⁡B​\(𝜶\)\+∑k=1K\(αk−λk\)​log⁡pk\]\\displaystyle=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log B\(\\boldsymbol\{\\lambda\}\)\-\\log B\(\\boldsymbol\{\\alpha\}\)\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\\log p\_\{k\}\]=log⁡B​\(𝝀\)−log⁡B​\(𝜶\)\+∑k=1K\(αk−λk\)​𝔼qϕ​\[log⁡pk\]\.\\displaystyle=\\log B\(\\boldsymbol\{\\lambda\}\)\-\\log B\(\\boldsymbol\{\\alpha\}\)\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log p\_\{k\}\]\.
To evaluate the expectation term, we utilize a well\-known property of the Dirichlet distribution\. The expectation of the log\-probability of a category under a Dirichlet distribution relies on the digamma functionψ​\(⋅\)\\psi\(\\cdot\), which is the logarithmic derivative of the gamma functionΓ​\(⋅\)\\Gamma\(\\cdot\)\. Specifically:

𝔼qϕ​\[log⁡pk\]=ψ​\(αk\)−ψ​\(∑j=1Kαj\)=ψ​\(αk\)−ψ​\(S\)\.\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[\\log p\_\{k\}\]=\\psi\(\\alpha\_\{k\}\)\-\\psi\\left\(\\sum\_\{j=1\}^\{K\}\\alpha\_\{j\}\\right\)=\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\.\(A\.8\)
Plugging Equation[A\.8](https://arxiv.org/html/2605.26477#A1.E8)into Equation[A\.7](https://arxiv.org/html/2605.26477#A1.E7), we obtain:

DK​L​\(qϕ∥P\)=\\displaystyle D\_\{KL\}\(q\_\{\\phi\}\\parallel P\)=log⁡B​\(𝝀\)−log⁡B​\(𝜶\)\\displaystyle\\log B\(\\boldsymbol\{\\lambda\}\)\-\\log B\(\\boldsymbol\{\\alpha\}\)\(A\.9\)\+∑k=1K\(αk−λk\)​\(ψ​\(αk\)−ψ​\(S\)\)\.\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\)\.
Finally, we fully unpack the logarithmic multivariate Beta functions:

−log⁡B​\(𝜶\)=log⁡Γ​\(S\)−∑k=1Klog⁡Γ​\(αk\),\-\\log B\(\\boldsymbol\{\\alpha\}\)=\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\),\(A\.10\)log⁡B​\(𝝀\)=∑k=1Klog⁡Γ​\(λk\)−log⁡Γ​\(‖𝝀‖1\)\.\\log B\(\\boldsymbol\{\\lambda\}\)=\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\lambda\_\{k\}\)\-\\log\\Gamma\(\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\)\.\(A\.11\)
Substituting these into Equation[A\.9](https://arxiv.org/html/2605.26477#A1.E9)and rearranging the terms algebraically yields the exact analytical form presented in the main text:

DK​L​\(𝜶∥𝝀\)=\\displaystyle D\_\{KL\}\(\\boldsymbol\{\\alpha\}\\parallel\\boldsymbol\{\\lambda\}\)=log⁡Γ​\(S\)−∑k=1Klog⁡Γ​\(αk\)\\displaystyle\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\)\(A\.12\)\+∑k=1K\(αk−λk\)​\(ψ​\(αk\)−ψ​\(S\)\)\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\)−log⁡Γ​\(‖𝝀‖1\)\+∑k=1Klog⁡Γ​\(λk\),\\displaystyle\-\\log\\Gamma\(\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\)\+\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\lambda\_\{k\}\),which concludes the derivation of Eq\.[9](https://arxiv.org/html/2605.26477#S3.E9)

## Appendix BProofs for Theorem[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmtheorem1)

To derive the generalization bound of our model, we start with the overall variational loss functionLV​I−E​D​LL\_\{VI\-EDL\}\. To decouple this complex loss function during the analysis, we must establish the Lipschitz continuity of it\. Letz=\(x,y\)z=\(x,y\)denote a sample pair \(feature and label\) from a dataset𝒟\\mathcal\{D\}, and leth​\(z\)=ℒV​I−E​D​L​\(α,y\)h\(z\)=\\mathcal\{L\}\_\{VI\-EDL\}\(\\alpha,y\)denote the variational loss function\.

###### Theorem B\.1\.

Given the ground\-truth label space𝒴=\{y∈ℝK∣yi∈\[0,1\],∑yi=1\}\\mathcal\{Y\}=\\\{y\\in\\mathbb\{R\}^\{K\}\\mid y\_\{i\}\\in\[0,1\],\\sum y\_\{i\}=1\\\}, the variational loss functionℒV​I−E​D​L\\mathcal\{L\}\_\{VI\-EDL\}is globally Lipschitz continuous with respect to the network output\. The Lipschitz constantLhL\_\{h\}is bounded by:

Lh≤\\displaystyle L\_\{h\}\\leq2\+1\(K\+1\)2\+2K​\(K\+1\)\+\\displaystyle 2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\+\(B\.1\)β​\(2\+1minj⁡λj\+1‖𝝀‖1\)\.\\displaystyle\\beta\\left\(2\+\\frac\{1\}\{\\min\_\{j\}\\lambda\_\{j\}\}\+\\frac\{1\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\\right\)\.

###### Proof\.

To prove that the variational loss functionh​\(z\)=ℒV​I−E​D​L​\(α,y\)h\(z\)=\\mathcal\{L\}\_\{VI\-EDL\}\(\\alpha,y\)is globally Lipschitz continuous over the valid domainΩ=\{α∈ℝK∣αi≥1,∀i\}\\Omega=\\\{\\alpha\\in\\mathbb\{R\}^\{K\}\\mid\\alpha\_\{i\}\\geq 1,\\forall i\\\}, we first state the fundamental theoretical basis for our proof\.

According to the mean value theorem for multivariate functions, a function isLL\-Lipschitz continuous if its gradient norm is uniformly bounded\. In our coordinate space, it suffices to prove that the absolute values of all partial derivatives are uniformly bounded by a constantL\>0L\>0, i\.e\.,\|∂ℒV​I−E​D​L∂αi\|≤L,∀α∈Ω\\left\|\\frac\{\\partial\\mathcal\{L\}\_\{VI\-EDL\}\}\{\\partial\\alpha\_\{i\}\}\\right\|\\leq L,\\forall\\alpha\\in\\Omega\. We now analyze the Mean Squared Error \(ℒM​S​E\\mathcal\{L\}\_\{MSE\}\) and the effective KL divergence \(D~K​L\\tilde\{D\}\_\{KL\}\) term\-by\-term\.

Step 1: Bounding the Gradient ofℒM​S​E\\mathcal\{L\}\_\{MSE\}\.We first recall the functional form of the MSE loss, which represents the expectedℒ2\\mathcal\{L\}\_\{2\}error over the class probability distribution:

ℒM​S​E​\(α\)=∑j=1K\(yj−p^j\)2\+1S\+1​\(1−∑j=1Kp^j2\),\\mathcal\{L\}\_\{MSE\}\(\\alpha\)=\\sum\_\{j=1\}^\{K\}\(y\_\{j\}\-\\hat\{p\}\_\{j\}\)^\{2\}\+\\frac\{1\}\{S\+1\}\\left\(1\-\\sum\_\{j=1\}^\{K\}\\hat\{p\}\_\{j\}^\{2\}\\right\),\(B\.2\)whereS=∑j=1KαjS=\\sum\_\{j=1\}^\{K\}\\alpha\_\{j\}andp^j=αjS\\hat\{p\}\_\{j\}=\\frac\{\\alpha\_\{j\}\}\{S\}\. To compute the gradient, we first note that∂p^j∂αi=1S​\(δi​j−p^j\)\\frac\{\\partial\\hat\{p\}\_\{j\}\}\{\\partial\\alpha\_\{i\}\}=\\frac\{1\}\{S\}\(\\delta\_\{ij\}\-\\hat\{p\}\_\{j\}\)\. Givenp^j∈\[0,1\]\\hat\{p\}\_\{j\}\\in\[0,1\], it follows that\|∂p^j∂αi\|≤1S\\left\|\\frac\{\\partial\\hat\{p\}\_\{j\}\}\{\\partial\\alpha\_\{i\}\}\\right\|\\leq\\frac\{1\}\{S\}\.

Applying the chain rule to the predictive error termf1​\(α\)=∑j=1K\(yj−p^j\)2f\_\{1\}\(\\alpha\)=\\sum\_\{j=1\}^\{K\}\(y\_\{j\}\-\\hat\{p\}\_\{j\}\)^\{2\}:

\|∂f1∂αi\|=\|−2​∑j=1K\(yj−p^j\)​∂p^j∂αi\|≤2​∑j=1K1⋅1S=2​KS≤2\.\\left\|\\frac\{\\partial f\_\{1\}\}\{\\partial\\alpha\_\{i\}\}\\right\|=\\left\|\-2\\sum\_\{j=1\}^\{K\}\(y\_\{j\}\-\\hat\{p\}\_\{j\}\)\\frac\{\\partial\\hat\{p\}\_\{j\}\}\{\\partial\\alpha\_\{i\}\}\\right\|\\leq 2\\sum\_\{j=1\}^\{K\}1\\cdot\\frac\{1\}\{S\}=\\frac\{2K\}\{S\}\\leq 2\.\(B\.3\)Next, for the variance penalty termf2​\(α\)=1S\+1​\(1−∑j=1Kp^j2\)f\_\{2\}\(\\alpha\)=\\frac\{1\}\{S\+1\}\(1\-\\sum\_\{j=1\}^\{K\}\\hat\{p\}\_\{j\}^\{2\}\), applying the product rule yields:

\|∂f2∂αi\|=\\displaystyle\\bigg\|\\frac\{\\partial f\_\{2\}\}\{\\partial\\alpha\_\{i\}\}\\bigg\|=\|−1\(S\+1\)2\(1−∑j=1Kp^j2\)\+1S\+1\\displaystyle\\bigg\|\-\\frac\{1\}\{\(S\+1\)^\{2\}\}\\left\(1\-\\sum\_\{j=1\}^\{K\}\\hat\{p\}\_\{j\}^\{2\}\\right\)\+\\frac\{1\}\{S\+1\}\(B\.4\)\(−2∑j=1Kp^j∂p^j∂αi\)\|≤1\(K\+1\)2\+2K​\(K\+1\)\.\\displaystyle\\bigg\(\-2\\sum\_\{j=1\}^\{K\}\\hat\{p\}\_\{j\}\\frac\{\\partial\\hat\{p\}\_\{j\}\}\{\\partial\\alpha\_\{i\}\}\\bigg\)\\bigg\|\\leq\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\.Thus, the gradient ofℒM​S​E\\mathcal\{L\}\_\{MSE\}is consistently bounded by a constant\.

Step 2: Bounding the Gradient of the Effective KL Divergence\.The effective part of the generalized KL divergence \(dropping prior\-only constant terms\) is given by:

D~K​L​\(α\)=\\displaystyle\\tilde\{D\}\_\{KL\}\(\\alpha\)=log⁡Γ​\(S\)−∑k=1Klog⁡Γ​\(αk\)\\displaystyle\\log\\Gamma\(S\)\-\\sum\_\{k=1\}^\{K\}\\log\\Gamma\(\\alpha\_\{k\}\)\(B\.5\)\+∑k=1K\(αk−λk\)​\(ψ​\(αk\)−ψ​\(S\)\)\.\\displaystyle\+\\sum\_\{k=1\}^\{K\}\(\\alpha\_\{k\}\-\\lambda\_\{k\}\)\(\\psi\(\\alpha\_\{k\}\)\-\\psi\(S\)\)\.Taking the partial derivative with respect toαi\\alpha\_\{i\}and utilizing the property of the digamma function∂∂x​log⁡Γ​\(x\)=ψ​\(x\)\\frac\{\\partial\}\{\\partial x\}\\log\\Gamma\(x\)=\\psi\(x\), many complex terms cancel out, leaving:

∂D~K​L∂αi=\(αi−λi\)​ψ′​\(αi\)−\(S−‖𝝀‖1\)​ψ′​\(S\),\\frac\{\\partial\\tilde\{D\}\_\{KL\}\}\{\\partial\\alpha\_\{i\}\}=\(\\alpha\_\{i\}\-\\lambda\_\{i\}\)\\psi^\{\\prime\}\(\\alpha\_\{i\}\)\-\(S\-\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\)\\psi^\{\\prime\}\(S\),\(B\.6\)whereψ′​\(⋅\)\\psi^\{\\prime\}\(\\cdot\)is the trigamma function\. To bound this, we utilize the inequality1x<ψ′​\(x\)<1x\+1x2\\frac\{1\}\{x\}<\\psi^\{\\prime\}\(x\)<\\frac\{1\}\{x\}\+\\frac\{1\}\{x^\{2\}\}forx\>0x\>0\. Sinceαi≥λi≥1\\alpha\_\{i\}\\geq\\lambda\_\{i\}\\geq 1:

0≤\(αi−λi\)​ψ′​\(αi\)<1−λiαi\+αi−λiαi2≤1\+1λi\.0\\leq\(\\alpha\_\{i\}\-\\lambda\_\{i\}\)\\psi^\{\\prime\}\(\\alpha\_\{i\}\)<1\-\\frac\{\\lambda\_\{i\}\}\{\\alpha\_\{i\}\}\+\\frac\{\\alpha\_\{i\}\-\\lambda\_\{i\}\}\{\\alpha\_\{i\}^\{2\}\}\\leq 1\+\\frac\{1\}\{\\lambda\_\{i\}\}\.\(B\.7\)Similarly, sinceS≥‖𝝀‖1≥K≥1S\\geq\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\\geq K\\geq 1, the second term is bounded by1\+1‖𝝀‖11\+\\frac\{1\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\. Applying the triangle inequality:

\|∂D~K​L∂αi\|<2\+1minj⁡λj\+1‖𝝀‖1\.\\left\|\\frac\{\\partial\\tilde\{D\}\_\{KL\}\}\{\\partial\\alpha\_\{i\}\}\\right\|<2\+\\frac\{1\}\{\\min\_\{j\}\\lambda\_\{j\}\}\+\\frac\{1\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\.\(B\.8\)
Conclusion\.Combining the results from Step 1 and Step 2, and accounting for the trade\-off hyper\-parameterβ\\beta, the infinity norm of the gradient∇αℒV​I\\nabla\_\{\\alpha\}\\mathcal\{L\}\_\{VI\}is uniformly bounded across the domainΩ\\Omegaby the structural constantLhL\_\{h\}:

Lh≤\\displaystyle L\_\{h\}\\leq2\+1\(K\+1\)2\+2K​\(K\+1\)\+\\displaystyle 2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\+\(B\.9\)β​\(2\+1minj⁡λj\+1‖𝝀‖1\)≜\|∂ℒV​I−E​D​L∂αi\|\.\\displaystyle\\beta\\bigg\(2\+\\frac\{1\}\{\\min\_\{j\}\\lambda\_\{j\}\}\+\\frac\{1\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\\bigg\)\\triangleq\\left\|\\frac\{\\partial\\mathcal\{L\}\_\{VI\-EDL\}\}\{\\partial\\alpha\_\{i\}\}\\right\|\.This derivation consistently aligns with the Lipschitz constantLhL\_\{h\}introduced in Theorem[B\.1](https://arxiv.org/html/2605.26477#A2.Thmtheorem1), which ends the proof of Theorem[B\.1](https://arxiv.org/html/2605.26477#A2.Thmtheorem1)\. ∎

Letℋ\\mathcal\{H\}be the hypothesis space composed of the loss functionsh​\(z\)h\(z\), and let𝒮=\{z1,…,zn\}\\mathcal\{S\}=\\\{z\_\{1\},\\dots,z\_\{n\}\\\}be a dataset of sizenndrawn i\.i\.d\. from𝒟\\mathcal\{D\}\. Based on Theorem[B\.1](https://arxiv.org/html/2605.26477#A2.Thmtheorem1)and the Ledoux\-Talagrand Contraction Inequality, we can decouple the loss function and establish the expected Rademacher complexity bound\.

###### Theorem B\.2\.

Assume the evidence networkfϕf\_\{\\phi\}is anLL\-layer feed\-forward neural network parameterized by weight matricesWlW\_\{l\}andLσlL\_\{\\sigma\_\{l\}\}\-Lipschitz activation functionsσl\\sigma\_\{l\}, wherel∈\[1,L\]l\\in\[1,L\]\. The expected Rademacher complexity of the hypothesis spaceℋ\\mathcal\{H\}over the sample set𝒮\\mathcal\{S\}is upper\-bounded by:

𝔼𝒮​\[ℜ^n​\(ℋ\)\]≤\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\\leq𝒪\(Lh⋅M⋅\\displaystyle\\mathcal\{O\}\\bigg\(L\_\{h\}\\cdot M\\cdot\(B\.10\)R​K​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)n\)\.\\displaystyle\\frac\{R\\sqrt\{K\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\}\{\\sqrt\{n\}\}\\bigg\)\.

###### Proof\.

To bound the expected Rademacher complexity of the hypothesis spaceℋ\\mathcal\{H\}, we rely on a fundamental theorem in statistical learning theory that allows us to iteratively peel off Lipschitz\-continuous operations\. We first formally introduce this lemma\.

###### Lemma B\.1\.

Letϕ:ℝ→ℝ\\phi:\\mathbb\{R\}\\to\\mathbb\{R\}be anLϕL\_\{\\phi\}\-Lipschitz continuous function such thatϕ​\(0\)=0\\phi\(0\)=0\. For any hypothesis classℱ\\mathcal\{F\}of real\-valued functions and any sample setSS, the empirical Rademacher complexity satisfies:

ℜ^n​\(ϕ∘ℱ\)≤Lϕ⋅ℜ^n​\(ℱ\)\.\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\phi\\circ\\mathcal\{F\}\)\\leq L\_\{\\phi\}\\cdot\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\)\.\(B\.11\)

Step 1: Decoupling the Loss Function\.The hypothesis spaceℋ\\mathcal\{H\}consists of functionsh​\(z\)=ℒV​I​\(fϕ​\(x\),y\)h\(z\)=\\mathcal\{L\}\_\{VI\}\(f\_\{\\phi\}\(x\),y\)\. Structurally, this is a composite function of the variational lossℒV​I\\mathcal\{L\}\_\{VI\}and the neural network outputfϕ​\(x\)f\_\{\\phi\}\(x\)\.

According to Theorem[B\.1](https://arxiv.org/html/2605.26477#A2.Thmtheorem1), the variational loss function isLhL\_\{h\}\-Lipschitz continuous with respect to the network’s evidence output\. By directly applying Lemma[B\.1](https://arxiv.org/html/2605.26477#A2.Thmlemma1)to this composite structure, we can elegantly decouple the loss function from the hypothesis space, thereby bounding its complexity strictly by the complexity of the network output space:

ℜ^n​\(ℋ\)≤Lh⋅ℜ^n​\(ℱM\),\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\\leq L\_\{h\}\\cdot\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{M\}\),\(B\.12\)whereℱM=\{x↦e∈ℝ≥0K∣e=fϕ​\(x\),‖e‖1≤M\}\\mathcal\{F\}\_\{M\}=\\\{x\\mapsto e\\in\\mathbb\{R\}^\{K\}\_\{\\geq 0\}\\mid e=f\_\{\\phi\}\(x\),\|\|e\|\|\_\{1\}\\leq M\\\}is the neural network function family restricted by the evidence capacity constraint \(Assumption[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmassumption1)\)\. To explicitly extract the capacity boundMM, we define a normalized "unit\-evidence" function familyℱ1=\{x↦e~∣e~=fϕ​\(x\)/M,‖e~‖1≤1\}\\mathcal\{F\}\_\{1\}=\\\{x\\mapsto\\tilde\{e\}\\mid\\tilde\{e\}=f\_\{\\phi\}\(x\)/M,\|\|\\tilde\{e\}\|\|\_\{1\}\\leq 1\\\}\. By utilizing the absolute homogeneity property of Rademacher complexity, we scale the complexity linearly:ℜ^n​\(ℱM\)=M⋅ℜ^n​\(ℱ1\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{M\}\)=M\\cdot\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{1\}\)\.

Step 2: Structural Unrolling of the Neural Network\.The evidence extractorfϕ​\(x\)f\_\{\\phi\}\(x\)is anLL\-layer feed\-forward neural network\. Mathematically, it can be expressed as a nested composition of linear transformations and non\-linear activations:

fϕ​\(x\)=WL​σL−1​\(WL−1​σL−2​\(…​σ1​\(W1​x\)​…\)\),f\_\{\\phi\}\(x\)=W\_\{L\}\\sigma\_\{L\-1\}\\left\(W\_\{L\-1\}\\sigma\_\{L\-2\}\\left\(\\dots\\sigma\_\{1\}\(W\_\{1\}x\)\\dots\\right\)\\right\),\(B\.13\)whereWlW\_\{l\}represents the weight matrix of thell\-th linear layer, andσl\\sigma\_\{l\}represents the element\-wise non\-linear activation function\.

Crucially, each fundamental operation in this nested structure preserves Lipschitz continuity, allowing us to apply Lemma A\.1 recursively:

- •Linear Layers:The mappingx↦Wl​xx\\mapsto W\_\{l\}xis inherently Lipschitz continuous, with its strict Lipschitz constant given by the spectral norm of the weight matrix, i\.e\.,‖Wl‖2\|\|W\_\{l\}\|\|\_\{2\}\.
- •Activation Functions:The widely\-used activation functions in modern neural networks are intrinsically Lipschitz continuous\. For instance, the standard ReLU functionσ​\(z\)=max⁡\(0,z\)\\sigma\(z\)=\\max\(0,z\)has a Lipschitz constant ofLσ=1L\_\{\\sigma\}=1\. Similarly, the Tanh function hasLσ=1L\_\{\\sigma\}=1, and the Sigmoid function is bounded byLσ=1/4L\_\{\\sigma\}=1/4, which can be derived from their derivatives\. Specially, if a layer does not have an activation function, it can also be regarded asσ​\(x\)=x\\sigma\(x\)=xwithLσ=1L\_\{\\sigma\}=1\. So we denote the specific Lipschitz constant of thell\-th activation asLσlL\_\{\\sigma\_\{l\}\}\.

Since the network output is aKK\-dimensional vector, applying the Ledoux\-Talagrand inequality along with the standard vector\-valued Rademacher bound introduces a dimension factor ofK\\sqrt\{K\}\. By recursively peeling off the output weight matrixWLW\_\{L\}, the last activationσL−1\\sigma\_\{L\-1\}, and continuing down to the input layer, the complexity of the normalized network is strictly bounded by the raw input space𝒳\\mathcal\{X\}:

ℜ^n​\(ℱ1\)\\displaystyle\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\_\{1\}\)≤KM⋅‖WL‖2⋅ℜ^n​\(σL−1∘WL−1​…​σ1∘W1​𝒳\)\\displaystyle\\leq\\frac\{\\sqrt\{K\}\}\{M\}\\cdot\|\|W\_\{L\}\|\|\_\{2\}\\cdot\\hat\{\\mathfrak\{R\}\}\_\{n\}\\left\(\\sigma\_\{L\-1\}\\circ W\_\{L\-1\}\\dots\\sigma\_\{1\}\\circ W\_\{1\}\\mathcal\{X\}\\right\)\(B\.14\)≤KM⋅‖WL‖2⋅LσL−1⋅ℜ^n​\(WL−1​…​σ1∘W1​𝒳\)\\displaystyle\\leq\\frac\{\\sqrt\{K\}\}\{M\}\\cdot\|\|W\_\{L\}\|\|\_\{2\}\\cdot L\_\{\\sigma\_\{L\-1\}\}\\cdot\\hat\{\\mathfrak\{R\}\}\_\{n\}\\left\(W\_\{L\-1\}\\dots\\sigma\_\{1\}\\circ W\_\{1\}\\mathcal\{X\}\\right\)⋮\\displaystyle\\quad\\vdots≤KM​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)​ℜ^n​\(𝒳\)\.\\displaystyle\\leq\\frac\{\\sqrt\{K\}\}\{M\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{X\}\)\.
Step 3: Bounding the Base Input Space\.Finally, we must bound the base complexity of the raw input featuresℜ^n​\(𝒳\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{X\}\)\. Recall Assumption[IV\.2](https://arxiv.org/html/2605.26477#S4.Thmassumption2), which restricts the input space to a bounded ball‖x‖2≤R\|\|x\|\|\_\{2\}\\leq R\. Utilizing the Cauchy\-Schwarz inequality and Jensen’s inequality, we obtain:

ℜ^n​\(𝒳\)\\displaystyle\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{X\}\)=1n​𝔼σ​\[sup‖x‖2≤R∑i=1nσi​xi\]≤Rn​𝔼σ​\[‖∑i=1nσi​xi‖2\]\\displaystyle=\\frac\{1\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\[\\sup\_\{\|\|x\|\|\_\{2\}\\leq R\}\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}x\_\{i\}\\right\]\\leq\\frac\{R\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\[\\left\|\\left\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}x\_\{i\}\\right\|\\right\|\_\{2\}\\right\]\(B\.15\)≤Rn​𝔼σ​\[‖∑i=1nσi​xi‖22\]\.\\displaystyle\\leq\\frac\{R\}\{n\}\\sqrt\{\\mathbb\{E\}\_\{\\sigma\}\\left\[\\left\|\\left\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}x\_\{i\}\\right\|\\right\|\_\{2\}^\{2\}\\right\]\}\.Since the Rademacher variablesσi∈\{−1,1\}\\sigma\_\{i\}\\in\\\{\-1,1\\\}are independent and zero\-mean, the cross terms in the squared norm mathematically vanish \(i\.e\.,𝔼​\[σi​σj\]=0\\mathbb\{E\}\[\\sigma\_\{i\}\\sigma\_\{j\}\]=0fori≠ji\\neq j\), leaving only∑i=1n‖xi‖22≤n​R2\\sum\_\{i=1\}^\{n\}\|\|x\_\{i\}\|\|\_\{2\}^\{2\}\\leq nR^\{2\}\. Thus, the input complexity is tightly bounded by:

ℜ^n​\(𝒳\)≤Rn​n​R2=Rn\.\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{X\}\)\\leq\\frac\{R\}\{n\}\\sqrt\{nR^\{2\}\}=\\frac\{R\}\{\\sqrt\{n\}\}\.\(B\.16\)
Conclusion\.Substituting the base input bound back into the unrolled network bound, and multiplying by the loss Lipschitz constant and capacity bound \(Lh⋅ML\_\{h\}\\cdot M\), we arrive at the overall expected Rademacher complexity bound:

𝔼𝒮​\[ℜ^n​\(ℋ\)\]≤\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\\leq𝒪\(Lh⋅M⋅\\displaystyle\\mathcal\{O\}\\bigg\(L\_\{h\}\\cdot M\\cdot\(B\.17\)R​K​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)n\)\.\\displaystyle\\frac\{R\\sqrt\{K\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\}\{\\sqrt\{n\}\}\\bigg\)\.This comprehensively establishes the analytical upper bound presented in Theorem[B\.2](https://arxiv.org/html/2605.26477#A2.Thmtheorem2), which ends the proof of Theorem[B\.2](https://arxiv.org/html/2605.26477#A2.Thmtheorem2)\. ∎

To bridge the gap between empirical observations and true distributions, we introduce the generalization theorem based on concentration inequalities\. Assuming the loss function is globally bounded byh​\(z\)∈\[0,B\]h\(z\)\\in\[0,B\], we apply McDiarmid’s Inequality to tightly concentrate the empirical Rademacher complexity around its expectation\.

###### Theorem B\.3\.

The expected true riskℒt​r​u​e​\(ϕ\)\\mathcal\{L\}\_\{true\}\(\\phi\)for any hypothesish∈ℋh\\in\\mathcal\{H\}is bounded by the empirical riskℒe​m​p​\(ϕ\)\\mathcal\{L\}\_\{emp\}\(\\phi\)and the expected Rademacher complexity𝔼𝒮​\[ℜ^n​\(ℋ\)\]\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]with probability at least1−δ1\-\\delta:

ℒt​r​u​e​\(ϕ\)≤ℒe​m​p​\(ϕ\)\+2​𝔼𝒮​\[ℜ^n​\(ℋ\)\]\+3​B​log⁡\(2/δ\)2​n\.\\mathcal\{L\}\_\{true\}\(\\phi\)\\leq\\mathcal\{L\}\_\{emp\}\(\\phi\)\+2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(B\.18\)

###### Proof\.

According to the standard Rademacher generalization theorem, with probability at least1−δ/21\-\\delta/2, the true expected risk is bounded by:

ℒt​r​u​e​\(ϕ\)≤ℒe​m​p​\(ϕ\)\+2​ℜ^n​\(ℋ\)\+B​log⁡\(2/δ\)2​n\.\\mathcal\{L\}\_\{true\}\(\\phi\)\\leq\\mathcal\{L\}\_\{emp\}\(\\phi\)\+2\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\+B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(B\.19\)To bridge the gap between the empirical complexityℜ^n​\(ℋ\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)and its expectation𝔼𝒮​\[ℜ^n​\(ℋ\)\]\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\], we apply McDiarmid’s Inequality\. Since the loss function is globally bounded byBB, altering a single sampleziz\_\{i\}changesℜ^n​\(ℋ\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)by at mostB/nB/n\. By the Bounded Differences Inequality, with probability at least1−δ/21\-\\delta/2, we have:

ℜ^n​\(ℋ\)≤𝔼𝒮​\[ℜ^n​\(ℋ\)\]\+B​log⁡\(2/δ\)2​n\.\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\\leq\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\+B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(B\.20\)\(Note: The logarithmic term structurally absorbs the union bound splitting constants for simplicity in asymptotic notation\)\. Substituting this concentration inequality into the standard generalization theorem via the union bound, we establish the bound valid with probability at least1−δ1\-\\delta:

ℒt​r​u​e​\(ϕ\)≤\\displaystyle\\mathcal\{L\}\_\{true\}\(\\phi\)\\leqℒe​m​p​\(ϕ\)\+2​\(𝔼S​\[ℜ^n​\(ℋ\)\]\+B​log⁡\(2/δ\)2​n\)\\displaystyle\\mathcal\{L\}\_\{emp\}\(\\phi\)\+2\\left\(\\mathbb\{E\}\_\{S\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\+B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\\right\)\(B\.21\)\+B​log⁡\(2/δ\)2​n\\displaystyle\+B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}=\\displaystyle=ℒe​m​p​\(ϕ\)\+2​𝔼𝒮​\[ℜ^n​\(ℋ\)\]\+3​B​log⁡\(2/δ\)2​n,\\displaystyle\\mathcal\{L\}\_\{emp\}\(\\phi\)\+2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\},which ends the proof of Theorem[B\.3](https://arxiv.org/html/2605.26477#A2.Thmtheorem3)\. ∎

Finally, to prove Theorem[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmtheorem1), we explicitly synthesize the intermediate results by substituting the exact expressions into the concentration\-based framework\.

###### Proof\.

Step 1: Substitute the Expected Complexity Bound\.Recall the concentration\-based generalization bound from Theorem[B\.3](https://arxiv.org/html/2605.26477#A2.Thmtheorem3):

ℒt​r​u​e​\(ϕ\)≤ℒe​m​p​\(ϕ\)\+2​𝔼𝒮​\[ℜ^n​\(ℋ\)\]\+3​B​log⁡\(2/δ\)2​n\.\\mathcal\{L\}\_\{true\}\(\\phi\)\\leq\\mathcal\{L\}\_\{emp\}\(\\phi\)\+2\\mathbb\{E\}\_\{\\mathcal\{S\}\}\[\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\)\]\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(B\.22\)By substituting the expected Rademacher complexity bound derived in Theorem[B\.2](https://arxiv.org/html/2605.26477#A2.Thmtheorem2)into the above inequality, we obtain the intermediate structural bound:

ℒt​r​u​e​\(ϕ\)≤\\displaystyle\\mathcal\{L\}\_\{true\}\(\\phi\)\\leqℒe​m​p\(ϕ\)\+𝒪\(Lh⋅M⋅\\displaystyle\\mathcal\{L\}\_\{emp\}\(\\phi\)\+\\mathcal\{O\}\\bigg\(L\_\{h\}\\cdot M\\cdot\(B\.23\)R​K​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)n\)\\displaystyle\\frac\{R\\sqrt\{K\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\}\{\\sqrt\{n\}\}\\bigg\)\+3​B​log⁡\(2/δ\)2​n\.\\displaystyle\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(Note: The constant coefficient22is naturally absorbed by the𝒪​\(⋅\)\\mathcal\{O\}\(\\cdot\)notation\)\.

Step 2: Explicit Algebraic Expansion ofLh⋅ML\_\{h\}\\cdot M\.The core of the proof lies in explicitly expanding the product of the Lipschitz constantLhL\_\{h\}and the evidence capacityMM\. Recall the formulations from Theorem[B\.1](https://arxiv.org/html/2605.26477#A2.Thmtheorem1)and Assumption[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmassumption1):

Lh=\\displaystyle L\_\{h\}=\(2\+1\(K\+1\)2\+2K​\(K\+1\)\+2​β\+βminj⁡λj\)\\displaystyle\\left\(2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\+2\\beta\+\\frac\{\\beta\}\{\\min\_\{j\}\\lambda\_\{j\}\}\\right\)\(B\.24\)\+β‖𝝀‖1,\\displaystyle\+\\frac\{\\beta\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\},M=\\displaystyle M=‖𝝀‖1​\(1μm​i​n−1\)\.\\displaystyle\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)\.For clarity, letC=2\+1\(K\+1\)2\+2K​\(K\+1\)\+2​β\+βminj⁡λjC=2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\+2\\beta\+\\frac\{\\beta\}\{\\min\_\{j\}\\lambda\_\{j\}\}, which serves as a constant scalar dependent on the label space dimensionality and fixed parameters\. Thus,LhL\_\{h\}can be compactly rewritten asLh=C\+β‖𝝀‖1L\_\{h\}=C\+\\frac\{\\beta\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\.

Now, we multiplyLhL\_\{h\}andMMdirectly:

Lh⋅M\\displaystyle L\_\{h\}\\cdot M=\(C\+β‖𝝀‖1\)⋅\[‖𝝀‖1​\(1μm​i​n−1\)\]\\displaystyle=\\left\(C\+\\frac\{\\beta\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\\right\)\\cdot\\left\[\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)\\right\]\(B\.25\)=\(C⋅‖𝝀‖1\+β‖𝝀‖1⋅‖𝝀‖1\)×\(1μm​i​n−1\)\.\\displaystyle=\\left\(C\\cdot\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\+\\frac\{\\beta\}\{\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\}\\cdot\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\\right\)\\times\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)\.Crucially, the prior capacity term‖𝝀‖1\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}elegantly cancels out in the second term, strictly bounding the structural product to a linear growth function of‖𝝀‖1\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}:

Lh⋅M=\(C⋅‖𝝀‖1\+β\)×\(1μm​i​n−1\)\.L\_\{h\}\\cdot M=\\left\(C\\cdot\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\+\\beta\\right\)\\times\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)\.\(B\.26\)
Step 3: Final Synthesis\.Finally, substituting the expanded exact expression of\(Lh⋅M\)\(L\_\{h\}\\cdot M\)back into the intermediate structural bound in Step 1, we arrive at the grand unified generalization bound:

ℒt​r​u​e​\(ϕ\)≤\\displaystyle\\mathcal\{L\}\_\{true\}\(\\phi\)\\leqℒe​m​p\(ϕ\)\+𝒪\(\[\(2\+1\(K\+1\)2\+2K​\(K\+1\)\\displaystyle\\mathcal\{L\}\_\{emp\}\(\\phi\)\+\\mathcal\{O\}\\Bigg\(\\Bigg\[\(2\+\\frac\{1\}\{\(K\+1\)^\{2\}\}\+\\frac\{2\}\{K\(K\+1\)\}\(B\.27\)\+2β\+βminj⁡λj\)\|\|𝝀\|\|1\+β\]×\(1μm​i​n−1\)\\displaystyle\+2\\beta\+\\frac\{\\beta\}\{\\min\_\{j\}\\lambda\_\{j\}\}\)\|\|\\boldsymbol\{\\lambda\}\|\|\_\{1\}\+\\beta\\bigg\]\\times\\left\(\\frac\{1\}\{\\mu\_\{min\}\}\-1\\right\)×R​K​\(∏l=1L−1Lσl\)​\(∏l=1L‖Wl‖2\)n\)\\displaystyle\\times\\frac\{R\\sqrt\{K\}\\left\(\\prod\_\{l=1\}^\{L\-1\}L\_\{\\sigma\_\{l\}\}\\right\)\\left\(\\prod\_\{l=1\}^\{L\}\|\|W\_\{l\}\|\|\_\{2\}\\right\)\}\{\\sqrt\{n\}\}\\Bigg\)\+3​B​log⁡\(2/δ\)2​n,\\displaystyle\+3B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\},This comprehensively completes the formal derivation presented in Theorem[IV\.1](https://arxiv.org/html/2605.26477#S4.Thmtheorem1)\. ∎

## Appendix CCompute Resources

The compute resources used in the experiments are listed as follows:

- •CPU: AMD EPYC 7642 48\-Core Processor×\\times2,
- •GPU: NVIDIA GeForce RTX 4090 24G×\\times1,
- •MEM: 500GB,
- •Maximum total computing time per dataset:≈\\approx5min\.

Similar Articles

Learning from Language Feedback via Variational Policy Distillation

Hugging Face Daily Papers

Variational Policy Distillation (VPD) formalizes learning from language feedback as a variational EM problem, co-training teacher and student networks to improve policy learning in reinforcement learning from verifiable rewards. It shows consistent improvements over baselines on code generation and scientific reasoning tasks.

Variational option discovery algorithms

OpenAI Blog

OpenAI researchers introduce VALOR, a variational inference method for option discovery that connects option learning to variational autoencoders, and propose a curriculum learning approach that stabilizes training by dynamically increasing context complexity.

What Type of Inference is Active Inference?

arXiv cs.AI

This paper analyzes Active Inference by proving that the Variational Free Energy of an augmented generative model can be decomposed into the predictive model's VFE plus explicit entropy-correction terms, yielding a full variational characterization of Expected Free Energy-based planning. The authors derive a message-passing scheme for EFE-based planning and validate it on grid-world environments.