Implicit Variational Rejection Sampling

arXiv cs.LG 06/15/26, 04:00 AM Papers
Summary
The article proposes Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to improve posterior approximation in variational inference, and introduces the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a tighter variational lower bound.
arXiv:2606.14235v1 Announce Type: new Abstract: Variational Inference (VI) is a fundamental inference technique in Bayesian machine learning for approximating complex posterior distributions. Traditional VI often relies on the mean-field factorization, which can inadequately capture true posterior complexity. Recent advancements have leveraged neural networks to model implicit distributions, offering increased flexibility. However, the practical constraints of neural network architectures still produces inaccuracies. In this paper, we propose a method called Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to improve the posterior approximation. Our method uses neural networks to construct implicit proposal distributions, and rejection sampling with a discriminator network that estimates the density ratio between the implicit proposal and the true posterior for refining the approximation. Towards this end, we introduce the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a metric to characterize the resampled distribution's quality and derive a tighter variational lower bound. Experimental results demonstrate that our method outperforms traditional variational inference techniques.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:12 AM
# Implicit Variational Rejection Sampling
Source: [https://arxiv.org/html/2606.14235](https://arxiv.org/html/2606.14235)
Shigui LiSouth China University of TechnologyWei ChenSouth China University of TechnologyJiacheng LiSouth China University of TechnologyZhiqi LinSouth China University of Technology Delu ZengCorresponding author\. Email:dlzeng@scut\.edu\.cnSouth China University of TechnologyXinghao DingXiamen UniversityJohn PaisleyColumbia UniversityQibin ZhaoCorresponding author\. Email:qibin\.zhao@riken\.jpRIKEN AIP

###### Abstract

Variational Inference \(VI\) is a fundamental inference technique in Bayesian machine learning for approximating complex posterior distributions\. Traditional VI often relies on the mean\-field factorization, which can inadequately capture true posterior complexity\. Recent advancements have leveraged neural networks to model implicit distributions, offering increased flexibility\. However, the practical constraints of neural network architectures still produces inaccuracies\. In this paper, we propose a method called Implicit Variational Rejection Sampling \(IVRS\), which integrates implicit distributions with rejection sampling to improve the posterior approximation\. Our method uses neural networks to construct implicit proposal distributions, and rejection sampling with a discriminator network that estimates the density ratio between the implicit proposal and the true posterior for refining the approximation\. Towards this end, we introduce the Implicit Resampling Evidence Lower Bound \(IR\-ELBO\) as a metric to characterize the resampled distribution’s quality and derive a tighter variational lower bound\. Experimental results demonstrate that our method outperforms traditional variational inference techniques\.

## 1Introduction

Variational Inference \(VI\) has emerged as a fundamental technique in Bayesian machine learning for approximating complex posterior distributions\[jordan1999introduction,hoffman2013stochastic\]\. Traditional VI methods frequently rely on a mean\-field assumption\[blei2017variational\], which trades off posterior expressiveness for computational tractability\. To address this limitation, implicit distributions, which are typically modeled using neural networks, have been proposed to leverage their flexibility when approximating complex posterior distributions\[mescheder2017adversarial,huszar2017variational,titsias2019unbiased,shi2017kernel\]; such implicit and diffusion\-based constructions have also been extended to richer Bayesian models such as deep Gaussian processes\[xu2024sparse,xu2026diffusion\]\. While neural networks are highly expressive in theory\[hornik1989multilayer,krizhevsky2012imagenet,lecun2015deep\], they may still struggle to match complex true posteriors in practice, especially under limited capacity, poor initialization, or optimization difficulty\[arora2017generalization\]\. As a result, the posterior approximations with neural networks are not robust as an off\-the\-shelf method\.

To improve neural network posterior approximations, we propose Implicit Variational Rejection Sampling \(IVRS\), which leverages rejection sampling\[gilks1992adaptive\]to better exploit the strengths of implicit distributions\. We first use neural networks to construct implicit distributions that serve as proposals\. We then design an acceptance probability function related to the density ratio between the proposal distribution and the true posterior, and apply rejection sampling to generate resampled samples\. A discriminator network is used to approximate the density ratio, thereby refining the proposal distribution into a more accurate posterior approximation\. By incorporating adversarial training techniques, this approach enables us to construct an Implicitly Resampled Evidence Lower Bound \(IR\-ELBO\)\.

We summarize our contributions below:

1. 1\)We introduce Implicit Variational Rejection Sampling \(IVRS\) to combine implicit distributions with rejection sampling for more accurate variational inference with neural networks\.
2. 2\)By incorporating adversarial training techniques, we construct an Implicit Resampling Evidence Lower Bound \(IR\-ELBO\) and analyze the resampled distribution, particularly its reduced KL divergence from the true posterior, providing theoretical support for improved accuracy\.
3. 3\)We demonstrate through experiments that IVRS can outperform traditional variational inference methods in terms of accuracy and efficiency\.

## 2Model Framework

### 2\.1Bayesian Generative Models

Consider an unsupervised generative model for a dataset𝒟=\{𝐱i\}i=1N\\mathcal\{D\}=\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{N\}that has latent variables𝐳\\mathbf\{z\}and model parametersθ\\mathbf\{\\theta\}\. The joint distribution of the model has the form

p\(𝐱,𝐳\|θ\)=p\(𝐳\)p\(𝐱\|𝐳,θ\),p\(\\mathbf\{x\},\\mathbf\{z\}\|\\theta\)=p\(\\mathbf\{z\}\)p\(\\mathbf\{x\}\|\\mathbf\{z\},\\theta\),\(1\)wherep\(𝐳\)p\(\\mathbf\{z\}\)is the prior distribution of𝐳\\mathbf\{z\}andp\(𝐱\|𝐳,θ\)p\(\\mathbf\{x\}\|\\mathbf\{z\},\\theta\)is a parametric generative model\. In more traditional models, such as the Gaussian Mixture Model \(GMM\), this distribution is often specified through manual design\. With the advance of deep learning, and generative models in particular as exemplified by Variational Autoencoders \(VAEs\), this distribution is frequently parameterized using a neural network\. While this framework can be extended to supervised learning scenarios, we develop our method in the unsupervised learning regime\.

### 2\.2Variational Inference

The goal of inference is to model the posterior distribution of the latent variables in Equation[1](https://arxiv.org/html/2606.14235#S2.E1)\. For non\-conjugate models, such as those involving deep learning architectures, the posterior distribution is highly complex and requires approximate methods\. Variational inference \(VI\) provides a KL divergence approximation to the posterior distributionp\(𝐳\|𝐱\)p\(\\mathbf\{z\}\|\\mathbf\{x\}\)using a predefined variational distributionq\(𝐳\|ϕ\)q\(\\mathbf\{z\}\|\\phi\)by maximizing the Evidence Lower Bound \(ELBO\),

ℒ\(𝐱,θ,ϕ\)=𝔼q\(𝐳\|ϕ\)\[log⁡p\(𝐱,𝐳\|θ\)−log⁡q\(𝐳\|ϕ\)\]\.\\mathcal\{L\}\(\\mathbf\{x\},\\theta,\\phi\)=\\mathbb\{E\}\_\{q\(\\mathbf\{z\}\|\\phi\)\}\\left\[\\log p\(\\mathbf\{x\},\\mathbf\{z\}\|\\theta\)\-\\log q\(\\mathbf\{z\}\|\\phi\)\\right\]\.\(2\)The traditional mean\-field approximation assumes a factorized form for the variational distribution,

q\(𝐳\|𝐱,ϕ\)=∏i=1mq\(𝐳i\|𝐱,ϕi\),q\(\\mathbf\{z\}\|\\mathbf\{x\},\\phi\)=\\textstyle\\prod\_\{i=1\}^\{m\}q\(\\mathbf\{z\}\_\{i\}\|\\mathbf\{x\},\\phi\_\{i\}\),\(3\)wheremmrepresents the number of factors in the decomposition, and eachq\(𝐳i\|𝐱,ϕi\)q\(\\mathbf\{z\}\_\{i\}\|\\mathbf\{x\},\\phi\_\{i\}\)is often a simple parametric distribution\.

The mean field approximation sacrifices accuracy for tractability\. To address this limitation, the implicit variational inference method employs a neural network parameterization of the variational distribution\. These methods aim to capture more accurate, complex posterior structures by leveraging the expressive power of neural networks\. They take the form

𝐳∼qϕ\(𝐳\|𝐱\)⟶𝐳=fϕ\(𝐱,ϵ\),ϵ∼p\(ϵ\)\.\\mathbf\{z\}\\sim q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\quad\\longrightarrow\\quad\\mathbf\{z\}=f\_\{\\phi\}\\left\(\\mathbf\{x\},\\epsilon\\right\),\\penalty 10000\\ \\penalty 10000\\ \\epsilon\\sim p\(\\epsilon\)\.\(4\)Here,ϕ\\phirepresents the parameters of a neural network, whileϵ\\epsilonis drawn independently from a simple distribution such as a Gaussian\. Recently, various algorithms have been proposed to effectively train such models, including Adversarial Variational Bayes\[mescheder2017adversarial\]and Semi\-implicit Variational Inference\[yin2018semi\]\.

Despite their flexibility, traditional neural networks face practical limitations, and their structural design often relies on empirical heuristics\. Rejection sampling\[naesseth2017reparameterization,jankowiak2023reparameterized\]offers a flexible approach to more robust implicit distribution learning without requiring additional model capacity or architectural changes\. We therefore view rejection sampling as a complementary mechanism for improving implicit VI, particularly when the variational family lacks sufficient support\. In the following section, we introduce a method that incorporates rejection sampling to address these challenges\.

## 3Proposed Method

### 3\.1Rejection Sampling

Rejection sampling is a standard statistical technique for generating samples from a target distribution via a proposal distribution\. Given a target distributionptar\(𝐳\)p\_\{\\mathrm\{tar\}\}\(\\mathbf\{z\}\)and a proposal distributionqpro\(𝐳\)q\_\{\\mathrm\{pro\}\}\(\\mathbf\{z\}\), rejection sampling accepts a sample𝐳∼qpro\(𝐳\)\\mathbf\{z\}\\sim q\_\{\\mathrm\{pro\}\}\(\\mathbf\{z\}\)with probability defined by an acceptance probability functiona\(𝐳\)a\(\\mathbf\{z\}\)that is proportional to the ratio of the target density to the proposal density,

a\(𝐳\)=ptar\(𝐳\)/Mqpro\(𝐳\),a\(\\mathbf\{z\}\)=p\_\{\\mathrm\{tar\}\}\(\\mathbf\{z\}\)/Mq\_\{\\mathrm\{pro\}\}\(\\mathbf\{z\}\),\(5\)whereM\>0M\>0and the choice ofMMensures that the acceptance probability is less than or equal to 1\. In our model, the target distributionptar\(𝐳\)=pθ\(𝐳\|𝐱\)p\_\{\\mathrm\{tar\}\}\(\\mathbf\{z\}\)=p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\), being the true posterior of the latent variables of model structure Equation[1](https://arxiv.org/html/2606.14235#S2.E1)\. We define the proposal distribution to be the implicit distribution in Equation[5](https://arxiv.org/html/2606.14235#S3.E5),qpro\(𝐳\)=qϕ\(𝐳\|𝐱\)q\_\{\\mathrm\{pro\}\}\(\\mathbf\{z\}\)=q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\. Therefore, we write the acceptance rate function asa\(𝐳;𝐱,θ,ϕ\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\.

However, unlike traditional rejection sampling, calculating the acceptance rate functiona\(𝐳;𝐱,θ,ϕ\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)in our model is not analytical\. The primary challenges are two\-fold: i\) The target distributionpθ\(𝐳\|𝐱\)p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)is the true posterior, typically only representable in the unnormalized joint likelihood form of Bayes rule in Equation[1](https://arxiv.org/html/2606.14235#S2.E1); ii\) The proposal distributionqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)is an implicit distribution, often generated by a structure that makes it difficult to determine its probability density function\. We next turn to our proposal for addressing these two issues\.

### 3\.2The Acceptance Probability Function

To address these two challenges, we first express the target distributionpθ\(𝐳\|𝐱\)p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)using Bayes rule,

pθ\(𝐳\|𝐱\)=pθ\(𝐱\|𝐳\)p\(𝐳\)/pθ\(𝐱\),p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)=p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)/p\_\{\\theta\}\(\\mathbf\{x\}\),\(6\)wherepθ\(𝐱\|𝐳\)p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)is the likelihood,p\(𝐳\)p\(\\mathbf\{z\}\)is the prior, andpθ\(𝐱\)=∫pθ\(𝐱\|𝐳\)p\(𝐳\)𝑑𝐳p\_\{\\theta\}\(\\mathbf\{x\}\)=\\int p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\\,d\\mathbf\{z\}is the evidence\. To ensure the acceptance probability is within the range\[0,1\]\[0,1\], we can ignore the evidence term provided we choose an appropriate scaling factorMMsuch that

a\(𝐳;𝐱,θ,ϕ\)=pθ\(𝐱\|𝐳\)p\(𝐳\)Mqϕ\(𝐳\|𝐱\)≤1\.a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)=\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\}\{Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\leq 1\.\(7\)To ensure this constraint in practice, the acceptance rate function is usually constructed as

a\(𝐳;𝐱,θ,ϕ\)=min⁡\[pθ\(𝐱\|𝐳\)p\(𝐳\)Mqϕ\(𝐳\|𝐱\),1\]\.a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)=\\min\\left\[\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\}\{Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\},1\\right\]\.\(8\)However, the min function makes gradient\-based optimization for variational posterior parameters challenging\. To address this, we adopt the fully differentiable approximation ofgrover2018variational,

a\(𝐳;𝐱,θ,ϕ\)=pθ\(𝐱\|𝐳\)p\(𝐳\)pθ\(𝐱\|𝐳\)p\(𝐳\)\+Mqϕ\(𝐳\|𝐱\)∈\(0,1\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)=\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\}\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\+Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\in\(0,1\)\(9\)This approximation addresses the first challenge\.

For the second challenge, where the proposal distributionqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)is implicitly modeled by a neural networkϕ\\phi, we estimate it using adversarial training\. Specifically, we address the challenge of computing the termlog⁡p\(𝐳\)−log⁡qϕ\(𝐳\|𝐱\)\\log p\(\\mathbf\{z\}\)\-\\log q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)by introducing an additional discriminative networkT\(𝐱,𝐳\)T\(\\mathbf\{x\},\\mathbf\{z\}\), which distinguishes between pairs\(𝐱,𝐳\)\(\\mathbf\{x\},\\mathbf\{z\}\)sampled from the true joint distributionp\(𝐱,𝐳\)p\(\\mathbf\{x\},\\mathbf\{z\}\), and pairs\(𝐱,𝐳\)\(\\mathbf\{x\},\\mathbf\{z\}\)sampled using the implicit proposal distributionqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\. The objectiveD\(T\)D\(T\)for this discriminatorT\(𝐱,𝐳\)T\(\\mathbf\{x\},\\mathbf\{z\}\)is

D\(T\)=\\displaystyle D\(T\)\\penalty 10000\\ =\\penalty 10000𝔼p\(𝐱\)𝔼qϕ\(𝐳\|𝐱\)\[log⁡σ\(T\(𝐱,𝐳\)\)\]\\displaystyle\\mathbb\{E\}\_\{p\(\\mathbf\{x\}\)\}\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\left\[\\log\\sigma\(T\(\\mathbf\{x\},\\mathbf\{z\}\)\)\\right\]\(10\)\+\\displaystyle\+\\penalty 10000𝔼p\(𝐱\)𝔼p\(𝐳\)\[log⁡\(1−σ\(T\(𝐱,𝐳\)\)\)\],\\displaystyle\\mathbb\{E\}\_\{p\(\\mathbf\{x\}\)\}\\mathbb\{E\}\_\{p\(\\mathbf\{z\}\)\}\\left\[\\log\(1\-\\sigma\(T\(\\mathbf\{x\},\\mathbf\{z\}\)\)\)\\right\],whereσ\(t\)=11\+e−t\\sigma\(t\)=\\frac\{1\}\{1\+e^\{\-t\}\}denotes the sigmoid function\. Bygoodfellow2014generativeandmescheder2017adversarial, the optimal discriminatorT∗\(𝐱,𝐳\)T^\{\*\}\(\\mathbf\{x\},\\mathbf\{z\}\)is

T∗\(𝐱,𝐳\)=log⁡qϕ\(𝐳\|𝐱\)−log⁡p\(𝐳\)\.T^\{\*\}\(\\mathbf\{x\},\\mathbf\{z\}\)=\\log q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\-\\log p\(\\mathbf\{z\}\)\.\(11\)We see thatT∗T^\{\*\}can be directly substituted into Equation \([9](https://arxiv.org/html/2606.14235#S3.E9)\) to compute the implicit proposal distribution,

a\(𝐳;𝐱,θ,ϕ\)=pθ\(𝐱\|𝐳\)pθ\(𝐱\|𝐳\)\+Mexp⁡\(T∗\(𝐱,𝐳\)\)\.a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)=\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)\}\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)\+M\\exp\\left\(T^\{\*\}\(\\mathbf\{x\},\\mathbf\{z\}\)\\right\)\}\.\(12\)As a result, we can effectively perform rejection sampling even in the absence of explicit analytical forms for the proposal distribution\. Numerous methods are available for estimating density ratios of non\-analytical distributions\. We employ adversarial training here\[goodfellow2014generative,mescheder2017adversarial\], although alternative estimators, such as recent diffusion\- and Schrödinger\-bridge\-based density\-ratio estimation\[chen2025dequantified\], are equally compatible with our framework\. Additionally, the expectation in Equation \([10](https://arxiv.org/html/2606.14235#S3.E10)\) is on the outermost layer, so Monte Carlo estimation remains unbiased and is suitable for mini\-batch algorithms\.

Algorithm 1Sampler forrθ,ϕ\(𝐳\|𝐱\)r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)0:

aθ,ϕ\(𝐳;θ,ϕ\)a\_\{\\theta,\\phi\}\(\\mathbf\{z\};\\theta,\\phi\),

qϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)
0:

𝐳∼rθ,ϕ\(𝐳\|𝐱\)\\mathbf\{z\}\\sim r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)
1:Perform gradient ascent on

D\(Tη\)D\(T\_\{\\eta\}\)in Equation \([10](https://arxiv.org/html/2606.14235#S3.E10)\) with respect to

η\\etato obtain

Tη∗T\_\{\\eta\}^\{\*\}
2:whileTruedo

3:

𝐳←\\mathbf\{z\}\\leftarrowsample from implict proposal

qϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)by Equation \([4](https://arxiv.org/html/2606.14235#S2.E4)\)

4:Compute acceptance probability

a\(𝐳;𝐱,θ,ϕ\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)by Equation \([12](https://arxiv.org/html/2606.14235#S3.E12)\)

5:Sample uniform

u∼𝒰\[0,1\]u\\sim\\mathcal\{U\}\[0,1\]
6:if

u<a\(𝐳;𝐱,θ,ϕ\)u<a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)then

7:Output sample

𝐳\\mathbf\{z\}
8:endif

9:endwhile

### 3\.3Implicit Resampling ELBO

Unlike previous implicit variational inference, we us as our variational approximation the distribution resampled via rejection sampling, denoted as

rθ,ϕ\(𝐳\|𝐱\)=qϕ\(𝐳\|𝐱\)a\(𝐳;𝐱,θ,ϕ\)Zθ,ϕ\(𝐱\),r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)=\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\}\{Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\},\(13\)whereZθ,ϕ\(𝐱\)=𝔼qϕ\(𝐳\|𝐱\)\[a\(𝐳;𝐱,θ,ϕ\)\]\.Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)=\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\[a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\]\.

To sample fromrθ,ϕr\_\{\\theta,\\phi\}, we follow the procedure defined in Algorithm[1](https://arxiv.org/html/2606.14235#alg1)\. First, we use a neural network parameterized byη\\etato represent the discriminative networkTη\(𝐱,𝐳\)T\_\{\\eta\}\(\\mathbf\{x\},\\mathbf\{z\}\)\. By using gradient\-based optimization, we obtain a local optimal valueTη∗\(𝐱,𝐳\)T\_\{\\eta\}^\{\*\}\(\\mathbf\{x\},\\mathbf\{z\}\)\. Then, through an accept\-reject step, we resample from the implicit proposal distributionrθ,ϕr\_\{\\theta,\\phi\}\. We then define theImplicit Resampling Evidence Lower Bound\(IR\-ELBO\) on the marginal log\-likelihood of𝐱\\mathbf\{x\}\. This involves using the implicit distribution as the proposal distribution and the resampled distributionrθ,ϕr\_\{\\theta,\\phi\}as the variational distribution\. By Jensen’s inequality, we have that,

log⁡pθ\(𝐱\)\\displaystyle\\log p\_\{\\theta\}\(\\mathbf\{x\}\)≥𝔼rθ,ϕ\(𝐳\|𝐱\)\[log⁡pθ\(𝐱,𝐳\)rθ,ϕ\(𝐳\|𝐱\)\]\\displaystyle\\geq\\mathbb\{E\}\_\{r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\left\[\\log\\frac\{p\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{z\}\)\}\{r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\right\]\(14\)=𝔼rθ,ϕ\(𝐳\|𝐱\)\[log⁡pθ\(𝐱,𝐳\)Zθ,ϕ\(𝐱\)qϕ\(𝐳\|𝐱\)a\(𝐳;𝐱,θ,ϕ\)\]\.\\displaystyle=\\mathbb\{E\}\_\{r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\\left\[\\log\\frac\{p\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{z\}\)Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\}\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\}\\right\]\.In this equation, we can use the discriminative networkTη∗\(𝐱,𝐳\)T\_\{\\eta\}^\{\*\}\(\\mathbf\{x\},\\mathbf\{z\}\)described in Algorithm[1](https://arxiv.org/html/2606.14235#alg1)to compute the probability density function of the implicit distribution\. Using Equations \([9](https://arxiv.org/html/2606.14235#S3.E9)\) and \([11](https://arxiv.org/html/2606.14235#S3.E11)\), we therefore have that

𝔼rθ,ϕ\[log⁡pθ\(𝐱,𝐳\)Zθ,ϕ\(𝐱\)qϕ\(𝐳\|𝐱\)a\(𝐳;𝐱,θ,ϕ\)\]=\\displaystyle\\mathbb\{E\}\_\{r\_\{\\theta,\\phi\}\}\\left\[\\log\\frac\{p\_\{\\theta\}\(\\mathbf\{x\},\\mathbf\{z\}\)Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\}\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\}\\right\]\\penalty 10000\\ =\(15\)𝔼rθ,ϕ\[log⁡\(pθ\(𝐱\|𝐳\)exp⁡\(Tη∗\(𝐱,𝐳\)\)\+M\)\]\+log⁡Zθ,ϕ\(𝐱\)\.\\displaystyle\\penalty 10000\\ \\penalty 10000\\ \\penalty 10000\\ \\mathbb\{E\}\_\{r\_\{\\theta,\\phi\}\}\\bigg\[\\log\\bigg\(\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)\}\{\\exp\\left\(T\_\{\\eta\}^\{\*\}\\left\(\\mathbf\{x\},\\mathbf\{z\}\\right\)\\right\)\}\+M\\bigg\)\\bigg\]\+\\log Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\.For the termlog⁡Zθ,ϕ\(𝐱\)\\log Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\), we can again use Jensen’s inequality to define a lower bound,

log⁡Zθ,ϕ\(𝐱\)\\displaystyle\\log Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)≥𝔼qϕ\(𝐳\|𝐱\)\[log⁡a\(𝐳;θ,ϕ\)\]\\displaystyle\\geq\\mathbb\{E\}\_\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\[\\log a\(\\mathbf\{z\};\\theta,\\phi\)\]\(16\)=𝔼qϕ\[log⁡pθ\(𝐱\|𝐳\)pθ\(𝐱\|𝐳\)\+Mexp⁡\(Tη∗\(𝐱,𝐳\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\\bigg\[\\log\\frac\{p\_\{\\theta\}\\left\(\\mathbf\{x\}\|\\mathbf\{z\}\\right\)\}\{p\_\{\\theta\}\\left\(\\mathbf\{x\}\|\\mathbf\{z\}\\right\)\+M\\exp\\left\(T\_\{\\eta\}^\{\*\}\\left\(\\mathbf\{x\},\\mathbf\{z\}\\right\)\\right\)\}\\bigg\]\.Substituting the lower bound forlog⁡Zθ,ϕ\(𝐱\)\\log Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)from Equation \([16](https://arxiv.org/html/2606.14235#S3.E16)\) into Equation \([15](https://arxiv.org/html/2606.14235#S3.E15)\) yields the final loss function, which we call the IR\-ELBO\. Similar to Equation \([10](https://arxiv.org/html/2606.14235#S3.E10)\), since the expectation is applied to the outermost layer, the Monte Carlo approximation of this objective function remains unbiased and is appropriate for mini\-batch algorithms\. Samples from the implicit distribution can be directly obtained, and by adjusting the parameterMMthey can be resampled\.

### 3\.4Implicit Variational Rejection Sampling

Using the above derivations, we propose a new inference method for generative models called Implicit Variational Rejection Sampling \(IVRS\), which combines the strengths of implicit distributions and rejection sampling to achieve a more accurate posterior approximation\. The algorithm is summarized in Algorithm[2](https://arxiv.org/html/2606.14235#alg2)\. Optimization of the discriminator networkTη\(𝐱,𝐳\)T\_\{\\eta\}\(\\mathbf\{x\},\\mathbf\{z\}\)is reflected in both Algorithm[1](https://arxiv.org/html/2606.14235#alg1)and Algorithm[2](https://arxiv.org/html/2606.14235#alg2)— these steps can be merged in practice\.

0:Data

𝐱\\mathbf\{x\}, model parameters

θ\\theta, neural network parameters

ϕ\\phiand

η\\eta
0:Optimized parameters

θ∗\\theta^\{\*\}and

ϕ∗\\phi^\{\*\}
1:Sample Generation:Generate samples

\{𝐳i\}i=1Q\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{Q\}from implicit proposal

qϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\.

2:Density Ratio:Use discriminator network

Tη\(𝐱,𝐳\)T\_\{\\eta\}\(\\mathbf\{x\},\\mathbf\{z\}\)to estimate density ratio for each

𝐳i\\mathbf\{z\}\_\{i\}\.

3:Rejection Sampling:Accept/reject

𝐳i\\mathbf\{z\}\_\{i\}based on acceptance function

a\(𝐳i;𝐱i,θ,ϕ\)a\(\\mathbf\{z\}\_\{i\};\\mathbf\{x\}\_\{i\},\\theta,\\phi\)\. \(Alg\.[1](https://arxiv.org/html/2606.14235#alg1)\)

4:ELBO Optimization:Compute the IR\-ELBO by Equation \([15](https://arxiv.org/html/2606.14235#S3.E15)\) and Equation \([16](https://arxiv.org/html/2606.14235#S3.E16)\)\.

5:Update

θ←θ\+α∇θIR\-ELBO\\theta\\leftarrow\\theta\+\\alpha\\nabla\_\{\\theta\}\\text\{IR\-ELBO\}\.

6:Update

ϕ←ϕ\+β∇ϕIR\-ELBO\\phi\\leftarrow\\phi\+\\beta\\nabla\_\{\\phi\}\\text\{IR\-ELBO\}\.

7:Repeat steps 1 to 6 until convergence\.

Algorithm 2Implicit Variational Rejection Sampling#### Discussion of IVRS\.

We briefly analyze the properties of the resampling distributionrθ,ϕ\(𝐳\|𝐱\)r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)and show that it is indeed a better approximation compared to the implicit proposal distributionqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\. To that end, we directly compute the KL divergence betweenrθ,ϕ\(𝐳\|𝐱\)r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)and the true posteriorpθ\(𝐳\|𝐱\)p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\),

KL\(rθ,ϕ\(𝐳\|𝐱\)\|\|pθ\(𝐳\|𝐱\)\)=\\displaystyle\\mathrm\{KL\}\\left\(r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\|\|p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\right\)\\penalty 10000\\ =\\penalty 10000\\ \\penalty 10000\\ \\penalty 10000\\\(17\)∫qϕ\(𝐳\|𝐱\)a\(𝐳;𝐱,θ,ϕ\)Zθ,ϕ\(𝐱\)log⁡qϕ\(𝐳\|𝐱\)a\(𝐳;𝐱,θ,ϕ\)Zθ,ϕ\(𝐱\)pθ\(𝐳\|𝐱\)d𝐳\.\\displaystyle\\penalty 10000\\ \\penalty 10000\\ \\penalty 10000\\ \\penalty 10000\\ \\penalty 10000\\ \\int\{\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\}\{Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\}\\log\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)a\(\\mathbf\{z\};\\mathbf\{x\},\\theta,\\phi\)\}\{Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\}d\\mathbf\{z\}\.We directly substitute Equation \([9](https://arxiv.org/html/2606.14235#S3.E9)\) into the above KL divergence\. The RHS can then be rewritten as

∫pθ\(𝐳\|𝐱\)qϕ\(𝐳\|𝐱\)Λlog⁡qϕ\(𝐳\|𝐱\)Λd𝐳,\\int\{\\frac\{p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\{\\Lambda\}\\log\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\{\\Lambda\}\}d\\mathbf\{z\},\(18\)whereΛ=Zθ,ϕ\(𝐱\)\(pθ\(𝐳\|𝐱\)\+Mqϕ\(𝐳\|𝐱\)\)\\Lambda=Z\_\{\\theta,\\phi\}\(\\mathbf\{x\}\)\\left\(p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\+Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\right\)\.

We now characterize how this divergence depends on the scaling parameterMM\.

###### Proposition 1\.

Consider the idealized setting in which the discriminator is optimal, so that the acceptance probability in Equation \([9](https://arxiv.org/html/2606.14235#S3.E9)\) uses the exact density ratio\. ThenKL\(rθ,ϕ\(𝐳\|𝐱\)∥pθ\(𝐳\|𝐱\)\)\\mathrm\{KL\}\\left\(r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\parallel p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\right\)is a monotonically non\-increasing function ofMM\. Moreover, asM→0M\\rightarrow 0the acceptance rate approaches11andrθ,ϕr\_\{\\theta,\\phi\}approachesqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\), whereas asM→∞M\\rightarrow\\inftythe acceptance rate approaches0andrθ,ϕr\_\{\\theta,\\phi\}approaches the true posteriorpθ\(𝐳\|𝐱\)p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\.

The key step is to write the acceptance probability in the reduced formaM\(𝐳\)=1/\(1\+Mw\(𝐳\)\)a\_\{M\}\(\\mathbf\{z\}\)=1/\(1\+M\\,w\(\\mathbf\{z\}\)\), with density ratiow\(𝐳\)=qϕ\(𝐳\|𝐱\)/pθ\(𝐳\|𝐱\)w\(\\mathbf\{z\}\)=q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)/p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\. Differentiating the divergence then givesddMKL\(rθ,ϕ∥pθ\)=−Covrθ,ϕ\(cM,log⁡cM\)≤0\\tfrac\{d\}\{dM\}\\mathrm\{KL\}\(r\_\{\\theta,\\phi\}\\parallel p\_\{\\theta\}\)=\-\\mathrm\{Cov\}\_\{r\_\{\\theta,\\phi\}\}\\\!\\big\(c\_\{M\},\\log c\_\{M\}\\big\)\\leq 0, wherecM=w/\(1\+Mw\)c\_\{M\}=w/\(1\+Mw\)and the inequality follows becauseCov\(X,f\(X\)\)≥0\\mathrm\{Cov\}\(X,f\(X\)\)\\geq 0for any monotone non\-decreasingff\. A complete derivation is provided in Appendix[A](https://arxiv.org/html/2606.14235#A1)\. It follows that, in our algorithm,MMcontributes positively to model improvement by drivingrθ,ϕr\_\{\\theta,\\phi\}to provide a more accurate approximation compared to the implicit proposal distributionqϕ\(𝐳\|𝐱\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\), thereby achieving a tighter variational lower bound\. In our experiments we empirically determineMMby cross\-validation\. When the discriminator is only approximately optimal, the acceptance probability and the induced resampled distribution become approximate, so the monotonicity above is best understood as an idealized analytical property rather than a strict guarantee\.

## 4Experiments

We compare IVRS with other ELBO\-based variational inference methods, including the Adversarial Variational Bayes \(AVB\)\[mescheder2017adversarial\]and Semi\-Implicit Variational Inference \(SIVI\)\[yin2018semi\], across a range of unsupervised learning tasks\. These include some illustrative studies on several toy examples, followed by a comparison one a regression tast with Bayesian Neural Networks \(BNNs\) and an autoencoder task\. We also consider variants of SIVI, such as UIVI\[titsias2019unbiased\]\. We use the Adam optimizer\[kingma2014adam\]and empirically selectMMfor training via cross\-validation\. All experiments were conducted on an RTX 4090\. The code will be available with a final draft of the paper\.

![Refer to caption](https://arxiv.org/html/2606.14235v1/x1.png)\(a\)1D Gaussian
![Refer to caption](https://arxiv.org/html/2606.14235v1/x2.png)\(b\)Rejection sampling
![Refer to caption](https://arxiv.org/html/2606.14235v1/x3.png)\(c\)Banana shape
![Refer to caption](https://arxiv.org/html/2606.14235v1/x4.png)\(d\)Rejection sampling
![Refer to caption](https://arxiv.org/html/2606.14235v1/x5.png)\(e\)1D Laplace
![Refer to caption](https://arxiv.org/html/2606.14235v1/x6.png)\(f\)Rejection sampling
![Refer to caption](https://arxiv.org/html/2606.14235v1/x7.png)\(g\)X\-shape
![Refer to caption](https://arxiv.org/html/2606.14235v1/x8.png)\(h\)Rejection sampling
![Refer to caption](https://arxiv.org/html/2606.14235v1/x9.png)\(i\)1D GMM
![Refer to caption](https://arxiv.org/html/2606.14235v1/x10.png)\(j\)Rejection sampling
![Refer to caption](https://arxiv.org/html/2606.14235v1/x11.png)\(k\)2D GMM
![Refer to caption](https://arxiv.org/html/2606.14235v1/x12.png)\(l\)Rejection sampling

Figure 1:Density estimation tasks for 1D and 2D toy datasets\. The subplots sequentially show the kernel density curves estimated using the original implicit distributionQQbase for the target distributionPP, as well as the curves after applying rejection sampling\. The color gradient represents the density of the generated samples, with red indicating higher concentration and blue indicating sparser regions\. Table[1](https://arxiv.org/html/2606.14235#S4.T1)shows quantitative values\.Table 1:A comparison of cross\-entropy values before and after rejection sampling for different targets using implicit distributions as proposal distributions\. We observe that rejection sampling provides improved approximate posterior samples\. \(Lower is better\.\)### 4\.1Density Estimation of Toy Datasets

We IVRS to approximate six low dimensional synthetic distributions: A 1D Gaussian distribution, a 1D Laplace distribution, a 1D bimodal Gaussian Mixture Model \(GMM\), a 2D banana\-shaped distribution, a 2D X\-shaped mixture of Gaussians, and a 2D bimodal GMM\. The six densities are pictured in Figure[1](https://arxiv.org/html/2606.14235#S4.F1)and listed in Table[1](https://arxiv.org/html/2606.14235#S4.T1)along with the performance results\.

The dimension ofϵ\\epsilonwas set to 10, and the network approximationfϕf\_\{\\phi\}was parameterized by a 4\-layer MLP with layer widths\[20,40,20,2\]\[20,40,20,2\]\. The output offϕf\_\{\\phi\}was then combined with Gaussian noise\. Since this task is straightforward, we adopt a Monte Carlo estimator to estimateqϕq\_\{\\phi\}andKL\(qϕ\|\|ptar\)\\mathrm\{KL\}\(q\_\{\\phi\}\|\|p\_\{\\text\{tar\}\}\)for gradient\-based optimization\. The key difference in our method is that the trained neural network was used as the proposal distribution, followed by rejection sampling using the acceptance probability function defined in Equation[9](https://arxiv.org/html/2606.14235#S3.E9)\. Followingyin2018semi, we used 100 iterations for each inner\-loop of Monte Carlo sampling to estimate the entropy of the implicit distribution\. All methods were trained with 50,000 parameter updates\.

Table 2:Quantitative results for six UCI regression tasks\. More accurate posterior sampling allows IVRS to outperform several other VI approximations for learning the same Bayesian neural network\.Figure[1](https://arxiv.org/html/2606.14235#S4.F1)shows the contour plots of the synthetic distributions, along with kernel density estimates from samples drawn from the trained implicit distributionsqϕq\_\{\\phi\}\. Additionally, Figure[1](https://arxiv.org/html/2606.14235#S4.F1)compares the distributions before and after applying rejection sampling\. The results show the neural network’s slight misalignment with the target distribution contour\. However, after applying rejection sampling, our method demonstrates improved approximation with better alignment to the target distribution as expected\. Due to the challenges in normalizing the distribution after rejection sampling, we instead report the cross\-entropy between the target distributions and the approximate distributions in Table[1](https://arxiv.org/html/2606.14235#S4.T1)\. We conducted 10 runs and report the mean and standard deviation of the cross\-entropy values\. As shown, our method outperforms across all toy target distributions by reducing the cross\-entropy, further validating its effectiveness\.

### 4\.2Bayesian Neural Networks

We next consider the problem of sampling from the posterior of a Bayesian neural network \(BNN\)\. In this scenario, the latent model variables𝐳\\mathbf\{z\}correspond to the BNN weights\. We utilize a two\-layer network with 50 hidden units and ReLU activation functions\. We compare our method with AVB, SIVI, and several SIVI variants, including UIVI, SIVI\-SM\[yu2023semi\], and KSIVI\[cheng2024kernel\]\. We use six common UCI datasets to perform these experiments\. Each dataset is randomly partitioned, with 90% used for training and 10% for testing\. Both the proposal distributionϕ\\phiand the discriminatorη\\etaare modeled using four\-layer fully connected neural networks\. The results are averaged over 10 random trials\.

Table[2](https://arxiv.org/html/2606.14235#S4.T2)presents the average test root mean squared error \(RMSE\) and negative log\-likelihood \(NLL\) along with their standard deviations\. The six UCI datasets considered are indicated by their names\. As is evident, IVRS achieves competitive or superior performance compared to the baselines on nearly all problems, indicating that rejection sampling provides a more accurate representation of the BNN model variables\. Additional ablations—fixed\-architecture comparisons, discriminator\-optimization sensitivity, latent\-dimensionality scaling, and a decomposition of the gains—are provided in Appendix[C](https://arxiv.org/html/2606.14235#A3)\.

### 4\.3Sensitivity to the Rejection HyperparameterMM

The hyperparameterMMcontrols the acceptance probability and directly shapes the resampled variational distributionrθ,ϕ\(𝐳∣𝐱\)r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\\mid\\mathbf\{x\}\): increasingMMtightens the approximation by pushingrθ,ϕr\_\{\\theta,\\phi\}toward the true posterior, at the cost of a lower acceptance rate and higher computational overhead\. We evaluate IVRS withM∈\{0\.1,1,10,100,500\}M\\in\\\{0\.1,1,10,100,500\\\}on the three UCI BNN benchmarks \(Boston,Concrete,Protein\), keeping all other settings fixed and averaging over1010runs\. As shown in Table[3](https://arxiv.org/html/2606.14235#S4.T3), largerMMconsistently lowers both NLL and RMSE—empirically confirming the monotonicity established in Proposition[1](https://arxiv.org/html/2606.14235#Thmproposition1)—while the acceptance rate decreases and the relative cost grows only moderately \(within22–3×3\\timeseven atM=500M=500\)\. The main results in Table[2](https://arxiv.org/html/2606.14235#S4.T2)correspond toM=100M=100; moderate valuesM∈\[10,100\]M\\in\[10,100\]already capture most of the gains at a reasonable cost, which is why we selectMMin this range via lightweight cross\-validation\.

Table 3:Impact of the rejection hyperparameterMMon performance and efficiency\. Relative cost is computed as the inverse of the acceptance rate\. Lower NLL and RMSE indicate better performance\.
### 4\.4Variational Autoencoder Task on MNIST

Variational Autoencoders \(VAEs\)\[kingma2013auto\]are a popular method for unsupervised feature learning and dimensionality reduction that learn encoderϕ\\phiand decoderθ\\thetaparameters by maximizing the Evidence Lower Bound \(ELBO\) as defined in Equation[2](https://arxiv.org/html/2606.14235#S2.E2)\. Typical VAE implementations utilize Gaussian distributions and amortized inference to approximate a complex posterior distributionqϕq\_\{\\phi\}\. To improve this approximation various approaches such as Adversarial Variational Bayes \(AVB\) and Semi\-Implicit Variational Inference \(SIVI\) have been proposed, leveraging adversarial training and semi\-implicit hierarchical structures, respectively\.

To evaluate our proposed Implicit Variational Rejection Sampling \(IVRS\) on the VAE inference problem, we conducted experiments on the MNIST dataset and compare with AVB and other methods\. We trained our model on 60,000 training samples and evaluated performance on 10,000 test samples\. The encoder is designed as a two\-layer convolutional neural network \(CNN\) to map to the latent space, while the decoder consists of a four\-layer transposed convolution network for image reconstruction\. Figure[2](https://arxiv.org/html/2606.14235#S4.F2)shows examples sampled from the test set and our model, demonstrating that our method generates images with a closer resemblance to the ground truth compared to the baseline AVB method, owing to the improved sampling quality introduced by rejection sampling\.

![Refer to caption](https://arxiv.org/html/2606.14235v1/x13.png)\(a\)Ground Truth
![Refer to caption](https://arxiv.org/html/2606.14235v1/x14.png)\(b\)AVB
![Refer to caption](https://arxiv.org/html/2606.14235v1/x15.png)\(c\)IVRS \(ours\)

Figure 2:Examples of 16 MNIST images from the test set using the VAE learned by AVB and IVRS\. Our method generates images with a closer resemblance to the ground truth compared to the baseline due to the rejection sampling\.Table 4:An empirical time analysis comparing the AVB method and our proposed IVRS on the MNIST dataset\. GPU\-accelerated parallelization is able to reduce the impact of additional computation required by our method\.To evaluate the efficiency of our model, we conducted an empirical time analysis comparing IVRS to the baseline AVB on MNIST\. The results are shown in Table[4](https://arxiv.org/html/2606.14235#S4.T4)\. Using a batch size of 64 during training, we observed that our model incurs only a slight increase in computational cost, showing the advantage of our GPU\-accelerated parallel sampling implementation\.

Table 5:Comparison of reported Negative Log Evidence \(NLE\) values across different algorithms on the MNIST dataset\. An asterisk \(⋆\\star\) indicates results obtained with data augmentation\. Thebest,second\-best, andthird\-bestresults are highlighted\.We also quantitatively benchmarked our model against several well\-established autoencoding methods, including Importance Weighted Autoencoders \(IWAE\)\[burda2015importance\], Hamiltonian Variational Inference \(HVI\)\[salimans2015markov\], and Normalizing Flows \(NF\)\[rezende2015variational\]\. Many approaches have also adopted deep architectures for the VAE encoder to achieve more effective feature extraction\. Therefore, we also report comparisons with recent deep architecture\-based VAE improvements such as NVAE\[vahdat2020nvae\], CR\-NVAE\[sinha2021consistency\], and Efficient\-VDVAE\[hazami2022efficientvdvae\]\.

As shown in Table[5](https://arxiv.org/html/2606.14235#S4.T5), our vanilla IVRS method achieves a Negative Log Evidence \(NLE\) score of81\.7881\.78, surpassing traditional variational inference methods like SIVI and AVB\. Additionally, we considered integrating IVRS within the NVAE architecture, which further boosts performance to an NLE of77\.3677\.36\. This also represents an improvement over the original NVAE at78\.0178\.01\. Although CR\-NVAE\[sinha2021consistency\]is the one approach to demonstrate slightly better results, that approach relies on additional data augmentation techniques\. While data augmentation is highly effective in preventing overfitting, our focus is on demonstrating the competitiveness of rejection sampling in enhancing implicit variational inference, but we include those results for completeness\. We further compare against the more recent DVP\-VAE\[kuzina2024hierarchical\], a hierarchical VAE with a diffusion\-based VampPrior, which reports an NLE of77\.1077\.10; this is directly comparable to our IVRS\+NVAE result of77\.3677\.36, while IVRS additionally offers a general inference\-refinement mechanism that is orthogonal to the choice of prior\.

### 4\.5Results on CIFAR\-10 and ImageNet

To further evaluate IVRS on higher\-dimensional image data, we conducted VAE learning experiments on both the CIFAR\-10 and ImageNet datasets and compare against baseline inference methods\.

The CIFAR\-10 dataset consists of 50,000 training images and 10,000 test images, each consisting of32×3232\\times 32pixels and 3 color channels \(RGB\) across 10 classes\. The ImageNet dataset\[deng2009imagenet\], a large\-scale dataset commonly used for image classification and generative modeling tasks, includes over 14 million images spanning 1,000 object categories\. For our experiments, we utilized a subset of ImageNet consisting of64×6464\\times 64pixel color images, which are more manageable for generative models like VAEs\. This subset introduces greater inference complexity due to the diversity of object categories and higher resolution compared to CIFAR\-10\. Given the increased complexity and higher dimensionality of these datasets compared to MNIST, we adapted the VAE architecture by directly employing the encoder from the deep architecture NVAE\[vahdat2020nvae\]\.

For quantitative evaluation, we calculate bits per dimension \(bpd\), a standard metric for high\-dimensional image datasets\. Table[6](https://arxiv.org/html/2606.14235#S4.T6)reports the bpd scores for various variational inference methods on CIFAR\-10, while Table[7](https://arxiv.org/html/2606.14235#S4.T7)presents the corresponding results on ImageNet\. As can be seen, the proposed IVRS consistently achieves lower bpd values than a range of classic VAE\-based approaches, indicating improved density estimation quality\. Notably, while CR\-NVAE\[sinha2021consistency\]attains strong performance by leveraging additional data augmentation, our controlled experiments under the same augmentation protocol show that incorporating IVRS further improves performance\. In particular, on CIFAR\-10, CR\-NVAE achieves 2\.51 bpd with augmentation, and adding IVRS \(withM=1M=1\) reduces this to 2\.43 bpd\. This demonstrates that IVRS provides a non\-trivial gain even on top of strong augmentation\-based baselines, suggesting that the proposed inference refinement is orthogonal and complementary to data augmentation\. For a more recent point of comparison, the DVP\-VAE\[kuzina2024hierarchical\]reports2\.732\.73bpd on CIFAR\-10, which is on par with our IVRS\+NVAE result of2\.762\.76bpd, while IVRS remains complementary as an inference\-refinement step\. Moreover, despite the additional rejection sampling step, the overall computational overhead remains modest, indicating that IVRS is practical for large\-scale VAE training scenarios\.

Table 6:Comparison of reported Bits per Dimension \(bpd\) values among various algorithms for the CIFAR\-10 dataset\. An asterisk \(⋆\\star\) indicates results obtained using data augmentation\. Thebest,second\-best, andthird\-bestresults are highlighted\.Table 7:Comparison of reported Bits per Dimension \(bpd\) values among various algorithms for the Imagenet 64×\\times64 dataset\. An asterisk \(⋆\\star\) indicates results obtained using data augmentation\. Thebest,second\-best, andthird\-bestresults are highlighted\.

## 5Conclusion

We have introduced Implicit Variational Rejection Sampling, a novel posterior approximation method that enhances variational inference by integrating the flexibility of implicit distributions with the precision of rejection sampling\. By optimizing the derived Implicit Resampled Evidence Lower Bound \(IR\-ELBO\), IVRS provides a mechanism for enhancing posterior approximation over traditional VI methods\. Our experimental results demonstrate the effectiveness of IVRS across various tasks, including regression problems and VAE\-based image modeling, achieving superior quantitative performance over related inference approaches\. The supplement below contains more practical details about implementation and limitations, along with an in depth description of related work and ablation study\.

## References

## Appendix AProof of the Monotonicity of the KL Divergence inMM

Here we give a rigorous proof of Proposition[1](https://arxiv.org/html/2606.14235#Thmproposition1): in the idealized setting with an optimal discriminator, the divergenceKL\(rθ,ϕ\(𝐳\|𝐱\)∥pθ\(𝐳\|𝐱\)\)\\mathrm\{KL\}\\\!\\left\(r\_\{\\theta,\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\parallel p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\right\)is monotonically non\-increasing inMM\. We first record an elementary covariance inequality\.

###### Lemma 1\.

LetXXbe a real\-valued random variable and letffbe a monotone non\-decreasing function such that the expectations below exist\. ThenCov\(X,f\(X\)\)≥0\\mathrm\{Cov\}\(X,f\(X\)\)\\geq 0\.

###### Proof\.

LetX′X^\{\\prime\}be an independent copy ofXX\. Then

Cov\(X,f\(X\)\)=12𝔼\[\(X−X′\)\(f\(X\)−f\(X′\)\)\]\.\\mathrm\{Cov\}\(X,f\(X\)\)=\\tfrac\{1\}\{2\}\\,\\mathbb\{E\}\\\!\\left\[\(X\-X^\{\\prime\}\)\\big\(f\(X\)\-f\(X^\{\\prime\}\)\\big\)\\right\]\.Sinceffis monotone non\-decreasing,\(X−X′\)\(f\(X\)−f\(X′\)\)≥0\(X\-X^\{\\prime\}\)\\big\(f\(X\)\-f\(X^\{\\prime\}\)\\big\)\\geq 0for every pair\(X,X′\)\(X,X^\{\\prime\}\)\. Taking expectations givesCov\(X,f\(X\)\)≥0\\mathrm\{Cov\}\(X,f\(X\)\)\\geq 0\. ∎

###### Proof of Proposition[1](https://arxiv.org/html/2606.14235#Thmproposition1)\.

Writeπ\(𝐳\)=pθ\(𝐳\|𝐱\)=pθ\(𝐱\|𝐳\)p\(𝐳\)/pθ\(𝐱\)\\pi\(\\mathbf\{z\}\)=p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)=p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)/p\_\{\\theta\}\(\\mathbf\{x\}\)for the true posterior, and let the acceptance probability in Equation \([9](https://arxiv.org/html/2606.14235#S3.E9)\) be

aM\(𝐳\)=pθ\(𝐱\|𝐳\)p\(𝐳\)pθ\(𝐱\|𝐳\)p\(𝐳\)\+Mqϕ\(𝐳\|𝐱\)\.a\_\{M\}\(\\mathbf\{z\}\)=\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\}\{p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)\+Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\.Usingpθ\(𝐱\|𝐳\)p\(𝐳\)=pθ\(𝐱\)π\(𝐳\)p\_\{\\theta\}\(\\mathbf\{x\}\|\\mathbf\{z\}\)p\(\\mathbf\{z\}\)=p\_\{\\theta\}\(\\mathbf\{x\}\)\\pi\(\\mathbf\{z\}\), this can be rewritten as

aM\(𝐳\)=pθ\(𝐱\)π\(𝐳\)pθ\(𝐱\)π\(𝐳\)\+Mqϕ\(𝐳\|𝐱\)\.a\_\{M\}\(\\mathbf\{z\}\)=\\frac\{p\_\{\\theta\}\(\\mathbf\{x\}\)\\pi\(\\mathbf\{z\}\)\}\{p\_\{\\theta\}\(\\mathbf\{x\}\)\\pi\(\\mathbf\{z\}\)\+Mq\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\.Defining the density ratio and rescaled parameter

w\(𝐳\)=qϕ\(𝐳\|𝐱\)π\(𝐳\),M~=Mpθ\(𝐱\),w\(\\mathbf\{z\}\)=\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\}\{\\pi\(\\mathbf\{z\}\)\},\\qquad\\tilde\{M\}=\\frac\{M\}\{p\_\{\\theta\}\(\\mathbf\{x\}\)\},givesaM\(𝐳\)=1/\(1\+M~w\(𝐳\)\)a\_\{M\}\(\\mathbf\{z\}\)=1/\(1\+\\tilde\{M\}w\(\\mathbf\{z\}\)\)\. Sincepθ\(𝐱\)p\_\{\\theta\}\(\\mathbf\{x\}\)is a positive constant independent of𝐳\\mathbf\{z\}, we may absorb it into the scalar parameter and equivalently writeaM\(𝐳\)=1/\(1\+Mw\(𝐳\)\)a\_\{M\}\(\\mathbf\{z\}\)=1/\(1\+Mw\(\\mathbf\{z\}\)\)\.

The induced resampled distribution is

rM\(𝐳\)=qϕ\(𝐳\|𝐱\)aM\(𝐳\)ZM,ZM=𝔼qϕ\[aM\(𝐳\)\]\.r\_\{M\}\(\\mathbf\{z\}\)=\\frac\{q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\,a\_\{M\}\(\\mathbf\{z\}\)\}\{Z\_\{M\}\},\\qquad Z\_\{M\}=\\mathbb\{E\}\_\{q\_\{\\phi\}\}\[a\_\{M\}\(\\mathbf\{z\}\)\]\.Usingqϕ\(𝐳\|𝐱\)=w\(𝐳\)π\(𝐳\)q\_\{\\phi\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)=w\(\\mathbf\{z\}\)\\pi\(\\mathbf\{z\}\), we obtain

rM\(𝐳\)=π\(𝐳\)cM\(𝐳\)Z~M,cM\(𝐳\)=w\(𝐳\)1\+Mw\(𝐳\),Z~M=𝔼π\[cM\(𝐳\)\]\.r\_\{M\}\(\\mathbf\{z\}\)=\\frac\{\\pi\(\\mathbf\{z\}\)\\,c\_\{M\}\(\\mathbf\{z\}\)\}\{\\tilde\{Z\}\_\{M\}\},\\quad c\_\{M\}\(\\mathbf\{z\}\)=\\frac\{w\(\\mathbf\{z\}\)\}\{1\+Mw\(\\mathbf\{z\}\)\},\\quad\\tilde\{Z\}\_\{M\}=\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}\(\\mathbf\{z\}\)\]\.Hence the divergence to the true posterior is

KL\(rM∥π\)=1Z~M𝔼π\[cMlog⁡cM\]−log⁡Z~M\.\\mathrm\{KL\}\(r\_\{M\}\\parallel\\pi\)=\\frac\{1\}\{\\tilde\{Z\}\_\{M\}\}\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}\\log c\_\{M\}\]\-\\log\\tilde\{Z\}\_\{M\}\.Differentiating with respect toMMand using

dcMdM=−cM2,dZ~MdM=−𝔼π\[cM2\],d\(cMlog⁡cM\)dM=−cM2\(1\+log⁡cM\),\\frac\{dc\_\{M\}\}\{dM\}=\-c\_\{M\}^\{2\},\\quad\\frac\{d\\tilde\{Z\}\_\{M\}\}\{dM\}=\-\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}^\{2\}\],\\quad\\frac\{d\(c\_\{M\}\\log c\_\{M\}\)\}\{dM\}=\-c\_\{M\}^\{2\}\(1\+\\log c\_\{M\}\),the two terms involving𝔼π\[cM2\]/Z~M\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}^\{2\}\]/\\tilde\{Z\}\_\{M\}cancel, leaving

ddMKL\(rM∥π\)=−1Z~M𝔼π\[cM2log⁡cM\]\+𝔼π\[cM2\]Z~M2𝔼π\[cMlog⁡cM\]\.\\frac\{d\}\{dM\}\\mathrm\{KL\}\(r\_\{M\}\\parallel\\pi\)=\-\\frac\{1\}\{\\tilde\{Z\}\_\{M\}\}\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}^\{2\}\\log c\_\{M\}\]\+\\frac\{\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}^\{2\}\]\}\{\\tilde\{Z\}\_\{M\}^\{2\}\}\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}\\log c\_\{M\}\]\.Rewriting the expectations underrMr\_\{M\}via𝔼rM\[g\]=𝔼π\[cMg\]/Z~M\\mathbb\{E\}\_\{r\_\{M\}\}\[g\]=\\mathbb\{E\}\_\{\\pi\}\[c\_\{M\}g\]/\\tilde\{Z\}\_\{M\}yields

ddMKL\(rM∥π\)=−𝔼rM\[cMlog⁡cM\]\+𝔼rM\[cM\]𝔼rM\[log⁡cM\]=−CovrM\(cM,log⁡cM\)\.\\frac\{d\}\{dM\}\\mathrm\{KL\}\(r\_\{M\}\\parallel\\pi\)=\-\\mathbb\{E\}\_\{r\_\{M\}\}\[c\_\{M\}\\log c\_\{M\}\]\+\\mathbb\{E\}\_\{r\_\{M\}\}\[c\_\{M\}\]\\,\\mathbb\{E\}\_\{r\_\{M\}\}\[\\log c\_\{M\}\]=\-\\mathrm\{Cov\}\_\{r\_\{M\}\}\(c\_\{M\},\\log c\_\{M\}\)\.Applying Lemma[1](https://arxiv.org/html/2606.14235#Thmlemma1)withX=cMX=c\_\{M\}and the monotone non\-decreasing functionf\(X\)=log⁡Xf\(X\)=\\log XgivesCovrM\(cM,log⁡cM\)≥0\\mathrm\{Cov\}\_\{r\_\{M\}\}\(c\_\{M\},\\log c\_\{M\}\)\\geq 0\. Therefore

ddMKL\(rM∥pθ\(𝐳\|𝐱\)\)≤0,\\frac\{d\}\{dM\}\\mathrm\{KL\}\\\!\\left\(r\_\{M\}\\parallel p\_\{\\theta\}\(\\mathbf\{z\}\|\\mathbf\{x\}\)\\right\)\\leq 0,so the divergence is monotonically non\-increasing inMMin the idealized exact\-ratio setting\. When the discriminator is imperfect, the acceptance probability and the induced resampled distribution are only approximate, so this monotonicity should be interpreted as an idealized analytical property rather than a strict guarantee\. ∎

## Appendix BRelated Work

The main idea of implicit variational inference is to transform a simple base distribution into a more expressive one using a deep neural network\[mescheder2017adversarial,huszar2017variational,titsias2019unbiased,shi2017kernel\]\. To avoid density ratio estimation, semi\-implicit variational inference has been proposed, where the variational distributions are formed through a semi\-implicit hierarchical construction, and surrogate ELBOs \(asymptotically unbiased\) are employed for training\[yin2018semi,molchanov2019doubly,moens2021efficient,lim2024particle,yu2023semi,cheng2024kernel\]\. These methods have made improvements to implicit variational inference or semi\-implicit variational inference at various levels\. However, these methods do not address the limited expressiveness of neural networks, which may still fall short in certain scenarios\. Our work aims to address this issue by proposing an implicit rejection sampling algorithm\.

Beyond standard latent\-variable models, implicit and diffusion\-based variational inference has been increasingly applied to richer Bayesian models, including deep Gaussian processes and their dynamical extensions\[xu2024sparse,xu2026diffusion,xu2024neural,xu2025variational,xu2025bayesian,xu2025fully\], Bayesian last\-layer models\[xu2024flexible\], heavy\-tailed Student\-ttprocesses\[xu2024sparsestudent,xu2026sparse\], and nonlinear probabilistic latent\-variable models for industrial sensing\[chen2024analyzing\]\. In parallel, diffusion\-based generative models have motivated new tools for posterior sampling and density\-ratio estimation\[chen2025dequantified,li2026evodiff,chen2024rethinking\], which are complementary to the discriminator\-based density\-ratio estimation used in IVRS\.

On the other hand, rejection sampling is a classical method to generate samples from a distribution using samples drawn from a different distribution\[gilks1992adaptive,grover2018variational,azadi2018discriminator,stimper2022resampling,verine2024optimal\]\. Specifically,grover2018variationalandstimper2022resamplinghave embedded latent rejection sampling within their training processes, applying it within a variational inference and a normalizing flow framework, respectively\. We acknowledge their contributions; however, in this work, we address a different problem—leveraging implicit distributions as proposal distributions for rejection sampling and variational inference, creating a novel variational inference algorithm\.

The prior inbauer2019resampledis reformulated as a resampled distribution, whereas our method explicitly derives a new evidence lower bound \(IR\-ELBO\) based on rejection sampling\. While both approaches use density ratio estimation, our work focuses on leveraging rejection sampling to construct a tighter variational objective by directly approximating the posterior through implicit distributions\.

Equation \(2\) injankowiak2023reparameterizedand our Equation \(13\) share a similar mathematical form, but a key difference lies in the nature of the variational distribution: injankowiak2023reparameterized, the distribution is explicit, while in our work, it is implicit\. This distinction is significant as it aligns with our objective of improving implicit variational inference by utilizing rejection sampling to create more flexible posterior approximations\.

Our method can also be extended to importance samplingburda2015importance, and could be adapted for use with mixtures of variational distributions\. Using mixtures could enhance the expressivity of the approximate posterior and align well with the results fromhotti2024efficientandkviman2023cooperation\. Combining the importance\-weighted ELBO \(IW\-ELBO\)burda2015importancewith Eq\. \(16\) could provide a tighter bound\. This would involve integrating importance weighting within the rejection sampling framework, potentially amplifying the advantages of both techniques\. This presents an exciting direction for future work\.

## Appendix CAblation Study

In this section, we provide additional ablation experiments that complement the sensitivity analysis of the rejection hyperparameterMMin the main text\. We focus on three representative UCI Bayesian neural network regression benchmarks:Boston Housing,Concrete, andProtein\. These datasets allow us to systematically examine the trade\-off between posterior accuracy and computational efficiency induced by rejection sampling\.

### C\.1Fixed Architecture Comparison

To isolate the contribution of rejection sampling from increased network capacity, we conduct controlled experiments where all methods share identical neural network architectures\. This allows us to explicitly disentangle the effect of posterior refinement via rejection sampling from that of simply adding more model parameters\.

Specifically, we compare the following settings:

1. 1\.Baseline implicit variational inference \(Implicit VI\) with a fixed 4\-layer MLP\.
2. 2\.IVRS applied to the same architecture \(no increase in network capacity\)\.
3. 3\.Implicit VI with increased network capacity \(2×\\timesand 3×\\timeshidden width\)\.

All models are trained and evaluated under identical optimization settings\.

Table[8](https://arxiv.org/html/2606.14235#A3.T8)reports the results on three representative UCI BNN regression datasets\. Several important observations emerge:

- •Rejection sampling outperforms capacity increase\.Applying IVRS with the same base architecture consistently yields larger improvements in NLL and RMSE than doubling or tripling the network width\. For example, on the Boston dataset, IVRS achieves a4\.9%4\.9\\%NLL improvement over the baseline, whereas doubling network capacity yields only a1\.1%1\.1\\%improvement\.
- •Capacity alone cannot replicate IVRS gains\.Even with 3×\\timesnetwork capacity, implicit VI underperforms IVRS using the original architecture across all datasets\. This indicates that the performance gains of IVRS do not stem from increased representational capacity\.
- •Complementary mechanisms\.These results suggest that rejection sampling improves posterior approximation in a manner fundamentally different from architectural scaling\. IVRS refines the variational distribution through sample\-level selection rather than function\-level expressiveness\.

Overall, this ablation demonstrates that IVRS provides a principled and parameter\-efficient alternative to increasing network capacity, reinforcing the claim that rejection sampling contributes meaningfully beyond standard neural network design choices\.

Table 8:Ablation study with fixed neural network architecture\. All models use a 4\-layer MLP; \[kk\] denotes hidden width per layer\. Lower NLL and RMSE indicate better performance\.
### C\.2Complete Hyperparameter Settings forMM

For completeness and reproducibility, we summarize the values of the rejection hyperparameterMMused across all experiments\.

#### Toy Density Estimation \(Table 1\)\.

For low\-dimensional toy distributions, we set:

- •1D distributions \(Gaussian, Laplace, GMM\):M=0\.1M=0\.1
- •2D distributions \(Banana, X\-shape, GMM\):M=500M=500

Higher\-dimensional targets require largerMMto sufficiently refine the proposal distribution, consistent with the analysis in Section 3\.4\.

#### UCI Regression \(BNN Experiments, Table 2\)\.

For Bayesian neural network regression, we selectedMMvia cross\-validation on held\-out validation sets:

- •Boston, Concrete, Protein, Power:M=100M=100
- •Yacht, Wine:M=10M=10

These BNN posteriors are relatively low\-dimensional \(on the order of hundreds of parameters\)\. AtM=100M=100, acceptance rates remain practical \(approximately5050–60%60\\%\), yielding substantial accuracy improvements \(44–13%13\\%NLL reduction\) with moderate computational overhead \(1\.61\.6–2\.1×2\.1\\times\)\.

#### VAE Experiments\.

For large\-scale VAE training, we fixM=1M=1for all datasets:

- •MNIST
- •CIFAR\-10
- •ImageNet 64×\\times64

Unlike BNN tasks, VAE training involves significantly larger datasets \(e\.g\., 60K for MNIST, 50K for CIFAR\-10, and over 1M samples for ImageNet\)\. To maintain reasonable training time, we use a smallMM, which yields high acceptance rates \(approximately80%80\\%\) while still demonstrating the benefit of rejection sampling\. These experiments primarily serve to validate the general applicability of IVRS across different model classes\.

### C\.3Training and Inference with Rejection Sampling

We clarify the role of rejection sampling during both training and inference in IVRS, as this point may be a source of potential confusion\.

#### Training Phase\.

During training, rejection sampling is an integral component of IVRS\. The variational distribution is defined as the resampled distributionrθ,ϕ\(z∣x\)r\_\{\\theta,\\phi\}\(z\\mid x\), and all expectations in the proposed IR\-ELBO are taken with respect to this distribution\. Consequently, rejection sampling is explicitly applied during training to generate samples fromrθ,ϕ\(z∣x\)r\_\{\\theta,\\phi\}\(z\\mid x\), as described in Algorithm 1 and Algorithm 2\.

#### Inference Phase\.

At test time, posterior samples are also drawn from the same resampled variational distributionrθ,ϕ\(z∣x\)r\_\{\\theta,\\phi\}\(z\\mid x\)using the identical rejection sampling procedure\. This ensures consistency between training and inference, and allows the improved posterior approximation obtained by IVRS to directly translate into better predictive performance\.

#### Practical Considerations\.

In practice, the acceptance rate is controlled by the hyperparameterMMand remains sufficiently high in all reported experiments \(see Appendix[C\.2](https://arxiv.org/html/2606.14235#A3.SS2)\)\. For large\-scale settings such as VAE training, we adopt a smallerMMto balance computational efficiency and posterior refinement\. Importantly, no additional approximation is introduced at test time beyond the rejection sampling mechanism already used during training\.

### C\.4Sensitivity to Discriminator Optimization

The quality of the discriminator is a critical part of the method\. In our framework the discriminator is not a peripheral component: it is used to approximate the density\-ratio quantity that enters the acceptance probability, the induced resampled posterior, and the practical training objective\. If the discriminator is inaccurate, the resulting error does not remain local but propagates through the entire procedure\. The idealized analysis of Proposition[1](https://arxiv.org/html/2606.14235#Thmproposition1)should therefore be understood as relying on a sufficiently accurate—in the strongest case optimal—discriminator\. In practical optimization, however, the discriminator is only approximately trained, so the acceptance function, the resampled posterior, and the optimized objective are all approximate; we therefore do not claim a general theoretical guarantee that approximation errors in the discriminator cannot degrade the posterior approximation\. What is optimized in practice is better viewed as a discriminator\-parameterized surrogate objective, which is also why Algorithm[2](https://arxiv.org/html/2606.14235#alg2)updates the discriminator and the variational model in alternation rather than fully solving the discriminator subproblem at each outer step\.

Our experiments show that the method is sensitive to discriminator optimization, but not in the trivial sense that stronger discriminator training always helps—overly aggressive updates can degrade both acceptance behavior and downstream posterior quality\. On the Concrete dataset, under a fixed evaluation protocol with100100accepted posterior samples, we observe the behavior in Table[9](https://arxiv.org/html/2606.14235#A3.T9): a moderate setting is stable, whereas making the discriminator much stronger \(more inner steps\) or updating it too frequently drives the acceptance rate toward zero and substantially worsens RMSE/NLL\. This indicates that the main failure mode is not underfitting of the discriminator, but*imbalance*in the alternating optimization between the discriminator and the variational model; the sensitivity is primarily a training\-dynamics issue rather than a purely approximation\-theoretic one\.

Table 9:Effect of discriminator optimization on IVRS \(Concrete,100100accepted samples\)\. The “strong” setting drives the acceptance rate down to approximately0\.00020\.0002\.
### C\.5Effect of Latent Dimensionality

Although our approach uses a smooth rejection\-inspired mechanism rather than classical exact rejection sampling, it still inherits a related scalability issue: as the effective latent dimensionality grows, acceptance behavior can deteriorate and tuning becomes more delicate, so the method is not immune to a curse\-of\-dimensionality effect\. On MNIST, varying the latent dimensionality from1616to392392under a fixed evaluation protocol gives the results in Table[10](https://arxiv.org/html/2606.14235#A3.T10): the best NLE is achieved in a moderate range \(around3232–6464dimensions\), while larger latent spaces gradually reduce the calibrated acceptance rate and worsen NLE\. Once the latent space becomes too large, the proposal is harder to refine effectively by rejection\-style resampling\. Overall, IVRS is most suitable when one has a moderately expressive proposal, manageable latent dimensionality, and sufficient room for posterior refinement through selective resampling\.

Table 10:Effect of latent dimensionality on IVRS \(MNIST\)\. “Accept\. rate” is the calibrated acceptance rate\.
### C\.6Decomposing the Source of the Gains

To understand where the improvements come from, we perform controlled decomposition ablations on the Concrete dataset using the same backbone, training protocol, and evaluation budget, while isolating four cases:baseline\(implicit train \+ implicit test\),reweight\_only\(IVRS\-style train \+ implicit test\),selection\_only\(implicit train \+ IVRS\-style test\), andfull\_ivrs\(IVRS\-style train \+ IVRS\-style test\)\. Table[11](https://arxiv.org/html/2606.14235#A3.T11)reports RMSE/NLL under two discriminator settings \(M=1M\{=\}1andM=0\.1M\{=\}0\.1, both withdisc\-update\-every=5,disc\-steps=1\)\. The gain cannot be attributed to density\-ratio reweighting alone, and “implicit regularization” from the training objective by itself is not consistently beneficial; rather, the improvement comes from the*interaction*between reweighting and selection rather than from either component in isolation\.

Table 11:Controlled decomposition ablation on Concrete\. Only the full combination of reweighting \(train\) and selection \(test\) consistently improves over the baseline\.

## Appendix DLimitations and Additional Discussion

Despite the clear advantages of IVRS, which combines the accuracy of rejection sampling with the flexibility of implicit distributions, there are some limitations\. One key limitation is the need for empirical hand\-tuning of the hyperparameterMMin the acceptance probability function\. Additionally, while our experiments have shown improved bpd scores on standard datasets like CIFAR\-10, further testing on more complex and higher\-dimensional datasets is necessary to fully assess the robustness of our approach\.

### D\.1Robustness of the Method in High\-Dimensional Settings

We acknowledge the challenges in high\-dimensional scenarios due to the curse of dimensionality\. To mitigate this, our approach leverages implicit proposal distributions parameterized by neural networks, which are designed to closely approximate the target distribution\. This reduces the rejection rate and improves efficiency even in higher dimensions\.

In our experiments, we have evaluated the proposed method on datasets with moderately high\-dimensional posterior distributions \(e\.g\., several hundred dimensions\)\. Results demonstrate that the acceptance rates remain manageable, and the method performs favorably compared to baseline variational inference \(VI\) approaches\. We recognize, however, that testing on more extreme high\-dimensional cases \(e\.g\., thousands of dimensions\) could provide additional insights\. This is a valuable direction for future work to implement more scalable architectures to further explore this aspect\.

### D\.2Computational Efficiency Compared to Other Methods

Computational overhead is a critical factor in assessing the practicality of our approach\. The addition of the discriminator network introduces some computational cost, particularly for estimating the density ratio and acceptance probability\. To address this issue:

- •Comparison with Baselines:In our experiments, we observed that the computational cost of training the discriminator is offset by the reduction in bias due to tighter approximation of the Evidence Lower Bound \(ELBO\)\. Compared to standard VI methods, our approach exhibits a trade\-off where the marginal improvement in accuracy justifies the additional computation\.
- •Efficiency in Large\-Scale Settings:To improve scalability, we used a lightweight architecture for the discriminator and optimized its training through mini\-batch techniques\. However, we acknowledge that in large\-scale datasets or extremely high\-dimensional problems, further optimization \(e\.g\., parallelization or approximate methods\) may be necessary\.
- •Maximum Iteration Limit:We introduce a maximum iteration limit for the rejection sampling process to prevent it from getting stuck in an infinite loop

### D\.3Cost and Difficulty of Training the Discriminator \(Eq\.[10](https://arxiv.org/html/2606.14235#S3.E10)\)

In our framework, the discriminator approximates the density ratio between the proposal distributionqϕ\(z\|x\)q\_\{\\phi\}\(z\|x\)and the true posteriorp\(z\|x\)p\(z\|x\), which determines the acceptance probability in the rejection sampling step\. It is parameterized using the same architecture as in AVB and trained with a standard binary cross\-entropy loss\. As a result, the computational cost remains comparable to AVB, and no additional model complexity is introduced\.

The discriminator directly influences the sampling process by defining the acceptance probabilitya\(z;x,θ,ϕ\)a\(z;x,\\theta,\\phi\), making its accuracy crucial to overall performance\. However, we found that full convergence of the discriminator at each update step is not necessary in practice\. Instead, we adopt an alternating optimization strategy, where the discriminator is updated for a few epochs \(typically 5–10\) before updating the variational parameters, similar to the alternating training scheme used in Generative Adversarial Networks \(GANs\)\.

### D\.4Manual Tuning of the HyperparameterMM

The hyperparameterMMcontrols the acceptance threshold in rejection sampling and directly governs the trade\-off between posterior accuracy and computational efficiency\. In our experiments,MMis selected via cross\-validation on held\-out validation sets\.

As demonstrated in the ablation studies in Appendix[C](https://arxiv.org/html/2606.14235#A3), increasingMMconsistently improves posterior approximation quality, as reflected by lower NLL and RMSE, while simultaneously reducing the acceptance rate and increasing computational cost\. This behavior aligns with the theoretical analysis in Section 3\.4, which shows that the KL divergence between the resampled variational distribution and the true posterior decreases monotonically withMM\.

Importantly, we find that IVRS is not overly sensitive to the exact choice ofMMwithin a reasonable range\. Moderate values ofMMalready provide substantial accuracy gains over the implicit proposal distribution, while maintaining practical acceptance rates\. This allowsMMto be chosen using lightweight cross\-validation without extensive tuning\.

While automatic or adaptive selection ofMMis an interesting direction for future work, our results indicate that the current strategy provides a reliable and interpretable balance between accuracy and efficiency across a wide range of tasks\.
Implicit Variational Rejection Sampling

Similar Articles

Variational Inference for Evidential Deep Learning

Zeroth-Order Non-Log-Concave Sampling with Variance Reduction and Applications to Inverse Problems

Hierarchical Variational Policies for Reward-Guided Diffusion

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…

Submit Feedback

Similar Articles

Variational Inference for Evidential Deep Learning
Zeroth-Order Non-Log-Concave Sampling with Variance Reduction and Applications to Inverse Problems
Hierarchical Variational Policies for Reward-Guided Diffusion
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…