Zeroth-Order Non-Log-Concave Sampling with Variance Reduction and Applications to Inverse Problems
Summary
Proposes a variance-reduced zeroth-order Langevin sampling method for non-log-concave distributions, establishing the first non-asymptotic convergence guarantees, and applies it to inverse problems with score-based generative priors.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# Zeroth-Order Non-Log-Concave Sampling with Variance Reduction and Applications to Inverse Problems
Source: [https://arxiv.org/html/2605.30573](https://arxiv.org/html/2605.30573)
###### Abstract
Sampling from high\-dimensional, non\-log\-concave distributions with unnormalized densities remains a fundamental challenge in machine learning, particularly in black\-box settings where gradient information is inaccessible or computationally prohibitive\. While Langevin dynamics provides a principled framework for sampling when gradients are accessible, its extension to the black\-box settings suffers from high variance and lacks non\-asymptotic convergence guarantees for non\-log\-concave sampling\. To address these limitations, we propose a variance\-reduced zeroth\-order Langevin sampling method\. Our method employs a gradient estimator that substantially reduces the variance of the classical batched zeroth\-order estimator and eliminates the unfavorable dimensional dependence of the batch size required for accurate estimation, enabling practical and stable sampling\. We establish the first non\-asymptotic convergence guarantees for zeroth\-order non\-log\-concave sampling in terms ofε\\varepsilon\-relative Fisher information, and, under a Poincaré inequality assumption, squared total variation distance\. We further proposeZO\-APMC, a posterior sampling algorithm for black\-box inverse problems with pre\-trained score\-based generative priors, establishing the first non\-asymptotic convergence guarantees for such methods\. We validate our theory through synthetic experiments and demonstrate strong empirical performance on practical linear and nonlinear inverse problems\.
Machine Learning, ICML
## 1Introduction
We study the problem of sampling from a distributionπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)with potential functionf:ℝd→ℝf:\{\\mathbb\{R\}\}^\{d\}\\rightarrow\{\\mathbb\{R\}\}when we only have access to zeroth\-order \(ZO\) evaluations of the potential function\. This problem is of fundamental importance when gradients are unavailable or prohibitively expensive, and has been investigated by several recent works\(Liu and Wang,[2020](https://arxiv.org/html/2605.30573#bib.bib74); Royet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib38); Heet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib22)\)\. In the special case whereffis strongly log\-concave and has Lipschitz continuous gradient, this problem is well understood; for instance,Royet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib38)\)establish non\-asymptotic complexity bounds forLangevin Monte Carlo \(LMC\)sampling\. In contrast, to the best of our knowledge, the non\-log\-concave setting remains largely unexplored\. Ourfirst main contributionis to take an initial step toward a theory of non\-log\-concave ZO sampling by proposing a novel ZO estimator that enables sampling from non\-log\-concave distributions\. The main challenge is that standard ZO estimators based on finite\-difference evaluations along random Gaussian directions exhibit high variance\. Controlling this variance typically requires batch size scaling as𝒪\(d\)\\mathcal\{O\}\(d\)with the dimensionddof the sampling variable, leading to substantial function evaluations and memory costs in high\-dimensional settings\. To mitigate this issue, we propose a novel variance\-reduced ZO estimator that uses only𝒪\(1\)\\mathcal\{O\}\(1\)number of function evaluations per iteration, making the batch size independent of the ambient dimension and substantially reducing the associated memory costs\. Our theoretical analysis builds on the LMC framework with gradient access developed byBalasubramanianet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib34)\), and is based on a sampling analogue of*stationary point analysis*, a technique that has shown to be highly effective in non\-convex optimization\(Nesterov and others,[2018](https://arxiv.org/html/2605.30573#bib.bib88)\)\.
Beyond the foundational sampling setting, we further extend our method and theoretical analysis to posterior sampling for solving ill\-posed inverse problems using score\-based generative model \(SGM\) priors in black\-box settings, where privileged information about the forward model such as its derivative, pseudo\-inverse\(Songet al\.,[2023](https://arxiv.org/html/2605.30573#bib.bib21)\), or its parametrization\(Chunget al\.,[2023a](https://arxiv.org/html/2605.30573#bib.bib23)\)is unavailable or computationally prohibitive\. Such scenarios arise in a wide range of applications: the forward operator may be defined through large PDE\-based simulators whose derivatives or pseudo\-inverses are typically inaccessible or undefined\(Evensen and Van Leeuwen,[1996](https://arxiv.org/html/2605.30573#bib.bib26); Oliveret al\.,[2008](https://arxiv.org/html/2605.30573#bib.bib24); Iglesiaset al\.,[2013](https://arxiv.org/html/2605.30573#bib.bib25)\); simulators may rely on legacy code that cannot be adapted to modern auto\-differentiation frameworks\(Harbaughet al\.,[2000](https://arxiv.org/html/2605.30573#bib.bib79)\); the forward model may be a proprietary system \(closed\-source\), as in commercial MRI scanners\(Karakuzuet al\.,[2025](https://arxiv.org/html/2605.30573#bib.bib80)\); the underlying physics may involve discontinuities\(Moëset al\.,[1999](https://arxiv.org/html/2605.30573#bib.bib76); Tanet al\.,[2018](https://arxiv.org/html/2605.30573#bib.bib77); Lopez\-Gomezet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib78)\); or the forward model may be implemented as a rule\-based expert system\(Rotshtein and Rakytyanska,[2012](https://arxiv.org/html/2605.30573#bib.bib84); Huanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib18); Gonget al\.,[2025](https://arxiv.org/html/2605.30573#bib.bib85)\)\.
In such ill\-posed inverse problem settings, the choice of a flexible and expressive prior is crucial\. SGMs have recently emerged as powerfulplug\-and\-playpriors due to their ability to model high\-dimensional non\-log\-concave data distributions, while remaining applicable across a wide range of inverse problems without re\-training\. They have demonstrated a strong empirical performance across a wide range of applications, including image restoration\(Wanget al\.,[2023](https://arxiv.org/html/2605.30573#bib.bib7); Routet al\.,[2023](https://arxiv.org/html/2605.30573#bib.bib8)\), medical imaging\(Songet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib10); Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), and music generation\(Routet al\.,[2025](https://arxiv.org/html/2605.30573#bib.bib17)\), where access to gradients of the forward operator is typically assumed\. More recently, SGMs have been adopted to develop black\-box posterior sampling methods for inverse problems where access to the likelihood score is unavailable, showing promising empirical performance\(Tanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib31); Huanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib18); Zhenget al\.,[2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)\. However, these approaches rely on heuristic approximations and currently lack rigorous convergence guarantees to the target posterior under standard probabilistic discrepancy measures such as Fisher information \(FI\) or total variation \(TV\) distance\. In fact, rigorous guarantees remain rare even for posterior sampling methods with access to the likelihood score; when available, they typically rely on restrictive assumptions such as linear forward operators, which are often violated in practice\(Daraset al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib9)\)\. A more detailed discussion of both gradient\-based and black\-box posterior sampling methods is provided in Appendix[A](https://arxiv.org/html/2605.30573#A1), along with a complementary comparison to the proposed approach in Table[3](https://arxiv.org/html/2605.30573#A1.T3)\.
Thesecond main contributionof this work is the development of a theoretically groundedplug\-and\-play Monte Carlo \(PMC\)method for solving black\-box inverse problems using only forward model evaluations and a pre\-trained SGM prior\. We position this as an important step toward posterior sampling in black\-box settings that offers an algorithm with formal convergence guarantees and a solid foundation for future advances\. We encounter two key challenges in designing a practical ZO posterior sampling algorithm: standard LMC often converge slowly, and ZO estimates require batch sizes that scale with the problem dimension, leading to prohibitive computational and memory costs in high\-dimensional settings\. To address these challenges, we combine annealed LMC with our variance\-reduced ZO estimator, enabling accurate forward model gradient approximation using a practical number of function evaluations at each iteration\. We incorporate the effect of annealing schedule and SGM estimation error on convergence in our theoretical results for posterior sampling\.
Specifically, the key contributions of this work are the following:
- •We establish the first non\-asymptotic complexity guarantees for ZO sampling, achieving anε\\varepsilon\-relative FI error after𝒪\(1/ε4\)\\mathcal\{O\}\(1/\\varepsilon^\{4\}\)iterations\. Under a Poincaré inequality assumption on the target distribution, this rate also yieldsε\\varepsilon\-accuracy in squared TV distance\. Furthermore, with decaying parameters, we show weak convergence to the target distribution\.
- •We propose a novel variance\-reduced ZO gradient estimator that achieves the convergence guarantees above using only𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration, eliminating the𝒪\(d\)\\mathcal\{O\}\(d\)batch\-size scaling of standard ZO estimator\.
- •We propose a variance\-reduced ZO annealed PMC algorithm \(ZO\-APMC\) for black\-box inverse problems, based on annealed LMC and a pre\-trained SGM prior, and establish non\-asymptotic and weak convergence guarantees\.
- •We verify our theoretical findings with numerical and statistical experiments, and further demonstrate that ZO\-APMC consistently outperforms existing black\-box posterior sampling methods in MRI reconstruction and black hole imaging, while delivering competitive performance on the Navier–Stokes inverse problem\.
## 2Preliminaries
### 2\.1Zeroth\-Order Sampling
Traditionally, the Langevin diffusion is defined as the solution to the stochastic differential equation
d𝒙t=−∇f\(𝒙t\)dt\+2d𝑩t,d\{\\bm\{x\}\}\_\{t\}=\-\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)dt\+\\sqrt\{2\}d\{\\bm\{B\}\}\_\{t\},\(1\)hasπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)as its unique stationary distribution and converges to it ast→∞t\\rightarrow\\inftyunder mild conditions\. Here,\(𝑩t\)t≥0\(\{\\bm\{B\}\}\_\{t\}\)\_\{t\\geq 0\}denotes a standarddd\-dimensional Brownian motion\. Discretizing this stochastic process with step sizeγ\>0\\gamma\>0yields the standard Langevin Monte Carlo \(LMC\) algorithm\.
𝒙\(k\+1\)γ≔𝒙kγ−γ∇f\(𝒙kγ\)\+2\(𝑩\(k\+1\)γ−𝑩kγ\),\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\\gamma\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),\(2\)In this work, we assume black\-box access to the potential functionff; therefore, the gradient∇f\(𝒙kγ\)\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)cannot be computed\. Instead, we consider its ZO estimation\(Nesterov and Spokoiny,[2017](https://arxiv.org/html/2605.30573#bib.bib32)\), defined as
∇~fμ\(𝒙,𝒖\)≔f\(𝒙\+μ𝒖\)−f\(𝒙\)μ𝒖,\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\},\{\\bm\{u\}\}\)\\coloneqq\\frac\{f\(\{\\bm\{x\}\}\+\\mu\{\\bm\{u\}\}\)\-f\(\{\\bm\{x\}\}\)\}\{\\mu\}\\,\{\\bm\{u\}\},\(3\)where𝒖∼𝒩\(0,I\)\{\\bm\{u\}\}\\\!\\\!\\sim\\\!\\\!\{\\mathcal\{N\}\}\(0,I\)andμ\>0\\mu\\\!\>\\\!0is the smoothing parameter\. We note that the ZO estimator is biased, since∇fμ\(𝒙\)≔𝔼𝒖\[∇~fμ\(𝒙,𝒖\)\]≠∇f\(𝒙\)\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\)\\coloneqq\\mathbb\{E\}\_\{\\bm\{u\}\}\[\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\},\{\\bm\{u\}\}\)\]\\neq\\nabla f\(\{\\bm\{x\}\}\)\. However, this bias vanishes asμ→0\\mu\\rightarrow 0; see Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1)in Appendix[B\.1](https://arxiv.org/html/2605.30573#A2.SS1)\. Replacing∇f\(𝒙kγ\)\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)in \([2](https://arxiv.org/html/2605.30573#S2.E2)\) with its ZO estimate\(1/b\)∑i=1b∇~fμ\(𝒙kγ,𝒖ki\)\(1/b\)\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)yields a naive ZO\-LMC algorithm, wherebbis the batch size\.Royet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib38)\)analyze this approach for strongly log\-concave target distributions, an assumption often violated in practice, and establish Wasserstein\-2 convergence guarantees\. However, their analysis requires the batch size to scale as𝒪\(d\)\\mathcal\{O\}\(d\), resulting in prohibitively large memory requirements in high\-dimensional settings\. In this work, we instead provide an analysis for non\-log\-concave target distributions and use𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration\. This property is ensured by the proposed variance\-reduced ZO estimator𝒈k\{\\bm\{g\}\}\_\{k\}, defined in Section[3](https://arxiv.org/html/2605.30573#S3)\.
Figure 1:Illustration of weighted annealing in PMC through weighted posteriors\(πσk\(αk\)\)k=0N−1\\bigl\(\\pi\_\{\\sigma\_\{k\}\}^\{\(\\alpha\_\{k\}\)\}\\bigr\)\_\{k=0\}^\{N\-1\}\. Solid lines and shaded regions denote the distribution mean and density, respectively, while unshaded regions correspond to∇logπ\(𝒙\)=0\\nabla\\log\\pi\(\{\\bm\{x\}\}\)=0\. By gradually reducing the prior smoothing parameterσk\\sigma\_\{k\}and its weightαk\\alpha\_\{k\}relative to the likelihoodℓ\\ell, weighted annealing enables PMC to escape plateaus in∇logπ\(𝒙\)\\nabla\\log\\pi\(\{\\bm\{x\}\}\)\.
### 2\.2Black\-Box Inverse Problems
For the posterior sampling part of our work, we consider a general black\-box inverse problem setting modeled as
𝒚=𝑨\(𝒙\)\+ξ,𝒙∈ℝd,𝒚,ξ∈ℝm\.\{\\bm\{y\}\}=\{\\bm\{A\}\}\(\{\\bm\{x\}\}\)\+\\xi,\\quad\{\\bm\{x\}\}\\in\{\\mathbb\{R\}\}^\{d\},\\;\\quad\{\\bm\{y\}\},\\xi\\in\{\\mathbb\{R\}\}^\{m\}\.\(4\)where we assume only black\-box access to the forward model𝑨\(⋅\)\{\\bm\{A\}\}\(\\cdot\)\. In this setting, gradient information is unavailable, and𝑨\{\\bm\{A\}\}can only be queried through input–output evaluations\. The operator𝑨:ℝd→ℝm\{\\bm\{A\}\}\\\!:\\\!\{\\mathbb\{R\}\}^\{d\}\\to\{\\mathbb\{R\}\}^\{m\}models the response of the imaging system, wherem≪dm\\\!\\ll\\\!d, andξ∈ℝm\\xi\\in\{\\mathbb\{R\}\}^\{m\}represents measurement noise\. The objective is to recover the unknown signal𝒙\{\\bm\{x\}\}from the noisy measurements𝒚\{\\bm\{y\}\}\. In many practical settings, the mapping𝒙→𝒚\{\\bm\{x\}\}\\\!\\rightarrow\\\!\{\\bm\{y\}\}is many\-to\-one, making the reconstruction task an ill\-posed inverse problem, since𝒙\{\\bm\{x\}\}cannot be uniquely recovered from𝒚\{\\bm\{y\}\}\. In Bayesian framework, one could introducep\(𝒙\)∝exp\(−h\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\)as the*prior*, and samples from the*posterior*π\(𝒙\|𝒚\)\\pi\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\), which is formally established with the Bayes’ rule:π\(𝒙\|𝒚\)∝ℓ\(𝒚\|𝒙\)p\(𝒙\)\\pi\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\)\\\!\\propto\\\!\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\), whereℓ\(𝒚\|𝒙\)∝exp\(−f\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\)is the*likelihood*distribution induced by \([4](https://arxiv.org/html/2605.30573#S2.E4)\)\. Thus, by applying Bayes’ rule to the LMC iterates in \([2](https://arxiv.org/html/2605.30573#S2.E2)\) and replacing∇f\(𝒙kγ\)\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)with its naive ZO estimate, we readily obtain the following ZO posterior sampling LMC algorithm\.
𝒙\(k\+1\)γ≔𝒙kγ\\displaystyle\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}−γ\(1b∑i=1b∇~fμ\(𝒙kγ,𝒖ki\)\+∇h\(𝒙kγ\)\)\\displaystyle\-\\gamma\\biggl\(\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)\+\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\biggr\)\(5\)\+2\(𝑩\(k\+1\)γ−𝑩kγ\),\\displaystyle\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),whereffandhhdenote potential functions of the likelihood and the prior, respectively\. While∇~fμ\(𝒙kγ,𝒖ki\)\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)can be computed via forward model function evaluations under the assumption of Gaussian measurement noise in \([4](https://arxiv.org/html/2605.30573#S2.E4)\), the prior score−∇h\(𝒙kγ\)\-\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)is unknown\.
SGMs have been proposed as powerful models for high\-dimensional non\-log\-concave priors\. At their core, they learn the perturbed score function−∇hσ\(𝒙\)≔∇logpσ\(𝒙\)\-\\nabla h\_\{\\sigma\}\(\{\\bm\{x\}\}\)\\coloneqq\\nabla\\log p\_\{\\sigma\}\(\{\\bm\{x\}\}\), wherepσ\(𝒙\)≔∫ℝdp\(𝒛\)𝒩\(𝒙\|𝒛,σ2I\)𝑑𝒛p\_\{\\sigma\}\(\{\\bm\{x\}\}\)\\coloneqq\\int\_\{\{\\mathbb\{R\}\}^\{d\}\}p\(\{\\bm\{z\}\}\)\{\\mathcal\{N\}\}\(\{\\bm\{x\}\}\|\{\\bm\{z\}\},\\sigma^\{2\}I\)\\,d\{\\bm\{z\}\}with a small perturbationσ\>0\\sigma\>0\. This score is learned by a deep neural network usingscore matching objective\(Hyvärinen and Dayan,[2005](https://arxiv.org/html/2605.30573#bib.bib5); Vincent,[2011](https://arxiv.org/html/2605.30573#bib.bib6)\)and can be estimated viaTweedie’sformula\(Robbins,[1992](https://arxiv.org/html/2605.30573#bib.bib90); Miyasawa and others,[1961](https://arxiv.org/html/2605.30573#bib.bib89); Efron,[2011](https://arxiv.org/html/2605.30573#bib.bib29)\)\. We denote the SGM estimator by𝒮θ\(𝒙,σ\)≈−∇hσ\(𝒙\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\},\\sigma\)\\approx\-\\nabla h\_\{\\sigma\}\(\{\\bm\{x\}\}\), where𝒮θ\{\\mathcal\{S\}\}\_\{\\theta\}is conditioned on the noise levelσ\\sigmaand parametrized byθ\\theta\. In practice, LMC algorithms suffer from slow convergence and mode collapse when sampling from high\-dimensional multimodal distributions\. To alleviate this, SGMs are trained across multiple noise scales\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}where larger noise levels produce smoother approximations of the data distributions, making score estimation and sampling well\-posed\. Replacing−∇h\(𝒙kγ\)\-\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)in \([5](https://arxiv.org/html/2605.30573#S2.E5)\) by SGM prior, we obtain a naive ZO annealed LMC for posterior sampling\.
𝒙\(k\+1\)γ≔\\displaystyle\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\\coloneqq~\\𝒙kγ−γ\(1b∑i=1b∇~fμ\(𝒙kγ,𝒖ki\)\\displaystyle\{\\bm\{x\}\}\_\{k\\gamma\}\-\\gamma\\biggl\(\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)\(6\)−αk𝒮θ\(𝒙kγ,σk\)\)\+2\(𝑩\(k\+1\)γ−𝑩kγ\),\\displaystyle\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\\biggr\)\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),where\(αk\)k=0N−1\(\\alpha\_\{k\}\)\_\{k=0\}^\{N\-1\}denotes a weighted annealing schedule\. This corresponds to sampling from a sequence of weighted posterior distributionsπσk\(αk\)\(𝒙\|𝒚\)∝ℓ\(𝒚\|𝒙\)pσkαk\(𝒙\)\\pi\_\{\\sigma\_\{k\}\}^\{\(\\alpha\_\{k\}\)\}\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\)\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\_\{\\sigma\_\{k\}\}^\{\\alpha\_\{k\}\}\(\{\\bm\{x\}\}\)\. Initially, the posterior is dominated by the smoothed priorpσkp\_\{\\sigma\_\{k\}\}, and its less concentrated density facilitates escape from low\-probability noisy regions of∇h\\nabla h\. As the iterations progress, the likelihood contributes more strongly while the smoothed priorpσkp\_\{\\sigma\_\{k\}\}approaches the true priorpp, guiding the iterates toward the target posterior\. An illustration of this process is provided in Fig\.[1](https://arxiv.org/html/2605.30573#S2.F1)\. In practice, both the noise scales\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}and annealing weights\(αk\)k=0N−1\(\\alpha\_\{k\}\)\_\{k=0\}^\{N\-1\}decrease over the iterations until reaching prescribed minimum values, after which they remain fixed\(Song and Ermon,[2019](https://arxiv.org/html/2605.30573#bib.bib28),[2020](https://arxiv.org/html/2605.30573#bib.bib30); Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\)\. The precise definitions of these schedules are provided in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\.
### 2\.3Sampling Analogue of Stationary Point Analysis
Before introducing our proposed variance\-reduced ZO sampling method with its convergence guarantees, we explain why convergence in relative FI, defined asFI\(ν∥π\)≔∫ℝnν\(x\)‖∇logν\(x\)−∇logπ\(x\)‖22𝑑x\\mathrm\{FI\}\(\\nu\\\|\\pi\)\\coloneqq\\int\_\{\\mathbb\{R\}^\{n\}\}\\nu\(x\)\\\|\\nabla\\log\\nu\(x\)\-\\nabla\\log\\pi\(x\)\\\|\_\{2\}^\{2\}\\,dx, serves as the sampling analogue of stationary point analysis in non\-convex optimization\. Consider the minimization of theKullback–Leibler \(KL\)divergence over the Wasserstein space of probability distributions\.
ν^=argminνKL\(ν∥π\),\\hat\{\\nu\}=\\operatorname\*\{arg\\,min\}\_\{\\nu\}\\mathrm\{KL\}\(\\nu\\\|\\pi\),\(7\)whereKL\(ν∥π\)≔∫ℝdν\(𝒙\)logν\(𝒙\)π\(𝒙\)d𝒙\\mathrm\{KL\}\(\\nu\\\|\\pi\)\\coloneqq\\int\_\{\{\\mathbb\{R\}\}^\{d\}\}\\nu\(\{\\bm\{x\}\}\)\\log\\frac\{\\nu\(\{\\bm\{x\}\}\)\}\{\\pi\(\{\\bm\{x\}\}\)\}d\{\\bm\{x\}\}, andν\\nuandπ\\pidenote the estimate and target distributions, respectively\. Similar to the gradient concept in Euclidean space, the Wasserstein gradient ofKL\(⋅∥π\)\\mathrm\{KL\}\(\\cdot\\\|\\pi\)atν\\nuis∇log\(ν/π\)\\nabla\\log\(\\nu/\\pi\)\(Ambrosioet al\.,[2008](https://arxiv.org/html/2605.30573#bib.bib48)\)and its expected square norm gives usFI\(ν∥π\)\\mathrm\{FI\}\(\\nu\\\|\\pi\)\. Ifνt\\nu\_\{t\}evolves under Langevin diffusion in \([1](https://arxiv.org/html/2605.30573#S2.E1)\), thenddtKL\(νt∥π\)=−FI\(νt∥π\)\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)=\-\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\(Ambrosioet al\.,[2008](https://arxiv.org/html/2605.30573#bib.bib48); Villani,[2009](https://arxiv.org/html/2605.30573#bib.bib49)\), showing that Langevin diffusion is a gradient flow in probability space\. From an optimization perspective,FI\(νt∥π\)\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)plays a role analogous to the squared gradient norm in non\-convex optimization\. Unlike optimization, however, the conditionFI\(ν∥π\)=0\\mathrm\{FI\}\(\\nu\\\|\\pi\)=0impliesν=π\\nu=\\pi, making FI a natural discrepancy measure for quantifying convergence to the target distribution\. Accordingly, followingBalasubramanianet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib34)\), we characterize convergence by requiring the output distributionν\\nuto satisfyFI\(ν∥π\)≤ε\\mathrm\{FI\}\(\\nu\\\|\\pi\)\\leq\\varepsilon, analogous to reaching anε\\varepsilon\-stationary point in optimization\.
## 3Methods
In this section, we propose the variance\-reduced ZO gradient estimator and present theoretical guarantees for sampling from non\-log\-concave distributions\. We then extend our analysis to posterior sampling with SGM priors, accounting for both the bias due to the annealing schedules and the SGM estimation error\.
### 3\.1Variance\-Reduced Zeroth\-Order Sampling
As mentioned previously, the naive ZO estimate requires batch size of𝒪\(d\)\\mathcal\{O\}\(d\)to make an accurate estimate of a gradient, which is prohibitive to calculate at each iteration\. Using small batch sizesb≪db\\ll d, while computationally attractive, leads to noisy estimates\. Thus, inspired byLiet al\.\([2021](https://arxiv.org/html/2605.30573#bib.bib4)\), we combine the large\- and small\-batch estimators to define the following variance\-reduced ZO estimator:
𝒈k:=\{1b∑i=1b∇~fμ\(𝒙kγ,𝒖ki\),w\.p\.p,𝒈k−1\+1b′∑i=1b′\(∇~fμ\(𝒙kγ,𝒖ki\)−∇~fμ\(𝒙\(k−1\)γ,𝒖ki\)\),w\.p\.1−p,\{\\bm\{g\}\}\_\{k\}:=\\begin\{cases\}\\displaystyle\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\),&\\hskip\-40\.00006pt\\text\{w\.p\. \}p,\\\\\[4\.30554pt\] \\displaystyle\\begin\{aligned\} \{\\bm\{g\}\}\_\{k\-1\}&\+\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\Big\(\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)\\\\ &\\qquad\\qquad\\quad\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\-1\)\\gamma\},\{\\bm\{u\}\}\_\{k\}^\{i\}\)\\Big\),\\end\{aligned\}&\\hskip\-40\.00006pt\\text\{w\.p\. \}1\-p,\\end\{cases\}\(8\)wherek≥1k\\\!\\geq\\\!1is the iteration index, andb,b′≥1b,b^\{\\prime\}\\\!\\\!\\geq\\\!\\\!1denote batch sizes, withbba large batch size computed with probabilityp∈\(0,1\]p\\in\(0,1\]andb′≪bb^\{\\prime\}\\ll ba much smaller one\. For the initial stepk=0k\\\!\\\!=\\\!\\\!0, we use the batch estimate𝒈0≔1b∑i=1b∇~fμ\(𝒙0,𝒖0i\)\{\\bm\{g\}\}\_\{0\}\\coloneqq\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\}\)\. The key motivation behind this construction is to mitigate the high computational cost of ZO estimation, which would otherwise require a batch size on the order of𝒪\(d\)\\mathcal\{O\}\(d\)at every iteration\. Instead, a large batch estimator is computed only intermittently, with probabilityp<1p<1\. With the remaining probability1−p1\-p, the estimation is updated recursively using a small batch of fresh function evaluations together with the previous estimate\. More specifically, the update approximates the change in the gradient between consecutive iterates, exploiting the strong correlation between𝒙\(k−1\)γ\{\\bm\{x\}\}\_\{\(k\-1\)\\gamma\}and𝒙kγ\{\\bm\{x\}\}\_\{k\\gamma\}\. Under a suitable regularity condition onff, this gradient variation can be accurately estimated using only a small batch sizeb′≪bb^\{\\prime\}\\ll b, thereby substantially reducing the per\-iteration computational cost\. We formalize this condition in the following assumption\.
###### Assumption 1\.
The potential functionf:ℝd→ℝf:\{\\mathbb\{R\}\}^\{d\}\\rightarrow\{\\mathbb\{R\}\}isL1L\_\{1\}\-Lipschitz continuous:‖f\(𝒙1\)−f\(𝒙2\)‖2≤L1‖𝒙1−𝒙2‖2\\\|f\(\{\\bm\{x\}\}\_\{1\}\)\-f\(\{\\bm\{x\}\}\_\{2\}\)\\\|\_\{2\}\\leq L\_\{1\}\\\|\{\\bm\{x\}\}\_\{1\}\-\{\\bm\{x\}\}\_\{2\}\\\|\_\{2\}, for all𝒙1,𝒙2∈ℝd\{\\bm\{x\}\}\_\{1\},\{\\bm\{x\}\}\_\{2\}\\in\{\\mathbb\{R\}\}^\{d\}and for someL1\>0L\_\{1\}\>0\.
Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)is restrictive since it does not hold globally even for simple distributions such as Gaussians\. However, it is satisfied by differentiable potentials on compact domains, which commonly arise in practice through normalization or gradient clipping\. In the optimization literature, variance\-reduction techniques achieving𝒪\(1\)\\mathcal\{O\}\(1\)per\-iteration cost typically rely on Lipschitz continuity condition on stochastic gradients to control their variance\(Cutkosky and Orabona,[2019](https://arxiv.org/html/2605.30573#bib.bib82); Liet al\.,[2021](https://arxiv.org/html/2605.30573#bib.bib4); Liuet al\.,[2025](https://arxiv.org/html/2605.30573#bib.bib83)\)\. Since gradients are unavailable in our setting and must instead be approximated using ZO estimators, we impose Lipschitz continuity on the potential functionffto obtain analogous variance\-control properties\.
Using the proposed estimator, we obtain the following variance\-reduced ZO\-LMC algorithm
𝒙\(k\+1\)γ≔𝒙kγ−γ𝒈k\+2\(𝑩\(k\+1\)γ−𝑩kγ\)\.\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\\gamma\{\\bm\{g\}\}\_\{k\}\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\)\.\(9\)We state our main results for the following continuous time interpolation of LMC, defined fort∈\[kγ,\(k\+1\)γ\]t\\in\[k\\gamma,\(k\+1\)\\gamma\]:
𝒙t≔𝒙kγ−\(t−kγ\)𝒈k\+2\(𝑩t−𝑩kγ\),\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\(t\-k\\gamma\)\{\\bm\{g\}\}\_\{k\}\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),\(10\)and we writeνt\\nu\_\{t\}for the law of𝒙t\{\\bm\{x\}\}\_\{t\}\. Before presenting our main results, we first provide intuition for the proposed variance\-reduction mechanism through the following bound on the estimation error\.
###### Proposition 1\.
Suppose Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)holds, and let\(𝐱kγ\)k≥0\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\_\{k\\geq 0\}be generated by \([9](https://arxiv.org/html/2605.30573#S3.E9)\), where𝐠k\{\\bm\{g\}\}\_\{k\}is given by \([8](https://arxiv.org/html/2605.30573#S3.E8)\)\. Defineek2:=𝔼\[‖𝐠k−∇fμ\(𝐱kγ\)‖22\]e\_\{k\}^\{2\}:=\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|\_\{2\}^\{2\}\]\. Then, for anyγ\>0\\gamma\>0,
ek\+12≤pσk\+12b\+\(1−p\)ek2\+4\(1−p\)dL12Δkμ2b′,e^\{2\}\_\{k\+1\}\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)dL\_\{1\}^\{2\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\},\(11\)whereσk\+12≔𝔼\[‖∇~fμ\(𝐱\(k\+1\)γ,𝐮k\+1i\)‖2\]\\sigma^\{2\}\_\{k\+1\}\\coloneqq\\mathbb\{E\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\\|^\{2\}\]andΔk≔𝔼\[‖𝐱\(k\+1\)γ−𝐱kγ‖2\]\\Delta\_\{k\}\\coloneqq\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]\.
Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)\(proof provided in Appendix[B\.2](https://arxiv.org/html/2605.30573#A2.SS2)\) gives an intuition for the variance\-reduction mechanism\. Whenp=1p=1, the estimation errorek\+12e\_\{k\+1\}^\{2\}is controlled solely by the variance termσk\+12/b\\sigma\_\{k\+1\}^\{2\}/b, which can only be reduced by increasing the batch sizebb\. In contrast, whenp<1p<1, the variance term can be reduced byppwhile keepingbbfixed, at the cost of introducing an error\-propagation term\(1−p\)ek2\(1\-p\)e\_\{k\}^\{2\}together with an additional term reflecting how similar consecutive iterates remain\. Consequently, smaller values ofppreduce the variance term but increase error propagation, leading to a trade\-off between variance reduction and convergence speed\. Unlike stochastic optimization\(Liet al\.,[2021](https://arxiv.org/html/2605.30573#bib.bib4)\), the additional error term depends on the discretization error of the Langevin diffusion and is amplified by the ZO smoothing parameterμ\\mu, resulting in an additional trade\-off between ZO estimation bias and discretization error\. As shown in our main results, these trade\-offs can be controlled through suitable choices of the parametersμ\\mu,γ\\gamma, andpp\. Furthermore, controlling the discretization error requires the following assumption, which is standard in LMC analysis\.
###### Assumption 2\.
The gradient offfisL2L\_\{2\}\-Lipschitz continuous:‖∇f\(𝒙1\)−∇f\(𝒙2\)‖2≤L2‖𝒙1−𝒙2‖2\\\|\\nabla f\(\{\\bm\{x\}\}\_\{1\}\)\-\\nabla f\(\{\\bm\{x\}\}\_\{2\}\)\\\|\_\{2\}\\leq L\_\{2\}\\\|\{\\bm\{x\}\}\_\{1\}\-\{\\bm\{x\}\}\_\{2\}\\\|\_\{2\}, for all𝒙1,𝒙2∈ℝd\{\\bm\{x\}\}\_\{1\},\{\\bm\{x\}\}\_\{2\}\\in\{\\mathbb\{R\}\}^\{d\}and for someL2\>0L\_\{2\}\>0\.
We are now ready to state our first main result\.
###### Theorem 1\.
Letπ∝exp\(−f\)\\pi\\propto\\exp\(\-f\)be the target distribution, where the potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1)and[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\}andL2L\_\{2\}, respectively\. DefineLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}\. LetN≥1N\\\!\\geq\\\!1denote the total number of iterations, and let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of the continuous\-time interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\) generated by step sizeγ=Lm−1N−3/4d−7/4\\gamma=L\_\{m\}^\{\-1\}N^\{\-3/4\}d^\{\-7/4\}, probabilityp=LmN−1/4d−1/4p=L\_\{m\}N^\{\-1/4\}d^\{\-1/4\}, batch sizeb=⌈p−1⌉b=\\lceil p^\{\-1\}\\rceil, and smoothing parameter of ZO estimatesμ=Lm−1/2N−1/8d−5/8\\mu=L\_\{m\}^\{\-1/2\}N^\{\-1/8\}d^\{\-5/8\}\. Then, the time\-averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dtsatisfiesFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilonafter𝒪\(d7Lm4/ε4\)\\mathcal\{O\}\(d^\{7\}L\_\{m\}^\{4\}/\\varepsilon^\{4\}\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration\.
Theorem 1 \(the full statement and the proof are provided in Appendix[B\.3](https://arxiv.org/html/2605.30573#A2.SS3)\) shows that the proposed variance\-reduced ZO\-LMC converges with only𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration, substantially reducing the memory and per\-iteration computational costs compared to naive ZO\-LMC in high\-dimensional settings\. The high polynomial dependence onddin number of iterations arises from the interaction between the ZO approximation error and the Langevin discretization error, which is captured by the last term in the upper bound of Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)and reflects the intrinsic difficulty of sampling from non\-log\-concave distributions using ZO information\. In practice, however, our experiments show that accurate samples can be obtained using significantly fewer iterations\.
We next present a stronger convergence guarantee under the following Poincaré inequality assumption\.
###### Assumption 3\.
For every smooth, compactly supported functionϕ:ℝd→ℝ\\phi:\\mathbb\{R\}^\{d\}\\\!\\to\\\!\\mathbb\{R\}, the target distributionπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)satisfies the Poincaré inequality:Varπ\(ϕ\)≤CPI𝔼π\[‖∇ϕ‖22\]\\operatorname\{Var\}\_\{\\pi\}\(\\phi\)\\leq C\_\{\\mathrm\{PI\}\}\\mathbb\{E\}\_\{\\pi\}\[\\\|\\nabla\\phi\\\|\_\{2\}^\{2\}\]for someCPI\>0C\_\{\\mathrm\{PI\}\}\>0\.
Assumption 3 enforces concentration by requiring sufficiently growth of the potential at infinity, thereby preventing heavy tails, and is analogous to Polyak\-Łojasiewicz inequality in optimization\. This assumption holds for a broad class of distributions, including log\-concave distributions and certain non\-log\-concave cases such as Gaussian convolutions of bounded\-support distributions\(Chewiet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib35)\)\. Combining the Poincaré inequality with Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1), we obtain the following convergence guarantee in squared TV distance\.
###### Corollary 1\.
Letπ∝exp\(−f\)\\pi\\propto\\exp\(\-f\)be the target distribution, where the potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\},L2L\_\{2\}, respectively, and Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3)with constantCPIC\_\{\\mathrm\{PI\}\}\. DefineLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}\. LetN≥1N\\geq 1denote the total number of iterations, and let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of the continuous\-time interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\) generated by the same step sizeγ\\gamma, probabilitypp, batch sizebb, and smoothing parameter of ZO estimatesμ\\muas in Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)\. Then, the time\-averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dtsatisfies‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|^\{2\}\_\{\\mathrm\{TV\}\}\\leq\\varepsilonafter𝒪\(d7Lm4CPI4/ε4\)\\mathcal\{O\}\(d^\{7\}L\_\{m\}^\{4\}C\_\{\\mathrm\{PI\}\}^\{4\}/\\varepsilon^\{4\}\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration\.
The complete statement and the proof are provided in Appendix[B\.4](https://arxiv.org/html/2605.30573#A2.SS4)\. Building on Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)and the fact thatFI\(ν∥π\)=0\\mathrm\{FI\}\(\\nu\\\|\\pi\)=0impliesν=π\\nu=\\pi, we obtain the following result with its formal statement and proof provided in Appendix[B\.5](https://arxiv.org/html/2605.30573#A2.SS5)\.
###### Theorem 2\.
Letπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)be the target distribution, where the potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1)and[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\}andL2L\_\{2\}, respectively\. DefineLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of continuous\-time interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\) generated by time\-varying step sizeγk=Cγk3/2\\gamma\_\{k\}=\\frac\{C\_\{\\gamma\}\}\{k^\{3/2\}\}, probabilitypk=12k1/2p\_\{k\}=\\frac\{1\}\{2k^\{1/2\}\}, batch sizebk=⌈pk−1⌉b\_\{k\}=\\lceil p\_\{k\}^\{\-1\}\\rceil, and smoothing parameter of ZO estimatesμk=Cμk1/8\\mu\_\{k\}=\\frac\{C\_\{\\mu\}\}\{k^\{1/8\}\}fork≥1k\\geq 1, whereCγ,Cμ\>0C\_\{\\gamma\},C\_\{\\mu\}\\\!\>\\\!0are constants\. Then, the time\-averaged lawν¯τn≔1τn∫0τnνt𝑑t\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt, whereτn≔∑k=1nγk\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}, converges weakly toπ\\pi\.
### 3\.2Zeroth\-Order Posterior Sampling with SGM Prior
In this section, we propose ZO\-APMC algorithm for solving black\-box inverse problems with an SGM prior and extend the theoretical analysis of the previous section to this setting by incorporating the bias due to the annealing schedules and the SGM estimation error\. We replace the naive ZO gradient estimator in \([6](https://arxiv.org/html/2605.30573#S2.E6)\) with the proposed variance reduction mechanism in \([8](https://arxiv.org/html/2605.30573#S3.E8)\) yielding the following ZO\-APMC iterates:
𝒙\(k\+1\)γ≔𝒙kγ\\displaystyle\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}−γ\(𝒈k−αk𝒮θ\(𝒙kγ,σk\)\)\\displaystyle\-\\gamma\(\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\)\(12\)\+2\(𝑩\(k\+1\)γ−𝑩kγ\),\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),where\(αk\)k=0N−1\(\\alpha\_\{k\}\)\_\{k=0\}^\{N\-1\}and\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}are annealing and noise schedules, respectively\. FollowingSong and Ermon \([2020](https://arxiv.org/html/2605.30573#bib.bib30)\)andSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\), we define, for allk≥0k\\geq 0,
αk≔max\{α0ρ1k,1\}andσk≔max\{σ0ρ2k,σmin\},\\alpha\_\{k\}\\coloneqq\\max\\\{\\alpha\_\{0\}\\rho\_\{1\}^\{k\},1\\\}\\\!\\\!\\quad\\\!\\\!\\text\{and\}\\\!\\\!\\quad\\\!\\\!\\sigma\_\{k\}\\coloneqq\\max\\\{\\sigma\_\{0\}\\rho\_\{2\}^\{k\},\\sigma\_\{\\text\{min\}\}\\\},\(13\)whereρ1,ρ2∈\(0,1\)\\rho\_\{1\},\\rho\_\{2\}\\\!\\in\\\!\(0,1\)denote decay rates,σ0≥σmin\\sigma\_\{0\}\\geq\\sigma\_\{\\text\{min\}\}andα0≥1\\alpha\_\{0\}\\geq 1are initial values, andσmin\>0\\sigma\_\{\\text\{min\}\}\\\!\>\\\!0is the minimum noise level\. These parameters are selected using the principled techniques ofSong and Ermon \([2020](https://arxiv.org/html/2605.30573#bib.bib30)\)such that there exist indicesKα,Kσ<N−1K\_\{\\alpha\},K\_\{\\sigma\}<N\-1satisfyingαk=1,∀k≥Kα\\alpha\_\{k\}=1,\\forall k\\geq K\_\{\\alpha\}andσk=σmin,∀k≥Kσ\\sigma\_\{k\}=\\sigma\_\{\\text\{min\}\},\\forall k\\geq K\_\{\\sigma\}\. Our analysis relies on these properties together with a continuous\-time interpolation of the ZO\-APMC iterates incorporating the annealing and noise schedules\. Fort∈\[kγ,\(k\+1\)γ\]t\\in\[k\\gamma,\(k\+1\)\\gamma\], the interpolation is defined by
𝒙t≔𝒙kγ−\(t−kγ\)\(𝒈k\\displaystyle\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\(t\-k\\gamma\)\(\{\\bm\{g\}\}\_\{k\}−αk𝒮θ\(𝒙kγ,σk\)\)\\displaystyle\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\)\(14\)\+2\(𝑩t−𝑩kγ\)\.\\displaystyle\\quad\\quad\\quad\\quad\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\)\.Here,𝒮θ\(𝒙kγ,σk\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)estimates the perturbed score−∇hσk\(𝒙kγ\)\-\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\), rather than the true score−∇h\(𝒙kγ\)\-\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\. FollowingSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\), we therefore impose the following assumption to quantify the discrepancy between the perturbed and true scores\.
###### Assumption 4\.
Lethσkh\_\{\\sigma\_\{k\}\}be the potential function of the perturbed prior, defined aspσk\(𝒙\)∝exp\(−hσk\(𝒙\)\)p\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\),pσk\(𝒙\)≔∫ℝdp\(𝒛\)𝒩\(𝒙\|𝒛,σk2I\)𝑑𝒛p\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\\coloneqq\\int\_\{\{\\mathbb\{R\}\}^\{d\}\}p\(\{\\bm\{z\}\}\)\{\\mathcal\{N\}\}\(\{\\bm\{x\}\}\|\{\\bm\{z\}\},\\sigma\_\{k\}^\{2\}I\)\\,d\{\\bm\{z\}\}, wherep\(𝒙\)p\(\{\\bm\{x\}\}\)denotes the unperturbed prior\. We assume that for anyσk\>0\\sigma\_\{k\}\>0and𝒙∈ℝd\{\\bm\{x\}\}\\in\{\\mathbb\{R\}\}^\{d\},‖∇hσk\(𝒙\)−∇h\(𝒙\)‖2≤C1σk\\\|\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\-\\nabla h\(\{\\bm\{x\}\}\)\\\|\_\{2\}\\leq C\_\{1\}\\sigma\_\{k\}for some constantC1\>0C\_\{1\}\>0\.
Under Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), it follows that∇hσk→∇h\\nabla h\_\{\\sigma\_\{k\}\}\\to\\nabla hasσk→0\\sigma\_\{k\}\\to 0\. For special cases, such as Gaussian priors, the discrepancy between the perturbed and true scores can be characterized analytically\. However, deriving a closed\-form bound for this discrepancy is generally intractable for arbitrary prior distributions\.
To account for the SGM estimation error,𝒮θ\(𝒙kγ,σk\)≈−∇hσk\(𝒙kγ\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\\approx\-\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\), we require an additional assumption\. Since SGMs are trained using the denoising score matching objective\(Songet al\.,[2021b](https://arxiv.org/html/2605.30573#bib.bib1); Hoet al\.,[2020](https://arxiv.org/html/2605.30573#bib.bib2)\), their optimal score network satisfies𝒮θ\(𝒙kγ,σk\)=−∇hσk\(𝒙kγ\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)=\-\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)with probability 1\. Following this characterization and prior work\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), we impose the following assumption\.
###### Assumption 5\.
For anyσk\>0\\sigma\_\{k\}\>0and all𝒙∈ℝd\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}, the score network satisfies‖𝒮θ\(𝒙,σk\)\+∇hσk\(𝒙\)‖2≤εσk<∞\\\|\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\},\\sigma\_\{k\}\)\+\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\\\|\_\{2\}\\leq\\varepsilon\_\{\\sigma\_\{k\}\}<\\inftyand‖𝒮θ\(𝒙,σk\)‖2≤C2σk−1\\\|\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\},\\sigma\_\{k\}\)\\\|\_\{2\}\\leq C\_\{2\}\\sigma\_\{k\}^\{\-1\}\.
UnlikeSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\), we relax the uniform norm bound on the score network to the noise\-scale\-dependent bound‖𝒮θ\(𝒙,σk\)‖2≤C2σk−1\\\|\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\},\\sigma\_\{k\}\)\\\|\_\{2\}\\leq C\_\{2\}\\sigma\_\{k\}^\{\-1\}\. This assumption is consistent with empirical observations in seminal score\-based generative modeling works\(Song and Ermon,[2019](https://arxiv.org/html/2605.30573#bib.bib28); Songet al\.,[2021b](https://arxiv.org/html/2605.30573#bib.bib1); Song and Ermon,[2020](https://arxiv.org/html/2605.30573#bib.bib30)\), where the score magnitude is observed to scale as‖𝒮θ\(𝒙,σk\)‖2∝σk−1\\\|\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\},\\sigma\_\{k\}\)\\\|\_\{2\}\\propto\\sigma\_\{k\}^\{\-1\}\. Additionally, in contrast to prior work\(Yang and Wibisono,[2022](https://arxiv.org/html/2605.30573#bib.bib56); Leeet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib55); Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), our analysis does not require the score network to be Lipschitz continuous, which is often violated in practice\.
We now present our main result for the ZO\-APMC posterior sampling algorithm \([12](https://arxiv.org/html/2605.30573#S3.E12)\) with SGM prior\.
###### Theorem 3\.
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1L\_\{f\_\{1\}\},Lf2L\_\{f\_\{2\}\}, respectively, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation in \([14](https://arxiv.org/html/2605.30573#S3.E14)\) generated by the step sizeγ\\gamma, probabilitypp, batch sizebb, smoothing parameter of ZO estimatesμ\\mustated in Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)with the annealing and noise schedules defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. Then, the time\-averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dtsatisfiesFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilonafter𝒪\(d7Lm4/ε4\)\\mathcal\{O\}\(d^\{7\}L\_\{m\}^\{4\}/\\varepsilon^\{4\}\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration\.
The full theorem statement and the proof are provided in Appendix[B\.6](https://arxiv.org/html/2605.30573#A2.SS6)\. Intuitively, the theorem guarantees that the sample distribution produced by ZO\-APMC increasingly captures the posterior score information that governs directions toward high\-probability, measurement\-consistent reconstructions in inverse problems\. This is achieved using only forward model evaluations without any gradient computation in \([4](https://arxiv.org/html/2605.30573#S2.E4)\)\. Moreover, ZO\-APMC achievesε\\varepsilon\-accuracy with only𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration, compared to the𝒪\(d\)\\mathcal\{O\}\(d\)batch sizes typically required by standard batched ZO estimator\. This substantially reduces the per\-iteration computational and memory costs, enabling scalable posterior sampling for black\-box inverse problems\. Although the iteration complexity exhibits a high\-order polynomial dependence on the dimension, our empirical results on high\-dimensional tasks such as MRI reconstruction show that ZO\-APMC converges in practice with substantially fewer iterations, matching the iteration count of its gradient\-based counterpartAPMC\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\)\.
Leveraging this result and assuming that the target posteriorπ\\pisatisfies a Poincaré inequality, we obtain a stronger convergence guarantee\.
###### Corollary 2\.
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1L\_\{f\_\{1\}\},Lf2L\_\{f\_\{2\}\}, respectively, the target posteriorπ\\pisatisfies Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3)with constantCPI\>0C\_\{\\mathrm\{PI\}\}\\\!\>\\\!0, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation \([14](https://arxiv.org/html/2605.30573#S3.E14)\) generated by the step sizeγ\\gamma, probabilitypp, batch sizebb, smoothing parameter of ZO estimatesμ\\mustated in Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)with the annealing schedule defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. Then, the time\-averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dtsatisfies‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|^\{2\}\_\{\\mathrm\{TV\}\}\\leq\\varepsilonafter𝒪\(d7Lm4CPI4/ε4\)\\mathcal\{O\}\(d^\{7\}L\_\{m\}^\{4\}C\_\{\\mathrm\{PI\}\}^\{4\}/\\varepsilon^\{4\}\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration\.
Corollary[2](https://arxiv.org/html/2605.30573#Thmcorollary2)\(full statement and proof are in Appendix[B\.7](https://arxiv.org/html/2605.30573#A2.SS7)\) intuitively implies that, for posterior distributions without flat valleys, widely separated modes, or heavy tails, the reconstructions produced by ZO\-APMC are statistically indistinguishable from samples drawn from the true posterior, up to anε\\varepsilonerror, after𝒪\(d7Lm4CPI4/ε4\)\\mathcal\{O\}\(d^\{7\}L\_\{m\}^\{4\}C\_\{\\mathrm\{PI\}\}^\{4\}/\\varepsilon^\{4\}\)iterations using𝒪\(1\)\\mathcal\{O\}\(1\)memory on average\.
Finally, we establish weak convergence of the generated samples to the target posterior, showing that they asymptotically recover the desired posterior law\.
###### Theorem 4\.
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1,Lf2L\_\{f\_\{1\}\},L\_\{f\_\{2\}\}, respectively, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}, and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation \([14](https://arxiv.org/html/2605.30573#S3.E14)\) generated by time\-varying step sizeγk\\gamma\_\{k\}, probabilitypkp\_\{k\}, batch sizebkb\_\{k\}, and smoothing parameter of ZO estimatesμk\\mu\_\{k\}stated in Theorem[2](https://arxiv.org/html/2605.30573#Thmtheorem2)with the annealing schedule defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. Then, the time\-averaged lawν¯τn≔1τn∫0τnνt𝑑t\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt, whereτn≔∑k=1nγk\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}converges weakly toπ\\pi\.
The complete statement and the proof are provided in Appendix[B\.8](https://arxiv.org/html/2605.30573#A2.SS8)\. Next, we validate our theoretical results through statistical and numerical experiments, and demonstrate the performance of ZO\-APMC in practical settings\.
Figure 2:\(a\) Convergence of ZO\-APMC withb=10b\\\!=\\\!10,b′=5b^\{\\prime\}\\\!=\\\!5, andεk∗=2\.5\\varepsilon\_\{k^\{\\ast\}\}\\\!=\\\!2\.5for variouspp, alongside APMC convergence with gradient access\. \(b\) Convergence results for fixed per\-iteration cost\. Each×\\bm\{\\times\}indicate parameter settings\(p,b\)\(p,b\)for which the FI drops below0\.010\.01after 2000 iterations\. \(c\) Comparison of sample statistics generated by ZO\-APMC and APMC with those of the analytical ground truth posterior\.Table 1:Quantitative comparison with baselines for MRI reconstruction \(SD: sample standard deviation\)\. The best values of each metric for black\-box and gradient\-access settings are highlighted inboldandunderline, respectively\.PSNR \(dB\)↑\\uparrowSSIM↑\\uparrowNRMSE↓\\downarrowSD↓\\downarrowMSE↓\\downarrowPnPDM30\.810\.9463\.76e\-22\.16e\-28\.46e\-4DPS34\.380\.9652\.54e\-22\.06e\-24\.07e\-4APMC36\.550\.9731\.99e\-22\.0e\-22\.55e\-4Forward\-GSG27\.80\.9185\.42e\-23\.26e\-219\.1e\-4Central\-GSG27\.780\.9175\.43e\-23\.27e\-219\.2e\-4SCG7\.10\.7117\.671\.380\.21DPG32\.170\.9535\.4e\-22\.69e\-26\.5e\-4EnKG31\.320\.9345\.72e\-22\.92e\-26\.72e\-4ZO\-APMC \(ours\)35\.290\.9662\.28e\-22\.99e\-23\.29e\-4
Figure 3:Visualization of averaged reconstructions generated by APMC in the setting with gradient access and ZO\-APMC in the black\-box setting\. Despite relying solely on function evaluations, ZO\-APMC achieves reconstruction quality comparable to APMC\.Figure 4:Visualization of mean reconstructions for the black\-hole imaging inverse problem\. Two representative test cases are shown \(top and bottom rows\), with ground truths in the left column and black\-box method reconstructions in the remaining columns\. PSNR \(dB\) and closure\-phase error \(χcph2\\chi\_\{\\mathrm\{cph\}\}^\{2\}, lower is better\) are reported below each reconstruction\.Table 2:Quantitative comparison with baselines for black\-hole imaging \(SD: sample standard deviation\)\. Best values for black\-box and gradient\-access settings are shown inboldandunderline, respectively\.PSNR↑\\uparrowBlurred PSNR↑\\uparrowχcph2↓\\chi\_\{\\text\{cph\}\}^\{2\}\\downarrowχcamp2↓\\chi\_\{\\text\{camp\}\}^\{2\}\\downarrowSD↓\\downarrowPnPDM26\.4832\.3111\.4823\.544\.5e\-2DPS25\.6130\.8412\.3917\.724\.32e\-2APMC26\.2331\.3211\.7819\.234\.34e\-2Forward\-GSG26\.2131\.476\.7714\.062\.99e\-2Central\-GSG21\.6323\.7380\.3178\.54\.5e\-2SCG22\.2125\.5123\.7214\.231\.7e\-2DPG12\.3314\.028\.1730\.441\.6e\-2EnKG22\.8627\.6964\.3733\.440\.925ZO\-APMC \(Ours\)26\.7132\.865\.4211\.233\.02e\-2

Figure 5:Visualization of mean samples for the Navier–Stokes inverse problem across black\-box methods\.
## 4Experiments
Baselines\.Our primary focus is on methods that assume black\-box access to the forward model\. Accordingly, we compare ZO\-APMC against SCG\(Huanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib18)\), DPG\(Tanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib31)\), EnKG\(Zhenget al\.,[2025a](https://arxiv.org/html/2605.30573#bib.bib27)\), and Forward\-GSG and Central\-GSG proposed byZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)\. The GSG methods resemble DPS\(Chunget al\.,[2023b](https://arxiv.org/html/2605.30573#bib.bib19)\), but approximate the likelihood score via Tweedie’s formula combined with ZO approximations\. For completeness, we additionally evaluate gradient\-based methods in settings where the forward model gradient are available\. Specifically, we compare against DPS\(Chunget al\.,[2023b](https://arxiv.org/html/2605.30573#bib.bib19)\), PnPDM\(Wuet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib63)\), and APMC\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), an annealed LMC posterior sampling method with gradient access and the gradient\-based counterpart of ZO\-APMC\.
### 4\.1Toy Experiments
Numerical Validation\.To validate the convergence behavior predicted by our theory, we consider a synthetic inverse problem with bimodal 2D Gaussian mixture prior, random forward model𝑨\{\\bm\{A\}\}, and measurement noiseξ∼𝒩\(0,I\)\\xi\\sim\{\\mathcal\{N\}\}\(0,I\)\. To mimic SGM estimation error, we use the analytical prior score corrupted by additive Gaussian noise with std\. dev\.εk∗=2\.5\\varepsilon\_\{k^\{\*\}\}=2\.5\. We generate 1000 samples with ZO\-APMC from 20 random initializations of𝑨\{\\bm\{A\}\}and report the mean FI relative to the analytical posterior\. Fig\.[2](https://arxiv.org/html/2605.30573#S3.F2)a shows that, withb=10b=10,b′=5b^\{\\prime\}=5, ZO\-APMC converges to near\-zero FI forp∈\{1,0\.75,0\.5\}p\\in\\\{1,0\.75,0\.5\\\}, but becomes unstable atp=0\.3p=0\.3due to the reduced total number of function evaluations\. To validate converges under𝒪\(1\)\\mathcal\{O\}\(1\)batch complexity, we varyppandbbwhile keeping the per\-iteration cost fixed\(pb=10\)\(pb=10\)in Fig\.[2](https://arxiv.org/html/2605.30573#S3.F2)b\. Across all parameter pairs, ZO\-APMC converges below0\.010\.01FI, consistent with our theoretical predictions\.
Statistical Validation\.To validate ZO\-APMC’s recovery of posterior mode statistics, we consider a compressed sensing problem with a bimodal Gaussian mixture prior constructed from32×3232\\times 32CelebA\(Liuet al\.,[2015](https://arxiv.org/html/2605.30573#bib.bib64)\)images, a random forward model𝑨∈ℝ115×1024\{\\bm\{A\}\}\\in\{\\mathbb\{R\}\}^\{115\\times 1024\}, and a Gaussian measurement noiseξ∼𝒩\(0,0\.01I\)\\xi\\\!\\sim\\\!\{\\mathcal\{N\}\}\(0,0\.01I\)\. The prior modes correspond to the “male” and “female” attributes shifted by\+1\+1and−1\-1, respectively, to ensure clear separation\. We train a shallow customized SGM\(Nichol and Dhariwal,[2021](https://arxiv.org/html/2605.30573#bib.bib65)\)on this prior and generate 1000 samples using ZO\-APMC and APMC, where ZO\-APMC usesp=0\.5p\\\!=\\\!0\.5,b=50b\\\!=\\\!50, andb′=5b^\{\\prime\}=5\. Fig\.[2](https://arxiv.org/html/2605.30573#S3.F2)c shows that ZO\-APMC accurately recovers the mean and variance of both posterior modes, closely matching gradient\-based APMC up to a slight increase in variance due to ZO estimations\. This can be further reduced by increasingbb\. Additional results and details are provided in Appendix[C](https://arxiv.org/html/2605.30573#A3)\.
### 4\.2Magnetic Resonance Imaging \(MRI\)
Imaging inverse problems \(e\.g\., MRI recon\.\) are widely used benchmarks\. Although our primary focus is on more challenging black\-box forward models, we additionally evaluate our method on linear MRI recon\. problem for completeness and to demonstrate the effectiveness of our variance\-reduction mechanism in a high\-dimensional setting\.
Problem Setting\.We consider a radial subsampling mask with acceleration factor of4×4\\times\. For evaluation, we use the SGM prior fromSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\), which is pre\-trained on the FastMRI brain dataset\(Zbontaret al\.,[2019](https://arxiv.org/html/2605.30573#bib.bib66)\), and evaluate all methods on a separate test set provided in that work to ensure a consistent comparison\. We randomly select 40 images of size256×256256\\times 256and, for each method, generate 20 reconstructions per image, average them, and report the mean image\-quality metrics across the test set\. For ZO\-APMC, we usep=0\.2p=0\.2,b=104b=10^\{4\}, andb′=103b^\{\\prime\}=10^\{3\}\.
Results\.Fig\.[3](https://arxiv.org/html/2605.30573#S3.F3)demonstrates that both ZO\-APMC and APMC yield visually indistinguishable reconstructions for representative brain MRI cases showing pathology, with ZO\-APMC accurately capturing fine details without gradient information\. Table[1](https://arxiv.org/html/2605.30573#S3.T1)shows that ZO\-APMC consistently achieve higher reconstruction quality than other black\-box baselines in all image quality metrics and closely matches the APMC with gradient access\. Our method yields slightly higher SD than DPG but this can be alleviated by increasingpporbb, albeit at larger computational cost\.
### 4\.3Black\-Hole Imaging
Problem Setting\.Black\-hole interferometric imaging reconstructs black\-hole images from “visibility” measurements acquired by Earth\-based telescope arrays\. We use the SGM prior pre\-trained on GRMHD\(Wonget al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib67)\)images of size64×6464\\times 64, the highly nonlinear forward model, and 100 test images provided by InverseBench\(Zhenget al\.,[2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. For each method, we generate 5 reconstructions per image, average them, and report the resulting mean metrics across the test set\. Since the image sizes are small, we usep=1p\\\!=\\\!1withb=1024b\\\!=\\\!1024\. Evaluation is based on the chi\-square errors of the closure phases \(χcph2\\chi\_\{\\text\{cph\}\}^\{2\}\) and closure amplitudes \(χcamp2\\chi\_\{\\text\{camp\}\}^\{2\}\), which quantify how well the reconstructions fit the measurements\. Because the black\-hole imaging system captures only low spatial frequencies, we followAkiyamaet al\.\([2019](https://arxiv.org/html/2605.30573#bib.bib68)\)and compute PSNR for both the original and blurred reconstructions at the system’s intrinsic resolution\.
Results\.Fig\.[4](https://arxiv.org/html/2605.30573#S3.F4)shows two representative cases of black\-hole reconstructions generated by ZO\-APMC and other black\-box baselines, alongside the ground truth\. ZO\-APMC produces reconstructions that most closely match the ground truth, whereas baselines introduce artifacts and lose fine details\. Table[3\.2](https://arxiv.org/html/2605.30573#S3.SS2)shows that ZO\-APMC outperforms all baselines across all metrics except SD, which can be mitigated by increasing batch sizepporbbat additional cost\.
### 4\.4Navier–Stokes Equation
Problem Setting\.The Navier–Stokes equation is a standard fluid\-dynamics benchmark\(Iglesiaset al\.,[2013](https://arxiv.org/html/2605.30573#bib.bib25)\), widely used from ocean dynamics to climate modeling, where atmospheric observations calibrate initial conditions for numerical forecasts\. Computing forward model gradients via auto\-differentiation is impractical because it requires differentiating through a PDE solver\. We use the Navier–Stokes forward model, 10 test images, and the pre\-trained SGM prior provided by InverseBench\(Zhenget al\.,[2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. For each black\-box method, we generate 5 samples per test image, average them, and report the mean NRMSE across the 10 test images\. We repeat this procedure for 3 noise levelsσnoise∈\{0,1,2\}\\sigma\_\{\\mathrm\{noise\}\}\\in\\\{0,1,2\\\}\. Additional experimental details are provided in Appendix[C\.5](https://arxiv.org/html/2605.30573#A3.SS5)\.
Results\.Fig\.[5](https://arxiv.org/html/2605.30573#S3.F5)shows results obtained withσnoise=1\\sigma\_\{\\text\{noise\}\}=1, demonstrating that ZO\-APMC generates samples that qualitatively preserve key flow features, comparable to EnKG and DPG, whereas SCG fails to do so\. Moreover, EnKG produces noticeably noisier samples than ZO\-APMC\. Quantitative results in Appendix[C\.5](https://arxiv.org/html/2605.30573#A3.SS5)show that, although ZO\-APMC does not outperform EnKG and DPG in NRMSE, it achieves comparable performance to DPG while providing convergence guarantees and a principled understanding of how hyperparameters affect performance, unlike competing methods that rely on heuristic approximations\.
## 5Conclusion
We established a theoretically grounded framework for variance\-reduced zeroth\-order \(ZO\) sampling in non\-log\-concave settings, with applications to inverse problems\. We derived the first non\-asymptotic complexity guarantees for non\-log\-concave ZO sampling and for black\-box posterior sampling with SGM priors\. Empirically, our proposed ZO\-APMC method achieved state\-of\-the\-art performance among black\-box methods in MRI reconstruction and black\-hole imaging, while remaining competitive on the Navier–Stokes inverse problem\. Future work includes extending our method to underdamped Langevin dynamics and latent diffusion models for improved sampling efficiency\.
## Acknowledgements
This work was partially supported by National Insititutes of Health grant NIH/NHLBI R01\-HL153430 \(PI: Sharif\), and supported in part by NSF DMS\-2502560 and CNS\-2313109\.
## Impact Statement
This work establshes the first theoretically grounded framework for zeroth\-order sampling from non\-log\-concave distributions, addressing a fundamental gap in derivative\-free probabilistic inference\. We further develop the first zeroth\-order posterior sampling framework with convergence guarantees for black\-box inverse problems using score\-based generative priors\. These advances broaden the applicability of sampling methods to settings where gradients are inaccessible, undefined, or prohibitively expensive, including scientific simulators, medical imaging systems, and physics\-based inverse problems\. By enabling principled probabilistic inference under limited forward model access, our work may expand the use of machine learning in scientific and engineering domains\. As with other sampling and generative methods, practical deployment in application domains requires careful empirical validation and appropriate domain expertise\.
## References
- K\. Akiyama, A\. Alberdi, W\. Alef, K\. Asada, R\. Azulay, A\. Baczko, D\. Ball, M\. Baloković, J\. Barrett, D\. Bintley,et al\.\(2019\)First m87 event horizon telescope results\. iv\. imaging the central supermassive black hole\.The Astrophysical Journal Letters875\(1\),pp\. L4\.Cited by:[§4\.3](https://arxiv.org/html/2605.30573#S4.SS3.p1.5)\.
- L\. Ambrosio, N\. Gigli, and G\. Savaré \(2008\)Gradient flows in metric spaces and in the space of probability measures\.2nd edition,Lectures in Mathematics ETH Zürich,Birkhäuser Verlag,Basel\.Cited by:[§2\.3](https://arxiv.org/html/2605.30573#S2.SS3.p1.16)\.
- K\. Balasubramanian, S\. Chewi, M\. A\. Erdogdu, A\. Salim, and S\. Zhang \(2022\)Towards a theory of non\-log\-concave sampling: first\-order stationarity guarantees for langevin monte carlo\.InConference on Learning Theory,pp\. 2896–2923\.Cited by:[§B\.5](https://arxiv.org/html/2605.30573#A2.SS5.SSS0.Px1.p12.19),[§B\.8](https://arxiv.org/html/2605.30573#A2.SS8.SSS0.Px1.p7.19),[§1](https://arxiv.org/html/2605.30573#S1.p1.6),[§2\.3](https://arxiv.org/html/2605.30573#S2.SS3.p1.16),[Lemma 2](https://arxiv.org/html/2605.30573#Thmlemma2)\.
- C\. A\. Bouman and G\. T\. Buzzard \(2023\)Generative plug and play: posterior sampling for inverse problems\.In2023 59th Annual Allerton Conference on Communication, Control, and Computing \(Allerton\),pp\. 1–7\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.3.2.1.1)\.
- A\. A\. Chael, M\. D\. Johnson, K\. L\. Bouman, L\. L\. Blackburn, K\. Akiyama, and R\. Narayan \(2018\)Interferometric imaging directly with closure phases and closure amplitudes\.The Astrophysical Journal857\(1\),pp\. 23\.Cited by:[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.5)\.
- S\. Chewi, M\. A\. Erdogdu, M\. Li, R\. Shen, and M\. S\. Zhang \(2024\)Analysis of langevin monte carlo from poincare to log\-sobolev\.Foundations of Computational Mathematics,pp\. 1–51\.Cited by:[§B\.1](https://arxiv.org/html/2605.30573#A2.SS1.p3.1),[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p8.1),[Lemma 3](https://arxiv.org/html/2605.30573#Thmlemma3)\.
- H\. Chung, J\. Kim, S\. Kim, and J\. C\. Ye \(2023a\)Parallel diffusion models of operator and image for blind inverse problems\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 6059–6069\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- H\. Chung, J\. Kim, M\. T\. Mccann, M\. L\. Klasky, and J\. C\. Ye \(2023b\)Diffusion posterior sampling for general noisy inverse problems\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=OnD9zGAGT0k)Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.30573#A1.SS2.p2.1),[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.7.4.1.1.1),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- F\. Coeurdoux, N\. Dobigeon, and P\. Chainais \(2024\)Plug\-and\-play split gibbs sampler: embedding deep generative priors in bayesian inference\.IEEE Transactions on Image Processing33,pp\. 3496–3507\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.14.11.1.1.1)\.
- A\. Cutkosky and F\. Orabona \(2019\)Momentum\-based variance reduction in non\-convex sgd\.Advances in neural information processing systems32\.Cited by:[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p2.2),[Remark 2](https://arxiv.org/html/2605.30573#Thmremark2.p1.3)\.
- G\. Daras, H\. Chung, C\. Lai, Y\. Mitsufuji, J\. C\. Ye, P\. Milanfar, A\. G\. Dimakis, and M\. Delbracio \(2024\)A survey on diffusion models for inverse problems\.External Links:2410\.00083,[Link](https://arxiv.org/abs/2410.00083)Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p3.1)\.
- B\. Efron \(2011\)Tweedie’s formula and selection bias\.Journal of the American Statistical Association106\(496\),pp\. 1602–1614\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.9)\.
- G\. Evensen and P\. J\. Van Leeuwen \(1996\)Assimilation of geosat altimeter data for the agulhas current using the ensemble kalman filter with a quasigeostrophic model\.Monthly weather review124\(1\),pp\. 85–96\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- B\. T\. Feng, J\. Smith, M\. Rubinstein, H\. Chang, K\. L\. Bouman, and W\. T\. Freeman \(2023\)Score\-based diffusion models as principled priors for inverse imaging\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 10520–10531\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.5.2.1.1.1)\.
- A\. Gong, W\. He, Y\. Cao, G\. Zhou, and H\. Zhu \(2025\)Interpretability metrics and optimization methods for belief rule based expert systems\.Expert Systems with Applications,pp\. 128363\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- A\. Guillin, C\. Léonard, L\. Wu, and N\. Yao \(2009\)Transportation\-information inequalities for markov processes\.Probability theory and related fields144\(3\),pp\. 669–695\.Cited by:[Lemma 4](https://arxiv.org/html/2605.30573#Thmlemma4)\.
- A\. W\. Harbaugh, E\. R\. Banta, M\. C\. Hill, and M\. G\. McDonald \(2000\)MODFLOW\-2000, the U\.S\. geological survey modular ground\-water model: user guide to modularization concepts and the ground\-water flow process\.Technical reportTechnical Report00\-92,Open\-File Report,U\.S\. Geological Survey\.External Links:[Document](https://dx.doi.org/10.3133/ofr200092),[Link](https://doi.org/10.3133/ofr200092)Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- Y\. He, K\. Rojas, and M\. Tao \(2024\)Zeroth\-order sampling methods for non\-log\-concave distributions: alleviating metastability by denoising diffusion\.Advances in Neural Information Processing Systems37,pp\. 71122–71161\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p1.6)\.
- Y\. He and W\. Sun \(2007\)Stability and convergence of the crank–nicolson/adams–bashforth scheme for the time\-dependent navier–stokes equations\.SIAM Journal on Numerical Analysis45\(2\),pp\. 837–869\.Cited by:[§C\.5](https://arxiv.org/html/2605.30573#A3.SS5.p1.10)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p3.1),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p3.2)\.
- Y\. Huang, A\. Ghatare, Y\. Liu, Z\. Hu, Q\. Zhang, C\. Shama Sastry, S\. Gururani, S\. Oore, and Y\. Yue \(2024\)Symbolic music generation with non\-differentiable rule guided diffusion\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 19772–19797\.Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p2.2),[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.10.7.2.1.1),[§1](https://arxiv.org/html/2605.30573#S1.p2.1),[§1](https://arxiv.org/html/2605.30573#S1.p3.1),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- A\. Hyvärinen and P\. Dayan \(2005\)Estimation of non\-normalized statistical models by score matching\.\.Journal of Machine Learning Research6\(4\)\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.9)\.
- M\. A\. Iglesias, K\. J\. Law, and A\. M\. Stuart \(2013\)Ensemble kalman methods for inverse problems\.Inverse Problems29\(4\),pp\. 045001\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1),[§4\.4](https://arxiv.org/html/2605.30573#S4.SS4.p1.1)\.
- A\. Jalal, M\. Arvinte, G\. Daras, E\. Price, A\. G\. Dimakis, and J\. Tamir \(2021\)Robust compressed sensing mri with deep generative priors\.Advances in neural information processing systems34,pp\. 14938–14954\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.4.2.3.1.1)\.
- A\. Karakuzu, N\. Blostein, A\. V\. Caron, A\. Boré, F\. Rheault, M\. Descoteaux, and N\. Stikov \(2025\)Rethinking mri as a measurement device through modular and portable pipelines\.Magnetic Resonance Materials in Physics, Biology and Medicine,pp\. 1–17\.Cited by:[§A\.2](https://arxiv.org/html/2605.30573#A1.SS2.p1.1),[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- T\. Karras, M\. Aittala, T\. Aila, and S\. Laine \(2022\)Elucidating the design space of diffusion\-based generative models\.Advances in neural information processing systems35,pp\. 26565–26577\.Cited by:[§A\.2](https://arxiv.org/html/2605.30573#A1.SS2.p2.1)\.
- B\. Kawar, G\. Vaksman, and M\. Elad \(2021\)Snips: solving noisy inverse problems stochastically\.Advances in Neural Information Processing Systems34,pp\. 21757–21769\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.12.9.1.1.1)\.
- G\. Lan \(2020\)First\-order and stochastic optimization methods for machine learning\.Vol\.1,Springer\.Cited by:[Lemma 1](https://arxiv.org/html/2605.30573#Thmlemma1)\.
- R\. Laumont, V\. D\. Bortoli, A\. Almansa, J\. Delon, A\. Durmus, and M\. Pereyra \(2022\)Bayesian imaging using plug & play priors: when langevin meets tweedie\.SIAM Journal on Imaging Sciences15\(2\),pp\. 701–737\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.13.10.1.1.1)\.
- H\. Lee, J\. Lu, and Y\. Tan \(2022\)Convergence for score\-based generative modeling with polynomial complexity\.Advances in Neural Information Processing Systems35,pp\. 22870–22882\.Cited by:[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2)\.
- Z\. Li, H\. Bao, X\. Zhang, and P\. Richtárik \(2021\)PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization\.InInternational conference on machine learning,pp\. 6286–6295\.Cited by:[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p1.2),[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p2.2),[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p4.13)\.
- J\. Liu, R\. Anirudh, J\. J\. Thiagarajan, S\. He, K\. A\. Mohan, U\. S\. Kamilov, and H\. Kim \(2023\)Dolce: a model\-based probabilistic diffusion framework for limited\-angle ct reconstruction\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 10498–10508\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.8.5.1.1.1)\.
- L\. Liu and Z\. Wang \(2020\)One\-point gradient estimators for zeroth\-order stochastic gradient langevin dynamics\.InOPT2020: 12th Annual Workshop on Optimization for Machine Learning,External Links:[Link](https://opt-ml.org/oldopt/papers/2020/paper_96.pdf)Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p1.6)\.
- Y\. Liu, M\. Chen, C\. Ji, H\. Zhang, and R\. Wang \(2025\)SERENA: a unified stochastic recursive variance reduced gradient framework for riemannian non\-convex optimization\.InForty\-second International Conference on Machine Learning,Cited by:[§3\.1](https://arxiv.org/html/2605.30573#S3.SS1.p2.2)\.
- Z\. Liu, P\. Luo, X\. Wang, and X\. Tang \(2015\)Deep learning face attributes in the wild\.InProceedings of the IEEE international conference on computer vision,pp\. 3730–3738\.Cited by:[§C\.2](https://arxiv.org/html/2605.30573#A3.SS2.p1.5),[§4\.1](https://arxiv.org/html/2605.30573#S4.SS1.p2.9)\.
- I\. Lopez\-Gomez, C\. Christopoulos, H\. L\. Langeland Ervik, O\. R\. Dunbar, Y\. Cohen, and T\. Schneider \(2022\)Training physics\-based machine\-learning parameterizations with gradient\-free ensemble kalman methods\.Journal of Advances in Modeling Earth Systems14\(8\),pp\. e2022MS003105\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- K\. Miyasawaet al\.\(1961\)An empirical bayes estimator of the mean of a normal population\.Bull\. Inst\. Internat\. Statist38\(181\-188\),pp\. 1–2\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.9)\.
- N\. Moës, J\. Dolbow, and T\. Belytschko \(1999\)A finite element method for crack growth without remeshing\.International journal for numerical methods in engineering46\(1\),pp\. 131–150\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- Y\. Nesterovet al\.\(2018\)Lectures on convex optimization\.Vol\.137,Springer\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p1.6)\.
- Y\. Nesterov and V\. Spokoiny \(2017\)Random gradient\-free minimization of convex functions\.Foundations of Computational Mathematics17\(2\),pp\. 527–566\.Cited by:[§2\.1](https://arxiv.org/html/2605.30573#S2.SS1.p1.7)\.
- A\. Q\. Nichol and P\. Dhariwal \(2021\)Improved denoising diffusion probabilistic models\.InInternational conference on machine learning,pp\. 8162–8171\.Cited by:[§C\.2](https://arxiv.org/html/2605.30573#A3.SS2.p1.5),[§4\.1](https://arxiv.org/html/2605.30573#S4.SS1.p2.9)\.
- D\. S\. Oliver, A\. C\. Reynolds, and N\. Liu \(2008\)Inverse theory for petroleum reservoir characterization and history matching\.Cambridge, New York\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- H\. E\. Robbins \(1992\)An empirical bayes approach to statistics\.InBreakthroughs in Statistics: Foundations and basic theory,pp\. 388–394\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.9)\.
- A\. P\. Rotshtein and H\. B\. Rakytyanska \(2012\)Inverse inference based on fuzzy rules\.InFuzzy Evidence in Identification, Forecasting and Diagnosis,pp\. 193–233\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- L\. Rout, Y\. Chen, N\. Ruiz, A\. Kumar, C\. Caramanis, S\. Shakkottai, and W\. Chu \(2025\)RB\-modulation: training\-free stylization using reference\-based modulation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=bnINPG5A32)Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p3.1)\.
- L\. Rout, N\. Raoof, G\. Daras, C\. Caramanis, A\. Dimakis, and S\. Shakkottai \(2023\)Solving linear inverse problems provably via posterior sampling with latent diffusion models\.Advances in Neural Information Processing Systems36,pp\. 49960–49990\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p3.1)\.
- A\. Roy, L\. Shen, K\. Balasubramanian, and S\. Ghadimi \(2022\)Stochastic zeroth\-order discretizations of langevin diffusions for bayesian inference\.Bernoulli28\(3\),pp\. 1810–1834\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p1.6),[§2\.1](https://arxiv.org/html/2605.30573#S2.SS1.p1.17)\.
- J\. Song, C\. Meng, and S\. Ermon \(2021a\)Denoising diffusion implicit models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=St1giarCHLP)Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p3.1)\.
- J\. Song, A\. Vahdat, M\. Mardani, and J\. Kautz \(2023\)Pseudoinverse\-guided diffusion models for inverse problems\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.17),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2)\.
- Y\. Song and S\. Ermon \(2020\)Improved techniques for training score\-based generative models\.Advances in neural information processing systems33,pp\. 12438–12448\.Cited by:[§C\.1](https://arxiv.org/html/2605.30573#A3.SS1.p1.23),[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.17),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p1.11),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p1.3),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2)\.
- Y\. Song, L\. Shen, L\. Xing, and S\. Ermon \(2022\)Solving inverse problems in medical imaging with score\-based generative models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vaRCHVj0uGI)Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.6.3.2.1.1),[§1](https://arxiv.org/html/2605.30573#S1.p3.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021b\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p1.1),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p3.2),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2)\.
- H\. Sun and K\. L\. Bouman \(2021\)Deep probabilistic imaging: uncertainty quantification and multi\-modal solution characterization for computational imaging\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 2628–2637\.Cited by:[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.4.1.2.1.1),[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.5)\.
- Y\. Sun, Z\. Wu, Y\. Chen, B\. T\. Feng, and K\. L\. Bouman \(2024\)Provable probabilistic imaging using score\-based generative priors\.IEEE Transactions on Computational Imaging\.Cited by:[§A\.2](https://arxiv.org/html/2605.30573#A1.SS2.p1.1),[Table 3](https://arxiv.org/html/2605.30573#A1.T3),[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.15.12.2.1.1),[§C\.1](https://arxiv.org/html/2605.30573#A3.SS1.p1.23),[§C\.3](https://arxiv.org/html/2605.30573#A3.SS3.p1.13),[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.17),[§1](https://arxiv.org/html/2605.30573#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.17),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p1.14),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p1.3),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p3.2),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2),[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p6.3),[§4\.2](https://arxiv.org/html/2605.30573#S4.SS2.p2.5),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- Z\. Tan, C\. M\. Kaul, K\. G\. Pressel, Y\. Cohen, T\. Schneider, and J\. Teixeira \(2018\)An extended eddy\-diffusivity mass\-flux scheme for unified representation of subgrid\-scale turbulence and convection\.Journal of Advances in Modeling Earth Systems10\(3\),pp\. 770–800\.Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p2.1)\.
- H\. Tang, T\. Xie, A\. Feng, H\. Wang, C\. Zhang, and Y\. Bai \(2024\)Solving general noisy inverse problem via posterior sampling: a policy gradient viewpoint\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 2116–2124\.Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p2.2),[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.9.6.2.1.1),[§1](https://arxiv.org/html/2605.30573#S1.p3.1),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- C\. Villani \(2009\)Optimal transport: old and new\.Grundlehren der Mathematischen Wissenschaften \[Fundamental Principles of Mathematical Sciences\], Vol\.338,Springer\-Verlag,Berlin\.Cited by:[§2\.3](https://arxiv.org/html/2605.30573#S2.SS3.p1.16)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural computation23\(7\),pp\. 1661–1674\.Cited by:[§2\.2](https://arxiv.org/html/2605.30573#S2.SS2.p2.9)\.
- Y\. Wang, J\. Yu, and J\. Zhang \(2023\)Zero\-shot image restoration using denoising diffusion null\-space model\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=mRieQgMtNTQ)Cited by:[§1](https://arxiv.org/html/2605.30573#S1.p3.1)\.
- G\. N\. Wong, B\. S\. Prather, V\. Dhruv, B\. R\. Ryan, M\. Mościbrodzka, C\. Chan, A\. V\. Joshi, R\. Yarza, A\. Ricarte, H\. Shiokawa,et al\.\(2022\)Patoka: simulating electromagnetic observables of black hole accretion\.The Astrophysical Journal Supplement Series259\(2\),pp\. 64\.Cited by:[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.17),[§4\.3](https://arxiv.org/html/2605.30573#S4.SS3.p1.5)\.
- Z\. Wu, Y\. Sun, Y\. Chen, B\. Zhang, Y\. Yue, and K\. Bouman \(2024\)Principled probabilistic imaging using diffusion models as plug\-and\-play priors\.Advances in Neural Information Processing Systems37,pp\. 118389–118427\.Cited by:[§A\.2](https://arxiv.org/html/2605.30573#A1.SS2.p2.1),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- K\. Y\. Yang and A\. Wibisono \(2022\)Convergence of the inexact langevin algorithm and score\-based generative models in kl divergence\.arXiv preprint arXiv:2211\.01512\.Cited by:[§3\.2](https://arxiv.org/html/2605.30573#S3.SS2.p4.2)\.
- J\. Zbontar, F\. Knoll, A\. Sriram, T\. Murrell, Z\. Huang, M\. J\. Muckley, A\. Defazio, R\. Stern, P\. Johnson, M\. Bruno,et al\.\(2019\)FastMRI: an open dataset and benchmarks for accelerated mri\.External Links:1811\.08839,[Link](https://arxiv.org/abs/1811.08839)Cited by:[§C\.3](https://arxiv.org/html/2605.30573#A3.SS3.p2.3),[§4\.2](https://arxiv.org/html/2605.30573#S4.SS2.p2.5)\.
- H\. Zheng, W\. Chu, A\. Wang, N\. B\. Kovachki, R\. Baptista, and Y\. Yue \(2025a\)Ensemble kalman diffusion guidance: a derivative\-free method for inverse problems\.Transactions on Machine Learning Research\.Note:External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=XPEEsKneKs)Cited by:[§A\.1](https://arxiv.org/html/2605.30573#A1.SS1.p1.1),[Table 3](https://arxiv.org/html/2605.30573#A1.T3.5.11.8.2.1.1),[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.17),[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.5),[§C\.5](https://arxiv.org/html/2605.30573#A3.SS5.p3.8),[Table 4](https://arxiv.org/html/2605.30573#A3.T4),[Table 4](https://arxiv.org/html/2605.30573#A3.T4.2.1),[§1](https://arxiv.org/html/2605.30573#S1.p3.1),[§4](https://arxiv.org/html/2605.30573#S4.p1.1)\.
- H\. Zheng, W\. Chu, B\. Zhang, Z\. Wu, A\. Wang, B\. Feng, C\. Zou, Y\. Sun, N\. B\. Kovachki, Z\. E\. Ross, K\. Bouman, and Y\. Yue \(2025b\)InverseBench: benchmarking plug\-and\-play diffusion priors for inverse problems in physical sciences\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=U3PBITXNG6)Cited by:[§C\.3](https://arxiv.org/html/2605.30573#A3.SS3.p1.13),[§C\.4](https://arxiv.org/html/2605.30573#A3.SS4.p1.17),[§C\.5](https://arxiv.org/html/2605.30573#A3.SS5.p3.8),[§4\.3](https://arxiv.org/html/2605.30573#S4.SS3.p1.5),[§4\.4](https://arxiv.org/html/2605.30573#S4.SS4.p1.1)\.
## Appendix AComparison to Related Work
In this section, we compare ZO\-APMC against existing posterior sampling methods based on SGM priors, highlighting their strengths and limitations\.
### A\.1Black\-box Posterior Sampling Algorithms
There are three other prior works that studied posterior sampling with SGM priors for black\-box \(derivative\-free\) forward models\. The most recent is EnKG, which estimates the likelihood score using the ensemble\-based statistical linearization\(Zhenget al\.,[2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)\. Unlike our algorithm, EnKG is not designed to sample from the full posterior and therefore offers limited uncertainity quantification\. In our experiments, we observed that EnKG also becomes highly inefficient when the evaluation of the forward model is cheaper than computation of the score network, which is the case in MRI reconstruction with linear forward model\. In addition, as reflected in our results, in black hole imaging and MRI reconstruction, it produced noticeably lower reconstruction quality than our method\. However, it performed better than our method for Navier–Stokes inverse problem because noisy function evaluations of forward model makes the guidance term unstable\. In EnKG paper, authors proposed Forward\-GSG and Central\-GSG baselines, which are originated from DPS algorithm but the guidance term is estimated by mini\-batch ZO estimate\. Since they have additional approximation error due to heuristic approximation of the guidance term\(Chunget al\.,[2023b](https://arxiv.org/html/2605.30573#bib.bib19)\), they generate samples with larger reconstruction errors, as reflected in our results\. However, they require less iteration as they use SDE from diffusion models\(Songet al\.,[2021b](https://arxiv.org/html/2605.30573#bib.bib1)\)as opposed to Langevin sampling, which requires longer iterations\.
SCG\(Huanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib18)\)and DPG\(Tanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib31)\), which cast diffusion guidance in a stochastic control framework and steer the sampling process via an estimated value function\. Similar to EnKG, neither of the targets to sample from the posterior distribution due to intractable score likelihood term, so uncertainity quantification cannot be done as opposed to our algorithm\. Moreover, SCG relies on a threshold parameter that disables the likelihood\-based guidance in certain iterations\. We found that improper tuning of this threshold leads to severely degraded performance, yet the method provides no theoretical guidance for selecting it\. As a result, practitioners must rely on grid search or extensive hyperparameter tuning\. However, in our setting,ppandbbcan be adjusted easily once a budget for per\-iteration is determined\. Although we did not experiment with, since SCG is based on value function \(not approximation of gradient\), it could perform better than our method for rule\-based inverse problems\.
Similar to our approach, DPG also relies on Monte Carlo mini\-batch estimates of the posterior score\. However, unlike our method, DPG does not include any variance\-reduction mechanism, and incorporating such mechanisms is nontrivial due to the DDPM\(Hoet al\.,[2020](https://arxiv.org/html/2605.30573#bib.bib2)\)and DDIM\(Songet al\.,[2021a](https://arxiv.org/html/2605.30573#bib.bib3)\)sampling procedures it depends on\. In addition, their update rule couples the likelihood and prior score terms, preventing the decoupling that enables our variance\-reduced design\. Empirically, these differences lead to clear performance gaps: in both the black hole and MRI reconstruction tasks, our method achieves substantially better reconstruction quality, whereas on the Navier–Stokes example DPG performs slightly better, which may be due to instability in the ZO score approximations for this specific problem\.
### A\.2Gradient\-based Posterior Sampling Algorithms
There are many gradient\-based posterior sampling with SGM priors proposed for solving inverse problems\. As can be seen from our MRI reconstruction experiments \(Table[1](https://arxiv.org/html/2605.30573#S3.T1)\), APMC\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), which is the closest work to ours, generated slightly better reconstructions than the proposed ZO\-APMC algorithm\. This shows that when one can compute the gradient of the forward model exactly, it is more preferable\. However, this may not be possible in commercial MRI applications or closed\-source systems\(Karakuzuet al\.,[2025](https://arxiv.org/html/2605.30573#bib.bib80)\), where one can only have access to input and output of the forward operator\. In that scenario, our model performs the best among black\-box posterior samplers with comparable performance to the gradient\-based model\. Similar trend can be seen in black hole imaging experiments as well\. Similar to our work, the APMC paper also derived an upper bound on the FI\. However, beyond establishing an FI bound in the black\-box setting, we additionally prove convergence in squared total variation distance and show weak asymptotic convergence to the target distribution without using Lipschitzness assumption of the score network\.
Besides APMC, we used DPS\(Chunget al\.,[2023b](https://arxiv.org/html/2605.30573#bib.bib19)\)and PnP\-DM\(Wuet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib63)\)posterior sampling algorithms as gradient\-based baselines\. Although our method treats the forward model as black\-box, it performed better than DPS in experiments\. We attributed this to the heuristic approximation of the guidance term in DPS, which is called Jensen gap which cannot be controlled explicitly as we control the estimator error via PAGE variance reduction mechanism\. PnP\-DM provides a principled framework for posterior sampling, but it applies to inverse problems only when implemented within the EDM framework\(Karraset al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib62)\)as opposed to our method\. Moreover, although PnP\-DM establishes upper bounds on FI, it provides neither asymptotic convergence to the true posterior nor non\-asymptotic guarantees in squared total variation distance\. Consequently, it remains unclear whether samples produced after many iterations will coincide with those from the exact true posterior\.
Table 3:A conceptual overview of posterior sampling approaches for probabilistic imaging\. The*“Annealing”*column highlights distinctions among MCMC\-based methods\. The*“Black\-box access”*column shows whether the corresponding method assumes black\-box access and works when gradients of the forward model is unavailable\. The𝑨\(⋅\)\{\\bm\{A\}\}\(\\cdot\)column shows the assumption on the type of forward model\. This table extends that ofSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\)by incorporating additional black\-box posterior sampling algorithms\.CategoryReferenceGenerativepriorModel agnostic𝑨\(⋅\)\{\\bm\{A\}\}\(\\cdot\)ConvergenceguaranteesAnnealingBlack\-boxaccessVariationalBayesian\(Sun and Bouman,[2021](https://arxiv.org/html/2605.30573#bib.bib39)\)✗✗General✗–✗\(Fenget al\.,[2023](https://arxiv.org/html/2605.30573#bib.bib40)\)✓✗General✗–✗DM\-based\(Songet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib10)\)✓✓Linear✗–✗\(Chunget al\.,[2023b](https://arxiv.org/html/2605.30573#bib.bib19)\)✓✓General✗–✗\(Liuet al\.,[2023](https://arxiv.org/html/2605.30573#bib.bib41)\)✓✗Linear✗–✗\(Tanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib31)\)✓✗General✗–✓\(Huanget al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib18)\)✓✗General✗–✓\(Zhenget al\.,[2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)✓✗General✗–✓MCMC\-based\(Jalalet al\.,[2021](https://arxiv.org/html/2605.30573#bib.bib42)\)✓✓Linear✓1✓✗\(Kawaret al\.,[2021](https://arxiv.org/html/2605.30573#bib.bib43)\)✓✓Linear✗✓✗\(Laumontet al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib44)\)✗✓General✓✗✗\(Coeurdouxet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib45)\)✓✓Linear✗✗✗\(Bouman and Buzzard,[2023](https://arxiv.org/html/2605.30573#bib.bib46)\)✗✓Linear✓2✗✗\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\)✓✓General✓✓✗MCMC\-basedOurs✓✓General✓✓✓
1Requires𝑨\(⋅\)\{\\bm\{A\}\}\(\\cdot\)to be a Gaussian random matrix\.2Guarantees on asymptotic convergence\.
## Appendix BTheoretical Results
Notation\.Throughout the proof, we work within the probability space\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\), whereΩ\\Omegadenotes the sample space,ℱ\\mathcal\{F\}theσ\\sigma\-algebra, andℙ\\mathbb\{P\}the probability measure\. We define the following filtration containing all randomness up to iterationkk
ℱk≔σ\(𝒙0,𝒖0i,𝒁0,…,𝒖ki,𝒁k\),\\mathcal\{F\}\_\{k\}\\coloneqq\\sigma\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\},\{\\bm\{Z\}\}\_\{0\},\\ldots,\{\\bm\{u\}\}\_\{k\}^\{i\},\{\\bm\{Z\}\}\_\{k\}\),\(15\)where𝒁k≔𝑩\(k\+1\)γ−𝑩kγ\{\\bm\{Z\}\}\_\{k\}\\coloneqq\{\\bm\{B\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{B\}\}\_\{k\\gamma\}and𝒖ki\{\\bm\{u\}\}\_\{k\}^\{i\}denotes the collection of Gaussian vectors used in the ZO estimates at iterationkk\.
⌈⋅⌉\\lceil\\cdot\\rceilis the ceiling function and it is defined as⌈x⌉≔min\{n∈ℤ:n≥x\}\\lceil x\\rceil\\coloneqq\\min\\\{n\\in\{\\mathbb\{Z\}\}:n\\geq x\\\}\.
For two nonnegative sequences\(an\)n≥0\(a\_\{n\}\)\_\{n\\geq 0\}and\(bn\)n≥0\(b\_\{n\}\)\_\{n\\geq 0\}, the notationan≲bna\_\{n\}\\lesssim b\_\{n\}denotes that there exists a constantC\>0C\>0, independent ofnn, such that
an≤Cbn,a\_\{n\}\\leq Cb\_\{n\},for alln≥0n\\geq 0\.
For a random variableX:Ω→ℝnX:\\Omega\\to\\mathbb\{R\}^\{n\}, we write its expectation as
𝔼\[X\]=∫ΩX\(ω\)ℙ\(dω\)\.\\mathbb\{E\}\[X\]=\\int\_\{\\Omega\}X\(\\omega\)\\,\\mathbb\{P\}\(d\\omega\)\.
The posterior distribution of interest is of the form
π\(𝒙\|𝒚\)∝ℓ\(𝒚\|𝒙\)p\(𝒙\),\\pi\(\{\\bm\{x\}\}\|\{\\bm\{y\}\}\)\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\),where we definef\(𝒙\)f\(\{\\bm\{x\}\}\)andh\(𝒙\)h\(\{\\bm\{x\}\}\)as potential functions of the likelihoodℓ\(𝒚\|𝒙\)∝exp\(−f\(𝒙\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝒙\)∝exp\(−h\(𝒙\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\), respectively\. Moreover, we definehσkh\_\{\\sigma\_\{k\}\}as the potential function of the perturbed prior given aspσk\(𝒙\)∝exp\(−hσk\(𝒙\)\)p\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\)\),pσ\(𝒙\)≔∫ℝdp\(𝒛\)𝒩\(𝒙\|𝒛,σ2I\)𝑑𝒛p\_\{\\sigma\}\(\{\\bm\{x\}\}\)\\coloneqq\\int\_\{\{\\mathbb\{R\}\}^\{d\}\}p\(\{\\bm\{z\}\}\)\{\\mathcal\{N\}\}\(\{\\bm\{x\}\}\|\{\\bm\{z\}\},\\sigma^\{2\}I\)\\,d\{\\bm\{z\}\}, wherep\(𝒙\)p\(\{\\bm\{x\}\}\)is the unperturbed prior\. For simplicity, we omit the explicit dependence on𝒚\{\\bm\{y\}\}\. Recall that
−∇logπ\(𝒙\)=∇f\(𝒙\)\+∇h\(𝒙\)\.\-\\nabla\\log\\pi\(\{\\bm\{x\}\}\)=\\nabla f\(\{\\bm\{x\}\}\)\+\\nabla h\(\{\\bm\{x\}\}\)\.\(16\)
We denote the zeroth\-order approximation of the forward model gradient as follows
∇~fμ\(𝒙kγ,𝒖\)≔f\(𝒙kγ\+μ𝒖\)−f\(𝒙kγ\)μ𝒖,\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\)\\coloneqq\\frac\{f\(\{\\bm\{x\}\}\_\{k\\gamma\}\+\\mu\{\\bm\{u\}\}\)\-f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\}\{\\mu\}\{\\bm\{u\}\},\(17\)where𝒖∼𝒩\(0,I\)\{\\bm\{u\}\}\\sim\{\\mathcal\{N\}\}\(0,I\)andμ\>0\\mu\>0\. The expectation of the zeroth\-order approximation is denoted as∇fμ\(𝒙kγ\)≔𝔼𝒖\[∇~fμ\(𝒙kγ,𝒖\)\]\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\coloneqq\\mathbb\{E\}\_\{\{\\bm\{u\}\}\}\\left\[\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\)\\right\]\. For notational convenience, we also define
Δk≔𝔼\[‖𝒙\(k\+1\)γ−𝒙kγ‖22\],\\Delta\_\{k\}\\coloneqq\\mathbb\{E\}\\big\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|\_\{2\}^\{2\}\\big\],\(18\)as the expected squaredℓ2\\ell\_\{2\}\-distance between consecutive iterates\. For convenience, we recall the definition of the Kullback–Leibler \(KL\) divergence between two probability densitiesν\\nuandπ\\pi:
KL\(ν∥π\)=∫ℝnν\(x\)logν\(x\)π\(x\)dx\.\\mathrm\{KL\}\(\\nu\\\|\\pi\)=\\int\_\{\\mathbb\{R\}^\{n\}\}\\nu\(x\)\\log\\frac\{\\nu\(x\)\}\{\\pi\(x\)\}\\,dx\.
The Fisher information \(FI\) is given by
FI\(ν∥π\)=∫ℝn‖∇logν\(x\)π\(x\)‖22ν\(x\)𝑑x=∫ℝn‖∇logν\(x\)−∇logπ\(x\)‖22ν\(x\)𝑑x\.\\mathrm\{FI\}\(\\nu\\\|\\pi\)=\\int\_\{\\mathbb\{R\}^\{n\}\}\\bigl\\\|\\nabla\\log\\tfrac\{\\nu\(x\)\}\{\\pi\(x\)\}\\bigr\\\|\_\{2\}^\{2\}\\nu\(x\)\\,dx=\\int\_\{\\mathbb\{R\}^\{n\}\}\\\|\\nabla\\log\\nu\(x\)\-\\nabla\\log\\pi\(x\)\\\|\_\{2\}^\{2\}\\nu\(x\)\\,dx\.
Total variation \(TV\) distance between two probability measuresμ\\muandν\\nuon a measurable space\(𝒳,ℱ\)\(\\mathcal\{X\},\\mathcal\{F\}\)is given b by
‖ν−π‖TV≔supA∈ℱ\|ν\(A\)−π\(A\)\|=12∫𝒳\|dν−dπ\|\.\\\|\\nu\-\\pi\\\|\_\{\\mathrm\{TV\}\}\\;\\coloneqq\\;\\sup\_\{A\\in\\mathcal\{F\}\}\\bigl\|\\nu\(A\)\-\\pi\(A\)\\bigr\|\\;=\\;\\frac\{1\}\{2\}\\int\_\{\\mathcal\{X\}\}\\\!\\bigl\|d\\nu\-d\\pi\\bigr\|\.Unless otherwise stated,∥⋅∥\\\|\\cdot\\\|denotes the squaredℓ2\\ell\_\{2\}\-norm, i\.e\.∥⋅∥2\\\|\\cdot\\\|\_\{2\}\.CL1,1C^\{1,1\}\_\{L\}denotes the class of differentiable functionsf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}whose gradients areLL\-Lipschitz, i\.e\.,
‖∇f\(𝒙\)−∇f\(𝒚\)‖2≤L‖𝒙−𝒚‖2,∀𝒙,𝒚∈ℝd\.\\\|\\nabla f\(\{\\bm\{x\}\}\)\-\\nabla f\(\{\\bm\{y\}\}\)\\\|\_\{2\}\\leq L\\\|\{\\bm\{x\}\}\-\{\\bm\{y\}\}\\\|\_\{2\},\\quad\\forall\{\\bm\{x\}\},\{\\bm\{y\}\}\\in\\mathbb\{R\}^\{d\}\.\(19\)
### B\.1Lemmas
We begin by reviewing the key lemmas from the zeroth\-order optimization and non\-log\-concave sampling literature\. The following lemma summarizes key properties of zeroth\-order approximations used in our analysis\.
###### Lemma 1\(Lemma 6\.2 under Section 6\.1\.2\.1 inLan \([2020](https://arxiv.org/html/2605.30573#bib.bib33)\)\)\.
Suppose thatf\(𝐱\)∈CL1,1f\(\{\\bm\{x\}\}\)\\in C\_\{L\}^\{1,1\}, i\.e\.,ffis continuously differentiable and its gradient isLL\-Lipschitz, and letfμ\(𝐱\):=𝔼𝐮\[f\(𝐱\+μ𝐮\)\]f\_\{\\mu\}\(\{\\bm\{x\}\}\):=\\mathbb\{E\}\_\{\\bm\{u\}\}\\\!\\left\[f\(\{\\bm\{x\}\}\+\\mu\{\\bm\{u\}\}\)\\right\]\. Then the following statements hold:
1. \(a\)fμ∈CLμ1,1\(ℝd\)f\_\{\\mu\}\\in C\_\{L\_\{\\mu\}\}^\{1,1\}\(\\mathbb\{R\}^\{d\}\), whereLμ≤LL\_\{\\mu\}\\leq L,
2. \(b\)‖∇fμ\(𝒙\)−∇f\(𝒙\)‖≤12μL\(d\+3\)32\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\)\-\\nabla f\(\{\\bm\{x\}\}\)\\\|\\leq\\tfrac\{1\}\{2\}\\mu L\(d\+3\)^\{\\tfrac\{3\}\{2\}\},
3. \(c\)𝔼𝒖\[‖∇~fμ\(𝒙,𝒖\)‖2\]≤12μ2L2\(d\+6\)3\+2\(d\+4\)‖∇f\(𝒙\)‖22\\mathbb\{E\}\_\{\{\\bm\{u\}\}\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\},\{\\bm\{u\}\}\)\\\|^\{2\}\]\\leq\\tfrac\{1\}\{2\}\\mu^\{2\}L^\{2\}\(d\+6\)^\{3\}\+2\(d\+4\)\\\|\\nabla f\(\{\\bm\{x\}\}\)\\\|\_\{2\}^\{2\},
where∇~fμ\(𝐱,𝐮\):=f\(𝐱\+μ𝐮\)−f\(𝐱\)μ𝐮\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\},\{\\bm\{u\}\}\):=\\frac\{f\(\{\\bm\{x\}\}\+\\mu\{\\bm\{u\}\}\)\-f\(\{\\bm\{x\}\}\)\}\{\\mu\}\{\\bm\{u\}\}for𝐮∼𝒩\(0,I\)\{\\bm\{u\}\}\\sim\\mathcal\{N\}\(0,I\)and any𝐱∈ℝd\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\},μ\>0\\mu\>0\.
The following lemma concerns the density evolution of an interpolated diffusion process\.
###### Lemma 2\(Lemma 23 inBalasubramanianet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib34)\)\)\.
Consider the stochastic process defined by
𝒙t:=𝒙0−tg0\+2𝑩t,withg0=g\(𝒙0\),𝒙0∼ν0\{\\bm\{x\}\}\_\{t\}:=\{\\bm\{x\}\}\_\{0\}\-tg\_\{0\}\+\\sqrt\{2\}\{\\bm\{B\}\}\_\{t\},\\quad\\text\{with $g\_\{0\}=g\(\{\\bm\{x\}\}\_\{0\}\)$\},\\quad\{\\bm\{x\}\}\_\{0\}\\sim\\nu\_\{0\}whereg0g\_\{0\}is integrable and\{Bt\}t≥0\\\{B\_\{t\}\\\}\_\{t\\geq 0\}is a standard Brownian motion inℝd\{\\mathbb\{R\}\}^\{d\}independent of\(𝐱0,g0\)\(\{\\bm\{x\}\}\_\{0\},g\_\{0\}\)\. Then, writingνt\\nu\_\{t\}for the probability density of𝐱t\{\\bm\{x\}\}\_\{t\}, we have
ddtKL\(νt∥π\)≤−34FI\(νt∥π\)\+𝔼\[‖∇f\(𝒙t\)−g0‖2\],\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\mathbb\{E\}\\left\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\-g\_\{0\}\\\|^\{2\}\\right\],where we recall thatπ∝e−f\\pi\\propto e^\{\-f\}, and the expectation in the last term is with respect tox0∼ν0x\_\{0\}\\sim\\nu\_\{0\}andxt∼νtx\_\{t\}\\sim\\nu\_\{t\}\.
We also used the following lemma to bound the Fisher information, which is taken from\(Chewiet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib35)\)\.
###### Lemma 3\(Lemma 20 inChewiet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib35)\)\)\.
Assume that∇logπ\(𝐱\)\\nabla\\log\\pi\(\{\\bm\{x\}\}\)isLπL\_\{\\pi\}\-Lipschitz\. For any probability measureν\\nu, it holds that
𝔼ν\[‖∇logπ\(𝒙\)‖2\]≤FI\(ν∥π\)\+2dLπ\.\\mathbb\{E\}\_\{\\nu\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\)\\\|^\{2\}\\right\]\\leq\\mathrm\{FI\}\\left\(\\nu\\\|\\pi\\right\)\+2dL\_\{\\pi\}\.
We use the following lemma to derive an upper bound on total variation \(TV\) distance\.
###### Lemma 4\(Theorem 3\.1 inGuillinet al\.\([2009](https://arxiv.org/html/2605.30573#bib.bib36)\)\)\.
Ifπ\\pisatisfies a Poincaré inequality, i\.e\. for every smooth, compactly supportedf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\},
Varπ\(f\)≤CPI𝔼π\[‖∇f‖2\],\\mathrm\{Var\}\_\{\\pi\}\(f\)\\;\\leq\\;C\_\{\\mathrm\{PI\}\}\\,\\mathbb\{E\}\_\{\\pi\}\\\!\\bigl\[\\\|\\nabla f\\\|^\{2\}\\bigr\],then for any probability measureν\\nu,
‖ν−π‖TV2≤4CPIFI\(ν∥π\)\.\\\|\\nu\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\;\\leq\\;4C\_\{\\mathrm\{PI\}\}\\,\\mathrm\{FI\}\(\\nu\\\|\\pi\)\.
### B\.2Proof of Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)
#### Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)
Suppose Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)holds, and let\(𝐱kγ\)k≥0\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\_\{k\\geq 0\}be generated by \([9](https://arxiv.org/html/2605.30573#S3.E9)\), where𝐠k\{\\bm\{g\}\}\_\{k\}is given by \([8](https://arxiv.org/html/2605.30573#S3.E8)\)\. Defineek2:=𝔼\[‖𝐠k−∇fμ\(𝐱kγ\)‖22\]e\_\{k\}^\{2\}:=\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|\_\{2\}^\{2\}\]\. Then, for anyγ\>0\\gamma\>0,
ek\+12≤pσk\+12b\+\(1−p\)ek2\+4\(1−p\)dL12Δkμ2b′,e^\{2\}\_\{k\+1\}\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)dL\_\{1\}^\{2\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\},\(20\)whereσk\+12≔𝔼\[‖∇~fμ\(𝐱\(k\+1\)γ,𝐮k\+1i\)‖2\]\\sigma^\{2\}\_\{k\+1\}\\coloneqq\\mathbb\{E\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\\|^\{2\}\]andΔk≔𝔼\[‖𝐱\(k\+1\)γ−𝐱kγ‖2\]\\Delta\_\{k\}\\coloneqq\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]
Proof\.Using the definition of𝒈k\{\\bm\{g\}\}\_\{k\}in \([8](https://arxiv.org/html/2605.30573#S3.E8)\), we can boundek\+12e\_\{k\+1\}^\{2\}as follows
ek\+12\\displaystyle e\_\{k\+1\}^\{2\}=p𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−1b∑i=1b∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)‖2\]\\displaystyle=p\\,\\mathbb\{E\}\\\!\\left\[\\left\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\\bigl\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\bigr\)\\right\\\|^\{2\}\\right\]\+\(1−p\)𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−𝒈k−1b′∑i=1b′\(∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)−∇~fμ\(𝒙kγ,𝒖k\+1i\)\)‖2\]\\displaystyle\\quad\+\(1\-p\)\\,\\mathbb\{E\}\\\!\\left\[\\left\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\-\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\Bigl\(\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\)\\right\\\|^\{2\}\\right\]\(21\)whereb′b^\{\\prime\},bbdenote the small and large batch sizes, respectively, and𝒖k\+1i∼𝒩\(0,I\)\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)inℝd\{\\mathbb\{R\}\}^\{d\}\. Here,kkandiidenote the iteration and sample indices, respectively, for the sampled Gaussian vectors\. We can upper bound the first expectation as
𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−1b∑i=1b∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)‖2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\\frac\{1\}\{b\}\\sum\_\{i=1\}^\{b\}\\widetilde\{\\nabla\}f\_\{\\mu\}\\bigl\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\bigr\)\\right\\\|^\{2\}\\right\]=1b𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)‖2\]\\displaystyle=\\frac\{1\}\{b\}\\mathbb\{E\}\\left\[\\left\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\right\\\|^\{2\}\\right\]\(22\)≤1b𝔼\[‖∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)‖2\]⏟=σk\+12\.\\displaystyle\\leq\\frac\{1\}\{b\}\\underbrace\{\\mathbb\{E\}\\left\[\\left\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\right\\\|^\{2\}\\right\]\}\_\{=\\sigma\_\{k\+1\}^\{2\}\}\.\(23\)In \([22](https://arxiv.org/html/2605.30573#A2.E22)\), we expand the square and use𝔼\[∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)\|ℱk\]=∇fμ\(𝒙\(k\+1\)γ\)\\mathbb\{E\}\[\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\|\\mathcal\{F\}\_\{k\}\]=\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)with the fact that𝒖k\+1i\{\\bm\{u\}\}\_\{k\+1\}^\{i\}are identically and independently distributed\. In \([23](https://arxiv.org/html/2605.30573#A2.E23)\), we use the second\-moment bound on the variance with definition ofσk\+12\\sigma\_\{k\+1\}^\{2\}\. Plugging this upper bound into \([21](https://arxiv.org/html/2605.30573#A2.E21)\), we get
ek\+12\\displaystyle e\_\{k\+1\}^\{2\}≤pσk\+12b\+\(1−p\)𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−𝒈k−1b′∑i=1b′\(∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)−∇~fμ\(𝒙kγ,𝒖k\+1i\)\)‖2\]\\displaystyle\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)\\mathbb\{E\}\\Biggl\[\\Bigl\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\-\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\Bigl\(\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\)\\Bigr\\\|^\{2\}\\Biggr\]\(24\)=pσk\+12b\+\(1−p\)𝔼\[∥∇fμ\(𝒙kγ\)−𝒈k\+∇fμ\(𝒙\(k\+1\)γ\)−∇fμ\(𝒙kγ\)\\displaystyle=\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)\\mathbb\{E\}\\Biggl\[\\Bigl\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\+\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)−1b′∑i=1b′\(∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)−∇~fμ\(𝒙kγ,𝒖k\+1i\)\)∥2\]\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\-\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\Bigl\(\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\)\\Bigr\\\|^\{2\}\\Biggr\]\(25\)=pσk\+12b\+\(1−p\)ek2\+\(1−p\)b′𝔼\[‖∇fμ\(𝒙\(k\+1\)γ\)−∇fμ\(𝒙kγ\)−\(∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)−∇~fμ\(𝒙kγ,𝒖k\+1i\)\)‖2\]\\displaystyle=\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{\(1\-p\)\}\{b^\{\\prime\}\}\\mathbb\{E\}\\Biggl\[\\Bigl\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\\Bigl\(\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\)\\Bigr\\\|^\{2\}\\Biggr\]\(26\)≤pσk\+12b\+\(1−p\)ek2\+\(1−p\)b′𝔼\[‖∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)−∇~fμ\(𝒙kγ,𝒖k\+1i\)‖2\]\\displaystyle\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{\(1\-p\)\}\{b^\{\\prime\}\}\\mathbb\{E\}\\Biggl\[\\Bigl\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\\\|^\{2\}\\Biggr\]\(27\)=pσk\+12b\+\(1−p\)ek2\+\(1−p\)μ2b′𝔼\[∥\(f\(𝒙\(k\+1\)γ\+μ𝒖k\+1i\)−f\(𝒙kγ\+μ𝒖k\+1i\)\)\\displaystyle=\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{\(1\-p\)\}\{\\mu^\{2\}b^\{\\prime\}\}\\mathbb\{E\}\\Biggl\[\\Bigl\\\|\\Bigl\(f\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\+\\mu\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\-f\(\{\\bm\{x\}\}\_\{k\\gamma\}\+\\mu\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\Bigr\)−\(f\(𝒙\(k\+1\)γ\)−f\(𝒙kγ\)\)∥2∥𝒖k\+1i∥2\]\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\-\\Bigl\(f\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\)\-f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\Bigr\)\\Bigr\\\|^\{2\}\\Bigl\\\|\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\Bigr\\\|^\{2\}\\Biggr\]\(28\)≤pσk\+12b\+\(1−p\)ek2\+4\(1−p\)L12Δkμ2b′𝔼\[‖𝒖k\+1i‖2\]\\displaystyle\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)L\_\{1\}^\{2\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\}\\mathbb\{E\}\\Bigl\[\\left\\\|\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\right\\\|^\{2\}\\Bigr\]\(29\)=pσk\+12b\+\(1−p\)ek2\+4\(1−p\)dL12Δkμ2b′\\displaystyle=\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)dL\_\{1\}^\{2\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\}\(30\)whereΔk≔𝔼\[‖𝒙\(k\+1\)γ−𝒙kγ‖2\]\\Delta\_\{k\}\\coloneqq\\mathbb\{E\}\\left\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\\right\]\. Note that we add and subtract∇fμ\(𝒙kγ\)\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)in \([25](https://arxiv.org/html/2605.30573#A2.E25)\)\. To get \([26](https://arxiv.org/html/2605.30573#A2.E26)\), we use the fact that random variables𝒖k\+1i∼𝒩\(0,I\)\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)are i\.i\.d\., and calculate the conditional expectations conditioned with respect toℱk\{\\mathcal\{F\}\}\_\{k\}and then use the definition ofek2e\_\{k\}^\{2\}\. We use second\-moment bound on variance in \([27](https://arxiv.org/html/2605.30573#A2.E27)\) and use the zeroth\-order definition to get \([28](https://arxiv.org/html/2605.30573#A2.E28)\)\. Following these steps, we apply Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)and use the independence of𝒖k\+1i\{\\bm\{u\}\}\_\{k\+1\}^\{i\}from𝒙kγ\{\\bm\{x\}\}\_\{k\\gamma\}and𝒙\(k\+1\)γ\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}to obtain \([29](https://arxiv.org/html/2605.30573#A2.E29)\)\. This concludes our proof\.□\\square
### B\.3Proof of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)
#### Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)
Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\), and letπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)be the target distribution, where the potential functionffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1)and[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\}andL2L\_\{2\}, respectively\. Then, for any step sizeγ∈\(0,1Lm52ϕ\(μ\)\]\\gamma\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}\\right\], whereϕ\(μ\)≔1\+4\(1−p\)dpμ2b′\\phi\(\\mu\)\\coloneqq 1\+\\frac\{4\(1\-p\)d\}\{p\\mu^\{2\}b^\{\\prime\}\}andLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}, and for anyN≥1N\\geq 1, the Fisher information satisfies
1Nγ∫0NγFI\(νt∥π\)𝑑t≤C0Nγ\+13d\(d\+2\)2b\+138μ2L22\(d\+3\)3\+14γLm2ϕ\(μ\)d,\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{13d\(d\+2\)\}\{2b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+14\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d,\(31\)whereC0\>0C\_\{0\}\>0is a numerical constant\. Furthermore, let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\}\(32\)Then, to achieveFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilonfor the time averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dt,𝒪\(d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration is sufficient\.
Proof\.At initialization, we chooseγ0,b0,μ0\>0\\gamma\_\{0\},b\_\{0\},\\mu\_\{0\}\>0andp0∈\(0,1\]p\_\{0\}\\in\(0,1\], and define𝒈0\{\\bm\{g\}\}\_\{0\}as
𝒈0≔1b0∑i=1b0∇~fμ0\(𝒙0,𝒖0i\),𝒖0i∼𝒩\(0,I\)\.\{\\bm\{g\}\}\_\{0\}\\coloneqq\\frac\{1\}\{b\_\{0\}\}\\sum\_\{i=1\}^\{b\_\{0\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{0\}\}\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\}\),\\quad\{\\bm\{u\}\}\_\{0\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)\.We recall that the interpolation argument \([10](https://arxiv.org/html/2605.30573#S3.E10)\) for the proposed algorithm is given as
𝒙t≔𝒙kγ−\(t−kγ\)𝒈k\+2\(𝑩t−𝑩kγ\),\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\(t\-k\\gamma\)\{\\bm\{g\}\}\_\{k\}\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\),\(33\)where\(𝑩t\)t≥0\(\{\\bm\{B\}\}\_\{t\}\)\_\{t\\geq 0\}is Brownian motion and𝒈k\{\\bm\{g\}\}\_\{k\}is the proposed variance\-reduced ZO estimator\. Using this interpolation with Lemma[2](https://arxiv.org/html/2605.30573#Thmlemma2), we obtain
ddtKL\(νt∥π\)\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤−34FI\(νt∥π\)\+𝔼\[‖∇f\(𝒙t\)−𝒈k‖2\]\\displaystyle\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\-\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]=−34FI\(νt∥π\)\+𝔼\[‖∇f\(𝒙t\)−∇f\(𝒙kγ\)\+∇f\(𝒙kγ\)−∇fμ\(𝒙kγ\)\+∇fμ\(𝒙kγ\)−𝒈k‖2\]\\displaystyle=\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\-\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]≤−34FI\(νt∥π\)\+3L22𝔼\[‖𝒙t−𝒙kγ‖2\]\+34μ2L22\(d\+3\)3\+3𝔼\[‖∇fμ\(𝒙kγ\)−𝒈k‖2\]\.\\displaystyle\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+3L\_\{2\}^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]\+\\frac\{3\}\{4\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+3\\mathbb\{E\}\[\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]\.\(34\)We use Jensen’s inequality, Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1), and Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)to obtain the last inequality\. Letek≔𝔼\[‖∇fμ\(𝒙kγ\)−𝒈k‖2\]e\_\{k\}\\coloneqq\\mathbb\{E\}\[\\\|\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]be the mean squared estimation error\. Then, using Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1), we have
ek\+12≤pσk\+12b\+\(1−p\)ek2\+4\(1−p\)dL1Δkμ2b′,e\_\{k\+1\}^\{2\}\\leq\\frac\{p\\sigma\_\{k\+1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)dL\_\{1\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\},\(35\)whereΔk≔𝔼\[‖x\(k\+1\)γ−xkγ‖2\]\\Delta\_\{k\}\\coloneqq\\mathbb\{E\}\[\\\|x\_\{\(k\+1\)\\gamma\}\-x\_\{k\\gamma\}\\\|^\{2\}\]andσk\+12≔𝔼\[‖∇~fμ\(x\(k\+1\)γ,𝒖k\+1i\)‖2\]\\sigma\_\{k\+1\}^\{2\}\\coloneqq\\mathbb\{E\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(x\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\\|^\{2\}\]\. We can further upper boundσk\+12\\sigma\_\{k\+1\}^\{2\}by using the definition of∇~fμ\(𝒙\(k\+1\)γ,𝒖k\+1i\)\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)with the Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)as follows
σk\+12\\displaystyle\\sigma\_\{k\+1\}^\{2\}=𝔼\[‖∇~fμ\(x\(k\+1\)γ,𝒖k\+1i\)‖2\]\\displaystyle=\\mathbb\{E\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(x\_\{\(k\+1\)\\gamma\},\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\)\\\|^\{2\}\]≤L12𝔼\[‖𝒖k\+1i‖4\]\\displaystyle\\leq L\_\{1\}^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{u\}\}\_\{k\+1\}^\{i\}\\\|^\{4\}\]\(36\)=L12d\(d\+2\)\.\\displaystyle=L\_\{1\}^\{2\}d\(d\+2\)\.Plugging this into \([35](https://arxiv.org/html/2605.30573#A2.E35)\), we have
ek\+12≤pd\(d\+2\)L12b\+\(1−p\)ek2\+4\(1−p\)dL1Δkμ2b′e\_\{k\+1\}^\{2\}\\leq\\frac\{pd\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\+\(1\-p\)e\_\{k\}^\{2\}\+\\frac\{4\(1\-p\)dL\_\{1\}\\Delta\_\{k\}\}\{\\mu^\{2\}b^\{\\prime\}\}\(37\)fork≥1k\\geq 1\. Dividing both sides byppand rearranging the terms, we get a recursive upper bound on the error term
ek2≤d\(d\+2\)L12b\+\(1−pp\)4dL12μ2b′Δk−1p\(ek\+12−ek2\)\.e\_\{k\}^\{2\}\\leq\\frac\{d\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{4dL\_\{1\}^\{2\}\}\{\\mu^\{2\}b^\{\\prime\}\}\\Delta\_\{k\}\-\\frac\{1\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\.\(38\)Using the interpolation in \([10](https://arxiv.org/html/2605.30573#S3.E10)\), we can upper bound𝔼\[‖𝒙t−𝒙kγ‖2\]\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]in \([34](https://arxiv.org/html/2605.30573#A2.E34)\) as
𝔼\[‖𝒙t−𝒙kγ‖2\]\\displaystyle\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]=\(t−kγ\)𝔼\[‖𝒈k‖2\]\+2γd\\displaystyle=\(t\-k\\gamma\)\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]\+2\\gamma d≤γ2𝔼\[‖𝒈k‖2\]\+2γd=Δk\.\\displaystyle\\leq\\gamma^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]\+2\\gamma d=\\Delta\_\{k\}\.\(39\)Using this property and plugging the upper bound in \([38](https://arxiv.org/html/2605.30573#A2.E38)\) into \([34](https://arxiv.org/html/2605.30573#A2.E34)\), we get
ddtKL\(νt∥π\)≤−34FI\(νt∥π\)\+3Lm2ϕ\(μ\)Δk\+3d\(d\+2\)L12b−3p\(ek\+12−ek2\)\+34μ2L22\(d\+3\)3,\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+3L\_\{m\}^\{2\}\\phi\(\\mu\)\\Delta\_\{k\}\+\\frac\{3d\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\-\\frac\{3\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+\\frac\{3\}\{4\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\},\(40\)whereϕ\(μ\)≔1\+\(1−pp\)4dμ2b′\\phi\(\\mu\)\\coloneqq 1\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{4d\}\{\\mu^\{2\}b^\{\\prime\}\}\. We can upper bound the term involvingΔk\\Delta\_\{k\}as
Δk\\displaystyle\\Delta\_\{k\}≔𝔼\[‖𝒙\(k\+1\)γ−𝒙kγ‖2\]\\displaystyle\\coloneqq\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]=γ2𝔼\[‖𝒈k‖2\]\+2γd\\displaystyle=\\gamma^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\\\|^\{2\}\]\+2\\gamma d\(41\)=γ2𝔼\[‖𝒈k−∇fμ\(𝒙kγ\)\+∇fμ\(𝒙kγ\)−∇f\(𝒙kγ\)\+∇f\(𝒙kγ\)−∇f\(𝒙t\)\+∇f\(𝒙t\)‖2\]\+2γd\\displaystyle=\\gamma^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\+\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+2\\gamma d=4γ2L22Δk\+γ2μ2L22\(d\+3\)3\+4γ2ek2\+4γ2𝔼\[‖∇f\(𝒙t\)‖2\]\+2γd\\displaystyle=4\\gamma^\{2\}L\_\{2\}^\{2\}\\Delta\_\{k\}\+\\gamma^\{2\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+4\\gamma^\{2\}e\_\{k\}^\{2\}\+4\\gamma^\{2\}\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+2\\gamma d\(42\)≤4γ2Lm2ϕ\(μ\)Δk\+4γ2d\(d\+2\)L12b−4γ2p\(ek\+12−ek2\)\+γ2μ2L22\(d\+3\)3\+4γ2𝔼\[‖∇f\(𝒙t\)‖2\]\+2γd\.\\displaystyle\\leq 4\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\Delta\_\{k\}\+\\frac\{4\\gamma^\{2\}d\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\-\\frac\{4\\gamma^\{2\}\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+\\gamma^\{2\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+4\\gamma^\{2\}\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+2\\gamma d\.\(43\)We use interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\) to obtain \([41](https://arxiv.org/html/2605.30573#A2.E41)\)\. Following that, we use Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)together with \([39](https://arxiv.org/html/2605.30573#A2.E39)\), Jensen’s inequality, and apply Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1)to obtain \([42](https://arxiv.org/html/2605.30573#A2.E42)\)\. Finally, we use the upper bound in \([38](https://arxiv.org/html/2605.30573#A2.E38)\) with the fact thatL2≤LmL\_\{2\}\\leq L\_\{m\}to obtain \([43](https://arxiv.org/html/2605.30573#A2.E43)\)\. Rearranging the terms, we get
\(1−4γ2Lm2ϕ\(μ\)\)Δk≤4γ2d\(d\+2\)L12b−4γ2p\(ek\+12−ek2\)\+γ2μ2L22\(d\+3\)3\+4γ2𝔼\[‖∇f\(𝒙t\)‖2\]\+2γd\.\\left\(1\-4\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\)\\Delta\_\{k\}\\leq\\frac\{4\\gamma^\{2\}d\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\-\\frac\{4\\gamma^\{2\}\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+\\gamma^\{2\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+4\\gamma^\{2\}\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+2\\gamma d\.Letγ≤1Lm52ϕ\(μ\)\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}and multiplying both sides by1312\\frac\{13\}\{12\}, we get
Δk≤133γ2d\(d\+2\)L12b−133γ2p\(ek\+12−ek2\)\+1312γ2μ2L22\(d\+3\)3\+13γ23𝔼\[‖∇f\(𝒙t\)‖2\]\+136γd\\Delta\_\{k\}\\leq\\frac\{13\}\{3\}\\frac\{\\gamma^\{2\}d\(d\+2\)L\_\{1\}^\{2\}\}\{b\}\-\\frac\{13\}\{3\}\\frac\{\\gamma^\{2\}\}\{p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+\\frac\{13\}\{12\}\\gamma^\{2\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+\\frac\{13\\gamma^\{2\}\}\{3\}\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+\\frac\{13\}\{6\}\\gamma dPlugging this upper bound into \([40](https://arxiv.org/html/2605.30573#A2.E40)\), we obtain
ddtKL\(νt∥π\)≤\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\leq−34FI\(νt∥π\)\+\[3\+13γ2Lm2ϕ\(μ\)\]d\(d\+2\)b−1p\[3\+13γ2Lm2ϕ\(μ\)\]\(ek\+12−ek2\)\\displaystyle\-\\frac\{3\}\{4\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\left\[3\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\]\\frac\{d\(d\+2\)\}\{b\}\-\\frac\{1\}\{p\}\\left\[3\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\]\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+\[34\+134γ2Lm2ϕ\(μ\)\]μ2L22\(d\+3\)3\+13γ2Lm2ϕ\(μ\)𝔼\[‖∇f\(𝒙t\)‖2\]\+132γLm2ϕ\(μ\)d\\displaystyle\+\\left\[\\frac\{3\}\{4\}\+\\frac\{13\}\{4\}\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\]\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\mathbb\{E\}\[\\\|\\nabla f\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]\+\\frac\{13\}\{2\}\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d≤\\displaystyle\\leq−12FI\(νt∥π\)\+13d\(d\+2\)4b−1p\[3\+13γ2Lm2ϕ\(μ\)\]\(ek\+12−ek2\)\+1316μ2L22\(d\+3\)3\\displaystyle\-\\frac\{1\}\{2\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{13d\(d\+2\)\}\{4b\}\-\\frac\{1\}\{p\}\\left\[3\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\]\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+\\frac\{13\}\{16\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+\[132\+26γLm\]γLm2ϕ\(μ\)d\.\\displaystyle\+\\left\[\\frac\{13\}\{2\}\+26\\gamma L\_\{m\}\\right\]\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d\.\(44\)We use the boundγ≤1Lm52ϕ\(μ\)\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}and apply Lemma[3](https://arxiv.org/html/2605.30573#Thmlemma3)to obtain \([44](https://arxiv.org/html/2605.30573#A2.E44)\)\. We define
ℒk≔KL\(νt∥π\)\+γp\(3\+13γ2Lm2ϕ\(μ\)\)ek2\.\{\\mathcal\{L\}\}\_\{k\}\\coloneqq\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{\\gamma\}\{p\}\\left\(3\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\)e\_\{k\}^\{2\}\.Integrating both sides of \([44](https://arxiv.org/html/2605.30573#A2.E44)\) and using the definitionℒk\{\\mathcal\{L\}\}\_\{k\}, we get
ℒk\+1−ℒk≤−12∫kγ\(k\+1\)γFI\(νt∥π\)𝑑t\+134γd\(d\+2\)b\+1316γμ2L22\(d\+3\)3\+7γ2Lmϕ\(μ\)d\.\{\\mathcal\{L\}\}\_\{k\+1\}\-\{\\mathcal\{L\}\}\_\{k\}\\leq\-\\frac\{1\}\{2\}\\int\_\{k\\gamma\}^\{\(k\+1\)\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\+\\frac\{13\}\{4\}\\frac\{\\gamma d\(d\+2\)\}\{b\}\+\\frac\{13\}\{16\}\\gamma\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+7\\gamma^\{2\}L\_\{m\}\\phi\(\\mu\)d\.\(45\)Iterating this inequality fork∈\{0,1,…,N−1\}k\\in\\\{0,1,\\ldots,N\-1\\\}, and rearranging the terms, we obtain
1Nγ∫0NγFI\(νt∥π\)𝑑t≤2ℒ0Nγ\+132d\(d\+2\)b\+138μ2L22\(d\+3\)3\+14γLm2ϕ\(μ\)d,\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\\leq\\frac\{2\{\\mathcal\{L\}\}\_\{0\}\}\{N\\gamma\}\+\\frac\{13\}\{2\}\\frac\{d\(d\+2\)\}\{b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+14\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d,\(46\)where
2ℒ0=2KL\(ν0∥π\)\+2γp\(3\+13γ2Lm2ϕ\(μ\)\)e02≤2KL\(ν0∥π\)\+26γ4pe02⏟≔C0<∞,2\{\\mathcal\{L\}\}\_\{0\}=2\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{2\\gamma\}\{p\}\\left\(3\+13\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\)e\_\{0\}^\{2\}\\leq\\underbrace\{2\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{26\\gamma\}\{4p\}e\_\{0\}^\{2\}\}\_\{\\coloneqq C\_\{0\}\}<\\infty,where we useγ≤1Lm52ϕ\(μ\)\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}andC0\>0C\_\{0\}\>0is a numerical constant\. This proves the upper bound on Fisher information in the theorem statement\. Furthermore, sincep∈\(0,1\]p\\in\(0,1\], we obtain
ϕ\(μ\)=1\+\(1−p\)4dpμ2b′≤1\+4dpμ2b′=1\+4d5/2N1/2b′\.\\phi\(\\mu\)=1\+\\frac\{\(1\-p\)4d\}\{p\\mu^\{2\}b^\{\\prime\}\}\\leq 1\+\\frac\{4d\}\{p\\mu^\{2\}b^\{\\prime\}\}=1\+\\frac\{4d^\{5/2\}N^\{1/2\}\}\{b^\{\\prime\}\}\.We can chooseN≥b′4d5/2N\\geq\\frac\{b^\{\\prime\}\}\{4d^\{5/2\}\}and obtain
ϕ\(μ\)≤1\+4d5/2N1/2b′≤8d5/2N1/2b′\\phi\(\\mu\)\\leq 1\+\\frac\{4d^\{5/2\}N^\{1/2\}\}\{b^\{\\prime\}\}\\leq\\frac\{8d^\{5/2\}N^\{1/2\}\}\{b^\{\\prime\}\}\(47\)Plugging this upper bound into \([46](https://arxiv.org/html/2605.30573#A2.E46)\), we obtain
1Nγ∫0NγFI\(νt∥π\)𝑑t≤C0Nγ\+13d\(d\+2\)2b\+138μ2L22\(d\+3\)3\+112γLm2d7/2N1/2b′\.\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{13d\(d\+2\)\}\{2b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+\\frac\{112\\gamma L\_\{m\}^\{2\}d^\{7/2\}N^\{1/2\}\}\{b^\{\\prime\}\}\.Using the convexity of Fisher information, we obtain
FI\(ν¯Nγ∥π\)≤C0Nγ\+13d\(d\+2\)2b\+138μ2L22\(d\+3\)3\+112γLm2d2pμ2b′\.\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{13d\(d\+2\)\}\{2b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+\\frac\{112\\gamma L\_\{m\}^\{2\}d^\{2\}\}\{p\\mu^\{2\}b^\{\\prime\}\}\.\(48\)Let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4\.\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\}\.Plugging these values into \([48](https://arxiv.org/html/2605.30573#A2.E48)\), we obtain
FI\(ν¯Nγ∥π\)≤C0Lmd7/4N1/4\+132Lm\(d\+2\)3/2N1/4\+138Lm\(d\+3\)3N1/4d5/4\+112d7/4Lmb′N1/4\.\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\frac\{C\_\{0\}L\_\{m\}d^\{7/4\}\}\{N^\{1/4\}\}\+\\frac\{13\}\{2\}\\frac\{L\_\{m\}\(d\+2\)^\{3/2\}\}\{N^\{1/4\}\}\+\\frac\{13\}\{8\}\\frac\{L\_\{m\}\(d\+3\)^\{3\}\}\{N^\{1/4\}d^\{5/4\}\}\+\\frac\{112d^\{7/4\}L\_\{m\}\}\{b^\{\\prime\}N^\{1/4\}\}\.This impliesFI\(ν¯Nγ∥π\)=𝒪\(d7/4LmN1/4\)\.\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)=\\mathcal\{O\}\\left\(\\frac\{d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\.Therefore, to obtainFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilon, the sufficient number of iterations is
N=𝒪\(d7Lm4ε4\)\.N=\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)\.Note that the per\-iteration cost ispb\+\(1−p\)b′=𝒪\(1\)pb\+\(1\-p\)b^\{\\prime\}=\\mathcal\{O\}\(1\), therefore, the total number of function evaluation is𝒪\(d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)as well\.
It remains to verify that the iteration complexityN=𝒪\(d7Lm4ε4\)N=\\mathcal\{O\}\\\!\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)satisfies the conditions
N≥b′4d5/2,p∈\(0,1\],γ≤1Lm52ϕ\(μ\)\.N\\geq\\frac\{b^\{\\prime\}\}\{4d^\{5/2\}\},\\quad p\\in\(0,1\],\\quad\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}\.The first condition holds sinceb′=𝒪\(1\)b^\{\\prime\}=\\mathcal\{O\}\(1\)is a numerical constant\. The second condition requires
N≥Lm4d,N\\geq\\frac\{L\_\{m\}^\{4\}\}\{d\},which is satisfied by the derived iteration complexity\. Finally, using the upper bound in \([47](https://arxiv.org/html/2605.30573#A2.E47)\), we chooseNNsuch that
γ≤1Lm416d5/2N1/2b′≤1Lm52ϕ\(μ\),\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{\\frac\{416d^\{5/2\}N^\{1/2\}\}\{b^\{\\prime\}\}\}\}\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\},which verifies the last condition\. Substituting the definition ofγ\\gammaand simplifying yields the additional requirement
N≥416b′d,N\\geq\\frac\{416\}\{b^\{\\prime\}d\},which is again satisfied by the derived iteration complexity\. This completes the proof\.
□\\square
### B\.4Proof of Corollary[1](https://arxiv.org/html/2605.30573#Thmcorollary1)
#### Corollary[1](https://arxiv.org/html/2605.30573#Thmcorollary1)
Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\), and letπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)be the target distribution, where the potential functionffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\},L2L\_\{2\}, respectively, and Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3)with constantCPIC\_\{\\mathrm\{PI\}\}\. Then, for any step sizeγ∈\(0,1Lm52ϕ\(μ\)\]\\gamma\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\(\\mu\)\}\}\\right\], whereϕ\(μ\)≔1\+4\(1−p\)dpμ2b′\\phi\(\\mu\)\\coloneqq 1\+\\frac\{4\(1\-p\)d\}\{p\\mu^\{2\}b^\{\\prime\}\}andLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}, and for anyN≥1N\\geq 1, we have
‖ν¯Nγ−π‖TV2≤4CPIC0Nγ\+26CPId\(d\+2\)b\+132CPIμ2L22\(d\+3\)3\+56CPIγLm2ϕ\(μ\)d,\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\frac\{4C\_\{\\mathrm\{PI\}\}C\_\{0\}\}\{N\\gamma\}\+\\frac\{26C\_\{\\mathrm\{PI\}\}d\(d\+2\)\}\{b\}\+\\frac\{13\}\{2\}C\_\{\\mathrm\{PI\}\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+56C\_\{\\mathrm\{PI\}\}\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d,\(49\)whereC0\>0C\_\{0\}\>0is a numerical constant\. Furthermore, let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\}\(50\)Then, to achieve‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\varepsilonfor the time averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dt,𝒪\(d7Lm4CPI4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}C\_\{\\mathrm\{PI\}^\{4\}\}\}\{\\varepsilon^\{4\}\}\\right\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration is sufficient\.
Proof\.Recall from Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1), we have
1Nγ∫0NγFI\(νt∥π\)𝑑t≤C0Nγ\+132d\(d\+2\)b\+138μ2L22\(d\+3\)3\+14γLm2ϕ\(μ\)d\.\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{13\}\{2\}\\frac\{d\(d\+2\)\}\{b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+14\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d\.\(51\)SinceFI\(⋅∥π\)\\mathrm\{FI\}\(\\cdot\\\|\\pi\)is convex with respect to its first argument, Jensen’s inequality yields
FI\(ν¯Nγ∥π\)≤C0Nγ\+132d\(d\+2\)b\+138μ2L22\(d\+3\)3\+14γLm2ϕ\(μ\)d\.\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{13\}\{2\}\\frac\{d\(d\+2\)\}\{b\}\+\\frac\{13\}\{8\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+14\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d\.\(52\)Using Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3)and invoking Lemma[4](https://arxiv.org/html/2605.30573#Thmlemma4), we obtain
‖ν¯Nγ−π‖TV2\\displaystyle\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}≤4CPIC0Nγ\+26CPId\(d\+2\)b\+132CPIμ2L22\(d\+3\)3\+56CPIγLm2ϕ\(μ\)d\.\\displaystyle\\leq\\frac\{4C\_\{\\mathrm\{PI\}\}C\_\{0\}\}\{N\\gamma\}\+26C\_\{\\mathrm\{PI\}\}\\frac\{d\(d\+2\)\}\{b\}\+\\frac\{13\}\{2\}C\_\{\\mathrm\{PI\}\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+56C\_\{\\mathrm\{PI\}\}\\gamma L\_\{m\}^\{2\}\\phi\(\\mu\)d\.\(53\)This completes the proof of the upper bound in the corollary\. The convergence rate then follows by an argument similar to that of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)\. We can chooseN≥b′4d5/2N\\geq\\frac\{b^\{\\prime\}\}\{4d^\{5/2\}\}and using the fact thatp∈\(0,1\]p\\in\(0,1\], we have
ϕ\(μ\)≔1\+\(1−p\)4dpμ2b′≤8dpμ2b′,\\phi\(\\mu\)\\coloneqq 1\+\\frac\{\(1\-p\)4d\}\{p\\mu^\{2\}b^\{\\prime\}\}\\leq\\frac\{8d\}\{p\\mu^\{2\}b^\{\\prime\}\},\(54\)we obtain
‖ν¯Nγ−π‖TV2≤4CPIC0Nγ\+26CPId\(d\+2\)b\+132CPIμ2L22\(d\+3\)3\+448CPIγLm2d2pμ2b′\.\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\frac\{4C\_\{\\mathrm\{PI\}\}C\_\{0\}\}\{N\\gamma\}\+26C\_\{\\mathrm\{PI\}\}\\frac\{d\(d\+2\)\}\{b\}\+\\frac\{13\}\{2\}C\_\{\\mathrm\{PI\}\}\\mu^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+448C\_\{\\mathrm\{PI\}\}\\frac\{\\gamma L\_\{m\}^\{2\}d^\{2\}\}\{p\\mu^\{2\}b^\{\\prime\}\}\.\(55\)Similar to the previous part, letting
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4,\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\},\(56\)we obtain that‖ν¯Nγ−π‖TV2≤𝒪\(CPId7/4LmN1/4\)\.\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\.Therefore, to obtain‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\varepsilon, the sufficient number of iterations is
N=𝒪\(CPI4d7Lm4ε4\)\.N=\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}^\{4\}d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)\.\(57\)Note that the per\-iteration cost ispb\+\(1−p\)b′=𝒪\(1\)pb\+\(1\-p\)b^\{\\prime\}=\\mathcal\{O\}\(1\); therefore, the number of function evaluations is𝒪\(CPI4d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}^\{4\}d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)as well\. The remaining conditions onNNcan be verified by following the same arguments as in the proof of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)\.□\\square
### B\.5Proof of Theorem[2](https://arxiv.org/html/2605.30573#Thmtheorem2)
#### Theorem[2](https://arxiv.org/html/2605.30573#Thmtheorem2)
Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\), and letπ∝exp\(−f\)\\pi\\propto\\mathrm\{exp\}\(\-f\)be the target distribution, where the potential functionffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1)and[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsL1L\_\{1\}andL2L\_\{2\}, respectively\. Define the time\-varying parameters as follows
γk=Cγk3/2,pk=12k1/2,bk=⌈1pk⌉,andμk=Cμk1/8,∀k≥1,\\gamma\_\{k\}=\\frac\{C\_\{\\gamma\}\}\{k^\{3/2\}\},\\quad p\_\{k\}=\\frac\{1\}\{2k^\{1/2\}\},\\quad b\_\{k\}=\\left\\lceil\\frac\{1\}\{p\_\{k\}\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu\_\{k\}=\\frac\{C\_\{\\mu\}\}\{k^\{1/8\}\},\\quad\\forall k\\geq 1,\(58\)whereLm≔max\{L1,L2\}L\_\{m\}\\coloneqq\\max\\\{L\_\{1\},L\_\{2\}\\\}, andCγ,Cμ\>0C\_\{\\gamma\},C\_\{\\mu\}\>0are numerical constants\. Then, the time averaged lawν¯τn≔1τn∫0τnνt𝑑t\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt, whereτn≔∑k=1nγk\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}, converges weakly toπ\\pi\.
Proof\.Given time\-varying parametersγk,bk,pk,μk\\gamma\_\{k\},b\_\{k\},p\_\{k\},\\mu\_\{k\}at iterationkk, define the cumulative timeτn\\tau\_\{n\}and averaged lawν¯τn\\bar\{\\nu\}\_\{\\tau\_\{n\}\}at iterationnnas
τn≔∑k=1nγk,ν¯τn≔1τn∫0τnνt𝑑t,\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\},\\qquad\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt,whereνt\\nu\_\{t\}denotes the law of the process𝒙t\{\\bm\{x\}\}\_\{t\}under the following continuous\-time interpolation
𝒙t≔𝒙τn−\(t−τn\)𝒈n\+2\(𝑩t−𝑩τn\),t∈\[τn,τn\+1\]\.\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{\\tau\_\{n\}\}\-\(t\-\\tau\_\{n\}\)\\,\{\\bm\{g\}\}\_\{n\}\+\\sqrt\{2\}\\,\\bigl\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{\\tau\_\{n\}\}\\bigr\),\\qquad t\\in\[\\tau\_\{n\},\\tau\_\{n\+1\}\]\.\(59\)gng\_\{n\}is defined as follows
𝒈n:=\{1bn∑i=1bn∇~fμn\(𝒙τn,𝒖ni\)w\.p\.pn,𝒈n−1\+1b′∑i=1b′\(∇~fμn\(𝒙τn,𝒖ni\)−∇~fμn\(𝒙τn−1,𝒖ni\)\)w\.p\.1−pn,\{\\bm\{g\}\}\_\{n\}:=\\begin\{cases\}\\displaystyle\\frac\{1\}\{b\_\{n\}\}\\sum\_\{i=1\}^\{b\_\{n\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)&\\text\{w\.p\. \}p\_\{n\},\\\\\[4\.30554pt\] \\displaystyle\{\\bm\{g\}\}\_\{n\-1\}\+\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\left\(\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\-1\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)\\right\)&\\text\{w\.p\. \}1\-p\_\{n\},\\end\{cases\}\(60\)for alln≥1n\\geq 1\. At initialization, we chooseγ0,b0,μ0\>0\\gamma\_\{0\},b\_\{0\},\\mu\_\{0\}\>0andp0∈\(0,1\]p\_\{0\}\\in\(0,1\], and define𝒈0\{\\bm\{g\}\}\_\{0\}as
𝒈0≔1b0∑i=1b0∇~fμ0\(𝒙0,𝒖0i\),𝒖0i∼𝒩\(0,I\)\.\{\\bm\{g\}\}\_\{0\}\\coloneqq\\frac\{1\}\{b\_\{0\}\}\\sum\_\{i=1\}^\{b\_\{0\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{0\}\}\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\}\),\\quad\{\\bm\{u\}\}\_\{0\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)\.We establish weak convergence by adapting the proof of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)\. To this end, we verify that the step sizes\(γk\)k≥1\(\\gamma\_\{k\}\)\_\{k\\geq 1\}satisfy the step\-size condition of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1), namely,
γk∈\(0,1Lm52ϕk\(μk\)\],whereϕk\(μk\)=1\+4\(1−pk\)dpkμk2b′\\gamma\_\{k\}\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\_\{k\}\(\\mu\_\{k\}\)\}\}\\right\],\\quad\\text\{where\}\\quad\\phi\_\{k\}\(\\mu\_\{k\}\)=1\+\\frac\{4\(1\-p\_\{k\}\)d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}\(61\)for allk≥1k\\geq 1\. Using the definitions ofμk\\mu\_\{k\}andpkp\_\{k\}, we have
ϕk\(μk\)\\displaystyle\\phi\_\{k\}\(\\mu\_\{k\}\)=1\+4\(1−pk\)dpkμk2b′\\displaystyle=1\+\\frac\{4\(1\-p\_\{k\}\)d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}≤1\+4dpkμk2b′\\displaystyle\\leq 1\+\\frac\{4d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}≤1\+8dk3/4b′Cμ2\\displaystyle\\leq 1\+\\frac\{8dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}≤16dk3/4b′Cμ2,\\displaystyle\\leq\\frac\{16dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\},\(62\)where the last inequality is satisfied by selecting a constantCμC\_\{\\mu\}such that
8db′≤Cμ\.\\sqrt\{\\frac\{8d\}\{b^\{\\prime\}\}\}\\leq C\_\{\\mu\}\.Then, we have
1Lm52ϕk\(μk\)≥1Lm832dk3/4b′Cμ2=CμLmk3/8b′832d\.\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\_\{k\}\(\\mu\_\{k\}\)\}\}\\geq\\frac\{1\}\{L\_\{m\}\\sqrt\{\\frac\{832dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}\}\}=\\frac\{C\_\{\\mu\}\}\{L\_\{m\}k^\{3/8\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{832d\}\}\.Choosing
Cγ=CμLmb′832d,C\_\{\\gamma\}=\\frac\{C\_\{\\mu\}\}\{L\_\{m\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{832d\}\},we have
γk=Cγk3/2≤CμLmk3/8b′832d≤1Lm52ϕk\(μk\),\\gamma\_\{k\}=\\frac\{C\_\{\\gamma\}\}\{k^\{3/2\}\}\\leq\\frac\{C\_\{\\mu\}\}\{L\_\{m\}k^\{3/8\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{832d\}\}\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{52\\phi\_\{k\}\(\\mu\_\{k\}\)\}\},which verifies \([61](https://arxiv.org/html/2605.30573#A2.E61)\)\. By definition ofpk=12k1/2p\_\{k\}=\\frac\{1\}\{2k^\{1/2\}\}, we havepk∈\(0,1\]p\_\{k\}\\in\(0,1\]fork≥1k\\geq 1\. Therefore, we can follow the same steps in Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1)up to \([45](https://arxiv.org/html/2605.30573#A2.E45)\) since the same arguments remain valid fort∈\[τn−1,τn\]t\\in\[\\tau\_\{n\-1\},\\tau\_\{n\}\]with time\-varying parameters\. We obtain
KL\(ντn∥π\)−KL\(ντn−1∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\-\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\-1\}\}\\\|\\pi\)≤−12∫τn−1τnFI\(νt∥π\)𝑑t\+13γnd\(d\+2\)L124b\+1316γnμn2L22\(d\+3\)3\\displaystyle\\leq\-\\frac\{1\}\{2\}\\int\_\{\\tau\_\{n\-1\}\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+\\frac\{13\\gamma\_\{n\}d\(d\+2\)L\_\{1\}^\{2\}\}\{4b\}\+\\frac\{13\}\{16\}\\gamma\_\{n\}\\mu\_\{n\}^\{2\}L\_\{2\}^\{2\}\(d\+3\)^\{3\}\+7γn2dLm2ϕn\(μn\)−γnpn\(3\+13γn2Lm2ϕn\(μn\)\)\(en2−en−12\)\\displaystyle\\quad\+7\\gamma\_\{n\}^\{2\}dL\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\-\\frac\{\\gamma\_\{n\}\}\{p\_\{n\}\}\\left\(3\+13\\gamma\_\{n\}^\{2\}L\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\\right\)\\left\(e\_\{n\}^\{2\}\-e\_\{n\-1\}^\{2\}\\right\)\(63\)fort∈\[τn−1,τn\]t\\in\[\\tau\_\{n\-1\},\\tau\_\{n\}\]\. Plugging the definitions ofγn\\gamma\_\{n\},bnb\_\{n\},pnp\_\{n\}, andμn\\mu\_\{n\}into \([63](https://arxiv.org/html/2605.30573#A2.E63)\), we get
KL\(ντn∥π\)−KL\(ντn−1∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\-\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\-1\}\}\\\|\\pi\)≤−12∫τn−1τnFI\(νt∥π\)𝑑t\+A1n2\+A2n7/4\+A3n9/4−cn\(en2−en−12\),\\displaystyle\\leq\-\\frac\{1\}\{2\}\\int\_\{\\tau\_\{n\-1\}\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+\\frac\{A\_\{1\}\}\{n^\{2\}\}\+\\frac\{A\_\{2\}\}\{n^\{7/4\}\}\+\\frac\{A\_\{3\}\}\{n^\{9/4\}\}\-c\_\{n\}\\left\(e\_\{n\}^\{2\}\-e\_\{n\-1\}^\{2\}\\right\),\(64\)where
A1≔13d\(d\+2\)L12Cγ8,A2≔13L22\(d\+3\)3CγCμ216,A3≔112d2Lm2Cγ2b′Cμ2,A\_\{1\}\\coloneqq\\frac\{13d\(d\+2\)L\_\{1\}^\{2\}C\_\{\\gamma\}\}\{8\},\\quad A\_\{2\}\\coloneqq\\frac\{13L\_\{2\}^\{2\}\(d\+3\)^\{3\}C\_\{\\gamma\}C\_\{\\mu\}^\{2\}\}\{16\},\\quad A\_\{3\}\\coloneqq\\frac\{112d^\{2\}L\_\{m\}^\{2\}C\_\{\\gamma\}^\{2\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\},and
cn≔γnpn\(3\+13γn2Lm2ϕn\(μn\)\)c\_\{n\}\\coloneqq\\frac\{\\gamma\_\{n\}\}\{p\_\{n\}\}\\left\(3\+13\\gamma\_\{n\}^\{2\}L\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\\right\)forn≥1n\\geq 1\. Iterating the bound in \([64](https://arxiv.org/html/2605.30573#A2.E64)\), we obtain
KL\(ντn∥π\)≤KL\(ν0∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)−12∫0τnFI\(νt∥π\)𝑑t\+A1S1\+A2S2\+A3S3−∑k=1nck\(ek2−ek−12\)\\displaystyle\-\\frac\{1\}\{2\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\-\\sum\_\{k=1\}^\{n\}c\_\{k\}\\left\(e\_\{k\}^\{2\}\-e\_\{k\-1\}^\{2\}\\right\)\(65\)where we have
∑k=1nk−2≤∑k=1∞k−2⏟≔S1<∞,∑k=1nk−7/4≤∑k=1∞k−7/4⏟≔S2<∞\.\\sum\_\{k=1\}^\{n\}k^\{\-2\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-2\}\}\_\{\\coloneqq S\_\{1\}\}<\\infty,\\quad\\sum\_\{k=1\}^\{n\}k^\{\-7/4\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-7/4\}\}\_\{\\coloneqq S\_\{2\}\}<\\infty\.
∑k=1nk−9/4≤∑k=1∞k−9/4⏟≔S3<∞\.\\sum\_\{k=1\}^\{n\}k^\{\-9/4\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-9/4\}\}\_\{\\coloneqq S\_\{3\}\}<\\infty\.Thus,A1S1A\_\{1\}S\_\{1\},A2S2A\_\{2\}S\_\{2\}, andA3S3A\_\{3\}S\_\{3\}are bounded constants and are independent ofnn\. Furthermore, if we assume that\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is nonnegative and decreasing sequence \(i\.e\.0≤cn\+1<cn0\\leq c\_\{n\+1\}<c\_\{n\}\), we can bound the summation in \([65](https://arxiv.org/html/2605.30573#A2.E65)\) as follows
−∑k=1nck\(ek2−ek−12\)=c1e02\+∑k=1n−1\(ck\+1−ck\)ek2−cnen2≤c1e0\.\\displaystyle\-\\sum\_\{k=1\}^\{n\}c\_\{k\}\\left\(e\_\{k\}^\{2\}\-e\_\{k\-1\}^\{2\}\\right\)=c\_\{1\}e\_\{0\}^\{2\}\+\\sum\_\{k=1\}^\{n\-1\}\(c\_\{k\+1\}\-c\_\{k\}\)e\_\{k\}^\{2\}\-c\_\{n\}e\_\{n\}^\{2\}\\leq c\_\{1\}e\_\{0\}\.\(66\)Thus, it remains to show that\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is a nonnegative and decreasing sequence\. Substituting the parametersγn\\gamma\_\{n\},bnb\_\{n\},pnp\_\{n\}, andμn\\mu\_\{n\}into the definition ofcnc\_\{n\}, we obtain
cn=6Cγn\+6Lm2Cγ3n4\+48dLm2Cγ3b′Cμ21n13/4\(1−12n1/2\)c\_\{n\}=\\frac\{6C\_\{\\gamma\}\}\{n\}\+\\frac\{6L\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}\}\{n^\{4\}\}\+\\frac\{48dL\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}\\frac\{1\}\{n^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2n^\{1/2\}\}\\right\)\(67\)forn≥1n\\geq 1\. For notational convenience, defineF≔Lm2Cγ3F\\coloneqq L\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}, which is a numerical constant, then we rewritecnc\_\{n\}as
cn=6Cγn\+Fn4\+8Fdb′Cμ2n13/4\(1−12n1/2\)\.c\_\{n\}=\\frac\{6C\_\{\\gamma\}\}\{n\}\+\\frac\{F\}\{n^\{4\}\}\+\\frac\{8Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}n^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2n^\{1/2\}\}\\right\)\.Nonnegativity is immediate since1−12n1/2\>01\-\\frac\{1\}\{2n^\{1/2\}\}\>0forn≥1n\\geq 1\. Therefore, all the terms incnc\_\{n\}are nonnegative forn≥1n\\geq 1\. To show\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is decreasing, define the continuous extensionc\(x\)c\(x\), forx≥1x\\geq 1, as
c\(x\)=6Cγx\+Fx4\+8Fdb′Cμ2x13/4\(1−12x1/2\)\.c\(x\)=\\frac\{6C\_\{\\gamma\}\}\{x\}\+\\frac\{F\}\{x^\{4\}\}\+\\frac\{8Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}x^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2x^\{1/2\}\}\\right\)\.Taking the derivative yields
c′\(x\)=−6Cγx2−4Fx5−26Fdb′Cμ2x17/4\(1−1526x1/2\)\.c^\{\\prime\}\(x\)=\-\\frac\{6C\_\{\\gamma\}\}\{x^\{2\}\}\-\\frac\{4F\}\{x^\{5\}\}\-\\frac\{26Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}x^\{17/4\}\}\\left\(1\-\\frac\{15\}\{26x^\{1/2\}\}\\right\)\.Forx≥1x\\geq 1, we have
1−1526x1/2\>0,1\-\\frac\{15\}\{26x^\{1/2\}\}\>0,and hencec′\(x\)<0c^\{\\prime\}\(x\)<0\. Therefore,c\(x\)c\(x\)is decreasing on\[1,∞\)\[1,\\infty\), which implies that\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is decreasing\. Hence, using the upper bound \([66](https://arxiv.org/html/2605.30573#A2.E66)\) in \([65](https://arxiv.org/html/2605.30573#A2.E65)\) yields
KL\(ντn∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)≤KL\(ν0∥π\)−12∫0τnFI\(νt∥π\)𝑑t\+A1S1\+A2S2\+A3S3\+c1e02\\displaystyle\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\-\\frac\{1\}\{2\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+c\_\{1\}e\_\{0\}^\{2\}\(68\)Rearranging the terms, using the convexity of Fisher information with Jensen’s inequality, and multiplying both sides by2/τn2/\\tau\_\{n\}, we get
FI\(ν¯τn∥π\)\\displaystyle\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)≤1τn∫0τnFI\(νt∥π\)𝑑t\\displaystyle\\leq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt≤2KL\(ν0∥π\)τn\+2τn\(A1S1\+A2S2\+A3S3\+c1e02\),\\displaystyle\\leq\\frac\{2\\,\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\}\{\\tau\_\{n\}\}\+\\frac\{2\}\{\\tau\_\{n\}\}\\Bigl\(A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+c\_\{1\}e\_\{0\}^\{2\}\\Bigr\),\(69\)whereA1S1\+A2S2\+A3S3<∞A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}<\\inftyand does not depend onnn\. On the other hand, ift∈\[τn,τn\+1\]t\\in\\left\[\\tau\_\{n\},\\tau\_\{n\+1\}\\right\], integrating \([44](https://arxiv.org/html/2605.30573#A2.E44)\) betweenτn\\tau\_\{n\}andttand dropping the negative integral over the Fisher information give us
KL\(νt∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤KL\(ντn∥π\)\+13\(t−τn\)d\(d\+2\)Lf124b\+1316\(t−τn\)μ2Lf22\(d\+3\)3\\displaystyle\\leq\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\+\\frac\{13\(t\-\\tau\_\{n\}\)d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{4b\}\+\\frac\{13\}\{16\}\(t\-\\tau\_\{n\}\)\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+7\(t−τn\)γdLm2ϕ\(μ\)−\(t−τn\)pn\+1\(3\+13γn\+12Lm2ϕn\+1\(μn\+1\)\)\(en\+12−en2\)\\displaystyle\\quad\+7\(t\-\\tau\_\{n\}\)\\gamma dL\_\{m\}^\{2\}\\phi\(\\mu\)\-\\frac\{\(t\-\\tau\_\{n\}\)\}\{p\_\{n\+1\}\}\\left\(3\+13\\gamma\_\{n\+1\}^\{2\}L\_\{m\}^\{2\}\\phi\_\{n\+1\}\(\\mu\_\{n\+1\}\)\\right\)\(e\_\{n\+1\}^\{2\}\-e\_\{n\}^\{2\}\)≤KL\(ν0∥π\)\+2A1S1\+2A2S2\+2A3S3\+2c1e02\+cn\+1en2\.\\displaystyle\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+2A\_\{1\}S\_\{1\}\+2A\_\{2\}S\_\{2\}\+2A\_\{3\}S\_\{3\}\+2c\_\{1\}e\_\{0\}^\{2\}\+c\_\{n\+1\}e\_\{n\}^\{2\}\.\(70\)To obtain the second inequality, we bound all terms except theKL\\mathrm\{KL\}divergence,c1e02c\_\{1\}e\_\{0\}^\{2\}, andcn\+1en2c\_\{n\+1\}e\_\{n\}^\{2\}by their finite limits asn→∞n\\to\\infty\. We then apply the bound onKL\(ντn∥π\)\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)derived from \([68](https://arxiv.org/html/2605.30573#A2.E68)\)\. By Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)and Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1), the initial error satisfiese0<∞e\_\{0\}<\\infty, and hencec1e02<∞c\_\{1\}e\_\{0\}^\{2\}<\\infty\. Therefore, to show that\{KL\(νt∥π\)∣t≥0\}\\\{\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\mid t\\geq 0\\\}is bounded, it remains to prove thatcn\+1en2<∞c\_\{n\+1\}e\_\{n\}^\{2\}<\\inftyfor alln≥1n\\geq 1\. Since\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is decreasing, it is uniformly bounded\. Therefore, it remains to show thaten2<∞e\_\{n\}^\{2\}<\\inftyfor alln≥1n\\geq 1\. To this end, we apply Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)for time\-varying parameters
en\+12≤\(1−pn\+1\)en2\+pn\+1σn\+12bn\+1\+4\(1−pn\+1\)dL12μn\+12b′𝔼\[‖𝒙τn\+1−𝒙τn‖2\],e\_\{n\+1\}^\{2\}\\leq\(1\-p\_\{n\+1\}\)e\_\{n\}^\{2\}\+\\frac\{p\_\{n\+1\}\\sigma\_\{n\+1\}^\{2\}\}\{b\_\{n\+1\}\}\+\\frac\{4\(1\-p\_\{n\+1\}\)dL\_\{1\}^\{2\}\}\{\\mu\_\{n\+1\}^\{2\}b^\{\\prime\}\}\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\\tau\_\{n\+1\}\}\-\{\\bm\{x\}\}\_\{\\tau\_\{n\}\}\\\|^\{2\}\],\(71\)where
σn\+12≔𝔼\[‖∇~fμ\(𝒙τn\+1,𝒖n\+1i\)‖2\]\.\\sigma\_\{n\+1\}^\{2\}\\coloneqq\\mathbb\{E\}\[\\\|\\widetilde\{\\nabla\}f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\+1\}\},\{\\bm\{u\}\}\_\{n\+1\}^\{i\}\)\\\|^\{2\}\]\.
The second term can be bounded using Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)together with Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1), which imply thatσn2≤Cσ\\sigma\_\{n\}^\{2\}\\leq C\_\{\\sigma\}for some constantCσ\>0C\_\{\\sigma\}\>0\. Hence,
pn\+1σn\+12bn\+1≲pn\+12,\\frac\{p\_\{n\+1\}\\sigma\_\{n\+1\}^\{2\}\}\{b\_\{n\+1\}\}\\lesssim p\_\{n\+1\}^\{2\},\(72\)wherean≲bna\_\{n\}\\lesssim b\_\{n\}denotes that there exists a constantC\>0C\>0, independent ofnn, such thatan≤Cbna\_\{n\}\\leq Cb\_\{n\}\.
To bound the last term in \([71](https://arxiv.org/html/2605.30573#A2.E71)\), we use the interpolation \([10](https://arxiv.org/html/2605.30573#S3.E10)\) to obtain
4\(1−pn\+1\)dL12μn\+12b′𝔼\[‖𝒙τn\+1−𝒙τn‖2\]\\displaystyle\\frac\{4\(1\-p\_\{n\+1\}\)dL\_\{1\}^\{2\}\}\{\\mu\_\{n\+1\}^\{2\}b^\{\\prime\}\}\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\\tau\_\{n\+1\}\}\-\{\\bm\{x\}\}\_\{\\tau\_\{n\}\}\\\|^\{2\}\]≲1−pn\+1μn\+12\(γn\+12𝔼\[‖𝒈n‖2\]\+γn\+1\)\\displaystyle\\lesssim\\frac\{1\-p\_\{n\+1\}\}\{\\mu\_\{n\+1\}^\{2\}\}\\left\(\\gamma\_\{n\+1\}^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{n\}\\\|^\{2\}\]\+\\gamma\_\{n\+1\}\\right\)\(73\)≲1μn\+12\(γn\+12en2\+γn\+12𝔼\[‖∇fμn\(𝒙τn\)‖2\]\+γn\+1\)\\displaystyle\\lesssim\\frac\{1\}\{\\mu\_\{n\+1\}^\{2\}\}\\left\(\\gamma\_\{n\+1\}^\{2\}e\_\{n\}^\{2\}\+\\gamma\_\{n\+1\}^\{2\}\\mathbb\{E\}\[\\\|\\nabla f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\}\)\\\|^\{2\}\]\+\\gamma\_\{n\+1\}\\right\)\(74\)≲γn\+12μn\+12\(en2\+1\)\+γn\+1μn\+12\\displaystyle\\lesssim\\frac\{\\gamma\_\{n\+1\}^\{2\}\}\{\\mu\_\{n\+1\}^\{2\}\}\(e\_\{n\}^\{2\}\+1\)\+\\frac\{\\gamma\_\{n\+1\}\}\{\\mu\_\{n\+1\}^\{2\}\}\(75\)≲\(n\+1\)−11/4\(en2\+1\)\+\(n\+1\)−5/4\\displaystyle\\lesssim\(n\+1\)^\{\-11/4\}\(e\_\{n\}^\{2\}\+1\)\+\(n\+1\)^\{\-5/4\}\(76\)≲pn\+111/2\(en2\+1\)\+pn\+15/2\.\\displaystyle\\lesssim p\_\{n\+1\}^\{11/2\}\(e\_\{n\}^\{2\}\+1\)\+p\_\{n\+1\}^\{5/2\}\.\(77\)
We add and subtract∇fμn\(𝒙τn\)\\nabla f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\}\)inside the squared norm in \([73](https://arxiv.org/html/2605.30573#A2.E73)\), apply the inequality\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}and usepn\+1∈\(0,1\]p\_\{n\+1\}\\in\(0,1\]to obtain \([74](https://arxiv.org/html/2605.30573#A2.E74)\)\. Since Lipschitz continuity is preserved under Gaussian smoothing,fμnf\_\{\\mu\_\{n\}\}is Lipschitz continuous wheneverffis Lipschitz continuous\. Applying this property yields \([75](https://arxiv.org/html/2605.30573#A2.E75)\)\. Substituting the definitions ofγn\+1\\gamma\_\{n\+1\}andμn\+1\\mu\_\{n\+1\}into \([75](https://arxiv.org/html/2605.30573#A2.E75)\) gives \([76](https://arxiv.org/html/2605.30573#A2.E76)\), and using the definition ofpn\+1p\_\{n\+1\}yields the final inequality\.
Combining \([72](https://arxiv.org/html/2605.30573#A2.E72)\) and \([77](https://arxiv.org/html/2605.30573#A2.E77)\), we obtain, for some constantG\>0G\>0,
en\+12≤\(1−pn\+1\+Gpn\+111/2\)en2\+G\(pn\+12\+pn\+111/2\+pn\+15/2\)\.e\_\{n\+1\}^\{2\}\\leq\\left\(1\-p\_\{n\+1\}\+Gp\_\{n\+1\}^\{11/2\}\\right\)e\_\{n\}^\{2\}\+G\\left\(p\_\{n\+1\}^\{2\}\+p\_\{n\+1\}^\{11/2\}\+p\_\{n\+1\}^\{5/2\}\\right\)\.\(78\)
Sincepn→0p\_\{n\}\\to 0, there existsn0≥1n\_\{0\}\\geq 1such that
Gpn\+19/2≤12Gp\_\{n\+1\}^\{9/2\}\\leq\\frac\{1\}\{2\}for alln≥n0n\\geq n\_\{0\}\. Moreover, sincepn\+1∈\(0,1\]p\_\{n\+1\}\\in\(0,1\],
pn\+12\+pn\+111/2\+pn\+15/2≤3pn\+1\.p\_\{n\+1\}^\{2\}\+p\_\{n\+1\}^\{11/2\}\+p\_\{n\+1\}^\{5/2\}\\leq 3p\_\{n\+1\}\.Therefore,
en\+12≤\(1−pn\+12\)en2\+3Gpn\+1,n≥n0\.e\_\{n\+1\}^\{2\}\\leq\\left\(1\-\\frac\{p\_\{n\+1\}\}\{2\}\\right\)e\_\{n\}^\{2\}\+3Gp\_\{n\+1\},\\qquad n\\geq n\_\{0\}\.
DefiningM≔6GM\\coloneqq 6G, we obtain
en\+12≤\(1−pn\+12\)en2\+Mpn\+12\.e\_\{n\+1\}^\{2\}\\leq\\left\(1\-\\frac\{p\_\{n\+1\}\}\{2\}\\right\)e\_\{n\}^\{2\}\+\\frac\{Mp\_\{n\+1\}\}\{2\}\.Since the right\-hand side is a convex combination ofen2e\_\{n\}^\{2\}andMM,
en\+12≤max\{en2,M\},n≥n0\.e\_\{n\+1\}^\{2\}\\leq\\max\\\{e\_\{n\}^\{2\},M\\\},\\qquad n\\geq n\_\{0\}\.By induction,
en2≤max\{en02,M\},n≥n0\.e\_\{n\}^\{2\}\\leq\\max\\\{e\_\{n\_\{0\}\}^\{2\},M\\\},\\qquad n\\geq n\_\{0\}\.
For the finitely many indices1≤n≤n01\\leq n\\leq n\_\{0\}, finiteness ofen2e\_\{n\}^\{2\}follows recursively from \([78](https://arxiv.org/html/2605.30573#A2.E78)\)\. Consequently,
en2≤max\{e02,…,en02,M\}<∞e\_\{n\}^\{2\}\\leq\\max\\\{e\_\{0\}^\{2\},\\ldots,e\_\{n\_\{0\}\}^\{2\},M\\\}<\\infty\(79\)for alln≥1n\\geq 1\. That provescn\+1en2<∞c\_\{n\+1\}e\_\{n\}^\{2\}<\\inftyfor alln≥1n\\geq 1\.
Combining this with \([70](https://arxiv.org/html/2605.30573#A2.E70)\) implies that\{KL\(νt∥π\)∣t≥0\}\\\{\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\mid t\\geq 0\\\}is uniformly bounded\. By the convexity of the KL divergence, this implies that the sequence\{KL\(ν¯τn∥π\)\}n∈ℕ\\\{\\mathrm\{KL\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)\\\}\_\{n\\in\\mathbb\{N\}\}is uniformly bounded as well\. Since the sublevel sets ofKL\(⋅∥π\)\\mathrm\{KL\}\(\\cdot\\\|\\pi\)are weakly compact,\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}is tight\. To establish thatν¯τn⇀π\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\rightharpoonup\\piweakly, it suffices to verify that every cluster point of\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}equal toπ\\pi\.
Consider a subsequence of\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}converging to some limitν¯\\bar\{\\nu\}\. Takingn→∞n\\to\\inftyin \([69](https://arxiv.org/html/2605.30573#A2.E69)\) and noting thatτn→∞\\tau\_\{n\}\\to\\infty, we obtainFI\(ν¯τn∥π\)→0\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)\\rightarrow 0\. Therefore, the same holds along the subsequence\. By the weak lower semicontinuity of the Fisher information along the subsequence, we haveFI\(ν¯∥π\)=0\\mathrm\{FI\}\(\\bar\{\\nu\}\\\|\\pi\)=0\. Writingψ:=dν¯dπ\\psi:=\\frac\{d\\bar\{\\nu\}\}\{d\\pi\}, this meansψ∈domℰ\\sqrt\{\\psi\}\\in\\text\{dom\}~\\mathcal\{E\}andℰ\(ψ\)=0\\mathcal\{E\}\(\\sqrt\{\\psi\}\)=0, whereℰ\\mathcal\{E\}denotes the Dirichlet energy \(i\.e\., the squaredL2\(π\)L^\{2\}\(\\pi\)\-norm of the gradient; see Section 3 inBalasubramanianet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib34)\)\)\. Since∇logπ\\nabla\\log\\piis Lipschitz by Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2),π\\pihas a continuous and strictly positive density onℝd\\mathbb\{R\}^\{d\}, soℰ\(ψ\)=0\\mathcal\{E\}\(\\sqrt\{\\psi\}\)=0, which implies thatψ\\psimust be a constantπ\\pi\-a\.e\., henceν¯=π\\bar\{\\nu\}=\\pi\.□\\square
### B\.6Proof of Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)
#### Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1L\_\{f\_\{1\}\},Lf2L\_\{f\_\{2\}\}, respectively, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and the SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation generated by \([14](https://arxiv.org/html/2605.30573#S3.E14)\) with annealing and noise schedules defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. For any step sizeγ∈\(0,1Lm85ϕ\(μ\)\]\\gamma\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\(\\mu\)\}\}\\right\], whereϕ\(μ\)≔1\+4\(1−p\)dpμ2b′\\phi\(\\mu\)\\coloneqq 1\+\\frac\{4\(1\-p\)d\}\{p\\mu^\{2\}b^\{\\prime\}\}, and for anyN≥1N\\geq 1, the Fisher information satisfies
1Nγ∫0NγFI\(νt∥π\)𝑑t≤C0Nγ\+17d\(d\+2\)Lf122b\+17μ2Lf22\(d\+3\)38\+σ¯2\+ε¯σ2\+α¯2\+32γLm2dϕ\(μ\),\\frac\{1\}\{N\\gamma\}\\\!\\int\_\{0\}^\{N\\gamma\}\\\!\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt\\\!\\leq\\\!\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\\!\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{2b\}\+\\\!\\frac\{17\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{8\}\+\\\!\\bar\{\\sigma\}^\{2\}\+\\\!\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\\!\\bar\{\\alpha\}^\{2\}\+32\\gamma L\_\{m\}^\{2\}d\\phi\(\\mu\),\(80\)where
σ¯2≔51C122N∑k=0N−1σk2,ε¯σ2≔512N∑k=0N−1εσk2,andα¯2≔51C222N∑k=0N−1\(αk−1\)2σk2,\\bar\{\\sigma\}^\{2\}\\coloneqq\\frac\{51C\_\{1\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\sigma\_\{k\}^\{2\},\\quad\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\\coloneqq\\frac\{51\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\},\\quad\\text\{and\}\\quad\\bar\{\\alpha\}^\{2\}\\coloneqq\\frac\{51C\_\{2\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\frac\{\(\\alpha\_\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\},andC0C\_\{0\},C1C\_\{1\},C2C\_\{2\}are positive constants\. Furthermore, let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4,\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\},and let the initial parameters satisfyσ0≥σmin\\sigma\_\{0\}\\geq\\sigma\_\{\\text\{min\}\},εσ0\>0\\varepsilon\_\{\\sigma\_\{0\}\}\>0,α0≥1\\alpha\_\{0\}\\geq 1, andσmin\>0\\sigma\_\{\\text\{min\}\}\>0is the minimum noise level\. Then, to achieveFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilonfor the time averaged lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dt,𝒪\(d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration is sufficient\.
Proof\.At initialization, we chooseγ0,b0,μ0\>0\\gamma\_\{0\},b\_\{0\},\\mu\_\{0\}\>0andp0∈\(0,1\]p\_\{0\}\\in\(0,1\], and define𝒈0\{\\bm\{g\}\}\_\{0\}as
𝒈0≔1b0∑i=1b0∇~fμ0\(𝒙0,𝒖0i\),𝒖0i∼𝒩\(0,I\)\.\{\\bm\{g\}\}\_\{0\}\\coloneqq\\frac\{1\}\{b\_\{0\}\}\\sum\_\{i=1\}^\{b\_\{0\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{0\}\}\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\}\),\\quad\{\\bm\{u\}\}\_\{0\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)\.We recall the interpolation argument in \([14](https://arxiv.org/html/2605.30573#S3.E14)\) given as
𝒙t≔𝒙kγ−\(t−kγ\)\(𝒈k−αk𝒮θ\(𝒙kγ\)\)\+2\(𝑩t−𝑩kγ\)fort∈\[kγ,\(k\+1\)γ\],\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{k\\gamma\}\-\(t\-k\\gamma\)\(\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\)\+\\sqrt\{2\}\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{k\\gamma\}\)\\quad\\text\{for\}\\quad t\\in\\left\[k\\gamma,\(k\+1\)\\gamma\\right\],\(81\)where𝒈k\{\\bm\{g\}\}\_\{k\}denotes variance\-reduced zeroth\-order estimate of the negative likelihood score∇f\(𝒙kγ\)\\nabla f\(\{\\bm\{x\}\}\_\{k\\gamma\}\)at iterationkk, and\(αk\)k=0N−1\(\\alpha\_\{k\}\)\_\{k=0\}^\{N\-1\}and\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}denote annealing and noise schedules, respectively\. By Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2),[4](https://arxiv.org/html/2605.30573#Thmassumption4)and triangle inequality, we know that the posterior score function∇logπ\(𝒙\)\\nabla\\log\\pi\(\{\\bm\{x\}\}\)is Lipschitz continuous with Lipschitz constantLp\+Lf2L\_\{p\}\+L\_\{f\_\{2\}\}\. Furthermore, by Assumptions[4](https://arxiv.org/html/2605.30573#Thmassumption4)and[5](https://arxiv.org/html/2605.30573#Thmassumption5), the error between the negative prior score and and its estimate scaled by annealing parameter can be bounded by
‖∇h\(𝒙kγ\)\+αk𝒮θ\(𝒙kγ\)‖\\displaystyle\\\|\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|≤‖∇h\(𝒙kγ\)−∇hσk\(𝒙kγ\)\+∇hσk\(𝒙kγ\)\+𝒮θ\(𝒙kγ,σk\)\+\(αk−1\)𝒮θ\(𝒙kγ\)‖\\displaystyle\\leq\\\|\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\-\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\+\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\+\(\\alpha\_\{k\}\-1\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|≤σkC\+εσk\+\(αk−1\)σk−1C\.\\displaystyle\\leq\\sigma\_\{k\}C\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)\\sigma\_\{k\}^\{\-1\}C\.\(82\)Recall that𝒮θ\(𝒙kγ,σk\)\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)estimates the perturbed score−∇hσk\(𝒙kγ\)\-\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\. Applying the triangle inequality yields \([82](https://arxiv.org/html/2605.30573#A2.E82)\)\. We are now ready to prove the theorem\.
Combining Lemma[2](https://arxiv.org/html/2605.30573#Thmlemma2)with the interpolation argument in \([81](https://arxiv.org/html/2605.30573#A2.E81)\), it follows that for everyt∈\[kγ,\(k\+1\)γ\]t\\in\\left\[k\\gamma,\(k\+1\)\\gamma\\right\],
ddtKL\(νt∥π\)≤−34FI\(νt∥π\)\+𝔼\[‖∇logπ\(𝒙t\)\+𝒈k−αk𝒮θ\(𝒙kγ,σk\)‖2\]\.\\frac\{d\}\{dt\}\\mathrm\{KL\(\\nu\_\{t\}\\\|\\pi\)\}\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+\\mathbb\{E\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\+\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\\\|^\{2\}\\right\]\.Adding and subtracting∇fμ\(𝒙t\)\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{t\}\),∇fμ\(𝒙kγ\)\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\),∇h\(𝒙kγ\)\\nabla h\(\{\\bm\{x\}\}\_\{k\\gamma\}\), and∇hσk\(𝒙kγ\)\\nabla h\_\{\\sigma\_\{k\}\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)inside the squared norm in expectation, and applying the inequality\(a\+b\+c\+d\)2≤4a2\+4b2\+4c2\+4d2\(a\+b\+c\+d\)^\{2\}\\leq 4a^\{2\}\+4b^\{2\}\+4c^\{2\}\+4d^\{2\}together with \([82](https://arxiv.org/html/2605.30573#A2.E82)\) and Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1)yields
ddtKL\(νt∥π\)≤\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\(\\nu\_\{t\}\\\|\\pi\)\}\\leq−34FI\(νt∥π\)\+4𝔼\[‖𝒈k−∇fμ\(𝒙kγ\)‖2\]\+4Lπ2𝔼\[‖𝒙t−𝒙kγ‖2\]\\displaystyle\-\\frac\{3\}\{4\}\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+4\\mathbb\{E\}\\left\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|^\{2\}\\right\]\+4L\_\{\\pi\}^\{2\}\\mathbb\{E\}\\left\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\\right\]\+μ2Lf22\(d\+3\)3\+4\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2,\\displaystyle\+\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+4\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\},\(83\)whereLπ≔Lf2\+LhL\_\{\\pi\}\\coloneqq L\_\{f\_\{2\}\}\+L\_\{h\}\. Using Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1), and subsequently applying steps \([36](https://arxiv.org/html/2605.30573#A2.E36)\), \([37](https://arxiv.org/html/2605.30573#A2.E37)\), and \([38](https://arxiv.org/html/2605.30573#A2.E38)\), we can upper bound the mean squared estimation errorek2≔𝔼\[‖𝒈k−∇fμ\(𝒙kγ\)‖2\]e\_\{k\}^\{2\}\\coloneqq\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\nabla f\_\{\\mu\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|^\{2\}\]as follows
ek2≤d\(d\+2\)Lf12b\+\(1−pp\)4dLf12μ2b′Δk−1p\(ek\+12−ek2\)e\_\{k\}^\{2\}\\leq\\frac\{d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{4dL\_\{f\_\{1\}\}^\{2\}\}\{\\mu^\{2\}b^\{\\prime\}\}\\Delta\_\{k\}\-\\frac\{1\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\(84\)Plugging this upper bound into \([B\.6](https://arxiv.org/html/2605.30573#A2.Ex76)\), we get
ddtKL\(νt∥π\)\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤−34FI\(νt∥π\)\+4Lπ2𝔼\[‖𝒙t−𝒙kγ‖2\]\+μ2Lf22\(d\+3\)2\+4d\(d\+2\)Lf12b\\displaystyle\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+4L\_\{\\pi\}^\{2\}\\mathbb\{E\}\\Bigl\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\\Bigr\]\+\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{2\}\+\\frac\{4d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\(1−pp\)16dLf12μ2b′Δk−4p\(ek\+12−ek2\)\+4\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\\displaystyle\\quad\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{16dL\_\{f\_\{1\}\}^\{2\}\}\{\\mu^\{2\}b^\{\\prime\}\}\\Delta\_\{k\}\-\\frac\{4\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+4\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\(85\)By the interpolation argument in \([81](https://arxiv.org/html/2605.30573#A2.E81)\), we have
𝔼\[‖𝒙t−𝒙kγ‖2\]\\displaystyle\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{t\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]=\(t−kγ\)2𝔼\[‖𝒈k−αk𝒮θ\(𝒙kγ\)‖2\]\+2\(t−kγ\)d\\displaystyle=\(t\-k\\gamma\)^\{2\}\\mathbb\{E\}\\left\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|^\{2\}\\right\]\+2\(t\-k\\gamma\)d≤γ2𝔼\[‖𝒈k−αk𝒮θ\(𝒙kγ\)‖2\]\+2γd\\displaystyle\\leq\\gamma^\{2\}\\mathbb\{E\}\\left\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\}\)\\\|^\{2\}\\right\]\+2\\gamma d=𝔼\[‖𝒙\(k\+1\)γ−𝒙kγ‖2\]\\displaystyle=\\mathbb\{E\}\[\\\|\{\\bm\{x\}\}\_\{\(k\+1\)\\gamma\}\-\{\\bm\{x\}\}\_\{k\\gamma\}\\\|^\{2\}\]=Δk\\displaystyle=\\Delta\_\{k\}\(86\)fort∈\[kγ,\(k\+1\)γ\]t\\in\\left\[k\\gamma,\(k\+1\)\\gamma\\right\]\. Substituting the bound in \([86](https://arxiv.org/html/2605.30573#A2.E86)\) into \([85](https://arxiv.org/html/2605.30573#A2.E85)\) yields
ddtKL\(νt∥π\)\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤−34FI\(νt∥π\)\+4Lm2\[1\+\(1−pp\)4dμ2b′\]⏟≔ϕ\(μ\)Δk\+μ2Lf22\(d\+3\)2\+4d\(d\+2\)Lf12b\\displaystyle\\leq\-\\frac\{3\}\{4\}\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+4L\_\{m\}^\{2\}\\underbrace\{\\left\[1\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{4d\}\{\\mu^\{2\}b^\{\\prime\}\}\\right\]\}\_\{\\coloneqq\\phi\(\\mu\)\}\\Delta\_\{k\}\+\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{2\}\+\\frac\{4d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}−4p\(ek\+12−ek2\)\+4\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\\displaystyle\\quad\-\\frac\{4\}\{p\}\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\)\+4\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\(87\)whereLm≔max\{Lf1,Lπ\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{1\}\},L\_\{\\pi\}\\\}\. We next use the interpolation \([81](https://arxiv.org/html/2605.30573#A2.E81)\) to derive an upper bound onΔk\\Delta\_\{k\}:
Δk\\displaystyle\\Delta\_\{k\}=γ2𝔼\[‖𝒈k−αk𝒮θ\(𝒙kγ,σk\)‖2\]\+2γd\\displaystyle=\\gamma^\{2\}\\mathbb\{E\}\[\\\|\{\\bm\{g\}\}\_\{k\}\-\\alpha\_\{k\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{k\\gamma\},\\sigma\_\{k\}\)\\\|^\{2\}\]\+2\\gamma d≤5γ2ek2\+5γ2μ2Lf22\(d\+3\)34\+5γ2Lπ2Δk\+5γ2𝔼\[‖∇logπ\(𝒙t\)‖2\]\\displaystyle\\leq 5\\gamma^\{2\}e\_\{k\}^\{2\}\+\\frac\{5\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{4\}\+5\\gamma^\{2\}L\_\{\\pi\}^\{2\}\\Delta\_\{k\}\+5\\gamma^\{2\}\\mathbb\{E\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\\right\]\+5γ2\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\+2γd\\displaystyle\\quad\+5\\gamma^\{2\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+2\\gamma d\(88\)where we add and subtract∇fμ\(𝐱kγ\)\\nabla f\_\{\\mu\}\(\\mathbf\{x\}\_\{k\\gamma\}\),∇f\(𝐱kγ\)\\nabla f\(\\mathbf\{x\}\_\{k\\gamma\}\),∇f\(𝐱t\)\\nabla f\(\\mathbf\{x\}\_\{t\}\),∇h\(𝐱t\)\\nabla h\(\\mathbf\{x\}\_\{t\}\),∇h\(𝐱kγ\)\\nabla h\(\\mathbf\{x\}\_\{k\\gamma\}\), and∇hσk\(𝐱kγ\)\\nabla h\_\{\\sigma\_\{k\}\}\(\\mathbf\{x\}\_\{k\\gamma\}\)inside the squared norm in expectation, and then apply the convexity of theℓ2\\ell\_\{2\}norm together with the part \(b\) of Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)to obtain \([88](https://arxiv.org/html/2605.30573#A2.E88)\)\. Using the bound onek2e\_\{k\}^\{2\}in \([84](https://arxiv.org/html/2605.30573#A2.E84)\), we get
Δk\\displaystyle\\Delta\_\{k\}≤5γ2Lm2ϕ\(μ\)Δk\+5γ2d\(d\+2\)Lf12b\+5γ2μ2Lf22\(d\+3\)34\+5γ2𝔼\[‖∇logπ\(𝒙t\)‖2\]−5γ2p\(ek\+12−ek2\)\\displaystyle\\leq 5\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\Delta\_\{k\}\+\\frac\{5\\gamma^\{2\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\\frac\{5\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{4\}\+5\\gamma^\{2\}\\mathbb\{E\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\\right\]\-\\frac\{5\\gamma^\{2\}\}\{p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+5γ2\(σkC\+εσk\+\(αk−1\)σk−1C\)2\+2γd\\displaystyle\\quad\+5\\gamma^\{2\}\(\\sigma\_\{k\}C\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)\\sigma\_\{k\}^\{\-1\}C\)^\{2\}\+2\\gamma dAssume thatγ≤1Lm85ϕ\(μ\)\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\(\\mu\)\}\}, then rearranging the terms, we have
1617Δk\\displaystyle\\frac\{16\}\{17\}\\Delta\_\{k\}≤5γ2d\(d\+2\)Lf12b\+5γ2μ2Lf22\(d\+3\)34\+5γ2𝔼\[‖∇logπ\(𝒙t\)‖2\]−5γ2p\(ek\+12−ek2\)\\displaystyle\\leq\\frac\{5\\gamma^\{2\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\\frac\{5\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{4\}\+5\\gamma^\{2\}\\mathbb\{E\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\\right\]\-\\frac\{5\\gamma^\{2\}\}\{p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+5γ2\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\+2γd\\displaystyle\\quad\+5\\gamma^\{2\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+2\\gamma dMultiplying both sides by1716\\frac\{17\}\{16\}, we get
Δk\\displaystyle\\Delta\_\{k\}≤85γ2d\(d\+2\)Lf1216b\+85γ2μ2Lf22\(d\+3\)364\+8516γ2𝔼\[‖∇logπ\(𝒙t\)‖2\]−85γ216p\(ek\+12−ek2\)\\displaystyle\\leq\\frac\{85\\gamma^\{2\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{16b\}\+\\frac\{85\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{64\}\+\\frac\{85\}\{16\}\\gamma^\{2\}\\mathbb\{E\}\\left\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\\right\]\-\\frac\{85\\gamma^\{2\}\}\{16p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+8516γ2\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\+178γd\\displaystyle\\quad\+\\frac\{85\}\{16\}\\gamma^\{2\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+\\frac\{17\}\{8\}\\gamma dWe can use Lemma[3](https://arxiv.org/html/2605.30573#Thmlemma3)to put an upper bound on𝔼\[‖∇logπ\(𝒙t\)‖2\]\\mathbb\{E\}\[\\\|\\nabla\\log\\pi\(\{\\bm\{x\}\}\_\{t\}\)\\\|^\{2\}\]
Δk\\displaystyle\\Delta\_\{k\}≤85γ2d\(d\+2\)Lf1216b\+85γ2μ2Lf22\(d\+3\)364\+8516γ2FI\(νt∥π\)\+858γ2Lπd−85γ216p\(ek\+12−ek2\)\\displaystyle\\leq\\frac\{85\\gamma^\{2\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{16b\}\+\\frac\{85\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{64\}\+\\frac\{85\}\{16\}\\gamma^\{2\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\+\\frac\{85\}\{8\}\\gamma^\{2\}L\_\{\\pi\}d\-\\frac\{85\\gamma^\{2\}\}\{16p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+8516γ2\(C1σk\+εσk2\+\(αk−1\)C2σk−1\)2\+178γd\.\\displaystyle\\quad\+\\frac\{85\}\{16\}\\gamma^\{2\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+\\frac\{17\}\{8\}\\gamma d\.\(89\)We can combine the terms858γ2Lπd\\frac\{85\}\{8\}\\gamma^\{2\}L\_\{\\pi\}dand178γd\\frac\{17\}\{8\}\\gamma dusing the assumptionγ≤1Lm85ϕ\(μ\)\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\(\\mu\)\}\}\. In particular,
γ\\displaystyle\\gamma≤185Lm2\(1\+\(1−pp\)4dμ2b′\)≤1Lm85,\\displaystyle\\leq\\frac\{1\}\{\\sqrt\{85L\_\{m\}^\{2\}\\left\(1\+\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{4d\}\{\\mu^\{2\}b^\{\\prime\}\}\\right\)\}\}\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{85\}\},where we use the fact that\(1−pp\)8dμ2b′\>0\\left\(\\frac\{1\-p\}\{p\}\\right\)\\frac\{8d\}\{\\mu^\{2\}b^\{\\prime\}\}\>0\. Consequently,
858γ2dLπ≤85γdLπ8Lm≤858γd,\\displaystyle\\frac\{85\}\{8\}\\gamma^\{2\}dL\_\{\\pi\}\\leq\\frac\{\\sqrt\{85\}\\gamma dL\_\{\\pi\}\}\{8L\_\{m\}\}\\leq\\frac\{\\sqrt\{85\}\}\{8\}\\gamma d,where we useLπ≤LmL\_\{\\pi\}\\leq L\_\{m\}\. Therefore,
858γ2Lπd\+178γd\\displaystyle\\frac\{85\}\{8\}\\gamma^\{2\}L\_\{\\pi\}d\+\\frac\{17\}\{8\}\\gamma d≤\(858\+178\)γd\\displaystyle\\leq\\left\(\\frac\{\\sqrt\{85\}\}\{8\}\+\\frac\{17\}\{8\}\\right\)\\gamma d≤4γd\.\\displaystyle\\leq 4\\gamma d\.Substituting this bound into \([89](https://arxiv.org/html/2605.30573#A2.E89)\) yields the simplified bound
Δk\\displaystyle\\Delta\_\{k\}≤85γ2d\(d\+2\)Lf1216b\+85γ2μ2Lf22\(d\+3\)364\+8516γ2FI\(νt∥π\)−85γ216p\(ek\+12−ek2\)\\displaystyle\\leq\\frac\{85\\gamma^\{2\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{16b\}\+\\frac\{85\\gamma^\{2\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\}\{64\}\+\\frac\{85\}\{16\}\\gamma^\{2\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\-\\frac\{85\\gamma^\{2\}\}\{16p\}\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+8516γ2\(C1σk\+εσk2\+\(αk−1\)C2σk−1\)2\+4γd\.\\displaystyle\\quad\+\\frac\{85\}\{16\}\\gamma^\{2\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+4\\gamma d\.\(90\)Plugging \([B\.6](https://arxiv.org/html/2605.30573#A2.Ex95)\) into \([87](https://arxiv.org/html/2605.30573#A2.E87)\), we get
ddtKL\(νt∥π\)\\displaystyle\\frac\{d\}\{dt\}\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤\(−34\+85γ2Lm2ϕ\(μ\)24\)FI\(νt∥π\)\+\(1\+85γ2Lm2ϕ\(μ\)16\)4d\(d\+2\)Lf12b\\displaystyle\\leq\\left\(\-\\frac\{3\}\{4\}\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)^\{2\}\}\{4\}\\right\)\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\\frac\{4d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\(1\+85γ2Lm2ϕ\(μ\)16\)μ2Lf22\(d\+3\)3−4p\(1\+85γ2Lm2ϕ\(μ\)16\)\(ek\+12−ek2\)\\displaystyle\\quad\+\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\-\\frac\{4\}\{p\}\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+4\(1\+85γ2Lm2ϕ\(μ\)16\)\(σkC\+εσk\+\(αk−1\)σk−1C\)2\+16γdLm2ϕ\(μ\)\\displaystyle\\quad\+4\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\(\\sigma\_\{k\}C\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)\\sigma\_\{k\}^\{\-1\}C\)^\{2\}\+16\\gamma dL\_\{m\}^\{2\}\\phi\(\\mu\)≤−12FI\(νt∥π\)\+17d\(d\+2\)Lf124b\+1716μ2Lf22\(d\+3\)3−4p\(1\+85γ2Lm2ϕ\(μ\)16\)\(ek\+12−ek2\)\\displaystyle\\leq\-\\frac\{1\}\{2\}\\mathrm\{FI\(\\nu\_\{t\}\\\|\\pi\)\}\+\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{4b\}\+\\frac\{17\}\{16\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\-\\frac\{4\}\{p\}\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+174\(C1σk\+εσk\+\(αk−1\)C2σk−1\)2\+16γdLm2ϕ\(μ\),\\displaystyle\\quad\+\\frac\{17\}\{4\}\(C\_\{1\}\\sigma\_\{k\}\+\\varepsilon\_\{\\sigma\_\{k\}\}\+\(\\alpha\_\{k\}\-1\)C\_\{2\}\\sigma\_\{k\}^\{\-1\}\)^\{2\}\+16\\gamma dL\_\{m\}^\{2\}\\phi\(\\mu\),\(91\)where we use the fact that85γ2Lm2ϕ\(μ\)≤185\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\leq 1to get \([91](https://arxiv.org/html/2605.30573#A2.E91)\)\. Integrating both sides between\[kγ,\(k\+1\)γ\]\\left\[k\\gamma,\(k\+1\)\\gamma\\right\], we get
KL\(ν\(k\+1\)γ∥π\)−KL\(νkγ∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\(k\+1\)\\gamma\}\\\|\\pi\)\-\\mathrm\{KL\}\(\\nu\_\{k\\gamma\}\\\|\\pi\)≤−12∫kγ\(k\+1\)γFI\(νt∥π\)𝑑t\+17γd\(d\+2\)Lf124b\+1716γμ2Lf22\(d\+3\)3\\displaystyle\\leq\-\\frac\{1\}\{2\}\\int\_\{k\\gamma\}^\{\(k\+1\)\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+\\frac\{17\\gamma d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{4b\}\+\\frac\{17\}\{16\}\\gamma\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}−4γp\(1\+85γ2Lm2ϕ\(μ\)16\)\(ek\+12−ek2\)\\displaystyle\\quad\-\\frac\{4\\gamma\}\{p\}\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\\left\(e\_\{k\+1\}^\{2\}\-e\_\{k\}^\{2\}\\right\)\+51γ4\(C12σk2\+εσk2\+\(αk−1\)2C22σk−2\)\+16γ2dLm2ϕ\(μ\),\\displaystyle\\quad\+\\frac\{51\\gamma\}\{4\}\(C\_\{1\}^\{2\}\\sigma\_\{k\}^\{2\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)^\{2\}C\_\{2\}^\{2\}\\sigma\_\{k\}^\{\-2\}\)\+16\\gamma^\{2\}dL\_\{m\}^\{2\}\\phi\(\\mu\),\(92\)where we use the inequality\(a\+b\+c\)2≤3a2\+3b2\+3c2\(a\+b\+c\)^\{2\}\\leq 3a^\{2\}\+3b^\{2\}\+3c^\{2\}to get the last term\. Letℒk≔KL\(νkγ∥π\)\+4γp\[1\+8516γ2Lm2ϕ\(μ\)\]ek2\{\\mathcal\{L\}\}\_\{k\}\\coloneqq\\mathrm\{KL\}\(\\nu\_\{k\\gamma\}\\\|\\pi\)\+\\frac\{4\\gamma\}\{p\}\\left\[1\+\\frac\{85\}\{16\}\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\right\]e\_\{k\}^\{2\}, iterating fork=0,…,N−1k=0,\\ldots,N\-1, multiplying both sides by2Nγ\\frac\{2\}\{N\\gamma\}, rearranging the terms and using the fact thatℒN≥0\{\\mathcal\{L\}\}\_\{N\}\\geq 0, we get
1Nγ∫0NγFI\(νt∥π\)𝑑t≤2ℒ0Nγ\+17d\(d\+2\)Lf122b\+178μ2Lf22\(d\+3\)3\+σ¯2\+ε¯σ2\+α¯2\+32γLm2dϕ\(μ\),\\displaystyle\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\\leq\\frac\{2\{\\mathcal\{L\}\}\_\{0\}\}\{N\\gamma\}\+\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{2b\}\+\\frac\{17\}\{8\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\}\+32\\gamma L\_\{m\}^\{2\}d\\phi\(\\mu\),\(93\)whereσ¯2≔51C122N∑k=0N−1σk2\\bar\{\\sigma\}^\{2\}\\coloneqq\\frac\{51C\_\{1\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\sigma\_\{k\}^\{2\},ε¯σ2≔512N∑k=0N−1εσk2\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\\coloneqq\\frac\{51\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}, andα¯≔51C222N∑k=0N−1\(αk−1\)2σk2\\bar\{\\alpha\}\\coloneqq\\frac\{51C\_\{2\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\frac\{\(\\alpha\_\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\}\. Since85γ2Lm2ϕ\(μ\)≤185\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\\leq 1, we have
ℒ0=KL\(ν0∥π\)\+4γp\(1\+85γ2Lm2ϕ\(μ\)16\)e02≤KL\(ν0∥π\)\+17γe024p\.\{\\mathcal\{L\}\}\_\{0\}=\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{4\\gamma\}\{p\}\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)e\_\{0\}^\{2\}\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{17\\gamma e\_\{0\}^\{2\}\}\{4p\}\.We can defineC0≔2KL\(ν0∥π\)\+17γe022pC\_\{0\}\\coloneqq 2\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{17\\gamma e\_\{0\}^\{2\}\}\{2p\}which is a numerical constant\. This completes the proof of the Fisher information upper bound\.
To establish the iteration complexity, we observe that all terms in the upper bound of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1), except for the bias induced by the score network approximation errorε¯σ2\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}, and the bias due to the annealing and noise schedulesα¯2\\bar\{\\alpha\}^\{2\}andσ¯2\\bar\{\\sigma\}^\{2\}, respectively, appear with different coefficients\. Therefore, choosing the same parameter values as
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4,\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\},we obtain the following result
1Nγ∫0NγFI\(νt∥π\)𝑑t≤𝒪\(d7/4LmN1/4\)\+σ¯2\+ε¯σ2\+α¯2\.\\displaystyle\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\\leq\\mathcal\{O\}\\left\(\\frac\{d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\+\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\}\.\(94\)We recall that the noise\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}and annealing\(αk\)k=0N−1\(\\alpha\_\{k\}\)\_\{k=0\}^\{N\-1\}schedules are defined as
αk=max\{α0ρ1k,1\}andσk=max\{σ0ρ2k,σmin\},\\alpha\_\{k\}=\\max\\\{\\alpha\_\{0\}\\rho\_\{1\}^\{k\},1\\\}\\quad\\text\{and\}\\quad\\sigma\_\{k\}=\\max\\\{\\sigma\_\{0\}\\rho\_\{2\}^\{k\},\\sigma\_\{\\text\{min\}\}\\\},\(95\)whereρ1,ρ2∈\(0,1\)\\rho\_\{1\},\\rho\_\{2\}\\in\(0,1\)denote decay rates,σ0\>0\\sigma\_\{0\}\>0,α0\>0\\alpha\_\{0\}\>0, andσmin\>0\\sigma\_\{\\text\{min\}\}\>0is the minimum noise level\. By definition in \([13](https://arxiv.org/html/2605.30573#S3.E13)\), there exist indicesKα<N−1K\_\{\\alpha\}<N\-1andKσ<N−1K\_\{\\sigma\}<N\-1independent ofNNsuch thatαk=1\\alpha\_\{k\}=1for∀k≥Kα\\forall k\\geq K\_\{\\alpha\}andσk=σmin\\sigma\_\{k\}=\\sigma\_\{\\text\{min\}\}for∀k≥Kσ\\forall k\\geq K\_\{\\sigma\}\.
We next analyze the convergence behavior of the bias contributions due to the noise schedule, annealing schedule, and score network error separately\.
To boundσ¯2\\bar\{\\sigma\}^\{2\}, we proceed as
σ¯2\\displaystyle\\bar\{\\sigma\}^\{2\}=51C122N∑k=0N−1max\{σ02ρ22k,σmin2\}\\displaystyle=\\frac\{51C\_\{1\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\max\\\{\\sigma\_\{0\}^\{2\}\\rho\_\{2\}^\{2k\},\\sigma\_\{\\min\}^\{2\}\\\}=\(∗\)51C122N∑k=0Kσ−1σ02ρ22k\+51C122N∑k=KσN−1σmin2\\displaystyle\\overset\{\(\*\)\}\{=\}\\frac\{51C\_\{1\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{K\_\{\\sigma\}\-1\}\\sigma\_\{0\}^\{2\}\\rho\_\{2\}^\{2k\}\+\\frac\{51C\_\{1\}^\{2\}\}\{2N\}\\sum\_\{k=K\_\{\\sigma\}\}^\{N\-1\}\\sigma\_\{\\min\}^\{2\}=51C12σ02\(1−ρ22Kσ\)2N\(1−ρ22\)⏟𝒪\(N−1\)\+51C12\(N−Kσ\)σmin22N\.\\displaystyle=\\underbrace\{\\frac\{51C\_\{1\}^\{2\}\\sigma\_\{0\}^\{2\}\(1\-\\rho\_\{2\}^\{2K\_\{\\sigma\}\}\)\}\{2N\(1\-\\rho\_\{2\}^\{2\}\)\}\}\_\{\\mathcal\{O\}\(N^\{\-1\}\)\}\+\\frac\{51C\_\{1\}^\{2\}\(N\-K\_\{\\sigma\}\)\\sigma\_\{\\min\}^\{2\}\}\{2N\}\.\(96\)where\(∗\)\(\*\)follows from the definition of the annealing and noise schedules \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. Setting
σmin=𝒪\(N−1/8\),\\sigma\_\{\\min\}=\\mathcal\{O\}\(N^\{\-1/8\}\),we obtain
σ¯2=𝒪\(N−1\)\+𝒪\(N−1/4\)=𝒪\(N−1/4\)\.\\bar\{\\sigma\}^\{2\}=\\mathcal\{O\}\(N^\{\-1\}\)\+\\mathcal\{O\}\(N^\{\-1/4\}\)=\\mathcal\{O\}\(N^\{\-1/4\}\)\.\(97\)
To boundα¯2\\bar\{\\alpha\}^\{2\}, we use that\(σk\)k=0N−1\(\\sigma\_\{k\}\)\_\{k=0\}^\{N\-1\}is a nonincreasing sequence and that there exists a constantKα<N−1K\_\{\\alpha\}<N\-1, independent ofNN, such thatαk=1\\alpha\_\{k\}=1for allk≥Kαk\\geq K\_\{\\alpha\}\. Thus, we obtain
α¯2\\displaystyle\\bar\{\\alpha\}^\{2\}=51C222N∑k=0N−1\(αk−1\)2σk2\\displaystyle=\\frac\{51C\_\{2\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\frac\{\(\\alpha\_\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\}=51C222N∑k=0Kα−1\(α0ρ1k−1\)2σk2\\displaystyle=\\frac\{51C\_\{2\}^\{2\}\}\{2N\}\\sum\_\{k=0\}^\{K\_\{\\alpha\}\-1\}\\frac\{\(\\alpha\_\{0\}\\rho\_\{1\}^\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\}≤51C222Nσmin2∑k=0Kα−1\(α0ρ1k−1\)2\\displaystyle\\leq\\frac\{51C\_\{2\}^\{2\}\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\}\\sum\_\{k=0\}^\{K\_\{\\alpha\}\-1\}\(\\alpha\_\{0\}\\rho\_\{1\}^\{k\}\-1\)^\{2\}≤51C222Nσmin2∑k=0Kα−1\(α02ρ12k−2α0ρ1k\+1\)\\displaystyle\\leq\\frac\{51C\_\{2\}^\{2\}\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\}\\sum\_\{k=0\}^\{K\_\{\\alpha\}\-1\}\(\\alpha\_\{0\}^\{2\}\\rho\_\{1\}^\{2k\}\-2\\alpha\_\{0\}\\rho\_\{1\}^\{k\}\+1\)≤51C22α022Nσmin2∑k=0Kα−1ρ12k\+51C22Kα2Nσmin2\\displaystyle\\leq\\frac\{51C\_\{2\}^\{2\}\\alpha\_\{0\}^\{2\}\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\}\\sum\_\{k=0\}^\{K\_\{\\alpha\}\-1\}\\rho\_\{1\}^\{2k\}\+\\frac\{51C\_\{2\}^\{2\}K\_\{\\alpha\}\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\}=51C22α02\(1−ρ12Kα\)2Nσmin2\(1−ρ12\)⏟𝒪\(N−3/4\)\+51C22Kα2Nσmin2⏟𝒪\(N−3/4\)\\displaystyle=\\underbrace\{\\frac\{51C\_\{2\}^\{2\}\\alpha\_\{0\}^\{2\}\(1\-\\rho\_\{1\}^\{2K\_\{\\alpha\}\}\)\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\(1\-\\rho\_\{1\}^\{2\}\)\}\}\_\{\\mathcal\{O\}\\left\(N^\{\-3/4\}\\right\)\}\+\\underbrace\{\\frac\{51C\_\{2\}^\{2\}K\_\{\\alpha\}\}\{2N\\sigma\_\{\\text\{min\}\}^\{2\}\}\}\_\{\\mathcal\{O\}\(N^\{\-3/4\}\)\}=𝒪\(N−3/4\)\\displaystyle=\\mathcal\{O\}\\left\(N^\{\-3/4\}\\right\)\(98\)
Finally, to boundε¯σ2\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}, under the assumption that the SGM estimation error decays asεσk2≤C′k\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\\leq\\frac\{C^\{\\prime\}\}\{k\}for some constantC′\>0C^\{\\prime\}\>0, we proceed as
ε¯σ2=512N∑k=0N−1εσk2≤51εσ022N\+51C′2N∑k=1N−11k=𝒪\(logNN\)\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}=\\frac\{51\}\{2N\}\\sum\_\{k=0\}^\{N\-1\}\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\\leq\\frac\{51\\varepsilon\_\{\\sigma\_\{0\}\}^\{2\}\}\{2N\}\+\\frac\{51C^\{\\prime\}\}\{2N\}\\sum\_\{k=1\}^\{N\-1\}\\frac\{1\}\{k\}=\\mathcal\{O\}\\left\(\\frac\{\\log N\}\{N\}\\right\)\(99\)Combining the results \([97](https://arxiv.org/html/2605.30573#A2.E97)\), \([98](https://arxiv.org/html/2605.30573#A2.E98)\), and \([99](https://arxiv.org/html/2605.30573#A2.E99)\), we obtain
1Nγ∫0NγFI\(νt∥π\)𝑑t\\displaystyle\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt=𝒪\(d7/4LmN1/4\)\+𝒪\(1N1/4\)\+𝒪\(1N3/4\)\+𝒪\(logNN\)\\displaystyle=\\mathcal\{O\}\\left\(\\frac\{d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{1/4\}\}\\right\)\+\\mathcal\{O\}\\left\(\\frac\{1\}\{N^\{3/4\}\}\\right\)\+\\mathcal\{O\}\\left\(\\frac\{\\log N\}\{N\}\\right\)=𝒪\(d7/4LmN1/4\)\\displaystyle=\\mathcal\{O\}\\left\(\\frac\{d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\(100\)By the convexity of the Fisher information and Jensen’s inequality, it follows that
FI\(ν¯Nγ∥π\)=𝒪\(d7/4LmN1/4\)\.\\displaystyle\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)=\\mathcal\{O\}\\left\(\\frac\{d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\\right\)\.Therefore, to obtainFI\(ν¯Nγ∥π\)≤ε\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)\\leq\\varepsilon, the sufficient number of iterations is
N=𝒪\(d7Lm4ε4\)\.N=\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)\.Note that the mean per\-iteration cost ispb\+\(1−p\)b′=𝒪\(1\)pb\+\(1\-p\)b^\{\\prime\}=\\mathcal\{O\}\(1\), therefore, the total number of function evaluations is𝒪\(d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)as well\. In the proof of Theorem[1](https://arxiv.org/html/2605.30573#Thmtheorem1), we verify thatN≥𝒪\(d7Lm4ε4\)N\\geq\\mathcal\{O\}\\\!\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)satisfies the conditions
p∈\(0,1\]andγ≤1Lm85ϕ\(μ\)\.p\\in\(0,1\]\\quad\\text\{and\}\\quad\\gamma\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\(\\mu\)\}\}\.\(101\)Therefore, all required conditions onNNhold by the iteration complexity, completing the proof\.□\\square
### B\.7Proof of Corollary[2](https://arxiv.org/html/2605.30573#Thmcorollary2)
#### Corollary[2](https://arxiv.org/html/2605.30573#Thmcorollary2)
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1L\_\{f\_\{1\}\},Lf2L\_\{f\_\{2\}\}, respectively, the target posteriorπ\\pisatisfies Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3)with constantCPI\>0C\_\{\\mathrm\{PI\}\}\>0, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and the SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation generated by \([14](https://arxiv.org/html/2605.30573#S3.E14)\) with annealing and noise schedules defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. For any step sizeγ∈\(0,1Lm85ϕ\(μ\)\]\\gamma\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\(\\mu\)\}\}\\right\], whereϕ\(μ\)≔1\+4\(1−p\)dpμ2b′\\phi\(\\mu\)\\coloneqq 1\+\\frac\{4\(1\-p\)d\}\{p\\mu^\{2\}b^\{\\prime\}\}, and for anyN≥1N\\geq 1, we have
‖ν¯Nγ−π‖TV2\\displaystyle\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}≤16LmC0CPIdϕ\(μ\)N\+34d\(d\+2\)CPILf12b\+172μ2CPILf22\(d\+3\)3\+σ¯2\+ε¯σ2\+α¯2,\\displaystyle\\leq 16L\_\{m\}\\sqrt\{\\frac\{C\_\{0\}C\_\{\\mathrm\{PI\}\}d\\phi\(\\mu\)\}\{N\}\}\+\\frac\{34d\(d\+2\)C\_\{\\mathrm\{PI\}\}L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\\frac\{17\}\{2\}\\mu^\{2\}C\_\{\\mathrm\{PI\}\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\},where
σ¯2≔102CPIC12N∑k=0N−1σk2,ε¯σ2≔102CPIN∑k=0N−1εσk2,andα¯2≔102CPIC22N∑k=0N−1\(αk−1\)2σk2,\\bar\{\\sigma\}^\{2\}\\coloneqq\\frac\{102C\_\{\\mathrm\{PI\}\}C\_\{1\}^\{2\}\}\{N\}\\sum\_\{k=0\}^\{N\-1\}\\sigma\_\{k\}^\{2\},\\quad\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\\coloneqq\\frac\{102C\_\{\\mathrm\{PI\}\}\}\{N\}\\sum\_\{k=0\}^\{N\-1\}\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\},\\quad\\text\{and\}\\quad\\bar\{\\alpha\}^\{2\}\\coloneqq\\frac\{102C\_\{\\mathrm\{PI\}\}C\_\{2\}^\{2\}\}\{N\}\\sum\_\{k=0\}^\{N\-1\}\\frac\{\(\\alpha\_\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\},andC0,C1,C2C\_\{0\},C\_\{1\},C\_\{2\}are positive constants\. Furthermore, let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4,\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\},and let the initial parameters satisfyσ0≥σmin\\sigma\_\{0\}\\geq\\sigma\_\{\\text\{min\}\},εσ0\>0\\varepsilon\_\{\\sigma\_\{0\}\}\>0,α0≥1\\alpha\_\{0\}\\geq 1, andσmin\>0\\sigma\_\{\\text\{min\}\}\>0is the minimum noise level\. Then, to achieve‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\varepsilonfor the time average lawν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}\\,dt,𝒪\(d7Lm4CPI4ε4\)\\mathcal\{O\}\\left\(\\frac\{d^\{7\}L\_\{m\}^\{4\}C\_\{\\mathrm\{PI\}\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)iterations with𝒪\(1\)\\mathcal\{O\}\(1\)function evaluations per iteration is sufficient\.
Proof\.Recall that from Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)proof, we have inequality \([93](https://arxiv.org/html/2605.30573#A2.E93)\) as
1Nγ∫0NγFI\(νt∥π\)𝑑t\\displaystyle\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt≤C0Nγ\+17d\(d\+2\)Lf122b\+178μ2Lf22\(d\+3\)3\+σ¯2\+ε¯σ2\+α¯2\+32γLm2dϕ\(μ\),\\displaystyle\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{2b\}\+\\frac\{17\}\{8\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\}\+32\\gamma L\_\{m\}^\{2\}d\\phi\(\\mu\),whereC0=2KL\(ν0∥π\)\+17γe022pC\_\{0\}=2\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+\\frac\{17\\gamma e\_\{0\}^\{2\}\}\{2p\}\. By the convexity of the Fisher information, we have
FI\(ν¯Nγ∥π\)\\displaystyle\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)≤1Nγ∫0NγFI\(νt∥π\)𝑑t\\displaystyle\\leq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt≤C0Nγ\+17d\(d\+2\)Lf122b\+178μ2Lf22\(d\+3\)3\+σ¯2\+ε¯σ2\+α¯2\+32γLm2dϕ\(μ\)\\displaystyle\\leq\\frac\{C\_\{0\}\}\{N\\gamma\}\+\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{2b\}\+\\frac\{17\}\{8\}\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\}\+32\\gamma L\_\{m\}^\{2\}d\\phi\(\\mu\)whereν¯Nγ≔1Nγ∫0Nγνt𝑑t\\bar\{\\nu\}\_\{N\\gamma\}\\coloneqq\\frac\{1\}\{N\\gamma\}\\int\_\{0\}^\{N\\gamma\}\\nu\_\{t\}dt\. By Assumption[3](https://arxiv.org/html/2605.30573#Thmassumption3), we invoke Lemma[4](https://arxiv.org/html/2605.30573#Thmlemma4)and get
‖ν¯Nγ−π‖TV2\\displaystyle\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|^\{2\}\_\{\\mathrm\{TV\}\}≤4CPIFI\(ν¯Nγ∥π\)\\displaystyle\\leq 4C\_\{\\mathrm\{PI\}\}\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{N\\gamma\}\\\|\\pi\)≤4C0CPINγ\+34d\(d\+2\)CPILf12b\+172μ2CPILf22\(d\+3\)3\+4CPI\(σ¯2\+ε¯σ2\+α¯2\)\\displaystyle\\leq\\frac\{4C\_\{0\}C\_\{\\mathrm\{PI\}\}\}\{N\\gamma\}\+\\frac\{34d\(d\+2\)C\_\{\\mathrm\{PI\}\}L\_\{f\_\{1\}\}^\{2\}\}\{b\}\+\\frac\{17\}\{2\}\\mu^\{2\}C\_\{\\mathrm\{PI\}\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+4C\_\{\\mathrm\{PI\}\}\(\\bar\{\\sigma\}^\{2\}\+\\bar\{\\varepsilon\}\_\{\\sigma\}^\{2\}\+\\bar\{\\alpha\}^\{2\}\)\+128γCPILm2dϕ\(μ\)\.\\displaystyle\\quad\+128\\gamma C\_\{\\mathrm\{PI\}\}L\_\{m\}^\{2\}d\\phi\(\\mu\)\.\(102\)Let
γ=1LmN3/4d7/4,p=LmN1/4d1/4b=⌈1p⌉,andμ2=1LmN1/4d5/4\.\\displaystyle\\gamma=\\frac\{1\}\{L\_\{m\}N^\{3/4\}d^\{7/4\}\},\\quad p=\\frac\{L\_\{m\}\}\{N^\{1/4\}d^\{1/4\}\}\\quad b=\\left\\lceil\\frac\{1\}\{p\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu^\{2\}=\\frac\{1\}\{L\_\{m\}N^\{1/4\}d^\{5/4\}\}\.Substituting these choices into \([102](https://arxiv.org/html/2605.30573#A2.E102)\) and following the derivation steps from \([94](https://arxiv.org/html/2605.30573#A2.E94)\) to \([100](https://arxiv.org/html/2605.30573#A2.E100)\) in Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3), we obtain
‖ν¯Nγ−π‖TV2=𝒪\(CPId7/4LmN1/4\+CPIlogNN\)\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|^\{2\}\_\{\\mathrm\{TV\}\}=\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}d^\{7/4\}L\_\{m\}\}\{N^\{1/4\}\}\+\\frac\{C\_\{\\mathrm\{PI\}\}\\log N\}\{N\}\\right\)\(103\)provided that the annealing and noise schedules are defined as \([13](https://arxiv.org/html/2605.30573#S3.E13)\) and the SGM estimation error obeysεσk≔𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}\\coloneqq\\mathcal\{O\}\(k^\{\-1/2\}\)and for allk≥1k\\geq 1with initial valueεσ0\>0\\varepsilon\_\{\\sigma\_\{0\}\}\>0\. Thereofer, to achieve‖ν¯Nγ−π‖TV2≤ε\\\|\\bar\{\\nu\}\_\{N\\gamma\}\-\\pi\\\|\_\{\\mathrm\{TV\}\}^\{2\}\\leq\\varepsilon, the sufficient number of iterations is
N=𝒪\(CPI4d7Lm4ε4\)\.N=\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}^\{4\}d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)\.Furthermore, the per\-iteration computation cost ispb\+\(1−p\)b′=𝒪\(1\)pb\+\(1\-p\)b^\{\\prime\}=\\mathcal\{O\}\(1\); therefore, the number of total function evaluations is𝒪\(CPI4d7Lm4ε4\)\\mathcal\{O\}\\left\(\\frac\{C\_\{\\mathrm\{PI\}\}^\{4\}d^\{7\}L\_\{m\}^\{4\}\}\{\\varepsilon^\{4\}\}\\right\)as well\.□\\square
### B\.8Proof of Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4)
#### Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4)
Letπ∝ℓ\(𝐲\|𝐱\)p\(𝐱\)\\pi\\propto\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)p\(\{\\bm\{x\}\}\)be the posterior with the likelihoodℓ\(𝐲\|𝐱\)∝exp\(−f\(𝐱\)\)\\ell\(\{\\bm\{y\}\}\|\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-f\(\{\\bm\{x\}\}\)\)and the priorp\(𝐱\)∝exp\(−h\(𝐱\)\)p\(\{\\bm\{x\}\}\)\\propto\\mathrm\{exp\}\(\-h\(\{\\bm\{x\}\}\)\)\. Suppose the likelihood potentialffsatisfies Assumptions[1](https://arxiv.org/html/2605.30573#Thmassumption1),[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantsLf1L\_\{f\_\{1\}\},Lf2L\_\{f\_\{2\}\}, respectively, the prior potentialhhsatisfies Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2)with Lipschitz constantLhL\_\{h\}and Assumption[4](https://arxiv.org/html/2605.30573#Thmassumption4), and the SGM satisfies Assumption[5](https://arxiv.org/html/2605.30573#Thmassumption5)with decreasing errorεσk=𝒪\(k−1/2\)\\varepsilon\_\{\\sigma\_\{k\}\}=\\mathcal\{O\}\(k^\{\-1/2\}\)forσk\>0\\sigma\_\{k\}\>0, andk≥1k\\geq 1\. DefineLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}\. Let\(νt\)t≥0\(\\nu\_\{t\}\)\_\{t\\geq 0\}denote the law of interpolation generated by \([14](https://arxiv.org/html/2605.30573#S3.E14)\) with annealing and noise schedules defined in \([13](https://arxiv.org/html/2605.30573#S3.E13)\)\. Define the time\-varying parameters as follows
γk=Cγk3/2,pk=12k1/2,bk=⌈1pk⌉,andμk=Cμk1/8,\\gamma\_\{k\}=\\frac\{C\_\{\\gamma\}\}\{k^\{3/2\}\},\\quad p\_\{k\}=\\frac\{1\}\{2k^\{1/2\}\},\\quad b\_\{k\}=\\left\\lceil\\frac\{1\}\{p\_\{k\}\}\\right\\rceil,\\quad\\text\{and\}\\quad\\mu\_\{k\}=\\frac\{C\_\{\\mu\}\}\{k^\{1/8\}\},whereLm≔max\{Lf2\+Lh,Lf1\}L\_\{m\}\\coloneqq\\max\\\{L\_\{f\_\{2\}\}\+L\_\{h\},L\_\{f\_\{1\}\}\\\}, andCγ,CμC\_\{\\gamma\},C\_\{\\mu\}are positive constants\. Let the initial parameters satisfyσ0≥σmin\\sigma\_\{0\}\\geq\\sigma\_\{\\text\{min\}\},εσ0\>0\\varepsilon\_\{\\sigma\_\{0\}\}\>0,α0≥1\\alpha\_\{0\}\\geq 1, andσmin\>0\\sigma\_\{\\text\{min\}\}\>0is the minimum noise level\. Then, the time\-averaged lawν¯τn≔1τn∫0τnνt𝑑t\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt, whereτn≔∑k=1nγk\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}, converges weakly toπ\\pi\.
Proof\.Given time\-varying parametersγk,bk,pk,μk\\gamma\_\{k\},b\_\{k\},p\_\{k\},\\mu\_\{k\}at iterationkk, define the cumulative timeτn\\tau\_\{n\}and averaged lawν¯τn\\bar\{\\nu\}\_\{\\tau\_\{n\}\}at iterationnnas
τn≔∑k=1nγk,ν¯τn≔1τn∫0τnνt𝑑t,\\tau\_\{n\}\\coloneqq\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\},\\qquad\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\coloneqq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\nu\_\{t\}\\,dt,whereνt\\nu\_\{t\}denotes the law of the process𝒙t\{\\bm\{x\}\}\_\{t\}under the following continuous\-time interpolation
𝒙t≔𝒙τn−1−\(t−τn−1\)\(𝒈n−αn𝒮θ\(𝒙τn,σn\)\)\+2\(𝑩t−𝑩τn−1\),t∈\[τn−1,τn\]\.\{\\bm\{x\}\}\_\{t\}\\coloneqq\{\\bm\{x\}\}\_\{\\tau\_\{n\-1\}\}\-\(t\-\\tau\_\{n\-1\}\)\\,\(\{\\bm\{g\}\}\_\{n\}\-\\alpha\_\{n\}\{\\mathcal\{S\}\}\_\{\\theta\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\},\\sigma\_\{n\}\)\)\+\\sqrt\{2\}\\,\\bigl\(\{\\bm\{B\}\}\_\{t\}\-\{\\bm\{B\}\}\_\{\\tau\_\{n\-1\}\}\\bigr\),\\qquad t\\in\[\\tau\_\{n\-1\},\\tau\_\{n\}\]\.\(104\)gkg\_\{k\}is defined as follows
𝒈n:=\{1bn∑i=1bn∇~fμn\(𝒙τn,𝒖ni\)w\.p\.pn,𝒈n−1\+1b′∑i=1b′\(∇~fμn\(𝒙τn,𝒖ni\)−∇~fμn\(𝒙τn−1,𝒖ni\)\)w\.p\.1−pn,\{\\bm\{g\}\}\_\{n\}:=\\begin\{cases\}\\displaystyle\\frac\{1\}\{b\_\{n\}\}\\sum\_\{i=1\}^\{b\_\{n\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)&\\text\{w\.p\. \}p\_\{n\},\\\\\[4\.30554pt\] \\displaystyle\{\\bm\{g\}\}\_\{n\-1\}\+\\frac\{1\}\{b^\{\\prime\}\}\\sum\_\{i=1\}^\{b^\{\\prime\}\}\\left\(\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)\-\\widetilde\{\\nabla\}f\_\{\\mu\_\{n\}\}\(\{\\bm\{x\}\}\_\{\\tau\_\{n\-1\}\},\{\\bm\{u\}\}\_\{n\}^\{i\}\)\\right\)&\\text\{w\.p\. \}1\-p\_\{n\},\\end\{cases\}\(105\)for alln≥1n\\geq 1\. At initialization, we chooseγ0,b0,μ0\>0\\gamma\_\{0\},b\_\{0\},\\mu\_\{0\}\>0andp0∈\(0,1\]p\_\{0\}\\in\(0,1\], and define𝒈0\{\\bm\{g\}\}\_\{0\}
𝒈0≔1b0∑i=1b0∇~fμ0\(𝒙0,𝒖0i\),𝒖0i∼𝒩\(0,I\)\.\{\\bm\{g\}\}\_\{0\}\\coloneqq\\frac\{1\}\{b\_\{0\}\}\\sum\_\{i=1\}^\{b\_\{0\}\}\\widetilde\{\\nabla\}f\_\{\\mu\_\{0\}\}\(\{\\bm\{x\}\}\_\{0\},\{\\bm\{u\}\}\_\{0\}^\{i\}\),\\quad\{\\bm\{u\}\}\_\{0\}^\{i\}\\sim\{\\mathcal\{N\}\}\(0,I\)\.We establish weak convergence by adapting the proof of Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)\. To this end, we first verify that the step sizes\(γk\)k≥1\(\\gamma\_\{k\}\)\_\{k\\geq 1\}satisfy the step\-size condition of Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3), namely,
γk∈\(0,1Lm85ϕk\(μk\)\],whereϕk\(μk\)=1\+4\(1−pk\)dpkμk2b′\\gamma\_\{k\}\\in\\left\(0,\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\_\{k\}\(\\mu\_\{k\}\)\}\}\\right\],\\quad\\text\{where\}\\quad\\phi\_\{k\}\(\\mu\_\{k\}\)=1\+\\frac\{4\(1\-p\_\{k\}\)d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}\(106\)fork≥1k\\geq 1\. Using the definitions ofμk\\mu\_\{k\}andpkp\_\{k\}, we have
ϕk\(μk\)\\displaystyle\\phi\_\{k\}\(\\mu\_\{k\}\)=1\+4\(1−pk\)dpkμk2b′\\displaystyle=1\+\\frac\{4\(1\-p\_\{k\}\)d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}≤1\+4dpkμk2b′\\displaystyle\\leq 1\+\\frac\{4d\}\{p\_\{k\}\\mu\_\{k\}^\{2\}b^\{\\prime\}\}≤1\+8dk3/4b′Cμ2\\displaystyle\\leq 1\+\\frac\{8dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}≤16dk3/4b′Cμ2,\\displaystyle\\leq\\frac\{16dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\},\(107\)where the last inequality is satisfied by selecting a constantCμC\_\{\\mu\}such that
8db′≤Cμ\.\\sqrt\{\\frac\{8d\}\{b^\{\\prime\}\}\}\\leq C\_\{\\mu\}\.Then, we have
1Lm85ϕk\(μk\)≥1Lm1360dk3/4b′Cμ2=CμLmk3/8b′1360d\.\\displaystyle\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\_\{k\}\(\\mu\_\{k\}\)\}\}\\geq\\frac\{1\}\{L\_\{m\}\\sqrt\{\\frac\{1360dk^\{3/4\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}\}\}=\\frac\{C\_\{\\mu\}\}\{L\_\{m\}k^\{3/8\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{1360d\}\}\.Choosing
Cγ=CμLmb′1360d,C\_\{\\gamma\}=\\frac\{C\_\{\\mu\}\}\{L\_\{m\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{1360d\}\},we have
γk=Cγk3/2≤CμLmk3/8b′1360d≤1Lm85ϕk\(μk\),\\gamma\_\{k\}=\\frac\{C\_\{\\gamma\}\}\{k^\{3/2\}\}\\leq\\frac\{C\_\{\\mu\}\}\{L\_\{m\}k^\{3/8\}\}\\sqrt\{\\frac\{b^\{\\prime\}\}\{1360d\}\}\\leq\\frac\{1\}\{L\_\{m\}\\sqrt\{85\\phi\_\{k\}\(\\mu\_\{k\}\)\}\},which implies \([106](https://arxiv.org/html/2605.30573#A2.E106)\) holds fork≥1k\\geq 1\. Furthermore, we note thatpk∈\(0,1\]p\_\{k\}\\in\(0,1\]holds for allk≥1k\\geq 1by definition\. Therefore, we can follow the same steps up to \([B\.6](https://arxiv.org/html/2605.30573#A2.Ex100)\) in Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)since the same arguments remain valid fort∈\[τn−1,τn\]t\\in\[\\tau\_\{n\-1\},\\tau\_\{n\}\]with time\-varying parameters\. We obtain
KL\(ντn∥π\)−KL\(ντn−1∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\-\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\-1\}\}\\\|\\pi\)≤−12∫τn−1τnFI\(νt∥π\)𝑑t\+17γnd\(d\+2\)Lf124bn\+1716γnμn2Lf22\(d\+3\)3\\displaystyle\\leq\-\\frac\{1\}\{2\}\\int\_\{\\tau\_\{n\-1\}\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+\\frac\{17\\gamma\_\{n\}d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{4b\_\{n\}\}\+\\frac\{17\}\{16\}\\gamma\_\{n\}\\mu\_\{n\}^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}−4γnpn\(1\+85γn2Lm2ϕn\(μn\)16\)\(en2−en−12\)\\displaystyle\\quad\-\\frac\{4\\gamma\_\{n\}\}\{p\_\{n\}\}\\left\(1\+\\frac\{85\\gamma\_\{n\}^\{2\}L\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\}\{16\}\\right\)\\left\(e\_\{n\}^\{2\}\-e\_\{n\-1\}^\{2\}\\right\)\+51γn4\(C12σn2\+εσn2\+\(αn−1\)2C22σn−2\)\+16γn2dLm2ϕn\(μn\)\\displaystyle\\quad\+\\frac\{51\\gamma\_\{n\}\}\{4\}\(C\_\{1\}^\{2\}\\sigma\_\{n\}^\{2\}\+\\varepsilon\_\{\\sigma\_\{n\}\}^\{2\}\+\(\\alpha\_\{n\}\-1\)^\{2\}C\_\{2\}^\{2\}\\sigma\_\{n\}^\{\-2\}\)\+16\\gamma\_\{n\}^\{2\}dL\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\(108\)fort∈\[τn−1,τn\]t\\in\[\\tau\_\{n\-1\},\\tau\_\{n\}\]\. Plugging the definitions ofγk\\gamma\_\{k\},bkb\_\{k\},pkp\_\{k\}, andμk\\mu\_\{k\}given in Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4)into \([108](https://arxiv.org/html/2605.30573#A2.E108)\), we get
KL\(ντn∥π\)−KL\(ντn−1∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\-\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\-1\}\}\\\|\\pi\)≤−12∫τn−1τnFI\(νt∥π\)𝑑t\+A1n2\+A2n7/4\+A3n9/4−cn\(en2−en−12\)\\displaystyle\\leq\-\\frac\{1\}\{2\}\\int\_\{\\tau\_\{n\-1\}\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+\\frac\{A\_\{1\}\}\{n^\{2\}\}\+\\frac\{A\_\{2\}\}\{n^\{7/4\}\}\+\\frac\{A\_\{3\}\}\{n^\{9/4\}\}\-c\_\{n\}\\left\(e\_\{n\}^\{2\}\-e\_\{n\-1\}^\{2\}\\right\)\+51γn4\(C12σn2\+εσn2\+\(αn−1\)2σn−2C22\),\\displaystyle\\quad\+\\frac\{51\\gamma\_\{n\}\}\{4\}\(C\_\{1\}^\{2\}\\sigma\_\{n\}^\{2\}\+\\varepsilon\_\{\\sigma\_\{n\}\}^\{2\}\+\(\\alpha\_\{n\}\-1\)^\{2\}\\sigma\_\{n\}^\{\-2\}C\_\{2\}^\{2\}\),\(109\)where
A1≔17d\(d\+2\)Lf12Cγ8,A2≔17Lf22\(d\+3\)3CγCμ216,A3≔256d2Lm2Cγ2b′Cμ2,A\_\{1\}\\coloneqq\\frac\{17d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}C\_\{\\gamma\}\}\{8\},\\quad A\_\{2\}\\coloneqq\\frac\{17L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}C\_\{\\gamma\}C\_\{\\mu\}^\{2\}\}\{16\},\\quad A\_\{3\}\\coloneqq\\frac\{256d^\{2\}L\_\{m\}^\{2\}C\_\{\\gamma\}^\{2\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\},and
cn≔γnpn\(4\+85γn2Lm2ϕn\(μn\)16\),c\_\{n\}\\coloneqq\\frac\{\\gamma\_\{n\}\}\{p\_\{n\}\}\\left\(4\+\\frac\{85\\gamma\_\{n\}^\{2\}L\_\{m\}^\{2\}\\phi\_\{n\}\(\\mu\_\{n\}\)\}\{16\}\\right\),forn≥1n\\geq 1\. We use \([107](https://arxiv.org/html/2605.30573#A2.E107)\) to obtainA3A\_\{3\}\. Iterating the bound in \([109](https://arxiv.org/html/2605.30573#A2.E109)\), we obtain
KL\(ντn∥π\)≤KL\(ν0∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)−12∫0τnFI\(νt∥π\)𝑑t\+A1S1\+A2S2\+A3S3−∑k=1nck\(ek2−ek−12\)\\displaystyle\-\\frac\{1\}\{2\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\-\\sum\_\{k=1\}^\{n\}c\_\{k\}\\left\(e\_\{k\}^\{2\}\-e\_\{k\-1\}^\{2\}\\right\)\+514∑k=1nγk\(σk2C2\+εσk2\+\(αk−1\)2σk−2C2\),\\displaystyle\+\\frac\{51\}\{4\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}\(\\sigma\_\{k\}^\{2\}C^\{2\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)^\{2\}\\sigma\_\{k\}^\{\-2\}C^\{2\}\),\(110\)where we have
∑k=1nk−2≤∑k=1∞k−2⏟≔S1<∞,∑k=1nk−7/4≤∑k=1∞k−7/4⏟≔S2<∞\.\\sum\_\{k=1\}^\{n\}k^\{\-2\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-2\}\}\_\{\\coloneqq S\_\{1\}\}<\\infty,\\quad\\sum\_\{k=1\}^\{n\}k^\{\-7/4\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-7/4\}\}\_\{\\coloneqq S\_\{2\}\}<\\infty\.
∑k=1nk−9/4≤∑k=1∞k−9/4⏟≔S3<∞\.\\sum\_\{k=1\}^\{n\}k^\{\-9/4\}\\leq\\underbrace\{\\sum\_\{k=1\}^\{\\infty\}k^\{\-9/4\}\}\_\{\\coloneqq S\_\{3\}\}<\\infty\.Thus,A1S1A\_\{1\}S\_\{1\},A2S2A\_\{2\}S\_\{2\}, andA3S3A\_\{3\}S\_\{3\}are uniformly bounded constants and are independent ofnn\. Furthermore, if we assume thatcnc\_\{n\}is nonnegative and decreasing sequence \(i\.e\.0≤cn\+1<cn0\\leq c\_\{n\+1\}<c\_\{n\}\), we can bound the summation in \([110](https://arxiv.org/html/2605.30573#A2.E110)\) as follows
−∑k=1nck\(ek2−ek−12\)=c1e02\+∑k=1n−1\(ck\+1−ck\)ek2−cnen2≤c1e0\.\\displaystyle\-\\sum\_\{k=1\}^\{n\}c\_\{k\}\\left\(e\_\{k\}^\{2\}\-e\_\{k\-1\}^\{2\}\\right\)=c\_\{1\}e\_\{0\}^\{2\}\+\\sum\_\{k=1\}^\{n\-1\}\(c\_\{k\+1\}\-c\_\{k\}\)e\_\{k\}^\{2\}\-c\_\{n\}e\_\{n\}^\{2\}\\leq c\_\{1\}e\_\{0\}\.\(111\)Thus, it remains to show that\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is a nonnegative and decreasing sequence\. Substituting the parametersγn\\gamma\_\{n\},bnb\_\{n\},pnp\_\{n\}, andμn\\mu\_\{n\}into the definition ofcnc\_\{n\}, we obtain
cn=8Cγn\+85Lm2Cγ381n4\+85dLm2Cγ3b′Cμ21n13/4\(1−12n1/2\)c\_\{n\}=\\frac\{8C\_\{\\gamma\}\}\{n\}\+\\frac\{85L\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}\}\{8\}\\frac\{1\}\{n^\{4\}\}\+\\frac\{85dL\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}\}\\frac\{1\}\{n^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2n^\{1/2\}\}\\right\)forn≥1n\\geq 1\. LetF≔85Lm2Cγ38F\\coloneqq\\frac\{85L\_\{m\}^\{2\}C\_\{\\gamma\}^\{3\}\}\{8\}, which is a numerical constant\. Then, we have
cn=8Cγn\+Fn4\+8Fdb′Cμ2n13/4\(1−12n1/2\)\.c\_\{n\}=\\frac\{8C\_\{\\gamma\}\}\{n\}\+\\frac\{F\}\{n^\{4\}\}\+\\frac\{8Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}n^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2n^\{1/2\}\}\\right\)\.Nonnegativity is immediate since1−12n1/2\>01\-\\frac\{1\}\{2n^\{1/2\}\}\>0forn≥1n\\geq 1\. Therefore, all the terms incnc\_\{n\}are nonnegative forn≥1n\\geq 1\. To show\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is decreasing, define the continuous extensionc\(x\)c\(x\), forx≥1x\\geq 1, as
c\(x\)=8Cγx\+Fx4\+8Fdb′Cμ2x13/4\(1−12x1/2\)\.c\(x\)=\\frac\{8C\_\{\\gamma\}\}\{x\}\+\\frac\{F\}\{x^\{4\}\}\+\\frac\{8Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}x^\{13/4\}\}\\left\(1\-\\frac\{1\}\{2x^\{1/2\}\}\\right\)\.Taking the derivative yields
c′\(x\)=−8Cγx2−4Fx5−26Fdb′Cμ2x17/4\(1−1526x1/2\)\.c^\{\\prime\}\(x\)=\-\\frac\{8C\_\{\\gamma\}\}\{x^\{2\}\}\-\\frac\{4F\}\{x^\{5\}\}\-\\frac\{26Fd\}\{b^\{\\prime\}C\_\{\\mu\}^\{2\}x^\{17/4\}\}\\left\(1\-\\frac\{15\}\{26x^\{1/2\}\}\\right\)\.Forx≥1x\\geq 1, we have
1−1526x1/2\>0,1\-\\frac\{15\}\{26x^\{1/2\}\}\>0,and hencec′\(x\)<0c^\{\\prime\}\(x\)<0\. Therefore,c\(x\)c\(x\)is decreasing on\[1,∞\)\[1,\\infty\), which implies that\(cn\)n≥1\(c\_\{n\}\)\_\{n\\geq 1\}is decreasing\. Hence, using the upper bound \([111](https://arxiv.org/html/2605.30573#A2.E111)\) in \([110](https://arxiv.org/html/2605.30573#A2.E110)\), we get
KL\(ντn∥π\)≤KL\(ν0∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)−12∫0τnFI\(νt∥π\)𝑑t\+A1S1\+A2S2\+A3S3\+c1e02\\displaystyle\-\\frac\{1\}\{2\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+c\_\{1\}e\_\{0\}^\{2\}\+514∑k=1nγk\(C12σk2\+εσk2\+\(αk−1\)2C22σk−2\)\.\\displaystyle\+\\frac\{51\}\{4\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}\(C\_\{1\}^\{2\}\\sigma\_\{k\}^\{2\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)^\{2\}C\_\{2\}^\{2\}\\sigma\_\{k\}^\{\-2\}\)\.\(112\)We next show that the last term in \([112](https://arxiv.org/html/2605.30573#A2.E112)\), capturing the effects of the noise and annealing schedules and the SGM error, is uniformly bounded\.
For the noise schedule\(σk\)k=1n\(\\sigma\_\{k\}\)\_\{k=1\}^\{n\}, using monotonicity, we obtain
51C124∑k=1nγkσk2\\displaystyle\\frac\{51C\_\{1\}^\{2\}\}\{4\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}\\sigma\_\{k\}^\{2\}=51C12σ024∑k=1nγk\\displaystyle=\\frac\{51C\_\{1\}^\{2\}\\sigma\_\{0\}^\{2\}\}\{4\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}=51C12σ02Cγ4∑k=1n1k3/2\\displaystyle=\\frac\{51C\_\{1\}^\{2\}\\sigma\_\{0\}^\{2\}C\_\{\\gamma\}\}\{4\}\\sum\_\{k=1\}^\{n\}\\frac\{1\}\{k^\{3/2\}\}≤51C12σ02Cγ4∑k=1∞1k3/2⏟≔Cσ\\displaystyle\\leq\\underbrace\{\\frac\{51C\_\{1\}^\{2\}\\sigma\_\{0\}^\{2\}C\_\{\\gamma\}\}\{4\}\\sum\_\{k=1\}^\{\\infty\}\\frac\{1\}\{k^\{3/2\}\}\}\_\{\\coloneqq C\_\{\\sigma\}\}\(113\)<∞\.\\displaystyle<\\infty\.\(114\)
For the annealing and noise schedules\(αk\)k=1n\(\\alpha\_\{k\}\)\_\{k=1\}^\{n\}and\(σk\)k=1n\(\\sigma\_\{k\}\)\_\{k=1\}^\{n\}, respectively, using monotonicity together with the boundsαk≤α0\\alpha\_\{k\}\\leq\\alpha\_\{0\}andσk≥σmin\\sigma\_\{k\}\\geq\\sigma\_\{\\text\{min\}\}, we obtain
51C224∑k=1nγk\(αk−1\)2σk2\\displaystyle\\frac\{51C\_\{2\}^\{2\}\}\{4\}\\sum\_\{k=1\}^\{n\}\\frac\{\\gamma\_\{k\}\(\\alpha\_\{k\}\-1\)^\{2\}\}\{\\sigma\_\{k\}^\{2\}\}≤51C22\(α0−1\)24σmin2∑k=1nγk\\displaystyle\\leq\\frac\{51C\_\{2\}^\{2\}\(\\alpha\_\{0\}\-1\)^\{2\}\}\{4\\sigma\_\{\\text\{min\}\}^\{2\}\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}≤51C22\(α0−1\)24σmin2∑k=1∞1k3/2⏟≔Cα\\displaystyle\\leq\\underbrace\{\\frac\{51C\_\{2\}^\{2\}\(\\alpha\_\{0\}\-1\)^\{2\}\}\{4\\sigma\_\{\\text\{min\}\}^\{2\}\}\\sum\_\{k=1\}^\{\\infty\}\\frac\{1\}\{k^\{3/2\}\}\}\_\{\\coloneqq C\_\{\\alpha\}\}\(115\)<∞\.\\displaystyle<\\infty\.\(116\)
Lastly, for the SGM error decay\(εσk\)k=1n\(\\varepsilon\_\{\\sigma\_\{k\}\}\)\_\{k=1\}^\{n\}, we have
514∑k=1nγkεσk2≤514∑k=1∞𝒪\(1k5/2\)⏟≔Cεσ<∞\.\\frac\{51\}\{4\}\\sum\_\{k=1\}^\{n\}\\gamma\_\{k\}\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\\leq\\underbrace\{\\frac\{51\}\{4\}\\sum\_\{k=1\}^\{\\infty\}\\mathcal\{O\}\\left\(\\frac\{1\}\{k^\{5/2\}\}\\right\)\}\_\{\\coloneqq C\_\{\\varepsilon\_\{\\sigma\}\}\}<\\infty\.\(117\)Using the upper bounds \([113](https://arxiv.org/html/2605.30573#A2.E113)\), \([115](https://arxiv.org/html/2605.30573#A2.E115)\), and \([117](https://arxiv.org/html/2605.30573#A2.E117)\) to upper bound the last term in \([112](https://arxiv.org/html/2605.30573#A2.E112)\), we obtain
KL\(ντn∥π\)≤KL\(ν0∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)−12∫0τnFI\(νt∥π\)𝑑t\+A1S1\+A2S2\+A3S3\+c1e02\+Cα\+Cσ\+Cεσ\.\\displaystyle\-\\frac\{1\}\{2\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)dt\+A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+c\_\{1\}e\_\{0\}^\{2\}\+C\_\{\\alpha\}\+C\_\{\\sigma\}\+C\_\{\\varepsilon\_\{\\sigma\}\}\.\(118\)Rearranging the terms, using the convexity of Fisher information and multiplying both sides by2/τn2/\\tau\_\{n\}, we get
FI\(ν¯τn∥π\)\\displaystyle\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)≤1τn∫0τnFI\(νt∥π\)𝑑t\\displaystyle\\leq\\frac\{1\}\{\\tau\_\{n\}\}\\int\_\{0\}^\{\\tau\_\{n\}\}\\mathrm\{FI\}\(\\nu\_\{t\}\\\|\\pi\)\\,dt≤2KL\(ν0∥π\)τn\+2τn\(A1S1\+A2S2\+A3S3\+c1e02\+Cα\+Cσ\+Cεσ\),\\displaystyle\\leq\\frac\{2\\,\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\}\{\\tau\_\{n\}\}\+\\frac\{2\}\{\\tau\_\{n\}\}\\Bigl\(A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+c\_\{1\}e\_\{0\}^\{2\}\+C\_\{\\alpha\}\+C\_\{\\sigma\}\+C\_\{\\varepsilon\_\{\\sigma\}\}\\Bigr\),\(119\)whereA1S1\+A2S2\+A3S3\+Cα\+Cεσ\+Cσ<∞A\_\{1\}S\_\{1\}\+A\_\{2\}S\_\{2\}\+A\_\{3\}S\_\{3\}\+C\_\{\\alpha\}\+C\_\{\\varepsilon\_\{\\sigma\}\}\+C\_\{\\sigma\}<\\inftyand does not depend onnn\. On the other hand, ift∈\[τn,τn\+1\]t\\in\\left\[\\tau\_\{n\},\\tau\_\{n\+1\}\\right\], integrating \([91](https://arxiv.org/html/2605.30573#A2.E91)\) betweenτn\\tau\_\{n\}andttand dropping the negative integral over the Fisher information give us
KL\(νt∥π\)\\displaystyle\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)≤KL\(τn∥π\)\+17\(t−τn\)d\(d\+2\)Lf124b\+1716\(t−τn\)μ2Lf22\(d\+3\)3\\displaystyle\\leq\\mathrm\{KL\}\(\\tau\_\{n\}\\\|\\pi\)\+\\frac\{17\(t\-\\tau\_\{n\}\)d\(d\+2\)L\_\{f\_\{1\}\}^\{2\}\}\{4b\}\+\\frac\{17\}\{16\}\(t\-\\tau\_\{n\}\)\\mu^\{2\}L\_\{f\_\{2\}\}^\{2\}\(d\+3\)^\{3\}\+4\(t−τn\)γdLm2ϕ\(μ\)−4\(t−τn\)p\(1\+85γ2Lm2ϕ\(μ\)16\)\(en\+12−en2\)\\displaystyle\\quad\+4\(t\-\\tau\_\{n\}\)\\gamma dL\_\{m\}^\{2\}\\phi\(\\mu\)\-\\frac\{4\(t\-\\tau\_\{n\}\)\}\{p\}\\left\(1\+\\frac\{85\\gamma^\{2\}L\_\{m\}^\{2\}\\phi\(\\mu\)\}\{16\}\\right\)\(e\_\{n\+1\}^\{2\}\-e\_\{n\}^\{2\}\)\+51\(t−τn\)4\(C12σk2\+εσk2\+\(αk−1\)2C22σk−2\)\\displaystyle\\quad\+\\frac\{51\(t\-\\tau\_\{n\}\)\}\{4\}\(C\_\{1\}^\{2\}\\sigma\_\{k\}^\{2\}\+\\varepsilon\_\{\\sigma\_\{k\}\}^\{2\}\+\(\\alpha\_\{k\}\-1\)^\{2\}C\_\{2\}^\{2\}\\sigma\_\{k\}^\{\-2\}\)≤KL\(ν0∥π\)\+2A1S1\+2A2S2\+2A3S3\+2c1e02\+cn\+1en2\+2Cα\+2Cεσ\+2Cσ\.\\displaystyle\\leq\\mathrm\{KL\}\(\\nu\_\{0\}\\\|\\pi\)\+2A\_\{1\}S\_\{1\}\+2A\_\{2\}S\_\{2\}\+2A\_\{3\}S\_\{3\}\+2c\_\{1\}e\_\{0\}^\{2\}\+c\_\{n\+1\}e\_\{n\}^\{2\}\+2C\_\{\\alpha\}\+2C\_\{\\varepsilon\_\{\\sigma\}\}\+2C\_\{\\sigma\}\.\(120\)We obtain \([120](https://arxiv.org/html/2605.30573#A2.E120)\) by bounding all terms except theKL\\mathrm\{KL\}divergence,c1e02c\_\{1\}e\_\{0\}^\{2\}, andcnen2c\_\{n\}e\_\{n\}^\{2\}by their finite limits asn→∞n\\rightarrow\\infty\. We then apply the bound onKL\(ντn∥π\)\\mathrm\{KL\}\(\\nu\_\{\\tau\_\{n\}\}\\\|\\pi\)derived from \([118](https://arxiv.org/html/2605.30573#A2.E118)\)\. By Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1)and Lemma[1](https://arxiv.org/html/2605.30573#Thmlemma1), the initial error satisfiese0<∞e\_\{0\}<\\infty, and hencec1e02<∞c\_\{1\}e\_\{0\}^\{2\}<\\infty\. In addition, following the steps from \([78](https://arxiv.org/html/2605.30573#A2.E78)\) to \([79](https://arxiv.org/html/2605.30573#A2.E79)\) in the proof of Theorem[2](https://arxiv.org/html/2605.30573#Thmtheorem2), we haveen2<∞e\_\{n\}^\{2\}<\\infty\. Combining these with \([120](https://arxiv.org/html/2605.30573#A2.E120)\) implies that\{KL\(νt∥π\)∣t≥0\}\\\{\\mathrm\{KL\}\(\\nu\_\{t\}\\\|\\pi\)\\mid t\\geq 0\\\}is uniformly bounded\. By the convexity of the KL divergence, this implies that the sequence\{KL\(ν¯τn∥π\)\}n∈ℕ\\\{\\mathrm\{KL\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)\\\}\_\{n\\in\\mathbb\{N\}\}is uniformly bounded as well\. Since the sublevel sets ofKL\(⋅∥π\)\\mathrm\{KL\}\(\\cdot\\\|\\pi\)are weakly compact,\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}is tight\. To establish thatν¯τn⇀π\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\rightharpoonup\\piweakly, it suffices to verify that every cluster point of\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}equal toπ\\pi\.
Consider a subsequence\(ν¯τn\)n∈ℕ\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\)\_\{n\\in\\mathbb\{N\}\}converging to some limitν¯\\bar\{\\nu\}\. Takingn→∞n\\to\\inftyin \([119](https://arxiv.org/html/2605.30573#A2.E119)\) and noting thatτn→∞\\tau\_\{n\}\\to\\infty, we obtainFI\(ν¯τn∥π\)→0\\mathrm\{FI\}\(\\bar\{\\nu\}\_\{\\tau\_\{n\}\}\\\|\\pi\)\\to 0\. Therefore, the same holds along the subsequence\. By the weak lower semicontinuity of the Fisher information along the subsequence, we haveFI\(ν¯∥π\)=0\\mathrm\{FI\}\(\\bar\{\\nu\}\\\|\\pi\)=0\. Writingψ:=dν¯dπ\\psi:=\\frac\{d\\bar\{\\nu\}\}\{d\\pi\}, this meansψ∈domℰ\\sqrt\{\\psi\}\\in\\text\{dom\}~\\mathcal\{E\}andℰ\(ψ\)=0\\mathcal\{E\}\(\\sqrt\{\\psi\}\)=0, whereℰ\\mathcal\{E\}denotes the Dirichlet energy \(i\.e\., the squaredL2\(π\)L^\{2\}\(\\pi\)\-norm of the gradient; see Section 3 inBalasubramanianet al\.\([2022](https://arxiv.org/html/2605.30573#bib.bib34)\)\)\. Since∇logπ\\nabla\\log\\piis Lipschitz by Assumption[2](https://arxiv.org/html/2605.30573#Thmassumption2),π\\pihas a continuous and strictly positive density onℝd\\mathbb\{R\}^\{d\}, soℰ\(ψ\)=0\\mathcal\{E\}\(\\sqrt\{\\psi\}\)=0, which implies thatψ\\psimust be a constantπ\\pi\-a\.e\., henceν¯=π\\bar\{\\nu\}=\\pi\.□\\square
## Appendix CExtended Experimental Results
### C\.1Numerical Validation
Similar to the related works\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11); Song and Ermon,[2020](https://arxiv.org/html/2605.30573#bib.bib30)\), we run ZO\-APMC with an exponential annealing schedule:
σk≔max\{σ0ρ2k,σmin\},αk=max\{α0σk2,1\},\\sigma\_\{k\}\\coloneqq\\max\\\{\\sigma\_\{0\}\\rho\_\{2\}^\{k\},\\sigma\_\{\\text\{min\}\}\\\},\\quad\\alpha\_\{k\}=\\max\\\{\\alpha\_\{0\}\\sigma\_\{k\}^\{2\},1\\\},\(121\)whereρ2\\rho\_\{2\}is the decay rate andkkis the step index\. We always chooseα0≤1/σmin2\\alpha\_\{0\}\\leq 1/\\sigma\_\{\\text\{min\}\}^\{2\}so thatαk\\alpha\_\{k\}converges to 1\. We note that our definition of annealing and noise schedules \([13](https://arxiv.org/html/2605.30573#S3.E13)\) with two different independent decay rates, which areρ1\\rho\_\{1\}andρ2\\rho\_\{2\}, for each schedule is more general than this definition since one could easily recover \([121](https://arxiv.org/html/2605.30573#A3.E121)\) by settingρ2\\rho\_\{2\},α0\\alpha\_\{0\},σ0\\sigma\_\{0\}, andσmin\\sigma\_\{\\text\{min\}\}properly\. For our numerical validation results, we setσ0=10\\sigma\_\{0\}=10,α0=10\\alpha\_\{0\}=10,ρ2=0\.975\\rho\_\{2\}=0\.975,σmin=0\\sigma\_\{\\text\{min\}\}=0andγ=0\.1\.\\gamma=0\.1\.For ZO estimator, we choose the smoothing parameter asμ=10−4\\mu=10^\{\-4\}\. We run ZO\-APMC with 1000 sample points initialized with uniform distribution U\[−50,50\]2\[\-50,50\]^\{2\}on\[−50,50\]2\[\-50,50\]^\{2\}grid forN=2000N=2000iterations\. At each step, we use a Gaussian mixture model \(GMM\) to fit a distribution to the samples at intermediate steps, which allows us to compute the probability of an arbitrary value on\[−50,50\]2\[\-50,50\]^\{2\}grid\. Then, for each intermediate GMM distribution, we calculate the empirical Fisher information relative to target posterior whose analytical posterior can be calculated\. We discretize the grid to1000×10001000\\times 1000unit areas in\[−50,50\]2\[\-50,50\]^\{2\}and calculate the Fisher information for each unit area\. The total sum over the grid gives us the approximate relative Fisher information\. As stated in Section[4\.1](https://arxiv.org/html/2605.30573#S4.SS1), all results in the numerical validation experiments are obtained by repeating the same experimental procedure across 20 random seeds, each corresponding to a different randomly generated forward operator\. Reported Fisher information and KL divergence values are averaged over these runs\.
Figure 6:Effect ofppon the convergence of ZO\-APMC to the true posterior distribution in terms of relative Fisher information\. The solid lines show the mean values and shaded areas show the minimum and maximum ranges\.Convergence in Fisher Information\.We provide additional experiments illustrating the effect ofppon FI convergence in Fig\.[6](https://arxiv.org/html/2605.30573#A3.F6)\. At each iteration, ZO\-APMC performs a zero\-order estimate with a large batch sizeb=10b=10with probabilitypp, while with probability1−p1\-pit uses a smaller batch sizeb′=1b^\{\\prime\}=1, whose gradient estimate is aggregated with the previous step’s update\.
To further demonstrate the benefit of the variance reduction mechanism, we compare naive ZO\-APMC approach \(p=1p=1\) discussed in after Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)with ZO\-APMC withp<1p<1\. We use the same inverse problem setting and analytical prior explained in Section[4\.1](https://arxiv.org/html/2605.30573#S4.SS1)\. We note that we approximate FI accurately because our generated samples live inℝ2\{\\mathbb\{R\}\}^\{2\}\. In higher dimensions, this approximation quickly becomes unreliable and expensive\. Importantly, inℝ2\{\\mathbb\{R\}\}^\{2\}, ZO estimators do not suffer from the curse of dimensionality, which is a primary motivation for our variance\-reduction mechanism\. To better mimic high\-dimensional behavior and induce higher variance in ZO estimates, we inject synthetic noise into the likelihood score estimates with standard deviationσnoise=10\\sigma\_\{\\mathrm\{noise\}\}=10\. Given a fixed budget of 6 score likelihood approximations on average, which corresponds to 12 forward model evaluations, we run ZO\-APMC with different pairs of\(p,b\)∈\{\(1,6\),\(0\.4,9\),\(0\.2,14\),\(0\.1,24\)\}\(p,b\)\\in\\\{\(1,6\),\(0\.4,9\),\(0\.2,14\),\(0\.1,24\)\\\}\. For each pair, we chooseb′b^\{\\prime\}such that\(1−p\)b′\+pb≤6\(1\-p\)b^\{\\prime\}\+pb\\leq 6and the algorithm remains within the per\-iteration budget\. This ensures a fair comparison between ZO\-APMC withp<1p<1and the naive ZO\-APMC\(p=1\)\(p=1\)approach\.
Figure 7:Comparison of naive ZO\-APMC \(p=1p=1\) and ZO\-APMC withp<1p<1for Fisher information convergence under Gaussian \(left\) and Laplace \(right\) measurement noise types in an inverse problem\. Each plot performs the same number of forward model evaluations per iteration on average\.In addition to the Gaussian noise model, we also consider a Laplace noise model, which is commonly used in practice\. Accordingly, we conduct experiments under both noise modeling assumptions\. We present the results in Fig\.[7](https://arxiv.org/html/2605.30573#A3.F7)\. They verify our theoretical results in Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)\. Whenp=1p=1, ZO\-APMC converges to the largest suboptimal point\. This suggests that whenbbis not large enough, the upper bound in Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)becomes suboptimal and it can be made optimal only by increasingbb\. In contrast, whenp<1p<1, we can increasebbwithout changing the per\-iteration budget\. Doing so improves the convergence of the FI because ZO\-APMC with different values ofp<1p<1converge to smaller values\. We further note that the convergence forp=0\.1p=0\.1is worse thanp=0\.4p=0\.4\. This is consistent with Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)because asppreduces, it increases the discretization error \(second term\) at the upper bound\. We present a similar comparison for statistical validation in the subsequent section after weak convergence results\.
Weak Convergence\.To verify the weak convergence of ZO\-APMC to the target posterior, we consider the same synthetic inverse problem setting as in the previous experiments, but use time\-varying schedules for the step sizeγk\\gamma\_\{k\}, refresh probabilitypkp\_\{k\}, batch sizebkb\_\{k\}, and ZO smoothing parameterμk\\mu\_\{k\}\. Following Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4), we employ decreasing schedules forγk\\gamma\_\{k\},pkp\_\{k\}, andμk\\mu\_\{k\}, together with an increasing schedule forbkb\_\{k\}\. Specifically, lettingt=k/N∈\[0,1\]t=k/N\\in\[0,1\]denote the normalized iteration index, whereNNis the total number of iterations, we linearly decay the step size fromγ0=0\.1\\gamma\_\{0\}=0\.1toγmin=0\.001\\gamma\_\{\\min\}=0\.001, the refresh probability from its initial valuep0p\_\{0\}\(reported in Fig\.[8](https://arxiv.org/html/2605.30573#A3.F8)\) topmin=0\.01p\_\{\\min\}=0\.01, and the smoothing parameter fromμ0=10−4\\mu\_\{0\}=10^\{\-4\}toμmin=10−5\\mu\_\{\\min\}=10^\{\-5\}\. Consistent with the theoretical relationbk=⌈1/pk⌉b\_\{k\}=\\lceil 1/p\_\{k\}\\rceil, the batch size is increased linearly fromb0=10b\_\{0\}=10tobmax=100b\_\{\\max\}=100\. Unlike the asymptotic schedules in Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4), whereγk,pk,μk→0\\gamma\_\{k\},p\_\{k\},\\mu\_\{k\}\\to 0andbk→∞b\_\{k\}\\to\\infty, we use bounded practical schedules for numerical stability and finite computational cost\. Moreover, consistent with Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4), the SGM estimation error is linearly decreased fromεσ0=2\.5\\varepsilon\_\{\\sigma\_\{0\}\}=2\.5to0throughout sampling\. Fig\.[8](https://arxiv.org/html/2605.30573#A3.F8)shows that, across different initial values ofp0p\_\{0\}, ZO\-APMC achieves near\-zero KL divergence, providing empirical support for the weak convergence of the generated sample distribution toward the target posterior predicted by Theorem[4](https://arxiv.org/html/2605.30573#Thmtheorem4)\. Despite using only ZO likelihood score estimates \([8](https://arxiv.org/html/2605.30573#S3.E8)\), ZO\-APMC closely matches the convergence behavior of APMC using exact likelihood scores\.
Figure 8:Convergence of APMC and ZO\-APMC to the target posterior inKL\\mathrm\{KL\}divergence for varying initial valuesp0p\_\{0\}\. The probability parameterppis decreased throughout sampling, and the reported values denote its initialization\. ZO\-APMC converges for all initial values ofp0p\_\{0\}\.Figure 9:Comparison of naive ZO\-APMC \(p=1p=1\) and ZO\-APMC withp<1p<1for estimating the ground truth posterior mean and variance statistics under Gaussian \(left\) and Laplace \(right\) measurement noise types\. First row shows the posterior means and the second row shows the posterior variance\. The colorbar for each statistic is shown at the end of its corresponding row\. Each column shows the statistics estimated by ZO\-APMC with specified probabilitypp\.
### C\.2Statistical Validation
The SGM used in this experiment is U\-Net taken from\(Nichol and Dhariwal,[2021](https://arxiv.org/html/2605.30573#bib.bib65)\)with some of its layers removed to process the32×3232\\times 32images, which are taken from CelebA\(Liuet al\.,[2015](https://arxiv.org/html/2605.30573#bib.bib64)\)dataset\. Each image is normalized to\[−1,1\]\[\-1,1\]and downscaled to32×3232\\times 32pixels for simplicity\. The forward operator is generated as random Gaussian matrix and for each test image, we inject a Gaussian noise with variance 0\.01 as a measurement noise\. We construct a bimodal distribution by selecting male and female images from the CelebA dataset and fitting a Gaussian mixture model \(GMM\) to the combined data\. To ensure adequate separation, the two modes are shifted by\+1\+1and−1\-1\. The SGM prior is then trained on samples drawn from this synthetic multimodal distribution\. Because the synthetic Gaussian images lack the structural richness of natural images, the score network’s results on this dataset should not be taken as representative of its performance on real\-world data\. For comparison, we compute the target modes and posterior statistics using the statistics derived from male and female images in the CelebA dataset\.
In addition to the statistical validation experiments in Section[4\.1](https://arxiv.org/html/2605.30573#S4.SS1), we compare naive ZO\-APMC \(p=1p\\\!=\\\!1\) with ZO\-APMC forp<1p\\\!<\\\!1under Gaussian and Laplace noise modeling, which are two widely used noise modelings\. We run each ZO\-APMC algorithm forN=5000N=5000iterations to generate10001000samples with different values of\(p,b\)∈\{\(1,2\),\(0\.4,4\),\(0\.2,7\),\(0\.1,12\)\}\(p,b\)\\in\\\{\(1,2\),\(0\.4,4\),\(0\.2,7\),\(0\.1,12\)\\\}andb′b^\{\\prime\}is selected such that the number of gradient approximations per\-iteration satisfies\(1−p\)b′\+pb≤2\(1\-p\)b^\{\\prime\}\+pb\\leq 2\. Because the Laplace likelihood and Gaussian prior are not a conjugate pair, closed\-form posterior statistics are not available\. Therefore, when modeling Laplace noise, we estimate the ground truth posterior statistics numerically by running standard Langevin Monte Carlo and get an empirical estimate of mean and variance\. In this procedure, we use the likelihood score directly and do not apply any annealing schedule to the sampling parameters\. We present the results in Fig\.[9](https://arxiv.org/html/2605.30573#A3.F9)\. Under Gaussian measurement noise, the naive ZO\-APMC has significantly higher variance than the ground truth variance, which would require increasing the batch size and therefore the number of forward evaluations per iteration\. Instead, our ZO\-APMC withp<1p<1reduces this variance by using a larger batch sizebbtogether with a smallppas explained previously\. Fig\.[9](https://arxiv.org/html/2605.30573#A3.F9)shows that choosingp=0\.4p=0\.4reduces the sample variance significantly and it looks very similar to the ground truth variance\. However, the variance reduction is not monotonic inpp\. Forp=0\.2p=0\.2andp=0\.1p=0\.1, we observe a slight increase in variance, although it remains lower than that obtained withp=1p=1\. This phenomenon follows from the trade\-off characterized in Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)between variance reduction and propagated estimation error\. Specifically, decreasingppincreases the influence of accumulated errors from previous iterations, leading to larger zeroth\-order estimation error and preventing convergence to the target statistics\. We observe a similar trend under Laplace noise modeling\. The choicep=0\.4p=0\.4yields the most accurate posterior variance estimate, whereas decreasingppfurther top=0\.2p=0\.2leads to increased variance\. Lastly, across all the experiments, we keep the ZO smoothing parameter fixedμ=10−4\\mu=10^\{\-4\}to isolate its effect\. This results in large batch sizes being used less frequently, thereby increasing the bias\-induced error predicted by Proposition[1](https://arxiv.org/html/2605.30573#Thmproposition1)\. We observe this effect atp=0\.1p=0\.1in the estimated posterior mean under both noise models\. The effect is stronger under Laplace noise modeling because the Fisher information upper bound in Theorem[3](https://arxiv.org/html/2605.30573#Thmtheorem3)depends on the smoothness properties of the log\-likelihood\. While‖𝒚−𝑨𝒙‖22\\\|\{\\bm\{y\}\}\-\{\\bm\{A\}\}\{\\bm\{x\}\}\\\|\_\{2\}^\{2\}is globally smooth with Lipschitz gradient constant2‖𝑨⊤𝑨‖op2\\\|\{\\bm\{A\}\}^\{\\top\}\{\\bm\{A\}\}\\\|\_\{\\mathrm\{op\}\},‖𝒚−𝑨𝒙‖1\\\|\{\\bm\{y\}\}\-\{\\bm\{A\}\}\{\\bm\{x\}\}\\\|\_\{1\}is non\-smooth and only differentiable almost everywhere\. Consequently, the zeroth\-order smoothing bias is larger under Laplace noise modeling, although this effect can be mitigated by choosing a smaller value ofμ\\mu\.
### C\.3MRI Reconstruction
We evaluate the reconstruction quality of the samples generated by ZO\-APMC and other baselines methods by using peak signal to noise \(PSNR\) ratio, structural similarity index measure \(SSIM\), normalized root mean square error \(NRMSE\), and mean standard deviation \(SD\)\. Given an estimate𝒙^∈ℝd\\hat\{\{\\bm\{x\}\}\}\\in\\mathbb\{R\}^\{d\}and the ground truth𝒙GT∈ℝd\{\\bm\{x\}\}\_\{\\text\{GT\}\}\\in\\mathbb\{R\}^\{d\}, we define the error metrics as
MSE\(𝒙^,𝒙GT\)\\displaystyle\\text\{MSE\}\(\\hat\{\{\\bm\{x\}\}\},\{\\bm\{x\}\}\_\{\\text\{GT\}\}\)≔1d‖𝒙^−𝒙GT‖22,NRMSE\(𝒙^,𝒙GT\)≔‖𝒙^−𝒙GT‖2‖𝒙GT‖2,\\displaystyle\\coloneqq\\frac\{1\}\{d\}\\\|\\hat\{\{\\bm\{x\}\}\}\-\{\\bm\{x\}\}\_\{\\mathrm\{GT\}\}\\\|\_\{2\}^\{2\},\\quad\\text\{NRMSE\}\(\\hat\{\{\\bm\{x\}\}\},\{\\bm\{x\}\}\_\{\\text\{GT\}\}\)\\coloneqq\\frac\{\\\|\\hat\{\{\\bm\{x\}\}\}\-\{\\bm\{x\}\}\_\{\\text\{GT\}\}\\\|\_\{2\}\}\{\\\|\{\\bm\{x\}\}\_\{\\text\{GT\}\}\\\|\_\{2\}\},PSNR\(𝒙^,𝒙GT\)≔10log10\(max\(𝒙GT\)2MSE\(𝒙^,𝒙GT\)\)\.\\displaystyle\\text\{PSNR\}\(\\hat\{\{\\bm\{x\}\}\},\{\\bm\{x\}\}\_\{\\text\{GT\}\}\)\\coloneqq 10\\log\_\{10\}\\\!\\Biggl\(\\frac\{\\max\(\{\\bm\{x\}\}\_\{\\mathrm\{GT\}\}\)^\{2\}\}\{\\text\{MSE\}\(\\hat\{\{\\bm\{x\}\}\},\{\\bm\{x\}\}\_\{\\text\{GT\}\}\)\}\\Biggr\)\.whereddis the dimensionality of𝒙\{\\bm\{x\}\}, andmax\\mathrm\{max\}denotes the maximum possible value of the signal \(e\.g\.,11for normalized data or255255for 8\-bit images\)\. In addition, we compute SD as follows: for each test image, we first calculate the standard deviation across generated samples at each pixel location, and then average these pixel\-wise standard deviations to obtain a mean standard deviation for the generated image\. We run ZO\-APMC and APMC using the annealing and noise schedules defined in \([121](https://arxiv.org/html/2605.30573#A3.E121)\), with step sizeγ=5×10−6\\gamma=5\\times 10^\{\-6\}, initial noise and annealing parametersσ0=348\\sigma\_\{0\}=348andα0=104\\alpha\_\{0\}=10^\{4\}, minimum noise levelσmin=0\.01\\sigma\_\{\\min\}=0\.01, decay rateρ2=0\.99\\rho\_\{2\}=0\.99, and ZO smoothing parameterμ=10−4\\mu=10^\{\-4\}\. Unless otherwise specified, these hyperparameters follow the implementation and settings ofSunet al\.\([2024](https://arxiv.org/html/2605.30573#bib.bib11)\)\. For all other baselines, we utilize the implementations and parameters provided by InverseBench\(Zhenget al\.,[2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. While our theoretical results do not imply gains in sampling speed, for completeness, we report runtimes of each method to generate a sample\. The per\-sample runtimes \(seconds\) are as follows: PnPDM: 68\.3, DPS: 25\.4, APMC: 23\.77, Forward\-GSG: 38\.12, Central\-GSG: 37\.83, DPG: 60\.54, and ZO\-APMC: 50\.53, measured on an NVIDIA H100 GPU\.
Ablation Study\.Among the inverse problems considered in this work, MRI reconstruction involves the largest image size \(256×256256\\times 256\), which necessitates a larger batch size in our ZO estimator to accurately compute the forward model gradient\. To identify the optimal value ofpp, we subsample examples from the validation set of FastMRI\(Zbontaret al\.,[2019](https://arxiv.org/html/2605.30573#bib.bib66)\)and evaluate reconstruction quality across different values,p∈\{0\.1,0\.2,0\.4,0\.5\}p\\in\\\{0\.1,0\.2,0\.4,0\.5\\\}, as illustrated in Fig\.[10](https://arxiv.org/html/2605.30573#A3.F10)\.
Figure 10:Comparison of the ground\-truth brain MRI with APMC and ZO\-APMC reconstructions for various probabilitiesp∈\{0\.1,0\.2,0\.4,0\.5\}p\\in\\\{0\.1,0\.2,0\.4,0\.5\\\}, using a large batch size ofb=104b=10^\{4\}and a small batch size ofb′=103b^\{\\prime\}=10^\{3\}\. PSNR values for each reconstruction are displayed in the lower\-left corner of the corresponding image\.As indicated by the orange arrow, reducingppexcessively while keepingbbfixed produces visible artifacts in the generated samples\. ZO\-APMC maintains reasonable reconstruction quality down top=0\.2p=0\.2\. Even when using a smaller batch size ofb′=103b^\{\\prime\}=10^\{3\}, an order of magnitude lower thanbb, in about half of the iterations on average, ZO\-APMC maintains high reconstruction quality that is very close both visually and quantitatively to the reconstruction of APMC\. As opposed to APMC, ZO\-APMC achieves this without any gradient information and uses only forward model function evaluations\. Because the performance gain beyondp=0\.2p=0\.2is not significant and the gap betweenp=0\.5p=0\.5andp=0\.2p=0\.2can be further reduced by averaging multiple parallel outputs, we setp=0\.2p=0\.2for our brain MRI inverse problem experiments\.
Moreover, ZO estimators are widely recognized in the literature for exhibiting high variance in high\-dimensional settings, as they rely on first\-order approximations of the function along random directions\. To evaluate our proposed variance\-reduction mechanism, we compare the reconstructions of our method with DPS and APMC, which do not assume black\-box setting and have access to gradients of the forward model\. Results in Fig\.[11](https://arxiv.org/html/2605.30573#A3.F11)show that although our proposed method ZO\-APMC assumes black\-box setting and uses noisy forward model evaluations to approximate the gradient of the forward model, it has similar variance compared to DPS and APMC, which assumes access to the gradients, thanks to our proposed variance\-reduction mechanism\.
Figure 11:Comparison of the ground\-truth brain MRI with reconstructions from ZO\-APMC and the gradient\-based approaches DPS and APMC\. Each method generates 20 samples from the same measurements; the first row shows the mean reconstructions and the second row shows the corresponding variance maps\. Owing to its variance\-reduction mechanism, ZO\-APMC produces variance maps comparable to those of the gradient\-based algorithms despite relying on noisy evaluations of the forward model\.
### C\.4Black\-Hole Imaging
In black\-hole imaging, very long baseline interferometry \(VLBI\) uses an array of ground\-based telescopes\. Each telescope pair\(a,b\)\(a,b\)at timettproduces a complex visibilityVta,bV^\{a,b\}\_\{t\}\. To mitigate atmospheric and thermal phase errors, visibilities are combined into noise\-robust*closure*measurements\(Chaelet al\.,[2018](https://arxiv.org/html/2605.30573#bib.bib73)\): closure phases𝒚t,\(a,b,c\)cph\{\\bm\{y\}\}^\{\\mathrm\{cph\}\}\_\{t,\(a,b,c\)\}and log\-closure amplitudes𝒚t,\(a,b,c,d\)camp\{\\bm\{y\}\}^\{\\mathrm\{camp\}\}\_\{t,\(a,b,c,d\)\}\. FollowingSun and Bouman \([2021](https://arxiv.org/html/2605.30573#bib.bib39)\); Zhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\), we use the following likelihood model:
ℓ\(𝒚∣𝒙\)=∑t‖𝒜tcph\(𝒙\)−𝒚tcph‖222βcph2\+∑t‖𝒜tcamp\(𝒙\)−𝒚tcamp‖222βcamp2\+ρ2‖∑ixi−yflux‖22\.\\ell\(\{\\bm\{y\}\}\\mid\{\\bm\{x\}\}\)=\\sum\_\{t\}\\frac\{\\bigl\\\|\\mathcal\{A\}^\{\\mathrm\{cph\}\}\_\{t\}\(\{\\bm\{x\}\}\)\-\{\\bm\{y\}\}^\{\\mathrm\{cph\}\}\_\{t\}\\bigr\\\|\_\{2\}^\{2\}\}\{2\\beta\_\{\\mathrm\{cph\}\}^\{2\}\}\+\\sum\_\{t\}\\frac\{\\bigl\\\|\\mathcal\{A\}^\{\\mathrm\{camp\}\}\_\{t\}\(\{\\bm\{x\}\}\)\-\{\\bm\{y\}\}^\{\\mathrm\{camp\}\}\_\{t\}\\bigr\\\|\_\{2\}^\{2\}\}\{2\\beta\_\{\\mathrm\{camp\}\}^\{2\}\}\+\\frac\{\\rho\}\{2\}\\,\\Bigl\\\|\\sum\_\{i\}x\_\{i\}\-y^\{\\mathrm\{flux\}\}\\Bigr\\\|\_\{2\}^\{2\}\.\(122\)Here,𝒜tcph\\mathcal\{A\}^\{\\mathrm\{cph\}\}\_\{t\}and𝒜tcamp\\mathcal\{A\}^\{\\mathrm\{camp\}\}\_\{t\}map an image𝒙\{\\bm\{x\}\}to predicted closure phases and log\-closure amplitudes, respectively;βcph\\beta\_\{\\mathrm\{cph\}\}andβcamp\\beta\_\{\\mathrm\{camp\}\}are instrument\-specific noise scales\. The first two sums act as chi\-squared penalties for the closure measurements, while the final term enforces the total\-flux constraint with weightρ\\rhoand target fluxyfluxy^\{\\mathrm\{flux\}\}\. For our experiments, we use the GRMHD dataset\(Wonget al\.,[2022](https://arxiv.org/html/2605.30573#bib.bib67)\), the pre\-trained SGM prior\(Sunet al\.,[2024](https://arxiv.org/html/2605.30573#bib.bib11)\), and the forward model implementation and baseline methods provided byZhenget al\.\([2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. For EnKG, we adopt the hyperparameter settings recommended byZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\), while for the baseline methods we use the hyperparameters provided byZhenget al\.\([2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. For ZO\-APMC, we use a batch size ofb=1024b=1024, setp=1p=1, and choose a ZO smoothing parameter ofμ=0\.01\\mu=0\.01\. We usep=1p=1because the image size is relatively small \(64×6464\\times 64\), making the large batch size practical\. The per\-sample runtimes \(seconds\) of each method are PnPDM: 25\.84, DPS: 15\.02, APMC: 14\.32, Forward\-GSG: 156\.72, Central\-GSG: 155\.66, SCG: 213\.78, DPG: 152\.86, EnKG: 422\.25, and ZO\-APMC: 154\.22, measured on an NVIDIA H100 GPU\.
### C\.5Navier–Stokes Equation
In our experiments, we study the two\-dimensional Navier–Stokes equations for a viscous, incompressible fluid in vorticity form on a torus\. Letu∈C\(\[0,T\];Hperr\(\(0,2π\)2,ℝ2\)\)u\\in C\(\[0,T\];H^\{r\}\_\{\\mathrm\{per\}\}\(\(0,2\\pi\)^\{2\},\\mathbb\{R\}^\{2\}\)\)for anyr\>0r\>0denote the velocity field, and letw=∇×uw=\\nabla\\times ube the vorticity\. The initial vorticity isw0∈Lper2\(\(0,2π\)2;ℝ\)w\_\{0\}\\in L^\{2\}\_\{\\mathrm\{per\}\}\(\(0,2\\pi\)^\{2\};\\mathbb\{R\}\), the viscosity coefficient isν∈ℝ\+\\nu\\in\\mathbb\{R\}\_\{\+\}, and the forcing term isf∈Lper2\(\(0,2π\)2;ℝ\)f\\in L^\{2\}\_\{\\mathrm\{per\}\}\(\(0,2\\pi\)^\{2\};\\mathbb\{R\}\)\. The solution operator𝒢\\mathcal\{G\}maps the initial vorticity to the vorticity at timeTT, i\.e\.𝒢:w0↦wT\\mathcal\{G\}:w\_\{0\}\\mapsto w\_\{T\}\. In our experiments, we implement𝒢\\mathcal\{G\}using a pseudo\-spectral solver followingHe and Sun \([2007](https://arxiv.org/html/2605.30573#bib.bib75)\):
∂tw\(x,t\)\+u\(x,t\)⋅∇w\(x,t\)\\displaystyle\\partial\_\{t\}w\(x,t\)\+u\(x,t\)\\cdot\\nabla w\(x,t\)=νΔw\(x,t\)\+f\(x\),\\displaystyle=\\nu\\Delta w\(x,t\)\+f\(x\),x∈\(0,2π\)2,t∈\(0,T\],\\displaystyle x\\in\(0,2\\pi\)^\{2\},\\,t\\in\(0,T\],\(123\)∇⋅u\(x,t\)\\displaystyle\\nabla\\cdot u\(x,t\)=0,\\displaystyle=0,x∈\(0,2π\)2,t∈\(0,T\],\\displaystyle x\\in\(0,2\\pi\)^\{2\},\\,t\\in\(0,T\],\(124\)w\(x,0\)\\displaystyle w\(x,0\)=w0\(x\),\\displaystyle=w\_\{0\}\(x\),x∈\(0,2π\)2\.\\displaystyle x\\in\(0,2\\pi\)^\{2\}\.\(125\)
The task is to infer the initial vorticity field from noisy and sparsely observed vorticity data at timeT=1T=1\. Since Eq\. \(21\) admits no closed\-form solution, the corresponding derivative of the solution operator is also unavailable\. Furthermore, the computation of accurate numerical derivatives via automatic differentiation through the solve is challenging since the extensive computation graph can span thousands of discrete time steps\.
We follow the approach inZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27),[b](https://arxiv.org/html/2605.30573#bib.bib16)\)and first solve the equation up to timeT=5T=5starting from random Gaussian initial conditions, which are highly nontrivial due to the nonlinearity of the Navier–Stokes equations\. We use the SGM\-prior, which was pre\-trained over20,00020\{,\}000vorticity fields, and use the test set consisting of1010samples with size of128×128128\\times 128from InverseBench\. For EnKG, we use the hyperparameters recommended byZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\), while for the baseline methods we adopt the settings provided byZhenget al\.\([2025b](https://arxiv.org/html/2605.30573#bib.bib16)\)\. For ZO\-APMC, we set the ZO smoothing parameter toμ=0\.01\\mu=0\.01, batch size tob=1024b=1024and usep=1p=1\. We observed degraded performance forp<1p<1, which may arise from the violation of Assumption[1](https://arxiv.org/html/2605.30573#Thmassumption1), as the Navier–Stokes forward operator exhibits strong nonlinearity arising from the underlying PDE dynamics\. Quantitative results are presented in Table[4](https://arxiv.org/html/2605.30573#A3.T4)\. Our method outperforms Forward\-GSG, Central\-GSG, and SCG, and achieves performance comparable to DPG, while additionally providing theoretical guarantees of convergence to the target posterior that the baseline methods lack\. Table[4](https://arxiv.org/html/2605.30573#A3.T4)also shows that EnKG outperforms ZO\-APMC in this setting\. One possible explanation is that ZO\-APMC estimates gradients using Gaussian perturbations, which may move inputs off the data manifold and adversely affect solver stability in highly nonlinear PDE\-based problems such as Navier–Stokes\. In contrast, EnKG avoids such perturbations and appears more robust in this regime\. We refer the reader to Section 4\.2 ofZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)for a more detailed discussion\. The per\-sample runtimes \(seconds\) of each method are Forward\-GSG: 4536\.87, Central\-GSG: 4534\.32, SCG: 2112\.72, DPG: 5133\.93, EnKG: 3422\.25, and ZO\-APMC: 4531\.61, measured on an NVIDIA H100 GPU\.
Table 4:Quantitative results for the Navier–Stokes inverse problem\. For noise levelσnoise\\sigma\_\{\\text\{noise\}\}, the best\-performing method is shown inbold\. Baseline results are taken fromZhenget al\.\([2025a](https://arxiv.org/html/2605.30573#bib.bib27)\)\.NRMSE \(σnoise=0\\sigma\_\{\\text\{noise\}\}=0\)↓\\downarrowNRMSE \(σnoise=1\\sigma\_\{\\text\{noise\}\}=1\)↓\\downarrowNRMSE \(σnoise=2\\sigma\_\{\\text\{noise\}\}=2\)↓\\downarrowForward\-GSG1\.6871\.6121\.454Central\-GSG2\.2032\.1171\.746SCG0\.9080\.9280\.966DPG0\.3250\.4080\.466EnKG0\.1200\.1910\.294ZO\-APMC \(Ours\)0\.4590\.4630\.472Similar Articles
Conditional Diffusion Under Linear Constraints: Langevin Mixing and Information-Theoretic Guarantees
This paper analyzes zero-shot conditional sampling with pretrained diffusion models for linear inverse problems, providing information-theoretic guarantees and proposing a projected-Langevin initialization method.
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
This paper studies nonconvex stochastic optimization under Blum-Gladyshev noise, where gradient variance grows with distance from initialization. It proves convergence guarantees for normalized SGD with momentum and a variance-reduced STORM method, achieving minimax optimal rates under certain conditions.
Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation
This paper presents a unified theoretical framework for stochastic variance-reduced estimation, deriving high-probability bounds via a new Freedman inequality and improving oracle complexities for constrained optimization.
Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
This paper proves sharp dimension-free first-order lower bounds for finding epsilon-stationary points in higher-order smooth nonconvex optimization, resolving open problems for Hessian-Lipschitz and third-order smooth cases.
On the quantitative analysis of decoder-based generative models
This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.