Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo

arXiv cs.LG 06/02/26, 04:00 AM Papers
uncertainty-quantification latent-variable-models markov-chain-monte-carlo subsampling stochastic-gradient-langevin-dynamics bayesian-inference
Summary
This paper develops a scaling limit theory for SGLD-Gibbs to provide principled hyperparameter tuning guidance for meaningful uncertainty quantification in large-scale latent variable models.
arXiv:2606.00309v1 Announce Type: new Abstract: Stochastic gradient Langevin dynamics combined with Gibbs updates (SGLD--Gibbs) provides a highly scalable approach to approximate Bayesian inference in latent variable models. However, it remains unclear how to tune the algorithm's hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD--Gibbs. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space-time rescaling. We show that global parameters converge to a diffusion-type limit, while each latent variable converges to a jump process, reflecting the use of intermittent Gibbs updates. This joint jump-diffusion structure reveals how latent-variable randomness contributes to the stationary distribution of the global parameters. We leverage our results to propose explicit guidance on hyperparameter tuning for SGLD--Gibbs that ensures meaningful uncertainty quantification. Numerical experiments show that SGLD--Gibbs with our tuning guidance leads to better parameter estimates, uncertainty quantification, and predictive performance than stochastic variational inference.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:41 PM
# Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
Source: [https://arxiv.org/html/2606.00309](https://arxiv.org/html/2606.00309)
###### Abstract

Stochastic gradient Langevin dynamics combined with Gibbs updates \(SGLD–Gibbs\) provides a highly scalable approach to approximate Bayesian inference in latent variable models\. However, it remains unclear how to tune the algorithm’s hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful\. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD–Gibbs\. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space\-time rescaling\. We show that global parameters converge to a diffusion\-type limit, while each latent variable converges to a jump process, reflecting the use of intermittent Gibbs updates\. This joint jump\-diffusion structure reveals how latent\-variable randomness contributes to the stationary distribution of the global parameters\. We leverage our results to propose explicit guidance on hyperparameter tuning for SGLD–Gibbs that ensures meaningful uncertainty quantification\. Numerical experiments show that SGLD–Gibbs with our tuning guidance leads to better parameter estimates, uncertainty quantification, and predictive performance than stochastic variational inference\.

## 1Introduction

Stochastic gradient methods such as stochastic gradient descent \(SGD\) and stochastic gradient Langevin dynamics \(SGLD\) have become central tools for large\-scale optimization and approximate Bayesian inference\(Nemirovski et al\.,[2009](https://arxiv.org/html/2606.00309#bib.bib36); Moulines & Bach,[2011](https://arxiv.org/html/2606.00309#bib.bib32); Bottou et al\.,[2018](https://arxiv.org/html/2606.00309#bib.bib4); Welling & Teh,[2011](https://arxiv.org/html/2606.00309#bib.bib48); Nemeth & Fearnhead,[2021](https://arxiv.org/html/2606.00309#bib.bib35)\)\. For approximate sampling, latent variable models \(LVMs\) are one of the most frequently cited applications of SG\(L\)D\. Examples include Gaussian mixture models, mixed\-membership stochastic block models\(Li et al\.,[2016](https://arxiv.org/html/2606.00309#bib.bib24)\), latent Dirichlet allocation\(Patterson & Teh,[2013](https://arxiv.org/html/2606.00309#bib.bib37)\), Bayesian matrix factorization\(Ahn et al\.,[2015](https://arxiv.org/html/2606.00309#bib.bib1)\), mixed effects models\(Danaher,[2023](https://arxiv.org/html/2606.00309#bib.bib9)\), and discrete choice models\(Loaiza\-Maya & Nibbering,[2023](https://arxiv.org/html/2606.00309#bib.bib25); Loaiza\-Maya et al\.,[2024](https://arxiv.org/html/2606.00309#bib.bib26)\)\. In these applications, an*SGLD–Gibbs*scheme is often used, where SGLD update steps are constructed using one or more conditional draws of the latent variables, enabling approximate posterior sampling with per\-iteration costs that scale favorably with data size\.

However, there is little rigorous guidance on tuning the algorithmic hyperparameters of SGLD–Gibbs such as the step size, minibatch size, and inverse temperature\. Moreover, how to obtain meaningful uncertainty quantification using SGLD–Gibbs remains unclear\. A substantial body of work\(Walk,[1977](https://arxiv.org/html/2606.00309#bib.bib45); Pflug,[1986](https://arxiv.org/html/2606.00309#bib.bib39); Kushner & Yin,[2003](https://arxiv.org/html/2606.00309#bib.bib21); Negrea et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib34); Wang et al\.,[2025](https://arxiv.org/html/2606.00309#bib.bib46)\)has employed scaling\-limit analyses to study standard SG\(L\)D\. These approaches relate SG\(L\)D sample paths to continuous\-time stochastic processes and yield characterizations of optimization accuracy, asymptotic behavior, and numerical efficiency\. Such analyses have proven particularly useful for understanding hyperparameter tuning and uncertainty quantification\(Mandt et al\.,[2017](https://arxiv.org/html/2606.00309#bib.bib27); Negrea et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. However, existing scaling\-limit results do not directly apply to latent variable models\.

In this work, we address this gap by jointly analyzing the global parameters and latent variables under appropriate space\-time rescaling, which provides a unified asymptotic characterization of SGLD–Gibbs dynamics\. We show that the global parameters converge to a diffusion\-type limit, while each latent variable converges to an independent jump process\. We further demonstrate that the interaction between the global diffusion and latent\-variable jumps fundamentally alters the noise structure of the global parameters\. In particular, the latent variables contribute an additional source of variability determined by the number of Gibbs samples used per iteration\. We use our results to derive concrete guidance for uncertainty quantification and hyperparameter tuning\. Empirically, we find that SGLD–Gibbs yields improved accuracy and more reliable uncertainty quantification compared to stochastic variational inference in applications to mixture modeling and topic modeling\.

### 1\.1Related Work and Alternative Approaches

Given their widespread use, SG\(L\)D methods have been studied from many perspectives, including finite\-sample error bounds, convergence rates, and stationary distributions\(e\.g\., Mcleish,[1976](https://arxiv.org/html/2606.00309#bib.bib29); Ruppert,[1988](https://arxiv.org/html/2606.00309#bib.bib43); Polyak & Juditsky,[1992](https://arxiv.org/html/2606.00309#bib.bib40); Kushner & Yin,[2003](https://arxiv.org/html/2606.00309#bib.bib21); Negrea et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib34); Rakhlin et al\.,[2011](https://arxiv.org/html/2606.00309#bib.bib42); Dieuleveut et al\.,[2020](https://arxiv.org/html/2606.00309#bib.bib10); Mou et al\.,[2020](https://arxiv.org/html/2606.00309#bib.bib31); Cheng et al\.,[2020](https://arxiv.org/html/2606.00309#bib.bib5); Srikant,[2024](https://arxiv.org/html/2606.00309#bib.bib44); Anastasiou et al\.,[2019](https://arxiv.org/html/2606.00309#bib.bib2); Ge et al\.,[2015](https://arxiv.org/html/2606.00309#bib.bib12); Jin et al\.,[2017](https://arxiv.org/html/2606.00309#bib.bib18)\)\. Most relevant to our work is scaling\-limit theory for stochastic approximation algorithms, which shows that, under appropriate rescaling, SGD and SGLD trajectories converge to Ornstein–Uhlenbeck diffusions\(Kushner & Huang,[1981](https://arxiv.org/html/2606.00309#bib.bib22); Kushner & Yang,[1993](https://arxiv.org/html/2606.00309#bib.bib23); Kushner & Yin,[2003](https://arxiv.org/html/2606.00309#bib.bib21); Negrea et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. Further results characterize mixing times, stationary covariances, and the behavior of averaged iterates\(Mandt et al\.,[2017](https://arxiv.org/html/2606.00309#bib.bib27); Negrea et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib34); Collins\-Woodfin et al\.,[2024](https://arxiv.org/html/2606.00309#bib.bib7); Qian et al\.,[2024](https://arxiv.org/html/2606.00309#bib.bib41); Kushner & Yang,[1993](https://arxiv.org/html/2606.00309#bib.bib23)\)\.

Several recent works extend this theory to improve uncertainty quantification for stochastic gradient algorithms\.Wang et al\. \([2025](https://arxiv.org/html/2606.00309#bib.bib46)\)develop non\-asymptotic functional error bounds for diffusion approximations to scaling limits\.Wang et al\. \([2026](https://arxiv.org/html/2606.00309#bib.bib47)\)develop discrete\-time proxy theories for stochastic gradient algorithms, clarifying when diffusion\-based uncertainty quantification remains valid in large\-batch or non\-asymptotic regimes\. A separate line of work studies scaling limits of SGD in high\-dimensional regimes where the parameter dimensiond→∞d\\to\\infty, yielding mean\-field or dynamical equations for low\-dimensional summary statistics\(Arous et al\.,[2022](https://arxiv.org/html/2606.00309#bib.bib3); Collins\-Woodfin et al\.,[2023](https://arxiv.org/html/2606.00309#bib.bib6); Mignacco et al\.,[2021](https://arxiv.org/html/2606.00309#bib.bib30)\)\.

Variational Bayesian methods, including mean\-field variational Bayes, online variational Bayes, stochastic variational inference \(SVI\), and related variational approximations, can also provide scalable for latent variable model inference\(Hoffman et al\.,[2013](https://arxiv.org/html/2606.00309#bib.bib16),[2010](https://arxiv.org/html/2606.00309#bib.bib15); Kucukelbir et al\.,[2017](https://arxiv.org/html/2606.00309#bib.bib20)\)\. Nonetheless, their ability to quantify posterior uncertainty can be fundamentally limited\(Gelman et al\.,[2013](https://arxiv.org/html/2606.00309#bib.bib13); Margossian et al\.,[2025](https://arxiv.org/html/2606.00309#bib.bib28); Giordano et al\.,[2018](https://arxiv.org/html/2606.00309#bib.bib14)\)\. For example,Margossian et al\. \([2025](https://arxiv.org/html/2606.00309#bib.bib28)\)show that when the true posterior distribution exhibits dependence structure, variational approximations based on factorization cannot, in general, correctly estimate posterior uncertainty\. Depending on the divergence being minimized, uncertainty estimates produced by variational Bayes are often poorly calibrated, even under correct model specification\.

## 2Preliminaries and Problem Setup

This section introduces the class of latent variable models considered in this work, describes the SGLD–Gibbs algorithm, and reviews preliminary results concerning scaling limits for stochastic gradient methods\.

### 2\.1Latent Variable Models

We consider a general class of latent variable models in which each observation is associated with an unobserved latent variable\. Let\{\(Xi,zi\)\}i=1n\\\{\(X\_\{i\},z\_\{i\}\)\\\}\_\{i=1\}^\{n\}denote independent pairs of observed dataXi∈𝒳X\_\{i\}\\in\\mathcal\{X\}and latent variableszi∈𝒵z\_\{i\}\\in\\mathcal\{Z\}\. The joint distribution is parameterized by a global parameterθ∈Θ⊂ℝd\\theta\\in\\Theta\\subset\\mathbb\{R\}^\{d\}and admits the factorization

p\(Xi,zi∣θ\)=p\(zi∣θ\)p\(Xi∣zi,θ\),\\displaystyle p\(X\_\{i\},z\_\{i\}\\mid\\theta\)=p\(z\_\{i\}\\mid\\theta\)\\,p\(X\_\{i\}\\mid z\_\{i\},\\theta\),\(2\)with prior distributionπ0\(θ\)\\pi\_\{0\}\(\\theta\)onθ\\theta\. The marginal likelihood of the observations is given byp\(Xi∣θ\)=∫p\(Xi,zi∣θ\)𝑑zi,p\(X\_\{i\}\\mid\\theta\)=\\int p\(X\_\{i\},z\_\{i\}\\mid\\theta\)\\,dz\_\{i\},and the corresponding log\-likelihood isℓ\(θ;Xi\):=log⁡p\(Xi∣θ\)\.\\ell\(\\theta;X\_\{i\}\):=\\log p\(X\_\{i\}\\mid\\theta\)\.This formulation encompasses a wide range of commonly used models, including mixture models, mixed\-membership stochastic block models, topic models, and Bayesian matrix factorization\. SeeMurphy \([2023](https://arxiv.org/html/2606.00309#bib.bib33)\)for a systematic discussion of learning and approximate Bayesian inference in such models\.

### 2\.2SGLD with Gibbs Updates

We study stochastic gradient Langevin dynamics combined with Gibbs updates for latent variables\. Letb∈\{1,…,n\}b\\in\\\{1,\\dots,n\\\}denote the minibatch size\. At iterationkk, a minibatch of indicesIk=\{Ik\(1\),…,Ik\(b\)\}I\_\{k\}=\\\{I\_\{k\}\(1\),\\dots,I\_\{k\}\(b\)\\\}is sampled uniformly from\{1,…,n\}\\\{1,\\dots,n\\\}with replacement\. This convention is mainly for theoretical convenience\. Similar scaling\-limit results for sampling without replacement were established inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), and we expect the conclusions here to remain the same under sampling without replacement, up to constant\-level deviations whenbbis of ordernn\. Given the current global parameterθk\\theta\_\{k\}, the algorithm proceeds in two steps:

#### \(i\) Gibbs updates of latent variables\.

For eachi∈Iki\\in I\_\{k\}, the latent variable is resampled from its conditional posterior,

zi,k\+1∼p\(zi∣Xi,θk\),\\displaystyle z\_\{i,k\+1\}\\sim p\(z\_\{i\}\\mid X\_\{i\},\\theta\_\{k\}\),\(3\)while latent variables not in the minibatch remain unchanged\.

#### \(ii\) SGLD update of global parameters\.

Using the stochastic gradient estimator with refreshed latent variables given by

Gk\(θ\):=1n∇log⁡π0\(θ\)\+1b∑i∈Ik∇θlog⁡p\(Xi,zi,k\+1∣θ\),\\displaystyle\\begin\{split\}G\_\{k\}\(\\theta\)&:=\\frac\{1\}\{n\}\\nabla\\log\\pi\_\{0\}\(\\theta\)\\\\ &\\phantom\{:=~\}\+\\frac\{1\}\{b\}\\sum\_\{i\\in I\_\{k\}\}\\nabla\_\{\\theta\}\\log p\\\!\\left\(X\_\{i\},z\_\{i,k\+1\}\\mid\\theta\\right\),\\end\{split\}the global parameter is updated via

θk\+1=θk\+h2ΓGk\(θk\)\+hβΓ1/2ξk,\\displaystyle\\theta\_\{k\+1\}=\\theta\_\{k\}\+\\frac\{h\}\{2\}\\,\\Gamma\\,G\_\{k\}\(\\theta\_\{k\}\)\+\\sqrt\{\\frac\{h\}\{\\beta\}\}\\,\\Gamma^\{1/2\}\\xi\_\{k\},\(4\)whereh\>0h\>0is the step size,β∈\(0,∞\]\\beta\\in\(0,\\infty\]is the inverse temperature,Γ∈ℝd×d\\Gamma\\in\\mathbb\{R\}^\{d\\times d\}is a positive definite preconditioning matrix, andξk∼𝒩\(0,Id\)\\xi\_\{k\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\. The full procedure is summarized in[Algorithm1](https://arxiv.org/html/2606.00309#alg1)\.

Algorithm 1SGLD–Gibbs for Latent Variable Models1:Input:step size

hh, batch size

bb, inverse temperature

β\\beta, preconditioner

Γ\\Gamma, initial values

\(θ0,\{zi,0\}i=1n\)\(\\theta\_\{0\},\\\{z\_\{i,0\}\\\}\_\{i=1\}^\{n\}\)
2:for

k=0,1,2,…k=0,1,2,\\ldotsdo

3:Sample minibatch

Ik⊂\{1,…,n\}I\_\{k\}\\subset\\\{1,\\ldots,n\\\}with

\|Ik\|=b\|I\_\{k\}\|=b
4:foreach

i∈Iki\\in I\_\{k\}do

5:Sample

zi,k\+1∼p\(zi∣Xi,θk\)z\_\{i,k\+1\}\\sim p\(z\_\{i\}\\mid X\_\{i\},\\theta\_\{k\}\)
6:endfor

7:Update

θk\+1\\theta\_\{k\+1\}using the SGLD step with updated

\{zi,k\+1\}i∈Ik\\\{z\_\{i,k\+1\}\\\}\_\{i\\in I\_\{k\}\}
8:endfor

### 2\.3Scaling Limits for Stochastic Gradient Methods

In a general setup that does not involve latent variables, assume observations are i\.i\.d\. from an unknown distributionP⋆P\_\{\\star\}\. The model is parameterized by a global parameterθ∈Θ⊂ℝd\\theta\\in\\Theta\\subset\\mathbb\{R\}^\{d\}and admits a likelihood of the formp\(Xi∣θ\)p\(X\_\{i\}\\mid\\theta\)for each observationXiX\_\{i\}, with prior distributionπ0\(θ\)\\pi\_\{0\}\(\\theta\)onθ\\theta\. The optimal parameter is given byθ⋆:=argminθ⁡𝔼\[ℓ\(X,θ\)\]\\theta\_\{\\star\}:=\\operatornamewithlimits\{arg\\,min\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\ell\(X,\\theta\)\\right\], whereX∼P⋆X\\sim P\_\{\\star\}\.

Recall thatIk⊂\{1,…,n\}I\_\{k\}\\subset\\\{1,\\dots,n\\\}denotes thekkth minibatch\. SGLD uses the one\-step update given in[Equation4](https://arxiv.org/html/2606.00309#S2.E4), where the stochastic gradient estimator is now

Gk\(θ\):=1n∇log⁡π0\(θ\)\+1b∑i∈Ik∇θlog⁡p\(Xi∣θ\)\.\\displaystyle G\_\{k\}\(\\theta\):=\\frac\{1\}\{n\}\\nabla\\log\\pi\_\{0\}\(\\theta\)\+\\frac\{1\}\{b\}\\sum\_\{i\\in I\_\{k\}\}\\nabla\_\{\\theta\}\\log p\\\!\\left\(X\_\{i\}\\mid\\theta\\right\)\.\(5\)
Scaling limit theory relates discrete\-time stochastic gradient algorithms to continuous\-time stochastic processes under appropriate space\-time rescaling\. Letθk\(n\)∈ℝd\\theta^\{\(n\)\}\_\{k\}\\in\\mathbb\{R\}^\{d\}denote the global parameter at iterationkkandθ^\(n\)\\hat\{\\theta\}^\{\(n\)\}denote a critical point satisfying the first\-order condition∑i=1n∇ℓ\(θ^\(n\);Xi\)=0\\textstyle\\sum\_\{i=1\}^\{n\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\)\};X\_\{i\}\)=0\. Define the rescaled, continuous\-time process

ϑt\(n\)=n𝔴\(θ⌊n𝔞t⌋\(n\)−θ^\(n\)\),\\displaystyle\\vartheta^\{\(n\)\}\_\{t\}=n^\{\\mathfrak\{w\}\}\\left\(\\theta^\{\(n\)\}\_\{\\lfloor n^\{\\mathfrak\{a\}\}t\\rfloor\}\-\\hat\{\\theta\}^\{\(n\)\}\\right\),\(6\)where𝔴\>0\\mathfrak\{w\}\>0and𝔞\>0\\mathfrak\{a\}\>0denote the spatial and temporal scaling exponents\.

Then, under an appropriate scaling regime, this process converges in distribution to an Ornstein–Uhlenbeck process whose drift and diffusion coefficients depend on the preconditionerΓ\\Gammaand the first\- and second\-order Fisher information matrices

I⋆:=𝔼\[\[∇θℓ\(θ⋆;X\)\]⊗2\],J⋆:=−𝔼\[∇θ⊗2ℓ\(θ⋆;X\)\],\\displaystyle I\_\{\\star\}:=\\mathbb\{E\}\\left\[\[\\nabla\_\{\\theta\}\\ell\(\\theta^\{\\star\};X\)\]^\{\\otimes 2\}\\right\],J\_\{\\star\}:=\-\\mathbb\{E\}\\left\[\\nabla\_\{\\theta\}^\{\\otimes 2\}\\ell\(\\theta^\{\\star\};X\)\\right\],\(7\)wherea⊗2:=a⊗aa^\{\\otimes 2\}:=a\\otimes adenotes the outer product, and∇⊗2\\nabla^\{\\otimes 2\}denotes the Hessian operator\.

HereI⋆I\_\{\\star\}quantifies the variability of the log\-likelihood gradient atθ⋆\\theta^\{\\star\}, whileJ⋆J\_\{\\star\}captures the local second\-order behavior of the log\-likelihood aroundθ⋆\\theta^\{\\star\}\.

###### Theorem 2\.1\(Negrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), Theorem 1\)\.

Consider the SGLD algorithm with step sizeh\(n\)=chn−𝔥h^\{\(n\)\}=c\_\{h\}n^\{\-\\mathfrak\{h\}\}, batch sizeb\(n\)=⌊cbn𝔟⌋b^\{\(n\)\}=\\lfloor c\_\{b\}n^\{\\mathfrak\{b\}\}\\rfloor, and inverse temperatureβ\(n\)=cβn𝔱\\beta^\{\(n\)\}=c\_\{\\beta\}n^\{\\mathfrak\{t\}\}\. Let𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}and𝔴=min⁡\{𝔟\+𝔥,𝔱\}/2\\mathfrak\{w\}=\\min\\\{\\mathfrak\{b\}\+\\mathfrak\{h\},\\mathfrak\{t\}\\\}/2\. Then, under mild regularity conditions, asn→∞n\\to\\infty,

ϑt\(n\)⇒ϑt\\displaystyle\\vartheta^\{\(n\)\}\_\{t\}\\Rightarrow\\vartheta\_\{t\}\(8\)in the Skorohod topology in probability, where⇒\\Rightarrowdenotes weak convergence andϑt\\vartheta\_\{t\}is an Ornstein–Uhlenbeck process solving

dϑt=−12Bϑtdt\+AdWt,\\displaystyle d\\vartheta\_\{t\}=\-\\frac\{1\}\{2\}B\\vartheta\_\{t\}\\,dt\+\\sqrt\{A\}\\,dW\_\{t\},\(9\)with

B=chΓJ⋆𝟏\{𝔞=𝔥\},A=chcβΓ𝟏\{𝔥\+𝔟≤𝔱\}\+ch24cbΓI⋆Γ⊤𝟏\{𝔱≤𝔟\+𝔥\}\.\\displaystyle\\begin\{aligned\} B&=c\_\{h\}\\Gamma J\_\{\\star\}\\mathbf\{1\}\\\{\\mathfrak\{a\}=\\mathfrak\{h\}\\\},\\\\ A&=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\Gamma\\mathbf\{1\}\\\{\\mathfrak\{h\}\+\\mathfrak\{b\}\\leq\\mathfrak\{t\}\\\}\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\Gamma I\_\{\\star\}\\Gamma^\{\\top\}\\mathbf\{1\}\\\{\\mathfrak\{t\}\\leq\\mathfrak\{b\}\+\\mathfrak\{h\}\\\}\.\\end\{aligned\}\(10\)

As discussed byNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), one implication of this result is that characterization of full\-path limiting dynamics relates the mixing behavior of SGLD to the spectral properties of the limiting Ornstein–Uhlenbeck process\. Heuristically, the asymptotic mixing time then scales inversely to the smallest eigenvalue of the corresponding drift matrixBB\. This motivates choosing the preconditionerΓ\\Gammato approximateJ⋆−1J\_\{\\star\}^\{\-1\}, which optimizes mixing speed in the asymptotic regime\.

### 2\.4Uncertainty Quantification

[Theorem2\.1](https://arxiv.org/html/2606.00309#S2.Thmtheorem1)shows how the scaling of the step size, minibatch size, and inverse temperature determines which noise sources remain active in the limiting diffusion, which determines the form of the stationary covariance, as summarized by the following standard result\.

###### Proposition 2\.2\(Negrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), Corollary 1\)\.

For the Ornstein–Uhlenbeck process\(ϑt\)t≥0\(\\vartheta\_\{t\}\)\_\{t\\geq 0\}defined by \([9](https://arxiv.org/html/2606.00309#S2.E9)\) with knownϑ0\\vartheta\_\{0\}, the law ofϑt\\vartheta\_\{t\}is Gaussian with meane−tB/2ϑ0e^\{\-tB/2\}\\vartheta\_\{0\}and covariance

Σt=∫0te−sB/2Ae−sB⊤/2𝑑s\.\\displaystyle\\textstyle\\Sigma\_\{t\}=\\int\_\{0\}^\{t\}e^\{\-sB/2\}Ae^\{\-sB^\{\\top\}/2\}\\,ds\.\(11\)If a stationary distribution exists, thenϑt\\vartheta\_\{t\}admits𝒩\(0,Σ∞\)\\mathcal\{N\}\(0,\\Sigma\_\{\\infty\}\)as its stationary law, whereΣ∞\\Sigma\_\{\\infty\}solves

12BΣ∞\+12Σ∞B⊤=A\.\\displaystyle\\textstyle\\frac\{1\}\{2\}B\\Sigma\_\{\\infty\}\+\\frac\{1\}\{2\}\\Sigma\_\{\\infty\}B^\{\\top\}=A\.\(12\)

Thus, Proposition[2\.2](https://arxiv.org/html/2606.00309#S2.Thmtheorem2)enables the targeting of a desired stationary covariance to ensure meaningful uncertainty quantification\. In the Bayesian setting, the Bernstein\-von Mises theorem states that the posterior is approximately𝒩\(θ^\(n\),J⋆−1/N\)\\mathcal\{N\}\(\\hat\{\\theta\}^\{\(n\)\},J\_\{\\star\}^\{\-1\}/N\)\(Kleijn & van der Vaart,[2012](https://arxiv.org/html/2606.00309#bib.bib19)\)\. Thus, one possible goal when using SGLD for uncertainty quantification is to obtain samples with a distribution that is approximately equal to𝒩\(θ^\(n\),J⋆−1/N\)\\mathcal\{N\}\(\\hat\{\\theta\}^\{\(n\)\},J\_\{\\star\}^\{\-1\}/N\)\. However, the sampling distribution ofθ^\(n\)\\hat\{\\theta\}^\{\(n\)\}is asymptotically normal with meanθ⋆\\theta\_\{\\star\}and covariance equal toJ⋆−1I⋆J⋆−1/NJ\_\{\\star\}^\{\-1\}I\_\{\\star\}J\_\{\\star\}^\{\-1\}/N\(White,[1982](https://arxiv.org/html/2606.00309#bib.bib49)\)\. The matrixS⋆:=J⋆−1I⋆J⋆−1S\_\{\\star\}:=J\_\{\\star\}^\{\-1\}I\_\{\\star\}J\_\{\\star\}^\{\-1\}is known as the “sandwich” covariance matrix, and it suggests that for proper uncertainty quantification we want the stationary SGLD distribution to be approximately𝒩\(θ^\(n\),J⋆−1I⋆J⋆−1/N\)\\mathcal\{N\}\(\\hat\{\\theta\}^\{\(n\)\},J\_\{\\star\}^\{\-1\}I\_\{\\star\}J\_\{\\star\}^\{\-1\}/N\)\. When the model is correctly specified,I⋆=J⋆I\_\{\\star\}=J\_\{\\star\}, so the sandwich covariance is equal toJ⋆−1J\_\{\\star\}^\{\-1\}and the Bayesian posterior provides correct uncertainty quantification\. In the misspecified setting, tuning SGLD to have stationary covarianceS⋆/NS\_\{\\star\}/Nwill correctly capture the sampling uncertainty\. Alternatively, the model\-based and sampling\-based uncertainties can be combined, as in the bagged posterior\(Huggins & Miller,[2024](https://arxiv.org/html/2606.00309#bib.bib17)\)\. Based on these statistical considerations,Negrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)propose to target either the Bernstein–von Mises uncertainty or the bagged posterior uncertainty, with the sampling uncertainty arising as a special case of the latter\. The recommended tunings derived from[Theorem2\.1](https://arxiv.org/html/2606.00309#S2.Thmtheorem1)and Proposition[2\.2](https://arxiv.org/html/2606.00309#S2.Thmtheorem2)are summarized in[Table1](https://arxiv.org/html/2606.00309#S2.T1)\.

Table 1:Recommended tuning parameter combinations and the corresponding asymptotic covariances of SGLD inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\.BvM:targeting the asymptotic covariance of the posterior based on the Bernstein–von Mises\.Bagged post\.:targeting the asymptotic covariance of the \(generalized\) bagged posterior\(Huggins & Miller,[2024](https://arxiv.org/html/2606.00309#bib.bib17)\)\. If targeting the sandwich covariance, use the bagged posterior tuning withw1=1w\_\{1\}=1andw2=0w\_\{2\}=0\.

## 3Main Results

In this section we present our main theoretical results\. We first define the scaling\-limit objects for both the global parameter and latent variables, together with the associated*latent\-involved Fisher information matrix*\. We then state our main joint scaling\-limit theorem for SGLD–Gibbs, followed by corollaries on uncertainty quantification and latent\-variable mixing\. All technical assumptions and proofs are deferred to the supplementary materials\.

### 3\.1Scaling\-limit objects and information matrices

We take the same definition of rescaled global\-parameter processϑt\(n\)\\vartheta^\{\(n\)\}\_\{t\}as in \([6](https://arxiv.org/html/2606.00309#S2.E6)\) and define latent\-variable process by

ζi,t\(n\)=zi,⌊n𝔞t⌋\(n\),\\displaystyle\\zeta^\{\(n\)\}\_\{i,t\}=z^\{\(n\)\}\_\{i,\\lfloor n^\{\\mathfrak\{a\}\}t\\rfloor\},\(13\)wherezi,k\(n\)z^\{\(n\)\}\_\{i,k\}denote the latent variable associated with observationiiatkkth iteration\.

Distinguished fromI⋆I\_\{\\star\}defined in[Section2\.3](https://arxiv.org/html/2606.00309#S2.SS3), we define*latent\-involved Fisher information matrix*

I~⋆:=𝔼X,Z∣θ⋆\[∇θlog⁡p\(X,Z∣θ⋆\)⊗2\]=I⋆\+M⋆,\\displaystyle\\widetilde\{I\}\_\{\\star\}:=\\mathbb\{E\}\_\{X,Z\\mid\\theta^\{\\star\}\}\\left\[\\nabla\_\{\\theta\}\\log p\(X,Z\\mid\\theta^\{\\star\}\)^\{\\otimes 2\}\\right\]=I\_\{\\star\}\+M\_\{\\star\},\(14\)where

M⋆:=𝔼X∣θ⋆\[VarZ∣X,θ⋆\(∇θlog⁡p\(X,Z∣θ⋆\)\)\]⪰0\\displaystyle M\_\{\\star\}:=\\mathbb\{E\}\_\{X\\mid\\theta^\{\\star\}\}\\Big\[\\mathrm\{Var\}\_\{Z\\mid X,\\theta^\{\\star\}\}\\big\(\\nabla\_\{\\theta\}\\log p\(X,Z\\mid\\theta^\{\\star\}\)\\big\)\\Big\]\\succeq 0\(15\)is the “Jensen gap” that quantifies the additional*algorithm\-induced uncertainty*due to estimating the marginal likelihood with only a single Gibbs sample\. It follows from the definition ofM⋆M\_\{\\star\}that the gap will be smaller when, on average, there is less uncertainty aboutZZgivenXX\.

### 3\.2Joint scaling limit for global and latent parameters

We now show that, under an appropriate asymptotic scaling, the rescaled global parameter and a fixed latent variable \(e\.g\.,z1z\_\{1\}\) converge jointly in distribution\. The limiting process has independent diffusion and jump components\. Specifically, the global parameter process converges to an Ornstein–Uhlenbeck process whose drift and diffusion structure is analogous to that in[Theorem2\.1](https://arxiv.org/html/2606.00309#S2.Thmtheorem1), except that the diffusion matrix is modified to explicitly incorporate algorithm\-induced uncertainty arising in SGLD–Gibbs\. Meanwhile, the latent\-variable process converges to a Poisson\-driven Gibbs jump process, with jumps drawn from the true conditional posterior of the latent variable\.

###### Theorem 3\.1\(Joint scaling limit of SGLD–Gibbs\)\.

Consider the SGLD–Gibbs algorithm with the same polynomial scaling of tuning parameters as in[Theorem2\.1](https://arxiv.org/html/2606.00309#S2.Thmtheorem1)\. Let𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}and𝔴=min⁡\{𝔟\+𝔥,𝔱\}/2\\mathfrak\{w\}=\\min\\\{\\mathfrak\{b\}\+\\mathfrak\{h\},\\mathfrak\{t\}\\\}/2\. Assume the regularity conditions stated in[SectionA\.1](https://arxiv.org/html/2606.00309#A1.SS1)\. Then, asn→∞n\\to\\infty,

\(ϑt\(n\),ζ1,t\(n\)\)⇒\(ϑt,ζ1,t\),\\displaystyle\\bigl\(\\vartheta^\{\(n\)\}\_\{t\},\\zeta^\{\(n\)\}\_\{1,t\}\\bigr\)\\Rightarrow\\bigl\(\\vartheta\_\{t\},\\zeta\_\{1,t\}\\bigr\),\(16\)in the Skorokhod topology in probability, where the limiting processesϑt\\vartheta\_\{t\}andζ1,t\\zeta\_\{1,t\}are independent and defined as follows:

1. 1\.The global\-parameter limitϑt\\vartheta\_\{t\}is an Ornstein–Uhlenbeck process solving dϑt=−12Bϑtdt\+AdWt,\\displaystyle d\\vartheta\_\{t\}=\-\\frac\{1\}\{2\}B\\vartheta\_\{t\}\\,dt\+\\sqrt\{A\}\\,dW\_\{t\},\(17\)with B=chΓJ⋆𝟏\{𝔞=𝔥\},A=chcβΓ𝟏\{𝔥\+𝔟≤𝔱\}\+ch24cbΓI⋆~Γ⊤𝟏\{𝔱≤𝔟\+𝔥\}\.\\displaystyle\\begin\{aligned\} B&=c\_\{h\}\\Gamma J\_\{\\star\}\\mathbf\{1\}\\\{\\mathfrak\{a\}=\\mathfrak\{h\}\\\},\\\\ A&=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\Gamma\\mathbf\{1\}\\\{\\mathfrak\{h\}\+\\mathfrak\{b\}\\leq\\mathfrak\{t\}\\\}\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\Gamma\\tilde\{I\_\{\\star\}\}\\Gamma^\{\\top\}\\mathbf\{1\}\\\{\\mathfrak\{t\}\\leq\\mathfrak\{b\}\+\\mathfrak\{h\}\\\}\.\\end\{aligned\}\(18\)
2. 2\.When𝔥\+𝔟≤1\\mathfrak\{h\}\+\\mathfrak\{b\}\\leq 1, the latent\-variable limitζ1,t\\zeta\_\{1,t\}is a pure\-jump Markov process with generator \(ℒf\)\(z\)=λ∫\{f\(z′\)−f\(z\)\}p\(z′∣X1,θ⋆\)𝑑z′,\\displaystyle\\begin\{aligned\} \(\\mathcal\{L\}f\)\(z\)&=\\lambda\\int\\bigl\\\{f\(z^\{\\prime\}\)\-f\(z\)\\bigr\\\}\\,p\(z^\{\\prime\}\\mid X\_\{1\},\\theta^\{\\star\}\)\\,dz^\{\\prime\},\\end\{aligned\}\(19\)whereλ:=cb𝟏\{𝔥\+𝔟=1\}\.\\lambda:=c\_\{b\}\\mathbf\{1\}\\\{\\mathfrak\{h\}\+\\mathfrak\{b\}=1\\\}\.

For global parameters, the joint scaling limit in[Theorem3\.1](https://arxiv.org/html/2606.00309#S3.Thmtheorem1)yields the same scaling regimes in\(𝔥,𝔟,𝔱\)\(\\mathfrak\{h\},\\mathfrak\{b\},\\mathfrak\{t\}\)as[Theorem2\.1](https://arxiv.org/html/2606.00309#S2.Thmtheorem1)\. The key difference in latent variable models is that the diffusion term involvesI~⋆\\widetilde\{I\}\_\{\\star\}, which captures additional uncertainty arising from the latent variables and Gibbs sampling\. This latent\-induced contribution persists in the scaling limit and affects the stationary covariance of the global parameters, beyond what is implied by the marginal likelihood alone\.

The behavior of the latent\-variable dynamics depends critically on the relation between the scaling exponents𝔥\\mathfrak\{h\}and𝔟\\mathfrak\{b\}\. If𝔥\+𝔟=1\\mathfrak\{h\}\+\\mathfrak\{b\}=1, each latent variable evolves as a pure\-jump Markov process with a nondegenerate limiting intensity, yielding a meaningful joint jump\-diffusion limit as characterized in[Theorem3\.1](https://arxiv.org/html/2606.00309#S3.Thmtheorem1)\. If𝔥\+𝔟<1\\mathfrak\{h\}\+\\mathfrak\{b\}<1, the latent\-variable dynamics become degenerate on the macroscopic timescale; this is becausez1z\_\{1\}is refreshed too infrequently relative to the evolution ofθ\\theta, so the latent\-variable process effectively freezes and loses any meaningful limiting behavior\. If𝔥\+𝔟\>1\\mathfrak\{h\}\+\\mathfrak\{b\}\>1, latent\-variable updates occur on a much faster time scale than the global parameters\. In this scenario, the latent\-variable generator diverges in the limit, and thus the sequence of latent\-variable processes does not admit a well\-defined limit\.

###### Corollary 3\.2\(Latent\-variable limit: stationarity and mixing\)\.

Assume the regime of Theorem[3\.1](https://arxiv.org/html/2606.00309#S3.Thmtheorem1)with𝔥\+𝔟=1\\mathfrak\{h\}\+\\mathfrak\{b\}=1, in which the latent\-variable limit is a pure\-jump Markov process with rateλ=cb\\lambda=c\_\{b\}\. Letμt\\mu\_\{t\}denote the law ofζ1,t\\zeta\_\{1,t\}with initial lawμ0\\mu\_\{0\}\. Define theε\\varepsilon\-mixing time in total variation by

tmix\(ε\):=inf\{t≥0:dTV\(μt,π\)≤ε\}\.\\displaystyle t\_\{\\mathrm\{mix\}\}\(\\varepsilon\):=\\inf\\bigl\\\{t\\geq 0:d\_\{\\mathrm\{TV\}\}\(\\mu\_\{t\},\\pi\)\\leq\\varepsilon\\bigr\\\}\.\(20\)Thenπ\(⋅\)=p\(⋅∣X1,θ⋆\)\\pi\(\\cdot\)=p\(\\,\\cdot\\mid X\_\{1\},\\theta^\{\\star\}\)is the unique stationary distribution and, for allt≥0t\\geq 0,

μt=e−λtμ0\+\(1−e−λt\)π\.\\displaystyle\\mu\_\{t\}=e^\{\-\\lambda t\}\\mu\_\{0\}\+\(1\-e^\{\-\\lambda t\}\)\\pi\.\(21\)Consequently,dTV\(μt,π\)=e−λtdTV\(μ0,π\)d\_\{\\mathrm\{TV\}\}\(\\mu\_\{t\},\\pi\)=e^\{\-\\lambda t\}d\_\{\\mathrm\{TV\}\}\(\\mu\_\{0\},\\pi\), and hence

tmix\(ε\)=1λlog⁡\(1ε\)\.\\displaystyle t\_\{\\mathrm\{mix\}\}\(\\varepsilon\)=\\frac\{1\}\{\\lambda\}\\log\\\!\\left\(\\frac\{1\}\{\\varepsilon\}\\right\)\.\(22\)

This result implies thatζ1,t\\zeta\_\{1,t\}evolves as a refresh process that periodically resamples fromp\(z1∣X1,θ⋆\)p\(z\_\{1\}\\mid X\_\{1\},\\theta^\{\\star\}\)\. Since each latent\-variable update acts as an instantaneous refresh in the scaling limit, the effective mixing time is of order1/λ1/\\lambda\. For the global parameter, the relevant mixing heuristic is inherited fromLABEL:\{thm:negrea\_ou\}via the drift matrixBBof the limiting Ornstein–Uhlenbeck process, as inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)

Table 2:Recommended tuning parameter combinations and the corresponding asymptotic covariances of SGLD–Gibbs\. See the caption of[Table1](https://arxiv.org/html/2606.00309#S2.T1)for further explanation\.
### 3\.3Scaling limit under generalized Gibbs approximation

We now consider a natural extension of the SGLD–Gibbs update that is commonly used in practice, in which the single latent\-variable draw in the stochastic gradient is replaced by an average over multiple conditional samples\. LettingSSdenote the number of Gibbs samples per iteration, for eachi∈Iki\\in I\_\{k\}now drawzi,k\+1\(1\),…,zi,k\+1\(S\)∼i\.i\.d\.p\(zi∣Xi,θk\)z\_\{i,k\+1\}^\{\(1\)\},\\dots,z\_\{i,k\+1\}^\{\(S\)\}\\stackrel\{\{\\scriptstyle\\text\{i\.i\.d\.\}\}\}\{\{\\sim\}\}p\(z\_\{i\}\\mid X\_\{i\},\\theta\_\{k\}\)\. We then define the averaged stochastic gradient

Gk\(S\)\(θ\)\\displaystyle G\_\{k\}^\{\(S\)\}\(\\theta\):=1n∇log⁡π0\(θ\)\\displaystyle:=\\frac\{1\}\{n\}\\nabla\\log\\pi\_\{0\}\(\\theta\)\(23\)\+1bS∑i∈Ik∑s=1S∇θlog⁡p\(Xi,zi,k\+1\(s\)∣θ\)\.\\displaystyle\\phantom\{=~\}\+\\frac\{1\}\{bS\}\\sum\_\{i\\in I\_\{k\}\}\\sum\_\{s=1\}^\{S\}\\nabla\_\{\\theta\}\\log p\\\!\\left\(X\_\{i\},z\_\{i,k\+1\}^\{\(s\)\}\\mid\\theta\\right\)\.\(24\)ForS=1S=1, this reduces to the standard SGLD–Gibbs gradientGk\(θ\)G\_\{k\}\(\\theta\)defined in[Section2\.2](https://arxiv.org/html/2606.00309#S2.Ex6)\.[Theorem3\.1](https://arxiv.org/html/2606.00309#S3.Thmtheorem1)generalizes to this case, with the only difference being that the diffusion matrixAAis replaced by

AS\\displaystyle A\_\{S\}=chcβΓ𝟏\{𝔥\+𝔟≤𝔱\}\\displaystyle=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\Gamma\\mathbf\{1\}\\\{\\mathfrak\{h\}\+\\mathfrak\{b\}\\leq\\mathfrak\{t\}\\\}\(25\)\+ch24cbΓI~⋆\(S\)Γ⊤𝟏\{𝔱≤𝔟\+𝔥\},\\displaystyle\\phantom\{=~\}\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\Gamma\\widetilde\{I\}\_\{\\star\}^\{\(S\)\}\\Gamma^\{\\top\}\\mathbf\{1\}\\\{\\mathfrak\{t\}\\leq\\mathfrak\{b\}\+\\mathfrak\{h\}\\\},\(26\)whereI⋆\(S\):=I⋆\+1SM⋆I\_\{\\star\}^\{\(S\)\}:=I\_\{\\star\}\+\\frac\{1\}\{S\}M\_\{\\star\}\. Hence, we see that whenS\>1S\>1, the “Jensen gap”M⋆M\_\{\\star\}decreases toM⋆/SM\_\{\\star\}/Sand goes to zero asS→∞S\\to\\infty\(i\.e\., as the marginal likelihood estimate error goes to zero\)\. However, smaller algorithm\-induced uncertainty comes at the cost ofSStimes as many gradient evaluations per iteration\. Our result thus helps quantify the trade\-off between computational cost and accuracy of the uncertainty quantification\.

### 3\.4Parameter Tuning and uncertainty quantification

To retain a nondegenerate and interpretable joint scaling limit, we focus on the regime𝔥\+𝔟=1\\mathfrak\{h\}\+\\mathfrak\{b\}=1, in which the latent\-variable component admits a nontrivial macroscopic limit\. Our discussion covers generalS≥1S\\geq 1, with the special caseS=1S=1recovering the standard SGLD–Gibbs update with a single conditional draw\. To keep the Gaussian noise active in the limit, we also set𝔱=1\\mathfrak\{t\}=1\. Thus, we parameterize the tuning constants following the SGLD scaling\-limit literature by setting

β\(n\)=nw2,andh\(n\)=4w1b\(n\)n,\\displaystyle\\beta^\{\(n\)\}=\\frac\{n\}\{w\_\{2\}\},\\quad\\text\{and\}\\quad h^\{\(n\)\}=\\frac\{4w\_\{1\}\\,b^\{\(n\)\}\}\{n\},\(27\)wherew1,w2\>0w\_\{1\},w\_\{2\}\>0are constants\. With these choices, the stationary covariance of the limiting OU process depends on the choice of preconditionerΓ\\Gammaand onSS\. This leads to two natural tuning strategies\.

#### Bernstein–von Mises covariance viaΓ=\(I~⋆\(S\)\)−1\\Gamma=\\bigl\(\\widetilde\{I\}\_\{\\star\}^\{\(S\)\}\\bigr\)^\{\-1\}\.

If we choose the preconditionerΓ=\(I~⋆\(S\)\)−1\\Gamma=\\bigl\(\\widetilde\{I\}\_\{\\star\}^\{\(S\)\}\\bigr\)^\{\-1\}, then, by taking the weights to satisfyw1\+w2=1w\_\{1\}\+w\_\{2\}=1, the stationary covariance isJ⋆−1J\_\{\\star\}^\{\-1\}, recovering Bayes\-type uncertainty quantification\. Note that takingw1=0w\_\{1\}=0results in using SGD–Gibbs rather than SGLD–Gibbs\.

#### Bagged posterior and sandwich\-type covariances viaΓ=J⋆−1\\Gamma=J\_\{\\star\}^\{\-1\}\.

Alternatively, if we chooseΓ=J⋆−1\\Gamma=J\_\{\\star\}^\{\-1\}, the stationary covariance takes the bagged posterior form

w1J⋆−1I~⋆\(S\)J⋆−1\+w2J⋆−1\\displaystyle w\_\{1\}\\,J\_\{\\star\}^\{\-1\}\\,\\widetilde\{I\}\_\{\\star\}^\{\(S\)\}\\,J\_\{\\star\}^\{\-1\}\+w\_\{2\}\\,J\_\{\\star\}^\{\-1\}\(28\)=w1J⋆−1I⋆J⋆−1\+w2J⋆−1\+w1SJ⋆−1M⋆J⋆−1\.\\displaystyle\\quad=w\_\{1\}J\_\{\\star\}^\{\-1\}I\_\{\\star\}J\_\{\\star\}^\{\-1\}\+w\_\{2\}J\_\{\\star\}^\{\-1\}\+\\dfrac\{w\_\{1\}\}\{S\}\\,J\_\{\\star\}^\{\-1\}M\_\{\\star\}J\_\{\\star\}^\{\-1\}\.\(29\)The third term isolates the algorithm\-induced contribution and decays asSSincreases\. A sandwich\-type covariance can thus be recovered by takingw1=1w\_\{1\}=1andw2=0w\_\{2\}=0\. The resulting tuning recommendations are summarized in[Table2](https://arxiv.org/html/2606.00309#S3.T2)\.

![Refer to caption](https://arxiv.org/html/2606.00309v1/x1.png)a\.Empirical stationary density of the rescaled global parameterϑ\(n\)\\vartheta^\{\(n\)\}under SGLD–Gibbs, compared with the Ornstein–Uhlenbeck stationary distribution predicted by the scaling limit\.
![Refer to caption](https://arxiv.org/html/2606.00309v1/x2.png)b\.Empirical stationary behavior of representative latent assignments, consistent with the jump\-type limiting dynamics\.

Figure 1:Validation of the joint scaling limit on a synthetic Gaussian mixture model\.

### 3\.5Proof sketch of[Theorem3\.1](https://arxiv.org/html/2606.00309#S3.Thmtheorem1)

Our proof follows the spirit ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. We establish weak convergence of the processes in the Skorokhod topology in probability by first proving almost sure convergence along subsequences\. This is achieved by showing that the difference between the approximate generator and the limiting generator, evaluated on smooth test functions with compact support, vanishes uniformly\. We divide the proof into two parts\.

In Part 1, we consider arguments that are sufficiently far from the support of the test function\. The main idea is to control the probability that the global\-parameter process jumps back into the support of the test function\. A key difference withNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)is that we must impose assumptions on the joint likelihood uniformly on latent variable values to ensure that, even in the presence of additional uncertainty induced by the latent\-variable updates, the probability that the global parameter jumps back into the support can still be controlled at a comparable scale\.

In Part 2, we analyze arguments that lie in or are close to the support of the test function\. We perform a Taylor expansion of the joint approximate generator\. The drift term converges in the same way as inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. For the gradient component of the diffusion term, a similar approach applies, but the resulting limit now explicitly incorporates additional variability arising from sampling the latent variables\. We then have to control a new term that bridges the infinitesimal operator associated with Gibbs updates and the generator of a Poisson jump process\. The most technically challenging new aspect is to show that all cross terms involving both the global parameters and the latent variables vanish in the limit\. This vanishing further implies that, in the asymptotic regime, the contribution of any single observation becomes asymptotically independent of the global\-parameter dynamics, which in turn allows the joint limiting distribution to factorize into independent components\.

## 4Experiments

Our experiments are designed to answer three questions\. First, we verify the predictions of our scaling\-limit theory: under the hyperparameter tuning guided by[Table2](https://arxiv.org/html/2606.00309#S3.T2), samples produced by SGLD–Gibbs should match the stationary distributions predicted by the limiting processes\. On synthetic models with known ground\-truth parameters, we directly compare empirical posteriors of both global parameters and latent variables with their theoretical stationary distributions\. Second, we evaluate the quality of uncertainty quantification provided by SGLD–Gibbs and compare it with stochastic variational inference \(SVI\), which we assess using posterior variances and rank\-uniformity calibration diagnostics\. Third, on real datasets where direct verification against ground\-truth parameters is not possible, we evaluate downstream performance using clustering accuracy \(ARI and AMI\) for mixture models and held\-out perplexity for topic models\. Complete experimental details are provided in Appendix[C](https://arxiv.org/html/2606.00309#A3)\.

### 4\.1Synthetic Gaussian Mixture Model

![Refer to caption](https://arxiv.org/html/2606.00309v1/x3.png)a\.Rank\-uniformity diagnostic for global parameters\. SGLD–Gibbs yields empirical ranks closer to the uniform reference line than stochastic variational inference \(SVI\), indicating better\-calibrated uncertainty\.
![Refer to caption](https://arxiv.org/html/2606.00309v1/x4.png)b\.Representative posterior marginals for selected parameters under SGLD–Gibbs and SVI, together with the true parameter values\. SGLD–Gibbs concentrates more accurately around the ground truth\. IncreasingSSreduces uncertainty significantly\.

Figure 2:Synthetic Gaussian GMM: uncertainty quantification and posterior accuracyWe generate synthetic data with sample sizen=30,000n=30\{,\}000with observationsxi∈ℝ8x\_\{i\}\\in\\mathbb\{R\}^\{8\}\. The data are drawn from a finite Gaussian mixture with66clusters\. We run SGLD–Gibbs using a minibatch sizeb=50b=50and consider Gibbs updates withS∈\{1,10\}S\\in\\\{1,10\\\}samples\. For SGLD, we follow the sandwich tuning and choose the step size and inverse temperature ash=2b/nh=2b/nandβ=2n\\beta=2n\. We use a preconditioner for the global parameters constructed from an estimate ofJ⋆−1J\_\{\\star\}^\{\-1\}, as described in[Table2](https://arxiv.org/html/2606.00309#S3.T2)\. As a baseline, we apply SVI with a standard mean\-field variational family\. We use a diagonal\-covariance GMM with Normal–Gamma variational factors for the component means and precisions\.

[Figure1](https://arxiv.org/html/2606.00309#S3.F1)demonstrates that our scaling limit theory accurately predicts the distributions of the global and latent variables\. As shown in[Figure2](https://arxiv.org/html/2606.00309#S4.F2)\(a\), SGLD–Gibbs also produces substantially better calibrated uncertainty than SVI, with empirical ranks lying close to the uniform reference\. Finally,[Figure2](https://arxiv.org/html/2606.00309#S4.F2)\(b\) demonstrates that SGLD–Gibbs also provides more accurate parameter estimates\. Moreover, as predicted by our theory, increasing the number of Gibbs draws toS=10S=10leads to a clear reduction in posterior uncertainty, resulting in visibly tighter marginal distributions compared to theS=1S=1case\.

### 4\.2Synthetic Latent Dirichlet Allocation

![Refer to caption](https://arxiv.org/html/2606.00309v1/x5.png)a\.Rank\-uniformity diagnostic over topic\-word probabilities\. SGRLD–Gibbs yields empirical ranks closer to the uniform reference line, indicating better\-calibrated uncertainty\.
![Refer to caption](https://arxiv.org/html/2606.00309v1/x6.png)b\.Posterior marginals for selected topic\-word probabilities under SGRLD–Gibbs and SVI, with true parameter values indicated\. SGRLD–Gibbs concentrates more accurately around the ground truth with higher uncertainty\. IncreasingSSslightly reduces uncertainty\.

Figure 3:Synthetic LDA: uncertainty calibration and posterior accuracy\.We generate a synthetic topic model with vocabulary sizeV=50V=50, number of topicsK=3K=3, and corpus sized=10000d=10000documents\. Each document is generated from a Latent Dirichlet Allocation \(LDA\) model with fixed topic proportions and topic\-word distributions\. We run SGLD–Gibbs using a minibatch sizeb=100b=100and consider Gibbs updates withS∈\{1,5\}S\\in\\\{1,5\\\}samples\. For SGLD we use the bagged posterior tuning withw1=w2=1w\_\{1\}=w\_\{2\}=1\(soh=4b/nh=4b/nandβ=n\\beta=n\) and for the preconditioner we follow\(Patterson & Teh,[2013](https://arxiv.org/html/2606.00309#bib.bib37)\)\. This preconditioner is motivated by the natural geometry of the simplex, taking a diagonal form proportional todiag\(θ\)\\mathrm\{diag\}\(\\theta\)\. The resulting algorithm is known as*stochastic gradient Riemannian Langevin dynamics*\(SGRLD\)\. As a baseline, we apply stochastic variational inference \(SVI\) to a semi\-collapsed LDA model\. The document\-level topic proportions are integrated out analytically, and we use a mean\-field variational family in which the global topic\-word distributions are modeled by Dirichlet factors and the latent topic assignments by categorical factors\.

[Figure3](https://arxiv.org/html/2606.00309#S4.F3)\(a\) shows that SGRLD–Gibbs yields ranks substantially closer to uniform than SVI, indicating superior calibration\.[Figure3](https://arxiv.org/html/2606.00309#S4.F3)\(b\) shows that the SGRLD–Gibbs posterior concentrates more tightly around the ground\-truth values and better matches the target uncertainty, whereas SVI exhibits noticeable miscalibration and bias in several coordinates\. UsingS=5S=5only slightly reduces uncertainty compared toS=1S=1\(far less than in GMM\), likely due to the smallSSand the strong autocorrelation of Gibbs updates in LDA\.

### 4\.3Real\-data evaluation: Flow Cytometry \(GMM\) and 20 Newsgroups \(LDA\)

Table 3:Performance comparison on real datasets\. For Flow Cytometry \(GMM\), clustering quality is evaluated using adjusted Rand index \(ARI\) and adjusted mutual information \(AMI\)\. For 20 Newsgroups \(LDA\), predictive performance is measured by held\-out perplexity \(lower is better\)\.We further evaluate SGLD–Gibbs on two real\-world latent variable models: a diagonal\-covariance Gaussian mixture model \(GMM\) on a flow cytometry dataset and latent Dirichlet allocation \(LDA\) on the 20 Newsgroups corpus\.

#### Flow Cytometry \(GMM\)\.

We measure inferential quality in terms of the adjusted Rand index \(ARI\) and adjusted mutual information \(AMI\), which compares the inferred clustering to the provided expert clustering\.[Table3](https://arxiv.org/html/2606.00309#S4.T3)shows that SGLD–Gibbs attains substantially higher ARI and AMI than SVI\.

#### 20 Newsgroups \(LDA\)\.

For LDA, we compare methods based on predictive performance using held\-out perplexity\. As shown in[Table3](https://arxiv.org/html/2606.00309#S4.T3), SGLD–Gibbs achieves much better perplexity \(∼350\\sim 350nats\) than SVI\.

Overall, these real\-data results complement our synthetic experiments, demonstrating that in practice SGLD–Gibbs retains the scalability of SVI but provides superior task\-specific performance\.

## 5Discussion and Future Work

This work provides a scaling\-limit perspective on SGLD–Gibbs in latent variable models, clarifying how uncertainty quantification and algorithmic tuning are shaped by the interaction between stochastic\-gradient dynamics and latent\-variable updates\. The resulting joint jump–diffusion limit explains how additional algorithm\-induced uncertainty due to estimating the marginal likelihood with Gibbs samples contributes to the effective noise of the global parameters and yields principled guidance for hyperparameter scaling\. Empirically, our results suggest that SGLD–Gibbs achieves better\-calibrated posterior uncertainty and better predictive performance than variational methods\.

There are a number of limitations of our work that motivate directions for future research\. Our results are for latent variable models in which each data point is associated with a single local latent variable refreshed through Gibbs updates\. Many important models exhibit more complex dependency structures, such as Bayesian matrix factorization, mixed\-effects models, or hierarchical topic models, where latent variables are shared across observations or interact globally\. Extending the joint scaling\-limit framework to such settings would require capturing richer coupling between latent variables and global parameters, and may lead to qualitatively different limiting dynamics and hence distinct tuning guidance\. Moreover, when the latent\-variable updates do not mix rapidly, for instance under SGLD or Gibbs samplers with more complicated dependency structures, the resulting latent variables can exhibit temporal dependence and nontrivial interactions with the global\-parameter iterates\. Analyzing such regimes is an interesting direction for future work\. Beyond these model\-specific extensions, another important direction is to compare scaling\-limit\-based tuning with more algorithmic tuning criteria, such as the KSD\-based strategy ofCoullon et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib8)\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of probabilistic machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## References

- Ahn et al\. \(2015\)Ahn, S\., Korattikara, A\., Liu, N\., Rajan, S\., and Welling, M\.Large\-scale distributed Bayesian matrix factorization using stochastic gradient MCMC, 2015\.URL[https://arxiv\.org/abs/1503\.01596](https://arxiv.org/abs/1503.01596)\.
- Anastasiou et al\. \(2019\)Anastasiou, A\., Balasubramanian, K\., and Erdogdu, M\. A\.Normal approximation for stochastic gradient descent via non\-asymptotic rates of martingale CLT\.In Beygelzimer, A\. and Hsu, D\. \(eds\.\),*Proceedings of the Thirty\-Second Conference on Learning Theory*, volume 99 of*Proceedings of Machine Learning Research*, pp\. 115–137\. PMLR, 25–28 Jun 2019\.URL[https://proceedings\.mlr\.press/v99/anastasiou19a\.html](https://proceedings.mlr.press/v99/anastasiou19a.html)\.
- Arous et al\. \(2022\)Arous, G\. B\., Gheissari, R\., and Jagannath, A\.High\-dimensional limit theorems for SGD: effective dynamics and critical scaling\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*, NIPS ’22, Red Hook, NY, USA, 2022\. Curran Associates Inc\.ISBN 9781713871088\.
- Bottou et al\. \(2018\)Bottou, L\., Curtis, F\. E\., and Nocedal, J\.Optimization methods for large\-scale machine learning\.*SIAM Review*, 60\(2\):223–311, 2018\.doi:10\.1137/16M1080173\.URL[https://doi\.org/10\.1137/16M1080173](https://doi.org/10.1137/16M1080173)\.
- Cheng et al\. \(2020\)Cheng, X\., Yin, D\., Bartlett, P\., and Jordan, M\.Stochastic gradient and Langevin processes\.In*International Conference on Machine Learning*, pp\. 1810–1819\. PMLR, 2020\.
- Collins\-Woodfin et al\. \(2023\)Collins\-Woodfin, E\., Paquette, C\., Paquette, E\., and Seroussi, I\.Hitting the high\-dimensional notes: An ode for sgd learning dynamics on glms and multi\-index models, 2023\.URL[https://arxiv\.org/abs/2308\.08977](https://arxiv.org/abs/2308.08977)\.
- Collins\-Woodfin et al\. \(2024\)Collins\-Woodfin, E\., Seroussi, I\., Malaxechebarría, B\. n\. G\., Mackenzie, A\. W\., Paquette, E\., and Paquette, C\.The high line: exact risk and learning rate curves of stochastic adaptive learning rate algorithms\.In*Proceedings of the 38th International Conference on Neural Information Processing Systems*, NIPS ’24, Red Hook, NY, USA, 2024\. Curran Associates Inc\.ISBN 9798331314385\.
- Coullon et al\. \(2023\)Coullon, J\., South, L\., and Nemeth, C\.Efficient and generalizable tuning strategies for stochastic gradient mcmc\.*Statistics and Computing*, 33\(3\), April 2023\.ISSN 0960\-3174\.doi:10\.1007/s11222\-023\-10233\-3\.URL[https://doi\.org/10\.1007/s11222\-023\-10233\-3](https://doi.org/10.1007/s11222-023-10233-3)\.
- Danaher \(2023\)Danaher, P\. J\.Optimal microtargeting of advertising\.*Journal of Marketing Research*, 60\(3\):564–584, 2023\.doi:10\.1177/00222437221116034\.URL[https://doi\.org/10\.1177/00222437221116034](https://doi.org/10.1177/00222437221116034)\.
- Dieuleveut et al\. \(2020\)Dieuleveut, A\., Durmus, A\., and Bach, F\.Bridging the gap between constant step size stochastic gradient descent and Markov chains\.*The Annals of Statistics*, 48\(3\):1348 – 1382, 2020\.URL[https://doi\.org/10\.1214/19\-AOS1850](https://doi.org/10.1214/19-AOS1850)\.
- Ethier & Kurtz \(2009\)Ethier, S\. N\. and Kurtz, T\. G\.*Markov Processes: Characterization and Convergence*\.Wiley Series in Probability and Statistics\. John Wiley & Sons, 2009\.ISBN 9780470412035\.
- Ge et al\. \(2015\)Ge, R\., Huang, F\., Jin, C\., and Yuan, Y\.Escaping from saddle points—online stochastic gradient for tensor decomposition\.In*Conference on learning theory*, pp\. 797–842\. PMLR, 2015\.
- Gelman et al\. \(2013\)Gelman, A\., Carlin, J\., Stern, H\., Dunson, D\., Vehtari, A\., and Rubin, D\.*Bayesian Data Analysis*\.Chapman & Hall/CRC Texts in Statistical Science Series\. CRC, Boca Raton, Florida, third edition, 2013\.ISBN 9781439840955 1439840954\.URL[https://stat\.columbia\.edu/~gelman/book/](https://stat.columbia.edu/~gelman/book/)\.
- Giordano et al\. \(2018\)Giordano, R\., Broderick, T\., and Jordan, M\. I\.Covariances, robustness and Variational Bayes\.*J\. Mach\. Learn\. Res\.*, 19\(1\):1981–2029, January 2018\.ISSN 1532\-4435\.
- Hoffman et al\. \(2010\)Hoffman, M\., Bach, F\., and Blei, D\.Online learning for Latent Dirichlet Allocation\.In Lafferty, J\., Williams, C\., Shawe\-Taylor, J\., Zemel, R\., and Culotta, A\. \(eds\.\),*Advances in Neural Information Processing Systems*, volume 23\. Curran Associates, Inc\., 2010\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0\-Paper\.pdf](https://proceedings.neurips.cc/paper_files/paper/2010/file/71f6278d140af599e06ad9bf1ba03cb0-Paper.pdf)\.
- Hoffman et al\. \(2013\)Hoffman, M\. D\., Blei, D\. M\., Wang, C\., and Paisley, J\.Stochastic variational inference\.*Journal of Machine Learning Research*, 14\(40\):1303–1347, 2013\.URL[http://jmlr\.org/papers/v14/hoffman13a\.html](http://jmlr.org/papers/v14/hoffman13a.html)\.
- Huggins & Miller \(2024\)Huggins, J\. H\. and Miller, J\. W\.Reproducible parameter inference using bagged posteriors\.*Electronic Journal of Statistics*, 18\(1\), 2024\.ISSN 1935\-7524\.doi:10\.1214/24\-ejs2237\.
- Jin et al\. \(2017\)Jin, C\., Ge, R\., Netrapalli, P\., Kakade, S\. M\., and Jordan, M\. I\.How to escape saddle points efficiently\.In*International conference on machine learning*, pp\. 1724–1732\. PMLR, 2017\.
- Kleijn & van der Vaart \(2012\)Kleijn, B\. and van der Vaart, A\.The Bernstein\-Von\-Mises theorem under misspecification\.*Electronic Journal of Statistics*, \(6\):354–381, 2012\.
- Kucukelbir et al\. \(2017\)Kucukelbir, A\., Tran, D\., Ranganath, R\., Gelman, A\., and Blei, D\. M\.Automatic differentiation variational inference\.*J\. Mach\. Learn\. Res\.*, 18\(1\):430–474, January 2017\.ISSN 1532\-4435\.
- Kushner & Yin \(2003\)Kushner, H\. and Yin, G\.*Stochastic Approximation and Recursive Algorithms and Applications*\.Stochastic Modelling and Applied Probability\. Springer New York, 2003\.ISBN 9780387008943\.URL[https://books\.google\.com/books?id=\_0bIieuUJGkC](https://books.google.com/books?id=_0bIieuUJGkC)\.
- Kushner & Huang \(1981\)Kushner, H\. J\. and Huang, H\.Asymptotic properties of stochastic approximations with constant coefficients\.*SIAM Journal on Control and Optimization*, 19\(1\):87–105, 1981\.doi:10\.1137/0319007\.URL[https://doi\.org/10\.1137/0319007](https://doi.org/10.1137/0319007)\.
- Kushner & Yang \(1993\)Kushner, H\. J\. and Yang, J\.Stochastic approximation with averaging of the iterates: Optimal asymptotic rate of convergence for general processes\.*SIAM Journal on Control and Optimization*, 31\(4\):1045–1062, 1993\.doi:10\.1137/0331047\.URL[https://doi\.org/10\.1137/0331047](https://doi.org/10.1137/0331047)\.
- Li et al\. \(2016\)Li, W\., Ahn, S\., and Welling, M\.Scalable MCMC for mixed membership stochastic blockmodels\.In Gretton, A\. and Robert, C\. C\. \(eds\.\),*Proceedings of the 19th International Conference on Artificial Intelligence and Statistics*, volume 51 of*Proceedings of Machine Learning Research*, pp\. 723–731, Cadiz, Spain, 09–11 May 2016\. PMLR\.URL[https://proceedings\.mlr\.press/v51/li16d\.html](https://proceedings.mlr.press/v51/li16d.html)\.
- Loaiza\-Maya & Nibbering \(2023\)Loaiza\-Maya, R\. and Nibbering, D\.Fast Variational Bayes methods for multinomial probit models\.*Journal of Business & Economic Statistics*, 41\(4\):1352–1363, 2023\.doi:10\.1080/07350015\.2022\.2139267\.URL[https://doi\.org/10\.1080/07350015\.2022\.2139267](https://doi.org/10.1080/07350015.2022.2139267)\.
- Loaiza\-Maya et al\. \(2024\)Loaiza\-Maya, R\., Nibbering, D\., and Zhu, D\.Hybrid unadjusted Langevin methods for high\-dimensional latent variable models\.*Journal of Econometrics*, 241\(2\):105741, 2024\.ISSN 0304\-4076\.doi:https://doi\.org/10\.1016/j\.jeconom\.2024\.105741\.URL[https://www\.sciencedirect\.com/science/article/pii/S0304407624000873](https://www.sciencedirect.com/science/article/pii/S0304407624000873)\.
- Mandt et al\. \(2017\)Mandt, S\., Hoffman, M\. D\., and Blei, D\. M\.Stochastic gradient descent as approximate Bayesian inference\.*Journal of Machine Learning Research*, 18\(134\):1–35, 2017\.URL[http://jmlr\.org/papers/v18/17\-214\.html](http://jmlr.org/papers/v18/17-214.html)\.
- Margossian et al\. \(2025\)Margossian, C\. C\., Pillaud\-Vivien, L\., and Saul, L\. K\.Variational inference for uncertainty quantification: An analysis of trade\-offs\.*Journal of Machine Learning Research*, 26\(202\):1–41, 2025\.
- Mcleish \(1976\)Mcleish, D\. L\.Functional and random central limit theorems for the Robbins\-Munro process, 1976\.URL[https://www\.jstor\.org/stable/3212676](https://www.jstor.org/stable/3212676)\.
- Mignacco et al\. \(2021\)Mignacco, F\., Krzakala, F\., Urbani, P\., and Zdeborová, L\.Dynamical mean\-field theory for stochastic gradient descent in Gaussian mixture classification\*\.*Journal of Statistical Mechanics: Theory and Experiment*, 2021\(12\):124008, December 2021\.ISSN 1742\-5468\.doi:10\.1088/1742\-5468/ac3a80\.URL[http://dx\.doi\.org/10\.1088/1742\-5468/ac3a80](http://dx.doi.org/10.1088/1742-5468/ac3a80)\.
- Mou et al\. \(2020\)Mou, W\., Li, C\. J\., Wainwright, M\. J\., Bartlett, P\. L\., and Jordan, M\. I\.On linear stochastic approximation: Fine\-grained Polyak\-Ruppert and non\-asymptotic concentration\.In Abernethy, J\. and Agarwal, S\. \(eds\.\),*Proceedings of Thirty Third Conference on Learning Theory*, volume 125 of*Proceedings of Machine Learning Research*, pp\. 2947–2997\. PMLR, 09–12 Jul 2020\.URL[https://proceedings\.mlr\.press/v125/mou20a\.html](https://proceedings.mlr.press/v125/mou20a.html)\.
- Moulines & Bach \(2011\)Moulines, E\. and Bach, F\.Non\-asymptotic analysis of stochastic approximation algorithms for machine learning\.In Shawe\-Taylor, J\., Zemel, R\., Bartlett, P\., Pereira, F\., and Weinberger, K\. \(eds\.\),*Advances in Neural Information Processing Systems*, volume 24\. Curran Associates, Inc\., 2011\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b\-Paper\.pdf](https://proceedings.neurips.cc/paper_files/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf)\.
- Murphy \(2023\)Murphy, K\. P\.*Probabilistic Machine Learning: Advanced Topics*\.MIT Press, 2023\.URL[http://probml\.github\.io/book2](http://probml.github.io/book2)\.
- Negrea et al\. \(2023\)Negrea, J\., Yang, J\., Feng, H\., Roy, D\. M\., and Huggins, J\. H\.Tuning stochastic gradient algorithms for statistical inference via large\-sample asymptotics, 2023\.URL[https://arxiv\.org/abs/2207\.12395](https://arxiv.org/abs/2207.12395)\.
- Nemeth & Fearnhead \(2021\)Nemeth, C\. and Fearnhead, P\.Stochastic gradient Markov chain Monte Carlo\.*Journal of the American Statistical Association*, 116\(533\):433–450, 2021\.doi:10\.1080/01621459\.2020\.1847120\.URL[https://doi\.org/10\.1080/01621459\.2020\.1847120](https://doi.org/10.1080/01621459.2020.1847120)\.
- Nemirovski et al\. \(2009\)Nemirovski, A\., Juditsky, A\., Lan, G\., and Shapiro, A\.Robust stochastic approximation approach to stochastic programming\.*SIAM Journal on Optimization*, 19\(4\):1574–1609, 2009\.URL[https://doi\.org/10\.1137/070704277](https://doi.org/10.1137/070704277)\.
- Patterson & Teh \(2013\)Patterson, S\. and Teh, Y\. W\.Stochastic gradient Riemannian Langevin dynamics on the probability simplex\.In Burges, C\., Bottou, L\., Welling, M\., Ghahramani, Z\., and Weinberger, K\. \(eds\.\),*Advances in Neural Information Processing Systems*, volume 26\. Curran Associates, Inc\., 2013\.URL[https://proceedings\.neurips\.cc/paper/2013/file/309928d4b100a5d75adff48a9bfc1ddb\-Paper\.pdf](https://proceedings.neurips.cc/paper/2013/file/309928d4b100a5d75adff48a9bfc1ddb-Paper.pdf)\.
- Pedregosa et al\. \(2011\)Pedregosa, F\., Varoquaux, G\., Gramfort, A\., Michel, V\., Thirion, B\., Grisel, O\., Blondel, M\., Prettenhofer, P\., Weiss, R\., Dubourg, V\., et al\.Scikit\-learn: Machine learning in python\.*Journal of Machine Learning Research*, 12:2825–2830, 2011\.
- Pflug \(1986\)Pflug, G\. C\.Stochastic minimization with constant step\-size: Asymptotic laws\.*SIAM Journal on Control and Optimization*, 24\(4\):655–666, 1986\.doi:10\.1137/0324039\.URL[https://doi\.org/10\.1137/0324039](https://doi.org/10.1137/0324039)\.
- Polyak & Juditsky \(1992\)Polyak, B\. T\. and Juditsky, A\. B\.Acceleration of stochastic approximation by averaging\.*SIAM Journal on Control and Optimization*, 30\(4\):838–855, 1992\.URL[https://doi\.org/10\.1137/0330046](https://doi.org/10.1137/0330046)\.
- Qian et al\. \(2024\)Qian, X\., Xie, Z\., Liu, X\., and Zhang, S\.Almost sure convergence rates and concentration of stochastic approximation and reinforcement learning with Markovian noise, 2024\.URL[https://arxiv\.org/abs/2411\.13711](https://arxiv.org/abs/2411.13711)\.
- Rakhlin et al\. \(2011\)Rakhlin, A\., Shamir, O\., and Sridharan, K\.Making gradient descent optimal for strongly convex stochastic optimization\.*arXiv preprint arXiv:1109\.5647*, 2011\.
- Ruppert \(1988\)Ruppert, D\.Efficient estimations from a slowly convergent Robbins\-Monro process\.02 1988\.
- Srikant \(2024\)Srikant, R\.Rates of convergence in the central limit theorem for Markov chains, with an application to td learning, 2024\.
- Walk \(1977\)Walk, H\.An invariance principle for the Robbins\-Monro process in a Hilbert space\.*Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete*, 39:135–150, 1977\.URL[https://api\.semanticscholar\.org/CorpusID:119733417](https://api.semanticscholar.org/CorpusID:119733417)\.
- Wang et al\. \(2025\)Wang, X\., Kasprzak, M\. J\., Negrea, J\., Bourguin, S\., and Huggins, J\. H\.Quantitative error bounds for scaling limits of stochastic iterative algorithms, 2025\.URL[https://arxiv\.org/abs/2501\.12212](https://arxiv.org/abs/2501.12212)\.
- Wang et al\. \(2026\)Wang, Y\., Ding, J\., and Huggins, J\. H\.Accurate large\-sample uncertainty quantification using stochastic gradient Markov chain Monte Carlo\.In*Proceedings of the 43rd International Conference on Machine Learning*, Proceedings of Machine Learning Research\. PMLR, 2026\.
- Welling & Teh \(2011\)Welling, M\. and Teh, Y\. W\.Bayesian learning via stochastic gradient Langevin dynamics\.In Getoor, L\. and Scheffer, T\. \(eds\.\),*ICML*, pp\. 681–688\. Omnipress, 2011\.URL[http://dblp\.uni\-trier\.de/db/conf/icml/icml2011\.html\#WellingT11](http://dblp.uni-trier.de/db/conf/icml/icml2011.html#WellingT11)\.
- White \(1982\)White, H\.Maximum likelihood estimation of misspecified models\.*Econometrica: Journal of the econometric society*, pp\. 1–25, 1982\.

## Appendix APreliminaries for Proof of Main Result

### A\.1Assumptions

Recall that we assume throughout thatXi∼P\{X\_\{i\}\}\\sim Pindependently for alli∈ℕi\\in\\mathbb\{N\}\. We denoteℓ\(θ;X,z\):=log⁡p\(X,z∣θ\)\\ell\(\\theta;\{X\_\{,\}\}z\):=\\log p\(\{X\_\{,\}\}z\\mid\\theta\)\.

###### Assumption A\.1\.

∇log⁡π0\\nabla\\log\\pi\_\{0\}isL0L\_\{0\}\-Lipschitz, andlog⁡p\(x,z∣⋅\)∈C2\(Θ\)\\log p\(x,z\\mid\\cdot\)\\in C^\{2\}\(\\Theta\)for eachx,z∈\(𝒳,ℝm\)x,z\\in\(\\mathcal\{X\},\\mathbb\{R\}^\{m\}\)

###### Assumption A\.2\.

The exponents satisfy𝔥−𝔴−𝔞/3\>0\\mathfrak\{h\}\-\\mathfrak\{w\}\-\\mathfrak\{a\}/3\>0and𝔼\[∥∇logp\(X1,⋅∣θ∗\)∥∞p2\]<∞\\mathbb\{E\}\[\\\|\\nabla\\log p\(\{X\_\{1\}\},\\cdot\\mid\\theta^\{\*\}\)\\\|\_\{\\infty\}^\{p\_\{2\}\}\]<\\inftyfor somep2\>1𝔥−𝔴−𝔞/3p\_\{2\}\>\\frac\{1\}\{\\mathfrak\{h\}\-\\mathfrak\{w\}\-\\mathfrak\{a\}/3\}\.

###### Assumption A\.3\.

For someq3∈\[0,𝔴\]q\_\{3\}\\in\[0,\\mathfrak\{w\}\]andp3:=1𝔥\+q3−𝔴−𝔞/3p\_\{3\}:=\\frac\{1\}\{\\mathfrak\{h\}\+q\_\{3\}\-\\mathfrak\{w\}\-\\mathfrak\{a\}/3\}, the local critical points satisfy‖θ^\(n\)−θ∗‖∈op\(1/nq3\)\\\|\\hat\{\\theta\}^\{\(n\)\}\-\\theta^\{\*\}\\\|\\in o\_\{p\}\(1/n^\{q\_\{3\}\}\), and𝔼\[∥∇⊗2log\(X1,⋅∣⋅\)∥p3\]<∞\\mathbb\{E\}\[\\\|\\nabla^\{\\otimes 2\}\\log\(\{X\_\{1\}\},\\cdot\\mid\\cdot\)\\\|^\{p\_\{3\}\}\]<\\infty\.

Let

ℓ\(θ;xi\):=log⁡p\(xi∣θ\)=log∫p\(xi,zi∣θ\)𝑑zi\\displaystyle\\ell\(\\theta;x\_\{i\}\):=\\log p\(x\_\{i\}\\mid\\theta\)=\\log\\int p\(x\_\{i\},z\_\{i\}\\mid\\theta\)\\,dz\_\{i\}\(A\.1\)denote the log\-likelihood function with the latent variableziz\_\{i\}marginalized out\. For anyr\>0r\>0, let

B\(n\)\(r\):=B\(θ^n,r/n𝔴\)\\displaystyle B^\{\(n\)\}\(r\):=B\(\\hat\{\\theta\}\_\{n\},r/n^\{\\mathfrak\{w\}\}\)\(A\.2\)denote the ball centered at the MLEθ^n\\hat\{\\theta\}\_\{n\}with radiusr/n𝔴r/n^\{\\mathfrak\{w\}\}, for some scaling exponentα\>0\\alpha\>0\.

###### Assumption A\.4\.

There is a non\-decreasing sequencerJ,n→∞r\_\{J,n\}\\to\\inftysuch that

supθ∈B\(rJ,n\)‖1n∑i=1nm∇θ⊗2ℓ\(θ;Xi\)\+J∗‖→0\\displaystyle\\sup\_\{\\theta\\in B\(r\_\{J,n\}\)\}\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\_\{\\theta\}^\{\\otimes 2\}\\ell\(\\theta;\{X\_\{i\}\}\)\+J\_\{\*\}\\\|\\to 0\(A\.3\)

###### Assumption A\.5\.

There is a non\-decreasing sequencerI,n→∞r\_\{I,n\}\\to\\inftysuch that

supθ∈B\(rI,n\)∥1n∑i=1nm𝔼z\|θ,X\[∇θlogp\(Xi,zi∣θ\)⊗2\]−I~⋆∥→0\\displaystyle\\sup\_\{\\theta\\in B\(r\_\{I,n\}\)\}\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\mathbb\{E\}\_\{z\|\\theta,X\}\[\\nabla\_\{\\theta\}\\log p\(\{X\_\{i\}\},z\_\{i\}\\mid\\theta\)^\{\\otimes 2\}\]\-\\tilde\{I\}\_\{\\star\}\\\|\\to 0\(A\.4\)

###### Assumption A\.6\.

Ifθn→θ∗\\theta\_\{n\}\\to\\theta^\{\*\}whenn→∞n\\to\\infty, then for almost everyXXandzz,

p\(z\|X,θn\)→p\(z\|X,θ∗\)\\displaystyle p\(z\|X,\\theta\_\{n\}\)\\to p\(z\|X,\\theta^\{\*\}\)\(A\.5\)whenn→∞\.n\\to\\infty\.

### A\.2Technical Lemmas

We will make use of the following two technical results in our proof\.

###### Proposition A\.7\(Approximation of Markov chains\(Ethier & Kurtz,[2009](https://arxiv.org/html/2606.00309#bib.bib11)\)\)\.

Let

A:Cc∞\(ℝd\)→C\(ℝd\)\\displaystyle A:C\_\{c\}^\{\\infty\}\(\\mathbb\{R\}^\{d\}\)\\to C\(\\mathbb\{R\}^\{d\}\)\(A\.6\)be a linear operator, and suppose that the closure of the graph ofAAwith respect to the graph norm

‖f‖A:=‖f‖∞\+‖Af‖∞,f∈Cc∞\(ℝd\),\\displaystyle\\\|f\\\|\_\{A\}:=\\\|f\\\|\_\{\\infty\}\+\\\|Af\\\|\_\{\\infty\},\\qquad f\\in C\_\{c\}^\{\\infty\}\(\\mathbb\{R\}^\{d\}\),\(A\.7\)generates a Feller semigroup\(Tt\)t≥0\(T\_\{t\}\)\_\{t\\geq 0\}onℝd\\mathbb\{R\}^\{d\}\. Let\(θt\)t≥0\(\\theta\_\{t\}\)\_\{t\\geq 0\}be a Markov process with forward operator semigroup\(Tt\)t≥0\(T\_\{t\}\)\_\{t\\geq 0\}\. Let\{\(θk\(n\)\)k∈ℕ∪\{0\}\}n∈ℕ\\\{\(\\theta^\{\(n\)\}\_\{k\}\)\_\{k\\in\\mathbb\{N\}\\cup\\\{0\\\}\}\\\}\_\{n\\in\\mathbb\{N\}\}be a sequence of discrete\-time Markov chains onℝd\\mathbb\{R\}^\{d\}with respective transition kernels\{U\(n\)\}n∈ℕ\\\{U^\{\(n\)\}\\\}\_\{n\\in\\mathbb\{N\}\}\. Suppose that0<α\(n\)→∞0<\\alpha^\{\(n\)\}\\to\\infty, and define

A\(n\):=α\(n\)\(U\(n\)−I\),Tt\(n\):=\(U\(n\)\)⌊α\(n\)t⌋,θt\(n\):=θ⌊α\(n\)t⌋\(n\)\.\\displaystyle A^\{\(n\)\}:=\\alpha^\{\(n\)\}\\bigl\(U^\{\(n\)\}\-I\\bigr\),\\qquad T\_\{t\}^\{\(n\)\}:=\\bigl\(U^\{\(n\)\}\\bigr\)^\{\\lfloor\\alpha^\{\(n\)\}t\\rfloor\},\\qquad\\theta\_\{t\}^\{\(n\)\}:=\\theta^\{\(n\)\}\_\{\\lfloor\\alpha^\{\(n\)\}t\\rfloor\}\.\(A\.8\)If

‖A\(n\)f−Af‖∞⟶0for allf∈Cc∞\(ℝd\),\\displaystyle\\\|A^\{\(n\)\}f\-Af\\\|\_\{\\infty\}\\;\\longrightarrow\\;0\\quad\\text\{for all \}f\\in C\_\{c\}^\{\\infty\}\(\\mathbb\{R\}^\{d\}\),\(A\.9\)then

1. \(a\)Tt\(n\)→TtT\_\{t\}^\{\(n\)\}\\to T\_\{t\}for eacht\>0t\>0and
2. \(b\)ifθ\(n\)\(0\)⇒θ\(0\)\\theta^\{\(n\)\}\(0\)\\Rightarrow\\theta\(0\), then θ\(n\)\(⋅\)⇒θ\(⋅\)in the Skorokhod topology\.\\displaystyle\\theta^\{\(n\)\}\(\\cdot\)\\Rightarrow\\theta\(\\cdot\)\\quad\\text\{in the Skorokhod topology\}\.\(A\.10\)

###### Lemma A\.8\(Negrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\)\.

Let\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)be a probability space, let\(𝒳,τ\)\(\\mathcal\{X\},\\tau\)be a topological space endowed with theσ\\sigma\-fieldℱ𝒳:=σ\(τ\)\\mathcal\{F\}\_\{\\mathcal\{X\}\}:=\\sigma\(\\tau\), let\(Xn\)n∈ℕ\(X\_\{n\}\)\_\{n\\in\\mathbb\{N\}\}be a sequence of𝒳\\mathcal\{X\}\-valued random elements, and letx∈𝒳x\\in\\mathcal\{X\}\. If for every subsequence\(nm\)\(n\_\{m\}\)there exists a sub\-subsequence\(nmk\)\(n\_\{m\_\{k\}\}\)such that

Xnmk⟶xalmost surely ask→∞,\\displaystyle X\_\{n\_\{m\_\{k\}\}\}\\;\\longrightarrow\\;x\\quad\\text\{almost surely as \}k\\to\\infty,\(A\.11\)then

Xn→ℙx\.\\displaystyle X\_\{n\}\\;\\xrightarrow\{\\;\\mathbb\{P\}\\;\}\\;x\.\(A\.12\)If\(𝒳,τ\)\(\\mathcal\{X\},\\tau\)is first countable, then the converse also holds: if

Xn→ℙx,\\displaystyle X\_\{n\}\\;\\xrightarrow\{\\;\\mathbb\{P\}\\;\}\\;x,\(A\.13\)then for every subsequence\(nm\)\(n\_\{m\}\)there exists a sub\-subsequence\(nmk\)\(n\_\{m\_\{k\}\}\)such that

Xnmk⟶xalmost surely ask→∞\.\\displaystyle X\_\{n\_\{m\_\{k\}\}\}\\;\\longrightarrow\\;x\\quad\\text\{almost surely as \}k\\to\\infty\.\(A\.14\)

### A\.3Reduction to almost\-sure convergence along subsequences

Define the random quantities

Φ\(n\):=max⁡\{Φ1\(n\),Φ2\(n\),Φ3\(n\)\},Φ1\(n\):=nq3‖θ^\(n\)−θ⋆‖,Φ2\(n\):=supθ∈B\(rJ,n\)‖1n∑i=1n∇θ⊗2ℓ\(θ;Xi\)\+J⋆‖,Φ3\(n\):=supθ∈B\(rI,n\)∥1n∑i=1n𝔼Zi∣Xi,θ\[∇θlogp\(Xi,Zi∣θ\)⊗2\]−I~⋆∥\.\\displaystyle\\begin\{aligned\} \\Phi^\{\(n\)\}&:=\\max\\\!\\left\\\{\\Phi^\{\(n\)\}\_\{1\},\\Phi^\{\(n\)\}\_\{2\},\\Phi^\{\(n\)\}\_\{3\}\\right\\\},\\\\ \\Phi^\{\(n\)\}\_\{1\}&:=n^\{q\_\{3\}\}\\,\\\|\\hat\{\\theta\}^\{\(n\)\}\-\\theta^\{\\star\}\\\|,\\\\ \\Phi^\{\(n\)\}\_\{2\}&:=\\sup\_\{\\theta\\in B\(r\_\{J,n\}\)\}\\left\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\nabla\_\{\\theta\}^\{\\otimes 2\}\\ell\(\\theta;X\_\{i\}\)\+J\_\{\\star\}\\right\\\|,\\\\ \\Phi^\{\(n\)\}\_\{3\}&:=\\sup\_\{\\theta\\in B\(r\_\{I,n\}\)\}\\left\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{Z\_\{i\}\\mid X\_\{i\},\\theta\}\\\!\\left\[\\nabla\_\{\\theta\}\\log p\(X\_\{i\},Z\_\{i\}\\mid\\theta\)^\{\\otimes 2\}\\right\]\-\\tilde\{I\}\_\{\\star\}\\right\\\|\.\\end\{aligned\}\(A\.15\)
By assumption,Φ\(n\)→ℙ0\\Phi^\{\(n\)\}\\xrightarrow\{\\mathbb\{P\}\}0\. By Lemma 1 ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), for every subsequence\(nm\)m∈ℕ\(n\_\{m\}\)\_\{m\\in\\mathbb\{N\}\}there exists a further subsequence\(nmk\)k∈ℕ\(n\_\{m\_\{k\}\}\)\_\{k\\in\\mathbb\{N\}\}such that

Φ\(nmk\)→a\.s\.0\.\\displaystyle\\Phi^\{\(n\_\{m\_\{k\}\}\)\}\\xrightarrow\{\\mathrm\{a\.s\.\}\}0\.\(A\.16\)
It therefore suffices to establish weak convergence ofϑ\(n\)\\vartheta^\{\(n\)\}along any subsequence for whichΦ\(n\)→0\\Phi^\{\(n\)\}\\to 0almost surely\.

Fix such a subsequence\(nm\)\(n\_\{m\}\)and define the event

Ω\(0\):=⋂j=13Ω\(j\),\\displaystyle\\Omega^\{\(0\)\}:=\\bigcap\_\{j=1\}^\{3\}\\Omega^\{\(j\)\},\(A\.17\)
where

Ω\(1\):=\{Φ\(nm\)→0\},Ω\(2\):=\{max1≤i≤n⁡‖∇ℓ\(θ∗;Xi,⋅\)‖≤n1/p2a\.b\.f\.o\.\},Ω\(3\):=\{max1≤i≤n⁡‖∇⊗2ℓ\(⋅;Xi,⋅\)‖∞≤n1/p3a\.b\.f\.o\.\}\.\\displaystyle\\begin\{aligned\} \\Omega^\{\(1\)\}&:=\\left\\\{\\Phi^\{\(n\_\{m\}\)\}\\to 0\\right\\\},\\\\ \\Omega^\{\(2\)\}&:=\\left\\\{\\max\_\{1\\leq i\\leq n\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{i\}\},\\cdot\)\\\|\\leq n^\{1/p\_\{2\}\}\\ \\text\{a\.b\.f\.o\.\}\\right\\\},\\\\ \\Omega^\{\(3\)\}&:=\\left\\\{\\max\_\{1\\leq i\\leq n\}\\\|\\nabla^\{\\otimes 2\}\\ell\(\\cdot;\{X\_\{i\}\},\\cdot\)\\\|\_\{\\infty\}\\leq n^\{1/p\_\{3\}\}\\ \\text\{a\.b\.f\.o\.\}\\right\\\}\.\\end\{aligned\}\(A\.18\)By the assumed moment conditions and Lemma 2 ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)applied to the power functionst↦tp2t\\mapsto t^\{p\_\{2\}\}andt↦tp3t\\mapsto t^\{p\_\{3\}\}, the eventΩ\(0\)\\Omega^\{\(0\)\}has probability one\. We will show that onΩ\(0\)\\Omega^\{\(0\)\}, all remainder terms appearing in the generator expansions are negligible, and the convergence of the discrete\-time Markov generators to their continuous\-time limit follows\.

## Appendix BProof of Main Theorem

### B\.1Overview

Our proof follows the spirit ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. We establish weak convergence of the processes in the Skorokhod topology in probability by first proving almost sure convergence along subsequences\. This is achieved by showing that the difference between the approximate generator and the limiting generator, evaluated on smooth test functions with compact support, vanishes uniformly\. Using Lemma[A\.8](https://arxiv.org/html/2606.00309#A1.Thmtheorem8), this then yields weak convergence in the Skorokhod topology in probability\. As inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\), we divide the proof into two parts for technical reasons\.

In Part 1, we consider arguments that are sufficiently far from the support of the test function\. The main idea is to control the probability that the global\-parameter process jumps back into the support of the test function\. A key difference withNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)is that we must impose assumptions on the joint likelihood uniformly on latent variable values to ensure that, even in the presence of additional uncertainty induced by the latent\-variable updates, the probability that the global parameter jumps back into the support can still be controlled at a comparable scale\.

In Part 2, we analyze arguments that lie in or are close to the support of the test function\. We perform a Taylor expansion of the joint approximate generator\. In[SectionB\.5](https://arxiv.org/html/2606.00309#A2.SS5), as a result of[EquationB\.44](https://arxiv.org/html/2606.00309#A2.E44), the drift term converges in the same way as inNegrea et al\. \([2023](https://arxiv.org/html/2606.00309#bib.bib34)\)\. In[SectionB\.6](https://arxiv.org/html/2606.00309#A2.SS6), for the gradient component of the diffusion term, a similar approach applies, but the resulting limit now explicitly incorporates additional variability arising from sampling the latent variables\. In[SectionB\.7](https://arxiv.org/html/2606.00309#A2.SS7), we introduce a new term that bridges the infinitesimal operator associated with Gibbs updates and the generator of a Poisson jump process\. The most technically challenging new aspect is in[SectionsB\.5](https://arxiv.org/html/2606.00309#A2.SS5)and[B\.9](https://arxiv.org/html/2606.00309#A2.SS9), showing that all cross terms involving both the global parameters and the latent variables vanish in the limit\. This vanishing further implies that, in the asymptotic regime, the contribution of any single observation becomes asymptotically independent of the global\-parameter dynamics, which in turn allows the joint limiting distribution to factorize into independent components\.

### B\.2Notation useful for the proof

We first introduce notation for the increments of the localized algorithm\. Throughout, we condition onϑ0\(n\)=ϑ\\vartheta^\{\(n\)\}\_\{0\}=\\varthetaandζ0\(n\)=ζ\\zeta^\{\(n\)\}\_\{0\}=\\zeta, and writeζ1\(n\)=ζ~\\zeta^\{\(n\)\}\_\{1\}=\\tilde\{\\zeta\}for the updated latent variable\.

Define the following components of the one\-step increment:

Δξ\(n\):=w\(n\)h\(n\)\(β\(n\)\)−1Γξ1,\\displaystyle\\Delta^\{\(n\)\}\_\{\\xi\}:=w^\{\(n\)\}\\sqrt\{h^\{\(n\)\}\(\\beta^\{\(n\)\}\)^\{\-1\}\\Gamma\}\\,\\xi\_\{1\},\(B\.1\)Δπ0\(n\):=h\(n\)w\(n\)Γ2n∇log⁡π0\(θ^\(n\)\+\(w\(n\)\)−1ϑ\),\\displaystyle\\Delta^\{\(n\)\}\_\{\\pi\_\{0\}\}:=\\frac\{h^\{\(n\)\}w^\{\(n\)\}\\Gamma\}\{2n\}\\nabla\\log\\pi\_\{0\}\\\!\\left\(\\hat\{\\theta\}^\{\(n\)\}\+\(w^\{\(n\)\}\)^\{\-1\}\\vartheta\\right\),\(B\.2\)Δℓ\(n\):=h\(n\)w\(n\)Γ2b\(n\)∑j=1b\(n\)∇θℓ\(θ^\(n\)\+\(w\(n\)\)−1ϑ;XI1\(n\)\(j\),ζ~I1\(n\)\(j\)\),\\displaystyle\\Delta^\{\(n\)\}\_\{\\ell\}:=\\frac\{h^\{\(n\)\}w^\{\(n\)\}\\Gamma\}\{2b^\{\(n\)\}\}\\sum\_\{j=1\}^\{b^\{\(n\)\}\}\\nabla\_\{\\theta\}\\ell\\\!\\left\(\\hat\{\\theta\}^\{\(n\)\}\+\(w^\{\(n\)\}\)^\{\-1\}\\vartheta;X\_\{I^\{\(n\)\}\_\{1\}\(j\)\},\\tilde\{\\zeta\}\_\{I^\{\(n\)\}\_\{1\}\(j\)\}\\right\),\(B\.3\)and set

Δ\(n\)=Δξ\(n\)\+Δπ0\(n\)\+Δℓ\(n\)\.\\displaystyle\\Delta^\{\(n\)\}=\\Delta^\{\(n\)\}\_\{\\xi\}\+\\Delta^\{\(n\)\}\_\{\\pi\_\{0\}\}\+\\Delta^\{\(n\)\}\_\{\\ell\}\.\(B\.4\)
The latent variables are updated according to

ζ~i∼p\(z∣Xi,θ^\(n\)\+\(w\(n\)\)−1ϑ\),i=1,…,n,\\displaystyle\\tilde\{\\zeta\}\_\{i\}\\sim p\\\!\\left\(z\\mid X\_\{i\},\\hat\{\\theta\}^\{\(n\)\}\+\(w^\{\(n\)\}\)^\{\-1\}\\vartheta\\right\),\\qquad i=1,\\dots,n,\(B\.5\)independently conditional on the observations and the current parameter value\.

We next define a sequence of generator\-like operators acting on test functionsffby

A\(n\)f\(ϑ,ζ1\):=α\(n\)\(𝔼\[f\(ϑ\+Δ\(n\),ζ~1\)\]−f\(ϑ,ζ1\)\),\\displaystyle A^\{\(n\)\}f\(\\vartheta,\\zeta\_\{1\}\):=\\alpha^\{\(n\)\}\\Big\(\\mathbb\{E\}\\big\[f\(\\vartheta\+\\Delta^\{\(n\)\},\\tilde\{\\zeta\}\_\{1\}\)\\big\]\-f\(\\vartheta,\\zeta\_\{1\}\)\\Big\),\(B\.6\)where the expectation is taken over all algorithmic randomness, including minibatch sampling, Gaussian noise, and Gibbs updates, conditional on the observations\.

For sufficiently smooth test functionsff, the generatorAAof the limiting Lévy process is given by

\(Af\)\(ϑ,ζ1\)=−⟨Bϑ,∇ϑf\(ϑ,ζ1\)⟩\+12A:∇ϑ⊗2f\(ϑ,ζ1\)\+λ\(∫f\(ϑ,z\)p\(z∣X1,θ⋆\)𝑑z−f\(ϑ,ζ1\)\),\\displaystyle\\begin\{aligned\} \(Af\)\(\\vartheta,\\zeta\_\{1\}\)&=\-\\langle B\\vartheta,\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\rangle\+\\frac\{1\}\{2\}A:\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\\\ &\\quad\+\\lambda\\left\(\\int f\(\\vartheta,z\)\\,p\(z\\mid X\_\{1\},\\theta^\{\\star\}\)\\,dz\-f\(\\vartheta,\\zeta\_\{1\}\)\\right\),\\end\{aligned\}\(B\.7\)with

B=chΓJ⋆𝟏\{𝔞=𝔥\},A=chcβΓ𝟏\{𝔥\+𝔟≤𝔱\}\+ch24cbΓI⋆~Γ⊤𝟏\{𝔱≤𝔟\+𝔥\}λ=cb\.\\displaystyle\\begin\{aligned\} B&=c\_\{h\}\\Gamma J\_\{\\star\}\\mathbf\{1\}\\\{\\mathfrak\{a\}=\\mathfrak\{h\}\\\},\\\\ A&=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\Gamma\\mathbf\{1\}\\\{\\mathfrak\{h\}\+\\mathfrak\{b\}\\leq\\mathfrak\{t\}\\\}\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\Gamma\\tilde\{I\_\{\\star\}\}\\Gamma^\{\\top\}\\mathbf\{1\}\\\{\\mathfrak\{t\}\\leq\\mathfrak\{b\}\+\\mathfrak\{h\}\\\}\\\\ \\lambda&=c\_\{b\}\.\\end\{aligned\}\(B\.8\)
Consider realization ofX\(n\)∈Ω\(n\)X^\{\(n\)\}\\in\\Omega^\{\(n\)\}, we want to show that for allf∈Cc∞\(ℝd\+s\)f\\in C^\{\\infty\}\_\{c\}\(\\mathbb\{R\}^\{d\+s\}\)and anyζ1\\zeta\_\{1\},

limm→∞supϑ∈ℝd‖A\(nm\)f\(ϑ,ζ1\)−Af\(ϑ,ζ1\)‖=0\.\\displaystyle\\lim\_\{m\\to\\infty\}\\sup\_\{\\vartheta\\in\\mathbb\{R\}^\{d\}\}\\\|A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)\-Af\(\\vartheta,\\zeta\_\{1\}\)\\\|=0\.\(B\.9\)We will show this in two parts\. To begin, note that, for any test functionffwith compact support, there existsK0K\_\{0\}such thatf\(θ\)=0f\(\\theta\)=0for allθ∈K0c\\theta\\in K\_\{0\}^\{c\}\. First we will identify a extension setK1⊂K0K\_\{1\}\\subset K\_\{0\}such that

limm→∞supϑ∈K1c‖A\(nm\)f\(ϑ,ζ1\)−Af\(ϑ,ζ1\)‖=0\.\\displaystyle\\lim\_\{m\\to\\infty\}\\sup\_\{\\vartheta\\in K^\{c\}\_\{1\}\}\\\|A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)\-Af\(\\vartheta,\\zeta\_\{1\}\)\\\|=0\.\(B\.10\)Second, we will show that

limm→∞supϑ∈K1‖A\(nm\)f\(ϑ,ζ1\)−Af\(ϑ,ζ1\)‖=0\.\\displaystyle\\lim\_\{m\\to\\infty\}\\sup\_\{\\vartheta\\in K\_\{1\}\}\\\|A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)\-Af\(\\vartheta,\\zeta\_\{1\}\)\\\|=0\.\(B\.11\)

### B\.3Part 1\.

For allϑ∈K0c\\vartheta\\in K^\{c\}\_\{0\}, we havef\(ϑ\)=0f\(\\vartheta\)=0,∇ϑf\(ϑ\)=0\\nabla\_\{\\vartheta\}f\(\\vartheta\)=0, and∇ϑ⊗2f\(ϑ\)=0\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta\)=0\. Therefore, for anyK1⊇K0K\_\{1\}\\supseteq K\_\{0\},

supϑ∈K1c‖A\(nm\)f\(ϑ,ζ1\)−Af\(ϑ,ζ1\)‖=α\(nm\)supϑ∈K1c𝔼\[f\(ϑ\+Δ\(nm\)\(ϑ\),ζ~1\)\]≤α\(nm\)‖f‖∞supϑ∈K1cℙ\[ϑ\+Δ\(nm\)\(ϑ\)∈K0\]\.\\displaystyle\\begin\{aligned\} \\sup\_\{\\vartheta\\in K^\{c\}\_\{1\}\}\\\|A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)\-Af\(\\vartheta,\\zeta\_\{1\}\)\\\|&=\\alpha^\{\(n\_\{m\}\)\}\\sup\_\{\\vartheta\\in K^\{c\}\_\{1\}\}\\mathbb\{E\}\[f\(\\vartheta\+\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\]\\\\ &\\leq\\alpha^\{\(n\_\{m\}\)\}\\\|f\\\|\_\{\\infty\}\\sup\_\{\\vartheta\\in K^\{c\}\_\{1\}\}\\mathbb\{P\}\[\\vartheta\+\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\in K\_\{0\}\]\.\\end\{aligned\}\(B\.12\)
Let

K1:=\{ϑ:‖ϑ‖≤2R0\+2c0\},R0:=supϑ∈K0‖ϑ‖,\\displaystyle K\_\{1\}:=\\\{\\vartheta:\\\|\\vartheta\\\|\\leq 2R\_\{0\}\+2c\_\{0\}\\\},\\qquad R\_\{0\}:=\\sup\_\{\\vartheta\\in K\_\{0\}\}\\\|\\vartheta\\\|,\(B\.13\)where

c0:=ch2\(3\+‖Γ∇log⁡π0\(θ∗\)‖\)\+ch/cβΓ\.\\displaystyle c\_\{0\}:=\\frac\{c\_\{h\}\}\{2\}\(3\+\\\|\\Gamma\\nabla\\log\\pi\_\{0\}\(\\theta^\{\*\}\)\\\|\)\+\\sqrt\{c\_\{h\}/c\_\{\\beta\}\\Gamma\}\.\(B\.14\)
Then, forϑ∈K1c\\vartheta\\in K\_\{1\}^\{c\}, under the assumption thatΓ∇log⁡π0\\Gamma\\nabla\\log\\pi\_\{0\}isL0L\_\{0\}\-Lipschitz,

‖Δπ0\(nm\)\(ϑ\)‖=h\(nm\)w\(nm\)Γ2n‖∇log⁡π0\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ\)‖≤chn𝔴−𝔥−12\(‖Γ∇log⁡π0\(θ∗\)‖\+L0‖θ^\(nm\)−θ∗‖\+L0‖ϑ‖n𝔴\)\.\\displaystyle\\begin\{aligned\} \\\|\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|&=\\frac\{h^\{\(n\_\{m\}\)\}w^\{\(n\_\{m\}\)\}\\Gamma\}\{2n\}\\\|\\nabla\\log\\pi\_\{0\}\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta\)\\\|\\\\ &\\leq\\frac\{c\_\{h\}n^\{\\mathfrak\{w\}\-\\mathfrak\{h\}\-1\}\}\{2\}\\Bigl\(\\\|\\Gamma\\nabla\\log\\pi\_\{0\}\(\\theta^\{\*\}\)\\\|\+L\_\{0\}\\\|\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\-\\theta^\{\*\}\\\|\+\\frac\{L\_\{0\}\\\|\\vartheta\\\|\}\{n^\{\\mathfrak\{w\}\}\}\\Bigr\)\.\\end\{aligned\}\(B\.15\)
Similarly,

‖Δℓ\(nm\)\(ϑ\)‖=h\(nm\)w\(nm\)‖Γ‖2b\(nm\)‖∑i=1b\(nm\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\),ζ~I1\(n\)\(i\)\)‖≤chn𝔴−𝔥‖Γ‖2b\(nm\)∑i∈\[nm\]‖∇ℓ\(θ∗;XI1\(nm\)\(i\),ζ~I1\(nm\)\(i\)\)‖\+L\(XI1\(nm\)\(i\),ζ~I1\(nm\)\(i\)\)‖θ^\(nm\)−θ∗‖\+L\(XI1\(nm\)\(i\),ζ~I1\(nm\)\(i\)\)‖ϑ‖nm𝔴\.\\displaystyle\\begin\{aligned\} \\\|\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|&=\\frac\{h^\{\(n\_\{m\}\)\}w^\{\(n\_\{m\}\)\}\\\|\\Gamma\\\|\}\{2b^\{\(n\_\{m\}\)\}\}\\Bigl\\\|\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\)\\Bigr\\\|\\\\ &\\leq\\frac\{c\_\{h\}n^\{\\mathfrak\{w\}\-\\mathfrak\{h\}\}\\\|\\Gamma\\\|\}\{2b^\{\(n\_\{m\}\)\}\}\\sum\_\{i\\in\[n\_\{m\}\]\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\)\\\|\\\\ &\\quad\+L\(\{X\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\)\\\|\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\-\\theta^\{\*\}\\\|\\\\ &\\quad\+L\(\{X\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\_\{m\}\)\}\(i\)\}\)\\frac\{\\\|\\vartheta\\\|\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\.\\end\{aligned\}\(B\.16\)
We define the \(random\) Lipschitz constants

L\(Xi\):=‖∇⊗2ℓ\(⋅;Xi,⋅\)‖∞,L∗\(X\(nm\)\):=maxi∈\[nm\]⁡‖∇ℓ\(θ∗;Xi,⋅\)‖,L\(X\(nm\),ζ~\(nm\)\):=maxi∈\[nm\]⁡L\(Xi\)\.\\displaystyle\\begin\{aligned\} L\(\{X\_\{i\}\}\)&:=\\\|\\nabla^\{\\otimes 2\}\\ell\(\\cdot;\{X\_\{i\}\},\\cdot\)\\\|\_\{\\infty\},\\\\ L\_\{\*\}\(X^\{\(n\_\{m\}\)\}\)&:=\\max\_\{i\\in\[n\_\{m\}\]\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{i\}\},\\cdot\)\\\|,\\\\ L\(X^\{\(n\_\{m\}\)\},\\tilde\{\\zeta\}^\{\(n\_\{m\}\)\}\)&:=\\max\_\{i\\in\[n\_\{m\}\]\}L\(\{X\_\{i\}\}\)\.\\end\{aligned\}\(B\.17\)
SinceX\(nm\)∈Ω\(0\)X^\{\(n\_\{m\}\)\}\\in\\Omega^\{\(0\)\}, we have thatΦ\(nm\)→0\\Phi^\{\(n\_\{m\}\)\}\\to 0,maxi∈\[nm\]⁡‖∇ℓ\(θ∗;Xi,⋅\)‖∞≤nm1/p2\\max\_\{i\\in\[n\_\{m\}\]\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{i\}\},\\cdot\)\\\|\_\{\\infty\}\\leq n\_\{m\}^\{1/p\_\{2\}\}, andmaxi∈\[nm\]⁡‖∇⊗2ℓ\(⋅;Xi,⋅\)‖∞≤nm1/p3\\max\_\{i\\in\[n\_\{m\}\]\}\\\|\\nabla^\{\\otimes 2\}\\ell\(\\cdot;\{X\_\{i\}\},\\cdot\)\\\|\_\{\\infty\}\\leq n\_\{m\}^\{1/p\_\{3\}\}\. Thus ifmmis large enough that all of the following hold:

supm′≥mΦ\(nm\)\\displaystyle\\sup\_\{m^\{\\prime\}\\geq m\}\\Phi^\{\(n\_\{m\}\)\}≤min⁡\(1,L0−1\),\\displaystyle\\leq\\min\(1,L\_\{0\}^\{\-1\}\),\(B\.18\)1\\displaystyle 1≥supm′≥mL∗\(X\(nm′\)\)nm′1/p2\\displaystyle\\geq\\sup\_\{m^\{\\prime\}\\geq m\}\\frac\{L\_\{\*\}\(X^\{\(n\_\{m^\{\\prime\}\}\)\}\)\}\{n\_\{m^\{\\prime\}\}^\{1/p\_\{2\}\}\}\(B\.19\)nm\\displaystyle n\_\{m\}≥max⁡\(\(2ch‖Γ‖\)1/\(1/p3−𝔥\),\(2chL0‖Γ‖\)1𝔥\+1−𝔞−𝔴\)\\displaystyle\\geq\\max\\left\(\(2c\_\{h\}\\\|\\Gamma\\\|\)^\{1/\(1/p\_\{3\}\-\\mathfrak\{h\}\)\},\(2c\_\{h\}L\_\{0\}\\\|\\Gamma\\\|\)^\{\\frac\{1\}\{\\mathfrak\{h\}\+1\-\\mathfrak\{a\}\-\\mathfrak\{w\}\}\}\\right\)\(B\.20\)1\\displaystyle 1≥supm′≥mL\(X\(nm′\)\)nm′1/p3\.\\displaystyle\\geq\\sup\_\{m^\{\\prime\}\\geq m\}\\frac\{L\(X^\{\(n\_\{m^\{\\prime\}\}\)\}\)\}\{n\_\{m^\{\\prime\}\}^\{1/p\_\{3\}\}\}\.\(B\.21\)
Then, using that0<𝔴<10<\\mathfrak\{w\}<1,

‖Δπ0\(nm\)\(ϑ\)‖≤ch‖Γ‖2\(‖∇log⁡π0\(θ∗\)‖\+1\)\+14‖ϑ‖,‖Δℓ\(nm\)\(ϑ\)‖≤ch‖Γ‖\+14‖ϑ‖\.\\displaystyle\\\|\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|\\leq\\frac\{c\_\{h\}\\\|\\Gamma\\\|\}\{2\}\\bigl\(\\\|\\nabla\\log\\pi\_\{0\}\(\\theta^\{\*\}\)\\\|\+1\\bigr\)\+\\frac\{1\}\{4\}\\\|\\vartheta\\\|,\\qquad\\\|\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|\\leq c\_\{h\}\\\|\\Gamma\\\|\+\\frac\{1\}\{4\}\\\|\\vartheta\\\|\.\(B\.22\)
Therefore, forϑ∈K1c\\vartheta\\in K\_\{1\}^\{c\},

‖ϑ‖−‖Δπ0\(nm\)\(ϑ\)‖−‖Δℓ\(nm\)\(ϑ\)‖−R0≥12‖ϑ‖−ch‖Γ‖2\(3\+‖∇log⁡π0\(θ∗\)‖\)−R0≥ch‖Γ‖/cβ\.\\displaystyle\\\|\\vartheta\\\|\-\\\|\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|\-\\\|\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|\-R\_\{0\}\\geq\\frac\{1\}\{2\}\\\|\\vartheta\\\|\-\\frac\{c\_\{h\}\\\|\\Gamma\\\|\}\{2\}\(3\+\\\|\\nabla\\log\\pi\_\{0\}\(\\theta^\{\*\}\)\\\|\)\-R\_\{0\}\\geq\\sqrt\{c\_\{h\}\\\|\\Gamma\\\|/c\_\{\\beta\}\}\.\(B\.23\)
Consequently,

limm→∞supϑ∈K1c‖A\(nm\)f\(ϑ,ζ1\)−Af\(ϑ,ζ1\)‖≤limm→∞αnm‖f‖∞ℙ\(‖ξ1‖≥nm𝔥/2\+𝔱/2−𝔴\)=0\.\\displaystyle\\begin\{aligned\} \\lim\_\{m\\to\\infty\}\\sup\_\{\\vartheta\\in K^\{c\}\_\{1\}\}\\\|A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)\-Af\(\\vartheta,\\zeta\_\{1\}\)\\\|&\\leq\\lim\_\{m\\to\\infty\}\\alpha^\{n\_\{m\}\}\\\|f\\\|\_\{\\infty\}\\mathbb\{P\}\\\!\\left\(\\\|\\xi\_\{1\}\\\|\\geq n\_\{m\}^\{\\mathfrak\{h\}/2\+\\mathfrak\{t\}/2\-\\mathfrak\{w\}\}\\right\)=0\.\\end\{aligned\}\(B\.24\)

### B\.4Part 2\.

We take a partial second\-order Taylor expansion of the test functionffwith respect to the global variableϑ\\vartheta:

A\(nm\)f\(ϑ,ζ1\)=α\(nm\)\(𝔼\[f\(ϑ\+Δ\(nm\)\(ϑ\),ζ~1\)\]−f\(ϑ,ζ1\)\)=α\(nm\)\(𝔼\[f\(ϑ\+Δ\(nm\)\(ϑ\),ζ~1\)−f\(ϑ,ζ~1\)\]\+𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]\)=nm𝔞𝔼⟨∇ϑf\(ϑ,ζ~1\),Δ\(nm\)\(ϑ\)⟩\+nm𝔞𝔼⟨12∇ϑ⊗2f\(ϑ,ζ~1\)Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)⟩\+nm𝔞𝔼\[16∇ϑ⊗3f\(ϑ\+SΔ\(nm\)\(ϑ\),ζ~1\)\(Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)\)\]\+𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]\.\\displaystyle\\begin\{aligned\} A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)&=\\alpha^\{\(n\_\{m\}\)\}\\bigl\(\\mathbb\{E\}\[f\(\\vartheta\+\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\]\-f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\\\\ &=\\alpha^\{\(n\_\{m\}\)\}\\Bigl\(\\mathbb\{E\}\[f\(\\vartheta\+\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\]\+\\mathbb\{E\}\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\]\\Bigr\)\\\\ &=n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\\\ &\\quad\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{6\}\\nabla\_\{\\vartheta\}^\{\\otimes 3\}f\(\\vartheta\+S\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\(\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\)\\right\]\\\\ &\\quad\+\\mathbb\{E\}\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\]\.\\end\{aligned\}\(B\.25\)Rearranging terms yields

A\(nm\)f\(ϑ,ζ1\)=nm𝔞𝔼⟨∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\+nm𝔞𝔼⟨12∇ϑ⊗2f\(ϑ,ζ1\)Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)⟩\+nm𝔞𝔼\[16∇ϑ⊗3f\(ϑ\+SΔ\(nm\)\(ϑ\),ζ~1\)\(Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)\)\]\+𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]\+nm𝔞𝔼⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\+nm𝔞𝔼⟨12∇ϑ⊗2f\(ϑ,ζ~1\)Δ\(nm\)\(ϑ\)−12∇ϑ⊗2f\(ϑ,ζ1\)Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)⟩\\displaystyle\\begin\{aligned\} A^\{\(n\_\{m\}\)\}f\(\\vartheta,\\zeta\_\{1\}\)&=n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\\\ &\\quad\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{6\}\\nabla\_\{\\vartheta\}^\{\\otimes 3\}f\(\\vartheta\+S\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\(\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\)\\right\]\\\\ &\\quad\+\\mathbb\{E\}\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\]\\\\ &\\quad\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\\\ &\\quad\+n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\-\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\end\{aligned\}\(B\.26\)for someS∈\[0,1\]S\\in\[0,1\]\. Therefore,

‖A\(nm\)f−Af‖≤\\displaystyle\\\|A^\{\(n\_\{m\}\)\}f\-Af\\\|\\leq\\;‖nm𝔞𝔼⟨∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\+⟨12Bϑ,∇ϑf\(ϑ,ζ1\)⟩‖⏟R1\\displaystyle\\underbrace\{\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\+\\left<\\frac\{1\}\{2\}B\\vartheta,\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\right\>\\Bigr\\\|\}\_\{R\_\{1\}\}\(B\.27\)\+∥nm𝔞𝔼⟨12∇ϑ⊗2f\(ϑ,ζ1\)Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)⟩−12A:∇ϑ⊗2f\(ϑ,ζ1\)∥⏟R2\\displaystyle\+\\underbrace\{\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\-\\frac\{1\}\{2\}A:\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\Bigr\\\|\}\_\{R\_\{2\}\}\(B\.28\)\+∥λ\(∫f\(ϑ,y\)p\(y\|X1,θ∗\)dy−f\(ϑ,ζ1\)\)−nm𝔞𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]∥⏟R3\\displaystyle\+\\underbrace\{\\Bigl\\\|\\lambda\\Bigl\(\\int f\(\\vartheta,y\)p\(y\|\{X\_\{1\}\},\\theta^\{\*\}\)\\,dy\-f\(\\vartheta,\\zeta\_\{1\}\)\\Bigr\)\-n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\]\\Bigr\\\|\}\_\{R\_\{3\}\}\(B\.29\)\+‖nm𝔞𝔼⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩‖⏟R4\\displaystyle\+\\underbrace\{\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\Bigr\\\|\}\_\{R\_\{4\}\}\(B\.30\)\+‖nm𝔞𝔼⟨12∇ϑ⊗2f\(ϑ,ζ~1\)Δ\(nm\)\(ϑ\)−12∇ϑ⊗2f\(ϑ,ζ1\)Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)⟩‖⏟R5\\displaystyle\+\\underbrace\{\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\-\\frac\{1\}\{2\}\\nabla\_\{\\vartheta\}^\{\\otimes 2\}f\(\\vartheta,\\zeta\_\{1\}\)\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\\Bigr\\\|\}\_\{R\_\{5\}\}\(B\.31\)\+‖nm𝔞𝔼\[16∇ϑ⊗3f\(ϑ\+SΔ\(nm\)\(ϑ\),ζ~1\)\(Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)\)\]‖⏟R6\.\\displaystyle\+\\underbrace\{\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{6\}\\nabla\_\{\\vartheta\}^\{\\otimes 3\}f\(\\vartheta\+S\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\(\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\)\\right\]\\Bigr\\\|\}\_\{R\_\{6\}\}\.\(B\.32\)
We denote the six terms above byR1,…,R6R\_\{1\},\\dots,R\_\{6\}and will show that each of them vanishes asn→∞n\\to\\infty\.

### B\.5R1R\_\{1\}\(drift term\)

We have

nm𝔞𝔼\[Δ\(nm\)\(ϑ\)\]=nm𝔞𝔼\[Δξ\(nm\)\(ϑ\)\+Δπ0\(nm\)\(ϑ\)\+Δℓ\(nm\)\(ϑ\)\]\.\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\[\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\]=n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\bigl\[\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\+\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\+\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\bigr\]\.\(B\.33\)
#### Noise term\.

nm𝔞𝔼\[Δξ\(nm\)\(ϑ\)\]=nm𝔞𝔼\[w\(nm\)h\(nm\)\(β\)−1Γξ1\]=nm𝔞w\(nm\)h\(nm\)\(β\)−1Γ𝔼\[ξ1\]=0\.\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\[\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\]=n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[w^\{\(n\_\{m\}\)\}\\sqrt\{h^\{\(n\_\{m\}\)\}\(\\beta\)^\{\-1\}\\Gamma\}\\,\\xi\_\{1\}\\right\]=n\_\{m\}^\{\\mathfrak\{a\}\}\\,w^\{\(n\_\{m\}\)\}\\sqrt\{h^\{\(n\_\{m\}\)\}\(\\beta\)^\{\-1\}\\Gamma\}\\,\\mathbb\{E\}\[\\xi\_\{1\}\]=0\.\(B\.34\)

#### Prior term\.

nm𝔞𝔼\[Δπ0\(nm\)\(ϑ\)\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\[\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\]=nm𝔞𝔼\[h\(nm\)w\(nm\)Γ2n∇log⁡π0\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ\)\]\\displaystyle=n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\\frac\{h^\{\(n\_\{m\}\)\}w^\{\(n\_\{m\}\)\}\\Gamma\}\{2n\}\\,\\nabla\\log\\pi\_\{0\}\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta\\bigr\)\\right\]\(B\.35\)=ch2nm𝔞−𝔥\+𝔴−1Γ∇log⁡π0\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ\)\\displaystyle=\\frac\{c\_\{h\}\}\{2\}\\,n\_\{m\}^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\,\\Gamma\\,\\nabla\\log\\pi\_\{0\}\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta\\bigr\)\(B\.36\)=chnm𝔞−𝔥\+𝔴−1‖Γ‖2\(∇log⁡π0\(θ∗\)\+L0\(Φ\(nm\)\+2R0\+2c0nm𝔴\)\)\.\\displaystyle=\\frac\{c\_\{h\}n\_\{m\}^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\\|\\Gamma\\\|\}\{2\}\\,\\Bigg\(\\nabla\\log\\pi\_\{0\}\\\!\\bigl\(\\theta^\{\*\}\)\+L\_\{0\}\(\\Phi^\{\(n\_\{m\}\)\}\+\\frac\{2R\_\{0\}\+2c\_\{0\}\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\)\\Bigg\)\.\(B\.37\)Which vanishes uniformly onK1K\_\{1\}, as𝔞−𝔥\+𝔴−1<0\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1<0

#### Likelihood term\.

nm𝔞𝔼Δℓ\(nm\)\(ϑ\)\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)=n𝔞𝔼h\(nm\)w\(nm\)Γ2b\(nm\)∑i=1b\(nm\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\),ζ~I1\(n\)\(i\)\)\\displaystyle=n^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\frac\{h^\{\(n\_\{m\}\)\}w^\{\(n\_\{m\}\)\}\\Gamma\}\{2b^\{\(n\_\{m\}\)\}\}\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\)\(B\.38\)=chn𝔞−𝔥\+𝔴−1Γ2𝔼ζ~∑i=1nm∇ℓ\(θ^\(nm\)\+nm−𝔴ϑ;Xi,ζ~i\)\\displaystyle=\\frac\{c\_\{h\}n^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\Gamma\}\{2\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\)\(B\.39\)=chnm𝔞−𝔥\+𝔴−1Γ2∑i=1nm∇ℓ\(θ^\(nm\)\+n−𝔴ϑ;Xi\)\\displaystyle=\\frac\{c\_\{h\}n\_\{m\}^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\Gamma\}\{2\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n^\{\-\\mathfrak\{w\}\}\\vartheta;\{X\_\{i\}\}\)\(B\.40\)=chnm𝔞−𝔥\+𝔴−1Γ2∑i=1nm\[∇ℓ\(θ^\(nm\);Xi\)\+n−𝔴∇2ℓ\(θ^\(nm\);Xi\)ϑ\]\\displaystyle=\\frac\{c\_\{h\}n\_\{m\}^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\Gamma\}\{2\}\\sum\_\{i=1\}^\{n\_\{m\}\}\[\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\};\{X\_\{i\}\}\)\+n^\{\-\\mathfrak\{w\}\}\\nabla^\{2\}\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\};\{X\_\{i\}\}\)\\vartheta\]\(B\.41\)=chnm𝔞−𝔥Γ2ϑnm∑i=1nm∫01∇⊗2ℓ\(θ^\(nm\)\+snm𝔴ϑ;Xi\)𝑑s\\displaystyle=\\frac\{c\_\{h\}n\_\{m\}^\{\\mathfrak\{a\}\-\\mathfrak\{h\}\}\\Gamma\}\{2\}\\frac\{\\vartheta\}\{n\_\{m\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\int\_\{0\}^\{1\}\\nabla^\{\\otimes 2\}\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\\frac\{s\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\\vartheta\};\{X\_\{i\}\}\)ds\(B\.42\)where we used thatℓ\(θ;Xi\)\\ell\(\\theta;\{X\_\{i\}\}\)denotes the \(marginal\) log\-likelihood withzzmarginalized, and

𝔼ζ~\[g\(ζ~i\)\]=∫g\(ζ~i\)p\(ζ~i∣Xi,θ^\(nm\)\+nm−𝔴ϑ\)𝑑ζ~i\.\\displaystyle\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[g\(\\tilde\{\\zeta\}\_\{i\}\)\]=\\int g\(\\tilde\{\\zeta\}\_\{i\}\)\\,p\(\\tilde\{\\zeta\}\_\{i\}\\mid\{X\_\{i\}\},\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\\,d\\tilde\{\\zeta\}\_\{i\}\.\(B\.43\)
Indeed,

𝔼ζ~\[∇ℓ\(θ^\(nm\)\+nm−𝔴ϑ;Xi,ζ~i\)\]\\displaystyle\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\!\\left\[\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\)\\right\]=∫∇log⁡p\(Xi,ζ~i∣θ^\(nm\)\+nm−𝔴ϑ\)p\(ζ~i∣Xi,θ^\(nm\)\+nm−𝔴ϑ\)𝑑ζ~i\\displaystyle=\\int\\nabla\\log p\(\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\mid\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\\,p\(\\tilde\{\\zeta\}\_\{i\}\\mid\{X\_\{i\}\},\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\\,d\\tilde\{\\zeta\}\_\{i\}\(B\.44\)=∇p\(Xi∣θ^\(nm\)\+nm−𝔴ϑ\)p\(Xi∣θ^\(nm\)\+nm−𝔴ϑ\)\\displaystyle=\\frac\{\\nabla p\(\{X\_\{i\}\}\\mid\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\}\{p\(\{X\_\{i\}\}\\mid\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\}\(B\.45\)=∇ℓ\(θ^\(nm\)\+nm−𝔴ϑ;Xi\)\.\\displaystyle=\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta;\{X\_\{i\}\}\)\.\(B\.46\)
Moreover, by the definition ofθ^\(nm\)\\hat\{\\theta\}^\{\(n\_\{m\}\)\},

∑i=1nm∇ℓ\(θ^\(nm\);Xi\)=0\.\\displaystyle\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\};\{X\_\{i\}\}\)=0\.\(B\.47\)Hence, when𝔞\+𝔥<1\\mathfrak\{a\}\+\\mathfrak\{h\}<1,nm𝔞𝔼Δℓ\(nm\)\(ϑ\)n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)vanishes and the drift term will be inactive in the limit\.

When𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}, andnmn\_\{m\}large enough thatrJ,nm≥R0\+c0r\_\{J,n\_\{m\}\}\\geq R\_\{0\}\+c\_\{0\},

R1=‖nm𝔞𝔼⟨∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\+⟨chΓJ⋆2ϑ,∇ϑf\(ϑ,ζ1\)⟩‖≤ch‖f‖∞‖Γ‖\(R0\+c0\)Φ\(nm\),\\displaystyle R\_\{1\}=\\Bigl\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left<\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\right\>\+\\left<\\frac\{c\_\{h\}\\Gamma J\_\{\\star\}\}\{2\}\\vartheta,\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\right\>\\Bigr\\\|\\leq c\_\{h\}\\\|f\\\|\_\{\\infty\}\\\|\\Gamma\\\|\(R\_\{0\}\+c\_\{0\}\)\\Phi^\{\(n\_\{m\}\)\},\(B\.48\)which vanishes uniformly onK1K\_\{1\}

### B\.6R2R\_\{2\}\(diffusion term\)

We write

nm𝔞𝔼\[\(Δ\(nm\)\(ϑ\)\)⊗2\]=nm𝔞𝔼\[\(Δξ\(nm\)\(ϑ\)\+Δπ0\(nm\)\(ϑ\)\+Δℓ\(nm\)\(ϑ\)\)⊗2\]\.\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]=n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\+\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\+\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]\.\(B\.49\)Given the independence among all three terms, the cross term will vanish\. So the potentially non\-vanishing contributions come fromnm𝔞𝔼\[\(Δξ\(nm\)\(ϑ\)\)⊗2\]n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]andnm𝔞𝔼\[\(Δℓ\(nm\)\(ϑ\)\)⊗2\]n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]\.

To obtain a non\-trivial limit, we only considering𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}\.

#### Noise variance\.

nm𝔞𝔼\[\(Δξ\(nm\)\(ϑ\)\)⊗2\]=chcβnm𝔞\+2𝔴−𝔥−𝔱Γ=chcβnm2𝔴−𝔥Γ\.\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\,n\_\{m\}^\{\\mathfrak\{a\}\+2\\mathfrak\{w\}\-\\mathfrak\{h\}\-\\mathfrak\{t\}\}\\,\\Gamma=\\frac\{c\_\{h\}\}\{c\_\{\\beta\}\}\\,n\_\{m\}^\{2\\mathfrak\{w\}\-\\mathfrak\{h\}\}\\,\\Gamma\.\(B\.50\)
Thus when𝔴<𝔱/2\\mathfrak\{w\}<\\mathfrak\{t\}/2, the corresponding diffusion term is inactive in the limit\. When𝔴=𝔱/2\\mathfrak\{w\}=\\mathfrak\{t\}/2,

‖nm𝔞𝔼\[\(Δξ\(nm\)\(ϑ\)\)⊗2\]−chΓcβ‖=0\.\\displaystyle\\\|n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\right\]\-\\frac\{c\_\{h\}\\Gamma\}\{c\_\{\\beta\}\}\\\|=0\.\(B\.51\)

#### Stochastic gradient variance\.

n𝔞𝔼\[Δℓ\(nm\)\(ϑ\)\]2\\displaystyle n^\{\\mathfrak\{a\}\}\\mathbb\{E\}\[\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\]^\{2\}=n𝔞𝔼\[h\(nm\)w\(nm\)2b\(nm\)∑i=1b\(nm\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\),ζ~I1\(n\)\(i\)\)\]2\\displaystyle=n^\{\\mathfrak\{a\}\}\\mathbb\{E\}\\left\[\\frac\{h^\{\(n\_\{m\}\)\}w^\{\(n\_\{m\}\)\}\}\{2b^\{\(n\_\{m\}\)\}\}\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\)\\right\]^\{2\}\(B\.52\)=ch24cbn𝔞−2𝔥\+2𝔴−2𝔟Γ𝔼\[∑i=1b\(nm\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\),ζ~I1\(n\)\(i\)\)\]2Γ′\\displaystyle=\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}n^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}\\Gamma\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\)\\right\]^\{2\}\\Gamma^\{\\prime\}\(B\.53\)=ch24cbn𝔞−2𝔥\+2𝔴−𝔟Γ\[1nm𝔼ζ~∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)2\]Γ′\\displaystyle=\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}n^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-\\mathfrak\{b\}\}\\Gamma\\left\[\\frac\{1\}\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\)^\{2\}\\right\]\\Gamma^\{\\prime\}\(B\.54\)\+ch24cb2n𝔞−2𝔥\+2𝔴−2𝔟Γ\[𝔼∑i=1b\(nm\)∑i′=1,i′≠ib\(nm\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\),ζ~I1\(n\)\(i\)\)\\displaystyle\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}\\Gamma\\Bigg\[\\mathbb\{E\}\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\sum\_\{i^\{\\prime\}=1,i^\{\\prime\}\\neq i\}^\{b^\{\(n\_\{m\}\)\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\)\(B\.55\)⊗∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i′\),ζ~I1\(n\)\(i′\)\)\]Γ′\\displaystyle\\phantom\{\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}\\Gamma\\Bigg\[\\mathbb\{E\}\\sum\_\{i=1\}^\{b^\{\(n\_\{m\}\)\}\}\\sum\_\{i^\{\\prime\}=1,i^\{\\prime\}\\neq i\}^\{b^\{\(n\_\{m\}\)\}\}\}\\otimes\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i^\{\\prime\}\)\}\},\\tilde\{\\zeta\}\_\{I\_\{1\}^\{\(n\)\}\(i^\{\\prime\}\)\}\)\\Bigg\]\\Gamma^\{\\prime\}\(B\.56\)=ch24cbnm𝔞−2𝔥\+2𝔴−𝔟Γ\[1nm𝔼ζ~∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)2\]Γ′\\displaystyle=\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-\\mathfrak\{b\}\}\\Gamma\\Bigg\[\\frac\{1\}\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\)^\{2\}\\Bigg\]\\Gamma^\{\\prime\}\(B\.57\)\+ch24cb2nm𝔞−2𝔥\+2𝔴−2𝔟b\(nm\)\(b\(nm\)−1\)nm2Γ\[∑i=1nm∑i′=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i\)\)\\displaystyle\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}\\frac\{b^\{\(n\_\{m\}\)\}\(b^\{\(n\_\{m\}\)\}\-1\)\}\{n\_\{m\}^\{2\}\}\\Gamma\\Bigg\[\\sum\_\{i=1\}^\{n\_\{m\}\}\\sum\_\{i^\{\\prime\}=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i\)\}\}\)\(B\.58\)⊗∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;XI1\(n\)\(i′\)\)\]Γ′\\displaystyle\\phantom\{\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}\\frac\{b^\{\(n\_\{m\}\)\}\(b^\{\(n\_\{m\}\)\}\-1\)\}\{n\_\{m\}^\{2\}\}\\Gamma\\sum\_\{i=1\}^\{n\_\{m\}\}\\sum\_\{i^\{\\prime\}=1\}^\{n\_\{m\}\}\}\\otimes\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{I\_\{1\}^\{\(n\)\}\(i^\{\\prime\}\)\}\}\)\\Bigg\]\\Gamma^\{\\prime\}\(B\.59\)=ch24cbnm𝔞−2𝔥\+2𝔴−𝔟Γ\[1nm𝔼ζ~∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)2\]Γ′\\displaystyle=\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-\\mathfrak\{b\}\}\\Gamma\\Bigg\[\\frac\{1\}\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\)^\{2\}\\Bigg\]\\Gamma^\{\\prime\}\(B\.60\)\+ch24cb2nm𝔞−2𝔥\+2𝔴−2𝔟b\(nm\)\(b\(nm\)−1\)Γ\(1nm∑i=1nm∫01∇⊗2ℓ\(θ^\(nm\)\+snm𝔴ϑ;Xi\)𝑑s1nm𝔴ϑ\)⊗2Γ′\.\\displaystyle\+\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}b^\{\(n\_\{m\}\)\}\(b^\{\(n\_\{m\}\)\}\-1\)\\Gamma\\Bigg\(\\frac\{1\}\{n\_\{m\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\int\_\{0\}^\{1\}\\nabla^\{\\otimes 2\}\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\\frac\{s\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\\vartheta;\{X\_\{i\}\}\)ds\\frac\{1\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\\vartheta\\Bigg\)^\{\\otimes 2\}\\Gamma^\{\\prime\}\.\(B\.61\)
Note that

‖ch24cb2nm𝔞−2𝔥\+2𝔴−2𝔟b\(nm\)\(b\(nm\)−1\)Γ\(1nm∑i=1nm∫01∇⊗2ℓ\(θ^\(nm\)\+snm𝔴ϑ;Xi\)𝑑s1nm𝔴ϑ\)⊗2Γf′‖\\displaystyle\\bigg\\\|\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}^\{2\}\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\-2\\mathfrak\{b\}\}b^\{\(n\_\{m\}\)\}\(b^\{\(n\_\{m\}\)\}\-1\)\\Gamma\\Bigg\(\\frac\{1\}\{n\_\{m\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\int\_\{0\}^\{1\}\\nabla^\{\\otimes 2\}\\ell\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\\frac\{s\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\\vartheta;\{X\_\{i\}\}\)ds\\frac\{1\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\\vartheta\\Bigg\)^\{\\otimes 2\}\\Gamma f^\{\\prime\}\\bigg\\\|\(B\.62\)≤dch2nm𝔞−2𝔥\+2𝔴‖Γ‖2‖∇⊗2f‖∞\(2R0\+2c0\)2nm2𝔴\(J⋆\+Φ\(nm\)\)2\\displaystyle\\quad\\leq\\sqrt\{d\}c\_\{h\}^\{2\}n\_\{m\}^\{\\mathfrak\{a\}\-2\\mathfrak\{h\}\+2\\mathfrak\{w\}\}\\\|\\Gamma\\\|^\{2\}\\\|\\nabla^\{\\otimes 2\}f\\\|\_\{\\infty\}\\frac\{\(2R\_\{0\}\+2c\_\{0\}\)^\{2\}\}\{n\_\{m\}^\{2\\mathfrak\{w\}\}\}\(J\_\{\\star\}\+\\Phi^\{\(n\_\{m\}\)\}\)^\{2\}\(B\.63\)
Since𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}and𝔥\>0\\mathfrak\{h\}\>0, this term vanishes uniformly onK1K\_\{1\}\.

Thus when𝔴<\(𝔥\+𝔟\)/2\\mathfrak\{w\}<\(\\mathfrak\{h\}\+\\mathfrak\{b\}\)/2, the corresponding diffusion term is inactive in the limit\. When𝔴<\(𝔥\+𝔟\)/2\\mathfrak\{w\}<\(\\mathfrak\{h\}\+\\mathfrak\{b\}\)/2andnmn\_\{m\}large enough thatrI,nm≥R0\+c0r\_\{I,n\_\{m\}\}\\geq R\_\{0\}\+c\_\{0\},

‖n𝔞𝔼\[Δℓ\(nm\)\(ϑ\)\]2−ch24cbΓI~⋆Γ′‖≤ch24cb‖Γ‖2Φ\(nm\)\\displaystyle\\\|n^\{\\mathfrak\{a\}\}\\mathbb\{E\}\[\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\]^\{2\}\-\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\Gamma\\tilde\{I\}\_\{\\star\}\\Gamma^\{\\prime\}\\\|\\leq\\frac\{c\_\{h\}^\{2\}\}\{4c\_\{b\}\}\\\|\\Gamma\\\|^\{2\}\\Phi^\{\(n\_\{m\}\)\}\(B\.64\)vanishes uniformly onK1K\_\{1\}\. Hence,R2R\_\{2\}vanishes uniformly onK1K\_\{1\}\.

### B\.7R3R\_\{3\}\(jump term\)

We consider

nm𝔞𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\\right\]=cbnm𝔞\+𝔟−1\[∫f\(ϑ,y\)p\(y∣X1,θ^\(nm\)\+nm−𝔴ϑ\)𝑑ζ−f\(ϑ,ζ1\)\]\\displaystyle=c\_\{b\}\\,n\_\{m\}^\{\\mathfrak\{a\}\+\\mathfrak\{b\}\-1\}\\left\[\\int f\(\\vartheta,y\)\\,p\(y\\mid\{X\_\{1\}\},\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\\,d\\zeta\-f\(\\vartheta,\\zeta\_\{1\}\)\\right\]\(B\.65\)=cbnm𝔞\+𝔟−1\[∫f\(ϑ,y\)p\(y∣X1,θ∗\)𝑑ζ−f\(ϑ,ζ1\)\]\\displaystyle=c\_\{b\}\\,n\_\{m\}^\{\\mathfrak\{a\}\+\\mathfrak\{b\}\-1\}\\left\[\\int f\(\\vartheta,y\)\\,p\(y\\mid\{X\_\{1\}\},\\theta^\{\*\}\)\\,d\\zeta\-f\(\\vartheta,\\zeta\_\{1\}\)\\right\]\(B\.66\)\+cbnm𝔞\+𝔟−1\[∫f\(ϑ,y\)\(p\(y∣X1,θ^\(nm\)\+nm−𝔴ϑ\)−p\(y∣X1,θ∗\)\)𝑑y\]\.\\displaystyle\\quad\+c\_\{b\}\\,n\_\{m\}^\{\\mathfrak\{a\}\+\\mathfrak\{b\}\-1\}\\left\[\\int f\(\\vartheta,y\)\\,\\bigl\(p\(y\\mid\{X\_\{1\}\},\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\-p\(y\\mid\{X\_\{1\}\},\\theta^\{\*\}\)\\bigr\)\\,dy\\right\]\.\(B\.67\)
Since‖θ^\(nm\)\+nm−𝔴ϑ−θ∗‖≤‖θ^\(nm\)−θ∗‖\+nm−𝔴‖ϑ‖\\\|\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\-\\theta^\{\*\}\\\|\\leq\\\|\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\-\\theta^\{\*\}\\\|\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\\|\\vartheta\\\|vanishes uniformly onK1K\_\{1\}, under[AssumptionA\.6](https://arxiv.org/html/2606.00309#A1.Thmtheorem6), so do∥p\(y∣X1,θ^\(nm\)\+nm−𝔴ϑ\)−p\(y∣X1,θ∗\)∥\\\|p\(y\\mid\{X\_\{1\}\},\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{\-\\mathfrak\{w\}\}\\vartheta\)\-p\(y\\mid\{X\_\{1\}\},\\theta^\{\*\}\)\\\|\.

Thus when𝔟\+𝔥<1\\mathfrak\{b\}\+\\mathfrak\{h\}<1, this term is inactive in the limit\. When𝔟\+𝔥=1\\mathfrak\{b\}\+\\mathfrak\{h\}=1,

R3=∥λ\(∫f\(ϑ,y\)p\(y\|X1,θ∗\)dy−f\(ϑ,ζ1\)\)−nm𝔞𝔼\[f\(ϑ,ζ~1\)−f\(ϑ,ζ1\)\]∥\\displaystyle R\_\{3\}=\\\|\\lambda\\Bigl\(\\int f\(\\vartheta,y\)p\(y\|\{X\_\{1\}\},\\theta^\{\*\}\)\\,dy\-f\(\\vartheta,\\zeta\_\{1\}\)\\Bigr\)\-n\_\{m\}^\{\\mathfrak\{a\}\}\\mathbb\{E\}\[f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-f\(\\vartheta,\\zeta\_\{1\}\)\]\\\|\(B\.68\)vanishes uniformly onK1K\_\{1\}withλ=cb\.\\lambda=c\_\{b\}\.

### B\.8R4R\_\{4\}\(Gradient mismatch term\)

Recall that∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\)\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)is non\-zero only when the latent variableζ1\\zeta\_\{1\}is updated, i\.e\., when

A1:=\{1∈I1\(nm\)\}\\displaystyle A\_\{1\}:=\\\{1\\in I\_\{1\}^\{\(n\_\{m\}\)\}\\\}\(B\.69\)occurs\. Therefore, we analyze

nm𝔞𝔼\[⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\\bigl\\langle\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\bigr\\rangle\\right\]\(B\.70\)by conditioning onA1A\_\{1\}\.

By the law of total expectation,

nm𝔞𝔼\[⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\\bigl\\langle\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\bigr\\rangle\\right\]=nm𝔞ℙ\(A1\)𝔼\[⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\|A1\]\.\\displaystyle=n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{P\}\(A\_\{1\}\)\\,\\mathbb\{E\}\\\!\\left\[\\bigl\\langle\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\bigr\\rangle\\;\\middle\|\\;A\_\{1\}\\right\]\.\(B\.71\)Since the mini\-batch is sampled with replacement,

ℙ\(A1\)=1−\(1−1nm\)b\(nm\)\.\\displaystyle\\mathbb\{P\}\(A\_\{1\}\)=1\-\\Bigl\(1\-\\frac\{1\}\{n\_\{m\}\}\\Bigr\)^\{b^\{\(n\_\{m\}\)\}\}\.\(B\.72\)
Note thatΔξ\(nm\)\(ϑ\)\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)andΔπ0\(nm\)\(ϑ\)\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)is independent ofA1A\_\{1\}, thus∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\)\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)brings in a factor that is bounded bynm𝔟−1‖∇f‖∞n\_\{m\}^\{\\mathfrak\{b\}\-1\}\\\|\\nabla f\\\|\_\{\\infty\}, keeping the effects from Gaussian noise and prior still inactive \(As discussed in[SectionB\.5](https://arxiv.org/html/2606.00309#A2.SS5)\) in the limit, thus we only considerΔℓ\(nm\)\(ϑ\)\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\. UnderA1A\_\{1\}, the conditional expectation of the increment satisfies

𝔼\[Δℓ\(nm\)\(ϑ\)∣A1\]\\displaystyle\\mathbb\{E\}\[\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\mid A\_\{1\}\]=chnm−𝔥\+𝔴2\[1nmℙ\(A1\)Γ∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)\\displaystyle=\\frac\{c\_\{h\}\\,n\_\{m\}^\{\-\\mathfrak\{h\}\+\\mathfrak\{w\}\}\}\{2\}\\Biggl\[\\frac\{1\}\{n\_\{m\}\\mathbb\{P\}\(A\_\{1\}\)\}\\,\\Gamma\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\+\(1−1nmℙ\(A1\)\)1nm−1∑i=2nmΓ∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\]\.\\displaystyle\\qquad\\qquad\\quad\+\\Bigl\(1\-\\frac\{1\}\{n\_\{m\}\\mathbb\{P\}\(A\_\{1\}\)\}\\Bigr\)\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\Gamma\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\\Biggr\]\.\(B\.73\)Substituting \([B\.73](https://arxiv.org/html/2606.00309#A2.E73)\) into \([B\.71](https://arxiv.org/html/2606.00309#A2.E71)\) yields an explicit decomposition of the gradient mismatch term\.

Assume𝔟−1<0\\mathfrak\{b\}\-1<0, so thatb\(nm\)/nm→0b^\{\(n\_\{m\}\)\}/n\_\{m\}\\to 0\. Then

ℙ\(A1\)=1−\(1−1nm\)b\(nm\)=b\(nm\)nm\+O\(b\(nm\)2nm2\)\.\\displaystyle\\mathbb\{P\}\(A\_\{1\}\)=1\-\\Bigl\(1\-\\frac\{1\}\{n\_\{m\}\}\\Bigr\)^\{b^\{\(n\_\{m\}\)\}\}=\\frac\{b^\{\(n\_\{m\}\)\}\}\{n\_\{m\}\}\+O\\\!\\left\(\\frac\{\{b^\{\(n\_\{m\}\)\}\}^\{2\}\}\{n\_\{m\}^\{2\}\}\\right\)\.\(B\.74\)Moreover,

1nmℙ\(A1\)=1b\(nm\)\+o\(1b\(nm\)\)\.\\displaystyle\\frac\{1\}\{n\_\{m\}\\mathbb\{P\}\(A\_\{1\}\)\}=\\frac\{1\}\{b^\{\(n\_\{m\}\)\}\}\+o\\left\(\\frac\{1\}\{b^\{\(n\_\{m\}\)\}\}\\right\)\.\(B\.75\)
Assuming𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}, the dominant contribution becomes

nm𝔞𝔼\[⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),Δ\(nm\)\(ϑ\)⟩\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\\bigl\\langle\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\\bigr\\rangle\\right\]=chΓ2nm𝔴−1𝔼ζ~\[⟨∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\),∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⟩\]\\displaystyle=\\frac\{c\_\{h\}\\Gamma\}\{2\}\\,n\_\{m\}^\{\\mathfrak\{w\}\-1\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\!\\left\[\\bigl\\langle\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\),\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\bigr\\rangle\\right\]\+chΓ2nm𝔴−1⟨𝔼ζ~\[∇ϑf\(ϑ,ζ~1\)−∇ϑf\(ϑ,ζ1\)\],b\(nm\)−1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)⟩\.\\displaystyle\\quad\+\\frac\{c\_\{h\}\\Gamma\}\{2\}\\,n\_\{m\}^\{\\mathfrak\{w\}\-1\}\\Biggl\\langle\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\!\\left\[\\nabla\_\{\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\right\],\\frac\{b^\{\(n\_\{m\}\)\}\-1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)\\Biggr\\rangle\.\(B\.76\)
Note that

1nm∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)\\displaystyle\\frac\{1\}\{n\_\{m\}\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)=1nm∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)−1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\\displaystyle=\\frac\{1\}\{n\_\{m\}\}\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)\-\\frac\{1\}\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)=nm−𝔴\(J∗ϑ\+RJ\)−nm−1∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\),\\displaystyle=n\_\{m\}^\{\-\\mathfrak\{w\}\}\(J\_\{\*\}\\,\\vartheta\+R\_\{J\}\)\-n\_\{m\}^\{\-1\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\),\(B\.77\)whereRJR\_\{J\}vanishes uniformly onK1K\_\{1\}asnm→∞n\_\{m\}\\to\\infty\. So only consider the dominant terms,

\([B\.8](https://arxiv.org/html/2606.00309#A2.Ex3)\)≲\\displaystyle\\eqref\{eq:gradientmm\}\\lesssimnm𝔴−1‖∇f‖∞𝔼ζ~‖∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)‖\\displaystyle n\_\{m\}^\{\\mathfrak\{w\}\-1\}\\\|\\nabla f\\\|\_\{\\infty\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\!\\\|\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\\|\+‖∇f‖∞\(nmb−1J⋆ϑ\+nm𝔟\+𝔴−2∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\)\\displaystyle\+\\\|\\nabla f\\\|\_\{\\infty\}\\left\(n\_\{m\}^\{b\-1\}J\_\{\\star\}\\vartheta\+n\_\{m\}^\{\\mathfrak\{b\}\+\\mathfrak\{w\}\-2\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)\\right\)\(B\.78\)Since𝔴−1<0\\mathfrak\{w\}\-1<0and𝔟−1<0\\mathfrak\{b\}\-1<0, and formmlarge enough,max1≤i≤nm⁡‖∇ℓ\(θ∗;Xi,⋅\)‖≤nm1/p2\\max\_\{1\\leq i\\leq n\_\{m\}\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{i\}\},\\cdot\)\\\|\\leq n\_\{m\}^\{1/p\_\{2\}\}, all terms in \([B\.8](https://arxiv.org/html/2606.00309#A2.Ex3)\) vanish,R4R\_\{4\}vanishes uniformly onK1K\_\{1\}\.

### B\.9R5R\_\{5\}\(Hessian mismatch term\)

Let

A1:=\{1∈I1\(nm\)\},ℙ\(A1\)=1−\(1−1nm\)b\(nm\)\.\\displaystyle A\_\{1\}:=\\\{1\\in I\_\{1\}^\{\(n\_\{m\}\)\}\\\},\\qquad\\mathbb\{P\}\(A\_\{1\}\)=1\-\\Bigl\(1\-\\frac\{1\}\{n\_\{m\}\}\\Bigr\)^\{b^\{\(n\_\{m\}\)\}\}\.\(B\.79\)Again we will focus on

nm𝔞𝔼\[\(Δℓ\(nm\)\(ϑ\)\)⊗2:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\\right\]=ℙ\(A1\)nm𝔞𝔼\[\(Δℓ\(nm\)\(ϑ\)\)⊗2:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\|A1\]\.\\displaystyle=\\mathbb\{P\}\(A\_\{1\}\)\\,n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\\;\\middle\|\\;A\_\{1\}\\right\]\.\(B\.80\)
Under the eventA1=\{1∈I1\(nm\)\}A\_\{1\}=\\\{1\\in I\_\{1\}^\{\(n\_\{m\}\)\}\\\}, the conditional second moment of the likelihood increment admits the decomposition

𝔼\[\(Δℓ\(nm\)\(ϑ\)\)⊗2∣A1\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}\\mid A\_\{1\}\\right\]\(B\.81\)=\(chnm−𝔥\+𝔴2\)2Γ\{1b\(nm\)\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2nmℙ\(A1\)\\displaystyle=\\left\(\\frac\{c\_\{h\}\\,n\_\{m\}^\{\-\\mathfrak\{h\}\+\\mathfrak\{w\}\}\}\{2\}\\right\)^\{2\}\\Gamma\\Bigg\\\{\\frac\{1\}\{b^\{\(n\_\{m\}\)\}\}\\Bigg\[\\frac\{\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\(B\.82\)\+\(1−1nmℙ\(A1\)\)1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)⊗2\]\\displaystyle\+\\left\(1\-\\frac\{1\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\\right\)\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)^\{\\otimes 2\}\\Bigg\]\(B\.83\)\+\(b\(nm\)−1\)b\(nm\)\[\(1nmℙ\(A1\)\)2∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2\\displaystyle\+\\frac\{\(b^\{\(n\_\{m\}\)\}\-1\)\}\{b^\{\(n\_\{m\}\)\}\}\\Bigg\[\\left\(\\frac\{1\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\\right\)^\{2\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}\(B\.84\)\+2nmℙ\(A1\)\(1−1nmℙ\(A1\)\)∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗\(1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\)\\displaystyle\+\\frac\{2\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\\left\(1\-\\frac\{1\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\\right\)\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\otimes\\left\(\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\\right\)\(B\.85\)\+\(1−1nmℙ\(A1\)\)2\(1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\)⊗2\]\}Γ′\.\\displaystyle\+\\left\(1\-\\frac\{1\}\{n\_\{m\}\\,\\mathbb\{P\}\(A\_\{1\}\)\}\\right\)^\{2\}\\left\(\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\\right\)^\{\\otimes 2\}\\Bigg\]\\Bigg\\\}\\Gamma^\{\\prime\}\.\(B\.86\)
Assuming𝔞=𝔥\\mathfrak\{a\}=\\mathfrak\{h\}, the dominant contribution of

nm𝔞𝔼\[\(Δℓ\(nm\)\(ϑ\)\)⊗2:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\(\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\)^\{\\otimes 2\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\\right\]becomes

ch2nm−𝔥\+2𝔴\+𝔟−14Γ\{1b\(nm\)\[𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2b\(nm\):\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle\\frac\{c\_\{h\}^\{2\}n\_\{m\}^\{\-\\mathfrak\{h\}\+2\\mathfrak\{w\}\+\\mathfrak\{b\}\-1\}\}\{4\}\\Gamma\\Bigg\\\{\\frac\{1\}\{b^\{\(n\_\{m\}\)\}\}\\Bigg\[\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\frac\{\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}\}\{b^\{\(n\_\{m\}\)\}\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\(B\.87\)\+\(b\(nm\)−1b\(nm\)\)1nm−1∑i=2nm𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)⊗2\]:𝔼ζ~\[\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\]\\displaystyle\+\\left\(\\frac\{b^\{\(n\_\{m\}\)\}\-1\}\{b^\{\(n\_\{m\}\)\}\}\\right\)\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)^\{\\otimes 2\}\]:\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\\Bigg\]\(B\.88\)\+\(b\(nm\)−1\)b\(nm\)\[1b\(nm\)2𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle\+\\frac\{\(b^\{\(n\_\{m\}\)\}\-1\)\}\{b^\{\(n\_\{m\}\)\}\}\\Bigg\[\\frac\{1\}\{\{b^\{\(n\_\{m\}\)\}\}^\{2\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\(B\.89\)\+2\(b\(nm\)−1\)b\(nm\)2𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\):\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle\+\\frac\{2\(b^\{\(n\_\{m\}\)\}\-1\)\}\{\{b^\{\(n\_\{m\}\)\}\}^\{2\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\):\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\(B\.90\)⊗𝔼ζ~\[1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\]\\displaystyle\\otimes\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\]\(B\.91\)\+\(b\(nm\)−1b\(nm\)\)2𝔼ζ~\(1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\)⊗2:𝔼ζ~\[\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\]\}Γ′\.\\displaystyle\+\\left\(\\frac\{b^\{\(n\_\{m\}\)\}\-1\}\{b^\{\(n\_\{m\}\)\}\}\\right\)^\{2\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\left\(\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\\right\)^\{\\otimes 2\}:\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\\Bigg\]\\Bigg\\\}\\Gamma^\{\\prime\}\.\(B\.92\)
Then we want to write the terms with index2,⋯,nm2,\\cdots,n\_\{m\}as summation over1,⋯,nm1,\\cdots,n\_\{m\}and terms with index11\.

1nm−1∑i=2nm𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)⊗2\]\\displaystyle\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)^\{\\otimes 2\}\]\(B\.93\)=nmnm−1\(I~⋆\+RI−nm−1𝔼ζ~∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2\),\\displaystyle=\\frac\{n\_\{m\}\}\{n\_\{m\}\-1\}\\left\(\\tilde\{I\}\_\{\\star\}\+R\_\{I\}\-n\_\{m\}^\{\-1\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}\\right\),\(B\.94\)𝔼ζ~\[1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\]\\displaystyle\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\]\(B\.95\)=nmnm−1\(nm−𝔴\(J∗ϑ\+RJ\)−nm−1∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\)\\displaystyle=\\frac\{n\_\{m\}\}\{n\_\{m\}\-1\}\\left\(n\_\{m\}^\{\-\\mathfrak\{w\}\}\(J\_\{\*\}\\,\\vartheta\+R\_\{J\}\)\-n\_\{m\}^\{\-1\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)\\right\)\(B\.96\)𝔼ζ~\(1nm−1∑i=2nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)\)⊗2\\displaystyle\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\left\(\\frac\{1\}\{n\_\{m\}\-1\}\\sum\_\{i=2\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)\\right\)^\{\\otimes 2\}\(B\.97\)=1\(nm−1\)2\{\(∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)\)⊗2\+∑i=1nm𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi,ζ~i\)⊗2\]\\displaystyle=\\frac\{1\}\{\(n\_\{m\}\-1\)^\{2\}\}\\Bigg\\\{\\left\(\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)\\right\)^\{\\otimes 2\}\+\\sum\_\{i=1\}^\{n\_\{m\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\},\\tilde\{\\zeta\}\_\{i\}\\bigr\)^\{\\otimes 2\}\]\(B\.98\)−∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)⊗2−2∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\(∑i=1nm∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;Xi\)\)\\displaystyle\-\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)^\{\\otimes 2\}\-2\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)\\left\(\\sum\_\{i=1\}^\{n\_\{m\}\}\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{i\}\}\\bigr\)\\right\)\(B\.99\)\+2∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)⊗2−𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2\]\}\\displaystyle\+2\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)^\{\\otimes 2\}\-\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}\]\\Bigg\\\}\(B\.100\)
So only consider the dominant terms,

\([B\.87](https://arxiv.org/html/2606.00309#A2.E87)\)≲\\displaystyle\\eqref\{eq:heissianmm\}\\lesssimnm−1𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗2:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\+nm𝔟−1∥∇⊗2∥∞I~⋆\\displaystyle n\_\{m\}^\{\-1\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)^\{\\otimes 2\}:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\+n\_\{m\}^\{\\mathfrak\{b\}\-1\}\\\|\\nabla^\{\\otimes 2\}\\\|\_\{\\infty\}\\tilde\{I\}\_\{\\star\}\(B\.101\)\+nm𝔟−1−𝔴𝔼ζ~\[∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)⊗J⋆ϑ:\(∇ϑϑf\(ϑ,ζ~1\)−∇ϑϑf\(ϑ,ζ1\)\)\]\\displaystyle\+n\_\{m\}^\{\\mathfrak\{b\}\-1\-\\mathfrak\{w\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\[\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\otimes J\_\{\\star\}\\vartheta:\\bigl\(\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\tilde\{\\zeta\}\_\{1\}\)\-\\nabla\_\{\\vartheta\\vartheta\}f\(\\vartheta,\\zeta\_\{1\}\)\\bigr\)\]\(B\.102\)\+‖∇⊗2‖∞\(nm𝔟−𝔥−1\(J∗ϑ\)⊗2\+nm2𝔟−2\(I~⋆−I⋆\)\+nm2𝔟−2−𝔴J⋆ϑ⊗∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\)\\displaystyle\+\\\|\\nabla^\{\\otimes 2\}\\\|\_\{\\infty\}\\left\(n\_\{m\}^\{\\mathfrak\{b\}\-\\mathfrak\{h\}\-1\}\(J\_\{\*\}\\vartheta\)^\{\\otimes 2\}\+n\_\{m\}^\{2\\mathfrak\{b\}\-2\}\(\\tilde\{I\}\_\{\\star\}\-I\_\{\\star\}\)\+n\_\{m\}^\{2\\mathfrak\{b\}\-2\-\\mathfrak\{w\}\}J\_\{\\star\}\\vartheta\\otimes\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)\\right\)\(B\.103\)≲∥∇⊗2∥∞\{nm−1𝔼ζ~∥∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)∥∞2\+nm𝔟−1I~⋆\+nm𝔟−1−𝔴𝔼ζ~∥∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1,ζ~1\)∥\\displaystyle\\lesssim\\\|\\nabla^\{\\otimes 2\}\\\|\_\{\\infty\}\\Bigg\\\{n\_\{m\}^\{\-1\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\|\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\\|^\{2\}\_\{\\infty\}\+n\_\{m\}^\{\\mathfrak\{b\}\-1\}\\tilde\{I\}\_\{\\star\}\+n\_\{m\}^\{\\mathfrak\{b\}\-1\-\\mathfrak\{w\}\}\\mathbb\{E\}\_\{\\tilde\{\\zeta\}\}\\\|\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\},\\tilde\{\\zeta\}\_\{1\}\\bigr\)\\\|\(B\.104\)\+nm𝔟−𝔥−1\(J∗ϑ\)⊗2\+nm2𝔟−2\(I~⋆−I⋆\)\+nm2𝔟−2−𝔴J⋆ϑ⊗∇ℓ\(θ^\(nm\)\+\(w\(nm\)\)−1ϑ;X1\)\}\\displaystyle\+n\_\{m\}^\{\\mathfrak\{b\}\-\\mathfrak\{h\}\-1\}\(J\_\{\*\}\\vartheta\)^\{\\otimes 2\}\+n\_\{m\}^\{2\\mathfrak\{b\}\-2\}\(\\tilde\{I\}\_\{\\star\}\-I\_\{\\star\}\)\+n\_\{m\}^\{2\\mathfrak\{b\}\-2\-\\mathfrak\{w\}\}J\_\{\\star\}\\vartheta\\otimes\\nabla\\ell\\\!\\bigl\(\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\+\(w^\{\(n\_\{m\}\)\}\)^\{\-1\}\\vartheta;\{X\_\{1\}\}\\bigr\)\\Bigg\\\}\(B\.105\)Since𝔴−1<0\\mathfrak\{w\}\-1<0and𝔟−1<0\\mathfrak\{b\}\-1<0, and formmlarge enough,max1≤i≤nm⁡‖∇ℓ\(θ∗;Xi,⋅\)‖≤nm1/p2\\max\_\{1\\leq i\\leq n\_\{m\}\}\\\|\\nabla\\ell\(\\theta^\{\*\};\{X\_\{i\}\},\\cdot\)\\\|\\leq n\_\{m\}^\{1/p\_\{2\}\},R5R\_\{5\}vanishes uniformly onK1K\_\{1\}\.

### B\.10R6R\_\{6\}\(third\-order remainder term\)

By the triangle inequality,

nm𝔞𝔼\[16∇ϑ⊗3f\(ϑ\+SΔ\(nm\)\(ϑ\),ζ~1\)\(Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\),Δ\(nm\)\(ϑ\)\)\]\\displaystyle n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{6\}\\,\\nabla\_\{\\vartheta\}^\{\\otimes 3\}f\(\\vartheta\+S\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\tilde\{\\zeta\}\_\{1\}\)\(\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\),\\Delta^\{\(n\_\{m\}\)\}\(\\vartheta\)\)\\right\]\(B\.106\)≤276nm𝔞‖∇ϑ⊗3f‖∞\(𝔼‖Δξ\(nm\)\(ϑ\)‖3\+𝔼‖Δπ0\(nm\)\(ϑ\)‖3\+𝔼‖Δℓ\(nm\)\(ϑ\)‖3\)\.\\displaystyle\\leq\\frac\{27\}\{6\}\\,n\_\{m\}^\{\\mathfrak\{a\}\}\\,\\\|\\nabla\_\{\\vartheta\}^\{\\otimes 3\}f\\\|\_\{\\infty\}\\,\\Bigl\(\\mathbb\{E\}\\\|\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\+\\mathbb\{E\}\\\|\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\+\\mathbb\{E\}\\\|\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\\Bigr\)\.\(B\.107\)
Moreover,

𝔼‖Δξ\(nm\)\(ϑ\)‖3≤nm−32\(𝔥\+𝔱−2𝔴\)\(ch2cβ‖Γ‖3/223/2Γ\(d\+32\)Γ\(d2\)\),𝔞−32\(𝔥\+𝔱−2𝔴\)<0\.\\displaystyle\\mathbb\{E\}\\\|\\Delta\_\{\\xi\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\\leq n\_\{m\}^\{\-\\frac\{3\}\{2\}\(\\mathfrak\{h\}\+\\mathfrak\{t\}\-2\\mathfrak\{w\}\)\}\\,\\left\(\\frac\{c\_\{h\}\}\{2c\_\{\\beta\}\}\\,\\\|\\Gamma\\\|^\{3/2\}\\,2^\{3/2\}\\,\\frac\{\\Gamma\\\!\\left\(\\frac\{d\+3\}\{2\}\\right\)\}\{\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\}\\right\),\\qquad\\mathfrak\{a\}\-\\frac\{3\}\{2\}\(\\mathfrak\{h\}\+\\mathfrak\{t\}\-2\\mathfrak\{w\}\)<0\.\(B\.108\)Also,

𝔼‖Δπ0\(nm\)\(ϑ\)‖3≤\(ch2nm−𝔥\+𝔴−1‖Γ‖\)3\(‖∇log⁡π0\(θ∗\)‖\+L0‖θ^\(nm\)−θ∗‖\+L0\(2R0\+2c0\)nm𝔴\)3,𝔞−3𝔥\+3𝔴−3<0\.\\displaystyle\\mathbb\{E\}\\\|\\Delta\_\{\\pi\_\{0\}\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\\leq\\left\(\\frac\{c\_\{h\}\}\{2\}\\,n\_\{m\}^\{\-\\mathfrak\{h\}\+\\mathfrak\{w\}\-1\}\\,\\\|\\Gamma\\\|\\right\)^\{3\}\\left\(\\\|\\nabla\\log\\pi\_\{0\}\(\\theta^\{\*\}\)\\\|\+L\_\{0\}\\\|\\hat\{\\theta\}^\{\(n\_\{m\}\)\}\-\\theta^\{\*\}\\\|\+\\frac\{L\_\{0\}\(2R\_\{0\}\+2c\_\{0\}\)\}\{n\_\{m\}^\{\\mathfrak\{w\}\}\}\\right\)^\{3\},\\qquad\\mathfrak\{a\}\-3\\mathfrak\{h\}\+3\\mathfrak\{w\}\-3<0\.\(B\.109\)Finally,

𝔼‖Δℓ\(nm\)\(ϑ\)‖3≤\(ch‖Γ‖2\)3\(nm1/p2−𝔥\+𝔴\+nm1/p3−𝔥\+𝔴Φ\(nm\)\+nm1/p3−𝔥\)3\.\\displaystyle\\mathbb\{E\}\\\|\\Delta\_\{\\ell\}^\{\(n\_\{m\}\)\}\(\\vartheta\)\\\|^\{3\}\\leq\\left\(\\frac\{c\_\{h\}\\\|\\Gamma\\\|\}\{2\}\\right\)^\{3\}\\left\(n\_\{m\}^\{1/p\_\{2\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\}\+n\_\{m\}^\{1/p\_\{3\}\-\\mathfrak\{h\}\+\\mathfrak\{w\}\}\\,\\Phi^\{\(n\_\{m\}\)\}\+n\_\{m\}^\{1/p\_\{3\}\-\\mathfrak\{h\}\}\\right\)^\{3\}\.\(B\.110\)Therefore,R6R\_\{6\}vanishes uniformly\.

## Appendix CExperimental Details

This appendix provides additional details on the experimental setups used in Section[4](https://arxiv.org/html/2606.00309#S4), including the synthetic data\-generating distributions and real\-data description\. Code to reproduce all experiments is available at[https://github\.com/shawngyn\-stack/LVM\_scaling\_limits](https://github.com/shawngyn-stack/LVM_scaling_limits)\.

### C\.1Gaussian Mixture Model \(GMM\) Experiments

We consider a Gaussian mixture model withKKcomponents inℝd\\mathbb\{R\}^\{d\}and dataset sizeNN\. For each observationi∈\{1,…,N\}i\\in\\\{1,\\dots,N\\\}, a latent cluster label is drawn as

zi∼Categorical\(π\),zi∈\{1,…,K\},\\displaystyle z\_\{i\}\\sim\\mathrm\{Categorical\}\(\\pi\),\\qquad z\_\{i\}\\in\\\{1,\\dots,K\\\},\(C\.1\)and the observation is generated according to

Xi∣zi=k∼𝒩\(μk,diag\(σk2\)\),\\displaystyle X\_\{i\}\\mid z\_\{i\}=k\\sim\\mathcal\{N\}\\\!\\bigl\(\\mu\_\{k\},\\ \\mathrm\{diag\}\(\\sigma\_\{k\}^\{2\}\)\\bigr\),\(C\.2\)whereμk∈ℝd\\mu\_\{k\}\\in\\mathbb\{R\}^\{d\}denotes the component mean andσk∈ℝ\+d\\sigma\_\{k\}\\in\\mathbb\{R\}\_\{\+\}^\{d\}parameterizes a diagonal covariance matrix\. The global parameters are\{μk,σk\}k=1K\\\{\\mu\_\{k\},\\sigma\_\{k\}\\\}\_\{k=1\}^\{K\}, while the latent variables are the cluster assignments\{zi\}i=1N\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}\.

### C\.2Latent Dirichlet Allocation \(LDA\)

We consider the standard LDA generative model withKKtopics and a vocabulary of sizeVV\. LetDDdenote the number of documents\. For each topick∈\{1,…,K\}k\\in\\\{1,\\dots,K\\\}, we draw a topic\-word distribution

βk∼Dirichlet\(β1V\),βk∈ΔV−1,\\displaystyle\\beta\_\{k\}\\sim\\mathrm\{Dirichlet\}\(\\beta\\,\\mathbf\{1\}\_\{V\}\),\\qquad\\beta\_\{k\}\\in\\Delta^\{V\-1\},\(C\.3\)independently acrosskk\. Hereβ\>0\\beta\>0is a symmetric Dirichlet hyperparameter and𝟏V\\mathbf\{1\}\_\{V\}denotes the all\-ones vector inℝV\\mathbb\{R\}^\{V\}\.

For each documentd∈\{1,…,D\}d\\in\\\{1,\\dots,D\\\}, we draw a document\-topic proportion vector

θd∼Dirichlet\(α1K\),θd∈ΔK−1,\\displaystyle\\theta\_\{d\}\\sim\\mathrm\{Dirichlet\}\(\\alpha\\,\\mathbf\{1\}\_\{K\}\),\\qquad\\theta\_\{d\}\\in\\Delta^\{K\-1\},\(C\.4\)independently across documents, whereα\>0\\alpha\>0is a symmetric concentration parameter\. We then sample the document lengthLdL\_\{d\}independently from a discrete uniform distribution over a fixed range\. Conditional onθd\\theta\_\{d\}, each tokenn∈\{1,…,Ld\}n\\in\\\{1,\\dots,L\_\{d\}\\\}is generated by first sampling a topic assignment

zdn∣θd∼Categorical\(θd\),\\displaystyle z\_\{dn\}\\mid\\theta\_\{d\}\\sim\\mathrm\{Categorical\}\(\\theta\_\{d\}\),\(C\.5\)and then sampling the observed word

wdn∣zdn=k∼Categorical\(βk\)\.\\displaystyle w\_\{dn\}\\mid z\_\{dn\}=k\\sim\\mathrm\{Categorical\}\(\\beta\_\{k\}\)\.\(C\.6\)We collect each document as a sequence of word indiceswd,1:Ld∈\{1,…,V\}Ldw\_\{d,1:L\_\{d\}\}\\in\\\{1,\\dots,V\\\}^\{L\_\{d\}\}\.

In our experiments, the global parameters correspond to the topic\-word distributions\{βk\}k=1K\\\{\\beta\_\{k\}\\\}\_\{k=1\}^\{K\}\(equivalently, the matrixβ∈ℝK×V\\beta\\in\\mathbb\{R\}^\{K\\times V\}with rows on the simplex\), while the latent variables are the token\-level topic assignments\{zdn\}\\\{z\_\{dn\}\\\}\. In the experiments, the document\-topic proportions\{θd\}\\\{\\theta\_\{d\}\\\}are integrated out, and Gibbs updates are performed over the token\-level assignments\{zdn\}\\\{z\_\{dn\}\\\}\. For further details, we refer the reader toPatterson & Teh \([2013](https://arxiv.org/html/2606.00309#bib.bib37)\)\.

### C\.3Real\-World Datasets

#### Flow cytometry \(GMM\)\.

We use a real flow cytometry dataset in which each observation corresponds to a single cell\. As input features, we retain only the four fluorescence intensity channels FL1\.H–FL4\.H, which measure marker\-specific protein expression levels\. All features are standardized prior to model fitting\. Ground\-truth cell population labels are available and are used for evaluation\.

#### 20 Newsgroups \(LDA\)\.

For topic\-modeling experiments, we use the 20 Newsgroups dataset as provided byscikit\-learn\(Pedregosa et al\.,[2011](https://arxiv.org/html/2606.00309#bib.bib38)\)\. Documents are represented as bags of words after standard preprocessing\.

### C\.4Setup of SVI and computational cost

We used the default schedule recommended by Hoffman et al\. \(2013\) and verified it converges; results are robust to moderate changes inτ0\\tau\_\{0\}andκ\\kappa\. The hyperparameters’ values are as follows\.

GMM: mean\-field Normal–Gamma variational family initialized via k\-means; minibatch size256256; prior parametersα0=1\.0\\alpha\_\{0\}=1\.0,β0=1\.0\\beta\_\{0\}=1\.0,a0=2\.0a\_\{0\}=2\.0,b0=2\.0b\_\{0\}=2\.0; Robbins–Monro step\-size scheduleρt=\(τ0\+t\)−κ\\rho\_\{t\}=\(\\tau\_\{0\}\+t\)^\{\-\\kappa\}withτ0=10\\tau\_\{0\}=10andκ=0\.7\\kappa=0\.7\.

LDA \(synthetic and real data\): semi\-collapsed SVI with variational familyq\(π\)=∏kDirichlet\(λk\)q\(\\pi\)=\\prod\_\{k\}\\mathrm\{Dirichlet\}\(\\lambda\_\{k\}\)and categorical local factors; minibatch size6464;1010local coordinate\-ascent updates per document; symmetric priorsα=β=0\.1\\alpha=\\beta=0\.1; same Robbins–Monro schedule\.

In our synthetic GMM experiments, SVI typically stabilizes within roughly 20 iterations, whereas SGLD–Gibbs requires an initial burn\-in period of about 200 iterations, followed by additional iterations used for posterior sampling\. On a per\-iteration basis, the cost of SGLD–Gibbs is approximately0\.30\.3–0\.6×0\.6\\timesthat of SVI whenS=1S=1, and approximately1\.61\.6–3\.0×3\.0\\timeswhenS=5S=5\. Consequently, the overall cost of SGLD–Gibbs depends on the number of retained posterior samples\. In our experiments, the total cost needed to obtain stable SGLD–Gibbs estimates was roughly55–10×10\\timesthat of SVI\. This additional cost was associated with accuracy improvements of about2020–50%50\\%, together with better calibrated uncertainty quantification\.
Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo

Similar Articles

Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo

Uncertainty Quantification for Large Language Diffusion Models

Integrating Local and Global Entropy for Uncertainty Quantification in LLMs

SpanUQ: Span-Level Uncertainty Quantification for Large Language Model Generation

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

Submit Feedback

Similar Articles

Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
Uncertainty Quantification for Large Language Diffusion Models
Integrating Local and Global Entropy for Uncertainty Quantification in LLMs
SpanUQ: Span-Level Uncertainty Quantification for Large Language Model Generation
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models