Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
Summary
This paper proposes new discrete-time approximations for stochastic gradient Langevin dynamics (SGLD) with and without momentum, enabling accurate predictions of stationary covariance, iterate average covariance, and integrated autocorrelation time. The method provides improved tuning guidance for large-sample uncertainty quantification, especially under model misspecification.
View Cached Full Text
Cached at: 06/02/26, 03:40 PM
# Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
Source: [https://arxiv.org/html/2606.00293](https://arxiv.org/html/2606.00293)
###### Abstract
Tuning algorithms such as stochastic gradient descent \(SGD\) and stochastic gradient Langevin dynamics \(SGLD\) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified\. Existing theory that provides tuning guidance relies on continuous\-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes\. We address these shortcomings by proposing new discrete\-time approximations to SG\(L\)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time\. Moreover, we prove quantitative, non\-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification\. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data\-generating distributions where existing approaches fail, including when using theβ\\beta\-divergence rather than log\-loss to obtain statistically robust inferences\.
## 1Introduction
Stochastic gradient–based methods have become the default tool for large\-sample optimization in machine learning\. Algorithms such as stochastic gradient descent \(SGD\) and its variants dominate modern practice because subsampling dramatically reduces per\-iteration computational cost while having strong empirical performance and favorable generalization properties\(Bottou,[2010](https://arxiv.org/html/2606.00293#bib.bib6); Hardt et al\.,[2016](https://arxiv.org/html/2606.00293#bib.bib20); Goodfellow et al\.,[2016](https://arxiv.org/html/2606.00293#bib.bib17)\)\.
From a Bayesian perspective, subsampling\-based Markov chain Monte Carlo \(MCMC\) methods seem to offer an analogous path toward scalable sampling and uncertainty quantification \(UQ\)\. In particular, stochastic gradient MCMC \(SG\-MCMC\) algorithms such as stochastic gradient Langevin dynamics \(SGLD\) replace full\-data likelihood gradients with unbiased minibatch estimates, promising posterior sampling at a computational cost comparable to SGD\(Welling & Teh,[2011](https://arxiv.org/html/2606.00293#bib.bib73); Li et al\.,[2016](https://arxiv.org/html/2606.00293#bib.bib40); Raginsky et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib57); Brosse et al\.,[2018](https://arxiv.org/html/2606.00293#bib.bib7); Nemeth & Fearnhead,[2021](https://arxiv.org/html/2606.00293#bib.bib54)\)\. In practice, however, SG\-MCMC methods are notoriously difficult to tune because the step size, batch size, and temperature parameters must be carefully chosen to control discretization bias and mixing behavior while simultaneously providing accurate UQ\(Nemeth & Fearnhead,[2021](https://arxiv.org/html/2606.00293#bib.bib54); Coullon et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib9); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53); Rajpal et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib58); Kim et al\.,[2024](https://arxiv.org/html/2606.00293#bib.bib34); Mauri & Zanella,[2024](https://arxiv.org/html/2606.00293#bib.bib47); Alexos et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib4); Paulin et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib55); Akyildiz & Sabanis,[2024](https://arxiv.org/html/2606.00293#bib.bib3)\)\. These challenges are exacerbated when the statistical model is misspecified, a setting in which standard Bayesian posteriors are no longer well\-calibrated\. The same calibration issue applies when using a generalized Bayesian loss, whether the model is correctly specified or not\(Bissiri et al\.,[2016](https://arxiv.org/html/2606.00293#bib.bib5); Jewson et al\.,[2018](https://arxiv.org/html/2606.00293#bib.bib29)\)\.
Recent work has begun to address these challenges by explicitly combining algorithmic and statistical asymptotic perspectives\. For example,Mandt et al\. \([2017](https://arxiv.org/html/2606.00293#bib.bib46)\)adopted a heuristic perspective that was motivated by two lines of work\. The first considers scaling limits in stochastic approximations and show that, after appropriate rescaling of space and time, the iterates jointly converge to a continuous\-time Ornstein–Uhlenbeck process\(Kushner & Huang,[1981](https://arxiv.org/html/2606.00293#bib.bib37); Pflug,[1986](https://arxiv.org/html/2606.00293#bib.bib56); Walk,[1977](https://arxiv.org/html/2606.00293#bib.bib70); Kushner & Yang,[1993](https://arxiv.org/html/2606.00293#bib.bib38); Kushner & Yin,[2003](https://arxiv.org/html/2606.00293#bib.bib36)\)\. The second concerns the asymptotics of the Bayesian posterior, known as Bernstein–von Mises \(or Bayesian Central Limit\) theorems\(Kleijn & van der Vaart,[2012](https://arxiv.org/html/2606.00293#bib.bib35); Van der Vaart,[2000](https://arxiv.org/html/2606.00293#bib.bib67)\)\.
More recently,Negrea et al\. \([2023](https://arxiv.org/html/2606.00293#bib.bib53)\); Wang et al\. \([2025](https://arxiv.org/html/2606.00293#bib.bib72)\)formalize and extend the heuristic arguments ofMandt et al\. \([2017](https://arxiv.org/html/2606.00293#bib.bib46)\)by analyzing stochastic gradient algorithms through joint limits in which both the dataset size and algorithm parameters \(e\.g\., step size and batch size\) scale together\. Further,Wang & Huggins \([2026](https://arxiv.org/html/2606.00293#bib.bib71)\)extend the results ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00293#bib.bib53)\)to models with local latent variables\. These results, which characterize the limiting stochastic process of the iterate sample paths, make it possible to not only determine the limiting stationary distribution \(which is important for UQ\) but also the mixing time and iterate average distribution, which determine the algorithm’s computational efficiency and the accuracy of posterior expectation estimates\. Hence, these results are able to provide precise tuning advice that maximizes computational efficiency while targeting the desired form of UQ such as frequentist coverage\(White,[1982](https://arxiv.org/html/2606.00293#bib.bib74)\), Bayesian model uncertainty\(Kleijn & van der Vaart,[2012](https://arxiv.org/html/2606.00293#bib.bib35)\), or both\(Huggins & Miller,[2024](https://arxiv.org/html/2606.00293#bib.bib26)\)\.
A major limitation of these results, however, is that they rely on taking continuous\-time stochastic differential equation \(SDE\) limits, which approximate discrete\-time algorithms only in the vanishing step\-size regime\(Wang et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib72); Li et al\.,[2019](https://arxiv.org/html/2606.00293#bib.bib42)\)\. These limiting approximations become quantitatively inaccurate precisely in the large batch\-size regimes most relevant to practice\. The problem is that using a large batch size requires using a relatively large step size\(Goyal et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib18); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53)\), so continuous\-time approximations can substantially mischaracterize stationary covariance structure, which can result in inaccurate UQ\.
[Figure1](https://arxiv.org/html/2606.00293#S1.F1)illustrates how the these issues can arise even in simple misspecified linear models\. In this example, as the batch size increases, the accuracy of the tuning rules derived from SDE limits decreases rapidly, leading to the stationary covariance failing to match the sandwich covariance𝒮⋆\\mathcal\{S\}\_\{\\star\}\(White,[1982](https://arxiv.org/html/2606.00293#bib.bib74)\)\. Such failures persist even with increasing data size, highlighting a fundamental limitation of continuous\-time approximations for guiding practical tuning decisions\(Wang et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib72)\)\.
Recent work has used discrete\-time approximations to stochastic gradient algorithms that remain valid at large batch sizes and/or large step sizes\(Dieuleveut et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib11); Liu et al\.,[2021](https://arxiv.org/html/2606.00293#bib.bib43); Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\. While promising, existing results either assume a constant noise covariance, apply only to linear models, or to not account for model misspecification\. Moreover, most approximations lack rigorous non\-asymptotic error guarantees; and none provide estimates for the mixing time or iterate\-average distribution\.[Figure1](https://arxiv.org/html/2606.00293#S1.F1)illustrates how, as a result, they can fall short of providing reliable guidance for uncertainty quantification – in this case, due to model misspecification\.
Figure 1:Misspecified linear regression with heteroskedastic noise\. Data are generated according toyn∼𝒩\(xn⊤θ⋆,1\+‖xi‖22\)y\_\{n\}\\sim\\mathcal\{N\}\(x\_\{n\}^\{\\top\}\\theta\_\{\\star\},1\+\\\|x\_\{i\}\\\|\_\{2\}^\{2\}\), whereθ⋆∼𝒩\(0,ID\)\\theta\_\{\\star\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)is fixed andxn∼iid𝒩\(0,ID\)x\_\{n\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,I\_\{D\}\)\. A linear model is fitted using constant\-step\-size SGD\.𝒮⋆=𝒥⋆−1ℐ⋆𝒥⋆−1\\mathcal\{S\}\_\{\\star\}=\\mathcal\{J\}\_\{\\star\}^\{\-1\}\\mathcal\{I\}\_\{\\star\}\\mathcal\{J\}\_\{\\star\}^\{\-1\}: sandwich covariance;𝒮^\\hat\{\\mathcal\{S\}\}: covariance obtained under step\-size tuning rules derived from different theories\.Table 1:Comparison of approximations used to tune SG\(L\)D for sampling\. References are to the works most directly relevant to tuning\.Large batch:Is the approach accurate for large batch sizes?Non\-const\. noise:Does the approach account for non\-constant stochastic gradient noise?General model/loss:Does the approach account for model misspecification or the use of a generalized loss?Mixing:Does the approach provide mixing time and iterate average covariance estimates?Bounds:Are quantitative error bounds available?ApproachLarge batchNon\-const\. noiseGeneral model/lossMixingBoundsContinuous\-time\(Mandt et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib46); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53); Wang et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib72)\)✗✗✓\\checkmark✓\\checkmark✓\\checkmarkDiscrete quadratic \+ constant noise\(Dieuleveut et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib11); Liu et al\.,[2021](https://arxiv.org/html/2606.00293#bib.bib43)\)✓\\checkmark✗✓\\checkmark✗✓\\checkmarkLinear regression \+ well\-specified\(Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\)✓\\checkmark✓\\checkmark✗✗✗Discrete quadratic \+ exact noise\(this work\)✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmark✓\\checkmarkIn this work, we address these limitations by developing a discrete\-time theoretical framework for SGD and SGLD that remains accurate at large batch sizes, and under model misspecification\.[Table1](https://arxiv.org/html/2606.00293#S1.T1)compares our approach to alternatives\. Our contributions are as follows:
1. 1\.\(minor\)We introduce a*proxy algorithm framework*that clarifies the differences and limitations of existing approaches, and thereby helps identify where further theory is needed\. \([Section3](https://arxiv.org/html/2606.00293#S3)\)
2. 2\.\(major\)We derive a*new discrete\-time approximation for SGD and SGLD*\(with and without momentum\) that remains accurate for large batch sizes and misspecified models\. \([Section4](https://arxiv.org/html/2606.00293#S4)\)
3. 3\.\(major\)We provide*quantitative, non\-asymptotic error analyses*demonstrating that the resulting stationary covariance estimates are sufficiently accurate for practical tuning*for the purpose of sampling and uncertainty quantification*\. \([Section4\.2](https://arxiv.org/html/2606.00293#S4.SS2)\)
4. 4\.\(major\)We use our results to propose a practical, tuning\-free procedure for scalable uncertainty quantification \([Algorithm1](https://arxiv.org/html/2606.00293#alg1)\)\. Through numerical experiments, we show that our theory provides improved tuning guidance for a different models, batch size regimes, and loss functions\. \([Section6](https://arxiv.org/html/2606.00293#S6)\)
5. 5\.\(minor\)Finally, while our focus in the paper is on uncertainty quantification and sampling, our results also shed light on the training dynamics and generalization behavior of SGD and its use for frequentist inference\(Jantre et al\.,[2024](https://arxiv.org/html/2606.00293#bib.bib28); Hwang et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib27); Chang et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib8); Lyle et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib44); Mandt et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib46); Zhu et al\.,[2019](https://arxiv.org/html/2606.00293#bib.bib77); Lewkowycz et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib39); Keskar et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib33); Hoffer et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib23); Mori & Ueda,[2020](https://arxiv.org/html/2606.00293#bib.bib50)\)\. For completeness, we illustrate some of these directions, which may be of interest to the wider ML community, through some preliminary experiments \(Appendix[E](https://arxiv.org/html/2606.00293#A5)\)\.
## 2Background
### 2\.1Setting
Let\{xn\}n=1N\\\{x\_\{n\}\\\}\_\{n=1\}^\{N\}denote the observed data withxn∈𝕏x\_\{n\}\\in\\mathbb\{X\}\. For parameterθ∈ℝD\\theta\\in\\mathbb\{R\}^\{D\}, assume an observation\-level differentiable loss or negative log\-likelihoodℓ:𝕏×ℝD→ℝ\\ell:\\mathbb\{X\}\\times\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}, and regularizerℛ:ℝD→ℝ\\mathcal\{R\}:\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}, which in the sampling setting we should interpret as a negative log prior−logπ0\(θ\)\-\\log\\pi\_\{0\}\(\\theta\)\(up to an additive constant\)\. Together, these lead to the negative potential \(or loss\)
ℒ\(θ\):=N−1∑n=1Nℓ\(xn,θ\)\+N−1ℛ\(θ\)\.\\displaystyle\\textstyle\\mathcal\{L\}\(\\theta\):=N^\{\-1\}\\sum\_\{n=1\}^\{N\}\\ell\(x\_\{n\},\\theta\)\+N^\{\-1\}\\mathcal\{R\}\(\\theta\)\.\(2\)Define the stochastic gradient
Gt\(θ\):=B−1∑n∈St∇ℓ\(xn,θ\)\+N−1∇ℛ\(θ\),\\displaystyle\\textstyle G\_\{t\}\(\\theta\):=B^\{\-1\}\\sum\_\{n\\in S\_\{t\}\}\\nabla\\ell\(x\_\{n\},\\theta\)\+N^\{\-1\}\\nabla\\mathcal\{R\}\(\\theta\),\(3\)whereSt=\{It1,It2,…,ItB\}S\_\{t\}=\\\{I\_\{t1\},I\_\{t2\},\\dots,I\_\{tB\}\\\}is a set ofBBindependent random integers sampled uniformly from\{1,…,N\}\\\{1,\\dots,N\\\}either with or without replacement\.*Stochastic gradient Langevin dynamics*\(SGLD; Welling & Teh,[2011](https://arxiv.org/html/2606.00293#bib.bib73)\)is a Markov chain Monte Carlo \(MCMC\) algorithm with the single\-step update equation
θt=θt−1−ΛGt\(θt−1\)\+2β−1Λξt−1,\\displaystyle\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\Lambda\\,G\_\{t\}\(\\theta\_\{t\-1\}\)\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\},\(4\)whereΛ∈ℝD×D\\Lambda\\in\\mathbb\{R\}^\{D\\times D\}is a positive definite step size matrix,β∈\(0,∞\]\\beta\\in\(0,\\infty\]is the inverse temperature \(canonically set toβ=N\\beta=N\), andξt−1∼iid𝒩\(0,I\)\\xi\_\{t\-1\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,I\)\. SGLD is the prototypical example of a*subsampling MCMC*algorithm, variants of which have been applied for learning a wide variety of large\-sample models\(Ahn et al\.,[2012](https://arxiv.org/html/2606.00293#bib.bib1); Nemeth & Fearnhead,[2021](https://arxiv.org/html/2606.00293#bib.bib54); Aicher et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib2); Kim et al\.,[2024](https://arxiv.org/html/2606.00293#bib.bib34); Rajpal et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib58); Mauri & Zanella,[2024](https://arxiv.org/html/2606.00293#bib.bib47); Alexos et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib4); Paulin et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib55)\)\. Ifβ=∞\\beta=\\infty\(with1/∞:=01/\\infty:=0\), then SGLD reduces to SGD\. SettingΛ=λID\\Lambda=\\lambda I\_\{D\}for someλ\>0\\lambda\>0results in the usual formulation of SG\(L\)D with fixed step sizeλ\\lambda\.
### 2\.2Uncertainty Quantification
Both SGD and SGLD have been used for quantifying uncertainty about model parameters\(Welling & Teh,[2011](https://arxiv.org/html/2606.00293#bib.bib73); Ahn et al\.,[2012](https://arxiv.org/html/2606.00293#bib.bib1); Nemeth & Fearnhead,[2021](https://arxiv.org/html/2606.00293#bib.bib54); Mandt et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib46)\)\. Assuming observations are i\.i\.d\. from an unknown distributionP⋆P\_\{\\star\}, then the optimal parameter is given byθ⋆:=argminθ𝔼\[ℓ\(X,θ\)\]\\theta\_\{\\star\}:=\\operatornamewithlimits\{arg\\,min\}\_\{\\theta\}\\mathbb\{E\}\\left\[\\ell\(X,\\theta\)\\right\], whereX∼P⋆X\\sim P\_\{\\star\}\. In the Bayesian setting, the Bernstein\-von Mises theorem states that the posterior is approximately𝒩\(θ^,𝒥⋆−1/N\)\\mathcal\{N\}\(\\widehat\{\\theta\},\\mathcal\{J\}\_\{\\star\}^\{\-1\}/N\), where𝒥⋆:=𝔼\[∇θ2ℓ\(X,θ⋆\)\]\\mathcal\{J\}\_\{\\star\}:=\\mathbb\{E\}\\left\[\\nabla\_\{\\theta\}^\{2\}\\ell\\left\(X,\\theta\_\{\\star\}\\right\)\\right\]\(Kleijn & van der Vaart,[2012](https://arxiv.org/html/2606.00293#bib.bib35)\)\. Thus, one possible goal when using SG\(L\)D is to obtain samples with a distribution that is approximately equal to𝒩\(θ^,𝒥⋆−1/N\)\\mathcal\{N\}\(\\widehat\{\\theta\},\\mathcal\{J\}\_\{\\star\}^\{\-1\}/N\)\. However, the sampling distribution ofθ^\\widehat\{\\theta\}is asymptotically normal with meanθ⋆\\theta\_\{\\star\}and covariance equal to𝒥⋆−1ℐ⋆𝒥⋆−1/N\\mathcal\{J\}\_\{\\star\}^\{\-1\}\\mathcal\{I\}\_\{\\star\}\\mathcal\{J\}\_\{\\star\}^\{\-1\}/N, whereℐ⋆:=𝔼\[∇θℓ\(X,Y,θ⋆\)∇θℓ\(X,Y,θ⋆\)⊤\]\\mathcal\{I\}\_\{\\star\}:=\\mathbb\{E\}\[\\nabla\_\{\\theta\}\\ell\\left\(X,Y,\\theta\_\{\\star\}\\right\)\\nabla\_\{\\theta\}\\ell\\left\(X,Y,\\theta\_\{\\star\}\\right\)^\{\\top\}\]\(White,[1982](https://arxiv.org/html/2606.00293#bib.bib74)\)\. The matrix𝒥⋆−1ℐ⋆𝒥⋆−1\\mathcal\{J\}\_\{\\star\}^\{\-1\}\\mathcal\{I\}\_\{\\star\}\\mathcal\{J\}\_\{\\star\}^\{\-1\}is known as the “sandwich” covariance matrix, and it suggests that for proper uncertainty quantification we want the stationary SG\(L\)D distribution to be approximately𝒩\(θ^,𝒥⋆−1ℐ⋆𝒥⋆−1/N\)\\mathcal\{N\}\(\\widehat\{\\theta\},\\mathcal\{J\}\_\{\\star\}^\{\-1\}\\mathcal\{I\}\_\{\\star\}\\mathcal\{J\}\_\{\\star\}^\{\-1\}/N\)\. When the model is correctly specified,ℐ⋆=𝒥⋆\\mathcal\{I\}\_\{\\star\}=\\mathcal\{J\}\_\{\\star\}, so the sandwich covariance is equal to𝒥⋆−1\\mathcal\{J\}\_\{\\star\}^\{\-1\}and the Bayesian posterior \(and the Laplace approximation\) provides correct uncertainty quantification\. In the case of a generalized loss, there is no notion of well\-specification, and so the “model covariance”𝒥⋆−1\\mathcal\{J\}\_\{\\star\}^\{\-1\}is not a coherent target for uncertainty quantification\(Bissiri et al\.,[2016](https://arxiv.org/html/2606.00293#bib.bib5); Jewson et al\.,[2018](https://arxiv.org/html/2606.00293#bib.bib29)\)\. Therefore, tuning SG\(L\)D to satisfyΣθ≈𝒥⋆−1ℐ⋆𝒥⋆−1/N\\Sigma\_\{\\theta\}\\approx\\mathcal\{J\}\_\{\\star\}^\{\-1\}\\mathcal\{I\}\_\{\\star\}\\mathcal\{J\}\_\{\\star\}^\{\-1\}/Nwill capture the sampling uncertainty in both the model\-based and generalized loss settings\.
## 3Proxy Algorithms
Given the extensive use of SGD and SGLD, both algorithms have been studied from a wide variety of perspectives\. Many such analyses can be viewed as proposing a*proxy algorithm*: an alternative stochastic process that is “close” to the actual algorithm of interest\. The idea is to characterize important properties of the proxy algorithm, then argue either heuristically or rigorously that these properties can be transferred back to apply to the original \(exact\) algorithm\. We will focus on proxy algorithms that, at least implicitly, require that the loss is well\-approximated by a quadratic function:
ℒ\(θt\)≈ℒ~\(θt\):=12\(θt−θ^\(N\)\)⊤H^\(θt−θ^\(N\)\)\+const,\\displaystyle\\textstyle\\mathcal\{L\}\(\\theta\_\{t\}\)\\approx\\tilde\{\\mathcal\{L\}\}\(\\theta\_\{t\}\):=\\frac\{1\}\{2\}\\big\(\\theta\_\{t\}\-\\widehat\{\\theta\}^\{\(N\)\}\\big\)^\{\\top\}\\widehat\{H\}\\big\(\\theta\_\{t\}\-\\widehat\{\\theta\}^\{\(N\)\}\\big\)\+\\mathrm\{const\},\(5\)whereH^:=∇2ℒ\(θ^\)\\widehat\{H\}:=\\nabla^\{2\}\\mathcal\{L\}\(\\widehat\{\\theta\}\)is the Hessian of the loss \(evaluated atθ^\\widehat\{\\theta\}\)\. While such a condition may seem quite limiting, it turns out to be reasonable in many interesting settings\.
##### Continuous\-time proxies\.
Perhaps the most popular proxy approach is to replace discrete dynamics of the iterative algorithm by a continuous\-time stochastic process\(Mandt et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib46); Zhu et al\.,[2019](https://arxiv.org/html/2606.00293#bib.bib77); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53)\)\. LetC^=Cov\(G1\(θ^\)\)\\widehat\{C\}=\\operatorname\{Cov\}\(G\_\{1\}\(\\widehat\{\\theta\}\)\)denote the gradient noise covariance at the minimizer and letWtW\_\{t\}be aDD\-dimensional Brownian motion\. Focusing on the case of SGD for clarity, the Ornstein–Uhlenbeck process\(ϑt\)t≥0\(\\vartheta\_\{t\}\)\_\{t\\geq 0\}defined by the stochastic differential equation \(SDE\)
dϑt=−ΛH^ϑtdt\+ΛC^1/2dWt,\\displaystyle\\mathrm\{d\}\\vartheta\_\{t\}=\-\\Lambda\\widehat\{H\}\\vartheta\_\{t\}\\mathrm\{d\}t\+\\Lambda\\widehat\{C\}^\{1/2\}\\mathrm\{d\}W\_\{t\},\(6\)provides a proxy to the discrete\-time dynamics after appropriate rescaling and discretization\.111Li et al\. \([2017](https://arxiv.org/html/2606.00293#bib.bib41),[2019](https://arxiv.org/html/2606.00293#bib.bib42)\)propose using stochastic modified equations \(SMEs\) to approximate SGD and perform error analysis\. However, SMEs serve as a close approximation to SGD only for small learning rates, making it challenging to justify this approach for non\-vanishing values ofλ\\lambda\(Li et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib41)\)\. Moreover, in most cases, SMEs are not conducive to exact analysis; this leads to, for example,Li et al\. \([2019](https://arxiv.org/html/2606.00293#bib.bib42), Section 5\.1\)focusing on cases with an explicit solution that match[Equation6](https://arxiv.org/html/2606.00293#S3.E6)\.This approach can be made rigorous via both numerical analysis and statistical \(large\-sample\) perspectives\(Kushner & Yang,[1993](https://arxiv.org/html/2606.00293#bib.bib38); Kushner & Huang,[1981](https://arxiv.org/html/2606.00293#bib.bib37); Kushner & Yin,[2003](https://arxiv.org/html/2606.00293#bib.bib36); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53); Wang et al\.,[2025](https://arxiv.org/html/2606.00293#bib.bib72)\)\. Similar types of arguments have also been widely used to study MCMC algorithms that do not use subsampling\(Roberts & Rosenthal,[1998](https://arxiv.org/html/2606.00293#bib.bib60); Dalalyan,[2017](https://arxiv.org/html/2606.00293#bib.bib10); Roberts & Rosenthal,[2001](https://arxiv.org/html/2606.00293#bib.bib61); Wibisono,[2018](https://arxiv.org/html/2606.00293#bib.bib75)\)\.
The continuous\-time approach is appealing because\(ϑt\)t≥0\(\\vartheta\_\{t\}\)\_\{t\\geq 0\}is a Gaussian process, so its properties are straightforward to analyze\(Mandt et al\.,[2017](https://arxiv.org/html/2606.00293#bib.bib46); Negrea et al\.,[2023](https://arxiv.org/html/2606.00293#bib.bib53); Kushner & Yang,[1993](https://arxiv.org/html/2606.00293#bib.bib38)\)\. For example, if the process has stationary distributionπϑ\\pi\_\{\\vartheta\}, the covariance matrix of the stationary distributionΣϑ:=Cov\(πϑ\)\\Sigma\_\{\\vartheta\}:=\\operatorname\{Cov\}\(\\pi\_\{\\vartheta\}\)must satisfyΣϑH^\+H^Σϑ=ΛC^\\Sigma\_\{\\vartheta\}\\widehat\{H\}\+\\widehat\{H\}\\Sigma\_\{\\vartheta\}=\\Lambda\\widehat\{C\}\(Gardiner,[1985](https://arxiv.org/html/2606.00293#bib.bib13)\)\. In particular, setting
Λ=\(ΣH^\+H^Σ\)C^−1\\displaystyle\\Lambda=\(\\Sigma\\widehat\{H\}\+\\widehat\{H\}\\Sigma\)\\widehat\{C\}^\{\-1\}\(7\)results in a stationary covariance ofΣϑ=Σ\\Sigma\_\{\\vartheta\}=\\Sigma\.222Or, when we are interested in characterizing the stationary covariance, ifΣϑ\\Sigma\_\{\\vartheta\}andH^\\widehat\{H\}commute, then thatΣϑ=12ΛC^H^−1\\Sigma\_\{\\vartheta\}=\\frac\{1\}\{2\}\\Lambda\\widehat\{C\}\\widehat\{H\}^\{\-1\}\.Furthermore,Negrea et al\. \([2023](https://arxiv.org/html/2606.00293#bib.bib53)\)show that the asymptotic mixing time is heuristically equal to2/λmin\(ΛH^\)2/\\lambda\_\{\\min\}\(\\Lambda\\widehat\{H\}\)iterations, whereλmin\(A\)\\lambda\_\{\\min\}\(A\)denotes the minimum eigenvalue of matrixAA\. This result suggests that, to optimize mixing time, setΛ∝H^−1\\Lambda\\propto\\widehat\{H\}^\{\-1\}\.
##### Discrete\-time proxies\.
The continuous\-time proxy approach requires the step size matrixΛ\\Lambdato be sufficiently small that*\(i\)*the continuous\-time dynamics \(driven by Gaussian noise\) is a good approximation to the discrete\-time dynamics and*\(ii\)*the gradient noise is approximately constant \(that is,Gt\(θt−1\)≈Gt\(θ^\)G\_\{t\}\(\\theta\_\{t\-1\}\)\\approx G\_\{t\}\(\\widehat\{\\theta\}\)for allt=1,…,Tt=1,\\dots,T\)\. However, in practice it is often desirable to use a relatively large batch size \(e\.g, 1%–10% of the data\), in which case following the guidance ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00293#bib.bib53)\)requires the use of a relatively largeΛ\\Lambda– exactly the regime in which the continuous\-time theory often breaks down, leading to inaccurate predictions about real algorithm behavior\(Liu et al\.,[2021](https://arxiv.org/html/2606.00293#bib.bib43); Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\. The importance of capturing the location\-dependence of noise has been widely observed\(Simsekli et al\.,[2019](https://arxiv.org/html/2606.00293#bib.bib62),[2020](https://arxiv.org/html/2606.00293#bib.bib63); Hodgkinson & Mahoney,[2021](https://arxiv.org/html/2606.00293#bib.bib22); Meng et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib48); Mori et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib51); Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\.
A number of papers aim to overcome these limitations by using the discrete\-time proxy algorithm
ψt=ψt−1−ΛB∑n∈StH^n\(ψt−1−θ^\),\\displaystyle\\textstyle\\psi\_\{t\}=\\psi\_\{t\-1\}\-\\frac\{\\Lambda\}\{B\}\\sum\_\{n\\in S\_\{t\}\}\\widehat\{H\}\_\{n\}\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\),\(8\)whereH^n:=∇2ℓ\(xn,θ^\)\\widehat\{H\}\_\{n\}:=\\nabla^\{2\}\\ell\(x\_\{n\},\\widehat\{\\theta\}\)\. Assuming it exists, letπψ\\pi\_\{\\psi\}denote the stationary distribution of the proxy algorithm given in[Equation8](https://arxiv.org/html/2606.00293#S3.E8), letΣψ:=Cov\(πψ\)\\Sigma\_\{\\psi\}:=\\operatorname\{Cov\}\(\\pi\_\{\\psi\}\), and forψ∞∼πψ\\psi\_\{\\infty\}\\sim\\pi\_\{\\psi\}, letC¯ψ:=𝔼\[Cov\{G1\(ψ∞\)\}\]\\overline\{C\}\_\{\\psi\}:=\\mathbb\{E\}\[\\operatorname\{Cov\}\\\{G\_\{1\}\(\\psi\_\{\\infty\}\)\\\}\]denote the expected covariance of the gradient noise\.Liu et al\. \([2021](https://arxiv.org/html/2606.00293#bib.bib43)\)show that the stationary covarianceΣψ\\Sigma\_\{\\psi\}of discrete\-time update described in[Equation8](https://arxiv.org/html/2606.00293#S3.E8)satisfies
ΛH^Σψ\+ΣψH^Λ=Λ\(C¯ψ\+H^ΣψH^\)Λ\.\\displaystyle\\Lambda\\widehat\{H\}\\Sigma\_\{\\psi\}\+\\Sigma\_\{\\psi\}\\widehat\{H\}\\Lambda=\\Lambda\\left\(\\overline\{C\}\_\{\\psi\}\+\\widehat\{H\}\\Sigma\_\{\\psi\}\\widehat\{H\}\\right\)\\Lambda\.\(9\)Dieuleveut et al\. \([2020](https://arxiv.org/html/2606.00293#bib.bib11)\)provides exact discrete\-time analyses of constant\-step stochastic gradient descent, treating SGD as a time\-homogeneous Markov chain rather than as a discretization of a continuous\-time diffusion\. In the quadratic setting, they also show that SGD converges to a stationary distribution whose covariance satisfies[Equation9](https://arxiv.org/html/2606.00293#S3.E9), as also given byLiu et al\. \([2021](https://arxiv.org/html/2606.00293#bib.bib43)\)\. Notably, the higher\-order covariance termsΛC¯ψΛ\\Lambda\\overline\{C\}\_\{\\psi\}\\Lambda– which capture finite learning\-rate effects intrinsic to the discrete\-time dynamics – is missing from diffusion\-based SDE approximations\.
A key challenge when using[Equation9](https://arxiv.org/html/2606.00293#S3.E9)is that it only provides an*implicit*characterization ofΣψ\\Sigma\_\{\\psi\}because the average noise covarianceC¯ψ\\overline\{C\}\_\{\\psi\}also depends on the stationary distributionπψ\\pi\_\{\\psi\}\. Hence,C¯ψ\\overline\{C\}\_\{\\psi\}must either be approximated or, in special cases, computed exactly\. An important special case is the linear regression model, wherexn=\(zn,yn\)∈ℝD×ℝx\_\{n\}=\(z\_\{n\},y\_\{n\}\)\\in\\mathbb\{R\}^\{D\}\\times\\mathbb\{R\}and the observation\-level loss isℓ\(xn,θ\)=12σ2\(yn−θ⊤zn\)2\\ell\(x\_\{n\},\\theta\)=\\frac\{1\}\{2\\sigma^\{2\}\}\(y\_\{n\}\-\\theta^\{\\top\}z\_\{n\}\)^\{2\}\. In a follow\-up toLiu et al\. \([2021](https://arxiv.org/html/2606.00293#bib.bib43)\),Ziyin et al\. \([2022](https://arxiv.org/html/2606.00293#bib.bib78)\)show that, assumingzn∼𝒩\(0,A\)z\_\{n\}\\sim\\mathcal\{N\}\(0,A\)and the model is well\-specified \(i\.e\.,yn∼𝒩\(θ⋆⊤zn,σ2\)y\_\{n\}\\sim\\mathcal\{N\}\(\\theta\_\{\\star\}^\{\\top\}z\_\{n\},\\sigma^\{2\}\)for someθ⋆∈ℝD\\theta\_\{\\star\}\\in\\mathbb\{R\}^\{D\}\), for largeNN,
C¯ψ≈B−1\(AΣψA\+Tr\[AΣψ\]A\+σ2A\)\.\\displaystyle\\overline\{C\}\_\{\\psi\}\\approx B^\{\-1\}\\left\(A\\Sigma\_\{\\psi\}A\+\\operatorname\{Tr\}\\left\[A\\Sigma\_\{\\psi\}\\right\]A\+\\sigma^\{2\}A\\right\)\.\(10\)Using this approximate expected covariance for SGD noise,Ziyin et al\. \([2022](https://arxiv.org/html/2606.00293#bib.bib78)\)are able to show, for example, better test loss estimation, the benefits of negative regularization, the role of overparameterization in the steady\-state dynamics of SGD, and power\-law tail behavior of SGD noise\.
Hence, using discrete\-time proxies rather than continuous\-time ones can lead to more precise tuning advice and new insights\. Nevertheless, as summarized in[Table1](https://arxiv.org/html/2606.00293#S1.T1), existing approaches are not yet sufficiently reliable for practical use: some leave the noise covariance implicit\(Dieuleveut et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib11)\), others rely on heuristic approximations to the noise covariance\(Liu et al\.,[2021](https://arxiv.org/html/2606.00293#bib.bib43)\), and others focus on restricted settings, such as well\-specified models withN≫DN\\gg D\(Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\. In addition, they do not characterize the mixing time or iterate average error\. Our results, which are presented in the next section, address all of these limitations, as described in the last row of[Table1](https://arxiv.org/html/2606.00293#S1.T1)and illustrated in[Figure1](https://arxiv.org/html/2606.00293#S1.F1)\.
## 4A New Proxy Algorithm for Analyzing SG\(L\)D
Our approach to creating an improved proxy algorithm is to apply a second\-order Taylor approximation to each loss termℓn\(θ\):=ℓ\(xn,θ\)\\ell\_\{n\}\(\\theta\):=\\ell\(x\_\{n\},\\theta\):
ℓ~n\(θ\):=ℓn\(θ^\)\+∇ℓn⊤\(θ^\)\(θ−θ^\)\+12\(θ−θ^\)⊤∇2ℓn\(θ^\)\(θ−θ^\)\.\\displaystyle\\begin\{split\}\\tilde\{\\ell\}\_\{n\}\(\\theta\)&:=\\ell\_\{n\}\(\\widehat\{\\theta\}\)\+\\nabla\\ell\_\{n\}^\{\\top\}\(\\widehat\{\\theta\}\)\(\\theta\-\\widehat\{\\theta\}\)\\\\ &\\phantom\{:=~\}\+\\frac\{1\}\{2\}\(\\theta\-\\widehat\{\\theta\}\)^\{\\top\}\\nabla^\{2\}\\ell\_\{n\}\(\\widehat\{\\theta\}\)\(\\theta\-\\widehat\{\\theta\}\)\.\\end\{split\}We apply SG\(L\)D \(with or without momentum\) to the approximation,ℒ~\(θ\):=N−1∑n=1Nℓ~n\(θ\)\+N−1ℛ\(θ\)\\tilde\{\\mathcal\{L\}\}\(\\theta\):=N^\{\-1\}\\sum\_\{n=1\}^\{N\}\\tilde\{\\ell\}\_\{n\}\(\\theta\)\+N^\{\-1\}\\mathcal\{R\}\(\\theta\)\. Letting𝒥n:=∇2ℓn\(θ^\)\\mathcal\{J\}\_\{n\}:=\\nabla^\{2\}\\ell\_\{n\}\(\\widehat\{\\theta\}\)and using[Sections4](https://arxiv.org/html/2606.00293#S4.Ex6)and[4](https://arxiv.org/html/2606.00293#S2.E4), the update equation for our proxy algorithm is
ψt\\displaystyle\\psi\_\{t\}=ψt−1−Λ\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]\\displaystyle=\\psi\_\{t\-1\}\-\\Lambda\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\Bigr\]\(11\)\+2β−1Λξt−1\.\\displaystyle\\quad\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\.\(12\)
Assuming the iterates\(ψt\)t≥0\(\\psi\_\{t\}\)\_\{t\\geq 0\}have a well\-defined stationary distribution, the stationary covarianceΣψ\\Sigma\_\{\\psi\}provides an approximationΣ^θ:=Σψ\\widehat\{\\Sigma\}\_\{\\theta\}:=\\Sigma\_\{\\psi\}toΣθ\\Sigma\_\{\\theta\}\. The quadratic form of[Section4](https://arxiv.org/html/2606.00293#S4.Ex6)facilitates analyses that allow us to address limitations of previous work\. First, the quadratic loss results is the linear structure of the SG\(L\)D update given in[Equation12](https://arxiv.org/html/2606.00293#S4.E12), which makes it amenable to direct analysis\. Building on the techniques of previous work\(Liu et al\.,[2021](https://arxiv.org/html/2606.00293#bib.bib43); Ziyin et al\.,[2022](https://arxiv.org/html/2606.00293#bib.bib78)\), in[Section4\.1](https://arxiv.org/html/2606.00293#S4.SS1)we derive an*exact*, solvable relationship betweenΣψ\\Sigma\_\{\\psi\}andΛ\\Lambda\.*Thus, unlike previous results, we do not require any additional assumptions or approximations*\.
In addition, the use of a Taylor series approximation for the observation\-level losses lends itself to rigorous error analysis\. Specifically, we are able to bound the Wasserstein distance between the distributions ofψt\\psi\_\{t\}andθt\\theta\_\{t\}in[Section4\.2](https://arxiv.org/html/2606.00293#S4.SS2)\. Using this result we obtain relative error bounds on the marginal standard deviation and covariance matrix estimates under standard assumptions, which hold for logistic regression and, assuming a bounded parameter space, for Poisson and gamma regression as well\(see, e\.g\., Brosse et al\.,[2018](https://arxiv.org/html/2606.00293#bib.bib7); Moulines & Bach,[2011](https://arxiv.org/html/2606.00293#bib.bib52); Toulis et al\.,[2014](https://arxiv.org/html/2606.00293#bib.bib66)\):
1. \(A\)The observation\-level lossesℓ1,…,ℓN\\ell\_\{1\},\\dots,\\ell\_\{N\}are convex\.
2. \(B\)For eachn=1,…,Nn=1,\\dots,N, for finite positiveLnL\_\{n\}andMnM\_\{n\}, the lossℓn\\ell\_\{n\}isLnL\_\{n\}\-smooth and satisfiessupθ∑d=1D∥∇2\(∂dℓn\(θ\)\)∥2≤Mn2\\sup\_\{\\theta\}\\sum\_\{d=1\}^\{D\}\\lVert\\nabla^\{2\}\(\\partial\_\{d\}\\ell\_\{n\}\(\\theta\)\)\\rVert^\{2\}\\leq M\_\{n\}^\{2\}\.
3. \(C\)For someμ\>0\\mu\>0, the lossℒ\\mathcal\{L\}isμ\\mu\-strongly convex\.
###### Theorem 4\.1\.
If Assumptions[\(A\)](https://arxiv.org/html/2606.00293#S4.I1.i1)–[\(C\)](https://arxiv.org/html/2606.00293#S4.I1.i3)hold andΛ=λID\\Lambda=\\lambda I\_\{D\}for someλ∈\(0,1/\(2L\)\)\\lambda\\in\(0,1/\(2L\)\), then there exist constantsCvC\_\{v\}andCsC\_\{s\}independent ofλ\\lambdasuch that
∥Σθ−Σψ∥/∥Σθ∥\\displaystyle\\lVert\\Sigma\_\{\\theta\}\-\\Sigma\_\{\\psi\}\\rVert/\\lVert\\Sigma\_\{\\theta\}\\rVert≤Cvλ1/2\\displaystyle\\leq C\_\{v\}\\lambda^\{1/2\}\(13\)and, ford=1,…,Dd=1,\\dots,D,\|σθ,d−σψ,d\|/σθ,d\\displaystyle\|\\sigma\_\{\\theta,d\}\-\\sigma\_\{\\psi,d\}\|/\\sigma\_\{\\theta,d\}≤Csλ1/2\.\\displaystyle\\leq C\_\{s\}\\lambda^\{1/2\}\.\(14\)
Hence, it follows from our results that the approximationΣ^θ:=Σψ\\widehat\{\\Sigma\}\_\{\\theta\}:=\\Sigma\_\{\\psi\}is close enough toΣθ\\Sigma\_\{\\theta\}to provide a practically useful estimate\.
### 4\.1Stationary Analysis
To prove our main result[Theorem4\.1](https://arxiv.org/html/2606.00293#S4.Thmtheorem1), we first obtain an exact relationship between the learning rate matrixΛ\\Lambda, the stationary covarianceΣψ\\Sigma\_\{\\psi\}, and the average noiseC¯ψ\\overline\{C\}\_\{\\psi\}\.
###### Proposition 4\.2\.
Assuming the iterates\(ψt\)t≥0\(\\psi\_\{t\}\)\_\{t\\geq 0\}have a well\-defined stationary distribution, the stationary covarianceΣψ\\Sigma\_\{\\psi\}satisfies
ΛH^Σψ\+ΣψH^Λ=Λ\(C¯ψ\+H^ΣψH^\)Λ\+2β−1Λ\.\\displaystyle\\Lambda\\widehat\{H\}\\Sigma\_\{\\psi\}\+\\Sigma\_\{\\psi\}\\widehat\{H\}\\Lambda=\\Lambda\\big\(\\overline\{C\}\_\{\\psi\}\+\\widehat\{H\}\\Sigma\_\{\\psi\}\\widehat\{H\}\\big\)\\Lambda\+2\\beta^\{\-1\}\\Lambda\.\(15\)
It follows from[Equation15](https://arxiv.org/html/2606.00293#S4.E15)that to obtain a solvable relationship betweenΛ\\LambdaandΣψ\\Sigma\_\{\\psi\}using[Proposition4\.2](https://arxiv.org/html/2606.00293#S4.Thmtheorem2), we must compute the expected covarianceC¯ψ\\overline\{C\}\_\{\\psi\}\. Such calculation is feasible when using𝒩\(0,Γ−1\)\\mathcal\{N\}\(0,\\Gamma^\{\-1\}\)withΓ∈ℝD×D\\Gamma\\in\\mathbb\{R\}^\{D\\times D\}positive\-definite, as a prior forθ\\theta– that is, usingℛ\(θ\)=12θ⊤Γθ\\mathcal\{R\}\(\\theta\)=\\frac\{1\}\{2\}\\theta^\{\\top\}\\Gamma\\theta\.
###### Theorem 4\.3\.
For the proxy algorithm[Equation12](https://arxiv.org/html/2606.00293#S4.E12), ifℛ\(θ\)=12θ⊤Γθ⊤\\mathcal\{R\}\(\\theta\)=\\frac\{1\}\{2\}\\theta^\{\\top\}\\Gamma\\theta^\{\\top\}and the mini\-batches are sampled with replacement, then
C¯ψ=1B\(ℐ−∥Γθ^∥2N2\+1N∑n=1N𝒥nΣψ𝒥n−𝒥Σψ𝒥\),\\displaystyle\\begin\{aligned\} \\overline\{C\}\_\{\\psi\}=\\frac\{1\}\{B\}\\left\(\\mathcal\{I\}\-\\frac\{\\lVert\\Gamma\\widehat\{\\theta\}\\rVert^\{2\}\}\{N^\{2\}\}\+\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\_\{n\}\-\\mathcal\{J\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\\right\),\\end\{aligned\}\(16\)whereℐ:=1N∑n=1N∇ℓn\(θ^\)∇ℓn\(θ^\)⊤\\mathcal\{I\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\_\{n\}\\bigl\(\\widehat\{\\theta\}\\bigr\)\\nabla\\ell\_\{n\}\\bigl\(\\widehat\{\\theta\}\\bigr\)^\{\\top\}\. If the mini\-batches are sampled without replacement, the same result holds but with the right\-hand side multiplied by\(N−B\)/\(N−1\)\(N\-B\)/\(N\-1\)\.
Plugging[Equation16](https://arxiv.org/html/2606.00293#S4.E16)into[Equation15](https://arxiv.org/html/2606.00293#S4.E15)provides an exact relationship betweenΛ\\LambdaandΣψ\\Sigma\_\{\\psi\}\. Hence, given a fixed learning rate matrixΛ\\Lambda\(or a scalar learning rateλ\\lambda\), we can, in principle, compute the stationary covarianceΣψ\\Sigma\_\{\\psi\}as an estimate forΣθ\\Sigma\_\{\\theta\}\.
Our final result in this section improves upon the heuristic mixing time estimate ofNegrea et al\. \([2023](https://arxiv.org/html/2606.00293#bib.bib53)\)\(see Appendix[A](https://arxiv.org/html/2606.00293#A1)for more on mixing time\)\. Unlike the stationary covariance case, our result is identical except for an additive−1\-1\.
###### Proposition 4\.4\.
Consider the proxy update in[Equation12](https://arxiv.org/html/2606.00293#S4.E12)and suppose that0<λ<2/μmax\(H^\)0<\\lambda<2/\\mu\_\{\\max\}\(\\hat\{H\}\), whereμmax\(A\)\\mu\_\{\\max\}\(A\)andμmin\(A\)\\mu\_\{\\min\}\(A\)denote the largest and smallest eigenvalues of a matrixAA, respectively\. UnderLL\-smoothness, it simplifies to0<λ<2/L0<\\lambda<2/L, which is consistent with step\-size condition used inDieuleveut et al\. \([2020](https://arxiv.org/html/2606.00293#bib.bib11)\)\. Under this condition, the resulting SG\(L\)D Markov chain admits a unique stationary distributionπθ\\pi\_\{\\theta\}\. For eachv∈ℝDv\\in\\mathbb\{R\}^\{D\}, define the projectionfv\(θ\):=v⊤θf\_\{v\}\(\\theta\):=v^\{\\top\}\\thetaand let
ρk,v:=Corrπθ\(v⊤θ0,v⊤θk\),\\displaystyle\\rho\_\{k,v\}:=\\mathrm\{Corr\}\_\{\\pi\_\{\\theta\}\}\\\!\\bigl\(v^\{\\top\}\\theta\_\{0\},v^\{\\top\}\\theta\_\{k\}\\bigr\),\(17\)and
τint\(fv\):=1\+2∑t=1∞ρk,v\.\\displaystyle\\textstyle\\tau\_\{\\mathrm\{int\}\}\(f\_\{v\}\):=1\+2\\sum\_\{t=1\}^\{\\infty\}\\rho\_\{k,v\}\.\(18\)Then the worst\-case integrated autocorrelation timeτ:=supvτint\(fv\)\\tau:=\\sup\_\{v\}\\tau\_\{\\mathrm\{int\}\}\(f\_\{v\}\)is equal to2/μmin\(ΛH^\)−1\{2\}/\{\\mu\_\{\\min\}\(\\Lambda\\widehat\{H\}\)\}\-1iterations\.
### 4\.2Error Analysis
We assess the accuracy of our proxy algorithm by bounding the 2\-Wasserstein distance between the distributions ofθt\\theta\_\{t\}andψt\\psi\_\{t\}\. The 2\-Wasserstein distance between distributionsπ\\piandπ~\\tilde\{\\pi\}is given by
W2\(π,π~\)=inf𝔼\(∥θ−θ~∥2\)1/2,\\displaystyle W\_\{2\}\(\\pi,\\tilde\{\\pi\}\)=\\inf\\mathbb\{E\}\(\\lVert\\theta\-\\tilde\{\\theta\}\\rVert^\{2\}\)^\{1/2\},\(19\)where the infimum is over all joint distributions of\(θ,θ~\)\(\\theta,\\tilde\{\\theta\}\)such thatθ∼π\\theta\\sim\\piandθ~∼π~\\tilde\{\\theta\}\\sim\\tilde\{\\pi\}\. A small Wasserstein distance between distributions implies the covariance and marginal standard deviations are also close\. Letσθ,d:=Σθ,dd1/2\\sigma\_\{\\theta,d\}:=\\Sigma\_\{\\theta,dd\}^\{1/2\}andσψ,d:=Σψ,dd1/2\\sigma\_\{\\psi,d\}:=\\Sigma\_\{\\psi,dd\}^\{1/2\}\. Then, byHuggins et al\. \([2020](https://arxiv.org/html/2606.00293#bib.bib25), Theorem 3\.4\),W2\(πθ,πψ\)≤εW\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq\\varepsilonimplies that
\|σθ,d−σψ,d\|≤ε\(d=1,…,D\)∥Σθ−Σψ∥≤2ε\(∥Σθ∥1/2∧∥Σψ∥1/2\+ε\)\.\\displaystyle\\begin\{aligned\} &\|\\sigma\_\{\\theta,d\}\-\\sigma\_\{\\psi,d\}\|\\leq\\varepsilon~\(d=1,\\dots,D\)\\\\ &\\lVert\\Sigma\_\{\\theta\}\-\\Sigma\_\{\\psi\}\\rVert\\leq 2\\varepsilon\(\\lVert\\Sigma\_\{\\theta\}\\rVert^\{1/2\}\\wedge\\lVert\\Sigma\_\{\\psi\}\\rVert^\{1/2\}\+\\varepsilon\)\.\\end\{aligned\}\(20\)Hence, boundingW2\(πθ,πψ\)W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)enables us to bound the error of the proxy stationary covarianceΣψ\\Sigma\_\{\\psi\}\.
We first give a bound on the Wasserstein distance between the distributions ofθt\\theta\_\{t\}andψt\\psi\_\{t\}, which we denote by, respectively,πθ,t\\pi\_\{\\theta,t\}andπψ,t\\pi\_\{\\psi,t\}\.
###### Theorem 4\.5\.
If Assumptions[\(A\)](https://arxiv.org/html/2606.00293#S4.I1.i1)–[\(C\)](https://arxiv.org/html/2606.00293#S4.I1.i3)hold andΛ=λID\\Lambda=\\lambda I\_\{D\}for someλ∈\(0,1/\(2L\)\)\\lambda\\in\(0,1/\(2L\)\), then, lettingβ¯:=1−λμ\(1−2λL\)\\bar\{\\beta\}:=1\-\\lambda\\mu\\left\(1\-2\\lambda L\\right\),Mp¯:=N−1∑n=1NMnp,p∈\{1,2\}\\overline\{M^\{p\}\}:=N^\{\-1\}\\sum\_\{n=1\}^\{N\}M\_\{n\}^\{p\},p\\in\\\{1,2\\\}, andCs:=𝔼\(∥ψs−θ^∥4\)C\_\{s\}:=\\mathbb\{E\}\(\\lVert\\psi\_\{s\}\-\\widehat\{\\theta\}\\rVert^\{4\}\), for allt=1,2,…t=1,2,\\dots,
W22\(πθ,t,πψ,t\)\\displaystyle W\_\{2\}^\{2\}\(\\pi\_\{\\theta,t\},\\pi\_\{\\psi,t\}\)\(21\)≤β¯tW22\(θ0,ψ0\)\+λ\{λM2¯2\+M¯24μ\}∑s=1tβ¯t−sCs−1\.\\displaystyle\\leq\\bar\{\\beta\}^\{t\}W\_\{2\}^\{2\}\(\\theta\_\{0\},\\psi\_\{0\}\)\+\\lambda\\left\\\{\\frac\{\\lambda\\overline\{M^\{2\}\}\}\{2\}\+\\frac\{\\overline\{M\}^\{2\}\}\{4\\mu\}\\right\\\}\\sum\_\{s=1\}^\{t\}\\bar\{\\beta\}^\{t\-s\}C\_\{s\-1\}\.\(22\)
[Theorem4\.5](https://arxiv.org/html/2606.00293#S4.Thmtheorem5)is quite general, and we conjecture it could be useful beyond our application to bounding the stationary covariance error\. Typically we would expect to takeψ0=θ0\\psi\_\{0\}=\\theta\_\{0\}, in which case the first term on the righthand side of[Equation21](https://arxiv.org/html/2606.00293#S4.E21)is zero\. We note that[Theorem4\.5](https://arxiv.org/html/2606.00293#S4.Thmtheorem5)is similar in spirit to the 2\-Wasserstein bound provided byJin et al\. \([2024](https://arxiv.org/html/2606.00293#bib.bib31)\)for a continuous\-time Langevin\-based proxy algorithm that uses Poissonized data subsampling; however,Jin et al\. \([2024](https://arxiv.org/html/2606.00293#bib.bib31)\)do not use their proxy algorithm to estimate the stationary covariance of SG\(L\)D\.
Using[Equation21](https://arxiv.org/html/2606.00293#S4.E21)to obtain an explicit quantitative bound requires upper\-bounding the 4th moment ofψt\\psi\_\{t\}, which we do in[LemmaF\.2](https://arxiv.org/html/2606.00293#A6.Thmtheorem2)\. The following corollary gives our main error bound, which for simplicity we state for the case of SGD since the SGLD case is qualitatively identical\.
###### Corollary 4\.6\.
Under the same assumptions as[Theorem4\.5](https://arxiv.org/html/2606.00293#S4.Thmtheorem5)and withβ=∞\\beta=\\infty\(i\.e\., for the case of SGD\), ifλ<min\{Bμ^/\(200L2\),1/\(4L\)\}\\lambda<\\min\\\{B\\hat\{\\mu\}/\(200L^\{2\}\),1/\(4L\)\\\}, then there exists an explicit constantAAgiven in[SectionF\.5](https://arxiv.org/html/2606.00293#A6.Ex13)such thatW2\(πθ,πψ\)≤Aλ/B\.W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq A\\,\{\\lambda\}/\{B\}\.
### 4\.3SGLD with Momentum
Our theoretical results extend to the case of SG\(L\)D with momentum\. These extensions are tight, in the sense that we recover our non\-momentum results as special cases\. Due to space limitations, we defer details to Appendix[B](https://arxiv.org/html/2606.00293#A2)\.
Algorithm 1SG\(L\)D with Target Covariance Tuning\.
DQ\+exact:discrete quadratic \+ exact noise \(this work\)\.
CT:continuous time\.DQ\+const:discrete quadratic \+ constant noise\.LR\+WS:linear regression \+ well\-specified\.0:Dataset
\{xn\}n=1N\\\{x\_\{n\}\\\}\_\{n=1\}^\{N\}, tuning method choice, per\-sample loss
ℓ\(θ;x\)\\ell\(\\theta;x\), offline subsample size
MM, inverse temperature
β\\beta, batch size
BB, number of iteration
TTStep 1: Offline UQ tuning
1:Subsample
MMobservations
\{xm′\}m=1M⊆\{xn\}n=1N\\\{x\_\{m\}^\{\\prime\}\\\}\_\{m=1\}^\{M\}\\subseteq\\\{x\_\{n\}\\\}\_\{n=1\}^\{N\}
2:Use subsample to obtain MAP estimate
θ^\\hat\{\\theta\}*Estimate sandwich covariance atθ^\\hat\{\\theta\}for UQ:*
3:
𝒥^←1M∑m=1M∇2ℓ\(θ^;xm′\)\\widehat\{\\mathcal\{J\}\}\\leftarrow\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\nabla^\{2\}\\ell\(\\hat\{\\theta\};x\_\{m\}^\{\\prime\}\)
4:
ℐ^←1M∑m=1M∇ℓ\(θ^;xm′\)∇ℓ\(θ^;xm′\)⊤\\widehat\{\\mathcal\{I\}\}\\leftarrow\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\nabla\\ell\(\\hat\{\\theta\};x\_\{m\}^\{\\prime\}\)\\nabla\\ell\(\\hat\{\\theta\};x\_\{m\}^\{\\prime\}\)^\{\\top\}
5:
𝒮^←𝒥^−1ℐ^𝒥^−1\\widehat\{\\mathcal\{S\}\}\\leftarrow\\widehat\{\\mathcal\{J\}\}^\{\-1\}\\,\\widehat\{\\mathcal\{I\}\}\\,\\widehat\{\\mathcal\{J\}\}^\{\-1\}
6:*Determine bestΛ\\Lambdausing chosen tuning method:*DQ\+exact:solve eqs\. \([15](https://arxiv.org/html/2606.00293#S4.E15)\) and \([16](https://arxiv.org/html/2606.00293#S4.E16)\) with
Σψ=𝒮^\\Sigma\_\{\\psi\}=\\widehat\{\\mathcal\{S\}\}CT:use eq\. \([7](https://arxiv.org/html/2606.00293#S3.E7)\) withΣ=𝒮^\\Sigma=\\widehat\{\\mathcal\{S\}\}DQ\+const:solve eq\. \([9](https://arxiv.org/html/2606.00293#S3.E9)\) withΣψ=𝒮^\\Sigma\_\{\\psi\}=\\widehat\{\\mathcal\{S\}\}andC¯ψ=𝒥^\\overline\{C\}\_\{\\psi\}=\\widehat\{\\mathcal\{J\}\}LR\+WS:solve eqs\. \([9](https://arxiv.org/html/2606.00293#S3.E9)\) and \([10](https://arxiv.org/html/2606.00293#S3.E10)\) withΣψ=𝒮^\\Sigma\_\{\\psi\}=\\widehat\{\\mathcal\{S\}\}Step 2: Preconditioned SG\(L\)D sampling
7:Initialize
θ0←θ^\\theta\_\{0\}\\leftarrow\\hat\{\\theta\}\(or any warm start\)\.
8:for
t=1t=1to
TTdo
9:Sample minibatch
ℬt⊂\{1,…,N\}\\mathcal\{B\}\_\{t\}\\subset\\\{1,\\ldots,N\\\}with
\|ℬt\|=B\|\\mathcal\{B\}\_\{t\}\|=B\.
10:Compute gradient
gt←1B∑n∈ℬt∇ℓ\(θt;xn\)g\_\{t\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{n\\in\\mathcal\{B\}\_\{t\}\}\\nabla\\ell\(\\theta\_\{t\};x\_\{n\}\)
11:Sample update
θt∼𝒩\(θt−1−Λgt,2β−1Λ\)\\theta\_\{t\}\\sim\\mathcal\{N\}\(\\theta\_\{t\-1\}\-\\Lambda\\,g\_\{t\},2\\,\\beta^\{\-1\}\\Lambda\)
12:endfor
13:return
\{θt\}t=1T\\\{\\theta\_\{t\}\\\}\_\{t=1\}^\{T\}and
Λ\\Lambda\.
Table 2:Results for linear regression experiments with simulated data\. Calibration error is the Kolmogorov–Smirnov distance toUnif\(0,1\)\\mathrm\{Unif\}\(0,1\)\. Covariance error is‖𝒮⋆−𝒮^‖F/‖𝒮⋆‖F\\\|\\mathcal\{S\}\_\{\\star\}\-\\hat\{\\mathcal\{S\}\}\\\|\_\{F\}/\\\|\\mathcal\{S\}\_\{\\star\}\\\|\_\{F\}\. Within each metric row and loss block, for a fixed batch sizeBB, bold indicates methods whose 95% confidence intervals overlap with the confidence interval of the method with the lowest mean error\. Confidence intervals are computed over 30 independent runs and are reported in the full table in Appendix D\. Sandwich Gauss is included as the target sandwich Gaussian reference, while NUTS and the exact posterior are included to illustrate the discrepancy between the posterior distribution and the sandwich target\.Log lossβ\\beta\-loss \(β=1\.5\\beta=1\.5\)BBPosteriorCTLR\+WSDQ\+exactNUTSSandwich GaussCTLR\+WSDQ\+exactCalibration error16160\.4180\.1710\.5290\.1690\.1950\.1560\.2010\.1780\.172⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.4180\.1790\.5170\.1740\.1950\.1560\.1960\.1770\.190Covariance error16160\.9430\.6720\.9950\.6640\.7950\.0000\.6401\.1150\.695⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.9430\.9750\.9960\.6720\.7990\.0001\.0061\.3220\.748Table 3:Results for linear regression experiments with Boston housing data\. See[Table2](https://arxiv.org/html/2606.00293#S4.T2)caption for further explanation\.∞\\inftydenotes divergence under this tuning guidance\.Log lossβ\\beta\-loss \(β=1\.5\\beta=1\.5\)BBPosteriorCTLR\+WSDQ\+exactNUTSSandwich GaussCTLR\+WSDQ\+exactCovariance error16160\.3580\.2479\.23×1089\.23\\times 10^\{8\}0\.3372\.52802\.054∞\\infty2\.782⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.3580\.5891\.40×1071\.40\\times 10^\{7\}0\.3522\.52803\.126∞\\infty1\.398Table 4:Results for Poisson regression experiments\. See[Table2](https://arxiv.org/html/2606.00293#S4.T2)caption for further explanation\.SimulatedCreditBBmethodcalib\. err\.cov\. err\.cov\. err\.1616CT0\.0690\.2070\.132DQ\+const0\.6460\.6720\.982DQ\+exact0\.0740\.2080\.157⌊0\.1×N⌋\\lfloor 0\.1\\\!\\times\\\!N\\rfloorCT0\.0890\.2300\.191DQ\+const1\.3760\.9910\.997DQ\+exact0\.0750\.2110\.154
## 5A General Procedure for Calibrated SG\(L\)D Sampling
[Algorithm1](https://arxiv.org/html/2606.00293#alg1)outlines a practical tuning procedure for SG\(L\)D uncertainty calibration, which covers all the approaches listed in[Table1](https://arxiv.org/html/2606.00293#S1.T1)\. When tuning ofΛ\\Lambdausing our proposed approach \(DQ\+exact\),[Algorithm1](https://arxiv.org/html/2606.00293#alg1)is applicable to the large\-sample, low\-to\-moderate dimensional regime\. Its computational complexity isO\(MD2\+D3\)\+O\(T\(BD\+D2\)\)O\(MD^\{2\}\+D^\{3\}\)\+O\\left\(T\(BD\+D^\{2\}\)\\right\)where the first term corresponds to a one\-time offline cost using a subsample of sizeMM, and the second term is the cost ofTTstochastic gradient iterations with minibatch sizeB≪NB\\ll N\. TheO\(D3\)O\(D^\{3\}\)term is incurred only once and becomes negligible whenN≫D3N\\gg D^\{3\}\. Since the mixing time of tuned SG\(L\)D isO\(1\)O\(1\)epochs \(equivalentlyO\(N/B\)O\(N/B\)iterations\), relative Monte Carlo errorδ\\deltais achievable withT=O\(N/\[Bδ\]\)T=O\(N/\[B\\delta\]\)iterations\. Hence, the overall computational complexity isO\(N\[D\+D2/B\]/δ\)O\\bigl\(N\[D\+D^\{2\}/B\]/\\delta\\bigr\)\. This result also suggests a benefit to using a large batch size of at leastB≫DB\\gg Dto reduce the number of preconditioning operations, improving the computational efficiency of SG\(L\)D\.
The tuning procedure also requiresO\(D2\)O\(D^\{2\}\)memory to store and manipulate quantities such as Hessian𝒥^\\hat\{\\mathcal\{J\}\}and Fisher informationℐ^\\hat\{\\mathcal\{I\}\}, making accurate computation challenging in very high\-dimensional settings\. A promising future direction is to develop structured, low\-rank, diagonal, or trajectory\-based approximations of𝒥^\\hat\{\\mathcal\{J\}\}orℐ^\\hat\{\\mathcal\{I\}\}computations to improve scalability\.
## 6Experiments
We compare the accuracy of the learning rate tuning guidance provided by our theory versus previous work \(see[Table1](https://arxiv.org/html/2606.00293#S1.T1)\)\. For fair comparison, we follow[Algorithm1](https://arxiv.org/html/2606.00293#alg1), with the only difference across approaches being howΛ\\Lambdais determined\. In our experiments, we use SGD \(so,β=∞\\beta=\\infty\)\. The code for all experiments is publicly available at[https://github\.com/wangyu1369/large\-sample\-sgmcmc\-uq](https://github.com/wangyu1369/large-sample-sgmcmc-uq)\.
In our experiments we computeΛ\\Lambdausing a numerical optimization procedure since obtaining a close\-form solution is challenging\(Hammarling,[1982](https://arxiv.org/html/2606.00293#bib.bib19); Ye et al\.,[1998](https://arxiv.org/html/2606.00293#bib.bib76)\)\. Specifically, we substitute the stationary noise expression from[Equation16](https://arxiv.org/html/2606.00293#S4.E16)into the stationary covariance equation[Equation15](https://arxiv.org/html/2606.00293#S4.E15)and setΣψ=𝒮^\\Sigma\_\{\\psi\}=\\widehat\{\\mathcal\{S\}\}\. This yields a matrix equation of the formF\(Λ\)=0F\(\\Lambda\)=0, whereΛ\\Lambdais the only unknown\. We solve this system numerically by vectorizingΛ\\Lambdaand applyingscipy\.optimize\.rootwith the Powell hybrid method\. While our approach requires solving jointly[Equations15](https://arxiv.org/html/2606.00293#S4.E15)and[16](https://arxiv.org/html/2606.00293#S4.E16), this cost is incurred only once per problem and is negligible compared to the dominant cost of running SG\(L\)D trajectories\. We empirically verify this in[SectionD\.2](https://arxiv.org/html/2606.00293#A4.SS2)\.
### 6\.1Robust Linear Regression
While standard Bayesian inference and maximum likelihood estimation optimize the KL divergence, the resulting log loss is notoriously sensitive to outliers and misspecification, allowing a single atypical datapoint to dominate the gradient\. Theβ\\beta\-divergence provides a robust alternative by downweighting low\-probability observations through a tunable power parameterβ\\beta, effectively controlling heavy\-tailed effects\(Ghosh & Basu,[2016](https://arxiv.org/html/2606.00293#bib.bib16); Jewson et al\.,[2018](https://arxiv.org/html/2606.00293#bib.bib29),[2024](https://arxiv.org/html/2606.00293#bib.bib30)\)\. We consider a regression setting with observationsx=\(y,z\)x=\(y,z\)whereyydenotes the response andzzthe covariates\. Then theβ\\beta\-divergence loss is defined asℓ\(θ;x\)=−1β−1f\(y;θ,z\)β−1\+1β∫f\(y′;θ,z\)βdy′\\ell\(\\theta;x\)=\-\\frac\{1\}\{\\beta\-1\}f\(y;\\theta,z\)^\{\\beta\-1\}\+\\frac\{1\}\{\\beta\}\\int f\\left\(y^\{\\prime\};\\theta,z\\right\)^\{\\beta\}\\mathrm\{d\}y^\{\\prime\}, wheref\(y;θ,z\)f\(y;\\theta,z\)denotes the likelihood\. For a detailed discussion of tuning guidance under theβ\\beta\-divergence loss, see[SectionD\.3](https://arxiv.org/html/2606.00293#A4.SS3)\.
##### *Simulated misspecified data with outliers\.*
First, we consider a misspecified linear regression model with heteroskedastic errors\. Data\{\(xn,yn\)\}n=1N\\\{\(x\_\{n\},y\_\{n\}\)\\\}\_\{n=1\}^\{N\}are generated according toyn∣xn∼𝒩\(xn⊤θ⋆,1\+‖xn‖22\)y\_\{n\}\\mid x\_\{n\}\\sim\\mathcal\{N\}\\\!\\left\(x\_\{n\}^\{\\top\}\\theta\_\{\\star\},\\;1\+\\\|x\_\{n\}\\\|\_\{2\}^\{2\}\\right\), where the true parameterθ⋆∼𝒩\(0,ID\)\\theta\_\{\\star\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)is fixed throughout the experiment, and the covariates are drawn independently asxn∼𝒩\(0,ID\)x\_\{n\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)\.
To model heavy\-tailed contamination, a fractionp∈\[0,1\]p\\in\[0,1\]of samples is selected uniformly at random and replaced by outliers\. For these contaminated observations, responses are generated asyn∣xn∼𝒩\(xn⊤θ⋆\+b,s2\(1\+‖xn‖22\)\)y\_\{n\}\\mid x\_\{n\}\\sim\\mathcal\{N\}\\\!\\left\(x\_\{n\}^\{\\top\}\\theta\_\{\\star\}\+b,\\;s^\{2\}\\bigl\(1\+\\\|x\_\{n\}\\\|\_\{2\}^\{2\}\\bigr\)\\right\), wheres\>1s\>1controls variance inflation andbbintroduces a mean shift\. Unless otherwise specified, we setD=50D=50,N=5000N=5000,p=0\.01p=0\.01,b=5\.0b=5\.0, ands=5\.0s=5\.0\.
As shown in[Table2](https://arxiv.org/html/2606.00293#S4.T2), LR\+WS performs poorly under log loss due to its reliance on well\-specified model assumptions\. Replacing the log loss with theβ\\beta\-divergence substantially improves its quantile calibration\. However, comparable calibration does not imply accurate uncertainty quantification: existing tunings \(CT and LR\+WS\) exhibit significant covariance mismatch, particularly for large batch sizes\. In contrast, our tuning consistently yields a stationary covariance closest to the target sandwich covariance𝒮⋆\\mathcal\{S\}\_\{\\star\}, with the largest gains observed in the large batch size regime\.
##### *Boston housing data*\(Harrison & Rubinfeld,[1978](https://arxiv.org/html/2606.00293#bib.bib21)\)
We next consider the Boston housing dataset, where the linear model is strongly misspecified\. In this setting, LR\+WS becomes unstable and produces extremely large covariance errors\. These results corroborate the simulated experiments and highlight the importance of our tuning guidance for accurate uncertainty quantification in misspecified and large\-batch regimes\.
### 6\.2Poisson Regression
Finally, we demonstrate how our theory provides tuning guidance for the more challenging case of Poisson regression\.
##### *Simulated data\.*
We first consider simulated data generated from the assumed modelyn∼Poisson\(exp\{xn⊤θ⋆\}\),y\_\{n\}\\sim\\text\{Poisson\}\(\\exp\\\{x\_\{n\}^\{\\top\}\\theta\_\{\\star\}\\\}\),whereθ⋆∼𝒩\(0,ID\)\\theta\_\{\\star\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)andxn∼iid𝒩\(0,ID\)x\_\{n\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,I\_\{D\}\)\. We use a sample size ofN=5,000N=5\{,\}000and set dimensionD=50D=50\.
##### *Credit data*\(Hofmann,[1994](https://arxiv.org/html/2606.00293#bib.bib24)\)
The German credit data contains data onD=20D=20variables and the credibility ofN=1,000N=1\{,\}000loan applicants\.
For both, we consider batch sizesB∈\{16,⌊0\.1×N⌋\}B\\in\\\{16,\\lfloor 0\.1\\times N\\rfloor\\\}\. The results in[Table4](https://arxiv.org/html/2606.00293#S4.T4)show that while continuous\-time tuning remains competitive for small batch sizes, its accuracy degrades substantially as the batch size increases, leading to larger covariance and calibration errors\. In contrast, the tuning guidance derived from our discrete\-time theory consistently yields more accurate uncertainty quantification in the large\-batch regime\. Existing discrete\-time approaches based on quadratic objectives and constant noise suffer from severe miscalibration once these restrictive assumptions are violated\. The improved calibration and covariance accuracy demonstrate that our method provides reliable UQ beyond the regimes where continuous\-time or constant\-noise approximations apply\.
### 6\.3Neural Network
We further compare the stationary covariance predicted by the different theories in[Table1](https://arxiv.org/html/2606.00293#S1.T1)on a real\-world neural network task\. Specifically, we fit a two\-hidden\-layertanh\\tanhneural network with hidden widths\(2,3\)\(2,3\)on the*Diabetes*dataset\(Efron et al\.,[2004](https://arxiv.org/html/2606.00293#bib.bib12)\)\.
As we can see from[Figure2](https://arxiv.org/html/2606.00293#S6.F2), at small learning rates, all methods are comparable\. However, as the learning rate increases, continuous\-time approximations become quantitatively inaccurate, while our discrete\-time method remains accurate\. While our non\-asymptotic error analysis assumes convexity, the characterization of the stationary covarianceΣψ\\Sigma\_\{\\psi\}\([Proposition4\.2](https://arxiv.org/html/2606.00293#S4.Thmtheorem2)\) and minibatch noiseC¯ψ\\overline\{C\}\_\{\\psi\}\([Theorem4\.3](https://arxiv.org/html/2606.00293#S4.Thmtheorem3)\) does not rely on this assumption\. Empirically, we observe that our discrete\-time proxy remains effective for neural networks, suggesting that the approach can potentially extend beyond the convex setting and motivating future work on non\-convex analysis\.
Figure 2:Covariance prediction error for neural network with hidden on the*Diabetes*dataset\. The error is measured as‖Σψ−Σθ‖F/‖Σθ‖F\\\|\\Sigma\_\{\\psi\}\-\\Sigma\_\{\\theta\}\\\|\_\{F\}/\\\|\\Sigma\_\{\\theta\}\\\|\_\{F\}, whereΣθ\\Sigma\_\{\\theta\}is the empirical stationary covariance estimated from SGD tail iterates andΣψ\\Sigma\_\{\\psi\}is the covariance predicted by each theory\. Shaded regions denote95%95\\%confidence intervals for the mean across 30 independent repetitions\.
## 7Conclusion
We study uncertainty quantification for stochastic gradient methods from a discrete\-time perspective\. Our results show that accurate characterization of SGD and SGLD stationary behavior requires moving beyond continuous\-time approximations, particularly at large batch sizes and non\-vanishing learning rates\. By explicitly modeling the stationary covariance and minibatch\-induced noise structure, our framework provides principled and practical tuning strategies for SGD and SGLD under both well\-specified and misspecified settings\. Empirically, we demonstrate improved covariance estimation and calibration across synthetic and real\-world tasks when using[Algorithm1](https://arxiv.org/html/2606.00293#alg1)\.
A limitation of our finite\-sample analysis is that it relies on strong convexity assumptions\. Empirically, we observe that the proposed discrete\-time proxy remains effective for neural networks \([Section6\.3](https://arxiv.org/html/2606.00293#S6.SS3)\), suggesting that the approach may extend beyond convex settings\. Developing finite\-sample guarantees for non\-convex models remains an important direction for future work\. Another natural extension is to characterize more precisely the regimes of learning rate and batch size in which continuous\-time approximations fail, and to identify sharp thresholds at which discrete\-time effects dominate stationary behavior\.
## Acknowledgments
Y\. Wang and J\. H\. Huggins were partially supported by National Science Foundation CAREER award IIS\-2340586\.
## Impact Statement
This paper presents work whose goal is to advance the field of probabilistic machine learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.
## References
- Ahn et al\. \(2012\)Ahn, S\., Korattikara, A\., and Welling, M\.Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring\.In Langford, J\. and Pineau, J\. \(eds\.\),*Proceedings of the 29th International Conference on Machine Learning \(ICML\-12\)*, ICML ’12, pp\. 1591–1598, New York, NY, USA, July 2012\. Omnipress\.ISBN 978\-1\-4503\-1285\-1\.
- Aicher et al\. \(2025\)Aicher, C\., Putcha, S\., Nemeth, C\., Fearnhead, P\., and Fox, E\.Stochastic Gradient MCMC for Nonlinear State Space Models\.*Bayesian Analysis*, 20\(1\):83 – 105, 2025\.
- Akyildiz & Sabanis \(2024\)Akyildiz, O\. D\. and Sabanis, S\.Nonasymptotic analysis of Stochastic Gradient Hamiltonian Monte Carlo under local conditions for nonconvex optimization\.*Journal of Machine Learning Research*, 25\(113\):1–34, 2024\.
- Alexos et al\. \(2022\)Alexos, A\., Boyd, A\. J\., and Mandt, S\.Structured stochastic gradient MCMC\.In*International Conference on Machine Learning*, pp\. 414–434\. PMLR, 2022\.
- Bissiri et al\. \(2016\)Bissiri, P\. G\., Holmes, C\. C\., and Walker, S\. G\.A general framework for updating belief distributions\.*Journal of the Royal Statistical Society: Series B \(Statistical Methodology\)*, 78\(5\):1103–1130, 2016\.doi:10\.1111/rssb\.12158\.
- Bottou \(2010\)Bottou, L\.Large\-scale machine learning with stochastic gradient descent\.In*Proceedings of COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, August 22\-27, 2010 Keynote, Invited and Contributed Papers*, pp\. 177–186\. Springer, 2010\.
- Brosse et al\. \(2018\)Brosse, N\., Durmus, A\., and Moulines, E\.The promises and pitfalls of stochastic gradient Langevin dynamics\.In*Advances in Neural Information Processing Systems*, 2018\.
- Chang et al\. \(2017\)Chang, H\.\-S\., Learned\-Miller, E\., and McCallum, A\.Active bias: Training more accurate neural networks by emphasizing high variance samples\.*Advances in Neural Information Processing Systems*, 30, 2017\.
- Coullon et al\. \(2023\)Coullon, J\., South, L\., and Nemeth, C\.Efficient and generalizable tuning strategies for stochastic gradient MCMC\.*Statistics and Computing*, 33\(3\):66, 2023\.ISSN 0960\-3174\.doi:10\.1007/s11222\-023\-10233\-3\.
- Dalalyan \(2017\)Dalalyan, A\. S\.Theoretical guarantees for approximate sampling from smooth and log\-concave densities\.*Journal of the Royal Statistical Society Series B: Statistical Methodology*, 79\(3\):651–676, 2017\.doi:10\.1111/rssb\.12183\.
- Dieuleveut et al\. \(2020\)Dieuleveut, A\., Durmus, A\., and Bach, F\.Bridging the gap between constant step size stochastic gradient descent and Markov chains\.*Annals of Statistics*, 48\(3\):1348–1382, 2020\.doi:10\.1214/19\-AOS1850\.
- Efron et al\. \(2004\)Efron, B\., Hastie, T\., Johnstone, I\., and Tibshirani, R\.Least angle regression\.*The Annals of Statistics*, 32\(2\):407–499, 2004\.doi:10\.1214/009053604000000067\.
- Gardiner \(1985\)Gardiner, C\. W\.Handbook of stochastic methods for physics, chemistry and the natural sciences\.*Springer series in synergetics*, 1985\.
- Gelman et al\. \(1995\)Gelman, A\., Carlin, J\. B\., Stern, H\. S\., and Rubin, D\. B\.*Bayesian data analysis*\.Chapman and Hall/CRC, 1995\.
- Geyer \(1992\)Geyer, C\. J\.Practical Markov Chain Monte Carlo\.*Statistical Science*, 7\(4\):473 – 483, 1992\.doi:10\.1214/ss/1177011137\.
- Ghosh & Basu \(2016\)Ghosh, A\. and Basu, A\.Robust Bayes estimation using the density power divergence\.*Annals of the Institute of Statistical Mathematics*, 68\(2\):413–437, 2016\.ISSN 0020\-3157\.doi:10\.1007/s10463\-014\-0499\-0\.
- Goodfellow et al\. \(2016\)Goodfellow, I\., Bengio, Y\., and Courville, A\.*Deep Learning*\.MIT Press, 2016\.
- Goyal et al\. \(2017\)Goyal, P\., Dollár, P\., Girshick, R\., Noordhuis, P\., Wesolowski, L\., Kyrola, A\., Tulloch, A\., Jia, Y\., and He, K\.Accurate, large minibatch SGD: Training ImageNet in 1 hour\.*arXiv preprint arXiv:1706\.02677*, 2017\.
- Hammarling \(1982\)Hammarling, S\. J\.Numerical solution of the stable, non\-negative definite Lyapunov equation\.*IMA Journal of Numerical Analysis*, 2\(3\):303–323, 1982\.doi:10\.1093/imanum/2\.3\.303\.
- Hardt et al\. \(2016\)Hardt, M\., Recht, B\., and Singer, Y\.Train faster, generalize better: Stability of stochastic gradient descent\.In*International Conference on Machine Learning*, pp\. 1225–1234\. PMLR, 2016\.
- Harrison & Rubinfeld \(1978\)Harrison, D\. and Rubinfeld, D\. L\.Hedonic housing prices and the demand for clean air\.*Journal of Environmental Economics and Management*, 5\(1\):81–102, 1978\.doi:https://doi\.org/10\.1016/0095\-0696\(78\)90006\-2\.
- Hodgkinson & Mahoney \(2021\)Hodgkinson, L\. and Mahoney, M\.Multiplicative noise and heavy tails in stochastic optimization\.In*International Conference on Machine Learning*, pp\. 4262–4274\. PMLR, 2021\.
- Hoffer et al\. \(2017\)Hoffer, E\., Hubara, I\., and Soudry, D\.Train longer, generalize better: closing the generalization gap in large batch training of neural networks\.*Advances in Neural Information Processing Systems*, 30, 2017\.
- Hofmann \(1994\)Hofmann, H\.Statlog \(German Credit Data\)\.UCI Machine Learning Repository, 1994\.
- Huggins et al\. \(2020\)Huggins, J\., Kasprzak, M\., Campbell, T\., and Broderick, T\.Validated variational inference via practical posterior error bounds\.In*International Conference on Artificial Intelligence and Statistics*, pp\. 1792–1802\. PMLR, 2020\.
- Huggins & Miller \(2024\)Huggins, J\. H\. and Miller, J\. W\.Reproducible parameter inference using bagged posteriors\.*Electronic Journal of Statistics*, 18\(1\), 2024\.ISSN 1935\-7524\.doi:10\.1214/24\-ejs2237\.
- Hwang et al\. \(2022\)Hwang, S\., Choi, J\., and Choi, J\.Uncertainty\-Based Selective Clustering for Active Learning\.*IEEE Access*, 10:110983–110991, 2022\.doi:10\.1109/ACCESS\.2022\.3216065\.
- Jantre et al\. \(2024\)Jantre, S\., Urban, N\. M\., Qian, X\., and Yoon, B\.\-J\.Learning Active Subspaces for Effective and Scalable Uncertainty Quantification in Deep Neural Networks\.In*ICASSP 2024 \- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, pp\. 5330–5334, 2024\.doi:10\.1109/ICASSP48485\.2024\.10448265\.
- Jewson et al\. \(2018\)Jewson, J\., Smith, J\. Q\., and Holmes, C\.Principles of Bayesian Inference Using General Divergence Criteria\.*Entropy*, 20\(6\):442, 2018\.doi:10\.3390/e20060442\.
- Jewson et al\. \(2024\)Jewson, J\., Smith, J\. Q\., and Holmes, C\.On the Stability of General Bayesian Inference\.*Bayesian Analysis*, pp\. 1 – 31, 2024\.doi:10\.1214/24\-BA1502\.
- Jin et al\. \(2024\)Jin, K\., Liu, C\., and Latz, J\.Subsampling Error in Stochastic Gradient Langevin Diffusions\.In*International Conference on Artificial Intelligence and Statistics*, pp\. 1414–1422\. PMLR, 2024\.
- Jones \(2004\)Jones, G\. L\.On the Markov chain central limit theorem\.*Probability Surveys*, 1\(none\):299 – 320, 2004\.doi:10\.1214/154957804100000051\.
- Keskar et al\. \(2017\)Keskar, N\. S\., Mudigere, D\., Nocedal, J\., Smelyanskiy, M\., and Tang, P\. T\. P\.On Large\-Batch Training for Deep Learning: Generalization Gap and Sharp Minima\.In*International Conference on Learning Representations*, 2017\.
- Kim et al\. \(2024\)Kim, S\., Jung, S\., Kim, S\., and Lee, J\.Learning to Explore for Stochastic Gradient MCMC\.In*Proceedings of the 41st International Conference on Machine Learning*, ICML’24\. JMLR\.org, 2024\.
- Kleijn & van der Vaart \(2012\)Kleijn, B\. and van der Vaart, A\.The Bernstein\-Von\-Mises theorem under misspecification\.*Electronic Journal of Statistics*, 6:354–381, 2012\.doi:10\.1214/12\-EJS675\.
- Kushner & Yin \(2003\)Kushner, H\. and Yin, G\. G\.*Stochastic approximation and recursive algorithms and applications*\.Springer, 2003\.doi:10\.1007/b97441\.
- Kushner & Huang \(1981\)Kushner, H\. J\. and Huang, H\.Asymptotic properties of stochastic approximations with constant coefficients\.*SIAM Journal on Control and Optimization*, 19\(1\):87–105, 1981\.doi:10\.1137/0319007\.
- Kushner & Yang \(1993\)Kushner, H\. J\. and Yang, J\.Stochastic Approximation with Averaging of the Iterates: Optimal Asymptotic Rate of Convergence for General Processes\.*SIAM Journal on Control and Optimization*, 31\(4\):1045–1062, 1993\.ISSN 0363\-0129\.doi:10\.1137/0331047\.
- Lewkowycz et al\. \(2020\)Lewkowycz, A\., Bahri, Y\., Dyer, E\., Sohl\-Dickstein, J\., and Gur\-Ari, G\.The large learning rate phase of deep learning: the catapult mechanism\.*arXiv preprint arXiv:2003\.02218*, 2020\.
- Li et al\. \(2016\)Li, C\., Chen, C\., Carlson, D\., and Carin, L\.Preconditioned stochastic gradient langevin dynamics for deep neural networks\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 30, 2016\.doi:10\.1609/aaai\.v30i1\.10200\.
- Li et al\. \(2017\)Li, Q\., Tai, C\., and E, W\.Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms\.In Precup, D\. and Teh, Y\. W\. \(eds\.\),*Proceedings of the 34th International Conference on Machine Learning*, volume 70 of*Proceedings of Machine Learning Research*, pp\. 2101–2110\. PMLR, 06–11 Aug 2017\.
- Li et al\. \(2019\)Li, Q\., Tai, C\., and E, W\.Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations\.*Journal of Machine Learning Research*, 20\(40\):1–47, 2019\.
- Liu et al\. \(2021\)Liu, K\., Ziyin, L\., and Ueda, M\.Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent\.In*Proceedings of the 38th International Conference on Machine Learning*, volume 139 of*Proceedings of Machine Learning Research*, pp\. 7045–7056\. PMLR, 18–24 Jul 2021\.
- Lyle et al\. \(2020\)Lyle, C\., Schut, L\., Ru, R\., Gal, Y\., and van der Wilk, M\.A Bayesian Perspective on Training Speed and Model Selection\.In*Advances in Neural Information Processing Systems*, volume 33, pp\. 10396–10408, 2020\.
- MacKay \(1992\)MacKay, D\. J\.A practical Bayesian framework for backpropagation networks\.*Neural Computation*, 4\(3\):448–472, 1992\.
- Mandt et al\. \(2017\)Mandt, S\., Hoffman, M\. D\., and Blei, D\. M\.Stochastic Gradient Descent as Approximate Bayesian Inference\.*Journal of Machine Learning Research*, 18\(134\):1–35, 2017\.
- Mauri & Zanella \(2024\)Mauri, L\. and Zanella, G\.Robust Approximate Sampling via Stochastic Gradient Barker Dynamics\.In Dasgupta, S\., Mandt, S\., and Li, Y\. \(eds\.\),*Proceedings of The 27th International Conference on Artificial Intelligence and Statistics*, volume 238 of*Proceedings of Machine Learning Research*, pp\. 2107–2115\. PMLR, 02–04 May 2024\.
- Meng et al\. \(2020\)Meng, Q\., Gong, S\., Chen, W\., Ma, Z\.\-M\., and Liu, T\.\-Y\.Dynamic of stochastic gradient descent with state\-dependent noise\.*arXiv preprint arXiv:2006\.13719*, 2020\.
- Merad & Gaïffas \(2025\)Merad, I\. and Gaïffas, S\.Convergence and concentration properties of constant step\-size SGD through Markov chains\.*Electronic Journal of Statistics*, 19\(2\):5843 – 5894, 2025\.doi:10\.1214/25\-EJS2471\.
- Mori & Ueda \(2020\)Mori, T\. and Ueda, M\.Improved generalization by noise enhancement\.*arXiv preprint arXiv:2009\.13094*, 2020\.
- Mori et al\. \(2022\)Mori, T\., Ziyin, L\., Liu, K\., and Ueda, M\.Power\-law escape rate of SGD\.In*International Conference on Machine Learning*, pp\. 15959–15975\. PMLR, 2022\.
- Moulines & Bach \(2011\)Moulines, E\. and Bach, F\.Non\-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning\.In*Advances in Neural Information Processing Systems*, volume 24, 2011\.
- Negrea et al\. \(2023\)Negrea, J\., Yang, J\., Feng, H\., Roy, D\. M\., and Huggins, J\. H\.Tuning stochastic gradient algorithms for statistical inference via large\-sample asymptotics, 2023\.arXiv preprint arXiv:2207\.12395\.
- Nemeth & Fearnhead \(2021\)Nemeth, C\. and Fearnhead, P\.Stochastic Gradient Markov Chain Monte Carlo\.*Journal of the American Statistical Association*, 116\(533\):433–450, 2021\.doi:10\.1080/01621459\.2020\.1847120\.
- Paulin et al\. \(2025\)Paulin, D\., Whalley, P\. A\., Chada, N\. K\., and Leimkuhler, B\. J\.Sampling from bayesian neural network posteriors with symmetric minibatch splitting langevin dynamics\.In*Proceedings of The 28th International Conference on Artificial Intelligence and Statistics*, volume 258 of*Proceedings of Machine Learning Research*, pp\. 5014–5022\. PMLR, 03–05 May 2025\.
- Pflug \(1986\)Pflug, G\. C\.Stochastic minimization with constant step\-size: asymptotic laws\.*SIAM Journal on Control and Optimization*, 24\(4\):655–666, 1986\.doi:10\.1137/0324039\.
- Raginsky et al\. \(2017\)Raginsky, M\., Rakhlin, A\., and Telgarsky, M\.Non\-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis\.In*Proceedings of the 2017 Conference on Learning Theory*, volume 65 of*Proceedings of Machine Learning Research*, pp\. 1674–1703\. PMLR, 07–10 Jul 2017\.
- Rajpal et al\. \(2025\)Rajpal, R\., Leimkuhler, B\., and Jiang, Y\.Adaptive Stepsizing for Stochastic Gradient Langevin Dynamics in Bayesian Neural Networks\.*arXiv preprint arXiv:2511\.11666*, 2025\.
- Rissanen \(1983\)Rissanen, J\.A Universal Prior for Integers and Estimation by Minimum Description Length\.*The Annals of Statistics*, 11\(2\):416 – 431, 1983\.doi:10\.1214/aos/1176346150\.
- Roberts & Rosenthal \(1998\)Roberts, G\. O\. and Rosenthal, J\. S\.Optimal Scaling of Discrete Approximations to Langevin Diffusions\.*Journal of the Royal Statistical Society Series B: Statistical Methodology*, 60\(1\):255–268, 01 1998\.ISSN 1369\-7412\.doi:10\.1111/1467\-9868\.00123\.
- Roberts & Rosenthal \(2001\)Roberts, G\. O\. and Rosenthal, J\. S\.Optimal scaling for various Metropolis\-Hastings algorithms\.*Statistical Science*, 16\(4\):351 – 367, 2001\.doi:10\.1214/ss/1015346320\.
- Simsekli et al\. \(2019\)Simsekli, U\., Sagun, L\., and Gurbuzbalaban, M\.A tail\-index analysis of stochastic gradient noise in deep neural networks\.In Chaudhuri, K\. and Salakhutdinov, R\. \(eds\.\),*Proceedings of the 36th International Conference on Machine Learning*, volume 97 of*Proceedings of Machine Learning Research*, pp\. 5827–5837\. PMLR, 09–15 Jun 2019\.
- Simsekli et al\. \(2020\)Simsekli, U\., Sener, O\., Deligiannidis, G\., and Erdogdu, M\. A\.Hausdorff dimension, heavy tails, and generalization in neural networks\.*Advances in Neural Information Processing Systems*, 33:5138–5151, 2020\.
- Sokal \(1997\)Sokal, A\.*Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms*, pp\. 131–192\.Springer US, Boston, MA, 1997\.doi:10\.1007/978\-1\-4899\-0319\-8˙6\.
- Teh et al\. \(2016\)Teh, Y\. W\., Thiery, A\. H\., and Vollmer, S\. J\.Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics\.*Journal of Machine Learning Research*, 17\(7\):1–33, 2016\.
- Toulis et al\. \(2014\)Toulis, P\., Airoldi, E\., and Rennie, J\.Statistical analysis of stochastic gradient methods for generalized linear models\.In*Proceedings of the 31st International Conference on Machine Learning*, volume 32 of*Proceedings of Machine Learning Research*, pp\. 667–675, Bejing, China, 22–24 Jun 2014\. PMLR\.
- Van der Vaart \(2000\)Van der Vaart, A\. W\.*Asymptotic statistics*, volume 3\.Cambridge University Press, 2000\.
- Vershynin \(2018\)Vershynin, R\.*Random Vectors in High Dimensions*, pp\. 38–69\.Cambridge Series in Statistical and Probabilistic Mathematics\. Cambridge University Press, 2018\.
- Vollmer et al\. \(2016\)Vollmer, S\. J\., Zygalakis, K\. C\., and Teh, Y\. W\.Exploration of the \(Non\-\)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics\.*Journal of Machine Learning Research*, 17\(159\):1–48, 2016\.
- Walk \(1977\)Walk, H\.An invariance principle for the Robbins\-Monro process in a Hilbert space\.*Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete*, 39\(2\):135–150, 1977\.
- Wang & Huggins \(2026\)Wang, X\. and Huggins, J\. H\.Large\-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo\.In*International Conference on Machine Learning*, PMLR, 2026\.
- Wang et al\. \(2025\)Wang, X\., Kasprzak, M\. J\., Negrea, J\., Bourguin, S\., and Huggins, J\. H\.Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms\.*arXiv*, 2025\.doi:10\.48550/arxiv\.2501\.12212\.
- Welling & Teh \(2011\)Welling, M\. and Teh, Y\. W\.Bayesian learning via stochastic gradient Langevin dynamics\.In*Proceedings of the 28th International Conference on Machine Learning \(ICML\-11\)*, pp\. 681–688, 2011\.
- White \(1982\)White, H\.Maximum likelihood estimation of misspecified models\.*Econometrica*, 50\(1\):1–25, January 1982\.doi:10\.2307/1912526\.
- Wibisono \(2018\)Wibisono, A\.Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem\.In Bubeck, S\., Perchet, V\., and Rigollet, P\. \(eds\.\),*Proceedings of the 31st Conference on Learning Theory*, volume 75 of*Proceedings of Machine Learning Research*, pp\. 2093–3027\. PMLR, 2018\.
- Ye et al\. \(1998\)Ye, H\., Michel, A\. N\., and Hou, L\.Stability theory for hybrid dynamical systems\.*IEEE Transactions on Automatic Control*, 43\(4\):461–474, 1998\.
- Zhu et al\. \(2019\)Zhu, Z\., Wu, J\., Yu, B\., Wu, L\., and Ma, J\.The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects\.In*Proceedings of the 36th International Conference on Machine Learning*, volume 97 of*Proceedings of Machine Learning Research*, pp\. 7654–7663\. PMLR, 09–15 Jun 2019\.
- Ziyin et al\. \(2022\)Ziyin, L\., Liu, K\., Mori, T\., and Ueda, M\.Strength of Minibatch Noise in SGD\.In*International Conference on Learning Representations*, 2022\.
## Appendix ADiscussions on Mixing Speed
Beyond matching a desired stationary covariance, practical uncertainty quantification also requires that the SG\(L\)D Markov chain mixes rapidly so that Monte Carlo estimates are accurate at low computational cost\. Let\(θt\)t≥0\(\\theta\_\{t\}\)\_\{t\\geq 0\}denote the SG\(L\)D iterates and letπθ\\pi\_\{\\theta\}be their stationary distribution\. Given a scalar functionalf:ℝD→ℝf\\colon\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}, letπθ\(f\):=∫f\(θ\)πθ\(dθ\)\\pi\_\{\\theta\}\(f\):=\\int f\(\\theta\)\\,\\pi\_\{\\theta\}\(\\mathrm\{d\}\\theta\)denote its expectation under the invariant distribution, which is the quantity we ultimately want to estimate\. GivenTTiterates, the standard Monte Carlo estimator forπθ\(f\)\\pi\_\{\\theta\}\(f\)isf^T:=T−1∑t=1Tf\(θt\)\.\\hat\{f\}\_\{T\}:=T^\{\-1\}\\sum\_\{t=1\}^\{T\}f\(\\theta\_\{t\}\)\.
To isolate the effect of mixing, suppose the chain is started at stationarity:θ0∼πθ\\theta\_\{0\}\\sim\\pi\_\{\\theta\}\. Lettingρk\(f\):=Corrπθ\(f\(θ0\),f\(θk\)\)\\rho\_\{k\}\(f\):=\\mathrm\{Corr\}\_\{\\pi\_\{\\theta\}\}\(f\(\\theta\_\{0\}\),f\(\\theta\_\{k\}\)\)denote the lag\-kkautocorrelation of the stationary time series\(f\(θt\)\)t≥0\(f\(\\theta\_\{t\}\)\)\_\{t\\geq 0\}, the*integrated autocorrelation time*
τint\(f\):=1\+2∑t=1∞ρk\(f\)\\displaystyle\\textstyle\\tau\_\{\\mathrm\{int\}\}\(f\):=1\+2\\sum\_\{t=1\}^\{\\infty\}\\rho\_\{k\}\(f\)\(A\.1\)\(Geyer,[1992](https://arxiv.org/html/2606.00293#bib.bib15); Sokal,[1997](https://arxiv.org/html/2606.00293#bib.bib64)\)quantifies how much serial dependence inflates Monte Carlo variance relative toTTi\.i\.d\. draws: rapid mixing corresponds to fast decay and/or negative values ofρt\(f\)\\rho\_\{t\}\(f\), which results in smallτint\(f\)\\tau\_\{\\mathrm\{int\}\}\(f\)\. In particular,f^T\\hat\{f\}\_\{T\}is unbiased and, under standard regularity conditions, its variance takes the form
Var\(f^T\)≈Varπθ\(f\)Tτint\(f\),\\displaystyle\\operatorname\{Var\}\(\\hat\{f\}\_\{T\}\)\\approx\\frac\{\\mathrm\{Var\}\_\{\\pi\_\{\\theta\}\}\(f\)\}\{T\}\\,\\tau\_\{\\mathrm\{int\}\}\(f\),\(A\.2\)\(Jones,[2004](https://arxiv.org/html/2606.00293#bib.bib32); Geyer,[1992](https://arxiv.org/html/2606.00293#bib.bib15)\)whereVarπθ\(f\)\\mathrm\{Var\}\_\{\\pi\_\{\\theta\}\}\(f\)is the marginal variance off\(θ\)f\(\\theta\)whenθ∼πθ\\theta\\sim\\pi\_\{\\theta\}\. Equivalently, the*effective sample size*isT/τint\(f\)T/\\tau\_\{\\mathrm\{int\}\}\(f\)\(Gelman et al\.,[1995](https://arxiv.org/html/2606.00293#bib.bib14)\), makingτint\(f\)\\tau\_\{\\mathrm\{int\}\}\(f\)a direct measure of sampling efficiency\.
## Appendix BSGLD with Momentum
SGLD with momentumκ\\kappais defined by the one\-step update equations
\{mt=κmt−1\+Gt\(θt−1\)θt=θt−1−Λmt\+2β−1Λξt−1\.\\displaystyle\\left\\\{\\begin\{array\}\[\]\{l\}m\_\{t\}=\\kappa m\_\{t\-1\}\+G\_\{t\}\(\\theta\_\{t\-1\}\)\\\\ \\theta\_\{t\}=\\theta\_\{t\-1\}\-\\Lambda m\_\{t\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\.\\end\{array\}\\right\.\(B\.3\)
Combining[EquationB\.3](https://arxiv.org/html/2606.00293#A2.E3)with the approximation given in[Section4](https://arxiv.org/html/2606.00293#S4.Ex6)leads to the proxy algorithm with one\-step update
\{νt=κνt−1\+Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)ψt=ψt−1−Λνt\+2β−1Λξt−1\\displaystyle\\left\\\{\\begin\{array\}\[\]\{l\}\\nu\_\{t\}=\\kappa\\nu\_\{t\-1\}\+G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\\\ \\psi\_\{t\}=\\psi\_\{t\-1\}\-\\Lambda\\nu\_\{t\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\end\{array\}\\right\.\(B\.6\)
We first present the stationary covariance of SGLD with momentum, which recovers the case without momentum by takingκ=0\\kappa=0\. Proofs of results in this section are in Appendix[G](https://arxiv.org/html/2606.00293#A7)\.
###### Proposition B\.1\.
If the iterates are updated according to[EquationB\.6](https://arxiv.org/html/2606.00293#A2.E6)and they have a stationary distribution, then the stationary covarianceΣψ\\Sigma\_\{\\psi\}satisfies
\(1−κ\)\(ΛH^Σ\+ΣH^Λ\)\+κ1−κ2\(ΛH^ΛH^Σ\+ΣH^ΛH^Λ\)=ΛC¯ψΛ\+1\+κ21−κ2ΛH^ΣH^Λ\+\(1\+κ2\)2Λβ\.\\displaystyle\\begin\{aligned\} \(1\-\\kappa\)\(\\Lambda\\widehat\{H\}\\Sigma\+\\Sigma\\widehat\{H\}\\Lambda\)\+\\frac\{\\kappa\}\{1\-\\kappa^\{2\}\}\(\\Lambda\\widehat\{H\}\\Lambda\\widehat\{H\}\\Sigma\+\\Sigma\\widehat\{H\}\\Lambda\\widehat\{H\}\\Lambda\)&=\\Lambda\\overline\{C\}\_\{\\psi\}\\Lambda\+\\frac\{1\+\\kappa^\{2\}\}\{1\-\\kappa^\{2\}\}\\Lambda\\widehat\{H\}\\Sigma\\widehat\{H\}\\Lambda\+\(1\+\\kappa^\{2\}\)\\frac\{2\\Lambda\}\{\\beta\}\.\\end\{aligned\}\(B\.7\)
###### Proposition B\.2\.
Under the same hypotheses as[PropositionB\.1](https://arxiv.org/html/2606.00293#A2.Thmtheorem1), the iterate averageψ¯k=1k∑k′=1kψk′\\bar\{\\psi\}\_\{k\}=\\frac\{1\}\{k\}\\sum\_\{k^\{\\prime\}=1\}^\{k\}\\psi\_\{k^\{\\prime\}\}has stationary covariance
Σψ\(k\)=1k2\(kΣψ\+2∑k′=1k−1\(I−ΛH^\)k′Σψ\),\\displaystyle\\Sigma\_\{\\psi\}^\{\(k\)\}=\\frac\{1\}\{k^\{2\}\}\\left\(k\\Sigma\_\{\\psi\}\+2\\sum\_\{k^\{\\prime\}=1\}^\{k\-1\}\\left\(I\-\\Lambda\\widehat\{H\}\\right\)^\{k^\{\\prime\}\}\\Sigma\_\{\\psi\}\\right\),\(B\.8\)whereΣψ\\Sigma\_\{\\psi\}is defined by[EquationB\.7](https://arxiv.org/html/2606.00293#A2.E7)\.
In the momentum setting, we obtain 2\-Wasserstein error bounds analogous to the non\-momentum case\. The main difference is that the contraction factor and the corresponding constants now depend on the momentum parameterκ\\kappa\.
###### Theorem B\.3\.
latexIf assumptions[\(A\)](https://arxiv.org/html/2606.00293#S4.I1.i1)–[\(C\)](https://arxiv.org/html/2606.00293#S4.I1.i3)hold andΛ=λID\\Lambda=\\lambda I\_\{D\}for someλ\>0\\lambda\>0,κ∈\(0,1\)\\kappa\\in\(0,1\),λ∈\(0,\(1−κ\)/\(4L\)\)\\lambda\\in\(0,\(1\-\\kappa\)/\(4L\)\), and2λL21−κ\+2L2κ\(1\+λ\)\(1−κ\)2\+κ<μ\\frac\{2\\lambda L^\{2\}\}\{1\-\\kappa\}\+\\frac\{2L^\{2\}\\kappa\(1\+\\lambda\)\}\{\(1\-\\kappa\)^\{2\}\}\+\\kappa<\\mu, then, lettingβ¯:=ρ\(A\)<1\\bar\{\\beta\}:=\\rho\(A\)<1for the coefficient matrixAAdefined in[EquationG\.40](https://arxiv.org/html/2606.00293#A7.E40),Mp¯:=N−1∑n=1NMnp\\overline\{M^\{p\}\}:=N^\{\-1\}\\sum\_\{n=1\}^\{N\}M\_\{n\}^\{p\}forp∈\{1,2\}p\\in\\\{1,2\\\},Cs:=𝔼‖ψs−θ^‖4C\_\{s\}:=\\mathbb\{E\}\\\|\\psi\_\{s\}\-\\widehat\{\\theta\}\\\|^\{4\}, and𝒫\\mathcal\{P\}as in[EquationG\.16](https://arxiv.org/html/2606.00293#A7.E16), for allt=1,2,…t=1,2,\\dots,
W22\(πθ,t,πψ,t\)≤β¯tW22\(πθ,0,πψ,0\)\+𝒫∑s=1tβ¯t−sCs−1\.\\displaystyle W\_\{2\}^\{2\}\\bigl\(\\pi\_\{\\theta,t\},\\pi\_\{\\psi,t\}\\bigr\)\\leq\\bar\{\\beta\}^\{\\,t\}\\,W\_\{2\}^\{2\}\\\!\\bigl\(\\pi\_\{\\theta,0\},\\pi\_\{\\psi,0\}\\bigr\)\+\\mathcal\{P\}\\,\\sum\_\{s=1\}^\{t\}\\bar\{\\beta\}^\{\\,t\-s\}\\,C\_\{s\-1\}\.\(B\.9\)
Finally, we find that, with the momentum proportional toλ\\lambda, the Wasserstein error remains of orderλ/B\\lambda/B; moreover, the bound recovers the non\-momentum result whenκ=0\\kappa=0\.
###### Corollary B\.4\.
Under the same assumptions as[TheoremB\.3](https://arxiv.org/html/2606.00293#A2.Thmtheorem3)withβ=∞\\beta=\\infty, assume the*scaled momentum*regimeκ=cκλ\\kappa=c\_\{\\kappa\}\\lambdawith0<cκ≤min\{μ232L2,μ^c1L^3\}0<c\_\{\\kappa\}\\leq\\min\\Big\\\{\\tfrac\{\\mu^\{2\}\}\{32L^\{2\}\},\\ \\tfrac\{\\hat\{\\mu\}\}\{c\_\{1\}\\hat\{L\}^\{3\}\}\\Big\\\}andλ≤min\{1,1L^,μc2L2,Bμ^c3L2,\(μ^/c4\)1/4\}\.\\lambda\\leq\\min\\Big\\\{1,\\ \\tfrac\{1\}\{\\hat\{L\}\},\\ \\tfrac\{\\mu\}\{c\_\{2\}L^\{2\}\},\\ \\tfrac\{B\\hat\{\\mu\}\}\{c\_\{3\}L^\{2\}\},\\ \(\\hat\{\\mu\}/c\_\{4\}\)^\{1/4\}\\Big\\\}\.Then there exists a constantA⋆\>0A\_\{\\star\}\>0\(given at[SectionG\.4](https://arxiv.org/html/2606.00293#A7.Ex9)\) such that
W2\(πθ,πψ\)≤A⋆λB\.\\displaystyle W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq A\_\{\\star\}\\frac\{\\lambda\}\{B\}\.\(B\.10\)
## Appendix CApplication to High\-dimensional Problems
##### Dense setting\.
LetDDdenote the parameter dimension, letI∼Unif\(\{1,…,N\}\)I\\sim\\mathrm\{Unif\}\(\\\{1,\\dots,N\\\}\)be independent, and definegI:=∇θℓ\(xI,yI,θ^\),τ22:=𝔼‖gI‖2g\_\{I\}:=\\nabla\_\{\\theta\}\\ell\(x\_\{I\},y\_\{I\},\\hat\{\\theta\}\),\\tau\_\{2\}^\{2\}:=\\mathbb\{E\}\\\|g\_\{I\}\\\|^\{2\}, andτ44:=𝔼‖gI‖4\\tau\_\{4\}^\{4\}:=\\mathbb\{E\}\\\|g\_\{I\}\\\|^\{4\}\. Assume there exist constantsc2,c4<∞c\_\{2\},c\_\{4\}<\\inftyindependent ofDDsuch thatτ22≤c2D\\tau\_\{2\}^\{2\}\\leq c\_\{2\}Dandτ44≤c4D2\\tau\_\{4\}^\{4\}\\leq c\_\{4\}D^\{2\}\. Such bounds hold, for example, under sub\-Gaussian designs with uniformly bounded GLM weights; seeVershynin,[2018](https://arxiv.org/html/2606.00293#bib.bib68)\. Letξ∼𝒩\(0,ID\)\\xi\\sim\\mathcal\{N\}\(0,I\_\{D\}\)so that𝔼‖ξ‖2=D\\mathbb\{E\}\\\|\\xi\\\|^\{2\}=Dand𝔼‖ξ‖4=D\(D\+2\)\\mathbb\{E\}\\\|\\xi\\\|^\{4\}=D\(D\+2\)\. Corollaries[4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)and[B\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4)yield
W2\(πθ,πψ\)≤Aeff\(λB\+1β\),\\displaystyle W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\ \\leq\\ A\_\{\\mathrm\{eff\}\}\\Big\(\\frac\{\\lambda\}\{B\}\+\\frac\{1\}\{\\beta\}\\Big\),\(C\.1\)whereAeffA\_\{\\mathrm\{eff\}\}is the explicit constant in the corresponding corollary \(e\.g\.,Aeff=AA\_\{\\mathrm\{eff\}\}=\\sqrt\{A\}when the corollary is stated asW22≤A\(λ/B\+1/β\)2W\_\{2\}^\{2\}\\leq A\(\\lambda/B\+1/\\beta\)^\{2\}; see[SectionsF\.5](https://arxiv.org/html/2606.00293#A6.Ex13)and[G\.4](https://arxiv.org/html/2606.00293#A7.Ex9)for the explicit definitions\)\. If the curvature constants entering these corollaries \(e\.g\.,μ,L,μ^,L^\\mu,L,\\hat\{\\mu\},\\hat\{L\}\) are bounded above and below by constants independent ofdd, then inserting the bounds onτ2\\tau\_\{2\},τ4\\tau\_\{4\}, and the Gaussian moments above into[SectionsF\.5](https://arxiv.org/html/2606.00293#A6.Ex13)and[G\.4](https://arxiv.org/html/2606.00293#A7.Ex9)shows that there existsC\>0C\>0independent ofDDsuch thatAeff≤CDA\_\{\\mathrm\{eff\}\}\\leq CD, and hence
W2\(πθ,πψ\)≤CD\(λB\+1β\)\.\\displaystyle W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\ \\leq\\ C\\,D\\Big\(\\frac\{\\lambda\}\{B\}\+\\frac\{1\}\{\\beta\}\\Big\)\.\(C\.2\)In particular, ifB≥cDB\\geq cDthenW2\(πθ,πψ\)≤C′\(λ\+D/β\)W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq C^\{\\prime\}\(\\lambda\+D/\\beta\), and thusW2\(πθ,πψ\)≤C′′λW\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq C^\{\\prime\\prime\}\\lambdauniformly inDDwhenβ=∞\\beta=\\inftyor whenβ≥c′′′D/λ\\beta\\geq c^\{\\prime\\prime\\prime\}D/\\lambda\. If insteadβ=N\\beta=NandN/D∈\[γmin,γmax\]N/D\\in\[\\gamma\_\{\\min\},\\gamma\_\{\\max\}\]for fixed0<γmin≤γmax<∞0<\\gamma\_\{\\min\}\\leq\\gamma\_\{\\max\}<\\infty, thend/β=D/N∈\[1/γmax,1/γmin\]d/\\beta=D/N\\in\[1/\\gamma\_\{\\max\},1/\\gamma\_\{\\min\}\]and the bound need not vanish asD→∞D\\to\\infty\.
##### Sparse setting\.
Alternatively, we can consider the sparse regime\. LetS⊆\[D\]S\\subseteq\[D\]with\|S\|=s≪D\|S\|=s\\ll Dand letPSP\_\{S\}be the coordinate projector\. Assume both the exact and proxy chains evolve on the affine subspace𝒜S:=θ^\+range\(PS\)\\mathcal\{A\}\_\{S\}:=\\hat\{\\theta\}\+\\mathrm\{range\}\(P\_\{S\}\); for example, ifPScθ0=PScψ0=PScθ^P\_\{S^\{c\}\}\\theta\_\{0\}=P\_\{S^\{c\}\}\\psi\_\{0\}=P\_\{S^\{c\}\}\\hat\{\\theta\}andPSP\_\{S\}is applied to every drift and injected\-noise term so thatθt,ψt∈𝒜S\\theta\_\{t\},\\psi\_\{t\}\\in\\mathcal\{A\}\_\{S\}for alltt\. Define the restricted gradient moments atθ^\\hat\{\\theta\}bygI,S:=PS∇θℓ\(xI,yI,θ^\),τ2,S2:=𝔼‖gI,S‖2g\_\{I,S\}:=P\_\{S\}\\nabla\_\{\\theta\}\\ell\(x\_\{I\},y\_\{I\},\\hat\{\\theta\}\),\\tau\_\{2,S\}^\{2\}:=\\mathbb\{E\}\\\|g\_\{I,S\}\\\|^\{2\}andτ4,S4:=𝔼‖gI,S‖4\\tau\_\{4,S\}^\{4\}:=\\mathbb\{E\}\\\|g\_\{I,S\}\\\|^\{4\}\. Assume the curvature constants, when restricted to𝒜S\\mathcal\{A\}\_\{S\}, are bounded above and below by constants independent ofDDandss, and there exist constantsc2,c4<∞c\_\{2\},c\_\{4\}<\\inftyindependent ofDDandsssuch thatτ4,S2≤c2s\\tau\_\{4,S\}^\{2\}\\leq c\_\{2\}\\,sandτ4,S4≤c4s2\\tau\_\{4,S\}^\{4\}\\leq c\_\{4\}\\,s^\{2\}A sufficient condition is isotropic sub\-Gaussian designs onSSwith uniformly bounded per\-sample weights; see\(Vershynin,[2018](https://arxiv.org/html/2606.00293#bib.bib68)\)\. If the injected noise is also projected, then forξ∼𝒩\(0,ID\)\\xi\\sim\\mathcal\{N\}\(0,I\_\{D\}\)it holds that𝔼‖PSξ‖2=s\\mathbb\{E\}\\\|P\_\{S\}\\xi\\\|^\{2\}=sand𝔼‖PSξ‖4=s\(s\+2\)\\mathbb\{E\}\\\|P\_\{S\}\\xi\\\|^\{4\}=s\(s\+2\)\. Then the same inspection of the constants in[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)and[CorollaryB\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4)yields a constantC\>0C\>0independent ofDDandsssuch that
W2\(πθ,πψ\)≤Cs\(λB\+1β\)\.\\displaystyle W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq Cs\\left\(\\frac\{\\lambda\}\{B\}\+\\frac\{1\}\{\\beta\}\\right\)\.\(C\.3\)In particular, ifB≥csB\\geq csthenW2\(πθ,πψ\)≤C′\(λ\+s/β\)W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq C^\{\\prime\}\(\\lambda\+s/\\beta\), henceW2\(πθ,πψ\)≤C′′λW\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq C^\{\\prime\\prime\}\\lambdaforβ=∞\\beta=\\inftyand also forβ≥c′′′s/λ\\beta\\geq c^\{\\prime\\prime\\prime\}\\,s/\\lambda\. If the injected noise is not rank\-ss, then the diffusion moments scale withDD\(since𝔼‖ξ‖2=D\\mathbb\{E\}\\\|\\xi\\\|^\{2\}=Dand𝔼‖ξ‖4=D\(D\+2\)\\mathbb\{E\}\\\|\\xi\\\|^\{4\}=D\(D\+2\)\), so the temperature\-dependent contribution generally scales withD/βD/\\betarather thans/βs/\\beta\.
## Appendix DAdditional Experiment Details
### D\.1Empirical Validation of the Wasserstein Bound
We empirically validate the Wasserstein error bound in[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)using Poisson regression in both well\-specified and misspecified settings\. We generate covariatesxi∈ℝDx\_\{i\}\\in\\mathbb\{R\}^\{D\}and responses according to two synthetic data\-generating mechanisms\. In the well\-specified setting, the responses are generated from
yi∼Poisson\(exp\{xi⊤θ⋆\}\),\\displaystyle y\_\{i\}\\sim\\mathrm\{Poisson\}\\\!\\left\(\\exp\\\{x\_\{i\}^\{\\top\}\\theta\_\{\\star\}\\\}\\right\),\(D\.1\)and the fitted model is also Poisson regression\. In the misspecified setting, the responses are generated from a negative binomial model,
yi∼NegBin\(ri,pi\),pi=riri\+exp\{xi⊤θ⋆\},\\displaystyle y\_\{i\}\\sim\\mathrm\{NegBin\}\(r\_\{i\},p\_\{i\}\),\\qquad p\_\{i\}=\\frac\{r\_\{i\}\}\{r\_\{i\}\+\\exp\\\{x\_\{i\}^\{\\top\}\\theta\_\{\\star\}\\\}\},\(D\.2\)but are still fitted using a Poisson model\. We use sample sizeN=2,000N=2\{,\}000and dimensionD=50D=50\.
To instantiate the theoretical upper bound, we use plug\-in estimates evaluated atθ^\\widehat\{\\theta\}\. Specifically, we estimate the smoothness constantLLand strong convexity constantμ\\muusing the largest and smallest eigenvalues of the empirical Hessian, respectively\. The higher\-order quantitiesM¯\\overline\{M\},M2¯\\overline\{M^\{2\}\}, andτ4\\tau\_\{4\}are estimated empirically from the sample gradients\. We then evaluateW2\(πθ,πψ\)W\_\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)across different batch sizesBBand different values of the ratioλ/B\\lambda/B\.
As shown in[FigureD\.1](https://arxiv.org/html/2606.00293#A4.F1), the empirical Wasserstein distance exhibits a clear approximately linear scaling inλ/B\\lambda/Bacross all batch sizes\. Moreover, in both the well\-specified Poisson setting and the misspecified negative binomial setting, the empirical curves remain below the theoretical upper bound\. This confirms that the bound captures the correct dependence onλ/B\\lambda/B, although it is conservative in magnitude\.
Figure D\.1:Empirical validation of the Wasserstein error bound in[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)\.Left:well\-specified Poisson data fitted with a Poisson model\.Right:misspecified negative binomial data fitted with a Poisson model\.
### D\.2Computational Cost of DeterminingΛ\\Lambda
In this subsection, we benchmark the wall\-clock cost of computing the preconditionerΛ\\Lambda—by solving the matrix equation induced by[Equations15](https://arxiv.org/html/2606.00293#S4.E15)and[16](https://arxiv.org/html/2606.00293#S4.E16)—against the cost of running the resulting MCMC chain\.[TablesD\.1](https://arxiv.org/html/2606.00293#A4.T1)and[D\.2](https://arxiv.org/html/2606.00293#A4.T2)show thatΛ\\Lambdacan be computed extremely quickly relative to sampling: across all tested dimensions,Λ\\Lambdaconstruction takes at most a few milliseconds and is typically10−510^\{\-5\}–10−310^\{\-3\}of the MCMC runtime\. In the following exploratory study, we fixN=1000N=1000andB=64B=64, and run MCMC for 10 epochs\.
Table D\.1:Linear regression: comparison ofΛ\\Lambdacomputation and MCMC running time\. Entries are mean wall\-clock time \(seconds\) averaged over independent runs\.DDΛ\\Lambdacomputation timeMCMC timeΛ\\Lambdacomputation time / MCMC59\.2×10−59\.2\\times 10^\{\-5\}1\.491\.496\.2×10−56\.2\\times 10^\{\-5\}108\.4×10−48\.4\\times 10^\{\-4\}1\.671\.675\.0×10−45\.0\\times 10^\{\-4\}206\.5×10−46\.5\\times 10^\{\-4\}1\.831\.833\.6×10−43\.6\\times 10^\{\-4\}508\.0×10−48\.0\\times 10^\{\-4\}3\.033\.032\.6×10−42\.6\\times 10^\{\-4\}Table D\.2:Poisson regression: comparison ofΛ\\Lambdacomputation and MCMC time\. Entries are mean wall\-clock time \(seconds\) averaged over independent runs\.DDΛ\\Lambdacomputation timeMCMC timeΛ\\Lambdacomputation time / MCMC59\.5×10−59\.5\\times 10^\{\-5\}1\.711\.715\.6×10−55\.6\\times 10^\{\-5\}104\.0×10−34\.0\\times 10^\{\-3\}1\.741\.742\.3×10−32\.3\\times 10^\{\-3\}208\.5×10−48\.5\\times 10^\{\-4\}1\.791\.794\.8×10−44\.8\\times 10^\{\-4\}507\.8×10−47\.8\\times 10^\{\-4\}1\.811\.814\.3×10−44\.3\\times 10^\{\-4\}
### D\.3Details of theβ\\beta\-divergence\.
Under the definition ofβ\\beta\-divergenceℓ\(β\)\(y,f\(⋅;θ\)\)=−1β−1f\(y;θ\)β−1\+1β∫f\(z;θ\)β𝑑z\\ell^\{\(\\beta\)\}\(y,f\(\\cdot;\\theta\)\)=\-\\frac\{1\}\{\\beta\-1\}f\(y;\\theta\)^\{\\beta\-1\}\+\\frac\{1\}\{\\beta\}\\int f\(z;\\theta\)^\{\\beta\}dz, the lossℒ\\mathcal\{L\}can be rewriten as follows
ℒ\(θ\)=1N∑n=1Nℓ\(β\)\(yn,f\(⋅;θ\)\)\+1Nℛ\(θ\)=−1N∑n=1N1β−1f\(yn;θ\)β−1\+1β∫f\(z;θ\)β𝑑z\+1Nℛ\(θ\)=1N∑n=1Nℓ~n\(β\)\+1N\(Ω\(β\)\(θ\)\+ℛ\(θ\)\),\\displaystyle\\begin\{aligned\} \\mathcal\{L\}\(\\theta\)&=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\ell^\{\(\\beta\)\}\(y\_\{n\},f\(\\cdot;\\theta\)\)\+\\frac\{1\}\{N\}\\mathcal\{R\}\(\\theta\)\\\\ &=\-\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\frac\{1\}\{\\beta\-1\}f\(y\_\{n\};\\theta\)^\{\\beta\-1\}\+\\frac\{1\}\{\\beta\}\\int f\(z;\\theta\)^\{\\beta\}dz\+\\frac\{1\}\{N\}\\mathcal\{R\}\(\\theta\)\\\\ &=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\tilde\{\\ell\}\_\{n\}^\{\(\\beta\)\}\+\\frac\{1\}\{N\}\\left\(\\Omega^\{\(\\beta\)\}\(\\theta\)\+\\mathcal\{R\}\(\\theta\)\\right\),\\end\{aligned\}\(D\.3\)whereℓ~n\(β\)\(θ\)=−1β−1f\(yn;θ\)β−1\\tilde\{\\ell\}\_\{n\}^\{\(\\beta\)\}\(\\theta\)=\-\\frac\{1\}\{\\beta\-1\}f\(y\_\{n\};\\theta\)^\{\\beta\-1\}andΩ\(β\)\(θ\)=Nβ∫f\(z;θ\)β𝑑z\\Omega^\{\(\\beta\)\}\(\\theta\)=\\frac\{N\}\{\\beta\}\\int f\(z;\\theta\)^\{\\beta\}dz\.
Then the lossℒ\\mathcal\{L\}can be rewritten as
ℒ\(θ\)=1N∑n=1Nℓn\(β\)\(θ\),\\displaystyle\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\ell\_\{n\}^\{\(\\beta\)\}\(\\theta\),\(D\.4\)whereℓn\(β\)\(θ\)=ℓ~n\(β\)\+1β∫f\(z;θ\)β𝑑z\+1Nℛ\(θ\)\\ell\_\{n\}^\{\(\\beta\)\}\(\\theta\)=\\tilde\{\\ell\}\_\{n\}^\{\(\\beta\)\}\+\\frac\{1\}\{\\beta\}\\int f\(z;\\theta\)^\{\\beta\}dz\+\\frac\{1\}\{N\}\\mathcal\{R\}\(\\theta\)\.
Similarly, we can compute𝒥^\(β\)=1N∑n=1N∇2ℓn\(β\)\(θ^\)\\widehat\{\\mathcal\{J\}\}^\{\(\\beta\)\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla^\{2\}\\ell\_\{n\}^\{\(\\beta\)\}\(\\hat\{\\theta\}\)andℐ^\(β\)=1N∑n=1N∇ℓn\(β\)\(θ^\)\(∇ℓn\(β\)\(θ^\)\)⊤\\widehat\{\\mathcal\{I\}\}^\{\(\\beta\)\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\_\{n\}^\{\(\\beta\)\}\(\\hat\{\\theta\}\)\(\\nabla\\ell\_\{n\}^\{\(\\beta\)\}\(\\hat\{\\theta\}\)\)^\{\\top\}to use our[Algorithm1](https://arxiv.org/html/2606.00293#alg1)underβ\\beta\-divergence loss\.
### D\.4Full Experiment Results
Due to space constraints, we are unable to report parameter errors and detailed confidence intervals in the main text\. This subsection therefore presents the complete experimental results for all settings\. To aid interpretation of the reported relative covariance errors, we additionally compare the marginal variances of the estimated covariance𝒮^\\hat\{\\mathcal\{S\}\}with those of the target covariance𝒮⋆\\mathcal\{S\}\_\{\\star\}\.
[FigureD\.2](https://arxiv.org/html/2606.00293#A4.F2)shows that for both datasets, using our results leads to the desired marginal variances when using either a small or large batch size\. The continuous\-time tuning performs well when the batch size is small, since a small batch size requires using a small learning rate\. However, the variances are too large in the large batch size case\. The large\-sample\+well\-specified tuning, on the other hand, leads to excessive variance for both small and large batch size regimes since the assumption that the model is well\-specified is violated\.[FigureD\.3](https://arxiv.org/html/2606.00293#A4.F3)shows that theories based on the heuristic SGD noiseC¯=1BH\\overline\{C\}=\\frac\{1\}\{B\}Hlead to an excessively large stationary covariance for the simulated data but a too smaller covariance for the German credit data\. The continuous\-time tuning leads to too large covariance for the large batch size in both cases\. In our theory, on the other hand, is accurate in all scenarios\.


Figure D\.2:Comparison of step size tuning guidance for linear regression with\(top\)simulated misspecified data with heteroskedastic noise and\(bottom\)the classic Boston housing dataset\.

Figure D\.3:Comparison of step size tuning guidance for Poisson regression with\(top\)simulated well\-specified data and\(bottom\)the German credit data\.Table D\.3:Full results for linear regression experiments with simulated data\. Calibration error is the Kolmogorov–Smirnov distance toUnif\(0,1\)\\mathrm\{Unif\}\(0,1\); lower is better\. Covariance error is\(‖𝒮⋆−𝒮^‖F\)/‖𝒮⋆‖F\(\\\|\\mathcal\{S\}\_\{\\star\}\-\\hat\{\\mathcal\{S\}\}\\\|\_\{F\}\)/\\\|\\mathcal\{S\}\_\{\\star\}\\\|\_\{F\}; lower is better\. Within each metric row and loss block, for a fixed batch sizeBB, bold indicates methods whose 95% confidence intervals overlap with the confidence interval of the method with the lowest mean error\.Log lossβ\\beta\-loss \(β=1\.5\\beta=1\.5\)BBPosteriorCTLR\+WSDQ\+exactNUTSSandwich GaussCTLR\+WSDQ\+exactCalibration error16160\.4180\.1950\.156⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.4180\.1950\.156Covariance error16160\.9430\.7950\.000⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.9430\.7990\.000Table D\.4:Full results for linear regression experiments with Boston housing data\. See[TableD\.3](https://arxiv.org/html/2606.00293#A4.T3)caption for further explanation\.Log lossβ\\beta\-loss \(β=1\.5\\beta=1\.5\)BBPosteriorCTLR\+WSDQ\+exactNUTSSandwich GaussCTLR\+WSDQ\+exactCovariance error16160\.3589\.23×1089\.23\{\\times\}10^\{8\}\[3\.62×1043\.62\{\\times\}10^\{4\},6\.17×1096\.17\{\\times\}10^\{9\}\]2\.5280∞\\infty⌊0\.1×N⌋\\lfloor 0\.1\\times N\\rfloor0\.3581\.40×1071\.40\{\\times\}10^\{7\}\[4\.85×1034\.85\{\\times\}10^\{3\},9\.01×1079\.01\{\\times\}10^\{7\}\]2\.5280∞\\inftyTable D\.5:Results for Poisson regression experiments\. See[TableD\.3](https://arxiv.org/html/2606.00293#A4.T3)caption for further explanation\.SimulatedCreditBBmethodcalib\. err\.cov\. err\.cov\. err\.1616CT0\.069\[0\.062,0\.075\]0\.207\[0\.199,0\.215\]0\.132\[0\.112,0\.168\]DQ\+const0\.646\[0\.639,1\.344\]0\.672\[0\.664,0\.678\]0\.982\[0\.975,0\.987\]DQ\+exact0\.074\[0\.068,0\.080\]0\.208\[0\.201,0\.217\]0\.157\[0\.132,0\.193\]⌊0\.1×N⌋\\lfloor 0\.1\\\!\\times\\\!N\\rfloorCT0\.089\[0\.078,0\.100\]0\.230\[0\.218,0\.245\]0\.191\[0\.155,0\.240\]DQ\+const1\.376\[1\.370,1\.382\]0\.991\[0\.990,0\.992\]0\.997\[0\.996,0\.999\]DQ\+exact0\.075\[0\.066,0\.083\]0\.211\[0\.203,0\.220\]0\.154\[0\.138,0\.181\]
## Appendix EApplication: Stationary Covariance for a Fixed Learning Rate
In this section, we discuss how our theory can be used to justify the stationary covariance structure at a fixed learning rate\.
### E\.1Linear Regression
As an illustration of the usefulness of, and new insights provided by[Theorem4\.3](https://arxiv.org/html/2606.00293#S4.Thmtheorem3), we first focus on the special case of linear regression without regularization \(i\.e\., whereℛ≡0\\mathcal\{R\}\\equiv 0\)\. Since in the case of linear regression the proxy algorithm is identical to the exact algorithm, we will give all our results in terms of the original process\(θt\)t≥0\(\\theta\_\{t\}\)\_\{t\\geq 0\}\. In linear regression we can specialize[Equation16](https://arxiv.org/html/2606.00293#S4.E16)to obtain
C¯θ=1B\(N−1∑n=1Nxnxn⊤Σθxnxn⊤−H^ΣθH^\)\+1BN∑n=1Nrn2xnxn⊤,\\displaystyle\\begin\{aligned\} \\overline\{C\}\_\{\\theta\}&=\\textstyle\\frac\{1\}\{B\}\(N^\{\-1\}\\sum\_\{n=1\}^\{N\}x\_\{n\}x\_\{n\}^\{\\top\}\\Sigma\_\{\\theta\}x\_\{n\}x\_\{n\}^\{\\top\}\-\\widehat\{H\}\\Sigma\_\{\\theta\}\\widehat\{H\}\)\+\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}r\_\{n\}^\{2\}x\_\{n\}x\_\{n\}^\{\\top\},\\end\{aligned\}\(E\.1\)wherern=yn−θ^⊤xnr\_\{n\}=y\_\{n\}\-\\widehat\{\\theta\}^\{\\top\}x\_\{n\}is the residual andH^=N−1∑n=1Nxnxn⊤\\widehat\{H\}=N^\{\-1\}\\sum\_\{n=1\}^\{N\}x\_\{n\}x\_\{n\}^\{\\top\}\.
##### Relation to large\-sample approximation ofZiyin et al\. \([2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\.
We can recover the approximation given in[Equation10](https://arxiv.org/html/2606.00293#S3.E10)by making the same simplifying assumptions and approximations \(see[SectionE\.3\.1](https://arxiv.org/html/2606.00293#A5.SS3.SSS1)for details\)\. First, ifxn∼𝒩\(0,A\)x\_\{n\}\\sim\\mathcal\{N\}\(0,A\)andNNis large, then, using the properties of the Gaussian, the first term on the righthand side of[EquationE\.1](https://arxiv.org/html/2606.00293#A5.E1)is well\-approximated by2AΣθA\+Tr\[AΣψ\]A2A\\Sigma\_\{\\theta\}A\+\\operatorname\{Tr\}\[A\\Sigma\_\{\\psi\}\]AandH^≈A\\widehat\{H\}\\approx A\. Hence, the first two terms together are approximately equal toAΣθA\+Tr\[AΣψ\]AA\\Sigma\_\{\\theta\}A\+\\operatorname\{Tr\}\[A\\Sigma\_\{\\psi\}\]A\. However, in many scenarios the covariates may not be normally distribution \(e\.g\., they may be binary or have heavier tails\) andNNmay not be large relative to the parameter/covariate dimensionDD\. To simplify the final term in[EquationE\.1](https://arxiv.org/html/2606.00293#A5.E1), we must also assume the model is well\-specified, which implies thatxnx\_\{n\}andrnr\_\{n\}are independent andrn∼𝒩\(0,σ2\)r\_\{n\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\. Hence, whenNNis large, the final term is approximatelyσ2A\\sigma^\{2\}A\. However, when the model is misspecified, the termN−1∑i=1Nri2xixi⊤N^\{\-1\}\\sum\_\{i=1\}^\{N\}r\_\{i\}^\{2\}x\_\{i\}x\_\{i\}^\{\\top\}can capture additional variability due to, for example, a poor model fit, heteroskedastic errors, and/or heavy\-tailed errors\. We illustrate this latter point next\.
##### Numerical illustrations\.
To validate our theory, we compare the predicted stationary covariance structure obtained from combining[Propositions4\.2](https://arxiv.org/html/2606.00293#S4.Thmtheorem2)and[E\.1](https://arxiv.org/html/2606.00293#A5.E1)with predictions based on \(1\) the continuous\-time theory and \(2\) the discrete\-time theory that assumes largeNNand a well\-specified model\. We focus on the effect of varying the \(scalar\) learning rate\.
##### *Simulated misspecified data\.*
First, we consider a misspecified simulated dataset with heteroskedastic error generated according to the model
yn∼𝒩\(xn⊤θ⋆,1\+‖xi‖22\),\\displaystyle y\_\{n\}\\sim\\mathcal\{N\}\(x\_\{n\}^\{\\top\}\\theta\_\{\\star\},1\+\\\|x\_\{i\}\\\|\_\{2\}^\{2\}\),\(E\.2\)whereθ⋆∼𝒩\(0,ID\)\\theta\_\{\\star\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)is fixed andxn∼iid𝒩\(0,ID\)x\_\{n\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,I\_\{D\}\)\. We takeD=20D=20andN=2,000N=2\{,\}000\.[FigureE\.1](https://arxiv.org/html/2606.00293#A5.F1)\(left\) illustrates the predicted covariance for the parameters\(θ1,θ2\)⊤\(\\theta\_\{1\},\\theta\_\{2\}\)^\{\\top\}\. The results show that our theory delivers the most accurate covariance predictions across all learning rate levels\. In contrast, the continuous\-time theory underestimates the parameter variances, while the discrete\-time approximation that assumeNNis large and the model is correct overestimates them\.
##### *Boston housing data\.*
Next, we reconsider the real\-world Boston housing data Similar to the results on simulated data,[FigureE\.1](https://arxiv.org/html/2606.00293#A5.F1)\(right\) demonstrates that our theory can accurately predict the covariance\. The alternative approximations consistently underestimate it\.


Figure E\.1:Comparison of estimated stationary covariance structure for linear regression at3σ3\\sigmaconfidence region on\(left\)simulated misspecified data with heteroskedastic noise and\(right\)the classic Boston housing dataset withλ=0\.1\\lambda=0\.1andB=32B=32\. Our theory provides more accurate stationary covariance predictions in both cases\.
### E\.2Poisson Regression
Similar to the linear regression experiments, we compare the stationary covariance predicted by our theory with those derived from continuous\-time theory and the discrete\-time quadratic loss proxy with constant noise \(that is, usingC¯ψ≈1BH^\\overline\{C\}\_\{\\psi\}\\approx\\frac\{1\}\{B\}\\widehat\{H\}in[Equation15](https://arxiv.org/html/2606.00293#S4.E15)\)\. However, unlike in linear regression, the proxy algorithm is no longer exact, and so we must rely on our error analysis to justify its use\.


Figure E\.2:Comparison of estimated stationary covariance structure for Poisson regression at3σ3\\sigmaconfidence region with\(left\)simulated well\-specified data and\(right\)the German credit data by setting batch sizeλ=0\.1\\lambda=0\.1, andB=32B=32\.Learning Rateλ\\lambdacontinuous\-timediscrete\-quadratic\+constant noisediscrete\-quadratic\+exact noise\|Σψ−Σθ\|F\\left\|\\Sigma\_\{\\psi\}\-\\Sigma\_\{\\theta\}\\right\|\_\{F\}for Poisson regression on well\-specified simulated dataset0\.10\.2370\.3020\.0300\.30\.4790\.6310\.0960\.50\.5450\.6510\.202\|Σψ−Σθ\|F\\left\|\\Sigma\_\{\\psi\}\-\\Sigma\_\{\\theta\}\\right\|\_\{F\}for Poisson regression on misspecified German credit dataset0\.10\.03670\.0370\.0040\.30\.0980\.0990\.0250\.50\.1240\.1260\.041Table E\.1:Comparison of difference between estimated stationary covariance structureΣψ\\Sigma\_\{\\psi\}and the ground truth using Frobenius norm for Poisson regression\.[FigureE\.2](https://arxiv.org/html/2606.00293#A5.F2)\(right\) shows that our theory provides an accurate estimate of the stationary covariance while alternatives provide severe underestimates\.
For both simulated and real\-world dataset, our approximation demonstrates an improvement in accuracy with errors that are 3–10 times smaller than the baseline approaches as shown in[TableE\.1](https://arxiv.org/html/2606.00293#A5.T1)\.
### E\.3Optimal weight decay and batch size
A direct application of accurate stationary covariance prediction is to estimate the test loss\. To simplify our analysis, we will focus on linear regression\. The test loss depends on the stationary covariance byℒtest=1M∑m=1M\(ym−θ^⊤xm\)2\+1M∑m=1Mxm⊤Σθxm\\mathcal\{L\}\_\{\\text\{test\}\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\(y\_\{m\}\-\\widehat\{\\theta\}^\{\\top\}x\_\{m\}\)^\{2\}\+\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}x\_\{m\}^\{\\top\}\\Sigma\_\{\\theta\}x\_\{m\}\(see[SectionE\.3\.2](https://arxiv.org/html/2606.00293#A5.SS3.SSS2)\), where\{\(xm,ym\)\}m=1M\\\{\(x\_\{m\},y\_\{m\}\)\\\}\_\{m=1\}^\{M\}is the test dataset\. As illustrated in[FigureE\.3](https://arxiv.org/html/2606.00293#A5.F3), our theory offers the most accurate test loss estimation across different decay weights and batch sizes\.


Figure E\.3:Comparison of estimated test loss ridge regression \(Γ=γID\\Gamma=\\gamma I\_\{D\}\) on simulated misspecified data with heteroskedastic noise considered in[EquationE\.2](https://arxiv.org/html/2606.00293#A5.E2)\.\(left\)We setλ=0\.1\\lambda=0\.1,B=32B=32\.\(right\)We setλ=0\.1\\lambda=0\.1,γ=0\\gamma=0\.#### E\.3\.1More discussion about[SectionE\.1](https://arxiv.org/html/2606.00293#A5.SS1)
Recall that in linear regression we can specialize[Equation16](https://arxiv.org/html/2606.00293#S4.E16)to obtain
C¯θ\\displaystyle\\overline\{C\}\_\{\\theta\}=1B\(N−1∑n=1Nxnxn⊤Σθxnxn⊤−H^ΣθH^\)\+1BN∑n=1Nrn2xnxn⊤,\\displaystyle=\\frac\{1\}\{B\}\(N^\{\-1\}\\sum\_\{n=1\}^\{N\}x\_\{n\}x\_\{n\}^\{\\top\}\\Sigma\_\{\\theta\}x\_\{n\}x\_\{n\}^\{\\top\}\-\\widehat\{H\}\\Sigma\_\{\\theta\}\\widehat\{H\}\)\+\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}r\_\{n\}^\{2\}x\_\{n\}x\_\{n\}^\{\\top\},\(E\.3\)
wherern=yn−θ^⊤xnr\_\{n\}=y\_\{n\}\-\\widehat\{\\theta\}^\{\\top\}x\_\{n\}is the residual andH^=N−1∑n=1Nxnxn⊤\\widehat\{H\}=N^\{\-1\}\\sum\_\{n=1\}^\{N\}x\_\{n\}x\_\{n\}^\{\\top\}\.
Now suppose that the data\{\(xn,yn\)\}n=1N\\\{\(x\_\{n\},y\_\{n\}\)\\\}\_\{n=1\}^\{N\}are generated from a linear model, there exists aθ⋆∈ℝD\\theta\_\{\\star\}\\in\\mathbb\{R\}^\{D\}such thatyn=xn⊤θ⋆\+ϵny\_\{n\}=x\_\{n\}^\{\\top\}\\theta\_\{\\star\}\+\\epsilon\_\{n\}, whereϵn∼iid𝒩\(0,σ2\)\\epsilon\_\{n\}\\overset\{\\text\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,\\sigma^\{2\}\), fori=1,2,…,Ni=1,2,\.\.\.,N\. Now we will focus on the MSE loss defined as
ℒ\(θ\)=1N∑n=1Nℓ\(xn,yn,θ\)=12Nσ2∑n=1N\(yn−xn⊤θ\)2\.\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\ell\(x\_\{n\},y\_\{n\},\\theta\)=\\frac\{1\}\{2N\\sigma^\{2\}\}\\sum\_\{n=1\}^\{N\}\(y\_\{n\}\-x\_\{n\}^\{\\top\}\\theta\)^\{2\}\.\(E\.4\)Note thatθ^∼𝒩\(θ⋆,σ2\(𝐗⊤𝐗\)−1\)\\widehat\{\\theta\}\\sim\\mathcal\{N\}\\left\(\\theta\_\{\\star\},\\sigma^\{2\}\\left\(\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\\right\)^\{\-1\}\\right\), where𝐗∈ℝN×D\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times D\}, then we have
1N𝔼\[∑i=1Nri2xixi⊤\]\\displaystyle\\frac\{1\}\{N\}\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{N\}r\_\{i\}^\{2\}x\_\{i\}x\_\{i\}^\{\\top\}\\right\]=1N∑i=1N𝔼\[\(yi−xi⊤θ^\)2\]xixi⊤\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{E\}\\left\[\\left\(y\_\{i\}\-x\_\{i\}^\{\\top\}\\widehat\{\\theta\}\\right\)^\{2\}\\right\]x\_\{i\}x\_\{i\}^\{\\top\}\(E\.5\)=1N∑i=1N\(𝔼\[\(yi−𝔼\[yi\]\)2\]\+𝔼\[\(xi⊤θ^−𝔼\[yi\]\)2\]\)xixi⊤\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(\\mathbb\{E\}\\left\[\\left\(y\_\{i\}\-\\mathbb\{E\}\\left\[y\_\{i\}\\right\]\\right\)^\{2\}\\right\]\+\\mathbb\{E\}\\left\[\\left\(x\_\{i\}^\{\\top\}\\widehat\{\\theta\}\-\\mathbb\{E\}\\left\[y\_\{i\}\\right\]\\right\)^\{2\}\\right\]\\right\)x\_\{i\}x\_\{i\}^\{\\top\}=1N∑i=1Nσ2\(I\+\(𝐗⊤𝐗\)−1\)xixi⊤\\displaystyle=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sigma^\{2\}\\left\(I\+\\left\(\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\\right\)^\{\-1\}\\right\)x\_\{i\}x\_\{i\}^\{\\top\}=σ2\(A\+1NI\)\.\\displaystyle=\\sigma^\{2\}\\left\(A\+\\frac\{1\}\{N\}I\\right\)\.Then we have
limNto∞𝔼\[∑i=1Nri2xixi⊤\]=σ2A\.\\displaystyle\\lim\_\{N\\ to\\infty\}\\mathbb\{E\}\\left\[\\sum\_\{i=1\}^\{N\}r\_\{i\}^\{2\}x\_\{i\}x\_\{i\}^\{\\top\}\\right\]=\\sigma^\{2\}A\.\(E\.6\)Under the assumptions ofxn∼𝒩\(0,A\)x\_\{n\}\\sim\\mathcal\{N\}\(0,A\)andNNbeing large, we have
limN→∞1N∑n=1Nxnxn⊤Σθxnxn⊤−H^ΣθH^=AΣθA\+Tr\[AΣψ\]A\.\\displaystyle\\lim\_\{N\\to\\infty\}\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}x\_\{n\}x\_\{n\}^\{\\top\}\\Sigma\_\{\\theta\}x\_\{n\}x\_\{n\}^\{\\top\}\-\\widehat\{H\}\\Sigma\_\{\\theta\}\\widehat\{H\}=A\\Sigma\_\{\\theta\}A\+\\operatorname\{Tr\}\[A\\Sigma\_\{\\psi\}\]A\.\(E\.7\)Then, we will get exactly the same result of Lemma 1 inZiyin et al\. \([2022](https://arxiv.org/html/2606.00293#bib.bib78)\)\.
#### E\.3\.2Test Loss of Linear Regression
The test loss in linear regression can be decomposed as follows:
ℒtest\\displaystyle\\mathcal\{L\}\_\{\\text\{test\}\}=𝔼\[1M∑m=1M\(ym−θt⊤xm\)2\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\bigl\(y\_\{m\}\-\\theta\_\{t\}^\{\\top\}x\_\{m\}\\bigr\)^\{2\}\\right\]\(E\.8\)=𝔼\[1M∑m=1M\(ym−θ^⊤xm\+\(θ^−θt\)⊤xm\)2\]\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\bigl\(y\_\{m\}\-\\hat\{\\theta\}^\{\\top\}x\_\{m\}\+\(\\hat\{\\theta\}\-\\theta\_\{t\}\)^\{\\top\}x\_\{m\}\\bigr\)^\{2\}\\right\]\(E\.9\)=1M∑m=1M\(ym−θ^⊤xm\)2\+1M∑m=1Mxm⊤𝔼\[\(θt−θ^\)\(θt−θ^\)⊤\]xm\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\bigl\(y\_\{m\}\-\\hat\{\\theta\}^\{\\top\}x\_\{m\}\\bigr\)^\{2\}\+\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}x\_\{m\}^\{\\top\}\\mathbb\{E\}\\\!\\left\[\(\\theta\_\{t\}\-\\hat\{\\theta\}\)\(\\theta\_\{t\}\-\\hat\{\\theta\}\)^\{\\top\}\\right\]x\_\{m\}\(E\.10\)=1M∑m=1M\(ym−θ^⊤xm\)2\+1M∑m=1Mxm⊤Σθxm,\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\bigl\(y\_\{m\}\-\\hat\{\\theta\}^\{\\top\}x\_\{m\}\\bigr\)^\{2\}\+\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}x\_\{m\}^\{\\top\}\\Sigma\_\{\\theta\}x\_\{m\},\(E\.11\)whereΣθ=𝔼\[\(θt−θ^\)\(θt−θ^\)⊤\]\\Sigma\_\{\\theta\}=\\mathbb\{E\}\\\!\\left\[\(\\theta\_\{t\}\-\\hat\{\\theta\}\)\(\\theta\_\{t\}\-\\hat\{\\theta\}\)^\{\\top\}\\right\]\.
## Appendix FProofs from Main Text
###### Lemma F\.1\.
Assume the parametersψ\\psiare updated based on discrete\-time proxy algorithm[Section4](https://arxiv.org/html/2606.00293#S4.Ex6),ℛ\(ψ\)=12ψ⊤Γψ\\mathcal\{R\}\(\\psi\)=\\frac\{1\}\{2\}\\psi^\{\\top\}\\Gamma\\psi, and the stationary distribution ofψ\\psiexists, then the stationary meanμψ\\mu\_\{\\psi\}satisfiesμψ=θ^\\mu\_\{\\psi\}=\\widehat\{\\theta\}\. If the parametersψ\\psiare updated based on discrete\-time proxy algorithm[Section4](https://arxiv.org/html/2606.00293#S4.Ex6), and the stationary distribution ofψ\\psiexists, then the stationary meanμψ\\mu\_\{\\psi\}satisfiesμψ=θ^\\mu\_\{\\psi\}=\\widehat\{\\theta\}\.
###### Proof\.
Now we assume thatψ0\\psi\_\{0\}are sampled from the stationary distribution\. Then by taking expectation we have
μψ=𝔼\[ψt\]=𝔼\[ψt−1−Λ\{Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\}\+2β−1Λξt−1\]=μψ−Λ𝔼\(Gt\(θ^\)\)−Λ𝔼\(∇Gt\(θ^\)\(ψt−1−θ^\)\)\\displaystyle\\begin\{aligned\} \\mu\_\{\\psi\}=\\mathbb\{E\}\\left\[\\psi\_\{t\}\\right\]&=\\mathbb\{E\}\\left\[\\psi\_\{t\-1\}\-\\Lambda\\left\\\{G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\\\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\right\]\\\\ &=\\mu\_\{\\psi\}\-\\Lambda\\mathbb\{E\}\(G\_\{t\}\(\\widehat\{\\theta\}\)\)\-\\Lambda\\mathbb\{E\}\\left\(\\nabla G\_\{t\}\(\\hat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\hat\{\\theta\}\)\\right\)\\end\{aligned\}\(F\.1\)Then we have
Λ𝔼\(Gt\(θ^\)\)\+Λ\(𝒥\+1NΓ\)\(μψ−θ^\)=0\\displaystyle\\Lambda\\mathbb\{E\}\(G\_\{t\}\(\\widehat\{\\theta\}\)\)\+\\Lambda\(\\mathcal\{J\}\+\\frac\{1\}\{N\}\\Gamma\)\(\\mu\_\{\\psi\}\-\\hat\{\\theta\}\)=0\(F\.2\)Sinceθ^\\widehat\{\\theta\}satisfies∇ℒ\(θ^\)=0\\nabla\\mathcal\{L\}\(\\widehat\{\\theta\}\)=0, we have
𝔼\(Gt\(θ\)^\)=ΓNθ^\+1N∑n=1N∇ℓ\(xn,yn,θ^\)=0\.\\displaystyle\\mathbb\{E\}\(G\_\{t\}\(\\hat\{\\theta\)\}\)=\\frac\{\\Gamma\}\{N\}\\widehat\{\\theta\}\+\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)=0\.\(F\.3\)Therefore, combining the previous two displayed equations and if , we conclude thatμψ=θ^\\mu\_\{\\psi\}=\\widehat\{\\theta\}\.
∎
### F\.1Proof of[Proposition4\.2](https://arxiv.org/html/2606.00293#S4.Thmtheorem2)
###### Proof\.
The proxy algorithm leads to the discrete\-time update
ψt=ψt−1−Λ\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]\+2β−1Λξt−1,\\displaystyle\\psi\_\{t\}=\\psi\_\{t\-1\}\-\\Lambda\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\},\(F\.4\)whereξt−1∼𝒩\(0,I\)\\xi\_\{t\-1\}\\sim\\mathcal\{N\}\(0,I\)\. Let
ηt−1=\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]−𝔼\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\],\\displaystyle\\eta\_\{t\-1\}=\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\-\\mathbb\{E\}\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\],\(F\.5\)it follows from[EquationF\.3](https://arxiv.org/html/2606.00293#A6.E3)thatCov\(ηt−1\)=C¯ψ\\text\{Cov\}\(\\eta\_\{t\-1\}\)=\\overline\{C\}\_\{\\psi\}\. Noting that𝒥=1N∑n=1N𝒥n\\mathcal\{J\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}, we have
H^=𝔼\[∇Gt\(θ^\)\]=𝒥\+1NΓ\.\\displaystyle\\widehat\{H\}=\\mathbb\{E\}\\bigl\[\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\\bigr\]=\\mathcal\{J\}\+\\frac\{1\}\{N\}\\Gamma\.\(F\.6\)
[Equation12](https://arxiv.org/html/2606.00293#S4.E12)can also be rewritten as
ψt−θ^=ψt−1−θ^−Λ\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]\+2β−1Λξt−1=ψt−1−θ^−Λ𝔼\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]−Ληt−1\+2β−1Λξt−1=ψt−1−θ^−Λ\(1N∑n=1N𝒥n−1NΓ\)\(ψt−1−θ^\)−Ληt−1\+2β−1Λξt−1=\(I−Λ𝒥−1NΛΓ\)\(ψt−1−θ^\)−Ληt−1\+2β−1Λξt−1=\(I−ΛH^\)\(ψt−1−θ^\)−Ληt−1\+2β−1Λξt−1\.\\displaystyle\\begin\{aligned\} \\psi\_\{t\}\-\\widehat\{\\theta\}&=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\Lambda\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\Bigr\]\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\Lambda\\mathbb\{E\}\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\Lambda\\left\(\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\-\\frac\{1\}\{N\}\\Gamma\\right\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\left\(I\-\\Lambda\\mathcal\{J\}\-\\frac\{1\}\{N\}\\Lambda\\Gamma\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\.\\end\{aligned\}\(F\.7\)Note that
Σψ=𝔼\[\(ψt−θ^\)\(ψt−θ^\)⊤\]=\(I−ΛH^\)Σψ\(I−ΛH^\)⊤\+ΛC¯ψΛ\+2Λβ\.\\displaystyle\\begin\{aligned\} \\Sigma\_\{\\psi\}&=\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\\left\(I\-\\Lambda\\widehat\{H\}\\right\)^\{\\top\}\+\\Lambda\\overline\{C\}\_\{\\psi\}\\Lambda\+\\frac\{2\\Lambda\}\{\\beta\}\.\\end\{aligned\}\(F\.8\)Then, after some algebra, we have
ΛH^Σψ\+ΣψH^Λ=Λ\(C¯ψ\+H^ΣψH^\)Λ\+2Λβ\.\\displaystyle\\Lambda\\widehat\{H\}\\Sigma\_\{\\psi\}\+\\Sigma\_\{\\psi\}\\widehat\{H\}\\Lambda=\\Lambda\\left\(\\overline\{C\}\_\{\\psi\}\+\\widehat\{H\}\\Sigma\_\{\\psi\}\\widehat\{H\}\\right\)\\Lambda\+\\frac\{2\\Lambda\}\{\\beta\}\.\(F\.9\)∎
### F\.2Proof of[Theorem4\.3](https://arxiv.org/html/2606.00293#S4.Thmtheorem3)
The covariance of the gradient noise for parameterψ\\psiis given by
C\(ψ\)=\{1B\[1N∑n=1N∇ℓ~n\(ψ\)∇ℓ~n\(ψ\)⊤−∇ℒ~\(ψ\)∇ℒ~\(ψ\)⊤\]if with replacementN−BB\(N−1\)\[1N∑n=1N∇ℓ~n\(ψ\)∇ℓ~n\(ψ\)⊤−∇ℒ~\(ψ\)∇ℒ~\(ψ\)⊤\]if without replacement\.\\displaystyle C\(\\psi\)=\\begin\{cases\}\\frac\{1\}\{B\}\\left\[\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\tilde\{\\ell\}\_\{n\}\(\\psi\)\\nabla\\tilde\{\\ell\}\_\{n\}\(\\psi\)^\{\\top\}\-\\nabla\\tilde\{\\mathcal\{L\}\}\(\\psi\)\\nabla\\tilde\{\\mathcal\{L\}\}\(\\psi\)^\{\\top\}\\right\]&\\text\{if with replacement\}\\\\ \\frac\{N\-B\}\{B\(N\-1\)\}\\left\[\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\tilde\{\\ell\}\_\{n\}\(\\psi\)\\nabla\\tilde\{\\ell\}\_\{n\}\(\\psi\)^\{\\top\}\-\\nabla\\tilde\{\\mathcal\{L\}\}\(\\psi\)\\nabla\\tilde\{\\mathcal\{L\}\}\(\\psi\)^\{\\top\}\\right\]&\\text\{if without replacement\.\}\\end\{cases\}\(F\.10\)We focus on sampling with replacement since our results can be easily extended to the sampling without replacement case by substituting eachC\(ψ\)C\(\\psi\)term withN−BN−1C\(ψ\)\\frac\{N\-B\}\{N\-1\}C\(\\psi\)\.
We have
C\(ψt−1\)=1NB∑n=1N∇ℓn\(ψt−1\)∇ℓn\(ψt−1\)⊤−1B∇ℒ\(ψt−1\)∇ℒ\(ψt−1\)⊤=1B1N∑n=1N\[∇ℓ\(xn,yn,θ^\)\+𝒥n\(ψt−1−θ^\)\]\[∇ℓ\(xn,yn,θ^\)\+𝒥n\(ψt−1−θ^\)\]⊤⏟C3\(ψt−1\)−1B\[1N∑n=1N∇ℓ\(xn,yn,θ^\)\+𝒥n\(ψt−1−θ^\)\]\[1N∑n=1N∇ℓ\(xn,yn,θ^\)\+𝒥n\(ψt−1−θ^\)\]⊤⏟C4\(ψt−1\)\.\\displaystyle\\begin\{aligned\} C\(\\psi\_\{t\-1\}\)&=\\frac\{1\}\{NB\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\_\{n\}\\left\(\\psi\_\{t\-1\}\\right\)\\nabla\\ell\_\{n\}\\left\(\\psi\_\{t\-1\}\\right\)^\{\\top\}\-\\frac\{1\}\{B\}\\nabla\\mathcal\{L\}\\left\(\\psi\_\{t\-1\}\\right\)\\nabla\\mathcal\{L\}\\left\(\\psi\_\{t\-1\}\\right\)^\{\\top\}\\\\ &=\\underbrace\{\\frac\{1\}\{B\}\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\+\\mathcal\{J\}\_\{n\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\]\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\+\\mathcal\{J\}\_\{n\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\]^\{\\top\}\}\_\{C\_\{3\}\(\\psi\_\{t\-1\}\)\}\\\\ &\\quad\-\\underbrace\{\\frac\{1\}\{B\}\\left\[\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\+\\mathcal\{J\}\_\{n\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\]\\left\[\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\+\\mathcal\{J\}\_\{n\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\]^\{\\top\}\}\_\{C\_\{4\}\(\\psi\_\{t\-1\}\)\}\.\\end\{aligned\}\(F\.11\)Note that
𝔼\[C3\(ψt−1\)\]=1BN∑n=1N\[∇ℓ\(xn,yn,θ^\)\]\[∇ℓ\(xn,yn,θ^\)\]⊤\+1BN∑n=1N𝒥n𝔼\[\(ψt−1−θ^\)\(ψt−1−θ^\)⊤\]𝒥n⊤=1BN∑n=1N\[∇ℓ\(xn,yn,θ^\)\]\[∇ℓ\(xn,yn,θ^\)\]⊤\+1BN∑n=1N𝒥nΣψ𝒥n=1Bℐ\+1BN∑n=1N𝒥nΣψ𝒥n\.\\displaystyle\\begin\{aligned\} \\mathbb\{E\}\\left\[C\_\{3\}\(\\psi\_\{t\-1\}\)\\right\]&=\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\]\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\]^\{\\top\}\+\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\mathcal\{J\}\_\{n\}^\{\\top\}\\\\ &=\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\]\\left\[\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\]^\{\\top\}\+\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\_\{n\}\\\\ &=\\frac\{1\}\{B\}\\mathcal\{I\}\+\\frac\{1\}\{BN\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\_\{n\}\.\\end\{aligned\}\(F\.12\)Also note that, using[EquationF\.3](https://arxiv.org/html/2606.00293#A6.E3),
𝔼\[C4\(ψt−1\)\]=1B\{\(1N∑n=1N∇ℓ\(xn,yn,θ^\)\)\(1N∑n=1N∇ℓ\(xn,yn,θ^\)\)⊤\+𝒥𝔼\[\(ψt−1−θ^\)\(ψt−1−θ^\)⊤\]𝒥\}=1B\(1N2Γθ^θ^⊤Γ⊤\+𝒥Σψ𝒥\)\.\\displaystyle\\begin\{aligned\} \\mathbb\{E\}\\left\[C\_\{4\}\(\\psi\_\{t\-1\}\)\\right\]&=\\frac\{1\}\{B\}\\left\\\{\\left\(\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\)\\left\(\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\\right\)^\{\\top\}\+\\mathcal\{J\}\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\mathcal\{J\}\\right\\\}\\\\ &=\\frac\{1\}\{B\}\\left\(\\frac\{1\}\{N^\{2\}\}\\Gamma\\widehat\{\\theta\}\\widehat\{\\theta\}^\{\\top\}\\Gamma^\{\\top\}\+\\mathcal\{J\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\\right\)\.\\end\{aligned\}\(F\.13\)Therefore, we have
C¯ψ=𝔼\[C3\(ψt−1\)\]−𝔼\[C4\(ψt−1\)\]=1B\(ℐ−1N2Γθ^θ^⊤Γ⊤\+1N∑n=1N𝒥nΣψ𝒥n−𝒥Σψ𝒥\)\.\\displaystyle\\begin\{aligned\} \\overline\{C\}\_\{\\psi\}&=\\mathbb\{E\}\\left\[C\_\{3\}\(\\psi\_\{t\-1\}\)\\right\]\-\\mathbb\{E\}\\left\[C\_\{4\}\(\\psi\_\{t\-1\}\)\\right\]\\\\ &=\\frac\{1\}\{B\}\\left\(\\mathcal\{I\}\-\\frac\{1\}\{N^\{2\}\}\\Gamma\\widehat\{\\theta\}\\widehat\{\\theta\}^\{\\top\}\\Gamma^\{\\top\}\+\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathcal\{J\}\_\{n\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\_\{n\}\-\\mathcal\{J\}\\Sigma\_\{\\psi\}\\mathcal\{J\}\\right\)\.\\end\{aligned\}\(F\.14\)
### F\.3Proof of[Proposition4\.4](https://arxiv.org/html/2606.00293#S4.Thmtheorem4)
We analyze the mixing behavior under the proxy dynamics[Section4](https://arxiv.org/html/2606.00293#S4.Ex6)\. For each coordinate projectionfi\(θ\):=θif\_\{i\}\(\\theta\):=\\theta\_\{i\}, the theoretical lag\-kkautocorrelation is defined as
ρk,i\\displaystyle\\rho\_\{k,i\}:=Corrπθ\(θ0,i,θk,i\)\\displaystyle:=\\mathrm\{Corr\}\_\{\\pi\_\{\\theta\}\}\\\!\\bigl\(\\theta\_\{0,i\},\\theta\_\{k,i\}\\bigr\)=Covπθ\(θ0,i,θk,i\)Varπθ\(θ0,i\)=𝔼πθ\[\(θ0,i−θ^i\)\(θk,i−θ^i\)\]\(Σψ\)ii\\displaystyle=\\frac\{\\mathrm\{Cov\}\_\{\\pi\_\{\\theta\}\}\(\\theta\_\{0,i\},\\theta\_\{k,i\}\)\}\{\\mathrm\{Var\}\_\{\\pi\_\{\\theta\}\}\(\\theta\_\{0,i\}\)\}=\\frac\{\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\(\\theta\_\{0,i\}\-\\widehat\{\\theta\}\_\{i\}\)\(\\theta\_\{k,i\}\-\\widehat\{\\theta\}\_\{i\}\)\\right\]\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}=\(𝔼πθ\[\(θ0−θ^\)\(θk−θ^\)⊤\]\)ii\(Σψ\)ii\.\\displaystyle=\\frac\{\\bigl\(\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\(\\theta\_\{0\}\-\\widehat\{\\theta\}\)\(\\theta\_\{k\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]\\bigr\)\_\{ii\}\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}\.\(F\.15\)
Under the proxy update[Equation12](https://arxiv.org/html/2606.00293#S4.E12), the iterates satisfy
ψt−θ^=\(I−ΛH^\)\(ψt−1−θ^\)−Ληt−1\+2β−1Λξt−1,\\displaystyle\\psi\_\{t\}\-\\widehat\{\\theta\}=\(I\-\\Lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\},\(F\.16\)whereξt−1∼𝒩\(0,I\)\\xi\_\{t\-1\}\\sim\\mathcal\{N\}\(0,I\)and
ηt−1\\displaystyle\\eta\_\{t\-1\}=\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]−𝔼\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]\.\\displaystyle=\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\Bigr\]\-\\mathbb\{E\}\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\Bigr\]\.\(F\.17\)Iterating forward,
ψt\+k−θ^=\(I−ΛH^\)\(ψt\+k−1−θ^\)−Ληt\+k−1\+2β−1Λξt\+k−1\.\\displaystyle\\psi\_\{t\+k\}\-\\widehat\{\\theta\}=\(I\-\\Lambda\\widehat\{H\}\)\(\\psi\_\{t\+k\-1\}\-\\widehat\{\\theta\}\)\-\\Lambda\\eta\_\{t\+k\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\+k\-1\}\.\(F\.18\)
Define the lag\-kkcross\-covariance matrix
Ξk\\displaystyle\\Xi\_\{k\}:=𝔼πψ\[\(ψt\+k−θ^\)\(ψt−θ^\)⊤\]\.\\displaystyle:=\\mathbb\{E\}\_\{\\pi\_\{\\psi\}\}\\\!\\left\[\(\\psi\_\{t\+k\}\-\\widehat\{\\theta\}\)\(\\psi\_\{t\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]\.\(F\.19\)Then
Ξk\\displaystyle\\Xi\_\{k\}=\(I−ΛH^\)Ξk−1−Λ𝔼πθ\[ηt\+k−1\(ψt−θ^\)⊤\]\.\\displaystyle=\(I\-\\Lambda\\widehat\{H\}\)\\Xi\_\{k\-1\}\-\\Lambda\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\eta\_\{t\+k\-1\}\(\\psi\_\{t\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]\.\(F\.20\)Sinceηt\+k−1\\eta\_\{t\+k\-1\}is conditionally mean\-zero given the past,
𝔼πθ\[ηt\+k−1\(ψt−θ^\)⊤\]=𝔼\[𝔼\[ηt\+k−1\(ψt−θ^\)⊤∣ℱt\+k−1\]\]=0\.\\displaystyle\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\eta\_\{t\+k\-1\}\(\\psi\_\{t\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\mathbb\{E\}\\\!\\left\[\\eta\_\{t\+k\-1\}\(\\psi\_\{t\}\-\\widehat\{\\theta\}\)^\{\\top\}\\mid\\mathcal\{F\}\_\{t\+k\-1\}\\right\]\\right\]=0\.\(F\.21\)Therefore,
Ξk=\(I−ΛH^\)kΞ0=\(I−ΛH^\)kΣψ\.\\displaystyle\\Xi\_\{k\}=\(I\-\\Lambda\\widehat\{H\}\)^\{k\}\\Xi\_\{0\}=\(I\-\\Lambda\\widehat\{H\}\)^\{k\}\\Sigma\_\{\\psi\}\.\(F\.22\)
Consequently, the lag\-kkautocorrelation for coordinateiiis
ρk,i=\(\(I−ΛH^\)kΣψ\)ii\(Σψ\)ii\.\\displaystyle\\rho\_\{k,i\}=\\frac\{\\bigl\(\(I\-\\Lambda\\widehat\{H\}\)^\{k\}\\Sigma\_\{\\psi\}\\bigr\)\_\{ii\}\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}\.\(F\.23\)
WhenΛ=λI\\Lambda=\\lambda IandH^\\widehat\{H\}is symmetric positive definite, the eigenvalues ofI−λH^I\-\\lambda\\widehat\{H\}are1−λμi\(H^\)1\-\\lambda\\mu\_\{i\}\(\\widehat\{H\}\)\. Hence
‖I−λH^‖2=maxi\|1−λμi\(H^\)\|\.\\displaystyle\\\|I\-\\lambda\\widehat\{H\}\\\|\_\{2\}=\\max\_\{i\}\|1\-\\lambda\\mu\_\{i\}\(\\widehat\{H\}\)\|\.\(F\.24\)According to the condition0<λ<2/μmax\(H^\)0<\\lambda<2/\\mu\_\{\\max\}\(\\widehat\{H\}\), we have0<λμi\(H^\)<20<\\lambda\\mu\_\{i\}\(\\widehat\{H\}\)<2for all eigenvaluesμi\(H^\)\\mu\_\{i\}\(\\widehat\{H\}\), and therefore
\|1−λμi\(H^\)\|<1\.\\displaystyle\|1\-\\lambda\\mu\_\{i\}\(\\widehat\{H\}\)\|<1\.\(F\.25\)SinceH^\\widehat\{H\}is symmetric positive definite, the spectral norm satisfies
‖A‖2:=sup‖x‖2=1‖Ax‖2=maxi\|μi\(A\)\|\.\\displaystyle\\\|A\\\|\_\{2\}:=\\sup\_\{\\\|x\\\|\_\{2\}=1\}\\\|Ax\\\|\_\{2\}=\\max\_\{i\}\|\\mu\_\{i\}\(A\)\|\.\(F\.26\)Hence,
‖I−λH^‖2=maxi\|1−λμi\(H^\)\|<1\.\\displaystyle\\\|I\-\\lambda\\widehat\{H\}\\\|\_\{2\}=\\max\_\{i\}\|1\-\\lambda\\mu\_\{i\}\(\\widehat\{H\}\)\|<1\.\(F\.27\)Therefore, the Neumann series converges and
∑k=0∞\(I−λH^\)k=\(λH^\)−1\.\\displaystyle\\sum\_\{k=0\}^\{\\infty\}\(I\-\\lambda\\widehat\{H\}\)^\{k\}=\(\\lambda\\widehat\{H\}\)^\{\-1\}\.\(F\.28\)Hence,
∑k=1∞ρk,i\\displaystyle\\sum\_\{k=1\}^\{\\infty\}\\rho\_\{k,i\}=\(\(\(ΛH^\)−1−I\)Σψ\)ii\(Σψ\)ii=\(\(ΛH^\)−1Σψ\)ii\(Σψ\)ii−1\.\\displaystyle=\\frac\{\\bigl\(\(\(\\Lambda\\widehat\{H\}\)^\{\-1\}\-I\)\\Sigma\_\{\\psi\}\\bigr\)\_\{ii\}\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}=\\frac\{\\bigl\(\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}\\bigr\)\_\{ii\}\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}\-1\.\(F\.29\)
The coordinate\-wise integrated autocorrelation time
τint\(fi\):=1\+2∑k=1∞ρk,i\\displaystyle\\tau\_\{\\mathrm\{int\}\}\(f\_\{i\}\):=1\+2\\sum\_\{k=1\}^\{\\infty\}\\rho\_\{k,i\}\(F\.30\)thus satisfies
τint\(fi\)=2\(\(ΛH^\)−1Σψ\)ii\(Σψ\)ii−1\.\\displaystyle\\tau\_\{\\mathrm\{int\}\}\(f\_\{i\}\)=2\\frac\{\\bigl\(\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}\\bigr\)\_\{ii\}\}\{\(\\Sigma\_\{\\psi\}\)\_\{ii\}\}\-1\.\(F\.31\)
Letw:=Σψ1/2vw:=\\Sigma\_\{\\psi\}^\{1/2\}v\. Then
v⊤\(ΛH^\)−1Σψvv⊤Σψv=w⊤\(Σψ1/2\(ΛH^\)−1Σψ−1/2\)ww⊤w\.\\displaystyle\\frac\{v^\{\\top\}\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}v\}\{v^\{\\top\}\\Sigma\_\{\\psi\}v\}=\\frac\{w^\{\\top\}\\Bigl\(\\Sigma\_\{\\psi\}^\{1/2\}\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}^\{\-1/2\}\\Bigr\)w\}\{w^\{\\top\}w\}\.\(F\.32\)The matrix
M:=Σψ1/2\(ΛH^\)−1Σψ−1/2\\displaystyle M:=\\Sigma\_\{\\psi\}^\{1/2\}\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}^\{\-1/2\}\(F\.33\)is similar to\(ΛH^\)−1\(\\Lambda\\widehat\{H\}\)^\{\-1\}, henceμmax\(M\)=μmax\(\(ΛH^\)−1\)=1/μmin\(ΛH^\)\\mu\_\{\\max\}\(M\)=\\mu\_\{\\max\}\(\(\\Lambda\\widehat\{H\}\)^\{\-1\}\)=1/\\mu\_\{\\min\}\(\\Lambda\\widehat\{H\}\)\.
By Rayleigh–Ritz,
supv≠0v⊤\(ΛH^\)−1Σψvv⊤Σψv=supw≠0w⊤Mww⊤w=λmax\(M\)=1λmin\(ΛH^\)\.\\displaystyle\\sup\_\{v\\neq 0\}\\frac\{v^\{\\top\}\(\\Lambda\\widehat\{H\}\)^\{\-1\}\\Sigma\_\{\\psi\}v\}\{v^\{\\top\}\\Sigma\_\{\\psi\}v\}=\\sup\_\{w\\neq 0\}\\frac\{w^\{\\top\}Mw\}\{w^\{\\top\}w\}=\\lambda\_\{\\max\}\(M\)=\\frac\{1\}\{\\lambda\_\{\\min\}\(\\Lambda\\widehat\{H\}\)\}\.\(F\.34\)Then we have
τ:=supvτint\(fv\)=2⋅1μmin\(ΛH^\)−1\.\\displaystyle\\tau:=\\sup\_\{v\}\\tau\_\{\\mathrm\{int\}\}\(f\_\{v\}\)=2\\cdot\\frac\{1\}\{\\mu\_\{\\min\}\(\\Lambda\\widehat\{H\}\)\}\-1\.\(F\.35\)
### F\.4Proof of[Theorem4\.5](https://arxiv.org/html/2606.00293#S4.Thmtheorem5)
Since there exists a coupling ofθ0∼ν\\theta\_\{0\}\\sim\\nuandψ0∼ν′\\psi\_\{0\}\\sim\\nu^\{\\prime\}such thatW22\(ν,ν′\)=𝔼\(‖θ0−ψ0‖2\)W\_\{2\}^\{2\}\(\\nu,\\nu^\{\\prime\}\)=\\mathbb\{E\}\(\\\|\\theta\_\{0\}\-\\psi\_\{0\}\\\|^\{2\}\), we assume\(θ0,ψ0\)\(\\theta\_\{0\},\\psi\_\{0\}\)follow this joint distribution\. Using the recursions forθt\\theta\_\{t\}andψt\\psi\_\{t\}, and using the assumption thatΛ=λI\\Lambda=\\lambda I, we have
‖θt−ψt‖2\\displaystyle\\left\\lVert\\theta\_\{t\}\-\\psi\_\{t\}\\right\\rVert^\{2\}=‖θt−1−ψt−1‖2\+‖λ\[Gt\(θt−1\)−Gt\(θ^\)−∇Gt\(θ^\)\(ψt−1−θ^\)\]‖2⏟⋆\\displaystyle=\\left\\lVert\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\right\\rVert^\{2\}\+\\underbrace\{\\left\\lVert\\lambda\\,\\Bigl\[G\_\{t\}\(\\theta\_\{t\-1\}\)\-G\_\{t\}\(\\widehat\{\\theta\}\)\-\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\\right\\rVert^\{2\}\}\_\{\\star\}\(F\.36\)−2λ⟨θt−1−ψt−1,Gt\(θt−1\)−Gt\(θ^\)−∇Gt\(θ^\)\(ψt−1−θ^\)⟩⏟⋆⋆\\displaystyle\\phantom\{=~~\}\\underbrace\{\-~2\\lambda\\left\\langle\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},G\_\{t\}\(\\theta\_\{t\-1\}\)\-G\_\{t\}\(\\widehat\{\\theta\}\)\-\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\right\\rangle\}\_\{\\star\\star\}Let\(ℱt\)t≥0\(\\mathcal\{F\}\_\{t\}\)\_\{t\\geq 0\}denote the filtration associated with\{\(θt,ψt\)\}t≥0\\\{\(\\theta\_\{t\},\\psi\_\{t\}\)\\\}\_\{t\\geq 0\}and𝔼t:=𝔼\(⋅∣ℱt\)\\mathbb\{E\}\_\{t\}:=\\mathbb\{E\}\(\\cdot\\mid\\mathcal\{F\}\_\{t\}\)\. LetIIdenote an independent random variable uniformly distributed on\{1,…,N\}\\\{1,\\dots,N\\\}\. We can bound the expected squared error as
𝔼t−1\(⋆\)\\displaystyle\\mathbb\{E\}\_\{t\-1\}\(\\star\)=λ2𝔼t−1\[‖Gt\(θt−1\)−Gt\(θ^\)−∇Gt\(θ^\)\(ψt−1−θ^\)‖2\]\\displaystyle=\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\lVert G\_\{t\}\(\\theta\_\{t\-1\}\)\-G\_\{t\}\(\\widehat\{\\theta\}\)\-\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\right\\rVert^\{2\}\\right\]\(F\.37\)≤λ2B∑n∈St𝔼t−1\[‖\(∇ℓn\(θt−1\)−∇ℓn\(θ^\)−𝒥n\(ψt−1−θ^\)\)‖2\]\\displaystyle\\leq\\frac\{\\lambda^\{2\}\}\{B\}\\sum\_\{n\\in S\_\{t\}\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\lVert\\left\(\\nabla\\ell\_\{n\}\(\\theta\_\{t\-1\}\)\-\\nabla\\ell\_\{n\}\\big\(\\widehat\{\\theta\}\\big\)\-\\mathcal\{J\}\_\{n\}\\big\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\)\\right\)\\right\\rVert^\{2\}\\right\]\(F\.38\)=λ2𝔼t−1\[‖\(∇ℓI\(θt−1\)−∇ℓI\(θ^\)−𝒥I\(ψt−1−θ^\)\)‖2\]\\displaystyle=\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\lVert\\left\(\\nabla\\ell\_\{I\}\(\\theta\_\{t\-1\}\)\-\\nabla\\ell\_\{I\}\\big\(\\widehat\{\\theta\}\\big\)\-\\mathcal\{J\}\_\{I\}\\big\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\)\\right\)\\right\\rVert^\{2\}\\right\]\(F\.39\)≤2λ2𝔼t−1\[‖∇ℓI\(θt−1\)−∇ℓI\(ψt−1\)‖2\]\+2λ2𝔼t−1\[‖∇ℓI\(ψt−1\)−∇ℓI\(θ^\)−𝒥I\(ψt−1−θ^\)‖2\]\.\\displaystyle\\leq 2\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\lVert\\nabla\\ell\_\{I\}\(\\theta\_\{t\-1\}\)\-\\nabla\\ell\_\{I\}\(\\psi\_\{t\-1\}\)\\right\\rVert^\{2\}\\right\]\+2\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\lVert\\nabla\\ell\_\{I\}\(\\psi\_\{t\-1\}\)\-\\nabla\\ell\_\{I\}\\big\(\\widehat\{\\theta\}\\big\)\-\\mathcal\{J\}\_\{I\}\\big\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\)\\right\\rVert^\{2\}\\right\]\.\(F\.40\)It follows from Taylor’s remainder theorem and Assumption[\(B\)](https://arxiv.org/html/2606.00293#S4.I1.i2)that
‖∇ℓn\(ψt−1\)−∇ℓn\(θ^\)−𝒥n\(ψt−1−θ^\)‖2\\displaystyle\\left\\lVert\\nabla\\ell\_\{n\}\(\\psi\_\{t\-1\}\)\-\\nabla\\ell\_\{n\}\\big\(\\widehat\{\\theta\}\\big\)\-\\mathcal\{J\}\_\{n\}\\big\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\right\\rVert^\{2\}≤Mn24∥ψt−1−θ^∥4\.\\displaystyle\\leq\\frac\{M\_\{n\}^\{2\}\}\{4\}\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\rVert^\{4\}\.\(F\.41\)Using the fact that convexity andLL\-smoothness implyLL\-co\-coercivity, we thus obtain
𝔼t−1\(⋆\)\\displaystyle\\mathbb\{E\}\_\{t\-1\}\(\\star\)\(F\.42\)≤2Lλ2𝔼t−1\[⟨θt−1−ψt−1,∇ℓI\(θt−1\)−∇ℓI\(ψt−1\)⟩\]\+λ22𝔼t−1\[MI2\]∥ψt−1−θ^∥4\\displaystyle\\leq 2L\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[\\left\\langle\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\nabla\\ell\_\{I\}\(\\theta\_\{t\-1\}\)\-\\nabla\\ell\_\{I\}\(\\psi\_\{t\-1\}\)\\right\\rangle\\right\]\+\\frac\{\\lambda^\{2\}\}\{2\}\\mathbb\{E\}\_\{t\-1\}\\left\[M\_\{I\}^\{2\}\\right\]\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{4\}\(F\.43\)=2Lλ2⟨θt−1−ψt−1,∇ℒ\(θt−1\)−∇ℒ\(ψt−1\)⟩\+λ2M2¯2∥ψt−1−θ^∥4\.\\displaystyle=2L\\lambda^\{2\}\\left\\langle\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\nabla\\mathcal\{L\}\(\\theta\_\{t\-1\}\)\-\\nabla\\mathcal\{L\}\(\\psi\_\{t\-1\}\)\\right\\rangle\+\\frac\{\\lambda^\{2\}\\overline\{M^\{2\}\}\}\{2\}\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{4\}\.\(F\.44\)Furthermore, for anyc\>0c\>0, and again using Taylor’s remainder theorem and Assumption[\(B\)](https://arxiv.org/html/2606.00293#S4.I1.i2), we have
𝔼t−1\(⋆⋆\)\\displaystyle\\mathbb\{E\}\_\{t\-1\}\(\\star\\star\)=−2λ𝔼t−1\[⟨θt−1−ψt−1,\(∇ℓI\(θt−1\)−∇ℓI\(θ^\)−𝒥I\(ψt−1−θ^\)\)⟩\]\\displaystyle=\-2\\lambda\\mathbb\{E\}\_\{t\-1\}\\left\[\\left<\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\left\(\\nabla\\ell\_\{I\}\(\\theta\_\{t\-1\}\)\-\\nabla\\ell\_\{I\}\(\\widehat\{\\theta\}\)\-\\mathcal\{J\}\_\{I\}\\big\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\)\\right\)\\right\>\\right\]\(F\.45\)=−2λ⟨θt−1−ψt−1,∇ℒ\(θt−1\)−∇ℒ\(θ^\)−𝒥\(ψt−1−θ^\)⟩\\displaystyle=\-2\\lambda\\left<\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\nabla\\mathcal\{L\}\(\\theta\_\{t\-1\}\)\-\\nabla\\mathcal\{L\}\(\\widehat\{\\theta\}\)\-\\mathcal\{J\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\>≤−2λ⟨θt−1−ψt−1,∇ℒ\(θt−1\)−∇ℒ\(ψt−1\)⟩\+λM¯∥θt−1−ψt−1∥∥ψt−1−θ^∥2\\displaystyle\\leq\-2\\lambda\\left<\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\nabla\\mathcal\{L\}\(\\theta\_\{t\-1\}\)\-\\nabla\\mathcal\{L\}\(\\psi\_\{t\-1\}\)\\right\>\+\\lambda\\overline\{M\}\\lVert\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\rVert\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{2\}≤−2λ⟨θt−1−ψt−1,∇ℒ\(θt−1\)−∇ℒ\(ψt−1\)⟩\+2λc∥θt−1−ψt−1∥2\+λM¯28c∥ψt−1−θ^∥4\.\\displaystyle\\leq\-2\\lambda\\left<\\theta\_\{t\-1\}\-\\psi\_\{t\-1\},\\nabla\\mathcal\{L\}\(\\theta\_\{t\-1\}\)\-\\nabla\\mathcal\{L\}\(\\psi\_\{t\-1\}\)\\right\>\+2\\lambda c\\lVert\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\rVert^\{2\}\+\\frac\{\\lambda\\overline\{M\}^\{2\}\}\{8c\}\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{4\}\.Thus, using Assumption[\(C\)](https://arxiv.org/html/2606.00293#S4.I1.i3)and choosingc=μ/2c=\\mu/2, we have
𝔼\(‖θt−ψt‖2\)\\displaystyle\\mathbb\{E\}\(\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\)≤\(1−2λμ\+2λc\+2λ2μL\)𝔼\(‖θt−1−ψt−1‖2\)\+\{λ2M2¯2\+λM¯28c\}𝔼\(∥ψt−1−θ^∥4\)\\displaystyle\\leq\\big\(1\-2\\lambda\\mu\+2\\lambda c\+2\\lambda^\{2\}\\mu L\\big\)\\mathbb\{E\}\(\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\)\+\\left\\\{\\frac\{\\lambda^\{2\}\\overline\{M^\{2\}\}\}\{2\}\+\\frac\{\\lambda\\overline\{M\}^\{2\}\}\{8c\}\\right\\\}\\mathbb\{E\}\\left\(\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{4\}\\right\)\(F\.46\)=\{1−λμ\(1−2λL\)\}𝔼\(‖θt−1−ψt−1‖2\)\+λ\{λM2¯2\+M¯24μ\}𝔼\(∥ψt−1−θ^∥4\)\.\\displaystyle=\\big\\\{1\-\\lambda\\mu\\left\(1\-2\\lambda L\\right\)\\big\\\}\\mathbb\{E\}\(\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\)\+\\lambda\\left\\\{\\frac\{\\lambda\\overline\{M^\{2\}\}\}\{2\}\+\\frac\{\\overline\{M\}^\{2\}\}\{4\\mu\}\\right\\\}\\mathbb\{E\}\\left\(\\big\\lVert\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\big\\rVert^\{4\}\\right\)\.\(F\.47\)Hence, we obtain the overall bound given in[Equation21](https://arxiv.org/html/2606.00293#S4.E21)\.
### F\.5Proof of[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)
First, we give a lemma bounding the stationary fourth moment ofψt−1\\psi\_\{t\-1\}\.
###### Lemma F\.2\.
Under the conditions of[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6), ifμ^\\hat\{\\mu\}andL^\\hat\{L\}denote, respectively, the smallest and largest eigenvalues ofH^=∇2ℒ\(θ^\)\\widehat\{H\}=\\nabla^\{2\}\\mathcal\{L\}\(\\widehat\{\\theta\}\)andλ≤min\{1/\(4μ^\),Bμ^/\(200L2\)\}\\lambda\\leq\\min\\\{1/\(4\\hat\{\\mu\}\),B\\hat\{\\mu\}/\(200L^\{2\}\)\\\}, then forψ∞∼πψ\\psi\_\{\\infty\}\\sim\\pi\_\{\\psi\}, satisfies
𝔼\(∥ψ∞−θ^∥4\)≤96λ2τ44μ^2B2\+24λτ42μ^2BβD\+12D2μ^2β2\+48λD\(D\+2\)μ^β2,\\displaystyle\\mathbb\{E\}\(\\lVert\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\rVert^\{4\}\)\\leq 96\\frac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\\,\+24\\frac\{\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}B\\beta\}D\+12\\frac\{D^\{2\}\}\{\\hat\{\\mu\}^\{2\}\\beta^\{2\}\}\+48\\frac\{\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\},\(F\.48\)whereτ44:=N−1∑n=1N∥∇ℓ\(xI,yI,θ^\)∥4\\tau\_\{4\}^\{4\}:=N^\{\-1\}\\sum\_\{n=1\}^\{N\}\\big\\lVert\\nabla\\ell\(x\_\{I\},y\_\{I\},\\widehat\{\\theta\}\)\\big\\rVert^\{4\}\.
###### Proof\.
The recursion forψt\\psi\_\{t\}can be rewritten as
ψt−θ^=\(I−λH^\)\(ψt−1−θ^\)−ληt−1\+2β−1λξt−1,\\psi\_\{t\}\-\\widehat\{\\theta\}=\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\,\\xi\_\{t\-1\},\(F\.49\)whereηt−1:=Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)−H^\(ψt−1−θ^\)\\eta\_\{t\-1\}:=G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\widehat\{H\}\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)andξt∼𝒩\(0,ID\)\\xi\_\{t\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)\.
Since the minibatch at timettis independent ofψt−1\\psi\_\{t\-1\}and𝔼\[Gt\(θ^\)\]=∇ℒ\(θ^\)=0\\mathbb\{E\}\[G\_\{t\}\(\\widehat\{\\theta\}\)\]=\\nabla\\mathcal\{L\}\(\\widehat\{\\theta\}\)=0,𝔼\[∇Gt\(θ^\)\]=∇2ℒ\(θ^\)=H^\\mathbb\{E\}\[\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\]=\\nabla^\{2\}\\mathcal\{L\}\(\\widehat\{\\theta\}\)=\\widehat\{H\}, we have𝔼t−1\[ηt−1\]=0\\mathbb\{E\}\_\{t\-1\}\[\\eta\_\{t\-1\}\]=0\. Using the multinomial formula and the fact that the expected gradient is zero atθ^\\widehat\{\\theta\}, we obtain
𝔼\{∥Gt\(θ^\)∥4\}\\displaystyle\\mathbb\{E\}\\\{\\lVert G\_\{t\}\(\\widehat\{\\theta\}\)\\rVert^\{4\}\\\}≤3B2𝔼\{∥∇ℓ\(xI,yI,θ^\)∥4\}⏟τ44:=and𝔼‖Gt\(θ^\)‖2≤τ42B;\\displaystyle\\leq\\frac\{3\}\{B^\{2\}\}\\underbrace\{\\mathbb\{E\}\\left\\\{\\big\\lVert\\nabla\\ell\(x\_\{I\},y\_\{I\},\\widehat\{\\theta\}\)\\big\\rVert^\{4\}\\right\\\}\}\_\{\\tau\_\{4\}^\{4\}\\,:=\}\\text\{ and \}\\quad\\mathbb\{E\}\\\|G\_\{t\}\(\\widehat\{\\theta\}\)\\\|^\{2\}\\leq\\frac\{\\tau\_\{4\}^\{2\}\}\{B\};\(F\.50\)Fixu:=ψt−1−θ^u:=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}, which isℱt−1\\mathcal\{F\}\_\{t\-1\}\-measurable\. With minibatch sampling*with replacement*, we can write
∇Gt\(θ^\)=1B∑b=1B∇2ℓ\(xIb,yIb,θ^\),I1,…,IBi\.i\.d\. and independent ofℱt−1\.\\displaystyle\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}\\nabla^\{2\}\\ell\(x\_\{I\_\{b\}\},y\_\{I\_\{b\}\},\\widehat\{\\theta\}\),\\qquad I\_\{1\},\\dots,I\_\{B\}\\ \\text\{i\.i\.d\. and independent of \}\\mathcal\{F\}\_\{t\-1\}\.\(F\.51\)LetHI:=∇2ℓ\(xI,yI,θ^\)H\_\{I\}:=\\nabla^\{2\}\\ell\(x\_\{I\},y\_\{I\},\\widehat\{\\theta\}\)andH:=𝔼\[HI\]=∇2ℒ\(θ^\)H:=\\mathbb\{E\}\[H\_\{I\}\]=\\nabla^\{2\}\\mathcal\{L\}\(\\widehat\{\\theta\}\)\. Define the i\.i\.d\. random vectors
Zb:=\(HIb−H\)u,b=1,…,B\.\\displaystyle Z\_\{b\}:=\(H\_\{I\_\{b\}\}\-H\)u,\\qquad b=1,\\dots,B\.\(F\.52\)Then𝔼t−1\[Zb\]=0\\mathbb\{E\}\_\{t\-1\}\[Z\_\{b\}\]=0and
\(∇Gt\(θ^\)−H\)u=1B∑b=1BZb\.\\displaystyle\(\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\-H\)u=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}Z\_\{b\}\.\(F\.53\)Moreover, byLL\-smoothness atθ^\\widehat\{\\theta\}, then‖H‖2≤𝔼‖HI‖2≤L\\\|H\\\|\_\{2\}\\leq\\mathbb\{E\}\\\|H\_\{I\}\\\|\_\{2\}\\leq L, we have‖HI−H‖2≤‖HI‖2\+‖H‖2≤2L\\\|H\_\{I\}\-H\\\|\_\{2\}\\leq\\\|H\_\{I\}\\\|\_\{2\}\+\\\|H\\\|\_\{2\}\\leq 2L, hence
‖Zb‖≤2L‖u‖,𝔼t−1‖Zb‖2≤4L2‖u‖2,𝔼t−1‖Zb‖4≤16L4‖u‖4\.\\displaystyle\\\|Z\_\{b\}\\\|\\leq 2L\\\|u\\\|,\\qquad\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{b\}\\\|^\{2\}\\leq 4L^\{2\}\\\|u\\\|^\{2\},\\qquad\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{b\}\\\|^\{4\}\\leq 16L^\{4\}\\\|u\\\|^\{4\}\.\(F\.54\)
Using independence and𝔼t−1\[Zb\]=0\\mathbb\{E\}\_\{t\-1\}\[Z\_\{b\}\]=0, the cross terms vanish:
𝔼t−1‖1B∑b=1BZb‖2=1B2∑b=1B𝔼t−1‖Zb‖2≤4L2B‖u‖2\.\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\Big\\\|\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}Z\_\{b\}\\Big\\\|^\{2\}=\\frac\{1\}\{B^\{2\}\}\\sum\_\{b=1\}^\{B\}\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{b\}\\\|^\{2\}\\leq\\frac\{4L^\{2\}\}\{B\}\\\|u\\\|^\{2\}\.\(F\.55\)LetS:=∑b=1BZbS:=\\sum\_\{b=1\}^\{B\}Z\_\{b\}\. Since𝔼t−1\[Zb\]=0\\mathbb\{E\}\_\{t\-1\}\[Z\_\{b\}\]=0and theZbZ\_\{b\}’s are independent,
‖S‖2=∑b=1B‖Zb‖2\+2∑1≤i<j≤B⟨Zi,Zj⟩\.\\displaystyle\\\|S\\\|^\{2\}=\\sum\_\{b=1\}^\{B\}\\\|Z\_\{b\}\\\|^\{2\}\+2\\sum\_\{1\\leq i<j\\leq B\}\\langle Z\_\{i\},Z\_\{j\}\\rangle\.\(F\.56\)By⟨Zi,Zj⟩2≤‖Zi‖2‖Zj‖2\\langle Z\_\{i\},Z\_\{j\}\\rangle^\{2\}\\leq\\\|Z\_\{i\}\\\|^\{2\}\\\|Z\_\{j\}\\\|^\{2\}, we have
𝔼t−1‖S‖4≤𝔼t−1\(∑b=1B‖Zb‖2\)2\+4𝔼t−1\(∑i<j⟨Zi,Zj⟩\)2≤B𝔼t−1‖Z1‖4\+3B\(B−1\)\(𝔼t−1‖Z1‖2\)2\.\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|S\\\|^\{4\}\\leq\\mathbb\{E\}\_\{t\-1\}\\Big\(\\sum\_\{b=1\}^\{B\}\\\|Z\_\{b\}\\\|^\{2\}\\Big\)^\{2\}\+4\\,\\mathbb\{E\}\_\{t\-1\}\\Big\(\\sum\_\{i<j\}\\langle Z\_\{i\},Z\_\{j\}\\rangle\\Big\)^\{2\}\\leq B\\,\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{4\}\+3B\(B\-1\)\\big\(\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{2\}\\big\)^\{2\}\.\(F\.57\)Using\(𝔼t−1‖Z1‖2\)2≤𝔼t−1‖Z1‖4\(\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{2\}\)^\{2\}\\leq\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{4\}, we get𝔼t−1‖S‖4≤3B2𝔼t−1‖Z1‖4\\mathbb\{E\}\_\{t\-1\}\\\|S\\\|^\{4\}\\leq 3B^\{2\}\\,\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{4\}, hence
𝔼t−1‖1B∑b=1BZb‖4=1B4𝔼t−1‖S‖4≤3B2𝔼t−1‖Z1‖4≤3B2⋅16L4‖u‖4=48L4B2‖u‖4\.\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\Big\\\|\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}Z\_\{b\}\\Big\\\|^\{4\}=\\frac\{1\}\{B^\{4\}\}\\mathbb\{E\}\_\{t\-1\}\\\|S\\\|^\{4\}\\leq\\frac\{3\}\{B^\{2\}\}\\,\\mathbb\{E\}\_\{t\-1\}\\\|Z\_\{1\}\\\|^\{4\}\\leq\\frac\{3\}\{B^\{2\}\}\\cdot 16L^\{4\}\\\|u\\\|^\{4\}=\\frac\{48L^\{4\}\}\{B^\{2\}\}\\\|u\\\|^\{4\}\.\(F\.58\)Then we have
𝔼t−1‖\(∇Gt\(θ^\)−H^\)\(ψt−1−θ^\)‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\big\\\|\(\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\-\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\big\\\|^\{2\}≤4L2B‖ψt−1−θ^‖2,\\displaystyle\\leq\\frac\{4L^\{2\}\}\{B\}\\,\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\},\(F\.59\)𝔼t−1‖\(∇Gt\(θ^\)−H^\)\(ψt−1−θ^\)‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\big\\\|\(\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\-\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\big\\\|^\{4\}≤48L4B2‖ψt−1−θ^‖4\.\\displaystyle\\leq\\frac\{48L^\{4\}\}\{B^\{2\}\}\\,\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\.\(F\.60\)Sinceηt−1=Gt\(θ^\)\+\(∇Gt\(θ^\)−H^\)\(ψt−1−θ^\)\\eta\_\{t\-1\}=G\_\{t\}\(\\widehat\{\\theta\}\)\+\(\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\-\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\), the inequalities‖a\+b‖2≤2‖a‖2\+2‖b‖2\\\|a\+b\\\|^\{2\}\\leq 2\\\|a\\\|^\{2\}\+2\\\|b\\\|^\{2\}and‖a\+b‖4≤8‖a‖4\+8‖b‖4\\\|a\+b\\\|^\{4\}\\leq 8\\\|a\\\|^\{4\}\+8\\\|b\\\|^\{4\}combined with[EquationF\.50](https://arxiv.org/html/2606.00293#A6.E50),[EquationF\.59](https://arxiv.org/html/2606.00293#A6.E59),[EquationF\.60](https://arxiv.org/html/2606.00293#A6.E60)yield
𝔼t−1‖ηt−1‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}≤2τ42B\+8L2B‖ψt−1−θ^‖2,\\displaystyle\\leq\\frac\{2\\tau\_\{4\}^\{2\}\}\{B\}\+\\frac\{8L^\{2\}\}\{B\}\\,\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\},\(F\.61\)𝔼t−1‖ηt−1‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}≤24τ44B2\+384L4B2‖ψt−1−θ^‖4\.\\displaystyle\\leq\\frac\{24\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\frac\{384L^\{4\}\}\{B^\{2\}\}\\,\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\.\(F\.62\)By[EquationF\.49](https://arxiv.org/html/2606.00293#A6.E49),𝔼t−1\[ηt−1\]=0\\mathbb\{E\}\_\{t\-1\}\[\\eta\_\{t\-1\}\]=0,𝔼\[ξt−1\]=0\\mathbb\{E\}\[\\xi\_\{t\-1\}\]=0,[EquationF\.61](https://arxiv.org/html/2606.00293#A6.E61), and‖I−λH^‖2≤1−λμ^\\\|I\-\\lambda\\widehat\{H\}\\\|\_\{2\}\\leq 1\-\\lambda\\hat\{\\mu\}, we can obtain
𝔼t−1‖ψt−θ^‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{2\}=‖\(I−λH^\)\(ψt−1−θ^\)‖2\+λ2𝔼t−1‖ηt−1‖2\+2β−1λD\\displaystyle=\\\|\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\\|^\{2\}\+\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D\(F\.63\)≤\(1−λμ^\)2‖ψt−1−θ^‖2\+λ2𝔼t−1‖ηt−1‖2\+2β−1λD\\displaystyle\\leq\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\}\+\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D\(F\.64\)≤\(\(1−λμ^\)2\+8λ2L2/B\)‖ψt−1−θ^‖2\+2λ2τ42/B\+2β−1λD\.\\displaystyle\\leq\\Bigl\(\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\+8\\lambda^\{2\}L^\{2\}/B\\Bigr\)\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\}\+2\\lambda^\{2\}\\tau\_\{4\}^\{2\}/B\+2\\beta^\{\-1\}\\lambda D\.\(F\.65\)Lettingt→∞t\\to\\inftyyields
𝔼‖ψ∞−θ^‖2≤2λ2τ42B\+2β−1λDλμ^\(2−λμ^\)−8λ2L2B\.\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{2\}\\leq\\frac\{\\frac\{2\\lambda^\{2\}\\tau\_\{4\}^\{2\}\}\{B\}\+2\\beta^\{\-1\}\\lambda D\}\{\\lambda\\hat\{\\mu\}\(2\-\\lambda\\hat\{\\mu\}\)\-\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\}\}\.\(F\.66\)Sinceλ≤1/4μ^\\lambda\\leq 1/4\\hat\{\\mu\}, we have2−λμ^≥7/42\-\\lambda\\hat\{\\mu\}\\geq 7/4and8λ2L2B≤125λμ^\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\}\\leq\\frac\{1\}\{25\}\\lambda\\hat\{\\mu\}, hence
𝔼‖ψ∞−θ^‖2≤65\(λτ42μ^B\+Dμ^β\)\.\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{2\}\\leq\\frac\{6\}\{5\}\\Bigl\(\\frac\{\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}B\}\+\\frac\{D\}\{\\hat\{\\mu\}\\beta\}\\Bigr\)\.\(F\.67\)Then we expand the fourth moment ofψt\\psi\_\{t\}and take conditional expectation,
𝔼t−1‖ψt−θ^‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{4\}=𝔼t−1‖\(I−λH^\)\(ψt−1−θ^\)‖4\+𝔼t−1‖−ληt−1\+2β−1λξt−1‖4\\displaystyle=\\mathbb\{E\}\_\{t\-1\}\\\|\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\\|^\{4\}\+\\mathbb\{E\}\_\{t\-1\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{4\}\+4𝔼t−1‖−ληt−1\+2β−1λξt−1‖2⟨\(I−λH^\)\(ψt−1−θ^\),−ληt−1\+2β−1λξt−1⟩\\displaystyle\\quad\+4\\,\\mathbb\{E\}\_\{t\-1\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{2\}\\Big\\langle\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\),\\,\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\Big\\rangle\(F\.68\)\+2𝔼t−1\(‖\(I−λH^\)\(ψt−1−θ^\)‖2‖−ληt−1\+2β−1λξt−1‖2\)\\displaystyle\\quad\+2\\,\\mathbb\{E\}\_\{t\-1\}\\Big\(\\\|\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\\|^\{2\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{2\}\\Big\)\+4𝔼t−1⟨\(I−λH^\)\(ψt−1−θ^\),−ληt−1\+2β−1λξt−1⟩2\.\\displaystyle\\quad\+4\\,\\mathbb\{E\}\_\{t\-1\}\\Big\\langle\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\),\\,\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\Big\\rangle^\{2\}\.\(F\.69\)Using⟨a,b⟩2≤‖a‖2‖b‖2\\langle a,b\\rangle^\{2\}\\leq\\\|a\\\|^\{2\}\\\|b\\\|^\{2\}and Cauchy\-Schwarz inequality,
𝔼t−1‖ψt−θ^‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{4\}≤𝔼t−1‖\(I−λH^\)\(ψt−1−θ^\)‖4\+3𝔼t−1‖−ληt−1\+2β−1λξt−1‖4\\displaystyle\\leq\\mathbb\{E\}\_\{t\-1\}\\\|\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\\|^\{4\}\+3\\mathbb\{E\}\_\{t\-1\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{4\}\+8𝔼t−1\(‖\(I−λH^\)\(ψt−1−θ^\)‖2‖−ληt−1\+2β−1λξt−1‖2\)\.\\displaystyle\\quad\+8\\,\\mathbb\{E\}\_\{t\-1\}\\Big\(\\\|\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\\|^\{2\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{2\}\\Big\)\.Moreover,
𝔼t−1‖−ληt−1\+2β−1λξt−1‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{2\}=λ2𝔼t−1‖ηt−1‖2\+2β−1λD\\displaystyle=\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D≤λ2\(2τ42B\+8L2B‖ψt−1−θ^‖2\)\+2β−1λD,\\displaystyle\\leq\\lambda^\{2\}\\Big\(\\frac\{2\\tau\_\{4\}^\{2\}\}\{B\}\+\\frac\{8L^\{2\}\}\{B\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\}\\Big\)\+2\\beta^\{\-1\}\\lambda D,\(F\.70\)and using‖u\+v‖4≤8‖u‖4\+8‖v‖4\\\|u\+v\\\|^\{4\}\\leq 8\\\|u\\\|^\{4\}\+8\\\|v\\\|^\{4\}together with𝔼‖ξ‖4=D\(D\+2\)\\mathbb\{E\}\\\|\\xi\\\|^\{4\}=D\(D\+2\),
𝔼t−1‖−ληt−1\+2β−1λξt−1‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\xi\_\{t\-1\}\\\|^\{4\}≤8λ4𝔼t−1‖ηt−1‖4\+32β−2λ2D\(D\+2\)\\displaystyle\\leq 8\\lambda^\{4\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}\+32\\beta^\{\-2\}\\lambda^\{2\}D\(D\+2\)≤192λ4τ44B2\+3072λ4L4B2‖ψt−1−θ^‖4\+32β−2λ2D\(D\+2\)\.\\displaystyle\\leq\\frac\{192\\lambda^\{4\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\frac\{3072\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\+32\\beta^\{\-2\}\\lambda^\{2\}D\(D\+2\)\.\(F\.71\)Combining[EquationF\.69](https://arxiv.org/html/2606.00293#A6.E69),[EquationF\.71](https://arxiv.org/html/2606.00293#A6.E71)and‖\(I−λH^\)‖2≤1−λμ^\\\|\(I\-\\lambda\\widehat\{H\}\)\\\|\_\{2\}\\leq 1\-\\lambda\\hat\{\\mu\}, we obtain
𝔼t−1‖ψt−θ^‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{4\}≤\(\(1−λμ^\)4\+64\(1−λμ^\)2λ2L2B\+9216λ4L4B2\)‖ψt−1−θ^‖4\\displaystyle\\leq\\Bigl\(\(1\-\\lambda\\hat\{\\mu\}\)^\{4\}\+\\frac\{64\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\lambda^\{2\}L^\{2\}\}\{B\}\+\\frac\{9216\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}\\Bigr\)\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\+16\(1−λμ^\)2\(λ2τ42/B\+λD/β\)‖ψt−1−θ^‖2\+576λ4τ44/B2\+96β−2λ2D\(D\+2\)\.\\displaystyle\\quad\+16\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\Bigl\(\\lambda^\{2\}\\tau\_\{4\}^\{2\}/B\+\\lambda D/\\beta\\Bigr\)\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\}\+576\\lambda^\{4\}\\tau\_\{4\}^\{4\}/B^\{2\}\+96\\beta^\{\-2\}\\lambda^\{2\}D\(D\+2\)\.\(F\.72\)Taking full expectation and lettingt→∞t\\to\\inftygives
\(1−\(1−λμ^\)4−64\(1−λμ^\)2λ2L2B−9216λ4L4B2\)𝔼‖ψ∞−θ^‖4\\displaystyle\\Bigl\(1\-\(1\-\\lambda\\hat\{\\mu\}\)^\{4\}\-\\frac\{64\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\lambda^\{2\}L^\{2\}\}\{B\}\-\\frac\{9216\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}\\Bigr\)\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{4\}≤16\(1−λμ^\)2\(λ2τ42B\+λDβ\)𝔼‖ψ∞−θ^‖2\\displaystyle\\leq 16\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\Bigl\(\\frac\{\\lambda^\{2\}\\tau\_\{4\}^\{2\}\}\{B\}\+\\frac\{\\lambda D\}\{\\beta\}\\Bigr\)\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{2\}\+576λ4τ44B2\+96β−2λ2D\(D\+2\)\.\\displaystyle\\quad\+\\frac\{576\\lambda^\{4\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+96\\beta^\{\-2\}\\lambda^\{2\}D\(D\+2\)\.\(F\.73\)Underλμ^≤1/4\\lambda\\hat\{\\mu\}\\leq 1/4andλ≤Bμ^/\(200L2\)\\lambda\\leq B\\hat\{\\mu\}/\(200L^\{2\}\),
1−\(1−λμ^\)4−64\(1−λμ^\)2λ2L2B−9216λ4L4B2\\displaystyle 1\-\(1\-\\lambda\\hat\{\\mu\}\)^\{4\}\-\\frac\{64\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\lambda^\{2\}L^\{2\}\}\{B\}\-\\frac\{9216\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}≥\(4λμ^−6λ2μ^2\)−64λ2L2/B−9216λ4L4/B2\\displaystyle\\geq\(4\\lambda\\hat\{\\mu\}\-6\\lambda^\{2\}\\hat\{\\mu\}^\{2\}\)\-64\\lambda^\{2\}L^\{2\}/B\-9216\\lambda^\{4\}L^\{4\}/B^\{2\}≥\(5/2\)λμ^−\(8/25\)λμ^−\(36/625\)λμ^\\displaystyle\\geq\(5/2\)\\lambda\\hat\{\\mu\}\-\(8/25\)\\lambda\\hat\{\\mu\}\-\(36/625\)\\lambda\\hat\{\\mu\}≥2λμ^\.\\displaystyle\\geq 2\\lambda\\hat\{\\mu\}\.\(F\.74\)Using\(1−λμ^\)2≤1\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\leq 1,[EquationF\.67](https://arxiv.org/html/2606.00293#A6.E67), and[EquationF\.74](https://arxiv.org/html/2606.00293#A6.E74)in[EquationF\.73](https://arxiv.org/html/2606.00293#A6.E73)yields
𝔼‖ψ∞−θ^‖4\\displaystyle\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{4\}≤485μ^2\(λτ42B\+Dβ\)2\+72λ2τ44μ^2B2\+48λD\(D\+2\)μ^β2\\displaystyle\\leq\\frac\{48\}\{5\\hat\{\\mu\}^\{2\}\}\\Bigl\(\\frac\{\\lambda\\tau\_\{4\}^\{2\}\}\{B\}\+\\frac\{D\}\{\\beta\}\\Bigr\)^\{2\}\+72\\frac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\+48\\frac\{\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\}\(F\.75\)≤\(72\+485\)λ2τ44μ^2B2\+965λτ42μ^2BβD\+485D2μ^2β2\+48λD\(D\+2\)μ^β2\\displaystyle\\leq\\Bigl\(72\+\\frac\{48\}\{5\}\\Bigr\)\\frac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\\,\+\\frac\{96\}\{5\}\\frac\{\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}B\\beta\}D\+\\frac\{48\}\{5\}\\frac\{D^\{2\}\}\{\\hat\{\\mu\}^\{2\}\\beta^\{2\}\}\+48\\frac\{\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\}\(F\.76\)≤96λ2τ44μ^2B2\+24λτ42μ^2BβD\+12D2μ^2β2\+48λD\(D\+2\)μ^β2\.\\displaystyle\\leq 96\\frac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\\,\+24\\frac\{\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}B\\beta\}D\+12\\frac\{D^\{2\}\}\{\\hat\{\\mu\}^\{2\}\\beta^\{2\}\}\+48\\frac\{\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\}\.\(F\.77\)∎
###### Proof of[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)\.
Taking the limitt→∞t\\to\\inftyand combining with[LemmaF\.2](https://arxiv.org/html/2606.00293#A6.Thmtheorem2)yields
W22\(πθ,πψ\)≤λ1−β¯\{λM2¯2\+M¯24μ\}𝔼‖ψ∞−θ^‖4≤2μ\{M2¯8L\+M¯24μ\}⏟C0×\{96λ2τ44μ^2B2\+24λτ42μ^2BβD\+12D2μ^2β2\+48λD\(D\+2\)μ^β2\}≤C0×\{96λ2τ44μ^2B2\+24λτ42μ^2BβD\+12D2μ^2β2\+48λD\(D\+2\)μ^β2\}≤96C0τ44μ^2⏟A0λ2B2\+12C0τ42μ^2D⏟A12λBβ\+\(12C0μ^2D2\+48C0μ^2D\(D\+2\)\)⏟A21β2≤A2\(λB\+1β\)2\\displaystyle\\begin\{split\}W\_\{2\}^\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)&\\leq\\frac\{\\lambda\}\{1\-\\bar\{\\beta\}\}\\left\\\{\\frac\{\\lambda\\overline\{M^\{2\}\}\}\{2\}\+\\frac\{\\overline\{M\}^\{2\}\}\{4\\mu\}\\right\\\}\\,\\mathbb\{E\}\\,\\bigl\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\bigr\\\|^\{4\}\\\\ &\\leq\\underbrace\{\\frac\{2\}\{\\mu\}\\left\\\{\\frac\{\\overline\{M^\{2\}\}\}\{8L\}\+\\frac\{\\overline\{M\}^\{2\}\}\{4\\mu\}\\right\\\}\}\_\{C\_\{0\}\}\\times\\left\\\{\\frac\{96\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\\,\+\\frac\{24\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}B\\beta\}D\+\\frac\{12D^\{2\}\}\{\\hat\{\\mu\}^\{2\}\\beta^\{2\}\}\+\\frac\{48\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\}\\right\\\}\\\\ &\\leq C\_\{0\}\\times\\left\\\{\\frac\{96\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}B^\{2\}\}\\,\+\\frac\{24\\lambda\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}B\\beta\}D\+\\frac\{12D^\{2\}\}\{\\hat\{\\mu\}^\{2\}\\beta^\{2\}\}\+\\frac\{48\\lambda D\(D\+2\)\}\{\\hat\{\\mu\}\\beta^\{2\}\}\\right\\\}\\\\ &\\leq\\underbrace\{\\frac\{96C\_\{0\}\\tau\_\{4\}^\{4\}\}\{\\hat\{\\mu\}^\{2\}\}\}\_\{A\_\{0\}\}\\,\\frac\{\\lambda^\{2\}\}\{B^\{2\}\}\+\\underbrace\{\\frac\{12C\_\{0\}\\tau\_\{4\}^\{2\}\}\{\\hat\{\\mu\}^\{2\}\}\\,D\}\_\{A\_\{1\}\}\\,\\frac\{2\\lambda\}\{B\\beta\}\+\\underbrace\{\\bigl\(\\frac\{12C\_\{0\}\}\{\\hat\{\\mu\}^\{2\}\}\\,D^\{2\}\+\\frac\{48C\_\{0\}\}\{\\hat\{\\mu\}^\{2\}\}\\,D\(D\+2\)\\bigr\)\}\_\{A\_\{2\}\}\\,\\frac\{1\}\{\\beta^\{2\}\}\\\\ &\\leq A^\{2\}\\left\(\\frac\{\\lambda\}\{B\}\+\\frac\{1\}\{\\beta\}\\right\)^\{2\}\\end\{split\}whereA2=max\{A0,A1,A2\}A^\{2\}=\\max\\\{A\_\{0\},A\_\{1\},A\_\{2\}\\\}∎
### F\.6Proof of[Theorem4\.1](https://arxiv.org/html/2606.00293#S4.Thmtheorem1)
###### Proof\.
Using[Corollary4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6), the proof of[Theorem4\.1](https://arxiv.org/html/2606.00293#S4.Thmtheorem1)is almost immediate\. Under Assumptions[\(A\)](https://arxiv.org/html/2606.00293#S4.I1.i1)–[\(C\)](https://arxiv.org/html/2606.00293#S4.I1.i3), there exists a constantc\>0c\>0such thatΣθ≺cλI\\Sigma\_\{\\theta\}\\prec c\\lambda IandΣψ≺cλI\\Sigma\_\{\\psi\}\\prec c\\lambda I\(Dieuleveut et al\.,[2020](https://arxiv.org/html/2606.00293#bib.bib11), Theorem 4\)and thereforeσθ,d\\sigma\_\{\\theta,d\}andσψ,d\\sigma\_\{\\psi,d\}are of orderλ1/2\\lambda^\{1/2\}\. Hence, it follows from[Corollaries4\.6](https://arxiv.org/html/2606.00293#S4.Thmtheorem6)and[20](https://arxiv.org/html/2606.00293#S4.E20)that the relative errors of the stationary standard deviations and covariance satisfy[Equations13](https://arxiv.org/html/2606.00293#S4.E13)and[14](https://arxiv.org/html/2606.00293#S4.E14)\. ∎
## Appendix GProofs for Momentum Results
### G\.1Proof of[PropositionB\.1](https://arxiv.org/html/2606.00293#A2.Thmtheorem1)
###### Proof\.
Let
ηt−1=\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]−𝔼\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]\.\\displaystyle\\eta\_\{t\-1\}=\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\-\\mathbb\{E\}\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\.\(G\.1\)Then, we have
ψt−θ^=ψt−1−θ^−Λκmt−1−Λ𝔼\[Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)\]−Ληt−1\+2β−1Λξt−1\.\\displaystyle\\begin\{aligned\} \\psi\_\{t\}\-\\widehat\{\\theta\}=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\Lambda\\,\\kappa\\,m\_\{t\-1\}\-\\Lambda\\mathbb\{E\}\\Bigl\[G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\,\\Bigr\]\-\\Lambda\\,\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\.\\end\{aligned\}\(G\.2\)Since by assumptionℛ\(ψ\)=12ψ⊤Γψ\\mathcal\{R\}\(\\psi\)=\\frac\{1\}\{2\}\\psi^\{\\top\}\\Gamma\\psi, we have
𝔼\(Gt\(θ^\)\)=1N∑n=1N∇ℓ\(xn,yn,θ^\)\+1NΓθ^=0\.\\displaystyle\\mathbb\{E\}\(G\_\{t\}\(\\hat\{\\theta\}\)\)=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\nabla\\ell\\left\(x\_\{n\},y\_\{n\},\\widehat\{\\theta\}\\right\)\+\\frac\{1\}\{N\}\\Gamma\\widehat\{\\theta\}=0\.\(G\.3\)Then[EquationG\.2](https://arxiv.org/html/2606.00293#A7.E2)can be rewritten as
ψt−θ^=ψt−1−θ^−Λκmt−1−ΛN∑n=1N\{𝒥n\(ψt−1−θ^\)\}−1NΛΓ\(ψt−1−θ^\)−Ληt−1\+2β−1Λξt−1=\(I−Λ𝒥−ΛNΓ\)\(ψt−1−θ^\)−Λκmt−1−Ληt−1\+2β−1Λξt−1=\(I−ΛH^\)\(ψt−1−θ^\)−Λκmt−1−Ληt−1\+2β−1Λξt−1\\displaystyle\\begin\{aligned\} \\psi\_\{t\}\-\\widehat\{\\theta\}&=\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\Lambda\\kappa m\_\{t\-1\}\-\\frac\{\\Lambda\}\{N\}\\sum\_\{n=1\}^\{N\}\\left\\\{\\mathcal\{J\}\_\{n\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\right\\\}\-\\frac\{1\}\{N\}\\Lambda\\Gamma\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\left\(I\-\\Lambda\\mathcal\{J\}\-\\frac\{\\Lambda\}\{N\}\\Gamma\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\-\\Lambda\\kappa m\_\{t\-1\}\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\-\\Lambda\\kappa m\_\{t\-1\}\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\end\{aligned\}\(G\.4\)Assumingθt\\theta\_\{t\}andmtm\_\{t\}are jointly sampled from stationary distribution, we have
Σψ=𝔼\[\(ψt−θ^\)\(ψt−θ^\)⊤\]=\(I−ΛH^\)Σψ\(I−ΛH^\)⊤\+κ2ΛMΛ\+ΛC¯ψΛ−\(D\+D⊤\)\+2Λβ,\\displaystyle\\begin\{aligned\} \\Sigma\_\{\\psi\}&=\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\\left\(I\-\\Lambda\\widehat\{H\}\\right\)^\{\\top\}\+\\kappa^\{2\}\\Lambda M\\Lambda\+\\Lambda\\overline\{C\}\_\{\\psi\}\\Lambda\-\(D\+D^\{\\top\}\)\+\\frac\{2\\Lambda\}\{\\beta\},\\end\{aligned\}\(G\.5\)whereD=κ\(I−ΛH^\)𝔼\[\(ψt−1−θ^\)mt−1⊤\]ΛD=\\kappa\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)m\_\{t\-1\}^\{\\top\}\\right\]\\Lambda, andM=𝔼\[mt−1mt−1⊤\]M=\\mathbb\{E\}\\left\[m\_\{t\-1\}m\_\{t\-1\}^\{\\top\}\\right\]\.
The rest of proof mainly follows the proof ofLiu et al\. \([2021](https://arxiv.org/html/2606.00293#bib.bib43), Theorem 3\)\. According to[EquationB\.3](https://arxiv.org/html/2606.00293#A2.E3), we haveΛmt=ψt−1−ψt\\Lambda m\_\{t\}=\\psi\_\{t\-1\}\-\\psi\_\{t\}, so
ΛMΛ=𝔼\[\(ψt−1−θ^−ψt−2\+θ^−2β−1Λξt−1\)\(ψt−1−θ^−ψt−2\+θ^\)⊤−2β−1Λξt−1\]=2Σψ−𝔼\[\(ψt−1−θ^\)\(ψt−2⊤−θ^\)\]−𝔼\[\(ψt−2−θ^\)\(ψt−1⊤−θ^\)\]\+2Λβ\.\\displaystyle\\begin\{aligned\} \\Lambda M\\Lambda&=\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\psi\_\{t\-2\}\+\\widehat\{\\theta\}\-\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\-\\psi\_\{t\-2\}\+\\widehat\{\\theta\}\\right\)^\{\\top\}\-\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\right\]\\\\ &=2\\Sigma\_\{\\psi\}\-\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-2\}^\{\\top\}\-\\widehat\{\\theta\}\\right\)\\right\]\-\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-2\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-1\}^\{\\top\}\-\\widehat\{\\theta\}\\right\)\\right\]\+\\frac\{2\\Lambda\}\{\\beta\}\.\\end\{aligned\}\(G\.6\)and
D=κ\(I−ΛH^\)𝔼\[\(ψt−1−θ^\)mt−1⊤\]Λ=κ\(I−ΛH^\)𝔼\[\(ψt−1−θ^\)\(ψt−2−ψt−1\+2β−1Λξt−1\)⊤\]=κ\(I−ΛH^\)\(𝔼\[\(ψt−1−θ^\)\(ψt−2−θ^\)⊤\]−Σψ\)\.\\displaystyle\\begin\{aligned\} D&=\\kappa\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)m\_\{t\-1\}^\{\\top\}\\right\]\\Lambda\\\\ &=\\kappa\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-2\}\-\\psi\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\right\)^\{\\top\}\\right\]\\\\ &=\\kappa\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\left\(\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-2\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\-\\Sigma\_\{\\psi\}\\right\)\.\\end\{aligned\}\(G\.7\)Note that
𝔼\[\(ψt−1−θ^\)\(ψt−2−θ^\)⊤\]=𝔼\[\(ψt−θ^\)\(ψt−1−θ^\)⊤\]=𝔼\[\(\(I−ΛH^\)\(ψt−1−θ^\)−Λκmt−1−Ληt−1\+2β−1Λξt−1\)\(ψt−1−θ^\)⊤\]=\(I−ΛH^\)Σψ−Λκ𝔼\[mt−1\(ψt−1−θ^\)⊤\]=\(I−ΛH^\)Σψ−κ𝔼\[\(ψt−2−ψt−1\)\(ψt−1−θ^\)⊤\]=\(I−ΛH^\)Σψ\+κΣψ−κ𝔼\[\(ψt−2−θ^\)\(ψt−1−θ^\)⊤\]\.\\displaystyle\\begin\{aligned\} \\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-2\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]&=\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\mathbb\{E\}\\left\[\\left\(\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\-\\Lambda\\kappa m\_\{t\-1\}\-\\Lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\Lambda\}\\,\\xi\_\{t\-1\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\-\\Lambda\\kappa\\mathbb\{E\}\\left\[m\_\{t\-1\}\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\-\\kappa\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-2\}\-\\psi\_\{t\-1\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\\\\ &=\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\+\\kappa\\Sigma\_\{\\psi\}\-\\kappa\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-2\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]\.\\end\{aligned\}\(G\.8\)Solving, we obtain
𝔼\[\(ψt−1−θ^\)\(ψt−2−θ^\)⊤\]\\displaystyle\\mathbb\{E\}\\left\[\\left\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\right\)\\left\(\\psi\_\{t\-2\}\-\\widehat\{\\theta\}\\right\)^\{\\top\}\\right\]=11\+κ\[\(I−ΛH^\)Σψ\+κΣψ\]\.\\displaystyle=\\frac\{1\}\{1\+\\kappa\}\\left\[\\left\(I\-\\Lambda\\widehat\{H\}\\right\)\\Sigma\_\{\\psi\}\+\\kappa\\Sigma\_\{\\psi\}\\right\]\.\(G\.9\)
so we conclude that
\(1−κ\)\(ΛH^Σ\+ΣH^Λ\)\+κ1−κ2\(ΛH^ΛH^Σ\+ΣH^ΛH^Λ\)=ΛC¯ψΛ\+1\+κ21−κ2ΛH^ΣH^Λ\+\(1\+κ2\)2Λβ\.\\displaystyle\(1\-\\kappa\)\(\\Lambda\\widehat\{H\}\\Sigma\+\\Sigma\\widehat\{H\}\\Lambda\)\+\\frac\{\\kappa\}\{1\-\\kappa^\{2\}\}\(\\Lambda\\widehat\{H\}\\Lambda\\widehat\{H\}\\Sigma\+\\Sigma\\widehat\{H\}\\Lambda\\widehat\{H\}\\Lambda\)=\\Lambda\\overline\{C\}\_\{\\psi\}\\Lambda\+\\frac\{1\+\\kappa^\{2\}\}\{1\-\\kappa^\{2\}\}\\Lambda\\widehat\{H\}\\Sigma\\widehat\{H\}\\Lambda\+\(1\+\\kappa^\{2\}\)\\frac\{2\\Lambda\}\{\\beta\}\.\(G\.10\)
∎
### G\.2Proof of[TheoremB\.3](https://arxiv.org/html/2606.00293#A2.Thmtheorem3)
Since there exists a coupling ofθ0∼ν\\theta\_\{0\}\\sim\\nuandψ0∼ν′\\psi\_\{0\}\\sim\\nu^\{\\prime\}withW22\(ν,ν′\)=𝔼‖θ0−ψ0‖2W\_\{2\}^\{2\}\(\\nu,\\nu^\{\\prime\}\)=\\mathbb\{E\}\\\|\\theta\_\{0\}\-\\psi\_\{0\}\\\|^\{2\}, we take\(θ0,ψ0\)\(\\theta\_\{0\},\\psi\_\{0\}\)from this joint distribution, and initialize the momenta atm0=ν0=0m\_\{0\}=\\nu\_\{0\}=0\. The proof proceeds in three steps: first, a coupled2×22\\times 2linear recursion for\(𝔼‖θt−ψt‖2,𝔼‖mt−νt‖2\)\(\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\},\\mathbb\{E\}\\\|m\_\{t\}\-\\nu\_\{t\}\\\|^\{2\}\)with coefficient matrixAA; second, a weighted Lyapunov functionVt=𝔼‖θt−ψt‖2\+γ𝔼‖mt−νt‖2V\_\{t\}=\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\+\\gamma\\mathbb\{E\}\\\|m\_\{t\}\-\\nu\_\{t\}\\\|^\{2\}whose optimal weightγ⋆\\gamma^\{\\star\}yields the sharpest contraction rateβ¯=ρ\(A\)\\bar\{\\beta\}=\\rho\(A\); third, iteration of the resulting one\-step recursion\.
ByLL\-smoothness of∇ℓn\\nabla\\ell\_\{n\}and‖a\+b‖2≤\(1\+c\)‖a‖2\+\(1\+1/c\)‖b‖2\\\|a\+b\\\|^\{2\}\\leq\(1\+c\)\\\|a\\\|^\{2\}\+\(1\+1/c\)\\\|b\\\|^\{2\}withc=\(1−κ\)/κc=\(1\-\\kappa\)/\\kappa,
𝔼t−1‖mt−νt‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|m\_\{t\}\-\\nu\_\{t\}\\\|^\{2\}=𝔼t−1‖κ\(mt−1−νt−1\)\+Gt\(θt−1\)−Gt\(θ^\)−∇Gt\(θ^\)\(ψt−1−θ^\)‖2\\displaystyle=\\mathbb\{E\}\_\{t\-1\}\\big\\\|\\kappa\(m\_\{t\-1\}\-\\nu\_\{t\-1\}\)\+G\_\{t\}\(\\theta\_\{t\-1\}\)\-G\_\{t\}\(\\widehat\{\\theta\}\)\-\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\big\\\|^\{2\}≤κ‖mt−1−νt−1‖2\+11−κ𝔼t−1‖Gt\(θt−1\)−Gt\(θ^\)−∇Gt\(θ^\)\(ψt−1−θ^\)‖2\\displaystyle\\leq\\kappa\\\|m\_\{t\-1\}\-\\nu\_\{t\-1\}\\\|^\{2\}\+\\frac\{1\}\{1\-\\kappa\}\\mathbb\{E\}\_\{t\-1\}\\big\\\|G\_\{t\}\(\\theta\_\{t\-1\}\)\-G\_\{t\}\(\\widehat\{\\theta\}\)\-\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\\big\\\|^\{2\}≤κ‖mt−1−νt−1‖2\+11−κ\(2L2‖θt−1−ψt−1‖2\+M2¯2‖ψt−1−θ^‖4\)\.\\displaystyle\\leq\\kappa\\\|m\_\{t\-1\}\-\\nu\_\{t\-1\}\\\|^\{2\}\+\\frac\{1\}\{1\-\\kappa\}\\Big\(2L^\{2\}\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\+\\frac\{\\overline\{M^\{2\}\}\}\{2\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\\Big\)\.\(G\.11\)Similarly, expanding𝔼t−1‖θt−ψt‖2\\mathbb\{E\}\_\{t\-1\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}and bounding the inner\-product term via Taylor’s theorem and2ab≤μa2\+b2/μ2ab\\leq\\mu a^\{2\}\+b^\{2\}/\\mu,
𝔼t−1‖θt−ψt‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}≤\(1−λμ\+λκ\+2λ2L21−κ\)‖θt−1−ψt−1‖2\+λκ\(1\+λ\)‖mt−1−νt−1‖2\\displaystyle\\leq\\Big\(1\-\\lambda\\mu\+\\lambda\\kappa\+\\frac\{2\\lambda^\{2\}L^\{2\}\}\{1\-\\kappa\}\\Big\)\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\+\\lambda\\kappa\(1\+\\lambda\)\\\|m\_\{t\-1\}\-\\nu\_\{t\-1\}\\\|^\{2\}\+\(λ2M2¯2\(1−κ\)\+λM¯24μ\)‖ψt−1−θ^‖4\.\\displaystyle\\quad\+\\Big\(\\frac\{\\lambda^\{2\}\\overline\{M^\{2\}\}\}\{2\(1\-\\kappa\)\}\+\\frac\{\\lambda\\overline\{M\}^\{2\}\}\{4\\mu\}\\Big\)\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\.\(G\.12\)Taking full expectation, this yields the linear recursion
\(𝔼‖θt−ψt‖2𝔼‖mt−νt‖2\)≤\(1−λμ\+λκ\+2λ2L2/\(1−κ\)λκ\(1\+λ\)2L2/\(1−κ\)κ\)⏟A:=\(aij\)2×2\(𝔼‖θt−1−ψt−1‖2𝔼‖mt−1−νt−1‖2\)\+\(𝒜ℬ\)Ct−1,\\begin\{pmatrix\}\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\\\\ \\mathbb\{E\}\\\|m\_\{t\}\-\\nu\_\{t\}\\\|^\{2\}\\end\{pmatrix\}\\leq\\underbrace\{\\begin\{pmatrix\}1\-\\lambda\\mu\+\\lambda\\kappa\+2\\lambda^\{2\}L^\{2\}/\(1\-\\kappa\)&\\lambda\\kappa\(1\+\\lambda\)\\\\ 2L^\{2\}/\(1\-\\kappa\)&\\kappa\\end\{pmatrix\}\}\_\{\\displaystyle A:=\(a\_\{ij\}\)\_\{2\\times 2\}\}\\begin\{pmatrix\}\\mathbb\{E\}\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\\\\ \\mathbb\{E\}\\\|m\_\{t\-1\}\-\\nu\_\{t\-1\}\\\|^\{2\}\\end\{pmatrix\}\+\\begin\{pmatrix\}\\mathcal\{A\}\\\\ \\mathcal\{B\}\\end\{pmatrix\}C\_\{t\-1\},\(G\.13\)where𝒜=λ22\(1−κ\)M2¯\+λ4μM¯2\\mathcal\{A\}=\\dfrac\{\\lambda^\{2\}\}\{2\(1\-\\kappa\)\}\\overline\{M^\{2\}\}\+\\dfrac\{\\lambda\}\{4\\mu\}\\overline\{M\}^\{2\}andℬ=M2¯2\(1−κ\)\\mathcal\{B\}=\\dfrac\{\\overline\{M^\{2\}\}\}\{2\(1\-\\kappa\)\}\.
Forγ\>0\\gamma\>0, letVt:=𝔼‖θt−ψt‖2\+γ𝔼‖mt−νt‖2V\_\{t\}:=\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\+\\gamma\\mathbb\{E\}\\\|m\_\{t\}\-\\nu\_\{t\}\\\|^\{2\}\. From[EquationG\.13](https://arxiv.org/html/2606.00293#A7.E13),
Vt\\displaystyle V\_\{t\}=\(a11\+γa21\)𝔼‖θt−1−ψt−1‖2\+\(a12\+γa22\)𝔼‖mt−1−νt−1‖2\+\(𝒜\+γℬ\)Ct−1\\displaystyle=\(a\_\{11\}\+\\gamma a\_\{21\}\)\\mathbb\{E\}\\\|\\theta\_\{t\-1\}\-\\psi\_\{t\-1\}\\\|^\{2\}\+\(a\_\{12\}\+\\gamma a\_\{22\}\)\\mathbb\{E\}\\\|m\_\{t\-1\}\-\\nu\_\{t\-1\}\\\|^\{2\}\+\(\\mathcal\{A\}\+\\gamma\\mathcal\{B\}\)C\_\{t\-1\}≤β\(γ\)Vt−1\+\(𝒜\+γℬ\)Ct−1,\\displaystyle\\leq\\beta\(\\gamma\)\\,V\_\{t\-1\}\+\(\\mathcal\{A\}\+\\gamma\\mathcal\{B\}\)\\,C\_\{t\-1\},\(G\.14\)whereβ\(γ\):=max\{a11\+γa21,a22\+a12/γ\}\.\\beta\(\\gamma\):=\\max\\\{a\_\{11\}\+\\gamma a\_\{21\},\\ a\_\{22\}\+a\_\{12\}/\\gamma\\\}\.
Minimizingβ\(γ\)\\beta\(\\gamma\)overγ\>0\\gamma\>0atγ⋆=\[\(a22−a11\)\+\(a11−a22\)2\+4a12a21\]/\(2a21\)\>0\\gamma^\{\\star\}=\\big\[\(a\_\{22\}\-a\_\{11\}\)\+\\sqrt\{\(a\_\{11\}\-a\_\{22\}\)^\{2\}\+4a\_\{12\}a\_\{21\}\}\\big\]/\(2a\_\{21\}\)\>0gives
β¯:=1−λμ\+2λ2L21−κ\+\(1\+λ\)κ\+X2\+Y2,\\bar\{\\beta\}:=\\frac\{1\-\\lambda\\mu\+\\tfrac\{2\\lambda^\{2\}L^\{2\}\}\{1\-\\kappa\}\+\(1\+\\lambda\)\\kappa\+\\sqrt\{X^\{2\}\+Y\}\}\{2\},\(G\.15\)whereX:=1−λμ\+2λ2L21−κ−\(1−λ\)κX:=1\-\\lambda\\mu\+\\tfrac\{2\\lambda^\{2\}L^\{2\}\}\{1\-\\kappa\}\-\(1\-\\lambda\)\\kappaandY:=8L2λκ\(1\+λ\)1−κY:=\\frac\{8L^\{2\}\\lambda\\kappa\(1\+\\lambda\)\}\{1\-\\kappa\}, and the source coefficient
𝒫:=𝒜\+γ⋆ℬ=λ22\(1−κ\)M2¯\+λ4μM¯2\+M2¯8L2\(X2\+Y−X\)\.\\mathcal\{P\}:=\\mathcal\{A\}\+\\gamma^\{\\star\}\\mathcal\{B\}=\\frac\{\\lambda^\{2\}\}\{2\(1\-\\kappa\)\}\\overline\{M^\{2\}\}\+\\frac\{\\lambda\}\{4\\mu\}\\overline\{M\}^\{2\}\+\\frac\{\\overline\{M^\{2\}\}\}\{8L^\{2\}\}\\big\(\\sqrt\{X^\{2\}\+Y\}\-X\\big\)\.\(G\.16\)Since𝔼‖θt−ψt‖2≤Vt\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\\leq V\_\{t\}andVt≤β¯Vt−1\+𝒫Ct−1V\_\{t\}\\leq\\bar\{\\beta\}V\_\{t\-1\}\+\\mathcal\{P\}\\,C\_\{t\-1\}, iterating and usingV0=𝔼‖θ0−ψ0‖2V\_\{0\}=\\mathbb\{E\}\\\|\\theta\_\{0\}\-\\psi\_\{0\}\\\|^\{2\}\(asm0=ν0=0m\_\{0\}=\\nu\_\{0\}=0\) gives
𝔼‖θt−ψt‖2≤β¯t𝔼‖θ0−ψ0‖2\+𝒫∑s=1tβ¯t−sCs−1\.\\mathbb\{E\}\\\|\\theta\_\{t\}\-\\psi\_\{t\}\\\|^\{2\}\\leq\\bar\{\\beta\}^\{\\,t\}\\,\\mathbb\{E\}\\\|\\theta\_\{0\}\-\\psi\_\{0\}\\\|^\{2\}\+\\mathcal\{P\}\\sum\_\{s=1\}^\{t\}\\bar\{\\beta\}^\{\\,t\-s\}C\_\{s\-1\}\.\(G\.17\)
### G\.3Proof of[PropositionB\.2](https://arxiv.org/html/2606.00293#A2.Thmtheorem2)
According to[SectionF\.3](https://arxiv.org/html/2606.00293#A6.SS3), assume that the SG\-MCMC iterates have reached stationarity\.
Define the lag\-kkautocovariance
Ξk:=𝔼πψ\[\(ψt\+k−θ^\)\(ψt−θ^\)⊤\]\.\\displaystyle\\Xi\_\{k\}:=\\mathbb\{E\}\_\{\\pi\_\{\\psi\}\}\\\!\\left\[\(\\psi\_\{t\+k\}\-\\widehat\{\\theta\}\)\(\\psi\_\{t\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]\.\(G\.18\)Under stationarity and the linearized dynamics, this autocovariance satisfies
Ξk=\(I−ΛH^\)kΞ0=\(I−ΛH^\)kΣψ,\\displaystyle\\Xi\_\{k\}=\(I\-\\Lambda\\widehat\{H\}\)^\{k\}\\Xi\_\{0\}=\(I\-\\Lambda\\widehat\{H\}\)^\{k\}\\Sigma\_\{\\psi\},\(G\.19\)whereΣψ:=Ξ0\\Sigma\_\{\\psi\}:=\\Xi\_\{0\}denotes the stationary covariance\.
Next, we approximate the stationary covariance of the averaged iterate
ψ¯k:=1k∑k′=1kψk′\.\\displaystyle\\bar\{\\psi\}\_\{k\}:=\\frac\{1\}\{k\}\\sum\_\{k^\{\\prime\}=1\}^\{k\}\\psi\_\{k^\{\\prime\}\}\.\(G\.20\)Assuming stationarity, its covariance can be computed as
Σψ\(k\)\\displaystyle\\Sigma\_\{\\psi\}^\{\(k\)\}:=𝔼\[\(ψ¯k−θ^\)\(ψ¯k−θ^\)⊤\]\\displaystyle:=\\mathbb\{E\}\\left\[\(\\bar\{\\psi\}\_\{k\}\-\\widehat\{\\theta\}\)\(\\bar\{\\psi\}\_\{k\}\-\\widehat\{\\theta\}\)^\{\\top\}\\right\]\(G\.21\)=1k2𝔼\[\(∑k′=1k\(ψk′−θ^\)\)\(∑k′′=1k\(ψk′′−θ^\)\)⊤\]\\displaystyle=\\frac\{1\}\{k^\{2\}\}\\mathbb\{E\}\\left\[\\left\(\\sum\_\{k^\{\\prime\}=1\}^\{k\}\(\\psi\_\{k^\{\\prime\}\}\-\\widehat\{\\theta\}\)\\right\)\\left\(\\sum\_\{k^\{\\prime\\prime\}=1\}^\{k\}\(\\psi\_\{k^\{\\prime\\prime\}\}\-\\widehat\{\\theta\}\)\\right\)^\{\\top\}\\right\]\(G\.22\)=1k2\(kΣψ\+2∑k′=1k−1Ξk′\)\\displaystyle=\\frac\{1\}\{k^\{2\}\}\\left\(k\\Sigma\_\{\\psi\}\+2\\sum\_\{k^\{\\prime\}=1\}^\{k\-1\}\\Xi\_\{k^\{\\prime\}\}\\right\)\(G\.23\)=1k2\(kΣψ\+2∑k′=1k−1\(I−ΛH^\)k′Σψ\)\.\\displaystyle=\\frac\{1\}\{k^\{2\}\}\\left\(k\\Sigma\_\{\\psi\}\+2\\sum\_\{k^\{\\prime\}=1\}^\{k\-1\}\(I\-\\Lambda\\widehat\{H\}\)^\{k^\{\\prime\}\}\\Sigma\_\{\\psi\}\\right\)\.\(G\.24\)
### G\.4Proof of[CorollaryB\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4)
###### Lemma G\.1\.
Under the conditions of[CorollaryB\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4), letμ^,L^\\hat\{\\mu\},\\hat\{L\}denote the smallest and largest eigenvalues ofH^=∇2ℒ\(θ^\)\\widehat\{H\}=\\nabla^\{2\}\\mathcal\{L\}\(\\widehat\{\\theta\}\), and let𝖴¯,𝖵¯\\underline\{\\mathsf\{U\}\},\\underline\{\\mathsf\{V\}\}be constants such that𝔼‖ψt−θ^‖2≤𝖴¯,𝔼‖νt‖2≤𝖵¯\.\\mathbb\{E\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{2\}\\leq\\underline\{\\mathsf\{U\}\},\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{2\}\\leq\\underline\{\\mathsf\{V\}\}\.For any step sizeλ≤min\{1/L^,1/\(4μ^\),Bμ^/\(CL2\),\(μ^/C\)1/2\}\\lambda\\leq\\min\\\{1/\\hat\{L\},\\,1/\(4\\hat\{\\mu\}\),\\,B\\hat\{\\mu\}/\(CL^\{2\}\),\\,\(\\hat\{\\mu\}/C\)^\{1/2\}\\\}and momentumκ\\kappaas in[CorollaryB\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4), there existsC\>0C\>0such that, wheneverM=\(mij\)M=\(m\_\{ij\}\)below satisfiesρ\(M\)<1\\rho\(M\)<1, the stationary iterateψ∞∼πψ\\psi\_\{\\infty\}\\sim\\pi\_\{\\psi\}of the linearized proxy with momentum[EquationB\.6](https://arxiv.org/html/2606.00293#A2.E6)satisfies
𝔼‖ψ∞−θ^‖4≤\(1−m22\)𝒜\+m12ℬ\(1−m11\)\(1−m22\)−m12m21,\\displaystyle\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{4\}\\leq\\frac\{\(1\-m\_\{22\}\)\\,\\mathcal\{A\}\+m\_\{12\}\\,\\mathcal\{B\}\}\{\(1\-m\_\{11\}\)\(1\-m\_\{22\}\)\-m\_\{12\}m\_\{21\}\},\(G\.25\)where
m11\\displaystyle m\_\{11\}:=\(1\+λμ^\)\(1−λμ^\)4\+C\{λ2L2B\+μ^−1λ3κ2L2B\+λ4L4B2\},\\displaystyle:=\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{4\}\+C\\Big\\\{\\tfrac\{\\lambda^\{2\}L^\{2\}\}\{B\}\+\\hat\{\\mu\}^\{\-1\}\\tfrac\{\\lambda^\{3\}\\kappa^\{2\}L^\{2\}\}\{B\}\+\\tfrac\{\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}\\Big\\\},\(G\.26\)m12\\displaystyle m\_\{12\}:=C\{μ^−3λκ4\+μ^−1λ3κ2L2B\},\\displaystyle:=C\\Big\\\{\\hat\{\\mu\}^\{\-3\}\\lambda\\kappa^\{4\}\+\\hat\{\\mu\}^\{\-1\}\\tfrac\{\\lambda^\{3\}\\kappa^\{2\}L^\{2\}\}\{B\}\\Big\\\},\(G\.27\)m21\\displaystyle m\_\{21\}:=C\{L^4\+κ2L^2\+L^2L2B\+L4B2\},\\displaystyle:=C\\Big\\\{\\hat\{L\}^\{4\}\+\\kappa^\{2\}\\hat\{L\}^\{2\}\+\\tfrac\{\\hat\{L\}^\{2\}L^\{2\}\}\{B\}\+\\tfrac\{L^\{4\}\}\{B^\{2\}\}\\Big\\\},\(G\.28\)m22\\displaystyle m\_\{22\}:=C\{κ4\+κ2L^2\},\\displaystyle:=C\\Big\\\{\\kappa^\{4\}\+\\kappa^\{2\}\\hat\{L\}^\{2\}\\Big\\\},\(G\.29\)𝒜\\displaystyle\\mathcal\{A\}:=C\(λ4τ44B2\+λ2\(D2\+2D\)β2\)\+C\(λ2τ42B\+λDβ\)\[\(1\+λμ^\)\(1−λμ^\)2𝖴¯\+\(1\+\(λμ^\)−1\)λ2κ2𝖵¯\],\\displaystyle:=C\\Bigl\(\\tfrac\{\\lambda^\{4\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\tfrac\{\\lambda^\{2\}\(D^\{2\}\+2D\)\}\{\\beta^\{2\}\}\\Bigr\)\+C\\Bigl\(\\tfrac\{\\lambda^\{2\}\\tau\_\{4\}^\{2\}\}\{B\}\+\\tfrac\{\\lambda D\}\{\\beta\}\\Bigr\)\\Bigl\[\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\,\\underline\{\\mathsf\{U\}\}\+\\bigl\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\\bigr\)\\lambda^\{2\}\\kappa^\{2\}\\,\\underline\{\\mathsf\{V\}\}\\Bigr\],\(G\.30\)ℬ\\displaystyle\\mathcal\{B\}:=Cτ44B2\+Cτ42B\(L^2𝖴¯\+κ2𝖵¯\)\+Cκ2L2B𝖵¯\.\\displaystyle:=C\\tfrac\{\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+C\\tfrac\{\\tau\_\{4\}^\{2\}\}\{B\}\\bigl\(\\hat\{L\}^\{2\}\\,\\underline\{\\mathsf\{U\}\}\+\\kappa^\{2\}\\,\\underline\{\\mathsf\{V\}\}\\bigr\)\+C\\tfrac\{\\kappa^\{2\}L^\{2\}\}\{B\}\\,\\underline\{\\mathsf\{V\}\}\.\(G\.31\)
###### Proof\.
The true linearized proxy with momentum atθ=θ^\\theta=\\widehat\{\\theta\}can be written as
\{νt=κνt−1\+H^\(ψt−1−θ^\)\+ηt−1,ψt−θ^=\(I−λH^\)\(ψt−1−θ^\)−λκνt−1−ληt−1\+2β−1λξt−1,\\left\\\{\\begin\{array\}\[\]\{l\}\\nu\_\{t\}=\\kappa\\nu\_\{t\-1\}\+\\widehat\{H\}\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\+\\eta\_\{t\-1\},\\\\ \\psi\_\{t\}\-\\widehat\{\\theta\}=\(I\-\\lambda\\widehat\{H\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\lambda\\kappa\\nu\_\{t\-1\}\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\,\\xi\_\{t\-1\},\\end\{array\}\\right\.\(G\.32\)whereηt−1:=Gt\(θ^\)\+∇Gt\(θ^\)\(ψt−1−θ^\)−H^\(ψt−1−θ^\)\\eta\_\{t\-1\}:=G\_\{t\}\(\\widehat\{\\theta\}\)\+\\nabla G\_\{t\}\(\\widehat\{\\theta\}\)\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)\-\\widehat\{H\}\(\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\)andξt−1∼𝒩\(0,ID\)\\xi\_\{t\-1\}\\sim\\mathcal\{N\}\(0,I\_\{D\}\)\.
Following the proof of[Theorem4\.5](https://arxiv.org/html/2606.00293#S4.Thmtheorem5)and𝔼t−1\[ηt−1\]=0\\mathbb\{E\}\_\{t\-1\}\[\\eta\_\{t\-1\}\]=0, we have
𝔼t−1‖ηt−1‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}≤2τ42B\+8L2B‖ψt−1−θ^‖2,\\displaystyle\\leq\\frac\{2\\tau\_\{4\}^\{2\}\}\{B\}\+\\frac\{8L^\{2\}\}\{B\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{2\},\(G\.33\)𝔼t−1‖ηt−1‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}≤24τ44B2\+384L4B2‖ψt−1−θ^‖4\.\\displaystyle\\leq\\frac\{24\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\frac\{384L^\{4\}\}\{B^\{2\}\}\\\|\\psi\_\{t\-1\}\-\\widehat\{\\theta\}\\\|^\{4\}\.\(G\.34\)Letet:=ψt−θ^e\_\{t\}:=\\psi\_\{t\}\-\\widehat\{\\theta\},[EquationG\.32](https://arxiv.org/html/2606.00293#A7.E32)can be written as
et=\(I−λH^\)et−1−λκνt−1−ληt−1\+2β−1λξt−1\.\\displaystyle e\_\{t\}=\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\}\-\\lambda\\kappa\\nu\_\{t\-1\}\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\,\\xi\_\{t\-1\}\.\(G\.35\)Then
𝔼t−1‖et‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|e\_\{t\}\\\|^\{2\}=‖\(I−λH^\)et−1−λκνt−1‖2\+λ2𝔼t−1‖ηt−1‖2\+2β−1λD\.\\displaystyle=\\\|\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\}\-\\lambda\\kappa\\nu\_\{t\-1\}\\\|^\{2\}\+\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D\.\(G\.36\)
Under the stepsize condition‖I−λH^‖2≤1−λμ^\\\|I\-\\lambda\\widehat\{H\}\\\|\_\{2\}\\leq 1\-\\lambda\\hat\{\\mu\}and by‖x−y‖2≤\(1\+a\)‖x‖2\+\(1\+a−1\)‖y‖2\\\|x\-y\\\|^\{2\}\\leq\(1\+a\)\\\|x\\\|^\{2\}\+\(1\+a^\{\-1\}\)\\\|y\\\|^\{2\}\(a\>0a\>0\), we can obtain
𝔼t−1‖et‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|e\_\{t\}\\\|^\{2\}≤\(1\+λμ^\)\(1−λμ^\)2‖et−1‖2\+\(1\+\(λμ^\)−1\)λ2κ2‖νt−1‖2\\displaystyle\\leq\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\\|e\_\{t\-1\}\\\|^\{2\}\+\\bigl\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\\bigr\)\\lambda^\{2\}\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\+λ2𝔼t−1‖ηt−1‖2\+2β−1λD\.\\displaystyle\\qquad\+\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D\.\(G\.37\)By[EquationG\.33](https://arxiv.org/html/2606.00293#A7.E33),
𝔼t−1‖et‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|e\_\{t\}\\\|^\{2\}≤\{\(1\+λμ^\)\(1−λμ^\)2\+8λ2L2B\}‖et−1‖2\\displaystyle\\leq\\left\\\{\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\+\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\}\\right\\\}\\\|e\_\{t\-1\}\\\|^\{2\}\+\(1\+\(λμ^\)−1\)λ2κ2‖νt−1‖2\+2λ2τ42B\+2β−1λD\.\\displaystyle\\qquad\+\\bigl\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\\bigr\)\\lambda^\{2\}\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\+\\frac\{2\\lambda^\{2\}\\tau\_\{4\}^\{2\}\}\{B\}\+2\\beta^\{\-1\}\\lambda D\.\(G\.38\)Similarly, using‖x\+y‖2≤2‖x‖2\+2‖y‖2\\\|x\+y\\\|^\{2\}\\leq 2\\\|x\\\|^\{2\}\+2\\\|y\\\|^\{2\}yields
𝔼t−1‖νt‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\nu\_\{t\}\\\|^\{2\}=‖κνt−1\+H^et−1‖2\+𝔼t−1‖ηt−1‖2\\displaystyle=\\\|\\kappa\\nu\_\{t\-1\}\+\\widehat\{H\}\\,e\_\{t\-1\}\\\|^\{2\}\+\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}≤\(2L^2\+8L2/B\)‖et−1‖2\+2κ2‖νt−1‖2\+2τ42/B\.\\displaystyle\\leq\\Big\(2\\hat\{L\}^\{2\}\+8L^\{2\}/B\\Big\)\\\|e\_\{t\-1\}\\\|^\{2\}\+2\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\+2\\tau\_\{4\}^\{2\}/B\.\(G\.39\)
Taking expectations in[EquationsG\.38](https://arxiv.org/html/2606.00293#A7.E38)and[G\.39](https://arxiv.org/html/2606.00293#A7.E39)gives the linear recursion
\(𝔼‖et‖2𝔼‖νt‖2\)≤\(\(1\+λμ^\)\(1−λμ^\)2\+8λ2L2B\(1\+\(λμ^\)−1\)λ2κ22L^2\+8L2B2κ2\)⏟A:=\(aij\)2×2\(𝔼‖et−1‖2𝔼‖νt−1‖2\)\+\(cecν\),\\binom\{\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{2\}\}\{\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{2\}\}\\leq\\underbrace\{\\begin\{pmatrix\}\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\+\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\}&\\bigl\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\\bigr\)\\lambda^\{2\}\\kappa^\{2\}\\\\ 2\\hat\{L\}^\{2\}\+\\frac\{8L^\{2\}\}\{B\}&2\\kappa^\{2\}\\end\{pmatrix\}\}\_\{\\displaystyle A:=\(a\_\{ij\}\)\_\{2\\times 2\}\}\\binom\{\\mathbb\{E\}\\\|e\_\{t\-1\}\\\|^\{2\}\}\{\\mathbb\{E\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\}\+\\binom\{c\_\{e\}\}\{c\_\{\\nu\}\},\(G\.40\)where
ce:=2λ2τ42B\+2β−1λD,cν:=2τ42B\.\\displaystyle c\_\{e\}:=\\frac\{2\\lambda^\{2\}\\tau\_\{4\}^\{2\}\}\{B\}\+2\\beta^\{\-1\}\\lambda D,\\qquad c\_\{\\nu\}:=\\frac\{2\\tau\_\{4\}^\{2\}\}\{B\}\.\(G\.41\)Under the step\-size and momentum restrictions, we bound the entries ofAA\.
Observe that
a11=\(1\+λμ^\)\(1−λμ^\)2\+8λ2L2B,\\displaystyle a\_\{11\}=\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\+\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\},\(G\.42\)and lets:=λμ^≤14s:=\\lambda\\hat\{\\mu\}\\leq\\frac\{1\}\{4\}, we have
\(1\+s\)\(1−s\)2=1−s−s2\+s3≤1−s\.\\displaystyle\(1\+s\)\(1\-s\)^\{2\}=1\-s\-s^\{2\}\+s^\{3\}\\leq 1\-s\.\(G\.43\)Then
1−a11≥λμ^−8λ2L2B\.\\displaystyle 1\-a\_\{11\}\\geq\\lambda\\hat\{\\mu\}\-\\frac\{8\\lambda^\{2\}L^\{2\}\}\{B\}\.\(G\.44\)Thus, by choosing the universal constant in the stepsize conditionλ≤Bμ^CL2\\lambda\\leq\\frac\{B\\hat\{\\mu\}\}\{CL^\{2\}\}large enough, we have1−a11≥12λμ^\.1\-a\_\{11\}\\geq\\frac\{1\}\{2\}\\lambda\\hat\{\\mu\}\.
Note thata22=2κ2a\_\{22\}=2\\kappa^\{2\}andκ≤1/2\\kappa\\leq 1/2,
1−a22=1−2κ2≥12\.\\displaystyle 1\-a\_\{22\}=1\-2\\kappa^\{2\}\\geq\\frac\{1\}\{2\}\.\(G\.45\)
It remains to control the off\-diagonal product:
a12=\(1\+\(λμ^\)−1\)λ2κ2≤Cλ3,\\displaystyle a\_\{12\}=\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\lambda^\{2\}\\kappa^\{2\}\\leq C\\lambda^\{3\},\(G\.46\)and
a21=2L^2\+8L2/B≤C\.\\displaystyle a\_\{21\}=2\\hat\{L\}^\{2\}\+8L^\{2\}/B\\leq C\.\(G\.47\)Forλ≤\(μ^/C\)1/2\\lambda\\leq\(\\hat\{\\mu\}/C\)^\{1/2\}
a12a21≤Cλ3≤18λμ^\.\\displaystyle a\_\{12\}a\_\{21\}\\leq C\\lambda^\{3\}\\leq\\tfrac\{1\}\{8\}\\lambda\\hat\{\\mu\}\.\(G\.48\)Thus,
Δ2:=\(1−a11\)\(1−a22\)−a12a21≥12λμ^⋅12−18λμ^=18λμ^\>0\.\\displaystyle\\Delta\_\{2\}:=\(1\-a\_\{11\}\)\(1\-a\_\{22\}\)\-a\_\{12\}a\_\{21\}\\geq\\frac\{1\}\{2\}\\lambda\\hat\{\\mu\}\\cdot\\frac\{1\}\{2\}\-\\frac\{1\}\{8\}\\lambda\\hat\{\\mu\}=\\frac\{1\}\{8\}\\lambda\\hat\{\\mu\}\>0\.\(G\.49\)Together witha11,a22<1a\_\{11\},a\_\{22\}<1andAAentrywise nonnegative, it impliesρ\(A\)<1\\rho\(A\)<1\.
Starting frome0=0e\_\{0\}=0andν0=0\\nu\_\{0\}=0, the recursion gives, for everyt≥1t\\geq 1,
\(𝔼‖et‖2𝔼‖νt‖2\)≤∑j=0t−1Aj\(cecν\)\.\\displaystyle\\binom\{\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{2\}\}\{\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{2\}\}\\leq\\sum\_\{j=0\}^\{t\-1\}A^\{j\}\\binom\{c\_\{e\}\}\{c\_\{\\nu\}\}\.\(G\.50\)Sinceρ\(A\)<1\\rho\(A\)<1, the geometric series is bounded entrywise by
∑j=0t−1Aj≤∑j=0∞Aj=\(I−A\)−1\.\\displaystyle\\sum\_\{j=0\}^\{t\-1\}A^\{j\}\\leq\\sum\_\{j=0\}^\{\\infty\}A^\{j\}=\(I\-A\)^\{\-1\}\.\(G\.51\)Hence
\(𝔼‖et‖2𝔼‖νt‖2\)≤\(I−A\)−1\(cecν\)\.\\displaystyle\\binom\{\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{2\}\}\{\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{2\}\}\\leq\(I\-A\)^\{\-1\}\\binom\{c\_\{e\}\}\{c\_\{\\nu\}\}\.\(G\.52\)Writing
I−A=\(1−a11−a12−a211−a22\),\\displaystyle I\-A=\\begin\{pmatrix\}1\-a\_\{11\}&\-a\_\{12\}\\\\ \-a\_\{21\}&1\-a\_\{22\}\\end\{pmatrix\},\(G\.53\)Cramer’s rule yields
\(I−A\)−1\(cecν\)=1Δ2\(\(1−a22\)ce\+a12cνa21ce\+\(1−a11\)cν\)\.\\displaystyle\(I\-A\)^\{\-1\}\\binom\{c\_\{e\}\}\{c\_\{\\nu\}\}=\\frac\{1\}\{\\Delta\_\{2\}\}\\binom\{\(1\-a\_\{22\}\)c\_\{e\}\+a\_\{12\}c\_\{\\nu\}\}\{a\_\{21\}c\_\{e\}\+\(1\-a\_\{11\}\)c\_\{\\nu\}\}\.\(G\.54\)Therefore, with
𝖴¯=\(1−a22\)ce\+a12cνΔ2,𝖵¯=a21ce\+\(1−a11\)cνΔ2,\\underline\{\\mathsf\{U\}\}=\\frac\{\(1\-a\_\{22\}\)\\,c\_\{e\}\+a\_\{12\}\\,c\_\{\\nu\}\}\{\\Delta\_\{2\}\},\\qquad\\underline\{\\mathsf\{V\}\}=\\frac\{a\_\{21\}\\,c\_\{e\}\+\(1\-a\_\{11\}\)\\,c\_\{\\nu\}\}\{\\Delta\_\{2\}\},\(G\.55\)we have, uniformly for allt≥1t\\geq 1,
𝔼‖ψt−θ^‖2≤𝖴¯,𝔼‖νt‖2≤𝖵¯\.\\displaystyle\\mathbb\{E\}\\\|\\psi\_\{t\}\-\\widehat\{\\theta\}\\\|^\{2\}\\leq\\underline\{\\mathsf\{U\}\},\\qquad\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{2\}\\leq\\underline\{\\mathsf\{V\}\}\.\(G\.56\)Thus constants as in the statement exist\.
Ifuuis deterministic andvvis a mean\-zero random vector, then
𝔼‖u\+v‖4≤‖u‖4\+8‖u‖2𝔼‖v‖2\+3𝔼‖v‖4\.\\mathbb\{E\}\\\|u\+v\\\|^\{4\}\\leq\\\|u\\\|^\{4\}\+8\\\|u\\\|^\{2\}\\,\\mathbb\{E\}\\\|v\\\|^\{2\}\+3\\,\\mathbb\{E\}\\\|v\\\|^\{4\}\.\(G\.57\)
From[EquationG\.32](https://arxiv.org/html/2606.00293#A7.E32), write
et=ut−1\+wt−1,\\displaystyle e\_\{t\}=u\_\{t\-1\}\+w\_\{t\-1\},\(G\.58\)where
ut−1:=\(I−λH^\)et−1−λκνt−1,wt−1:=−ληt−1\+2β−1λξt−1\.\\displaystyle u\_\{t\-1\}:=\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\}\-\\lambda\\kappa\\nu\_\{t\-1\},\\qquad w\_\{t\-1\}:=\-\\lambda\\eta\_\{t\-1\}\+\\sqrt\{2\\beta^\{\-1\}\\lambda\}\\,\\xi\_\{t\-1\}\.\(G\.59\)Conditioning on theℱt−1\\mathcal\{F\}\_\{t\-1\},ut−1u\_\{t\-1\}is deterministic\. Moreover, since𝔼t−1\[ηt−1\]=0\\mathbb\{E\}\_\{t\-1\}\[\\eta\_\{t\-1\}\]=0and𝔼\[ξt−1\]=0\\mathbb\{E\}\[\\xi\_\{t\-1\}\]=0, we have
𝔼t−1\[wt−1\]=0\.\\displaystyle\\mathbb\{E\}\_\{t\-1\}\[w\_\{t\-1\}\]=0\.\(G\.60\)Applying[EquationG\.57](https://arxiv.org/html/2606.00293#A7.E57)therefore gives
𝔼t−1‖et‖4≤‖ut−1‖4\+8‖ut−1‖2𝔼t−1‖wt−1‖2\+3𝔼t−1‖wt−1‖4\.\\mathbb\{E\}\_\{t\-1\}\\\|e\_\{t\}\\\|^\{4\}\\leq\\\|u\_\{t\-1\}\\\|^\{4\}\+8\\\|u\_\{t\-1\}\\\|^\{2\}\\,\\mathbb\{E\}\_\{t\-1\}\\\|w\_\{t\-1\}\\\|^\{2\}\+3\\,\\mathbb\{E\}\_\{t\-1\}\\\|w\_\{t\-1\}\\\|^\{4\}\.\(G\.61\)
For the leading term, we apply the weighted Young inequality: for anyϵ∈\(0,1\)\\epsilon\\in\(0,1\),
‖x\+y‖4≤\(1\+ϵ\)‖x‖4\+Cϵ−3‖y‖4\.\\displaystyle\\\|x\+y\\\|^\{4\}\\leq\(1\+\\epsilon\)\\\|x\\\|^\{4\}\+C\\epsilon^\{\-3\}\\\|y\\\|^\{4\}\.\(G\.62\)Takingϵ=λμ^\\epsilon=\\lambda\\hat\{\\mu\}, which is admissible underλμ^≤1/4\\lambda\\hat\{\\mu\}\\leq 1/4, and using
x=\(I−λH^\)et−1,y=−λκνt−1,\\displaystyle x=\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\},\\qquad y=\-\\lambda\\kappa\\nu\_\{t\-1\},\(G\.63\)we obtain
‖ut−1‖4≤\(1\+λμ^\)‖\(I−λH^\)et−1‖4\+C\(λμ^\)−3λ4κ4‖νt−1‖4\.\\displaystyle\\\|u\_\{t\-1\}\\\|^\{4\}\\leq\(1\+\\lambda\\hat\{\\mu\}\)\\\|\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\}\\\|^\{4\}\+C\(\\lambda\\hat\{\\mu\}\)^\{\-3\}\\lambda^\{4\}\\kappa^\{4\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\.\(G\.64\)Note that‖\(I−λH^\)et−1‖≤\(1−λμ^\)‖et−1‖\\\|\(I\-\\lambda\\widehat\{H\}\)e\_\{t\-1\}\\\|\\leq\(1\-\\lambda\\hat\{\\mu\}\)\\\|e\_\{t\-1\}\\\|, we have
‖ut−1‖4≤\(1\+λμ^\)\(1−λμ^\)4‖et−1‖4\+Cμ^−3λκ4‖νt−1‖4\.\\\|u\_\{t\-1\}\\\|^\{4\}\\leq\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{4\}\\\|e\_\{t\-1\}\\\|^\{4\}\+C\\hat\{\\mu\}^\{\-3\}\\lambda\\kappa^\{4\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\.\(G\.65\)
Sinceξt−1\\xi\_\{t\-1\}is independent of\(ℱt−1,ηt−1\)\(\\mathcal\{F\}\_\{t\-1\},\\eta\_\{t\-1\}\)and mean zero, the cross term vanishes, and using[EquationG\.33](https://arxiv.org/html/2606.00293#A7.E33),
𝔼t−1‖wt−1‖2\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|w\_\{t\-1\}\\\|^\{2\}=λ2𝔼t−1‖ηt−1‖2\+2β−1λD\\displaystyle=\\lambda^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+2\\beta^\{\-1\}\\lambda D\(G\.66\)≤λ2\(2τ42/B\+8L2/B‖et−1‖2\)\+2β−1λD\.\\displaystyle\\leq\\lambda^\{2\}\\Big\(2\\tau\_\{4\}^\{2\}/B\+8L^\{2\}/B\\,\\\|e\_\{t\-1\}\\\|^\{2\}\\Big\)\+2\\beta^\{\-1\}\\lambda D\.\(G\.67\)Using‖x\+y‖4≤8\(‖x‖4\+‖y‖4\)\\\|x\+y\\\|^\{4\}\\leq 8\(\\\|x\\\|^\{4\}\+\\\|y\\\|^\{4\}\),𝔼‖ξt−1‖4=D2\+2D\\mathbb\{E\}\\\|\\xi\_\{t\-1\}\\\|^\{4\}=D^\{2\}\+2Dand[EquationG\.34](https://arxiv.org/html/2606.00293#A7.E34),
𝔼t−1‖wt−1‖4\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|w\_\{t\-1\}\\\|^\{4\}≤8λ4𝔼t−1‖ηt−1‖4\+8\(2β−1λ\)2𝔼‖ξt−1‖4\\displaystyle\\leq 8\\lambda^\{4\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}\+8\(2\\beta^\{\-1\}\\lambda\)^\{2\}\\mathbb\{E\}\\\|\\xi\_\{t\-1\}\\\|^\{4\}≤8λ4\(24τ44/B2\+384L4/B2‖et−1‖4\)\+32β−2λ2\(D2\+2D\)\.\\displaystyle\\leq 8\\lambda^\{4\}\\Big\(24\\tau\_\{4\}^\{4\}/B^\{2\}\+384L^\{4\}/B^\{2\}\\\|e\_\{t\-1\}\\\|^\{4\}\\Big\)\+32\\beta^\{\-2\}\\lambda^\{2\}\(D^\{2\}\+2D\)\.\(G\.68\)
From the inequality‖x−y‖2≤\(1\+λμ^\)‖x‖2\+\(1\+\(λμ^\)−1\)‖y‖2\\\|x\-y\\\|^\{2\}\\leq\(1\+\\lambda\\hat\{\\mu\}\)\\\|x\\\|^\{2\}\+\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\\|y\\\|^\{2\}, we obtain
‖ut−1‖2≤\(1\+λμ^\)\(1−λμ^\)2‖et−1‖2\+\(1\+\(λμ^\)−1\)λ2κ2‖νt−1‖2\.\\displaystyle\\\|u\_\{t\-1\}\\\|^\{2\}\\leq\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\\|e\_\{t\-1\}\\\|^\{2\}\+\\bigl\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\\bigr\)\\lambda^\{2\}\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\.\(G\.69\)By[EquationG\.69](https://arxiv.org/html/2606.00293#A7.E69)and‖e‖2‖ν‖2≤12\(‖e‖4\+‖ν‖4\)\\\|e\\\|^\{2\}\\\|\\nu\\\|^\{2\}\\leq\\tfrac\{1\}\{2\}\(\\\|e\\\|^\{4\}\+\\\|\\nu\\\|^\{4\}\), it follows that
8‖ut−1‖2𝔼t−1‖wt−1‖2\\displaystyle 8\\\|u\_\{t\-1\}\\\|^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|w\_\{t\-1\}\\\|^\{2\}≤C\[\(1\+λμ^\)\(1−λμ^\)2λ2L2B\+\(1\+\(λμ^\)−1\)λ4κ2L2B\]‖et−1‖4\\displaystyle\\leq C\\left\[\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\frac\{\\lambda^\{2\}L^\{2\}\}\{B\}\+\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\frac\{\\lambda^\{4\}\\kappa^\{2\}L^\{2\}\}\{B\}\\right\]\\\|e\_\{t\-1\}\\\|^\{4\}\(G\.70\)\+C\(λ2τ42/B\+λD/β\)\(1\+λμ^\)\(1−λμ^\)2‖et−1‖2\\displaystyle\\quad\+C\\left\(\\lambda^\{2\}\\tau\_\{4\}^\{2\}/B\+\\lambda D/\\beta\\right\)\(1\+\\lambda\\hat\{\\mu\}\)\(1\-\\lambda\\hat\{\\mu\}\)^\{2\}\\\|e\_\{t\-1\}\\\|^\{2\}\(G\.71\)\+C\(1\+\(λμ^\)−1\)λ4κ2L2B‖νt−1‖4\\displaystyle\\quad\+C\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\frac\{\\lambda^\{4\}\\kappa^\{2\}L^\{2\}\}\{B\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\(G\.72\)\+C\(λ2τ42/B\+λD/β\)\(1\+\(λμ^\)−1\)λ2κ2‖νt−1‖2\.\\displaystyle\\quad\+C\\left\(\\lambda^\{2\}\\tau\_\{4\}^\{2\}/B\+\\lambda D/\\beta\\right\)\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\lambda^\{2\}\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\.\(G\.73\)
Combining[EquationsG\.61](https://arxiv.org/html/2606.00293#A7.E61),[G\.65](https://arxiv.org/html/2606.00293#A7.E65),[G\.68](https://arxiv.org/html/2606.00293#A7.E68),[G\.73](https://arxiv.org/html/2606.00293#A7.E73)and[G\.55](https://arxiv.org/html/2606.00293#A7.E55), then taking full expectation,
𝔼‖et‖4≤m11𝔼‖et−1‖4\+m12𝔼‖νt−1‖4\+𝒜\.\\displaystyle\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{4\}\\leq m\_\{11\}\\mathbb\{E\}\\\|e\_\{t\-1\}\\\|^\{4\}\+m\_\{12\}\\mathbb\{E\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\+\\mathcal\{A\}\.\(G\.74\)From[EquationG\.32](https://arxiv.org/html/2606.00293#A7.E32),νt=bt−1\+ηt−1\\nu\_\{t\}=b\_\{t\-1\}\+\\eta\_\{t\-1\}withbt−1:=κνt−1\+H^et−1b\_\{t\-1\}:=\\kappa\\nu\_\{t\-1\}\+\\widehat\{H\}e\_\{t\-1\}and𝔼t−1\[ηt−1\]=0\\mathbb\{E\}\_\{t\-1\}\[\\eta\_\{t\-1\}\]=0\. Applying[EquationG\.57](https://arxiv.org/html/2606.00293#A7.E57)gives
𝔼t−1‖νt‖4≤‖bt−1‖4\+8‖bt−1‖2𝔼t−1‖ηt−1‖2\+3𝔼t−1‖ηt−1‖4\.\\displaystyle\\mathbb\{E\}\_\{t\-1\}\\\|\\nu\_\{t\}\\\|^\{4\}\\leq\\\|b\_\{t\-1\}\\\|^\{4\}\+8\\\|b\_\{t\-1\}\\\|^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}\+3\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}\.\(G\.75\)Using‖a\+b‖2≤2‖a‖2\+2‖b‖2\\\|a\+b\\\|^\{2\}\\leq 2\\\|a\\\|^\{2\}\+2\\\|b\\\|^\{2\},‖H^‖2=L^\\\|\\widehat\{H\}\\\|\_\{2\}=\\hat\{L\}, and2ab≤a2\+b22ab\\leq a^\{2\}\+b^\{2\},
‖bt−1‖4≤C\(L^4\+κ2L^2\)‖et−1‖4\+C\(κ4\+κ2L^2\)‖νt−1‖4\.\\displaystyle\\\|b\_\{t\-1\}\\\|^\{4\}\\leq C\(\\hat\{L\}^\{4\}\+\\kappa^\{2\}\\hat\{L\}^\{2\}\)\\\|e\_\{t\-1\}\\\|^\{4\}\+C\(\\kappa^\{4\}\+\\kappa^\{2\}\\hat\{L\}^\{2\}\)\\\|\\nu\_\{t\-1\}\\\|^\{4\}\.\(G\.76\)For the cross term,‖bt−1‖2≤C\(L^2‖et−1‖2\+κ2‖νt−1‖2\)\\\|b\_\{t\-1\}\\\|^\{2\}\\leq C\(\\hat\{L\}^\{2\}\\\|e\_\{t\-1\}\\\|^\{2\}\+\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\)together with[EquationG\.33](https://arxiv.org/html/2606.00293#A7.E33)gives
8‖bt−1‖2𝔼t−1‖ηt−1‖2\\displaystyle 8\\\|b\_\{t\-1\}\\\|^\{2\}\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{2\}≤C\(L^2L2B\+L4B2\)‖et−1‖4\+Cτ42B\(L^2‖et−1‖2\+κ2‖νt−1‖2\)\+Cκ2L2B‖νt−1‖2\.\\displaystyle\\leq C\\Big\(\\tfrac\{\\hat\{L\}^\{2\}L^\{2\}\}\{B\}\+\\tfrac\{L^\{4\}\}\{B^\{2\}\}\\Big\)\\\|e\_\{t\-1\}\\\|^\{4\}\+C\\tfrac\{\\tau\_\{4\}^\{2\}\}\{B\}\\bigl\(\\hat\{L\}^\{2\}\\\|e\_\{t\-1\}\\\|^\{2\}\+\\kappa^\{2\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\\bigr\)\+C\\tfrac\{\\kappa^\{2\}L^\{2\}\}\{B\}\\\|\\nu\_\{t\-1\}\\\|^\{2\}\.\(G\.77\)Taking full expectation, bounding the second moments‖et−1‖2,‖νt−1‖2\\\|e\_\{t\-1\}\\\|^\{2\},\\\|\\nu\_\{t\-1\}\\\|^\{2\}by𝖴¯,𝖵¯\\underline\{\\mathsf\{U\}\},\\underline\{\\mathsf\{V\}\}\(so that they contribute toℬ\\mathcal\{B\}\), and using[EquationG\.34](https://arxiv.org/html/2606.00293#A7.E34)for the remaining𝔼t−1‖ηt−1‖4\\mathbb\{E\}\_\{t\-1\}\\\|\\eta\_\{t\-1\}\\\|^\{4\}term, yields
𝔼‖νt‖4≤m21𝔼‖et−1‖4\+m22𝔼‖νt−1‖4\+ℬ\.\\displaystyle\\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{4\}\\leq m\_\{21\}\\mathbb\{E\}\\\|e\_\{t\-1\}\\\|^\{4\}\+m\_\{22\}\\mathbb\{E\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\+\\mathcal\{B\}\.\(G\.78\)Thus
\(𝔼‖et‖4𝔼‖νt‖4\)≤M\(𝔼‖et−1‖4𝔼‖νt−1‖4\)\+\(𝒜ℬ\)\.\\displaystyle\\begin\{pmatrix\}\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{4\}\\\\ \\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{4\}\\end\{pmatrix\}\\leq M\\begin\{pmatrix\}\\mathbb\{E\}\\\|e\_\{t\-1\}\\\|^\{4\}\\\\ \\mathbb\{E\}\\\|\\nu\_\{t\-1\}\\\|^\{4\}\\end\{pmatrix\}\+\\begin\{pmatrix\}\\mathcal\{A\}\\\\ \\mathcal\{B\}\\end\{pmatrix\}\.\(G\.79\)SinceMMis entrywise nonnegative andρ\(M\)<1\\rho\(M\)<1,I−MI\-Mis a nonsingularMM\-matrix; hence
Δ4:=det\(I−M\)=\(1−m11\)\(1−m22\)−m12m21\>0\\displaystyle\\Delta\_\{4\}:=\\det\(I\-M\)=\(1\-m\_\{11\}\)\(1\-m\_\{22\}\)\-m\_\{12\}m\_\{21\}\>0\(G\.80\)and\(I−M\)−1=∑j=0∞Mj\(I\-M\)^\{\-1\}=\\sum\_\{j=0\}^\{\\infty\}M^\{j\}is entrywise nonnegative\.
As the recursion is a componentwise inequality with nonnegativeMM, then we have
lim supt→∞\(𝔼‖et‖4𝔼‖νt‖4\)≤\(I−M\)−1\(𝒜ℬ\)\.\\limsup\_\{t\\to\\infty\}\\begin\{pmatrix\}\\mathbb\{E\}\\\|e\_\{t\}\\\|^\{4\}\\\\ \\mathbb\{E\}\\\|\\nu\_\{t\}\\\|^\{4\}\\end\{pmatrix\}\\leq\(I\-M\)^\{\-1\}\\begin\{pmatrix\}\\mathcal\{A\}\\\\ \\mathcal\{B\}\\end\{pmatrix\}\.\(G\.81\)Sinceψt∼πψ\\psi\_\{t\}\\sim\\pi\_\{\\psi\}ast→∞t\\to\\infty, the first component on the left\-hand side converges to𝔼‖ψ∞−θ^‖4\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{4\},
𝔼‖ψ∞−θ^‖4≤\(1−m22\)𝒜\+m12ℬ\(1−m11\)\(1−m22\)−m12m21\.\\mathbb\{E\}\\\|\\psi\_\{\\infty\}\-\\widehat\{\\theta\}\\\|^\{4\}\\leq\\frac\{\(1\-m\_\{22\}\)\\,\\mathcal\{A\}\+m\_\{12\}\\,\\mathcal\{B\}\}\{\(1\-m\_\{11\}\)\(1\-m\_\{22\}\)\-m\_\{12\}m\_\{21\}\}\.\(G\.82\)∎
We now prove[CorollaryB\.4](https://arxiv.org/html/2606.00293#A2.Thmtheorem4)\. Under the notation of[TheoremB\.3](https://arxiv.org/html/2606.00293#A2.Thmtheorem3)and[LemmaG\.1](https://arxiv.org/html/2606.00293#A7.Thmtheorem1), combining the two results gives the master inequality
W22\(πθ,πψ\)≤𝒫1−β¯⋅\(1−m22\)𝒜\+m12ℬ\(1−m11\)\(1−m22\)−m12m21\.W\_\{2\}^\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)\\leq\\frac\{\\mathcal\{P\}\}\{1\-\\bar\{\\beta\}\}\\cdot\\frac\{\(1\-m\_\{22\}\)\\,\\mathcal\{A\}\+m\_\{12\}\\,\\mathcal\{B\}\}\{\(1\-m\_\{11\}\)\(1\-m\_\{22\}\)\-m\_\{12\}m\_\{21\}\}\.\(G\.83\)Throughout,C\>0C\>0denotes a generic constant \(depending only onμ^,L^,L,μ,cκ\\hat\{\\mu\},\\hat\{L\},L,\\mu,c\_\{\\kappa\}and possibly changing between displays\), whilec2,c3,c4c\_\{2\},c\_\{3\},c\_\{4\}andC1C\_\{1\}denote the specific constants fixed in the statement and below\. We bound the two factors in turn\.
For the quantitiesX,YX,Yof[TheoremB\.3](https://arxiv.org/html/2606.00293#A2.Thmtheorem3), the conditionsκ=cκλ≤12\\kappa=c\_\{\\kappa\}\\lambda\\leq\\tfrac\{1\}\{2\}andλ≤μ/\(c2L2\)\\lambda\\leq\\mu/\(c\_\{2\}L^\{2\}\)\(whenceλμ≤μ2/\(c2L2\)≤1/c2\\lambda\\mu\\leq\\mu^\{2\}/\(c\_\{2\}L^\{2\}\)\\leq 1/c\_\{2\}, usingμ≤L\\mu\\leq L\) giveX≥14X\\geq\\tfrac\{1\}\{4\}andY≤32L2cκλ2Y\\leq 32L^\{2\}c\_\{\\kappa\}\\lambda^\{2\}, hence byX2\+Y−X≤Y/\(2X\)\\sqrt\{X^\{2\}\+Y\}\-X\\leq Y/\(2X\),
X2\+Y−X≤Y2X≤64L2cκλ2\.\\sqrt\{X^\{2\}\+Y\}\-X\\leq\\frac\{Y\}\{2X\}\\leq 64L^\{2\}c\_\{\\kappa\}\\lambda^\{2\}\.\(G\.84\)Writing1−β¯=\(1−κ\)−12\(X\+X2\+Y\)1\-\\bar\{\\beta\}=\(1\-\\kappa\)\-\\tfrac\{1\}\{2\}\(X\+\\sqrt\{X^\{2\}\+Y\}\)and substitutingX=1−λμ\+2λ2L21−κ−\(1−λ\)κX=1\-\\lambda\\mu\+\\tfrac\{2\\lambda^\{2\}L^\{2\}\}\{1\-\\kappa\}\-\(1\-\\lambda\)\\kappawithκ=cκλ\\kappa=c\_\{\\kappa\}\\lambda,
1−β¯≥\(1−κ\)−X−Y=λμ−2λ2L21−κ−λκ−Y≥14λμ,1\-\\bar\{\\beta\}\\geq\(1\-\\kappa\)\-X\-Y=\\lambda\\mu\-\\frac\{2\\lambda^\{2\}L^\{2\}\}\{1\-\\kappa\}\-\\lambda\\kappa\-Y\\geq\\tfrac\{1\}\{4\}\\lambda\\mu,\(G\.85\)the last step usingλ≤μ/\(c2L2\)\\lambda\\leq\\mu/\(c\_\{2\}L^\{2\}\)\. Substituting[EquationsG\.84](https://arxiv.org/html/2606.00293#A7.E84)and[G\.85](https://arxiv.org/html/2606.00293#A7.E85)into the definition of𝒫\\mathcal\{P\},
𝒫1−β¯≤4λμ\[λ22\(1−κ\)M2¯\+λ4μM¯2\+M2¯8L2\(X2\+Y−X\)\]≤M¯2μ2\+4\(1\+8cκ\)μM2¯:=C1\.\\frac\{\\mathcal\{P\}\}\{1\-\\bar\{\\beta\}\}\\leq\\frac\{4\}\{\\lambda\\mu\}\\Big\[\\frac\{\\lambda^\{2\}\}\{2\(1\-\\kappa\)\}\\overline\{M^\{2\}\}\+\\frac\{\\lambda\}\{4\\mu\}\\overline\{M\}^\{2\}\+\\frac\{\\overline\{M^\{2\}\}\}\{8L^\{2\}\}\\big\(\\sqrt\{X^\{2\}\+Y\}\-X\\big\)\\Big\]\\leq\\frac\{\\overline\{M\}^\{2\}\}\{\\mu^\{2\}\}\+\\frac\{4\(1\+8c\_\{\\kappa\}\)\}\{\\mu\}\\overline\{M^\{2\}\}:=C\_\{1\}\.\(G\.86\)WriteS:=λτ42/B\+D/βS:=\\lambda\\tau\_\{4\}^\{2\}/B\+D/\\beta\. Sincece=2λSc\_\{e\}=2\\lambda S,cν=2τ42/Bc\_\{\\nu\}=2\\tau\_\{4\}^\{2\}/B,a21≤Ca\_\{21\}\\leq C,1−a11≤Cλ1\-a\_\{11\}\\leq C\\lambda, andΔ2≥18λμ^\\Delta\_\{2\}\\geq\\tfrac\{1\}\{8\}\\lambda\\hat\{\\mu\}, Cramer’s rule gives
𝖴¯≤CS,𝖵¯≤CS\+Cτ42B\.\\underline\{\\mathsf\{U\}\}\\leq CS,\\qquad\\underline\{\\mathsf\{V\}\}\\leq CS\+\\frac\{C\\tau\_\{4\}^\{2\}\}\{B\}\.\(G\.87\)
We claim the step\-size restrictions imply
1−m11≥λμ^,1−m22≥12,m12m21≤14λμ^\.1\-m\_\{11\}\\geq\\lambda\\hat\{\\mu\},\\qquad 1\-m\_\{22\}\\geq\\tfrac\{1\}\{2\},\\qquad m\_\{12\}m\_\{21\}\\leq\\tfrac\{1\}\{4\}\\lambda\\hat\{\\mu\}\.\(G\.88\)We verify the three inequalities in[EquationG\.88](https://arxiv.org/html/2606.00293#A7.E88)in turn\.
For the first, withs:=λμ^≤14s:=\\lambda\\hat\{\\mu\}\\leq\\tfrac\{1\}\{4\},
\(1\+s\)\(1−s\)4≤1−198s,C\{λ2L2B\+μ^−1λ3κ2L2B\+λ4L4B2\}≤118λμ^\\displaystyle\(1\+s\)\(1\-s\)^\{4\}\\leq 1\-\\tfrac\{19\}\{8\}s,\\qquad C\\Big\\\{\\tfrac\{\\lambda^\{2\}L^\{2\}\}\{B\}\+\\hat\{\\mu\}^\{\-1\}\\tfrac\{\\lambda^\{3\}\\kappa^\{2\}L^\{2\}\}\{B\}\+\\tfrac\{\\lambda^\{4\}L^\{4\}\}\{B^\{2\}\}\\Big\\\}\\leq\\tfrac\{11\}\{8\}\\lambda\\hat\{\\mu\}\(G\.89\)underλ≤Bμ^/\(c3L2\)\\lambda\\leq B\\hat\{\\mu\}/\(c\_\{3\}L^\{2\}\), so subtracting gives1−m11≥λμ^1\-m\_\{11\}\\geq\\lambda\\hat\{\\mu\}\.
For the second, sinceκ=cκλ≤12\\kappa=c\_\{\\kappa\}\\lambda\\leq\\tfrac\{1\}\{2\},
m22=C\{κ4\+κ2L^2\}≤Ccκ2λ2\(1\+L^2\)≤12\\displaystyle m\_\{22\}=C\\\{\\kappa^\{4\}\+\\kappa^\{2\}\\hat\{L\}^\{2\}\\\}\\leq Cc\_\{\\kappa\}^\{2\}\\lambda^\{2\}\(1\+\\hat\{L\}^\{2\}\)\\leq\\tfrac\{1\}\{2\}\(G\.90\)forcκc\_\{\\kappa\}andλ\\lambdawithin the stated ranges, so1−m22≥121\-m\_\{22\}\\geq\\tfrac\{1\}\{2\}\.
For the third,m12≤Cλ5m\_\{12\}\\leq C\\lambda^\{5\}andm21≤Cm\_\{21\}\\leq Cgive
m12m21≤Cλ5≤14λμ^wheneverλ≤\(μ^/c4\)1/4\.\\displaystyle m\_\{12\}m\_\{21\}\\leq C\\lambda^\{5\}\\leq\\tfrac\{1\}\{4\}\\lambda\\hat\{\\mu\}\\qquad\\text\{whenever \}\\lambda\\leq\(\\hat\{\\mu\}/c\_\{4\}\)^\{1/4\}\.\(G\.91\)
Combining[EquationG\.88](https://arxiv.org/html/2606.00293#A7.E88),
Δ4=\(1−m11\)\(1−m22\)−m12m21≥12λμ^−14λμ^=14λμ^\>0\.\\Delta\_\{4\}=\(1\-m\_\{11\}\)\(1\-m\_\{22\}\)\-m\_\{12\}m\_\{21\}\\geq\\tfrac\{1\}\{2\}\\lambda\\hat\{\\mu\}\-\\tfrac\{1\}\{4\}\\lambda\\hat\{\\mu\}=\\tfrac\{1\}\{4\}\\lambda\\hat\{\\mu\}\>0\.\(G\.92\)SinceMMis entrywise nonnegative withm11,m22<1m\_\{11\},m\_\{22\}<1anddet\(I−M\)=Δ4\>0\\det\(I\-M\)=\\Delta\_\{4\}\>0, it follows thatρ\(M\)<1\\rho\(M\)<1\.
UsingΔ4≥14λμ^\\Delta\_\{4\}\\geq\\tfrac\{1\}\{4\}\\lambda\\hat\{\\mu\}and1−m22≤11\-m\_\{22\}\\leq 1,
\(1−m22\)𝒜\+m12ℬΔ4≤Cλ−1𝒜\+Cλ−1m12ℬ\.\\frac\{\(1\-m\_\{22\}\)\\mathcal\{A\}\+m\_\{12\}\\mathcal\{B\}\}\{\\Delta\_\{4\}\}\\leq C\\lambda^\{\-1\}\\mathcal\{A\}\+C\\lambda^\{\-1\}m\_\{12\}\\mathcal\{B\}\.\(G\.93\)For the first term, since\(1\+\(λμ^\)−1\)λ2κ2≤Cλ3\(1\+\(\\lambda\\hat\{\\mu\}\)^\{\-1\}\)\\lambda^\{2\}\\kappa^\{2\}\\leq C\\lambda^\{3\},[LemmaG\.1](https://arxiv.org/html/2606.00293#A7.Thmtheorem1)andλ≤1\\lambda\\leq 1give
λ−1𝒜≤C\(λ3τ44B2\+λ\(D2\+2D\)β2\+S2\+λ3S2\+λ3τ42BS\)≤C\(λ2τ44B2\+λDτ42Bβ\+D2\+2Dβ2\)\.\\lambda^\{\-1\}\\mathcal\{A\}\\leq C\\Big\(\\tfrac\{\\lambda^\{3\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\tfrac\{\\lambda\(D^\{2\}\+2D\)\}\{\\beta^\{2\}\}\+S^\{2\}\+\\lambda^\{3\}S^\{2\}\+\\lambda^\{3\}\\tfrac\{\\tau\_\{4\}^\{2\}\}\{B\}S\\Big\)\\leq C\\Big\(\\tfrac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\tfrac\{\\lambda D\\tau\_\{4\}^\{2\}\}\{B\\beta\}\+\\tfrac\{D^\{2\}\+2D\}\{\\beta^\{2\}\}\\Big\)\.\(G\.94\)For the second term,m12≤Cλ5m\_\{12\}\\leq C\\lambda^\{5\}andℬ≤C\(τ44/B2\+τ42S/B\)\\mathcal\{B\}\\leq C\(\\tau\_\{4\}^\{4\}/B^\{2\}\+\\tau\_\{4\}^\{2\}S/B\)give
λ−1m12ℬ≤Cλ4\(τ44B2\+τ42BS\)≤C\(λ2τ44B2\+λDτ42Bβ\)\.\\lambda^\{\-1\}m\_\{12\}\\mathcal\{B\}\\leq C\\lambda^\{4\}\\Big\(\\tfrac\{\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\tfrac\{\\tau\_\{4\}^\{2\}\}\{B\}S\\Big\)\\leq C\\Big\(\\tfrac\{\\lambda^\{2\}\\tau\_\{4\}^\{4\}\}\{B^\{2\}\}\+\\tfrac\{\\lambda D\\tau\_\{4\}^\{2\}\}\{B\\beta\}\\Big\)\.\(G\.95\)Substituting[EquationsG\.84](https://arxiv.org/html/2606.00293#A7.E84)and[G\.85](https://arxiv.org/html/2606.00293#A7.E85)into the definition of𝒫\\mathcal\{P\}in[EquationG\.16](https://arxiv.org/html/2606.00293#A7.E16),
W22\(πθ,πψ\)≤Cτ44⏟A0⋆λ2B2\+Cτ42D⏟A1⋆λBβ\+C\(D2\+2D\)⏟A2⋆1β2≤A⋆2\(λB\+1β\)2,\\begin\{split\}W\_\{2\}^\{2\}\(\\pi\_\{\\theta\},\\pi\_\{\\psi\}\)&\\leq\\underbrace\{C\\,\\tau\_\{4\}^\{4\}\}\_\{A\_\{0\}^\{\\star\}\}\\frac\{\\lambda^\{2\}\}\{B^\{2\}\}\+\\underbrace\{C\\,\\tau\_\{4\}^\{2\}D\}\_\{A\_\{1\}^\{\\star\}\}\\frac\{\\lambda\}\{B\\beta\}\+\\underbrace\{C\\,\(D^\{2\}\+2D\)\}\_\{A\_\{2\}^\{\\star\}\}\\frac\{1\}\{\\beta^\{2\}\}\\\\ &\\leq A\_\{\\star\}^\{2\}\\Big\(\\tfrac\{\\lambda\}\{B\}\+\\tfrac\{1\}\{\\beta\}\\Big\)^\{2\},\\end\{split\}whereA⋆2:=max\{A0⋆,A1⋆,A2⋆\}A\_\{\\star\}^\{2\}:=\\max\\\{A\_\{0\}^\{\\star\},A\_\{1\}^\{\\star\},A\_\{2\}^\{\\star\}\\\}depends only on\(μ,L,μ^,L^,cκ,τ4,D,M¯2,M2¯\)\(\\mu,L,\\hat\{\\mu\},\\hat\{L\},c\_\{\\kappa\},\\tau\_\{4\},D,\\overline\{M\}^\{2\},\\overline\{M^\{2\}\}\)and is independent of\(λ,B,β\)\(\\lambda,B,\\beta\)\.Similar Articles
Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
This paper develops a scaling limit theory for SGLD-Gibbs to provide principled hyperparameter tuning guidance for meaningful uncertainty quantification in large-scale latent variable models.
Uncertainty Quantification for Large Language Diffusion Models
This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
This paper formulates adaptive sampling for large language models as a Markov decision process and trains a lightweight RL controller to balance correctness, latency, and computational cost, achieving improved trade-offs.
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
This paper studies nonconvex stochastic optimization under Blum-Gladyshev noise, where gradient variance grows with distance from initialization. It proves convergence guarantees for normalized SGD with momentum and a variance-reduced STORM method, achieving minimax optimal rates under certain conditions.