From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

arXiv cs.LG 05/27/26, 04:00 AM Papers
dp-sgd differential-privacy generalization pac-bayes max-information theory
Summary
This paper proves a finite-sample bound on the approximate max-information of DP-SGD that is at most linear in dataset size, yielding PAC-Bayes generalization bounds for models trained with differential privacy.
arXiv:2605.26222v1 Announce Type: new Abstract: Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $\epsilon$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD
Source: [https://arxiv.org/html/2605.26222](https://arxiv.org/html/2605.26222)
Christoph H\. Lampert Hossein Zakerinia Institute of Science and Technology Austria \(ISTA\) Klosterneuburg, Austria

###### Abstract

Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent \(DP\-SGD\)\. In this work we make progress on this persistent open problem by proving a finite\-sample bound on the approximate max\-information of DP\-SGD that exhibits scaling properties comparable withDwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\)’s classic result forϵ\\epsilon\-differentially private algorithms, namely at most linear in the dataset size\. From our result we obtain a general\-purpose PAC\-Bayes generalization bound in which the necessary prior distribution can be learned by DP\-SGD, as well as a generalization bound for DP\-SGD\-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters\.

## 1Introduction

The question of how to formally guarantee that learning algorithms generalize from their training set to new data has been lying at the core of machine learning theory ever since the seminal works on PAC learning in the 1970s and 1980s\(Vapnik & Chervonenkis,[1971](https://arxiv.org/html/2605.26222#bib.bib56);Valiant,[1984](https://arxiv.org/html/2605.26222#bib.bib55)\)\. In recent years, especially the framework of PAC\-Bayes bounds\(McAllester,[1999](https://arxiv.org/html/2605.26222#bib.bib37);Seeger,[2002](https://arxiv.org/html/2605.26222#bib.bib50);Catoni,[2007](https://arxiv.org/html/2605.26222#bib.bib12)\)has achieved remarkable progress in providing meaningful generalization guarantees even for overparameterized models, such as deep networks\(Dziugaite & Roy,[2017](https://arxiv.org/html/2605.26222#bib.bib19);Zhou et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib60);Lotfi et al\.,[2024](https://arxiv.org/html/2605.26222#bib.bib31)\)\. Phrased informally, one of the core insights is that models generalize if they do not*overfit*, i\.e\.they extract information from the training set but do not*memorize*it\.

In a modern context of learning from very large datasets that often contain personal data, related considerations emerge in the field of*model learning with data privacy*, specifically*differential privacy*\(Dwork & Roth,[2014](https://arxiv.org/html/2605.26222#bib.bib15);Ponomareva et al\.,[2023](https://arxiv.org/html/2605.26222#bib.bib43)\)\. To guarantee that the original training data cannot be extracted from the trained model, one has to put constraints on the amount that each data point can influence the model parameters, i\.e\.one prevents*data memorization*\.

Given their conceptual similarity, it is not surprising that the intersection of generalization and privacy became a fruitful area for innovation in both areas\. For example, in classic settings,Dwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\);Bassily et al\.\([2016](https://arxiv.org/html/2605.26222#bib.bib5)\)established andJung et al\.\([2020](https://arxiv.org/html/2605.26222#bib.bib25)\)refined generalization guarantees for statistical and low\-sensitivity queries\.Bassily et al\.\([2014](https://arxiv.org/html/2605.26222#bib.bib4)\)proved bounds for private risk minimization algorithms in convex settings, andBombari & Mondelli\([2025](https://arxiv.org/html/2605.26222#bib.bib8)\)andShi et al\.\([2026](https://arxiv.org/html/2605.26222#bib.bib51)\)for private gradient descent for random feature models and two\-layer ConvNets, respectively\. Specifically in a PAC\-Bayes setup,Dziugaite & Roy\([2018](https://arxiv.org/html/2605.26222#bib.bib20)\)demonstrated that privately learned data\-dependent distributions can take the place of the usually data\-independent prior distributions with only a controllable additive penalty\. Unfortunately, all of the above results are not applicable or ineffective for modern deep networks, which tend to be overparameterized, non\-convex and trained for multiple epochs using variants of stochastic gradient descent\.

For such systems, the dominant notion of privacy is*approximate differential privacy*, denoted as\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP, withδ\>0\\delta\>0\. However, a precise characterization of the relationship between generalization and private training for practical deep networks is so far not available\. In particular, it is known that\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP algorithms do not automatically generalize well\(Rogers et al\.,[2016](https://arxiv.org/html/2605.26222#bib.bib48);Stemmer & Nissim,[2019](https://arxiv.org/html/2605.26222#bib.bib53);Blanco\-Justicia et al\.,[2023](https://arxiv.org/html/2605.26222#bib.bib6)\), and even well\-generalizing deep models can be prone to data extraction attacks\(Carlini et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib10),[2021](https://arxiv.org/html/2605.26222#bib.bib11)\)\.

In this work, we make progress on this question by focusing on the dominant algorithm for differentially\-private model training:*differentially private stochastic gradient descent \(DP\-SGD\)*\(Rajkumar & Agarwal,[2012](https://arxiv.org/html/2605.26222#bib.bib44)\)\. Prior works studied the generalization properties of DP\-SGD predominantly in the framework of mutual\-information\-based bounds\(Xu & Raginsky,[2017](https://arxiv.org/html/2605.26222#bib.bib59)\)\. For example,Wang et al\.\([2021](https://arxiv.org/html/2605.26222#bib.bib58)\)studied the generalization gap of single\-epoch DP\-SGD in expectation over datasets, whilePensia et al\.\([2018](https://arxiv.org/html/2605.26222#bib.bib40)\)andIssa et al\.\([2023](https://arxiv.org/html/2605.26222#bib.bib24)\)established high\-probability bounds for SGD with additive Gaussian noise in the iterates\. However, the resulting bounds grow with the problem dimension, and the works did not demonstrate the possibility of non\-vacuous bounds in practical settings\.

Instead, we aim for dimension\-independent and numerically tight results by adopting the framework of max\-information and PAC\-Bayes bounds\. Our main technical result,[Theorem˜1](https://arxiv.org/html/2605.26222#Thmtheorem1), establishes that the*β\\beta\-approximate max\-information*of DP\-SGD on a datasetSSwithnnindependent elements fulfills

I∞β\(DP\-SGD\(S\),S\)=O\(Enζ2σ2\(log⁡E/β\)\)\\displaystyle I^\{\\beta\}\_\{\\infty\}\(\\texttt\{DP\-SGD\}\(S\),S\)=\{O\\bigl\(En\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}\(\\log E/\\beta\)\\bigr\)\}\(1\)whereEEis the number of training epochs,ζ\\zetais the clipping constant, andσ\\sigmais the strength of the noise added by the Gaussian mechanism111We will provide the definition of these quantities in[Section2](https://arxiv.org/html/2605.26222#S2), and non\-asymptotic expressions in[Section3](https://arxiv.org/html/2605.26222#S3)\.\.

To our knowledge, these are the first guarantees on the max\-information for a*practical*\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP learning algorithm, in the sense that they are applicable in a regime where real\-world models are trained these days, whether in industry, such as Google’s recent private vision\-language model*DP\-Cap*\(Sander et al\.,[2024](https://arxiv.org/html/2605.26222#bib.bib49)\), or large language model*VaultGemma*\(Sinha et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib52)\), or academia, such as recent models for private human action recognition\(Luo et al\.,[2024](https://arxiv.org/html/2605.26222#bib.bib32);Nken et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib38)\)\.

The ability to control DP\-SGD’s max\-information allows us to establish our main conceptual result,[Theorem˜2](https://arxiv.org/html/2605.26222#Thmtheorem2): the demonstration of how DP\-SGD can be used to obtain data\-dependent priors for PAC\-Bayes generalization bounds\. As in\(Dziugaite & Roy,[2018](https://arxiv.org/html/2605.26222#bib.bib20)\), this step comes at the expense of an additive correction term, which we show to have analogous form as \([1](https://arxiv.org/html/2605.26222#S1.E1)\), so we can control it by a suitable choice of DP\-SGD’s hyperparameters\. The resulting generalization guarantees hold*uniformly*, i\.e\.for arbitrarily trained models, not only those trained with privacy, with the DP\-SGD prior serving only as a reference measure in a complexity term\. However, by evaluating the bound for the prior itself, we readily obtain a prior\-free PAC\-Bayes generalization bound specifically for DP\-SGD\-trained models\.

In summary, our main contribution in this work is an explicit finite\-sample bound on the approximate max\-information of the DP\-SGD algorithm\.From it, we derive two additional contributions of independent interest: a new PAC\-Bayes generalization bound in which the prior can be learned from the actual training data using DP\-SGD, and a new generalization bound for DP\-SGD\-trained models only in terms of the DP\-SGD hyperparameters\.

## 2Background

In this section, we introduce the main concepts and required notation\. For more details and examples of the involved concepts, see Appendix[A](https://arxiv.org/html/2605.26222#A1)\.

We adopt a standard setup of statistical learning, in which a \(stochastic\) learning algorithm is given a training dataset,S=\(x1,…,xn\)∈𝒳nS=\(x\_\{1\},\\dots,x\_\{n\}\)\\in\\mathcal\{X\}^\{n\}and outputs a modely∈𝒴⊂ℝdy\\in\\mathcal\{Y\}\\subset\\mathbb\{R\}^\{d\}\(here and in the following, we identify models and their parametrizations\)\. A loss function,ℓ:𝒳×𝒴→ℝ\+\\ell:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}\_\{\+\}, measures the quality of a model,yy, on a data pointxx\.

#### Differential Privacy\.

A randomized algorithm,𝒜:𝒳n→𝒴\\mathcal\{A\}:\\mathcal\{X\}^\{n\}\\to\\mathcal\{Y\}, is called*approximate differentially private \(\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\)*for someϵ\>0\\epsilon\>0andδ∈\(0,1\)\\delta\\in\(0,1\), if

∀O⊂𝒴:\\displaystyle\\forall O\\subset\\mathcal\{Y\}:\\quadℙ\{𝒜\(S\)∈O\}≤eϵℙ\{𝒜\(S′\)∈O\}\+δ,\\displaystyle\\mathbb\{P\}\\\{\\mathcal\{A\}\(S\)\\in O\\\}\\leq e^\{\\epsilon\}\\mathbb\{P\}\\\{\\mathcal\{A\}\(S^\{\\prime\}\)\\in O\\\}\+\\delta,\(2\)for all datasetsS,S′S,S^\{\\prime\}that are*neighboring*, i\.e\.identical except for a single data point\. If an algorithm is\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP forδ=0\\delta=0, we call it*purely differentially private*\(ϵ\\epsilon\-DP\)\.

The most common mechanism for achieving differential privacy in modern machine learning is the*Gaussian mechanism*\(Dwork et al\.,[2006](https://arxiv.org/html/2605.26222#bib.bib16)\)\. For any function,Ψ:𝒳n→𝒴\\Psi:\\mathcal\{X\}^\{n\}\\to\\mathcal\{Y\}, denote byΔ\(Ψ\):=supS∼S′‖Ψ\(S\)−Ψ\(S′\)‖\\Delta\(\\Psi\):=\\sup\_\{S\\sim S^\{\\prime\}\}\\\|\\Psi\(S\)\-\\Psi\(S^\{\\prime\}\)\\\|its*sensitivity*, whereS∼S′S\\sim S^\{\\prime\}indicates that the two datasets are neighboring\. Then, the Gaussian mechanism of noise strengthσ\\sigmaworks asℳΨ\(S\)=Ψ\(S\)\+σZ\\mathcal\{M\}\_\{\\Psi\}\(S\)=\\Psi\(S\)\+\\sigma Z, whereZ∼𝒩\(0,I\)Z\\sim\\mathcal\{N\}\(0,\\text\{I\}\)is standard Gaussian noise\. For anyϵ∈\(0,1\)\\epsilon\\in\(0,1\)andδ∈\(0,1\)\\delta\\in\(0,1\),ℳΨ\\mathcal\{M\}\_\{\\Psi\}is\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP as long asσ≥Δϵ2log⁡\(1\.25/δ\)\\sigma\\geq\\frac\{\\Delta\}\{\\epsilon\}\\sqrt\{2\\log\(1\.25/\\delta\)\}\.

Algorithm 1DP\-SGD\-stream0:training set

S=\(x1,…,xn\)S=\(x\_\{1\},\\dots,x\_\{n\}\), clipping threshold

ζ\\zeta, noise strength

σ\\sigma, batch size

mm, number of per\-epoch steps

T≤⌊nm⌋T\\leq\\lfloor\\frac\{n\}\{m\}\\rfloor, number of epochs

EE, learning rates

η1,…,ηT\\eta\_\{1\},\\dots,\\eta\_\{T\}
1:

θ0←\\theta\_\{0\}\\leftarrowinitialize model parameters

2:for

e=1,…,Ee=1,\\dots,Edo

3:

I1,…,IT←CreateBatches\(\)I\_\{1\},\\dots,I\_\{T\}\\leftarrow\\ \\text\{CreateBatches\}\(\)// create index sets of disjoint batches

4:for

t=1,…,Tt=1,\\dots,Tdo

5:

z←sample from𝒩\(0,Id\)z\\leftarrow\\text\{ sample from \}\\mathcal\{N\}\(0,\\text\{I\}\_\{d\}\)// standard Gaussian noise

6:

ut←∑i∈Itclip\(∇ℓ\(xi,θt−1\),ζ\)\+σz\\displaystyle u\_\{t\}\\leftarrow\\sum\\nolimits\_\{i\\in I\_\{t\}\}\\text\{clip\}\\bigl\(\\nabla\\ell\(x\_\{i\},\\theta\_\{t\-1\}\),\\zeta\\bigr\)\+\\sigma z// update vector \(clipped gradients plus noise\)

7:

θt←GradientUpdate\(ut,ηt;θ1,…,θt−1\)\\displaystyle\\theta\_\{t\}\\leftarrow\\text\{GradientUpdate\}\(u\_\{t\},\\eta\_\{t\};\\,\\theta\_\{1\},\\dots,\\theta\_\{t\-1\}\)// add noise and update model parameters

8:yield

θt\\theta\_\{t\}// output parameters but continue algorithm

9:endfor

10:

θ0←θT\\theta\_\{0\}\\leftarrow\\theta\_\{T\}// prepare for next epoch

11:endfor

Algorithm 2DP\-SGD0:training set

SSof size

nn, clipping threshold

ζ\\zeta, noise strength

σ\\sigma, batch size

mm, number of per\-epoch steps

T≤⌊nm⌋T\\leq\\lfloor\\frac\{n\}\{m\}\\rfloor, number of epochs

EE, learning rates

η1,…,ηT\\eta\_\{1\},\\dots,\\eta\_\{T\}
1:

\(θ11,θ21,…,θTE\)←DP\-SGD\-stream\(S,ζ,σ,m,T,E,η1,…,ηT\)\(\\theta^\{1\}\_\{1\},\\theta^\{1\}\_\{2\},\\dots,\\theta^\{E\}\_\{T\}\)\\leftarrow\\texttt\{DP\-SGD\-stream\}\(S,\\zeta,\\sigma,m,T,E,\\eta\_\{1\},\\dots,\\eta\_\{T\}\)
1:

θTE\\theta^\{E\}\_\{T\}

#### Differentially\-Private Stochastic Gradient Descent \(DP\-SGD\)\.

The DP\-SGD algorithm\(Abadi et al\.,[2016](https://arxiv.org/html/2605.26222#bib.bib1)\)relies on a repeated application of the Gaussian mechanism: for each batch of data points, it enforces a bound on the sensitivity by clipping the sample gradients to a maximal length\. It applies the Gaussian mechanism to the sum of clipped gradients, and updates the model parameters using the now privatized aggregate\. Note that DP\-SGD is structurally a*streaming*algorithm that can output \(yield\) updated model parameters after each update step\. However, one can also run DP\-SGD as a batch algorithm, by simply ignoring all of its outputs except the last one\.

[Algorithms˜1](https://arxiv.org/html/2605.26222#alg1)and[2](https://arxiv.org/html/2605.26222#alg2)provide corresponding pseudo\-code\. They include two subroutines that allow tailoring the procedure to many real\-world setups while preserving the guarantees of our later Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1):*CreateBatches*outputs the index sets of fixed\-size disjoint batches\. Any procedure, deterministic or random, is permitted that does not depend on the values of the data samples, so that data samples within a batch and between batches remain independent\. For example, a natural example would be that datasets are shuffled at the beginning of each epoch and then split equally into consecutive batches\.

*GradientUpdate*combines previous model parameters,θt−1\\theta\_\{t\-1\}, and an update vector,utu\_\{t\}, into new model parametersθt\\theta\_\{t\}, for a given learning rateηt\\eta\_\{t\}\. The classic SGD choice isθt←θt−1−ηut\\theta\_\{t\}\\leftarrow\\theta\_\{t\-1\}\-\\frac\{\\eta\}\{u\}\_\{t\}, but also other deterministic update rules can be used, e\.g\.including momentum and weight decay, or even Adam, as long as they depend on the dataset only through the \(noise\-protected\) update vectors, see Appendix[A](https://arxiv.org/html/2605.26222#A1)\.

#### Approximate Max\-Information\.

A concept that is related to privacy but more tailored to studying problems of generalization is the*approximate max\-information*, which measures how much statistical information an algorithm’s output𝒜\(S\)\\mathcal\{A\}\(S\)contains about its inputSS,

I∞β\(𝒜\(S\),S\)\\displaystyle I^\{\\beta\}\_\{\\infty\}\\big\(\\mathcal\{A\}\(S\),S\\big\)=D∞β\(\(𝒜\(S\),S\)∥𝒜\(S\)×S\)\\displaystyle=D^\{\\beta\}\_\{\\infty\}\\big\(\(\\mathcal\{A\}\(S\),S\)\\\|\\mathcal\{A\}\(S\)\\times S\\big\)\(3\)forD∞β\(X∥Y\)=supO⊆𝒴,ℙ\{X∈O\}\>βlog⁡ℙ\{X∈O\}−βℙ\{Y∈O\}D^\{\\beta\}\_\{\\infty\}\(X\\\|Y\)=\\sup\_\{O\\subseteq\\mathcal\{Y\},\\mathbb\{P\}\\\{X\\in O\\\}\>\\beta\}\\log\\frac\{\\mathbb\{P\}\\\{X\\in O\\\}\-\\beta\}\{\\mathbb\{P\}\\\{Y\\in O\\\}\}\.Dwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18), Theorem 20\)established a link between max\-information and privacy: for anyϵ\\epsilon\-differentially private algorithm𝒜\\mathcal\{A\}, it holds that

I∞β\(𝒜\(S\),S\)\\displaystyle I^\{\\beta\}\_\{\\infty\}\(\\mathcal\{A\}\(S\),S\)≤12nϵ2\+ϵn2log⁡\(2/β\)\.\\displaystyle\\leq\{\\textstyle\{\\frac\{1\}\{2\}\}\}n\\epsilon^\{2\}\+\\epsilon\\sqrt\{\\textstyle\\frac\{n\}\{2\}\\log\(2/\\beta\)\}\.\(4\)

#### PAC\-Bayes generalization bounds\.

To study the generalization ability of a learning method, one assumes that the training data is sampled i\.i\.d\.from some data distribution, and compares two quantities of a modelyy: its*training risk*,R^\(y\)=1n∑xℓ\(x,y\)\\widehat\{R\}\(y\)=\\frac\{1\}\{n\}\\sum\_\{x\}\\ell\(x,y\)i\.e\.its \(average\) loss on the training data, and its*true risk*,R\(y\)=𝔼x\[ℓ\(x,y\)\]R\(y\)=\\mathbb\{E\}\_\{x\}\[\\ell\(x,y\)\], i\.e\.the model’s expected loss on future data\.*Generalization bounds*establish a relation between both quantities\. Specifically, PAC\-Bayes learning theory\(McAllester,[1999](https://arxiv.org/html/2605.26222#bib.bib37);Maurer,[2004](https://arxiv.org/html/2605.26222#bib.bib35)\)establishes that ifπ\\piis a probability distribution over𝒴\\mathcal\{Y\}, called the*prior*, that is chosen independently of the training data, then for allδ∈\(0,1\)\\delta\\in\(0,1\)it holds with probability at least1−δ1\-\\deltathat uniformly over all*posterior*distributionsρ\\rhoover𝒴\\mathcal\{Y\}:

kl\(R^\(ρ\)∥R\(ρ\)\)≤DKL\(ρ∥π\)\+log⁡2nδn,\\displaystyle\\text\{kl\}\(\\widehat\{R\}\(\\rho\)\\,\\\|\\,R\(\\rho\)\)\\leq\\frac\{D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\)\+\\log\\frac\{2\\sqrt\{n\}\}\{\\delta\}\}\{n\},\(5\)whereR^\(ρ\)=𝔼y∼ρ\[R^\(y\)\]\\widehat\{R\}\(\\rho\)=\\mathbb\{E\}\_\{y\\sim\\rho\}\[\\widehat\{R\}\(y\)\]andR\(ρ\)=𝔼y∼ρ\[R\(y\)\]R\(\\rho\)=\\mathbb\{E\}\_\{y\\sim\\rho\}\[R\(y\)\], andkl\(p^∥p\)=DKL\(Ber\(p^\)∥Ber\(p\)\)=p^log⁡p^p\+\(1−p^\)log⁡1−p^1−p\\text\{kl\}\(\\hat\{p\}\\\|p\)=D\_\{\\text\{KL\}\}\(\\text\{Ber\}\(\\hat\{p\}\)\\\|\\text\{Ber\}\(p\)\)=\\hat\{p\}\\log\\frac\{\\hat\{p\}\}\{p\}\+\(1\-\\hat\{p\}\)\\log\\frac\{1\-\\hat\{p\}\}\{1\-p\}222More intuitive notions of similarity betweenR^\(ρ\)\\widehat\{R\}\(\\rho\)andR\(ρ\)R\(\\rho\), e\.g\.on their difference, can be derived from \([5](https://arxiv.org/html/2605.26222#S2.E5)\) using standard relaxations ofkl\. See, e\.g\., our discussion in Appendix[A](https://arxiv.org/html/2605.26222#A1), and\(Rivasplata et al\.,[2020](https://arxiv.org/html/2605.26222#bib.bib47)\)for more details\., and we assume the loss function to only take values in\[0,1\]\[0,1\]\. In words, this means that, except for an arbitrarily small probability of outlier training sets, the test error will be close to the training error for all models that are sufficiently close to the prior\. Unfortunately, it often proves difficult to find data\-agnostic priorsπ\\pithat allow for posteriorsρ\\rhowith smallR^\(ρ\)\\widehat\{R\}\(\\rho\)andDKL\(ρ∥π\)D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\)at the same time\. Therefore, for real\-world learning tasks the guarantees tend to end up numerically vacuous, e\.g\., guaranteeingR\(ρ\)≤bR\(\\rho\)\\leq bfor someb≥1b\\geq 1, which is fulfilled anyway due to the boundedness of the loss\.

A way to achieve tighter bounds, specifically to find priors that allow for smallerDKL\(ρ∥π\)D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\), is to extend \([5](https://arxiv.org/html/2605.26222#S2.E5)\) to allow for*distribution\-dependent*or*data\-dependent priors*\(Catoni,[2007](https://arxiv.org/html/2605.26222#bib.bib12);Lever et al\.,[2013](https://arxiv.org/html/2605.26222#bib.bib29);Parrado\-Hernández et al\.,[2012](https://arxiv.org/html/2605.26222#bib.bib39)\)\. Specifically,Dziugaite & Roy\([2018](https://arxiv.org/html/2605.26222#bib.bib20)\)demonstrated that the prior distributions can be*learned*on the training data itself, as long as anϵ\\epsilon\-DP learning algorithm is used, and\(Rivasplata et al\.,[2020](https://arxiv.org/html/2605.26222#bib.bib47)\)rephrased this result to cover all algorithms with small approximate max\-information\. Unfortunately, as mentioned above,ϵ\\epsilon\-DP is a quite restrictive criterion, compared to\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP withδ\>0\\delta\>0\. In particular, it prevents the use of the Gaussian mechanism, and the alternative Laplace mechanism tends to yield unsatisfactory results for high\-dimensional settings\.

Consequently, one of the central open questions in the field is how \([4](https://arxiv.org/html/2605.26222#S2.E4)\) can be extended to\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\.Rogers et al\.\([2016](https://arxiv.org/html/2605.26222#bib.bib48), Theorem 3\.1\)succeeded to prove a boundI∞β\(𝒜\(S\),S\)≤O\(ϵ2n\+nδ/ϵ\)I^\{\\beta\}\_\{\\infty\}\(\\mathcal\{A\}\(S\),S\)\\leq O\(\\epsilon^\{2\}n\+n\\sqrt\{\\delta/\\epsilon\}\)for datasets that are sampled from a product distribution, i\.e\.have independent entries333They also show that one cannot hope for non\-trivial results on general distribution, but this is not a major problem for machine learning tasks, where one tends to rely on independently sampled datasets anyway\.\. However their focus was on establishing asymptotic guarantees and with a required assumption ofβ=e−ϵ2n\+O\(nδ/ϵ\)\\beta=e^\{\-\\epsilon^\{2\}n\}\+O\(n\\sqrt\{\\delta/\\epsilon\}\), the result is not applicable for practical settings, where constants matter andβ\\betamust be free to choose\. Our work makes first progress towards resolving this open problem, by establishing a bound on the approximate max\-information of DP\-SGD that, like \([4](https://arxiv.org/html/2605.26222#S2.E4)\), grows only linearly in the size of the training set\.

## 3Approximate Max\-Information of DP\-SGD

We now present our main technical result: a bound on the approximate max\-information of DP\-SGD\.

###### Theorem 1\.

Let𝒜\\mathcal\{A\}be either the*streaming DP\-SGD Algorithm[1](https://arxiv.org/html/2605.26222#alg1)*or the*batch DP\-SGD Algorithm[2](https://arxiv.org/html/2605.26222#alg2)*with hyperparametersζ\\zeta\(clipping threshold\),σ\\sigma\(noise scale\),mm\(batch size\),TT\(number of steps per epoch\), andEE\(number of epochs\), that is run on a dataset,SS, that consists ofnnindependent data samples withn≥Tmn\\geq Tm\. Then, for anyβ∈\(0,1\)\\beta\\in\(0,1\)the*approximate max\-information of𝒜\\mathcal\{A\}*fulfills

I∞β\(\\displaystyle I^\{\\beta\}\_\{\\infty\}\(𝒜\(S\),S\)≤12ETν\+Einfλ∈\(0,rν\)1λ\[TF\(λ\+λ22\)\+logEβ\]withF\(x\)=16ν2x2\+νx1−2νx,\\displaystyle\\mathcal\{A\}\(S\),S\)\\leq\\frac\{1\}\{2\}ET\\nu\+E\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}\\frac\{1\}\{\\lambda\}\\Bigl\[TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\+\\log\\frac\{E\}\{\\beta\}\\Big\]\\ \\text\{with\}\\ F\(x\)=\\frac\{16\\nu^\{2\}x^\{2\}\+\\nu x\}\{1\-2\\nu x\},\(6\)withν=mζ2σ2\\nu=m\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}andrν=1ν\+14−12r\_\{\\nu\}=\\sqrt\{\\frac\{1\}\{\\nu\}\+\\frac\{1\}\{4\}\}\-\\frac\{1\}\{2\}\. In less tight, but more explicit, form:≤ETmζ2σ2\(1\+3q\+12q2\)\+ETmζσ\(12\+3q\+12q2\)withq=2Tlog⁡\(E/β\)\.\\displaystyle\\leq ETm\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}\(1\+3q\+\\frac\{1\}\{2\}q^\{2\}\)\+ET\\sqrt\{m\}\\frac\{\\zeta\}\{\\sigma\}\(\\frac\{1\}\{2\}\+3q\+\\frac\{1\}\{2\}q^\{2\}\)\\quad\\text\{with $q=\\sqrt\{\\frac\{2\}\{T\}\\log\(E/\\beta\)\}$\.\}\(7\)

###### Proof\.

We only have to prove the result forDP\-SGD\-stream, asDP\-SGD\-batch’s additional postprocessing step of dropping some outputs cannot increase the max\-information\(Dwork et al\.,[2015b](https://arxiv.org/html/2605.26222#bib.bib18), Lemma 16\)\. The proof forDP\-SGD\-streamcan be found in Appendix[B](https://arxiv.org/html/2605.26222#A2)\. At its core lies a bound on the max\-information of the iterated use of the Gaussian mechanism on potentially repeating data in terms of the noise strength and the task’s sensitivity\.

For illustration of the core steps, in the following Section we state and fully prove a simpler result that captures the main components of the statement and the proof of[Theorem˜1](https://arxiv.org/html/2605.26222#Thmtheorem1): a bound on the approximate max\-information of a single application of the Gaussian mechanism\. ∎

### 3\.1Approximate Max\-Information of the Gaussian Mechanism

Throughout this section we assume thatΨ:𝒳m→𝒴⊂ℝd\\Psi:\\mathcal\{X\}^\{m\}\\to\\mathcal\{Y\}\\subset\\mathbb\{R\}^\{d\}is a mapping with sensitivitys\>0s\>0, as introduced in Section[2](https://arxiv.org/html/2605.26222#S2)\.S=\(X1,…,Xm\)S=\(X\_\{1\},\\dots,X\_\{m\}\)denotes a random dataset with independent entries\.

###### Proposition 1\(Approximate Max\-Information of the Gaussian Mechanism\)\.

LetℳΨ\(S\)=Ψ\(S\)\+σZ\\mathcal\{M\}\_\{\\Psi\}\(S\)=\\Psi\(S\)\+\\sigma Zbe the*Gaussian mechanism*of noise strengthσ\\sigmaapplied toΨ\\Psi\. Then, for anyβ∈\(0,1\)\\beta\\in\(0,1\), the*approximate max\-information ofℳΨ\\mathcal\{M\}\_\{\\Psi\}*fulfills, forq=log⁡1\+α/2βq=\\sqrt\{\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}:

I∞β\(ℳ\(S\),S\)\\displaystyle I\_\{\\infty\}^\{\\beta\}\(\\mathcal\{M\}\(S\),S\)≤infα\>01α≥β−12\[ms2σ2\(34\+12q\+14q2\)\+msσ\(1\+q\)q2−log⁡α\],\\displaystyle\\leq\\\!\\\!\\inf\_\{\\smash\[b\]\{\\begin\{subarray\}\{c\}\\alpha\>0\\\\ \\frac\{1\}\{\\alpha\}\\geq\\beta\-\\frac\{1\}\{2\}\\end\{subarray\}\}\}\\Bigl\[\\frac\{ms^\{2\}\}\{\\sigma^\{2\}\}\\Bigl\(\\frac\{3\}\{4\}\+\\frac\{1\}\{2\}q\+\\frac\{1\}\{4\}q^\{2\}\)\+\\frac\{\\sqrt\{m\}s\}\{\\sigma\}\(1\+q\)\\sqrt\{q^\{2\}\-\\log\\alpha\}\\,\\Big\],\(8\)

###### Proof\.

ByDwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\), to showI∞β\(ℳΨ\(S\),S\)≤κI\_\{\\infty\}^\{\\beta\}\(\\mathcal\{M\}\_\{\\Psi\}\(S\),S\)\\leq\\kappait suffices to proveℙ\{f\(S,Y\)≥κ\}≤β\\mathbb\{P\}\\\{f\(S,Y\)\\geq\\kappa\\\}\\leq\\betaforY=ℳΨ\(S\)Y=\\mathcal\{M\}\_\{\\Psi\}\(S\), and

f\(S,Y\)\\displaystyle f\(S,Y\)=log⁡p\(Y,S\)p\(S\)p\(Y\)=log⁡p\(Y\|S\)p\(Y\)\.\\displaystyle=\\log\\frac\{p\(Y,S\)\}\{p\(S\)p\(Y\)\}=\\log\\frac\{p\(Y\|S\)\}\{p\(Y\)\}\.\(9\)We do so in three steps: in Lemma[2](https://arxiv.org/html/2605.26222#Thmlemma2)we derive an explicit upper bound,g\(G,Z\)g\(G,Z\), tof\(S,Y\)f\(S,Y\), whereG=1σΨ\(S\)G=\\frac\{1\}\{\\sigma\}\\Psi\(S\)incorporates all randomness due to the data sampling andZZis the additional noise added by the Gaussian mechanism\. In Lemma[3](https://arxiv.org/html/2605.26222#Thmlemma3), we upper bound the variance ofGG, which constitutes the deterministic part ofg\(G,Zg\(G,Z\), and in Lemmas[4](https://arxiv.org/html/2605.26222#Thmlemma4)and[5](https://arxiv.org/html/2605.26222#Thmlemma5)we establish tail bounds for the stochastic parts ofg\(G,Z\)g\(G,Z\)\. The statement of Proposition[1](https://arxiv.org/html/2605.26222#Thmlemma1)then follows from combining these Lemmas\. ∎

###### Lemma 2\.

For anyλ∈ℝ\\lambda\\in\\mathbb\{R\}, it holds that

ℙS,Y\{f\(S,Y\)\>λ\}≤ℙG,Z\{g\(G,Z\)\>λ\}\\displaystyle\\mathbb\{P\}\_\{S,Y\}\\\{f\(S,Y\)\>\\lambda\\\}\\leq\\mathbb\{P\}\_\{G,Z\}\\\{g\(G,Z\)\>\\lambda\\\}\(10\)withg\(G,Z\)=⟨Z,G−𝔼\[G\]⟩\+12‖G−𝔼\[G\]‖2\+12𝔼‖G−𝔼\[G\]‖2\.\\displaystyle g\(G,Z\)=\\bigl\\langle Z,G\-\\mathbb\{E\}\[G\]\\bigr\\rangle\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\.\(11\)

###### Proof of Lemma[2](https://arxiv.org/html/2605.26222#Thmlemma2)\.

Fixy∈𝒴y\\in\\mathcal\{Y\}\. LetS~=\(X~1,…,X~m\)\\tilde\{S\}=\(\\tilde\{X\}\_\{1\},\\dots,\\tilde\{X\}\_\{m\}\)be an independent copy ofSS\. Because ofp\(y\)=𝔼S~p\(y\|S~\)p\(y\)=\\mathbb\{E\}\_\{\\tilde\{S\}\}p\(y\|\\tilde\{S\}\)and the convexity oft↦log⁡1tt\\mapsto\\log\\frac\{1\}\{t\}, Jensen’s inequality implies

f\(S,\\displaystyle f\(S,y\)=logp\(y\|S\)𝔼S~p\(y\|S~\)≤𝔼S~logp\(y\|S\)p\(y\|S~\)\.\\displaystyle y\)=\\log\\frac\{p\(y\|S\)\}\{\\mathbb\{E\}\_\{\\tilde\{S\}\}p\(y\|\\tilde\{S\}\)\}\\leq\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\frac\{p\(y\|S\)\}\{p\(y\|\\tilde\{S\}\)\}\.\(12\)By the explicit Gaussian form ofℳ\\mathcal\{M\}, we obtain=𝔼S~\[log⁡1\(2π\)de−12σ2‖y−Ψ\(S\)‖2−log⁡1\(2π\)de−12σ2‖y−Ψ\(S~\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\log\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}e^\{\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|y\-\\Psi\(S\)\\\|^\{2\}\}\-\\log\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}e^\{\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|y\-\\Psi\(\\tilde\{S\}\)\\\|^\{2\}\}\\Bigr\]\(13\)=𝔼S~\[−12σ2‖y−Ψ\(S\)‖2\+12σ2‖y−Ψ\(S~\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|y\-\\Psi\(S\)\\\|^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|y\-\\Psi\(\\tilde\{S\}\)\\\|^\{2\}\\Bigr\]\(14\)=𝔼S~\[1σ2⟨y−Ψ\(S\),Ψ\(S\)−Ψ\(S~\)⟩\+12σ2‖Ψ\(S\)−Ψ\(S~\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\frac\{1\}\{\\sigma^\{2\}\}\\langle y\-\\Psi\(S\),\\Psi\(S\)\-\\Psi\(\\tilde\{S\}\)\\rangle\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|\\Psi\(S\)\-\\Psi\(\\tilde\{S\}\)\\\|^\{2\}\\Bigr\]\(15\)=1σ2⟨y−Ψ\(S\),Ψ\(S\)−𝔼\[Ψ\(S\)\]⟩\+12σ2‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2\+12σ2𝔼‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2\\displaystyle=\\frac\{1\}\{\\sigma^\{2\}\}\\langle y\-\\Psi\(S\),\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\rangle\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\\mathbb\{E\}\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}\(16\)where we have used that𝔼S~‖Ψ\(S\)−Ψ\(S~\)‖2=‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2\+𝔼‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2\\mathbb\{E\}\_\{\\tilde\{S\}\}\\\|\\Psi\(S\)\-\\Psi\(\\tilde\{S\}\)\\\|^\{2\}=\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}\+\\mathbb\{E\}\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}, becauseΨ\(S\)\\Psi\(S\)andΨ\(S~\)\\Psi\(\\tilde\{S\}\)are independent and have identical distributions, so in particular𝔼\[Ψ\(S\)\]=𝔼\[Ψ\(S~\)\]\\mathbb\{E\}\[\\Psi\(S\)\]=\\mathbb\{E\}\[\\Psi\(\\tilde\{S\}\)\]\. InsertingG=1σΨ\(S\)G=\\frac\{1\}\{\\sigma\}\\Psi\(S\)we obtain=⟨y−Ψ\(S\)σ,G−𝔼\[G\]⟩\+12‖G−𝔼\[G\]‖2\+12𝔼‖G−𝔼\[G\]‖2\\displaystyle=\\Bigl\\langle\\frac\{y\-\\Psi\(S\)\}\{\\sigma\},G\-\\mathbb\{E\}\[G\]\\Bigr\\rangle\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\(17\)SubstitutingY=ℳ\(S\)Y=\\mathcal\{M\}\(S\)foryyand observing thatY−Ψ\(S\)σ\\frac\{Y\-\\Psi\(S\)\}\{\\sigma\}has a standard Gaussian distribution yields the statement of Lemma[2](https://arxiv.org/html/2605.26222#Thmlemma2)\. ∎

###### Lemma 3\.

In the setting above, letν=14ms2σ2\\nu=\\frac\{1\}\{4\}m\\frac\{s^\{2\}\}\{\\sigma^\{2\}\}\. Then, it holds that𝔼‖G−𝔼\[G\]‖2≤2ν\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\\leq 2\\nu\.

###### Proof\.

By the assumption on its sensitivity,Ψ\\Psisatisfies a bounded difference inequality: for anyi=1,…,mi=1,\\dots,m,supx1,…,xm,x′∈𝒳‖Ψ\(x1,…,xm\)−Ψ\(x1,…,xi−1,x′,xi\+1,…,xm\)‖≤s\\sup\_\{x\_\{1\},\\dots,x\_\{m\},x^\{\\prime\}\\in\\mathcal\{X\}\}\\\|\\Psi\(x\_\{1\},\\dots,x\_\{m\}\)\-\\Psi\(x\_\{1\},\\dots,x\_\{i\-1\},x^\{\\prime\},x\_\{i\+1\},\\dots,x\_\{m\}\)\\\|\\leq s\. Thus,𝔼‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2≤12ms2\\mathbb\{E\}\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}\\leq\{\\textstyle\{\\frac\{1\}\{2\}\}\}ms^\{2\}, by the vector\-valued Efron\-Stein’s inequality, and the result forG=1σΨ\(S\)G=\\frac\{1\}\{\\sigma\}\\Psi\(S\)follows, because𝔼‖G−𝔼\[G\]‖2=1σ2𝔼‖Ψ\(S\)−𝔼\[Ψ\(S\)\]‖2\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}=\\frac\{1\}\{\\sigma^\{2\}\}\\mathbb\{E\}\\\|\\Psi\(S\)\-\\mathbb\{E\}\[\\Psi\(S\)\]\\\|^\{2\}\. ∎

###### Lemma 4\.

In the setting above, it holds for anyλ≥2ν\\lambda\\geq\\sqrt\{2\\nu\}:

ℙ\{‖G−𝔼\[G\]‖\>λ\}\\displaystyle\\mathbb\{P\}\\\{\\\|G\-\\mathbb\{E\}\[G\]\\\|\>\\lambda\\\}≤e−\(λ2ν−1\)2\.\\displaystyle\\leq e^\{\-\(\\frac\{\\lambda\}\{\\sqrt\{2\\nu\}\}\-1\)^\{2\}\}\.\(18\)

###### Proof\.

Analogously to the proof of Lemma[3](https://arxiv.org/html/2605.26222#Thmlemma3), the bounded difference property ofΨ\\Psiallows us to apply McDiarmid’s inequality\(Boucheron et al\.,[2013](https://arxiv.org/html/2605.26222#bib.bib9), Theorem 6\.2\)to‖G−𝔼\[G\]‖\\\|G\-\\mathbb\{E\}\[G\]\\\|\. For anyτ≥0\\tau\\geq 0:

ℙ\{‖G−𝔼\[G\]‖≥𝔼‖G−𝔼\[G\]‖\+τ\}\\displaystyle\\mathbb\{P\}\\bigl\\\{\\\|G\-\\mathbb\{E\}\[G\]\\\|\\geq\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|\+\\tau\\bigr\\\}≤e−τ22ν\.\\displaystyle\\leq e^\{\-\\frac\{\\tau^\{2\}\}\{2\\nu\}\}\.\(19\)By Lemma[3](https://arxiv.org/html/2605.26222#Thmlemma3),𝔼‖G−𝔼\[G\]‖≤𝔼‖G−𝔼\[G\]‖2≤2ν\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|\\leq\\sqrt\{\\mathbb\{E\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}\}\\leq\\sqrt\{2\\nu\}\. Settingτ=λ−2ν\\tau=\\lambda\-\\sqrt\{2\\nu\}yields \([18](https://arxiv.org/html/2605.26222#S3.E18)\)\. ∎

###### Lemma 5\.

In the setting above, leth\(G,Z\)=⟨Z,G−𝔼\[G\]⟩\+12‖G−𝔼\[G\]‖2h\(G,Z\)=\\bigl\\langle Z,G\-\\mathbb\{E\}\[G\]\\bigr\\rangle\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}, i\.e\.g\(G,Z\)g\(G,Z\)without the deterministic last term\. Then, it holds for anyβ∈\(0,1\)\\beta\\in\(0,1\)and anyα\>0\\alpha\>0with1α≥β−12\\frac\{1\}\{\\alpha\}\\geq\\beta\-\\frac\{1\}\{2\}:

ℙ\{h\(G,Z\)\>ν\(q\+1\)2\+2ν\(q\+1\)q2−log⁡α\}\\displaystyle\\mathbb\{P\}\\Bigl\\\{h\(G,Z\)\>\\nu\(q\+1\)^\{2\}\+2\\sqrt\{\\nu\}\(q\+1\)\\sqrt\{q^\{2\}\-\\log\\alpha\}\\Bigr\\\}≤β,forq=log⁡1\+α/2β\.\\displaystyle\\leq\\beta,\\quad\\text\{for $q=\\sqrt\{\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}$\.\}\(20\)

###### Proof\.

For anyr≥2νr\\geq\\sqrt\{2\\nu\}\(to be chosen later\), we split the tail probability as

ℙ\{h\(G,Z\)\>λ\}\\displaystyle\\mathbb\{P\}\\\{h\(G,Z\)\>\\lambda\\\}=ℙ\{\(h\(G,Z\)\>λ\)∧\(‖G−𝔼\[G\]‖≤r\)\}\\displaystyle=\\mathbb\{P\}\\\{\(h\(G,Z\)\>\\lambda\)\\wedge\(\\\|G\-\\mathbb\{E\}\[G\]\\\|\\leq r\)\\\}\+ℙ\{\(h\(G,Z\)\>λ\)∧\(‖G−𝔼\[G\]‖\>r\)\}\\displaystyle\+\\mathbb\{P\}\\\{\(h\(G,Z\)\>\\lambda\)\\wedge\(\\\|G\-\\mathbb\{E\}\[G\]\\\|\>r\)\\\}\(21\)≤ℙ\{h\(G,Z\)\>λ\|‖G−𝔼\[G\]‖≤r\}\+ℙ\{‖G−𝔼\[G\]‖\>r\}\.\\displaystyle\\leq\\mathbb\{P\}\\\{h\(G,Z\)\>\\lambda\\ \\big\|\\ \\\|G\-\\mathbb\{E\}\[G\]\\\|\\leq r\\\}\+\\mathbb\{P\}\\\{\\\|G\-\\mathbb\{E\}\[G\]\\\|\>r\\\}\.\(22\)
Note that conditioned on anyGG, the law of⟨Z,G−𝔼\[G\]⟩\\bigl\\langle Z,G\-\\mathbb\{E\}\[G\]\\bigr\\rangleis a one\-dimensional zero\-mean Gaussian with variance‖G−𝔼\[G\]‖2\\\|G\-\\mathbb\{E\}\[G\]\\\|^\{2\}, soℙ\{⟨Z,G−𝔼\[G\]⟩≥λ\|G\}=Φ¯\(λ‖G−𝔼\[G\]‖\)\\mathbb\{P\}\\\{\\langle Z,G\-\\mathbb\{E\}\[G\]\\rangle\\geq\\lambda\|G\\\}=\\overline\{\\Phi\}\(\\frac\{\\lambda\}\{\\\|G\-\\mathbb\{E\}\[G\]\\\|\}\), whereΦ¯\\overline\{\\Phi\}is the complement of the Gaussian CDF\. Therefore, we can upper bound the right hand side of \([22](https://arxiv.org/html/2605.26222#S3.E22)\):

ℙ\{h\(G,Z\)\>λ\|‖G−𝔼\[G\]‖≤r\}\\displaystyle\\mathbb\{P\}\\\{h\(G,Z\)\>\\lambda\\ \\big\|\\ \\\|G\-\\mathbb\{E\}\[G\]\\\|\\leq r\\\}≤ℙ\{⟨Z,G−𝔼\[G\]⟩\>λ−12r2\|‖G−𝔼\[G\]‖≤r\}\\displaystyle\\leq\\mathbb\{P\}\\bigl\\\{\\langle Z,G\-\\mathbb\{E\}\[G\]\\rangle\>\\lambda\-\{\\textstyle\{\\frac\{1\}\{2\}\}\}r^\{2\}\\ \\bigm\|\\ \\\|G\-\\mathbb\{E\}\[G\]\\\|\\leq r\\bigr\\\}\(23\)≤Φ¯\(λ−12r2r\)≤12e−12\(λr−r2\)2\\displaystyle\\leq\\overline\{\\Phi\}\(\\frac\{\\lambda\-\{\\textstyle\{\\frac\{1\}\{2\}\}\}r^\{2\}\}\{r\}\)\\leq\{\\textstyle\\frac\{1\}\{2\}\}e^\{\-\\frac\{1\}\{2\}\(\\frac\{\\lambda\}\{r\}\-\\frac\{r\}\{2\}\)^\{2\}\}\(24\)
The second term of \([22](https://arxiv.org/html/2605.26222#S3.E22)\) we control by means of Lemma[4](https://arxiv.org/html/2605.26222#Thmlemma4):

ℙ\{‖G−𝔼\[G\]‖\>r\}\\displaystyle\\mathbb\{P\}\\\{\\\|G\-\\mathbb\{E\}\[G\]\\\|\>r\\\}≤e−\(r2ν−1\)2\.\\displaystyle\\leq e^\{\-\(\\frac\{r\}\{\\sqrt\{2\\nu\}\}\-1\)^\{2\}\}\.\(25\)
To determinerr, we introduce a balancing termα\>0\\alpha\>0, and combine \([24](https://arxiv.org/html/2605.26222#S3.E24)\) and \([25](https://arxiv.org/html/2605.26222#S3.E25)\) into

ℙ\{h\(G,Z\)≥λ\}≤α2e−12\(λr−r2\)2−log⁡α\+e−\(r2ν−1\)2,\\displaystyle\\mathbb\{P\}\\\{h\(G,Z\)\\geq\\lambda\\\}\\leq\\textstyle\\frac\{\\alpha\}\{2\}e^\{\-\\frac\{1\}\{2\}\(\\frac\{\\lambda\}\{r\}\-\\frac\{r\}\{2\}\)^\{2\}\-\\log\\alpha\}\+e^\{\-\(\\frac\{r\}\{\\sqrt\{2\\nu\}\}\-1\)^\{2\}\},\(26\)which holds because the effect ofα\\alphacancels out in the first term\. Then, we determine a valueqqthat balances the exponents, i\.e\.we have to solve the system12\(λr−r2\)2\+log⁡α=q2=\(r2ν−1\)2\.\\displaystyle\\frac\{1\}\{2\}\(\\frac\{\\lambda\}\{r\}\-\\frac\{r\}\{2\}\)^\{2\}\+\\log\\alpha=q^\{2\}=\(\\frac\{r\}\{\\sqrt\{2\\nu\}\}\-1\)^\{2\}\.\(27\)We obtainr=\(q\+1\)2νr=\(q\+1\)\\sqrt\{2\\nu\}, andλ=12r2\+r2q2−2log⁡α=ν\(q\+1\)2\+2ν\(q\+1\)q2−log⁡α\.\\displaystyle\\lambda=\{\\textstyle\{\\frac\{1\}\{2\}\}\}r^\{2\}\+r\\sqrt\{2q^\{2\}\-2\\log\\alpha\}=\\nu\(q\+1\)^\{2\}\+2\\sqrt\{\\nu\}\(q\+1\)\\sqrt\{q^\{2\}\-\\log\\alpha\}\.\(28\)By construction, for anyβ∈\(0,1\)\\beta\\in\(0,1\), the right hand side of \([26](https://arxiv.org/html/2605.26222#S3.E26)\) is bounded by\(1\+α2\)e−q2\(1\+\\frac\{\\alpha\}\{2\}\)e^\{\-q^\{2\}\}\. Setting this toβ\\betaand solving forqqyieldsq=log⁡1\+α/2βq=\\sqrt\{\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}, as long as1α≥β−12\\frac\{1\}\{\\alpha\}\\geq\\beta\-\\frac\{1\}\{2\}, concluding the proof\. ∎

###### Proof of Proposition[1](https://arxiv.org/html/2605.26222#Thmlemma1)– conclusion\.

We combine the above results: it follows from Lemmas[3](https://arxiv.org/html/2605.26222#Thmlemma3)and[5](https://arxiv.org/html/2605.26222#Thmlemma5)that

ℙ\{g\(G,Z\)≥2ν\+λ\}\\displaystyle\\mathbb\{P\}\\\{g\(G,Z\)\\geq 2\\nu\+\\lambda\\\}≤ℙ\{h\(G,Z\)≥λ\}≤β,\\displaystyle\\leq\\mathbb\{P\}\\\{h\(G,Z\)\\geq\\lambda\\\}\\leq\\beta,\(29\)forλ\\lambdaas in \([28](https://arxiv.org/html/2605.26222#S3.E28)\)\. By Lemma[2](https://arxiv.org/html/2605.26222#Thmlemma2), this impliesℙ\{f\(Y,S\)≥2ν\+λ\}≤β\\mathbb\{P\}\\\{f\(Y,S\)\\geq 2\\nu\+\\lambda\\\}\\leq\\beta, and thereforeI∞β\(𝒜\(S\),S\)≤2ν\+λI^\{\\beta\}\_\{\\infty\}\(\\mathcal\{A\}\(S\),S\)\\leq 2\\nu\+\\lambda\. Inserting the explicit expressions forλ\\lambdaandν\\nuconcludes the proof\. ∎

### 3\.2Relation toDwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\)

![Refer to caption](https://arxiv.org/html/2605.26222v1/x1.png)Figure 1:The coefficient constants for identicalϵ\\epsilonare comparable betweenDwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\)’s expressions forϵ\\epsilon\-DP and our results for the Gaussian mechanism\. Note that both results apply in different situations, as the Gaussian mechanism is notϵ\\epsilon\-DP for anyϵ\\epsilon\.In this section we compare Proposition[1](https://arxiv.org/html/2605.26222#Thmlemma1)to ’s classic result that anyϵ\\epsilon\-differentially private algorithm fulfillsI∞β\(𝒜\(S\),S\)≤mϵ22\+ϵm2log⁡2βI^\{\\beta\}\_\{\\infty\}\(\\mathcal\{A\}\(S\),S\)\\leq\\frac\{m\\epsilon^\{2\}\}\{2\}\+\\epsilon\\sqrt\{\\frac\{m\}\{2\}\\log\\frac\{2\}\{\\beta\}\}, see[Section˜2](https://arxiv.org/html/2605.26222#S2)\. Note that noϵ\\epsilon\-private algorithm can have the form of a Gaussian mechanism, because that only achieves\(ϵ,δ\)\(\\epsilon,\\delta\)\-differentially privacy withδ\>0\\delta\>0\. So, there is actually no setting where[Dwork et al\.](https://arxiv.org/html/2605.26222#bib.bib18)’s result is applicable at the same time as ours\. Nevertheless, to provide an intuition of their scaling behaviors, we will compare the dependence of both results on the training data size and other parameters\.

Specifically, in the relevant regime ofϵ∈\(0,1\)\\epsilon\\in\(0,1\), the Gaussian mechanism achieves\(ϵ,δ\)\(\\epsilon,\\delta\)\-differential privacy if instantiated withσ=sϵ2log⁡1\.25δ\\sigma=\\frac\{s\}\{\\epsilon\}\\sqrt\{2\\log\\frac\{1\.25\}\{\\delta\}\}, i\.e\.ϵ=sσ2log⁡1\.25δ\\epsilon=\\frac\{s\}\{\\sigma\}\\sqrt\{2\\log\\frac\{1\.25\}\{\\delta\}\}\. Consequently, the above expression would read as

ms2σ2log⁡1\.25δ\+msσlog⁡1\.25δlog⁡2β,\\displaystyle\\frac\{ms^\{2\}\}\{\\sigma^\{2\}\}\\log\\frac\{1\.25\}\{\\delta\}\+\\frac\{\\sqrt\{m\}s\}\{\\sigma\}\\sqrt\{\\log\\frac\{1\.25\}\{\\delta\}\\log\\frac\{2\}\{\\beta\}\},\(30\)which has exactly the same functional form,c1ms2σ2\+c2msσc\_\{1\}\\frac\{ms^\{2\}\}\{\\sigma^\{2\}\}\+c\_\{2\}\\frac\{\\sqrt\{m\}s\}\{\\sigma\}, as \([8](https://arxiv.org/html/2605.26222#S3.E8)\), only with other constants \(which depend logarithmically onβ\\betaand, potentially, onα\\alphaandδ\\delta\)\.

For illustration, we takeδ=β\\delta=\\betaand restrict ourselves toβ∈\(0,12\)\\beta\\in\(0,\\frac\{1\}\{2\}\)andα≤3\\alpha\\leq 3, which includes all cases of practical interest\. In Figure[1](https://arxiv.org/html/2605.26222#S3.F1)we visualize the coefficients to illustrate their relative sizes\. As one can see, they are of comparable size across the full range of parameter choices\. The following lemma confirms this fact formally for the dominant linear term\.

###### Lemma 6\.

Denote byR\(α,β\)=\[34\+12log⁡1\+α/2β\+14log⁡1\+α/2β\]/log⁡1\.25βR\(\\alpha,\\beta\)=\\Bigl\[\\frac\{3\}\{4\}\+\\frac\{1\}\{2\}\\sqrt\{\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}\+\\frac\{1\}\{4\}\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\\Bigr\]\\Bigm/\\log\\frac\{1\.25\}\{\\beta\}the ratio of the leading terms \([8](https://arxiv.org/html/2605.26222#S3.E8)\) and \([30](https://arxiv.org/html/2605.26222#S3.E30)\), respectively\. Then, it holds forα∈\(0,3\]\\alpha\\in\(0,3\]andβ∈\(0,12\]\\beta\\in\(0,\\frac\{1\}\{2\}\]:

- •R\(α,β\)R\(\\alpha,\\beta\)is strictly increasing inα\\alphaandβ\\beta, and14≤R\(α,β\)≤2\\frac\{1\}\{4\}\\leq R\(\\alpha,\\beta\)\\leq 2\.

###### Proof\.

The proof relies on explicit calculus to establish the monotonicity ofRR\. The intervals are then determined from the boundary values\. The formal derivations are provided in Appendix[B](https://arxiv.org/html/2605.26222#A2)\. ∎

Under the caveat that no formal equivalence is actually possible, the lemma provides an intuition that the max\-information, and thereby the generalization abilities, of the\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP Gaussian mechanism forδ=β\\delta=\\betais "comparable" to that of a genericϵ\\epsilon\-differential private algorithm for the sameϵ\\epsilon\. Consequently, with respect to its generalization properties, it should be possible to use DP\-SGD as a plug\-in replacement for steps that previously required the use of anϵ\\epsilon\-DP algorithm\. In the following section we demonstrate this formally for the case of PAC\-Bayes generalization\.

## 4PAC\-Bayes Learning with DP\-SGD Prior

We now state two applications of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)in the context of generalization, a PAC\-Bayes bound with prior distribution learned using DP\-SGD \(Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2)\), and a PAC\-Bayes generalization bound directly for stochastic classifiers obtained from DP\-SGD \(Corollary[3](https://arxiv.org/html/2605.26222#Thmtheorem3)\)\. The bound holds for any mechanism for deriving a distribution over models from the DP\-SGD output\. The most common choice is to use Gaussian distributions centered at the learned parameter vectors, which makes theDKLD\_\{\\text\{KL\}\}\-term appearing in the bounds explicitly computable\. For other possibilities, see Appendix[A](https://arxiv.org/html/2605.26222#A1)\.

###### Theorem 2\.

For a datasetS=\(X1,…,Xn\)S=\(X\_\{1\},\\dots,X\_\{n\}\)of i\.i\.d\.data points, letπ\\pibe a prior distribution obtained from running DP\-SGD forEEepochs onSS, withTTbatches of sizemmper epoch, with clipping thresholdζ\\zetaand noise strengthσ\\sigma\. Assume that the loss function is bounded,ℓ:𝒳×𝒴→\[0,1\]\\ell:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\[0,1\]\. Then, for anyδ∈\(0,1\)\\delta\\in\(0,1\), it holds with probability at least1−δ1\-\\deltaover the sampling ofSS, that uniformly for all posterior distributionsρ\\rho:

kl\(R^\(ρ\)∥R\(ρ\)\)≤DKL\(ρ∥π\)\+κ\+log⁡\(4nδ\)n\.\\displaystyle\\text\{kl\}\\bigl\(\\widehat\{R\}\(\\rho\)\\\|R\(\\rho\)\\bigr\)\\leq\\frac\{D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\)\+\\kappa\+\\log\(\\frac\{4\\sqrt\{n\}\}\{\\delta\}\)\}\{n\}\.\(31\)for

κ\\displaystyle\\kappa=12ETν\+Einfλ∈\(0,rν\)1λ\[TF\(λ\+λ22\)\+log⁡2Eδ\]\\displaystyle=\\frac\{1\}\{2\}ET\\nu\+E\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}\\frac\{1\}\{\\lambda\}\\Bigl\[TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\+\\log\\frac\{2E\}\{\\delta\}\\Big\]\(32\)withFFdefined as in Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1), as well as≤ETmζ2σ2\(1\+3q\+12q2\)\+ETmζσ\(12\+3q\+12q2\)forq=2Tlog⁡\(2E/δ\)\.\\displaystyle\\leq ETm\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}\(1\+3q\+\\frac\{1\}\{2\}q^\{2\}\)\+ET\\sqrt\{m\}\\frac\{\\zeta\}\{\\sigma\}\(\\frac\{1\}\{2\}\+3q\+\\frac\{1\}\{2\}q^\{2\}\)\\quad\\text\{for $q=\\sqrt\{\\frac\{2\}\{T\}\\log\(2E/\\delta\)\}$\.\}\(33\)

###### Proof\.

Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2)is a direct consequence of the*max\-information*Lemma of\(Rivasplata et al\.,[2020](https://arxiv.org/html/2605.26222#bib.bib47), Lemma 7\), which established the inequality \([31](https://arxiv.org/html/2605.26222#S4.E31)\) for priorsπ\\pilearned in any data\-dependent way withI∞δ/2\(π\(S\),S\)I^\{\\delta/2\}\_\{\\infty\}\(\\pi\(S\),S\)in place ofκ\\kappa\. The specific form ofκ\\kappafollows from Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)forβ=δ/2\\beta=\\delta/2, because the construction of a distributionπ\\pifrom the output of DP\-SGD is a postprocessing step that does not increase the max\-information\. ∎

From[Theorem˜2](https://arxiv.org/html/2605.26222#Thmtheorem2)we immediately obtain a generalization bound for DP\-SGD\-trained models that depends only on the training hyper\-parameters, simply by evaluating the bound withρ=π\\rho=\\pi\.

###### Corollary 3\.

In the setting of[Theorem˜2](https://arxiv.org/html/2605.26222#Thmtheorem2), for any distributionπ\\pi, constructed from the output of DP\-SGD, it holds with probability at least1−δ1\-\\deltathat, withκ\\kappaas in Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2):

kl\(R^\(π\)∥R\(π\)\)≤κ\+log⁡\(4nδ\)n\.\\displaystyle\\text\{kl\}\\bigl\(\\widehat\{R\}\(\\pi\)\\\|R\(\\pi\)\\bigr\)\\leq\\frac\{\\kappa\+\\log\(\\frac\{4\\sqrt\{n\}\}\{\\delta\}\)\}\{n\}\.\(34\)

![Refer to caption](https://arxiv.org/html/2605.26222v1/x2.png)Figure 2:Numeric values of complexity terms in Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2)and Corollary[3](https://arxiv.org/html/2605.26222#Thmtheorem3)for different hyperparameter values\.κ\\kappaandκ′\\kappa^\{\\prime\}denote the values from \([32](https://arxiv.org/html/2605.26222#S4.E32)\) and \([33](https://arxiv.org/html/2605.26222#S4.E33)\), respectively\.Table 1:Numeric upper bounds,ℬ\\mathcal\{B\}, on the risk,RR, for deep networks on MNIST and CIFAR\-10, obtained by numerically inverting \([34](https://arxiv.org/html/2605.26222#S4.E34)\) forπ\\pi, and \([31](https://arxiv.org/html/2605.26222#S4.E31)\) forρ\\rho\. For standard deviations, see Appendix[C](https://arxiv.org/html/2605.26222#A3)\.### 4\.1Numerical Experiments

We now illustrate the behavior of Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2)and Corollary[3](https://arxiv.org/html/2605.26222#Thmtheorem3)numerically, see Appendix[C](https://arxiv.org/html/2605.26222#A3)for details of the setup\. Figure[2](https://arxiv.org/html/2605.26222#S4.F2)visualizes the terms on the right hand side of \([34](https://arxiv.org/html/2605.26222#S4.E34)\), which is also the posterior\-independent part of the complexity term \([32](https://arxiv.org/html/2605.26222#S4.E32)\)\. One can see thatκn\\frac\{\\kappa\}\{n\}is robust against different values ofnn, and most affected by the signal to noise ratioζ/σ\\zeta/\\sigma\. Using the optimized choice \([32](https://arxiv.org/html/2605.26222#S4.E32)\) instead of the explicit formula \([33](https://arxiv.org/html/2605.26222#S4.E33)\) results in tighter bounds in all cases, though not dramatically so\. With fixed hyperparameters, the complexity terms scale approximately likelog⁡\(1/δ\)\\log\(1/\\delta\)\. The1nlog⁡\(4n/δ\)\\frac\{1\}\{n\}\\log\(4\\sqrt\{n\}/\\delta\)term is generally small, but can exceedκ/n\\kappa/nfor small values ofnn\.

[Table˜1](https://arxiv.org/html/2605.26222#S4.T1)reports upper bounds on the test error, obtained by numerically inverting thekl\-terms of \([31](https://arxiv.org/html/2605.26222#S4.E31)\) and \([34](https://arxiv.org/html/2605.26222#S4.E34)\), in two settings: for a multi\-layer perceptron network trained on the MNIST dataset, and for a deep ConvNet trained on the CIFAR\-10 dataset\. In both cases one can see that the resulting bounds are non\-vacuous, despite the highly overparameterized setup, and that they are tightened by the use of a DP\-SGD prior rather than a data\-independent prior\.

## 5Conclusion

In this work, we established the first explicit upper bounds on theβ\\beta\-approximate max\-information of DP\-SGD and variants in regimes relevant to modern deep learning\. Our results provide a principled connection between differentially private deep learning and PAC\-Bayes generalization theory, advancing over previous analyses that were restricted to pure differential privacy or restricted problem classes\. Leveraging our bounds, we showed that DP\-SGD can be used to learn data\-dependent priors while retaining generalization guarantees expressible directly in terms of optimization hyperparameters\.

#### Limitations\.

Some limitations remain in our work\. On the technical level, we discuss DP\-SGD with fixed\-size, non\-overlapping batches\. This is in contrast to works on differential privacy, where variable\-size randomly constructed batches, e\.g\.by Poisson sampling, can provide better guarantees by means of*privacy amplification*\(Ponomareva et al\.,[2023](https://arxiv.org/html/2605.26222#bib.bib43)\)\. However, we are confident that our proofs can be extended to this setup, though potentially at the cost of worse constants\. Note, also, that for the purpose of Theorem[2](https://arxiv.org/html/2605.26222#Thmtheorem2), the exact privacy guarantees of the DP\-SGD step do not matter so much, as the prior serves mainly as an analysis tool and the bound is explicit in the DP\-SGD hyperparameters, not in\(ϵ,δ\)\(\\epsilon,\\delta\)\. Consequently, learning the prior with fixed\-size batches after data shuffling, as is standard practice for non\-private SGD, might actually be preferable in practice\. A second limitation is that our PAC\-Bayes bounds hold only for bounded loss functions\. We expect that extensions of our work to unbounded losses will be possible, but it would require changes to the proof techniques that go beyond the scope of this work\.

## Acknowledgments and Disclosure of Funding

This research was supported by the Scientific Service Units \(SSU\) of ISTA through resources provided by Scientific Computing\. The work took place in parts while the first author was on a sabbatical leave at IIT in Genoa \. He would like to thank the*Computational Statistics and Machine Learning*for their hospitality and for providing a stimulating research environment\.

## References

- Abadi et al\. \(2016\)Abadi, M\., Chu, A\., Goodfellow, I\., McMahan, H\. B\., Mironov, I\., Talwar, K\., and Zhang, L\.Deep learning with differential privacy\.In*ACM SIGSAC Conference on Computer and Communications Security \(CCS\)*, 2016\.
- Ambroladze et al\. \(2006\)Ambroladze, A\., Parrado\-Hernández, E\., and Shawe\-Taylor, J\.Tighter PAC\-Bayes bounds\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2006\.
- Anil et al\. \(2022\)Anil, R\., Ghazi, B\., Gupta, V\., Kumar, R\., and Manurangsi, P\.Large\-scale differentially private BERT\.In*Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2022\.
- Bassily et al\. \(2014\)Bassily, R\., Thakurta, A\., and Smith, A\.Private empirical risk minimization: Efficient algorithms and tight error bounds\.In*Symposium on Foundations of Computer Science \(FOCS\)*, 2014\.
- Bassily et al\. \(2016\)Bassily, R\., Nissim, K\., Smith, A\. D\., Steinke, T\., Stemmer, U\., and Ullman, J\. R\.Algorithmic stability for adaptive data analysis\.In*Symposium on Theory of Computing \(STOC\)*, 2016\.
- Blanco\-Justicia et al\. \(2023\)Blanco\-Justicia, A\., Sánchez, D\., Domingo\-Ferrer, J\., and Muralidhar, K\.A critical review on the use \(and misuse\) of differential privacy in machine learning\.*ACM Computing Surveys*, 55\(8\), 2023\.
- Blundell et al\. \(2015\)Blundell, C\., Cornebise, J\., Kavukcuoglu, K\., and Wierstra, D\.Weight uncertainty in neural network\.In*International Conference on Machine Learning \(ICML\)*, 2015\.
- Bombari & Mondelli \(2025\)Bombari, S\. and Mondelli, M\.Privacy for free in the overparameterized regime\.*Proceedings of the National Academy of Sciences \(PNAS\)*, 122\(15\), 2025\.
- Boucheron et al\. \(2013\)Boucheron, S\., Lugosi, G\., and Massart, P\.*Concentration Inequalities: A Nonasymptotic Theory of Independence*\.Oxford University Press, 2013\.
- Carlini et al\. \(2019\)Carlini, N\., Liu, C\., Erlingsson, Ú\., Kos, J\., and Song, D\.The secret sharer: Evaluating and testing unintended memorization in neural networks\.In*USENIX Security Symposium*, 2019\.
- Carlini et al\. \(2021\)Carlini, N\., Tramèr, F\., Wallace, E\., Jagielski, M\., Herbert\-Voss, A\., Lee, K\., Roberts, A\., Brown, T\. B\., Song, D\., Erlingsson, Ú\., Oprea, A\., and Raffel, C\.Extracting training data from large language models\.In*USENIX Security Symposium*, 2021\.
- Catoni \(2007\)Catoni, O\.*PAC\-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning*\.Institute of Mathematical Statistics, 2007\.
- Chua et al\. \(2024\)Chua, L\., Ghazi, B\., Kamath, P\., Kumar, R\., Manurangsi, P\., Sinha, A\., and Zhang, C\.Scalable DP\-SGD: shuffling vs\. poisson subsampling\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- Chua et al\. \(2025\)Chua, L\., Ghazi, B\., Harrison, C\., Kamath, P\., Kumar, R\., Leeman, E\., Manurangsi, P\., Sinha, A\., and Zhang, C\.Balls\-and\-bins sampling for DP\-SGD\.In*International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, 2025\.
- Dwork & Roth \(2014\)Dwork, C\. and Roth, A\.The algorithmic foundations of differential privacy\.*Foundations and Trends in Theoretical Computer Science*, 9\(3–4\), 2014\.
- Dwork et al\. \(2006\)Dwork, C\., Kenthapadi, K\., McSherry, F\., Mironov, I\., and Naor, M\.Our data, ourselves: Privacy via distributed noise generation\.In*International Conference on the Theory and Applications of Cryptographic Techniques \(EUROCRYPT\)*, 2006\.
- Dwork et al\. \(2015a\)Dwork, C\., Feldman, V\., Hardt, M\., Pitassi, T\., Reingold, O\., and Roth, A\.Generalization in adaptive data analysis and holdout reuse\.*arXiv:1506\.02629*, 2015a\.
- Dwork et al\. \(2015b\)Dwork, C\., Feldman, V\., Hardt, M\., Pitassi, T\., Reingold, O\., and Roth, A\.Generalization in adaptive data analysis and holdout reuse\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2015b\.
- Dziugaite & Roy \(2017\)Dziugaite, G\. K\. and Roy, D\. M\.Computing nonvacuous generalization bounds for deep \(stochastic\) neural networks with many more parameters than training data\.In*Uncertainty in Artificial Intelligence \(UAI\)*, 2017\.
- Dziugaite & Roy \(2018\)Dziugaite, G\. K\. and Roy, D\. M\.Data\-dependent PAC\-Bayes priors via differential privacy\.In*Advances in Neural Information Processing Systems*, 2018\.
- Erlingsson et al\. \(2019\)Erlingsson, Ú\., Feldman, V\., Mironov, I\., Raghunathan, A\., Talwar, K\., and Thakurta, A\.Amplification by shuffling: From local to central differential privacy via anonymity\.In*Symposium on Discrete Algorithms \(SODA\)*, 2019\.
- Feldman & Shenfeld \(2025\)Feldman, V\. and Shenfeld, M\.Privacy amplification by random allocation\.*arXiv:2502\.08202*, 2025\.
- García\-Pérez et al\. \(2025\)García\-Pérez, D\., Parrado\-Hernández, E\., and Shawe\-Taylor, J\.Some theoretical improvements on the tightness of PAC\-Bayes risk certificates for neural networks\.*arXiv preprint arXiv:2510\.07935*, 2025\.
- Issa et al\. \(2023\)Issa, I\., Esposito, A\. R\., and Gastpar, M\.Asymptotically optimal generalization error bounds for noisy, iterative algorithms\.In*Conference on Learning Theory \(COLT\)*, 2023\.
- Jung et al\. \(2020\)Jung, C\., Ligett, K\., Neel, S\., Roth, A\., Sharifi\-Malvajerdi, S\., and Shenfeld, M\.A new analysis of differential privacy’s generalization guarantees\.In*Innovations in Theoretical Computer Science Conference \(ITCS\)*, 2020\.
- Kasiviswanathan et al\. \(2011\)Kasiviswanathan, S\. P\., Lee, H\. K\., Nissim, K\., Raskhodnikova, S\., and Smith, A\. D\.What can we learn privately?*SIAM Journal on Computing*, 40\(3\):793–826, 2011\.
- Krizhevsky & Hinton \(2009\)Krizhevsky, A\. and Hinton, G\.Learning multiple layers of features from tiny images\.Technical report, University of Toronto, 2009\.
- LeCun et al\. \(1998\)LeCun, Y\., Bottou, L\., Bengio, Y\., and Haffner, P\.Gradient\-based learning applied to document recognition\.*Proceedings of the IEEE*, 86\(11\):2278–2324, 1998\.
- Lever et al\. \(2013\)Lever, G\., Laviolette, F\., and Shawe\-Taylor, J\.Tighter pac\-bayes bounds through distribution\-dependent priors\.*Theoretical Computer Science*, 473:4–28, 2013\.
- Lotfi et al\. \(2022\)Lotfi, S\., Finzi, M\., Kapoor, S\., Potapczynski, A\., Goldblum, M\., and Wilson, A\. G\.PAC\-Bayes compression bounds so tight that they can explain generalization\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Lotfi et al\. \(2024\)Lotfi, S\., Finzi, M\. A\., Kuang, Y\., Rudner, T\. G\. J\., Goldblum, M\., and Wilson, A\. G\.Non\-vacuous generalization bounds for large language models\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Luo et al\. \(2024\)Luo, Z\., Zou, Y\., Yang, Y\., Durante, Z\., Huang, D\.\-A\., Yu, Z\., Xiao, C\., Fei\-Fei, L\., and Anandkumar, A\.Differentially private video activity recognition\.In*IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\)*, 2024\.
- Maddox et al\. \(2019\)Maddox, W\. J\., Izmailov, P\., Garipov, T\., Vetrov, D\. P\., and Wilson, A\. G\.A simple baseline for Bayesian uncertainty in deep learning\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2019\.
- Mandt et al\. \(2017\)Mandt, S\., Hoffman, M\. D\., and Blei, D\. M\.Stochastic gradient descent as approximate Bayesian inference\.*Journal of Machine Learning Research \(JMLR\)*, 18:134:1–134:35, 2017\.
- Maurer \(2004\)Maurer, A\.A note on the PAC\-Bayesian theorem\.*arXiv:0411099*, 2004\.
- Maurer et al\. \(2025\)Maurer, A\., Mirzaei, E\., and Pontil, M\.Generalization of Gibbs and Langevin Monte Carlo algorithms in the interpolation regime\.*arXiv:2510\.06028*, 2025\.
- McAllester \(1999\)McAllester, D\. A\.Some PAC\-Bayesian theorems\.In*Conference on Learning Theory \(COLT\)*, 1999\.
- Nken et al\. \(2025\)Nken, A\. T\. A\., McKeever, S\., Corcoran, P\., and Ullah, I\.Video\-DPRP: A differentially private approach for visual privacy\-preserving video human activity recognition\.In*European Conference on Machine Learning and Data Mining \(ECML/PKDD\)*, 2025\.
- Parrado\-Hernández et al\. \(2012\)Parrado\-Hernández, E\., Ambroladze, A\., Shawe\-Taylor, J\., and Sun, S\.PAC\-bayes bounds with data dependent priors\.*Journal of Machine Learning Research \(JMLR\)*, 13:3507–3531, 2012\.
- Pensia et al\. \(2018\)Pensia, A\., Jog, V\. S\., and Loh, P\.\-L\.Generalization error bounds for noisy, iterative algorithms\.In*International Symposium on Information Theory \(ISIT\)*, 2018\.
- Pérez\-Ortiz et al\. \(2021\)Pérez\-Ortiz, M\., Rivasplata, O\., Shawe\-Taylor, J\., and Szepesvári, C\.Tighter risk certificates for neural networks\.*Journal of Machine Learning Research \(JMLR\)*, 22\(227\):1–40, 2021\.
- Pillutla et al\. \(2025\)Pillutla, K\., Upadhyay, J\., Choquette\-Choo, C\. A\., Dvijotham, K\., Ganesh, A\., Henzinger, M\., Katz, J\., McKenna, R\., McMahan, H\. B\., Rush, K\., Steinke, T\., and Thakurta, A\.Correlated noise mechanisms for differentially private learning\.*arXiv:2506\.08201*, 2025\.
- Ponomareva et al\. \(2023\)Ponomareva, N\., Hazimeh, H\., Kurakin, A\., Xu, Z\., Denison, C\., McMahan, H\. B\., Vassilvitskii, S\., Chien, S\., and Thakurta, A\. G\.How to dp\-fy ML: A practical guide to machine learning with differential privacy\.*Journal of Artificial Intelligence Research \(JAIR\)*, 77:1113–1201, 2023\.
- Rajkumar & Agarwal \(2012\)Rajkumar, A\. and Agarwal, S\.A differentially private stochastic gradient descent algorithm for multiparty classification\.In*International Conference on Artificial Intelligence and Statistics \(AISTATS\)*, 2012\.
- Rivasplata et al\. \(2018\)Rivasplata, O\., Szepesvári, C\., Shawe\-Taylor, J\., Parrado\-Hernández, E\., and Sun, S\.PAC\-Bayes bounds for stable algorithms with instance\-dependent priors\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2018\.
- Rivasplata et al\. \(2019\)Rivasplata, O\., Tankasali, V\. M\., and Szepesvári, C\.PAC\-Bayes with backprop\.*arXiv:1908\.07380*, 2019\.
- Rivasplata et al\. \(2020\)Rivasplata, O\., Kuzborskij, I\., Szepesvári, C\., and Shawe\-Taylor, J\.PAC\-Bayes analysis beyond the usual bounds\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2020\.
- Rogers et al\. \(2016\)Rogers, R\., Roth, A\., Smith, A\., and Thakkar, O\.Max\-information, differential privacy, and post\-selection hypothesis testing\.*arXiv: 1604\.03924*, 2016\.
- Sander et al\. \(2024\)Sander, T\., Yu, Y\., Sanjabi, M\., Durmus, A\., Ma, Y\., Chaudhuri, K\., and Guo, C\.Differentially private representation learning via image captioning\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Seeger \(2002\)Seeger, M\.PAC\-Bayesian generalization error bounds for Gaussian processes\.*Journal of Machine Learning Research*, 3:233–269, 2002\.
- Shi et al\. \(2026\)Shi, Z\., Wang, P\., Zhang, C\., and Cao, Y\.Towards understanding generalization in DP\-GD: A case study in training two\-layer cnns\.In*Conference on Artificial Intelligence \(AAAI\)*, 2026\.
- Sinha et al\. \(2025\)Sinha, A\., Mesnard, T\., McKenna, R\., Liu, D\., Choquette\-Choo, C\. A\., Huang, Y\., Yu, D\., Kaissis, G\., Charles, Z\., Liu, R\., Chua, L\., Kamath, P\., Manurangsi, P\., He, S\., Zhang, C\., Ghazi, B\., Pigem, B\. D\. B\., Eruvbetine, P\., Warkentin, T\., Joulin, A\., and Kumar, R\.VaultGemma: A differentially private gemma model\.*arXiv:2510\.15001*, 2025\.
- Stemmer & Nissim \(2019\)Stemmer, U\. and Nissim, K\.Concentration bounds for high sensitivity functions through differential privacy\.*Journal of Privacy and Confidentiality*, 9\(1\), 2019\.
- Tang et al\. \(2024\)Tang, Q\., Shpilevskiy, F\., and Lécuyer, M\.DP\-AdamBC: Your DP\-Adam is actually DP\-SGD \(unless you apply bias correction\)\.In*Conference on Artificial Intelligence \(AAAI\)*, 2024\.
- Valiant \(1984\)Valiant, L\. G\.A theory of the learnable\.*Communications of the ACM*, 27\(11\):1134–1142, 1984\.
- Vapnik & Chervonenkis \(1971\)Vapnik, V\. N\. and Chervonenkis, A\. Y\.On the uniform convergence of relative frequencies of events to their probabilities\.*Theory of Probability & Its Applications*, 16\(2\):264–280, 1971\.
- Vershynin \(2018\)Vershynin, R\.*High\-Dimensional Probability: An Introduction with Applications in Data Science*\.Cambridge University Press, 2018\.
- Wang et al\. \(2021\)Wang, H\., Huang, Y\., Gao, R\., and Calmon, F\. P\.Analyzing the generalization capability of SGLD using properties of gaussian channels\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Xu & Raginsky \(2017\)Xu, A\. and Raginsky, M\.Information\-theoretic analysis of generalization capability of learning algorithms\.In*Conference on Neural Information Processing Systems \(NeurIPS\)*, 2017\.
- Zhou et al\. \(2019\)Zhou, W\., Veitch, V\., Austern, M\., Adams, R\. P\., and Orbanz, P\.Non\-vacuous generalization bounds at the ImageNet scale: a PAC\-Bayesian compression approach\.In*International Conference on Learning Representations \(ICLR\)*, 2019\.

## Appendix AExtended Background

In this section we provide additional background information on the concepts introduced in Section[2](https://arxiv.org/html/2605.26222#S2)\.

### A\.1Learning Setup

The learning setup in which we work \(input space𝒳\\mathcal\{X\}, model space𝒴\\mathcal\{Y\}, loss functionℓ:𝒳×𝒴→ℝ\+\\ell:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\mathbb\{R\}\_\{\+\}\) provides a unified notation for a number of standard learning scenarios\. Most commonly, in supervised classification,𝒳\\mathcal\{X\}would contain input\-output pairs,𝒴\\mathcal\{Y\}, could be the model parameters, e\.g\.of a neural network, andℓ\\ellmeasures if the model predicts the correct class label or not\. However, the notation also applies more widely\. For example, in unsupervised learning, the input set would be unlabeled data, the output could be a set of cluster centroids, and the loss function measures the distance between a data and its closest cluster centroid\.

### A\.2Differentially\-Private Stochastic Gradient Descent \(DP\-SGD\)

Several variants and improvements of DP\-SGD and its privacy analysis have been developed\.

One aspect under active development is*batch generation*: it has been shown that when data batches are created randomly, e\.g\.by Poisson sampling\(Kasiviswanathan et al\.,[2011](https://arxiv.org/html/2605.26222#bib.bib26)\),*balls\-and\-bins*\(Chua et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib14)\),kk\-out\-of\-ttallocation\(Feldman & Shenfeld,[2025](https://arxiv.org/html/2605.26222#bib.bib22)\), or even just*data shuffling*\(Erlingsson et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib21)\), stronger privacy guarantees can be proven than what would be provided by analyzing DP\-SGD simply as a concatenation of Gaussian mechanisms\. However, these techniques tend to result in batches with different \(random\) batch sizes, which creates problems with practical implementation and resource utilization\. A promising alternative is*Truncated Poisson Sampling*\(Chua et al\.,[2024](https://arxiv.org/html/2605.26222#bib.bib13)\), which enforces an upper bound on the batch size and has been used, e\.g\., in training Google’s VaultGemma LLM\(Sinha et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib52)\)\.

In orthogonal efforts, it has been shown that the use of correlated instead of uncorrelated noise, equally strong privacy guarantees can often be achieved with less overall added noise and higher model utility\(Pillutla et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib42)\)\.

We believe that an extension of our current analysis, which assumes classic fixed\-size disjoint batches and independent noise, to these techniques should be possible, but would require changes to our proof techniques that go beyond the scope of this work\.

Our discussion of DP\-SGD left the choice of the*GradientUpdate*routine mostly open, because the specific form of the update does not affect either the max\-information or the privacy guarantees, since the analysis depends only on the privatized update vectors\.

Classically, the update steps of \(DP\-\)SGD, given an update vector ofutu\_\{t\}is

- •θt←θt−1−ηtut\\theta\_\{t\}\\leftarrow\\theta\_\{t\-1\}\-\\eta\_\{t\}u\_\{t\}

for some \(constant or time\-varying\) learning rateηt\\eta\_\{t\}\(Abadi et al\.,[2016](https://arxiv.org/html/2605.26222#bib.bib1)\)\.

Including*momentum*of strengthβ\\betaand*weight decay*of strengthα\\alpha, changes the rule to

- •θt←\(1−α\)θt−1−ηtmt\\theta\_\{t\}\\leftarrow\(1\-\\alpha\)\\theta\_\{t\-1\}\-\\eta\_\{t\}m\_\{t\}, where
- •mt←βmt−1\+\(1−β\)utm\_\{t\}\\leftarrow\\beta m\_\{t\-1\}\+\(1\-\\beta\)u\_\{t\},

is the momentum term that is also updated in each step\.

Finally, DP\-Adam\(Anil et al\.,[2022](https://arxiv.org/html/2605.26222#bib.bib3)\)or variants\(Tang et al\.,[2024](https://arxiv.org/html/2605.26222#bib.bib54)\)additionally perform a component\-wise normalization of the updates, e\.g\.

- •θt=θt−1−ηtv^t\+ϵm^t\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\frac\{\\eta\_\{t\}\}\{\\sqrt\{\\hat\{v\}\_\{t\}\}\+\\epsilon\}\\hat\{m\}\_\{t\}, for
- •m^t=mt1−β1t\\hat\{m\}\_\{t\}=\\frac\{m\_\{t\}\}\{1\-\\beta\_\{1\}^\{t\}\}withmt=β1mt−1\+\(1−β1\)utm\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)u\_\{t\}, and
- •v^t=vt1−β2t\\hat\{v\}\_\{t\}=\\frac\{v\_\{t\}\}\{1\-\\beta\_\{2\}^\{t\}\}withvt=β2vt−1\+\(1−β2\)ut2v\_\{t\}=\\beta\_\{2\}v\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)u\_\{t\}^\{2\}\.

One can see that in all of them, the model parameters are a deterministic linear \(DP\-SGD\) or non\-linear \(DP\-Adam\) combination of the update vectors, so all of them are covered by our analysis\.

### A\.3PAC\-Bayes generalization bounds\.

#### Construction of priors and posteriors\.

At the core of PAC\-Bayes theory lie stochastic classifiers, i\.e\.distributions,π\\pi\(prior\) andρ\\rho\(posterior\) over the model space from which individual models are sampled\. Our work is agnostic to how exactly the distributions are constructed from the output of a learning algorithm, as this step constitutes a postprocessing operations and therefore cannot lead to an increase of the max\-information \(or a reduction of privacy\)\. In the literature, a number of suitable options exist:

- •Most classically, the training step outputs the parameter vector,θ\\theta, of one model, from which one constructs a Gaussian distribution centered atyywith fixed \(often isotropic\) covariance matrix\. The variance can be treated as a hyper\-parameter and chosen by model\-selection\. The use of a Gaussian form is motivated by the fact that it leads to explicit expression forDKL\(ρ∥π\)D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\), which can then serve as part of the objective function for the model training step\(Ambroladze et al\.,[2006](https://arxiv.org/html/2605.26222#bib.bib2);Dziugaite & Roy,[2017](https://arxiv.org/html/2605.26222#bib.bib19)\)\.
- •In*Bayesian Deep Learning*, the model parameters directly correspond to the parameters of a distribution over models, e\.g\.by learning not only the mean but also \(co\)variance term of a Gaussian\(Blundell et al\.,[2015](https://arxiv.org/html/2605.26222#bib.bib7);Rivasplata et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib46)\)\.
- •When the space of model is finite, e\.g\.because the model coefficients are quantized, one can choose the posterior as aδ\\delta\-peak at the \(quantized\) posterior\. This way, the bound applies also to deterministic classifiers\(Zhou et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib60);Lotfi et al\.,[2022](https://arxiv.org/html/2605.26222#bib.bib30)\)\.
- •Exploiting the iterative nature of \(DP\-\)SGD, one can the construct the mean and covariance matrix of a Gaussian distribution from its iterates\(Maddox et al\.,[2019](https://arxiv.org/html/2605.26222#bib.bib33)\), or treat them as approximate samples from the*Gibbs posterior*\(Mandt et al\.,[2017](https://arxiv.org/html/2605.26222#bib.bib34);Maurer et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib36)\)\.

#### From*kl*to more interpretable generalization bounds\.

PAC\-Bayes generalization bounds of a formkl\(R^∥R\)≤C/n\\text\{kl\}\(\\widehat\{R\}\\\|R\)\\leq C/nare also known as*fast\-rate bounds*, because their convergence speed can be faster than the classical rateO\(1/n\)O\(\\sqrt\{1/n\}\), specifically when the training risk,R^\\widehat\{R\}, is small\(Seeger,[2002](https://arxiv.org/html/2605.26222#bib.bib50);Catoni,[2007](https://arxiv.org/html/2605.26222#bib.bib12)\)\. This comes at the disadvantage that the bound onRRis only implicit, and the bounding quantity,C/nC/n, lacks immediate interpretability\. Two main routes exist to solve this issue\.

First, observe thatkl\(q∥p\)\\text\{kl\}\(q\\\|p\)is strictly monotonically increasing inp≥qp\\geq q, so it is invertible in that range\. Consequently, from the implicit formkl\(R^∥R\)≤C/n\\text\{kl\}\(\\widehat\{R\}\\\|R\)\\leq C/n, we can compute an upper\-bound onRRasR≤kl−1\(R^∥C/n\)R\\leq\\text\{kl\}^\{\-1\}\(\\widehat\{R\}\\\|C/n\), where

kl−1\(q∥b\)=sup\{p∈\[q,1\]:kl\(q∥p\)≤b\}\.\\text\{kl\}^\{\-1\}\(q\\\|b\)=\\text\{sup\}\\bigl\\\{p\\in\[q,1\]:\\text\{kl\}\(q\\\|p\)\\leq b\\bigr\\\}\.\(35\)Whilekl−1\\text\{kl\}^\{\-1\}does not have a simple analytic form, we can compute it efficiently with numeric techniques, e\.g\., using a binary search or Newton’s method\.

Alternatively, one can use Pinsker’s inequality,kl\(q∥p\)≥2\(q−p\)2\\text\{kl\}\(q\\\|p\)\\geq 2\(q\-p\)^\{2\}, to derive bounds in the classic*generalization\-gap*form,R≤R^\+C2nR\\leq\\widehat\{R\}\+\\sqrt\{\\frac\{C\}\{2n\}\}\. Due to its analytic expression, this form can also be used, e\.g\., as part of the training objective inside of a learning algorithm\.

## Appendix BProofs

In this section we provide the proofs that were left out of the main body due to space reasons\.

### B\.1Proof of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)

Our proof follows the same type of steps as Proposition[1](https://arxiv.org/html/2605.26222#Thmlemma1), which we demonstrated in the main body, however applied repeatedly to different batches because of the iterative application of the Gaussian mechanism and with refined constants\.

We will prove only the caseE=1E=1explicitly, from which the general case follows by standard properties of the approximate max\-information\. Furthermore, we actually prove the result for an extended version of DP\-SGD, that outputs not only the intermediate model parameters,θt\\theta\_\{t\}in each step but also the the update vectors,utu\_\{t\}, as well as the initialization,θ0\\theta\_\{0\}, and the index sets of the sampled batches,It∈\{1,…,n\}mI\_\{t\}\\in\\\{1,\\dots,n\\\}^\{m\}\. We denote the output at stepttasyt=\(θt,ut\)y\_\{t\}=\(\\theta\_\{t\},u\_\{t\}\), and the initial one asy0=\(θ0,ℐ\)y\_\{0\}=\(\\theta\_\{0\},\\mathcal\{I\}\)withℐ=\{I1,…,IT\}\\mathcal\{I\}=\\\{I\_\{1\},\\dots,I\_\{T\}\\\}\. The standard algorithm that outputs only the models parameters can be obtained from this by postprocessing \(ignoring outputs\), so its approximate max\-information cannot be higher\.

For a datasetS=\(X1,…,Xn\)S=\(X\_\{1\},\\dots,X\_\{n\}\)with independently sampled elements, letY=\(Y0,Y1,Y2,…,YT\)Y=\(Y\_\{0\},Y\_\{1\},Y\_\{2\},\\dots,Y\_\{T\}\)denote the output of \(extended\) DP\-SGD on a datasetSS\. We writeY0t=\(Y0,Y1,Y2,…,Yt\)Y\_\{0\}^\{t\}=\(Y\_\{0\},Y\_\{1\},Y\_\{2\},\\dots,Y\_\{t\}\)for an initial segments ofYY, and analogously for other variables\.

Then, byDwork et al\.\([2015b](https://arxiv.org/html/2605.26222#bib.bib18)\)to establish thatI∞β\(DP\-SGD\(S\),S\)≤κI^\{\\beta\}\_\{\\infty\}\(\\texttt\{DP\-SGD\}\(S\),S\)\\leq\\kappa, it suffices to proveℙ\{f\(S,Y\)≥κ\}≤β\\mathbb\{P\}\\\{f\(S,Y\)\\geq\\kappa\\\}\\leq\\betafor

f\(S,Y\)\\displaystyle f\(S,Y\):=log⁡p\(Y\|S\)p\(Y\)\.\\displaystyle:=\\log\\frac\{p\(Y\|S\)\}\{p\(Y\)\}\.\(36\)
To reason about the steps of the DP\-SGD algorithm, we introduce some additional notation\. For a datasetS=\(X1,…,Xn\)S=\(X\_\{1\},\\dots,X\_\{n\}\)and a set of batch index sets,ℐ=\(I1,…,IT\)\\mathcal\{I\}=\(I\_\{1\},\\dots,I\_\{T\}\), we denote the data batches asB1,…,BTB\_\{1\},\\dots,B\_\{T\}withBt=\(Xi\)i∈ItB\_\{t\}=\(X\_\{i\}\)\_\{i\\in I\_\{t\}\}\. Note that these are random with respect toℐ\\mathcal\{I\}as well as the entries ofSS\. For any model parameter vectorθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}and batchBB, letΨ\(B;θ\)=∑x∈Bϕ\(x;θ\)\\Psi\(B;\\theta\)=\\sum\_\{x\\in B\}\\phi\(x;\\theta\)forϕ\(x;θ\)=clip\(∇ℓ\(x;θ\),ζ\)\\phi\(x;\\theta\)=\\text\{clip\}\\bigl\(\\nabla\\ell\(x;\\theta\),\\zeta\\bigr\)\. For any index setI∈\{1,…,n\}mI\\in\\\{1,\\dots,n\\\}^\{m\}, we also use the notationΨ\(S;I,θ\)=Ψ\(S\|I;θ\):=Ψ\(B;θ\)\\Psi\(S;I,\\theta\)=\\Psi\(S\|\_\{I\};\\theta\):=\\Psi\(B;\\theta\)forB=\(Xi\)i∈IB=\(X\_\{i\}\)\_\{i\\in I\}\.

For any stept∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, we write the update vector computed by DP\-SGD asUt=Ψ\(S,It,θt−1\)\+σZtU\_\{t\}=\\Psi\(S,I\_\{t\},\\theta\_\{t\-1\}\)\+\\sigma Z\_\{t\}, whereZtZ\_\{t\}is thedd\-dimensional standard Gaussian noise from the Gaussian mechanism\.ZtZ\_\{t\}is independent from all previously computed quantities, includingΨ\(S,It,θt−1\)\\Psi\(S,I\_\{t\},\\theta\_\{t\-1\}\), and also conditional independent from future outputsYt\+1,…,YTY\_\{t\+1\},\\dots,Y\_\{T\}givenUtU\_\{t\}or the derivedθt\\theta\_\{t\}\. It is also independent fromSS,ℐ\\mathcal\{I\}and from allZt′Z\_\{t^\{\\prime\}\}witht′≠tt^\{\\prime\}\\neq t\.

Analogously to Section[3\.1](https://arxiv.org/html/2605.26222#S3.SS1), we now prove a number of lemmas, which in the end we will combine to establish the statement of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)\.

###### Lemma 8\.

LetGt=1σΨ\(S,It,θt−1\)G\_\{t\}=\\frac\{1\}\{\\sigma\}\\Psi\(S,I\_\{t\},\\theta\_\{t\-1\}\), andμt=𝔼\[Gt\|Y0t−1\]\\mu\_\{t\}=\\mathbb\{E\}\[G\_\{t\}\|Y\_\{0\}^\{t\-1\}\]\. Then, for anyλ≥0\\lambda\\geq 0it holds that

ℙ\{f\(S,Y\)\>λ\}\\displaystyle\\mathbb\{P\}\\Bigl\\\{f\(S,Y\)\>\\lambda\\Bigr\\\}≤ℙ\{12∑t=1T𝔼‖Gt−μt‖2\+∑t=1Tht\(Gt,Zt\)\>λ\}\\displaystyle\\leq\\mathbb\{P\}\\Biggl\\\{\\frac\{1\}\{2\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}h\_\{t\}\(G\_\{t\},Z\_\{t\}\)\>\\lambda\\Biggr\\\}\(37\)forht\(Gt,Zt\)\\displaystyle h\_\{t\}\(G\_\{t\},Z\_\{t\}\)=⟨Zt,Gt−μt⟩\+12‖Gt−μt‖2\.\\displaystyle=\\bigl\\langle Z\_\{t\},G\_\{t\}\-\\mu\_\{t\}\\bigr\\rangle\+\{\\textstyle\{\\frac\{1\}\{2\}\}\}\\bigl\\\|G\_\{t\}\-\\mu\_\{t\}\\bigr\\\|^\{2\}\.\(38\)

###### Proof\.

LetS~\\tilde\{S\}be an independent copy ofSS\. By Jensen’s inequality applied tot↦log⁡1tt\\mapsto\\log\\frac\{1\}\{t\}, we have

f\(S,Y\)\\displaystyle f\(S,Y\)≤𝔼S~log⁡p\(Y\|S\)p\(Y\|S~\)=𝔼S~log∏t=1Tp\(Yt\|Y0t−1,S\)p\(Yt\|Y0t−1,S~\)=∑t=1T𝔼S~log⁡p\(Yt\|Y0t−1,S\)p\(Yt\|Y0t−1,S~\)\.\\displaystyle\\leq\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\frac\{p\(Y\|S\)\}\{p\(Y\|\\tilde\{S\}\)\}=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\prod\_\{t=1\}^\{T\}\\frac\{p\(Y\_\{t\}\|Y\_\{0\}^\{t\-1\},S\)\}\{p\(Y\_\{t\}\|Y\_\{0\}^\{t\-1\},\\tilde\{S\}\)\}=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\frac\{p\(Y\_\{t\}\|Y\_\{0\}^\{t\-1\},S\)\}\{p\(Y\_\{t\}\|Y\_\{0\}^\{t\-1\},\\tilde\{S\}\)\}\.\(39\)For eacht∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, observe that DP\-SGD computes the new model parametersθt\\theta\_\{t\}deterministically from the current update vectorUtU\_\{t\}and the past model parametersθ0t−1\\theta\_\{0\}^\{t\-1\}\. Therefore=∑t=1T𝔼S~log⁡p\(Ut\|Y0t−1,S\)p\(Ut\|Y0t−1,S~\)\.\\displaystyle=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\frac\{p\(U\_\{t\}\|Y\_\{0\}^\{t\-1\},S\)\}\{p\(U\_\{t\}\|Y\_\{0\}^\{t\-1\},\\tilde\{S\}\)\}\.\(40\)The model update,UtU\_\{t\}, is the result of an application of the Gaussian mechanism, so its distribution, conditioned on any value forSSandY0t−1Y^\{t\-1\}\_\{0\}, which in particular includesItI\_\{t\}andθt−1\\theta\_\{t\-1\}, is Gaussian, namely𝒩\(Ψ\(S;It,θt−1\),σ2I\)\\mathcal\{N\}\(\\Psi\(S;I\_\{t\},\\theta\_\{t\-1\}\),\\sigma^\{2\}\\text\{I\}\)\. Consequently,=∑t=1T𝔼S~log⁡𝒩\(Ut;Ψ\(S;It,θt−1\),σ2I\)𝒩\(Ut;Ψ\(S~;It,θt−1\),σ2I\)\.\\displaystyle=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\log\\frac\{\\mathcal\{N\}\(U\_\{t\};\\Psi\(S;I\_\{t\},\\theta\_\{t\-1\}\),\\sigma^\{2\}\\text\{I\}\)\}\{\\mathcal\{N\}\(U\_\{t\};\\Psi\(\\tilde\{S\};I\_\{t\},\\theta\_\{t\-1\}\),\\sigma^\{2\}\\text\{I\}\)\}\.\(41\)
We analyze each term in \([40](https://arxiv.org/html/2605.26222#A2.E40)\) separately, exploiting the explicit form of the Gaussian density\. First, observe that for any valueu∈ℝdu\\in\\mathbb\{R\}^\{d\}, any index setJ∈\{1,…,n\}mJ\\in\\\{1,\\dots,n\\\}^\{m\}, and any model parameterθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}it holds that:

𝔼S~\\displaystyle\\mathbb\{E\}\_\{\\tilde\{S\}\}log⁡𝒩\(u;Ψ\(S;J,θ\),σ2I\)𝒩\(u;Ψ\(S~;J,θ\),σ2I\)\\displaystyle\\log\\frac\{\\mathcal\{N\}\(u;\\,\\Psi\(S;J,\\theta\),\\sigma^\{2\}\\text\{I\}\)\}\{\\mathcal\{N\}\(u;\\,\\Psi\(\\tilde\{S\};J,\\theta\),\\sigma^\{2\}\\text\{I\}\)\}\(42\)=𝔼S~\[log⁡1\(2π\)de−12σ2‖u−Ψ\(S;J,θ\)‖2−log⁡1\(2π\)de−12σ2‖u−Ψ\(S~;J,θ\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\log\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}e^\{\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|u\-\\Psi\(S;J,\\theta\)\\\|^\{2\}\}\-\\log\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}e^\{\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|u\-\\Psi\(\\tilde\{S\};J,\\theta\)\\\|^\{2\}\}\\Bigr\]\(43\)=𝔼S~\[−12σ2‖u−Ψ\(S;J,θ\)‖2\+12σ2‖u−Ψ\(S~;J,θ\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\-\\frac\{1\}\{2\\sigma^\{2\}\}\\\|u\-\\Psi\(S;J,\\theta\)\\\|^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|u\-\\Psi\(\\tilde\{S\};J,\\theta\)\\\|^\{2\}\\Bigr\]\(44\)=𝔼S~\[1σ2⟨u−Ψ\(S;J,θ\),Ψ\(S;J,θ\)−Ψ\(S~;J,θ\)⟩\+12σ2‖Ψ\(S;J,θ\)−Ψ\(S~;J,θ\)‖2\]\.\\displaystyle=\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\frac\{1\}\{\\sigma^\{2\}\}\\Bigl\\langle u\-\\Psi\(S;J,\\theta\),\\Psi\(S;J,\\theta\)\-\\Psi\(\\tilde\{S\};J,\\theta\)\\Bigr\\rangle\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|\\Psi\(S;J,\\theta\)\-\\Psi\(\\tilde\{S\};J,\\theta\)\\\|^\{2\}\\Bigr\]\.\(45\)Writingμ¯\(J,θ\):=𝔼S~\[Ψ\(S~;J,θ\)\]\\bar\{\\mu\}\(J,\\theta\):=\\mathbb\{E\}\_\{\\tilde\{S\}\}\[\\Psi\(\\tilde\{S\};J,\\theta\)\], we obtain=1σ2⟨u−Ψ\(S;J,θ\),Ψ\(S;J,θ\)−μ¯\(J,θ\)⟩\\displaystyle=\\frac\{1\}\{\\sigma^\{2\}\}\\Bigl\\langle u\-\\Psi\(S;J,\\theta\),\\Psi\(S;J,\\theta\)\-\\bar\{\\mu\}\(J,\\theta\)\\Bigr\\rangle\+12σ2‖Ψ\(S;J,θ\)−μ¯\(J,θ\)‖2\+12σ2𝔼S~‖Ψ\(S~;J,θ\)−μ¯\(J,θ\)‖2\.\\displaystyle\\qquad\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|\\Psi\(S;J,\\theta\)\-\\bar\{\\mu\}\(J,\\theta\)\\\|^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\\|\\Psi\(\\tilde\{S\};J,\\theta\)\-\\bar\{\\mu\}\(J,\\theta\)\\\|^\{2\}\.\(46\)
Now, insertingJ=ItJ=I\_\{t\},θ=θt−1\\theta=\\theta\_\{t\-1\}, andu=Utu=U\_\{t\}, and abbreviatingμ¯t=μ¯\(It,θt−1\)\\bar\{\\mu\}\_\{t\}=\\bar\{\\mu\}\(I\_\{t\},\\theta\_\{t\-1\}\),Ψt=Ψ\(S;It,θt−1\)\\Psi\_\{t\}=\\Psi\(S;I\_\{t\},\\theta\_\{t\-1\}\)andΨ~t=Ψ\(S~;J,θ\)\\tilde\{\\Psi\}\_\{t\}=\\Psi\(\\tilde\{S\};J,\\theta\), yields for each term in \([41](https://arxiv.org/html/2605.26222#A2.E41)\)

𝔼S~\\displaystyle\\mathbb\{E\}\_\{\\tilde\{S\}\}log𝒩\(Ut;Ψ\(S;It,θt−1\),σ2I\)𝒩\(Ut;Ψ\(S~;It,θt−1\),σ2I\)=1σ2⟨Ut−Ψt,Ψt−μ¯t⟩\+12σ2∥Ψt−μ¯t∥2\+12σ2𝔼S~∥Ψ~t−μ¯t∥2\]\.\\displaystyle\\log\\frac\{\\mathcal\{N\}\(U\_\{t\};\\Psi\(S;I\_\{t\},\\theta\_\{t\-1\}\),\\sigma^\{2\}\\text\{I\}\)\}\{\\mathcal\{N\}\(U\_\{t\};\\Psi\(\\tilde\{S\};I\_\{t\},\\theta\_\{t\-1\}\),\\sigma^\{2\}\\text\{I\}\)\}=\\frac\{1\}\{\\sigma^\{2\}\}\\bigl\\langle U\_\{t\}\\\!\-\\\!\\Psi\_\{t\},\\Psi\_\{t\}\\\!\-\\\!\\bar\{\\mu\}\_\{t\}\\bigr\\rangle\+\\frac\{1\}\{2\\sigma^\{2\}\}\\\|\\Psi\_\{t\}\\\!\-\\\!\\bar\{\\mu\}\_\{t\}\\\|^\{2\}\+\\frac\{1\}\{2\\sigma^\{2\}\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\\|\\tilde\{\\Psi\}\_\{t\}\\\!\-\\\!\\bar\{\\mu\}\_\{t\}\\\|^\{2\}\\Bigr\]\.\(47\)LetGt=1σΨtG\_\{t\}=\\frac\{1\}\{\\sigma\}\\Psi\_\{t\},G~t=1σΨ~t\\tilde\{G\}\_\{t\}=\\frac\{1\}\{\\sigma\}\\tilde\{\\Psi\}\_\{t\},μt=1σμ¯t\\mu\_\{t\}=\\frac\{1\}\{\\sigma\}\\bar\{\\mu\}\_\{t\}, andZt=1σ\(Ut−Ψt\)Z\_\{t\}=\\frac\{1\}\{\\sigma\}\(U\_\{t\}\-\\Psi\_\{t\}\), whereZtZ\_\{t\}is thedd\-dimensional standard Gaussian noise, added by the Gaussian mechanism\.

Note that conditioned onY0t−1Y\_\{0\}^\{t\-1\}, the elements of a batchBtB\_\{t\}computed fromSSare independent from previous model outputs and have the same product distribution as when the batch is computed from an independent datasetS~\\tilde\{S\}\. Therefore, it holds thatμt=𝔼\[Gt\|Y0t−1\]\\mu\_\{t\}=\\mathbb\{E\}\[G\_\{t\}\|Y\_\{0\}^\{t\-1\}\]\. Overall, in continuation of \([41](https://arxiv.org/html/2605.26222#A2.E41)\):

ℙ\{f\(S,Y\)\>λ\}\\displaystyle\\mathbb\{P\}\\Big\\\{f\(S,Y\)\>\\lambda\\Big\\\}≤ℙ\{12∑t=1T𝔼S~‖G~t−μt‖2\+∑t=1Tht\(Gt,Zt\)\>λ\},\\displaystyle\\leq\\mathbb\{P\}\\Bigl\\\{\\frac\{1\}\{2\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\bigl\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\bigr\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}h\_\{t\}\(G\_\{t\},Z\_\{t\}\)\>\\lambda\\Bigr\\\},\(48\)withhth\_\{t\}as in the statement of Lemma[8](https://arxiv.org/html/2605.26222#Thmlemma8)\. ∎

###### Lemma 9\.

In the setting above, letν=mζ2σ2\\nu=m\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}\. Then it holds for anyt∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}:

𝔼S~\[‖G~t−μt‖2\|Y0t−1\]\\displaystyle\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]≤ν,𝔼S~\[‖G~t−μt‖\|Y0t−1\]≤ν,\\displaystyle\\leq\\nu,\\qquad\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\\|\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]\\leq\\sqrt\{\\nu\},\(49\)and𝔼\[‖Gt−μt‖2\|Y0t−1\]\\displaystyle\\mathbb\{E\}\\Bigl\[\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]≤ν,𝔼\[‖Gt−μt‖\|Y0t−1\]≤ν\.\\displaystyle\\leq\\nu,\\qquad\\mathbb\{E\}\\Bigl\[\\\|G\_\{t\}\-\\mu\_\{t\}\\\|\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]\\leq\\sqrt\{\\nu\}\.\(50\)Note that compared to Lemma[3](https://arxiv.org/html/2605.26222#Thmlemma3), the constant isν\\nuinstead of2ν2\\nuhere, because for DP\-SGD we can exploit thatΨ\\Psiis a sum of independent terms, instead of just fulfilling a bounded difference inequality\.

###### Proof\.

For anyt∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, with fixedY0t−1Y\_\{0\}^\{t\-1\}, we haveGt=1σΨ\(S;It,θt−1\)=1σ∑i∈Itϕ\(Xi;θt−1\)G\_\{t\}=\\frac\{1\}\{\\sigma\}\\Psi\(S;I\_\{t\},\\theta\_\{t\-1\}\)=\\frac\{1\}\{\\sigma\}\\sum\_\{i\\in I\_\{t\}\}\\phi\(X\_\{i\};\\theta\_\{t\-1\}\), andG~t=1σΨ\(S~;It,θt−1\)=1σ∑i∈Itϕ\(X~i;θt−1\)\\tilde\{G\}\_\{t\}=\\frac\{1\}\{\\sigma\}\\Psi\(\\tilde\{S\};I\_\{t\},\\theta\_\{t\-1\}\)=\\frac\{1\}\{\\sigma\}\\sum\_\{i\\in I\_\{t\}\}\\phi\(\\tilde\{X\}\_\{i\};\\theta\_\{t\-1\}\), where the functionϕ:𝒳×ℝd→ℝd\\phi:\\mathcal\{X\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}fulfills‖ϕ\(x;θ\)‖≤ζ\\\|\\phi\(x;\\theta\)\\\|\\leq\\zetafor allxxandθ\\theta\.

Consequently,

𝔼S~\[‖G~t−μt‖2\|Y0t−1\]\\displaystyle\\mathbb\{E\}\_\{\\tilde\{S\}\}\\Bigl\[\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]=1σ2𝔼S~‖∑i∈It\(ϕ\(X~i;θt−1\)−𝔼X~i\[ϕ\(X~i;θt−1\)\]\)‖2\\displaystyle=\\frac\{1\}\{\\sigma^\{2\}\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\big\\\|\\sum\_\{i\\in I\_\{t\}\}\\bigl\(\\phi\(\\tilde\{X\}\_\{i\};\\theta\_\{t\-1\}\)\-\\mathbb\{E\}\_\{\\tilde\{X\}\_\{i\}\}\[\\phi\(\\tilde\{X\}\_\{i\};\\theta\_\{t\-1\}\)\]\\bigr\)\\,\\big\\\|^\{2\}\(51\)=1σ2∑i∈It𝔼X~i‖ϕ\(X~i;θt−1\)−𝔼X~i\[ϕ\(X~i;θt−1\)\]‖2\\displaystyle=\\frac\{1\}\{\\sigma^\{2\}\}\\sum\_\{i\\in I\_\{t\}\}\\mathbb\{E\}\_\{\\tilde\{X\}\_\{i\}\}\\big\\\|\\phi\(\\tilde\{X\}\_\{i\};\\theta\_\{t\-1\}\)\-\\mathbb\{E\}\_\{\\tilde\{X\}\_\{i\}\}\[\\phi\(\\tilde\{X\}\_\{i\};\\theta\_\{t\-1\}\)\]\\big\\\|^\{2\}\(52\)≤1σ2∑i∈It𝔼Xi‖ϕ\(Xi;θt−1\)‖2≤mζ2σ2,\\displaystyle\\leq\\frac\{1\}\{\\sigma^\{2\}\}\\sum\_\{i\\in I\_\{t\}\}\\mathbb\{E\}\_\{X\_\{i\}\}\\big\\\|\\phi\(X\_\{i\};\\theta\_\{t\-1\}\)\\big\\\|^\{2\}\\leq m\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\},\(53\)where the first identity holds becauseS~\\tilde\{S\}is sampled independently of the algorithm outputs, soμt=1σμ¯\(It,θt−1\)=1σ𝔼S~\[Ψ\(S~;It,θt−1\)\|Y0t−1\]\\mu\_\{t\}=\\frac\{1\}\{\\sigma\}\\bar\{\\mu\}\(I\_\{t\},\\theta\_\{t\-1\}\)=\\frac\{1\}\{\\sigma\}\\mathbb\{E\}\_\{\\tilde\{S\}\}\\bigl\[\\Psi\(\\tilde\{S\};I\_\{t\},\\theta\_\{t\-1\}\)\\bigm\|Y\_\{0\}^\{t\-1\}\\bigr\], and the second identity holds because the elements of batches formed fromS~\\tilde\{S\}, which has the same product distribution asSS, are independent\. The second inequality follows because\(𝔼S~\[‖G~t−μt‖\|Y0t−1\]\)2≤𝔼S~\[‖G~t−μt‖2\|Y0t−1\]\\big\(\\mathbb\{E\}\_\{\\tilde\{S\}\}\\bigl\[\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\\|\\bigm\|Y\_\{0\}^\{t\-1\}\\bigr\]\\big\)^\{2\}\\leq\\mathbb\{E\}\_\{\\tilde\{S\}\}\\bigl\[\\\|\\tilde\{G\}\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\\bigm\|Y\_\{0\}^\{t\-1\}\\bigr\]by Jensen’s inequality\.

The upper bound for𝔼\[‖Gt−μt‖2\|Y0t−1\]\\mathbb\{E\}\\bigl\[\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\\bigm\|Y\_\{0\}^\{t\-1\}\\bigr\]is obtained by the same steps, because for fixedItI\_\{t\}, the elements of a batchBtB\_\{t\}are independent of the previous algorithm outputs, so conditioned onY0t−1Y\_\{0\}^\{t\-1\}it has the same distribution asB~t\\tilde\{B\}\_\{t\}\. ∎

###### Lemma 10\.

For anyλ∈\(0,12ν\)\\lambda\\in\(0,\\frac\{1\}\{2\\nu\}\), it holds that

𝔼\[eλ‖Gt−μt‖2\|Y0t−1\]\\displaystyle\\mathbb\{E\}\\Bigl\[e^\{\\lambda\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\}\\Bigm\|Y\_\{0\}^\{t\-1\}\\Bigr\]≤e16λ2ν2\+λν1−2λν\.\\displaystyle\\leq e^\{\\frac\{16\\lambda^\{2\}\\nu^\{2\}\+\\lambda\\nu\}\{1\-2\\lambda\\nu\}\}\.\(54\)

###### Proof\.

For conciseness, we assume in the proof that all expectations and probabilities are conditioned onY0t−1Y\_\{0\}^\{t\-1\}\. Using the shorthand notationA=‖Gt−μt‖A=\\\|G\_\{t\}\-\\mu\_\{t\}\\\|,C=\(A−𝔼\[A\]\)2C=\(A\-\\mathbb\{E\}\[A\]\)^\{2\}andD=A2D=A^\{2\}, we will derive a bound on𝔼\[eλD2\]\\mathbb\{E\}\\bigl\[e^\{\\lambda D^\{2\}\}\\bigr\]\.

Observe that for any vectora∈ℝda\\in\\mathbb\{R\}^\{d\}, and anyθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}, the functionfa\(x1,…,xm\):=‖Ψ\(x1,…,xm;θ\)−a‖f\_\{a\}\(x\_\{1\},\\dots,x\_\{m\}\):=\\\|\\Psi\(x\_\{1\},\\dots,x\_\{m\};\\theta\)\-a\\\|fulfills a bounded difference condition\. Namely, for any two batchesB,B′B,B^\{\\prime\}that differ only in a single positionii, it holds that\|fa\(B\)−fa\(B′\)\|≤‖Ψ\(B;θ\)−Ψ\(B′;θ\)‖=‖ϕ\(xi;θ\)−ϕ\(xi′;θ\)‖≤2ζ\|f\_\{a\}\(B\)\-f\_\{a\}\(B^\{\\prime\}\)\|\\leq\\\|\\Psi\(B;\\theta\)\-\\Psi\(B^\{\\prime\};\\theta\)\\\|=\\\|\\phi\(x\_\{i\};\\theta\)\-\\phi\(x^\{\\prime\}\_\{i\};\\theta\)\\\|\\leq 2\\zeta\. Therefore, by McDiarmid’s inequality we obtain forA=‖Gt−μt‖=1σ‖Ψt\(Bt\)−σμt‖A=\\\|G\_\{t\}\-\\mu\_\{t\}\\\|=\\frac\{1\}\{\\sigma\}\\\|\\Psi\_\{t\}\(B\_\{t\}\)\-\\sigma\\mu\_\{t\}\\\|:

𝔼\[eλ\(A−𝔼\[A\]\)\]\\displaystyle\\mathbb\{E\}\\bigl\[e^\{\\lambda\(A\-\\mathbb\{E\}\[A\]\)\}\\bigl\]≤e12λ2νandℙ\{\|A−𝔼\[A\]\|≥λ\}≤2e−λ22ν,withν=mζ2σ2\.\\displaystyle\\leq e^\{\\frac\{1\}\{2\}\\lambda^\{2\}\\nu\}\\quad\\text\{and\}\\quad\\mathbb\{P\}\\bigl\\\{\|A\-\\mathbb\{E\}\[A\]\|\\geq\\lambda\\bigl\\\}\\leq 2e^\{\-\\frac\{\\lambda^\{2\}\}\{2\\nu\}\},\\quad\\text\{with $\\nu=m\\frac\{\\zeta^\{2\}\}\{\\sigma^\{2\}\}$\.\}\(55\)
From \([55](https://arxiv.org/html/2605.26222#A2.E55)\) we derive bounds on the moments ofC=\(A−𝔼\[A\]\)2C=\(A\-\\mathbb\{E\}\[A\]\)^\{2\}using the tail\-integral formula from\(Vershynin,[2018](https://arxiv.org/html/2605.26222#bib.bib57), Example 1\.2\.3\)\. For anyk≥1k\\geq 1, it holds that

𝔼\[Ck\]=𝔼\[\(A−𝔼\[A\]\)2k\]\\displaystyle\\mathbb\{E\}\[C^\{k\}\]=\\mathbb\{E\}\[\(A\-\\mathbb\{E\}\[A\]\)^\{2k\}\]=∫0∞2kλ2k−1ℙ\{\|A−𝔼\[A\]\|≥λ\}𝑑λ≤∫0∞2kλ2k−12e−λ22ν𝑑λ\.\\displaystyle=\\int\_\{0\}^\{\\infty\}\\\!\\\!\\\!\\\!2k\\,\\lambda^\{2k\-1\}\\mathbb\{P\}\\\{\|A\-\\mathbb\{E\}\[A\]\|\\geq\\lambda\\\}d\\lambda\\leq\\int\_\{0\}^\{\\infty\}\\\!\\\!2k\\,\\lambda^\{2k\-1\}2e^\{\-\\frac\{\\lambda^\{2\}\}\{2\\nu\}\}d\\lambda\.\(56\)By a change of variables fromλ\\lambdatou=λ22νu=\\frac\{\\lambda^\{2\}\}\{2\\nu\}this yields≤2k\(2ν\)k∫0∞uk−1e−u𝑑u\\displaystyle\\leq 2\\,k\(2\\nu\)^\{k\}\\int\_\{0\}^\{\\infty\}u^\{k\-1\}e^\{\-u\}du\(57\)=2k\!\(2ν\)k\\displaystyle=2\\,k\!\(2\\nu\)^\{k\}\(58\)where the last identity holds because∫0∞uk−1e−u𝑑u=Γ\(k\)=\(k−1\)\!\\int\_\{0\}^\{\\infty\}u^\{k\-1\}e^\{\-u\}du=\\Gamma\(k\)=\(k\-1\)\!

In particular, we obtain upper bounds on the second and higher order moments of the following form

- •𝔼\[C2\]≤2⋅2⋅\(2ν\)2=V,\\mathbb\{E\}\[C^\{2\}\]\\leq 2\\cdot 2\\cdot\(2\\nu\)^\{2\}=V,forV:=16ν2,V:=16\\nu^\{2\},
- •𝔼\[Ck\]≤2⋅k\!⋅\(2ν\)k=k\!2\(2ν\)k−216ν2=k\!2ck−2V\\mathbb\{E\}\[C^\{k\}\]\\leq 2\\cdot k\!\\cdot\(2\\nu\)^\{k\}=\\frac\{k\!\}\{2\}\(2\\nu\)^\{k\-2\}16\\nu^\{2\}=\\frac\{k\!\}\{2\}c^\{k\-2\}V, forc:=2νc:=2\\nu\.

which shows thatCCfulfills the conditions for Bernstein’s inequality\(Boucheron et al\.,[2013](https://arxiv.org/html/2605.26222#bib.bib9), Theorem 2\.10\)with parametersVVandccas above\. Consequently, we obtain a bound on its moment\-generating function:

𝔼\[eλ\(C−𝔼\[C\]\)\]≤eVλ22\(1−cλ\)=e8ν2λ21−2νλfor anyλ∈\(0,1/2ν\)\.\\displaystyle\\mathbb\{E\}\[e^\{\\lambda\(C\-\\mathbb\{E\}\[C\]\)\}\]\\leq e^\{\\frac\{V\\lambda^\{2\}\}\{2\(1\-c\\lambda\)\}\}=e^\{\\frac\{8\\nu^\{2\}\\lambda^\{2\}\}\{1\-2\\nu\\lambda\}\}\\qquad\\text\{for any $\\lambda\\in\(0,1/2\\nu\)$\.\}\(59\)
To relate \([59](https://arxiv.org/html/2605.26222#A2.E59)\) to \([54](https://arxiv.org/html/2605.26222#A2.E54)\), we observe thatD−𝔼\[D\]=C−𝔼\[C\]\+2𝔼\[A\]\(A−𝔼\[A\]\)D\-\\mathbb\{E\}\[D\]=C\-\\mathbb\{E\}\[C\]\+2\\mathbb\{E\}\[A\]\(A\-\\mathbb\{E\}\[A\]\)\. Hence,

𝔼\[eλ\(D−𝔼\[D\]\)\]\\displaystyle\\mathbb\{E\}\[e^\{\\lambda\(D\-\\mathbb\{E\}\[D\]\)\}\]=𝔼\[eλ\(C−𝔼\[C\]\+2𝔼\[A\]\(A−𝔼\[A\]\)\)\]\.\\displaystyle=\\mathbb\{E\}\[e^\{\\lambda\(C\-\\mathbb\{E\}\[C\]\+2\\mathbb\{E\}\[A\]\(A\-\\mathbb\{E\}\[A\]\)\)\}\]\.\(60\)For anyp,p′\>1p,p^\{\\prime\}\>1with1p\+1p′=1\\frac\{1\}\{p\}\+\\frac\{1\}\{p^\{\\prime\}\}=1, it holds by Hölder’s inequality:≤𝔼\[epλ\(C−𝔼\[C\]\)\]1p𝔼\[e2p′λ𝔼\[A\]\(A−𝔼\[A\]\)\]1p′\.\\displaystyle\\leq\\mathbb\{E\}\[e^\{p\\lambda\(C\-\\mathbb\{E\}\[C\]\)\}\]^\{\\frac\{1\}\{p\}\}\\,\\mathbb\{E\}\[e^\{2p^\{\\prime\}\\lambda\\mathbb\{E\}\[A\]\(A\-\\mathbb\{E\}\[A\]\)\}\]^\{\\frac\{1\}\{p^\{\\prime\}\}\}\.\(61\)Using \([59](https://arxiv.org/html/2605.26222#A2.E59)\) for the left factor and \([55](https://arxiv.org/html/2605.26222#A2.E55)\) and[50](https://arxiv.org/html/2605.26222#A2.E50)for the right one, we obtain for anyλ∈\(0,1/2pν\)\\lambda\\in\(0,1/2p\\nu\)≤e1p8ν2\(pλ\)21−2ν\(pλ\)e1p′12\(2p′λν\)2ν=eλ2ν2\(8p1−2pνλ\+2p′\)\.\\displaystyle\\leq e^\{\\frac\{1\}\{p\}\\frac\{8\\nu^\{2\}\(p\\lambda\)^\{2\}\}\{1\-2\\nu\(p\\lambda\)\}\}e^\{\\frac\{1\}\{p^\{\\prime\}\}\\frac\{1\}\{2\}\(2p^\{\\prime\}\\lambda\\sqrt\{\\nu\}\)^\{2\}\\nu\}=e^\{\\lambda^\{2\}\\nu^\{2\}\(\\frac\{8p\}\{1\-2p\\nu\\lambda\}\+2p^\{\\prime\}\)\}\.\(62\)Consequently, withD=‖Gt−μt‖2D=\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}and𝔼\[D\]=𝔼\[‖Gt−μt‖2\]≤ν\\mathbb\{E\}\[D\]=\\mathbb\{E\}\[\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\]\\leq\\nuby Lemma[9](https://arxiv.org/html/2605.26222#Thmlemma9), it follows that for anyp,p′\>1p,p^\{\\prime\}\>1with1p\+1p′=1\\frac\{1\}\{p\}\+\\frac\{1\}\{p^\{\\prime\}\}=1:𝔼\[eλ‖Gt−μt‖2\]\\displaystyle\\mathbb\{E\}\\bigl\[e^\{\\lambda\\\|G\_\{t\}\-\\mu\_\{t\}\\\|^\{2\}\}\\bigr\]≤eλ2ν2\(8p1−2pνλ\+2p′\)\+λν\.\\displaystyle\\leq e^\{\\lambda^\{2\}\\nu^\{2\}\(\\frac\{8p\}\{1\-2p\\nu\\lambda\}\+2p^\{\\prime\}\)\+\\lambda\\nu\}\.\(63\)Specifically, forp′=31−2νλp^\{\\prime\}=\\frac\{3\}\{1\-2\\nu\\lambda\}andp=p′p′−1=32\+2νλp=\\frac\{p^\{\\prime\}\}\{p^\{\\prime\}\-1\}=\\frac\{3\}\{2\+2\\nu\\lambda\}, it holds for anyλ∈\(0,1/2ν\)\\lambda\\in\(0,1/2\\nu\):=eλ2ν2\(181−2νλ\)\+λν=e16λ2ν2\+λν1−2νλ,\\displaystyle=e^\{\\lambda^\{2\}\\nu^\{2\}\(\\frac\{18\}\{1\-2\\nu\\lambda\}\)\+\\lambda\\nu\}=e^\{\\frac\{16\\lambda^\{2\}\\nu^\{2\}\+\\lambda\\nu\}\{1\-2\\nu\\lambda\}\},\(64\)which concludes the proof of Lemma[10](https://arxiv.org/html/2605.26222#Thmlemma10)\. ∎

For completeness, we explicitly state these standard identities for Gaussians random variables\.

###### Lemma 11\.

ForZ∼𝒩\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\), and anyλ∈ℝ\\lambda\\in\\mathbb\{R\}, we have

𝔼\[eλZ\]=eλ22\\displaystyle\\mathbb\{E\}\[e^\{\\lambda Z\}\]=e^\{\\frac\{\\lambda^\{2\}\}\{2\}\}\(65\)Additionally, for anyZ=\(Z1,…,Zd\)∼𝒩\(0,Id\)Z=\(Z\_\{1\},\\dots,Z\_\{d\}\)\\sim\\mathcal\{N\}\(0,I\_\{d\}\), and anyv∈ℝdv\\in\\mathbb\{R\}^\{d\}, we have

𝔼\[eλ⟨Z,v⟩\]=eλ2‖v‖22\\displaystyle\\mathbb\{E\}\[e^\{\\lambda\\langle Z,v\\rangle\}\]=e^\{\\frac\{\\lambda^\{2\}\\\|v\\\|^\{2\}\}\{2\}\}\(66\)

###### Proof\.

The proof is elementary:

𝔼\[eλZ\]=12π∫−∞∞eλze−z22𝑑z=12πeλ22∫−∞∞e−12\(z−λ\)2𝑑z=eλ22,\\displaystyle\\mathbb\{E\}\[e^\{\\lambda Z\}\]=\\frac\{1\}\{\\sqrt\{2\\pi\}\}\\int^\{\\infty\}\_\{\-\\infty\}e^\{\\lambda z\}e^\{\-\\frac\{z^\{2\}\}\{2\}\}dz=\\frac\{1\}\{\\sqrt\{2\\pi\}\}e^\{\\frac\{\\lambda^\{2\}\}\{2\}\}\\int^\{\\infty\}\_\{\-\\infty\}e^\{\-\\frac\{1\}\{2\}\(z\-\\lambda\)^\{2\}\}dz=e^\{\\frac\{\\lambda^\{2\}\}\{2\}\},\(67\)and

𝔼\[eλ⟨Z,v⟩\]=𝔼\[e∑i=1dλviZi\]=∏i=1d𝔼\[eλviZi\]=∏i=1deλ2vi22=eλ2‖v‖22\\displaystyle\\mathbb\{E\}\[e^\{\\lambda\\langle Z,v\\rangle\}\]=\\mathbb\{E\}\[e^\{\\sum\_\{i=1\}^\{d\}\\lambda v\_\{i\}Z\_\{i\}\}\]=\\prod\_\{i=1\}^\{d\}\\mathbb\{E\}\[e^\{\\lambda v\_\{i\}Z\_\{i\}\}\]=\\prod\_\{i=1\}^\{d\}e^\{\\frac\{\\lambda^\{2\}v^\{2\}\_\{i\}\}\{2\}\}=e^\{\\frac\{\\lambda^\{2\}\\\|v\\\|^\{2\}\}\{2\}\}\(68\)
∎

###### Lemma 12\.

In the notation of Lemma[8](https://arxiv.org/html/2605.26222#Thmlemma8), it holds that for anyβ∈\(0,1\)\\beta\\in\(0,1\):

ℙ\{∑t=1Tht\(Gt,Zt\)\>τ\}≤β,\\displaystyle\\mathbb\{P\}\\Bigl\\\{\\sum\_\{t=1\}^\{T\}h\_\{t\}\(G\_\{t\},Z\_\{t\}\)\>\\tau\\Bigr\\\}\\leq\\beta,\(69\)forτ=infλ∈\(0,rν\)1λ\[TF\(λ\+λ22\)\+log⁡1β\]withF\(x\)=16ν2x2\+νx1−2νx\.\\displaystyle\\tau=\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}\\frac\{1\}\{\\lambda\}\\Bigl\[TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\+\\log\\frac\{1\}\{\\beta\}\\Big\]\\quad\\text\{with\}\\quad F\(x\)=\\frac\{16\\nu^\{2\}x^\{2\}\+\\nu x\}\{1\-2\\nu x\}\.\(70\)

###### Proof\.

We exploit the tower property of expectations\. For anyt∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, we writeG¯t=Gt−μt\\bar\{G\}\_\{t\}=G\_\{t\}\-\\mu\_\{t\}, andDt=‖G¯t‖2D\_\{t\}=\\\|\\bar\{G\}\_\{t\}\\\|^\{2\}, such thatht\(Gt,Zt\)=⟨Zt,G¯t⟩\+12Dth\_\{t\}\(G\_\{t\},Z\_\{t\}\)=\\langle Z\_\{t\},\\bar\{G\}\_\{t\}\\rangle\+\\frac\{1\}\{2\}D\_\{t\}\. For any partial sum we obtain:

𝔼\[eλ∑j=1tht\(Zj,Gj\)\]\\displaystyle\\mathbb\{E\}\\Bigl\[e^\{\\lambda\\sum\\limits\_\{j=1\}^\{t\}h\_\{t\}\(Z\_\{j\},G\_\{j\}\)\}\\Bigr\]=𝔼\[eλ\(∑j=1t−1⟨Zj,G¯j⟩\+12Dj\)\+λ⟨Zt,G¯t⟩\+12λDt\],\\displaystyle=\\mathbb\{E\}\\Bigl\[e^\{\\lambda\(\\sum\\limits\_\{j=1\}^\{t\-1\}\\langle Z\_\{j\},\\bar\{G\}\_\{j\}\\rangle\+\\frac\{1\}\{2\}D\_\{j\}\)\\ \+\\ \\lambda\\langle Z\_\{t\},\\bar\{G\}\_\{t\}\\rangle\+\\frac\{1\}\{2\}\\lambda D\_\{t\}\}\\Bigr\],\(71\)=𝔼\[eλ\(∑j=1t−1⟨Zj,G¯j⟩\+12Dj\)𝔼\[𝔼\[eλ⟨Zt,G¯t⟩\|Bt,Y0t−1\]e12λDt\|Y0t−1\]\]\\displaystyle=\\mathbb\{E\}\\Bigl\[e^\{\\lambda\(\\sum\\limits\_\{j=1\}^\{t\-1\}\\langle Z\_\{j\},\\bar\{G\}\_\{j\}\\rangle\+\\frac\{1\}\{2\}D\_\{j\}\)\}\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\\bigl\[e^\{\\lambda\\langle Z\_\{t\},\\bar\{G\}\_\{t\}\\rangle\}\|B\_\{t\},Y\_\{0\}^\{t\-1\}\]\\ e^\{\\frac\{1\}\{2\}\\lambda D\_\{t\}\}\\Bigm\|Y\_\{0\}^\{t\-1\}\\bigr\]\\Bigr\]\(72\)by the tower property of expectations, becauseG¯1,…,G¯t−1\\bar\{G\}\_\{1\},\\dots,\\bar\{G\}\_\{t\-1\}andD1,…,Dt−1D\_\{1\},\\dots,D\_\{t\-1\}are predictable fromY0t−1Y\_\{0\}^\{t\-1\}, andZ1,…,ZtZ\_\{1\},\\dots,Z\_\{t\}are independent from it, as well as fromBtB\_\{t\}\. By Lemmas[10](https://arxiv.org/html/2605.26222#Thmlemma10)and[11](https://arxiv.org/html/2605.26222#Thmlemma11), it follows that=𝔼\[eλ\(∑j=1t−1⟨Zj,G¯j⟩\+12Dj\)𝔼\[e\(λ\+λ22\)Dt\|Y0t−1\]\]\\displaystyle=\\mathbb\{E\}\\Bigl\[e^\{\\lambda\(\\sum\\limits\_\{j=1\}^\{t\-1\}\\langle Z\_\{j\},\\bar\{G\}\_\{j\}\\rangle\+\\frac\{1\}\{2\}D\_\{j\}\)\}\\mathbb\{E\}\\bigl\[e^\{\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)D\_\{t\}\}\|Y\_\{0\}^\{t\-1\}\\bigr\]\\Bigr\]\(73\)≤𝔼\[eλ\(∑j=1t−1⟨Zj,G¯j⟩\+12Dj\)eF\(λ\+λ22\)\]\.\\displaystyle\\leq\\mathbb\{E\}\\Bigl\[e^\{\\lambda\(\\sum\\limits\_\{j=1\}^\{t\-1\}\\langle Z\_\{j\},\\bar\{G\}\_\{j\}\\rangle\+\\frac\{1\}\{2\}D\_\{j\}\)\}e^\{F\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\}\\Bigr\]\.\(74\)whereF\(x\)=16ν2x2\+νx1−2νxF\(x\)=\\frac\{16\\nu^\{2\}x^\{2\}\+\\nu x\}\{1\-2\\nu x\}is the exponent of the right\-hand side of \([54](https://arxiv.org/html/2605.26222#A2.E54)\), as long asλ\+λ22≤12ν\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\\leq\\frac\{1\}\{2\\nu\}, i\.e\.0≤λ<rν0\\leq\\lambda<r\_\{\\nu\}forrν:=1ν\+14−12r\_\{\\nu\}:=\\sqrt\{\\frac\{1\}\{\\nu\}\+\\frac\{1\}\{4\}\}\-\\frac\{1\}\{2\}\. Applying the above identity iteratively, we obtain a bound on the moment generating function ofH=∑t=1Tht\(Gt,Zt\)=∑t=1T⟨Zt,G¯t⟩\+12DtH=\\sum\_\{t=1\}^\{T\}h\_\{t\}\(G\_\{t\},Z\_\{t\}\)=\\sum\\limits\_\{t=1\}^\{T\}\\langle Z\_\{t\},\\bar\{G\}\_\{t\}\\rangle\+\\frac\{1\}\{2\}D\_\{t\}:𝔼\[eλH\]\\displaystyle\\mathbb\{E\}\\bigl\[e^\{\\lambda H\}\\bigr\]=𝔼\[eλ\(∑t=1T⟨Zt,G¯t⟩\+12Dt\)\]≤eTF\(λ\+λ22\)\.\\displaystyle=\\mathbb\{E\}\\Bigl\[e^\{\\lambda\(\\sum\\limits\_\{t=1\}^\{T\}\\langle Z\_\{t\},\\bar\{G\}\_\{t\}\\rangle\+\\frac\{1\}\{2\}D\_\{t\}\)\}\\Bigr\]\\leq e^\{TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\}\.\(75\)
By Chernoff’s inequality, we obtain for anyτ≥0\\tau\\geq 0

ℙ\{H≥τ\}\\displaystyle\\mathbb\{P\}\\\{H\\geq\\tau\\\}≤infλ∈\(0,rν\)e−λτ\+TF\(λ\+λ22\)\.\\displaystyle\\leq\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}e^\{\-\\lambda\\tau\+TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\}\.\(76\)For anyβ∈\(0,1\)\\beta\\in\(0,1\), solving for the right hand side to equalβ\\betayields

τ\\displaystyle\\tau=infλ∈\(0,rν\)1λ\[TF\(λ\+λ22\)\+log⁡1β\]\.\\displaystyle=\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}\\frac\{1\}\{\\lambda\}\\Bigl\[TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\+\\log\\frac\{1\}\{\\beta\}\\Bigr\]\.\(77\)which concludes the proof of Lemma[12](https://arxiv.org/html/2605.26222#Thmlemma12)\. ∎

###### Corollary 4\.

In the setting of Lemma[12](https://arxiv.org/html/2605.26222#Thmlemma12), it holds that

ℙ\{∑t=1Tht\(Gt,Zt\)≥τ\}≤β\\displaystyle\\mathbb\{P\}\\Big\\\{\\sum\_\{t=1\}^\{T\}h\_\{t\}\(G\_\{t\},Z\_\{t\}\)\\geq\\tau\\Big\\\}\\leq\\beta\(78\)forτ=T\(ν\+min⁡\{1,ν\}\)\(12\+3q\+12q2\)withq=2Tlog⁡\(1/β\)\.\\displaystyle\\tau=T\(\\nu\+\\min\\\{1,\\sqrt\{\\nu\}\\\}\)\(\{\\textstyle\{\\frac\{1\}\{2\}\}\}\+3q\+\\frac\{1\}\{2\}q^\{2\}\)\\quad\\text\{with $q=\\sqrt\{\\frac\{2\}\{T\}\\log\(1/\\beta\)\}$\. \}\(79\)

###### Proof\.

We construct an explicit upper bound on the expression forτ\\taufrom \([77](https://arxiv.org/html/2605.26222#A2.E77)\)\. Letu:=ν\(λ\+λ2\)u:=\\nu\(\\lambda\+\\lambda^\{2\}\), sou∈\(0,1\)u\\in\(0,1\)is equivalent toλ∈\(0,rν\)\\lambda\\in\(0,r\_\{\\nu\}\), andF\(λ\+λ22\)=4u2\+12u1−u=−4u\+9u2\(1−u\)F\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)=\\frac\{4u^\{2\}\+\\frac\{1\}\{2\}u\}\{1\-u\}=\-4u\+\\frac\{9u\}\{2\(1\-u\)\}\. Observe that1λ=ν\(1\+λ\)u≤ν\(1\+rν\)u\\frac\{1\}\{\\lambda\}=\\frac\{\\nu\(1\+\\lambda\)\}\{u\}\\leq\\frac\{\\nu\(1\+r\_\{\\nu\}\)\}\{u\}\. Then, withL:=1Tlog⁡1βL:=\\frac\{1\}\{T\}\\log\\frac\{1\}\{\\beta\},

τ\\displaystyle\\tau=infλ∈\(0,rν\)1λ\[TF\(λ\+λ22\)\+log⁡1β\]\\displaystyle=\\inf\_\{\\lambda\\in\(0,r\_\{\\nu\}\)\}\\frac\{1\}\{\\lambda\}\\Bigl\[TF\(\\frac\{\\lambda\+\\lambda^\{2\}\}\{2\}\)\+\\log\\frac\{1\}\{\\beta\}\\Bigr\]\(80\)≤Tinfu∈\(0,1\)ν\(1\+rν\)u\[−4u\+9u2\(1−u\)\+L\]\\displaystyle\\leq T\\,\\inf\_\{u\\in\(0,1\)\}\\frac\{\\nu\(1\+r\_\{\\nu\}\)\}\{u\}\\Bigl\[\-4u\+\\frac\{9u\}\{2\(1\-u\)\}\+L\\Bigr\]\(81\)=Tν\(1\+rν\)infu∈\(0,1\)G\(u\)withG\(u\)=−4\+92\(1−u\)\+Lu\.\\displaystyle=T\\nu\(1\+r\_\{\\nu\}\)\\,\\inf\_\{u\\in\(0,1\)\}G\(u\)\\quad\\text\{with $G\(u\)=\-4\+\\frac\{9\}\{2\(1\-u\)\}\+\\frac\{L\}\{u\}$\.\}\(82\)In this form, we can solve the minimization exactly: we haveG′\(u\)=92\(1−u\)2−Lu2G^\{\\prime\}\(u\)=\\frac\{9\}\{2\(1\-u\)^\{2\}\}\-\\frac\{L\}\{u^\{2\}\}, so for an unconstrained optimizeru∗u\_\{\*\}it holds that92\(1−u∗\)2=Lu∗2\\frac\{9\}\{2\(1\-u\_\{\*\}\)^\{2\}\}=\\frac\{L\}\{u\_\{\*\}^\{2\}\}, i\.e\.32\(1−u∗\)=Lu∗\\frac\{3\}\{\\sqrt\{2\}\(1\-u\_\{\*\}\)\}=\\frac\{\\sqrt\{L\}\}\{u\_\{\*\}\}, which yieldsu∗=2L3\+2Lu\_\{\*\}=\\frac\{\\sqrt\{2L\}\}\{3\+\\sqrt\{2L\}\}\. In particularu∗∈\(0,1\)u\_\{\*\}\\in\(0,1\), so the solution is feasible\. With1−u∗=33\+2L1\-u\_\{\*\}=\\frac\{3\}\{3\+\\sqrt\{2L\}\}, we obtainG\(u∗\)=−4\+923\+2L3\+L3\+2L2L=12\+32L\+LG\(u\_\{\*\}\)=\-4\+\\frac\{9\}\{2\}\\frac\{3\+\\sqrt\{2L\}\}\{3\}\+L\\frac\{3\+\\sqrt\{2L\}\}\{\\sqrt\{2L\}\}=\\frac\{1\}\{2\}\+3\\sqrt\{2L\}\+L\.

From the identitya=\(a\+b−b\)\(a\+b\+b\)a=\(\\sqrt\{a\+b\}\-\\sqrt\{b\}\)\(\\sqrt\{a\+b\}\+\\sqrt\{b\}\)witha=1νa=\\frac\{1\}\{\\nu\}andb=14b=\\frac\{1\}\{4\}, it follows forrν=1ν\+14−12r\_\{\\nu\}=\\sqrt\{\\frac\{1\}\{\\nu\}\+\\frac\{1\}\{4\}\}\-\\frac\{1\}\{2\}thatνrν=11ν\+14\+12≤min⁡\{1,ν\}\\nu r\_\{\\nu\}=\\frac\{1\}\{\\sqrt\{\\frac\{1\}\{\\nu\}\+\\frac\{1\}\{4\}\}\+\\frac\{1\}\{2\}\}\\leq\\min\\\{1,\\sqrt\{\\nu\}\\\}\. Now, the statement of Corollary[4](https://arxiv.org/html/2605.26222#Thmtheorem4)follows by settingq=2Lq=\\sqrt\{2L\}\. ∎

###### Proof of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)– conclusion\.

We combine the above result: from Lemmas[9](https://arxiv.org/html/2605.26222#Thmlemma9)to[12](https://arxiv.org/html/2605.26222#Thmlemma12), it follows for a single epoch of DP\-SGD that

ℙ\{f\(S,Y\)\>τ\+12Tν\}≤ℙ\{H\>τ\}≤β,\\displaystyle\\mathbb\{P\}\\big\\\{f\(S,Y\)\>\\tau\+\\frac\{1\}\{2\}T\\nu\\bigr\\\}\\leq\\mathbb\{P\}\\big\\\{H\>\\tau\\bigr\\\}\\leq\\beta,\(83\)withτ\\tauas in \([77](https://arxiv.org/html/2605.26222#A2.E77)\)\. This confirms the first statement of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)for DP\-SGD withE=1E=1\. The second follows by using the upper bound onτ\\tauprovided by Corollary[4](https://arxiv.org/html/2605.26222#Thmtheorem4)\.

For more epochs, observe that any later epoch of DP\-SGD follows the same steps as the first, except that its initial values for the model parameters is not simply random but stems from the algorithm’s previous output\. However, the max\-information property we proved holds uniformly for any initialization\. Consequently, we can modelEE\-epoch DP\-SGD,𝒜E\\mathcal\{A\}\_\{E\}, as anEE\-fold adaptive composition of single\-epoch DP\-SGD,𝒜1\\mathcal\{A\}\_\{1\}, with itself, and it follows from the additivity property of the approximate max\-information\(Dwork et al\.,[2015a](https://arxiv.org/html/2605.26222#bib.bib17), Theorem 15\), that

I∞β\(𝒜E\(S\),S\)\\displaystyle I^\{\\beta\}\_\{\\infty\}\(\\mathcal\{A\}\_\{E\}\(S\),S\)≤EI∞βE\(𝒜1\(S\),S\)\\displaystyle\\leq E\\,I^\{\\frac\{\\beta\}\{E\}\}\_\{\\infty\}\(\\mathcal\{A\}\_\{1\}\(S\),S\)\(84\)which concludes the proof of Theorem[1](https://arxiv.org/html/2605.26222#Thmtheorem1)\. ∎

### B\.2Proof of Lemma[6](https://arxiv.org/html/2605.26222#Thmlemma6)

Forβ∈\(0,12\]\\beta\\in\(0,\\frac\{1\}\{2\}\]andα∈\(0,3\]\\alpha\\in\(0,3\], we prove the monotonicity and bounds on the values of

R\(α,β\)\\displaystyle R\(\\alpha,\\beta\)=34\+12log⁡1\+α/2β\+14log⁡1\+α/2βlog⁡1\.25β\.\\displaystyle=\\frac\{\\frac\{3\}\{4\}\+\\frac\{1\}\{2\}\\sqrt\{\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}\+\\frac\{1\}\{4\}\\log\\frac\{1\+\\alpha/2\}\{\\beta\}\}\{\\log\\frac\{1\.25\}\{\\beta\}\}\.\(85\)For better readability, we reparametrize the ratio in terms ofa:=log⁡\(1\+α/2\)a:=\\log\(1\+\\alpha/2\)andb:=log⁡\(1/β\)b:=\\log\(1/\\beta\)asR′\(a,b\)\\displaystyle R^\{\\prime\}\(a,b\)=3\+2a\+b\+a\+b4\(log⁡1\.25\+b\)\.\\displaystyle=\\frac\{3\+2\\sqrt\{a\+b\}\+a\+b\}\{4\(\\log 1\.25\+b\)\}\.\(86\)
Because only the numerator depends onaa, and in an explicit form, it is clear that itR′R^\{\\prime\}strict monotonically increasing inaa, soRRis strictly monotonically increasing inα\\alpha\.

To establish monotonicity inβ\\beta, we compute

∂R′∂b\(a,b\)\\displaystyle\\frac\{\\partial R^\{\\prime\}\}\{\\partial b\}\(a,b\)=\(1a\+b\+1\)\(log⁡1\.25\+b\)−\(3\+2a\+b\+a\+b\)4\(log⁡1\.25\+b\)2,\\displaystyle=\\frac\{\(\\frac\{1\}\{\\sqrt\{a\+b\}\}\+1\)\(\\log 1\.25\+b\)\-\\bigl\(3\+2\\sqrt\{a\+b\}\+a\+b\\bigr\)\}\{4\(\\log 1\.25\+b\)^\{2\}\},\(87\)and check the sign of the numerator:\(1a\+b\+1\)\\displaystyle\(\\frac\{1\}\{\\sqrt\{a\+b\}\}\+1\)\(log⁡1\.25\+b\)−\(3\+2a\+b\+a\+b\)\.\\displaystyle\(\\log 1\.25\+b\)\-\\bigl\(3\+2\\sqrt\{a\+b\}\+a\+b\\bigr\)\.\(88\)This expression is decreasing ina≥0a\\geq 0, so we can upper bound it by its value fora=0a=0:≤\(1b\+1\)\(log⁡1\.25\+b\)−\(3\+2b\+b\)\\displaystyle\\leq\(\\frac\{1\}\{\\sqrt\{b\}\}\+1\)\(\\log 1\.25\+b\)\-\\bigl\(3\+2\\sqrt\{b\}\+b\\bigr\)\(89\)=log⁡1\.25b\+log⁡1\.25−3−b−b\.\\displaystyle=\\frac\{\\log 1\.25\}\{\\sqrt\{b\}\}\+\\log 1\.25\-3\-\\sqrt\{b\}\-b\.\(90\)This expression is monotonically decreasing forb≥0b\\geq 0, so we can upper bound it by its value forbmin=minβ∈\(0,12\]⁡log⁡1β=log⁡2b\_\{\\text\{min\}\}=\\min\_\{\\beta\\in\(0,\\frac\{1\}\{2\}\]\}\\log\\frac\{1\}\{\\beta\}=\\log 2\.≤log⁡1\.25log⁡2\+log⁡1\.25−3−log⁡2−log⁡2≈−4\.03<0\\displaystyle\\leq\\frac\{\\log 1\.25\}\{\\sqrt\{\\log 2\}\}\+\\log 1\.25\-3\-\\sqrt\{\\log 2\}\-\\log 2\\approx\-4\.03<0\(91\)This confirms thatR′R^\{\\prime\}is strictly decreasing as a function ofbb, and thereforeRRis strictly increasing as a function ofβ\\beta\. As a consequence of the monotonicity, we obtain upper and lower bounds\. For anyα∈\(0,3\]\\alpha\\in\(0,3\]andβ∈\(0,12\]\\beta\\in\(0,\\frac\{1\}\{2\}\], that is

R\(α,β\)\\displaystyle R\(\\alpha,\\beta\)≤R\(α=3,β=12\)≈1\.95<2,\\displaystyle\\leq R\(\\alpha=3,\\beta=\\frac\{1\}\{2\}\)\\approx 1\.95<2,\(92\)R\(α,β\)\\displaystyle R\(\\alpha,\\beta\)≥limβ′→0R\(α=0,β′\)=14\.\\displaystyle\\geq\\lim\_\{\\beta^\{\\prime\}\\to 0\}R\(\\alpha=0,\\beta^\{\\prime\}\)=\\frac\{1\}\{4\}\.\(93\)

## Appendix CExperiments

Here we present the details of our model training experiments in[Section˜4\.1](https://arxiv.org/html/2605.26222#S4.SS1)\.

#### Implementation\.

For the experiments we use the implementation of\(García\-Pérez et al\.,[2025](https://arxiv.org/html/2605.26222#bib.bib23);Pérez\-Ortiz et al\.,[2021](https://arxiv.org/html/2605.26222#bib.bib41)\), which directly optimizes a PAC\-Bayes bound for a family of Gaussian distributions\.

#### Datasets\.

We evaluate the bounds on the MNIST\(LeCun et al\.,[1998](https://arxiv.org/html/2605.26222#bib.bib28)\)data with60,00060\{,\}000training samples, and CIFAR\-10\(Krizhevsky & Hinton,[2009](https://arxiv.org/html/2605.26222#bib.bib27)\)50,00050\{,\}000training samples\. All reported guarantees are based only on the training data\.

#### Models\.

For MNIST, we use a fully\-connected MLP with three 600\-unit hidden layers and a 10\-class output layer\. For CIFAR\-10, we use a nine\-layer ConvNet with six convolutional layers followed by three fully connected layers for 10\-class classification\. In total, the models have1,198,2101\{,\}198\{,\}210and5,851,3385\{,\}851\{,\}338trainable parameters, respectively\. So, as typical for real\-world deep networks, they are highly overparameterized\.

#### Prior and Posterior distributions\.

For all experiments, both prior and posterior are Gaussian distributions\. For the baseline, the independent prior is

π0=𝒩\(μ0,τ𝐈𝐝\)\\pi\_\{0\}=\\mathcal\{N\}\(\\mu\_\{0\},\\tau\\mathbf\{Id\}\)\(94\)with a randomly generatedμ0\\mu\_\{0\}, andτ\\tauis chosen from\{0\.02,…,0\.08\}\\\{0\.02,\\dots,0\.08\\\}, using a union\-bound argument as described below\. The DP\-SGD prior is

πDP=𝒩\(μDP,τ𝐈𝐝\),\\pi\_\{DP\}=\\mathcal\{N\}\(\\mu\_\{DP\},\\tau\\mathbf\{Id\}\),\(95\)whereμDP\\mu\_\{DP\}is trained with[Algorithm˜2](https://arxiv.org/html/2605.26222#alg2)\. The posterior is also a Gaussian distribution

ρ=𝒩\(μρ,Σρ\),\\rho=\\mathcal\{N\}\(\\mu\_\{\\rho\},\\Sigma\_\{\\rho\}\),\(96\)with diagonalΣρ\\Sigma\_\{\\rho\}\. For all experiments,μρ\\mu\_\{\\rho\}andΣρ\\Sigma\_\{\\rho\}are trained by optimizing the corresponding bounds\.

#### Training the DP\-SGD prior\.

For training the prior, we train a deterministic model based on the DP\-SGD algorithm \([Algorithm˜1](https://arxiv.org/html/2605.26222#alg1)\)\. For CIFAR\-10, we use batch sizem=5000m=5000, clipping norm fromζ∈\{0\.005,0\.01,0\.015,0\.02,0\.03,0\.04,0\.05\}\\zeta\\in\\\{0\.005,0\.01,0\.015,0\.02,0\.03,0\.04,0\.05\\\}, learning rate from\{5,20,50,100\}\\\{5,20,50,100\\\}, epochs fromE∈\{1,3,5,10,20\}E\\in\\\{1,3,5,10,20\\\}and noise withσ=1\\sigma=1\. Out of all possible hyperparameter tuples, we try 100 of them with the lowest value ofκ\\kappa, and compute the bounds using a union\-bound argument as described below\.

#### Computing the certificates\.

Given a generalization bound of the formkl\(R^\|R\)≤b\\text\{kl\}\(\\widehat\{R\}\|R\)\\leq b, we estimate a bound on the true risk of a stochastic model following the same procedure asGarcía\-Pérez et al\.\([2025](https://arxiv.org/html/2605.26222#bib.bib23)\);Pérez\-Ortiz et al\.\([2021](https://arxiv.org/html/2605.26222#bib.bib41)\)\. We estimate the empirical risk,R^\\widehat\{R\}, by150,000150\{,\}000Monte Carlo rollouts, and apply the numeric inversion ofklbased on a binary search, as discuss in Appendix[A\.3](https://arxiv.org/html/2605.26222#A1.SS3)\.

#### Hyperparameter Section via Union Bounds\.

To apply the bounds, the hyperparameters used to generate the prior must be data\-independent\. However, in practice, we want to select them from a finite grid based on the resulting certificates\. To account for this selection, we employ a union bound argument\. Formally, assume we haveK1K\_\{1\}possible tuples of DP\-SGD hyperparameters, andK2K\_\{2\}possible choices of the prior varianceτ\\tau\. For theii\-th DP\-SGD hyperparameter tuple, letμi\\mu\_\{i\}be the output of DP\-SGD training andτj\\tau\_\{j\}be thejj\-th possible value forτ\\tau\. Letπi,j=𝒩\(μi,τj𝐈𝐝\)\\pi\_\{i,j\}=\\mathcal\{N\}\(\\mu\_\{i\},\\tau\_\{j\}\\mathbf\{Id\}\)be the prior generated byii\-th hyperparameter tuple, andjj\-th value forτ\\tau\. Letκi=I∞β/K1\(μi\(S\),S\)\\kappa\_\{i\}=I^\{\\beta/K\_\{1\}\}\_\{\\infty\}\(\\mu\_\{i\}\(S\),S\)be the corresponding max\-information forii\-th hyperparameter tuple\. By the union bound argument, for anyδ,δ′\\delta,\\delta^\{\\prime\}, andβ\\beta, with probability at least1−δ−δ′1\-\\delta\-\\delta^\{\\prime\}for all posteriorsρ\\rho, and all priorsπi,j\\pi\_\{i,j\}, we have

R\(ρ\)≤kl−1\(R^\(ρ\)\|DKL\(ρ∥πi,j\)\+κi\+log⁡\(2K1K2nδ−δ′−β\)n\),R\(\\rho\)\\leq\\text\{kl\}^\{\-1\}\\Biggl\(\\widehat\{R\}\(\\rho\)\\Biggm\|\\frac\{D\_\{\\text\{KL\}\}\(\\rho\\\|\\pi\_\{i,j\}\)\+\\kappa\_\{i\}\+\\log\(\\frac\{2K\_\{1\}K\_\{2\}\\sqrt\{n\}\}\{\\delta\-\\delta^\{\\prime\}\-\\beta\}\)\}\{n\}\\Biggr\),\(97\)
Additionally, if we estimate the training error of the posteriorρ\\rhowith N samples, with probability1−δ′1\-\\delta^\{\\prime\}we have:

R^\(ρ\)≤kl−1\(R~N\(ρ\)\|log⁡\(2Nδ′\)N\),\\widehat\{R\}\(\\rho\)\\leq\\text\{kl\}^\{\-1\}\\Big\(\\tilde\{R\}\_\{N\}\(\\rho\)\\Big\|\\frac\{\\log\(\\frac\{2\\sqrt\{N\}\}\{\\delta^\{\\prime\}\}\)\}\{N\}\\Big\),\(98\)whereR~N\(ρ\)\\tilde\{R\}\_\{N\}\(\\rho\)is the average training error overNNdraws of the posteriorρ\\rho\.

Combining these two bounds withδ=0\.05,δ′=0\.0125,β=0\.025,K1=100,K2=7\\delta=0\.05,\\delta^\{\\prime\}=0\.0125,\\beta=0\.025,K\_\{1\}=100,K\_\{2\}=7, results in the bound values we report in[Table˜1](https://arxiv.org/html/2605.26222#S4.T1)\.

Table 2:Numeric upper bounds,ℬ\\mathcal\{B\}, on the risk,RR, for overparameterized deep networks on MNIST, obtained by numerically inverting \([31](https://arxiv.org/html/2605.26222#S4.E31)\) forπ\\pi, and \([34](https://arxiv.org/html/2605.26222#S4.E34)\) forρ\\rho\. Reported results and standard deviations are 3 runs with different random seeds\.Table 3:Numeric upper bounds,ℬ\\mathcal\{B\}, on the risk,RR, for overparameterized deep networks on CIFAR\-10, obtained by numerically inverting \([31](https://arxiv.org/html/2605.26222#S4.E31)\) forπ\\pi, and \([34](https://arxiv.org/html/2605.26222#S4.E34)\) forρ\\rho\. Reported results and standard deviations are 3 runs with different random seeds\.

## Appendix DDeclaration of LLM Usage\.

In the preparation of the manuscript, we used ChatGPT and Gemini for spell\-checking and stylistic recommendations\. In the scientific process, we used ChatGPT for suggesting derivation paths and for proof\-reading our proofs, in particular making sure that constants are tracked correctly\. For the experiments, we used ChatGPT for implementing routine tasks, such as data plotting\.
From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Similar Articles

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

The Fast Mixing Mechanism for Differential Privacy

Differentially Private Natural Gradient Descent

Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

An exact information theory of generalization phase transitions in Bayesian diffusion models

Submit Feedback

Similar Articles

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy
The Fast Mixing Mechanism for Differential Privacy
Differentially Private Natural Gradient Descent
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise
An exact information theory of generalization phase transitions in Bayesian diffusion models