Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

arXiv cs.LG 05/14/26, 04:00 AM Papers
Summary
This paper establishes the first population risk bounds for Kolmogorov-Arnold Networks trained with mini-batch SGD and DP-SGD using correlated noise, advancing theoretical understanding of KANs in privacy-sensitive domains.
arXiv:2605.12648v1 Announce Type: new Abstract: We establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as differentially private SGD (DP-SGD) with Gaussian perturbations that interpolate between independent and temporally correlated noise. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini-batch SGD, the standard recipe for modern networks, rather than full-batch gradient descent (GD); and correlated-noise mechanisms have empirically shown a more favorable privacy-utility tradeoff than independent-noise mechanisms. Our results cover the corresponding full-batch GD and independent-noise DP-GD results for KANs by Wang et al. (2026), while yielding sharper fixed-second-layer specializations. The technical core is a new analysis route for correlated-noise DP training in the non-convex regime. Temporal dependence breaks the conditional-centering structure underlying standard one-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high-probability bootstrap certifying projection inactivity. Combining this optimization analysis with a stability-based generalization argument yields the stated population risk bounds. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated-noise mechanism for DP training beyond convex learning, in particular for neural networks.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:17 AM
# Population Risk Bounds for Kolmogorov–Arnold Networks Trained by DP-SGD with Correlated Noise
Source: [https://arxiv.org/html/2605.12648](https://arxiv.org/html/2605.12648)
Puyu Wang1Jan Schuchardt2Nikita Kalinin3Junyu Zhou4Sophie Fellenz1 Christoph Lampert3Marius Kloft1

1RPTU Kaiserslautern\-Landau, Kaiserslautern, Germany 2Machine Learning Research, Morgan Stanley 3Institute of Science and Technology, Klosterneuburg, Austria 4Catholic University of Eichstätt\-Ingolstadt, Ingolstadt, Germany

###### Abstract

We establish the first population risk bounds for Kolmogorov\-Arnold Networks \(KANs\) trained by mini\-batch SGD with gradient clipping, covering non\-private SGD as well as differentially private SGD \(DP\-SGD\) with Gaussian perturbations that interpolate between independent and temporally correlated noise\. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini\-batch SGD, the standard recipe for modern networks, rather than full\-batch gradient descent \(GD\); and correlated\-noise mechanisms have empirically shown a more favorable privacy\-utility tradeoff than independent\-noise mechanisms\. Our results cover the corresponding full\-batch GD and independent\-noise DP\-GD results for KANs byWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), while yielding sharper fixed\-second\-layer specializations\. The technical core is a new analysis route for correlated\-noise DP training in the non\-convex regime\. Temporal dependence breaks the conditional\-centering structure underlying standard one\-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations\. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high\-probability bootstrap certifying projection inactivity\. Combining this optimization analysis with a stability\-based generalization argument yields the stated population risk bounds\. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated\-noise mechanism for DP training beyond convex learning, in particular for neural networks\.

## 1Introduction

Kolmogorov\-Arnold Networks \(KANs\)\(Liuet al\.,[2025b](https://arxiv.org/html/2605.12648#bib.bib11)\)have recently emerged as a structured alternative to multilayer perceptrons \(MLPs\)\. By parameterizing interactions through learnable univariate functions on edges, KANs admit explicit functional decompositions that support interpretability and improved extrapolation in scientific and engineering domains\. They have shown strong empirical performance in molecular and biological modeling\(Cherednichenko and Poptsova,[2025](https://arxiv.org/html/2605.12648#bib.bib48); Liet al\.,[2025a](https://arxiv.org/html/2605.12648#bib.bib53)\), physics\-informed learning\(Patraet al\.,[2025](https://arxiv.org/html/2605.12648#bib.bib51); Shuklaet al\.,[2024](https://arxiv.org/html/2605.12648#bib.bib52); Wanget al\.,[2025d](https://arxiv.org/html/2605.12648#bib.bib50)\), and time\-series forecasting\(Vaca\-Rubioet al\.,[2024](https://arxiv.org/html/2605.12648#bib.bib49)\), domains that frequently involve*sensitive*patient, biological, or industrial data\.

Population risk bounds quantify how a trained model performs on new data\. They give worst\-case guarantees on this performance, identify which training choices matter for it, and enable principled comparison of how training algorithms scale with sample size\.

For KANs, however, population risk bounds are still tied to full\-batch gradient descent \(GD\)\(Wanget al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib19)\), whereas practitioners train with mini\-batch stochastic gradient descent \(SGD\) and clipping\. In this regime, mini\-batch sampling and clipping materially change the optimization dynamics, and hence the population risk of the trained model\. A natural question is therefore*whether one can obtain population risk guarantees for KANs trained in this more practical regime*\.

For sensitive data, the question above must be answered under an additional constraint: formal privacy guarantees\. Differential privacy \(DP\)\(Dwork,[2006](https://arxiv.org/html/2605.12648#bib.bib687)\)is the standard framework, and its canonical instantiation is DP\-SGD\(Songet al\.,[2013](https://arxiv.org/html/2605.12648#bib.bib29)\), in which calibrated Gaussian noise is added at each step to mask individual data points\. Yet, the only existing analyses for private KANs are again restricted to full\-batch training\(Wanget al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib19)\), leaving the practical mini\-batch regime open in the private setting\.

A further limitation concerns the noise model\. Standard DP\-SGD analyses typically assume fresh independent Gaussian noise at every step\. Recent correlated\-noise mechanisms instead introduce temporal correlations across perturbations, so that consecutive noise terms partially cancel and the cumulative noise entering the optimization dynamics is reduced\. These mechanisms have become a leading approach for improving DP utility, with deployment in production federated learning systems for on\-device language models\(McMahanet al\.,[2024](https://arxiv.org/html/2605.12648#bib.bib92)\)and strong empirical advantage demonstrated in recent benchmarks\(Kalininet al\.,[2026a](https://arxiv.org/html/2605.12648#bib.bib743)\)\. Yet despite this active line of work\(Andersson and Pagh,[2023](https://arxiv.org/html/2605.12648#bib.bib679); Choquette\-Chooet al\.,[2024a](https://arxiv.org/html/2605.12648#bib.bib746),[2023a](https://arxiv.org/html/2605.12648#bib.bib708),[2023b](https://arxiv.org/html/2605.12648#bib.bib685); Denisovet al\.,[2022](https://arxiv.org/html/2605.12648#bib.bib707); Fichtenbergeret al\.,[2023](https://arxiv.org/html/2605.12648#bib.bib675); Kalinin and Lampert,[2024](https://arxiv.org/html/2605.12648#bib.bib677); Kalininet al\.,[2026b](https://arxiv.org/html/2605.12648#bib.bib695); McKenna,[2025](https://arxiv.org/html/2605.12648#bib.bib700); Pillutlaet al\.,[2025](https://arxiv.org/html/2605.12648#bib.bib736); Rodioet al\.,[2025](https://arxiv.org/html/2605.12648#bib.bib2)\), a population risk theory for correlated\-noise DP training beyond convex learning is still missing\. In particular, no such guarantee is known for training of non\-convex neural networks such as KANs\.

This paper addresses both gaps by establishing population risk bounds for two\-layer KANs trained by clipped mini\-batch SGD, covering both the non\-private and DP settings\. In the DP setting, we consider temporally correlated\-noise mechanism, DP\-λ\\lambdaCGD\(Kalininet al\.,[2026a](https://arxiv.org/html/2605.12648#bib.bib743)\), taking the formξt=κ\(Zt−λZt−1\)\\xi\_\{t\}=\\kappa\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)withZtZ\_\{t\}the standard Gaussian noise andκ≥0\\kappa\\geq 0the noise multiplier, whereλ=0\\lambda=0recovers standard independent\-noise mechanism\. The correlated\-noise DP setting poses two main technical challenges:*\(i\)*temporal dependence breaks the conditional\-centering arguments that underpin standard one\-step recursions; and*\(ii\)*the projection used to keep the iterates localized breaks the partial\-cancellation structure on which correlated noise relies \(Figure[1](https://arxiv.org/html/2605.12648#S1.F1), right\)\. Overcoming these obstacles is the technical core of the paper\.

![Refer to caption](https://arxiv.org/html/2605.12648v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.12648v1/x2.png)

Figure 1:CNN on CIFAR\-10\.Left: Moderate noise correlation improves the accuracy of DP\-SGD over independent noise \(λ=0\\lambda=0\), especially for larger privacy budgetsϵ\\epsilon\. However, the gain is not monotone inλ\\lambda, and accuracy can drop whenλ→1\\lambda\\rightarrow 1\.Right: Subtracting aλ\\lambda\-fraction of the previous noise partially cancels consecutive noise perturbations, slowing cumulative\-noise growth and thus preserving accuracy\. Figure reproduced fromKalininet al\.\([2026a](https://arxiv.org/html/2605.12648#bib.bib743)\)with the authors’ permission\.Our main contributions are summarized as follows\.

- •We establish the*first*population risk bounds for two\-layer KANs trained by clipped mini\-batch SGD, in both non\-private and DP settings, together with explicit width regimes under which the guarantees hold\. This moves KAN theory beyond the full\-batch GD/DP\-GD setting ofWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\)\.
- •In the DP setting, we provide, to the best of our knowledge, the*first*population risk bound for correlated\-noise DP training in a non\-convex setting\. We instantiate this result for two\-layer KANs trained with clipped mini\-batch DP\-SGD\. Under a representative parameter regime, the resulting KAN rate matches the convex DP\-SCO lower bound ofBassilyet al\.\([2019](https://arxiv.org/html/2605.12648#bib.bib22)\), up to logarithmic factors\.
- •Our KAN bounds cover several special cases: non\-private mini\-batch SGD \(κ=0\\kappa=0\), independent\-noise DP\-SGD \(λ=0\\lambda=0\), and full\-batch training \(B=nB=n, withBBthe batch size andnnthe sample size\)\. In the full\-batch fixed\-second\-layer specialization, our non\-private and private bounds match the corresponding GD and DP\-GD sample/privacy scalings ofWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), while sharpening the dependence on the NTK margin and reducing the required width\.
- •Technically, we provide an analysis route for correlated\-noise DP training in the non\-convex regime\. The approach combines an auxiliary unprojected dynamics, a shifted iterate that exposes the cancellation structure of the noise, and a high\-probability bootstrap certifying projection inactivity\. This framework may be of independent interest beyond KANs\.

Concretely, in the polylogarithmic\-width regime \(m≍polylog\(n\)m\\asymp\\mathrm\{polylog\}\(n\)\), our non\-private mini\-batch SGD bounds yield averaged optimization and population risks of order1/n1/n, suppressing logarithmic factors and dependence on the NTK marginγ\\gamma\. In the private setting, both independent\-noise and correlated\-noise DP\-SGD attain the rate𝒪\(1n\+dnϵ\)\\mathcal\{O\}\(\\frac\{1\}\{\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{n\\epsilon\}\)under\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\. Full statements, including the precise choices ofB,T,ηB,T,\\eta, the width conditions, and theλ\\lambda\-dependence, appear in Section[5](https://arxiv.org/html/2605.12648#S5)\.

The paper is structured as follows\. Section[2](https://arxiv.org/html/2605.12648#S2)reviews related work\. Section[3](https://arxiv.org/html/2605.12648#S3)presents the problem setup\. Section[4](https://arxiv.org/html/2605.12648#S4)gives the core correlated\-noise optimization analysis\. Section[5](https://arxiv.org/html/2605.12648#S5)derives the private and non\-private population risk bounds\. Section[6](https://arxiv.org/html/2605.12648#S6)concludes the paper\.

## 2Related Work

KAN theory has so far focused on approximation, expressiveness, and optimization\(Eshtehardianet al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib13); Gao and Tan,[2025](https://arxiv.org/html/2605.12648#bib.bib45); Liet al\.,[2025b](https://arxiv.org/html/2605.12648#bib.bib9); Liuet al\.,[2025a](https://arxiv.org/html/2605.12648#bib.bib3); Wanget al\.,[2025c](https://arxiv.org/html/2605.12648#bib.bib8)\)\. The closest prior workWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\)establishes optimization and population risk bounds for two\-layer KANs trained by full\-batch GD, and extends them to DP\-GD with independent Gaussian noise\. Our results cover both as special cases \(B=nB=n, andB=nB=nwithλ=0\\lambda=0, respectively\) while covering mini\-batch SGD and the broader correlated\-noise regime\.

#### Population risk bounds for DP training of neural networks \(NNs\)\.

BeyondWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), recent work has also studied private training of NNs under the independent\-noise settingDinget al\.\([2025](https://arxiv.org/html/2605.12648#bib.bib5)\); Shiet al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib748)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.12648#bib.bib55)\); Xu and Chen \([2026](https://arxiv.org/html/2605.12648#bib.bib6)\); Zhanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib1)\)\. In particular,Wanget al\.\([2025a](https://arxiv.org/html/2605.12648#bib.bib55)\)analyzes DP\-GD for three\-layer MLPs in regression,Shiet al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib748)\)studies DP\-GD for two\-layer CNNs\.Dinget al\.\([2025](https://arxiv.org/html/2605.12648#bib.bib5)\); Xu and Chen \([2026](https://arxiv.org/html/2605.12648#bib.bib6)\); Zhanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib1)\)study DP\-SGD for NNs from a feature\-learning perspective, focusing respectively on noisy feature learning dynamics, fairness/robustness degradation, and memorization on long\-tailed data\. Their results cannot be extended to KANs, or correlated\-noise mechanisms\.

#### Theory of correlated\-noise differential privacy\.

Theoretical analysis of correlated\-noise mechanisms has been studied from several complementary perspectivesDenisovet al\.\([2022](https://arxiv.org/html/2605.12648#bib.bib707)\); Koloskovaet al\.\([2023](https://arxiv.org/html/2605.12648#bib.bib745)\); Choquette\-Chooet al\.\([2024a](https://arxiv.org/html/2605.12648#bib.bib746)\)\. In particular,Koloskovaet al\.\([2023](https://arxiv.org/html/2605.12648#bib.bib745)\)studies GD with linearly correlated noise\. In the smooth non\-convex regime, their bounds control average gradient norm rather than optimization or population risk\.Choquette\-Chooet al\.\([2024a](https://arxiv.org/html/2605.12648#bib.bib746)\)proves utility separations between correlated and independent noise for private convex learning, with explicit guarantees for linear regression\. None of these results covers clipped mini\-batch DP\-SGD for non\-convex NNs, in particular KANs, and none provides stability\-based population risk guarantees\.

A broader overview of related work, covering neural network theory and privacy amplification by subsampling, is provided in Appendix[A](https://arxiv.org/html/2605.12648#A1)\.

## 3Problem Setting

We now introduce the learning problem, the two\-layer KAN architecture, the mini\-batch DP\-SGD algorithm with correlated noise, the assumptions underlying our analysis, and our risk decomposition\.

#### Notation and learning problem\.

Let𝒫\\mathcal\{P\}be a probability distribution on𝒳×𝒴\\mathcal\{X\}\\times\\mathcal\{Y\}, where𝒳⊆\{𝐱∈ℝd:‖𝐱‖2≤1\}\\mathcal\{X\}\\subseteq\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}:\\\|\\mathbf\{x\}\\\|\_\{2\}\\leq 1\\\}and𝒴=\{−1,\+1\}\\mathcal\{Y\}=\\\{\-1,\+1\\\}\. For a positive integerqq, let\[q\]=\{1,…,q\}\[q\]=\\\{1,\\ldots,q\\\}\. We use∥⋅∥2\\\|\\cdot\\\|\_\{2\}for the Euclidean norm, and⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\ranglefor the inner product\. Given a training datasetS=\{\(𝐱i,yi\)\}i=1nS=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}drawn i\.i\.d\. from𝒫\\mathcal\{P\}, we measure the quality of a classifierf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}by the population and empirical risksℒ\(f\)=𝔼\(𝐱,y\)∼𝒫\[ℓ\(yf\(𝐱\)\)\]\\mathcal\{L\}\(f\)=\\mathbb\{E\}\_\{\(\\mathbf\{x\},y\)\\sim\\mathcal\{P\}\}\\\!\\left\[\\ell\(yf\(\\mathbf\{x\}\)\)\\right\]andℒS\(f\)=1n∑i=1nℓ\(yif\(𝐱i\)\)\\mathcal\{L\}\_\{S\}\(f\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(y\_\{i\}f\(\\mathbf\{x\}\_\{i\}\)\), respectively, whereℓ\(z\)=log⁡\(1\+exp⁡\(−z\)\)\\ell\(z\)=\\log\(1\+\\exp\(\-z\)\)is the logistic loss\.

### 3\.1Architecture: Two\-layer KANs with B\-spline Bases

Letmmbe the hidden width\. Following the spline\-based two\-layer KAN formulation studied inGao and Tan \([2025](https://arxiv.org/html/2605.12648#bib.bib45)\); Wanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), we consider a model with B\-spline basis\{bk\}k=1p\\\{b\_\{k\}\\\}\_\{k=1\}^\{p\}and a fixed activation functionσ:ℝ→ℝ\\sigma:\\mathbb\{R\}\\to\\mathbb\{R\}\. For an input𝐱=\(x1,…,xd\)\\mathbf\{x\}=\(x\_\{1\},\\ldots,x\_\{d\}\), the network is defined as

f𝐖\(𝐱\)=1m∑j=1m∑k=1pcj,kbk\(x1,j\)withx1,j=σ\(1d∑i=1d∑k=1pwi,j,kbk\(xi\)\),f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\sum\_\{j=1\}^\{m\}\\sum\_\{k=1\}^\{p\}c\_\{j,k\}\\,b\_\{k\}\(x\_\{1,j\}\)\\quad\\text\{with\}\\quad x\_\{1,j\}=\\sigma\\Big\(\\frac\{1\}\{\\sqrt\{d\}\}\\sum\_\{i=1\}^\{d\}\\sum\_\{k=1\}^\{p\}w\_\{i,j,k\}\\,b\_\{k\}\(x\_\{i\}\)\\Big\),wherex1,jx\_\{1,j\}is the output of thejj\-th hidden unit,𝐖=\{wi,j,k\}i∈\[d\],j∈\[m\],k∈\[p\]∈ℝmdp\\mathbf\{W\}=\\\{w\_\{i,j,k\}\\\}\_\{i\\in\[d\],\\,j\\in\[m\],\\,k\\in\[p\]\}\\in\\mathbb\{R\}^\{mdp\}denotes the trainable first\-layer spline coefficients,𝐜=\{cj,k\}j∈\[m\],k∈\[p\]∈ℝmp\\mathbf\{c\}=\\\{c\_\{j,k\}\\\}\_\{j\\in\[m\],\\,k\\in\[p\]\}\\in\\mathbb\{R\}^\{mp\}denotes the second\-layer spline coefficients\. The second\-layer coefficients𝐜\\mathbf\{c\}are drawn once at initialization from𝒩\(𝟎,𝐈mp\)\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mp\}\)and kept fixed throughout training, while we optimize only the first\-layer parameter𝐖\\mathbf\{W\}\. For notational simplification, we writeℒS\(𝐖\)=ℒS\(f𝐖\)\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)=\\mathcal\{L\}\_\{S\}\(f\_\{\\mathbf\{W\}\}\)andℒ\(𝐖\)=ℒ\(f𝐖\)\\mathcal\{L\}\(\\mathbf\{W\}\)=\\mathcal\{L\}\(f\_\{\\mathbf\{W\}\}\)\.

### 3\.2Algorithm: Mini\-batch DP\-SGD with Correlated Noise

To enable training KANs on sensitive datasets, we study a differentially private variant of mini\-batch SGD\. We call two datasets neighboring if they differ in the contribution of one data record\.

###### Definition 3\.1\(Differential privacy\(Dworket al\.,[2006](https://arxiv.org/html/2605.12648#bib.bib20)\)\)\.

We say that a randomized algorithm𝒜\\mathcal\{A\}satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP if, for any two neighboring datasetsSSandS′S^\{\\prime\}and any measurable setEEin the output space of𝒜\\mathcal\{A\}, it holds thatℙ\(𝒜\(S\)∈E\)≤eϵℙ\(𝒜\(S′\)∈E\)\+δ\\mathbb\{P\}\(\\mathcal\{A\}\(S\)\\in E\)\\leq e^\{\\epsilon\}\\mathbb\{P\}\(\\mathcal\{A\}\(S^\{\\prime\}\)\\in E\)\+\\delta\. We say𝒜\\mathcal\{A\}satisfiesϵ\\epsilon\-DP ifδ=0\\delta=0\.

The algorithm we analyze is mini\-batch DP\-SGD with correlated noise\. It also subsumes independent\-noise DP\-SGD and non\-private mini\-batch SGD as special cases\. We use DP\-λ\\lambdaCGD\(Kalininet al\.,[2026a](https://arxiv.org/html/2605.12648#bib.bib743)\), a simple correlated\-noise mechanism, as the noise model\. After initializing𝐖0∼𝒩\(𝟎,𝐈mdp\)\\mathbf\{W\}\_\{0\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\), at each iterationttwe draw a mini\-batchℬt⊆\[n\]\\mathcal\{B\}\_\{t\}\\subseteq\[n\]of sizeBBuniformly without replacement and formgt,i=∇ℓ\(yif𝐖t−1\(𝐱i\)\)g\_\{t,i\}=\\nabla\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\), which are clipped at thresholdCclipC\_\{\\mathrm\{clip\}\}and averaged:g~t,i=gt,imin⁡\{1,Cclip‖gt,i‖2\}\\tilde\{g\}\_\{t,i\}=g\_\{t,i\}\\min\\big\\\{1,\\frac\{C\_\{\\mathrm\{clip\}\}\}\{\\\|g\_\{t,i\}\\\|\_\{2\}\}\\big\\\}andvt=1B∑i∈ℬtg~t,iv\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\tilde\{g\}\_\{t,i\}\. The noise takes the correlated formξt=κ\(Zt−λZt−1\)\\xi\_\{t\}=\\kappa\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)withZ0=𝟎Z\_\{0\}=\\mathbf\{0\}and\{Zt\}t≥1∼iid𝒩\(𝟎,𝐈mdp\)\\\{Z\_\{t\}\\\}\_\{t\\geq 1\}\\overset\{\\mathrm\{iid\}\}\{\\sim\}\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\), whereκ≥0\\kappa\\geq 0is the noise multiplier andλ∈\[0,1\)\\lambda\\in\[0,1\)controls the strength of temporal correlation\. The iterate is then updated by a projected step onto𝒦=ℬ\(𝐖0,R∗\)\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\):𝐖t=Π𝒦\(𝐖t−1−ηv^t\)\\mathbf\{W\}\_\{t\}=\\Pi\_\{\\mathcal\{K\}\}\(\\mathbf\{W\}\_\{t\-1\}\-\\eta\\,\\hat\{v\}\_\{t\}\)withv^t=vt\+CclipBξt\\hat\{v\}\_\{t\}=v\_\{t\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\xi\_\{t\}\. The full procedure is given in Algorithm[1](https://arxiv.org/html/2605.12648#alg1)\.

Algorithm 1Mini\-batch DP\-SGD with Correlated Gaussian Noise1:Dataset

S=\{\(𝐱i,yi\)\}i=1nS=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{n\}; total iterations

TT; step size

η\>0\\eta\>0; batch size

BB; clip threshold

Cclip\>0C\_\{\\mathrm\{clip\}\}\>0; correlation

λ∈\[0,1\)\\lambda\\in\[0,1\); noise multiplier

κ≥0\\kappa\\geq 0; localization radius

R∗\>0R\_\{\*\}\>0\.

2:Initialize

𝐖0∼𝒩\(𝟎,𝐈mdp\)\\mathbf\{W\}\_\{0\}\\\!\\sim\\\!\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\)and

𝐜∼𝒩\(𝟎,𝐈mp\)\\mathbf\{c\}\\\!\\sim\\\!\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mp\}\); keep

𝐜\\mathbf\{c\}fixed; set

Z0=𝟎∈ℝmdpZ\_\{0\}\\\!=\\\!\\mathbf\{0\}\\in\\mathbb\{R\}^\{mdp\}and

𝒦=ℬ\(𝐖0,R∗\)\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\)\.

3:for

t=1,2,…,T−1t=1,2,\\dots,T\-1do

4:Sample a mini\-batch

ℬt⊆\[n\]\\mathcal\{B\}\_\{t\}\\subseteq\[n\]of size

BBuniformly without replacement

5:foreach

i∈ℬti\\in\\mathcal\{B\}\_\{t\}do

6:

g~t,i←gt,i⋅min⁡\{1,Cclip‖gt,i‖2\}\\tilde\{g\}\_\{t,i\}\\leftarrow g\_\{t,i\}\\cdot\\min\\big\\\{1,\\frac\{C\_\{\\mathrm\{clip\}\}\}\{\\\|g\_\{t,i\}\\\|\_\{2\}\}\\big\\\}with

gt,i=∇ℓ\(yif𝐖t−1\(𝐱i\)\)g\_\{t,i\}=\\nabla\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\)⊳\\trianglerightper\-example clipping

7:endfor

8:

vt←1B∑i∈ℬtg~t,iv\_\{t\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\tilde\{g\}\_\{t,i\}⊳\\trianglerightclipped mini\-batch gradient

9:Sample fresh noise

Zt∼𝒩\(𝟎,𝐈mdp\)Z\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\)
10:

ξt←κ\(Zt−λZt−1\)\\xi\_\{t\}\\leftarrow\\kappa\\,\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)⊳\\trianglerightcorrelated Gaussian noise

11:

v^t←vt\+CclipBξt\\hat\{v\}\_\{t\}\\leftarrow v\_\{t\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\,\\xi\_\{t\}
12:

𝐖t←Π𝒦\(𝐖t−1−ηv^t\)\\mathbf\{W\}\_\{t\}\\leftarrow\\Pi\_\{\\mathcal\{K\}\}\(\\mathbf\{W\}\_\{t\-1\}\-\\eta\\,\\hat\{v\}\_\{t\}\)
13:endfor

14:return

\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}\.

#### Special cases\.

This formulation recovers the standard baselines as special cases\. Settingλ=0\\lambda=0givesξt=κZt\\xi\_\{t\}=\\kappa Z\_\{t\}, recovering*independent\-noise DP\-SGD*\. Settingκ=0\\kappa=0removes the noise and yields*non\-private mini\-batch SGD*; taking in additionCclip→∞C\_\{\\mathrm\{clip\}\}\\to\\inftyandR∗→∞R\_\{\*\}\\to\\inftyremoves the clipping and projection, recovering plain*mini\-batch SGD*\.

### 3\.3Assumptions

We impose two standard assumptions used in prior KAN analyses\(Gao and Tan,[2025](https://arxiv.org/html/2605.12648#bib.bib45); Taheriet al\.,[2025](https://arxiv.org/html/2605.12648#bib.bib59); Wanget al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib19)\)\. The first bounds the activation and B\-spline basis\. It is satisfied, e\.g\., by cubic \(or higher\-degree\) B\-splines together with sigmoid or hyperbolic tangent activations\. Letℐb\\mathcal\{I\}\_\{b\}be an interval containing\[−1,1\]∪range\(σ\)\[\-1,1\]\\cup\\mathrm\{range\}\(\\sigma\)\.

###### Assumption 3\.2\.

Assumeσ\\sigmasatisfies\|σ\(u\)\|≤Bσ\|\\sigma\(u\)\|\\leq B\_\{\\sigma\},\|σ′\(u\)\|≤Bσ′\|\\sigma^\{\\prime\}\(u\)\|\\leq B^\{\\prime\}\_\{\\sigma\}, and\|σ′′\(u\)\|≤Bσ′′\|\\sigma^\{\\prime\\prime\}\(u\)\|\\leq B^\{\\prime\\prime\}\_\{\\sigma\}for allu∈ℝu\\in\\mathbb\{R\}\. Further, assume\{bk\}k=1p\\\{b\_\{k\}\\\}\_\{k=1\}^\{p\}satisfy\|bk\(v\)\|≤Bb\|b\_\{k\}\(v\)\|\\leq B\_\{b\},\|bk′\(v\)\|≤Bb′\|b\_\{k\}^\{\\prime\}\(v\)\|\\leq B^\{\\prime\}\_\{b\}, and\|bk′′\(v\)\|≤Bb′′\|b\_\{k\}^\{\\prime\\prime\}\(v\)\|\\leq B^\{\\prime\\prime\}\_\{b\}for allv∈ℐbv\\in\\mathcal\{I\}\_\{b\}\.

The second is a margin\-style separability condition on the NTK features at initialization\(Leiet al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib10); Taheriet al\.,[2025](https://arxiv.org/html/2605.12648#bib.bib59); Wanget al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib19)\)\. It is weaker than the NTK Gram\-matrix positive\-definiteness commonly used in the literature\(Aroraet al\.,[2019](https://arxiv.org/html/2605.12648#bib.bib151); Gao and Tan,[2025](https://arxiv.org/html/2605.12648#bib.bib45); Nitandaet al\.,[2019](https://arxiv.org/html/2605.12648#bib.bib73)\), and was verified for KANs in\(Wanget al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib19)\)\.

###### Assumption 3\.3\(NTK separability\)\.

There existsγ∈\(0,1\]\\gamma\\in\(0,1\]and𝐮∈ℝmdp\\mathbf\{u\}\\in\\mathbb\{R\}^\{mdp\}with‖𝐮‖2=1\\\|\\mathbf\{u\}\\\|\_\{2\}=1such thatyi⟨∇f𝐖0\(𝐱i\),𝐮⟩≥γy\_\{i\}\\langle\\nabla f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{u\}\\rangle\\geq\\gammafor alli∈\[n\]i\\in\[n\]\.

### 3\.4Risk Measures and Analysis Strategy

We measure the utility of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)by the averaged population risk1T∑t=0T−1ℒ\(𝐖t\)\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\. Our analysis decomposes this into two parts:

1T∑t=0T−1ℒ\(𝐖t\)⏟population risk=1T∑t=0T−1ℒS\(𝐖t\)⏟optimization risk\+1T∑t=0T−1\(ℒ\(𝐖t\)−ℒS\(𝐖t\)\)⏟generalization gap\.\\underbrace\{\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\}\_\{\\text\{population risk\}\}=\\underbrace\{\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\}\_\{\\text\{optimization risk\}\}\+\\underbrace\{\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\big\(\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\big\)\}\_\{\\text\{generalization gap\}\}\.We bound the optimization risk via a new proof framework tailored to the correlated\-noise regime, and the generalization gap via an algorithmic stability argument\. Combining the two yields the desired bound on the averaged population risk\.

## 4Optimization Risk Bound for DP\-SGD with Correlated Noise

This section presents an optimization risk bound for DP\-SGD with correlated noise, the technical core of the paper and the input to the population analysis in Section[5](https://arxiv.org/html/2605.12648#S5)\.

### 4\.1Technical Challenges

To control the optimization risk1T∑t=0T−1ℒS\(𝐖t\)\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\), a standard approach is to use a comparator argument based on the local curvature of the empirical loss\. In particular, for a suitable comparator𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}, one expects an inequality of the form

ℒS\(𝐖t−1\)−ℒS\(𝐖∗\)≲⟨v¯t,𝐖t−1−𝐖∗⟩\+Errt,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\lesssim\\big\\langle\\bar\{v\}\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\mathrm\{Err\}\_\{t\},wherev¯t=1n∑i=1ng~t,i\\bar\{v\}\_\{t\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\tilde\{g\}\_\{t,i\}is the full\-batch clipped gradient used only in the analysis, andErrt\\mathrm\{Err\}\_\{t\}contains the local curvature error terms\. Thus, bounding the optimization risk reduces to controlling the accumulated first\-order term∑t⟨v¯t,𝐖t−1−𝐖∗⟩\\sum\_\{t\}\\langle\\bar\{v\}\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\. This reduction leads to two distinct difficulties: the correlated perturbations break the conditional\-centering argument used in the independent\-noise analysis, and the curvature\-based comparator inequality needed above is only valid locally for KANs\.

A natural first attempt is to follow the independent\-noise proof, which applies the projected one\-step recursion and uses conditional centering to control the perturbation terms\. To see the key step, substitute the decomposition of the noisy averaged clipped gradientv^t\\hat\{v\}\_\{t\}into the first\-order term:

⟨v^t,𝐖t−1−𝐖∗⟩=⟨v¯t,𝐖t−1−𝐖∗⟩\+⟨vt−v¯t,𝐖t−1−𝐖∗⟩\+CclipB⟨ξt,𝐖t−1−𝐖∗⟩\.\\langle\\hat\{v\}\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle=\\langle\\bar\{v\}\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\+\\big\\langle v\_\{t\}\-\\bar\{v\}\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\big\\langle\\xi\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\.In the independent\-noise case, the mini\-batch fluctuationvt−v¯tv\_\{t\}\-\\bar\{v\}\_\{t\}and the Gaussian perturbationξt=κZt\\xi\_\{t\}=\\kappa Z\_\{t\}are both conditionally centered given the past\. Hence, after taking conditional expectations in the one\-step recursion, the two perturbation terms vanish and only the deterministic descent term remains, while the perturbations contribute only through higher\-order variance terms\. For correlated noise, however, this cancellation fails\. Sinceξt=κ\(Zt−λZt−1\)\\xi\_\{t\}=\\kappa\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\), we have𝔼\[ξt∣ℱt−1\]=−κλZt−1\\mathbb\{E\}\[\\xi\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=\-\\kappa\\lambda Z\_\{t\-1\}withℱt\\mathcal\{F\}\_\{t\}the sigma\-algebra containing all randomness revealed up to the end of iterationtt\. Hence, ignoring the scaling factorCclipB\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}, the noise component in the first\-order term satisfies

𝔼\[⟨ξt,𝐖t−1−𝐖∗⟩\|ℱt−1\]=−κλ⟨Zt−1,𝐖t−1−𝐖∗⟩\.\\mathbb\{E\}\\big\[\\big\\langle\\xi\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\\,\|\\,\\mathcal\{F\}\_\{t\-1\}\\big\]=\-\\kappa\\lambda\\big\\langle Z\_\{t\-1\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\.Thus, instead of vanishing after conditioning, the perturbation leaves a first\-order drift term\. This term cannot be removed by conditional centering, because𝐖t−1\\mathbf\{W\}\_\{t\-1\}already depends onZt−1Z\_\{t\-1\}\. Moreover, the projection prevents us from directly exploiting the cross\-iteration cancellation structure of the correlated perturbations\. Hence the standard projected one\-step argument no longer applies\.

The second difficulty is specific to KANs\. The empirical loss is non\-convex in the first\-layer spline coefficients, and the comparator inequality that converts the first\-order term into empirical loss is valid only when the iterate and the comparator stay in a localized region around the initialization\. Therefore, even after the correlated\-noise drift is handled through a shifted dynamics, the proof must still certify that the auxiliary trajectory remains in the region where the local KAN curvature and self\-boundedness estimates can be applied\.

### 4\.2Main Theorem

To overcome these difficulties, we introduce a proof approach that bypasses the projected dynamics through an auxiliary unprojected trajectory, a shifted iterate that absorbs the current Gaussian perturbation, and a high\-probability bootstrap showing that the projection is inactive\.

We first state the main optimization theorem, followed by a proof sketch in Section[4\.3](https://arxiv.org/html/2605.12648#S4.SS3)\. LetCσ,b\>0C\_\{\\sigma,b\}\>0be a constant depending only onσ,b\\sigma,\\,b, andGδ=Bσ′BbBb′p\(4p\+2log⁡\(2/δ\)/m\)G\_\{\\delta\}\\\!=\\\!B\_\{\\sigma\}^\{\\prime\}B\_\{b\}B\_\{b\}^\{\\prime\}p\(4\\sqrt\{p\}\\\!\+\\\!2\\sqrt\{\\log\(2/\\delta\)/m\}\)\. As a byproduct, our analysis suggests the clipping scaleCclip≍GδC\_\{\\rm clip\}\\asymp G\_\{\\delta\}, sufficient for the desired high\-probability bounds\.

###### Theorem 4\.1\(Optimization risk bound\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\),0<λ<10<\\lambda<1,κ\>0\\kappa\>0\. Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. SetR∗≍\(log⁡\(T\)\+log⁡\(n/δ\)\)/γR\_\{\*\}\\\!\\asymp\\\!\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\)/\\gamma\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≍GδC\_\{\\rm clip\}\\asymp G\_\{\\delta\}chosen so thatCclip≥GδC\_\{\\rm clip\}\\geq G\_\{\\delta\}\. AssumeηγT\(1B\+κ\(1−λ\)B\)\+η2γ2\(TB\+1\)≲1\\eta\\gamma\\sqrt\{T\}\(\\frac\{1\}\{\\sqrt\{\{B\}\}\}\+\\frac\{\\kappa\(1\-\\lambda\)\}\{B\}\)\+\\eta^\{2\}\\gamma^\{2\}\(\\frac\{T\}\{B\}\+1\)\\lesssim 1andΩ~\(γ−4\)≤m≤𝒪~\(B2\(η2κ2γ2d\)−1min⁡\{1,\(T\(\(1−λ\)2\+λ2η\)\)−1\}\),\\widetilde\{\\Omega\}\(\\gamma^\{\-4\}\)\\leq m\\leq\\widetilde\{\\mathcal\{O\}\}\(B^\{2\}\(\\eta^\{2\}\\kappa^\{2\}\\gamma^\{2\}d\)^\{\-1\}\\min\\\{1,\(T\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\)\)^\{\-1\}\\\}\),where𝒪~\\widetilde\{\\mathcal\{O\}\}andΩ~\\widetilde\{\\Omega\}suppress polylogarithmic factors inm,n,T,δ−1m,n,T,\\delta^\{\-1\}\. Then, with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\,1γ2ηT\+1γBT\+η\(1B\+1T\)\\displaystyle\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\+\\eta\\Big\(\\frac\{1\}\{B\}\+\\frac\{1\}\{T\}\\Big\)\+\(1−λ\)κBT\(1γ\+ηmdB\)\+\(\(1−λ\)2\+λ2η\)ηκ2mdB2=:Acorr\.\\displaystyle\\quad\+\\frac\{\(1\-\\lambda\)\\kappa\}\{B\\sqrt\{T\}\}\\Big\(\\frac\{1\}\{\\gamma\}\+\\frac\{\\eta\\sqrt\{md\}\}\{\\sqrt\{B\}\}\\Big\)\+\\Big\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\\Big\)\\frac\{\\eta\\kappa^\{2\}md\}\{B^\{2\}\}=:A\_\{\\rm corr\}\.

Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)identifies an admissible width range for private KAN training:mmmust be large enough for the local KAN curvature argument, but not so large that the accumulated private noise breaks localization\. Corollary[5\.3](https://arxiv.org/html/2605.12648#S5.Thmtheorem3)shows that this range is nonempty in the representative regime\.

### 4\.3Proof Sketch of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)

We outline the main steps and defer the full proofs to Appendix[C\.1](https://arxiv.org/html/2605.12648#A3.SS1)\. The key idea is to combine the cancellation structure of the correlated perturbations with the localization argument for KANs\. The proof proceeds in the following six steps\.

###### Proof Sketch\.

Step 1: Reduce to an auxiliary unprojected dynamics\.The correlated perturbations break the conditional\-centering argument, and the projection further obstructs the cross\-iteration cancellation structure of the noise\. We therefore first introduce an auxiliary unprojected trajectory\{𝐖~t\}t=0T−1\\\{\\widetilde\{\\mathbf\{W\}\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}, driven by the same mini\-batches and the same Gaussian variables:

𝐖~t=𝐖~t−1−η\(∇ℒS\(𝐖~t−1\)\+Δ~t\+cpriv\(Zt−λZt−1\)\)withcpriv=Cclipκ/B,\\widetilde\{\\mathbf\{W\}\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\eta\\big\(\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\widetilde\{\\Delta\}\_\{t\}\+c\_\{\\rm priv\}\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)\\big\)\\quad\\text\{with\}\\quad c\_\{\\rm priv\}=\{C\_\{\\rm clip\}\\kappa\}/\{B\},whereΔ~t=1B∑i∈ℬt∇ℓ\(yif𝐖~t−1\(𝐱i\)\)−∇ℒS\(𝐖~t−1\)\\widetilde\{\\Delta\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\\!\\\!\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\\\!\\\!\(\\mathbf\{x\}\_\{i\}\)\)\-\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)denotes the mini\-batch fluctuation around the empirical gradient at𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\. On the initialization event, the gradient is uniformly bounded byGδG\_\{\\delta\}\. Hence, underCclip≥GδC\_\{\\rm clip\}\\geq G\_\{\\delta\}, clipping is inactive, so the clipped and unclipped gradients coincide\. At this stage,𝐖∗\\mathbf\{W\}^\{\*\}is arbitrary\. The goal is then to prove a high\-probability estimate of the form

1T∑t=0T−1ℒS\(𝐖~t\)≲ℒS\(𝐖∗\)\+‖𝐖0−𝐖∗‖22ηT\+stochastic fluctuation terms\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\}\)\\lesssim\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\text\{stochastic fluctuation terms\}\.
Step 2: Introduce the shifted iterate\.To obtain the high\-probability estimate above, one would naturally try to apply a comparator recursion to the auxiliary trajectory\. However, as discussed in Section[4](https://arxiv.org/html/2605.12648#S4), the correlated perturbation is not conditionally centered, creating a first\-order drift term\.

To handle this obstruction, we define the shifted iterate𝐔t=𝐖~t\+ηcprivZt\.\\mathbf\{U\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\}\+\\eta c\_\{\\rm priv\}Z\_\{t\}\.Then the auxiliary update admits the exact reformulation𝐔t=𝐔t−1−η\(∇ℒS\(𝐖~t−1\)\+Δ~t\+\(1−λ\)cprivZt−1\)\.\\mathbf\{U\}\_\{t\}=\\mathbf\{U\}\_\{t\-1\}\-\\eta\\big\(\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\widetilde\{\\Delta\}\_\{t\}\+\(1\-\\lambda\)c\_\{\\rm priv\}Z\_\{t\-1\}\\big\)\.

This identity is the key cancellation step\. The leading correlated perturbation is reduced to a residual term proportional to1−λ1\-\\lambda, which is the source of the improved noise dependence in the final bound\.

Step 3: Derive a shifted potential recursion using the local KAN geometry\.We apply the squared\-distance recursion to the shifted potential‖𝐔t−𝐖∗‖22\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\. Letgt=∇ℒS\(𝐖~t−1\)g\_\{t\}=\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\. Using the shifted update from Step 2 and𝐔t−1−𝐖∗=𝐖~t−1−𝐖∗\+ηcprivZt−1\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\+\\eta c\_\{\\rm priv\}Z\_\{t\-1\}, we obtain

‖𝐔t−𝐖∗‖22=‖𝐔t−1−𝐖∗‖22−2η⟨gt,𝐖~t−1−𝐖∗⟩−2η⟨Δ~t,𝐔t−1−𝐖∗⟩\\displaystyle\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}=\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\+η2‖gt\+Δ~t‖22\+\(1−λ\)2η2cpriv2‖Zt−1‖22\\displaystyle\\quad\-2\(1\-\\lambda\)\\eta c\_\{\\rm priv\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\rm priv\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩−2λη2cpriv⟨gt,Zt−1⟩\.\\displaystyle\\quad\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\rm priv\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\-2\\lambda\\eta^\{2\}c\_\{\\rm priv\}\\big\\langle g\_\{t\},Z\_\{t\-1\}\\big\\rangle\.This identity is the main pathwise recursion\. The key step is to estimate the term−2η⟨gt,𝐖~t−1−𝐖∗⟩\-2\\eta\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangleusing the local Hessian lower bound for the KAN empirical loss, and to control the gradient part ofη2‖gt\+Δ~t‖22\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}using the self\-bounding estimate\.

Since these local estimates require𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}to stay in a localized region, we introduce the shifted localization event𝒜R¯U\(t\)=\{max0≤s≤t⁡‖𝐔s−𝐖∗‖2≤R¯\}\\mathcal\{A\}\_\{\\bar\{R\}\}^\{U\}\(t\)=\\big\\\{\\max\_\{0\\leq s\\leq t\}\\\|\\mathbf\{U\}\_\{s\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq\\bar\{R\}\\big\\\}for someR¯\>0\\bar\{R\}\>0\. We then define a stopped good eventℰgood\\mathcal\{E\}\_\{\\rm good\}consisting of a Gaussian concentration event, a stopped mini\-batch quadratic\-variation event, and a stopped potential\-fluctuation event\. The Gaussian event controlsmaxt⁡‖Zt‖2\\max\_\{t\}\\\|Z\_\{t\}\\\|\_\{2\}and∑t‖Zt−1‖22\\sum\_\{t\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}, the stopped mini\-batch event controls the stopped sum of‖Δ~t‖22\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}, and the stopped potential\-fluctuation event controls the martingale terms in the shifted recursion\. The stopping indicators allow these events to be proved with high probability before we know that the whole trajectory remains localized\.

Onℰgood\\mathcal\{E\}\_\{\\rm good\}and𝒜R¯U\(t−1\)\\mathcal\{A\}\_\{\\bar\{R\}\}^\{U\}\(t\-1\), both‖𝐔t−1−𝐖∗‖2\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}and‖Zt−1‖2\\\|Z\_\{t\-1\}\\\|\_\{2\}are bounded\. Together with𝐖~t−1=𝐔t−1−ηcprivZt−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}=\\mathbf\{U\}\_\{t\-1\}\-\\eta c\_\{\\rm priv\}Z\_\{t\-1\}, this implies that𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}remains in the local region where the KAN comparator estimates apply\. Then, we obtain−2η⟨gt,𝐖~t−1−𝐖∗⟩≤−4η3ℒS\(𝐖~t−1\)\+8η3ℒS\(𝐖∗\)\-2\\eta\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\\leq\-\\frac\{4\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\frac\{8\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\. On this local region, we also apply the self\-bounding estimate to control the part‖gt‖22\\\|g\_\{t\}\\\|\_\{2\}^\{2\}ofη2‖gt\+Δ~t‖22\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}byℒS\(𝐖~t−1\)\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\. The fluctuation part and the remaining martingale terms are controlled byℰgood\\mathcal\{E\}\_\{\\rm good\}\.

Step 4: High\-probability control and a stopped optimization bound\.The stopped good eventℰgood\\mathcal\{E\}\_\{\\rm good\}is shown to hold with high probability by combining several powerful concentration tools: Gaussian and chi\-square concentration for the Gaussian terms, the without\-replacement variance bound for the mini\-batch fluctuation, and martingale concentration for the stopped potential fluctuation\. On this event, all fluctuation terms in the shifted recursion are controlled\.

Recall thatτγ\\tau\_\{\\gamma\}is the localization scale induced by the NTK\-separability comparator, withτγ2≍\(log2⁡\(T\)\+log⁡\(n/δ\)\)/γ2\\tau\_\{\\gamma\}^\{2\}\\asymp\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\)/\\gamma^\{2\}\. We keepR¯\\bar\{R\}free at this stage, it will be chosen asR¯≍τγ\\bar\{R\}\\asymp\\tau\_\{\\gamma\}in Step 6\. Summing the stopped recursion overt=0,…,T−1t=0,\\ldots,T\-1and dividing byηT\\eta Tgives a stopped optimization bound\. Once the bootstrap in Step 5 shows that the shifted trajectory does not exit the localization ball, this stopped bound becomes

1T∑t=0T−1ℒS\(𝐖~t\)\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\}\)≲ℒS\(𝐖∗\)\+‖𝐖0−𝐖∗‖22ηT\+η\(1B\+1T\)\+Noiseλ,κ,\\displaystyle\\lesssim\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\eta\\big\(\\frac\{1\}\{B\}\+\\frac\{1\}\{T\}\\big\)\+\\text\{Noise\}\_\{\\lambda,\\kappa\},whereNoiseλ,κ\\text\{Noise\}\_\{\\lambda,\\kappa\}collects the fluctuation terms, including factors proportional to1−λ1\-\\lambda\.

Step 5: Close the bootstrap and show projection inactivity by induction\.We now close the localization bootstrap\. Onℰgood\\mathcal\{E\}\_\{\\rm good\}, the width and step\-size conditions in Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)imply the self\-consistency condition for the shifted potential, so the shifted process cannot exit the localization ball\. Hence𝒜R¯U\(T\)\\mathcal\{A\}\_\{\\bar\{R\}\}^\{U\}\(T\)holds, equivalentlysup0≤t≤T‖𝐔t−𝐖∗‖2≲τγ\\sup\_\{0\\leq t\\leq T\}\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\lesssim\\tau\_\{\\gamma\}\.

It remains to transfer this bound from𝐔t\\mathbf\{U\}\_\{t\}to𝐖~t\\widetilde\{\\mathbf\{W\}\}\_\{t\}\. Since𝐖~t=𝐔t−ηcprivZt\\widetilde\{\\mathbf\{W\}\}\_\{t\}=\\mathbf\{U\}\_\{t\}\-\\eta c\_\{\\rm priv\}Z\_\{t\}, the upper bound onmmand the Gaussian maximum\-norm bound implyηcpriv‖Zt‖2≲τγ\\eta c\_\{\\rm priv\}\\\|Z\_\{t\}\\\|\_\{2\}\\lesssim\\tau\_\{\\gamma\}uniformly fort≤Tt\\leq T\. Thereforesup0≤t≤T‖𝐖~t−𝐖0‖2≲τγ\\sup\_\{0\\leq t\\leq T\}\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\lesssim\\tau\_\{\\gamma\}\. ChoosingR∗=CRτγR\_\{\*\}=C\_\{R\}\\tau\_\{\\gamma\}withCRC\_\{R\}sufficiently large shows that the auxiliary trajectory remains inside the projection ball\. Thus projection is inactive, and𝐖t=𝐖~t\\mathbf\{W\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\}for allt=0,…,Tt=0,\\ldots,Ton the good event\.

Step 6: Choose the comparator using initialization separability\.The previous steps hold for an arbitrary comparator𝐖∗\\mathbf\{W\}^\{\*\}\. We now invoke Assumption[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)to choose it\. The comparator construction gives𝐖∗=𝐖0\+τntk𝐮\\mathbf\{W\}^\{\*\}=\\mathbf\{W\}\_\{0\}\+\\tau\_\{\\rm ntk\}\\mathbf\{u\}withτntk2≲\(log2⁡\(T\)\+log⁡\(n/δ\)\)/γ2\\tau\_\{\\rm ntk\}^\{2\}\\lesssim\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\)/\\gamma^\{2\}such that, with high probability,ℒS\(𝐖∗\)≤1/T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq 1/T\. Recall thatτγ\\tau\_\{\\gamma\}is the localization scale chosen in the theorem, withτγ2≍\(log2⁡\(T\)\+log⁡\(n/δ\)\)/γ2\\tau\_\{\\gamma\}^\{2\}\\asymp\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\)/\\gamma^\{2\}\. By choosing the constant inτγ\\tau\_\{\\gamma\}sufficiently large, we haveτntk≤τγ\\tau\_\{\\rm ntk\}\\leq\\tau\_\{\\gamma\}, and hence‖𝐖∗−𝐖0‖22≤τγ2\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}^\{2\}\\leq\\tau\_\{\\gamma\}^\{2\}\.

Substituting this comparator into the optimization bound from Step 4, and using the projection\-inactivity result𝐖t=𝐖~t\\mathbf\{W\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\}from Step 5, gives1T∑t=0T−1ℒS\(𝐖t\)≲Acorr\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim A\_\{\\rm corr\}, whereAcorrA\_\{\\rm corr\}is the upper bound stated in Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)\. A union bound over the initialization event, the comparator event, the Gaussian concentration event, the stopped mini\-batch event, and the stopped potential\-fluctuation event gives probability at least1−δ1\-\\delta\. ∎

## 5Population Risk Bounds

This section establishes population risk bounds for the iterates of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)\. Section[5\.1](https://arxiv.org/html/2605.12648#S5.SS1)treats the headline correlated\-noise DP\-SGD case, combining the optimization bound from Section[4](https://arxiv.org/html/2605.12648#S4)with a stability\-based generalization argument\. Section[5\.2](https://arxiv.org/html/2605.12648#S5.SS2)specializes to independent\-noise DP\-SGD \(λ=0\\lambda=0\), where a substantially simpler optimization analysis applies; we state the resulting rate\. Section[5\.3](https://arxiv.org/html/2605.12648#S5.SS3)further specializes to non\-private mini\-batch SGD \(κ=0\\kappa=0\)\.

### 5\.1Population Risk of DP\-SGD with Correlated Noise

We first state a privacy guarantee for Algorithm[1](https://arxiv.org/html/2605.12648#alg1)in the correlated\-noise regime\. The displayed choice ofκ\\kappais a conservative closed\-form calibration used for the subsequent risk analysis\. Since our focus is the learning\-theoretic analysis rather than sharper privacy accounting, we do not optimize this calibration\. Tighter accountants may yield smaller noise multipliers\. The proof is in Appendix[F](https://arxiv.org/html/2605.12648#A6)\.

###### Theorem 5\.1\(Privacy guarantee\)\.

Ifλ\>0\\lambda\>0andκ2≍\(1−λT1−λ\)2⋅\(BnT\+\(BnTlog⁡\(1δ\)\)1/2\)log⁡\(1δ\)ϵ−2\\kappa^\{2\}\\asymp\(\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}\)^\{2\}\\cdot\\big\(\\frac\{B\}\{n\}T\+\(\\frac\{B\}\{n\}T\\log\(\\frac\{1\}\{\\delta\}\)\)^\{1/2\}\\big\)\\log\(\\frac\{1\}\{\\delta\}\)\\epsilon^\{\-2\}, then Algorithm[1](https://arxiv.org/html/2605.12648#alg1)satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\.

The population risk bounds of DP\-SGD withλ\>0\\lambda\>0are given as follows\. The detailed proof can be found in Appendix[C\.2](https://arxiv.org/html/2605.12648#A3.SS2)\. Denote the algorithmic randomness𝒜:=\{ℬ1,…,ℬT−1,Z1,…,ZT−1\}\\mathcal\{A\}:=\\\{\{\\cal B\}\_\{1\},\\ldots,\{\\cal B\}\_\{T\-1\},Z\_\{1\},\\ldots,Z\_\{T\-1\}\\\}\. We suppress logarithmic factors in the displayed bound below\.

###### Theorem 5\.2\(Population risk bound\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Under the assumptions of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1), let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withλ\>0\\lambda\\\!\>\\\!0,η≤112Cσ,bp3\\eta\\\!\\leq\\\!\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≍GδC\_\{\\rm clip\}\\\!\\asymp\\\!G\_\{\\delta\}chosen so thatCclip≥GδC\_\{\\rm clip\}\\\!\\geq\\\!G\_\{\\delta\}\. Ifm≳R∗2log⁡\(mδ\)η2\(TAcorr\+\(log⁡\(nδ\)\+R∗\)\(\(Tlog⁡\(1/δ\)B\)12\+log⁡\(1δ\)\)\)2\.m\\gtrsim R\_\{\*\}^\{2\}\\log\(\\frac\{m\}\{\\delta\}\)\\eta^\{2\}\\big\(TA\_\{\\mathrm\{corr\}\}\+\(\\log\(\\frac\{n\}\{\\delta\}\)\+R\_\{\*\}\)\\big\(\(\\frac\{T\\log\(1/\\delta\)\}\{B\}\)^\{\\frac\{1\}\{2\}\}\+\\log\(\\frac\{1\}\{\\delta\}\)\\big\)\\big\)^\{2\}\.Then with probability at least1−δ1\-\\deltaover the initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTn\)\(Acorr\+\(m\+R∗\)δ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\bigl\(1\+\\frac\{\\eta T\}\{n\}\\bigr\)\\big\(A\_\{\\mathrm\{corr\}\}\+\(\\sqrt\{m\}\+R\_\{\*\}\\big\)\\delta\\big\)\.

We present one representative parameter regime that exhibits the optimal dependence onnn,dd, andϵ\\epsilon, up to logarithmic factors\. The choiceB=ρnB=\\rho nis made for a clean rate statement rather than as an optimized batch\-size recommendation\. Other parameter choices are covered by Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2)\.

###### Corollary 5\.3\(Risk rates under representative parameter regime\)\.

Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≍1\\eta\\asymp 1andCclip≍GδC\_\{\\rm clip\}\\asymp G\_\{\\delta\}\. Assume thatλ∈\(0,1\)\\lambda\\in\(0,1\)is a fixed constant bounded away from11\. Let0<ρ<10<\\rho<1be a fixed constant\. Ifm≍polylog\(n/δ\)γ6m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{6\}\}andδ≤min⁡\{1nm,γn\}\\delta\\leq\\min\\bigl\\\{\\frac\{1\}\{n\\sqrt\{m\}\},\\,\\frac\{\\gamma\}\{n\}\\bigr\\\}, setB=ρnB=\\rho nandT≍min⁡\{nγ,nϵγ2d\},T\\asymp\\min\\big\\\{\\frac\{\\sqrt\{n\}\}\{\\gamma\},\\frac\{n\\epsilon\\gamma^\{2\}\}\{\\sqrt\{d\}\}\\big\\\},then with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲1γn\+dγ4nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\.Moreover, with probability at least1−δ1\-\\deltaover the initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γn\+dγ4nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\.

#### Interpretation of the correlated\-noise rate\.

Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)shows that, for a fixedκ\\kappa, temporal correlation improvesAcorrA\_\{\\mathrm\{corr\}\}through factors such as1−λ1\-\\lambda\. Since the stability\-based population bound in Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2)is driven byAcorrA\_\{\\mathrm\{corr\}\}, this structure propagates to the population risk guarantee\. In Corollary[5\.3](https://arxiv.org/html/2605.12648#S5.Thmtheorem3), however, the conservative closed\-form calibration makes theλ\\lambda\-dependence ofκ\\kappaoffset this effect, so the displayed rate matches the independent\-noise rate asymptotically\. Sharper privacy calibration may lead to improvedλ\\lambda\-dependent risk bounds, and we leave this as an open question\.

#### Comparison\.

The closest workKoloskovaet al\.\([2023](https://arxiv.org/html/2605.12648#bib.bib745)\)analyzes linearly correlated noise for GD in smooth non\-convex optimization, but only at the stationarity level through average gradient norm bounds\. In contrast, Corollary[5\.3](https://arxiv.org/html/2605.12648#S5.Thmtheorem3)gives both optimization and population guarantees for correlated\-noise DP\-SGD, yielding the first learning\-theoretic risk analysis of such mechanism in a non\-convex NN setting\.

### 5\.2DP\-SGD with Independent Noise \(λ=0\\lambda=0special case\)

We state the privacy guarantee and the resulting population risk rate for standard DP\-SGD with independent noise, i\.e\., Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withλ=0\\lambda=0\. The full optimization and generalization analyses are simpler than the correlated\-noise case and are deferred to Appendix[D](https://arxiv.org/html/2605.12648#A4)\.

###### Theorem 5\.4\(Privacy guarantee\)\.

Ifλ=0\\lambda=0andκ2≍B2Tlog⁡\(1/δ\)n2ϵ2\\kappa^\{2\}\\asymp\\frac\{B^\{2\}T\\log\(1/\\delta\)\}\{n^\{2\}\\epsilon^\{2\}\}, then Algorithm[1](https://arxiv.org/html/2605.12648#alg1)satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\.

The result below shows that independent\-noise DP\-SGD attains the rate𝒪\(n−1/2\+d/\(nϵ\)\)\\mathcal\{O\}\(n^\{\-1/2\}\+\\sqrt\{d\}/\(n\\epsilon\)\)\.

###### Corollary 5\.5\(Risk rates under representative parameter regime\)\.

Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Assumeδ≤γn\\delta\\leq\\frac\{\\gamma\}\{n\}\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}\. Ifm≍polylog\(n/δ\)γ4m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}, setB≍γnB\\asymp\\gamma\\sqrt\{n\},η≍min⁡\{1,γ2nϵd\}\\eta\\asymp\\min\\big\\\{1,\\frac\{\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\}\{\\sqrt\{d\}\}\\big\\\}andT≍nγT\\asymp\\frac\{\\sqrt\{n\}\}\{\\gamma\}\. With probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲1γn\+dγ3nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.Moreover, with probability at least1−δ1\-\\deltaover the initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γn\+dγ3nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.

Comparison\.Specializing the more general independent\-noise bound in Appendix[D\.3](https://arxiv.org/html/2605.12648#A4.SS3)toB=nB=nyields the full\-batch DP\-GD population risk𝒪\(dγ3nϵ\)\\mathcal\{O\}\(\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\)underm≍polylog\(n/δ\)γ4m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}\. This improves the full\-batch DP\-GD result ofWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), which obtains the rate𝒪\(dγ4nϵ\)\\mathcal\{O\}\(\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\)under the stronger width conditionm≍polylog\(n/δ\)γ6m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{6\}\}\. Hence, in this full\-batch DP specialization, our bound reduces the privacy\-dependent loss by one power ofγ\\gamma, while requiring a smaller width\.

### 5\.3Non\-private Mini\-batch SGD \(κ=0\\kappa=0special case\)

We turn to the non\-private setting, recovered from Section[5\.2](https://arxiv.org/html/2605.12648#S5.SS2)by settingκ=0\\kappa=0\. The general theorem is deferred to Appendix[E](https://arxiv.org/html/2605.12648#A5); we present only the resulting optimization and population risk rates\.

###### Corollary 5\.6\(Risk rates under representative parameter regime\)\.

Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by mini\-batch SGD withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}\. Assumeδ≤1γn\\delta\\leq\\frac\{1\}\{\\gamma n\}\. Ifm≳polylog\(n/δ\)γ4m\\gtrsim\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}, setB≲γnB\\lesssim\\gamma\\sqrt\{n\},η≍Bn\\eta\\asymp\\frac\{B\}\{n\}andT≳n2γ2BT\\gtrsim\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}\. Then with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness

1T∑t=0T−1ℒS\(𝐖t\)≲1n\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{n\}\.Moreover, ifT≍n2γ2BT\\asymp\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}, with probability at least1−δ1\-\\deltaover the initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γ2n\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.

Comparison\.For non\-private KANs, the closest work isWanget al\.\([2026](https://arxiv.org/html/2605.12648#bib.bib19)\), which proves a population risk bound of order𝒪\(1γ4n\)\\mathcal\{O\}\(\\frac\{1\}\{\\gamma^\{4\}n\}\)for full\-batch GD under the width conditionm≳polylog\(n/δ\)γ6m\\gtrsim\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{6\}\}\. Specializing the general non\-private bound in Appendix[E](https://arxiv.org/html/2605.12648#A5)toB=nB=nunderm≳polylog\(n/δ\)γ4m\\gtrsim\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}yields the sharper rate𝒪\(1γ2n\)\\mathcal\{O\}\(\\frac\{1\}\{\\gamma^\{2\}n\}\)\. Thus, in the fixed\-second\-layer setting studied here, this specialization improves both the required width and the dependence on the NTK marginγ\\gamma\. Other works\(Gao and Tan,[2025](https://arxiv.org/html/2605.12648#bib.bib45); Eshtehardianet al\.,[2026](https://arxiv.org/html/2605.12648#bib.bib13)\)provide optimization analyses of GD/SGD for two\-layer KANs in regression under polynomial\-width conditions\.

## 6Conclusion and Limitations

We established population risk bounds for two\-layer KANs trained by clipped mini\-batch SGD, covering both non\-private and DP settings\. The results include explicit width regimes and cover non\-private SGD, independent\-noise DP\-SGD, and full\-batch training as special cases\. Our analysis route for correlated\-noise DP training in the non\-convex regime may be useful beyond KANs\. The main limitation is the conservative closed\-form privacy calibration\. Under this calibration, theλ\\lambda\-dependent noise scale offsets the variance\-reduction effect in the optimization bound, so the final rate matches the independent\-noise rate asymptotically \(Section[5\.1](https://arxiv.org/html/2605.12648#S5.SS1)\)\. Whether sharper privacy accounting can turn this structure into improved final risk rates remains open, along with extensions to deeper architectures, weaker smoothness assumptions, and further experiments\.

## Acknowledgment

Part of this work was conducted within the DFG SPP 2298 \(ID 464252197\)\. PW acknowledges support by the Alexander\-von\-Humboldt Foundation through a Humboldt Research Fellowship\. MK and SF acknowledge support by the DFG through FOR 5359 \(ID 459419731\), TRR 375 \(ID 511263698\), and SPP 2331 \(ID 441958259, 553345933, 466468799\), by the Carl\-Zeiss Foundation through the initiative AI\-Care, and by the BMFTR award 01IS24071A\. NK is supported in part by the Austrian Science Fund \(FWF\) \[10\.55776/COE12\]\.

## References

- A convergence theory for deep learning via over\-parameterization\.InInternational Conference on Machine Learning,pp\. 242–252\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- J\. D\. Andersson and R\. Pagh \(2023\)A smooth binary mechanism for efficient private continual observation\.Advances in Neural Information Processing Systems36,pp\. 49133–49145\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- M\. S\. M\. S\. Annamalai, B\. Balle, J\. Hayes, G\. Kaissis, and E\. De Cristofaro \(2025\)The hitchhiker’s guide to efficient, end\-to\-end, and tight dp auditing\.arXiv preprint arXiv:2506\.16666\.Cited by:[§F\.1](https://arxiv.org/html/2605.12648#A6.SS1.p2.5)\.
- S\. Arora, S\. Du, W\. Hu, Z\. Li, and R\. Wang \(2019\)Fine\-grained analysis of optimization and generalization for overparameterized two\-layer neural networks\.InInternational Conference on Machine Learning,pp\. 322–332\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1)\.
- G\. Barthe and F\. Olmedo \(2013\)Beyond differential privacy: composition theorems and relational logic for f\-divergences between probabilistic programs\.InInternational Colloquium on Automata, Languages, and Programming,pp\. 49–60\.Cited by:[§F\.1](https://arxiv.org/html/2605.12648#A6.SS1.p6.5)\.
- P\. L\. Bartlett, D\. J\. Foster, and M\. J\. Telgarsky \(2017\)Spectrally\-normalized margin bounds for neural networks\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1)\.
- R\. Bassily, V\. Feldman, K\. Talwar, and A\. Guha Thakurta \(2019\)Private stochastic convex optimization with optimal rates\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[2nd item](https://arxiv.org/html/2605.12648#S1.I1.i2.p1.1)\.
- Y\. Cao and Q\. Gu \(2019\)Generalization bounds of stochastic gradient descent for wide and deep neural networks\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- Z\. Chen, Y\. Cao, D\. Zou, and Q\. Gu \(2021\)How much over\-parameterization is sufficient to learn deep ReLU networks?\.InInternational Conference on Learning Representation,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- O\. Cherednichenko and M\. Poptsova \(2025\)Kolmogorov–arnold networks for genomic tasks\.Briefings in Bioinformatics26\(2\),pp\. bbaf129\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- C\. A\. Choquette\-Choo, K\. D\. Dvijotham, K\. Pillutla, A\. Ganesh, T\. Steinke, and A\. G\. Thakurta \(2024a\)Correlated noise provably beats independent noise for differentially private learning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1),[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px2.p1.1)\.
- C\. A\. Choquette\-Choo, A\. Ganesh, S\. Haque, T\. Steinke, and A\. Thakurta \(2024b\)Near exact privacy amplification for matrix mechanisms\.arXiv preprint arXiv:2410\.06266\.Cited by:[§A\.2](https://arxiv.org/html/2605.12648#A1.SS2.p1.1),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.5.p2.13),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.p3.1),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.p4.1),[Lemma F\.7](https://arxiv.org/html/2605.12648#A6.Thmtheorem7)\.
- C\. A\. Choquette\-Choo, A\. Ganesh, R\. McKenna, H\. B\. McMahan, J\. Rush, A\. Guha Thakurta, and Z\. Xu \(2023a\)\(Amplified\) banded matrix factorization: a unified approach to private training\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 74856–74889\.External Links:[Link](https://openreview.net/forum?id=zEm6hF97Pz)Cited by:[§A\.2](https://arxiv.org/html/2605.12648#A1.SS2.p1.1),[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- C\. A\. Choquette\-Choo, A\. Ganesh, T\. Steinke, and A\. G\. Thakurta \(2024c\)Privacy amplification for matrix mechanisms\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xUzWmFdglP)Cited by:[§A\.2](https://arxiv.org/html/2605.12648#A1.SS2.p1.1),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.p3.1),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.p4.1)\.
- C\. A\. Choquette\-Choo, H\. B\. McMahan, K\. Rush, and A\. Thakurta \(2023b\)Multi\-epoch matrix factorization mechanisms for private machine learning\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- S\. Denisov, H\. B\. McMahan, J\. Rush, A\. Smith, and A\. Guha Thakurta \(2022\)Improved differential privacy for SGD via optimal private linear operators on adaptive streams\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 5910–5924\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1),[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Ding, M\. Lei, S\. Fu, S\. Wang, D\. Wang, and J\. Xu \(2025\)Understanding private learning from feature perspective\.arXiv preprint arXiv:2511\.18006\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Dwork, F\. McSherry, K\. Nissim, and A\. Smith \(2006\)Calibrating noise to sensitivity in private data analysis\.InTheory of Cryptography,S\. Halevi and T\. Rabin \(Eds\.\),Berlin, Heidelberg,pp\. 265–284\.External Links:ISBN 978\-3\-540\-32732\-5Cited by:[Definition 3\.1](https://arxiv.org/html/2605.12648#S3.Thmtheorem1)\.
- C\. Dwork \(2006\)Differential privacy\.InInternational colloquium on automata, languages, and programming,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p4.1)\.
- S\. M\. Eshtehardian, M\. H\. Yassaee, and B\. Khalaj \(2026\)On the convergence of two\-layer kolmogorov\-arnold networks with first\-layer training\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.12648#S2.p1.3),[§5\.3](https://arxiv.org/html/2605.12648#S5.SS3.p2.6)\.
- H\. Fichtenberger, M\. Henzinger, and J\. Upadhyay \(2023\)Constant matters: fine\-grained error bound on differentially private continual observation\.InInternational Conference on Machine Learning,pp\. 10072–10092\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- S\. Frei, N\. S\. Chatterji, and P\. L\. Bartlett \(2023\)Random feature amplification: feature learning and generalization in neural networks\.Journal of Machine Learning Research24\(303\),pp\. 1–49\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1)\.
- Y\. Gao and V\. Y\. Tan \(2025\)On the convergence of \(stochastic\) gradient descent for Kolmogorov–Arnold networks\.IEEE Transactions on Information Theory\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.p1.3),[§3\.1](https://arxiv.org/html/2605.12648#S3.SS1.p1.4),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p1.2),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1),[§5\.3](https://arxiv.org/html/2605.12648#S5.SS3.p2.6)\.
- W\. Hoeffding \(1963\)Probability inequalities for sums of bounded random variables\.Journal of the American Statistical Association58\(301\),pp\. 13–30\.Cited by:[§C\.1](https://arxiv.org/html/2605.12648#A3.SS1.21.p3.11)\.
- A\. Jacot, F\. Gabriel, and C\. Hongler \(2018\)Neural tangent kernel: convergence and generalization in neural networks\.Advances in Neural Information Processing Systems31\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- Z\. Ji and M\. Telgarsky \(2020\)Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1)\.
- N\. P\. Kalinin and C\. Lampert \(2024\)Banded square root matrix factorization for differentially private model training\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 17602–17655\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- N\. P\. Kalinin, R\. McKenna, R\. Pagh, and C\. H\. Lampert \(2026a\)DP\-λ\\lambdaCGD: efficient noise correlation for differentially private model training\.Note:arXiv preprint arXiv:2601\.22334Cited by:[Figure 1](https://arxiv.org/html/2605.12648#S1.F1),[Figure 1](https://arxiv.org/html/2605.12648#S1.F1.12.5),[§1](https://arxiv.org/html/2605.12648#S1.p5.1),[§1](https://arxiv.org/html/2605.12648#S1.p6.5),[§3\.2](https://arxiv.org/html/2605.12648#S3.SS2.p2.17)\.
- N\. P\. Kalinin, R\. McKenna, J\. Upadhyay, and C\. H\. Lampert \(2026b\)Back to square roots: an optimal bound on the matrix factorization error for multi\-epoch differentially private SGD\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- A\. Koloskova, R\. McKenna, Z\. Charles, J\. Rush, and H\. B\. McMahan \(2023\)Gradient descent with linearly correlated noise: theory and applications to differential privacy\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 35761–35773\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.12648#S5.SS1.SSS0.Px2.p1.1)\.
- B\. Laurent and P\. Massart \(2000\)Adaptive estimation of a quadratic functional by model selection\.Annals of Statistics,pp\. 1302–1338\.Cited by:[§C\.1](https://arxiv.org/html/2605.12648#A3.SS1.p14.1)\.
- Y\. Lei, R\. Jin, and Y\. Ying \(2022\)Stability and generalization analysis of gradient methods for shallow neural networks\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 38557–38570\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p4.1)\.
- Y\. Lei, P\. Wang, Y\. Ying, and D\. Zhou \(2026\)Optimization and generalization of gradient descent for shallow ReLU networks with minimal width\.Journal of Machine Learning Research27\(34\),pp\. 1–35\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1)\.
- Y\. Lei and Y\. Ying \(2020\)Fine\-grained analysis of stability and generalization for stochastic gradient descent\.InInternational Conference on Machine Learning,pp\. 5809–5819\.Cited by:[§C\.2](https://arxiv.org/html/2605.12648#A3.SS2.p3.1),[Definition C\.14](https://arxiv.org/html/2605.12648#A3.Thmtheorem14),[Lemma C\.15](https://arxiv.org/html/2605.12648#A3.Thmtheorem15)\.
- L\. Li, Y\. Zhang, G\. Wang, and K\. Xia \(2025a\)Kolmogorov–arnold graph neural networks for molecular property prediction\.Nature Machine Intelligence7\(8\),pp\. 1346–1354\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- P\. Li, L\. Ding, J\. Fu, G\. Wang, Y\. Yuan,et al\.\(2025b\)Generalization bounds for kolmogorov\-arnold networks \(KANs\) and enhanced KANs with lower lipschitz complexity\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.12648#S2.p1.3)\.
- Y\. Li, Y\. Lei, Z\. Guo, and Y\. Ying \(2026\)Optimal rates for generalization of gradient descent for deep ReLU classification\.InAdvances in Neural Information Processing Systems,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1)\.
- W\. Liu, E\. Chatzi, and Z\. Lai \(2025a\)On the rate of convergence of kolmogorov\-arnold network regression estimators\.arXiv preprint arXiv:2509\.19830\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.p1.3)\.
- Z\. Liu, Y\. Wang, S\. Vaidya, F\. Ruehle, J\. Halverson, M\. Soljačić, T\. Y\. Hou, and M\. Tegmark \(2025b\)KAN: Kolmogorov\-Arnold networks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- R\. McKenna \(2025\)Scaling up the banded matrix factorization mechanism for differentially private ML\.InInternational Conference on Learning Representation,Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- H\. B\. McMahan, Z\. Xu, and Y\. Zhang \(2024\)A hassle\-free algorithm for strong differential privacy in federated learning systems\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 842–865\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- I\. Mironov \(2017\)Rényi differential privacy\.In2017 IEEE 30th computer security foundations symposium \(CSF\),pp\. 263–275\.Cited by:[Definition D\.1](https://arxiv.org/html/2605.12648#A4.Thmtheorem1),[Lemma D\.2](https://arxiv.org/html/2605.12648#A4.Thmtheorem2),[Lemma D\.4](https://arxiv.org/html/2605.12648#A4.Thmtheorem4),[Lemma D\.5](https://arxiv.org/html/2605.12648#A4.Thmtheorem5),[Lemma D\.6](https://arxiv.org/html/2605.12648#A4.Thmtheorem6)\.
- M\. Nguyen and N\. Muecke \(2024\)How many neurons do we need? a refined analysis for shallow networks trained with gradient descent\.Journal of Statistical Planning and Inference233,pp\. 106169\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- A\. Nitanda, G\. Chinot, and T\. Suzuki \(2019\)Gradient descent can learn less over\-parameterized two\-layer neural networks on classification problems\.arXiv preprint arXiv:1905\.09870\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1)\.
- A\. Nitanda and T\. Suzuki \(2021\)Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.
- S\. Patra, S\. Panda, B\. K\. Parida, M\. Arya, K\. Jacobs, D\. I\. Bondar, and A\. Sen \(2025\)Physics informed kolmogorov\-arnold neural networks for dynamical analysis via efficient\-kan and wav\-kan\.Journal of Machine Learning Research26\(233\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- K\. Pillutla, J\. Upadhyay, C\. A\. Choquette\-Choo, K\. Dvijotham, A\. Ganesh, M\. Henzinger, J\. Katz, R\. McKenna, H\. B\. McMahan, K\. Rush,et al\.\(2025\)Correlated noise mechanisms for differentially private learning\.Note:arXiv preprint arXiv:2506\.08201Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- D\. Richards and I\. Kuzborskij \(2021\)Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel\.InAdvances in Neural Information Processing Systems,Vol\.34\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p4.1)\.
- A\. Rodio, Z\. Chen, and E\. G\. Larsson \(2025\)Optimizing privacy\-utility trade\-off in decentralized learning with generalized correlated noise\.In2025 IEEE Information Theory Workshop \(ITW\),pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p5.1)\.
- J\. Schuchardt and N\. Kalinin \(2026\)Sampling\-free privacy accounting for matrix mechanisms under random allocation\.Cited by:[§A\.2](https://arxiv.org/html/2605.12648#A1.SS2.p1.1)\.
- Z\. Shi, P\. Wang, C\. Zhang, and Y\. Cao \(2026\)Towards understanding generalization in DP\-GD: a case study in training two\-layer CNNs\.InAAAI Conference on Artificial Intelligence,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1),[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Shukla, J\. D\. Toscano, Z\. Wang, Z\. Zou, and G\. E\. Karniadakis \(2024\)A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks\.Computer Methods in Applied Mechanics and Engineering431,pp\. 117290\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- S\. Song, K\. Chaudhuri, and A\. D\. Sarwate \(2013\)Stochastic gradient descent with differentially private updates\.In2013 IEEE global conference on signal and information processing,pp\. 245–248\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p4.1)\.
- H\. Taheri, C\. Thrampoulidis, and A\. Mazumdar \(2025\)Sharper guarantees for learning neural network classifiers with gradient methods\.InInternational Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p4.1),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p1.2),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1)\.
- H\. Taheri and C\. Thrampoulidis \(2024\)Generalization and stability of interpolating neural networks with minimal width\.Journal of Machine Learning Research25\(156\),pp\. 1–41\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p4.1),[Lemma B\.4](https://arxiv.org/html/2605.12648#A2.Thmtheorem4)\.
- C\. J\. Vaca\-Rubio, L\. Blanco, R\. Pereira, and M\. Caus \(2024\)Kolmogorov\-arnold networks \(kans\) for time series analysis\.In2024 IEEE Globecom Workshops \(GC Wkshps\),pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- M\. J\. Wainwright \(2019\)High\-dimensional statistics: a non\-asymptotic viewpoint\.Vol\.48,Cambridge university press\.Cited by:[Appendix B](https://arxiv.org/html/2605.12648#A2.p5.1)\.
- P\. Wang, Y\. Lei, M\. Kloft, and Y\. Ying \(2025a\)Optimal utility bounds for differentially private gradient descent in three\-layer neural networks\.In2025 IEEE 12th International Conference on Data Science and Advanced Analytics \(DSAA\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Wang, Y\. Lei, D\. Wang, Y\. Ying, and D\. Zhou \(2025b\)Generalization guarantees of gradient descent for shallow neural networks\.Neural Computation37\(2\),pp\. 344–402\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p4.1)\.
- P\. Wang, J\. Zhou, P\. Liznerski, and M\. Kloft \(2026\)Optimization, generalization and differential privacy bounds for gradient descent on Kolmogorov\-Arnold networks\.InInternational Conference on Machine Learning,Cited by:[Appendix B](https://arxiv.org/html/2605.12648#A2.2.p1.6),[Lemma B\.1](https://arxiv.org/html/2605.12648#A2.Thmtheorem1),[Appendix B](https://arxiv.org/html/2605.12648#A2.p2.1),[§C\.1](https://arxiv.org/html/2605.12648#A3.SS1.37.p3.1),[§D\.1](https://arxiv.org/html/2605.12648#A4.SS1.p8.2),[Appendix E](https://arxiv.org/html/2605.12648#A5.p1.1),[1st item](https://arxiv.org/html/2605.12648#S1.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2605.12648#S1.I1.i3.p1.5),[§1](https://arxiv.org/html/2605.12648#S1.p3.1),[§1](https://arxiv.org/html/2605.12648#S1.p4.1),[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.12648#S2.p1.3),[§3\.1](https://arxiv.org/html/2605.12648#S3.SS1.p1.4),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p1.2),[§3\.3](https://arxiv.org/html/2605.12648#S3.SS3.p2.1),[§5\.2](https://arxiv.org/html/2605.12648#S5.SS2.p3.6),[§5\.3](https://arxiv.org/html/2605.12648#S5.SS3.p2.6)\.
- Y\. Wang, J\. W\. Siegel, Z\. Liu, and T\. Y\. Hou \(2025c\)On the expressiveness and spectral bias of kans\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.12648#S2.p1.3)\.
- Y\. Wang, J\. Sun, J\. Bai, C\. Anitescu, M\. S\. Eshaghi, X\. Zhuang, T\. Rabczuk, and Y\. Liu \(2025d\)Kolmogorov–arnold\-informed neural network: a physics\-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks\.Computer Methods in Applied Mechanics and Engineering433,pp\. 117518\.Cited by:[§1](https://arxiv.org/html/2605.12648#S1.p1.1)\.
- Y\. Wang, B\. Balle, and S\. P\. Kasiviswanathan \(2019\)Subsampled rényi differential privacy and analytical moments accountant\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[§D\.1](https://arxiv.org/html/2605.12648#A4.SS1.3.p3.2),[§D\.1](https://arxiv.org/html/2605.12648#A4.SS1.4.p4.1)\.
- R\. Xu and K\. Chen \(2026\)Differential privacy in two\-layer networks: how dp\-sgd harms fairness and robustness\.arXiv preprint arXiv:2603\.04881\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, H\. Xie, M\. Ding, S\. Fu, J\. Liu, and D\. Wang \(2026\)Understanding the impact of differentially private training on memorization of long\-tailed data\.arXiv preprint arXiv:2602\.03872\.Cited by:[§2](https://arxiv.org/html/2605.12648#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhou, P\. Wang, and D\. Zhou \(2024\)Generalization analysis with deep relu networks for metric and similarity learning\.arXiv preprint arXiv:2405\.06415\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p3.1)\.
- Y\. Zhu, J\. Dong, and Y\. Wang \(2022\)Optimal accounting of differential privacy via characteristic function\.InInternational Conference on Artificial Intelligence and Statistics \(AISTATS\),Cited by:[§F\.1](https://arxiv.org/html/2605.12648#A6.SS1.p5.5),[§F\.2](https://arxiv.org/html/2605.12648#A6.SS2.p2.1),[Lemma F\.6](https://arxiv.org/html/2605.12648#A6.Thmtheorem6)\.
- D\. Zou, Y\. Cao, D\. Zhou, and Q\. Gu \(2020\)Gradient descent optimizes over\-parameterized deep relu networks\.Machine learning109,pp\. 467–492\.Cited by:[§A\.1](https://arxiv.org/html/2605.12648#A1.SS1.p2.1)\.

Appendix

## Appendix AFurther Related Work

This appendix expands on the related work referenced in Section[2](https://arxiv.org/html/2605.12648#S2), covering generalization theory for neural networks \(Appendix[A\.1](https://arxiv.org/html/2605.12648#A1.SS1)\) and privacy amplification by subsampling for correlated\-noise mechanisms \(Appendix[A\.2](https://arxiv.org/html/2605.12648#A1.SS2)\)\.

### A\.1Population risk bounds for neural networks

Most existing optimization and generalization results for neural networks have been developed for*fully connected*multilayer perceptrons \(MLPs\), rather than architectures with edge\-wise functional parametrizations such as KANs\. Broadly speaking, the theoretical literature on MLPs can be organized into three main directions\.

A first line of work is based on the neural tangent kernel \(NTK\) viewpoint\[[25](https://arxiv.org/html/2605.12648#bib.bib126)\]\. In this regime, a sufficiently wide network behaves approximately like its linearization around initialization, which makes it possible to analyze gradient\-based training through an associated kernel\. This perspective has led to influential convergence guarantees for overparameterized MLPs and, in the lazy\-training regime, to sharp statistical and excess risk bounds in a variety of settings\[[1](https://arxiv.org/html/2605.12648#bib.bib130),[4](https://arxiv.org/html/2605.12648#bib.bib151),[8](https://arxiv.org/html/2605.12648#bib.bib155),[9](https://arxiv.org/html/2605.12648#bib.bib81),[43](https://arxiv.org/html/2605.12648#bib.bib104),[45](https://arxiv.org/html/2605.12648#bib.bib93),[68](https://arxiv.org/html/2605.12648#bib.bib46)\]\.

A second line of work studies generalization through uniform convergence approach\. Typical tools include Rademacher complexity, covering numbers, norm\-based complexity measures, and margin\-based arguments\[[6](https://arxiv.org/html/2605.12648#bib.bib144),[22](https://arxiv.org/html/2605.12648#bib.bib77),[26](https://arxiv.org/html/2605.12648#bib.bib111),[33](https://arxiv.org/html/2605.12648#bib.bib10),[37](https://arxiv.org/html/2605.12648#bib.bib14),[44](https://arxiv.org/html/2605.12648#bib.bib73),[51](https://arxiv.org/html/2605.12648#bib.bib748),[66](https://arxiv.org/html/2605.12648#bib.bib15)\]\. These approaches yield broad generalization guarantees for MLPs, often under relatively mild assumptions on the training algorithm itself\.

More recently, there has been growing interest in algorithm\-dependent generalization analyses based on algorithmic stability\[[32](https://arxiv.org/html/2605.12648#bib.bib160),[48](https://arxiv.org/html/2605.12648#bib.bib162),[55](https://arxiv.org/html/2605.12648#bib.bib80),[54](https://arxiv.org/html/2605.12648#bib.bib59),[59](https://arxiv.org/html/2605.12648#bib.bib12)\]\. This line of work studies how perturbations in the training data propagate through the optimization dynamics, and consequently yields generalization and excess risk bounds that are directly tied to the learning algorithm\. This viewpoint is especially relevant to the present paper, since our goal is likewise to derive algorithm\-dependent risk guarantees, but now in the more structured setting of KANs and under minibatch, clipped, and differentially private training\.

### A\.2Privacy amplification by subsampling for correlated noise

Prior analyses of correlated\-noise mechanisms with privacy amplification cover banded matrix mechanisms, Poisson\-subsampled matrix mechanisms, and Balls\-in\-Bins schemes\[[12](https://arxiv.org/html/2605.12648#bib.bib720),[13](https://arxiv.org/html/2605.12648#bib.bib708),[14](https://arxiv.org/html/2605.12648#bib.bib721),[50](https://arxiv.org/html/2605.12648#bib.bib744)\]\. We complement this line by deriving an analytical noise\-multiplier bound for DP\-λ\\lambdaCGD under mini\-batches drawn uniformly without replacement, which underlies the privacy guarantee for Algorithm[1](https://arxiv.org/html/2605.12648#alg1)\.

## Appendix BUseful Lemmas

In this section, we introduce some useful lemmas that will be used in the proofs\.

###### Lemma B\.1\(\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\]\)\.

Lets∈ℕs\\in\\mathbb\{N\}and𝐀i∈ℝmi×ni\\mathbf\{A\}\_\{i\}\\in\\mathbb\{R\}^\{m\_\{i\}\\times n\_\{i\}\}fori∈\[s\]i\\in\[s\]\. Define the diagonal block matrix

𝐁=\[𝐀1⋯𝟎⋮⋱⋮𝟎⋯𝐀s\]∈ℝ∑i=1smi×∑i=1sni\\mathbf\{B\}=\\begin\{bmatrix\}\\mathbf\{A\}\_\{1\}&\\cdots&\\mathbf\{0\}\\\\ \\vdots&\\ddots&\\vdots\\\\ \\mathbf\{0\}&\\cdots&\\mathbf\{A\}\_\{s\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{\\sum\_\{i=1\}^\{s\}m\_\{i\}\\times\\sum\_\{i=1\}^\{s\}n\_\{i\}\}Then, it holds that

‖𝐁‖2=maxi∈\[s\]⁡‖𝐀i‖2\.\\displaystyle\\\|\\mathbf\{B\}\\\|\_\{2\}=\\max\_\{i\\in\[s\]\}\\\|\\mathbf\{A\}\_\{i\}\\\|\_\{2\}\.

###### Lemma B\.2\.

For anyn≥2n\\geq 2, letx1,…,xn∈ℝDx\_\{1\},\\dots,x\_\{n\}\\in\\mathbb\{R\}^\{D\}be deterministic vectors, and letx¯:=1n∑i=1nxi\.\\bar\{x\}:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}x\_\{i\}\.Suppose thatℬ⊆\[n\]\\mathcal\{B\}\\subseteq\[n\]is sampled uniformly without replacement among all subsets of cardinalityBB\. Definex¯ℬ=1B∑i∈ℬxi\.\\bar\{x\}\_\{\\mathcal\{B\}\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\}x\_\{i\}\.Then,𝔼ℬ\[x¯ℬ\]=x¯\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\bar\{x\}\_\{\\mathcal\{B\}\}\]=\\bar\{x\}and

𝔼ℬ\[‖x¯ℬ−x¯‖22\]=n−BB\(n−1\)⋅1n∑i=1n‖xi−x¯‖22\.\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\big\[\\\|\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\\big\]=\\frac\{n\-B\}\{B\(n\-1\)\}\\cdot\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\.Furthermore, if‖xi‖2≤G\\\|x\_\{i\}\\\|\_\{2\}\\leq Gfor alli∈\[n\]i\\in\[n\], then

𝔼ℬ\[‖x¯ℬ−x¯‖22\]≤G2B\.\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\big\[\\\|\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\\big\]\\leq\\frac\{G^\{2\}\}\{B\}\.

###### Proof\.

The proof is standard, we include it for completeness\. The unbiasedness follows from

𝔼ℬ\[x¯ℬ\]=1B∑i=1nℙ\(i∈ℬ\)xi=1B∑i=1nBnxi=x¯\.\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\bar\{x\}\_\{\\mathcal\{B\}\}\]=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{n\}\\mathbb\{P\}\(i\\in\\mathcal\{B\}\)\\,x\_\{i\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{n\}\\frac\{B\}\{n\}x\_\{i\}=\\bar\{x\}\.To prove the second identity, letxi′=xi−x¯x\_\{i\}^\{\\prime\}=x\_\{i\}\-\\bar\{x\}for anyi∈\[n\]\.i\\in\[n\]\.Then it holds∑i=1nxi′=0\\sum\_\{i=1\}^\{n\}x\_\{i\}^\{\\prime\}=0andx¯ℬ−x¯=1B∑i∈ℬxi′\.\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\}x\_\{i\}^\{\\prime\}\.Hence,

𝔼ℬ\[‖x¯ℬ−x¯‖22\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\big\[\\\|\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\\big\]=1B2𝔼ℬ\[‖∑i∈ℬxi′‖22\]=1B2\(∑i=1nℙ\(i∈ℬ\)‖xi′‖22\+∑i≠jℙ\(i,j∈ℬ\)⟨xi′,xj′⟩\)\.\\displaystyle=\\frac\{1\}\{B^\{2\}\}\\,\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\Big\[\\Big\\\|\\sum\_\{i\\in\\mathcal\{B\}\}x\_\{i\}^\{\\prime\}\\Big\\\|\_\{2\}^\{2\}\\Big\]=\\frac\{1\}\{B^\{2\}\}\\Big\(\\sum\_\{i=1\}^\{n\}\\mathbb\{P\}\(i\\in\\mathcal\{B\}\)\\\|x\_\{i\}^\{\\prime\}\\\|\_\{2\}^\{2\}\+\\sum\_\{i\\neq j\}\\mathbb\{P\}\(i,j\\in\\mathcal\{B\}\)\\langle x\_\{i\}^\{\\prime\},x\_\{j\}^\{\\prime\}\\rangle\\Big\)\.Sinceℬ\\mathcal\{B\}is sampled uniformly without replacement with\|ℬ\|=B\|\\mathcal\{B\}\|=B, we know

ℙ\(i∈ℬ\)=Bn,ℙ\(i,j∈ℬ\)=B\(B−1\)n\(n−1\)\(i≠j\)\.\\mathbb\{P\}\(i\\in\\mathcal\{B\}\)=\\frac\{B\}\{n\},\\qquad\\mathbb\{P\}\(i,j\\in\\mathcal\{B\}\)=\\frac\{B\(B\-1\)\}\{n\(n\-1\)\}\\quad\(i\\neq j\)\.Moreover, from∑i=1nxi′=0\\sum\_\{i=1\}^\{n\}x\_\{i\}^\{\\prime\}=0we get

∑i≠j⟨xi′,xj′⟩=‖∑i=1nxi′‖22−∑i=1n‖xi′‖22=−∑i=1n‖xi′‖22\.\\sum\_\{i\\neq j\}\\langle x\_\{i\}^\{\\prime\},x\_\{j\}^\{\\prime\}\\rangle=\\Big\\\|\\sum\_\{i=1\}^\{n\}x\_\{i\}^\{\\prime\}\\Big\\\|\_\{2\}^\{2\}\-\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}^\{\\prime\}\\\|\_\{2\}^\{2\}=\-\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}^\{\\prime\}\\\|\_\{2\}^\{2\}\.Therefore,

𝔼ℬ\[‖x¯ℬ−x¯‖22\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\big\[\\\|\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\\big\]=1B2\(Bn−B\(B−1\)n\(n−1\)\)∑i=1n‖xi′‖22=n−BB\(n−1\)⋅1n∑i=1n‖xi−x¯‖22\.\\displaystyle=\\frac\{1\}\{B^\{2\}\}\\Big\(\\frac\{B\}\{n\}\-\\frac\{B\(B\-1\)\}\{n\(n\-1\)\}\\Big\)\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}^\{\\prime\}\\\|\_\{2\}^\{2\}=\\frac\{n\-B\}\{B\(n\-1\)\}\\cdot\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\.Further, if‖xi‖2≤G\\\|x\_\{i\}\\\|\_\{2\}\\leq Gfor alli∈\[n\]i\\in\[n\], then

1n∑i=1n‖xi−x¯‖22=1n∑i=1n‖xi‖22−‖x¯‖22≤1n∑i=1n‖xi‖22≤G2\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}\\\|\_\{2\}^\{2\}\-\\\|\\bar\{x\}\\\|\_\{2\}^\{2\}\\leq\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|x\_\{i\}\\\|\_\{2\}^\{2\}\\leq G^\{2\}\.Hence,

𝔼ℬ\[‖x¯ℬ−x¯‖22\]≤n−BB\(n−1\)G2≤G2B\.\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\big\[\\\|\\bar\{x\}\_\{\\mathcal\{B\}\}\-\\bar\{x\}\\\|\_\{2\}^\{2\}\\big\]\\leq\\frac\{n\-B\}\{B\(n\-1\)\}\\,G^\{2\}\\leq\\frac\{G^\{2\}\}\{B\}\.This completes the proof\. ∎

Our convergence analysis relies on the following standard mirror descent inequality \(see, e\.g\.,\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\]\)\.

###### Lemma B\.3\(Mirror descent inequality\)\.

Let𝒦⊆ℝmdp\\mathcal\{K\}\\subseteq\\mathbb\{R\}^\{mdp\}be a nonempty closed convex set, and letProj𝒦\(⋅\)\\text\{Proj\}\_\{\\mathcal\{K\}\}\(\\cdot\)denote the Euclidean projection onto𝒦\\mathcal\{K\}\. Fix anyη\>0\\eta\>0and any vectorg∈ℝmdpg\\in\\mathbb\{R\}^\{mdp\}\. Define

𝐖\+=Proj𝒦\(𝐖−ηg\)\.\\mathbf\{W\}^\{\+\}=\\text\{Proj\}\_\{\\mathcal\{K\}\}\\big\(\\mathbf\{W\}\-\\eta g\\big\)\.Then, for any comparator𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}, the following inequality holds

⟨g,𝐖−𝐖∗⟩≤12η\(‖𝐖−𝐖∗‖22−‖𝐖\+−𝐖∗‖22\)\+η2‖g‖22\.\\big\\langle g,\\ \\mathbf\{W\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\\leq\\frac\{1\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}^\{\+\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\+\\frac\{\\eta\}\{2\}\\\|g\\\|\_\{2\}^\{2\}\.

Denote\[𝐖1,𝐖2\]=\{α𝐖1\+\(1−α\)𝐖2:α∈\[0,1\]\}\[\\mathbf\{W\}\_\{1\},\\mathbf\{W\}\_\{2\}\]=\\\{\\alpha\\mathbf\{W\}\_\{1\}\+\(1\-\\alpha\)\\mathbf\{W\}\_\{2\}:\\alpha\\in\[0,1\]\\\}as the line segment between𝐖1\\mathbf\{W\}\_\{1\}and𝐖2\\mathbf\{W\}\_\{2\}\.

###### Lemma B\.4\(Local quasi\-convexity property\[[55](https://arxiv.org/html/2605.12648#bib.bib80)\]\)\.

Supposes∈ℕs\\in\\mathbb\{N\}andG:ℝs→ℝG:\\mathbb\{R\}^\{s\}\\rightarrow\\mathbb\{R\}be a second\-order differentiable function satisfyingλmin\(∇2G\(𝐖\)\)≥−κG\(𝐖\)\\lambda\_\{\\min\}\(\\nabla^\{2\}G\(\\mathbf\{W\}\)\)\\geq\-\\kappa G\(\\mathbf\{W\}\)\. Let𝐖1,𝐖2∈ℝs\\mathbf\{W\}\_\{1\},\\mathbf\{W\}\_\{2\}\\in\\mathbb\{R\}^\{s\}be two arbitrary points with distance‖𝐖1−𝐖2‖2≤D≤2/κ\\\|\\mathbf\{W\}\_\{1\}\-\\mathbf\{W\}\_\{2\}\\\|\_\{2\}\\leq D\\leq\\sqrt\{2/\\kappa\}\. Letτ:=\(1−D2κ/2\)−1\\tau:=\(1\-D^\{2\}\\kappa/2\)^\{\-1\}\. Then,

max𝒱∈\[𝐖1,𝐖2\]⁡G\(𝒱\)≤τmax⁡\{G\(𝐖1\),G\(𝐖2\)\}\.\\max\_\{\\mathcal\{V\}\\in\[\\mathbf\{W\}\_\{1\},\\mathbf\{W\}\_\{2\}\]\}G\(\\mathcal\{V\}\)\\leq\\tau\\max\\big\\\{G\(\\mathbf\{W\}\_\{1\}\),G\(\\mathbf\{W\}\_\{2\}\)\\big\\\}\.

For any scalarv∈ℝv\\in\\mathbb\{R\}, we denote𝐡\(v\)=\[b1\(v\),…,bp\(v\)\]⊤∈ℝp\\mathbf\{h\}\(v\)=\[b\_\{1\}\(v\),\\ldots,b\_\{p\}\(v\)\]^\{\\top\}\\in\\mathbb\{R\}^\{p\}\. Fors∈ℕs\\in\\mathbb\{N\}and a vector𝐮=\[u1,…,us\]⊤∈ℝs\\mathbf\{u\}=\[u\_\{1\},\\ldots,u\_\{s\}\]^\{\\top\}\\in\\mathbb\{R\}^\{s\}, we denote𝐡\(𝐮\)=\[𝐡\(u1\)⊤,…,𝐡\(us\)⊤\]⊤∈ℝsp\.\\mathbf\{h\}\(\\mathbf\{u\}\)=\[\\mathbf\{h\}\(u\_\{1\}\)^\{\\top\},\\ldots,\\mathbf\{h\}\(u\_\{s\}\)^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{sp\}\.For each hidden unitj∈\[m\]j\\in\[m\], the first\-layer spline coefficients\{wi,j,k\}i∈\[d\],k∈\[p\]\\\{w\_\{i,j,k\}\\\}\_\{i\\in\[d\],\\,k\\in\[p\]\}are arranged into a vector𝐰j∈ℝdp\\mathbf\{w\}\_\{j\}\\in\\mathbb\{R\}^\{dp\}according to a fixed ordering of\(i,k\)\(i,k\)\. We write𝐖=\(𝐰1,…,𝐰m\)⊤∈ℝm×dp\\mathbf\{W\}=\(\\mathbf\{w\}\_\{1\},\\ldots,\\mathbf\{w\}\_\{m\}\)^\{\\top\}\\in\\mathbb\{R\}^\{m\\times dp\}\. The second\-layer spline coefficients\{cj,k\}j∈\[m\],k∈\[p\]\\\{c\_\{j,k\}\\\}\_\{j\\in\[m\],\\,k\\in\[p\]\}are collected as𝐜=\(𝐜1,…,𝐜m\)∈ℝmp\\mathbf\{c\}=\(\\mathbf\{c\}\_\{1\},\\ldots,\\mathbf\{c\}\_\{m\}\)\\in\\mathbb\{R\}^\{mp\}with𝐜j=\(cj,1,…,cj,p\)∈ℝp\\mathbf\{c\}\_\{j\}=\(c\_\{j,1\},\\ldots,c\_\{j,p\}\)\\in\\mathbb\{R\}^\{p\}\. Thenf𝐖f\_\{\\mathbf\{W\}\}can be rewritten as

f𝐖\(𝐱\)=1m𝐜⊤𝐡\(σ\(1d𝐖𝐡\(𝐱\)\)\)\.\\displaystyle f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\mathbf\{c\}^\{\\top\}\\mathbf\{h\}\\Big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\Big\)\.
To control the gradients and Hessians, we use the following lemma\[[57](https://arxiv.org/html/2605.12648#bib.bib112)\], which provides estimates for𝐜∈ℝmp\\mathbf\{c\}\\in\\mathbb\{R\}^\{mp\}\.

###### Lemma B\.5\.

For anyδ∈\(0,1\)\\delta\\in\(0,1\), define the event

ℰδ:=\{‖𝐜‖2≤4pm\+2log⁡\(2/δ\)andmaxj∈\[m\]⁡‖𝐜j‖2≤4p\+2log⁡\(2m/δ\)\}\.\\mathcal\{E\}\_\{\\delta\}:=\\Big\\\{\\\|\\mathbf\{c\}\\\|\_\{2\}\\leq 4\\sqrt\{pm\}\+2\\sqrt\{\\log\(2/\\delta\)\}\\ \\text\{ and \}\\ \\max\_\{j\\in\[m\]\}\\\|\\mathbf\{c\}\_\{j\}\\\|\_\{2\}\\leq 4\\sqrt\{p\}\+2\\sqrt\{\\log\(2m/\\delta\)\}\\Big\\\}\.\(1\)It holds thatℙ\(ℰδ\)≥1−δ\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\delta\}\)\\geq 1\-\\delta\.

To simplify the analysis, we work on the above high\-probability event associated with the initialization of𝐜\\mathbf\{c\}\.

###### Lemma B\.6\(Gradient and Hessian\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)andCσ,b\>0C\_\{\\sigma,b\}\>0be a constant that depend solely onσ,b\\sigma,b\. Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. It holds for any𝐖\\mathbf\{W\}and any𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}that

‖∇f𝐖\(𝐱\)‖2≤Cσ,bp\(p\+log⁡\(1/δ\)m\)\\big\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}\\leq C\_\{\\sigma,b\}\\,p\\,\\Big\(\\sqrt\{p\}\+\\sqrt\{\\frac\{\\log\(\{1\}/\{\\delta\}\)\}\{m\}\}\\Big\)and

‖∇2f𝐖\(𝐱\)‖2≤Cσ,bp32\(p\+log⁡\(m/δ\)\)m\.\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}\\leq\\frac\{C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{p\}\+\\sqrt\{\\log\(\{m\}/\{\\delta\}\)\}\\big\)\}\{\\sqrt\{m\}\}\.

###### Proof\.

The proof is based on arguments from\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\], where both𝐜\\mathbf\{c\}and𝐖\\mathbf\{W\}are trained\. For completeness, we provide a detailed proof\. We first estimate‖∇f𝐖\(𝐱\)‖2\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\. For allv∈ℝv\\in\\mathbb\{R\}, we denote𝐡′\(v\)=\[b1′\(v\),…,bp′\(v\)\]⊤∈ℝp\\mathbf\{h\}^\{\\prime\}\(v\)=\[b\_\{1\}^\{\\prime\}\(v\),\\ldots,b\_\{p\}^\{\\prime\}\(v\)\]^\{\\top\}\\in\\mathbb\{R\}^\{p\}and𝐡′′\(v\)=\[b1′′\(v\),…,bp′′\(v\)\]⊤∈ℝp\\mathbf\{h\}^\{\\prime\\prime\}\(v\)=\[b\_\{1\}^\{\\prime\\prime\}\(v\),\\ldots,b\_\{p\}^\{\\prime\\prime\}\(v\)\]^\{\\top\}\\in\\mathbb\{R\}^\{p\}\. Define

𝐮\(𝐱\)=σ\(1d𝐖′𝐡\(𝐱\)\)and𝐃\(𝐱\)=diag\(σ′\(1d𝐰i⊤𝐡\(𝐱\)\)\)i=1m∈ℝm×m\.\\displaystyle\\mathbf\{u\}\(\\mathbf\{x\}\)=\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}^\{\\prime\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\ \\text\{ and \}\\ \\mathbf\{D\}\(\\mathbf\{x\}\)=\\text\{diag\}\\big\(\\sigma^\{\\prime\}\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\)\\big\)\_\{i=1\}^\{m\}\\in\\mathbb\{R\}^\{m\\times m\}\.\(2\)From the formf𝐖\(𝐱\)=1m𝐜⊤𝐡\(𝐮\(𝐱\)\)f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\mathbf\{c\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{u\}\(\\mathbf\{x\}\)\), we have

∂𝐡\(𝐮\(𝐱\)\)∂𝐮\(𝐱\)=\[𝐡′\(u1\(𝐱\)\)𝟎⋯𝟎𝟎𝐡′\(u2\(𝐱\)\)⋯𝟎⋮⋮⋱⋮𝟎𝟎𝟎𝐡′\(um\(𝐱\)\)\]∈ℝmp×m\\displaystyle\\frac\{\\partial\\mathbf\{h\}\(\\mathbf\{u\}\(\\mathbf\{x\}\)\)\}\{\\partial\\mathbf\{u\}\(\\mathbf\{x\}\)\}=\\begin\{bmatrix\}\\mathbf\{h\}^\{\\prime\}\(u\_\{1\}\(\\mathbf\{x\}\)\)&\\mathbf\{0\}&\\cdots&\\mathbf\{0\}\\\\ \\mathbf\{0\}&\\mathbf\{h\}^\{\\prime\}\(u\_\{2\}\(\\mathbf\{x\}\)\)&\\cdots&\\mathbf\{0\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ \\mathbf\{0\}&\\mathbf\{0\}&\\mathbf\{0\}&\\mathbf\{h\}^\{\\prime\}\(u\_\{m\}\(\\mathbf\{x\}\)\)\\end\{bmatrix\}\\in\\mathbb\{R\}^\{mp\\times m\}and

∂𝐮\(𝐱\)∂𝐰i=\[𝟎⋮1dσ′\(1d𝐰i⊤𝐡\(𝐱\)\)𝐡\(𝐱\)⊤⋮𝟎\]∈ℝm×pd\.\\displaystyle\\frac\{\\partial\\mathbf\{u\}\(\\mathbf\{x\}\)\}\{\\partial\\mathbf\{w\}\_\{i\}\}=\\begin\{bmatrix\}\\mathbf\{0\}\\\\ \\vdots\\\\ \\frac\{1\}\{\\sqrt\{d\}\}\\sigma^\{\\prime\}\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\)\\mathbf\{h\}\(\\mathbf\{x\}\)^\{\\top\}\\\\ \\vdots\\\\ \\mathbf\{0\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{m\\times pd\}\.
According to the chain rule, for anyi∈\[m\]i\\in\[m\], it holds that

∂𝐰if𝐖\(𝐱\)\\displaystyle\\partial\_\{\\mathbf\{w\}\_\{i\}\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=∂f𝐖\(𝐱\)∂𝐡\(𝐮\(𝐱\)\)∂𝐡\(𝐮\(𝐱\)\)∂𝐮\(𝐱\)∂𝐮\(𝐱\)∂𝐰i=1md⟨𝐜i,𝐡′\(ui\(𝐱\)\)⟩σ′\(1d𝐰i⊤𝐡\(𝐱\)\)𝐡\(𝐱\)⊤\.\\displaystyle=\\frac\{\\partial f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\}\{\\partial\\mathbf\{h\}\(\\mathbf\{u\}\(\\mathbf\{x\}\)\)\}\\ \\frac\{\\partial\\mathbf\{h\}\(\\mathbf\{u\}\(\\mathbf\{x\}\)\)\}\{\\partial\\mathbf\{u\}\(\\mathbf\{x\}\)\}\\ \\frac\{\\partial\\mathbf\{u\}\(\\mathbf\{x\}\)\}\{\\partial\\mathbf\{w\}\_\{i\}\}=\\frac\{1\}\{\\sqrt\{md\}\}\\big\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\}\(u\_\{i\}\(\\mathbf\{x\}\)\)\\big\\rangle\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\mathbf\{h\}\(\\mathbf\{x\}\)^\{\\top\}\.Since𝐖=Vec\(\{𝐰i\}i=1m\)∈ℝmpd\\mathbf\{W\}=\\text\{Vec\}\\big\(\\\{\\mathbf\{w\}\_\{i\}\\\}\_\{i=1\}^\{m\}\\big\)\\in\\mathbb\{R\}^\{mpd\}is the vectorization of𝐖\\mathbf\{W\}, it then holds that

∇f𝐖\(𝐱\)=1mdVec\(\{σ′\(1d𝐰i⊤𝐡\(𝐱\)\)⟨𝐜i,𝐡′\(ui\(𝐱\)\)⟩𝐡\(𝐱\)\}i=1m\)∈ℝmpd\.\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{md\}\}\\text\{Vec\}\\Big\(\\big\\\{\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\}\(u\_\{i\}\(\\mathbf\{x\}\)\)\\rangle\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\\\}\_\{i=1\}^\{m\}\\Big\)\\in\\mathbb\{R\}^\{mpd\}\.Hence, we get

‖∇f𝐖\(𝐱\)‖2\\displaystyle\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}=1md\(∑i=1mσ′\(1d𝐰i⊤𝐡\(𝐱\)\)2\|⟨𝐜i,𝐡′\(ui\(𝐱\)\)⟩\|2‖𝐡\(𝐱\)‖22\)12\\displaystyle=\\frac\{1\}\{\\sqrt\{md\}\}\\Big\(\\sum\_\{i=1\}^\{m\}\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)^\{2\}\\,\\big\|\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\}\(u\_\{i\}\(\\mathbf\{x\}\)\)\\rangle\\big\|^\{2\}\\\|\\mathbf\{h\}\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}≤Bσ′Bbpm\(∑i=1m\|⟨𝐜i,𝐡′\(ui\(𝐱\)\)⟩\|2\)12\\displaystyle\\leq B\_\{\\sigma\}^\{\\prime\}B\_\{b\}\\sqrt\{\\frac\{p\}\{m\}\}\\Big\(\\sum\_\{i=1\}^\{m\}\\big\|\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\}\(u\_\{i\}\(\\mathbf\{x\}\)\)\\rangle\\big\|^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}≤Bσ′BbBb′p2m\(∑i=1m‖𝐜i‖22\)12≤Bσ′BbBb′p\(4p\+2log⁡\(2δ\)m\),\\displaystyle\\leq B\_\{\\sigma\}^\{\\prime\}B\_\{b\}B\_\{b\}^\{\\prime\}\\sqrt\{\\frac\{p^\{2\}\}\{m\}\}\\Big\(\\sum\_\{i=1\}^\{m\}\\\|\\mathbf\{c\}\_\{i\}\\big\\\|\_\{2\}^\{2\}\\Big\)^\{\\frac\{1\}\{2\}\}\\leq B\_\{\\sigma\}^\{\\prime\}B\_\{b\}B\_\{b\}^\{\\prime\}\\,p\\,\\Big\(4\\sqrt\{p\}\+2\\sqrt\{\\frac\{\\log\(\\frac\{2\}\{\\delta\}\)\}\{m\}\}\\Big\),\(3\)where the last inequality used \([1](https://arxiv.org/html/2605.12648#A2.E1)\)\. The first part of the lemma is proved\.

Now, we turn to estimate the Hessian off𝐖\(𝐱\)f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\), i\.e\.,

∇2f𝐖\(𝐱\)=\[∂𝐰12f𝐖\(𝐱\)⋯𝟎⋮⋱⋮𝟎⋯∂𝐰m2f𝐖\(𝐱\)\]∈ℝmpd×mpd,\\displaystyle\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\begin\{bmatrix\}\\partial\_\{\\mathbf\{w\}\_\{1\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)&\\cdots&\\mathbf\{0\}\\\\ \\vdots&\\ddots&\\vdots\\\\ \\mathbf\{0\}&\\cdots&\\partial\_\{\\mathbf\{w\}\_\{m\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\end\{bmatrix\}\\in\\mathbb\{R\}^\{mpd\\times mpd\},where

∂𝐰i2f𝐖\(𝐱\)=\\displaystyle\\partial\_\{\\mathbf\{w\}\_\{i\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=1dm\(σ′′\(1d𝐰i⊤𝐡\(𝐱\)\)⟨𝐜i,𝐡′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)⟩\\displaystyle\\frac\{1\}\{d\\sqrt\{m\}\}\\Big\(\\sigma^\{\\prime\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\\big\\rangle\+\(σ′\(1d𝐰i⊤𝐡\(𝐱\)\)2⟨𝐜i,𝐡′′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)⟩\)\)𝐡\(𝐱\)𝐡\(𝐱\)⊤∈ℝpd×pd\.\\displaystyle\+\\big\(\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)^\{2\}\\big\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{h\}^\{\\prime\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\\big\\rangle\\big\)\\Big\)\\mathbf\{h\}\(\\mathbf\{x\}\)\\mathbf\{h\}\(\\mathbf\{x\}\)^\{\\top\}\\in\\mathbb\{R\}^\{pd\\times pd\}\.We rewrite∇2f𝐖\(𝐱\)\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)as

∂𝐰i2f𝐖\(𝐱\)=1dm⟨𝐜i,𝐯i⟩𝐡\(𝐱\)𝐡\(𝐱\)⊤\\partial\_\{\\mathbf\{w\}\_\{i\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{d\\sqrt\{m\}\}\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{v\}\_\{i\}\\rangle\\mathbf\{h\}\(\\mathbf\{x\}\)\\mathbf\{h\}\(\\mathbf\{x\}\)^\{\\top\}with𝐯i=σ′′\(1d𝐰i⊤𝐡\(𝐱\)\)𝐡′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)\+σ′\(1d𝐰i⊤𝐡\(𝐱\)\)2𝐡′′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)\\mathbf\{v\}\_\{i\}=\\sigma^\{\\prime\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\mathbf\{h\}^\{\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\+\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)^\{2\}\\,\\mathbf\{h\}^\{\\prime\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\.

According to Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2), we can control‖𝐯i‖2\\\|\\mathbf\{v\}\_\{i\}\\\|\_\{2\}as

‖𝐯i‖2\\displaystyle\\\|\\mathbf\{v\}\_\{i\}\\\|\_\{2\}=‖σ′′\(1d𝐰i⊤𝐡\(𝐱\)\)𝐡′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)\+σ′\(1d𝐰i⊤𝐡\(𝐱\)\)2𝐡′′\(σ\(1d𝐰i⊤𝐡\(𝐱\)\)\)‖2\\displaystyle=\\big\\\|\\sigma^\{\\prime\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\,\\mathbf\{h\}^\{\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\+\\sigma^\{\\prime\}\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)^\{2\}\\,\\mathbf\{h\}^\{\\prime\\prime\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\)\\big\\\|\_\{2\}≤‖σ′′‖∞p‖b′‖∞\+‖σ′‖∞2p‖b′′‖∞≤p\(Bσ′′Bb′\+Bσ′⁣2Bb′′\)\.\\displaystyle\\leq\\\|\\sigma^\{\\prime\\prime\}\\\|\_\{\\infty\}\\sqrt\{p\}\\\|b^\{\\prime\}\\\|\_\{\\infty\}\+\\\|\\sigma^\{\\prime\}\\\|\_\{\\infty\}^\{2\}\\sqrt\{p\}\\\|b^\{\\prime\\prime\}\\\|\_\{\\infty\}\\leq\\sqrt\{p\}\(B\_\{\\sigma\}^\{\\prime\\prime\}B\_\{b\}^\{\\prime\}\+B\_\{\\sigma\}^\{\\prime 2\}B\_\{b\}^\{\\prime\\prime\}\)\.Combining the estimate of‖𝐯i‖2\\\|\\mathbf\{v\}\_\{i\}\\\|\_\{2\}and the fact that∇2f𝐖\(𝐱\)\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)is a block diagonal matrix, we can use Lemma[B\.1](https://arxiv.org/html/2605.12648#A2.Thmtheorem1)to get

‖∇2f𝐖\(𝐱\)‖2\\displaystyle\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}=maxi∈\[m\]⁡‖∇𝐰i2f𝐖\(𝐱\)‖2=maxi∈\[m\]sup‖𝐚‖2=1\|𝐚⊤∂𝐰i2f𝐖\(𝐱\)𝐚\|\\displaystyle=\\max\_\{i\\in\[m\]\}\\big\\\|\\nabla\_\{\\mathbf\{w\}\_\{i\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}=\\max\_\{i\\in\[m\]\}\\sup\_\{\\\|\\mathbf\{a\}\\\|\_\{2\}=1\}\\big\|\\mathbf\{a\}^\{\\top\}\\,\\partial\_\{\\mathbf\{w\}\_\{i\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\,\\mathbf\{a\}\\big\|=1dmmaxi∈\[m\]sup‖𝐚‖2=1\|⟨𝐜i,𝐯i⟩⟨𝐡\(𝐱\),𝐚⟩2\|≤1dmmaxi∈\[m\]⁡‖𝐜i‖2‖𝐯i‖2‖𝐡\(𝐱\)‖22\\displaystyle=\\frac\{1\}\{d\\sqrt\{m\}\}\\max\_\{i\\in\[m\]\}\\sup\_\{\\\|\\mathbf\{a\}\\\|\_\{2\}=1\}\\big\|\\langle\\mathbf\{c\}\_\{i\},\\mathbf\{v\}\_\{i\}\\rangle\\langle\\mathbf\{h\}\(\\mathbf\{x\}\),\\mathbf\{a\}\\rangle^\{2\}\\big\|\\leq\\frac\{1\}\{d\\sqrt\{m\}\}\\max\_\{i\\in\[m\]\}\\big\\\|\\mathbf\{c\}\_\{i\}\\big\\\|\_\{2\}\\,\\big\\\|\\mathbf\{v\}\_\{i\}\\big\\\|\_\{2\}\\,\\big\\\|\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}^\{2\}≤Bb2pm\(4p\+2log⁡\(2mδ\)\)maxi∈\[m\]⁡‖𝐯i‖2\\displaystyle\\leq B\_\{b\}^\{2\}\\frac\{p\}\{\\sqrt\{m\}\}\\Big\(4\\sqrt\{p\}\+2\\sqrt\{\\log\(\\frac\{2m\}\{\\delta\}\)\}\\Big\)\\max\_\{i\\in\[m\]\}\\big\\\|\\mathbf\{v\}\_\{i\}\\big\\\|\_\{2\}≤Bb2\(Bσ′′Bb′\+Bσ′⁣2Bb′′\)p32m\(4p\+2log⁡\(2mδ\)\),\\displaystyle\\leq B\_\{b\}^\{2\}\(B\_\{\\sigma\}^\{\\prime\\prime\}B\_\{b\}^\{\\prime\}\+B\_\{\\sigma\}^\{\\prime 2\}B\_\{b\}^\{\\prime\\prime\}\)\\frac\{p^\{\\frac\{3\}\{2\}\}\}\{\\sqrt\{m\}\}\\Big\(4\\sqrt\{p\}\+2\\sqrt\{\\log\(\\frac\{2m\}\{\\delta\}\)\}\\Big\),\(4\)where the second equality used the fact that∂𝐰i2f𝐖\(𝐱\)\\partial\_\{\\mathbf\{w\}\_\{i\}\}^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)is symmetric, the first inequality used Cauchy\-Schwarz inequality, the second inequality used Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)thatsupt∈ℝ\|b\(t\)\|≤Bb\\sup\_\{t\\in\\mathbb\{R\}\}\|b\(t\)\|\\leq B\_\{b\}and \([1](https://arxiv.org/html/2605.12648#A2.E1)\)\. The proof is complete\. ∎

The following lemma shows that the largest and smallest eigenvalues of∇2ℓ\(yf𝐖\(𝐱\)\)\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)admit well\-controlled upper and lower bounds, respectively\. As a consequence, the loss functionℓ\(yf𝐖\(𝐱\)\)\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)is weakly convex and smooth with respect to𝐖\\mathbf\{W\}\.

###### Lemma B\.7\(Smoothness and Curvature\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. Assumem≳log⁡\(m/δ\)m\\gtrsim\\log\(m/\\delta\)\. It holds for any𝐖\\mathbf\{W\}and any data point\(𝐱,y\)∈𝒵\(\\mathbf\{x\},y\)\\in\\mathcal\{Z\}, that

λmin\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≥−Cσ,bp32\(log⁡\(mδ\)\+p\)mℓ\(yf𝐖\(𝐱\)\),\\lambda\_\{\\min\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\geq\-\\frac\{C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{\\log\(\\frac\{m\}\{\\delta\}\)\}\+\\sqrt\{p\}\\big\)\}\{\\sqrt\{m\}\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\),λmax\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≤Cσ,bp3\\lambda\_\{\\max\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\leq C\_\{\\sigma,b\}\\,p^\{3\}and

‖∇ℒS\(𝐖\)‖22≤4Cσ,bp3ℒS\(𝐖\)\.\\big\\\|\\nabla\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)\\big\\\|\_\{2\}^\{2\}\\leq 4C\_\{\\sigma,b\}\\,p^\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)\.

###### Proof\.

From the chain rule, the gradient of loss is given as

∇ℓ\(yf𝐖\(𝐱\)\)=ℓ′\(yf𝐖\(𝐱\)\)y∇f𝐖\(𝐱\)\.\\nabla\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)=\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)y\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\.Note that

∇2ℓ\(yf𝐖\(𝐱\)\)=ℓ′′\(yf𝐖\(𝐱\)\)y2∇f𝐖\(𝐱\)∇f𝐖\(𝐱\)⊤\+ℓ′\(yf𝐖\(𝐱\)\)y∇2f𝐖\(𝐱\)\.\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)=\\ell^\{\\prime\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)y^\{2\}\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)^\{\\top\}\+\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)y\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\.Sinceℓ\\ellis convex,ℓ′′\(a\)≥0\\ell^\{\\prime\\prime\}\(a\)\\geq 0for alla∈ℝa\\in\\mathbb\{R\}\. Then,ℓ′′\(yf𝐖\(𝐱\)\)∇f𝐖\(𝐱\)∇f𝐖\(𝐱\)⊤\\ell^\{\\prime\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)^\{\\top\}is a PSD matrix\. By further noting that\|ℓ′\(a\)\|≤1\|\\ell^\{\\prime\}\(a\)\|\\leq 1and\|ℓ′′\(a\)\|≤1/4\|\\ell^\{\\prime\\prime\}\(a\)\|\\leq 1/4for alla∈ℝa\\in\\mathbb\{R\}, we have

−\|ℓ′\(yf𝐖\(𝐱\)\)\|‖∇2f𝐖\(𝐱\)‖2≤λmin\(∇2ℓ\(yf𝐖\(𝐱\)\)\)\-\|\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\|\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}\\leq\\lambda\_\{\\min\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)and

λmax\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≤14‖∇f𝐖\(𝐱\)‖22\+‖∇2f𝐖\(𝐱\)‖2\.\\lambda\_\{\\max\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\leq\\frac\{1\}\{4\}\\big\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}^\{2\}\+\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}\.
Plugging the estimates of‖∇f𝐖\(𝐱\)‖2\\big\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}and‖∇2f𝐖\(𝐱\)‖2\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\big\\\|\_\{2\}in Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)back and noting that\|ℓ′\(yf𝐖\(𝐱\)\)\|≤ℓ\(yf𝐖\(𝐱\)\)\|\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\|\\leq\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\), we know

λmin\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≥−Cσ,bp32\(log⁡\(mδ\)\+p\)mℓ\(yf𝐖\(𝐱\)\),\\lambda\_\{\\min\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\geq\-\\frac\{C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{\\log\(\\frac\{m\}\{\\delta\}\)\}\+\\sqrt\{p\}\\big\)\}\{\\sqrt\{m\}\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\),and

λmax\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≤Cσ,b\[p2\(p\+log⁡\(1δ\)m\)\+p32\(p\+log⁡\(mδ\)\)m\]\.\\lambda\_\{\\max\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\leq C\_\{\\sigma,b\}\\Big\[p^\{2\}\\Big\(p\+\\frac\{\\log\(\\frac\{1\}\{\\delta\}\)\}\{m\}\\Big\)\+\\frac\{p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{p\}\+\\sqrt\{\\log\(\\frac\{m\}\{\\delta\}\)\}\\big\)\}\{\\sqrt\{m\}\}\\Big\]\.
Furthermore, ifm≳log⁡\(m/δ\)m\\gtrsim\\log\(m/\\delta\), we have

λmax\(∇2ℓ\(yf𝐖\(𝐱\)\)\)≤Cσ,bp3,\\lambda\_\{\\max\}\\big\(\\nabla^\{2\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\big\)\\leq C\_\{\\sigma,b\}\\,p^\{3\},which completes the first two inequalities of the lemma\.

NoteCσ,bp3C\_\{\\sigma,b\}\\,p^\{3\}is an upper bound for the termsup𝐱∈𝒳14‖∇f𝐖\(𝐱\)‖22\+‖∇2f𝐖\(𝐱\)‖2\\sup\_\{\\mathbf\{x\}\\in\\mathcal\{X\}\}\\frac\{1\}\{4\}\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\+\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}and\|y\|=1\|y\|=1for anyy∈𝒴y\\in\\mathcal\{Y\}\. Then, it holds for any𝐖\\mathbf\{W\}that

‖∇ℒS\(𝐖\)‖22\\displaystyle\\big\\\|\\nabla\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)\\big\\\|\_\{2\}^\{2\}=‖1n∑i=1nℓ′\(yif𝐖\(𝐱i\)\)yi∇f𝐖\(𝐱i\)‖22≤\|1n∑i=1nℓ′\(yif𝐖\(𝐱i\)\)\|2supi∈\[n\]‖∇f𝐖\(𝐱i\)‖22\\displaystyle=\\Big\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell^\{\\prime\}\(y\_\{i\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}\)\)y\_\{i\}\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}\)\\Big\\\|\_\{2\}^\{2\}\\leq\\Big\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell^\{\\prime\}\(y\_\{i\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\Big\|^\{2\}\\sup\_\{i\\in\[n\]\}\\big\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\\\|\_\{2\}^\{2\}≤4Cσ,bp3\|1n∑i=1nℓ\(yif𝐖\(𝐱i\)\)\|=4Cσ,bp3ℒS\(𝐖\),\\displaystyle\\leq 4C\_\{\\sigma,b\}\\,p^\{3\}\\Big\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\Big\|=4C\_\{\\sigma,b\}\\,p^\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\),where the second inequality used the self\-boundedness property and\|ℓ′\(⋅\)\|≤1\|\\ell^\{\\prime\}\(\\cdot\)\|\\leq 1of the logistic loss\. This completes the proof of the lemma\. ∎

## Appendix CProofs for DP\-SGD with Correlated Noise

Throughout the paper, we use𝐖\\mathbf\{W\}to denote the vectorized parameter\(𝐰1⊤,…,𝐰m⊤\)⊤∈ℝmdp\(\\mathbf\{w\}\_\{1\}^\{\\top\},\\ldots,\\mathbf\{w\}\_\{m\}^\{\\top\}\)^\{\\top\}\\in\\mathbb\{R\}^\{mdp\}, unless otherwise specified\. We occasionally use the same symbol for the matrix representation in the definition off𝐖\(𝐱\)=1m𝐜⊤𝐡\(σ\(1d𝐖𝐡\(𝐱\)\)\)f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\mathbf\{c\}^\{\\top\}\\mathbf\{h\}\\big\(\\sigma\\big\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}\\mathbf\{h\}\(\\mathbf\{x\}\)\\big\)\\big\), when the meaning is clear from context\.

Moreover, underℰδ\\mathcal\{E\}\_\{\\delta\}, Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)yields a uniform upper bound on the per\-example gradient norm:

‖∇ℓ\(yf𝐖\(𝐱\)\)‖2≤Gδ:=Bσ′BbBb′p\(4p\+2log⁡\(2/δ\)m\)∀\(𝐱,y\)∈𝒵,∀𝐖\.\\\|\\nabla\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\\|\_\{2\}\\leq G\_\{\\delta\}:=B\_\{\\sigma\}^\{\\prime\}B\_\{b\}B\_\{b\}^\{\\prime\}\\,p\\Big\(4\\sqrt\{p\}\+2\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{m\}\}\\Big\)\\qquad\\forall\(\\mathbf\{x\},y\)\\in\\mathcal\{Z\},\\ \\forall\\ \\mathbf\{W\}\.\(5\)Therefore, if the clipping threshold is chosen such thatCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, then the clipping operator is inactive throughout the entire training process on the eventℰδ\\mathcal\{E\}\_\{\\delta\}\.

### C\.1Proofs for optimization of DP\-SGD with correlated noise

The key idea of the proof is to combine the cancellation structure of the correlated perturbations with the localization argument for KANs\. Following the proof sketch in Section[4\.3](https://arxiv.org/html/2605.12648#S4.SS3), we organize the detailed proof into the following six steps\.

Step 1: Reduce to an auxiliary unprojected dynamics\.

The correlated perturbations break the conditional\-centering argument, and the projection in Algorithm[1](https://arxiv.org/html/2605.12648#alg1)further obstructs the cross\-iteration cancellation structure of the noise\. To analyze this case, we introduce an auxiliary unprojected sequence\{𝐖~t\}t=0T−1\\\{\\widetilde\{\\mathbf\{W\}\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}, defined recursively by

𝐖~0=𝐖0,𝐖~t=𝐖~t−1−η\(1B∑i∈ℬtgt,iaux\+CclipBξt\),\\widetilde\{\\mathbf\{W\}\}\_\{0\}=\\mathbf\{W\}\_\{0\},\\qquad\\widetilde\{\\mathbf\{W\}\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\eta\\Big\(\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g^\{\\mathrm\{aux\}\}\_\{t,i\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\xi\_\{t\}\\Big\),wheregt,iaux=∇ℓ\(yif𝐖~t−1\(xi\)\)g^\{\\mathrm\{aux\}\}\_\{t,i\}=\\nabla\\ell\\big\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(x\_\{i\}\)\\big\)for eacht∈\[T\]t\\in\[T\]andi∈\[n\]i\\in\[n\]\.

We then define the mini\-batch fluctuation at the auxiliary iterate𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}by

Δ~t=1B∑i∈ℬtgt,iaux−1n∑i=1ngt,iaux=1B∑i∈ℬtgt,iaux−∇ℒS\(𝐖~t−1\),\\widetilde\{\\Delta\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g^\{\\mathrm\{aux\}\}\_\{t,i\}\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g^\{\\mathrm\{aux\}\}\_\{t,i\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g^\{\\mathrm\{aux\}\}\_\{t,i\}\-\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),which measures the discrepancy between the mini\-batch gradient and the full empirical gradient\.

Starting from𝐖~0=𝐖0\\widetilde\{\\mathbf\{W\}\}\_\{0\}=\\mathbf\{W\}\_\{0\}andZ0=𝟎Z\_\{0\}=\\mathbf\{0\}, the auxiliary unprojected iterate can be rewritten as

𝐖~t=𝐖~t−1−η\(∇ℒS\(𝐖~t−1\)\+Δ~t\+CclipκB\(Zt−λZt−1\)\),t∈\[T\]\.\\widetilde\{\\mathbf\{W\}\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\eta\\big\(\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\widetilde\{\\Delta\}\_\{t\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)\\big\),\\qquad t\\in\[T\]\.
Step 2: Introduce the shifted iterate\.

To obtain the high\-probability estimate above, one would naturally try to apply a comparator recursion to the auxiliary trajectory\. However, the correlated perturbationZt−λZt−1Z\_\{t\}\-\\lambda Z\_\{t\-1\}is not conditionally centered with respect to the natural filtration\. The term involvingZt−1Z\_\{t\-1\}is correlated with the current iterate and creates a first\-order drift term\.

To handle this obstruction, we define the shifted iterate

𝐔t=𝐖~t\+ηcprivZt,t=0,1,…,T\.\\mathbf\{U\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\}\+\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\},\\qquad t=0,1,\\dots,T\.Here,cpriv=CclipκBc\_\{\\mathrm\{priv\}\}=\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\. The role of𝐔t\\mathbf\{U\}\_\{t\}is to absorb the current Gaussian perturbation into the state variable, which allows us to rewrite the correlated\-noise dynamics in a form that is more amenable to one\-step analysis\.

Step 3: Derive a shifted potential recursion using the local KAN geometry\.

We now start the main proof route forλ\>0\\lambda\>0\. The key step is to absorb the current Gaussian noise into a shifted iterate so that the correlated perturbation can be handled at the*single\-step*level\. Definegt=∇ℒS\(𝐖~t−1\)g\_\{t\}=\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\.

###### Lemma C\.1\(Shifted one\-step identity\)\.

For any comparator𝐖∗∈ℝmdp\\mathbf\{W\}^\{\*\}\\in\\mathbb\{R\}^\{mdp\}and anyt∈\[T−1\]t\\in\[T\-1\],

‖𝐔t−𝐖∗‖22=\\displaystyle\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}=\\;‖𝐔t−1−𝐖∗‖22−2η⟨gt,𝐖~t−1−𝐖∗⟩\\displaystyle\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle−2η⟨Δ~t,𝐔t−1−𝐖∗⟩−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\\displaystyle\-2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+η2‖gt\+Δ~t‖22\+\(1−λ\)2η2cpriv2‖Zt−1‖22\\displaystyle\+\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩−2λη2cpriv⟨gt,Zt−1⟩\.\\displaystyle\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\-2\\lambda\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle g\_\{t\},Z\_\{t\-1\}\\big\\rangle\.\(6\)

###### Proof\.

By the definitions of𝐖~t\\widetilde\{\\mathbf\{W\}\}\_\{t\}and𝐔t\\mathbf\{U\}\_\{t\}, we have

𝐔t\\displaystyle\\mathbf\{U\}\_\{t\}=𝐖~t\+ηcprivZt=𝐖~t−1−η\(gt\+Δ~t\+cpriv\(Zt−λZt−1\)\)\+ηcprivZt\\displaystyle=\\widetilde\{\\mathbf\{W\}\}\_\{t\}\+\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\eta\\big\(g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\+c\_\{\\mathrm\{priv\}\}\(Z\_\{t\}\-\\lambda Z\_\{t\-1\}\)\\big\)\+\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\}=𝐖~t−1−η\(gt\+Δ~t\)\+ηλcprivZt−1\.\\displaystyle=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\eta\(g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\)\+\\eta\\lambda c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\.Plugging𝐖~t−1=𝐔t−1−ηcprivZt−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}=\\mathbf\{U\}\_\{t\-1\}\-\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}into the above equality, we get

𝐔t\\displaystyle\\mathbf\{U\}\_\{t\}=𝐔t−1−ηcprivZt−1−η\(gt\+Δ~t\)\+ηλcprivZt−1\\displaystyle=\\mathbf\{U\}\_\{t\-1\}\-\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\-\\eta\(g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\)\+\\eta\\lambda c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}=𝐔t−1−η\(gt\+Δ~t\+\(1−λ\)cprivZt−1\)\.\\displaystyle=\\mathbf\{U\}\_\{t\-1\}\-\\eta\\big\(g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\+\(1\-\\lambda\)c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\\big\)\.Hence,

𝐔t−𝐖∗=𝐔t−1−𝐖∗−η\(gt\+Δ~t\+\(1−λ\)cprivZt−1\)\.\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}=\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\-\\eta\\big\(g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\+\(1\-\\lambda\)c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\\big\)\.It then follows

‖𝐔t−𝐖∗‖22\\displaystyle\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}=‖𝐔t−1−𝐖∗‖22−2η⟨gt\+Δ~t\+\(1−λ\)cprivZt−1,𝐔t−1−𝐖∗⟩\+η2‖gt\+Δ~t\+\(1−λ\)cprivZt−1‖22\\displaystyle\\\!\\\!=\\\!\\\|\\mathbf\{U\}\_\{t\-1\}\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\\!\-\\\!2\\eta\\big\\langle g\_\{t\}\\\!\+\\\!\\widetilde\{\\Delta\}\_\{t\}\+\(1\\\!\-\\\!\\lambda\)c\_\{\\mathrm\{priv\}\}Z\_\{t\-\\\!1\},\\mathbf\{U\}\_\{t\\\!\-\\\!1\}\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\big\\rangle\\\!\+\\\!\\eta^\{2\}\\big\\\|g\_\{t\}\\\!\+\\\!\\widetilde\{\\Delta\}\_\{t\}\\\!\+\\\!\(1\\\!\-\\\!\\lambda\)c\_\{\\mathrm\{priv\}\}Z\_\{t\\\!\-\\\!1\}\\big\\\|\_\{2\}^\{2\}=‖𝐔t−1−𝐖∗‖22−2η⟨gt,𝐖~t−1−𝐖∗⟩−2η⟨Δ~t,𝐔t−1−𝐖∗⟩−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\\displaystyle\\\!\\\!=\\\!\\\|\\mathbf\{U\}\_\{t\\\!\-\\\!1\}\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\\!\-\\\!2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\\\!\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\big\\rangle\\\!\-\\\!2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\\\!\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\big\\rangle\\\!\-\\\!2\(1\\\!\-\\\!\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\big\\langle Z\_\{t\\\!\-\\\!1\},\\mathbf\{U\}\_\{t\-1\}\\\!\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\big\\rangle\+η2‖gt\+Δ~t‖22\+\(1−λ\)2η2cpriv2‖Zt−1‖22\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩−2λη2cpriv⟨gt,Zt−1⟩,\\displaystyle\+\\\!\\eta^\{2\}\\\|g\_\{t\}\\\!\+\\\!\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\-2\\lambda\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle g\_\{t\},Z\_\{t\-1\}\\big\\rangle,where we have used−2η⟨gt,𝐔t−1−𝐖∗⟩=−2η⟨gt,𝐖~t−1−𝐖∗⟩−2η2cpriv⟨gt,Zt−1⟩\-2\\eta\\big\\langle g\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle=\-2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle g\_\{t\},Z\_\{t\-1\}\\big\\rangleimplied by𝐔t−1=𝐖~t−1\+ηcprivZt−1\\mathbf\{U\}\_\{t\-1\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\+\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\. The proof is complete\. ∎

For a fixed comparator𝐖∗∈ℝmdp\\mathbf\{W\}^\{\*\}\\in\\mathbb\{R\}^\{mdp\}and a shifted radiusR¯\>0\\bar\{R\}\>0, define

AR¯U\(t\)=\{max0≤s≤t⁡‖𝐔s−𝐖∗‖2≤R¯\}andAR¯U=AR¯U\(T\)\.A\_\{\\bar\{R\}\}^\{U\}\(t\)=\\Big\\\{\\max\_\{0\\leq s\\leq t\}\\\|\\mathbf\{U\}\_\{s\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq\\bar\{R\}\\Big\\\}\\qquad\\text\{and\}\\qquad A\_\{\\bar\{R\}\}^\{U\}=A\_\{\\bar\{R\}\}^\{U\}\(T\)\.Forz\>0z\>0andVZ\>0V\_\{Z\}\>0, define the Gaussian event

GZ\(z,VZ\)=\{max0≤t≤T−1⁡‖Zt‖2≤zand∑t=1T‖Zt−1∥22≤VZ\}\.G\_\{Z\}\(z,V\_\{Z\}\)=\\Big\\\{\\max\_\{0\\leq t\\leq T\-1\}\\\|Z\_\{t\}\\\|\_\{2\}\\leq z\\quad\\text\{and\}\\quad\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\\leq V\_\{Z\}\\Big\\\}\.ForVΔ\>0V\_\{\\Delta\}\>0, define the stopped mini\-batch quadratic event

GΔ2\(R¯,VΔ\)=\{∑t=1T‖Δ~t∥22𝟏AR¯U\(t−1\)≤VΔ\}\.G\_\{\\Delta^\{2\}\}\(\\bar\{R\},V\_\{\\Delta\}\)=\\Big\\\{\\sum\_\{t=1\}^\{T\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\leq V\_\{\\Delta\}\\Big\\\}\.Define the stopped potential fluctuation process

Mk\\displaystyle M\_\{k\}=∑t=1k𝟏AR¯U\(t−1\)\[−2η⟨Δ~t,𝐔t−1−𝐖∗⟩−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\\displaystyle=\\sum\_\{t=1\}^\{k\}\\\!\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\\\!\(t\-\\\!1\)\}\\Big\[\-2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-\\\!1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\big\\langle Z\_\{t\-\\\!1\},\\mathbf\{U\}\_\{t\-\\\!1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩\],\\displaystyle\\qquad\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-\\\!1\}\\big\\rangle\\Big\],and define

Gpot\(R¯,M\):=\{max1≤k≤T⁡Mk≤M\}\.G\_\{\\mathrm\{pot\}\}\(\\bar\{R\},M\):=\\left\\\{\\max\_\{1\\leq k\\leq T\}M\_\{k\}\\leq M\\right\\\}\.Finally, define theshifted good event

ℰgood=GZ\(z,VZ\)∩GΔ2\(R¯,VΔ\)∩Gpot\(R¯,M\)\.\\mathcal\{E\}\_\{\\rm good\}=G\_\{Z\}\(z,V\_\{Z\}\)\\cap G\_\{\\Delta^\{2\}\}\(\\bar\{R\},V\_\{\\Delta\}\)\\cap G\_\{\\mathrm\{pot\}\}\(\\bar\{R\},M\)\.
###### Lemma C\.2\(Localized comparator under shifted localization\)\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. Let𝐖∗∈ℝmdp\\mathbf\{W\}^\{\*\}\\in\\mathbb\{R\}^\{mdp\}be fixed\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. On the eventsAR¯U\(t−1\)A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)andGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\)and assume

m≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)\(R¯\+ηcprivz\)4\.m\\gtrsim C\_\{\\sigma,b\}^\{2\}p^\{3\}\\bigl\(\\log\(m/\\delta\)\+p\\bigr\)\\bigl\(\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\\bigr\)^\{4\}\.\(7\)It holds that

23ℒS\(𝐖~t−1\)≤⟨gt,𝐖~t−1−𝐖∗⟩\+43ℒS\(𝐖∗\)\.\\frac\{2\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\leq\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\frac\{4\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\.

###### Proof\.

First, by Lemma[B\.7](https://arxiv.org/html/2605.12648#A2.Thmtheorem7), for every data point\(𝐱i,yi\)\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)and every𝐕∈ℝmdp\\mathbf\{V\}\\in\\mathbb\{R\}^\{mdp\},

λmin\(∇2ℓ\(yif𝐕\(𝐱i\)\)\)≥−νℓ\(yif𝐕\(𝐱i\)\)\\lambda\_\{\\min\}\\left\(\\nabla^\{2\}\\ell\(y\_\{i\}f\_\{\\mathbf\{V\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\right\)\\geq\-\\nu\\,\\ell\(y\_\{i\}f\_\{\\mathbf\{V\}\}\(\\mathbf\{x\}\_\{i\}\)\)withν=Cσ,bp32\(log\(m/δ\)\+p\)m\.\\nu=\\frac\{C\_\{\\sigma,b\}p^\{\\frac\{3\}\{2\}\}\\bigl\(\\sqrt\{\\log\(m/\\delta\}\)\+\\sqrt\{p\}\\bigr\)\}\{\\sqrt\{m\}\}\.Hence,

λmin\(∇2ℒS\(𝐕\)\)≥−νℒS\(𝐕\)\.\\lambda\_\{\\min\}\\left\(\\nabla^\{2\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\\right\)\\geq\-\\nu\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\.Under the eventsAR¯U\(t−1\)A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)andGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\), it holds that

‖𝐔t−1−𝐖∗‖2≤R¯and‖Zt−1‖2≤z\.\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq\\bar\{R\}\\qquad\\text\{and\}\\qquad\\\|Z\_\{t\-1\}\\\|\_\{2\}\\leq z\.Note𝐖~t−1=𝐔t−1−ηcprivZt−1,\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}=\\mathbf\{U\}\_\{t\-1\}\-\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\},then

‖𝐖~t−1−𝐖∗‖2≤‖𝐔t−1−𝐖∗‖2\+ηcpriv‖Zt−1‖2≤R¯\+ηcprivz\.\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}\\\|Z\_\{t\-1\}\\\|\_\{2\}\\leq\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\.TakingD=R¯\+ηcprivzD=\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}zand assumem≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)\(R¯\+ηcprivz\)4m\\gtrsim C\_\{\\sigma,b\}^\{2\}p^\{3\}\\bigl\(\\log\(m/\\delta\)\+p\\bigr\)\\bigl\(\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\\bigr\)^\{4\}guarantees

νD2≤12\.\\nu D^\{2\}\\leq\\frac\{1\}\{2\}\.Recall that we defined the line segment\[𝐖~t−1,𝐖∗\]=\{α𝐖~t−1\+\(1−α\)𝐖∗:α∈\[0,1\]\}\[\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\},\\mathbf\{W\}^\{\*\}\]=\\\{\\alpha\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\+\(1\-\\alpha\)\\mathbf\{W\}^\{\*\}:\\alpha\\in\[0,1\]\\\}\. Applying Taylor’s theorem, there exists a point𝐕∈\[𝐖~t−1,𝐖∗\]\\mathbf\{V\}\\in\[\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\},\\mathbf\{W\}^\{\*\}\]such that

ℒS\(𝐖∗\)\\displaystyle\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)≥ℒS\(𝐖~t−1\)\+⟨∇ℒS\(𝐖~t−1\),𝐖∗−𝐖~t−1⟩−ν2ℒS\(𝐕\)‖𝐖~t−1−𝐖∗‖22\.\\displaystyle\\geq\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\left\\langle\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),\\mathbf\{W\}^\{\*\}\-\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\\right\\rangle\-\\frac\{\\nu\}\{2\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\.Since‖𝐖~t−1−𝐖∗‖2≤D\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq D, then it holds that

ℒS\(𝐖~t−1\)−ℒS\(𝐖∗\)≤⟨∇ℒS\(𝐖~t−1\),𝐖~t−1−𝐖∗⟩\+νD22max𝐕∈\[𝐖~t−1,𝐖∗\]⁡ℒS\(𝐕\)\.\\displaystyle\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq\\left\\langle\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\right\\rangle\+\\frac\{\\nu D^\{2\}\}\{2\}\\max\_\{\\mathbf\{V\}\\in\[\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\},\\mathbf\{W\}^\{\*\}\]\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\.\(8\)
On the other hand, Lemma[B\.4](https://arxiv.org/html/2605.12648#A2.Thmtheorem4)withλmin\(∇2ℒS\(𝐕\)\)≥−νℒS\(𝐕\)\\lambda\_\{\\min\}\\left\(\\nabla^\{2\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\\right\)\\geq\-\\nu\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)gives

max𝐕∈\[𝐖~t−1,𝐖∗\]⁡ℒS\(𝐕\)≤ρmax⁡\{ℒS\(𝐖~t−1\),ℒS\(𝐖∗\)\},\\max\_\{\\mathbf\{V\}\\in\[\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\},\\mathbf\{W\}^\{\*\}\]\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{V\}\)\\leq\\rho\\max\\\{\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\\},whereρ=\(1−νD22\)−1\\rho=\\big\(1\-\\frac\{\\nu D^\{2\}\}\{2\}\\big\)^\{\-1\}\. NoteνD2≤1/2\\nu D^\{2\}\\leq 1/2, thenνD22ρ=νD22−νD2≤13\.\\frac\{\\nu D^\{2\}\}\{2\}\\rho=\\frac\{\\nu D^\{2\}\}\{2\-\\nu D^\{2\}\}\\leq\\frac\{1\}\{3\}\.Therefore, combining the above inequality with \([8](https://arxiv.org/html/2605.12648#A3.E8)\) yields

ℒS\(𝐖~t−1\)−ℒS\(𝐖∗\)\\displaystyle\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)≤⟨∇ℒS\(𝐖~t−1\),𝐖~t−1−𝐖∗⟩\+13\(ℒS\(𝐖~t−1\)\+ℒS\(𝐖∗\)\)\.\\displaystyle\\leq\\big\\langle\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\frac\{1\}\{3\}\\big\(\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\big\)\.Rearranging gives

23ℒS\(𝐖~t−1\)≤⟨∇ℒS\(𝐖~t−1\),𝐖~t−1−𝐖∗⟩\+43ℒS\(𝐖∗\)\.\\displaystyle\\frac\{2\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\leq\\big\\langle\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\),\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+\\frac\{4\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\.\(9\)This completes the proof\. ∎

We next show that, on the shifted good eventℰgood\\mathcal\{E\}\_\{\\rm good\}, the shifted iterate remains localized and yields a pathwise empirical loss bound\. In particular, the projection becomes inactive under a suitable choice ofR∗R\_\{\*\}\.

###### Lemma C\.3\(Conditional shifted\-potential bootstrap and loss bound\)\.

Suppose0<λ<10<\\lambda<1, Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. Letβ=Cσ,bp3\.\\beta=C\_\{\\sigma,b\}p^\{3\}\.Fix a comparator𝐖∗\\mathbf\{W\}^\{\*\}and a shifted radiusR¯\>0\\bar\{R\}\>0, assumeη≤112β\\eta\\leq\\frac\{1\}\{12\\beta\}, \([7](https://arxiv.org/html/2605.12648#A3.E7)\) and

‖𝐖0−𝐖∗‖22\+83ηTℒS\(𝐖∗\)\+M\+2η2VΔ\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)VZ≤R¯2,\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{8\}\{3\}\\eta T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+M\+2\\eta^\{2\}V\_\{\\Delta\}\+\\big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\big\)V\_\{Z\}\\leq\\bar\{R\}^\{2\},\(10\)then on the good eventℰgood\\mathcal\{E\}\_\{\\rm good\}, the localization eventAR¯UA^\{U\}\_\{\\bar\{R\}\}also occurs and it holds that

1T∑t=1TℒS\(𝐖~t−1\)≤8ℒS\(𝐖∗\)\+3‖𝐖0−𝐖∗‖22ηT\+3MηT\+6ηVΔT\+3\(\(1−λ\)2η\+12βλ2η2\)cpriv2VZT\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\\!\\leq\\\!8\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\frac\{3M\}\{\\eta T\}\+\\frac\{6\\eta V\_\{\\Delta\}\}\{T\}\+3\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+12\\beta\\lambda^\{2\}\\eta^\{2\}\\Big\)c\_\{\\mathrm\{priv\}\}^\{2\}\\frac\{V\_\{Z\}\}\{T\}\.
If we further assumeR∗≥R¯\+‖𝐖∗−𝐖0‖2\+ηcprivzR\_\{\*\}\\geq\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}z, then

1T∑t=1TℒS\(𝐖t−1\)≤8ℒS\(𝐖∗\)\+3‖𝐖0−𝐖∗‖22ηT\+3MηT\+6ηVΔT\+3\(\(1−λ\)2η\+12βλ2η2\)cpriv2VZT\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\\!\\leq\\\!8\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\frac\{3M\}\{\\eta T\}\+\\frac\{6\\eta V\_\{\\Delta\}\}\{T\}\+3\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+12\\beta\\lambda^\{2\}\\eta^\{2\}\\Big\)c\_\{\\mathrm\{priv\}\}^\{2\}\\frac\{V\_\{Z\}\}\{T\}\.

###### Proof\.

Fix an outcome inℰgood\\mathcal\{E\}\_\{\\rm good\}\. We first show that the shifted localization eventAR¯UA\_\{\\bar\{R\}\}^\{U\}holds\. The proof is done by contradiction\. SupposeAR¯UA\_\{\\bar\{R\}\}^\{U\}does not hold\. Since‖𝐔0−𝐖∗‖22=‖𝐖0−𝐖∗‖22≤R¯2\\\|\\mathbf\{U\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}=\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\leq\\bar\{R\}^\{2\}by \([10](https://arxiv.org/html/2605.12648#A3.E10)\), the first exit timeτ=min⁡\{t∈\[T\]:‖𝐔t−𝐖∗‖2\>R¯\}\\tau=\\min\\big\\\{t\\in\[T\]:\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\>\\bar\{R\}\\big\\\}is well\-defined\. Thenτ∈\[T\]\\tau\\in\[T\]andAR¯U\(τ−1\)A\_\{\\bar\{R\}\}^\{U\}\(\\tau\-1\)holds\. Hence, for every1≤t≤τ1\\leq t\\leq\\tau,

𝟏AR¯U\(t−1\)=1\.\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}=1\.
For everyt≤τt\\leq\\tau, sinceAR¯U\(t−1\)A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)andGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\)hold, we have

‖𝐖~t−1−𝐖∗‖2=‖𝐔t−1−ηcprivZt−1−𝐖∗‖2≤R¯\+ηcprivz\.\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}=\\\|\\mathbf\{U\}\_\{t\-1\}\-\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\.
By Lemma[C\.1](https://arxiv.org/html/2605.12648#A3.Thmtheorem1), for everyt≤τt\\leq\\tau,

‖𝐔t−𝐖∗‖22=\\displaystyle\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}=\\;‖𝐔t−1−𝐖∗‖22−2η⟨gt,𝐖~t−1−𝐖∗⟩\\displaystyle\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle−2η⟨Δ~t,𝐔t−1−𝐖∗⟩−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\\displaystyle\-2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+η2‖gt\+Δ~t‖22\+\(1−λ\)2η2cpriv2‖Zt−1‖22\\displaystyle\+\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩−2λη2cpriv⟨gt,Zt−1⟩\.\\displaystyle\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\-2\\lambda\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle g\_\{t\},Z\_\{t\-1\}\\big\\rangle\.
According to Lemma[C\.2](https://arxiv.org/html/2605.12648#A3.Thmtheorem2), we know that

−2η⟨gt,𝐖~t−1−𝐖∗⟩≤−4η3ℒS\(𝐖~t−1\)\+8η3ℒS\(𝐖∗\)\.\-2\\eta\\big\\langle g\_\{t\},\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\\leq\-\\frac\{4\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\frac\{8\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\.Moreover, Lemma[B\.7](https://arxiv.org/html/2605.12648#A2.Thmtheorem7)gives

‖gt‖22≤4βℒS\(𝐖~t−1\)withβ=Cσ,bp3\.\\\|g\_\{t\}\\\|\_\{2\}^\{2\}\\leq 4\\beta\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\ \\text\{ with \}\\ \\beta=C\_\{\\sigma,b\}p^\{3\}\.Hence,

η2‖gt\+Δ~t‖22≤2η2‖gt‖22\+2η2‖Δ~t‖22≤8βη2ℒS\(𝐖~t−1\)\+2η2‖Δ~t‖22\.\\eta^\{2\}\\\|g\_\{t\}\+\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\leq 2\\eta^\{2\}\\\|g\_\{t\}\\\|\_\{2\}^\{2\}\+2\\eta^\{2\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\leq 8\\beta\\eta^\{2\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+2\\eta^\{2\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\.Furthermore, from Young’s inequality2ab≤ρa2\+ρ−1b22ab\\leq\\rho a^\{2\}\+\\rho^\{\-1\}b^\{2\}Witha=η‖gt‖2a=\\sqrt\{\\eta\}\\,\\\|g\_\{t\}\\\|\_\{2\},b=λη3/2cpriv‖Zt−1‖2b=\\lambda\\eta^\{3/2\}c\_\{\\mathrm\{priv\}\}\\,\\\|Z\_\{t\-1\}\\\|\_\{2\}andρ=112β\\rho=\\frac\{1\}\{12\\beta\}, we obtain

−2λη2cpriv⟨gt,Zt−1⟩\\displaystyle\-2\\lambda\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\langle g\_\{t\},Z\_\{t\-1\}\\rangle≤2λη2cpriv\|⟨gt,Zt−1⟩\|\\displaystyle\\leq 2\\lambda\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\|\\langle g\_\{t\},Z\_\{t\-1\}\\rangle\|≤η12β‖gt‖22\+12βλ2η3cpriv2‖Zt−1‖22\\displaystyle\\leq\\frac\{\\eta\}\{12\\beta\}\\\|g\_\{t\}\\\|\_\{2\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}≤η3ℒS\(𝐖~t−1\)\+12βλ2η3cpriv2‖Zt−1‖22\.\\displaystyle\\leq\\frac\{\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\.
Combining the above estimates with−4η3\+8βη2\+η3=−η\+8βη2≤−η3\-\\frac\{4\\eta\}\{3\}\+8\\beta\\eta^\{2\}\+\\frac\{\\eta\}\{3\}=\-\\eta\+8\\beta\\eta^\{2\}\\leq\-\\frac\{\\eta\}\{3\}gives

‖𝐔t−𝐖∗‖22≤\\displaystyle\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\leq\\;‖𝐔t−1−𝐖∗‖22−η3ℒS\(𝐖~t−1\)\+8η3ℒS\(𝐖∗\)\\displaystyle\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\frac\{\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\+\\frac\{8\\eta\}\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)−2η⟨Δ~t,𝐔t−1−𝐖∗⟩−2\(1−λ\)ηcpriv⟨Zt−1,𝐔t−1−𝐖∗⟩\\displaystyle\-2\\eta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\+2\(1−λ\)η2cpriv⟨Δ~t,Zt−1⟩\+2η2‖Δ~t‖22\\displaystyle\+2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\+2\\eta^\{2\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)‖Zt−1‖22\.\\displaystyle\+\\big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\big\)\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\.Summing fromt=1t=1toτ\\tauand using𝟏AR¯U\(t−1\)=1\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}=1for1≤t≤τ1\\leq t\\leq\\tau, we obtain

‖𝐔τ−𝐖∗‖22\+η3∑t=1τℒS\(𝐖~t−1\)≤\\displaystyle\\\|\\mathbf\{U\}\_\{\\tau\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{\\eta\}\{3\}\\sum\_\{t=1\}^\{\\tau\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\leq\\;‖𝐖0−𝐖∗‖22\+83ηTℒS\(𝐖∗\)\+Mτ\+2η2∑t=1τ‖Δ~t‖22\\displaystyle\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{8\}\{3\}\\eta T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+M\_\{\\tau\}\+2\\eta^\{2\}\\sum\_\{t=1\}^\{\\tau\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)∑t=1τ‖Zt−1‖22\.\\displaystyle\+\\Big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\Big\)\\sum\_\{t=1\}^\{\\tau\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\.\(11\)
OnGpot\(R¯,𝔐\)G\_\{\\mathrm\{pot\}\}\(\\bar\{R\},\\mathfrak\{M\}\),GΔ2\(R¯,VΔ\)G\_\{\\Delta^\{2\}\}\(\\bar\{R\},V\_\{\\Delta\}\)andGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\), since the stopping indicators equal one for allt≤τt\\leq\\tau, it holds

Mτ≤M,∑t=1τ‖Δ~t‖22≤∑t=1T‖Δ~t‖22𝟏AR¯U\(t−1\)≤VΔand∑t=1τ‖Zt−1‖22≤VZ\.M\_\{\\tau\}\\leq M,\\quad\\sum\_\{t=1\}^\{\\tau\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\leq\\sum\_\{t=1\}^\{T\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\leq V\_\{\\Delta\}\\quad\\text\{and\}\\quad\\sum\_\{t=1\}^\{\\tau\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\\leq V\_\{Z\}\.Hence,

‖𝐔τ−𝐖∗‖22\+η3∑t=1τℒS\(𝐖~t−1\)\\displaystyle\\\|\\mathbf\{U\}\_\{\\tau\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{\\eta\}\{3\}\\sum\_\{t=1\}^\{\\tau\}\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)≤‖𝐖0−𝐖∗‖22\+83ηTℒS\(𝐖∗\)\+M\+2η2VΔ\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)VZ\\displaystyle\\leq\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{8\}\{3\}\\eta T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+M\+2\\eta^\{2\}V\_\{\\Delta\}\+\\Big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\Big\)V\_\{Z\}≤R¯2\.\\displaystyle\\leq\\bar\{R\}^\{2\}\.Thus

‖𝐔τ−𝐖∗‖22≤R¯2,\\\|\\mathbf\{U\}\_\{\\tau\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\leq\\bar\{R\}^\{2\},which contradicts the definition ofτ\\tau\. ThereforeAR¯UA\_\{\\bar\{R\}\}^\{U\}holds\.

UnderAR¯UA\_\{\\bar\{R\}\}^\{U\}, using \([11](https://arxiv.org/html/2605.12648#A3.E11)\) withτ=T\\tau=Tand dropping the nonnegative term‖𝐔T−𝐖∗‖22\\\|\\mathbf\{U\}\_\{T\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}, we get

η3∑t=1TℒS\(𝐖~t−1\)≤‖𝐖0−𝐖∗‖22\+83ηTℒS\(𝐖∗\)\+M\+2η2VΔ\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)VZ\.\\frac\{\\eta\}\{3\}\\\!\\sum\_\{t=1\}^\{T\}\\\!\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\\!\\leq\\\!\\\|\\mathbf\{W\}\_\{0\}\\\!\-\\\!\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+\\frac\{8\}\{3\}\\eta T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+M\+2\\eta^\{2\}V\_\{\\Delta\}\+\\Big\(\\\!\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+12\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\\!\\Big\)V\_\{Z\}\.Dividing both sides byηT/3\\eta T/3yields

1T∑t=1TℒS\(𝐖~t−1\)≤8ℒS\(𝐖∗\)\+3‖𝐖0−𝐖∗‖22ηT\+3MηT\+6ηVΔT\+3\(\(1−λ\)2η\+12βλ2η2\)cpriv2VZT,\\frac\{1\}\{T\}\\\!\\sum\_\{t=1\}^\{T\}\\\!\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\\!\\leq\\\!8\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\frac\{3M\}\{\\eta T\}\+\\frac\{6\\eta V\_\{\\Delta\}\}\{T\}\+3\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+12\\beta\\lambda^\{2\}\\eta^\{2\}\\Big\)c\_\{\\mathrm\{priv\}\}^\{2\}\\frac\{V\_\{Z\}\}\{T\},which proves the first part of the lemma\.

Now, we provide the proof for the second part of the lemma\. OnAR¯U∩GZ\(z,VZ\)A\_\{\\bar\{R\}\}^\{U\}\\cap G\_\{Z\}\(z,V\_\{Z\}\), for everyt=0,1,…,Tt=0,1,\\dots,T,

‖𝐖~t−𝐖0‖2\\displaystyle\\\|\\widetilde\{\\mathbf\{W\}\}\_\{t\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}=‖𝐔t−ηcprivZt−𝐖0‖2\\displaystyle=\\\|\\mathbf\{U\}\_\{t\}\-\\eta c\_\{\\mathrm\{priv\}\}Z\_\{t\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}≤‖𝐔t−𝐖∗‖2\+‖𝐖∗−𝐖0‖2\+ηcpriv‖Zt‖2\\displaystyle\\leq\\\|\\mathbf\{U\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}\\\|Z\_\{t\}\\\|\_\{2\}≤R¯\+‖𝐖∗−𝐖0‖2\+ηcprivz\.\\displaystyle\\leq\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}z\.If we chooseR∗≥R¯\+‖𝐖∗−𝐖0‖2\+ηcprivzR\_\{\*\}\\geq\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}z, then

𝐖~t∈𝒦=ℬ\(𝐖0,R∗\),∀t=0,1,…,T\.\\widetilde\{\\mathbf\{W\}\}\_\{t\}\\in\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\),\\qquad\\forall t=0,1,\\dots,T\.By induction, the projected and auxiliary unprojected iterates coincide\. The claim is trivial att=0t=0\. If𝐖t−1=𝐖~t−1\\mathbf\{W\}\_\{t\-1\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}, then the projected and unprojected updates have the same pre\-projection point, namely𝐖~t\\widetilde\{\\mathbf\{W\}\}\_\{t\}\. Since𝐖~t∈𝒦\\widetilde\{\\mathbf\{W\}\}\_\{t\}\\in\\mathcal\{K\}, projection leaves it unchanged:

𝐖t=Π𝒦\(𝐖~t\)=𝐖~t\.\\mathbf\{W\}\_\{t\}=\\Pi\_\{\\mathcal\{K\}\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\}\)=\\widetilde\{\\mathbf\{W\}\}\_\{t\}\.Thus

𝐖t=𝐖~t,∀t=0,1,…,T\.\\mathbf\{W\}\_\{t\}=\\widetilde\{\\mathbf\{W\}\}\_\{t\},\\qquad\\forall t=0,1,\\dots,T\.Therefore,

1T∑t=1TℒS\(𝐖t−1\)≤8ℒS\(𝐖∗\)\+3‖𝐖0−𝐖∗‖22ηT\+3MηT\+6ηVΔT\+3\(\(1−λ\)2η\+12βλ2η2\)c2VZT\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq 8\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\frac\{3M\}\{\\eta T\}\+\\frac\{6\\eta V\_\{\\Delta\}\}\{T\}\+3\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+12\\beta\\lambda^\{2\}\\eta^\{2\}\\Big\)c^\{2\}\\frac\{V\_\{Z\}\}\{T\}\.This completes the proof\. ∎

Step 4: High\-probability control and a stopped optimization bound\.

Now, we show that the good eventℰgood\\mathcal\{E\}\_\{\\rm good\}holds with high probability\. We need the following Laurent–Massart chi\-square tail bound\[[31](https://arxiv.org/html/2605.12648#bib.bib706)\]\.

###### Lemma C\.4\.

LetG∼𝒩\(0,𝐈d\)G\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)\. Then‖G‖22∼χd2\\\|G\\\|\_\{2\}^\{2\}\\sim\\chi\_\{d\}^\{2\}, and for everyδ∈\(0,1\)\\delta\\in\(0,1\),

ℙ\(‖G‖22≤d\+2dlog⁡\(1/δ\)\+2log⁡\(1/δ\)\)≥1−δ\.\\mathbb\{P\}\\left\(\\\|G\\\|\_\{2\}^\{2\}\\leq d\+2\\sqrt\{d\\log\(1/\\delta\)\}\+2\\log\(1/\\delta\)\\right\)\\geq 1\-\\delta\.

###### Lemma C\.5\(High\-probability control of the Gaussian noises\)\.

SupposeZ1,…,ZTZ\_\{1\},\\dots,Z\_\{T\}are i\.i\.d\. standard Gaussian vectors inℝmdp\\mathbb\{R\}^\{mdp\}, and letZ0=𝟎Z\_\{0\}=\\mathbf\{0\}\. For anyδZ∈\(0,1\)\\delta\_\{Z\}\\in\(0,1\), definezδZ=mdp\+2log⁡\(2TδZ\)z\_\{\\delta\_\{Z\}\}=\\sqrt\{mdp\}\+\\sqrt\{2\\log\\big\(\\frac\{2T\}\{\\delta\_\{Z\}\}\\big\)\}andVZ,δZ=\(T−1\)mdp\+2\(T−1\)mdplog⁡\(2δZ\)\+2log⁡\(2δZ\)\.V\_\{Z,\\delta\_\{Z\}\}=\(T\-1\)\{mdp\}\+2\\sqrt\{\(T\-1\)\{mdp\}\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\)\}\+2\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\)\.Then,

ℙ\(GZ\(zδZ,VZ,δZ\)\)≥1−δZ\.\\mathbb\{P\}\\left\(G\_\{Z\}\(z\_\{\\delta\_\{Z\}\},V\_\{Z,\\delta\_\{Z\}\}\)\\right\)\\geq 1\-\\delta\_\{Z\}\.

###### Proof\.

We first control the maximum Gaussian norm\. For a standard Gaussian vectorZt∼𝒩\(0,𝐈mdp\)Z\_\{t\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{\{mdp\}\}\), the standard Gaussian norm concentration inequality gives, for eacht=1,…,Tt=1,\\dots,T,

ℙ\(‖Zt‖2\>mdp\+2log⁡\(2TδZ\)\)≤δZ2T\.\\mathbb\{P\}\\Big\(\\\|Z\_\{t\}\\\|\_\{2\}\>\\sqrt\{mdp\}\+\\sqrt\{2\\log\\big\(\\frac\{2T\}\{\\delta\_\{Z\}\}\\big\)\}\\Big\)\\leq\\frac\{\\delta\_\{Z\}\}\{2T\}\.SinceZ0=𝟎Z\_\{0\}=\\mathbf\{0\}, taking a union bound overt=1,…,Tt=1,\\dots,Tyields

ℙ\(max0≤t≤T⁡‖Zt‖2≤zδZ\)≥1−δZ2\.\\mathbb\{P\}\\big\(\\max\_\{0\\leq t\\leq T\}\\\|Z\_\{t\}\\\|\_\{2\}\\leq z\_\{\\delta\_\{Z\}\}\\big\)\\geq 1\-\\frac\{\\delta\_\{Z\}\}\{2\}\.
Next, we control the quadratic sum\. NoteZ0=𝟎Z\_\{0\}=\\mathbf\{0\}, it holds

∑t=1T‖Zt−1‖22=∑s=0T−1‖Zs‖22=∑s=1T−1‖Zs‖22\.\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}=\\sum\_\{s=0\}^\{T\-1\}\\\|Z\_\{s\}\\\|\_\{2\}^\{2\}=\\sum\_\{s=1\}^\{T\-1\}\\\|Z\_\{s\}\\\|\_\{2\}^\{2\}\.The random variable on the right\-hand side follows a chi\-square distribution withK=\(T−1\)mdpK=\(T\-1\)\{mdp\}degrees of freedom\. By the Laurent–Massart chi\-square tail bound \(see Lemma[C\.4](https://arxiv.org/html/2605.12648#A3.Thmtheorem4)\), it holds

ℙ\(∑t=1T‖Zt−1‖22≤\(T−1\)mdp\+2\(T−1\)mdplog⁡\(2δZ\)\+2log⁡\(2δZ\)\)≥1−δZ2\.\\mathbb\{P\}\\Big\(\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\\leq\(T\-1\)\{mdp\}\+2\\sqrt\{\(T\-1\)\{mdp\}\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\)\}\+2\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\)\\Big\)\\geq 1\-\\frac\{\\delta\_\{Z\}\}\{2\}\.Hence,

ℙ\(∑t=1T‖Zt−1‖22≤VZ,δZ\)≥1−δZ2\.\\mathbb\{P\}\\Big\(\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\\leq V\_\{Z,\\delta\_\{Z\}\}\\Big\)\\geq 1\-\\frac\{\\delta\_\{Z\}\}\{2\}\.
Combining the two estimates by a union bound gives

ℙ\(GZ\(zδZ,VZ,δZ\)\)≥1−δZ\.\\mathbb\{P\}\\left\(G\_\{Z\}\(z\_\{\\delta\_\{Z\}\},V\_\{Z,\\delta\_\{Z\}\}\)\\right\)\\geq 1\-\\delta\_\{Z\}\.This completes the proof\. ∎

###### Lemma C\.6\(High\-probability control of the stopped mini\-batch quadratic variation\)\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. LetGδG\_\{\\delta\}be defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. FixR¯\>0\\bar\{R\}\>0andδΔ∈\(0,1\)\\delta\_\{\\Delta\}\\in\(0,1\)\. DefineVΔ,δΔ=2TGδ2B\+8Gδ2log⁡\(1δΔ\)V\_\{\\Delta,\\delta\_\{\\Delta\}\}=\\frac\{2TG\_\{\\delta\}^\{2\}\}\{B\}\+8G\_\{\\delta\}^\{2\}\\log\\big\(\\frac\{1\}\{\\delta\_\{\\Delta\}\}\\big\)\. Then, conditioned on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}, it holds that

ℙ𝒜\(GΔ2\(R¯,VΔ,δΔ\)\)≥1−δΔ\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\left\(G\_\{\\Delta^\{2\}\}\(\\bar\{R\},V\_\{\\Delta,\\delta\_\{\\Delta\}\}\)\\right\)\\geq 1\-\\delta\_\{\\Delta\}\.

###### Proof\.

Condition on the datasetSSand on the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}\. Letℱt=σ\(𝐖0,𝐜,ℬ1,…,ℬt,Z1,…,Zt\)\\mathscr\{F\}\_\{t\}=\\sigma\\bigl\(\\mathbf\{W\}\_\{0\},\\mathbf\{c\},\\mathcal\{B\}\_\{1\},\\dots,\\mathcal\{B\}\_\{t\},Z\_\{1\},\\dots,Z\_\{t\}\\bigr\)be the natural filtration generated by the algorithmic randomness up to timett\. For eacht∈\[T\]t\\in\[T\], defineXt=‖Δ~t‖22𝟏AR¯U\(t−1\)X\_\{t\}=\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\mathbf\{1\}\_\{A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)\}\. SinceAR¯U\(t−1\)A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)isℱt−1\\mathscr\{F\}\_\{t\-1\}\-measurable,XtX\_\{t\}is adapted\.

We first bound the conditional expectation ofXtX\_\{t\}\. Conditioned onℱt−1\\mathscr\{F\}\_\{t\-1\}, the iterate𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}is fixed\. Hence the vectors∇ℓ\(yif𝐖~t−1\(𝐱i\)\)\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\)withi∈\[n\]i\\in\[n\]are deterministic\. By the uniform gradient bound \([5](https://arxiv.org/html/2605.12648#A3.E5)\), onℰδ\\mathcal\{E\}\_\{\\delta\}, it holds

‖∇ℓ\(yif𝐖~t−1\(𝐱i\)\)‖2≤Gδ,∀i∈\[n\]\.\\big\\\|\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\big\\\|\_\{2\}\\leq G\_\{\\delta\},\\qquad\\forall i\\in\[n\]\.Therefore, by Lemma[B\.2](https://arxiv.org/html/2605.12648#A2.Thmtheorem2),

𝔼\[‖Δ~t‖22\|ℱt−1\]≤Gδ2B\.\\mathbb\{E\}\\big\[\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\,\|\\,\\mathscr\{F\}\_\{t\-1\}\\big\]\\leq\\frac\{G\_\{\\delta\}^\{2\}\}\{B\}\.Thus,

𝔼\[Xt∣ℱt−1\]=𝟏AR¯U\(t−1\)𝔼\[‖Δ~t‖22\|ℱt−1\]≤Gδ2B\.\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=\\mathbf\{1\}\_\{A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)\}\\mathbb\{E\}\\big\[\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\,\|\\,\\mathscr\{F\}\_\{t\-1\}\\big\]\\leq\\frac\{G\_\{\\delta\}^\{2\}\}\{B\}\.Consequently,

∑t=1T𝔼\[Xt∣ℱt−1\]≤TGδ2B=:V\.\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]\\leq\\frac\{TG\_\{\\delta\}^\{2\}\}\{B\}=:V\.
Next we boundXtX\_\{t\}almost surely\. Again using \([5](https://arxiv.org/html/2605.12648#A3.E5)\),

‖1B∑i∈ℬt∇ℓ\(yif𝐖~t−1\(𝐱i\)\)‖2≤Gδand‖∇ℒS\(𝐖~t−1\)‖2=‖1n∑i=1n∇ℓ\(yif𝐖~t−1\(𝐱i\)\)‖2≤Gδ\.\\Big\\\|\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\Big\\\|\_\{2\}\\leq G\_\{\\delta\}\\ \\text\{ and \}\\ \\big\\\|\\nabla\\mathcal\{L\}\_\{S\}\(\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\)\\big\\\|\_\{2\}=\\Big\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\Big\\\|\_\{2\}\\leq G\_\{\\delta\}\.Hence,

∥Δ~t∥2≤2Gδ,andXt≤4Gδ2=:L\.\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}\\leq 2G\_\{\\delta\},\\qquad\\text\{and\}\\qquad X\_\{t\}\\leq 4G\_\{\\delta\}^\{2\}=:L\.
We now apply a standard Chernoff argument for adapted bounded nonnegative variables\. For completeness, we include the short proof\. For anyθ\>0\\theta\>0and any random variableX∈\[0,L\]X\\in\[0,L\],

eθX≤1\+eθL−1LX≤exp⁡\(eθL−1LX\)\.e^\{\\theta X\}\\leq 1\+\\frac\{e^\{\\theta L\}\-1\}\{L\}X\\leq\\exp\\Big\(\\frac\{e^\{\\theta L\}\-1\}\{L\}X\\Big\)\.Therefore,

𝔼\[eθXt∣ℱt−1\]≤exp⁡\(eθL−1L𝔼\[Xt∣ℱt−1\]\)\.\\mathbb\{E\}\[e^\{\\theta X\_\{t\}\}\\mid\\mathscr\{F\}\_\{t\-1\}\]\\leq\\exp\\Big\(\\frac\{e^\{\\theta L\}\-1\}\{L\}\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]\\Big\)\.Iterating this inequality gives

𝔼\[exp⁡\(θ∑t=1TXt\)\]≤𝔼\[exp⁡\(eθL−1L∑t=1T𝔼\[Xt∣ℱt−1\]\)\]≤exp⁡\(eθL−1LV\)\.\\mathbb\{E\}\\big\[\\exp\\big\(\\theta\\sum\_\{t=1\}^\{T\}X\_\{t\}\\big\)\\big\]\\leq\\mathbb\{E\}\\Big\[\\exp\\Big\(\\frac\{e^\{\\theta L\}\-1\}\{L\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]\\Big\)\\Big\]\\leq\\exp\\Big\(\\frac\{e^\{\\theta L\}\-1\}\{L\}V\\Big\)\.Chooseθ=log⁡\(2\)L\\theta=\\frac\{\\log\(2\)\}\{L\}, theneθL−1=1e^\{\\theta L\}\-1=1, and for anyu\>0u\>0,

ℙ\(∑t=1TXt≥2V\+2Lu\)=ℙ\(exp⁡\(θ∑t=1TXt\)≥exp⁡\(2θV\+2θLu\)\)\\displaystyle\\mathbb\{P\}\\Big\(\\sum\_\{t=1\}^\{T\}X\_\{t\}\\geq 2V\+2Lu\\Big\)=\\mathbb\{P\}\\Big\(\\exp\\Big\(\\theta\\sum\_\{t=1\}^\{T\}X\_\{t\}\\Big\)\\geq\\exp\\big\(2\\theta V\+2\\theta Lu\\big\)\\Big\)≤exp⁡\(−2θV−2θLu\)𝔼\[exp⁡\(θ∑t=1TXt\)\]≤exp⁡\(−2θV−2θLu\+VL\)\\displaystyle\\leq\\exp\\big\(\-2\\theta V\-2\\theta Lu\\big\)\\mathbb\{E\}\\big\[\\exp\\Big\(\\theta\\sum\_\{t=1\}^\{T\}X\_\{t\}\\Big\)\\big\]\\leq\\exp\\Big\(\-2\\theta V\-2\\theta Lu\+\\frac\{V\}\{L\}\\Big\)=exp⁡\(−\(2log⁡\(2\)−1\)VL−2log⁡\(2\)u\)\.\\displaystyle=\\exp\\Big\(\-\\big\(2\\log\(2\)\-1\\big\)\\frac\{V\}\{L\}\-2\\log\(2\)u\\Big\)\.Since2log⁡\(2\)\>12\\log\(2\)\>1, the right\-hand side is bounded byexp⁡\(−u\)\\exp\(\-u\)\. Takingu=log⁡\(1δΔ\)u=\\log\\big\(\\frac\{1\}\{\\delta\_\{\\Delta\}\}\\big\)yields

ℙ𝒜\(∑t=1TXt≥2V\+2Llog⁡\(1δΔ\)\)≤δΔ\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\sum\_\{t=1\}^\{T\}X\_\{t\}\\geq 2V\+2L\\log\\big\(\\frac\{1\}\{\\delta\_\{\\Delta\}\}\\big\)\\Big\)\\leq\\delta\_\{\\Delta\}\.Recall thatV=TGδ2BV=\\frac\{TG\_\{\\delta\}^\{2\}\}\{B\}andL=4Gδ2,L=4G\_\{\\delta\}^\{2\},we obtain

ℙ𝒜\(∑t=1T‖Δ~t‖22𝟏AR¯U\(t−1\)≤VΔ,δΔ\)≥1−δΔ\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\sum\_\{t=1\}^\{T\}\\\|\\widetilde\{\\Delta\}\_\{t\}\\\|\_\{2\}^\{2\}\\mathbf\{1\}\_\{A^\{U\}\_\{\\bar\{R\}\}\(t\-1\)\}\\leq V\_\{\\Delta,\\delta\_\{\\Delta\}\}\\Big\)\\geq 1\-\\delta\_\{\\Delta\}\.This completes the proof\. ∎

We next estimate the three terms in the stopped potential fluctuationMkM\_\{k\}separately\.

###### Lemma C\.7\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. FixR¯\>0\\bar\{R\}\>0andδpot,1∈\(0,1\)\\delta\_\{\\mathrm\{pot\},1\}\\in\(0,1\)\. Define

Mk\(1\)=−2η∑t=1k𝟏AR¯U\(t−1\)⟨Δ~t,𝐔t−1−𝐖∗⟩,k∈\[T\]\.M\_\{k\}^\{\(1\)\}=\-2\\eta\\sum\_\{t=1\}^\{k\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle,\\qquad k\\in\[T\]\.Then, conditioned on the dataset and the initialization satisfying the eventℰδ\\mathcal\{E\}\_\{\\delta\}, it holds that

ℙ𝒜\(max1≤k≤T⁡\|Mk\(1\)\|≤42ηGδR¯Tlog⁡\(2/δpot,1\)B\)≥1−δpot,1\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(1\)\}\|\\leq 4\\sqrt\{2\}\\,\\eta G\_\{\\delta\}\\bar\{R\}\\sqrt\{\\frac\{T\\log\(2/\\delta\_\{\\mathrm\{pot\},1\}\)\}\{B\}\}\\Big\)\\geq 1\-\\delta\_\{\\mathrm\{pot\},1\}\.

###### Proof\.

Conditioned on the datasetSSand on the initialization satisfying the eventℰδ\\mathcal\{E\}\_\{\\delta\}\. For eacht∈\[T\]t\\in\[T\], defineXt=−2η𝟏AR¯U\(t−1\)⟨Δ~t,𝐔t−1−𝐖∗⟩\.X\_\{t\}=\-2\\eta\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle\.Then we know

Mk\(1\)=∑t=1kXt\.M\_\{k\}^\{\(1\)\}=\\sum\_\{t=1\}^\{k\}X\_\{t\}\.SinceAR¯U\(t−1\)A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)and𝐔t−1\\mathbf\{U\}\_\{t\-1\}areℱt−1\\mathscr\{F\}\_\{t\-1\}\-measurable, and note that𝔼\[Δ~t∣ℱt−1\]=0,\\mathbb\{E\}\[\\widetilde\{\\Delta\}\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0,we have𝔼\[Xt∣ℱt−1\]=0\.\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0\.

We now prove thatXtX\_\{t\}is conditionally sub\-Gaussian\. Conditioned onℱt−1\\mathscr\{F\}\_\{t\-1\}, defineht=𝟏AR¯U\(t−1\)\(𝐔t−1−𝐖∗\)h\_\{t\}=\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\(\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\)\. Then‖ht‖2≤R¯\\\|h\_\{t\}\\\|\_\{2\}\\leq\\bar\{R\}\. Also define deterministic scalarsai=⟨∇ℓ\(yif𝐖~t−1\(𝐱i\)\),ht⟩a\_\{i\}=\\big\\langle\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\),h\_\{t\}\\big\\ranglefori∈\[n\]i\\in\[n\]\.

On the eventℰδ\\mathcal\{E\}\_\{\\delta\}, from \([5](https://arxiv.org/html/2605.12648#A3.E5)\) we know\|ai\|≤GδR¯\|a\_\{i\}\|\\leq G\_\{\\delta\}\\bar\{R\}for alli∈\[n\]i\\in\[n\]\. Leta¯=n−1∑i=1nai\\bar\{a\}=n^\{\-1\}\\sum\_\{i=1\}^\{n\}a\_\{i\}\. Conditioned onℱt−1\\mathscr\{F\}\_\{t\-1\}, the valuesa1,…,ana\_\{1\},\\ldots,a\_\{n\}are fixed, and

⟨Δ~t,ht⟩=1B∑i∈ℬtai−a¯\.\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\rangle=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}a\_\{i\}\-\\bar\{a\}\.We use Hoeffding’s comparison theorem for finite\-population sampling\[[24](https://arxiv.org/html/2605.12648#bib.bib478), Theorem 4\]\. LetI1,…,IBI\_\{1\},\\ldots,I\_\{B\}be sampled uniformly without replacement from\[n\]\[n\], and letJ1,…,JBJ\_\{1\},\\ldots,J\_\{B\}be i\.i\.d\. uniform random variables on\[n\]\[n\]\. Then, for every convex functionφ\\varphi,

𝔼\[φ\(∑s=1BaIs\)\]≤𝔼\[φ\(∑s=1BaJs\)\]\.\\mathbb\{E\}\\Big\[\\varphi\\Big\(\\sum\_\{s=1\}^\{B\}a\_\{I\_\{s\}\}\\Big\)\\Big\]\\leq\\mathbb\{E\}\\Big\[\\varphi\\Big\(\\sum\_\{s=1\}^\{B\}a\_\{J\_\{s\}\}\\Big\)\\Big\]\.Takingφ\(x\)=exp⁡\(θ\(x/B−a¯\)\)\\varphi\(x\)=\\exp\\left\(\\theta\(x/B\-\\bar\{a\}\)\\right\), we get

𝔼\[exp⁡\(θ\(1B∑s=1BaIs−a¯\)\)\]≤𝔼\[exp⁡\(θB∑s=1B\(aJs−a¯\)\)\]\.\\mathbb\{E\}\\Big\[\\exp\\Big\(\\theta\\Big\(\\frac\{1\}\{B\}\\sum\_\{s=1\}^\{B\}a\_\{I\_\{s\}\}\-\\bar\{a\}\\Big\)\\Big\)\\Big\]\\leq\\mathbb\{E\}\\Big\[\\exp\\Big\(\\frac\{\\theta\}\{B\}\\sum\_\{s=1\}^\{B\}\(a\_\{J\_\{s\}\}\-\\bar\{a\}\)\\Big\)\\Big\]\.It remains to bound the right\-hand side\. SinceJ1,…,JBJ\_\{1\},\\ldots,J\_\{B\}are independent andaJs−a¯a\_\{J\_\{s\}\}\-\\bar\{a\}has mean zero and lies in an interval of length at most2GδR¯2G\_\{\\delta\}\\bar\{R\}, the usual Hoeffding lemma gives

𝔼\[exp⁡\(θB\(aJs−a¯\)\)\]≤exp⁡\(θ2\(GδR¯\)22B2\)\.\\mathbb\{E\}\\Big\[\\exp\\Big\(\\frac\{\\theta\}\{B\}\(a\_\{J\_\{s\}\}\-\\bar\{a\}\)\\Big\)\\Big\]\\leq\\exp\\Big\(\\frac\{\\theta^\{2\}\(G\_\{\\delta\}\\bar\{R\}\)^\{2\}\}\{2B^\{2\}\}\\Big\)\.Therefore, by independence,

𝔼\[exp⁡\(θB∑s=1B\(aJs−a¯\)\)\]=∏s=1B𝔼\[exp⁡\(θB\(aJs−a¯\)\)\]≤exp⁡\(θ2\(GδR¯\)22B\)\.\\mathbb\{E\}\\Big\[\\exp\\Big\(\\frac\{\\theta\}\{B\}\\sum\_\{s=1\}^\{B\}\(a\_\{J\_\{s\}\}\-\\bar\{a\}\)\\Big\)\\Big\]=\\prod\_\{s=1\}^\{B\}\\mathbb\{E\}\\Big\[\\exp\\Big\(\\frac\{\\theta\}\{B\}\(a\_\{J\_\{s\}\}\-\\bar\{a\}\)\\Big\)\\Big\]\\leq\\exp\\Big\(\\frac\{\\theta^\{2\}\(G\_\{\\delta\}\\bar\{R\}\)^\{2\}\}\{2B\}\\Big\)\.Consequently,

𝔼\[exp⁡\(θ⟨Δ~t,ht⟩\)∣ℱt−1\]≤exp⁡\(θ2Gδ2R¯22B\)\.\\mathbb\{E\}\\Big\[\\exp\\Big\(\\theta\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\rangle\\Big\)\\mid\\mathscr\{F\}\_\{t\-1\}\\Big\]\\leq\\exp\\Big\(\\frac\{\\theta^\{2\}G\_\{\\delta\}^\{2\}\\bar\{R\}^\{2\}\}\{2B\}\\Big\)\.
By further noting thatXt=−2η⟨Δ~t,ht⟩X\_\{t\}=\-2\\eta\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\rangle, and applying the preceding bound withθ=−2ηθ\\theta=\-2\\eta\\theta, we obtain, for allθ∈ℝ\\theta\\in\\mathbb\{R\},

𝔼\[exp⁡\(θXt\)∣ℱt−1\]≤exp⁡\(2θ2η2Gδ2R¯2B\)\.\\mathbb\{E\}\\left\[\\exp\(\\theta X\_\{t\}\)\\mid\\mathscr\{F\}\_\{t\-1\}\\right\]\\leq\\exp\\Big\(\\frac\{2\\theta^\{2\}\\eta^\{2\}G\_\{\\delta\}^\{2\}\\bar\{R\}^\{2\}\}\{B\}\\Big\)\.Moreover, sincehth\_\{t\}isℱt−1\\mathscr\{F\}\_\{t\-1\}\-measurable and𝔼\[Δ~t∣ℱt−1\]=0\\mathbb\{E\}\[\\widetilde\{\\Delta\}\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0, we have𝔼\[Xt∣ℱt−1\]=0\.\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0\.Therefore,\{Xt\}t=1T\\\{X\_\{t\}\\\}\_\{t=1\}^\{T\}is a conditionally sub\-Gaussian martingale difference sequence with varianceσ12=4η2Gδ2R¯2B\.\\sigma\_\{1\}^\{2\}=\\frac\{4\\eta^\{2\}G\_\{\\delta\}^\{2\}\\bar\{R\}^\{2\}\}\{B\}\.

By the maximal inequality for conditionally sub\-Gaussian martingales, for anyu\>0u\>0,

ℙ𝒜\(max1≤k≤T⁡\|∑t=1kXt\|≥2Tσ12u\)≤2e−u\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\\Big\|\\sum\_\{t=1\}^\{k\}X\_\{t\}\\Big\|\\geq\\sqrt\{2T\\sigma\_\{1\}^\{2\}u\}\\Big\)\\leq 2e^\{\-u\}\.Takingu=log⁡\(2δpot,1\)u=\\log\\big\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},1\}\}\\big\), we obtain

ℙ𝒜\(max1≤k≤T⁡\|Mk\(1\)\|≥2T⋅4η2Gδ2R¯2Blog⁡\(2δpot,1\)\)≤δpot,1\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(1\)\}\|\\geq\\sqrt\{2T\\cdot\\frac\{4\\eta^\{2\}G\_\{\\delta\}^\{2\}\\bar\{R\}^\{2\}\}\{B\}\\log\\big\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},1\}\}\\big\)\}\\Big\)\\leq\\delta\_\{\\mathrm\{pot\},1\}\.Enlarging the constant to424\\sqrt\{2\}proves the stated bound\. ∎

###### Lemma C\.8\.

Suppose0<λ<10<\\lambda<1\. FixR¯\>0\\bar\{R\}\>0andδpot,2∈\(0,1\)\\delta\_\{\\mathrm\{pot\},2\}\\in\(0,1\)\. Define

Mk\(2\)=−2\(1−λ\)ηcpriv∑t=1k𝟏AR¯U\(t−1\)⟨Zt−1,𝐔t−1−𝐖∗⟩,k∈\[T\]\.M\_\{k\}^\{\(2\)\}=\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\sum\_\{t=1\}^\{k\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle,\\qquad k\\in\[T\]\.Then, conditioned on the dataset and the initialization satisfying the eventℰδ\\mathcal\{E\}\_\{\\delta\}, it holds that

ℙ𝒜\(max1≤k≤T⁡\|Mk\(2\)\|≤4\(1−λ\)ηcprivR¯Tlog⁡\(2δpot,2\)\)≥1−δpot,2\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(2\)\}\|\\leq 4\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},2\}\}\\big\)\}\\Big\)\\geq 1\-\\delta\_\{\\mathrm\{pot\},2\}\.

###### Proof\.

Condition on the datasetSS, the comparator𝐖∗\\mathbf\{W\}^\{\*\}, and the initialization\(𝐖0,𝐜\)\(\\mathbf\{W\}\_\{0\},\\mathbf\{c\}\)satisfying the eventℰδ\\mathcal\{E\}\_\{\\delta\}\. Fort≥2t\\geq 2, define the lagged filtrationℋt−1=σ\(S,𝐖0,𝐜,𝐖∗,ℬ1,…,ℬt−1,Z1,…,Zt−2\)\\mathscr\{H\}\_\{t\-1\}=\\sigma\\bigl\(S,\\mathbf\{W\}\_\{0\},\\mathbf\{c\},\\mathbf\{W\}^\{\*\},\\mathcal\{B\}\_\{1\},\\dots,\\mathcal\{B\}\_\{t\-1\},Z\_\{1\},\\dots,Z\_\{t\-2\}\\bigr\)\. Fort=1t=1, the summand is zero sinceZ0=𝟎Z\_\{0\}=\\mathbf\{0\}\. Fort≥2t\\geq 2, the shifted iterate𝐔t−1\\mathbf\{U\}\_\{t\-1\}isℋt−1\\mathscr\{H\}\_\{t\-1\}\-measurable, and so is𝟏AR¯U\(t−1\)\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\. Moreover,Zt−1∼𝒩\(𝟎,𝐈mdp\)Z\_\{t\-1\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\)is independent ofℋt−1\\mathscr\{H\}\_\{t\-1\}\.

Define

Yt=−2\(1−λ\)ηcpriv1AR¯U\(t−1\)⟨Zt−1,𝐔t−1−𝐖∗⟩,t=1,…,T\.Y\_\{t\}=\-2\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle Z\_\{t\-1\},\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\big\\rangle,\\qquad t=1,\\dots,T\.Then,Mk\(2\)=∑t=1kYtM\_\{k\}^\{\(2\)\}=\\sum\_\{t=1\}^\{k\}Y\_\{t\}\. Fort=1t=1,Y1=0Y\_\{1\}=0\. Fort≥2t\\geq 2,𝔼\[Yt∣ℋt−1\]=0\\mathbb\{E\}\[Y\_\{t\}\\mid\\mathscr\{H\}\_\{t\-1\}\]=0\.

Furthermore, conditioned onℋt−1\\mathscr\{H\}\_\{t\-1\},YtY\_\{t\}is Gaussian with variance

4\(1−λ\)2η2cpriv2𝟏AR¯U\(t−1\)‖𝐔t−1−𝐖∗‖22≤4\(1−λ\)2η2cpriv2R¯2\.4\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\\|\\mathbf\{U\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\leq 4\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\bar\{R\}^\{2\}\.Therefore, for everyθ∈ℝ\\theta\\in\\mathbb\{R\},

𝔼\[exp\(θYt\)\|ℋt−1\]≤exp\(2θ2\(1−λ\)2η2cpriv2R¯2\)\.\\mathbb\{E\}\\left\[\\exp\(\\theta Y\_\{t\}\)\\,\\middle\|\\,\\mathscr\{H\}\_\{t\-1\}\\right\]\\leq\\exp\\left\(2\\theta^\{2\}\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\bar\{R\}^\{2\}\\right\)\.That is,\{Yt\}t=1T\\\{Y\_\{t\}\\\}\_\{t=1\}^\{T\}is a conditionally sub\-Gaussian martingale difference sequence with variance proxyσ22=4\(1−λ\)2η2cpriv2R¯2\\sigma\_\{2\}^\{2\}=4\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\\bar\{R\}^\{2\}\. Then by the maximal inequality for conditionally sub\-Gaussian martingales, for anyu\>0u\>0,

ℙ𝒜\(max1≤k≤T⁡\|∑t=1kYt\|≥2Tσ22u\)≤2e−u\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\\Big\|\\sum\_\{t=1\}^\{k\}Y\_\{t\}\\Big\|\\geq\\sqrt\{2T\\sigma\_\{2\}^\{2\}u\}\\Big\)\\leq 2e^\{\-u\}\.Takingu=log⁡\(2δpot,2\)u=\\log\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},2\}\}\), we get with probability at least1−δpot,21\-\\delta\_\{\\mathrm\{pot\},2\},

max1≤k≤T⁡\|Mk\(2\)\|≤22\(1−λ\)ηcprivR¯Tlog⁡\(2δpot,2\)\.\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(2\)\}\|\\leq 2\\sqrt\{2\}\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},2\}\}\\big\)\}\.Enlarging the constant to44proves the claim\. ∎

###### Lemma C\.9\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. FixR¯\>0\\bar\{R\}\>0,VZ\>0V\_\{Z\}\>0, andδpot,3∈\(0,1\)\\delta\_\{\\mathrm\{pot\},3\}\\in\(0,1\)\. Define

Mk\(3\)=2\(1−λ\)η2cpriv∑t=1k𝟏AR¯U\(t−1\)⟨Δ~t,Zt−1⟩,k∈\[T\]\.M\_\{k\}^\{\(3\)\}=2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\sum\_\{t=1\}^\{k\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle,\\qquad k\\in\[T\]\.Then, conditioned on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}, it holds that

ℙ𝒜\(GZ\(z,VZ\)∩\{max1≤k≤T⁡\|Mk\(3\)\|\>4\(1−λ\)η2cprivGδVZlog⁡\(2/δpot,3\)B\}\)≤δpot,3\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap\\Big\\\{\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(3\)\}\|\>4\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{V\_\{Z\}\\log\(2/\\delta\_\{\\mathrm\{pot\},3\}\)\}\{B\}\}\\Big\\\}\\Big\)\\leq\\delta\_\{\\mathrm\{pot\},3\}\.

###### Proof\.

Condition on the datasetSS, the comparator𝐖∗\\mathbf\{W\}^\{\*\}, and the initialization\(𝐖0,𝐜\)\(\\mathbf\{W\}\_\{0\},\\mathbf\{c\}\)satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}\. Conditioned onℱt−1\\mathscr\{F\}\_\{t\-1\}, the vectorZt−1Z\_\{t\-1\}is fixed, and so is𝐖~t−1\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\. The randomness inΔ~t\\widetilde\{\\Delta\}\_\{t\}comes only from the fresh mini\-batchℬt\\mathcal\{B\}\_\{t\}\.

For eacht∈\[T\]t\\in\[T\], defineYt=2\(1−λ\)η2cpriv1AR¯U\(t−1\)⟨Δ~t,Zt−1⟩Y\_\{t\}=2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\,\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},Z\_\{t\-1\}\\big\\rangle\. ThenMk\(3\)=∑t=1kYtM\_\{k\}^\{\(3\)\}=\\sum\_\{t=1\}^\{k\}Y\_\{t\}\. SinceAR¯U\(t−1\)A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)andZt−1Z\_\{t\-1\}areℱt−1\\mathscr\{F\}\_\{t\-1\}\-measurable, and𝔼\[Δ~t∣ℱt−1\]=0\\mathbb\{E\}\[\\widetilde\{\\Delta\}\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0, we have

𝔼\[Yt∣ℱt−1\]=0\.\\mathbb\{E\}\[Y\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0\.Thus\{Yt\}t=1T\\\{Y\_\{t\}\\\}\_\{t=1\}^\{T\}is a martingale difference sequence\.

We next prove a conditional sub\-Gaussian bound forYtY\_\{t\}\. Conditioned onℱt−1\\mathscr\{F\}\_\{t\-1\}, setht=𝟏AR¯U\(t−1\)Zt−1h\_\{t\}=\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}Z\_\{t\-1\}\. For eachi∈\[n\]i\\in\[n\], defineai=⟨∇ℓ\(yif𝐖~t−1\(𝐱i\)\),ht⟩a\_\{i\}=\\langle\\nabla\\ell\(y\_\{i\}f\_\{\\widetilde\{\\mathbf\{W\}\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\),h\_\{t\}\\rangle\. On the eventℰδ\\mathcal\{E\}\_\{\\delta\}, by \([5](https://arxiv.org/html/2605.12648#A3.E5)\), it holds

\|ai\|≤Gδ‖ht‖2=Gδ𝟏AR¯U\(t−1\)‖Zt−1‖2,i∈\[n\]\.\|a\_\{i\}\|\\leq G\_\{\\delta\}\\\|h\_\{t\}\\\|\_\{2\}=G\_\{\\delta\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\\|Z\_\{t\-1\}\\\|\_\{2\},\\qquad i\\in\[n\]\.Moreover,

⟨Δ~t,ht⟩=1B∑i∈ℬtai−1n∑i=1nai\.\\big\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\big\\rangle=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}a\_\{i\}\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}a\_\{i\}\.Similar to the proof of Lemma[C\.7](https://arxiv.org/html/2605.12648#A3.Thmtheorem7), Hoeffding’s comparison theorem for sampling without replacement gives, for everyθ∈ℝ\\theta\\in\\mathbb\{R\},

𝔼\[exp⁡\(θ⟨Δ~t,ht⟩\)\|ℱt−1\]≤exp⁡\(θ2Gδ2𝟏AR¯U\(t−1\)‖Zt−1‖222B\)\.\\mathbb\{E\}\\big\[\\exp\\big\(\\theta\\big\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\big\\rangle\\big\)\\,\|\\,\\mathscr\{F\}\_\{t\-1\}\\big\]\\leq\\exp\\Big\(\\frac\{\\theta^\{2\}G\_\{\\delta\}^\{2\}\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\}\{2B\}\\Big\)\.SinceYt=2\(1−λ\)η2cpriv⟨Δ~t,ht⟩Y\_\{t\}=2\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}\\big\\langle\\widetilde\{\\Delta\}\_\{t\},h\_\{t\}\\big\\rangle, we obtain

𝔼\[exp⁡\(θYt\)∣ℱt−1\]≤exp⁡\(θ2σt22\),\\mathbb\{E\}\\big\[\\exp\(\\theta Y\_\{t\}\)\\mid\\mathscr\{F\}\_\{t\-1\}\\big\]\\leq\\exp\\left\(\\frac\{\\theta^\{2\}\\sigma\_\{t\}^\{2\}\}\{2\}\\right\),whereσt2=4\(1−λ\)2η4cpriv2Gδ2𝟏AR¯U\(t−1\)‖Zt−1‖22B\.\\sigma\_\{t\}^\{2\}=4\(1\-\\lambda\)^\{2\}\\eta^\{4\}c\_\{\\mathrm\{priv\}\}^\{2\}G\_\{\\delta\}^\{2\}\\frac\{\\mathbf\{1\}\_\{A\_\{\\bar\{R\}\}^\{U\}\(t\-1\)\}\\\|Z\_\{t\-1\}\\\|\_\{2\}^\{2\}\}\{B\}\.

Thus\{Yt\}t=1T\\\{Y\_\{t\}\\\}\_\{t=1\}^\{T\}is a conditionally sub\-Gaussian martingale difference sequence with predictable variance processVk\(3\)=∑t=1kσt2V\_\{k\}^\{\(3\)\}=\\sum\_\{t=1\}^\{k\}\\sigma\_\{t\}^\{2\}\. OnGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\), we have

VT\(3\)≤4\(1−λ\)2η4cpriv2Gδ2VZB\.V\_\{T\}^\{\(3\)\}\\leq 4\(1\-\\lambda\)^\{2\}\\eta^\{4\}c\_\{\\mathrm\{priv\}\}^\{2\}G\_\{\\delta\}^\{2\}\\frac\{V\_\{Z\}\}\{B\}\.We now apply the standard maximal inequality for conditionally sub\-Gaussian martingales with predictable variance process: for everyv\>0v\>0andu\>0u\>0,

ℙ𝒜\(max1≤k≤T⁡\|∑t=1kYt\|≥2vuandVT\(3\)≤v\)≤2e−u\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\max\_\{1\\leq k\\leq T\}\\Big\|\\sum\_\{t=1\}^\{k\}Y\_\{t\}\\Big\|\\geq\\sqrt\{2vu\}\\ \\text\{ and \}\\ V\_\{T\}^\{\(3\)\}\\leq v\\Big\)\\leq 2e^\{\-u\}\.Takev=4\(1−λ\)2η4cpriv2Gδ2VZBv=4\(1\-\\lambda\)^\{2\}\\eta^\{4\}c\_\{\\mathrm\{priv\}\}^\{2\}G\_\{\\delta\}^\{2\}\\frac\{V\_\{Z\}\}\{B\}andu=log⁡\(2δpot,3\)u=\\log\(\\frac\{2\}\{\\delta\_\{\\mathrm\{pot\},3\}\}\)\. SinceGZ\(z,VZ\)G\_\{Z\}\(z,V\_\{Z\}\)impliesVT\(3\)≤vV\_\{T\}^\{\(3\)\}\\leq v, we obtain

ℙ𝒜\(GZ\(z,VZ\)∩\{max1≤k≤T⁡\|Mk\(3\)\|≥22\(1−λ\)η2cprivGδVZlog⁡\(2/δpot,3\)B\}\)≤δpot,3\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap\\Big\\\{\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(3\)\}\|\\geq 2\\sqrt\{2\}\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{V\_\{Z\}\\log\(2/\\delta\_\{\\mathrm\{pot\},3\}\)\}\{B\}\}\\Big\\\}\\Big\)\\leq\\delta\_\{\\mathrm\{pot\},3\}\.Enlarging the numerical constant from222\\sqrt\{2\}to44gives the stated bound\. This completes the proof\. ∎

We now combine the above three fluctuation bounds to control the stopped potential\-fluctuation eventGpotG\_\{\\mathrm\{pot\}\}\.

###### Lemma C\.10\(High\-probability control of the stopped potential fluctuation\)\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. FixR¯\>0\\bar\{R\}\>0,z\>0z\>0,VZ\>0V\_\{Z\}\>0, andδpot∈\(0,1\)\\delta\_\{\\mathrm\{pot\}\}\\in\(0,1\)\. Define

Mpot\(R¯,z,VZ;δpot\)=\\displaystyle M\_\{\\mathrm\{pot\}\}\(\\bar\{R\},z,V\_\{Z\};\\delta\_\{\\mathrm\{pot\}\}\)=\\;42ηGδR¯Tlog⁡\(6/δpot\)B\+4\(1−λ\)ηcprivR¯Tlog⁡\(6δpot\)\\displaystyle 4\\sqrt\{2\}\\,\\eta G\_\{\\delta\}\\bar\{R\}\\sqrt\{\\frac\{T\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\+4\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{6\}\{\\delta\_\{\\mathrm\{pot\}\}\}\\big\)\}\+4\(1−λ\)η2cprivGδVZlog⁡\(6/δpot\)B\.\\displaystyle\+4\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{V\_\{Z\}\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\.\(12\)Then, conditioned on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}, it holds that

ℙ𝒜\(GZ\(z,VZ\)∩Gpot\(R¯,Mpot\(R¯,z,VZ;δpot\)\)\)≥ℙ𝒜\(GZ\(z,VZ\)\)−δpot\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\left\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap G\_\{\\mathrm\{pot\}\}\\bigl\(\\bar\{R\},M\_\{\\mathrm\{pot\}\}\(\\bar\{R\},z,V\_\{Z\};\\delta\_\{\\mathrm\{pot\}\}\)\\bigr\)\\right\)\\geq\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\bigl\(G\_\{Z\}\(z,V\_\{Z\}\)\\bigr\)\-\\delta\_\{\\mathrm\{pot\}\}\.

###### Proof\.

Note thatMk=Mk\(1\)\+Mk\(2\)\+Mk\(3\)M\_\{k\}=M\_\{k\}^\{\(1\)\}\+M\_\{k\}^\{\(2\)\}\+M\_\{k\}^\{\(3\)\}for everyk∈\[T\]k\\in\[T\]\. Define the following three events:

E1\\displaystyle E\_\{1\}=\{max1≤k≤T⁡\|Mk\(1\)\|≤42ηGδR¯Tlog⁡\(6/δpot\)B\},\\displaystyle=\\Big\\\{\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(1\)\}\|\\leq 4\\sqrt\{2\}\\,\\eta G\_\{\\delta\}\\bar\{R\}\\sqrt\{\\frac\{T\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\\Big\\\},E2\\displaystyle E\_\{2\}=\{max1≤k≤T⁡\|Mk\(2\)\|≤4\(1−λ\)ηcprivR¯Tlog⁡\(6δpot\)\},\\displaystyle=\\Big\\\{\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(2\)\}\|\\leq 4\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{6\}\{\\delta\_\{\\mathrm\{pot\}\}\}\\big\)\}\\Big\\\},E3\\displaystyle E\_\{3\}=\{max1≤k≤T⁡\|Mk\(3\)\|≤4\(1−λ\)η2cprivGδVZlog⁡\(6/δpot\)B\}\.\\displaystyle=\\Big\\\{\\max\_\{1\\leq k\\leq T\}\|M\_\{k\}^\{\(3\)\}\|\\leq 4\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{V\_\{Z\}\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\\Big\\\}\.
By Lemmas[C\.7](https://arxiv.org/html/2605.12648#A3.Thmtheorem7),[C\.8](https://arxiv.org/html/2605.12648#A3.Thmtheorem8), and[C\.9](https://arxiv.org/html/2605.12648#A3.Thmtheorem9)withδpot,1=δpot,2=δpot,3=δpot/3\\delta\_\{\\mathrm\{pot\},1\}=\\delta\_\{\\mathrm\{pot\},2\}=\\delta\_\{\\mathrm\{pot\},3\}=\\delta\_\{\\mathrm\{pot\}\}/3, we have

ℙ𝒜\(E1\)≥1−δpot3,ℙ𝒜\(E2\)≥1−δpot3,ℙ𝒜\(GZ\(z,VZ\)∩E3c\)≤δpot3\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\(E\_\{1\}\)\\geq 1\-\\frac\{\\delta\_\{\\mathrm\{pot\}\}\}\{3\},\\qquad\\mathbb\{P\}\_\{\\mathcal\{A\}\}\(E\_\{2\}\)\\geq 1\-\\frac\{\\delta\_\{\\mathrm\{pot\}\}\}\{3\},\\qquad\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\bigl\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap E\_\{3\}^\{c\}\\bigr\)\\leq\\frac\{\\delta\_\{\\mathrm\{pot\}\}\}\{3\}\.
On the eventE1∩E2∩E3E\_\{1\}\\cap E\_\{2\}\\cap E\_\{3\}, for everyk∈\[T\]k\\in\[T\],

\|Mk\|≤\|Mk\(1\)\|\+\|Mk\(2\)\|\+\|Mk\(3\)\|≤Mpot\(R¯,z,VZ;δpot\)\.\|M\_\{k\}\|\\leq\|M\_\{k\}^\{\(1\)\}\|\+\|M\_\{k\}^\{\(2\)\}\|\+\|M\_\{k\}^\{\(3\)\}\|\\leq M\_\{\\mathrm\{pot\}\}\(\\bar\{R\},z,V\_\{Z\};\\delta\_\{\\mathrm\{pot\}\}\)\.Hence,

E1∩E2∩E3⊆Gpot\(R¯,Mpot\(R¯,z,VZ;δpot\)\)\.E\_\{1\}\\cap E\_\{2\}\\cap E\_\{3\}\\subseteq G\_\{\\mathrm\{pot\}\}\\bigl\(\\bar\{R\},M\_\{\\mathrm\{pot\}\}\(\\bar\{R\},z,V\_\{Z\};\\delta\_\{\\mathrm\{pot\}\}\)\\bigr\)\.It follows that

ℙ𝒜\(GZ\(z,VZ\)∩\(Gpot\(R¯,Mpot\(R¯,z,VZ;δpot\)\)\)c\)\\displaystyle\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\left\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap\\left\(G\_\{\\mathrm\{pot\}\}\\bigl\(\\bar\{R\},M\_\{\\mathrm\{pot\}\}\(\\bar\{R\},z,V\_\{Z\};\\delta\_\{\\mathrm\{pot\}\}\)\\bigr\)\\right\)^\{c\}\\right\)≤ℙ𝒜\(E1c\)\+ℙ𝒜\(E2c\)\+ℙ𝒜\(GZ\(z,VZ\)∩E3c\)≤δpot\.\\displaystyle\\leq\\mathbb\{P\}\_\{\\mathcal\{A\}\}\(E\_\{1\}^\{c\}\)\+\\mathbb\{P\}\_\{\\mathcal\{A\}\}\(E\_\{2\}^\{c\}\)\+\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\bigl\(G\_\{Z\}\(z,V\_\{Z\}\)\\cap E\_\{3\}^\{c\}\\bigr\)\\leq\\delta\_\{\\mathrm\{pot\}\}\.Rearranging the above inequality gives the claim\. ∎

Recall that

zδZ=mdp\+2log⁡\(2TδZ\)andVΔ,δΔ=2TGδ2B\+8Gδ2log⁡\(1δΔ\),z\_\{\\delta\_\{Z\}\}=\\sqrt\{mdp\}\+\\sqrt\{2\\log\\big\(\\frac\{2T\}\{\\delta\_\{Z\}\}\\big\)\}\\qquad\\text\{and\}\\qquad V\_\{\\Delta,\\delta\_\{\\Delta\}\}=\\frac\{2TG\_\{\\delta\}^\{2\}\}\{B\}\+8G\_\{\\delta\}^\{2\}\\log\\big\(\\frac\{1\}\{\\delta\_\{\\Delta\}\}\\big\),VZ,δZ=\(T−1\)mdp\+2\(T−1\)mdplog⁡\(2δZ\)\+2log⁡\(2δZ\),V\_\{Z,\\delta\_\{Z\}\}=\(T\-1\)mdp\+2\\sqrt\{\(T\-1\)mdp\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\)\}\+2\\log\\big\(\\frac\{2\}\{\\delta\_\{Z\}\}\\big\),and

Mδpot=\\displaystyle M\_\{\\delta\_\{\\mathrm\{pot\}\}\}=\\;42ηGδR¯Tlog⁡\(6/δpot\)B\+4\(1−λ\)ηcprivR¯Tlog⁡\(6δpot\)\\displaystyle 4\\sqrt\{2\}\\,\\eta G\_\{\\delta\}\\bar\{R\}\\sqrt\{\\frac\{T\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\+4\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\,\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{6\}\{\\delta\_\{\\mathrm\{pot\}\}\}\\big\)\}\+4\(1−λ\)η2cprivGδVZ,δZlog⁡\(6/δpot\)B\.\\displaystyle\+4\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{V\_\{Z,\\delta\_\{Z\}\}\\log\(6/\\delta\_\{\\mathrm\{pot\}\}\)\}\{B\}\}\.\(13\)
We now combine the high\-probability bounds forGZG\_\{Z\},GΔ2G\_\{\\Delta^\{2\}\}, andGpotG\_\{\\mathrm\{pot\}\}with the shifted bootstrap lemma to obtain a conditional optimization bound for the projected iterates\.

###### Lemma C\.11\(High\-probability shifted bootstrap\)\.

Suppose0<λ<10<\\lambda<1, Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs\. AssumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. Fix a comparator𝐖∗\\mathbf\{W\}^\{\*\}and a shifted radiusR¯\>0\\bar\{R\}\>0\. LetδZ,δΔ,δpot∈\(0,1\)\\delta\_\{Z\},\\delta\_\{\\Delta\},\\delta\_\{\\mathrm\{pot\}\}\\in\(0,1\)\. Assume thatη≤112β\\eta\\leq\\frac\{1\}\{12\\beta\}, the conditions \([7](https://arxiv.org/html/2605.12648#A3.E7)\) and \([10](https://arxiv.org/html/2605.12648#A3.E10)\) hold\. IfR∗≥R¯\+‖𝐖∗−𝐖0‖2\+ηczδZR\_\{\*\}\\geq\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta cz\_\{\\delta\_\{Z\}\}, then, conditional on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}, with probability at least1−\(δZ\+δΔ\+δpot\)1\-\(\\delta\_\{Z\}\+\\delta\_\{\\Delta\}\+\\delta\_\{\\mathrm\{pot\}\}\)over the randomness of the algorithm

1T∑t=1TℒS\(𝐖t−1\)≤\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\,8ℒS\(𝐖∗\)\+3‖𝐖0−𝐖∗‖22ηT\+3MδpotηT\+6ηVΔ,δΔT\\displaystyle 8\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+\\frac\{3M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\}\{\\eta T\}\+\\frac\{6\\eta V\_\{\\Delta,\\delta\_\{\\Delta\}\}\}\{T\}\+3\(\(1−λ\)2η\+12βλ2η2\)c2VZ,δZT\.\\displaystyle\+3\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+12\\beta\\lambda^\{2\}\\eta^\{2\}\\Big\)c^\{2\}\\frac\{V\_\{Z,\\delta\_\{Z\}\}\}\{T\}\.

###### Proof\.

By Lemma[C\.5](https://arxiv.org/html/2605.12648#A3.Thmtheorem5), Lemma[C\.6](https://arxiv.org/html/2605.12648#A3.Thmtheorem6), and Lemma[C\.10](https://arxiv.org/html/2605.12648#A3.Thmtheorem10), it holds

ℙ𝒜\(GZ\(zδZ,VZ,δZ\)∩GΔ2\(R¯,VΔ,δΔ\)∩Gpot\(R¯,𝔐δpot\)\)≥1−\(δZ\+δΔ\+δpot\)\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Bigl\(G\_\{Z\}\(z\_\{\\delta\_\{Z\}\},V\_\{Z,\\delta\_\{Z\}\}\)\\cap G\_\{\\Delta^\{2\}\}\(\\bar\{R\},V\_\{\\Delta,\\delta\_\{\\Delta\}\}\)\\cap G\_\{\\mathrm\{pot\}\}\(\\bar\{R\},\\mathfrak\{M\}\_\{\\delta\_\{\\mathrm\{pot\}\}\}\)\\Bigr\)\\geq 1\-\(\\delta\_\{Z\}\+\\delta\_\{\\Delta\}\+\\delta\_\{\\mathrm\{pot\}\}\)\.Combining this observation with Lemma[C\.3](https://arxiv.org/html/2605.12648#A3.Thmtheorem3)completes the proof\. ∎

Step 5 and 6: Close the bootstrap, show projection inactivity and Choose the comparator using initialization separability\.

Compared with the proof sketch in Section[4\.3](https://arxiv.org/html/2605.12648#S4.SS3), the last two steps are treated together in the detailed proof\. In the sketch, Step 5 closes the localization bootstrap for a fixed comparator, while Step 6 chooses the NTK\-separability comparator\. In the rigorous argument, these two steps are coupled, since the shifted localization event, the bootstrap condition, and the projection\-inactivity argument all depend on the chosen comparator𝐖∗\\mathbf\{W\}^\{\*\}\. We therefore first establish the existence of a suitable comparator under the initialization separability assumption, and then use this comparator to close the bootstrap and transfer the bound to the projected iterates\.

###### Lemma C\.12\(Comparator under NTK separability\)\.

Suppose Assumption[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)holds\. Assume

m≳log⁡\(m/δ\)\(log2⁡\(T\)\+log⁡\(n/δ\)\)/γ4\.m\\gtrsim\\log\(m/\\delta\)\\big\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\\big\)/\\gamma^\{4\}\.Letτ≍1γ\(log⁡\(T\)\+log⁡\(n/δ\)\)\\tau\\asymp\\frac\{1\}\{\\gamma\}\\big\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\big\)\. AssumeR∗≍τR\_\{\*\}\\asymp\\tau\. Then, with probability at least1−δ1\-\\deltaover the randomness of the initialization, there exists a comparator𝐖∗=𝐖0\+τ𝐮∈𝒦\\mathbf\{W\}^\{\*\}=\\mathbf\{W\}\_\{0\}\+\\tau\\mathbf\{u\}\\in\\mathcal\{K\}such that

LS\(𝐖∗\)≤1T,L\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq\\frac\{1\}\{T\},and

‖𝐖∗−𝐖0‖22=τ2≍log2⁡\(T\)\+log⁡\(n/δ\)γ2\.\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}^\{2\}=\\tau^\{2\}\\asymp\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\}\.

###### Proof\.

Let𝐮\\mathbf\{u\}be the unit vector in Assumption[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3), and define𝐖∗=𝐖0\+τ𝐮,\\mathbf\{W\}^\{\*\}=\\mathbf\{W\}\_\{0\}\+\\tau\\mathbf\{u\},whereτ≤R∗\\tau\\leq R\_\{\*\}will be chosen later\. Since‖𝐮‖2=1\\\|\\mathbf\{u\}\\\|\_\{2\}=1, we immediately have

‖𝐖∗−𝐖0‖2=τ\.\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}=\\tau\.
For any fixedi∈\[n\]i\\in\[n\], by Taylor’s theorem, there existsα∈\[0,1\]\\alpha\\in\[0,1\]with𝐖i′=α𝐖0\+\(1−α\)𝐖∗\\mathbf\{W\}\_\{i\}^\{\\prime\}=\\alpha\\mathbf\{W\}\_\{0\}\+\(1\-\\alpha\)\\mathbf\{W\}^\{\*\}such that

f𝐖∗\(𝐱i\)=f𝐖0\(𝐱i\)\+⟨∇f𝐖0\(𝐱i\),𝐖∗−𝐖0⟩\+12\(𝐖∗−𝐖0\)⊤∇2f𝐖i′\(𝐱i\)\(𝐖∗−𝐖0\)\.f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)=f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\+\\big\\langle\\nabla f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\big\\rangle\+\\frac\{1\}\{2\}\(\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\)^\{\\top\}\\nabla^\{2\}f\_\{\\mathbf\{W\}\_\{i\}^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\(\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\)\.Multiplying both sides byyiy\_\{i\}and using𝐖∗−𝐖0=τ𝐮\\mathbf\{W\}^\{\\ast\}\-\\mathbf\{W\}\_\{0\}=\\tau\\mathbf\{u\}, we obtain

yif𝐖∗\(𝐱i\)=yif𝐖0\(𝐱i\)\+τyi⟨∇f𝐖0\(𝐱i\),𝐮⟩\+τ22yi𝐮⊤∇2f𝐖i′\(𝐱i\)𝐮\.y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)=y\_\{i\}f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\+\\tau\\,y\_\{i\}\\big\\langle\\nabla f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{u\}\\big\\rangle\+\\frac\{\\tau^\{2\}\}\{2\}y\_\{i\}\\,\\mathbf\{u\}^\{\\top\}\\nabla^\{2\}f\_\{\\mathbf\{W\}\_\{i\}^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\\mathbf\{u\}\.Since\|yi\|≤1\|y\_\{i\}\|\\leq 1, it holds that

yif𝐖∗\(𝐱i\)≥−\|f𝐖0\(𝐱i\)\|\+τyi⟨∇f𝐖0\(𝐱i\),𝐮⟩−τ22‖∇2f𝐖i′\(𝐱i\)‖2\.y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)\\geq\-\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\|\+\\tau\\,y\_\{i\}\\big\\langle\\nabla f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{u\}\\big\\rangle\-\\frac\{\\tau^\{2\}\}\{2\}\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\_\{i\}^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\\\|\_\{2\}\.From Assumption[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3), we know that

yi⟨∇f𝐖0\(𝐱i\),𝐮⟩≥γ,y\_\{i\}\\big\\langle\\nabla f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{u\}\\big\\rangle\\geq\\gamma,which implies

yif𝐖∗\(𝐱i\)≥τγ−\|f𝐖0\(𝐱i\)\|−τ22‖∇2f𝐖i′\(𝐱i\)‖2\.y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)\\geq\\tau\\gamma\-\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\|\-\\frac\{\\tau^\{2\}\}\{2\}\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\_\{i\}^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\\\|\_\{2\}\.\(14\)
Now, we estimate the lower bound of the right hand side of \([14](https://arxiv.org/html/2605.12648#A3.E14)\)\. From Eq\. \(27\) in\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\], we know that with probability at least1−δ/21\-\\delta/2, it holds that

\|f𝐖0\(𝐱i\)\|≤Bb2plog⁡\(2nδ\),∀i∈\[n\]\.\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\|\\leq B\_\{b\}\\sqrt\{2p\\log\\Big\(\\frac\{2n\}\{\\delta\}\\Big\)\},\\qquad\\forall i\\in\[n\]\.\(15\)Note that the good event setℰδ/2\\mathcal\{E\}\_\{\\delta/2\}\(see \([1](https://arxiv.org/html/2605.12648#A2.E1)\)\) has probability at least1−δ/21\-\\delta/2\. From Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6), we know with probability at least1−δ/21\-\\delta/2over the initialization, it holds that

‖∇2f𝐖i′\(𝐱i\)‖2≤Cσ,bp32\(p\+log⁡\(2m/δ\)\)m\.\\big\\\|\\nabla^\{2\}f\_\{\\mathbf\{W\}\_\{i\}^\{\\prime\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\\\|\_\{2\}\\leq\\frac\{C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{p\}\+\\sqrt\{\\log\(\{2m\}/\{\\delta\}\)\}\\big\)\}\{\\sqrt\{m\}\}\.Plugging the above two bounds into \([14](https://arxiv.org/html/2605.12648#A3.E14)\), we get

yif𝐖∗\(𝐱i\)\\displaystyle y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)≥τγ−Bb2plog⁡\(2nδ\)−τ2Cσ,bp32\(p\+log⁡\(2m/δ\)\)2m\\displaystyle\\geq\\tau\\gamma\-B\_\{b\}\\sqrt\{2p\\log\\Big\(\\frac\{2n\}\{\\delta\}\\Big\)\}\-\\frac\{\\tau^\{2\}C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{p\}\+\\sqrt\{\\log\(\{2m\}/\{\\delta\}\)\}\\big\)\}\{2\\sqrt\{m\}\}≥τγ2−τ2Cσ,bp32\(p\+log⁡\(2m/δ\)\)2m\\displaystyle\\geq\\frac\{\\tau\\gamma\}\{2\}\-\\frac\{\\tau^\{2\}C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{p\}\+\\sqrt\{\\log\(\{2m\}/\{\\delta\}\)\}\\big\)\}\{2\\sqrt\{m\}\}≥τγ4≳log⁡\(T\),\\displaystyle\\geq\\frac\{\\tau\\gamma\}\{4\}\\gtrsim\\log\(T\),where the second inequality used the factτ≍\(log⁡\(T\)\+log⁡\(n/δ\)\)/γ\\tau\\asymp\\big\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\big\)/\\gamma, and in the last second inequality we have used the conditionm≳log⁡\(m/δ\)\(log2⁡\(T\)\+log⁡\(n/δ\)\)/γ4≳log⁡\(m/δ\)τ2/γ2m\\gtrsim\\log\(m/\\delta\)\\big\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\\big\)/\\gamma^\{4\}\\gtrsim\\log\(m/\\delta\)\\tau^\{2\}/\\gamma^\{2\}\.

It then follows that

ℓ\(yif𝐖∗\(𝐱i\)\)=log⁡\(1\+exp⁡\(−yif𝐖∗\(𝐱i\)\)\)≤exp⁡\(−yif𝐖∗\(𝐱i\)\)≤exp⁡\(−log⁡\(T\)\)=1T\.\\displaystyle\\ell\\big\(y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\)=\\log\\big\(1\+\\exp\\big\(\-y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\)\\big\)\\leq\\exp\\big\(\-y\_\{i\}f\_\{\\mathbf\{W\}^\{\*\}\}\(\\mathbf\{x\}\_\{i\}\)\\big\)\\leq\\exp\(\-\\log\(T\)\)=\\frac\{1\}\{T\}\.Averaging overi∈\[n\]i\\in\[n\]yields

ℒS\(𝐖∗\)≤1T,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\\ast\}\)\\leq\\frac\{1\}\{T\},which proves the first result of the lemma\.

The second result follows directly from the definition ofτ\\tau\. The proof is completed\. ∎

Recall thatcpriv=CclipκBc\_\{\\mathrm\{priv\}\}=\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}andβ=Cσ,bp3\\beta=C\_\{\\sigma,b\}p^\{3\}\. Now, we give our main result\.

###### Theorem C\.13\(Restatement of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\),0<λ<10<\\lambda<1,κ\>0\\kappa\>0, and suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. SetR∗≍log⁡\(T\)\+log⁡\(n/δ\)γR\_\{\*\}\\asymp\\frac\{\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\}\{\\gamma\}\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≍GδC\_\{\\rm clip\}\\asymp G\_\{\\delta\}chosen so thatCclip≥GδC\_\{\\rm clip\}\\geq G\_\{\\delta\}\. AssumeηγT\(1B\+κ\(1−λ\)B\)\+η2γ2\(TB\+1\)≲1\\eta\\gamma\\sqrt\{T\}\(\\frac\{1\}\{\\sqrt\{\{B\}\}\}\+\\frac\{\\kappa\(1\-\\lambda\)\}\{B\}\)\+\\eta^\{2\}\\gamma^\{2\}\(\\frac\{T\}\{B\}\+1\)\\lesssim 1andΩ~\(γ−4\)≤m≤𝒪~\(B2\(η2κ2γ2d\)−1min⁡\{1,\(T\(\(1−λ\)2\+λ2η\)\)−1\}\),\\widetilde\{\\Omega\}\(\\gamma^\{\-4\}\)\\leq m\\leq\\widetilde\{\\mathcal\{O\}\}\(B^\{2\}\(\\eta^\{2\}\\kappa^\{2\}\\gamma^\{2\}d\)^\{\-1\}\\min\\\{1,\(T\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\)\)^\{\-1\}\\\}\),where𝒪~\\widetilde\{\\mathcal\{O\}\}andΩ~\\widetilde\{\\Omega\}suppress polylogarithmic factors inm,n,T,δ−1m,n,T,\\delta^\{\-1\}\. Then, with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssimpolylog\(nTδ\)⋅\[1γ2ηT\+1γBT\+η\(1B\+1T\)\\displaystyle\\mathrm\{polylog\}\\big\(\\frac\{nT\}\{\\delta\}\\big\)\\cdot\\Big\[\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\+\\eta\\Big\(\\frac\{1\}\{B\}\+\\frac\{1\}\{T\}\\Big\)\+\(1−λ\)κBT\(1γ\+ηmdB\)\+\(\(1−λ\)2\+λ2η\)ηκ2mdB2\]=:Acorr\.\\displaystyle\\quad\+\\frac\{\(1\-\\lambda\)\\kappa\}\{B\\sqrt\{T\}\}\\Big\(\\frac\{1\}\{\\gamma\}\+\\frac\{\\eta\\sqrt\{md\}\}\{\\sqrt\{B\}\}\\Big\)\+\\Big\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\\Big\)\\frac\{\\eta\\kappa^\{2\}md\}\{B^\{2\}\}\\Big\]=:A\_\{\\rm corr\}\.

###### Proof\.

Fix auxiliary failure probabilitiesδinit=δntk=δZ=δΔ=δpot=δ5\\delta\_\{\\mathrm\{init\}\}=\\delta\_\{\\mathrm\{ntk\}\}=\\delta\_\{Z\}=\\delta\_\{\\Delta\}=\\delta\_\{\\mathrm\{pot\}\}=\\frac\{\\delta\}\{5\}\. Since replacingδ\\deltaby a constant fraction only affects logarithmic factors by absolute constants, we suppress this distinction below\. Setτγ2≍log2⁡\(T\)\+log⁡\(n/δ\)γ2\\tau\_\{\\gamma\}^\{2\}\\asymp\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\}\. We choose the shifted localization radius asR¯=CR¯τγ\\bar\{R\}=C\_\{\\bar\{R\}\}\\tau\_\{\\gamma\}, whereCR¯\>0C\_\{\\bar\{R\}\}\>0is a sufficiently large universal constant\. Our proof consists of the following steps\.

\(i\)\. Comparator construction under initialization separability\.By Lemma[C\.12](https://arxiv.org/html/2605.12648#A3.Thmtheorem12), under the width condition

m≳log⁡\(m/δntk\)\(log2⁡\(T\)\+log⁡\(n/δntk\)\)γ4,m\\gtrsim\\frac\{\\log\(m/\\delta\_\{\\mathrm\{ntk\}\}\)\\bigl\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\_\{\\mathrm\{ntk\}\}\)\\big\)\}\{\\gamma^\{4\}\},there exists a comparator𝐖∗=𝐖0\+τntk𝐮\\mathbf\{W\}^\{\*\}=\\mathbf\{W\}\_\{0\}\+\\tau\_\{\\mathrm\{ntk\}\}\\mathbf\{u\}such that, with probability at least1−δntk1\-\\delta\_\{\\mathrm\{ntk\}\}over the initialization,

ℒS\(𝐖∗\)≤1T,‖𝐖∗−𝐖0‖2≤τntk,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq\\frac\{1\}\{T\},\\qquad\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq\\tau\_\{\\mathrm\{ntk\}\},whereτntk2≲log2⁡\(T\)\+log⁡\(n/δ\)γ2\\tau\_\{\\mathrm\{ntk\}\}^\{2\}\\lesssim\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\}\. By takingCR¯C\_\{\\bar\{R\}\}sufficiently large, we may assumeτntk≤τγ\\tau\_\{\\mathrm\{ntk\}\}\\leq\\tau\_\{\\gamma\}\.

\(ii\)\. Verification of the shifted\-bootstrap assumptions\.We now verify the assumptions of Lemma[C\.11](https://arxiv.org/html/2605.12648#A3.Thmtheorem11)\. Throughout this step, we use the same convention as in the theorem statement and suppress polylogarithmic factors inm,n,T,δ−1m,n,T,\\delta^\{\-1\}and polynomial factors inppandCσ,bC\_\{\\sigma,b\}\. By the definitions ofzδZz\_\{\\delta\_\{Z\}\},VZ,δZV\_\{Z,\\delta\_\{Z\}\},VΔ,δΔV\_\{\\Delta,\\delta\_\{\\Delta\}\}, andMδpotM\_\{\\delta\_\{\\mathrm\{pot\}\}\}, and since all auxiliary failure probabilities are constant fractions ofδ\\delta, we have

zδZ≲md\+log⁡\(nTδ\),VZ,δZT≲md\+log⁡\(nT/δ\)T,z\_\{\\delta\_\{Z\}\}\\lesssim\\sqrt\{md\+\\log\\big\(\\frac\{nT\}\{\\delta\}\\big\)\},\\qquad\\frac\{V\_\{Z,\\delta\_\{Z\}\}\}\{T\}\\lesssim md\+\\frac\{\\log\(nT/\\delta\)\}\{T\},and

VΔ,δΔ≲Gδ2\(TB\+log⁡\(nTδ\)\)\.V\_\{\\Delta,\\delta\_\{\\Delta\}\}\\lesssim G\_\{\\delta\}^\{2\}\\Big\(\\frac\{T\}\{B\}\+\\log\\big\(\\frac\{nT\}\{\\delta\}\\big\)\\Big\)\.Moreover,

Mδpot≲\\displaystyle M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\\lesssim\{\}ηGδR¯Tlog⁡\(nT/δ\)B\+\(1−λ\)ηcprivR¯Tlog⁡\(nTδ\)\\displaystyle\\eta G\_\{\\delta\}\\bar\{R\}\\sqrt\{\\frac\{T\\log\(nT/\\delta\)\}\{B\}\}\+\(1\-\\lambda\)\\eta c\_\{\\mathrm\{priv\}\}\\bar\{R\}\\sqrt\{T\\log\\big\(\\frac\{nT\}\{\\delta\}\\big\)\}\+\(1−λ\)η2cprivGδTmdlog⁡\(nT/δ\)B\.\\displaystyle\\quad\+\(1\-\\lambda\)\\eta^\{2\}c\_\{\\mathrm\{priv\}\}G\_\{\\delta\}\\sqrt\{\\frac\{Tmd\\log\(nT/\\delta\)\}\{B\}\}\.UsingR¯≍τγ\\bar\{R\}\\asymp\\tau\_\{\\gamma\},Gδ≲log⁡\(1/δ\)G\_\{\\delta\}\\lesssim\\sqrt\{\\log\(1/\\delta\)\}, andcpriv=Cclipκ/Bc\_\{\\mathrm\{priv\}\}=C\_\{\\rm clip\}\\kappa/B, this gives

Mδpot≲\\displaystyle M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\\lesssim\{\}ητγTlog⁡\(nT/δ\)B\+\(1−λ\)ητγκBTlog⁡\(nTδ\)\\displaystyle\\eta\\tau\_\{\\gamma\}\\sqrt\{\\frac\{T\\log\(nT/\\delta\)\}\{B\}\}\+\(1\-\\lambda\)\\eta\\tau\_\{\\gamma\}\\frac\{\\kappa\}\{B\}\\sqrt\{T\\log\\big\(\\frac\{nT\}\{\\delta\}\\big\)\}\+\(1−λ\)η2κBTmdlog⁡\(nT/δ\)B,\\displaystyle\\quad\+\(1\-\\lambda\)\\eta^\{2\}\\frac\{\\kappa\}\{B\}\\sqrt\{\\frac\{Tmd\\log\(nT/\\delta\)\}\{B\}\},up to the suppressed factors\.

On the initialization eventℰδinit\\mathcal\{E\}\_\{\\delta\_\{\\mathrm\{init\}\}\}, the uniform gradient bound \([5](https://arxiv.org/html/2605.12648#A3.E5)\) holds\. SinceCclip≥GδinitC\_\{\\rm clip\}\\geq G\_\{\\delta\_\{\\mathrm\{init\}\}\}, the clipping operator is inactive on this event\.

We next verify the localization condition required by Lemma[C\.2](https://arxiv.org/html/2605.12648#A3.Thmtheorem2)\. The lemma requires

m≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)\(R¯\+ηcprivzδZ\)4\.m\\gtrsim C\_\{\\sigma,b\}^\{2\}\\,p^\{3\}\\bigl\(\\log\(m/\\delta\)\+p\\bigr\)\\big\(\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\\big\)^\{4\}\.By the upper bound onmmin the theorem, together with the bound onzδZz\_\{\\delta\_\{Z\}\}and the choicecpriv=Cclipκ/Bc\_\{\\mathrm\{priv\}\}=C\_\{\\mathrm\{clip\}\}\\kappa/B, we haveηcprivzδZ≲τγ\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\\lesssim\\tau\_\{\\gamma\}up to the suppressed logarithmic andpp\-dependent factors\. SinceR¯≍τγ\\bar\{R\}\\asymp\\tau\_\{\\gamma\}, it follows that

R¯\+ηcprivzδZ≲τγ\.\\bar\{R\}\+\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\\lesssim\\tau\_\{\\gamma\}\.Sinceτγ4≍\(log2⁡\(T\)\+log⁡\(n/δ\)\)2γ4\\tau\_\{\\gamma\}^\{4\}\\asymp\\frac\{\(\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\)^\{2\}\}\{\\gamma^\{4\}\}, the stated lower bound onmmimplies that the localization condition in Lemma[C\.2](https://arxiv.org/html/2605.12648#A3.Thmtheorem2)holds\.

It remains to verify the self\-consistency condition \([10](https://arxiv.org/html/2605.12648#A3.E10)\) in Lemma[C\.3](https://arxiv.org/html/2605.12648#A3.Thmtheorem3)\. Using‖𝐖∗−𝐖0‖22≤τγ2\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}^\{2\}\\leq\\tau\_\{\\gamma\}^\{2\}andℒS\(𝐖∗\)≤1T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq\\frac\{1\}\{T\}, the left\-hand side of \([10](https://arxiv.org/html/2605.12648#A3.E10)\) is bounded by

τγ2\+η\+Mδpot\+2η2VΔ,δΔ\+\(\(1−λ\)2η2cpriv2\+12βλ2η3cpriv2\)VZ,δZ\.\\displaystyle\\tau\_\{\\gamma\}^\{2\}\+\\eta\+M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\+2\\eta^\{2\}V\_\{\\Delta,\\delta\_\{\\Delta\}\}\+\\big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}c\_\{\\mathrm\{priv\}\}^\{2\}\+2\\beta\\lambda^\{2\}\\eta^\{3\}c\_\{\\mathrm\{priv\}\}^\{2\}\\big\)V\_\{Z,\\delta\_\{Z\}\}\.Substituting the preceding estimates, the part beyondτγ2\\tau\_\{\\gamma\}^\{2\}is bounded by

η\+ητγTlog⁡\(nT/δ\)B\+\(1−λ\)ηκBτγTlog⁡\(nTδ\)\\displaystyle\\eta\+\\eta\\tau\_\{\\gamma\}\\sqrt\{\\frac\{T\\log\(nT/\\delta\)\}\{B\}\}\+\(1\-\\lambda\)\\eta\\frac\{\\kappa\}\{B\}\\tau\_\{\\gamma\}\\sqrt\{T\\log\\Big\(\\frac\{nT\}\{\\delta\}\\Big\)\}\+\(1−λ\)η2κBTmdlog⁡\(nT/δ\)B\+η2\(TB\+log⁡\(nTδ\)\)\\displaystyle\\quad\+\(1\-\\lambda\)\\eta^\{2\}\\frac\{\\kappa\}\{B\}\\sqrt\{\\frac\{Tmd\\log\(nT/\\delta\)\}\{B\}\}\+\\eta^\{2\}\\Big\(\\frac\{T\}\{B\}\+\\log\\Big\(\\frac\{nT\}\{\\delta\}\\Big\)\\Big\)\+\(\(1−λ\)2η2\+λ2η3\)κ2TmdB2,\\displaystyle\\quad\+\\Big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}\+\\lambda^\{2\}\\eta^\{3\}\\Big\)\\frac\{\\kappa^\{2\}Tmd\}\{B^\{2\}\},up to the suppressed factors\. The first, second, third, and fifth terms are controlled by the step\-size conditionηγT\(1B\+κ\(1−λ\)B\)\+η2γ2\(TB\+1\)≲1\\eta\\gamma\\sqrt\{T\}\\big\(\\frac\{1\}\{\\sqrt\{B\}\}\+\\frac\{\\kappa\(1\-\\lambda\)\}\{B\}\\big\)\+\\eta^\{2\}\\gamma^\{2\}\(\\frac\{T\}\{B\}\+1\)\\lesssim 1\. The fourth term is controlled by the upper bound onmm: indeed, the first part of the upper bound impliesηκmd/B≲1/γ\\eta\\kappa\\sqrt\{md\}/B\\lesssim 1/\\gamma, and hence

\(1−λ\)η2κBTmdB≲\(1−λ\)ηγTB≲τγ2\(1\-\\lambda\)\\eta^\{2\}\\frac\{\\kappa\}\{B\}\\sqrt\{\\frac\{Tmd\}\{B\}\}\\lesssim\(1\-\\lambda\)\\frac\{\\eta\}\{\\gamma\}\\sqrt\{\\frac\{T\}\{B\}\}\\lesssim\\tau\_\{\\gamma\}^\{2\}under the same step size condition, up to the suppressed logarithmic factors\. The last term is controlled by the second part of the upper bound onmm, namely

m≲B2η2κ2γ2d⋅1T\(\(1−λ\)2\+λ2η\),m\\lesssim\\frac\{B^\{2\}\}\{\\eta^\{2\}\\kappa^\{2\}\\gamma^\{2\}d\}\\cdot\\frac\{1\}\{T\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\)\},which gives

\(\(1−λ\)2η2\+λ2η3\)κ2TmdB2≲τγ2\.\\Big\(\(1\-\\lambda\)^\{2\}\\eta^\{2\}\+\\lambda^\{2\}\\eta^\{3\}\\Big\)\\frac\{\\kappa^\{2\}Tmd\}\{B^\{2\}\}\\lesssim\\tau\_\{\\gamma\}^\{2\}\.Therefore, the self\-consistency condition holds after choosingCR¯C\_\{\\bar\{R\}\}sufficiently large\.

Finally, the same boundηcprivzδZ≲τγ\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\\lesssim\\tau\_\{\\gamma\}, together withR¯≍τγ\\bar\{R\}\\asymp\\tau\_\{\\gamma\}and‖𝐖∗−𝐖0‖2≤τγ\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq\\tau\_\{\\gamma\}, givesR¯\+‖𝐖∗−𝐖0‖2\+ηcprivzδZ≲τγ\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\\lesssim\\tau\_\{\\gamma\}\. Choosing the constant inR∗=CτγR\_\{\*\}=C\\tau\_\{\\gamma\}sufficiently large ensures

R∗≥R¯\+‖𝐖∗−𝐖0‖2\+ηcprivzδZ\.R\_\{\*\}\\geq\\bar\{R\}\+\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\+\\eta c\_\{\\mathrm\{priv\}\}z\_\{\\delta\_\{Z\}\}\.Therefore all assumptions of Lemma[C\.11](https://arxiv.org/html/2605.12648#A3.Thmtheorem11)are verified\.

\(iii\)\. Application of the shifted\-bootstrap lemma\.Applying Lemma[C\.11](https://arxiv.org/html/2605.12648#A3.Thmtheorem11)and usingℒS\(𝐖∗\)≤1/T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq 1/Tand‖𝐖0−𝐖∗‖22≤τγ2\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\leq\\tau\_\{\\gamma\}^\{2\}, we obtain that, conditioned on the initialization and comparator events, with probability at least1−\(δZ\+δΔ\+δpot\)1\-\(\\delta\_\{Z\}\+\\delta\_\{\\Delta\}\+\\delta\_\{\\mathrm\{pot\}\}\)over the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲τγ2ηT\+MδpotηT\+ηVΔ,δΔT\+\(\(1−λ\)2η\+λ2η2\)κ2B2VZ,δZT\.\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{\\tau\_\{\\gamma\}^\{2\}\}\{\\eta T\}\+\\frac\{M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\}\{\\eta T\}\+\\eta\\frac\{V\_\{\\Delta,\\delta\_\{\\Delta\}\}\}\{T\}\+\\Big\(\(1\-\\lambda\)^\{2\}\\eta\+\\lambda^\{2\}\\eta^\{2\}\\Big\)\\frac\{\\kappa^\{2\}\}\{B^\{2\}\}\\frac\{V\_\{Z,\\delta\_\{Z\}\}\}\{T\}\.\(16\)
By the definition ofτγ\\tau\_\{\\gamma\}, it holds

τγ2ηT≲log2⁡\(T\)\+log⁡\(n/δ\)γ2ηT\.\\frac\{\\tau\_\{\\gamma\}^\{2\}\}\{\\eta T\}\\lesssim\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\\eta T\}\.Moreover, from the definition ofMδpotM\_\{\\delta\_\{\\mathrm\{pot\}\}\}and the estimates in Step 2,

MδpotηT≲\\displaystyle\\frac\{M\_\{\\delta\_\{\\mathrm\{pot\}\}\}\}\{\\eta T\}\\lesssim\\;τγlog⁡\(nT/δ\)BT\+\(1−λ\)κBτγlog⁡\(nT/δ\)T\+\(1−λ\)ηκBmdlog⁡\(nT/δ\)BT\.\\displaystyle\\tau\_\{\\gamma\}\\sqrt\{\\frac\{\\log\(nT/\\delta\)\}\{BT\}\}\+\(1\-\\lambda\)\\frac\{\\kappa\}\{B\}\\tau\_\{\\gamma\}\\sqrt\{\\frac\{\\log\(nT/\\delta\)\}\{T\}\}\+\(1\-\\lambda\)\\eta\\frac\{\\kappa\}\{B\}\\sqrt\{\\frac\{md\\log\(nT/\\delta\)\}\{BT\}\}\.Also,

ηVΔ,δΔT≲η\(1B\+log⁡\(nT/δ\)T\)andVZ,δZT≲md\+log⁡\(nT/δ\)T\.\\eta\\frac\{V\_\{\\Delta,\\delta\_\{\\Delta\}\}\}\{T\}\\lesssim\\eta\\big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(nT/\\delta\)\}\{T\}\\big\)\\quad\\text\{and\}\\quad\\frac\{V\_\{Z,\\delta\_\{Z\}\}\}\{T\}\\lesssim md\+\\frac\{\\log\(nT/\\delta\)\}\{T\}\.Substituting these estimates into \([16](https://arxiv.org/html/2605.12648#A3.E16)\) and using the definition ofτγ\\tau\_\{\\gamma\}gives

1T∑t=0T−1ℒS\(𝐖t\)≲polylog\(nTδ\)⋅\[\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\mathrm\{polylog\}\\Big\(\\frac\{nT\}\{\\delta\}\\Big\)\\cdot\\Big\[1γ2ηT\+1γBT\+η\(1B\+1T\)\+\(1−λ\)κBT\(1γ\+ηmdB\)\\displaystyle\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\+\\eta\\Big\(\\frac\{1\}\{B\}\+\\frac\{1\}\{T\}\\Big\)\+\\frac\{\(1\-\\lambda\)\\kappa\}\{B\\sqrt\{T\}\}\\Big\(\\frac\{1\}\{\\gamma\}\+\\frac\{\\eta\\sqrt\{md\}\}\{\\sqrt\{B\}\}\\Big\)\+\(\(1−λ\)2\+λ2η\)ηκ2mdB2\]\.\\displaystyle\\quad\+\\big\(\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta\\big\)\\frac\{\\eta\\kappa^\{2\}md\}\{B^\{2\}\}\\Big\]\.Here we usedm≳log⁡\(nT/δ\)/\(Td\)m\\gtrsim\\log\(nT/\\delta\)/\(Td\)and1/T\+1/\(γ2ηT\)≍1/\(γ2ηT\)1/T\+1/\(\\gamma^\{2\}\\eta T\)\\asymp 1/\(\\gamma^\{2\}\\eta T\)sinceTd≥1Td\\geq 1,γ≤1\\gamma\\leq 1andη≤1\\eta\\leq 1\. This is exactly the claimed bound\.

\(iV\)\. Probability estimate\.The above argument holds on the intersection of the initialization eventℰδinit\\mathcal\{E\}\_\{\\delta\_\{\\mathrm\{init\}\}\}, the comparator event from Lemma[C\.12](https://arxiv.org/html/2605.12648#A3.Thmtheorem12), and the algorithmic good event from Lemma[C\.11](https://arxiv.org/html/2605.12648#A3.Thmtheorem11)\. Their total failure probability is at most

δinit\+δntk\+δZ\+δΔ\+δpot=δ\.\\delta\_\{\\mathrm\{init\}\}\+\\delta\_\{\\mathrm\{ntk\}\}\+\\delta\_\{Z\}\+\\delta\_\{\\Delta\}\+\\delta\_\{\\mathrm\{pot\}\}=\\delta\.Therefore, the stated optimization bound holds with probability at least1−δ1\-\\delta\. ∎

### C\.2Proofs for population of DP\-SGD with correlated noise

We introduce the concept of algorithmic stability to control generalization gap, and then combining generalization gap with optimization risk bound to get population risk bound \(see error decomposition in Section[3](https://arxiv.org/html/2605.12648#S3)\)\.

For a randomized algorithm𝒜\\mathcal\{A\}, let𝒜\(S\)∈ℝmdp\\mathcal\{A\}\(S\)\\in\\mathbb\{R\}^\{mdp\}be the output of𝒜\\mathcal\{A\}based on datasetSS\. The on\-average argument stability measures the on\-average sensitivity of the output up to the perturbation of the dataset\.

###### Definition C\.14\(On\-average argument stability\[[34](https://arxiv.org/html/2605.12648#bib.bib564)\]\)\.

LetS=\{z1,…,zn\}S=\\\{z\_\{1\},\\ldots,z\_\{n\}\\\}andS~=\{z1′,…,zn′\}\\widetilde\{S\}=\\\{z^\{\\prime\}\_\{1\},\\ldots,z^\{\\prime\}\_\{n\}\\\}be drawn independently from𝒫\\mathcal\{P\}\. For anyi∈\[n\]i\\in\[n\], defineS\(i\)=\{z1,…,zi−1,zi′,zi\+1,…,zn\}S^\{\(i\)\}=\\\{z\_\{1\},\\ldots,z\_\{i\-1\},z\_\{i\}^\{\\prime\},z\_\{i\+1\},\\ldots,z\_\{n\}\\\}\. Let𝒜\(S\)\\mathcal\{A\}\(S\)and𝒜\(S\(i\)\)\\mathcal\{A\}\(S^\{\(i\)\}\)be produced by an randomized algorithm𝒜\\mathcal\{A\}based onSSandS\(i\)S^\{\(i\)\}respectively\. We say𝒜\\mathcal\{A\}is on\-average argumentϵ\\epsilon\-stable if

𝔼S,S~,𝒜\[1n∑i=1n‖𝒜\(S\)−𝒜\(S\(i\)\)‖2\]≤ϵ\.\\mathbb\{E\}\_\{S,\\widetilde\{S\},\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\mathcal\{A\}\(S\)\-\\mathcal\{A\}\(S^\{\(i\)\}\)\\\|\_\{2\}\\Big\]\\leq\\epsilon\.

We consider using the connection between the on\-average argument stability and generalization error bounds\[[34](https://arxiv.org/html/2605.12648#bib.bib564)\]\.

###### Lemma C\.15\(\[[34](https://arxiv.org/html/2605.12648#bib.bib564)\]\)\.

If𝒜\\mathcal\{A\}is on\-average argumentϵ\\epsilon\-stable and the lossℓ\\ellisLL\-Lipschitz with respect to𝒜\(S\)\\mathcal\{A\}\(S\), then

𝔼S,𝒜\[ℒ\(𝒜\(S\)\)−ℒS\(𝒜\(S\)\)\]≤2Lϵ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\big\[\\mathcal\{L\}\(\\mathcal\{A\}\(S\)\)\-\\mathcal\{L\}\_\{S\}\(\\mathcal\{A\}\(S\)\)\\big\]\\leq 2L\\epsilon\.

To compare the outputs on two neighboring datasetsSSandS\(i\)S^\{\(i\)\}, we couple two runs of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)by using the same initialization, the same sequence of mini\-batches, and the same Gaussian noise sequence\. Under this coupling, the additive perturbations appear identically in the two updates and cancel in the difference of the iterates\. The resulting stability recursion therefore depends only on the gradient part of the projected update\. It is important, however, that the empirical losses appearing below are still evaluated along the two coupled private trajectories\.

Let

Δt\(i\)=‖𝐖t−𝐖t\(i\)‖2,𝐖α,t\(i\)=α𝐖t\+\(1−α\)𝐖t\(i\),α∈\[0,1\]\.\\Delta\_\{t\}^\{\(i\)\}=\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}\_\{t\}^\{\(i\)\}\\\|\_\{2\},\\qquad\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}=\\alpha\\mathbf\{W\}\_\{t\}\+\(1\-\\alpha\)\\mathbf\{W\}\_\{t\}^\{\(i\)\},\\quad\\alpha\\in\[0,1\]\.For eacht≥0t\\geq 0, define the shared mini\-batch loss

ℒt,ish\(𝐖\)=1B∑j∈ℬt∖\{i\}ℓ\(yjf𝐖\(𝐱j\)\),\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\)=\\frac\{1\}\{B\}\\sum\_\{j\\in\\mathcal\{B\}\_\{t\}\\setminus\\\{i\\\}\}\\ell\\left\(y\_\{j\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{j\}\)\\right\),where the sum is interpreted as0ifℬt∖\{i\}=∅\\mathcal\{B\}\_\{t\}\\setminus\\\{i\\\}=\\varnothing\. We also define

ℒℬt\(𝐖\)=1B∑j∈ℬtℓ\(yjf𝐖\(𝐱j\)\),\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\)=\\frac\{1\}\{B\}\\sum\_\{j\\in\\mathcal\{B\}\_\{t\}\}\\ell\\left\(y\_\{j\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{j\}\)\\right\),ℒℬt\(i\)\(𝐖\)=1B∑j∈ℬtℓ\(yj\(i\)f𝐖\(𝐱j\(i\)\)\)=ℒt,ish\(𝐖\)\+1B𝟏\{i∈ℬt\}ℓ\(yi′f𝐖\(𝐱i′\)\)\.\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\)=\\frac\{1\}\{B\}\\sum\_\{j\\in\\mathcal\{B\}\_\{t\}\}\\ell\\left\(y\_\{j\}^\{\(i\)\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\\right\)=\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\)\+\\frac\{1\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\ell\\left\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\\right\)\.Further, let

aδ,m=2Cσ,bp32\(log⁡\(mδ\)\+p\)m,bδ,m=Cσ,bp\(p\+log⁡\(1/δ\)m\)\.a\_\{\\delta,m\}=\\frac\{2C\_\{\\sigma,b\}\\,p^\{\\frac\{3\}\{2\}\}\\big\(\\sqrt\{\\log\(\\frac\{m\}\{\\delta\}\)\}\+\\sqrt\{p\}\\big\)\}\{\\sqrt\{m\}\},\\qquad b\_\{\\delta,m\}=C\_\{\\sigma,b\}\\,p\\Bigl\(\\sqrt\{p\}\+\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{m\}\}\\Bigr\)\.
###### Lemma C\.16\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds and the eventℰδ\\mathcal\{E\}\_\{\\delta\}occurs, and assumeCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. Let\{𝐖t\}t≥0\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t\\geq 0\}and\{𝐖t\(i\)\}t≥0\\\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\\\}\_\{t\\geq 0\}be two coupled runs of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withλ\>0\\lambda\>0onSSandS\(i\)S^\{\(i\)\}, respectively, driven by the same initialization, the same mini\-batch sequence, and the same Gaussian noise sequence\. Assumeη≤1Cσ,bp3\\eta\\leq\\frac\{1\}\{C\_\{\\sigma,b\}\\,p^\{3\}\}andm≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)R∗4m\\gtrsim C\_\{\\sigma,b\}^\{2\}p^\{3\}\(\\log\(m/\\delta\)\+p\)R\_\{\*\}^\{4\}\. Then, for allt≥0t\\geq 0, it holds that

Δt\+1\(i\)≤\(1\+aδ,mη\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\)Δt\(i\)\+ηbδ,mB𝟏\{i∈ℬt\}\(ℓ\(yif𝐖t\(𝐱i\)\)\+ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\)\.\\Delta\_\{t\+1\}^\{\(i\)\}\\\!\\leq\\\!\\bigl\(1\+a\_\{\\delta,m\}\\eta\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\\bigr\)\\Delta\_\{t\}^\{\(i\)\}\\\!\+\\\!\\frac\{\\eta b\_\{\\delta,m\}\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\bigl\(\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\+\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\bigr\)\.

###### Proof\.

Letζt=CclipBξt\\zeta\_\{t\}=\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\xi\_\{t\}denote the common additive Gaussian perturbation in the two coupled runs\. Since the two algorithms use the same initialization, the same mini\-batches, and the same Gaussian noise sequence, the projected updates can be written as

𝐖t\+1=Π𝒦\(𝐖t−η\(vt\+ζt\)\),𝐖t\+1\(i\)=Π𝒦\(𝐖t\(i\)−η\(vt\(i\)\+ζt\)\),\\mathbf\{W\}\_\{t\+1\}=\\Pi\_\{\\mathcal\{K\}\}\\big\(\\mathbf\{W\}\_\{t\}\-\\eta\(v\_\{t\}\+\\zeta\_\{t\}\)\\big\),\\qquad\\mathbf\{W\}\_\{t\+1\}^\{\(i\)\}=\\Pi\_\{\\mathcal\{K\}\}\\big\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\eta\(v\_\{t\}^\{\(i\)\}\+\\zeta\_\{t\}\)\\big\),where, since clipping is inactive onℰδ\\mathcal\{E\}\_\{\\delta\}underCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\},

vt=1B∑j∈ℬt∇ℓ\(yjf𝐖t\(𝐱j\)\),vt\(i\)=1B∑j∈ℬt∇ℓ\(yj\(i\)f𝐖t\(i\)\(𝐱j\(i\)\)\)\.v\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{j\\in\\mathcal\{B\}\_\{t\}\}\\nabla\\ell\\big\(y\_\{j\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\\big\),\\qquad v\_\{t\}^\{\(i\)\}=\\frac\{1\}\{B\}\\sum\_\{j\\in\\mathcal\{B\}\_\{t\}\}\\nabla\\ell\\big\(y\_\{j\}^\{\(i\)\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\\big\)\.By the non\-expansiveness of the Euclidean projection,

Δt\+1\(i\)\\displaystyle\\Delta\_\{t\+1\}^\{\(i\)\}=‖Π𝒦\(𝐖t−η\(vt\+ζt\)\)−Π𝒦\(𝐖t\(i\)−η\(vt\(i\)\+ζt\)\)‖2\\displaystyle=\\bigl\\\|\\Pi\_\{\\mathcal\{K\}\}\\big\(\\mathbf\{W\}\_\{t\}\-\\eta\(v\_\{t\}\+\\zeta\_\{t\}\)\\big\)\-\\Pi\_\{\\mathcal\{K\}\}\\big\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\eta\(v\_\{t\}^\{\(i\)\}\+\\zeta\_\{t\}\)\\big\)\\bigr\\\|\_\{2\}≤‖𝐖t−ηvt−\(𝐖t\(i\)−ηvt\(i\)\)‖2,\\displaystyle\\leq\\bigl\\\|\\mathbf\{W\}\_\{t\}\-\\eta v\_\{t\}\-\\bigl\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\eta v\_\{t\}^\{\(i\)\}\\bigr\)\\bigr\\\|\_\{2\},\(17\)where the common noise term cancels exactly\.

We split off the possibly replaced sample\. By the definitions ofℒt,ish\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\},ℒℬt\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}, andℒℬt\(i\)\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}, it holds

vt=∇ℒt,ish\(𝐖t\)\+1B𝟏\{i∈ℬt\}∇ℓ\(yif𝐖t\(𝐱i\)\),v\_\{t\}=\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\frac\{1\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\nabla\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\),and

vt\(i\)=∇ℒt,ish\(𝐖t\(i\)\)\+1B𝟏\{i∈ℬt\}∇ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\.v\_\{t\}^\{\(i\)\}=\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\+\\frac\{1\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\nabla\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\.Substituting these two decompositions into \([17](https://arxiv.org/html/2605.12648#A3.E17)\) gives

Δt\+1\(i\)\\displaystyle\\Delta\_\{t\+1\}^\{\(i\)\}≤‖𝐖t−η∇ℒt,ish\(𝐖t\)−\(𝐖t\(i\)−η∇ℒt,ish\(𝐖t\(i\)\)\)‖2\\displaystyle\\leq\\big\\\|\\mathbf\{W\}\_\{t\}\-\\eta\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\)\-\\bigl\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\eta\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\\big\\\|\_\{2\}\+ηB𝟏\{i∈ℬt\}\(‖∇ℓ\(yif𝐖t\(𝐱i\)\)‖2\+‖∇ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)‖2\)\.\\displaystyle\\quad\+\\frac\{\\eta\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\bigl\(\\\|\\nabla\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\\|\_\{2\}\+\\\|\\nabla\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\\|\_\{2\}\\bigr\)\.\(18\)
We first bound the shared\-sample part\. DefineH¯t,i=∫01∇2ℒt,ish\(𝐖α,t\(i\)\)𝑑α\\bar\{H\}\_\{t,i\}=\\int\_\{0\}^\{1\}\\nabla^\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\,d\\alpha\. By the integral form of the mean\-value theorem, it holds

∇ℒt,ish\(𝐖t\)−∇ℒt,ish\(𝐖t\(i\)\)=H¯t,i\(𝐖t−𝐖t\(i\)\)\.\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\)\-\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)=\\bar\{H\}\_\{t,i\}\(\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\.Therefore,

‖𝐖t−η∇ℒt,ish\(𝐖t\)−\(𝐖t\(i\)−η∇ℒt,ish\(𝐖t\(i\)\)\)‖2≤‖𝐈−ηH¯t,i‖opΔt\(i\)\.\\big\\\|\\mathbf\{W\}\_\{t\}\-\\eta\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\)\-\\bigl\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\eta\\nabla\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\\big\\\|\_\{2\}\\leq\\\|\\mathbf\{I\}\-\\eta\\bar\{H\}\_\{t,i\}\\\|\_\{\\mathrm\{op\}\}\\Delta\_\{t\}^\{\(i\)\}\.
Since∇2ℒt,ish\(𝐖\)\\nabla^\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\)is symmetric, it remains to control the eigenvalues along the segment\. Both iterates lie in𝒦=ℬ\(𝐖0,R∗\)\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), hence the whole segment\{𝐖α,t\(i\):α∈\[0,1\]\}\\\{\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}:\\alpha\\in\[0,1\]\\\}lies in𝒦\\mathcal\{K\}, and‖𝐖t−𝐖t\(i\)‖2≤2R∗\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}\_\{t\}^\{\(i\)\}\\\|\_\{2\}\\leq 2R\_\{\*\}\. Under the width conditionm≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)R∗4m\\gtrsim C\_\{\\sigma,b\}^\{2\}p^\{3\}\(\\log\(m/\\delta\)\+p\)R\_\{\*\}^\{4\}, the local quasi\-convexity condition is valid on this segment\. Applying Lemma[B\.4](https://arxiv.org/html/2605.12648#A2.Thmtheorem4)toG=ℒt,ishG=\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}, we obtain

maxα∈\[0,1\]⁡ℒt,ish\(𝐖α,t\(i\)\)≤2max⁡\{ℒt,ish\(𝐖t\),ℒt,ish\(𝐖t\(i\)\)\}\.\\max\_\{\\alpha\\in\[0,1\]\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\leq 2\\max\\big\\\{\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\),\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\big\\\}\.Since the loss is nonnegative andℒt,ish\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}is the shared part of the corresponding mini\-batch losses, hence

ℒt,ish\(𝐖t\)≤ℒℬt\(𝐖t\)andℒt,ish\(𝐖t\(i\)\)≤ℒℬt\(i\)\(𝐖t\(i\)\)\.\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}\)\\leq\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\\qquad\\text\{and\}\\qquad\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\leq\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\.Thus,

maxα∈\[0,1\]⁡ℒt,ish\(𝐖α,t\(i\)\)≤2\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\.\\max\_\{\\alpha\\in\[0,1\]\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\leq 2\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\.
Moreover, by Lemma[B\.7](https://arxiv.org/html/2605.12648#A2.Thmtheorem7), for everyα∈\[0,1\]\\alpha\\in\[0,1\],

λmax\(∇2ℒt,ish\(𝐖α,t\(i\)\)\)≤Cσ,bp3≤η−1,\\lambda\_\{\\max\}\\bigl\(\\nabla^\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\bigr\)\\leq C\_\{\\sigma,b\}p^\{3\}\\leq\\eta^\{\-1\},and

λmin\(∇2ℒt,ish\(𝐖α,t\(i\)\)\)≥−aδ,m2ℒt,ish\(𝐖α,t\(i\)\)\.\\lambda\_\{\\min\}\\bigl\(\\nabla^\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\bigr\)\\geq\-\\frac\{a\_\{\\delta,m\}\}\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\.Combining the previous two displays gives

λmin\(∇2ℒt,ish\(𝐖α,t\(i\)\)\)≥−aδ,m\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\.\\lambda\_\{\\min\}\\bigl\(\\nabla^\{2\}\\mathcal\{L\}\_\{t,i\}^\{\\mathrm\{sh\}\}\(\\mathbf\{W\}\_\{\\alpha,t\}^\{\(i\)\}\)\\bigr\)\\geq\-a\_\{\\delta,m\}\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\.Consequently,

‖𝐈−ηH¯t,i‖op≤1\+aδ,mη\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\.\\\|\\mathbf\{I\}\-\\eta\\bar\{H\}\_\{t,i\}\\\|\_\{\\mathrm\{op\}\}\\leq 1\+a\_\{\\delta,m\}\\eta\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\.
It remains to bound the two single\-sample gradient terms in \([18](https://arxiv.org/html/2605.12648#A3.E18)\)\. By the self\-bounding property of the logistic loss, we know\|ℓ′\(u\)\|≤ℓ\(u\)\|\\ell^\{\\prime\}\(u\)\|\\leq\\ell\(u\), and hence, for any𝐖\\mathbf\{W\}and\(𝐱,y\)\(\\mathbf\{x\},y\),

‖∇ℓ\(yf𝐖\(𝐱\)\)‖2=\|ℓ′\(yf𝐖\(𝐱\)\)\|‖∇f𝐖\(𝐱\)‖2≤ℓ\(yf𝐖\(𝐱\)\)‖∇f𝐖\(𝐱\)‖2\.\\\|\\nabla\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\\|\_\{2\}=\|\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\|\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\.Onℰδ\\mathcal\{E\}\_\{\\delta\}, Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)gives‖∇f𝐖\(𝐱\)‖2≤bδ,m\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq b\_\{\\delta,m\}\. Therefore,

‖∇ℓ\(yif𝐖t\(𝐱i\)\)‖2≤bδ,mℓ\(yif𝐖t\(𝐱i\)\)and‖∇ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)‖2≤bδ,mℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\.\\\|\\nabla\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\\|\_\{2\}\\leq b\_\{\\delta,m\}\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\\quad\\text\{and\}\\quad\\\|\\nabla\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\\|\_\{2\}\\leq b\_\{\\delta,m\}\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\.
Substituting the above estimates into \([18](https://arxiv.org/html/2605.12648#A3.E18)\) yields

Δt\+1\(i\)≤\(1\+aδ,mη\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\)Δt\(i\)\+ηbδ,mB𝟏\{i∈ℬt\}\(ℓ\(yif𝐖t\(𝐱i\)\)\+ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\)\.\\Delta\_\{t\+1\}^\{\(i\)\}\\\!\\leq\\\!\\bigl\(1\+a\_\{\\delta,m\}\\eta\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\\bigr\)\\Delta\_\{t\}^\{\(i\)\}\+\\frac\{\\eta b\_\{\\delta,m\}\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\bigl\(\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\+\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\bigr\)\.This completes the proof\. ∎

We first control the cumulative mini\-batch losses along the two coupled trajectories\. Forδf∈\(0,1\)\\delta\_\{f\}\\in\(0,1\), define the initialization\-output event

ℰδfout=\{max1≤i≤n⁡\|f𝐖0\(𝐱i\)\|≤Bb2plog⁡\(4n/δf\)andmax1≤i≤n⁡\|f𝐖0\(𝐱i′\)\|≤Bb2plog⁡\(4n/δf\)\},\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}=\\Big\\\{\\max\_\{1\\leq i\\leq n\}\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}\)\|\\leq B\_\{b\}\\sqrt\{2p\\log\\big\(\{4n\}/\{\\delta\_\{f\}\}\\big\)\}\\ \\text\{ and \}\\ \\max\_\{1\\leq i\\leq n\}\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\|\\leq B\_\{b\}\\sqrt\{2p\\log\\big\(\{4n\}/\{\\delta\_\{f\}\}\\big\)\}\\Big\\\},whereS~=\{\(𝐱i′,yi′\)\}i=1n\\widetilde\{S\}=\\\{\(\\mathbf\{x\}\_\{i\}^\{\\prime\},y\_\{i\}^\{\\prime\}\)\\\}\_\{i=1\}^\{n\}is an independent copy ofSS\.

###### Lemma C\.17\.

Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds\. Thenℙ\(ℰδfout\)≥1−δf\\mathbb\{P\}\\big\(\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\\big\)\\geq 1\-\\delta\_\{f\}\. Moreover, on the eventℰδ∩ℰδfout\\mathcal\{E\}\_\{\\delta\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}, for everyt≥0t\\geq 0, everyi∈\[n\]i\\in\[n\], and everyz=\(𝐱,y\)∈Sz=\(\\mathbf\{x\},y\)\\in Sandz\(i\)=\(𝐱\(i\),y\(i\)\)∈S\(i\)z^\{\(i\)\}=\(\\mathbf\{x\}^\{\(i\)\},y^\{\(i\)\}\)\\in S^\{\(i\)\},

max⁡\{\|f𝐖t\(𝐱j\)\|,\|f𝐖t\(i\)\(𝐱j\(i\)\)\|\}≤Fδ,δfandmax⁡\{ℓ\(yjf𝐖t\(𝐱j\)\),ℓ\(yj\(i\)f𝐖t\(i\)\(𝐱j\(i\)\)\)\}≤Uδ,δf,\\max\\big\\\{\|f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\|,\|f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\|\\big\\\}\\leq F\_\{\\delta,\\delta\_\{f\}\}\\ \\text\{ and \}\\ \\max\\Big\\\{\\ell\\left\(y\_\{j\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\\right\),\\ell\\big\(y\_\{j\}^\{\(i\)\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\\big\)\\Big\\\}\\leq U\_\{\\delta,\\delta\_\{f\}\},whereFδ,δf=Bb2plog⁡\(4nδf\)\+bδ,mR∗F\_\{\\delta,\\delta\_\{f\}\}=B\_\{b\}\\sqrt\{2p\\log\\big\(\\frac\{4n\}\{\\delta\_\{f\}\}\\big\)\}\+b\_\{\\delta,m\}R\_\{\*\}andUδ,δf=log⁡\(1\+eFδ,δf\)≤log⁡2\+Fδ,δf\.U\_\{\\delta,\\delta\_\{f\}\}=\\log\\bigl\(1\+e^\{F\_\{\\delta,\\delta\_\{f\}\}\}\\bigr\)\\leq\\log 2\+F\_\{\\delta,\\delta\_\{f\}\}\.

###### Proof\.

We first control the initial outputs\. Fix any input𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\. Recall thatf𝐖0\(𝐱\)=1m𝐜⊤𝐡\(σ\(1d𝐖0𝐡\(𝐱\)\)\)\.f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\)=\\frac\{1\}\{\\sqrt\{m\}\}\\,\\mathbf\{c\}^\{\\top\}\\mathbf\{h\}\(\\sigma\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}\_\{0\}\\mathbf\{h\}\(\\mathbf\{x\}\)\)\)\.Conditioned on\(𝐖0,𝐱\)\(\\mathbf\{W\}\_\{0\},\\mathbf\{x\}\), the only randomness comes from𝐜∼𝒩\(0,𝐈mp\)\\mathbf\{c\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{mp\}\)\. Then, we knowf𝐖0\(𝐱\)∼𝒩\(0,1m‖𝐡\(σ\(1d𝐖0𝐡\(𝐱\)\)\)‖22\)\.f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\)\\sim\\mathcal\{N\}\\big\(0,\\frac\{1\}\{m\}\\\|\\mathbf\{h\}\(\\sigma\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}\_\{0\}\\mathbf\{h\}\(\\mathbf\{x\}\)\)\)\\\|\_\{2\}^\{2\}\)\.Note that each spline basis function is bounded byBbB\_\{b\}on the range ofσ\\sigma, we have‖𝐡\(σ\(1d𝐖0𝐡\(𝐱\)\)\)‖22≤mpBb2\.\\\|\\mathbf\{h\}\(\\sigma\(\\frac\{1\}\{\\sqrt\{d\}\}\\mathbf\{W\}\_\{0\}\\mathbf\{h\}\(\\mathbf\{x\}\)\)\)\\\|\_\{2\}^\{2\}\\leq mp\\,B\_\{b\}^\{2\}\.Applying the Gaussian tail bound to the2n2npoints𝐱1,…,𝐱n,𝐱1′,…,𝐱n′\\mathbf\{x\}\_\{1\},\\dots,\\mathbf\{x\}\_\{n\},\\mathbf\{x\}\_\{1\}^\{\\prime\},\\dots,\\mathbf\{x\}\_\{n\}^\{\\prime\}and taking a union bound yieldsℙ\(ℰδfout\)≥1−δf\.\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\)\\geq 1\-\\delta\_\{f\}\.

We next extend the bound from the initial point to the whole projected trajectory\. On the eventℰδ\\mathcal\{E\}\_\{\\delta\}\(see \([1](https://arxiv.org/html/2605.12648#A2.E1)\)\), Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)implies that‖∇f𝐖\(𝐱\)‖2≤bδ,m\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq b\_\{\\delta,m\}for all𝐖∈ℝmdp,𝐱∈𝒳\.\\mathbf\{W\}\\in\\mathbb\{R\}^\{mdp\},\\mathbf\{x\}\\in\\mathcal\{X\}\.Since both trajectories are projected onto𝒦=ℬ\(𝐖0,R∗\)\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), we have‖𝐖t−𝐖0‖2≤R∗\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq R\_\{\*\}and‖𝐖t\(i\)−𝐖0‖2≤R∗\.\\\|\\mathbf\{W\}\_\{t\}^\{\(i\)\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq R\_\{\*\}\.Therefore, by the mean\-value theorem and the estimate forsup𝐱,𝐖‖∇f𝐰\(𝐱\)‖2\\sup\_\{\\mathbf\{x\},\\mathbf\{W\}\}\\\|\\nabla f\_\{\\mathbf\{w\}\}\(\\mathbf\{x\}\)\\\|\_\{2\},

max⁡\{\|f𝐖t\(𝐱j\)−f𝐖0\(𝐱j\)\|,\|f𝐖t\(i\)\(𝐱j\(i\)\)−f𝐖0\(𝐱j\(i\)\)\|\}≤bδ,mR∗\.\\max\\big\\\{\|f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\-f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{j\}\)\|,\|f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\-f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\|\\big\\\}\\leq b\_\{\\delta,m\}R\_\{\*\}\.Combining these inequalities with the eventℰδfout\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}yields

max⁡\{\|f𝐖t\(𝐱j\)\|,\|f𝐖t\(i\)\(𝐱j\(i\)\)\|\}≤Bb2plog⁡\(4nδf\)\+bδ,mR∗=Fδ,δf,\\max\\big\\\{\|f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\|,\|f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\|\\big\\\}\\leq B\_\{b\}\\sqrt\{2p\\log\\\!\\Big\(\\frac\{4n\}\{\\delta\_\{f\}\}\\Big\)\}\+b\_\{\\delta,m\}R\_\{\*\}=F\_\{\\delta,\\delta\_\{f\}\},Finally, note that the logistic loss satisfiesℓ\(u\)=log⁡\(1\+e−u\)≤log⁡\(1\+e\|u\|\)≤log⁡2\+\|u\|\\ell\(u\)=\\log\(1\+e^\{\-u\}\)\\leq\\log\(1\+e^\{\|u\|\}\)\\leq\\log 2\+\|u\|\. Then, it holds that

max⁡\{ℓ\(yjf𝐖t\(𝐱j\)\),ℓ\(yj\(i\)f𝐖t\(i\)\(𝐱j\(i\)\)\)\}≤Uδ,δf,\\max\\big\\\{\\ell\\big\(y\_\{j\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\\big\),\\ell\\big\(y\_\{j\}^\{\(i\)\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{j\}^\{\(i\)\}\)\\big\)\\big\\\}\\leq U\_\{\\delta,\\delta\_\{f\}\},which completes the proof\. ∎

###### Lemma C\.18\(Freedman’s inequality\)\.

Let\{Xt,ℱt\}t=1T\\\{X\_\{t\},\\mathscr\{F\}\_\{t\}\\\}\_\{t=1\}^\{T\}be a martingale difference sequence, that is,XtX\_\{t\}isℱt\\mathscr\{F\}\_\{t\}\-measurable and𝔼\[Xt∣ℱt−1\]=0\\mathbb\{E\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\-1\}\]=0a\.s\. for allt∈\[T\]t\\in\[T\]\. If\|Xt\|≤L\|X\_\{t\}\|\\leq La\.s\. for allt∈\[T\]t\\in\[T\], and the predictable quadratic variationVT=∑t=1T𝔼\[Xt2∣ℱt−1\]≤vV\_\{T\}=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\[X\_\{t\}^\{2\}\\mid\\mathscr\{F\}\_\{t\-1\}\]\\leq va\.s\., then for anyδ∈\(0,1\)\\delta\\in\(0,1\),

ℙ\(∑t=1TXt≥2vlog⁡\(1/δ\)\+2L3log⁡\(1/δ\)\)≤δ\.\\mathbb\{P\}\\Big\(\\sum\_\{t=1\}^\{T\}X\_\{t\}\\geq\\sqrt\{2v\\log\(\{1\}/\{\\delta\}\)\}\+\\frac\{2L\}\{3\}\\log\(\{1\}/\{\\delta\}\)\\Big\)\\leq\\delta\.

###### Lemma C\.19\.

Suppose the assumptions of Lemma[C\.16](https://arxiv.org/html/2605.12648#A3.Thmtheorem16)hold, and letδmb∈\(0,1\)\\delta\_\{\\mathrm\{mb\}\}\\in\(0,1\)\. Then, conditioned on theS,S\(i\)S,S^\{\(i\)\}and the initialization satisfyingℰδ∩ℰδfout\\mathcal\{E\}\_\{\\delta\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}, with probability at least1−2δmb1\-2\\delta\_\{\\mathrm\{mb\}\}over the remaining algorithmic randomness,

∑t=0T−1ℒℬt\(𝐖t\)\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)≤∑t=0T−1ℒS\(𝐖t\)\+CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\),\\displaystyle\\leq\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\+C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\),∑t=0T−1ℒℬt\(i\)\(𝐖t\(i\)\)\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)≤∑t=0T−1ℒS\(i\)\(𝐖t\(i\)\)\+CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\),\\displaystyle\\leq\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S^\{\(i\)\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\+C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\),whereC\>0C\>0is an absolute constant\.

###### Proof\.

We prove the first inequality, and the second one follows in exactly the same way\. Define the pre\-sampling filtrationℱt=σ\(S,S\(i\),𝐖0,ℬ0,…,ℬt−1,Z0,…,Zt−1\),\\mathscr\{F\}\_\{t\}=\\sigma\\bigl\(S,S^\{\(i\)\},\\mathbf\{W\}\_\{0\},\\mathcal\{B\}\_\{0\},\\ldots,\\mathcal\{B\}\_\{t\-1\},Z\_\{0\},\\ldots,Z\_\{t\-1\}\\bigr\),fort∈\[T\]∪\{0\}\.t\\in\[T\]\\cup\\\{0\\\}\.Thusℱt\\mathscr\{F\}\_\{t\}contains all randomness revealed before the fresh mini\-batchℬt\\mathcal\{B\}\_\{t\}is drawn\. In particular,𝐖t\\mathbf\{W\}\_\{t\}isℱt\\mathscr\{F\}\_\{t\}\-measurable\.

For eacht=0,…,T−1t=0,\\ldots,T\-1, defineXt=ℒℬt\(𝐖t\)−ℒS\(𝐖t\)\.X\_\{t\}=\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\.Conditioned onℱt\\mathscr\{F\}\_\{t\}, the iterate𝐖t\\mathbf\{W\}\_\{t\}is fixed and the only randomness inXtX\_\{t\}comes from the fresh mini\-batchℬt\\mathcal\{B\}\_\{t\}\. Sinceℬt\\mathcal\{B\}\_\{t\}is sampled uniformly without replacement, we have𝔼A\[Xt∣ℱt\]=0\.\\mathbb\{E\}\_\{A\}\[X\_\{t\}\\mid\\mathscr\{F\}\_\{t\}\]=0\.Moreover, on the eventℰδ∩ℰδfout\\mathcal\{E\}\_\{\\delta\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}, Lemma[C\.17](https://arxiv.org/html/2605.12648#A3.Thmtheorem17)gives

\|Xt\|≤supz∈S∪S\(i\)\|ℓ\(yf𝐖t\(𝐱\)\)\|≤Uδ,δf\.\|X\_\{t\}\|\\leq\\sup\_\{z\\in S\\cup S^\{\(i\)\}\}\|\\ell\(yf\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\)\)\|\\leq U\_\{\\delta,\\delta\_\{f\}\}\.We also bound the conditional variance\. Conditioned onℱt\\mathscr\{F\}\_\{t\}, the values\|uj:=ℓ\(yjf𝐖t\(𝐱j\)\)\|≤Uδ,δf\|u\_\{j\}:=\\ell\\\!\\left\(y\_\{j\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{j\}\)\\right\)\|\\leq U\_\{\\delta,\\delta\_\{f\}\}for allj∈\[n\]\.j\\in\[n\]\.From Lemma[B\.2](https://arxiv.org/html/2605.12648#A2.Thmtheorem2)we know

𝔼A\[Xt2∣ℱt\]≤Uδ,δf2B\.\\mathbb\{E\}\_\{A\}\[X\_\{t\}^\{2\}\\mid\\mathscr\{F\}\_\{t\}\]\\leq\\frac\{U\_\{\\delta,\\delta\_\{f\}\}^\{2\}\}\{B\}\.Applying Freedman’s inequality \(Lemma[C\.18](https://arxiv.org/html/2605.12648#A3.Thmtheorem18)\) withttreplaced byt−1t\-1, andv=TUδ,δf2/Bv=TU\_\{\\delta,\\delta\_\{f\}\}^\{2\}/B, with probability at least1−δmb1\-\\delta\_\{\\mathrm\{mb\}\}, it holds that

∑s=0T−1Xs≤2TUδ,δf2Blog⁡\(1δmb\)\+2Uδ,δf3log⁡\(1δmb\)\.\\sum\_\{s=0\}^\{T\-1\}X\_\{s\}\\leq\\sqrt\{\\frac\{2T\\,U\_\{\\delta,\\delta\_\{f\}\}^\{2\}\}\{B\}\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\}\+\\frac\{2U\_\{\\delta,\\delta\_\{f\}\}\}\{3\}\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\.Using

∑t=0T−1ℒℬt\(𝐖t\)=∑t=0T−1ℒS\(𝐖t\)\+∑t=0T−1Xt\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)=\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\+\\sum\_\{t=0\}^\{T\-1\}X\_\{t\}proves the first inequality\.

The second inequality is proved in the same way\. Finally, a union bound over the two estimates yields probability at least1−2δmb1\-2\\delta\_\{\\mathrm\{mb\}\}\. ∎

For eachi∈\[n\]i\\in\[n\]andt≥0t\\geq 0, define

Mt\(i\)=ηaδ,m\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\.M\_\{t\}^\{\(i\)\}=\\eta\\,a\_\{\\delta,m\}\\bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\bigr\)\.Letℰopt\(i\)\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}denote the event on which

1T∑t=0T−1ℒS\(𝐖t\)≤Acorr,1T∑t=0T−1ℒS\(i\)\(𝐖t\(i\)\)≤Acorr\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\leq A\_\{\\mathrm\{corr\}\},\\qquad\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S^\{\(i\)\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\leq A\_\{\\mathrm\{corr\}\}\.For eachi∈\[n\]i\\in\[n\], letℰmb\(i\)\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}be the event on which both inequalities in Lemma[C\.19](https://arxiv.org/html/2605.12648#A3.Thmtheorem19)hold, that is,

∑t=0T−1ℒℬt\(𝐖t\)\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)≤∑t=0T−1ℒS\(𝐖t\)\+CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\),\\displaystyle\\leq\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\+C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\),∑t=0T−1ℒℬt\(i\)\(𝐖t\(i\)\)\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)≤∑t=0T−1ℒS\(i\)\(𝐖t\(i\)\)\+CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\)\.\\displaystyle\\leq\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S^\{\(i\)\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\+C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\)\.
###### Lemma C\.20\.

Suppose the assumptions of Lemma[C\.19](https://arxiv.org/html/2605.12648#A3.Thmtheorem19)hold\. Assume

m≳Cσ,b2p3\(log⁡\(m/δ\)\+p\)η2\(TAcorr\+Uδ,δfTlog⁡\(1/δmb\)B\+Uδ,δflog⁡\(1δmb\)\)2,m\\gtrsim C\_\{\\sigma,b\}^\{2\}p^\{3\}\\bigl\(\\log\(m/\\delta\)\+p\\bigr\)\\,\\eta^\{2\}\\Big\(TA\_\{\\mathrm\{corr\}\}\+U\_\{\\delta,\\delta\_\{f\}\}\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+U\_\{\\delta,\\delta\_\{f\}\}\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\)^\{2\},then on the eventℰopt\(i\)∩ℰmb\(i\)∩ℰδfout\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}

∑t=0T−1Mt\(i\)≤1\.\\sum\_\{t=0\}^\{T\-1\}M\_\{t\}^\{\(i\)\}\\leq 1\.

###### Proof\.

On the eventℰmb\(i\)\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}, Lemma[C\.19](https://arxiv.org/html/2605.12648#A3.Thmtheorem19)implies

∑t=0T−1Mt\(i\)\\displaystyle\\sum\_\{t=0\}^\{T\-1\}M\_\{t\}^\{\(i\)\}=ηaδ,m∑t=0T−1\(ℒℬt\(𝐖t\)\+ℒℬt\(i\)\(𝐖t\(i\)\)\)\\displaystyle=\\eta a\_\{\\delta,m\}\\sum\_\{t=0\}^\{T\-1\}\\Bigl\(\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}\(\\mathbf\{W\}\_\{t\}\)\+\\mathcal\{L\}\_\{\\mathcal\{B\}\_\{t\}\}^\{\(i\)\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\\Bigr\)≤ηaδ,m\[∑t=0T−1ℒS\(𝐖t\)\+∑t=0T−1ℒS\(i\)\(𝐖t\(i\)\)\+2CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\)\]\.\\displaystyle\\leq\\eta a\_\{\\delta,m\}\\Big\[\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\+\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S^\{\(i\)\}\}\(\\mathbf\{W\}\_\{t\}^\{\(i\)\}\)\+2C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\)\\Big\]\.Onℰopt\(i\)\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}, each of the first two sums is at mostTAcorrTA\_\{\\mathrm\{corr\}\}, then

∑t=0T−1Mt\(i\)≤2ηaδ,m\(TAcorr\+CUδ,δf\(Tlog⁡\(1/δmb\)B\+log⁡\(1δmb\)\)\)\.\\sum\_\{t=0\}^\{T\-1\}M\_\{t\}^\{\(i\)\}\\leq 2\\eta a\_\{\\delta,m\}\\Big\(TA\_\{\\mathrm\{corr\}\}\+C\\,U\_\{\\delta,\\delta\_\{f\}\}\\Big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\mathrm\{mb\}\}\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\_\{\\mathrm\{mb\}\}\}\)\\Big\)\\Big\)\.The final claim follows by substituting the definition ofaδ,ma\_\{\\delta,m\}and rearranging\. ∎

DefineRt\(i\)=ηbδ,mB𝟏\{i∈ℬt\}\(ℓ\(yif𝐖t\(𝐱i\)\)\+ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\)\.R\_\{t\}^\{\(i\)\}=\\frac\{\\eta b\_\{\\delta,m\}\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\bigl\(\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\+\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\bigr\)\.

###### Lemma C\.21\.

Under the assumptions of Lemma[C\.20](https://arxiv.org/html/2605.12648#A3.Thmtheorem20), defineℰstab\(i\)=ℰopt\(i\)∩ℰmb\(i\)∩ℰδfout\\mathcal\{E\}\_\{\\mathrm\{stab\}\}^\{\(i\)\}=\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\. Then, on the eventℰstab\(i\)\\mathcal\{E\}\_\{\\mathrm\{stab\}\}^\{\(i\)\}, it holds for everyt≥0t\\geq 0that

Δt\+1\(i\)≤e∑k=0tRk\(i\)\.\\Delta\_\{t\+1\}^\{\(i\)\}\\leq e\\sum\_\{k=0\}^\{t\}R\_\{k\}^\{\(i\)\}\.

###### Proof\.

By Lemma[C\.16](https://arxiv.org/html/2605.12648#A3.Thmtheorem16), for everyt≥0t\\geq 0,

Δt\+1\(i\)≤\(1\+Mt\(i\)\)Δt\(i\)\+Rt\(i\)\.\\Delta\_\{t\+1\}^\{\(i\)\}\\leq\\bigl\(1\+M\_\{t\}^\{\(i\)\}\\bigr\)\\Delta\_\{t\}^\{\(i\)\}\+R\_\{t\}^\{\(i\)\}\.Iterating this recursion and usingΔ0\(i\)=0\\Delta\_\{0\}^\{\(i\)\}=0, we obtain

Δt\+1\(i\)≤∑k=0tRk\(i\)∏s=k\+1t\(1\+Ms\(i\)\)\.\\Delta\_\{t\+1\}^\{\(i\)\}\\leq\\sum\_\{k=0\}^\{t\}R\_\{k\}^\{\(i\)\}\\prod\_\{s=k\+1\}^\{t\}\\bigl\(1\+M\_\{s\}^\{\(i\)\}\\bigr\)\.
On the eventℰstab\(i\)\\mathcal\{E\}\_\{\\mathrm\{stab\}\}^\{\(i\)\}, Lemma[C\.20](https://arxiv.org/html/2605.12648#A3.Thmtheorem20)yields

∑s=0T−1Ms\(i\)≤1\.\\sum\_\{s=0\}^\{T\-1\}M\_\{s\}^\{\(i\)\}\\leq 1\.Hence, for every0≤t≤k0\\leq t\\leq k,

∏s=t\+1k\(1\+Ms\(i\)\)≤exp⁡\(∑s=t\+1kMs\(i\)\)≤e,\\prod\_\{s=t\+1\}^\{k\}\\bigl\(1\+M\_\{s\}^\{\(i\)\}\\bigr\)\\leq\\exp\\Bigl\(\\sum\_\{s=t\+1\}^\{k\}M\_\{s\}^\{\(i\)\}\\Bigr\)\\leq e,where we used the inequality1\+x≤ex1\+x\\leq e^\{x\}forx≥0x\\geq 0\. Substituting this bound into the previous display gives

Δk\+1\(i\)≤e∑t=0kRt\(i\)\.\\Delta\_\{k\+1\}^\{\(i\)\}\\leq e\\sum\_\{t=0\}^\{k\}R\_\{t\}^\{\(i\)\}\.This completes the proof\. ∎

For eachi∈\[n\]i\\in\[n\], define the overall good event

ℰ^\(i\)=ℰδ∩ℰopt\(i\)∩ℰmb\(i\)∩ℰδfout\.\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}=\\mathcal\{E\}\_\{\\delta\}\\cap\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\.
###### Theorem C\.22\(On\-average argument stability\)\.

Under the assumptions of Lemmas[C\.20](https://arxiv.org/html/2605.12648#A3.Thmtheorem20)and[C\.21](https://arxiv.org/html/2605.12648#A3.Thmtheorem21), assume that

ℙ\(\(ℰ^\(i\)\)c\)≤δstabfor alli∈\[n\]\.\\mathbb\{P\}\\big\(\(\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\)^\{c\}\\big\)\\leq\\delta\_\{\\mathrm\{stab\}\}\\qquad\\text\{for all \}i\\in\[n\]\.Then

𝔼S,S~,A\[1n∑i=1n‖𝐖T−𝐖T\(i\)‖2\]≤2eηbδ,mn∑t=0T−1𝔼S,A\[ℒS\(𝐖t\)\]\+2R∗δstab\.\\mathbb\{E\}\_\{S,\\widetilde\{S\},A\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\mathbf\{W\}\_\{T\}\-\\mathbf\{W\}\_\{T\}^\{\(i\)\}\\\|\_\{2\}\\Big\]\\leq\\frac\{2e\\eta b\_\{\\delta,m\}\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,A\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+2R\_\{\*\}\\delta\_\{\\mathrm\{stab\}\}\.

###### Proof\.

Fixi∈\[n\]i\\in\[n\]\. On the eventℰ^\(i\)\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}, Lemma[C\.21](https://arxiv.org/html/2605.12648#A3.Thmtheorem21)yieldsΔT\(i\)≤e∑t=0T−1Rt\(i\)\\Delta\_\{T\}^\{\(i\)\}\\leq e\\sum\_\{t=0\}^\{T\-1\}R\_\{t\}^\{\(i\)\}\. Since both trajectories are projected onto the ball𝒦=ℬ\(𝐖0,R∗\)\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), we haveΔT\(i\)≤2R∗\\Delta\_\{T\}^\{\(i\)\}\\leq 2R\_\{\*\}\. Therefore,

𝔼𝒜\[ΔT\(i\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{A\}\}\[\\Delta\_\{T\}^\{\(i\)\}\]=𝔼𝒜\[ΔT\(i\)𝟏ℰ^\(i\)\]\+𝔼𝒜\[ΔT\(i\)𝟏\(ℰ^\(i\)\)c\]≤e𝔼𝒜\[∑t=0T−1Rt\(i\)𝟏ℰ^\(i\)\]\+2R∗ℙ𝒜\(\(ℰ^\(i\)\)c\)\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\Delta\_\{T\}^\{\(i\)\}\\mathbf\{1\}\_\{\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\}\\Big\]\+\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\Delta\_\{T\}^\{\(i\)\}\\mathbf\{1\}\_\{\(\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\)^\{c\}\}\\Big\]\\leq e\\,\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\sum\_\{t=0\}^\{T\-1\}R\_\{t\}^\{\(i\)\}\\mathbf\{1\}\_\{\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\}\\Big\]\+2R\_\{\*\}\\,\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\bigl\(\(\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\)^\{c\}\\bigr\)≤e∑t=0T−1𝔼𝒜\[Rt\(i\)\]\+2R∗ℙ𝒜\(\(ℰ^\(i\)\)c\)\.\\displaystyle\\leq e\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{\\mathcal\{A\}\}\[R\_\{t\}^\{\(i\)\}\]\+2R\_\{\*\}\\,\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\bigl\(\(\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\)^\{c\}\\bigr\)\.
Averaging overi∈\[n\]i\\in\[n\]gives

1n∑i=1n𝔼𝒜\[ΔT\(i\)\]≤en∑t=0T−1∑i=1n𝔼𝒜\[Rt\(i\)\]\+2R∗δstab\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{\\mathcal\{A\}\}\[\\Delta\_\{T\}^\{\(i\)\}\]\\leq\\frac\{e\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{\\mathcal\{A\}\}\[R\_\{t\}^\{\(i\)\}\]\+2R\_\{\*\}\\delta\_\{\\mathrm\{stab\}\}\.
We now estimate the first term\. Since𝐖t\\mathbf\{W\}\_\{t\}and𝐖t\(i\)\\mathbf\{W\}\_\{t\}^\{\(i\)\}are measurable with respect to the randomness up to timet−1t\-1, they are independent of the fresh mini\-batchℬt\\mathcal\{B\}\_\{t\}\. Hence

𝔼𝒜\[1B𝟏\{i∈ℬt\}\]=1n\.\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{B\}\\mathbf\{1\}\_\{\\\{i\\in\\mathcal\{B\}\_\{t\}\\\}\}\\Big\]=\\frac\{1\}\{n\}\.It then follows from the definition ofRt\(i\)R\_\{t\}^\{\(i\)\}that

1n∑i=1n𝔼𝒜\[Rt\(i\)\]\\displaystyle\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{\\mathcal\{A\}\}\[R\_\{t\}^\{\(i\)\}\]=ηbδ,mn⋅1n∑i=1n𝔼𝒜\[ℓ\(yif𝐖t\(𝐱i\)\)\+ℓ\(yi′f𝐖t\(i\)\(𝐱i′\)\)\]\.\\displaystyle=\\frac\{\\eta b\_\{\\delta,m\}\}\{n\}\\cdot\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\big\[\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)\+\\ell\(y\_\{i\}^\{\\prime\}f\_\{\\mathbf\{W\}\_\{t\}^\{\(i\)\}\}\(\\mathbf\{x\}\_\{i\}^\{\\prime\}\)\)\\big\]\.
Taking expectation over\(S,S~\)\(S,\\widetilde\{S\}\)and using1n∑i=1nℓ\(yif𝐖t\(𝐱i\)\)=ℒS\(𝐖t\)\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\}\}\(\\mathbf\{x\}\_\{i\}\)\)=\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)together with the fact thatS\(i\)S^\{\(i\)\}has the same distribution asSS, we obtain

𝔼S,S~,𝒜\[1n∑i=1nRt\(i\)\]=2ηbδ,mn𝔼S,𝒜\[ℒS\(𝐖t\)\]\.\\mathbb\{E\}\_\{S,\\widetilde\{S\},\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}R\_\{t\}^\{\(i\)\}\\Big\]=\\frac\{2\\eta b\_\{\\delta,m\}\}\{n\}\\,\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\.Summing overt=0,…,T−1t=0,\\dots,T\-1yields

𝔼S,S~,𝒜\[1n∑i=1n‖𝐖T−𝐖T\(i\)‖2\]≤2eηbδ,mn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+2R∗δstab,\\mathbb\{E\}\_\{S,\\widetilde\{S\},\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\mathbf\{W\}\_\{T\}\-\\mathbf\{W\}\_\{T\}^\{\(i\)\}\\\|\_\{2\}\\Big\]\\leq\\frac\{2e\\eta b\_\{\\delta,m\}\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+2R\_\{\*\}\\delta\_\{\\mathrm\{stab\}\},which proves the claim\. ∎

Recallcpriv=CclipκBc\_\{\\rm priv\}=\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}andUδ,δf≲plog⁡\(n/δ\)\+p3/2R∗\.U\_\{\\delta,\\delta\_\{f\}\}\\lesssim\\sqrt\{p\\log\(n/\\delta\)\}\+p^\{3/2\}R\_\{\*\}\.

###### Theorem C\.23\(Restatement of Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2)\)\.

Under the assumptions of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1), let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}andη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}\. Assume

m≳R∗2log⁡\(m/δ\)η2\(TAcorr\+\(log⁡\(n/δ\)\+R∗\)\(Tlog⁡\(1/δ\)B\+log⁡\(1δ\)\)\)2\.m\\gtrsim R\_\{\*\}^\{2\}\\log\(m/\\delta\)\\eta^\{2\}\\Big\(TA\_\{\\mathrm\{corr\}\}\+\(\\log\(n/\\delta\)\+R\_\{\*\}\)\\big\(\\sqrt\{\\frac\{T\\log\(1/\\delta\)\}\{B\}\}\+\\log\(\\frac\{1\}\{\\delta\}\)\\big\)\\Big\)^\{2\}\.Then, with probability at least1−δ1\-\\deltaover the initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTlog⁡\(1/δ\)n\)\(Acorr\+log⁡\(nδ\)\(m\+R∗\)δ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\]\\lesssim\\Big\(1\+\\frac\{\\eta T\\log\(1/\\delta\)\}\{n\}\\Big\)\\Big\(A\_\{\\mathrm\{corr\}\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\delta\\Big\)\.

###### Proof\.

Choose auxiliary failure probabilitiesδf=δ4\\delta\_\{f\}=\\frac\{\\delta\}\{4\}andδopt=δmb=δ8\\delta\_\{\\mathrm\{opt\}\}=\\delta\_\{\\mathrm\{mb\}\}=\\frac\{\\delta\}\{8\}, and condition on the initialization eventℰδ/4\\mathcal\{E\}\_\{\\delta/4\}\. By Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1), applied to the two coupled runs onSSandS\(i\)S^\{\(i\)\}, we have

ℙ\(\(ℰopt\(i\)\)c\)≤2δopt\.\\mathbb\{P\}\\big\(\(\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}\)^\{c\}\\big\)\\leq 2\\delta\_\{\\mathrm\{opt\}\}\.Moreover, Lemma[C\.19](https://arxiv.org/html/2605.12648#A3.Thmtheorem19)gives

ℙ\(ℰδ/4∩ℰδfout∩\(ℰmb\(i\)\)c\)≤2δmb\.\\mathbb\{P\}\\big\(\\mathcal\{E\}\_\{\\delta/4\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\\cap\(\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}\)^\{c\}\\big\)\\leq 2\\delta\_\{\\mathrm\{mb\}\}\.Together withℙ\(ℰδ/4c\)≤δ/4\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\delta/4\}^\{c\}\)\\leq\\delta/4from Lemma[B\.5](https://arxiv.org/html/2605.12648#A2.Thmtheorem5)andℙ\(\(ℰδfout\)c\)≤δf\\mathbb\{P\}\(\(\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\)^\{c\}\)\\leq\\delta\_\{f\}from Lemma[C\.17](https://arxiv.org/html/2605.12648#A3.Thmtheorem17), we get, forℰ^\(i\)=ℰδ/4∩ℰδfout∩ℰopt\(i\)∩ℰmb\(i\)\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}=\\mathcal\{E\}\_\{\\delta/4\}\\cap\\mathcal\{E\}\_\{\\delta\_\{f\}\}^\{\\mathrm\{out\}\}\\cap\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}\\cap\\mathcal\{E\}\_\{\\mathrm\{mb\}\}^\{\(i\)\}that

ℙ\(\(ℰ^\(i\)\)c\)≤δ4\+δf\+2δopt\+2δmb=δ\.\\mathbb\{P\}\\big\(\(\\widehat\{\\mathcal\{E\}\}^\{\(i\)\}\)^\{c\}\\big\)\\leq\\frac\{\\delta\}\{4\}\+\\delta\_\{f\}\+2\\delta\_\{\\mathrm\{opt\}\}\+2\\delta\_\{\\mathrm\{mb\}\}=\\delta\.Thus Theorem[C\.22](https://arxiv.org/html/2605.12648#A3.Thmtheorem22)applies withδstab=δ\\delta\_\{\\mathrm\{stab\}\}=\\delta, and yields

𝔼S,S~,𝒜\[1n∑i=1n‖𝐖T−𝐖T\(i\)‖2\]≤2eηbδ/4,mn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+2R∗δ\.\\mathbb\{E\}\_\{S,\\widetilde\{S\},\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\mathbf\{W\}\_\{T\}\-\\mathbf\{W\}\_\{T\}^\{\(i\)\}\\\|\_\{2\}\\Big\]\\leq\\frac\{2e\\eta b\_\{\\delta/4,m\}\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+2R\_\{\*\}\\delta\.
We now pass from stability to generalization\. Onℰδ/4\\mathcal\{E\}\_\{\\delta/4\}, Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)gives

‖∇f𝐖\(𝐱\)‖2≤bδ/4,m,∀𝐖∈𝒦,∀𝐱∈𝒳\.\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq b\_\{\\delta/4,m\},\\qquad\\forall\\mathbf\{W\}\\in\\mathcal\{K\},\\ \\forall\\mathbf\{x\}\\in\\mathcal\{X\}\.Since\|ℓ′\(u\)\|≤1\|\\ell^\{\\prime\}\(u\)\|\\leq 1for the logistic loss, the loss isbδ/4,mb\_\{\\delta/4,m\}\-Lipschitz with respect to𝐖\\mathbf\{W\}on this event\. Applying Lemma[C\.15](https://arxiv.org/html/2605.12648#A3.Thmtheorem15), we obtain

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]\\displaystyle\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]≤2bδ/4,m\(2eηbδ/4,mn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+2R∗δ\)\\displaystyle\\leq 2b\_\{\\delta/4,m\}\\Big\(\\frac\{2e\\eta b\_\{\\delta/4,m\}\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+2R\_\{\*\}\\delta\\Big\)≲ηlog⁡\(1/δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+log⁡\(1/δ\)R∗δ,\\displaystyle\\lesssim\\frac\{\\eta\\log\(1/\\delta\)\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\sqrt\{\\log\(1/\\delta\)\}\\,R\_\{\*\}\\delta,where we have usedbδ/4,m≲log⁡\(1/δ\)b\_\{\\delta/4,m\}\\lesssim\\sqrt\{\\log\(1/\\delta\)\}\.

We now derive the averaged population risk bound\. The same stability argument can be applied to the algorithm stopped at timett\. Therefore, for everyt=0,…,T−1t=0,\\ldots,T\-1,

𝔼S,𝒜\[ℒ\(𝐖t\)−ℒS\(𝐖t\)\]≲ηlog⁡\(1/δ\)n∑s=0T−1𝔼S,𝒜\[ℒS\(𝐖s\)\]\+log⁡\(1/δ\)R∗δ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{\\eta\\log\(1/\\delta\)\}\{n\}\\sum\_\{s=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{s\}\)\\bigr\]\+\\sqrt\{\\log\(1/\\delta\)\}\\,R\_\{\*\}\\delta\.Averaging this bound overt=0,…,T−1t=0,\\ldots,T\-1gives

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]≤1T∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+ηlog⁡\(1/δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+log⁡\(1/δ\)R∗δ\\displaystyle\\leq\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\eta\\log\(1/\\delta\)\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\sqrt\{\\log\(1/\\delta\)\}\\,R\_\{\*\}\\delta=\(1\+ηTlog⁡\(1/δ\)n\)1T∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+log⁡\(1/δ\)R∗δ\.\\displaystyle=\\Big\(1\+\\frac\{\\eta T\\log\(1/\\delta\)\}\{n\}\\Big\)\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\sqrt\{\\log\(1/\\delta\)\}\\,R\_\{\*\}\\delta\.
If we further use the optimization bound, letℰopt\\mathcal\{E\}\_\{\\mathrm\{opt\}\}be the event from Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)on which

1T∑t=0T−1ℒS\(𝐖t\)≤Acorr\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\leq A\_\{\\mathrm\{corr\}\}\.Thenℙ\(ℰoptc\)≤δopt\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{c\}\)\\leq\\delta\_\{\\mathrm\{opt\}\}\. On the eventℰδ/4\\mathcal\{E\}\_\{\\delta/4\}, for any𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}, anyy∈\{−1,\+1\}y\\in\\\{\-1,\+1\\\}, and any𝐖∈B\(𝐖0,R∗\)\\mathbf\{W\}\\in B\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), it holds that

\|f𝐖\(𝐱\)\|≤\|f𝐖0\(𝐱\)\|\+bδ/4,mR∗≤Bbp‖c‖2\+bδ/4,mR∗≲log⁡\(nδ\)\(m\+R∗\),\|f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\|\\leq\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\)\|\+b\_\{\\delta/4,m\}R\_\{\*\}\\leq B\_\{b\}\\sqrt\{p\}\\,\\\|c\\\|\_\{2\}\+b\_\{\\delta/4,m\}R\_\{\*\}\\lesssim\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\),where we used‖c‖2≲m\\\|c\\\|\_\{2\}\\lesssim\\sqrt\{m\}onℰδ/4\\mathcal\{E\}\_\{\\delta/4\}\. Consequently,

ℓ\(yf𝐖\(𝐱\)\)=log⁡\(1\+e−yf𝐖\(𝐱\)\)≤log⁡2\+\|f𝐖\(𝐱\)\|≲log⁡\(nδ\)\(m\+R∗\)\.\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)=\\log\\bigl\(1\+e^\{\-yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\}\\bigr\)\\leq\\log 2\+\|f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\|\\lesssim\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\.Since𝐖t∈B\(𝐖0,R∗\)\\mathbf\{W\}\_\{t\}\\in B\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\)for allttby construction of Algorithm[1](https://arxiv.org/html/2605.12648#alg1), the same bound holds forℒS\(𝐖t\)\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)uniformly overtt\. Therefore, conditioning on the initialization eventℰδ/4\\mathcal\{E\}\_\{\\delta/4\}and decomposing overℰopt\\mathcal\{E\}\_\{\\mathrm\{opt\}\}, we obtain

1T∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]≤𝔼S\[𝔼𝒜\[1T∑t=0T−1ℒS\(𝐖t\)𝟏ℰopt\|S,𝐖0,c\]\]\\displaystyle\\leq\\mathbb\{E\}\_\{S\}\\Big\[\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\mathbf\{1\}\_\{\\mathcal\{E\}\_\{\\mathrm\{opt\}\}\}\\,\\Big\|\\,S,\\mathbf\{W\}\_\{0\},c\\Big\]\\Big\]\+𝔼S\[𝔼𝒜\[1T∑t=0T−1ℒS\(𝐖t\)𝟏ℰoptc\|S,𝐖0,c\]\]\\displaystyle\\qquad\+\\mathbb\{E\}\_\{S\}\\Big\[\\mathbb\{E\}\_\{\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\mathbf\{1\}\_\{\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{c\}\}\\,\\Big\|\\,S,\\mathbf\{W\}\_\{0\},c\\Big\]\\Big\]≲Acorr\+log⁡\(nδ\)\(m\+R∗\)δ\.\\displaystyle\\lesssim A\_\{\\mathrm\{corr\}\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\delta\.
Substituting this into the above bound yields

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTlog⁡\(1/δ\)n\)\(Acorr\+log⁡\(nδ\)\(m\+R∗\)δ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\]\\lesssim\\Big\(1\+\\frac\{\\eta T\\log\(1/\\delta\)\}\{n\}\\Big\)\\Big\(A\_\{\\mathrm\{corr\}\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\delta\\Big\)\.Finally,ℙ\(ℰδ/4\)≥1−δ/4≥1−δ\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\delta/4\}\)\\geq 1\-\\delta/4\\geq 1\-\\delta, so the above bound holds with probability at least1−δ1\-\\deltaover the initialization\. This completes the proof\. ∎

###### Corollary C\.24\(Restatement of Corollary[5\.3](https://arxiv.org/html/2605.12648#S5.Thmtheorem3)\)\.

Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≍1\\eta\\asymp 1andCclip≍GδC\_\{\\rm clip\}\\asymp G\_\{\\delta\}\. Assumeλ∈\(0,1\)\\lambda\\in\(0,1\)is a fixed constant bounded away from11\. Let0<ρ<10<\\rho<1be a fixed constant\. Ifm≍γ−6polylog\(n/δ\)m\\asymp\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\)andδ≤min⁡\{1nm,γn\}\\delta\\leq\\min\\bigl\\\{\\frac\{1\}\{n\\sqrt\{m\}\},\\,\\frac\{\\gamma\}\{n\}\\bigr\\\}, setB=ρnB=\\rho n,T≍min⁡\{nγ,nϵγ2d\}T\\asymp\\min\\bigl\\\{\\frac\{\\sqrt\{n\}\}\{\\gamma\},\\frac\{n\\epsilon\\gamma^\{2\}\}\{\\sqrt\{d\}\}\\bigr\\\}, then with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲polylog\(nδ\)\(1γn\+dγ4nϵ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\Big\(\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\\Big\)\.Moreover, with probability at least1−δ1\-\\deltaover the initialization, it holds

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲polylog\(nδ\)\(1γn\+dγ4nϵ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\Big\(\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\\Big\)\.

###### Proof\.

Sinceλ∈\(0,1\)\\lambda\\in\(0,1\)is fixed and bounded away from11, we have

1−λT1−λ=𝒪\(1\)and\(1−λ\)2\+λ2η=𝒪\(1\)\.\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}=\\mathcal\{O\}\(1\)\\qquad\\text\{and\}\\qquad\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta=\\mathcal\{O\}\(1\)\.Therefore, by the choice ofκ\\kappain Theorem[5\.1](https://arxiv.org/html/2605.12648#S5.Thmtheorem1)and the fact thatB=ρnB=\\rho n, we have, up to logarithmic factors,

κ2≲Tϵ2\.\\kappa^\{2\}\\lesssim\\frac\{T\}\{\\epsilon^\{2\}\}\.
We now verify the conditions of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)\. The lower bound onmmis satisfied sincem≍γ−6polylog\(n/δ\)m\\asymp\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\), which is larger thanγ−4\\gamma^\{\-4\}up to logarithmic factors\. For the upper bound onmm, usingB=ρnB=\\rho n,κ2≲T/ϵ2\\kappa^\{2\}\\lesssim T/\\epsilon^\{2\}, and\(1−λ\)2\+λ2η=𝒪\(1\)\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta=\\mathcal\{O\}\(1\), the right\-hand side of the width upper bound is

B2η2κ2γ2d⋅1T≳n2ϵ2η2γ2dT2\.\\frac\{B^\{2\}\}\{\\eta^\{2\}\\kappa^\{2\}\\gamma^\{2\}d\}\\cdot\\frac\{1\}\{T\}\\gtrsim\\frac\{n^\{2\}\\epsilon^\{2\}\}\{\\eta^\{2\}\\gamma^\{2\}dT^\{2\}\}\.SinceT≲nϵγ2dT\\lesssim\\frac\{n\\epsilon\\gamma^\{2\}\}\{\\sqrt\{d\}\}andm≍γ−6polylog\(n/δ\)m\\asymp\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\), the width upper bound is satisfied after adjusting constants\.

We also verify the step\-size condition in Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)\. SinceB=ρnB=\\rho n,η≍1\\eta\\asymp 1is chosen sufficiently small, andT≲n/γT\\lesssim\\sqrt\{n\}/\\gamma, we have

ηγT1B≲1andη2γ2\(TB\+1\)≲1\.\\eta\\gamma\\sqrt\{T\}\\frac\{1\}\{\\sqrt\{B\}\}\\lesssim 1\\qquad\\text\{and\}\\qquad\\eta^\{2\}\\gamma^\{2\}\\Big\(\\frac\{T\}\{B\}\+1\\Big\)\\lesssim 1\.Moreover, usingκ≲T/ϵ\\kappa\\lesssim\\sqrt\{T\}/\\epsilon, we get

ηγTκ\(1−λ\)B≲ηγTnϵ≲1,\\eta\\gamma\\sqrt\{T\}\\,\\frac\{\\kappa\(1\-\\lambda\)\}\{B\}\\lesssim\\eta\\gamma\\frac\{T\}\{n\\epsilon\}\\lesssim 1,where the last inequality follows fromT≲nϵγ2/dT\\lesssim n\\epsilon\\gamma^\{2\}/\\sqrt\{d\}, together withγ≤1\\gamma\\leq 1andd≥1d\\geq 1\. Hence the step\-size condition holds\.

Now we boundAcorrA\_\{\\rm corr\}\. SubstitutingB=ρnB=\\rho n,κ2≲T/ϵ2\\kappa^\{2\}\\lesssim T/\\epsilon^\{2\}, and\(1−λ\)2\+λ2η=𝒪\(1\)\(1\-\\lambda\)^\{2\}\+\\lambda^\{2\}\\eta=\\mathcal\{O\}\(1\)into Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1), we obtain, up to logarithmic factors,

Acorr≲\\displaystyle A\_\{\\rm corr\}\\lesssim1γ2ηT\+1γnT\+η\(1n\+1T\)\+1nϵ\(1γ\+ηmdn\)\+ηTmdn2ϵ2\.\\displaystyle\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{nT\}\}\+\\eta\\Big\(\\frac\{1\}\{n\}\+\\frac\{1\}\{T\}\\Big\)\+\\frac\{1\}\{n\\epsilon\}\\Big\(\\frac\{1\}\{\\gamma\}\+\\frac\{\\eta\\sqrt\{md\}\}\{\\sqrt\{n\}\}\\Big\)\+\\frac\{\\eta Tmd\}\{n^\{2\}\\epsilon^\{2\}\}\.By the choice

T≍min⁡\{nγ,nϵγ2d\},T\\asymp\\min\\Bigl\\\{\\frac\{\\sqrt\{n\}\}\{\\gamma\},\\frac\{n\\epsilon\\gamma^\{2\}\}\{\\sqrt\{d\}\}\\Bigr\\\},the first term satisfies

1γ2ηT≲1γn\+dγ4nϵ\.\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\.The second and third terms are of no larger order\. For the linear\-noise term,

1nϵ\(1γ\+ηmdn\)≲dγ4nϵ,\\frac\{1\}\{n\\epsilon\}\\Big\(\\frac\{1\}\{\\gamma\}\+\\frac\{\\eta\\sqrt\{md\}\}\{\\sqrt\{n\}\}\\Big\)\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\},where we usedd≥1d\\geq 1,γ≤1\\gamma\\leq 1,η≍1\\eta\\asymp 1, andm≍γ−6polylog\(n/δ\)m\\asymp\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\)\. For the quadratic\-noise term, usingT≲nϵγ2/dT\\lesssim n\\epsilon\\gamma^\{2\}/\\sqrt\{d\}, we obtain

ηTmdn2ϵ2≲ηmγ2dnϵ≲polylog\(nδ\)dγ4nϵ\.\\frac\{\\eta Tmd\}\{n^\{2\}\\epsilon^\{2\}\}\\lesssim\\frac\{\\eta m\\gamma^\{2\}\\sqrt\{d\}\}\{n\\epsilon\}\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\.Hence

Acorr≲polylog\(nδ\)\(1γn\+dγ4nϵ\)\.A\_\{\\rm corr\}\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\Big\(\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\\Big\)\.This proves the optimization bound\.

It remains to verify the additional condition in Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2)\. SinceR∗≍γ−1polylog\(n/δ\)R\_\{\*\}\\asymp\\gamma^\{\-1\}\\mathrm\{polylog\}\(n/\\delta\), the bound above gives

TAcorr≲γ−2polylog\(n/δ\)\.TA\_\{\\rm corr\}\\lesssim\\gamma^\{\-2\}\\mathrm\{polylog\}\(n/\\delta\)\.The additional mini\-batch term in the width condition of Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2)is no larger thanTAcorrTA\_\{\\rm corr\}, up to logarithmic factors, under the present choices ofBBandTT\. Therefore, the right\-hand side of the required width condition is bounded by

R∗2η2\(TAcorr\)2≲γ−6polylog\(n/δ\),R\_\{\*\}^\{2\}\\eta^\{2\}\(TA\_\{\\rm corr\}\)^\{2\}\\lesssim\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\),where we usedR∗≍γ−1polylog\(n/δ\)R\_\{\*\}\\asymp\\gamma^\{\-1\}\\mathrm\{polylog\}\(n/\\delta\)andη≍1\\eta\\asymp 1\. Hence the condition is satisfied by the choicem≍γ−6polylog\(n/δ\)m\\asymp\\gamma^\{\-6\}\\mathrm\{polylog\}\(n/\\delta\)\.

Applying Theorem[5\.2](https://arxiv.org/html/2605.12648#S5.Thmtheorem2), we obtain

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTlog⁡\(1/δ\)n\)\(Acorr\+log⁡\(nδ\)\(m\+R∗\)δ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\]\\lesssim\\Big\(1\+\\frac\{\\eta T\\log\(1/\\delta\)\}\{n\}\\Big\)\\Big\(A\_\{\\mathrm\{corr\}\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\delta\\Big\)\.SinceT≲n/γT\\lesssim\\sqrt\{n\}/\\gamma, the factor1\+ηTlog⁡\(1/δ\)/n1\+\\eta T\\log\(1/\\delta\)/nis absorbed into the logarithmic factor\. Moreover, the assumption

δ≤min⁡\{1nm,γn\}\\delta\\leq\\min\\Bigl\\\{\\frac\{1\}\{n\\sqrt\{m\}\},\\,\\frac\{\\gamma\}\{n\}\\Bigr\\\}implies

\(m\+log⁡\(nδ\)R∗\)δ≲polylog\(nδ\)1n,\\big\(\\sqrt\{m\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)R\_\{\*\}\\big\)\\delta\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\frac\{1\}\{n\},which is dominated by the displayed rate\. Hence

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲polylog\(nδ\)\(1γn\+dγ4nϵ\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\mathrm\{polylog\}\\big\(\\frac\{n\}\{\\delta\}\\big\)\\Big\(\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{4\}n\\epsilon\}\\Big\)\.This completes the proof\. ∎

## Appendix DProofs for DP\-SGD with Independent Noise

### D\.1Privacy guarantee of DP\-SGD with independent noise

In this subsection, we prove the privacy guarantee of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)in the caseλ=0\\lambda=0, i\.e\., standard DP\-SGD with independent Gaussian noise\. The proof is standard\. Nevertheless, for consistency, we include the details\.

Throughout the proof, we use Rényi differential privacy \(RDP\) that allows for a more refined analysis of privacy loss\. In this subsection, for the independent\-noise baseline, we use the standard replacement neighboring relation: two datasetsSSandS′S^\{\\prime\}of the same size are neighboring if they differ in one data point\.

###### Definition D\.1\(Rényi differential privacy\[[42](https://arxiv.org/html/2605.12648#bib.bib152)\]\)\.

Forα\>1\\alpha\>1andρ\>0\\rho\>0, a randomized algorithm𝒜\\mathcal\{A\}is said to satisfy\(α,ρ\)\(\\alpha,\\rho\)\-RDP if, for every pair of neighboring datasetsS,S′S,S^\{\\prime\},

Dα\(𝒜\(S\)∥𝒜\(S′\)\):=1α−1log⁡𝔼θ∼𝒜\(S′\)\[\(𝒜\(S\)\(θ\)𝒜\(S′\)\(θ\)\)α\]≤ρ\.D\_\{\\alpha\}\(\\mathcal\{A\}\(S\)\\,\\\|\\,\\mathcal\{A\}\(S^\{\\prime\}\)\):=\\frac\{1\}\{\\alpha\-1\}\\log\\mathbb\{E\}\_\{\\theta\\sim\\mathcal\{A\}\(S^\{\\prime\}\)\}\\\!\\left\[\\left\(\\frac\{\\mathcal\{A\}\(S\)\(\\theta\)\}\{\\mathcal\{A\}\(S^\{\\prime\}\)\(\\theta\)\}\\right\)^\{\\alpha\}\\right\]\\leq\\rho\.

A connection\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP and RDP is established in the following lemma\.

###### Lemma D\.2\(RDP to\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\[[42](https://arxiv.org/html/2605.12648#bib.bib152)\]\)\.

If𝒜\\mathcal\{A\}satisfies\(α,ρ\)\(\\alpha,\\rho\)\-RDP for someα\>1\\alpha\>1, then for everyδ∈\(0,1\)\\delta\\in\(0,1\),𝒜\\mathcal\{A\}also satisfies

\(ρ\+log⁡\(1/δ\)α−1,δ\)\-DP\.\\Big\(\\rho\+\\frac\{\\log\(1/\\delta\)\}\{\\alpha\-1\},\\,\\delta\\Big\)\\text\{\-DP\}\.

To achieve DP, we need the concept ofℓ2\\ell\_\{2\}\-sensitivity defined as follows\.

###### Definition D\.3\(ℓ2\\ell\_\{2\}\-sensitivity\)\.

Theℓ2\\ell\_\{2\}\-sensitivity of a function \(mechanism\)ℳ:𝒵n→𝒲\\mathcal\{M\}:\\mathcal\{Z\}^\{n\}\\rightarrow\\mathcal\{W\}is defined asΔ=supS,S′‖ℳ\(S\)−ℳ\(S′\)‖2,\\Delta=\\sup\_\{S,S^\{\\prime\}\}\\\|\\mathcal\{M\}\(S\)\-\\mathcal\{M\}\(S^\{\\prime\}\)\\\|\_\{2\},whereSSandS′S^\{\\prime\}are neighboring datasets\.

A basic mechanism to obtain RDP is Gaussian mechanism\.

###### Lemma D\.4\(Gaussian mechanism\[[42](https://arxiv.org/html/2605.12648#bib.bib152)\]\)\.

Consider a functionℳ:𝒵n→ℛd\\mathcal\{M\}:\\mathcal\{Z\}^\{n\}\\rightarrow\\mathcal\{R\}^\{d\}with theℓ2\\ell\_\{2\}\-sensitivity parameterΔ\\Delta, and a datasetS⊂𝒵nS\\subset\\mathcal\{Z\}^\{n\}\. The Gaussian mechanism𝒢\(S,σ\)=ℳ\(S\)\+𝐛\\mathcal\{G\}\(S,\\sigma\)=\\mathcal\{M\}\(S\)\+\\mathbf\{b\}, where𝐛∼𝒩\(0,σ2𝐈d\)\\mathbf\{b\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\\mathbf\{I\}\_\{d\}\), satisfies\(λ,λΔ22σ2\)\(\\lambda,\\frac\{\\lambda\\Delta^\{2\}\}\{2\\sigma^\{2\}\}\)\-RDP\.

The following post\-processing property enables flexible use of private data outputs while preserving rigorous privacy guarantees\.

###### Lemma D\.5\(Post\-processing\[[42](https://arxiv.org/html/2605.12648#bib.bib152)\]\)\.

If𝒜\\mathcal\{A\}satisfies\(α,ρ\)\(\\alpha,\\rho\)\-RDP andffis any deterministic function, thenf∘𝒜f\\circ\\mathcal\{A\}also satisfies\(α,ρ\)\(\\alpha,\\rho\)\-RDP\.

The following RDP composition theorem characterizes the privacy of a composition of parallel or adaptive mechanisms in terms of the privacy guarantees of the individual mechanisms\.

###### Lemma D\.6\(Composition of RDP\[[42](https://arxiv.org/html/2605.12648#bib.bib152)\]\)\.

Fix an orderα\>1\\alpha\>1\. For eachi∈\[k\]i\\in\[k\], let𝒜i:𝒵n→𝒲i\\mathcal\{A\}\_\{i\}:\\mathcal\{Z\}^\{n\}\\to\\mathcal\{W\}\_\{i\}be a randomized mechanism satisfying\(α,ρi\)\(\\alpha,\\rho\_\{i\}\)\-RDP\. Then the following statements hold\.

1. \(a\)Joint \(simultaneous\) release\. Let𝒜\(S\)=\(𝒜1\(S\),…,𝒜k\(S\)\)\\mathcal\{A\}\(S\)=\(\\mathcal\{A\}\_\{1\}\(S\),\\ldots,\\mathcal\{A\}\_\{k\}\(S\)\)\. Suppose\{𝒜i\}i=1k\\\{\\mathcal\{A\}\_\{i\}\\\}\_\{i=1\}^\{k\}are independent\. Then𝒜\\mathcal\{A\}satisfies\(α,∑i=1kρi\)\(\\alpha,\\sum\_\{i=1\}^\{k\}\\rho\_\{i\}\)\-RDP\.
2. \(b\)Adaptive composition\. Suppose𝒜1,…,𝒜k\\mathcal\{A\}\_\{1\},\\ldots,\\mathcal\{A\}\_\{k\}are applied sequentially, and for eachi∈\[k\]i\\in\[k\],𝒜i\\mathcal\{A\}\_\{i\}may depend on the previous outputs𝒜1\(S\),…,𝒜i−1\(S\)\\mathcal\{A\}\_\{1\}\(S\),\\ldots,\\mathcal\{A\}\_\{i\-1\}\(S\)\. If for every fixed realizationw<i:=\(𝒜1\(S\),…,𝒜i−1\(S\)\)w\_\{<i\}:=\(\\mathcal\{A\}\_\{1\}\(S\),\\ldots,\\mathcal\{A\}\_\{i\-1\}\(S\)\)of previous outputs, the conditional mechanism𝒜i\(⋅;w<i\)\\mathcal\{A\}\_\{i\}\(\\cdot\\,;w\_\{<i\}\)satisfies\(α,ρi\)\(\\alpha,\\rho\_\{i\}\)\-RDP, then the overall mechanism 𝒜\(S\)=\(𝒜1\(S\),𝒜2\(S;𝒜1\(S\)\),…,𝒜k\(S;𝒜1\(S\),…,𝒜k−1\(S\)\)\)\\mathcal\{A\}\(S\)=\\big\(\\mathcal\{A\}\_\{1\}\(S\),\\mathcal\{A\}\_\{2\}\(S;\\mathcal\{A\}\_\{1\}\(S\)\),\\ldots,\\mathcal\{A\}\_\{k\}\(S;\\mathcal\{A\}\_\{1\}\(S\),\\ldots,\\mathcal\{A\}\_\{k\-1\}\(S\)\)\\big\)satisfies\(α,∑i=1kρi\)\(\\alpha,\\sum\_\{i=1\}^\{k\}\\rho\_\{i\}\)\-RDP\.

###### Lemma D\.7\(One\-step privacy of the uniformly subsampled Gaussian mechanism\)\.

Fix an iterationttand condition on the past iterate𝐖t−1\\mathbf\{W\}\_\{t\-1\}\. Letg~t,i:=gt,i⋅min⁡\{1,Cclip‖gt,i‖2\}\.\\widetilde\{g\}\_\{t,i\}:=g\_\{t,i\}\\cdot\\min\\big\\\{1,\\frac\{C\_\{\\mathrm\{clip\}\}\}\{\\\|g\_\{t,i\}\\\|\_\{2\}\}\\big\\\}\.Consider the mechanism

ℳt\(S\)=1B∑i∈ℬtg~t,i\+CclipBκZt,Zt∼𝒩\(𝟎,𝐈\),\\mathcal\{M\}\_\{t\}\(S\)=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\widetilde\{g\}\_\{t,i\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\kappa Z\_\{t\},\\qquad Z\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),whereℬt\\mathcal\{B\}\_\{t\}is sampled uniformly without replacement from\[n\]\[n\]with\|ℬt\|=B\|\\mathcal\{B\}\_\{t\}\|=B\. Then, in the standard low\-sampling regime, there exists an absolute constantcsgmc\_\{\\mathrm\{sgm\}\}such thatℳt\\mathcal\{M\}\_\{t\}satisfies\(α,csgmαB2n2κ2\)\\bigl\(\\alpha,\\;c\_\{\\mathrm\{sgm\}\}\\frac\{\\alpha B^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\bigr\)\-RDP\. A convenient explicit choice iscsgm=8c\_\{\\mathrm\{sgm\}\}=8\.

###### Proof of Lemma[D\.7](https://arxiv.org/html/2605.12648#A4.Thmtheorem7)\.

To ensure the consistency of the paper, we provide proof here\. Condition on the past iterate𝐖t−1\\mathbf\{W\}\_\{t\-1\}\. Then the clipped gradients\{g~t,i\}i=1n\\\{\\widetilde\{g\}\_\{t,i\}\\\}\_\{i=1\}^\{n\}are deterministic vectors and satisfy‖g~t,i‖2≤Cclip\\\|\\widetilde\{g\}\_\{t,i\}\\\|\_\{2\}\\leq C\_\{\\mathrm\{clip\}\}for alli∈\[n\]i\\in\[n\]\. DefineU=\(u1,…,uB\)U=\(u\_\{1\},\\dots,u\_\{B\}\)by

M\(U\)=1B∑i=1Bui\+CclipBκZ,Z∼𝒩\(𝟎,𝐈\)\.M\(U\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}u\_\{i\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\kappa Z,\\qquad Z\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\.IfUUandU′U^\{\\prime\}differ in exactly one entry, then

‖1B∑i=1Bui−1B∑i=1Bui′‖2≤2CclipB\.\\Big\\\|\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}u\_\{i\}\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}u\_\{i\}^\{\\prime\}\\Big\\\|\_\{2\}\\leq\\frac\{2C\_\{\\mathrm\{clip\}\}\}\{B\}\.
Therefore,MMis a Gaussian mechanism withℓ2\\ell\_\{2\}\-sensitivityΔ=2CclipB\\Delta=\\frac\{2C\_\{\\mathrm\{clip\}\}\}\{B\}and noise standard deviationσ=CclipBκ\\sigma=\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\kappa\. By Lemma[D\.4](https://arxiv.org/html/2605.12648#A4.Thmtheorem4), we knowMMsatisfies\(α,εM\(α\)\)\(\\alpha,\\varepsilon\_\{M\}\(\\alpha\)\)\-RDP withεM\(α\)=αΔ22σ2=2ακ2\.\\varepsilon\_\{M\}\(\\alpha\)=\\frac\{\\alpha\\Delta^\{2\}\}\{2\\sigma^\{2\}\}=\\frac\{2\\alpha\}\{\\kappa^\{2\}\}\.Note thatℳt\\mathcal\{M\}\_\{t\}is exactly the mechanismℳt=M∘subsample\\mathcal\{M\}\_\{t\}=M\\circ\\mathrm\{subsample\}with subsampling ratioq=Bnq=\\frac\{B\}\{n\}\.

Applying Theorem 9 of\[[63](https://arxiv.org/html/2605.12648#bib.bib729)\]toℳt\\mathcal\{M\}\_\{t\}, for every integerα≥2\\alpha\\geq 2,

εt\(α\)≤\\displaystyle\\varepsilon\_\{t\}\(\\alpha\)\\leq1α−1log\(1\+q2\(α2\)min\{4\(eεM\(2\)−1\),eεM\(2\)min\{2,\(eεM\(∞\)−1\)2\}\}\\displaystyle\\,\\frac\{1\}\{\\alpha\-1\}\\log\\Big\(1\+q^\{2\}\\binom\{\\alpha\}\{2\}\\min\\Bigl\\\{4\\bigl\(e^\{\\varepsilon\_\{M\}\(2\)\}\-1\\bigr\),\\;e^\{\\varepsilon\_\{M\}\(2\)\}\\min\\\{2,\(e^\{\\varepsilon\_\{M\}\(\\infty\)\}\-1\)^\{2\}\\\}\\Bigr\\\}\+2∑j=3αqj\(αj\)e\(j−1\)εM\(j\)min\{2,\(eεM\(∞\)−1\)j\}\)\.\\displaystyle\+2\\sum\_\{j=3\}^\{\\alpha\}q^\{j\}\\binom\{\\alpha\}\{j\}e^\{\(j\-1\)\\varepsilon\_\{M\}\(j\)\}\\min\\\{2,\(e^\{\\varepsilon\_\{M\}\(\\infty\)\}\-1\)^\{j\}\\\}\\Big\)\.\(19\)SinceεM\(∞\)=∞\\varepsilon\_\{M\}\(\\infty\)=\\inftyandεM\(j\)=2jκ2\\varepsilon\_\{M\}\(j\)=\\frac\{2j\}\{\\kappa^\{2\}\}, then

εt\(α\)≤1α−1log⁡\(1\+q2\(α2\)min⁡\{4\(e4/κ2−1\),2e4/κ2\}\+2∑j=3αqj\(αj\)e2j\(j−1\)/κ2\)\.\\varepsilon\_\{t\}\(\\alpha\)\\leq\\frac\{1\}\{\\alpha\-1\}\\log\\Big\(1\+q^\{2\}\\binom\{\\alpha\}\{2\}\\min\\\!\\Bigl\\\{4\\bigl\(e^\{4/\\kappa^\{2\}\}\-1\\bigr\),\\;2e^\{4/\\kappa^\{2\}\}\\Bigr\\\}\+2\\sum\_\{j=3\}^\{\\alpha\}q^\{j\}\\binom\{\\alpha\}\{j\}e^\{2j\(j\-1\)/\\kappa^\{2\}\}\\Big\)\.
In the standard low\-sampling/high\-noise regime, thej=2j=2term dominates the above expression\. Accordingly, as noted in\[[63](https://arxiv.org/html/2605.12648#bib.bib729)\], the RDP of the subsampled Gaussian mechanism simplifies to

εt\(α\)=𝒪\(αq2κ2\)=𝒪\(αB2n2κ2\)\.\\varepsilon\_\{t\}\(\\alpha\)=\\mathcal\{O\}\\Big\(\\frac\{\\alpha q^\{2\}\}\{\\kappa^\{2\}\}\\Big\)=\\mathcal\{O\}\\Big\(\\frac\{\\alpha B^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\Big\)\.Hence there exists an absolute constantcsgm\>0c\_\{\\mathrm\{sgm\}\}\>0such thatεt\(α\)≤csgmαB2n2κ2\.\\varepsilon\_\{t\}\(\\alpha\)\\leq c\_\{\\mathrm\{sgm\}\}\\frac\{\\alpha B^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\.

Finally, the leading asymptotic contribution of the dominantj=2j=2term is

1α−1⋅q2\(α2\)⋅16κ2=8αq2κ2,\\frac\{1\}\{\\alpha\-1\}\\cdot q^\{2\}\\binom\{\\alpha\}\{2\}\\cdot\\frac\{16\}\{\\kappa^\{2\}\}=8\\frac\{\\alpha q^\{2\}\}\{\\kappa^\{2\}\},which explains the asymptotic constant88\. The proof is completed\. ∎

Based on the above lemmas, we now prove Theorem[5\.4](https://arxiv.org/html/2605.12648#S5.Thmtheorem4)\. We note the argument below is intended for the subsampled mini\-batch regimeB<nB<n, where privacy amplification by subsampling is relevant\. In the full\-batch caseB=nB=n, amplification is not needed, and the privacy guarantee follows directly from the standard Gaussian mechanism \(equivalently, from the full\-batch DP\-GD result in\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\]\)\.

###### Theorem D\.9\(Restatement of Theorem[5\.4](https://arxiv.org/html/2605.12648#S5.Thmtheorem4)\)\.

Assumeδ\>0\\delta\>0andλ=0\\lambda=0\. If

κ2≥16B2Tn2ϵ\+32B2Tlog⁡\(1/δ\)n2ϵ2,\\kappa^\{2\}\\geq\\frac\{16B^\{2\}T\}\{n^\{2\}\\epsilon\}\+\\frac\{32B^\{2\}T\\log\(1/\\delta\)\}\{n^\{2\}\\epsilon^\{2\}\},then Algorithm[1](https://arxiv.org/html/2605.12648#alg1)satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\. For simplicity, we setκ2=16B2Tn2ϵ\+32B2Tlog⁡\(1/δ\)n2ϵ2\.\\kappa^\{2\}=\\frac\{16B^\{2\}T\}\{n^\{2\}\\epsilon\}\+\\frac\{32B^\{2\}T\\log\(1/\\delta\)\}\{n^\{2\}\\epsilon^\{2\}\}\.

###### Proof\.

At iterationtt, the released noisy mini\-batch gradient is

v^t=1B∑i∈ℬtg~t,i\+CclipBκZt,Zt∼𝒩\(𝟎,𝐈mdp\),\\hat\{v\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}\\widetilde\{g\}\_\{t,i\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\}\{B\}\\kappa Z\_\{t\},\\qquad Z\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\_\{mdp\}\),sinceλ=0\\lambda=0impliesξt=κZt\\xi\_\{t\}=\\kappa Z\_\{t\}\.

Condition on the past iterate𝐖t−1\\mathbf\{W\}\_\{t\-1\}\. Then all clipped gradients\{g~t,i\}i=1n\\\{\\widetilde\{g\}\_\{t,i\}\\\}\_\{i=1\}^\{n\}are deterministic vectors satisfying‖g~t,i‖2≤Cclip\\\|\\widetilde\{g\}\_\{t,i\}\\\|\_\{2\}\\leq C\_\{\\mathrm\{clip\}\}\. Therefore, by Lemma[D\.7](https://arxiv.org/html/2605.12648#A4.Thmtheorem7), the conditional mechanism that outputsv^t\\hat\{v\}\_\{t\}satisfies

\(α,8αB2n2κ2\)\-RDP\.\\Bigl\(\\alpha,\\;8\\frac\{\\alpha B^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\Bigr\)\\text\{\-RDP\}\.Note the update𝐖t=Π𝒦\(𝐖t−1−ηv^t\)\\mathbf\{W\}\_\{t\}=\\Pi\_\{\\mathcal\{K\}\}\\\!\\bigl\(\\mathbf\{W\}\_\{t\-1\}\-\\eta\\hat\{v\}\_\{t\}\\bigr\)is a deterministic function ofG^t\\widehat\{G\}\_\{t\}and𝐖t−1\\mathbf\{W\}\_\{t\-1\}\. Hence, by Lemma[D\.5](https://arxiv.org/html/2605.12648#A4.Thmtheorem5), the conditional mechanism that outputs𝐖t\\mathbf\{W\}\_\{t\}also satisfies\(α,8αB2n2κ2\)\\bigl\(\\alpha,\\;8\\frac\{\\alpha B^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\bigr\)\-RDP\.

Applying Lemma[D\.6](https://arxiv.org/html/2605.12648#A4.Thmtheorem6)overt=0,…,T−1t=0,\\dots,T\-1, the full transcript\(𝐖0,…,𝐖T−1\)\(\\mathbf\{W\}\_\{0\},\\dots,\\mathbf\{W\}\_\{T\-1\}\)satisfies\(α,8αTB2n2κ2\)\\bigl\(\\alpha,\\;8\\frac\{\\alpha TB^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\bigr\)\-RDP\. By further using post\-processing property \(Lemma[D\.5](https://arxiv.org/html/2605.12648#A4.Thmtheorem5)\), the final iterate𝐖T−1\\mathbf\{W\}\_\{T\-1\}also satisfies\(α,8αTB2n2κ2\)\\bigl\(\\alpha,\\;8\\frac\{\\alpha TB^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\\bigr\)\-RDP\.

Choosingα=1\+2log⁡\(1/δ\)ϵ\\alpha=1\+\\frac\{2\\log\(1/\\delta\)\}\{\\epsilon\}\. Then Lemma[D\.2](https://arxiv.org/html/2605.12648#A4.Thmtheorem2)gives that𝐖T−1\\mathbf\{W\}\_\{T\-1\}satisfies\(8αTB2n2κ2\+ϵ2,δ\)\\big\(8\\frac\{\\alpha TB^\{2\}\}\{n^\{2\}\\kappa^\{2\}\}\+\\frac\{\\epsilon\}\{2\},\\,\\delta\\big\)\-DP\. Note thatκ2≥16αTB2n2ϵ\\kappa^\{2\}\\geq\\frac\{16\\alpha TB^\{2\}\}\{n^\{2\}\\epsilon\}ensures

8αTq2κ2≤ϵ2\.8\\frac\{\\alpha Tq^\{2\}\}\{\\kappa^\{2\}\}\\leq\\frac\{\\epsilon\}\{2\}\.Hence,𝐖T−1\\mathbf\{W\}\_\{T\-1\}satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\. Substitutingα=1\+2log⁡\(1/δ\)ϵ\\alpha=1\+\\frac\{2\\log\(1/\\delta\)\}\{\\epsilon\}yields

κ2≥16B2Tn2ϵ\+32B2Tlog⁡\(1/δ\)n2ϵ2,\\kappa^\{2\}\\geq\\frac\{16B^\{2\}T\}\{n^\{2\}\\epsilon\}\+\\frac\{32B^\{2\}T\\log\(1/\\delta\)\}\{n^\{2\}\\epsilon^\{2\}\},which completes the proof\. ∎

### D\.2Proofs for optimization of DP\-SGD with independent noise

Recall

ℰδ=\{‖𝐜‖2≤4pm\+2log⁡\(2/δ\),maxj∈\[m\]⁡‖𝐜j‖2≤4p\+2log⁡\(2m/δ\)\}\\mathcal\{E\}\_\{\\delta\}=\\Big\\\{\\\|\\mathbf\{c\}\\\|\_\{2\}\\leq 4\\sqrt\{pm\}\+2\\sqrt\{\\log\(2/\\delta\)\},\\quad\\max\_\{j\\in\[m\]\}\\\|\\mathbf\{c\}\_\{j\}\\\|\_\{2\}\\leq 4\\sqrt\{p\}\+2\\sqrt\{\\log\(2m/\\delta\)\}\\Big\\\}and

ℱt−1=σ\(𝐖0,ℬ1,…,ℬt−1,Z1,…,Zt−1\)\.\\mathcal\{F\}\_\{t\-1\}=\\sigma\\bigl\(\\mathbf\{W\}\_\{0\},\\mathcal\{B\}\_\{1\},\\dots,\\mathcal\{B\}\_\{t\-1\},Z\_\{1\},\\dots,Z\_\{t\-1\}\\bigr\)\.By Lemma[B\.5](https://arxiv.org/html/2605.12648#A2.Thmtheorem5), it holds thatℙ\(ℰδ\)≥1−δ\.\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\delta\}\)\\geq 1\-\\delta\.For anyt∈\[T\]t\\in\[T\], define

Δt=1B∑i∈ℬtgt,i−∇ℒS\(𝐖t−1\)withgt,i=∇ℓ\(yif𝐖t−1\(𝐱i\)\)\.\\displaystyle\\Delta\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g\_\{t,i\}\-\\nabla\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\quad\\text\{with\}\\quad g\_\{t,i\}=\\nabla\\ell\\left\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\\right\)\.\(20\)
###### Lemma D\.10\(Unbiasedness and variance whenλ=0\\lambda=0\)\.

Suppose thatλ=0\\lambda=0andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. Let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)\. Under the eventℰδ\\mathcal\{E\}\_\{\\delta\}, for allt∈\[T\]t\\in\[T\],

𝔼ℬt\[Δt∣ℱt−1\]=0and𝔼ℬt\[‖Δt‖22∣ℱt−1\]≤Gδ2B\.\\mathbb\{E\}\_\{\\mathcal\{B\}\_\{t\}\}\[\\Delta\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0\\ \\ \\text\{ and \}\\ \\ \\mathbb\{E\}\_\{\\mathcal\{B\}\_\{t\}\}\\big\[\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\\big\]\\leq\\frac\{G\_\{\\delta\}^\{2\}\}\{B\}\.Here the expectation is taken with respect to the uniform sampling of the mini\-batchℬt\\mathcal\{B\}\_\{t\}\.

###### Proof\.

Assume the eventℰδ\\mathcal\{E\}\_\{\\delta\}holds\. For any fixedt∈\[T\]t\\in\[T\], the uniform gradient bound \([5](https://arxiv.org/html/2605.12648#A3.E5)\) implies

‖gt,i‖2=‖∇ℓ\(yif𝐖t−1\(𝐱i\)\)‖2≤Gδ≤Cclip,∀i∈\[n\]\.\\\|g\_\{t,i\}\\\|\_\{2\}=\\big\\\|\\nabla\\ell\\\!\\left\(y\_\{i\}f\_\{\\mathbf\{W\}\_\{t\-1\}\}\(\\mathbf\{x\}\_\{i\}\)\\right\)\\big\\\|\_\{2\}\\leq G\_\{\\delta\}\\leq C\_\{\\mathrm\{clip\}\},\\qquad\\forall i\\in\[n\]\.Hence clipping is inactive, and sinceλ=0\\lambda=0, the update becomes

𝐖t=Π𝒦\(𝐖t−1−η\(1B∑i∈ℬtgt,i\+CclipκBZt\)\)\.\\mathbf\{W\}\_\{t\}=\\Pi\_\{\\mathcal\{K\}\}\\Big\(\\mathbf\{W\}\_\{t\-1\}\-\\eta\\Big\(\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g\_\{t,i\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}Z\_\{t\}\\Big\)\\Big\)\.Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\}, the iterate𝐖t−1\\mathbf\{W\}\_\{t\-1\}is fixed, and thus the vectorsgt,1,…,gt,ng\_\{t,1\},\\dots,g\_\{t,n\}are deterministic\. Sinceℬt\\mathcal\{B\}\_\{t\}is sampled uniformly without replacement from\[n\]\[n\], the mini\-batch average is an unbiased estimator of the full empirical gradient:

𝔼ℬt\[1B∑i∈ℬtgt,i\|ℱt−1\]=1n∑i=1ngt,i=∇ℒS\(𝐖t−1\)\.\\mathbb\{E\}\_\{\\mathcal\{B\}\_\{t\}\}\\Big\[\\frac\{1\}\{B\}\\sum\_\{i\\in\\mathcal\{B\}\_\{t\}\}g\_\{t,i\}\\,\\Big\|\\,\\mathcal\{F\}\_\{t\-1\}\\Big\]=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\_\{t,i\}=\\nabla\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\.This proves the first claim\.

Moreover, since‖gt,i‖2≤Gδ\\\|g\_\{t,i\}\\\|\_\{2\}\\leq G\_\{\\delta\}for allii, Lemma[B\.2](https://arxiv.org/html/2605.12648#A2.Thmtheorem2)gives

𝔼ℬt\[‖Δt‖22∣ℱt−1\]≤Gδ2B,\\mathbb\{E\}\_\{\\mathcal\{B\}\_\{t\}\}\\big\[\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\\big\]\\leq\\frac\{G\_\{\\delta\}^\{2\}\}\{B\},which completes the proof\. ∎

###### Lemma D\.11\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds, \([1](https://arxiv.org/html/2605.12648#A2.E1)\) holds,λ=0\\lambda=0andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}, whereGδG\_\{\\delta\}is defined in \([5](https://arxiv.org/html/2605.12648#A3.E5)\)\. Let\{𝐖t\}t=0T−1\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}\. Assume

m≥64Cσ,b2p3\(log⁡\(m/δ\)\+p\)2R∗4\.m\\geq 64C\_\{\\sigma,b\}^\{2\}p^\{3\}\\bigl\(\\sqrt\{\\log\(m/\\delta\)\}\+\\sqrt\{p\}\\bigr\)^\{2\}R\_\{\*\}^\{4\}\.\(21\)Then, for any comparator𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}and allt∈\[T\]t\\in\[T\], it holds onℰδ\\mathcal\{E\}\_\{\\delta\}that

ℒS\(𝐖t−1\)≤\\displaystyle\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\;4ℒS\(𝐖∗\)\+32η\(‖𝐖t−1−𝐖∗‖22−‖𝐖t−𝐖∗‖22\)\\displaystyle 4\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\+6η‖Δt‖22\+6η\(CclipκB\)2‖Zt‖22−3⟨Δt\+CclipκBZt,𝐖t−1−𝐖∗⟩\.\\displaystyle\\quad\+6\\eta\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\+6\\eta\\Big\(\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\Big\)^\{2\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}\-3\\Big\\langle\\Delta\_\{t\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}Z\_\{t\},\\,\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\Big\\rangle\.

###### Proof\.

Assumeℰδ\\mathcal\{E\}\_\{\\delta\}holds\. Letgt:=∇ℒS\(𝐖t−1\)g\_\{t\}:=\\nabla\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\. Since clipping is inactive andλ=0\\lambda=0, the update can be written as

𝐖t=Π𝒦\(𝐖t−1−η\(gt\+Δt\+cprivZt\)\)withcpriv=CclipκB\.\\mathbf\{W\}\_\{t\}=\\Pi\_\{\\mathcal\{K\}\}\\Big\(\\mathbf\{W\}\_\{t\-1\}\-\\eta\\big\(g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\big\)\\Big\)\\quad\\text\{with\}\\quad c\_\{\\rm priv\}=\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\.Applying Lemma[B\.3](https://arxiv.org/html/2605.12648#A2.Thmtheorem3)withg=gt\+Δt\+cprivZtg=g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\},𝐖=𝐖t−1\\mathbf\{W\}=\\mathbf\{W\}\_\{t\-1\}, and𝐖\+=𝐖t\\mathbf\{W\}^\{\+\}=\\mathbf\{W\}\_\{t\}, we obtain

⟨gt\+Δt\+cprivZt,𝐖t−1−𝐖∗⟩≤12η\(‖𝐖t−1−𝐖∗‖22−‖𝐖t−𝐖∗‖22\)\+η2‖gt\+Δt\+cprivZt‖22\.\\langle g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\},\\,\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\\leq\\frac\{1\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\+\\frac\{\\eta\}\{2\}\\\|g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\\|\_\{2\}^\{2\}\.Hence,

⟨gt,𝐖t−1−𝐖∗⟩≤\\displaystyle\\langle g\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\\leq\\;12η\(‖𝐖t−1−𝐖∗‖22−‖𝐖t−𝐖∗‖22\)\+η2‖gt\+Δt\+cprivZt‖22\\displaystyle\\frac\{1\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\+\\frac\{\\eta\}\{2\}\\\|g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\\|\_\{2\}^\{2\}−⟨Δt\+cprivZt,𝐖t−1−𝐖∗⟩\.\\displaystyle\-\\langle\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.\(22\)On the other hand, by Lemma[B\.7](https://arxiv.org/html/2605.12648#A2.Thmtheorem7), we know

λmin\(∇2ℒS\(𝐖\)\)≥−Cσ,bp3/2\(log⁡\(m/δ\)\+p\)mℒS\(𝐖\)\.\\lambda\_\{\\min\}\\big\(\\nabla^\{2\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)\\big\)\\geq\-\\frac\{C\_\{\\sigma,b\}\\,p^\{3/2\}\\bigl\(\\sqrt\{\\log\(m/\\delta\)\}\+\\sqrt\{p\}\\bigr\)\}\{\\sqrt\{m\}\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\)\.Since𝐖t−1,𝐖∗∈𝒦=ℬ\(𝐖0,R∗\)\\mathbf\{W\}\_\{t\-1\},\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}=\\mathcal\{B\}\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), it holds that‖𝐖t−1−𝐖∗‖2≤2R∗\.\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}\\leq 2R\_\{\*\}\.Combining this with the condition \([21](https://arxiv.org/html/2605.12648#A4.E21)\) ensures that

Cσ,bp3/2\(log⁡\(m/δ\)\+p\)m\(2R∗\)2≤12\.\\frac\{C\_\{\\sigma,b\}\\,p^\{3/2\}\\bigl\(\\sqrt\{\\log\(m/\\delta\)\}\+\\sqrt\{p\}\\bigr\)\}\{\\sqrt\{m\}\}\(2R\_\{\*\}\)^\{2\}\\leq\\frac\{1\}\{2\}\.Therefore, by Lemma[B\.4](https://arxiv.org/html/2605.12648#A2.Thmtheorem4)and the same argument as in the proof of the correlated case \(see proof of Lemma[C\.2](https://arxiv.org/html/2605.12648#A3.Thmtheorem2)\), we have

ℒS\(𝐖t−1\)≤ℒS\(𝐖∗\)\+⟨gt,𝐖t−1−𝐖∗⟩\+13\(ℒS\(𝐖t−1\)\+ℒS\(𝐖∗\)\)\.\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\langle g\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\+\\frac\{1\}\{3\}\\Big\(\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\+\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\Big\)\.Rearranging implies

23ℒS\(𝐖t−1\)≤43ℒS\(𝐖∗\)\+⟨gt,𝐖t−1−𝐖∗⟩\.\\frac\{2\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\frac\{4\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\langle g\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.Substituting \([22](https://arxiv.org/html/2605.12648#A4.E22)\) in to the above observation yields

23ℒS\(𝐖t−1\)≤\\displaystyle\\frac\{2\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\;43ℒS\(𝐖∗\)\+12η\(‖𝐖t−1−𝐖∗‖22−‖𝐖t−𝐖∗‖22\)\+η2‖gt\+Δt\+cprivZt‖22\\displaystyle\\frac\{4\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{1\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\+\\frac\{\\eta\}\{2\}\\\|g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\\|\_\{2\}^\{2\}−⟨Δt\+cprivZt,𝐖t−1−𝐖∗⟩\.\\displaystyle\-\\langle\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.
Note that

‖gt\+Δt\+cprivZt‖22≤2‖gt‖22\+4‖Δt‖22\+4cpriv2‖Zt‖22≤8Cσ,bp3ℒS\(𝐖t−1\)\+4‖Δt‖22\+4cpriv2‖Zt‖22,\\\|g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\\|\_\{2\}^\{2\}\\\!\\leq\\\!2\\\|g\_\{t\}\\\|\_\{2\}^\{2\}\+4\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\+4c\_\{\\rm priv\}^\{2\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}\\\!\\leq\\\!8C\_\{\\sigma,b\}p^\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\+4\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\+4c\_\{\\rm priv\}^\{2\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\},where we have used Lemma[B\.7](https://arxiv.org/html/2605.12648#A2.Thmtheorem7)\. By further using4Cσ,bp3η≤134C\_\{\\sigma,b\}p^\{3\}\\eta\\leq\\frac\{1\}\{3\}implied byη≤\(12Cσ,bp3\)−1\\eta\\leq\(12C\_\{\\sigma,b\}p^\{3\}\)^\{\-1\}, it holds

η2‖gt\+Δt\+cprivZt‖22≤13ℒS\(𝐖t−1\)\+2η‖Δt‖22\+2ηcpriv2‖Zt‖22\.\\frac\{\\eta\}\{2\}\\\|g\_\{t\}\+\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\}\\\|\_\{2\}^\{2\}\\leq\\frac\{1\}\{3\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\+2\\eta\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\+2\\eta c\_\{\\rm priv\}^\{2\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}\.Therefore,

13ℒS\(𝐖t−1\)≤\\displaystyle\\frac\{1\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\-1\}\)\\leq\\,43ℒS\(𝐖∗\)\+12η\(‖𝐖t−1−𝐖∗‖22−‖𝐖t−𝐖∗‖22\)\+2η‖Δt‖22\+2ηcpriv2‖Zt‖22\\displaystyle\\frac\{4\}\{3\}\\,\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\\!\+\\\!\\frac\{1\}\{2\\eta\}\\Big\(\\\|\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\-\\\|\\mathbf\{W\}\_\{t\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\\Big\)\\\!\+\\\!2\\eta\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\\\!\+\\\!2\\eta c\_\{\\rm priv\}^\{2\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}−⟨Δt\+cprivZt,𝐖t−1−𝐖∗⟩\.\\displaystyle\-\\langle\\Delta\_\{t\}\+c\_\{\\rm priv\}Z\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.Multiplying both sides by33and pluggingcprivc\_\{\\rm priv\}back complete the proof\. ∎

We now estimate the perturbation terms in Lemma[D\.11](https://arxiv.org/html/2605.12648#A4.Thmtheorem11)\.

###### Lemma D\.12\.

Suppose the assumptions of Lemma[D\.10](https://arxiv.org/html/2605.12648#A4.Thmtheorem10)hold\. LetC\>0C\>0be an absolute constant\. Then, conditional on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}\(see \([1](https://arxiv.org/html/2605.12648#A2.E1)\)\), the following statements hold true\.

1. \(a\)For anyδΔ2∈\(0,1\)\\delta\_\{\\Delta^\{2\}\}\\in\(0,1\), ℙ𝒜\(∑t=1T‖Δt‖22≤2TGδ2B\+8Gδ2log⁡\(1δΔ2\)\)≥1−δΔ2\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\\sum\_\{t=1\}^\{T\}\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\\leq\\frac\{2TG\_\{\\delta\}^\{2\}\}\{B\}\+8G\_\{\\delta\}^\{2\}\\log\\\!\\Big\(\\frac\{1\}\{\\delta\_\{\\Delta^\{2\}\}\}\\Big\)\\Big\)\\geq 1\-\\delta\_\{\\Delta^\{2\}\}\.
2. \(b\)For anyδZ2∈\(0,1\)\\delta\_\{Z^\{2\}\}\\in\(0,1\), ℙ\(∑t=1T‖Zt‖22≤Tmdp\+2Tmdplog⁡\(1/δZ2\)\+2log⁡\(1δZ2\)\)≥1−δZ2\.\\mathbb\{P\}\\Big\(\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}\\leq Tmdp\+2\\sqrt\{Tmdp\\log\(1/\{\\delta\_\{Z^\{2\}\}\}\)\}\+2\\log\\big\(\\frac\{1\}\{\\delta\_\{Z^\{2\}\}\}\\big\)\\Big\)\\geq 1\-\\delta\_\{Z^\{2\}\}\.
3. \(c\)Let𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}, and define𝔐T\(Δ\):=−3∑t=1T⟨Δt,𝐖t−1−𝐖∗⟩\.\\mathfrak\{M\}\_\{T\}^\{\(\\Delta\)\}:=\-3\\sum\_\{t=1\}^\{T\}\\langle\\Delta\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.Then, for anyδΔ,lin∈\(0,1\)\\delta\_\{\\Delta,\\mathrm\{lin\}\}\\in\(0,1\), ℙ𝒜\(\|𝔐T\(Δ\)\|≤CGδR∗Tlog⁡\(1/δΔ,lin\)B\)≥1−δΔ,lin\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\|\\mathfrak\{M\}\_\{T\}^\{\(\\Delta\)\}\|\\leq C\\,G\_\{\\delta\}R\_\{\*\}\\sqrt\{\\frac\{T\\log\(1/\\delta\_\{\\Delta,\\mathrm\{lin\}\}\)\}\{B\}\}\\Big\)\\geq 1\-\\delta\_\{\\Delta,\\mathrm\{lin\}\}\.
4. \(d\)Let𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}\. Define𝔐T\(Z\):=−3CclipκB∑t=1T⟨Zt,𝐖t−1−𝐖∗⟩\.\\mathfrak\{M\}\_\{T\}^\{\(Z\)\}:=\-3\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\sum\_\{t=1\}^\{T\}\\langle Z\_\{t\},\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}\\rangle\.Then, for anyδZ,lin∈\(0,1\)\\delta\_\{Z,\\mathrm\{lin\}\}\\in\(0,1\), ℙ𝒜\(\|𝔐T\(Z\)\|≤CCclipκBR∗Tlog⁡\(1/δZ,lin\)\)≥1−δZ,lin\.\\mathbb\{P\}\_\{\\mathcal\{A\}\}\\Big\(\|\\mathfrak\{M\}\_\{T\}^\{\(Z\)\}\|\\leq C\\,\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\,R\_\{\*\}\\sqrt\{T\\log\(1/\\delta\_\{Z,\\mathrm\{lin\}\}\)\}\\Big\)\\geq 1\-\\delta\_\{Z,\\mathrm\{lin\}\}\.

###### Proof\.

Part \(a\) is proved exactly as Lemma[C\.6](https://arxiv.org/html/2605.12648#A3.Thmtheorem6), but without the stopping indicator\. And part \(b\) directly from Lemma[C\.4](https://arxiv.org/html/2605.12648#A3.Thmtheorem4)\.

For the part \(c\), conditional on the pre\-sampling filtration,𝐖t−1−𝐖∗\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}is fixed and has norm at most2R∗2R\_\{\*\}, whileΔt\\Delta\_\{t\}is centered\. The proof is the same as that of Lemma[C\.7](https://arxiv.org/html/2605.12648#A3.Thmtheorem7), withR¯\\bar\{R\}replaced by2R∗2R\_\{\*\}and without the stopping indicator\.

Finally, conditional on the filtration up to timet−1t\-1, the vector𝐖t−1−𝐖∗\\mathbf\{W\}\_\{t\-1\}\-\\mathbf\{W\}^\{\*\}is fixed and has norm at most2R∗2R\_\{\*\}, whileZt∼𝒩\(0,𝐈mdp\)Z\_\{t\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{mdp\}\)is independent of the past\. Hence each summand is conditionally Gaussian with variance bounded byC\(CclipκR∗/B\)2C\(C\_\{\\mathrm\{clip\}\}\\kappa R\_\{\*\}/B\)^\{2\}\. The claim then follows from the standard maximal inequality for Gaussian martingale differences, exactly as in Lemma[C\.8](https://arxiv.org/html/2605.12648#A3.Thmtheorem8)\. The proof is complete\. ∎

###### Theorem D\.13\(General optimization risk bound for standard DP\-SGD\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose Assumption[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)holds\. Let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withλ=0\\lambda=0,η≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}\. Assumem≥64Cσ,b2p3\(log⁡\(m/δ\)\+p\)2R∗4\.m\\geq 64C\_\{\\sigma,b\}^\{2\}p^\{3\}\\bigl\(\\sqrt\{\\log\(m/\\delta\)\}\+\\sqrt\{p\}\\bigr\)^\{2\}R\_\{\*\}^\{4\}\.Then, conditioned on the dataset and the initialization satisfyingℰδ\\mathcal\{E\}\_\{\\delta\}, with probability at least1−δ1\-\\deltaover the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\;ℒS\(𝐖∗\)\+‖𝐖0−𝐖∗‖22ηT\+GδR∗log⁡\(1/δ\)BT\+CclipκBR∗log⁡\(1/δ\)T\\displaystyle\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+G\_\{\\delta\}R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{BT\}\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\,R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{T\}\}\+ηGδ2\(1B\+log⁡\(1/δ\)T\)\+ηCclip2κ2B2\(md\+log⁡\(1/δ\)T\)\.\\displaystyle\+\\eta G\_\{\\delta\}^\{2\}\\Big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\+\\eta\\frac\{C\_\{\\mathrm\{clip\}\}^\{2\}\\kappa^\{2\}\}\{B^\{2\}\}\\Big\(md\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\.

###### Proof\.

Fix auxiliary failure probabilitiesδΔ2=δΔ,lin=δZ2=δZ,lin=δ4\.\\delta\_\{\\Delta^\{2\}\}=\\delta\_\{\\Delta,\\mathrm\{lin\}\}=\\delta\_\{Z^\{2\}\}=\\delta\_\{Z,\\mathrm\{lin\}\}=\\frac\{\\delta\}\{4\}\.Since replacingδ\\deltaby a constant fraction only affects logarithmic terms by absolute constants, we suppress this distinction below\.

Assume the eventℰδ\\mathcal\{E\}\_\{\\delta\}defined in \([1](https://arxiv.org/html/2605.12648#A2.E1)\) holds\. Summing the inequality in Lemma[D\.11](https://arxiv.org/html/2605.12648#A4.Thmtheorem11)overt=1,…,Tt=1,\\dots,Tyields

∑t=0T−1ℒS\(𝐖t\)≤\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\leq\\;4TℒS\(𝐖∗\)\+32η‖𝐖0−𝐖∗‖22\+6η∑t=1T‖Δt‖22\+6η\(CclipκB\)2∑t=1T‖Zt‖22\\displaystyle 4T\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{3\}\{2\\eta\}\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\+6\\eta\\sum\_\{t=1\}^\{T\}\\\|\\Delta\_\{t\}\\\|\_\{2\}^\{2\}\+6\\eta\\Big\(\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\Big\)^\{2\}\\sum\_\{t=1\}^\{T\}\\\|Z\_\{t\}\\\|\_\{2\}^\{2\}\+𝔐T\(Δ\)\+𝔐T\(Z\)\.\\displaystyle\\quad\+\\mathfrak\{M\}\_\{T\}^\{\(\\Delta\)\}\+\\mathfrak\{M\}\_\{T\}^\{\(Z\)\}\.
Plugging the results of Lemma[D\.12](https://arxiv.org/html/2605.12648#A4.Thmtheorem12)into the above inequality and dividing byTTgives

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\;ℒS\(𝐖∗\)\+‖𝐖0−𝐖∗‖22ηT\+GδR∗log⁡\(1/δ\)BT\+CclipκBR∗log⁡\(1/δ\)T\\displaystyle\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\+\\frac\{\\\|\\mathbf\{W\}\_\{0\}\-\\mathbf\{W\}^\{\*\}\\\|\_\{2\}^\{2\}\}\{\\eta T\}\+G\_\{\\delta\}R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{BT\}\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\,R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{T\}\}\+ηGδ2\(1B\+log⁡\(1/δ\)T\)\+ηCclip2κ2B2\(md\+log⁡\(1/δ\)T\)\.\\displaystyle\\quad\+\\eta G\_\{\\delta\}^\{2\}\\Big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\+\\eta\\frac\{C\_\{\\mathrm\{clip\}\}^\{2\}\\kappa^\{2\}\}\{B^\{2\}\}\\Big\(md\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\.The stated probability follows from a union bound over the four good events\. ∎

Now, we present our optimization risk bound under the NTK separability assumption\.

###### Theorem D\.14\(Optimization Risk Bound\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by Algorithm[1](https://arxiv.org/html/2605.12648#alg1)withη≤112Cσ,bp3\\eta\\leq\\frac\{1\}\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}\. Assumem≳max⁡\{log⁡\(mδ\)\(log4⁡\(T\)\+log2⁡\(nδ\)\)γ4,log⁡\(nT/δ\)Td\}m\\gtrsim\\max\\big\\\{\\frac\{\\log\(\\frac\{m\}\{\\delta\}\)\(\\log^\{4\}\(T\)\+\\log^\{2\}\(\\frac\{n\}\{\\delta\}\)\)\}\{\\gamma^\{4\}\},\\frac\{\\log\(nT/\\delta\)\}\{Td\}\\big\\\}andR∗≍1γ\(log⁡\(T\)\+log⁡\(n/δ\)\)\.R\_\{\*\}\\asymp\\frac\{1\}\{\\gamma\}\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\)\.Then, with probability at least1−δ1\-\\deltaover the randomness of the initialization and the algorithmic randomness,

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\;log2⁡\(T\)\+log⁡\(n/δ\)γ2ηT\+log⁡\(1/δ\)\(log⁡\(T\)\+log⁡\(n/δ\)\)γBT\\displaystyle\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{\\log\(1/\\delta\)\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\}\{\\gamma\\sqrt\{BT\}\}\+log⁡\(1/δ\)\(log⁡\(T\)\+log⁡\(n/δ\)\)κγBT\+ηlog⁡\(1/δ\)\(1B\+log⁡\(1/δ\)T\)\\displaystyle\\quad\+\\frac\{\\log\(1/\\delta\)\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\\kappa\}\{\\gamma B\\sqrt\{T\}\}\+\\eta\\log\(1/\\delta\)\\Big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\+ηmdlog2⁡\(1/δ\)κ2B2=:Avan\.\\displaystyle\\quad\+\\frac\{\\eta md\\log^\{2\}\(1/\\delta\)\\kappa^\{2\}\}\{B^\{2\}\}=:A\_\{\\mathrm\{van\}\}\.

###### Proof\.

Fixδinit=δntk=δalg=δ3\.\\delta\_\{\\mathrm\{init\}\}=\\delta\_\{\\mathrm\{ntk\}\}=\\delta\_\{\\mathrm\{alg\}\}=\\frac\{\\delta\}\{3\}\.Take

R∗≍1γ\(log⁡\(T\)\+log⁡\(n/δ\)\)\.R\_\{\*\}\\asymp\\frac\{1\}\{\\gamma\}\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\.Under the corresponding width conditions, Lemma[C\.12](https://arxiv.org/html/2605.12648#A3.Thmtheorem12)implies that, with probability at least1−δntk1\-\\delta\_\{\\mathrm\{ntk\}\}over the initialization, there exists a comparator𝐖∗∈𝒦\\mathbf\{W\}^\{\*\}\\in\\mathcal\{K\}such that

ℒS\(𝐖∗\)≤1T,‖𝐖∗−𝐖0‖22≲log2⁡\(T\)\+log⁡\(n/δ\)γ2\.\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}^\{\*\}\)\\leq\\frac\{1\}\{T\},\\qquad\\\|\\mathbf\{W\}^\{\*\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}^\{2\}\\lesssim\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\}\.Intersecting this event withℰδinit\\mathcal\{E\}\_\{\\delta\_\{\\mathrm\{init\}\}\}and then applying Theorem[D\.13](https://arxiv.org/html/2605.12648#A4.Thmtheorem13)with failure probabilityδalg\\delta\_\{\\mathrm\{alg\}\}, we obtain, with probability at least1−δ1\-\\delta,

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\;1T\+log2⁡\(T\)\+log⁡\(n/δ\)γ2ηT\+GδR∗log⁡\(1/δ\)BT\+CclipκBR∗log⁡\(1/δ\)T\\displaystyle\\frac\{1\}\{T\}\+\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\\eta T\}\+G\_\{\\delta\}R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{BT\}\}\+\\frac\{C\_\{\\mathrm\{clip\}\}\\kappa\}\{B\}\\,R\_\{\*\}\\sqrt\{\\frac\{\\log\(1/\\delta\)\}\{T\}\}\+ηGδ2\(1B\+log⁡\(1/δ\)T\)\+ηCclip2κ2B2\(md\+log⁡\(1/δ\)T\)\.\\displaystyle\\quad\+\\eta G\_\{\\delta\}^\{2\}\\Big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\+\\eta\\frac\{C\_\{\\mathrm\{clip\}\}^\{2\}\\kappa^\{2\}\}\{B^\{2\}\}\\Big\(md\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\.Recall thatCclip≤Gδ≲log⁡\(1/δ\),C\_\{\\mathrm\{clip\}\}\\leq G\_\{\\delta\}\\lesssim\\sqrt\{\\log\(1/\\delta\)\},κ2≍B2Tlog⁡\(1/δ\)n2ϵ2\\kappa^\{2\}\\asymp\\frac\{B^\{2\}T\\log\(1/\\delta\)\}\{n^\{2\}\\epsilon^\{2\}\}andR∗≍1γ\(log⁡\(T\)\+log⁡\(n/δ\)\)R\_\{\*\}\\asymp\\frac\{1\}\{\\gamma\}\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\.

1T∑t=0T−1ℒS\(𝐖t\)≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\;log2⁡\(T\)\+log⁡\(n/δ\)γ2ηT\+log⁡\(1/δ\)\(log⁡\(T\)\+log⁡\(n/δ\)\)γBT\\displaystyle\\frac\{\\log^\{2\}\(T\)\+\\log\(n/\\delta\)\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{\\log\(1/\\delta\)\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\}\{\\gamma\\sqrt\{BT\}\}\+log⁡\(1/δ\)\(log⁡\(T\)\+log⁡\(n/δ\)\)κγBT\+ηlog⁡\(1/δ\)\(1B\+log⁡\(1/δ\)T\)\\displaystyle\\quad\+\\frac\{\\log\(1/\\delta\)\\bigl\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\\bigr\)\\kappa\}\{\\gamma B\\sqrt\{T\}\}\+\\eta\\log\(1/\\delta\)\\Big\(\\frac\{1\}\{B\}\+\\frac\{\\log\(1/\\delta\)\}\{T\}\\Big\)\+ηmdlog2⁡\(1/δ\)κ2B2\.\\displaystyle\\quad\+\\frac\{\\eta md\\log^\{2\}\(1/\\delta\)\\kappa^\{2\}\}\{B^\{2\}\}\.This proves the theorem\. ∎

### D\.3Proofs for population of DP\-SGD with independent noise

The stability and generalization analysis developed for the correlated\-noise case is noise\-agnostic once the corresponding optimization good event is available\. Therefore, whenλ=0\\lambda=0, the same argument applies verbatim after replacingAcorrA\_\{\\mathrm\{corr\}\}byAvanA\_\{\\mathrm\{van\}\}\. Recall thatUδ,δf≲log⁡\(n/δ\)U\_\{\\delta,\\delta\_\{f\}\}\\lesssim\\sqrt\{\\log\(n/\\delta\)\}andbδ,m≲log⁡\(1/δ\)b\_\{\\delta,m\}\\lesssim\\sqrt\{\\log\(1/\\delta\)\}\.

###### Theorem D\.15\(On\-average argument stability for standard DP\-SGD\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose the assumptions of Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)hold\. Assume

m≳log⁡\(mδ\)η2\(T2Avan2\+log⁡\(n/δ\)log2⁡\(1/δ\)\(TB\+1\)\)\.m\\gtrsim\\log\(\\frac\{m\}\{\\delta\}\)\\eta^\{2\}\\Big\(T^\{2\}A^\{2\}\_\{\\mathrm\{van\}\}\+\\log\(\{n\}/\{\\delta\}\)\\log^\{2\}\(\{1\}/\{\\delta\}\)\\Bigl\(\\frac\{T\}\{B\}\+1\\Bigr\)\\Big\)\.Then, with probability at least1−δ1\-\\deltaover initialization,

𝔼S,S~,𝒜\[1n∑i=1n‖𝐖T−𝐖T\(i\)‖2\]≲ηlog⁡\(1/δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+\(log⁡\(T\)\+log⁡\(n/δ\)\)δγ\.\\mathbb\{E\}\_\{S,\\widetilde\{S\},\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\mathbf\{W\}\_\{T\}\-\\mathbf\{W\}\_\{T\}^\{\(i\)\}\\\|\_\{2\}\\Big\]\\lesssim\\frac\{\\eta\\sqrt\{\\log\(1/\\delta\)\}\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\)\\delta\}\{\\gamma\}\.

###### Proof\.

The proof is identical to that of Theorem[C\.22](https://arxiv.org/html/2605.12648#A3.Thmtheorem22)\. The only difference is that the optimization good eventℰopt\(i\)\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{\(i\)\}is now supplied by Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)instead of Theorem[4\.1](https://arxiv.org/html/2605.12648#S4.Thmtheorem1)\. Under the stated lower bound onmm, Lemma[C\.20](https://arxiv.org/html/2605.12648#A3.Thmtheorem20)applies withAcorrA\_\{\\mathrm\{corr\}\}replaced byAvanA\_\{\\mathrm\{van\}\}, and the rest of the argument is unchanged\. ∎

###### Theorem D\.16\(Population risk bound\)\.

Suppose the assumptions of Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)hold\. Assume

m≳log⁡\(mδ\)η2\(T2Avan2\+log⁡\(n/δ\)log2⁡\(1/δ\)\(TB\+1\)\)\.m\\gtrsim\\log\(\\frac\{m\}\{\\delta\}\)\\eta^\{2\}\\Big\(T^\{2\}A^\{2\}\_\{\\mathrm\{van\}\}\+\\log\(\{n\}/\{\\delta\}\)\\log^\{2\}\(\{1\}/\{\\delta\}\)\\Bigl\(\\frac\{T\}\{B\}\+1\\Bigr\)\\Big\)\.Then, with probability at least1−δ1\-\\deltaover initialization,

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTlog⁡\(1δ\)n\)\(Avan\+δlog⁡\(nδ\)\)\+log⁡\(1δ\)\(log⁡\(T\)\+log⁡\(nδ\)\)δγ\.\\frac\{1\}\{T\}\\\!\\sum\_\{t=0\}^\{T\-1\}\\\!\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\\!\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\\!\\lesssim\\\!\\Bigl\(1\+\\frac\{\\eta T\\log\(\\frac\{1\}\{\\delta\}\)\}\{n\}\\Bigr\)\\bigl\(A\_\{\\mathrm\{van\}\}\+\\delta\\sqrt\{\\log\(\\frac\{n\}\{\\delta\}\)\}\\bigr\)\+\\sqrt\{\\log\(\\frac\{1\}\{\\delta\}\)\}\\frac\{\(\\log\(T\)\\\!\+\\\!\\\!\\sqrt\{\\log\(\\frac\{n\}\{\\delta\}\)\}\)\\delta\}\{\\gamma\}\.

###### Proof\.

By Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6)and\|ℓ′\(u\)\|≤1\|\\ell^\{\\prime\}\(u\)\|\\leq 1for the logistic loss, for every𝐖∈𝒦\\mathbf\{W\}\\in\\mathcal\{K\}and every\(𝐱,y\)∈𝒵\(\\mathbf\{x\},y\)\\in\\mathcal\{Z\},

‖∇𝐖ℓ\(yf𝐖\(𝐱\)\)‖2=\|ℓ′\(yf𝐖\(𝐱\)\)\|‖∇f𝐖\(𝐱\)‖2≤bδ0,m\.\\\|\\nabla\_\{\\mathbf\{W\}\}\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\\\|\_\{2\}=\|\\ell^\{\\prime\}\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)\|\\,\\\|\\nabla f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq b\_\{\\delta\_\{0\},m\}\.Hence the loss isbδ0,mb\_\{\\delta\_\{0\},m\}\-Lipschitz with respect to𝐖\\mathbf\{W\}\. Applying Lemma[C\.15](https://arxiv.org/html/2605.12648#A3.Thmtheorem15)together with Theorem[D\.15](https://arxiv.org/html/2605.12648#A4.Thmtheorem15)yields

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηlog⁡\(1/δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+log⁡\(1δ\)\(log⁡\(T\)\+log⁡\(n/δ\)\)δγ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\\log\(1/\\delta\)\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\sqrt\{\\log\(\\frac\{1\}\{\\delta\}\)\}\\frac\{\(\\log\(T\)\+\\sqrt\{\\log\(n/\\delta\)\}\)\\delta\}\{\\gamma\}\.\(23\)
Now, we consider the population risk bound\. Since replacingδ\\deltaby a smaller auxiliary failure probability only changes logarithmic factors, we suppress this distinction below\.

On the initialization eventℰδ0\\mathcal\{E\}\_\{\\delta\_\{0\}\}, for any𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}, anyy∈\{−1,\+1\}y\\in\\\{\-1,\+1\\\}, and any𝐖∈𝒦=B\(𝐖0,R∗\)\\mathbf\{W\}\\in\\mathcal\{K\}=B\(\\mathbf\{W\}\_\{0\},R\_\{\*\}\), it holds that

\|f𝐖\(𝐱\)\|≤\|f𝐖0\(𝐱\)\|\+‖∇f𝐖~\(𝐱\)‖2‖𝐖−𝐖0‖2≤Bbp‖c‖2\+bδ0,mR∗≲m\+log⁡\(nδ\)R∗,\|f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\|\\leq\|f\_\{\\mathbf\{W\}\_\{0\}\}\(\\mathbf\{x\}\)\|\+\\\|\\nabla f\_\{\\widetilde\{\\mathbf\{W\}\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\,\\\|\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq B\_\{b\}\\sqrt\{p\}\\,\\\|c\\\|\_\{2\}\+b\_\{\\delta\_\{0\},m\}R\_\{\*\}\\lesssim\\sqrt\{m\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)R\_\{\*\},where we used‖c‖2≲log⁡\(1/δ0\)m\\\|c\\\|\_\{2\}\\lesssim\\log\(1/\\delta\_\{0\}\)\\sqrt\{m\}onℰδ0\\mathcal\{E\}\_\{\\delta\_\{0\}\}, the bound‖∇f𝐖~\(𝐱\)‖2≤bδ0,m\\\|\\nabla f\_\{\\widetilde\{\\mathbf\{W\}\}\}\(\\mathbf\{x\}\)\\\|\_\{2\}\\leq b\_\{\\delta\_\{0\},m\}from Lemma[B\.6](https://arxiv.org/html/2605.12648#A2.Thmtheorem6), and‖𝐖−𝐖0‖2≤R∗\\\|\\mathbf\{W\}\-\\mathbf\{W\}\_\{0\}\\\|\_\{2\}\\leq R\_\{\*\}\. Consequently,

ℓ\(yf𝐖\(𝐱\)\)=log⁡\(1\+e−yf𝐖\(𝐱\)\)≤log⁡2\+\|f𝐖\(𝐱\)\|≲m\+log⁡\(nδ\)R∗\.\\ell\(yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\)=\\log\\bigl\(1\+e^\{\-yf\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\}\\bigr\)\\leq\\log 2\+\|f\_\{\\mathbf\{W\}\}\(\\mathbf\{x\}\)\|\\lesssim\\sqrt\{m\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)R\_\{\*\}\.Since𝐖t∈𝒦\\mathbf\{W\}\_\{t\}\\in\\mathcal\{K\}for allttby construction of Algorithm[1](https://arxiv.org/html/2605.12648#alg1), we obtain the uniform bound

ℒS\(𝐖t\)≲log⁡\(nδ\)\(m\+R∗\)∀t\.\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\qquad\\forall t\.
Now choose an auxiliary optimization failure probabilityδopt′=δ1\+log⁡\(nδ\)\(m\+R∗\)\\delta\_\{\\mathrm\{opt\}\}^\{\\prime\}=\\frac\{\\delta\}\{1\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\}\. By Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14), there exists an optimization good eventℰopt\\mathcal\{E\}\_\{\\mathrm\{opt\}\}such that

ℙ\(ℰoptc\)≤δopt′and1T∑t=0T−1ℒS\(𝐖t\)≤Avanonℰopt\.\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{c\}\)\\leq\\delta\_\{\\mathrm\{opt\}\}^\{\\prime\}\\qquad\\text\{and\}\\qquad\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\leq A\_\{\\mathrm\{van\}\}\\quad\\text\{on \}\\mathcal\{E\}\_\{\\mathrm\{opt\}\}\.Therefore, conditioning onℰδ0\\mathcal\{E\}\_\{\\delta\_\{0\}\}, we have

𝔼S,𝒜\[1T∑t=0T−1ℒS\(𝐖t\)\]\\displaystyle\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\Big\]=𝔼S,𝒜\[1T∑t=0T−1ℒS\(𝐖t\)𝟏ℰopt\]\+𝔼S,𝒜\[1T∑t=0T−1ℒS\(𝐖t\)𝟏ℰoptc\]\\displaystyle=\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\mathbf\{1\}\_\{\\mathcal\{E\}\_\{\\mathrm\{opt\}\}\}\\Big\]\+\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\Big\[\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\mathbf\{1\}\_\{\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{c\}\}\\Big\]≲Avan\+log⁡\(nδ\)\(m\+R∗\)ℙ\(ℰoptc\)\\displaystyle\\lesssim A\_\{\\mathrm\{van\}\}\+\\log\(\\tfrac\{n\}\{\\delta\}\)\(\\sqrt\{m\}\+R\_\{\*\}\)\\,\\mathbb\{P\}\(\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{c\}\)≲Avan\+δ≲Avan\+δlog⁡\(nδ\)\.\\displaystyle\\lesssim A\_\{\\mathrm\{van\}\}\+\\delta\\lesssim A\_\{\\mathrm\{van\}\}\+\\delta\\sqrt\{\\log\(\\tfrac\{n\}\{\\delta\}\)\}\.
Combining the above observation with the first part of the theorem gives

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\]\\lesssim\\;Avan\+δlog⁡\(nδ\)\+ηlog⁡\(1δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\\displaystyle A\_\{\\mathrm\{van\}\}\+\\delta\\sqrt\{\\log\(\\tfrac\{n\}\{\\delta\}\)\}\+\\frac\{\\eta\\log\(\\tfrac\{1\}\{\\delta\}\)\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\]\+log⁡\(1δ\)\(log⁡\(T\)\+log⁡\(nδ\)\)δγ\.\\displaystyle\\quad\+\\sqrt\{\\log\(\\tfrac\{1\}\{\\delta\}\)\}\\frac\{\(\\log\(T\)\+\\sqrt\{\\log\(\\tfrac\{n\}\{\\delta\}\)\}\)\\delta\}\{\\gamma\}\.Substituting

∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]≲T\(Avan\+δlog⁡\(nδ\)\)\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\]\\lesssim T\\bigl\(A\_\{\\mathrm\{van\}\}\+\\delta\\sqrt\{\\log\(\\tfrac\{n\}\{\\delta\}\)\}\\bigr\)yields the claimed bound\. ∎

Finally, we give the proof of Corollary[5\.5](https://arxiv.org/html/2605.12648#S5.Thmtheorem5)

###### Proof of Corollary[5\.5](https://arxiv.org/html/2605.12648#S5.Thmtheorem5)\.

We first check the optimization bound in Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)\. Under the stated choice of parameters,

1γBT≍1γγn⋅n/γ≍1γn\.\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\\asymp\\frac\{1\}\{\\gamma\\sqrt\{\\gamma\\sqrt\{n\}\\cdot\\sqrt\{n\}/\\gamma\}\}\\asymp\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\.Moreover,1γ2ηT≍1γηn\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\asymp\\frac\{1\}\{\\gamma\\eta\\sqrt\{n\}\}\. Ifη≍1\\eta\\asymp 1, then1γ2ηT≍1γn\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\asymp\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\. Ifη≍γ2nϵd\\eta\\asymp\\frac\{\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\}\{\\sqrt\{d\}\}, then1γ2ηT≍dγ3nϵ\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\asymp\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\. Hence, in all cases,

1γ2ηT≲1γn\+dγ3nϵ\.\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.
For the termη/B\\eta/B, ifη≍1\\eta\\asymp 1, thenηB≍1γn\\frac\{\\eta\}\{B\}\\asymp\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\. Ifη≍γ2nϵd\\eta\\asymp\\frac\{\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\}\{\\sqrt\{d\}\}, thenηB≍γϵd≲dγ3nϵ\\frac\{\\eta\}\{B\}\\asymp\\frac\{\\gamma\\epsilon\}\{\\sqrt\{d\}\}\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}, where the last inequality follows from the branch conditionγ2nϵ≤d\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\\leq\\sqrt\{d\}\. Therefore, in all cases,

ηB≲1γn\+dγ3nϵ\.\\frac\{\\eta\}\{B\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.
By Theorem[5\.4](https://arxiv.org/html/2605.12648#S5.Thmtheorem4),κ2≍B2Tn2ϵ2\\kappa^\{2\}\\asymp\\frac\{B^\{2\}T\}\{n^\{2\}\\epsilon^\{2\}\}\. SinceB≍γnB\\asymp\\gamma\\sqrt\{n\}andT≍n/γT\\asymp\\sqrt\{n\}/\\gamma, this givesκ2≲γnϵ2\\kappa^\{2\}\\lesssim\\frac\{\\gamma\}\{\\sqrt\{n\}\\,\\epsilon^\{2\}\}, andκγBT≍1γnϵ≲dγ3nϵ\\frac\{\\kappa\}\{\\gamma B\\sqrt\{T\}\}\\asymp\\frac\{1\}\{\\gamma n\\epsilon\}\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.

For the quadratic noise term, usingκ2/B2≍T/\(n2ϵ2\)\\kappa^\{2\}/B^\{2\}\\asymp T/\(n^\{2\}\\epsilon^\{2\}\), we haveηmdκ2B2≍ηmdTn2ϵ2\\frac\{\\eta md\\kappa^\{2\}\}\{B^\{2\}\}\\asymp\\frac\{\\eta mdT\}\{n^\{2\}\\epsilon^\{2\}\}\. Ifη≍1\\eta\\asymp 1, thenηmdκ2B2≍mdγn3/2ϵ2≲1γn\\frac\{\\eta md\\kappa^\{2\}\}\{B^\{2\}\}\\asymp\\frac\{md\}\{\\gamma n^\{3/2\}\\epsilon^\{2\}\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}, where we used the branch conditionϵ≥d/\(γ2n\)\\epsilon\\geq\\sqrt\{d\}/\(\\gamma^\{2\}\\sqrt\{n\}\)andm≍polylog\(n/δ\)/γ4m\\asymp\\mathrm\{polylog\}\(n/\\delta\)/\\gamma^\{4\}\. Ifη≍γ2nϵd\\eta\\asymp\\frac\{\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\}\{\\sqrt\{d\}\}, thenηmdκ2B2≍mγdnϵ≲dγ3nϵ\\frac\{\\eta md\\kappa^\{2\}\}\{B^\{2\}\}\\asymp\\frac\{m\\gamma\\sqrt\{d\}\}\{n\\epsilon\}\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\. Hence

ηmdκ2B2≲1γn\+dγ3nϵ\.\\frac\{\\eta md\\kappa^\{2\}\}\{B^\{2\}\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.
Combining the above estimates, Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)yields

1T∑t=0T−1ℒS\(𝐖t\)≲1γn\+dγ3nϵ\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}with probability at least1−δ1\-\\delta\.

Now we show the generalization and population risk rates\. By \([23](https://arxiv.org/html/2605.12648#A4.E23)\) and the above optimization bound,

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηlog⁡\(1/δ\)n∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+δγ,\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\\log\(1/\\delta\)\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\delta\}\{\\gamma\},where we suppress logarithmic factors in the first term\. Therefore,

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηTn\(1γn\+dγ3nϵ\)\+δγ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta T\}\{n\}\\Big\(\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\\Big\)\+\\frac\{\\delta\}\{\\gamma\}\.SinceηTn≍min⁡\{1γn,γϵd\},\\frac\{\\eta T\}\{n\}\\asymp\\min\\bigl\\\{\\frac\{1\}\{\\gamma\\sqrt\{n\}\},\\,\\frac\{\\gamma\\epsilon\}\{\\sqrt\{d\}\}\\bigr\\\},we have

ηTn⋅1γn≲1γ2nandηTn⋅dγ3nϵ≲1γ2n\.\\frac\{\\eta T\}\{n\}\\cdot\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\\qquad\\text\{and\}\\qquad\\frac\{\\eta T\}\{n\}\\cdot\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.Becauseδ≲γ/n\\delta\\lesssim\\gamma/n, the residual termδ/γ\\delta/\\gammais also of order at most1/n1/n, and hence

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲1γ2n\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.
Further, Theorem[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)implies

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTn\)\(Avan\+δ\)\+δγ,\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\Bigl\(1\+\\frac\{\\eta T\}\{n\}\\Bigr\)\\bigl\(A\_\{\\mathrm\{van\}\}\+\\delta\\bigr\)\+\\frac\{\\delta\}\{\\gamma\},where logarithmic factors are suppressed\. SinceηT/n≲1\\eta T/n\\lesssim 1,δ≲γ/n\\delta\\lesssim\\gamma/n, andAvan≲1γn\+dγ3nϵA\_\{\\mathrm\{van\}\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}, we obtain

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γn\+dγ3nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.
Finally, we verify that the width conditions in Theorems[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)and[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)are satisfied\. Sincem≍polylog\(n/δ\)γ4m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}, it suffices to check thatη2\(T2Avan2\+TB\+1\)≲1γ4\\eta^\{2\}\\Bigl\(T^\{2\}A\_\{\\mathrm\{van\}\}^\{2\}\+\\frac\{T\}\{B\}\+1\\Bigr\)\\lesssim\\frac\{1\}\{\\gamma^\{4\}\}\. SinceT/B≍1/γ2T/B\\asymp 1/\\gamma^\{2\}, thenη2\(T/B\+1\)≲1/γ4\\eta^\{2\}\(T/B\+1\)\\lesssim 1/\\gamma^\{4\}\. Also,Avan≲1γn\+dγ3nϵA\_\{\\mathrm\{van\}\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{n\}\}\+\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}, andηTγn≲1γ2,ηTdγ3nϵ≲1γ2\\frac\{\\eta T\}\{\\gamma\\sqrt\{n\}\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}\},\\qquad\\frac\{\\eta T\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}\}in both casesη≍1\\eta\\asymp 1andη≍γ2nϵd\\eta\\asymp\\frac\{\\gamma^\{2\}\\sqrt\{n\}\\,\\epsilon\}\{\\sqrt\{d\}\}\. Therefore\(ηT\)2Avan2≲1γ4\(\\eta T\)^\{2\}A\_\{\\mathrm\{van\}\}^\{2\}\\lesssim\\frac\{1\}\{\\gamma^\{4\}\}, and Theorems[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14)and[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)apply\. This completes the proof\.

∎

#### Recovery of Full\-batch DP\-GD\.

SinceB=nB=n, the mini\-batch gradient coincides with the full empirical gradient, thenΔt=0\\Delta\_\{t\}=0for alltt\(see \([20](https://arxiv.org/html/2605.12648#A4.E20)\)\)\. The sampling fluctuation term disappears and the recursion reduces to full\-batch DP\-GD\. We therefore specialize the bounds in Theorems[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14),[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)to this regime\.

Assumem≍polylog\(n/δ\)γ4m\\asymp\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\}andη≍min⁡\{1,ϵγmd\}\\eta\\asymp\\min\\left\\\{1,\\frac\{\\epsilon\}\{\\gamma\\sqrt\{md\}\}\\right\\\}\. By Theorem[5\.4](https://arxiv.org/html/2605.12648#S5.Thmtheorem4), whenB=nB=n,κ2≍B2Tn2ϵ2=Tϵ2≍nϵ2\\kappa^\{2\}\\asymp\\frac\{B^\{2\}T\}\{n^\{2\}\\epsilon^\{2\}\}=\\frac\{T\}\{\\epsilon^\{2\}\}\\asymp\\frac\{n\}\{\\epsilon^\{2\}\}up to logarithmic factors, and henceκ≍n/ϵ\\kappa\\asymp\\sqrt\{n\}/\\epsilon\.

Plugging the choices of parameters back into the optimization bound given in Theorem[D\.14](https://arxiv.org/html/2605.12648#A4.Thmtheorem14), and noting that1γn≲1γ2n\\frac\{1\}\{\\gamma n\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\},ηn≲1n≤1γ2n\\frac\{\\eta\}\{n\}\\lesssim\\frac\{1\}\{n\}\\leq\\frac\{1\}\{\\gamma^\{2\}n\},κγnn≍1γnϵ≲mdγnϵ\\frac\{\\kappa\}\{\\gamma n\\sqrt\{n\}\}\\asymp\\frac\{1\}\{\\gamma n\\epsilon\}\\lesssim\\frac\{\\sqrt\{md\}\}\{\\gamma n\\epsilon\}\. Further it holds1γ2ηn≲1γ2n\+mdγnϵ\\frac\{1\}\{\\gamma^\{2\}\\eta n\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\+\\frac\{\\sqrt\{md\}\}\{\\gamma n\\epsilon\}and

ηmdκ2n2≲1γ2n\+mdγnϵ\.\\frac\{\\eta md\\kappa^\{2\}\}\{n^\{2\}\}\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\+\\frac\{\\sqrt\{md\}\}\{\\gamma n\\epsilon\}\.
Combining the above estimates yields

1T∑t=0T−1ℒS\(𝐖t\)≲dγ3nϵ\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}with probability at least1−δ1\-\\delta\.

For the generalization risk, from Theorem[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)we have

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+δγ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\delta\}\{\\gamma\}\.Plugging the choices of parameters and noting thatδ≲γ/n\\delta\\lesssim\\gamma/n, the residual termδ/γ\\delta/\\gammais of the same or smaller order, it holds

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲1γ2n\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.
Furthermore, Theorem[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)shows

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTn\)\(Avan\(𝐖∗\)\+δ\)\+δγ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\Bigl\(1\+\\frac\{\\eta T\}\{n\}\\Bigr\)\\bigl\(A\_\{\\mathrm\{van\}\}\(\\mathbf\{W\}^\{\*\}\)\+\\delta\\bigr\)\+\\frac\{\\delta\}\{\\gamma\}\.Substitutingη,T\\eta,Tandm≍polylog\(n/δ\)/γ4m\\asymp\\mathrm\{polylog\}\(n/\\delta\)/\\gamma^\{4\}gives

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲dγ3nϵ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{\\sqrt\{d\}\}\{\\gamma^\{3\}n\\epsilon\}\.We can verify the width condition in Theorems[D\.16](https://arxiv.org/html/2605.12648#A4.Thmtheorem16)hold\. This completes the proof\.

## Appendix EProofs for Mini\-batch SGD

We note that in the full\-batch caseB=nB=n, privacy amplification is not needed, and the privacy guarantee follows directly from\[[60](https://arxiv.org/html/2605.12648#bib.bib19)\]\. Below, we present risk guarantees for mini\-batch SGD\.

###### Theorem E\.1\(Optimization, generalization and population risk bounds\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)\. Suppose Assumptions[3\.2](https://arxiv.org/html/2605.12648#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2605.12648#S3.Thmtheorem3)hold\. Let\{𝐖t\}t=0T\\\{\\mathbf\{W\}\_\{t\}\\\}\_\{t=0\}^\{T\}be generated by mini\-batch SGD withη≤1/12Cσ,bp3\\eta\\leq 1/\{12C\_\{\\sigma,b\}p^\{3\}\}andCclip≥GδC\_\{\\mathrm\{clip\}\}\\geq G\_\{\\delta\}\. Assumem≳log⁡\(mδ\)\(log4⁡\(T\)\+log2⁡\(nδ\)\)γ4m\\gtrsim\\frac\{\\log\(\\frac\{m\}\{\\delta\}\)\(\\log^\{4\}\(T\)\+\\log^\{2\}\(\\frac\{n\}\{\\delta\}\)\)\}\{\\gamma^\{4\}\}\. Then, with probability at least1−δ1\-\\deltaover the randomness of the initialization and the algorithmic randomness, it holds1T∑t=0T−1ℒS\(𝐖t\)≲1γ2ηT\+1γBT\+ηB=:Anon\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\+\\frac\{\\eta\}\{B\}=:A\_\{\\mathrm\{non\}\}\. Further assumem≳log⁡\(mδ\)η2\(T2Anon2\+log⁡\(nδ\)log2⁡\(1δ\)\(TB\+1\)\)\.m\\gtrsim\\log\(\\frac\{m\}\{\\delta\}\)\\eta^\{2\}\\big\(T^\{2\}A^\{2\}\_\{\\mathrm\{non\}\}\+\\log\(\\frac\{n\}\{\\delta\}\)\\log^\{2\}\(\\frac\{1\}\{\\delta\}\)\\bigl\(\\frac\{T\}\{B\}\+1\\bigr\)\\big\)\.Then, with probability at least1−δ1\-\\deltaover the initialization,𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+δγ\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\delta\}\{\\gamma\}and

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲\(1\+ηTn\)\(Anon\+δ\)\+δγ\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\Bigl\(1\+\\frac\{\\eta T\}\{n\}\\Bigr\)\\bigl\(A\_\{\\mathrm\{non\}\}\+\\delta\\bigr\)\+\\frac\{\\delta\}\{\\gamma\}\.

###### Proof\.

The non\-private mini\-batch SGD bounds are obtained as direct special cases of the standard DP\-SGD analysis by setting the noise levelκ=0\\kappa=0\. All privacy\-related terms then disappear, and the corresponding optimization, generalization, and excess risk bounds follow immediately\. ∎

The proof of Corollary[5\.6](https://arxiv.org/html/2605.12648#S5.Thmtheorem6)is given as follows\.

###### Proof of Corollary[5\.6](https://arxiv.org/html/2605.12648#S5.Thmtheorem6)\.

SetB≲γnB\\lesssim\\gamma\\sqrt\{n\},η≍Bn\\eta\\asymp\\frac\{B\}\{n\}, andT≳n2γ2BT\\gtrsim\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}\. Sincem≳polylog\(n/δ\)/γ4m\\gtrsim\\mathrm\{polylog\}\(n/\\delta\)/\\gamma^\{4\}, the width condition in Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1)is satisfied\. By Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1), we have

1T∑t=0T−1ℒS\(𝐖t\)≲1γ2ηT\+1γBT\+ηB\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\+\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\+\\frac\{\\eta\}\{B\}with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness\. NowηB≍1n\\frac\{\\eta\}\{B\}\\asymp\\frac\{1\}\{n\}\. SinceT≳n2γ2BT\\gtrsim\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}, we have1γ2ηT≍nγ2BT≲1n\\frac\{1\}\{\\gamma^\{2\}\\eta T\}\\asymp\\frac\{n\}\{\\gamma^\{2\}BT\}\\lesssim\\frac\{1\}\{n\}, and1γBT≲1γB⋅n2/\(γ2B\)=1n\\frac\{1\}\{\\gamma\\sqrt\{BT\}\}\\lesssim\\frac\{1\}\{\\gamma\\sqrt\{B\\cdot n^\{2\}/\(\\gamma^\{2\}B\)\}\}=\\frac\{1\}\{n\}\. Hence

1T∑t=0T−1ℒS\(𝐖t\)≲1n\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{n\}with probability at least1−δ1\-\\delta\.

Further setT≍n2γ2BT\\asymp\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}, it holds with probability at least1−δ1\-\\deltaover the initialization that

∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]≲T⋅1n\.\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim T\\cdot\\frac\{1\}\{n\}\.According to Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1), we know

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+δγ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\delta\}\{\\gamma\}\.Using the above optimization estimate,ηT≍Bn⋅n2γ2B=nγ2\\eta T\\asymp\\frac\{B\}\{n\}\\cdot\\frac\{n^\{2\}\}\{\\gamma^\{2\}B\}=\\frac\{n\}\{\\gamma^\{2\}\}andδ≤\(γn\)−1\\delta\\leq\(\\gamma n\)^\{\-1\}, it follows that

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲1γ2n\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.Finally, applying again Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1)and usingηT/n≍1/γ2\\eta T/n\\asymp 1/\\gamma^\{2\}andAnon≲1/\(γ2n\)A\_\{\\mathrm\{non\}\}\\lesssim 1/\(\\gamma^\{2\}n\), we obtain

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γ2n\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.This completes the proof\. ∎

#### Recovery of full\-batch GD\.

Takem≳polylog\(n/δ\)γ4m\\gtrsim\\frac\{\\mathrm\{polylog\}\(n/\\delta\)\}\{\\gamma^\{4\}\},B=nB=nandη≍1\\eta\\asymp 1\. Then the mini\-batch gradient coincides with the full empirical gradient, soΔt=0\\Delta\_\{t\}=0\(see \([20](https://arxiv.org/html/2605.12648#A4.E20)\)\) the sampling fluctuation terms disappear and the recursion reduces to full\-batch GD\. Therefore

1T∑t=0T−1ℒS\(𝐖t\)≲1γ2T\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{\\gamma^\{2\}T\}with probability at least1−δ1\-\\deltaover the initialization and the algorithmic randomness\.

Now further setT≍nγ2T\\asymp\\frac\{n\}\{\\gamma^\{2\}\}\. Then1T∑t=0T−1ℒS\(𝐖t\)≲1n\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\lesssim\\frac\{1\}\{n\}\. By Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1),

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲ηn∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]\+δγ\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{\\eta\}\{n\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\+\\frac\{\\delta\}\{\\gamma\}\.Since∑t=0T−1𝔼S,𝒜\[ℒS\(𝐖t\)\]≲T⋅1n≍1γ2\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim T\\cdot\\frac\{1\}\{n\}\\asymp\\frac\{1\}\{\\gamma^\{2\}\},δ≤\(γn\)−1\\delta\\leq\(\\gamma n\)^\{\-1\}andη≍1\\eta\\asymp 1, we obtain

𝔼S,𝒜\[ℒ\(𝐖T\)−ℒS\(𝐖T\)\]≲1γ2n\.\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{T\}\)\-\\mathcal\{L\}\_\{S\}\(\\mathbf\{W\}\_\{T\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.Applying again the excess risk bound in Theorem[E\.1](https://arxiv.org/html/2605.12648#A5.Thmtheorem1)yields

1T∑t=0T−1𝔼S,𝒜\[ℒ\(𝐖t\)\]≲1γ2n\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\_\{S,\\mathcal\{A\}\}\\bigl\[\\mathcal\{L\}\(\\mathbf\{W\}\_\{t\}\)\\bigr\]\\lesssim\\frac\{1\}\{\\gamma^\{2\}n\}\.This completes the proof\.

## Appendix FProof of Theorem 2

In the following, we prove Theorem[5\.1](https://arxiv.org/html/2605.12648#S5.Thmtheorem1)\. To this end, we proceed in two steps\. First, we conduct a general privacy analysis of Algorithm[1](https://arxiv.org/html/2605.12648#alg1)in terms of*dominating pairs*, which we shall introduce shortly\. Then, we use this general analysis to derive an analytical upper bound on noise multiplierκ2\\kappa^\{2\}\.

### F\.1Step 0: Preliminaries

Algorithm[1](https://arxiv.org/html/2605.12648#alg1)is an instance of a subsampled correlated noise mechanism / “matrix mechanism”\. We can more compactly express the outcome given a specific datasetSSas the random variable

𝐗\+𝐂−1𝐙with𝐗=∑i=1n𝐆\(i\)andZt,d∼𝒩\(0,κ2\)\.\\mathbf\{X\}\+\\mathbf\{C\}^\{\-1\}\\mathbf\{Z\}\\quad\\text\{with \}\\mathbf\{X\}=\\sum\_\{i=1\}^\{n\}\\mathbf\{G\}^\{\(i\)\}\\text\{ and \}Z\_\{t,d\}\\sim\\mathcal\{N\}\(0,\\kappa^\{2\}\)\.Here,𝐆\(i\)∈ℝT×mdp\\mathbf\{G\}^\{\(i\)\}\\in\\mathbb\{R\}^\{T\\times\\mathrm\{mdp\}\}are the gradient contributions of theii\-th record in the dataset to each training iteration, with‖𝔾t,:\(i\)‖2≤Cclip\|\|\\mathbb\{G\}^\{\(i\)\}\_\{t,:\}\|\|\_\{2\}\\leq C\_\{\\mathrm\{clip\}\}ifii\-th record is included in thett\-th batch and𝐆t,:\(i\)=0\\mathbf\{G\}^\{\(i\)\}\_\{t,:\}=0otherwise\. Note that𝐆t,:\(i\)\\mathbf\{G\}^\{\(i\)\}\_\{t,:\}can be adaptively chosen based on earlier outcomes\. Meanwhile,𝐂−1∈ℝT×T\\mathbf\{C\}^\{\-1\}\\in\\mathbb\{R\}^\{T\\times T\}is a*correlation matrix*with

Ct,u−1=\{1ifu=t,−λifu=t−1,0otherwise\.C^\{\-1\}\_\{t,u\}=\\begin\{cases\}1&\\text\{if \}u=t,\\\\ \-\\lambda&\\text\{if \}u=t\-1,\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}\(24\)and inverse

Ct,u=\{1ifu=t,λt−uifu<t,0otherwise\.C\_\{t,u\}=\\begin\{cases\}1&\\text\{if \}u=t,\\\\ \\lambda^\{t\-u\}&\\text\{if \}u<t,\\\\ 0&\\text\{otherwise\.\}\\end\{cases\}\(25\)Analogously, we can express the outcome given a neighboring datasetS′S^\{\\prime\}as the random variable𝐗′\+𝐂−1𝐙\\mathbf\{X\}^\{\\prime\}\+\\mathbf\{C\}^\{\-1\}\\mathbf\{Z\}with∑i′=1n′𝐆\(i′\)\\sum\_\{i^\{\\prime\}=1\}^\{n^\{\\prime\}\}\\mathbf\{G\}^\{\(i^\{\\prime\}\)\}, which differs in the gradient contributions of a single record\.

To proceed with the privacy analysis, we formalize the phrase “differ in the contribution of one data record” used in Definition[3\.1](https://arxiv.org/html/2605.12648#S3.Thmtheorem1)\. DP\-SGD with Poisson subsampling, where each record is independently included with some fixed rater∈\[0,1\]r\\in\[0,1\], is typically analyzed under the insertion/removal relation, whereS′=S∪\{\(x,y\)\}S^\{\\prime\}=S\\cup\\\{\(x,y\)\\\}orS′=S∖\{\(x,y\)\}S^\{\\prime\}=S\\setminus\\\{\(x,y\)\\\}for some record\(x,y\)\(x,y\)\. In contrast, Algorithm[1](https://arxiv.org/html/2605.12648#alg1)uses mini\-batches of fixed sizeBB, sampled uniformly without replacement\. For this fixed\-size subsampling scheme, we use the standard zero\-out relation \(see\[[3](https://arxiv.org/html/2605.12648#bib.bib751)\]for an overview of papers using this relation\)\.

###### Definition F\.1\(Zero\-out relation\)\.

DatasetsSSandS′S^\{\\prime\}of equal sizenna neighboring under the zero\-out relation if there exists a pair of records\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and\(xi′,yi′\)\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)such that

S′=S∖\{\(xi,yi\)\}∪\{\(xi′,yi′\)\},S^\{\\prime\}=S\\setminus\\\{\(x\_\{i\},y\_\{i\}\)\\\}\\cup\\\{\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)\\\},and the gradient contribution for one of the records is zero, i\.e\.,𝐆\(i\)=0\\mathbf\{G\}^\{\(i\)\}=0or𝐆′⁣\(i\)=0\\mathbf\{G\}^\{\\prime\(i\)\}=0\.

To simplify our analysis in the next section, we further subdivide this relation into two asymmetric parts:

###### Definition F\.2\(Zero\-out\-remove relation\)\.

DatasetsSSandS′S^\{\\prime\}of equal sizenna neighboring under the zero\-out\-remove relation if there exists a pair of records\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and\(xi′,yi′\)\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)such that

S′=S∖\{\(xi,yi\)\}∪\{\(xi′,yi′\)\},S^\{\\prime\}=S\\setminus\\\{\(x\_\{i\},y\_\{i\}\)\\\}\\cup\\\{\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)\\\},and the gradient contribution for the substituted record inS′S^\{\\prime\}is zero, i\.e\.,𝐆′⁣\(i\)=0\\mathbf\{G\}^\{\\prime\(i\)\}=0\.

###### Definition F\.3\(Zero\-out\-add relation\)\.

DatasetsSSandS′S^\{\\prime\}of equal sizenna neighboring under the zero\-out\-add relation if there exists a pair of records\(xi,yi\)\(x\_\{i\},y\_\{i\}\)and\(xi′,yi′\)\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)such that

S′=S∖\{\(xi,yi\)\}∪\{\(xi′,yi′\)\},S^\{\\prime\}=S\\setminus\\\{\(x\_\{i\},y\_\{i\}\)\\\}\\cup\\\{\(x^\{\\prime\}\_\{i\},y^\{\\prime\}\_\{i\}\)\\\},and the gradient contribution for the substituted record inSSis zero, i\.e\.,𝐆\(i\)=0\\mathbf\{G\}^\{\(i\)\}=0\.

In the following, we useS≃S′S\\simeq S^\{\\prime\}as a short\-hand for datasetsSSandS′S^\{\\prime\}being neighboring in general, andS≃0S′S\\simeq\_\{0\}S^\{\\prime\},S≃0,rS′S\\simeq\_\{0,r\}S^\{\\prime\},S≃0,aS′S\\simeq\_\{0,a\}S^\{\\prime\}for being neighboring under the zero\-out relation, zero\-out\-remove relation, and zero\-out\-add relation, respectively\.

Finally, let us introduce the hockey\-stick divergence and dominating pairs, which are an alternative characterization of the privacy profileδ\(ϵ\)\\delta\(\\epsilon\)\[[67](https://arxiv.org/html/2605.12648#bib.bib724)\]\. For this, we abuse notation𝒜\(⋅∣S\)\\mathcal\{A\}\(\\cdot\\mid S\)to refer to the distribution of random variable𝒜\(S\)\\mathcal\{A\}\(S\), i\.e\., the distribution of the outcome of randomized algorithm𝒜\\mathcal\{A\}applied to datasetSS\.

###### Definition F\.4\(Hockey\-stick divergence\)\.

The hockey\-stick divergence of orderα≥0\\alpha\\geq 0between two distributionsP,QP,Qon measure space\(Ω,ℱ\)\(\\Omega,\\mathcal\{F\}\)is

Hα\(P\|\|Q\)=supE∈ℱ\(P\(E\)−αQ\(E\)\)H\_\{\\alpha\}\(P\|\|Q\)=\\sup\_\{E\\in\\mathcal\{F\}\}\\left\(P\(E\)\-\\alpha Q\(E\)\\right\)

###### Definition F\.5\(Dominating pairs\)\.

Consider a randomized algorithm𝒜\\mathcal\{A\}and two distributionsP,QP,Qdefined on an arbitrary on measure space\. If

Hα\(𝒜\(⋅∣S\)\|\|𝒜\(⋅∣S′\)≤Hα\(P\|\|Q\)H\_\{\\alpha\}\(\\mathcal\{A\}\(\\cdot\\mid S\)\|\|\\mathcal\{A\}\(\\cdot\\mid S^\{\\prime\}\)\\leq H\_\{\\alpha\}\(P\|\|Q\)for all neighboring datasetS≃S′S\\simeq S^\{\\prime\}and allα≥0\\alpha\\geq 0, then\(P,Q\)\(P,Q\)is a dominating pair of𝒜\\mathcal\{A\}\.

Note that when\(P,Q\)\(P,Q\)is a dominating pair of𝒜\\mathcal\{A\}, then𝒜\\mathcal\{A\}is\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP withδ=Heϵ\(P\|\|Q\)\\delta=H\_\{e^\{\\epsilon\}\}\(P\|\|Q\)\[[5](https://arxiv.org/html/2605.12648#bib.bib733)\]\.

### F\.2Step 1: Dominating pairs for arbitrary correlation matrices

In this section, we determine a dominating pair for arbitrary correlation matrices𝐂−1\\mathbf\{C\}^\{\-1\}under uniform subsampling without replacement\.

To begin, we cut our work in half by showing that analyzing the zero\-out\-remove relation also yields a dominating pair for the zero\-out\-add relation\. The proof is identical to that of Lemma 28 from\[[67](https://arxiv.org/html/2605.12648#bib.bib724)\]\.

###### Lemma F\.6\(\[[67](https://arxiv.org/html/2605.12648#bib.bib724)\]\)\.

The following two are equivalent:

1. 1\.\(P,Q\)\(P,Q\)dominates algorithm𝒜\\mathcal\{A\}for zero\-out\-add neighbors\.
2. 2\.\(Q,P\)\(Q,P\)dominates algorithm𝒜\\mathcal\{A\}for zero\-out\-remove neighbors\.

###### Proof\.

LetT\[P,Q\]:\[0,1\]→\[0,1\]T\[P,Q\]:\[0,1\]\\rightarrow\[0,1\]be the trade\-off function of distributionsPPandQQ\. For our purposes, it is sufficent to know thatT\[P,Q\]T\[P,Q\]andT\[Q,P\]T\[Q,P\]are inverse functions of each other and thatHα\(P\|\|Q\)=1\+T\[P,Q\]∗\(−α\)H\_\{\\alpha\}\(P\|\|Q\)=1\+T\[P,Q\]^\{\*\}\(\-\\alpha\), wheref∗f^\{\*\}is the convex conjugate of a functionff\.

Consider two datasetsS≃0,aS′S\\simeq\_\{0,a\}S^\{\\prime\}\.

Then

condition 1⇔Hα\(𝒜\(S\)\|\|𝒜\(S′\)\)⩽Hα\(P\|\|Q\)\\displaystyle\\iff H\_\{\\alpha\}\(\\mathcal\{A\}\(S\)\|\|\\mathcal\{A\}\(S^\{\\prime\}\)\)\\leqslant H\_\{\\alpha\}\(P\|\|Q\)⇔T\[𝒜\(S′\),𝒜\(S\)\]⩾T\[Q,P\]\\displaystyle\\iff T\[\\mathcal\{A\}\(S^\{\\prime\}\),\\mathcal\{A\}\(S\)\]\\geqslant T\[Q,P\]⇔T\[𝒜\(S\),𝒜\(S′\)\]⩾T\[P,Q\]\\displaystyle\\iff T\[\\mathcal\{A\}\(S\),\\mathcal\{A\}\(S^\{\\prime\}\)\]\\geqslant T\[P,Q\]⇔Hα\(𝒜\(S′\),𝒜\(S\)\)⩽Hα\(Q\|\|P\)⇔condition 2\.\\displaystyle\\iff H\_\{\\alpha\}\(\\mathcal\{A\}\(S^\{\\prime\}\),\\mathcal\{A\}\(S\)\)\\leqslant H\_\{\\alpha\}\(Q\|\|P\)\\iff\\text\{condition 2\}\.∎

We further need the Lemma 4\.5\[[14](https://arxiv.org/html/2605.12648#bib.bib721)\], here restated as in Lemma 3\.3 from\[[12](https://arxiv.org/html/2605.12648#bib.bib720)\]:

###### Lemma F\.7\(\[[12](https://arxiv.org/html/2605.12648#bib.bib720)\]\)\.

Let𝐜1,…,𝐜k∈ℝn×p\\mathbf\{c\}\_\{1\},\\ldots,\\mathbf\{c\}\_\{k\}\\in\\mathbb\{R\}^\{n\\times p\}\. Let𝐜1′,…,𝐜k′∈ℝn\\mathbf\{c\}\_\{1\}^\{\\prime\},\\ldots,\\mathbf\{c\}\_\{k\}^\{\\prime\}\\in\\mathbb\{R\}^\{n\}be such that‖𝐜i\[j,:\]‖2≤𝐜i′\(j\)\|\|\\mathbf\{c\}\_\{i\}\[j,:\]\|\|\_\{2\}\\leq\\mathbf\{c\}\_\{i\}^\{\\prime\}\(j\)for alli,ji,j\. Then letting

P=N\(0,σ2𝕀\(n×p\)×\(n×p\)\),Q=∑ipiN\(𝐜i,σ2𝕀\(n×p\)×\(n×p\)\)P=N\(0,\\sigma^\{2\}\\mathbb\{I\}\_\{\(n\\times p\)\\times\(n\\times p\)\}\),Q=\\sum\_\{i\}p\_\{i\}N\(\\mathbf\{c\}\_\{i\},\\sigma^\{2\}\\mathbb\{I\}\_\{\(n\\times p\)\\times\(n\\times p\)\}\)P′=N\(0,σ2𝕀n×n\),Q′=∑ipiN\(𝐜i′,σ2𝕀n×n\),P^\{\\prime\}=N\(0,\\sigma^\{2\}\\mathbb\{I\}\_\{n\\times n\}\),Q^\{\\prime\}=\\sum\_\{i\}p\_\{i\}N\(\\mathbf\{c\}\_\{i\}^\{\\prime\},\\sigma^\{2\}\\mathbb\{I\}\_\{n\\times n\}\),
for allα\\alphawe haveHα\(P,Q\)≤Hα\(P′,Q′\)H\_\{\\alpha\}\(P,Q\)\\leq H\_\{\\alpha\}\(P^\{\\prime\},Q^\{\\prime\}\)\. Furthermore, this holds even if thejjth row of each𝐜i\\mathbf\{c\}\_\{i\}is chosen as a function of the firstj−1j\-1rows ofP,QP,Q\(subject to‖𝐜i\[j,:\]‖2≤𝐜i′\(j\)\|\|\\mathbf\{c\}\_\{i\}\[j,:\]\|\|\_\{2\}\\leq\\mathbf\{c\}\_\{i\}^\{\\prime\}\(j\)\) while𝐜i′\\mathbf\{c\}\_\{i\}^\{\\prime\}remain fixed\.

Using these results, we can now proceed to our main result\. Note that this bound is qualitatively very similar to existing bounds for matrix mechanism under Poisson subsampling\[[14](https://arxiv.org/html/2605.12648#bib.bib721)\]and balls\-and\-bins sampling\[[12](https://arxiv.org/html/2605.12648#bib.bib720)\], which both rely on the same proof strategy\. In the following, we apply the proof strategy for Lemma 3\.3 from\[[12](https://arxiv.org/html/2605.12648#bib.bib720)\]almost verbatim\.

###### Theorem F\.8\.

Given number of iterationsTT, batch sizeBBand dataset sizenn, define subsampling rater=Bnr=\\frac\{B\}\{n\}\. Let\|𝐂\|\|\\mathbf\{C\}\|be the elementwise absolute values of𝐂\\mathbf\{C\}\. Further define multivariate Gaussian mixtureP=∑𝐲∈\{0,1\}T𝒩\(𝐂𝐲,κ2/Cclip2𝐈\)⋅s\(𝐲\)P=\\sum\_\{\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}\}\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbin\{/\}C\_\{\\mathrm\{clip\}\}^\{2\}\\mathbf\{I\}\)\\cdot s\(\\mathbf\{y\}\)with subsampling probabilitiess\(𝐲\)=r‖𝐲‖0\(1−r\)T−‖𝐲‖0s\(\\mathbf\{y\}\)=r^\{\|\|\\mathbf\{y\}\|\|\_\{0\}\}\(1\-r\)^\{T\-\|\|\\mathbf\{y\}\|\|\_\{0\}\}and zero\-mean GaussianQ=𝒩\(𝟎,κ2/Cclip2𝐈\)Q=\\mathcal\{N\}\(\\mathbf\{0\},\\kappa^\{2\}\\mathbin\{/\}C\_\{\\mathrm\{clip\}\}^\{2\}\\mathbf\{I\}\)\. Then our algorithm𝒜\\mathcal\{A\}is dominated by\(P,Q\)\(P,Q\)under the zero\-out\-remove relation and by\(Q,P\)\(Q,P\)under the zero\-out\-add relation\.

###### Proof\.

Consider datasetsS≃0,rS′S\\simeq\_\{0,r\}S^\{\\prime\}and assume w\.l\.o\.g\. that thennth sample is zero’d out, i\.e\.,𝒜\(S\)=𝐗\+𝐂−1𝐙\\mathcal\{A\}\(S\)=\\mathbf\{X\}\+\\mathbf\{C\}^\{\-1\}\\mathbf\{Z\}with𝐗=∑i=1n𝐆\(i\)\\mathbf\{X\}=\\sum\_\{i=1\}^\{n\}\\mathbf\{G\}^\{\(i\)\}and𝒜\(S′\)=𝐗′\+𝐂−1𝐙\\mathcal\{A\}\(S^\{\\prime\}\)=\\mathbf\{X^\{\\prime\}\}\+\\mathbf\{C\}^\{\-1\}\\mathbf\{Z\}with𝐗′=∑i=1n−1𝐆\(i\)\\mathbf\{X^\{\\prime\}\}=\\sum\_\{i=1\}^\{n\-1\}\\mathbf\{G\}^\{\(i\)\}\. Further define𝒜~\(S\)=𝐂\(𝐗\+𝐂−1𝐙\)=𝐂𝐗\+𝐙\\tilde\{\\mathcal\{A\}\}\(S\)=\\mathbf\{C\}\(\\mathbf\{X\}\+\\mathbf\{C\}^\{\-1\}\\mathbf\{Z\}\)=\\mathbf\{C\}\\mathbf\{X\}\+\\mathbf\{Z\}and𝒜~\(S\)=𝐂𝐗′\+𝐙\\tilde\{\\mathcal\{A\}\}\(S\)=\\mathbf\{C\}\\mathbf\{X^\{\\prime\}\}\+\\mathbf\{Z\}\. Via postprocessing property, this mechanism is equally private, i\.e\., both dominate each other\.

Furthermore, “by post\-processing, we can assume that we release the contributions to the input matrix of all examples except the differing user’s”\[[12](https://arxiv.org/html/2605.12648#bib.bib720)\]\. Since these contributions are shared between𝒜~\(S\)\\tilde\{\\mathcal\{A\}\}\(S\)and𝒜~\(S\)\\tilde\{\\mathcal\{A\}\}\(S\), distinguishing𝐂𝐗\+𝐙\\mathbf\{C\}\\mathbf\{X\}\+\\mathbf\{Z\}and𝐂𝐗′\+𝐙\\mathbf\{C\}\\mathbf\{X^\{\\prime\}\}\+\\mathbf\{Z\}is equivalent to distinguishing𝐂\(𝐗−𝐗′\)\+𝐙\\mathbf\{C\}\(\\mathbf\{X\}\-\\mathbf\{X\}^\{\\prime\}\)\+\\mathbf\{Z\}and𝐙\\mathbf\{Z\}\. This is, by definition of𝐗\\mathbf\{X\}and𝐗′\\mathbf\{X^\{\\prime\}\}, equivalent to distinguishing𝐂𝐆\(n\)\+𝐙\\mathbf\{C\}\\mathbf\{G\}^\{\(n\)\}\+\\mathbf\{Z\}and𝐙\\mathbf\{Z\}\. We can therefore assume that the gradient contributions for all records except thennth one are zero\. We can thus see that the outcome of our mechanism has the same form asPPandQQin Lemma[F\.7](https://arxiv.org/html/2605.12648#A6.Thmtheorem7), i\.e\., a mixture of matrix\-valued isotropic Gaussians with adaptively chosen rows\.

Finally, we can apply Lemma[F\.7](https://arxiv.org/html/2605.12648#A6.Thmtheorem7)to bound its privacy via a mixture of vector\-valued isotropic Gaussians with constant rows\. Due to clipping constantCclipC\_\{\\mathrm\{clip\}\}, thett\-th row of𝐆\(n\)∈ℝT×mdp\\mathbf\{G\}^\{\(n\)\}\\in\\mathbb\{R\}^\{T\\times\\mathrm\{mdp\}\}\. has norm0if the record does not contribute and norm at mostCclipC\_\{\\mathrm\{clip\}\}if it contributes\. Thus, given an indicator vector𝐲∈\{0,1\}T\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}withyu=1y\_\{u\}=1if the record participates in stepuu, it immediately follows by triangle inequality that\|\|\(𝐂𝐆\(n\)\)t,:≤∑u=1T\|𝐂\|t,uCclipyu\|\|\(\\mathbf\{C\}\\mathbf\{G\}^\{\(n\)\}\)\_\{t,:\}\\leq\\sum\_\{u=1\}^\{T\}\|\\mathbf\{C\}\|\_\{t,u\}C\_\{\\mathrm\{clip\}\}y\_\{u\}\. The result then immediately follows from Lemma[F\.7](https://arxiv.org/html/2605.12648#A6.Thmtheorem7)and dividing both the sensitivities and standard deviation by clipping constantCclipC\_\{\\mathrm\{clip\}\}\. ∎

### F\.3Step 2: Analytical noise multiplier bound

In this final section, we will use the dominating pair from Theorem[F\.8](https://arxiv.org/html/2605.12648#A6.Thmtheorem8), i\.e\.,P=∑𝐲∈\{0,1\}T𝒩\(𝐂𝐲,κ2/Cclip2\)⋅s\(𝐲\)P=\\sum\_\{\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}\}\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbin\{/\}C\_\{\\mathrm\{clip\}\}^\{2\}\)\\cdot s\(\\mathbf\{\\mathbf\{y\}\}\)Q=𝒩\(𝟎,κ2/Cclip2\)Q=\\mathcal\{N\}\(\\mathbf\{0\},\\kappa^\{2\}\\mathbin\{/\}C\_\{\\mathrm\{clip\}\}^\{2\}\)to derive an analytical upper bound on required noise multiplierκ\\kappato attain a desired privacy parameterδ\\deltagivenϵ\\epsilon, i\.e\.,max\{Heϵ\(P\|\|Q\),Heϵ\(Q\|\|P\)\}≤δ\\max\\\{H\_\{e^\{\\epsilon\}\}\(P\|\|Q\),H\_\{e^\{\\epsilon\}\}\(Q\|\|P\)\\\}\\leq\\delta\. Since we are only interested in asymptotics, we will assume constant clipping normCclip=1C\_\{\\mathrm\{clip\}\}=1, which will only result in a linear scaling ofκ\\kappa\.

The main idea of our proof is that we can derive a high\-probability tail bound on our𝐲∈\{0,1\}T\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}with probability mass functions\(𝐲\)=r‖𝐲‖0\(1−r\)T−‖𝐲‖0s\(\\mathbf\{y\}\)=r^\{\|\|\\mathbf\{y\}\|\|\_\{0\}\}\(1\-r\)^\{T\-\|\|\\mathbf\{y\}\|\|\_\{0\}\}\. This, in turn, means that the Gaussian mean has bounded norm with high probability\. We can then calibrate our noise multiplier to this norm / sensitivity while adding some slack for the low probability of the tail bound being violated\.

###### Lemma F\.9\.

Consider an event of possible subsampling indicatorsE⊆\{0,1\}TE\\subseteq\\\{0,1\\\}^\{T\}with probabilityS\(E\)S\(E\)under the subsampling distribution with pmfs\(𝐲\)s\(\\mathbf\{y\}\)\. LetP,QP,Qbe defined as in Theorem[F\.8](https://arxiv.org/html/2605.12648#A6.Thmtheorem8)and assumeCclip=1C\_\{\\mathrm\{clip\}\}=1\. Then, for allϵ\\epsilon

max\{Heϵ\(P\|\|Q\),Heϵ\(Q\|\|P\)\}≤S\(E\)⋅max𝐲∈EHeϵ\(𝒩\(𝐂𝐲,κ2𝐈\)\|\|𝒩\(𝟎,κ2𝐈\)\+\(S\(E¯\)\)\.\\max\\\{H\_\{e^\{\\epsilon\}\}\(P\|\|Q\),H\_\{e^\{\\epsilon\}\}\(Q\|\|P\)\\\}\\leq S\(E\)\\cdot\\max\_\{\\mathbf\{y\}\\in E\}H\_\{e^\{\\epsilon\}\}\(\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbf\{I\}\)\|\|\\mathcal\{N\}\(\\bm\{0\},\\kappa^\{2\}\\mathbf\{I\}\)\+\(S\(\\overline\{E\}\)\)\.

###### Proof\.

First consider the\(P,Q\)\(P,Q\)case\. By law of total probability we haveP=P\(⋅∣E\)⋅S\(E\)\+P\(⋅∣E¯\)⋅S\(E¯\)P=P\(\\cdot\\mid E\)\\cdot S\(E\)\+P\(\\cdot\\mid\\overline\{E\}\)\\cdot S\(\\overline\{E\}\)withP\(⋅∣E\)=∑𝐲∈\{0,1\}T𝒩\(𝐂𝐲,κ2𝐈\)⋅s\(𝐲∣E\)P\(\\cdot\\mid E\)=\\sum\_\{\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}\}\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbf\{I\}\)\\cdot s\(\\mathbf\{y\}\\mid E\)\. It thus follows from joint convexity of hockey\-stick divergences that

Heϵ\(P\|\|Q\)\\displaystyle H\_\{e^\{\\epsilon\}\}\(P\|\|Q\)≤S\(E\)Heϵ\(P\(⋅∣E\)\|\|Q\)\+S\(E¯\)Heϵ\(P\(⋅∣E¯\)\|\|Q\)\\displaystyle\\leq S\(E\)H\_\{e^\{\\epsilon\}\}\(P\(\\cdot\\mid E\)\|\|Q\)\+S\(\\overline\{E\}\)H\_\{e^\{\\epsilon\}\}\(P\(\\cdot\\mid\\overline\{E\}\)\|\|Q\)≤S\(E\)Heϵ\(P\(⋅∣E\)\|\|Q\)\+S\(E¯\)\\displaystyle\\leq S\(E\)H\_\{e^\{\\epsilon\}\}\(P\(\\cdot\\mid E\)\|\|Q\)\+S\(\\overline\{E\}\)≤max𝐲∈EHeϵ\(𝒩\(𝐂𝐲,κ2𝐈\)\|\|𝒩\(𝟎,κ2𝐈\)\+S\(E¯\)\\displaystyle\\leq\\max\_\{\\mathbf\{y\}\\in E\}H\_\{e^\{\\epsilon\}\}\(\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbf\{I\}\)\|\|\\mathcal\{N\}\(\\bm\{0\},\\kappa^\{2\}\\mathbf\{I\}\)\+S\(\\overline\{E\}\)where the second inequality holds because the hockey\-stick divergence is always l\.e\.q\.11and the third inequality holds because joint convexity implies quasi\-convexity\.

The proof for the\(Q,P\)\(Q,P\)case is analogous due to translation\-equivariance of hockey\-stick divergences between individual multivariate Gaussians\. ∎

Next, we instantiate this result via an eventEEthat corresponds to a tail bound on the number of participations:

###### Lemma F\.10\.

GivenTTtraining steps, dataset sizennand batch sizeBB, define subsampling rater=Bnr=\\frac\{B\}\{n\}\. Choose an arbitrary constantc1∈\(0,1\)c\_\{1\}\\in\(0,1\)withrT\>3ln⁡\(1c1δ\)rT\>3\\ln\(\\frac\{1\}\{c\_\{1\}\\delta\}\)\. Let𝐲∼S\\mathbf\{y\}\\sim Sbe the subsampling indicator vectoryt∼Bernoulli\(r\)y\_\{t\}\\sim\\mathrm\{Bernoulli\}\(r\)withTTindependent trials\. Define eventE=\{‖𝐲‖0≤τ∣𝐲∈\{0,1\}T\}E=\\\{\|\|\\mathbf\{y\}\|\|\_\{0\}\\leq\\tau\\mid\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}\\\}withτ=rT\+3rTln⁡\(1c1δ\)\\tau=rT\+\\sqrt\{3rT\\ln\(\\frac\{1\}\{c\_\{1\}\\delta\}\)\}\. ThenS\(E\)≥1−c1δS\(E\)\\geq 1\-c\_\{1\}\\deltaandS\(E¯\)≤c1δS\(\\overline\{E\}\)\\leq c\_\{1\}\\delta\.

###### Proof\.

This result is a direct application of multiplicative Chernoff’s inequality: For anyc1≤1c\_\{1\}\\leq 1, we have

Pr\[z≥\(1\+c2\)⋅rT\]≤\\displaystyle Pr\[z\\geq\(1\+c\_\{2\}\)\\cdot rT\]\\leqexp⁡\(−c22rT3\)=\!c1δ\\displaystyle\\exp\\left\(\-\\frac\{c\_\{2\}^\{2\}rT\}\{3\}\\right\)\\stackrel\{\{\\scriptstyle\!\}\}\{\{=\}\}c\_\{1\}\\delta⟹\\displaystyle\\impliesc2=3rTln\(1c1δ\),\\displaystyle c\_\{2\}=\\sqrt\{\\frac\{3\}\{rT\}\\ln\(\\frac\{1\}\{c\_\{1\}\\delta\}\),\}and can defineτ=\(1\+c2\)rT\\tau=\(1\+c\_\{2\}\)rT\. ∎

Next, we determine the maximum sensitivity‖𝐂𝐲‖2\|\|\\mathbf\{C\}\\mathbf\{y\}\|\|\_\{2\}that can be attained for‖𝐲‖0≤τ\|\|\\mathbf\{y\}\|\|\_\{0\}\\leq\\tau\.

###### Lemma F\.11\.

Let𝐂\\mathbf\{C\}be theT×TT\\times Tlower triangular matrix withCi,j=λi−jC\_\{i,j\}=\\lambda^\{i\-j\}fori≥ji\\geq j\. For any𝐲∈\{0,1\}T\\mathbf\{y\}\\in\\\{0,1\\\}^\{T\}with‖𝐲‖0≤τ\\\|\\mathbf\{y\}\\\|\_\{0\}\\leq\\tau, the following bound holds:

‖𝐂𝐲‖2≤\(1−λT1−λ\)τ\.\\\|\\mathbf\{C\}\\mathbf\{y\}\\\|\_\{2\}\\leq\\left\(\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}\\right\)\\sqrt\{\\tau\}\.

###### Proof\.

By the consistency of the induced matrix 2\-norm,‖𝐂𝐲‖2≤‖𝐂‖2‖𝐲‖2\\\|\\mathbf\{C\}\\mathbf\{y\}\\\|\_\{2\}\\leq\\\|\\mathbf\{C\}\\\|\_\{2\}\\\|\\mathbf\{y\}\\\|\_\{2\}\. First, since𝐲\\mathbf\{y\}is a binary vector,‖𝐲‖2=‖𝐲‖0≤τ\\\|\\mathbf\{y\}\\\|\_\{2\}=\\sqrt\{\\\|\\mathbf\{y\}\\\|\_\{0\}\}\\leq\\sqrt\{\\tau\}\. Second, the spectral norm of the lower triangular Toeplitz matrix𝐂\\mathbf\{C\}is bounded by its maximum row sum:

‖𝐂‖2≤maxt∑u=1tλt−j=∑k=0T−1λk=1−λT1−λ\.\\\|\\mathbf\{C\}\\\|\_\{2\}\\leq\\max\_\{t\}\\sum\_\{u=1\}^\{t\}\\lambda^\{t\-j\}=\\sum\_\{k=0\}^\{T\-1\}\\lambda^\{k\}=\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}\.Combining these terms yields the stated inequality\. ∎

The theorem below is stated under the normalization\. Applying it to the normalized clipped gradientsg~t,i/Cclip\\widetilde\{g\}\_\{t,i\}/C\_\{\\rm clip\}immediately yields the privacy guarantee in Theorem[5\.1](https://arxiv.org/html/2605.12648#S5.Thmtheorem1)\.

###### Theorem F\.12\(Privacy guarantee\)\.

Assumeϵ,δ∈\(0,1\]\\epsilon,\\delta\\in\(0,1\], andrT≥3ln⁡\(2/δ\)rT\\geq 3\\ln\(2/\\delta\)withr=Bnr=\\frac\{B\}\{n\}\. If per\-step gradients are clipped such that‖∇ℓ‖2≤1\\\|\\nabla\\ell\\\|\_\{2\}\\leq 1and the noise multiplierκ\\kappasatisfies

κ2≥8⋅\(1−λT1−λ\)2⋅\(rT\+3rTln⁡\(2δ\)\)⋅ln⁡\(2\.5δ\)ϵ2,\\kappa^\{2\}\\geq 8\\cdot\\left\(\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}\\right\)^\{2\}\\cdot\\left\(rT\+\\sqrt\{3rT\\ln\\left\(\\frac\{2\}\{\\delta\}\\right\)\}\\right\)\\cdot\\frac\{\\ln\\left\(\\frac\{2\.5\}\{\\delta\}\\right\)\}\{\\epsilon^\{2\}\},then Algorithm[1](https://arxiv.org/html/2605.12648#alg1)satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\.

###### Proof\.

We partition the space of subsampling indicator vectors into a “good” eventEEand its complementE¯\\overline\{E\}\. Letc1∈\(0,1\)c\_\{1\}\\in\(0,1\)be a constant allocating the failure probability\.

By Lemma[F\.10](https://arxiv.org/html/2605.12648#A6.Thmtheorem10), defining the eventE=\{𝐲∣‖𝐲‖0≤τ\}E=\\\{\\mathbf\{y\}\\mid\\\|\\mathbf\{y\}\\\|\_\{0\}\\leq\\tau\\\}with thresholdτ=rT\+3rTln⁡\(1c1δ\)\\tau=rT\+\\sqrt\{3rT\\ln\(\\frac\{1\}\{c\_\{1\}\\delta\}\)\}, the probability of the complement is bounded byS\(E¯\)≤c1δS\(\\overline\{E\}\)\\leq c\_\{1\}\\delta, andS\(E\)≤1S\(E\)\\leq 1\.

Next, we determine theL2L\_\{2\}sensitivity for any trajectory withinEE\. GivenCclip=1C\_\{\\mathrm\{clip\}\}=1, the sensitivity is bounded by the norm of the encoded subsampling vector\. By Lemma[F\.11](https://arxiv.org/html/2605.12648#A6.Thmtheorem11), for any𝐲∈E\\mathbf\{y\}\\in E, we have:

max𝐲∈E⁡‖𝐂𝐲‖2≤\(1−λT1−λ\)τ:=Δ2\.\\max\_\{\\mathbf\{y\}\\in E\}\\\|\\mathbf\{C\}\\mathbf\{y\}\\\|\_\{2\}\\leq\\left\(\\frac\{1\-\\lambda^\{T\}\}\{1\-\\lambda\}\\right\)\\sqrt\{\\tau\}:=\\Delta\_\{2\}\.
For the Gaussian mechanism to satisfy\(ϵ,δ′\)\(\\epsilon,\\delta^\{\\prime\}\)\-DP on the bounded sensitivity spaceEE, standard privacy accounting requires the noise variance to satisfyκ2≥Δ222ln⁡\(1\.25/δ′\)ϵ2\\kappa^\{2\}\\geq\\Delta\_\{2\}^\{2\}\\frac\{2\\ln\(1\.25/\\delta^\{\\prime\}\)\}\{\\epsilon^\{2\}\}\. Setting the mechanism’s failure probability toδ′=\(1−c1\)δ\\delta^\{\\prime\}=\(1\-c\_\{1\}\)\\deltaand substituting our expressions forΔ2\\Delta\_\{2\}andτ\\tau, we recover the asymptotic bound onκ2\\kappa^\{2\}required by the Theorem\.

This ensures that for all𝐲∈E\\mathbf\{y\}\\in E, the hockey\-stick divergence of the Gaussian mechanism is bounded:

max𝐲∈EHeϵ\(𝒩\(𝐂𝐲,κ2𝐈\)\|\|𝒩\(𝟎,κ2𝐈\)\)≤\(1−c1\)δ\.\\max\_\{\\mathbf\{y\}\\in E\}H\_\{e^\{\\epsilon\}\}\(\\mathcal\{N\}\(\\mathbf\{C\}\\mathbf\{y\},\\kappa^\{2\}\\mathbf\{I\}\)\\,\|\|\\,\\mathcal\{N\}\(\\bm\{0\},\\kappa^\{2\}\\mathbf\{I\}\)\)\\leq\(1\-c\_\{1\}\)\\delta\.
Finally, we assemble the total privacy loss via the joint convexity bound in Lemma[F\.9](https://arxiv.org/html/2605.12648#A6.Thmtheorem9)\. Substituting the mechanism bound onEEand the probability ofE¯\\overline\{E\}:

max\{Heϵ\(P\|\|Q\),Heϵ\(Q\|\|P\)\}\\displaystyle\\max\\\{H\_\{e^\{\\epsilon\}\}\(P\|\|Q\),H\_\{e^\{\\epsilon\}\}\(Q\|\|P\)\\\}≤S\(E\)⋅\(1−c1\)δ\+S\(E¯\)\\displaystyle\\leq S\(E\)\\cdot\(1\-c\_\{1\}\)\\delta\+S\(\\overline\{E\}\)≤1⋅\(1−c1\)δ\+c1δ\\displaystyle\\leq 1\\cdot\(1\-c\_\{1\}\)\\delta\+c\_\{1\}\\delta=δ\.\\displaystyle=\\delta\.By the definition of hockey\-stick divergence, bounding it byδ\\deltaimplies the algorithm satisfies\(ϵ,δ\)\(\\epsilon,\\delta\)\-DP\. ∎
Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

Similar Articles

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

Geometric Kolmogorov--Arnold Network (GeoKAN)

High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence

Geometry-Aware R-Structured Kolmogorov-Arnold Networks

Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift

Submit Feedback

Similar Articles

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD
Geometric Kolmogorov--Arnold Network (GeoKAN)
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
Geometry-Aware R-Structured Kolmogorov-Arnold Networks
Regime-Arrival Uncertainty in Generalization Bounds under Distribution Shift