From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

arXiv cs.LG Papers

Summary

This paper derives batch scaling laws for sketched linear regression under power-law spectra, analyzing one-pass and multi-pass mini-batch SGD. It provides explicit risk decompositions showing how batch size affects bias, variance, and fluctuation terms, and establishes that without-replacement sampling yields lower noise than with-replacement.

arXiv:2605.24316v1 Announce Type: new Abstract: Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as $O(\min(M,(T_{\mathrm{eff}}\gamma)^{1/a})/(B T_{\mathrm{eff}}))$. Thus the usual $1/B$ covariance reduction holds at fixed update count $T$, but in the one-pass regime $T=N/B$ it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is $1/B$ with replacement and $\rho_{N,B}=(N-B)/(B(N-1))$ without replacement. Hence without-replacement sampling is less noisy for $B>1$, and when $B=N$ the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:04 AM

# From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression
Source: [https://arxiv.org/html/2605.24316](https://arxiv.org/html/2605.24316)
Ziyan Chen The University of Sydney ziyan\.chen@sydney\.edu\.au&Dingxuan Zhou The University of Sydney dingxuan\.zhou@sydney\.edu\.au

###### Abstract

Scaling laws provide a compact description of how prediction error varies with compute, model size, and data, but existing theoretical results largely focus on single\-sample SGD or full data reuse and leave the role of mini\-batching unclear\. In this paper, we study batch scaling laws for sketched linear regression under a power\-law covariance spectrum and a source condition on the target parameter\. We analyze three optimization procedures: one\-pass batch SGD, multi\-pass batch SGD with replacement, and multi\-pass batch SGD without replacement\. We first derive an explicit risk decomposition showing that all three procedures share the same irreducible and approximation terms, while the stochastic contributions depend on the optimization protocol: one\-pass batch SGD splits into bias and variance, whereas the two multi\-pass procedures split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory\. Building on this decomposition, we prove source\-condition scaling laws for both one\-pass and multi\-pass mini\-batch methods\. For one\-pass batch SGD, mini\-batching preserves the approximation and optimization\-bias exponents, while the variance term scales asO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\); thus the usual covariance reduction by1/B1/Bholds at fixed update countTT, whereas in the one\-pass regimeT=N/BT=N/Bthis gain is partly offset by the shorter optimization horizon\. For multi\-pass batch SGD, the approximation term and the GD bias/variance contribution are identical for with\-replacement and without\-replacement sampling; the only difference is the fluctuation term, whose covariance prefactor is1/B1/Bwith replacement andρN,B=\(N−B\)/\(B​\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)without replacement\. Consequently, without\-replacement sampling is less noisy forB\>1B\>1, and whenB=NB=Nthe fluctuation vanishes exactly, recovering deterministic gradient descent\. These results place batch size on the same theoretical footing as compute, data, and model dimension within the sketched linear\-regression framework\.

## 1Introduction

Scaling laws have become a standard language for describing progress in modern machine learning: across many domains, prediction error follows regular power laws in model size, data size, and compute\. This pattern has been documented from early large\-scale studies across translation, language, vision, and speech\(Hestnesset al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib6)\)to modern language\-model scaling analyses\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib3); Hoffmannet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib2)\)and multimodal autoregressive modeling\(Henighanet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib5)\)\. As a result, scaling laws are now used not only to summarize experiments, but also to guide forecasting, resource allocation, and training design\(Rosenfeldet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib21); Zhaiet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib22); Alabdulmohsinet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib23); Besirogluet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib18); Muennighoffet al\.,[2023](https://arxiv.org/html/2605.24316#bib.bib19); Paquetteet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib20)\)\.

Empirical scaling laws do not by themselves explain where the exponents come from, which parts of the risk they describe, or how algorithmic choices change them\. Rigorous results are therefore rarer\. In statistically transparent models, however, approximation, optimization, and sampling effects can be separated\. Following this program,Linet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\)proved source\-condition scaling laws for one\-pass SGD in sketched linear regression, andLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\)showed that multiple passes lead to a decomposition into approximation, GD bias, GD variance, and fluctuation, yielding sharper compute–risk tradeoffs\. These papers place empirical scaling laws on a rigorous footing and complement a broader theoretical literature on power\-law behavior based on manifold arguments, constructive learning curves, solvable and dynamical models, renormalized high\-dimensional asymptotics, quantization, and feature learning\(Sharma and Kaplan,[2020](https://arxiv.org/html/2605.24316#bib.bib24); Bahriet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib17); Hutter,[2021](https://arxiv.org/html/2605.24316#bib.bib25); Maloneyet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib26); Michaudet al\.,[2023](https://arxiv.org/html/2605.24316#bib.bib27); Bordelonet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib28),[2025](https://arxiv.org/html/2605.24316#bib.bib29); Atanasovet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib30); Dohmatobet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib31); Paquetteet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib20); Renet al\.,[2025](https://arxiv.org/html/2605.24316#bib.bib32)\)\.

Batch size is one of the most important large\-scale training knobs because it affects hardware utilization, wall\-clock efficiency, and gradient noise\. On the empirical side, large\-batch rules enabled ImageNet training with batches up to 8192\(Goyalet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib8)\), LARS pushed convolutional training to 8K and 32K\(Youet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib9)\), and later studies showed that the gains are highly workload\-dependent and eventually saturate\(Shallueet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib10); Golmantet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib11)\)\. A particularly influential synthesis is the gradient\-noise\-scale viewpoint ofMcCandlishet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib12)\), and batch\-dependent empirical scaling laws have recently been studied directly for language models\(Shuaiet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib16)\)\.

There is also a growing theoretical understanding of batch sizeBBthrough SGD noise\.Smith and Le \([2018](https://arxiv.org/html/2605.24316#bib.bib13)\)modeled SGD by an SDE with noise scale proportional toϵ​N/B\\epsilon N/B, suggesting that the optimal batch size should grow with both the learning rate and the dataset sizeNN\.Smithet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib14)\)studied the closely related strategy of increasing batch size instead of decaying the learning rate\. In least\-squares and related settings, the statistical effects of mini\-batching, multiple passes, tail averaging, and implicit regularization have also been analyzed extensively\(Lin and Rosasco,[2017](https://arxiv.org/html/2605.24316#bib.bib33); Jainet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib34); Mückeet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib35); Geet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib36); Zouet al\.,[2021](https://arxiv.org/html/2605.24316#bib.bib37); Wuet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib7); Pillaud\-Vivienet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib15)\)\. What is still missing in this sketched linear\-regression framework is a scaling\-law analysis that shows how batch size enters approximation, optimization bias, variance, and data reuse\.

This paper develops such a theory for sketched linear regression trained with mini\-batch methods\. We study three stochastic procedures: one\-pass batch SGD, multi\-pass batch SGD with replacement, and multi\-pass batch SGD without replacement\. While mini\-batching classically reduces the per\-update noise covariance by a factor1/B1/Bat a fixed number of updates, our contribution is to show how batch size propagates through the scaling\-law decompositions ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)\. In particular, the without\-replacement scheme introduces the finite\-population factorρN,B\\rho\_\{N,B\}and recovers GD exactly atB=NB=N\. Our main contributions are as follows\.

- •A one\-pass batch scaling law with horizon\-noise tradeoff\.Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1), together with the unified risk decomposition in Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1), shows that batching preserves the one\-pass approximation and bias exponents while the stochastic contribution obeys the variance bound in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Equivalently, the per\-update covariance gains the usual factor1/B1/Bat fixedTT, but in the actual one\-pass regimeT=N/BT=N/Bthis improvement is partly offset because larger batches shorten the optimization horizon\.
- •A multi\-pass fluctuation law\.Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)shows that mini\-batching does not change the deterministic approximation and GD bias–variance terms fromLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\); it changes only the fluctuation around the GD reference path\. Importantly, this result is not obtained by simply multiplying the 2025 fluctuation bound by1/B1/B: batch updates change the fluctuation recursion and the covariance calculation of the driving noise, so the derivation must be redone in the batch setting\. The resulting prefactor is1/B1/Bwith replacement andρN,B=\(N−B\)/\(B​\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)without replacement, so without\-replacement sampling is less noisy and recovers deterministic GD atB=NB=N, complementing the broader picture that multiple passes are statistically useful on hard problems\(Pillaud\-Vivienet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib15)\)\.

#### Notation\.

For two positive\-valued functionsf​\(x\)f\(x\)andg​\(x\)g\(x\), we writef​\(x\)≲g​\(x\)f\(x\)\\lesssim g\(x\)\(equivalently,f​\(x\)=O​\(g​\(x\)\)f\(x\)=O\(g\(x\)\)\) andf​\(x\)≳g​\(x\)f\(x\)\\gtrsim g\(x\)\(equivalently,f​\(x\)=Ω​\(g​\(x\)\)f\(x\)=\\Omega\(g\(x\)\)\) if there exists an absolute constantc\>0c\>0such thatf​\(x\)≤c​g​\(x\)f\(x\)\\leq cg\(x\)andf​\(x\)≥c​g​\(x\)f\(x\)\\geq cg\(x\), respectively; we writef​\(x\)≍g​\(x\)f\(x\)\\asymp g\(x\)\(equivalently,f​\(x\)=Θ​\(g​\(x\)\)f\(x\)=\\Theta\(g\(x\)\)\) when both bounds hold\. For vectorsuuandvvin a Hilbert space, we denote their inner product by⟨u,v⟩\\langle u,v\\rangleoru⊤​vu^\{\\top\}v\. For matricesAAandBBof compatible dimensions, we define their inner product by⟨A,B⟩:=tr⁡\(A⊤​B\)\\langle A,B\\rangle:=\\operatorname\{tr\}\(A^\{\\top\}B\)\. We use∥⋅∥\\\|\\cdot\\\|to denote the operator norm for matrices and theℓ2\\ell\_\{2\}\-norm for vectors\. For a positive semidefinite \(PSD\) matrixAAand a compatible vectorvv, we write‖v‖A2:=v⊤​A​v\\\|v\\\|\_\{A\}^\{2\}:=v^\{\\top\}Av, and we writeA⪯BA\\preceq BwhenB−AB\-Ais PSD\. For a symmetric matrixAA,μj​\(A\)\\mu\_\{j\}\(A\)denotes itsjj\-th eigenvalue andr​\(A\)r\(A\)its rank\. Finally,log⁡\(⋅\)\\log\(\\cdot\)denotes the base\-2 logarithm\.

## 2Preliminaries

We work in the same sketched linear\-regression framework asLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)\. In addition to the normal GD iterateθt\\theta\_\{t\}, we study a one\-pass batch SGD iterate, a multi\-pass batch SGD iterate with replacement, and a multi\-pass batch SGD iterate without replacement; each stochastic update averages a mini\-batch of sizeBB\.

In particular, whenB=1B=1, the one\-pass batch SGD setup reduces to the one\-pass SGD setting ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\)\. For the multi\-pass methods, our with\-replacement and without\-replacement rules coincide atB=1B=1, because the latter is without replacement only within each mini\-batch, not across an epoch\. Thus our without\-replacement procedure should not be confused with random reshuffling\.

#### Problem setup\.

Letℋ\\mathcal\{H\}be a finite\- or countably infinite\-dimensional Hilbert space\. For a parameterw∈ℋw\\in\\mathcal\{H\}, define the population risk

R​\(w\):=𝔼​\[\(⟨x,w⟩−y\)2\],\(x,y\)∼P,R\(w\):=\\mathbb\{E\}\\bigl\[\(\\langle x,w\\rangle\-y\)^\{2\}\\bigr\],\\qquad\(x,y\)\\sim P,wherePPis a Borel probability measure onℋ×ℝ\\mathcal\{H\}\\times\\mathbb\{R\}\. Define the population covariance and population risk minimizer as follows:

H:=𝔼​\[x​x⊤\],w∗∈arg⁡minw∈ℋ⁡R​\(w\)H:=\\mathbb\{E\}\[xx^\{\\top\}\],\\quad w^\{\\ast\}\\in\\arg\\min\_\{w\\in\\mathcal\{H\}\}R\(w\)We observe only the sketched covariates\(S​x,y\)\(Sx,y\), whereS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is the sketching operator\. Foru∈ℝMu\\in\\mathbb\{R\}^\{M\}, define the sketched risk

RM​\(u\):=R​\(S⊤​u\)=𝔼​\[\(⟨S​x,u⟩−y\)2\]\.R\_\{M\}\(u\):=R\(S^\{\\top\}u\)=\\mathbb\{E\}\\bigl\[\(\\langle Sx,u\\rangle\-y\)^\{2\}\\bigr\]\.
GivenNNi\.i\.d\. samples drawn according toPP,

D=\{\(xi,yi\)\}i=1N,D=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\},with observationO=\{\(S​xi,yi\)\}i=1NO=\\\{\(Sx\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}\. Write

X:=\(x1,…,xN\)⊤,y:=\(y1,…,yN\)⊤,X:=\(x\_\{1\},\\dots,x\_\{N\}\)^\{\\top\},\\qquad y:=\(y\_\{1\},\\dots,y\_\{N\}\)^\{\\top\},and define the population and empirical quantities

Σ:=S​H​S⊤,Σ^:=1N​S​X⊤​X​S⊤,b^:=1N​S​X⊤​y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{1\}\{N\}SX^\{\\top\}XS^\{\\top\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Conditioned onSS, the minimizer ofRMR\_\{M\}is, as inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\),

u∗=\(S​H​S⊤\)−1​S​H​w∗=Σ−1​S​H​w∗\.u^\{\\ast\}=\(SHS^\{\\top\}\)^\{\-1\}SHw^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\}\.

#### Optimization procedures\.

We compare four methods under this common setup\. For every optimization procedure run forLrunL\_\{\\mathrm\{run\}\}updates, we use the same blockwise geometric learning\-rate schedule: partition the updates into consecutive blocks indexed byℓ=0,1,2,…\\ell=0,1,2,\\ldots, each containingLrun,eff:=Lrun/log⁡LrunL\_\{\\mathrm\{run,eff\}\}:=L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}consecutive updates up to endpoint rounding, and setγt=γ/2ℓ\\gamma\_\{t\}=\\gamma/2^\{\\ell\}for every updatettin blockℓ\\ell\. In particular,Lrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD, whereasLrun=LL\_\{\\mathrm\{run\}\}=Lfor normal GD and the two multi\-pass batch methods\.

#### 1\. Normal GD\.

Let\(γt\)\(\\gamma\_\{t\}\)be the prescribed stepsize schedule\. The normal GD iterate is

θt=θt−1−γt​Σ^​θt−1\+γt​b^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.\(1\)

#### 2\. One\-pass Batch SGD\.

Assume for simplicity thatB∣NB\\mid N, and partition\[N\]\[N\]into disjoint batchesI1,…,IN/BI\_\{1\},\\dots,I\_\{N/B\}with\|It\|=B\|I\_\{t\}\|=B\. For each batch, define

Σ^It\(B\):=1B​∑i∈ItS​xi​xi⊤​S⊤,b^It\(B\):=1B​∑i∈ItS​xi​yi\.\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\}\.The one\-pass batch SGD iterate is

utop=ut−1op−γt​Σ^It\(B\)​ut−1op\+γt​b^It\(B\),t=1,…,NB,u0op=0\.u\_\{t\}^\{\\mathrm\{op\}\}=u\_\{t\-1\}^\{\\mathrm\{op\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{op\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad t=1,\\dots,\\frac\{N\}\{B\},\\qquad u\_\{0\}^\{\\mathrm\{op\}\}=0\.\(2\)Thus each update usesBBsamples, and performs a total ofN/BN/Bupdates\.

#### 3\. Multi\-pass Batch SGD with Replacement\.

At each stept∈\[L\]t\\in\[L\], sample a mini\-batch with replacementit,1,…,it,B∼iidunif​\(\[N\]\),i\_\{t,1\},\\dots,i\_\{t,B\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{unif\}\(\[N\]\),and define

Σ^t\(B\):=1B​∑r=1BS​xit,r​xit,r⊤​S⊤,b^t\(B\):=1B​∑r=1BS​xit,r​yit,r\.\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}x\_\{i\_\{t,r\}\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}y\_\{i\_\{t,r\}\}\.The multi\-pass batch SGD iterate with replacement is

utwr=ut−1wr−γt​Σ^t\(B\)​ut−1wr\+γt​b^t\(B\),t=1,…,L,u0wr=0\.u\_\{t\}^\{\\mathrm\{wr\}\}=u\_\{t\-1\}^\{\\mathrm\{wr\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad t=1,\\dots,L,\\qquad u\_\{0\}^\{\\mathrm\{wr\}\}=0\.\(3\)Here each step again usesBBsamples, but the algorithm now runs for a total ofLLupdates and may therefore reuse data across passes\.

#### 4\. Multi\-pass Batch SGD without Replacement\.

We consider mini\-batch updates on the fixed datasetDD, where at each stept∈\[L\]t\\in\[L\]we sample a subsetIt⊂\[N\]I\_\{t\}\\subset\[N\],\|It\|=B\|I\_\{t\}\|=Buniformly without replacement from\[N\]\[N\]\. Across different iterationstt, the batchesItI\_\{t\}are sampled independently, so data may be reused across iterations, but no sample is repeated within a single batch\.

For each sampled batchItI\_\{t\}, define

Σ^It\(B\):=1B​∑i∈ItS​xi​xi⊤​S⊤,b^It\(B\):=1B​∑i∈ItS​xi​yi\.\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\}\.Then the multi\-pass batch SGD iterate without replacement is

utwor=ut−1wor−γt​Σ^It\(B\)​ut−1wor\+γt​b^It\(B\),t=1,…,L,u0wor=0\.u\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\-1\}^\{\\mathrm\{wor\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad t=1,\\dots,L,\\qquad u\_\{0\}^\{\\mathrm\{wor\}\}=0\.\(4\)
In summary, we keep the same problem setup and normal GD process as inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), but replace the single\-sample updates by mini\-batch updates of sizeBB\. When comparing the stochastic iterates with GD, we always use the same stepsize schedule and the same number of updates:N/BN/Bfor one\-pass batch SGD andLLfor multi\-pass batch SGD, with or without replacement\.

## 3Main Results

This section contains the main theoretical contribution of the paper: explicit batch\-size scaling laws for one\-pass and multi\-pass sketched SGD under the power\-law/source\-condition model\. The main conclusion is that mini\-batching does not change the functional form of the deterministic approximation and optimization\-bias terms \(in the one\-pass case, the bias is evaluated at the shorter horizonT=N/BT=N/B\)\. Instead, it enters the stochastic terms mainly through the one\-pass variance boundO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\)and through the multi\-pass fluctuation prefactorρN,B\\rho\_\{N,B\}\.

The assumptions below are the same stylized assumptions used inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), adapted here to the mini\-batch setting\. We first restate those conditions, then give a common risk decomposition, and finally derive the one\-pass and multi\-pass scaling laws\.

###### Assumption 1\(Data assumptions\)\.

Assume the following conditions on the data distributionPP\.

1. A\.Gaussian design\.The feature vector satisfiesx∼𝒩​\(0,H\)\.x\\sim\\mathcal\{N\}\(0,H\)\.
2. B\.Well\-specified model\.The response satisfies 𝔼​\[y∣x,w∗\]=⟨x,w∗⟩withσ2:=𝔼​\[\(y−⟨x,w∗⟩\)2\]\.\\mathbb\{E\}\[y\\mid x,w^\{\\ast\}\]=\\langle x,w^\{\\ast\}\\rangle\\quad\\text\{with\}\\quad\\sigma^\{2\}:=\\mathbb\{E\}\\bigl\[\(y\-\\langle x,w^\{\\ast\}\\rangle\)^\{2\}\\bigr\]\.
3. C\.Power\-law spectrum\.Let\(λi\)i≥1\(\\lambda\_\{i\}\)\_\{i\\geq 1\}denote the eigenvalues ofHH\. Then for somea\>1a\>1 λi≍i−afor all​i≥1\.\\lambda\_\{i\}\\asymp i^\{\-a\}\\qquad\\text\{for all \}i\\geq 1\.
4. D\.Source condition\.Let\(λi,vi\)i≥1\(\\lambda\_\{i\},v\_\{i\}\)\_\{i\\geq 1\}be the eigenvalue–eigenvector pairs ofHH\. Assumew∗w^\{\\ast\}follows a prior such that for someb\>1b\>1 𝔼​\[⟨vi,w∗⟩​⟨vj,w∗⟩\]=0for​i≠j;𝔼​\[λi​⟨vi,w∗⟩2\]≍i−bfor all​i≥1\\mathbb\{E\}\\bigl\[\\langle v\_\{i\},w^\{\\ast\}\\rangle\\langle v\_\{j\},w^\{\\ast\}\\rangle\\bigr\]=0\\quad\\text\{for \}i\\neq j;\\qquad\\mathbb\{E\}\\bigl\[\\lambda\_\{i\}\\langle v\_\{i\},w^\{\\ast\}\\rangle^\{2\}\\bigr\]\\asymp i^\{\-b\}\\quad\\text\{for all \}i\\geq 1

###### Assumption 2\(Source condition in diagonal coordinates\)\.

Assume without loss of generality thatHHis diagonal with non\-increasing diagonal entries

H=diag⁡\(λ1,λ2,…\)\.H=\\operatorname\{diag\}\(\\lambda\_\{1\},\\lambda\_\{2\},\\dots\)\.Assume the true parameterw∗w^\{\\ast\}satisfies: for someb\>1b\>1,

𝔼​\[wi∗​wj∗\]=0for all​i≠j;𝔼​\[λi​\(wi∗\)2\]≍i−bfor all​i≥1\.\\mathbb\{E\}\[w\_\{i\}^\{\\ast\}w\_\{j\}^\{\\ast\}\]=0\\quad\\text\{for all \}i\\neq j;\\qquad\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\\quad\\text\{for all \}i\\geq 1\.

###### Assumption 3\(Stepsize conditions\)\.

Under the notation of the theorem and its proof, assume that with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS, the following hold:

1. A\.γ≤min⁡\{clog⁡N,ctr⁡\(Σ\)\}\.\\gamma\\leq\\min\\\!\\left\\\{\\frac\{c\}\{\\log N\},\\frac\{c\}\{\\operatorname\{tr\}\(\\Sigma\)\}\\right\\\}\.
2. B\.tr⁡\(Σ2\)≲1\.\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\lesssim 1\.
3. C\.∑i=1Mμi​\(Σ\)μi​\(Σ\)\+1/\(Leff​γ\)≤N4\.\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)\}\\leq\\frac\{N\}\{4\}\.
4. D\.For allt≥1t\\geq 1,ℙ​\(4​maxi∈\[N\]⁡‖S​xi‖22\>tγ\)≤N−c​t\.\\mathbb\{P\}\\left\(4\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\>\\frac\{t\}\{\\gamma\}\\right\)\\leq N^\{\-ct\}\.

Throughout this section, we additionally assume that the sketch operatorS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is Gaussian, meaning that in the diagonal coordinates ofHH, its entries are i\.i\.d\.𝒩​\(0,1/M\)\\mathcal\{N\}\(0,1/M\)\. For one\-pass and multi\-pass batch SGD, let

T:=NB≥2,Teff:=Tlog⁡T,Leff:=Llog⁡L,ρN,B:=N−BB​\(N−1\)\.T:=\\frac\{N\}\{B\}\\geq 2,\\quad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\quad L\_\{\\mathrm\{eff\}\}:=\\frac\{L\}\{\\log L\},\\quad\\rho\_\{N,B\}:=\\frac\{N\-B\}\{B\(N\-1\)\}\.
Unless a subscript indicates otherwise, the expectations in the theorem statements below are taken overw∗w^\{\\ast\}, the sample, and the mini\-batch randomness when applicable\.

To state the results compactly, define

ρ=\{1/B,for multi\-pass batch SGD with replacement,ρN,B,for multi\-pass batch SGD without replacement,\\rho=\\begin\{cases\}1/B,&\\text\{for multi\-pass batch SGD with replacement\},\\\\ \\rho\_\{N,B\},&\\text\{for multi\-pass batch SGD without replacement\},\\end\{cases\}and write correspondingly

\(uLρ,FlucBρ\)=\{\(uLwr,FlucBwr\),ρ=1/B,\(uLwor,FlucBwor\),ρ=ρN,B\.\(u\_\{L\}^\{\\rho\},\\mathrm\{Fluc\}^\{\\rho\}\_\{B\}\)=\\begin\{cases\}\(u\_\{L\}^\{\\mathrm\{wr\}\},\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\),&\\rho=1/B,\\\\ \(u\_\{L\}^\{\\mathrm\{wor\}\},\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\),&\\rho=\\rho\_\{N,B\}\.\\end\{cases\}
These assumptions and definitions have a simple interpretation\. The exponentsaaandbbquantify the statistical complexity of the problem through the spectral decay ofHHand the regularity ofw∗w^\{\\ast\}, while the sketch dimensionMMcontrols approximation\. The stepsize condition is the same kind of high\-probability regularity assumption used inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\); it ensures that the concentration and effective\-time arguments can be applied uniformly in proof\.

The next proposition packages the three procedures into one structural statement\. It shows that all three risks share the same common baseline risk, while the one\-pass method further splits into bias plus variance and the two multi\-pass methods split into a common GD reference contribution plus a sampling\-rule\-dependent fluctuation term\.

###### Proposition 3\.1\(Risk decompositions for the three optimization procedures\)\.

Assume Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Letu¯T:=𝔼​\[uTop\]\.\\bar\{u\}\_\{T\}:=\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\]\.Then the risks decompose as follows:

𝔼​\[RM​\(uTop\)\]=\\displaystyle\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=\{\}RM​\(u∗\)⏟common baseline risk\+RM​\(u¯T\)−RM​\(u∗\)⏟one\-pass biasexcess risk\+𝔼​\[RM​\(uTop\)−RM​\(u¯T\)\]⏟one\-pass varianceexcess risk\.\\displaystyle\\underbrace\{R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{common baseline risk\}\\end\{subarray\}\}\+\\underbrace\{R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{one\-pass bias\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{one\-pass variance\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\.𝔼​\[RM​\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=\{\}RM​\(u∗\)⏟common baseline risk\+𝔼​\[RM​\(θL\)−RM​\(u∗\)\]⏟common GD\-referenceexcess risk\+𝔼​\[RM​\(uLρ\)−RM​\(θL\)\]⏟sampling\-rule\-dependentfluctuation excess risk\.\\displaystyle\\underbrace\{R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{common baseline risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{common GD\-reference\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{sampling\-rule\-dependent\}\\\\ \\text\{fluctuation excess risk\}\\end\{subarray\}\}\.whereuLρu\_\{L\}^\{\\rho\}denotesuLwru\_\{L\}^\{\\mathrm\{wr\}\}whenρ=1/B\\rho=1/BanduLworu\_\{L\}^\{\\mathrm\{wor\}\}whenρ=ρN,B\\rho=\\rho\_\{N,B\}\. Note that

RM​\(u∗\)=R​\(w∗\)⏟irreducible risk\+\[RM​\(u∗\)−R​\(w∗\)\]⏟approximation risk\.R\_\{M\}\(u^\{\\ast\}\)=\\underbrace\{R\(w^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{irreducible risk\}\\end\{subarray\}\}\+\\underbrace\{\\bigl\[R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{approximation risk\}\\end\{subarray\}\}\.

The proof is given at the end of Appendix[A](https://arxiv.org/html/2605.24316#A1)\. Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)is the organizing principle for the rest of the section: the next theorem specializes the one\-pass decomposition, and the theorem after that specializes the multi\-pass decompositions and makes explicit that with\-replacement and without\-replacement sampling differ only through the fluctuation prefactor\.

###### Theorem 3\.1\(Scaling law for one\-pass batch SGD under the source condition\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3), and[2](https://arxiv.org/html/2605.24316#Thmassumption2), and suppose

1<b<a\+1,σ2≍1,Teff​γ≳1\.1<b<a\+1,\\qquad\\sigma^\{2\}\\asymp 1,\\qquad T\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1\.Assume moreover that the one\-pass analogue of Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)holds as follows: the effective\-horizon conditions are imposed withLeffL\_\{\\mathrm\{eff\}\}replaced byTeffT\_\{\\mathrm\{eff\}\}, while the maximum\-norm condition in that assumption remains over the originalNNsamples\. Then there exists anaa\-dependent constantc\>0c\>0such that, wheneverγ≤c/log⁡T\\gamma\\leq c/\\log T, we have with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼​\[RM​\(uTop\)\]=σ2\+Θ​\(M1−b\)\+Θ​\(\(Teff​γ\)\(1−b\)/a\)⏟Approx\+Bias\+O​\(min⁡\{M,\(Teff​γ\)1/a\}B​Teff\)⏟Var\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=\\sigma^\{2\}\+\\underbrace\{\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\}\_\{\\mathrm\{Approx\+Bias\}\}\+\\underbrace\{O\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\right\)\}\_\{\\mathrm\{Var\}\}\.Here the hidden constants depend only on\(a,b\)\(a,b\)\. In particular, when1<b≤a1<b\\leq a, the variance term is dominated by the sum of the approximation and bias terms, so the risk simplifies to

𝔼​\[RM​\(uTop\)\]=σ2\+Θ​\(M1−b\)\+Θ​\(\(Teff​γ\)\(1−b\)/a\)\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=\\sigma^\{2\}\+\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\.

In Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1), the factor1/B1/Bshould be interpreted at the level of the per\-update noise covariance, or equivalently relative to a fixed number of updatesTT\. In the actual one\-pass regime, however,T=N/BT=N/B, so increasingBBsimultaneously lowers the one\-step noise and shortens the optimization horizon\. Accordingly, the full variance term isO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\), not a pure1/B1/Bimprovement as a function ofBBat fixed dataset sizeNN\.

See Appendix[B](https://arxiv.org/html/2605.24316#A2), and in particular Appendix[B\.1](https://arxiv.org/html/2605.24316#A2.SS1), for the proof of Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\.

We now turn to the multi\-pass setting\. The key question is how mini\-batching affects the decomposed risks, where the next theorem shows that only the fluctuation changes\.

###### Theorem 3\.2\(Scaling law for multi\-pass batch SGD with and without replacement under the source condition\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

1<b<a\+1,σ2≍1,Leff​γ≳1,Leff≲Na/γ\.1<b<a\+1,\\qquad\\sigma^\{2\}\\asymp 1,\\qquad L\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1,\\qquad L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists an\(a,b\)\(a,b\)\-dependent constantc\>0c\>0such that, wheneverγ≤c/log⁡N\\gamma\\leq c/\\log N, we have with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS, the following hold\.

1. \(1\)If, for some fixedε∈\(0,1\)\\varepsilon\\in\(0,1\),Leff≲N\(1−ε\)​a/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{\(1\-\\varepsilon\)a\}/\\gammathen 𝔼​\[RM​\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\{\}σ2\+Θ​\(M1−b\)⏟Approx\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)⏟GD​Bias\\displaystyle\\sigma^\{2\}\+\\underbrace\{\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\}\_\{\\mathrm\{Approx\}\}\+\\underbrace\{\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\}\_\{\\mathrm\{GD\\,Bias\}\}\+Θ​\(min⁡\{M,\(Leff​γ\)1/a\}N\)⏟GD​Var\+O​\(ρ​γ​log⁡N​\[\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN\]\)⏟Fluc\.\\displaystyle\+\\underbrace\{\\Theta\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\\right\)\}\_\{\\mathrm\{GD\\,Var\}\}\+\\underbrace\{O\\\!\\left\(\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\\right\)\}\_\{\\mathrm\{Fluc\}\}\.
2. \(2\)In particular, whena≥ba\\geq b,Leff≲Na/b/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{a/b\}/\\gammaandγ​log⁡N≲1\\gamma\\log N\\lesssim 1, the GD variance and fluctuation terms are dominated by the sum of the approximation and GD bias terms, namely, 𝔼\[RM\(uLρ\)\]=σ2\+Θ\(M1−b\)\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\\sigma^\{2\}\+\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\.
3. \(3\)Whena<b<a\+1a<b<a\+1andLeff≲N/γL\_\{\\mathrm\{eff\}\}\\lesssim N/\\gamma, the approximation and GD bias terms combine as Θ\(M1−b\)\+Θ\(\(Leffγ\)\(1−b\)/a\)=Θ\(min\{M,\(Leffγ\)1/a\}1−b\),\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\bigl\(\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\bigr\)=\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\),and therefore 𝔼​\[RM​\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\{\}σ2\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)\+Θ\(min⁡\{M,\(Leff​γ\)1/a\}N\)\\displaystyle\\sigma^\{2\}\+\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\+\\Theta\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\\right\)\+O​\(ρ​γ​log⁡N​\[\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN\]\)\.\\displaystyle\+O\\\!\\left\(\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\\right\)\.

See Appendix[B](https://arxiv.org/html/2605.24316#A2), and in particular Appendix[B\.2](https://arxiv.org/html/2605.24316#A2.SS2), for the proof of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\.

#### Comparison with previous scaling laws\.

Taken together, Theorems[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)and[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)show precisely how the scaling laws ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)deform under mini\-batching\. The one\-pass theorem is the mini\-batch analogue ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\): the approximation termΘ​\(M1−b\)\\Theta\(M^\{1\-b\}\)and the one\-pass bias term keep the same exponents, while batching changes only the stochastic term\. More precisely, the one\-pass variance bound isO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\): the factor1/B1/Bis the fixed\-TTcovariance gain, whereas at fixed dataset sizeNNone must also account for the shorter horizonT=N/BT=N/B\. The multi\-pass theorem extendsLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\): the approximation term and the GD bias–variance contribution are unchanged, and the only new batch dependence is the fluctuation prefactorρ∈\{1/B,ρN,B\}\\rho\\in\\\{1/B,\\rho\_\{N,B\}\\\}\. In particular, settingB=1B=1recovers the corresponding one\-sample scaling laws; in this case the two multi\-pass sampling rules coincide\.

#### What batch size changes\.

The common message of the two theorems is that batch size acts as a noise\-control parameter rather than a deterministic regularizer\. In the one\-pass theorem, batching lowers the centered variance; in the multi\-pass theorem, it lowers only the fluctuation around the GD reference path\. This interpretation is consistent with the gradient\-noise\-scale viewpoint ofSmith and Le \([2018](https://arxiv.org/html/2605.24316#bib.bib13)\); McCandlishet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib12)\)and with empirical large\-batch studies showing that large batches can be effective when properly tuned but that their gains eventually saturate\(Goyalet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib8); Smithet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib14); Shallueet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib10)\)\. Our theorems make this principle explicit in the present linear\-regression setting: once the stochastic terms fall below the approximation and optimization terms, further increasingBBno longer changes the leading statistical scaling\. In the multi\-pass setting, without\-replacement sampling adds the finite\-population gainρN,B<1/B\\rho\_\{N,B\}<1/BwhenB\>1B\>1, so the benefit of large batches is strongest precisely whenBBis a non\-negligible fraction ofNN\.

#### Implications for choosing batch size\.

The theorems also suggest a simple batch\-size design rule\. In the one\-pass setting,BBhas two competing effects: it reduces the variance term but also shortens the optimization horizonT=N/BT=N/B, so overly large batches can help the noise term while worsening the bias term\. Thus one should increaseBBonly until the full variance boundO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\)is no longer comparable to the approximation\-plus\- bias contribution\. In the multi\-pass setting, by contrast, onceLLandγ\\gammaare fixed, increasingBBleaves the GD bias–variance terms unchanged and only decreases fluctuation\. This makes larger batches statistically attractive until the fluctuation term falls below the common GD reference contribution, with without\-replacement sampling being especially appealing in the large\-batch regime becauseρN,B\\rho\_\{N,B\}is strictly smaller than1/B1/BwhenB\>1B\>1and vanishes at full batch\.

#### Proof sketch\.

The proofs follow the same overall blueprint asLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), together with a batch version of the covariance\-iterate arguments used byWuet al\.\([2022](https://arxiv.org/html/2605.24316#bib.bib7)\)\. For one\-pass batch SGD, Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)and the mean\-centered decompositionet=mt\+δte\_\{t\}=m\_\{t\}\+\\delta\_\{t\}give

𝔼​\[RM​\(uTop\)\]=RM​\(u∗\)\+BiasB\+VarB,\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathrm\{Bias\}\_\{B\}\+\\mathrm\{Var\}\_\{B\},where the bias follows the deterministic recursionmt=\(I−γt​Σ\)​mt−1m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\}\. The novel ingredient in the one\-pass batch analysis is an exact split of the centered variance into a covariance\-fluctuation term and an additive\-noise term\. Specifically, we write

δt=qt\+vt,VarB=VarBcov\+VarBnoise,\\delta\_\{t\}=q\_\{t\}\+v\_\{t\},\\qquad\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\},whereqtq\_\{t\}is the centered covariance\-fluctuation process andvtv\_\{t\}is the additive\-noise process with

qt=\(I−γt​Z¯t\)​qt−1\+γt​\(Σ−Z¯t\)​mt−1,vt=\(I−γt​Z¯t\)​vt−1\+γt​ξ¯t,q\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\},\\qquad v\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\},withZ¯t\\bar\{Z\}\_\{t\}the current batch covariance\. We then boundVarBcov\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}andVarBnoise\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}separately: the latter is handled by a batch analogue of the noise\-recursion argument inWuet al\.\([2022](https://arxiv.org/html/2605.24316#bib.bib7)\), while the former captures the extra randomness created by replacingΣ\\Sigmawith a random batch covariance\. Together with the approximation and bias bounds, this yields the one\-pass scaling law\.

For multi\-pass methods, we keep the same GD reference path as inLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\)and modify only the fluctuation setup used to compare the stochastic batch iterate with normal GD\. Writing

Δtρ:=utρ−θt,\\Delta\_\{t\}^\{\\rho\}:=u\_\{t\}^\{\\rho\}\-\\theta\_\{t\},the perturbation follows the same general proof strategy as inLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\), except that the one\-sample random update at each step is replaced by a batch\-sampled update:

Δtwr=\(I−γt​Σ^t\(B\)\)​Δt−1wr\+γt​ξt\(B\),Δtwor=\(I−γt​Σ^It\(B\)\)​Δt−1wor\+γt​ξt,wor\(B\)\.\\Delta\_\{t\}^\{\\mathrm\{wr\}\}=\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\)\\Delta\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},\\qquad\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\.Thus we investigate the perturbation around normal GD using the same perturbative ideas asLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\), but the batch setup changes the covariance calculation of the driving noise\. In the with\-replacement case, the batch noise is an average of single\-sample noises,

ξt\(B\)=1B​∑r=1Bζt​\(it,r\),\\xi\_\{t\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\zeta\_\{t\}\(i\_\{t,r\}\),which produces the factor1/B1/B; in the without\-replacement case, the same argument is combined with the finite\-population covariance identity, which replaces1/B1/BbyρN,B\\rho\_\{N,B\}\. The GD reference contributes the common deterministic terms, and substituting the appendix source\-condition bounds into Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)yields the two theorems\.

## 4Experiments

We evaluate the batch\-dependent predictions of our theory in a synthetic sketched linear\-regression model with diagonal\-coordinate\. We fix an ambient dimensiondd, draw a Gaussian sketchS∈ℝM×dS\\in\\mathbb\{R\}^\{M\\times d\}with i\.i\.d\.𝒩​\(0,1/M\)\\mathcal\{N\}\(0,1/M\)entries, and generate data fromx∼𝒩​\(0,diag⁡\(λ1,…,λd\)\)x\\sim\\mathcal\{N\}\(0,\\operatorname\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{d\}\)\),λi=i−a\\lambda\_\{i\}=i^\{\-a\}, andy=⟨x,w∗⟩\+εy=\\langle x,w^\{\\ast\}\\rangle\+\\varepsilon, with source\-condition prior𝔼​\[λi​\(wi∗\)2\]≍i−b\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\. In the implementation, conditioned on\(S,w∗\)\(S,w^\{\\ast\}\), we sample the sketched pair\(S​x,y\)\(Sx,y\)directly from its induced joint Gaussian law\. Unless otherwise stated, we usea=2a=2,b=1\.5b=1\.5,d=104d=10^\{4\},M=64M=64,N=L=512N=L=512,σ=1\\sigma=1,γ=0\.05\\gamma=0\.05, and100100repetitions; full details are in Appendix[L](https://arxiv.org/html/2605.24316#A12)\.

Because our theorems show that the explicit mini\-batch covariance effect appears in the stochastic terms, we then conduct the experiments on the three claims that depend onBBmost sharply\. We fix\(S,w∗\)\(S,w^\{\\ast\}\)across repetitions to isolate the sampling and optimization randomness, thus the reported error bars quantify variability conditional on a representative sketched problem instance\.

#### Experiment 1: one\-pass variance sweep\.

In the one\-pass theorem, the explicit mini\-batch covariance reduction appears in the centered variance term\. BecauseT=N/BT=N/B, changingBBalso changes the effective horizonTeff​γT\_\{\\mathrm\{eff\}\}\\gamma; panel \(a\) therefore compares the measured variance with the predicted upper\-bound\. Accordingly, the first experiment directly measures the centered one\-pass variance and compares it with the predicted1/\(B​Teff\)1/\(BT\_\{\\mathrm\{eff\}\}\)\-type upper\-bound scaling\.

#### Experiment 2: multi\-pass fluctuation sweep\.

In the multi\-pass theorem, the deterministic GD contribution is common to with\-replacement and without\-replacement sampling, so the only sampling\-rule\-dependent term is the fluctuation\. The second experiment is therefore designed to isolate that term and test whether its batch dependence matches the predicted prefactors1/B1/BandρN,B\\rho\_\{N,B\}\.

#### Experiment 3: normalized fluctuation collapse\.

If the fluctuation scales as1/B1/BorρN,B\\rho\_\{N,B\}, then dividing by the corresponding batch prefactor should remove the leading factorBB\. The third experiment tests this collapse by plotting the normalized fluctuation curves across batch sizes; for without\-replacement sampling, the pointB=NB=Nis omitted becauseρN,N=0\\rho\_\{N,N\}=0\.

![Refer to caption](https://arxiv.org/html/2605.24316v1/experiments/01_one_pass_variance.png)

\(a\) One\-pass variance sweep

![Refer to caption](https://arxiv.org/html/2605.24316v1/experiments/02_multipass_fluctuation.png)

\(b\) Multi\-pass fluctuation sweep

![Refer to caption](https://arxiv.org/html/2605.24316v1/experiments/03_normalized_fluctuation_collapse.png)

\(c\) Normalized fluctuation collapse

Figure 1:Empirical validation of the batch\-dependent stochastic terms\. Panel \(a\) plots the empirical one\-pass centered variance againstBB, together with a rescaled reference curve of order∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\}/\(B​Teff\)\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\mu\_\{j\}\(\\Sigma\)\\\}/\(B\\,T\_\{\\mathrm\{eff\}\}\); the reference curve is multiplied by a single constant chosen to match theB=1B=1point\. Panels \(b\) and \(c\) compare the multi\-pass fluctuation for with\-replacement and without\-replacement sampling against the predicted batch prefactors1/B1/BandρN,B=\(N−B\)/\(B​\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)\. Error bars denote one standard deviation over repetitions\.The results align with our theoretical prediction to a large extent\. Panel \(a\) shows that the one\-pass centered variance decreases steadily with the batch size, in line with the predicted1/\(B​Teff\)1/\(BT\_\{\\mathrm\{eff\}\}\)decay after considering the effective dimension factor\. The dashed reference is an upper\-bound curve rather than an exact asymptotic equality, so the relevant comparison is the shape and order of decay, not pointwise equality\. In particular, the empirical variance staying below the rescaled reference is exactly what one should expect from the theorem\. Panel \(b\) directly tests the multi\-pass fluctuation term\. The with\-replacement curve follows the predicted1/B1/Bdecay closely, while the without\-replacement curve decays faster in the large\-batch regime\. This is precisely the behavior predicted by Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2): the fluctuation prefactor is1/B1/Bfor with\-replacement sampling andρN,B<1/B\\rho\_\{N,B\}<1/BwhenB\>1B\>1for without\-replacement sampling\. The without\-replacement pointB=NB=Nis omitted from the plot becauseρN,N=0\\rho\_\{N,N\}=0\. Panel \(c\) removes these batch prefactors and plotsFlucBwr/\(1/B\)\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}/\(1/B\)andFlucBwor/ρN,B\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}/\\rho\_\{N,B\}\. As can be seen, the two normalized curves are substantially flatter than the unnormalized curves, indicating the prefactorBBhas been largely removed\. Overall, these three experiments together support the paper’s main message: batch size does not change the leading deterministic exponents, but it quantitatively controls the stochastic terms, and without\-replacement sampling has the smaller fluctuation scaleρN,B\\rho\_\{N,B\}\.

## 5Conclusion

We studied batch scaling laws for sketched linear regression trained by SGD under a power\-law covariance spectrum and a source condition\. Across one\-pass batch SGD and multi\-pass batch SGD with and without replacement, we derived a unified risk decomposition that separates approximation, bias, variance, and fluctuation, and used it to identify exactly how batch size enters the excess risk\. Our results show that batching preserves the leading approximation and optimization\-bias exponents while changing only the stochastic terms: in the one\-pass setting the variance term isO​\(min⁡\{M,\(Teff​γ\)1/a\}/\(B​Teff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\), so the familiar1/B1/Bgain applies at fixed update count but is partly offset at fixed dataset size by the shorter horizonT=N/BT=N/B; in the multi\-pass setting the only difference between with\-replacement and without\-replacement sampling is the fluctuation scale\. The without\-replacement method is strictly less noisy whenB\>1B\>1and recovers deterministic gradient descent whenB=NB=N\. Simulations support these theoretical predictions\.

Several directions remain open\. One option is to extend the analysis beyond sketched linear regression to richer nonlinear or feature\-learning models, while relaxing the current assumptions\. It would also be interesting to study broader optimization settings, such as adaptive step sizes, momentum, or more general data\-reuse schemes, and to develop joint scaling laws that optimize batch size together with model dimension, sample size, and compute in more realistic training regimes\.

## References

- Revisiting neural scaling laws in language and vision\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22300–22312\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- A\. Atanasov, J\. A\. Zavatone\-Veth, and C\. Pehlevan \(2024\)Scaling and renormalization in high\-dimensional regression\.arXiv preprint arXiv:2405\.00592\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2024\)Explaining neural scaling laws\.Proceedings of the National Academy of Sciences121\(27\),pp\. e2311878121\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- T\. Besiroglu, E\. Erdil, M\. Barnett, and J\. You \(2024\)Chinchilla scaling: a replication attempt\.arXiv preprint arXiv:2404\.10102\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- B\. Bordelon, A\. Atanasov, and C\. Pehlevan \(2024\)A dynamical model of neural scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 4345–4382\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- B\. Bordelon, A\. Atanasov, and C\. Pehlevan \(2025\)How feature learning can improve neural scaling laws\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- E\. Dohmatob, Y\. Feng, P\. Yang, F\. Charton, and J\. Kempe \(2024\)A tale of tails: model collapse as a change of scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 11165–11197\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- R\. Ge, S\. M\. Kakade, R\. Kidambi, and P\. Netrapalli \(2019\)The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- N\. Golmant, N\. Vemuri, Z\. Yao, V\. Feinberg, A\. Gholami, K\. Rothauge, M\. W\. Mahoney, and J\. Gonzalez \(2019\)On the computational inefficiency of large batch sizes for stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- P\. Goyal, P\. Dollár, R\. Girshick, P\. Noordhuis, L\. Wesolowski, A\. Kyrola, A\. Tulloch, Y\. Jia, and K\. He \(2017\)Accurate, large minibatch SGD: training ImageNet in 1 hour\.arXiv preprint arXiv:1706\.02677\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray, C\. Hallacy, B\. Mann, A\. Radford, A\. Ramesh, D\. M\. Ziegler, D\. Amodei, and S\. McCandlish \(2020\)Scaling laws for autoregressive generative modeling\.arXiv preprint arXiv:2010\.14701\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, Md\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\)Deep learning scaling is predictable, empirically\.arXiv preprint arXiv:1712\.00409\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, J\. W\. Rae, O\. Vinyals, and L\. Sifre \(2022\)Training compute\-optimal large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- M\. Hutter \(2021\)Learning curve theory\.arXiv preprint arXiv:2102\.04074\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- P\. Jain, S\. M\. Kakade, R\. Kidambi, P\. Netrapalli, and A\. Sidford \(2017\)Parallelizing stochastic gradient descent for least squares regression: mini\-batching, averaging, and model misspecification\.Journal of Machine Learning Research18\(223\),pp\. 1–42\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Lin and L\. Rosasco \(2017\)Optimal rates for multi\-pass stochastic gradient methods\.Journal of Machine Learning Research18\(97\),pp\. 1–47\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- L\. Lin, J\. Wu, and P\. L\. Bartlett \(2025\)Improved scaling laws in linear regression via data reuse\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[Appendix K](https://arxiv.org/html/2605.24316#A11.p1.1),[§D\.1](https://arxiv.org/html/2605.24316#A4.SS1.3.p3.5),[§I\.1](https://arxiv.org/html/2605.24316#A9.SS1.5.p5.4),[§I\.2](https://arxiv.org/html/2605.24316#A9.SS2.1.p1.3),[§I\.3](https://arxiv.org/html/2605.24316#A9.SS3.3.p2.1),[Lemma I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5),[2nd item](https://arxiv.org/html/2605.24316#S1.I1.i2.p1.4),[§1](https://arxiv.org/html/2605.24316#S1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p5.3),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px6.p3.3),[§2](https://arxiv.org/html/2605.24316#S2.p1.2),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.4),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.5),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.6),[§3](https://arxiv.org/html/2605.24316#S3.p2.1),[§3](https://arxiv.org/html/2605.24316#S3.p6.5)\.
- L\. Lin, J\. Wu, S\. M\. Kakade, P\. L\. Bartlett, and J\. D\. Lee \(2024\)Scaling laws in linear regression: compute, parameters, and data\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[Appendix K](https://arxiv.org/html/2605.24316#A11.p1.1),[§1](https://arxiv.org/html/2605.24316#S1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p5.3),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px1.p2.5),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px6.p3.3),[§2](https://arxiv.org/html/2605.24316#S2.p1.2),[§2](https://arxiv.org/html/2605.24316#S2.p2.2),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.p2.1),[§3](https://arxiv.org/html/2605.24316#S3.p6.5)\.
- A\. Maloney, D\. A\. Roberts, and J\. Sully \(2022\)A solvable model of neural scaling laws\.arXiv preprint arXiv:2210\.16859\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- S\. McCandlish, J\. Kaplan, D\. Amodei, and O\. Team \(2018\)An empirical model of large\-batch training\.arXiv preprint arXiv:1812\.06162\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- E\. J\. Michaud, Z\. Liu, U\. Girit, and M\. Tegmark \(2023\)The quantization model of neural scaling\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- N\. Mücke, G\. Neu, and L\. Rosasco \(2019\)Beating SGD saturation with tail\-averaging and minibatching\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- N\. Muennighoff, A\. Rush, B\. Barak, T\. Le Scao, N\. Tazi, A\. Piktus, S\. Pyysalo, T\. Wolf, and C\. A\. Raffel \(2023\)Scaling data\-constrained language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- E\. Paquette, C\. Paquette, L\. Xiao, and J\. Pennington \(2024\)4\+3 phases of compute\-optimal neural scaling laws\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1),[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- L\. Pillaud\-Vivien, A\. Rudi, and F\. Bach \(2018\)Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[2nd item](https://arxiv.org/html/2605.24316#S1.I1.i2.p1.4),[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- Y\. Ren, E\. Nichani, D\. Wu, and J\. D\. Lee \(2025\)Emergence and scaling laws in SGD learning of shallow neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- J\. S\. Rosenfeld, A\. Rosenfeld, Y\. Belinkov, and N\. Shavit \(2020\)A constructive prediction of the generalization error across scales\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- C\. J\. Shallue, J\. Lee, J\. Antognini, J\. Sohl\-Dickstein, R\. Frostig, and G\. E\. Dahl \(2019\)Measuring the effects of data parallelism on neural network training\.Journal of Machine Learning Research20\(112\),pp\. 1–49\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- U\. Sharma and J\. Kaplan \(2020\)A neural scaling law from the dimension of the data manifold\.arXiv preprint arXiv:2004\.10802\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- X\. Shuai, Y\. Wang, Y\. Wu, X\. Jiang, and X\. Ren \(2024\)Scaling law for language models training considering batch size\.arXiv preprint arXiv:2412\.01505\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- S\. L\. Smith, P\. Kindermans, C\. Ying, and Q\. V\. Le \(2018\)Don’t decay the learning rate, increase the batch size\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- S\. L\. Smith and Q\. V\. Le \(2018\)A Bayesian perspective on generalization and stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- J\. Wu, D\. Zou, V\. Braverman, Q\. Gu, and S\. M\. Kakade \(2022\)The power and limitation of pretraining\-finetuning for linear regression under covariate shift\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.8)\.
- Y\. You, I\. Gitman, and B\. Ginsburg \(2017\)Large batch training of convolutional networks\.arXiv preprint arXiv:1708\.03888\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2022\)Scaling vision transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12104–12113\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- D\. Zou, Y\. Cao, D\. Zhou, and Q\. Gu \(2021\)The benefits of implicit regularization from stochastic gradient descent in least squares problems\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 29773–29785\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.

From One\-Pass SGD to Data Reuse: Mini\-Batch Scaling Laws in Sketched Linear Regression Supplementary Material

Table of Contents

AAppendix Preliminaries[A](https://arxiv.org/html/2605.24316#A1)

A\.1Block notation\.[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)

A\.2Notations and setup formulas for normal GD\.[A\.2](https://arxiv.org/html/2605.24316#A1.SS2)

A\.3Notations and setup formulas for one\-pass batch SGD\.[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)

A\.4Notations and setup formulas for multi\-pass batch SGD with replacement\.[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)

A\.5Notations and setup formulas for multi\-pass batch SGD without replacement\.[A\.5](https://arxiv.org/html/2605.24316#A1.SS5)

BAssembling the proofs of the main scaling theorems[B](https://arxiv.org/html/2605.24316#A2)

B\.1Assembling the proof of the one\-pass theorem\.[B\.1](https://arxiv.org/html/2605.24316#A2.SS1)

B\.2Assembling the proof of the multi\-pass theorem\.[B\.2](https://arxiv.org/html/2605.24316#A2.SS2)

CApproximation Error[C](https://arxiv.org/html/2605.24316#A3)

C\.1Upper bound\.[C\.1](https://arxiv.org/html/2605.24316#A3.SS1)

C\.2Lower bound\.[C\.2](https://arxiv.org/html/2605.24316#A3.SS2)

C\.3Bounds under our assumptions\.[C\.3](https://arxiv.org/html/2605.24316#A3.SS3)

DBias Error under Normal GD[D](https://arxiv.org/html/2605.24316#A4)

D\.1Upper and lower bounds\.[D\.1](https://arxiv.org/html/2605.24316#A4.SS1)

D\.2Example under the source condition\.[D\.2](https://arxiv.org/html/2605.24316#A4.SS2)

EVariance Error under Normal GD[E](https://arxiv.org/html/2605.24316#A5)

E\.1Upper and lower bounds\.[E\.1](https://arxiv.org/html/2605.24316#A5.SS1)

E\.2Example under the source condition\.[E\.2](https://arxiv.org/html/2605.24316#A5.SS2)

FOne\-pass Batch SGD: Excess Error Decomposition[F](https://arxiv.org/html/2605.24316#A6)

GBias Error for One\-pass Batch SGD[G](https://arxiv.org/html/2605.24316#A7)

G\.1Upper and lower bounds for the bias term\.[G\.1](https://arxiv.org/html/2605.24316#A7.SS1)

G\.2Bounds under the source condition\.[G\.2](https://arxiv.org/html/2605.24316#A7.SS2)

HVariance Error for One\-pass Batch SGD[H](https://arxiv.org/html/2605.24316#A8)

H\.1Upper and lower bounds for the exact variance components\.[H\.1](https://arxiv.org/html/2605.24316#A8.SS1)

H\.2The additive\-noise component\.[H\.2](https://arxiv.org/html/2605.24316#A8.SS2)

H\.3Variance bound for the additive\-noise component\.[H\.3](https://arxiv.org/html/2605.24316#A8.SS3)

H\.4The covariance\-fluctuation component\.[H\.4](https://arxiv.org/html/2605.24316#A8.SS4)

H\.5Bounds under the source condition\.[H\.5](https://arxiv.org/html/2605.24316#A8.SS5)

IFluctuation Error under Multi\-pass Batch SGD with Replacement[I](https://arxiv.org/html/2605.24316#A9)

I\.1Upper bound result\.[I\.1](https://arxiv.org/html/2605.24316#A9.SS1)

I\.2Fluctuation error under the source condition\.[I\.2](https://arxiv.org/html/2605.24316#A9.SS2)

I\.3Lemmas to prove the upper bound\.[I\.3](https://arxiv.org/html/2605.24316#A9.SS3)

JFluctuation Error under Multi\-pass Batch SGD without Replacement[J](https://arxiv.org/html/2605.24316#A10)

KCollected Auxiliary Lemmas[K](https://arxiv.org/html/2605.24316#A11)

K\.1General concentration lemmas\.[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)

K\.2Power\-law auxiliary lemmas\.[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)

LExperimental setup[L](https://arxiv.org/html/2605.24316#A12)

MLimitations and Broader Effects[M](https://arxiv.org/html/2605.24316#A13)

## Appendix AAppendix Preliminaries

We collect the common notation used throughout the appendix\. Later appendix sections refer back to this section whenever possible, instead of restating the full stochastic\-update setup each time\.

The framework and many of the core ideas in Appendices[A](https://arxiv.org/html/2605.24316#A1)–[G](https://arxiv.org/html/2605.24316#A7)and[K](https://arxiv.org/html/2605.24316#A11)are borrowed fromLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]\. Appendices[H](https://arxiv.org/html/2605.24316#A8)–[J](https://arxiv.org/html/2605.24316#A10)contain the main technical contributions of this paper, although some of the proofs there are inspired byLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], Wuet al\.\[[2022](https://arxiv.org/html/2605.24316#bib.bib7)\]\.

Throughout the appendix, whenever a procedure is run forLrunL\_\{\\mathrm\{run\}\}updates, we use the same blockwise geometric learning\-rate scheduleγt=γ/2ℓ\\gamma\_\{t\}=\\gamma/2^\{\\ell\}on theℓ\\ell\-th block ofLrun,eff:=Lrun/log⁡LrunL\_\{\\mathrm\{run,eff\}\}:=L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}consecutive updates \(up to the obvious endpoint rounding\); thusLrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD andLrun=LL\_\{\\mathrm\{run\}\}=Lfor normal GD and the two multi\-pass methods\.

### A\.1Block notation

For integers0≤k∗≤k0\\leq k\_\{\\ast\}\\leq k\(allowingk=∞k=\\infty\), define

Hk∗:k:=diag⁡\(λk∗\+1,…,λk\),wk∗:k:=\(wk∗\+1,…,wk\)⊤\.H\_\{k\_\{\\ast\}:k\}:=\\operatorname\{diag\}\(\\lambda\_\{k\_\{\\ast\}\+1\},\\dots,\\lambda\_\{k\}\),\\qquad w\_\{k\_\{\\ast\}:k\}:=\(w\_\{k\_\{\\ast\}\+1\},\\dots,w\_\{k\}\)^\{\\top\}\.Similarly, letSk∗:kS\_\{k\_\{\\ast\}:k\}denote the submatrix ofSSconsisting of columnsk∗\+1,…,kk\_\{\\ast\}\+1,\\dots,k\.

### A\.2Notations and setup formulas for normal GD

This subsection records the notation for the normal GD procedure equation[1](https://arxiv.org/html/2605.24316#S2.E1)\. Define

Σ:=S​H​S⊤,Σ^:=S​X⊤​X​S⊤N,u∗:=Σ−1​S​H​w∗,b^:=1N​S​X⊤​y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{SX^\{\\top\}XS^\{\\top\}\}\{N\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Let

ε~i:=yi−xi⊤​S⊤​u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1N​S​X⊤​ε~\.\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\}\.The normal GD iterate satisfies

θt=θt−1−γt​Σ^​θt−1\+γt​b^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.

### A\.3Notations and setup formulas for one\-pass batch SGD

This subsection records the notation for the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)\. Assume throughout thatB∣NB\\mid N, and define

T:=NB≥2,Teff:=Tlog⁡T,Σ:=S​H​S⊤,u∗:=Σ−1​S​H​w∗\.T:=\\frac\{N\}\{B\}\\geq 2,\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.Partition\[N\]\[N\]into disjoint batchesI1,…,ITI\_\{1\},\\dots,I\_\{T\}with\|It\|=B\|I\_\{t\}\|=B, and write each block as

It=\{it,1,…,it,B\}\.I\_\{t\}=\\\{i\_\{t,1\},\\dots,i\_\{t,B\}\\\}\.Define

Σ^t\(B\):=1B​∑i∈ItS​xi​xi⊤​S⊤,b^t\(B\):=1B​∑i∈ItS​xi​yi,\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\},and

ξ^t\(B\):=1B​∑i∈ItS​xi​\(yi−xi⊤​S⊤​u∗\)\.\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}\\bigl\(y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\\bigr\)\.The one\-pass batch SGD iterate\(utop\)\(u\_\{t\}^\{\\mathrm\{op\}\}\)satisfies

utop=ut−1op−γt​Σ^t\(B\)​ut−1op\+γt​b^t\(B\),u0op=0\.u\_\{t\}^\{\\mathrm\{op\}\}=u\_\{t\-1\}^\{\\mathrm\{op\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{op\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{op\}\}=0\.With the centered erroret:=utop−u∗e\_\{t\}:=u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}, this becomes

et=\(I−γt​Σ^t\(B\)\)​et−1\+γt​ξ^t\(B\)\.e\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)e\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.For later use, we also write

zt,b:=S​xit,b,Z¯t:=1B​∑b=1Bzt,b​zt,b⊤,ξ¯t:=1B​∑b=1Bzt,b​\(yit,b−zt,b⊤​u∗\),z\_\{t,b\}:=Sx\_\{i\_\{t,b\}\},\\qquad\\bar\{Z\}\_\{t\}:=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}z\_\{t,b\}^\{\\top\},\\qquad\\bar\{\\xi\}\_\{t\}:=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}\\bigl\(y\_\{i\_\{t,b\}\}\-z\_\{t,b\}^\{\\top\}u^\{\\ast\}\\bigr\),so that equivalently

et=\(I−γt​Z¯t\)​et−1\+γt​ξ¯t\.e\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)e\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Since the blocks are disjoint subsets of an i\.i\.d\. sample, the pairs\(Z¯t,ξ¯t\)\(\\bar\{Z\}\_\{t\},\\bar\{\\xi\}\_\{t\}\)are independent acrosstt\. The mean errormt:=𝔼​\[et\]m\_\{t\}:=\\mathbb\{E\}\[e\_\{t\}\]therefore satisfies

mt=\(I−γt​Σ\)​mt−1,m0=−u∗\.m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\},\\qquad m\_\{0\}=\-u^\{\\ast\}\.Throughout the one\-pass appendix sections, whenever Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)is invoked,LeffL\_\{\\mathrm\{eff\}\}is replaced byTeffT\_\{\\mathrm\{eff\}\}\. The sample\-level maximum\-norm condition remains over the originalNNsamples, since the one\-pass procedure still uses allNNobservations, grouped intoT=N/BT=N/Bmini\-batches\.

### A\.4Notations and setup formulas for multi\-pass batch SGD with replacement

This subsection records the notation for the multi\-pass batch SGD procedure with replacement, namely equation[3](https://arxiv.org/html/2605.24316#S2.E3)\. Define

Σ:=S​H​S⊤,Σ^:=S​X⊤​X​S⊤N,u∗:=Σ−1​S​H​w∗,b^:=1N​S​X⊤​y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{SX^\{\\top\}XS^\{\\top\}\}\{N\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Let

ε~i:=yi−xi⊤​S⊤​u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1N​S​X⊤​ε~\.\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\}\.At each stept∈\[L\]t\\in\[L\], sample

it,1,…,it,B∼iidunif​\(\[N\]\),i\_\{t,1\},\\dots,i\_\{t,B\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{unif\}\(\[N\]\),and define

Σ^t\(B\):=1B​∑r=1BS​xit,r​xit,r⊤​S⊤,b^t\(B\):=1B​∑r=1BS​xit,r​yit,r,\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}x\_\{i\_\{t,r\}\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}y\_\{i\_\{t,r\}\},c^t\(B\):=1B​∑r=1BS​xit,r​ε~it,r\.\\widehat\{c\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}\\widetilde\{\\varepsilon\}\_\{i\_\{t,r\}\}\.The multi\-pass batch SGD iterate with replacement and the normal GD iterate are

utwr=ut−1wr−γt​Σ^t\(B\)​ut−1wr\+γt​b^t\(B\),u0wr=0,u\_\{t\}^\{\\mathrm\{wr\}\}=u\_\{t\-1\}^\{\\mathrm\{wr\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{wr\}\}=0,and

θt=θt−1−γt​Σ^​θt−1\+γt​b^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.Define the fluctuation process

Δt:=utwr−θt,Δ0=0\.\\Delta\_\{t\}:=u\_\{t\}^\{\\mathrm\{wr\}\}\-\\theta\_\{t\},\\qquad\\Delta\_\{0\}=0\.Then

Δt=\(I−γt​Σ^t\(B\)\)​Δt−1\+γt​ξt\(B\),\\Delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},where

ξt\(B\):=−\(Σ^t\(B\)−Σ^\)\(θt−1−u∗\)\+\(c^t\(B\)−c^\)\.\\xi\_\{t\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{t\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.

### A\.5Notations and setup formulas for multi\-pass batch SGD without replacement

This subsection records the notation for the multi\-pass batch SGD procedure without replacement, namely equation[4](https://arxiv.org/html/2605.24316#S2.E4)\. Here we retain the common dataset\-level quantitiesΣ\\Sigma,Σ^\\widehat\{\\Sigma\},u∗u^\{\\ast\},b^\\widehat\{b\},ε~\\widetilde\{\\varepsilon\}, andc^\\widehat\{c\}from Section[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)\. At each stept∈\[L\]t\\in\[L\], we sample a subset

It⊂\[N\],\|It\|=B,I\_\{t\}\\subset\[N\],\\qquad\|I\_\{t\}\|=B,uniformly without replacement from\[N\]\[N\], independently across iterations\. Define

Σ^It\(B\):=1B​∑i∈ItS​xi​xi⊤​S⊤,b^It\(B\):=1B​∑i∈ItS​xi​yi,\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\},and

c^It\(B\):=1B​∑i∈ItS​xi​ε~i\.\\widehat\{c\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\}\.The multi\-pass batch SGD iterate without replacement is

utwor=ut−1wor−γt​Σ^It\(B\)​ut−1wor\+γt​b^It\(B\),u0wor=0\.u\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\-1\}^\{\\mathrm\{wor\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{wor\}\}=0\.Define the multi\-pass fluctuation process

Δtwor:=utwor−θt,Δ0wor=0\.\\Delta\_\{t\}^\{\\mathrm\{wor\}\}:=u\_\{t\}^\{\\mathrm\{wor\}\}\-\\theta\_\{t\},\\qquad\\Delta\_\{0\}^\{\\mathrm\{wor\}\}=0\.Then

Δtwor=\(I−γt​Σ^It\(B\)\)​Δt−1wor\+γt​ξt,wor\(B\),\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\},where

ξt,wor\(B\):=−\(Σ^It\(B\)−Σ^\)​\(θt−1−u∗\)\+\(c^It\(B\)−c^\)\.\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{I\_\{t\}\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.

### A\.6Proof of the main risk decomposition\.

We record here the proof of the structural decomposition used later in Section[3](https://arxiv.org/html/2605.24316#S3), since it only uses the basic identities from the present section\.

###### Proof of Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)\.

Sinceu∗u^\{\\ast\}minimizes the sketched riskRMR\_\{M\}, for everyu∈ℝMu\\in\\mathbb\{R\}^\{M\}one has

RM​\(u\)=RM​\(u∗\)\+‖Σ1/2​\(u−u∗\)‖22\.R\_\{M\}\(u\)=R\_\{M\}\(u^\{\\ast\}\)\+\\\|\\Sigma^\{1/2\}\(u\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\.
For one\-pass batch SGD, letu¯T:=𝔼​\[uTop\]\\bar\{u\}\_\{T\}:=\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\]\. Applying the previous display withu=uTopu=u\_\{T\}^\{\\mathrm\{op\}\}and taking expectation gives

𝔼​\[RM​\(uTop\)\]=RM​\(u∗\)\+𝔼​\[‖Σ1/2​\(uTop−u∗\)‖22\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.Now write

uTop−u∗=\(u¯T−u∗\)\+\(uTop−u¯T\)\.u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}=\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\+\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\.Since𝔼​\[uTop−u¯T\]=0\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\]=0, the cross term vanishes, so

𝔼​\[‖Σ1/2​\(uTop−u∗\)‖22\]=‖Σ1/2​\(u¯T−u∗\)‖22\+𝔼​\[‖Σ1/2​\(uTop−u¯T\)‖22\]\.\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\bigl\\\|\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\\bigr\\\|\_\{2\}^\{2\}\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.Sinceu∗u^\{\\ast\}minimizesRMR\_\{M\}, the first term is exactly

RM​\(u¯T\)−RM​\(u∗\)\.R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\.Moreover,

RM​\(uTop\)−RM​\(u¯T\)=2​⟨Σ1/2​\(u¯T−u∗\),Σ1/2​\(uTop−u¯T\)⟩\+‖Σ1/2​\(uTop−u¯T\)‖22,R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)=2\\bigl\\langle\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\),\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\bigr\\rangle\+\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\bigr\\\|\_\{2\}^\{2\},so taking expectation and using𝔼​\[uTop−u¯T\]=0\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\]=0gives

𝔼​\[RM​\(uTop\)−RM​\(u¯T\)\]=𝔼​\[‖Σ1/2​\(uTop−u¯T\)‖22\]\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]=\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.This proves the one\-pass decomposition in excess\-risk form\.

For either multi\-pass sampling rule, the same identity gives

𝔼​\[RM​\(uLρ\)\]=RM​\(u∗\)\+𝔼​\[‖Σ1/2​\(uLρ−u∗\)‖22\],\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\],whereuLρu\_\{L\}^\{\\rho\}denotes eitheruLwru\_\{L\}^\{\\mathrm\{wr\}\}oruLworu\_\{L\}^\{\\mathrm\{wor\}\}\. Writing

uLρ−u∗=\(θL−u∗\)\+\(uLρ−θL\)u\_\{L\}^\{\\rho\}\-u^\{\\ast\}=\(\\theta\_\{L\}\-u^\{\\ast\}\)\+\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)and expanding the squared norm produces the cross term

𝔼​\[⟨Σ1/2​\(θL−u∗\),Σ1/2​\(uLρ−θL\)⟩\]\.\\mathbb\{E\}\\bigl\[\\langle\\Sigma^\{1/2\}\(\\theta\_\{L\}\-u^\{\\ast\}\),\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)\\rangle\\bigr\]\.This term vanishes because the fluctuation process has zero conditional mean: from the recursions definingΔt=utwr−θt\\Delta\_\{t\}=u\_\{t\}^\{\\mathrm\{wr\}\}\-\\theta\_\{t\}andΔtwor=utwor−θt\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\}^\{\\mathrm\{wor\}\}\-\\theta\_\{t\}, the driving noise at each step is conditionally centered given\(S,D\)\(S,D\), so

𝔼​\[uLρ−θL∣S,D\]=0\.\\mathbb\{E\}\[u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\\mid S,D\]=0\.Hence

𝔼​\[‖Σ1/2​\(uLρ−u∗\)‖22\]=𝔼​\[‖Σ1/2​\(θL−u∗\)‖22\]\+𝔼​\[‖Σ1/2​\(uLρ−θL\)‖22\],\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(\\theta\_\{L\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\],where the first term is

𝔼​\[RM​\(θL\)−RM​\(u∗\)\]\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]becauseu∗u^\{\\ast\}minimizesRMR\_\{M\}, and the second term is

𝔼​\[RM​\(uLρ\)−RM​\(θL\)\]\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]because the corresponding cross term vanishes by the same conditional\-centering argument\. This proves the multi\-pass decomposition in excess\-risk form\. ∎

## Appendix BAssembling the proofs of the main scaling theorems

This section gathers the appendix ingredients used in the proofs of Theorems[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)and[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. The detailed derivations are carried out in the subsequent appendix sections; here we simply record how those results fit together\. Throughout, we work on the intersection of the high\-probability events from the cited lemmas\. Since only finitely many such results are invoked, a union bound still yields probability

1−exp⁡\(−Ω​\(M\)\)\.1\-\\exp\(\-\\Omega\(M\)\)\.
### B\.1Assembling the proof of Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)

Start from Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1):

𝔼​\[RM​\(uTop\)\]=RM​\(u∗\)\+\(RM​\(u¯T\)−RM​\(u∗\)\)\+𝔼​\[RM​\(uTop\)−RM​\(u¯T\)\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\bigl\(R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\)\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]\.The three terms are supplied by the appendix as follows\.

1. \(1\)Common baseline risk\.Appendix[C](https://arxiv.org/html/2605.24316#A3)defines Approx:=RM​\(u∗\)−R​\(w∗\)\.\\mathrm\{Approx\}:=R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)\.Under Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),R​\(w∗\)=σ2R\(w^\{\\ast\}\)=\\sigma^\{2\}, and therefore RM​\(u∗\)=σ2\+Approx\.R\_\{M\}\(u^\{\\ast\}\)=\\sigma^\{2\}\+\\mathrm\{Approx\}\.Lemma[C\.3](https://arxiv.org/html/2605.24316#A3.Thmlemma3)then gives 𝔼w∗​\[Approx\]≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}\.
2. \(2\)One\-pass bias term\.In Appendix[F](https://arxiv.org/html/2605.24316#A6), the mean error satisfiesmT=u¯T−u∗m\_\{T\}=\\bar\{u\}\_\{T\}\-u^\{\\ast\}, so Definition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmdefinition1)identifies RM​\(u¯T\)−RM​\(u∗\)=‖Σ1/2​\(u¯T−u∗\)‖22=BiasB\.R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)=\\\|\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}=\\mathrm\{Bias\}\_\{B\}\.Lemma[G\.3](https://arxiv.org/html/2605.24316#A7.Thmlemma3)yields the general upper bound 𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.Moreover, when\(Teff​γ\)1/a≤M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/c\_\{1\}, the same lemma gives the matching lower bound 𝔼w∗​\[BiasB\]≳\(Teff​γ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.
3. \(3\)One\-pass variance term\.Proposition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmproposition1)and Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2)identify the remaining term with the one\-pass variance quantityVarB\\mathrm\{Var\}\_\{B\}\. Lemma[H\.3](https://arxiv.org/html/2605.24316#A8.Thmlemma3)then yields 𝔼w∗​\[VarB\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Here the factor1/B1/Bshould be read as the covariance reduction at a fixed number of updates\. Since one\-pass batch SGD runs forT=N/BT=N/Bupdates, the full variance dependence onBBat fixed dataset sizeNNis given by the displayed bound rather than by a standalone1/B1/Blaw\.

From items\(1\)and\(2\), the deterministic contribution obeys

𝔼w∗​\[Approx\+BiasB\]=Θ​\(M1−b\)\+Θ​\(\(Teff​γ\)\(1−b\)/a\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\+\\mathrm\{Bias\}\_\{B\}\]=\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\.Indeed, when\(Teff​γ\)1/a≤M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/c\_\{1\}, the lower bound onBiasB\\mathrm\{Bias\}\_\{B\}gives the second scale directly; when\(Teff​γ\)1/a\>M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\>M/c\_\{1\}, one has\(Teff​γ\)\(1−b\)/a≲M1−b\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\lesssim M^\{1\-b\}, so the approximation lower bound already controls that term\. Substituting this deterministic bound and item\(3\)into the risk decomposition gives the first display in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. The simplified regime1<b≤a1<b\\leq afollows by comparing the variance order from Lemma[H\.3](https://arxiv.org/html/2605.24316#A8.Thmlemma3)with the deterministic orders above\.

### B\.2Assembling the proof of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)

Again start from Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1):

𝔼​\[RM​\(uLρ\)\]=RM​\(u∗\)\+𝔼​\[RM​\(θL\)−RM​\(u∗\)\]\+𝔼​\[RM​\(uLρ\)−RM​\(θL\)\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]\.The three contributions are assembled as follows\.

1. \(1\)Common baseline risk\.As in the one\-pass proof above, RM​\(u∗\)=σ2\+Approx,𝔼w∗​\[Approx\]≍M1−bR\_\{M\}\(u^\{\\ast\}\)=\\sigma^\{2\}\+\\mathrm\{Approx\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}by Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)and Lemma[C\.3](https://arxiv.org/html/2605.24316#A3.Thmlemma3)\.
2. \(2\)Common GD\-reference contribution\.Sections[D](https://arxiv.org/html/2605.24316#A4)and[E](https://arxiv.org/html/2605.24316#A5)control the deterministic and stochastic pieces of the normal\-GD reference iterateθL\\theta\_\{L\}\. Their source\-condition conclusions are Lemmas[D\.3](https://arxiv.org/html/2605.24316#A4.Thmlemma3)and[E\.2](https://arxiv.org/html/2605.24316#A5.Thmlemma2), namely 𝔼w∗\[BiasGD\(w∗\)\]≍min\{M,\(Leffγ\)1/a\}1−b,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\asymp\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\},VarGD≍min⁡\{M,\(Leff​γ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\asymp\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.These are exactly the orders labeledGD​Bias\\mathrm\{GD\\,Bias\}andGD​Var\\mathrm\{GD\\,Var\}in Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\.
3. \(3\)Sampling\-rule\-dependent fluctuation term\.Whenρ=1/B\\rho=1/B, Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)gives the upper bound forFlucBwr\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\. Whenρ=ρN,B\\rho=\\rho\_\{N,B\}, Corollary[J\.1](https://arxiv.org/html/2605.24316#A10.Thmcorollary1)transfers the same conclusion toFlucBwor\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}after replacing1/B1/BbyρN,B\\rho\_\{N,B\}\. Hence, for either sampling rule, 𝔼​\[FlucBρ\]≲ρ​γ​log⁡N​\[\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\rho\}\_\{B\}\]\\lesssim\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.

Combining the baseline term, the two GD\-reference orders, and the fluctuation estimate gives item\(1\)of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. Finally, items\(2\)and\(3\)follow by direct comparison of the orders in item\(1\)\.

## Appendix CApproximation Error

This section studies the approximation term, which is common to the four optimization procedures equation[1](https://arxiv.org/html/2605.24316#S2.E1), equation[2](https://arxiv.org/html/2605.24316#S2.E2), equation[3](https://arxiv.org/html/2605.24316#S2.E3), and equation[4](https://arxiv.org/html/2605.24316#S2.E4)\.

Retaining the common setup from Section[2](https://arxiv.org/html/2605.24316#S2), define

Approx:=minu∈ℝM⁡RM​\(u\)−minw∈ℋ⁡R​\(w\)=RM​\(u∗\)−R​\(w∗\)\\mathrm\{Approx\}:=\\min\_\{u\\in\\mathbb\{R\}^\{M\}\}R\_\{M\}\(u\)\-\\min\_\{w\\in\\mathcal\{H\}\}R\(w\)=R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)which further yields, by substituting the value ofu∗u^\{\*\}:

Approx=‖\(I−H1/2​S⊤​Σ−1​S​H1/2\)​H1/2​w∗‖22\.\\mathrm\{Approx\}=\\bigl\\\|\(I\-H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\)H^\{1/2\}w^\{\\ast\}\\bigr\\\|\_\{2\}^\{2\}\.\(5\)
###### Assumption 4\(Gaussian sketching\)\.

The sketching operatorS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is Gaussian, meaning that in the diagonal coordinates ofHH, its entries are i\.i\.d\. distributed as𝒩​\(0,1/M\)\\mathcal\{N\}\(0,1/M\)\.

###### Assumption 5\(Source\-condition regime\)\.

In addition to Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2), we work in the regime

### C\.1Upper bound

###### Lemma C\.1\(Upper bound on the approximation error\)\.

Fix any integerk≥0k\\geq 0such thatr​\(H\)≥k\+Mr\(H\)\\geq k\+M, and define

Ak:=Sk:∞​Hk:∞​Sk:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Then the approximation error satisfies the deterministic bound

Approx≲‖wk:∞∗‖Hk:∞2\+w0:k∗⊤​\(H0:k−1\+S0:k⊤​Ak−1​S0:k\)−1​w0:k∗\.\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+w\_\{0:k\}^\{\\ast\\top\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}w\_\{0:k\}^\{\\ast\}\.If in addition Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4)holds andk≤M/2k\\leq M/2, then with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

Approx≲‖wk:∞∗‖Hk:∞2\+βk​‖w0:k∗‖22,\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\,\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\},where

βk:=∑i\>kλiM\+λk\+1\+∑i\>kλi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}\}\{M\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.

###### Proof\.

For the deterministic identities below, it is enough to work in an eigenbasis ofHH, so we writeHHin diagonal form and use the block notation from Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. When Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4)is invoked later for the high\-probability part, this is also the natural coordinate system by rotational invariance\. Set

𝒯:=H1/2​S⊤​Σ−1​S​H1/2−I\.\\mathcal\{T\}:=H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\-I\.Using the block decomposition induced by the split0​:​k0\\text\{:\}kandk​:​∞k\\text\{:\}\\infty, write

𝒯=\(UVV⊤W\),\\mathcal\{T\}=\\begin\{pmatrix\}U&V\\\\ V^\{\\top\}&W\\end\{pmatrix\},where

U:=H0:k1/2​S0:k⊤​Σ−1​S0:k​H0:k1/2−I,U:=H\_\{0:k\}^\{1/2\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}^\{1/2\}\-I,V:=H0:k1/2​S0:k⊤​Σ−1​Sk:∞​Hk:∞1/2,V:=H\_\{0:k\}^\{1/2\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\},W:=Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2−I\.W:=H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\-I\.Then equation[5](https://arxiv.org/html/2605.24316#A3.E5)gives

Approx=‖𝒯​H1/2​w∗‖22\.\\mathrm\{Approx\}=\\\|\\mathcal\{T\}H^\{1/2\}w^\{\\ast\}\\\|\_\{2\}^\{2\}\.By the inequality\(a\+b\)2≤2​a2\+2​b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\},

Approx\\displaystyle\\mathrm\{Approx\}≤2​w0:k∗⊤​H0:k1/2​\(U2\+V​V⊤\)​H0:k1/2​w0:k∗\\displaystyle\\leq 2w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}^\{1/2\}\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}^\{1/2\}w\_\{0:k\}^\{\\ast\}\+2​wk:∞∗⊤​Hk:∞1/2​\(W2\+V⊤​V\)​Hk:∞1/2​wk:∞∗\.\\displaystyle\\qquad\+2w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}\.\(6\)
We first control the tail block\. Since

Σ=S0:k​H0:k​S0:k⊤\+Ak,\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\},a direct calculation shows that

W2\+V⊤​V=−W\.W^\{2\}\+V^\{\\top\}V=\-W\.\(7\)Moreover,

0⪯Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2⪯I,0\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq I,so−I⪯W⪯0\-I\\preceq W\\preceq 0\. Combining this with equation[7](https://arxiv.org/html/2605.24316#A3.E7), we obtain

0⪯W2\+V⊤​V=−W⪯I,0\\preceq W^\{2\}\+V^\{\\top\}V=\-W\\preceq I,and therefore

wk:∞∗⊤​Hk:∞1/2​\(W2\+V⊤​V\)​Hk:∞1/2​wk:∞∗≤‖wk:∞∗‖Hk:∞2\.w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}\\leq\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.\(8\)
We next control the head block\. Applying the Woodbury identity to

Σ−1=\(S0:k​H0:k​S0:k⊤\+Ak\)−1,\\Sigma^\{\-1\}=\(S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\)^\{\-1\},we get

Σ−1=Ak−1−Ak−1​S0:k​\(H0:k−1\+S0:k⊤​Ak−1​S0:k\)−1​S0:k⊤​Ak−1\.\\Sigma^\{\-1\}=A\_\{k\}^\{\-1\}\-A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}\.Substituting this identity into the definitions ofUUandVV, we then have

U2\+V​V⊤=H0:k−1/2​\(H0:k−1\+S0:k⊤​Ak−1​S0:k\)−1​H0:k−1/2\.U^\{2\}\+VV^\{\\top\}=H\_\{0:k\}^\{\-1/2\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}H\_\{0:k\}^\{\-1/2\}\.\(9\)Hence

w0:k∗⊤​H0:k1/2​\(U2\+V​V⊤\)​H0:k1/2​w0:k∗=w0:k∗⊤​\(H0:k−1\+S0:k⊤​Ak−1​S0:k\)−1​w0:k∗\.w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}^\{1/2\}\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}^\{1/2\}w\_\{0:k\}^\{\\ast\}=w\_\{0:k\}^\{\\ast\\top\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}w\_\{0:k\}^\{\\ast\}\.\(10\)Putting equation[8](https://arxiv.org/html/2605.24316#A3.E8)and equation[10](https://arxiv.org/html/2605.24316#A3.E10)into equation[6](https://arxiv.org/html/2605.24316#A3.E6)proves the first claim\.

For the high\-probability bound, define

βk:=∑i\>kλiM\+λk\+1\+∑i\>kλi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}\}\{M\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.By Lemma[K\.4](https://arxiv.org/html/2605.24316#A11.Thmlemma4), with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),

‖Ak‖2≲βk\.\\\|A\_\{k\}\\\|\_\{2\}\\lesssim\\beta\_\{k\}\.Equivalently,

Ak−1⪰c​βk−1​IA\_\{k\}^\{\-1\}\\succeq c\\,\\beta\_\{k\}^\{\-1\}Ifor some absolute constantc\>0c\>0\. Also, sincek≤M/2k\\leq M/2, a standard Gaussian covariance concentration bound gives

S0:k⊤​S0:k⪰c0​IkS\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c\_\{0\}I\_\{k\}with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\), for some absolute constantc0\>0c\_\{0\}\>0\. Therefore,

S0:k⊤​Ak−1​S0:k⪰c​βk−1​S0:k⊤​S0:k⪰c′​βk−1​Ik\.S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\succeq c\\,\\beta\_\{k\}^\{\-1\}S\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c^\{\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\.Hence

\(H0:k−1\+S0:k⊤​Ak−1​S0:k\)−1⪯\(c′​βk−1​Ik\)−1≲βk​Ik\.\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}\\preceq\\bigl\(c^\{\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\\bigr\)^\{\-1\}\\lesssim\\beta\_\{k\}I\_\{k\}\.Substituting this into the deterministic bound gives

Approx≲‖wk:∞∗‖Hk:∞2\+βk​‖w0:k∗‖22,\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\},which completes the proof\. ∎

### C\.2Lower bound

###### Lemma C\.2\(Lower bound on the approximation error under the source condition\)\.

Assume Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), and let

Hw:=𝔼​\[w∗​w∗⊤\]\.H\_\{w\}:=\\mathbb\{E\}\[w^\{\\ast\}w^\{\\ast\\top\}\]\.Then, conditioned on the sketch matrixSS,

𝔼w∗​\[Approx\]≳∑i\>Mλi​ia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\.In particular, under Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2),

𝔼w∗​\[Approx\]≳M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim M^\{1\-b\}\.

###### Proof\.

Reuse the block decomposition from the proof of Lemma[C\.1](https://arxiv.org/html/2605.24316#A3.Thmlemma1)\. Since Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)implies thatHHandHwH\_\{w\}are diagonal and that the coordinates ofw∗w^\{\\ast\}are uncorrelated, the cross terms vanish after taking expectation overw∗w^\{\\ast\}, and we obtain

𝔼w∗​\[Approx\]\\displaystyle\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]=tr⁡\(\(U2\+V​V⊤\)​H0:k​Hw,0:k\)\+tr⁡\(\(W2\+V⊤​V\)​Hk:∞​Hw,k:∞\)\\displaystyle=\\operatorname\{tr\}\\bigl\(\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}H\_\{w,0:k\}\\bigr\)\+\\operatorname\{tr\}\\bigl\(\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)≥tr⁡\(\(W2\+V⊤​V\)​Hk:∞​Hw,k:∞\)\\displaystyle\\geq\\operatorname\{tr\}\\bigl\(\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)=−tr⁡\(W​Hk:∞​Hw,k:∞\),\\displaystyle=\-\\operatorname\{tr\}\\bigl\(WH\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\),\(11\)where the last identity uses equation[7](https://arxiv.org/html/2605.24316#A3.E7)\.

Define

Pk:=I−Hk:∞1/2​Sk:∞⊤​Ak−1​Sk:∞​Hk:∞1/2\.P\_\{k\}:=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.Since

Σ=S0:k​H0:k​S0:k⊤\+Ak⪰Ak,\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\\succeq A\_\{k\},we have

Σ−1⪯Ak−1\.\\Sigma^\{\-1\}\\preceq A\_\{k\}^\{\-1\}\.The matrix

Hk:∞1/2​Sk:∞⊤​Ak−1​Sk:∞​Hk:∞1/2H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}is an orthogonal projection onto the row space induced by the tail sketch, hencePkP\_\{k\}is also a projection matrix\. Moreover,

Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2⪯Hk:∞1/2​Sk:∞⊤​Ak−1​Sk:∞​Hk:∞1/2,H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\},so

−W=I−Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2⪰Pk\.\-W=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\succeq P\_\{k\}\.SinceHk:∞​Hw,k:∞⪰0H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\succeq 0, equation[11](https://arxiv.org/html/2605.24316#A3.E11)implies

𝔼w∗​\[Approx\]≥tr⁡\(Pk​Hk:∞​Hw,k:∞\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\geq\\operatorname\{tr\}\\bigl\(P\_\{k\}H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)\.Finally, all eigenvalues ofPkP\_\{k\}are either0or11, with at mostMMzeros\. Applying Von Neumann’s trace inequality to the last display, we obtain

𝔼w∗​\[Approx\]≥∑i\>k\+Mμi​\(H​Hw\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\geq\\sum\_\{i\>k\+M\}\\mu\_\{i\}\(HH\_\{w\}\)\.Since Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)gives

μi​\(H​Hw\)=λi​𝔼​\[\(wi∗\)2\]≍i−b=λi​ia−b,\\mu\_\{i\}\(HH\_\{w\}\)=\\lambda\_\{i\}\\,\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}=\\lambda\_\{i\}i^\{a\-b\},we conclude that

𝔼w∗​\[Approx\]≳∑i\>k\+Mλi​ia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>k\+M\}\\lambda\_\{i\}i^\{a\-b\}\.Settingk=0k=0gives

𝔼w∗​\[Approx\]≳∑i\>Mλi​ia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\.Under the power\-law assumptionλi≍i−a\\lambda\_\{i\}\\asymp i^\{\-a\}, this becomes

𝔼w∗​\[Approx\]≳∑i\>Mi−b≍M1−b,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}i^\{\-b\}\\asymp M^\{1\-b\},which completes the proof\. ∎

### C\.3Bounds under our assumptions

###### Lemma C\.3\(Approximation error under our assumptions\)\.

Assume Assumptions[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[5](https://arxiv.org/html/2605.24316#Thmassumption5)\. Then with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼w∗​\[Approx\]≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}\.

###### Proof\.

We first prove the upper bound\. Letk=⌊M/2⌋k=\\lfloor M/2\\rfloor\. By Lemma[C\.1](https://arxiv.org/html/2605.24316#A3.Thmlemma1),

𝔼w∗​\[Approx\]≲𝔼w∗​‖wk:∞∗‖Hk:∞2\+βk​𝔼w∗​‖w0:k∗‖22\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Under Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2),

𝔼w∗​‖wk:∞∗‖Hk:∞2=∑i\>kλi​𝔼​\[\(wi∗\)2\]≍∑i\>ki−b≍k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}=\\sum\_\{i\>k\}\\lambda\_\{i\}\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\>k\}i^\{\-b\}\\asymp k^\{1\-b\}\.Also, usingλi≍i−a\\lambda\_\{i\}\\asymp i^\{\-a\},

βk≲∑i\>ki−aM\+k−a\+∑i\>ki−2​aM≲M−a\\beta\_\{k\}\\lesssim\\frac\{\\sum\_\{i\>k\}i^\{\-a\}\}\{M\}\+k^\{\-a\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}i^\{\-2a\}\}\{M\}\}\\lesssim M^\{\-a\}whenk≍Mk\\asymp M\. Moreover,

𝔼w∗​‖w0:k∗‖22=∑i≤k𝔼​\[\(wi∗\)2\]≍∑i≤kia−b≲ka−b\+1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}=\\sum\_\{i\\leq k\}\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\\leq k\}i^\{a\-b\}\\lesssim k^\{a\-b\+1\},where the last step uses Assumption[5](https://arxiv.org/html/2605.24316#Thmassumption5)\. Therefore,

βk​𝔼w∗​‖w0:k∗‖22≲M−a​ka−b\+1≍M1−b\.\\beta\_\{k\}\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\\lesssim M^\{\-a\}k^\{a\-b\+1\}\\asymp M^\{1\-b\}\.Since alsok1−b≍M1−bk^\{1\-b\}\\asymp M^\{1\-b\}, this proves

𝔼w∗​\[Approx\]≲M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim M^\{1\-b\}\.
For the lower bound, Lemma[C\.2](https://arxiv.org/html/2605.24316#A3.Thmlemma2)gives

𝔼w∗​\[Approx\]≳∑i\>Mλi​ia−b≍∑i\>Mi−b≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\\asymp\\sum\_\{i\>M\}i^\{\-b\}\\asymp M^\{1\-b\}\.Combining the two bounds yields the claim\. ∎

## Appendix DBias Error under Normal GD

This section focuses on the normal GD procedure equation[1](https://arxiv.org/html/2605.24316#S2.E1)\. Similarly, we retain the notation of Section[2](https://arxiv.org/html/2605.24316#S2)\. In particular,

θt=θt−1−γt​Σ^​θt−1\+γt​b^,θ0=0,\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0,with

Σ:=S​H​S⊤,Σ^:=1N​S​X⊤​X​S⊤,u∗:=Σ−1​S​H​w∗,b^:=1N​S​X⊤​y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{1\}\{N\}SX^\{\\top\}XS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Define

ε~i:=yi−xi⊤​S⊤​u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1N​S​X⊤​ε~,\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\},and introduce the shorthand

CL:=∏t=1L\(I−γt​Σ^\),V​\(Σ^\):=1N​∑t=1Lγt​∏i=t\+1L\(I−γi​Σ^\)\.C\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\),\\qquad V\(\\widehat\{\\Sigma\}\):=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}\\widehat\{\\Sigma\}\)\.Then

θL−u∗=−CL​u∗\+V​\(Σ^\)​S​X⊤​ε~\.\\theta\_\{L\}\-u^\{\\ast\}=\-C\_\{L\}u^\{\\ast\}\+V\(\\widehat\{\\Sigma\}\)SX^\{\\top\}\\widetilde\{\\varepsilon\}\.Accordingly, the GD bias term is

BiasGD​\(w∗\):=𝔼X​\[‖Σ1/2​CL​u∗‖22\]\.\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\):=\\mathbb\{E\}\_\{X\}\\bigl\[\\\|\\Sigma^\{1/2\}C\_\{L\}u^\{\\ast\}\\\|\_\{2\}^\{2\}\\bigr\]\.
### D\.1Upper and lower bounds

###### Lemma D\.1\(Upper bound on the GD bias term\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Fix any integerk≤M/3k\\leq M/3such thatrank⁡\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define

Ak:=Sk:∞​Hk:∞​Sk:∞⊤,k~:=⌈N/2⌉,Σk~:∞:=Sk~:∞​Hk~:∞​Sk~:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\},\\qquad\\widetilde\{k\}:=\\lceil N/2\\rceil,\\qquad\\Sigma\_\{\\widetilde\{k\}:\\infty\}:=S\_\{\\widetilde\{k\}:\\infty\}H\_\{\\widetilde\{k\}:\\infty\}S\_\{\\widetilde\{k\}:\\infty\}^\{\\top\}\.Then with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

BiasGD​\(w∗\)≲‖w0:k∗‖22Leff​γ​\(μM/2​\(Ak\)μM​\(Ak\)\)2\+B¯​‖wk:∞∗‖Hk:∞2,\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\\lesssim\\frac\{\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\+\\overline\{B\}\\,\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\},where

B¯:=1\+\(Leffγ\)2tr\(Σk~:∞\)2N2\+‖Σk~:∞‖22\+tr⁡\(Σk~:∞2\)N\+tr⁡\(Σk~:∞4\)N\.\\overline\{B\}:=1\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\}\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}\)^\{2\}\}\{N^\{2\}\}\+\\\|\\Sigma\_\{\\widetilde\{k\}:\\infty\}\\\|\_\{2\}^\{2\}\+\\frac\{\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{2\}\)\}\{N\}\+\\sqrt\{\\frac\{\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{4\}\)\}\{N\}\}\.

###### Proof\.

By rotational invariance of Gaussian sketching, we may work in the diagonal coordinates ofHHand use the block notation from Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. Define

ML:=CL​Σ​CL\.M\_\{L\}:=C\_\{L\}\\Sigma C\_\{L\}\.Substituting

S​H=\(S0:k​H0:k,Sk:∞​Hk:∞\)SH=\(S\_\{0:k\}H\_\{0:k\},\\,S\_\{k:\\infty\}H\_\{k:\\infty\}\)into the identity

BiasGD​\(w∗\)=𝔼X​\[w∗⊤​H​S⊤​Σ−1​ML​Σ−1​S​H​w∗\]\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)=\\mathbb\{E\}\_\{X\}\\bigl\[w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}SHw^\{\\ast\}\\bigr\]and splitting head and tail blocks gives

BiasGD​\(w∗\)≤2​T1\+2​T2,\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\\leq 2T\_\{1\}\+2T\_\{2\},where

T1:=𝔼X​\[w0:k∗⊤​H0:k​S0:k⊤​Σ−1​ML​Σ−1​S0:k​H0:k​w0:k∗\],T\_\{1\}:=\\mathbb\{E\}\_\{X\}\\bigl\[w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}w\_\{0:k\}^\{\\ast\}\\bigr\],and

T2:=𝔼X​\[wk:∞∗⊤​Hk:∞​Sk:∞⊤​Σ−1​ML​Σ−1​Sk:∞​Hk:∞​wk:∞∗\]\.T\_\{2\}:=\\mathbb\{E\}\_\{X\}\\bigl\[w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}\\bigr\]\.
For the head term,

T1≤𝔼X​‖ML‖2⋅‖Σ−1​S0:k​H0:k‖22⋅‖w0:k∗‖22\.T\_\{1\}\\leq\\mathbb\{E\}\_\{X\}\\\|M\_\{L\}\\\|\_\{2\}\\cdot\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}^\{2\}\\cdot\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\.Applying Lemma[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)and spectral calculus, we have

𝔼X​‖ML‖2≲1Leff​γ,\\mathbb\{E\}\_\{X\}\\\|M\_\{L\}\\\|\_\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},while Lemma[K\.6](https://arxiv.org/html/2605.24316#A11.Thmlemma6)gives

‖Σ−1​S0:k​H0:k‖2≲μM/2​\(Ak\)μM​\(Ak\)\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence

T1≲‖w0:k∗‖22Leff​γ​\(μM/2​\(Ak\)μM​\(Ak\)\)2\.T\_\{1\}\\lesssim\\frac\{\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\.
For the tail term, define

ℬ:=𝔼X​\[Σ−1/2​ML​Σ−1/2\]\.\\mathcal\{B\}:=\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}M\_\{L\}\\Sigma^\{\-1/2\}\\bigr\]\.Then

T2≤‖ℬ‖2⋅‖Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2‖2⋅‖wk:∞∗‖Hk:∞2\.T\_\{2\}\\leq\\\|\\mathcal\{B\}\\\|\_\{2\}\\cdot\\\|H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\\|\_\{2\}\\cdot\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.The middle operator norm is at most one, while the calculation as in Appendix B\.1 ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]gives

‖ℬ‖2≲B¯\.\\\|\\mathcal\{B\}\\\|\_\{2\}\\lesssim\\overline\{B\}\.Combining the estimates forT1T\_\{1\}andT2T\_\{2\}proves the claim\. ∎

###### Lemma D\.2\(Lower bound on the GD bias term\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3)\. Let

Hw:=𝔼w∗​\[w∗​w∗⊤\],Σw:=S​H​Hw​H​S⊤\.H\_\{w\}:=\\mathbb\{E\}\_\{w^\{\\ast\}\}\[w^\{\\ast\}w^\{\\ast\\top\}\],\\qquad\\Sigma\_\{w\}:=SHH\_\{w\}HS^\{\\top\}\.Then with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼w∗​\[BiasGD​\(w∗\)\]≳∑i=2​τ\+1Mμ3​i​\(Σw\)μi​\(Σ\),\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i=2\\tau\+1\}^\{M\}\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\},where

τ:=𝔼X​\[\#​\{i∈\[M\]:μi​\(Σ^\)​Leff​γ0\>1/4\}\]\.\\tau:=\\mathbb\{E\}\_\{X\}\\Bigl\[\\\#\\bigl\\\{i\\in\[M\]:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\\Bigr\]\.

###### Proof\.

SetCL:=∏t=1L\(I−γt​Σ^\)C\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\. We have

𝔼w∗​\[BiasGD​\(w∗\)\]=tr⁡\(𝔼X​\[Σ−1/2​CL​Σ​CL⊤​Σ−1/2\]⋅Σ−1/2​Σw​Σ−1/2\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]=\\operatorname\{tr\}\\\!\\Bigl\(\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}C\_\{L\}\\Sigma C\_\{L\}^\{\\top\}\\Sigma^\{\-1/2\}\\bigr\]\\cdot\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\Bigr\)\.Moreover,

𝔼X​\[Σ−1/2​CL​Σ​CL⊤​Σ−1/2\]⪰Σ−1/2​𝔼X​\[CL\]​Σ​𝔼X​\[CL\]⊤​Σ−1/2,\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}C\_\{L\}\\Sigma C\_\{L\}^\{\\top\}\\Sigma^\{\-1/2\}\\bigr\]\\succeq\\Sigma^\{\-1/2\}\\,\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\\,\\Sigma\\,\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]^\{\\top\}\\Sigma^\{\-1/2\},since𝔼X​\[\(CL−𝔼X​\[CL\]\)​Σ​\(CL−𝔼X​\[CL\]\)⊤\]⪰0\\mathbb\{E\}\_\{X\}\[\(C\_\{L\}\-\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\)\\Sigma\(C\_\{L\}\-\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\)^\{\\top\}\]\\succeq 0\. Applying this PSD lower bound and then spectral truncation/Von Neumann argument yields

𝔼w∗​\[BiasGD​\(w∗\)\]≳∑i=2​τ\+1Mμi​\(Σ−1/2​Σw​Σ−1/2\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i=2\\tau\+1\}^\{M\}\\mu\_\{i\}\\bigl\(\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\bigr\)\.Lastly, the spectral comparison yields

μi​\(Σ−1/2​Σw​Σ−1/2\)≳μ3​i​\(Σw\)μi​\(Σ\)\.\\mu\_\{i\}\\bigl\(\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\bigr\)\\gtrsim\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.Substituting this estimate into the previous display proves the lemma\. ∎

### D\.2Example under the source condition

###### Lemma D\.3\(Bias bounds under the source condition\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

a\>b−1,Leff≲Na/γ\.a\>b\-1,\\qquad L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists an\(a,b\)\(a,b\)\-dependent constantc\>0c\>0such that, whenever

γ≤clog⁡N,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼w∗\[BiasGD\(w∗\)\]≍min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\asymp\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.

###### Proof\.

We first verify that the ingredients entering Lemmas[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)and[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)are all of the claimed order\.

By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7),

μi​\(Σ\)≍i−afor​i∈\[M\]\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\in\[M\]with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Therefore,

tr⁡\(Σ2\)≲1,∑i=1Mμi​\(Σ\)μi​\(Σ\)\+1/\(Leff​γ\)≲\(Leff​γ\)1/a\.\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\lesssim 1,\\qquad\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.
Next, Lemma[K\.8](https://arxiv.org/html/2605.24316#A11.Thmlemma8)implies

μM/2​\(Ak\)μM​\(Ak\)≲1\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\lesssim 1for every admissiblekk, while the power\-law tail bounds give

tr⁡\(Σk~:∞\)≲N1−a,‖Σk~:∞‖2≲N−a,tr⁡\(Σk~:∞2\)≲N1−2​a,tr⁡\(Σk~:∞4\)≲N1−4​a\.\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}\)\\lesssim N^\{1\-a\},\\qquad\\\|\\Sigma\_\{\\widetilde\{k\}:\\infty\}\\\|\_\{2\}\\lesssim N^\{\-a\},\\qquad\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{2\}\)\\lesssim N^\{1\-2a\},\\qquad\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{4\}\)\\lesssim N^\{1\-4a\}\.Hence the quantityB¯\\overline\{B\}in Lemma[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)satisfies

B¯≲1\\overline\{B\}\\lesssim 1wheneverLeff≲Na/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.

For the upper bound, choose

k:=min⁡\{M/3,\(Leff​γ\)1/a\}\.k:=\\min\\\!\\bigl\\\{M/3,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}\.Using Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),

𝔼w∗​‖w0:k∗‖22≍∑i=1kia−b,𝔼w∗​‖wk:∞∗‖Hk:∞2≍∑i\>ki−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\\asymp\\sum\_\{i=1\}^\{k\}i^\{a\-b\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\\asymp\\sum\_\{i\>k\}i^\{\-b\}\.Plugging these relations into Lemma[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)gives

𝔼w∗​\[BiasGD​\(w∗\)\]≲1Leff​γ​∑i=1kia−b\+∑i\>ki−b≲k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i=1\}^\{k\}i^\{a\-b\}\+\\sum\_\{i\>k\}i^\{\-b\}\\lesssim k^\{1\-b\}\.Sincek≍min⁡\{M,\(Leff​γ\)1/a\}k\\asymp\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}, this yields

𝔼w∗\[BiasGD\(w∗\)\]≲min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\lesssim\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.
For the lower bound, Lemma[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)and the estimate onτ\\tau\(mentioned in Lemma[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)\) give

𝔼w∗​\[BiasGD​\(w∗\)\]≳∑i≳\(Leff​γ\)1/aμ3​i​\(Σw\)μi​\(Σ\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), the operatorH​Hw​HHH\_\{w\}Hhas eigenvalues of orderi−a−bi^\{\-a\-b\}, so Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)applied toH​Hw​HHH\_\{w\}Hyields

μi​\(Σw\)≍i−a−b\.\\mu\_\{i\}\(\\Sigma\_\{w\}\)\\asymp i^\{\-a\-b\}\.Combining this withμi​\(Σ\)≍i−a\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}gives

𝔼w∗\[BiasGD\(w∗\)\]≳∑i≳\(Leff​γ\)1/ai−b≳min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-b\}\\gtrsim\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.This completes the proof\. ∎

## Appendix EVariance Error under Normal GD

Retain the notation of Section[D](https://arxiv.org/html/2605.24316#A4)\. In particular,

V​\(Σ^\):=1N​∑t=1Lγt​∏i=t\+1L\(I−γi​Σ^\),VL:=I−∏t=1L\(I−γt​Σ^\)\.V\(\\widehat\{\\Sigma\}\):=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}\\widehat\{\\Sigma\}\),\\qquad V\_\{L\}:=I\-\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\.We define the normalized GD variance term by

VarGD:=𝔼X​\[tr⁡\(X​S⊤​V​\(Σ^\)​Σ​V​\(Σ^\)​S​X⊤\)\]\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}:=\\mathbb\{E\}\_\{X\}\\Bigl\[\\operatorname\{tr\}\\bigl\(XS^\{\\top\}V\(\\widehat\{\\Sigma\}\)\\Sigma V\(\\widehat\{\\Sigma\}\)SX^\{\\top\}\\bigr\)\\Bigr\]\.Then using

V​\(Σ^\)=1N​\(I−∏t=1L\(I−γt​Σ^\)\)​Σ^−1=1N​VL​Σ^−1,V\(\\widehat\{\\Sigma\}\)=\\frac\{1\}\{N\}\\Bigl\(I\-\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\\Bigr\)\\widehat\{\\Sigma\}^\{\-1\}=\\frac\{1\}\{N\}V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\},we may rewrite

VarGD=1N​𝔼X​\[tr⁡\(Σ​VL​Σ^−1​VL\)\]\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}=\\frac\{1\}\{N\}\\,\\mathbb\{E\}\_\{X\}\\Bigl\[\\operatorname\{tr\}\\bigl\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\\bigr\)\\Bigr\]\.
### E\.1Upper and lower bounds

###### Lemma E\.1\(Upper and lower bounds on the GD variance term\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then

VarGD≲DUN,VarGD≳DLN,\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{D\_\{U\}\}\{N\},\\qquad\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\gtrsim\\frac\{D\_\{L\}\}\{N\},where

DU:=𝔼X​\[\#​\{i∈\[M\]:μi​\(Σ^\)​Leff​γ0\>1/4\}\+\(Leff​γ0\)​∑i:μi​\(Σ^\)​Leff​γ0≤1/4μi​\(Σ^\)\],D\_\{U\}:=\\mathbb\{E\}\_\{X\}\\Biggl\[\\\#\\bigl\\\{i\\in\[M\]:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\+\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\Biggr\],and

DL:=𝔼X​\[\(Leff​γ0\)2​∑i:μi​\(Σ^\)​Leff​γ0≤1/4μi​\(Σ\)​μi​\(Σ^\)\+15​∑i:μi​\(Σ^\)​Leff​γ0\>1/4μi​\(Σ\)μi​\(Σ^\)\]\.D\_\{L\}:=\\mathbb\{E\}\_\{X\}\\Biggl\[\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\+\\frac\{1\}\{5\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\\Biggr\]\.

###### Proof\.

Using the identity

VarGD=1N​𝔼X​\[tr⁡\(Σ​VL​Σ^−1​VL\)\],\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}=\\frac\{1\}\{N\}\\,\\mathbb\{E\}\_\{X\}\\bigl\[\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\bigr\],and we fix anyλ\>0\\lambda\>0for later separation\. Then

tr⁡\(Σ​VL​Σ^−1​VL\)≤‖Σ1/2​\(Σ^\+λ​I\)−1/2‖22⋅tr⁡\(VL2\+λ​Σ^−1​VL2\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\leq\\\|\\Sigma^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\\|\_\{2\}^\{2\}\\cdot\\operatorname\{tr\}\\bigl\(V\_\{L\}^\{2\}\+\\lambda\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}^\{2\}\\bigr\)\.Note that the stepsize assumptions imply

VL⪯I−\(I−2​γ0​Σ^\)Leff\.V\_\{L\}\\preceq I\-\(I\-2\\gamma\_\{0\}\\widehat\{\\Sigma\}\)^\{L\_\{\\mathrm\{eff\}\}\}\.Hence, using Bernoulli’s inequality exactly as in that proof,

tr⁡\(VL2\+λ​Σ^−1​VL2\)≲\#​\{i:μi​\(Σ^\)​Leff​γ0\>1/4\}\+\(Leff​γ0\)​∑i:μi​\(Σ^\)​Leff​γ0≤1/4μi​\(Σ^\)\\operatorname\{tr\}\\bigl\(V\_\{L\}^\{2\}\+\\lambda\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}^\{2\}\\bigr\)\\lesssim\\\#\\bigl\\\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\+\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)after choosingλ=\(Leff​γ\)−1≤\(Leff​γ0\)−1\\lambda=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1\}\\leq\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{\-1\}\. Applying Lemma[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)then yields

𝔼X​\[tr⁡\(Σ​VL​Σ^−1​VL\)\]≲DU,\\mathbb\{E\}\_\{X\}\\bigl\[\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\bigr\]\\lesssim D\_\{U\},which proves the upper bound\.

For the lower bound, Von Neumann’s trace inequality gives

tr⁡\(Σ​VL​Σ^−1​VL\)≥∑i=1Mμi​\(Σ\)​μi​\(Σ^\)​μ2​\(M−i\)\+1​\(VL2​Σ^−2\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\geq\\sum\_\{i=1\}^\{M\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\,\\mu\_\{2\(M\-i\)\+1\}\\bigl\(V\_\{L\}^\{2\}\\widehat\{\\Sigma\}^\{\-2\}\\bigr\)\.Note that the scalar function

f​\(x\):=\(1−\(1−γ0​x\)Leff\)2x2f\(x\):=\\frac\{\(1\-\(1\-\\gamma\_\{0\}x\)^\{L\_\{\\mathrm\{eff\}\}\}\)^\{2\}\}\{x^\{2\}\}is decreasing on\[0,1/γ0\]\[0,1/\\gamma\_\{0\}\], and therefore

μ2​\(M−i\)\+1​\(VL2​Σ^−2\)≳\{\(Leff​γ0\)2,μi​\(Σ^\)​Leff​γ0≤1/4,1/μi​\(Σ^\)2,μi​\(Σ^\)​Leff​γ0\>1/4\.\\mu\_\{2\(M\-i\)\+1\}\\bigl\(V\_\{L\}^\{2\}\\widehat\{\\Sigma\}^\{\-2\}\\bigr\)\\gtrsim\\begin\{cases\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\},&\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4,\\\\\[3\.0pt\] 1/\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)^\{2\},&\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\.\\end\{cases\}Substituting this bound into the previous display gives

tr⁡\(Σ​VL​Σ^−1​VL\)≳\(Leff​γ0\)2​∑i:μi​\(Σ^\)​Leff​γ0≤1/4μi​\(Σ\)​μi​\(Σ^\)\+15​∑i:μi​\(Σ^\)​Leff​γ0\>1/4μi​\(Σ\)μi​\(Σ^\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\+\\frac\{1\}\{5\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\.Taking expectation overXXproves the lower bound\. ∎

### E\.2Example under the source condition

###### Lemma E\.2\(Variance bounds under the source condition\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists anaa\-dependent constantc\>0c\>0such that, whenever

γ≤clog⁡N,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

VarGD≍min⁡\{M,\(Leff​γ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\asymp\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.

###### Proof\.

For the upper bound, Lemma[E\.1](https://arxiv.org/html/2605.24316#A5.Thmlemma1)gives

VarGD≲DUN\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{D\_\{U\}\}\{N\}\.By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7),μi​\(Σ\)≍i−a\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}, and the same truncation argument used in the proof of Lemma[D\.3](https://arxiv.org/html/2605.24316#A4.Thmlemma3)implies

𝔼X​\[\#​\{i:μi​\(Σ^\)​Leff​γ0\>1/4\}\]≲\(Leff​γ\)1/a\.\\mathbb\{E\}\_\{X\}\\Bigl\[\\\#\\\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\\}\\Bigr\]\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Moreover,

\(Leff​γ0\)​∑i:μi​\(Σ^\)​Leff​γ0≤1/4μi​\(Σ^\)≲\(Leff​γ\)​∑i≳\(Leff​γ\)1/ai−a≲\(Leff​γ\)1/a\.\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-a\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Since alsoDU≤MD\_\{U\}\\leq M, we obtain

VarGD≲min⁡\{M,\(Leff​γ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.
For the lower bound, first assume

\(Leff​γ\)1/a≤M/c\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/cfor a sufficiently large constantc\>0c\>0\. By Lemmas[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)and[K\.10](https://arxiv.org/html/2605.24316#A11.Thmlemma10), with high probability we have

μi​\(Σ\)≍i−a,μi​\(Σ^\)≍i−afor​i≤c−1​min⁡\{M,N\}\.\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\},\\qquad\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\leq c^\{\-1\}\\min\\\{M,N\\\}\.Therefore the first term inDLD\_\{L\}satisfies

DL≳\(Leff​γ\)2​∑i≳\(Leff​γ\)1/ai−2​a≳\(Leff​γ\)1/a\.D\_\{L\}\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\}\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-2a\}\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.
If instead

\(Leff​γ\)1/a≥M/c,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\geq M/c,then the second term inDLD\_\{L\}and the same empirical\-spectrum estimate imply

DL≳∑i≤M/cμi​\(Σ\)μi​\(Σ^\)≳M\.D\_\{L\}\\gtrsim\\sum\_\{i\\leq M/c\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\\gtrsim M\.Combining the two regimes with Lemma[E\.1](https://arxiv.org/html/2605.24316#A5.Thmlemma1)yields

VarGD≳min⁡\{M,\(Leff​γ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\gtrsim\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.This proves the lemma\. ∎

## Appendix FOne\-pass Batch SGD: Excess Error Decomposition

This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular, the centered erroret:=utop−u∗e\_\{t\}:=u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}satisfies

et=\(I−γt​Σ^t\(B\)\)​et−1\+γt​ξ^t\(B\)\.e\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)e\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.\(12\)
### F\.1Excess error decomposition

Sinceu∗u^\{\\ast\}is the minimizer of the sketched population risk, the excess error is

ℰex\(B\):=RM​\(uTop\)−RM​\(u∗\)=‖uTop−u∗‖Σ2=‖eT‖Σ2\.\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}:=R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(u^\{\\ast\}\)=\\\|u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}=\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Taking expectation, we will decompose

𝔼​ℰex\(B\)=𝔼​‖eT‖Σ2\\mathbb\{E\}\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}=\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}into deterministic and stochastic contributions\.

We use throughout the original mean\-centered decomposition

eT=mT\+\(eT−mT\),mT:=𝔼​\[eT\]\.e\_\{T\}=m\_\{T\}\+\(e\_\{T\}\-m\_\{T\}\),\\qquad m\_\{T\}:=\\mathbb\{E\}\[e\_\{T\}\]\.
###### Definition F\.1\(Bias and centered variance\)\.

Define the mean iterate error and centered fluctuation by

mt:=𝔼​\[et\],δt:=et−mt\.m\_\{t\}:=\\mathbb\{E\}\[e\_\{t\}\],\\qquad\\delta\_\{t\}:=e\_\{t\}\-m\_\{t\}\.The one\-pass bias and variance terms are

BiasB:=‖mT‖Σ2,VarB:=𝔼​‖δT‖Σ2\.\\mathrm\{Bias\}\_\{B\}:=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}:=\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Since

𝔼​\[Σ^t\(B\)\]=Σ,𝔼​\[ξ^t\(B\)\]=0,\\mathbb\{E\}\[\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\]=\\Sigma,\\qquad\\mathbb\{E\}\[\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\]=0,the mean recursion satisfies

mt=\(I−γt​Σ\)​mt−1,m0=−u∗,m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\},\\qquad m\_\{0\}=\-u^\{\\ast\},and hence

mT=−∏t=1T\(I−γt​Σ\)​u∗\.m\_\{T\}=\-\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)u^\{\\ast\}\.Therefore

BiasB=∥∏t=1T\(I−γtΣ\)u∗∥Σ2\.\\boxed\{\\mathrm\{Bias\}\_\{B\}=\\left\\\|\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)u^\{\\ast\}\\right\\\|\_\{\\Sigma\}^\{2\}\.\}

###### Proposition F\.1\(Exact mean\-centered decomposition of the excess error\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Under the one\-pass batch SGD recursion equation[12](https://arxiv.org/html/2605.24316#A6.E12),

𝔼​ℰex\(B\)=𝔼​‖eT‖Σ2=BiasB\+VarB\.\\mathbb\{E\}\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}=\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathrm\{Bias\}\_\{B\}\+\\mathrm\{Var\}\_\{B\}\.

###### Proof\.

SinceeT=mT\+δTe\_\{T\}=m\_\{T\}\+\\delta\_\{T\}, we have

‖eT‖Σ2=‖mT‖Σ2\+2​⟨mT,δT⟩Σ\+‖δT‖Σ2\.\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+2\\langle m\_\{T\},\\delta\_\{T\}\\rangle\_\{\\Sigma\}\+\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Taking expectation and using𝔼​\[δT\]=0\\mathbb\{E\}\[\\delta\_\{T\}\]=0gives

𝔼​‖eT‖Σ2=‖mT‖Σ2\+𝔼​‖δT‖Σ2,\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\},which is exactly the claimed identity\. ∎

### F\.2Fluctuation recursion and an exact split of the variance term

Subtracting the mean recursion from equation[12](https://arxiv.org/html/2605.24316#A6.E12)gives the exact centered fluctuation recursion

δt=\(I−γt​Σ^t\(B\)\)​δt−1\+γt​\(Σ−Σ^t\(B\)\)​mt−1\+γt​ξ^t\(B\)\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\delta\_\{t\-1\}\+\\gamma\_\{t\}\\bigl\(\\Sigma\-\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)m\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.\(13\)Equivalently, in the shorthand notation of Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3),

δt=\(I−γt​Z¯t\)​δt−1\+γt​\(Σ−Z¯t\)​mt−1\+γt​ξ¯t\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\\bigr\)\\delta\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.\(14\)
For the variance analysis it is convenient to splitδt\\delta\_\{t\}into the part caused by batch\-covariance fluctuations and the part caused by the additive label noise\.

###### Definition F\.2\(Centered covariance and noise components\)\.

Define two auxiliary processes\(qt\)\(q\_\{t\}\)and\(vt\)\(v\_\{t\}\)by

q0=0,v0=0,q\_\{0\}=0,\\qquad v\_\{0\}=0,and, fort≥1t\\geq 1,

qt=\(I−γt​Z¯t\)​qt−1\+γt​\(Σ−Z¯t\)​mt−1,q\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\},vt=\(I−γt​Z¯t\)​vt−1\+γt​ξ¯t\.v\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Define the corresponding quadratic terms by

VarBcov:=𝔼​‖qT‖Σ2,VarBnoise:=𝔼​‖vT‖Σ2\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}:=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}:=\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.

###### Proposition F\.2\(Exact split of the centered variance\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Then, for everytt,

δt=qt\+vt,𝔼​\[qt\]=0\.\\delta\_\{t\}=q\_\{t\}\+v\_\{t\},\\qquad\\mathbb\{E\}\[q\_\{t\}\]=0\.Moreover, if

𝒢t:=σ\(zs,b:1≤s≤t,1≤b≤B\),\\mathcal\{G\}\_\{t\}:=\\sigma\\bigl\(z\_\{s,b\}:1\\leq s\\leq t,\\ 1\\leq b\\leq B\\bigr\),then

𝔼​\[vt∣𝒢t,w∗\]=0,\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0,and therefore

VarB=VarBcov\+VarBnoise\.\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.

###### Proof\.

The identityδt=qt\+vt\\delta\_\{t\}=q\_\{t\}\+v\_\{t\}follows by induction from equation[14](https://arxiv.org/html/2605.24316#A6.E14)\. Indeed, it is true at timet=0t=0, and ifδt−1=qt−1\+vt−1\\delta\_\{t\-1\}=q\_\{t\-1\}\+v\_\{t\-1\}, then

δt=\(I−γt​Z¯t\)​\(qt−1\+vt−1\)\+γt​\(Σ−Z¯t\)​mt−1\+γt​ξ¯t=qt\+vt\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\\bigr\)\(q\_\{t\-1\}\+v\_\{t\-1\}\)\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}=q\_\{t\}\+v\_\{t\}\.
Next,𝔼​\[q0\]=0\\mathbb\{E\}\[q\_\{0\}\]=0\. If𝔼​\[qt−1\]=0\\mathbb\{E\}\[q\_\{t\-1\}\]=0, then using thatqt−1q\_\{t\-1\}depends only on batches1,…,t−11,\\dots,t\-1whileZ¯t\\bar\{Z\}\_\{t\}comes from the disjointtt\-th batch, we may use independence ofqt−1q\_\{t\-1\}andZ¯t\\bar\{Z\}\_\{t\}, together with𝔼​\[Z¯t\]=Σ\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}\]=\\Sigmaand the fact thatmt−1m\_\{t\-1\}is deterministic, to obtain

𝔼​\[qt\]=\(I−γt​Σ\)​𝔼​\[qt−1\]\+γt​𝔼​\[\(Σ−Z¯t\)​mt−1\]=0\.\\mathbb\{E\}\[q\_\{t\}\]=\(I\-\\gamma\_\{t\}\\Sigma\)\\mathbb\{E\}\[q\_\{t\-1\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\]=0\.Thus𝔼​\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt\.

To prove the conditional mean\-zero property forvtv\_\{t\}, fix one sample in the current batch and writez=S​xz=Sx\. Conditioning on the sketch matrixSS, the pair\(x,z\)\(x,z\)is jointly Gaussian with

𝔼​\[x∣z\]=H​S⊤​\(S​H​S⊤\)−1​z=H​S⊤​Σ−1​z\.\\mathbb\{E\}\[x\\mid z\]=HS^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}z=HS^\{\\top\}\\Sigma^\{\-1\}z\.Sinceu∗=Σ−1​S​H​w∗u^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\}, the Gaussian conditional\-regression identity yields

𝔼​\[⟨x,w∗⟩∣z,w∗\]=⟨𝔼​\[x∣z\],w∗⟩=⟨H​S⊤​Σ−1​z,w∗⟩=z⊤​u∗\.\\mathbb\{E\}\[\\langle x,w^\{\\ast\}\\rangle\\mid z,w^\{\\ast\}\]=\\langle\\mathbb\{E\}\[x\\mid z\],w^\{\\ast\}\\rangle=\\langle HS^\{\\top\}\\Sigma^\{\-1\}z,w^\{\\ast\}\\rangle=z^\{\\top\}u^\{\\ast\}\.On the other hand, Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)implies

𝔼​\[y−⟨x,w∗⟩∣x,w∗\]=0\.\\mathbb\{E\}\[y\-\\langle x,w^\{\\ast\}\\rangle\\mid x,w^\{\\ast\}\]=0\.Therefore, by the tower property,

𝔼​\[y−z⊤​u∗∣z,w∗\]=𝔼​\[y−⟨x,w∗⟩∣z,w∗\]\+𝔼​\[⟨x,w∗⟩−z⊤​u∗∣z,w∗\]=0\.\\mathbb\{E\}\[y\-z^\{\\top\}u^\{\\ast\}\\mid z,w^\{\\ast\}\]=\\mathbb\{E\}\[y\-\\langle x,w^\{\\ast\}\\rangle\\mid z,w^\{\\ast\}\]\+\\mathbb\{E\}\[\\langle x,w^\{\\ast\}\\rangle\-z^\{\\top\}u^\{\\ast\}\\mid z,w^\{\\ast\}\]=0\.Multiplying byzz, which is measurable with respect toσ​\(z,w∗\)\\sigma\(z,w^\{\\ast\}\), gives

𝔼​\[z​\(y−z⊤​u∗\)∣z,w∗\]=0\.\\mathbb\{E\}\\bigl\[z\(y\-z^\{\\top\}u^\{\\ast\}\)\\mid z,w^\{\\ast\}\\bigr\]=0\.Applying this to each summand inξ¯t=1B​∑b=1Bzt,b​\(yit,b−zt,b⊤​u∗\)\\bar\{\\xi\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}\(y\_\{i\_\{t,b\}\}\-z\_\{t,b\}^\{\\top\}u^\{\\ast\}\), and conditioning on𝒢t\\mathcal\{G\}\_\{t\}, which fixes the current sketched covariates\(zt,b\)b=1B\(z\_\{t,b\}\)\_\{b=1\}^\{B\}, we obtain

𝔼​\[ξ¯t∣𝒢t,w∗\]=0\.\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0\.Becausevt−1v\_\{t\-1\}depends only on the firstt−1t\-1batches, it is independent of the current block\(zt,b\)b=1B\(z\_\{t,b\}\)\_\{b=1\}^\{B\}\. Hence

𝔼​\[vt∣𝒢t,w∗\]=\(I−γt​Z¯t\)​𝔼​\[vt−1∣𝒢t,w∗\]\+γt​𝔼​\[ξ¯t∣𝒢t,w∗\]\.\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]\.Since conditioning on the additional current block does not change𝔼​\[vt−1∣𝒢t,w∗\]\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\], the induction hypothesis gives𝔼​\[vt−1∣𝒢t,w∗\]=𝔼​\[vt−1∣𝒢t−1,w∗\]=0\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\-1\},w^\{\\ast\}\]=0\. Starting fromv0=0v\_\{0\}=0, we therefore obtain by induction that

𝔼​\[vt∣𝒢t,w∗\]=0for all​t\.\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0\\qquad\\text\{for all \}t\.
Finally,qTq\_\{T\}is𝒢T\\mathcal\{G\}\_\{T\}\-measurable, so

𝔼​⟨qT,vT⟩Σ=𝔼​\[𝔼​\[⟨qT,vT⟩Σ∣𝒢T,w∗\]\]=0\.\\mathbb\{E\}\\langle q\_\{T\},v\_\{T\}\\rangle\_\{\\Sigma\}=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\\bigl\[\\langle q\_\{T\},v\_\{T\}\\rangle\_\{\\Sigma\}\\mid\\mathcal\{G\}\_\{T\},w^\{\\ast\}\\bigr\]\\Bigr\]=0\.UsingδT=qT\+vT\\delta\_\{T\}=q\_\{T\}\+v\_\{T\}, we conclude that

𝔼​‖δT‖Σ2=𝔼​‖qT‖Σ2\+𝔼​‖vT‖Σ2=VarBcov\+VarBnoise,\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\},as claimed\. ∎

## Appendix GBias Error for One\-pass Batch SGD

This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular,

T:=NB,Teff:=Tlog⁡T,Σ:=S​H​S⊤,u∗:=Σ−1​S​H​w∗\.T:=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.By Definition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmdefinition1), this section studies the exact one\-pass bias term

BiasB=‖mT‖Σ2\.\\mathrm\{Bias\}\_\{B\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.
Define

CT:=∏t=1T\(I−γt​Σ\)\.C\_\{T\}:=\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)\.Since the mean recursion satisfiesmt=\(I−γt​Σ\)​mt−1m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\}andm0=−u∗m\_\{0\}=\-u^\{\\ast\}, we have

mT=−CT​u∗,BiasB=‖CT​u∗‖Σ2\.m\_\{T\}=\-C\_\{T\}u^\{\\ast\},\\qquad\\mathrm\{Bias\}\_\{B\}=\\\|C\_\{T\}u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\.
### G\.1Upper and lower bounds for the bias term

###### Lemma G\.1\(Upper bound on the one\-pass batch bias\)\.

Assume Assumptions[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Fix any integerk≤M/3k\\leq M/3such thatrank⁡\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define

Ak:=Sk:∞​Hk:∞​Sk:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Then, with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

BiasB≲‖w0:k∗‖22Teff​γ​\(μM/2​\(Ak\)μM​\(Ak\)\)2\+‖wk:∞∗‖Hk:∞2\.\\mathrm\{Bias\}\_\{B\}\\lesssim\\frac\{\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\+\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.In particular,

BiasB≲‖u∗‖22Teff​γ\.\\mathrm\{Bias\}\_\{B\}\\lesssim\\frac\{\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.

###### Proof\.

By Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4), we may work in the diagonal coordinates ofHH, as encoded in Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)\. Define

MT:=CT​Σ​CT\.M\_\{T\}:=C\_\{T\}\\Sigma C\_\{T\}\.By the effective\-time comparison, we have

MT⪯\(I−γΣ\)TeffΣ\(I−γΣ\)Teff=:M\.M\_\{T\}\\preceq\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\Sigma\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}=:M\.Therefore,

BiasB\\displaystyle\\mathrm\{Bias\}\_\{B\}=⟨MT,u∗​u∗⊤⟩\\displaystyle=\\langle M\_\{T\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle≤⟨M,u∗​u∗⊤⟩\\displaystyle\\leq\\langle M,u^\{\\ast\}u^\{\\ast\\top\}\\rangle=w∗⊤​H​S⊤​Σ−1​M​Σ−1​S​H​w∗\.\\displaystyle=w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}SHw^\{\\ast\}\.\(15\)
Now decompose

S​H=\(S0:k​H0:k,Sk:∞​Hk:∞\)\.SH=\(S\_\{0:k\}H\_\{0:k\},\\,S\_\{k:\\infty\}H\_\{k:\\infty\}\)\.Then using the idea of spectral truncation, we have

where

T1:=w0:k∗⊤​H0:k​S0:k⊤​Σ−1​M​Σ−1​S0:k​H0:k​w0:k∗,T\_\{1\}:=w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}w\_\{0:k\}^\{\\ast\},and

T2:=wk:∞∗⊤​Hk:∞​Sk:∞⊤​Σ−1​M​Σ−1​Sk:∞​Hk:∞​wk:∞∗\.T\_\{2\}:=w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}\.
For the head term, we have

T1≤‖M‖2⋅‖Σ−1​S0:k​H0:k‖22⋅‖w0:k∗‖22\.T\_\{1\}\\leq\\\|M\\\|\_\{2\}\\cdot\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}^\{2\}\\cdot\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\.By spectral calculus,

‖M‖2=maxx∈\[0,‖Σ‖2\]⁡x​\(1−γ​x\)2​Teff≲1Teff​γ\.\\\|M\\\|\_\{2\}=\\max\_\{x\\in\[0,\\\|\\Sigma\\\|\_\{2\}\]\}x\(1\-\\gamma x\)^\{2T\_\{\\mathrm\{eff\}\}\}\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.Moreover, Lemma[K\.6](https://arxiv.org/html/2605.24316#A11.Thmlemma6)yields

‖Σ−1​S0:k​H0:k‖2≲μM/2​\(Ak\)μM​\(Ak\)\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence

T1≲‖w0:k∗‖22Teff​γ​\(μM/2​\(Ak\)μM​\(Ak\)\)2\.T\_\{1\}\\lesssim\\frac\{\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\.
For the tail term, since0⪯I−γ​Σ⪯I0\\preceq I\-\\gamma\\Sigma\\preceq I, we also have

0⪯M=\(I−γ​Σ\)Teff​Σ​\(I−γ​Σ\)Teff⪯Σ\.0\\preceq M=\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\Sigma\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\preceq\\Sigma\.Therefore,

T2\\displaystyle T\_\{2\}≤wk:∞∗⊤​Hk:∞​Sk:∞⊤​Σ−1​Σ​Σ−1​Sk:∞​Hk:∞​wk:∞∗\\displaystyle\\leq w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}\\Sigma\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}=wk:∞∗⊤​Hk:∞1/2​\(Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2\)​Hk:∞1/2​wk:∞∗\\displaystyle=w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\\Bigl\(H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\Bigr\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}≤‖wk:∞∗‖Hk:∞2,\\displaystyle\\leq\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\},where the last step uses

0⪯Hk:∞1/2​Sk:∞⊤​Σ−1​Sk:∞​Hk:∞1/2⪯I\.0\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq I\.Combining the bounds onT1T\_\{1\}andT2T\_\{2\}proves the first claim\.

For the simpler bound, we may ignore the head–tail split and use

BiasB≤‖M‖2​‖u∗‖22≲‖u∗‖22Teff​γ\.\\mathrm\{Bias\}\_\{B\}\\leq\\\|M\\\|\_\{2\}\\,\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\\lesssim\\frac\{\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.∎

###### Lemma G\.2\(Lower bound on the one\-pass batch bias\)\.

Let

Hw:=𝔼w∗​\[w∗​w∗⊤\],Σw:=S​H​Hw​H​S⊤\.H\_\{w\}:=\\mathbb\{E\}\_\{w^\{\\ast\}\}\[w^\{\\ast\}w^\{\\ast\\top\}\],\\qquad\\Sigma\_\{w\}:=SHH\_\{w\}HS^\{\\top\}\.Then, conditioned on the sketch matrixSS,

𝔼w∗​\[BiasB\]≳∑i:μi​\(Σ\)<1/\(γ​Teff\)μi​\(Σw\)μi​\(Σ\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i:\\,\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\)\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.

###### Proof\.

Let us define

MT′:=Σ​\(I−2​γ​Σ\)2​Teff\.M\_\{T\}^\{\\prime\}:=\\Sigma\(I\-2\\gamma\\Sigma\)^\{2T\_\{\\mathrm\{eff\}\}\}\.Note that one can verify, blockwise geometric schedule implies

∏t=1T\(1−γt​x\)2≥\(1−2​γ​x\)2​Tefffor every​x∈\[0,‖Σ‖2\]\.\\prod\_\{t=1\}^\{T\}\(1\-\\gamma\_\{t\}x\)^\{2\}\\geq\(1\-2\\gamma x\)^\{2T\_\{\\mathrm\{eff\}\}\}\\qquad\\text\{for every \}x\\in\[0,\\\|\\Sigma\\\|\_\{2\}\]\.By functional calculus, this gives

MT:=CT​Σ​CT⪰MT′\.M\_\{T\}:=C\_\{T\}\\Sigma C\_\{T\}\\succeq M\_\{T\}^\{\\prime\}\.Therefore,

BiasB=⟨MT,u∗​u∗⊤⟩≥⟨MT′,u∗​u∗⊤⟩\.\\mathrm\{Bias\}\_\{B\}=\\langle M\_\{T\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle\\geq\\langle M\_\{T\}^\{\\prime\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle\.Since

𝔼w∗​\[u∗​u∗⊤\]=Σ−1​Σw​Σ−1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[u^\{\\ast\}u^\{\\ast\\top\}\]=\\Sigma^\{\-1\}\\Sigma\_\{w\}\\Sigma^\{\-1\},we have

𝔼w∗​\[BiasB\]\\displaystyle\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]≥tr⁡\(MT′​Σ−1​Σw​Σ−1\)\\displaystyle\\geq\\operatorname\{tr\}\\\!\\bigl\(M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\\Sigma\_\{w\}\\Sigma^\{\-1\}\\bigr\)=tr⁡\(Σ−1​MT′​Σ−1​Σw\)\.\\displaystyle=\\operatorname\{tr\}\\\!\\bigl\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\\Sigma\_\{w\}\\bigr\)\.\(16\)Applying von Neumann’s trace inequality to equation[16](https://arxiv.org/html/2605.24316#A7.E16)gives

𝔼w∗​\[BiasB\]≳∑i=1MμM−i\+1​\(Σ−1​MT′​Σ−1\)​μi​\(Σw\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i=1\}^\{M\}\\mu\_\{M\-i\+1\}\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\)\\,\\mu\_\{i\}\(\\Sigma\_\{w\}\)\.SinceMT′M\_\{T\}^\{\\prime\}is positive definite, the identityμM−i\+1​\(A\)=μi​\(A−1\)−1\\mu\_\{M\-i\+1\}\(A\)=\\mu\_\{i\}\(A^\{\-1\}\)^\{\-1\}yields

μM−i\+1​\(Σ−1​MT′​Σ−1\)=1μi​\(Σ2​MT′⁣−1\)=1μi​\(Σ​\(I−2​γ​Σ\)−2​Teff\)\.\\mu\_\{M\-i\+1\}\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\)=\\frac\{1\}\{\\mu\_\{i\}\(\\Sigma^\{2\}M\_\{T\}^\{\\prime\-1\}\)\}=\\frac\{1\}\{\\mu\_\{i\}\\\!\\bigl\(\\Sigma\(I\-2\\gamma\\Sigma\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\bigr\)\}\.Therefore,

𝔼w∗​\[BiasB\]≳∑i=1Mμi​\(Σw\)μi​\(Σ​\(I−2​γ​Σ\)−2​Teff\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\\\!\\bigl\(\\Sigma\(I\-2\\gamma\\Sigma\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\bigr\)\}\.Ifμi​\(Σ\)<1/\(γ​Teff\)\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\), then

\(1−2​γ​μi​\(Σ\)\)−2​Teff≲1,\(1\-2\\gamma\\mu\_\{i\}\(\\Sigma\)\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\lesssim 1,so the corresponding denominator is comparable toμi​\(Σ\)\\mu\_\{i\}\(\\Sigma\)\. Restricting the sum to this index set gives

𝔼w∗​\[BiasB\]≳∑i:μi​\(Σ\)<1/\(γ​Teff\)μi​\(Σw\)μi​\(Σ\),\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i:\\,\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\)\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\},which is exactly the claimed bound\. ∎

### G\.2Bounds under the source condition

###### Lemma G\.3\(Bounds on the one\-pass batch bias under the source condition\)\.

Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Suppose moreover that

Then there exist\(a,b\)\(a,b\)\-dependent constantsc0,c1\>0c\_\{0\},c\_\{1\}\>0such that, whenever

γ≤c0log⁡T,\\gamma\\leq\\frac\{c\_\{0\}\}\{\\log T\},we have with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.Moreover, if in addition

\(Teff​γ\)1/a≤Mc1,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq\\frac\{M\}\{c\_\{1\}\},then, on the same event,

𝔼w∗​\[BiasB\]≳\(Teff​γ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.

###### Proof\.

In this proof, we verify the upper and lower bounds separately\.

For the upper bound, choose

k:=min⁡\{M/3,\(Teff​γ\)1/a\}\.k:=\\min\\\!\\bigl\\\{M/3,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}\.By Lemma[K\.8](https://arxiv.org/html/2605.24316#A11.Thmlemma8),

μM/2​\(Ak\)μM​\(Ak\)≲1\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\lesssim 1with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),

𝔼w∗​‖w0:k∗‖22≍∑i=1kia−b,𝔼w∗​‖wk:∞∗‖Hk:∞2≍∑i\>ki−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\\asymp\\sum\_\{i=1\}^\{k\}i^\{a\-b\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\\asymp\\sum\_\{i\>k\}i^\{\-b\}\.Plugging these relations into Lemma[G\.1](https://arxiv.org/html/2605.24316#A7.Thmlemma1)gives

𝔼w∗​\[BiasB\]≲1Teff​γ​∑i=1kia−b\+∑i\>ki−b≲k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i=1\}^\{k\}i^\{a\-b\}\+\\sum\_\{i\>k\}i^\{\-b\}\\lesssim k^\{1\-b\}\.Sincek≍min⁡\{M,\(Teff​γ\)1/a\}k\\asymp\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}, this yields

𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.
For the lower bound, set

θ:=\(Teff​γ\)1/a\.\\theta:=\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)gives

μi​\(Σ\)≍i−afor​i∈\[M\]\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\in\[M\]with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Therefore, the condition

μi​\(Σ\)<1γ​Teff\\mu\_\{i\}\(\\Sigma\)<\\frac\{1\}\{\\gamma T\_\{\\mathrm\{eff\}\}\}forces

Moreover, under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), the operatorH​Hw​HHH\_\{w\}Hhas eigenvalues of orderi−a−bi^\{\-a\-b\}, so Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)applied toH​Hw​HHH\_\{w\}Hyields

μi​\(Σw\)≍i−a−b\.\\mu\_\{i\}\(\\Sigma\_\{w\}\)\\asymp i^\{\-a\-b\}\.Combining these estimates with Lemma[G\.2](https://arxiv.org/html/2605.24316#A7.Thmlemma2), we obtain

𝔼w∗​\[BiasB\]≳∑i≳θμi​\(Σw\)μi​\(Σ\)≍∑i≳θi−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i\\gtrsim\\theta\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\\asymp\\sum\_\{i\\gtrsim\\theta\}i^\{\-b\}\.Ifθ≤M/c1\\theta\\leq M/c\_\{1\}for a sufficiently large\(a,b\)\(a,b\)\-dependent constantc1c\_\{1\}, then the last sum contains the range\[C​θ,2​C​θ\]\[C\\theta,2C\\theta\]for some constantC\>0C\>0, and hence

𝔼w∗​\[BiasB\]≳θ1−b≳\(Teff​γ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\theta^\{1\-b\}\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.Combining the upper and lower bounds proves the claim\. ∎

## Appendix HVariance Error for One\-pass Batch SGD

This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular,

T:=NB,Teff:=Tlog⁡T,utop−u∗=\(I−γt​Z¯t\)​\(ut−1op−u∗\)\+γt​ξ¯t\.T:=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\(u\_\{t\-1\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Throughout this section, Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)is understood withNNreplaced byTTandLeffL\_\{\\mathrm\{eff\}\}replaced byTeffT\_\{\\mathrm\{eff\}\}\(except the max norm is over the originalNNsamples\)\.

For the covariance\-iterate arguments below, it is convenient to reindex the one\-pass updates byt=0,…,T−1t=0,\\dots,T\-1: whenever this convention is used, we set

\(γt,Z¯t,ξ¯t\):=\(γt\+1,Z¯t\+1,ξ¯t\+1\)\.\(\\gamma\_\{t\},\\bar\{Z\}\_\{t\},\\bar\{\\xi\}\_\{t\}\):=\(\\gamma\_\{t\+1\},\\bar\{Z\}\_\{t\+1\},\\bar\{\\xi\}\_\{t\+1\}\)\.In particular,qTq\_\{T\}andvTv\_\{T\}still denote the states after allTTone\-pass updates\.

We also define

σ~2​\(w∗\):=2​\(σ2\+α​‖w∗‖H2\),σ¯2​\(w∗\):=σ~2​\(w∗\)\+α​‖w∗‖H2=2​σ2\+3​α​‖w∗‖H2,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=2\\bigl\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\bigr\),\\qquad\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}=2\\sigma^\{2\}\+3\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\},whereα\\alphais the constant from Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2)\. Under Gaussian design, that lemma permits the choiceα=3\\alpha=3\.

### H\.1Upper and lower bounds for the exact variance components

By Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2), the exact centered variance splits as

VarB=VarBcov\+VarBnoise\.\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.We first record abstract upper and lower bounds for the two exact variance subcomponents\. Both are controlled by the same spectral kernel, which we define first and bound only in the last subsection\.

###### Definition H\.1\(Common raw kernel quantity\)\.

Let

T~Σ​\(γ\)∘A:=Σ​A\+A​Σ−γ​Σ​A​Σ\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma\\Sigma A\\Sigmafor any symmetric matrixAA\. Define

𝒦B:=1B​\(1−γ​RB2\)​⟨Σ,∑t=0T−1γt2​∏i=t\+1T−1\(I−γi​Σ\)2​Σ⟩,\\mathcal\{K\}\_\{B\}:=\\frac\{1\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle,\(17\)where

RB2:=\(1\+α−1B\)​tr⁡\(Σ\)\.R\_\{B\}^\{2\}:=\\left\(1\+\\frac\{\\alpha\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\.\(18\)

#### Stability of the batch fourth\-moment factor\.

Under Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3), the absolute constantc\>0c\>0in the stepsize condition may be chosen sufficiently small so that the denominator in the above definition is bounded away from zero\. Indeed, sinceB≥1B\\geq 1,

RB2=\(1\+α−1B\)​tr⁡\(Σ\)≤α​tr⁡\(Σ\)\.R\_\{B\}^\{2\}=\\left\(1\+\\frac\{\\alpha\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\\leq\\alpha\\,\\operatorname\{tr\}\(\\Sigma\)\.Therefore Assumption 3A gives

γ​RB2≤α​γ​tr⁡\(Σ\)≤α​c\.\\gamma R\_\{B\}^\{2\}\\leq\\alpha\\gamma\\operatorname\{tr\}\(\\Sigma\)\\leq\\alpha c\.Choosingc≤1/\(2​α\)c\\leq 1/\(2\\alpha\), we obtain

γ​RB2≤12,1−γ​RB2≥12\.\\gamma R\_\{B\}^\{2\}\\leq\\frac\{1\}\{2\},\\qquad 1\-\\gamma R\_\{B\}^\{2\}\\geq\\frac\{1\}\{2\}\.Throughout this section, we work under this choice of the absolute constant\. Consequently, all factors of the form\(1−γ​RB2\)−1\(1\-\\gamma R\_\{B\}^\{2\}\)^\{\-1\}are absorbed into universal constants\.

###### Proposition H\.1\(Upper and lower bounds for the exact variance components\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Suppose additionally thatγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then the two exact centered\-variance subcomponents satisfy

0≤VarBnoise≤σ~2​\(w∗\)​𝒦B,0\\leq\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\\leq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\},0≤VarBcov≤α​‖u∗‖Σ2​𝒦B≤α​‖w∗‖H2​𝒦B\.0\\leq\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\\leq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\\leq\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\.Consequently,

0≤VarB≤σ¯2​\(w∗\)​𝒦B\.0\\leq\\mathrm\{Var\}\_\{B\}\\leq\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\}\.The same bounds hold after taking expectation overw∗w^\{\\ast\}\.

###### Proof\.

The lower bounds are immediate from the definitions

VarBnoise=𝔼​‖vT‖Σ2,VarBcov=𝔼​‖qT‖Σ2,\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\},and from the exact identityVarB=VarBcov\+VarBnoise\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.

For the upper bound on the additive\-noise component, Theorem[H\.1](https://arxiv.org/html/2605.24316#A8.Thmmytheorem1)gives

VarBnoise=⟨Σ,CT\(B\)⟩≤σ~2​\(w∗\)​𝒦B\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\langle\\Sigma,C\_\{T\}^\{\(B\)\}\\rangle\\leq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\}\.Similarly, Theorem[H\.2](https://arxiv.org/html/2605.24316#A8.Thmmytheorem2)yields

VarBcov=⟨Σ,QT\(B\)⟩≤α​‖u∗‖Σ2​𝒦B\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\langle\\Sigma,Q\_\{T\}^\{\(B\)\}\\rangle\\leq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\.To compare‖u∗‖Σ2\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}with‖w∗‖H2\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}, write

P:=H1/2​S⊤​Σ−1​S​H1/2\.P:=H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\.SinceΣ=S​H​S⊤\\Sigma=SHS^\{\\top\}, we haveP2=PP^\{2\}=P, soPPis an orthogonal projector and therefore0⪯P⪯I0\\preceq P\\preceq I\. Usingu∗=Σ−1​S​H​w∗u^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\},

‖u∗‖Σ2=w∗⊤​H​S⊤​Σ−1​S​H​w∗=w∗⊤​H1/2​P​H1/2​w∗≤‖w∗‖H2\.\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}=w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}SHw^\{\\ast\}=w^\{\\ast\\top\}H^\{1/2\}PH^\{1/2\}w^\{\\ast\}\\leq\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\.Combining the last three displays proves the upper bounds\. ∎

### H\.2The additive\-noise component

We first treat the additive\-noise component\(vt\)\(v\_\{t\}\)from Definition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmdefinition2)\.

Define

v0:=0,vt\+1:=\(I−γt​Z¯t\)​vt\+γt​ξ¯t\.v\_\{0\}:=0,\\qquad v\_\{t\+1\}:=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Let

Ct\(B\):=𝔼​\[vt​vt⊤\]C\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[v\_\{t\}v\_\{t\}^\{\\top\}\]be the covariance iterate of the additive\-noise component\. Also define the batch fourth\-moment operator

MΣ\(B\)​\(A\):=𝔼​\[Z¯t​A​Z¯t\],M\_\{\\Sigma\}^\{\(B\)\}\(A\):=\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}A\\bar\{Z\}\_\{t\}\],and the batch covariance operator

TΣ\(B\)​\(γ\)∘A:=Σ​A\+A​Σ−γ​MΣ\(B\)​\(A\)\.T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma M\_\{\\Sigma\}^\{\(B\)\}\(A\)\.
###### Lemma H\.1\(Covariance recursion for the additive\-noise component\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Let

Σξ\(B\):=𝔼​\[ξ¯t​ξ¯t⊤\]\.\\Sigma\_\{\\xi\}^\{\(B\)\}:=\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\bar\{\\xi\}\_\{t\}^\{\\top\}\]\.Then

Σξ\(B\)⪯σ~2​\(w∗\)B​Σ\.\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\,\\Sigma\.Moreover, the covariance iterate satisfies the exact recursion

Ct\+1\(B\)=\(I−γt​TΣ\(B\)​\(γt\)\)∘Ct\(B\)\+γt2​Σξ\(B\)\.C\_\{t\+1\}^\{\(B\)\}=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\Sigma\_\{\\xi\}^\{\(B\)\}\.\(19\)Finally, ifγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, then for alltt,

Ct\(B\)⪯γ​σ~2​\(w∗\)B​\(1−γ​RB2\)​I\.C\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\gamma\\,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\,I\.

###### Proof\.

We first prove the bound onΣξ\(B\)\\Sigma\_\{\\xi\}^\{\(B\)\}\. Since theBBsamples in a batch are independent and centered,

Σξ\(B\)=𝔼​\[ξ¯t​ξ¯t⊤\]=1B​𝔼​\[z​\(y−z⊤​u∗\)2​z⊤\]\.\\Sigma\_\{\\xi\}^\{\(B\)\}=\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\bar\{\\xi\}\_\{t\}^\{\\top\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\\\!\\left\[z\\bigl\(y\-z^\{\\top\}u^\{\\ast\}\\bigr\)^\{2\}z^\{\\top\}\\right\]\.Nowz=S​xz=Sx, and part \(ii\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2)yields

𝔼​\[\(y−⟨u∗,S​x⟩\)2​\(S​x\)​\(S​x\)⊤\]⪯σ~2​\(w∗\)​Σ,\\mathbb\{E\}\\\!\\left\[\(y\-\\langle u^\{\\ast\},Sx\\rangle\)^\{2\}\(Sx\)\(Sx\)^\{\\top\}\\right\]\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\Sigma,whereσ~2​\(w∗\)=2​\(σ2\+α​‖w∗‖H2\)\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)=2\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\)\. Hence

Σξ\(B\)⪯σ~2​\(w∗\)B​Σ\.\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\Sigma\.
We next derive the recursion\. By definition,

vt\+1=\(I−γt​Z¯t\)​vt\+γt​ξ¯t\.v\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Taking the second moment, the mixed term vanishes after conditioning on the current sketched covariates𝒢tcur:=σ​\(zt,1,…,zt,B\)\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\}:=\\sigma\(z\_\{t,1\},\\dots,z\_\{t,B\}\)and onw∗w^\{\\ast\}: the vectorvtv\_\{t\}depends only on earlier batches and is therefore independent of𝒢tcur\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\}, while Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2)gives𝔼​\[ξ¯t∣𝒢tcur,w∗\]=0\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\},w^\{\\ast\}\]=0\. Hence both cross terms are zero, and

Ct\+1\(B\)=𝔼​\[\(I−γt​Z¯t\)​vt​vt⊤​\(I−γt​Z¯t\)\]\+γt2​Σξ\(B\)\.C\_\{t\+1\}^\{\(B\)\}=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}v\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]\+\\gamma\_\{t\}^\{2\}\\Sigma\_\{\\xi\}^\{\(B\)\}\.Sincevtv\_\{t\}is independent of the current batch, we have

𝔼​\[\(I−γt​Z¯t\)​vt​vt⊤​\(I−γt​Z¯t\)\]=\(I−γt​TΣ\(B\)​\(γt\)\)∘Ct\(B\),\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}v\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\},which proves equation[19](https://arxiv.org/html/2605.24316#A8.E19)\.

Finally, we prove the crude bound by induction\. Set

κB:=γ​σ~2​\(w∗\)B​\(1−γ​RB2\)\.\\kappa\_\{B\}:=\\frac\{\\gamma\\,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\.We showCt\(B\)⪯κB​IC\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{B\}Ifor alltt\. This is true fort=0t=0sinceC0\(B\)=0C\_\{0\}^\{\(B\)\}=0\. Assume it holds at timett\. Then by monotonicity ofMΣ\(B\)M\_\{\\Sigma\}^\{\(B\)\},

MΣ\(B\)​\(Ct\(B\)\)⪯κB​MΣ\(B\)​\(I\)\.M\_\{\\Sigma\}^\{\(B\)\}\(C\_\{t\}^\{\(B\)\}\)\\preceq\\kappa\_\{B\}\\,M\_\{\\Sigma\}^\{\(B\)\}\(I\)\.Now we are up to

MΣ\(B\)​\(I\)=𝔼​\[Z¯t2\]=1B​𝔼​\[\(z​z⊤\)2\]\+B−1B​Σ2\.M\_\{\\Sigma\}^\{\(B\)\}\(I\)=\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}^\{2\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\[\(zz^\{\\top\}\)^\{2\}\]\+\\frac\{B\-1\}\{B\}\\Sigma^\{2\}\.By part \(i\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2), for every PSD matrixAA,

𝔼​\[z​z⊤​A​z​z⊤\]⪯α​tr⁡\(Σ​A\)​Σ\.\\mathbb\{E\}\[zz^\{\\top\}Azz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\)\\Sigma\.Applying this withA=IA=I, and usingΣ2⪯‖Σ‖2​Σ⪯tr⁡\(Σ\)​Σ\\Sigma^\{2\}\\preceq\\\|\\Sigma\\\|\_\{2\}\\Sigma\\preceq\\operatorname\{tr\}\(\\Sigma\)\\Sigma, we obtain

MΣ\(B\)​\(I\)⪯\(αB\+B−1B\)​tr⁡\(Σ\)​Σ=RB2​Σ\.M\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq\\left\(\\frac\{\\alpha\}\{B\}\+\\frac\{B\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\\Sigma=R\_\{B\}^\{2\}\\Sigma\.Therefore

Ct\+1\(B\)⪯κB​\(I−2​γt​Σ\+γt2​RB2​Σ\)\+γt2​σ~2​\(w∗\)B​Σ\.C\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{B\}\(I\-2\\gamma\_\{t\}\\Sigma\+\\gamma\_\{t\}^\{2\}R\_\{B\}^\{2\}\\Sigma\)\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\Sigma\.Sinceγt≤γ\\gamma\_\{t\}\\leq\\gamma, we get

Ct\+1\(B\)⪯κB​I−γt2​σ~2​\(w∗\)B​\(2−γ​RB21−γ​RB2−1\)​Σ⪯κB​I\.C\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{B\}I\-\\frac\{\\gamma\_\{t\}^\{2\}\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\left\(\\frac\{2\-\\gamma R\_\{B\}^\{2\}\}\{1\-\\gamma R\_\{B\}^\{2\}\}\-1\\right\)\\Sigma\\preceq\\kappa\_\{B\}I\.This closes the induction\. ∎

### H\.3Variance bound for the additive\-noise component

We now prove a variance bound for the additive\-noise component using the following theorem\.

###### Theorem H\.1\(Variance bound for the additive\-noise component\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2), and supposeγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then

VarBnoise=⟨Σ,CT\(B\)⟩≤σ~2​\(w∗\)B​\(1−γ​RB2\)​⟨Σ,∑t=0T−1γt2​∏i=t\+1T−1\(I−γi​Σ\)2​Σ⟩\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\langle\\Sigma,C\_\{T\}^\{\(B\)\}\\rangle\\leq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle\.

###### Proof\.

Starting from the exact recursion equation[19](https://arxiv.org/html/2605.24316#A8.E19), Lemma[H\.1](https://arxiv.org/html/2605.24316#A8.Thmlemma1)gives the crude boundCt\(B\)⪯κB​IC\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{B\}I, whereκB=γ​σ~2​\(w∗\)/\(B​\(1−γ​RB2\)\)\\kappa\_\{B\}=\\gamma\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)/\(B\(1\-\\gamma R\_\{B\}^\{2\}\)\)\. Combining this with equation[19](https://arxiv.org/html/2605.24316#A8.E19), the estimateMΣ\(B\)​\(I\)⪯RB2​ΣM\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq R\_\{B\}^\{2\}\\Sigma, and the boundΣξ\(B\)⪯σ~2​\(w∗\)​Σ/B\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\Sigma/B, we obtain

Ct\+1\(B\)⪯Ct\(B\)−γt​Σ​Ct\(B\)−γt​Ct\(B\)​Σ\+γt2​σ~2​\(w∗\)B​\(1−γ​RB2\)​Σ\.C\_\{t\+1\}^\{\(B\)\}\\preceq C\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}\\Sigma C\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}C\_\{t\}^\{\(B\)\}\\Sigma\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.Sinceγt2​Σ​Ct\(B\)​Σ⪰0\\gamma\_\{t\}^\{2\}\\Sigma C\_\{t\}^\{\(B\)\}\\Sigma\\succeq 0, the last display is in turn bounded by

Ct\+1\(B\)⪯\(I−γt​T~Σ​\(γt\)\)∘Ct\(B\)\+γt2​σ~2​\(w∗\)B​\(1−γ​RB2\)​Σ,C\_\{t\+1\}^\{\(B\)\}\\preceq\\bigl\(I\-\\gamma\_\{t\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma,where

T~Σ​\(γ\)∘A:=Σ​A\+A​Σ−γ​Σ​A​Σ\.\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma\\Sigma A\\Sigma\.Unrolling this recursion fromt=0t=0toT−1T\-1, and usingC0\(B\)=0C\_\{0\}^\{\(B\)\}=0, we get

CT\(B\)⪯σ~2​\(w∗\)B​\(1−γ​RB2\)​∑t=0T−1γt2​∏i=t\+1T−1\(I−γi​T~Σ​\(γi\)\)∘Σ\.C\_\{T\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{i\}\)\)\\circ\\Sigma\.Now, sinceΣ\\Sigmacommutes with every polynomial inΣ\\Sigma, one checks directly that

\(I−γ​T~Σ​\(γ\)\)∘A=\(I−γ​Σ\)​A​\(I−γ​Σ\)\(I\-\\gamma\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\)\\circ A=\(I\-\\gamma\\Sigma\)A\(I\-\\gamma\\Sigma\)wheneverAAcommutes withΣ\\Sigma\. Since the recursion starts fromA=ΣA=\\Sigma, every term in the expansion is a polynomial inΣ\\Sigma, and hence commutes withΣ\\Sigma\. Therefore,

∏i=t\+1T−1\(I−γi​T~Σ​\(γi\)\)∘Σ=∏i=t\+1T−1\(I−γi​Σ\)2​Σ\.\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{i\}\)\)\\circ\\Sigma=\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\.Taking the inner product withΣ\\Sigmaproves the claim\. ∎

### H\.4The covariance\-fluctuation component

We now treat the centered covariance\-fluctuation process\(qt\)\(q\_\{t\}\)from Definition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmdefinition2)\. Also, we reindex the recursion from equation[14](https://arxiv.org/html/2605.24316#A6.E14)as

q0:=0,qt\+1:=\(I−γt​Z¯t\)​qt\+γt​ζt,ζt:=\(Σ−Z¯t\)​mt,q\_\{0\}:=0,\\qquad q\_\{t\+1\}:=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\+\\gamma\_\{t\}\\zeta\_\{t\},\\qquad\\zeta\_\{t\}:=\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\},fort=0,…,T−1t=0,\\dots,T\-1\. Let

Qt\(B\):=𝔼​\[qt​qt⊤\],Λt\(B\):=𝔼​\[ζt​ζt⊤\]\.Q\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[q\_\{t\}q\_\{t\}^\{\\top\}\],\\qquad\\Lambda\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[\\zeta\_\{t\}\\zeta\_\{t\}^\{\\top\}\]\.
###### Lemma H\.2\(Covariance iterate for the centered covariance\-fluctuation component\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Then𝔼​\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt, and

Λt\(B\)⪯α​‖mt‖Σ2B​Σ⪯α​‖u∗‖Σ2B​Σ\.\\Lambda\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\alpha\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\,\\Sigma\\preceq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\,\\Sigma\.Moreover, the covariance iterate satisfies the exact recursion

Qt\+1\(B\)=\(I−γt​TΣ\(B\)​\(γt\)\)∘Qt\(B\)\+γt2​Λt\(B\)\.Q\_\{t\+1\}^\{\(B\)\}=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\Lambda\_\{t\}^\{\(B\)\}\.\(20\)Finally, ifγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, then for alltt,

Qt\(B\)⪯γ​α​‖u∗‖Σ2B​\(1−γ​RB2\)​I\.Q\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\,I\.

###### Proof\.

We first prove that𝔼​\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt\. This is true at timet=0t=0\. If𝔼​\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0, thenqtq\_\{t\}depends only on the firstttbatches and is independent of the current batch definingZ¯t\\bar\{Z\}\_\{t\}\. Using also𝔼​\[Z¯t\]=Σ\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}\]=\\Sigmaand the fact thatmtm\_\{t\}is deterministic,

𝔼​\[qt\+1\]=\(I−γt​Σ\)​𝔼​\[qt\]\+γt​𝔼​\[\(Σ−Z¯t\)​mt\]=0\.\\mathbb\{E\}\[q\_\{t\+1\}\]=\(I\-\\gamma\_\{t\}\\Sigma\)\\mathbb\{E\}\[q\_\{t\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\}\]=0\.
Next, letAt:=mt​mt⊤A\_\{t\}:=m\_\{t\}m\_\{t\}^\{\\top\}\. Since theBBsamples in a batch are*iid*,

𝔼​\[Z¯t​At​Z¯t\]=1B​𝔼​\[z​z⊤​At​z​z⊤\]\+B−1B​Σ​At​Σ\.\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}A\_\{t\}\\bar\{Z\}\_\{t\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\+\\frac\{B\-1\}\{B\}\\Sigma A\_\{t\}\\Sigma\.Therefore,

Λt\(B\)=𝔼​\[\(Σ−Z¯t\)​At​\(Σ−Z¯t\)\]=1B​\(𝔼​\[z​z⊤​At​z​z⊤\]−Σ​At​Σ\)⪯1B​𝔼​\[z​z⊤​At​z​z⊤\]\.\\Lambda\_\{t\}^\{\(B\)\}=\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)A\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)\]=\\frac\{1\}\{B\}\\Bigl\(\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\-\\Sigma A\_\{t\}\\Sigma\\Bigr\)\\preceq\\frac\{1\}\{B\}\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\.By part \(i\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2),

𝔼​\[z​z⊤​At​z​z⊤\]⪯α​tr⁡\(Σ​At\)​Σ=α​‖mt‖Σ2​Σ,\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\_\{t\}\)\\Sigma=\\alpha\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\\Sigma,which proves the first bound onΛt\(B\)\\Lambda\_\{t\}^\{\(B\)\}\. Since under the zero\-based convention fixed at the start of the section,mt\+1=\(I−γt​Σ\)​mtm\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\}, and all eigenvalues ofI−γt​ΣI\-\\gamma\_\{t\}\\Sigmalie in\[0,1\]\[0,1\], the sequence‖mt‖Σ2\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}is nonincreasing, so we result in‖mt‖Σ2≤‖m0‖Σ2=‖u∗‖Σ2\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\\leq\\\|m\_\{0\}\\\|\_\{\\Sigma\}^\{2\}=\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\.

We now derive the recursion\. Expanding the second moment ofqt\+1=\(I−γt​Z¯t\)​qt\+γt​ζtq\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\+\\gamma\_\{t\}\\zeta\_\{t\}gives

Qt\+1\(B\)=𝔼​\[\(I−γt​Z¯t\)​qt​qt⊤​\(I−γt​Z¯t\)\]\+γt2​Λt\(B\)\+Γt\+Γt⊤,Q\_\{t\+1\}^\{\(B\)\}=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}q\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]\+\\gamma\_\{t\}^\{2\}\\Lambda\_\{t\}^\{\(B\)\}\+\\Gamma\_\{t\}\+\\Gamma\_\{t\}^\{\\top\},where

Γt:=𝔼​\[\(I−γt​Z¯t\)​qt​ζt⊤\]\.\\Gamma\_\{t\}:=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\\zeta\_\{t\}^\{\\top\}\]\.Becauseqtq\_\{t\}is independent of the current batch andmtm\_\{t\}is deterministic, if we define the linear operator

ℒt​\(A\):=𝔼​\[\(I−γt​Z¯t\)​A​\(Σ−Z¯t\)\],\\mathcal\{L\}\_\{t\}\(A\):=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)A\(\\Sigma\-\\bar\{Z\}\_\{t\}\)\],then

Γt=ℒt​\(𝔼​\[qt\]​mt⊤\)=0\.\\Gamma\_\{t\}=\\mathcal\{L\}\_\{t\}\\bigl\(\\mathbb\{E\}\[q\_\{t\}\]m\_\{t\}^\{\\top\}\\bigr\)=0\.Thus the cross terms vanish\. Sinceqtq\_\{t\}is independent of the current batch,

𝔼​\[\(I−γt​Z¯t\)​qt​qt⊤​\(I−γt​Z¯t\)\]=\(I−γt​TΣ\(B\)​\(γt\)\)∘Qt\(B\),\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}q\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\},which proves equation[20](https://arxiv.org/html/2605.24316#A8.E20)\.

Finally, set

κq:=γ​α​‖u∗‖Σ2B​\(1−γ​RB2\)\.\\kappa\_\{q\}:=\\frac\{\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\.We prove by induction thatQt\(B\)⪯κq​IQ\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{q\}I\. This is true att=0t=0\. Assuming it holds at timett, monotonicity ofMΣ\(B\)M\_\{\\Sigma\}^\{\(B\)\}and the bound onMΣ\(B\)​\(I\)M\_\{\\Sigma\}^\{\(B\)\}\(I\)from Lemma[H\.1](https://arxiv.org/html/2605.24316#A8.Thmlemma1)give

MΣ\(B\)​\(Qt\(B\)\)⪯κq​RB2​Σ\.M\_\{\\Sigma\}^\{\(B\)\}\(Q\_\{t\}^\{\(B\)\}\)\\preceq\\kappa\_\{q\}R\_\{B\}^\{2\}\\Sigma\.Using equation[20](https://arxiv.org/html/2605.24316#A8.E20)and the bound onΛt\(B\)\\Lambda\_\{t\}^\{\(B\)\}, we get

Qt\+1\(B\)⪯κq​\(I−2​γt​Σ\+γt2​RB2​Σ\)\+γt2​α​‖u∗‖Σ2B​Σ⪯κq​I\.Q\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{q\}\(I\-2\\gamma\_\{t\}\\Sigma\+\\gamma\_\{t\}^\{2\}R\_\{B\}^\{2\}\\Sigma\)\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\Sigma\\preceq\\kappa\_\{q\}I\.This closes the induction\. ∎

###### Theorem H\.2\(Bound on the centered covariance\-fluctuation component\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2), and supposeγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then

VarBcov=⟨Σ,QT\(B\)⟩≤α​‖u∗‖Σ2B​\(1−γ​RB2\)​⟨Σ,∑t=0T−1γt2​∏i=t\+1T−1\(I−γi​Σ\)2​Σ⟩\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\langle\\Sigma,Q\_\{T\}^\{\(B\)\}\\rangle\\leq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle\.

###### Proof\.

Starting from the exact recursion equation[20](https://arxiv.org/html/2605.24316#A8.E20), Lemma[H\.2](https://arxiv.org/html/2605.24316#A8.Thmlemma2)gives the crude boundQt\(B\)⪯κq​IQ\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{q\}I, whereκq=γ​α​‖u∗‖Σ2/\(B​\(1−γ​RB2\)\)\\kappa\_\{q\}=\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}/\(B\(1\-\\gamma R\_\{B\}^\{2\}\)\)\. Combining this with equation[20](https://arxiv.org/html/2605.24316#A8.E20), the estimateMΣ\(B\)​\(I\)⪯RB2​ΣM\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq R\_\{B\}^\{2\}\\Sigma, and the boundΛt\(B\)⪯α​‖u∗‖Σ2​Σ/B\\Lambda\_\{t\}^\{\(B\)\}\\preceq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\Sigma/B, we get

Qt\+1\(B\)⪯Qt\(B\)−γt​Σ​Qt\(B\)−γt​Qt\(B\)​Σ\+γt2​α​‖u∗‖Σ2B​\(1−γ​RB2\)​Σ\.Q\_\{t\+1\}^\{\(B\)\}\\preceq Q\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}\\Sigma Q\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}Q\_\{t\}^\{\(B\)\}\\Sigma\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.Sinceγt2​Σ​Qt\(B\)​Σ⪰0\\gamma\_\{t\}^\{2\}\\Sigma Q\_\{t\}^\{\(B\)\}\\Sigma\\succeq 0, this is bounded by

Qt\+1\(B\)⪯\(I−γt​T~Σ​\(γt\)\)∘Qt\(B\)\+γt2​α​‖u∗‖Σ2B​\(1−γ​RB2\)​Σ\.Q\_\{t\+1\}^\{\(B\)\}\\preceq\\bigl\(I\-\\gamma\_\{t\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.The same unrolling and commutation argument as in the proof of Theorem[H\.1](https://arxiv.org/html/2605.24316#A8.Thmmytheorem1)therefore yields

QT\(B\)⪯α​‖u∗‖Σ2B​\(1−γ​RB2\)​∑t=0T−1γt2​∏i=t\+1T−1\(I−γi​Σ\)2​Σ\.Q\_\{T\}^\{\(B\)\}\\preceq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\.Taking the inner product withΣ\\Sigmaproves the claim\. ∎

### H\.5Bounds under the source condition

We now specialize the abstract upper bounds from Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1)under the power\-law spectrum and source\-condition assumptions\. The only remaining input is the bound on𝒦B\\mathcal\{K\}\_\{B\}, which is proved in the next subsection\.

###### Lemma H\.3\(Source\-condition bounds for the exact one\-pass variance terms\)\.

Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Suppose moreover that

Then, with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrixSS,

𝔼w∗​\[VarBnoise\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},𝔼w∗​\[VarBcov\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},and consequently

𝔼w∗​\[VarB\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.

###### Proof\.

By Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1),

𝔼w∗​\[VarBnoise\]≤𝔼w∗​\[σ~2​\(w∗\)\]​𝒦B,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\leq\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\,\\mathcal\{K\}\_\{B\},𝔼w∗​\[VarBcov\]≤α​𝔼w∗​\[‖u∗‖Σ2\]​𝒦B≤α​𝔼w∗​\[‖w∗‖H2\]​𝒦B\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\leq\\alpha\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\]\\,\\mathcal\{K\}\_\{B\}\\leq\\alpha\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\]\\,\\mathcal\{K\}\_\{B\}\.Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),

𝔼w∗​\[‖w∗‖H2\]=∑i≥1λi​𝔼​\[\(wi∗\)2\]≍∑i≥1i−b≲1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\]=\\sum\_\{i\\geq 1\}\\lambda\_\{i\}\\,\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\\geq 1\}i^\{\-b\}\\lesssim 1,sinceb\>1b\>1\. Therefore

𝔼w∗​\[σ~2​\(w∗\)\]≲1,𝔼w∗​\[σ¯2​\(w∗\)\]≲1\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\lesssim 1,\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\lesssim 1\.Applying Lemma[H\.5](https://arxiv.org/html/2605.24316#A8.Thmlemma5), proved in the next subsection, yields

𝔼w∗​\[VarBnoise\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},and

𝔼w∗​\[VarBcov\]≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Summing the two bounds gives the final claim\. ∎

### H\.6Bounding the common kernel quantity

We now prove the bounds on𝒦B\\mathcal\{K\}\_\{B\}used above\. We first convert this kernel quantity into the effective\-dimension form and then simplify it under the power\-law spectrum assumption\.

###### Lemma H\.4\(Effective\-dimension reduction for𝒦B\\mathcal\{K\}\_\{B\}\)\.

Assume the blockwise geometric learning\-rate schedule

γt=γ2ℓfor​t∈Iℓ,\\gamma\_\{t\}=\\frac\{\\gamma\}\{2^\{\\ell\}\}\\qquad\\text\{for \}t\\in I\_\{\\ell\},where the blocksIℓI\_\{\\ell\}form a partition of\{0,…,T−1\}\\\{0,\\dots,T\-1\\\}into consecutive intervals of length comparable to

Teff:=Tlog⁡TT\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\}up to endpoint rounding\. Let

Teff:=Tlog⁡T\.T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\}\.Then there exists a universal constantc\>0c\>0such that

𝒦B≤cB​Teff​∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\}\.\\mathcal\{K\}\_\{B\}\\leq\\frac\{c\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\.Equivalently, if we write

s:=Teff​γ,s:=T\_\{\\mathrm\{eff\}\}\\gamma,then

𝒦B≤cB​Teff​\(\#​\{μj​\(Σ\)≥1/s\}\+s​∑μj​\(Σ\)<1/sμj​\(Σ\)\)\.\\mathcal\{K\}\_\{B\}\\leq\\frac\{c\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\left\(\\\#\\\{\\mu\_\{j\}\(\\Sigma\)\\geq 1/s\\\}\+s\\sum\_\{\\mu\_\{j\}\(\\Sigma\)<1/s\}\\mu\_\{j\}\(\\Sigma\)\\right\)\.This second display is simply an exact rewriting of the effective\-dimension bound above, and it is the form used in the power\-law estimate below\.

###### Proof\.

DiagonalizeΣ\\Sigmaas

Σ=U​diag⁡\(μ1​\(Σ\),…,μM​\(Σ\)\)​U⊤\.\\Sigma=U\\operatorname\{diag\}\(\\mu\_\{1\}\(\\Sigma\),\\dots,\\mu\_\{M\}\(\\Sigma\)\)U^\{\\top\}\.Then Definition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmdefinition1)gives

𝒦B=1B​\(1−γ​RB2\)​∑j=1Mμj​\(Σ\)2​∑t=0T−1γt2​∏i=t\+1T−1\(1−γi​μj​\(Σ\)\)2\.\\mathcal\{K\}\_\{B\}=\\frac\{1\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{j=1\}^\{M\}\\mu\_\{j\}\(\\Sigma\)^\{2\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\\bigl\(1\-\\gamma\_\{i\}\\mu\_\{j\}\(\\Sigma\)\\bigr\)^\{2\}\.For each scalarλ≥0\\lambda\\geq 0, define

ΦT​\(λ\):=λ2​∑t=0T−1γt2​∏i=t\+1T−1\(1−γi​λ\)2\.\\Phi\_\{T\}\(\\lambda\):=\\lambda^\{2\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(1\-\\gamma\_\{i\}\\lambda\)^\{2\}\.We claim that

ΦT​\(λ\)≲1Teff​min⁡\{1,Teff​γ​λ\}\.\\Phi\_\{T\}\(\\lambda\)\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\lambda\\\}\.\(21\)To prove this, partition\{0,…,T−1\}\\\{0,\\dots,T\-1\\\}into the geometric blocks\(Iℓ\)ℓ≥0\(I\_\{\\ell\}\)\_\{\\ell\\geq 0\}, and write

ηℓ:=γ2ℓ,aℓ:=Teff​ηℓ​λ,a0=Teff​γ​λ\.\\eta\_\{\\ell\}:=\\frac\{\\gamma\}\{2^\{\\ell\}\},\\qquad a\_\{\\ell\}:=T\_\{\\mathrm\{eff\}\}\\eta\_\{\\ell\}\\lambda,\\qquad a\_\{0\}=T\_\{\\mathrm\{eff\}\}\\gamma\\lambda\.Sinceγ​tr⁡\(Σ\)≲1\\gamma\\operatorname\{tr\}\(\\Sigma\)\\lesssim 1by Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3), we have0≤γt​λ≤1/20\\leq\\gamma\_\{t\}\\lambda\\leq 1/2after taking the absolute constant in that assumption small enough\. Hence all factors1−γt​λ1\-\\gamma\_\{t\}\\lambdalie in\[0,1\]\[0,1\], and therefore for each blockIℓI\_\{\\ell\},

ΦT,ℓ​\(λ\)\\displaystyle\\Phi\_\{T,\\ell\}\(\\lambda\):=λ2​∑t∈Iℓγt2​∏i=t\+1T−1\(1−γi​λ\)2\\displaystyle=\\lambda^\{2\}\\sum\_\{t\\in I\_\{\\ell\}\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(1\-\\gamma\_\{i\}\\lambda\)^\{2\}≤λ2​\|Iℓ\|​ηℓ2​∏q\>ℓ\(1−ηq​λ\)2​\|Iq\|\.\\displaystyle\\leq\\lambda^\{2\}\|I\_\{\\ell\}\|\\eta\_\{\\ell\}^\{2\}\\prod\_\{q\>\\ell\}\(1\-\\eta\_\{q\}\\lambda\)^\{2\|I\_\{q\}\|\}\.Using\|Iℓ\|≍Teff\|I\_\{\\ell\}\|\\asymp T\_\{\\mathrm\{eff\}\}, the inequality1−x≤e−x1\-x\\leq e^\{\-x\}forx∈\[0,1\]x\\in\[0,1\], and the geometric identity∑q\>ℓηq≍ηℓ\\sum\_\{q\>\\ell\}\\eta\_\{q\}\\asymp\\eta\_\{\\ell\}, we obtain

ΦT,ℓ​\(λ\)≲Teff​λ2​ηℓ2​exp⁡\(−c​Teff​λ​∑q\>ℓηq\)≲aℓ2Teff​e−c​aℓ\\Phi\_\{T,\\ell\}\(\\lambda\)\\lesssim T\_\{\\mathrm\{eff\}\}\\lambda^\{2\}\\eta\_\{\\ell\}^\{2\}\\exp\\\!\\Bigl\(\-cT\_\{\\mathrm\{eff\}\}\\lambda\\sum\_\{q\>\\ell\}\\eta\_\{q\}\\Bigr\)\\lesssim\\frac\{a\_\{\\ell\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\}e^\{\-ca\_\{\\ell\}\}for a universal constantc\>0c\>0\. Summing overℓ\\ellgives

ΦT​\(λ\)≲1Teff​∑ℓ≥0aℓ2​e−c​aℓ,aℓ=a02ℓ\.\\Phi\_\{T\}\(\\lambda\)\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\}\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\},\\qquad a\_\{\\ell\}=\\frac\{a\_\{0\}\}\{2^\{\\ell\}\}\.Ifa0≤1a\_\{0\}\\leq 1, then

∑ℓ≥0aℓ2​e−c​aℓ≤∑ℓ≥0aℓ2≲a02≤a0\.\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\leq\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}\\lesssim a\_\{0\}^\{2\}\\leq a\_\{0\}\.Ifa0≥1a\_\{0\}\\geq 1, letℓ⋆:=⌊log2⁡a0⌋\\ell\_\{\\star\}:=\\lfloor\\log\_\{2\}a\_\{0\}\\rfloor\. Then forℓ\>ℓ⋆\\ell\>\\ell\_\{\\star\}, we haveaℓ<1a\_\{\\ell\}<1, so

∑ℓ\>ℓ⋆aℓ2​e−c​aℓ≤∑ℓ\>ℓ⋆aℓ2≲1\.\\sum\_\{\\ell\>\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\leq\\sum\_\{\\ell\>\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}\\lesssim 1\.Forℓ≤ℓ⋆\\ell\\leq\\ell\_\{\\star\}, we haveaℓ≥1a\_\{\\ell\}\\geq 1, and the dyadic pointsaℓ=a0/2ℓa\_\{\\ell\}=a\_\{0\}/2^\{\\ell\}decrease geometrically, whilex↦x2​e−c​xx\\mapsto x^\{2\}e^\{\-cx\}decays exponentially for largexx\. Hence

∑ℓ≤ℓ⋆aℓ2​e−c​aℓ≲1\.\\sum\_\{\\ell\\leq\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\lesssim 1\.Combining the last three displays proves equation[21](https://arxiv.org/html/2605.24316#A8.E21)\. Substituting this bound into the diagonal expansion above and using that1−γ​RB21\-\\gamma R\_\{B\}^\{2\}is bounded below by a universal constant under the hypothesisγ​RB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, the first claim is verified\.

The second display is just the identity

∑j=1Mmin⁡\{1,s​μj\}=\#​\{μj≥1/s\}\+s​∑μj<1/sμj,s:=Teff​γ\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,s\\mu\_\{j\}\\\}=\\\#\\\{\\mu\_\{j\}\\geq 1/s\\\}\+s\\sum\_\{\\mu\_\{j\}<1/s\}\\mu\_\{j\},\\qquad s:=T\_\{\\mathrm\{eff\}\}\\gamma\.No further reduction is needed here; Lemma[H\.5](https://arxiv.org/html/2605.24316#A8.Thmlemma5)applies this equivalent form directly under the power\-law assumption\. ∎

###### Lemma H\.5\(Power\-law bound for𝒦B\\mathcal\{K\}\_\{B\}\)\.

Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[4](https://arxiv.org/html/2605.24316#Thmassumption4)\. Then, with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrixSS, assumeTeff​γ≳1T\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1\(in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\),

𝒦B≲min⁡\{M,\(Teff​γ\)1/a\}B​Teff\.\\mathcal\{K\}\_\{B\}\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Consequently,

VarB≲σ¯2​\(w∗\)​min⁡\{M,\(Teff​γ\)1/a\}B​Teff\.\\mathrm\{Var\}\_\{B\}\\lesssim\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.

###### Proof\.

By Lemma[H\.4](https://arxiv.org/html/2605.24316#A8.Thmlemma4), it suffices to bound

∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\}\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\.By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7), with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),

μj​\(Σ\)≍j−a,j∈\[M\]\.\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\},\\qquad j\\in\[M\]\.Let

k⋆:=min⁡\{M,⌊\(Teff​γ\)1/a⌋\}\.k\_\{\\star\}:=\\min\\\{M,\\lfloor\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\rfloor\\\}\.Then forj≤k⋆j\\leq k\_\{\\star\},

Teff​γ​μj​\(Σ\)≳1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\gtrsim 1,while forj\>k⋆j\>k\_\{\\star\},

Teff​γ​μj​\(Σ\)≲Teff​γ​j−a\.T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\lesssim T\_\{\\mathrm\{eff\}\}\\gamma\\,j^\{\-a\}\.Hence

∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\}≲k⋆\+Teff​γ​∑j\>k⋆j−a\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\\lesssim k\_\{\\star\}\+T\_\{\\mathrm\{eff\}\}\\gamma\\sum\_\{j\>k\_\{\\star\}\}j^\{\-a\}\.Sincea\>1a\>1,

∑j\>k⋆j−a≲k⋆1−a,\\sum\_\{j\>k\_\{\\star\}\}j^\{\-a\}\\lesssim k\_\{\\star\}^\{\\,1\-a\},and therefore

∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\}≲k⋆\+Teff​γ​k⋆1−a≲k⋆=min⁡\{M,\(Teff​γ\)1/a\}\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\\lesssim k\_\{\\star\}\+T\_\{\\mathrm\{eff\}\}\\gamma\\,k\_\{\\star\}^\{\\,1\-a\}\\lesssim k\_\{\\star\}=\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\.Substituting into Lemma[H\.4](https://arxiv.org/html/2605.24316#A8.Thmlemma4)proves the first claim\.

The second claim follows from Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1)\. ∎

## Appendix IFluctuation Error under Multi\-pass Batch SGD with Replacement

This section focuses on the multi\-pass batch SGD procedure with replacement equation[3](https://arxiv.org/html/2605.24316#S2.E3)and uses the notation from Section[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)\. For the proof sketch, we first rewrite the fluctuation recursion, then control the covariance of the stochastic noise term, and finally apply a stochastic approximation lemma together with the leave\-one\-out control of GD outputs\. Note that different from one\-sample update, in our case, we have a random positive semidefinite mini\-batch covariance matrix\.

Define the fluctuation term, conditional on the sketched dataset, by

FlucBwr:=𝔼batch​\[‖Σ1/2​\(uLwr−θL\)‖22\]=𝔼batch​\[‖Σ1/2​ΔL‖22\]\.\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}:=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wr\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\]\.
For convenience, we restate the fluctuation process and its noise term here:

Δt=\(I−γt​Σ^t\(B\)\)​Δt−1\+γt​ξt\(B\),\\Delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},\(22\)where

ξt\(B\):=−\(Σ^t\(B\)−Σ^\)​\(θt−1−u∗\)\+\(c^t\(B\)−c^\)\.\\xi\_\{t\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{t\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.\(23\)
### I\.1Upper bound result

###### Lemma I\.1\(Upper bound on the fluctuation error for multi\-pass batch SGD with replacement\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose

Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Under the notation above, fix anys∈\[0,1\]s\\in\[0,1\]andα\>1\\alpha\>1\. Letθt\(−i\)\\theta\_\{t\}^\{\(\-i\)\}denote the leave\-one\-out GD iterate, namely,

θt\(−i\)=\(I−γt​Σ^\(−i\)\)​θt−1\(−i\)\+γt​\(S​X⊤​y\)\(−i\),with​θ0\(−i\)=0,\\theta\_\{t\}^\{\(\-i\)\}=\\left\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}^\{\(\-i\)\}\\right\)\\theta\_\{t\-1\}^\{\(\-i\)\}\+\\gamma\_\{t\}\\left\(SX^\{\\top\}y\\right\)^\{\(\-i\)\},\\qquad\\text\{with \}\\theta\_\{0\}^\{\(\-i\)\}=0,whereΣ^\(−i\):=∑j≠iS​xj​xj⊤​S⊤/N\\widehat\{\\Sigma\}^\{\(\-i\)\}:=\\sum\_\{j\\neq i\}Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}/Nand\(S​X⊤​y\)\(−i\):=∑j≠iS​xj​yj/N\(SX^\{\\top\}y\)^\{\(\-i\)\}:=\\sum\_\{j\\neq i\}Sx\_\{j\}y\_\{j\}/N\. Define

λ:=1Leff​γ,Rp:=‖\(Σ\+λ​I\)1/2​\(Σ^\+λ​I\)−1/2‖22,\\lambda:=\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},\\qquad R\_\{p\}:=\\bigl\\\|\(\\Sigma\+\\lambda I\)^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\bigr\\\|\_\{2\}^\{2\},amax:=maxi∈\[N\],t∈\[L\]⁡\|yi−xi⊤​S⊤​θt\(−i\)\|,a\_\{\\max\}:=\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\bigl\|y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\\bigr\|,BΔ:=amax2⋅maxi∈\[N\]⁡‖S​xi‖22⋅Rp⋅\(Leff​γ\)2−sN2,B\_\{\\Delta\}:=a\_\{\\max\}^\{2\}\\cdot\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\\cdot R\_\{p\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\},and

FB:=Rp\(maxi∈\[N\]\(xi⊤S⊤u∗\)2\+maxi∈\[N\]ε~i2\+maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ\)\.F\_\{B\}:=R\_\{p\}\\Bigl\(\\max\_\{i\\in\[N\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\+\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\}\\Bigr\)\.Then there exists a constantc\>0c\>0, depending only on\(s,α\)\(s,\\alpha\), such that with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼​\[FlucBwr\]=𝔼w∗,\(xi,yi\)i=1N,batch​\[‖Σ1/2​\(uLwr−θL\)‖22\]≤c⋅𝔼​\[FB\]⋅tr​\(Σ^1/α\)B⋅γ1/α​Leff1/α−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]=\\mathbb\{E\}\_\{w^\{\\ast\},\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\},\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wr\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]\\leq c\\cdot\\mathbb\{E\}\[F\_\{B\}\]\\cdot\\frac\{\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\}\{B\}\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.

###### Proof\.

Conditioned onSS,w∗w^\{\\ast\}, and the datasetD=\(xi,yi\)i=1ND=\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\}, the mini\-batch average is unbiased:

𝔼​\[Σ^t\(B\)∣S,w∗,D,ℱt−1\]=Σ^,𝔼​\[c^t\(B\)∣S,w∗,D,ℱt−1\]=c^\.\\mathbb\{E\}\[\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=\\widehat\{\\Sigma\},\\qquad\\mathbb\{E\}\[\\widehat\{c\}\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=\\widehat\{c\}\.Hence

𝔼​\[ξt\(B\)∣S,w∗,D,ℱt−1\]=0\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=0\.
We then rewrite the batch noise as an average of single\-sample noises\. Define, forj∈\[N\]j\\in\[N\],

ζt​\(j\):=−\(S​xj​xj⊤​S⊤−Σ^\)​\(θt−1−u∗\)\+\(S​xj​ε~j−c^\)\.\\zeta\_\{t\}\(j\):=\-\\bigl\(Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(Sx\_\{j\}\\widetilde\{\\varepsilon\}\_\{j\}\-\\widehat\{c\}\\bigr\)\.Then

ξt\(B\)=1B​∑r=1Bζt​\(it,r\)\.\\xi\_\{t\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\zeta\_\{t\}\(i\_\{t,r\}\)\.Conditioned on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\), the vectorsζt​\(it,1\),…,ζt​\(it,B\)\\zeta\_\{t\}\(i\_\{t,1\}\),\\dots,\\zeta\_\{t\}\(i\_\{t,B\}\)are i\.i\.d\. and mean zero, hence

𝔼​\[ξt\(B\)​ξt\(B\)⊤∣S,w∗,D,ℱt−1\]=1B​𝔼​\[ζt​\(i\)​ζt​\(i\)⊤∣S,w∗,D,ℱt−1\],\\mathbb\{E\}\\bigl\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]=\\frac\{1\}\{B\}\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\],wherei∼unif​\(\[N\]\)i\\sim\\mathrm\{unif\}\(\[N\]\)\.

Write

dt:=θt−1−u∗,zi:=S​xi,d\_\{t\}:=\\theta\_\{t\-1\}\-u^\{\\ast\},\\qquad z\_\{i\}:=Sx\_\{i\},and decompose

ζt​\(i\)=ζt,1​\(i\)\+ζt,2​\(i\),\\zeta\_\{t\}\(i\)=\\zeta\_\{t,1\}\(i\)\+\\zeta\_\{t,2\}\(i\),where

ζt,1​\(i\):=−\(zi​zi⊤−Σ^\)​dt,ζt,2​\(i\):=zi​ε~i−c^\.\\zeta\_\{t,1\}\(i\):=\-\\bigl\(z\_\{i\}z\_\{i\}^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)d\_\{t\},\\qquad\\zeta\_\{t,2\}\(i\):=z\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\}\-\\widehat\{c\}\.Since\(a\+b\)​\(a\+b\)⊤⪯2​a​a⊤\+2​b​b⊤\(a\+b\)\(a\+b\)^\{\\top\}\\preceq 2aa^\{\\top\}\+2bb^\{\\top\}for all vectorsa,ba,b,

𝔼​\[ζt​\(i\)​ζt​\(i\)⊤∣S,w∗,D,ℱt−1\]\\displaystyle\\mathbb\{E\}\\bigl\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]⪯2​𝔼​\[ζt,1​\(i\)​ζt,1​\(i\)⊤∣S,w∗,D,ℱt−1\]\\displaystyle\\preceq 2\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\+2​𝔼​\[ζt,2​\(i\)​ζt,2​\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\displaystyle\\qquad\+2\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\.We bound the two terms separately\. Since

ζt,1​\(i\)=at​\(i\)−𝔼​\[at​\(i\)∣S,w∗,D,ℱt−1\],at​\(i\):=−zi​zi⊤​dt,\\zeta\_\{t,1\}\(i\)=a\_\{t\}\(i\)\-\\mathbb\{E\}\[a\_\{t\}\(i\)\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\],\\qquad a\_\{t\}\(i\):=\-z\_\{i\}z\_\{i\}^\{\\top\}d\_\{t\},its covariance is dominated by its second moment:

𝔼​\[ζt,1​\(i\)​ζt,1​\(i\)⊤∣S,w∗,D,ℱt−1\]⪯𝔼​\[at​\(i\)​at​\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\mathbb\{E\}\\bigl\[a\_\{t\}\(i\)a\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\.Now

at\(i\)at\(i\)⊤=\(zi⊤dt\)2zizi⊤⪯maxj∈\[N\]\(zj⊤dt\)2zizi⊤,a\_\{t\}\(i\)a\_\{t\}\(i\)^\{\\top\}=\\bigl\(z\_\{i\}^\{\\top\}d\_\{t\}\\bigr\)^\{2\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq\\max\_\{j\\in\[N\]\}\\bigl\(z\_\{j\}^\{\\top\}d\_\{t\}\\bigr\)^\{2\}z\_\{i\}z\_\{i\}^\{\\top\},so averaging overi∼unif​\(\[N\]\)i\\sim\\mathrm\{unif\}\(\[N\]\)gives

𝔼\[ζt,1\(i\)ζt,1\(i\)⊤∣S,w∗,D,ℱt−1\]⪯maxj∈\[N\]\(xj⊤S⊤\(θt−1−u∗\)\)2Σ^\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\max\_\{j\\in\[N\]\}\\bigl\(x\_\{j\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\\,\\widehat\{\\Sigma\}\.Similarly,

ζt,2​\(i\)=b​\(i\)−𝔼​\[b​\(i\)∣S,w∗,D\],b​\(i\):=zi​ε~i,\\zeta\_\{t,2\}\(i\)=b\(i\)\-\\mathbb\{E\}\[b\(i\)\\mid S,w^\{\\ast\},D\],\\qquad b\(i\):=z\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\},and therefore

𝔼​\[ζt,2​\(i\)​ζt,2​\(i\)⊤∣S,w∗,D,ℱt−1\]⪯𝔼​\[b​\(i\)​b​\(i\)⊤∣S,w∗,D\]\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\mathbb\{E\}\\bigl\[b\(i\)b\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D\\bigr\]\.Since

b​\(i\)​b​\(i\)⊤=ε~i2​zi​zi⊤⪯maxj∈\[N\]⁡ε~j2​zi​zi⊤,b\(i\)b\(i\)^\{\\top\}=\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq\\max\_\{j\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{j\}^\{2\}z\_\{i\}z\_\{i\}^\{\\top\},we obtain

𝔼​\[ζt,2​\(i\)​ζt,2​\(i\)⊤∣S,w∗,D,ℱt−1\]⪯maxj∈\[N\]⁡ε~j2​Σ^\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\max\_\{j\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{j\}^\{2\}\\,\\widehat\{\\Sigma\}\.Combining the decomposition above with the last four displays yields

𝔼​\[ξt\(B\)​ξt\(B\)⊤∣S,w∗,D,ℱt−1\]⪯σξ,B2B​Σ^,\\mathbb\{E\}\\bigl\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\,\\widehat\{\\Sigma\},\(24\)where

σξ,B2:=2maxi∈\[N\],t∈\[L\]\[\(xi⊤S⊤\(θt−1−u∗\)\)2\+ε~i2\]\.\\sigma\_\{\\xi,B\}^\{2\}:=2\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\left\[\\bigl\(x\_\{i\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\+\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\\right\]\.
Let

λ:=1Leff​γ,Rp:=‖\(Σ\+λ​I\)1/2​\(Σ^\+λ​I\)−1/2‖22\.\\lambda:=\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},\\qquad R\_\{p\}:=\\bigl\\\|\(\\Sigma\+\\lambda I\)^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\bigr\\\|\_\{2\}^\{2\}\.Condition onSS,w∗w^\{\\ast\}, andDD\. By Lemma[I\.3](https://arxiv.org/html/2605.24316#A9.Thmlemma3), assumptions\(1\),\(3\),\(4\), and\(5\)of Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)hold for

At=Σ^t\(B\),Σν=Σ^,CA=maxi∈\[N\]⁡‖S​xi‖22\.A\_\{t\}=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\},\\qquad\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\},\\qquad C\_\{A\}=\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\.Moreover, sinceθt−1\\theta\_\{t\-1\}is deterministic once\(S,w∗,D\)\(S,w^\{\\ast\},D\)are fixed and each mini\-batch is sampled independently across iterations, the pair\(At,ξt\(B\)\)\(A\_\{t\},\\xi\_\{t\}^\{\(B\)\}\)is independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}\. Step 1 gives𝔼batch​\[ξt\(B\)\]=0\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[\\xi\_\{t\}^\{\(B\)\}\]=0, and taking expectation over the batch randomness in equation[24](https://arxiv.org/html/2605.24316#A9.E24)gives the covariance part of assumption\(2\)with

σξ2=σξ,B2B\.\\sigma\_\{\\xi\}^\{2\}=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\.Therefore, using

‖\(Σ^\+λ​I\)1/2​ΔL‖22=‖Σ^1/2​ΔL‖22\+λ​‖ΔL‖22,\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}=\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\},and applying Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)withu=1u=1andu=0u=0, we obtain

𝔼batch​‖Σ1/2​ΔL‖22\\displaystyle\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}≤Rp​𝔼batch​‖\(Σ^\+λ​I\)1/2​ΔL‖22\\displaystyle\\leq R\_\{p\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}=Rp​𝔼batch​‖Σ^1/2​ΔL‖22\+Rp​λ​𝔼batch​‖ΔL‖22\\displaystyle=R\_\{p\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\+R\_\{p\}\\,\\lambda\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\}≲Rp⋅σξ,B2B⋅γ0​tr​\(Σ^1/α\)​\(Leff​γ0\)1/α−1\\displaystyle\\lesssim R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-1\}\+Rp⋅λ⋅σξ,B2B⋅γ0​tr​\(Σ^1/α\)​\(Leff​γ0\)1/α\\displaystyle\\qquad\+R\_\{p\}\\cdot\\lambda\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\}=Rp⋅σξ,B2B⋅\(1\+λ​Leff​γ0\)​γ0​tr​\(Σ^1/α\)​\(Leff​γ0\)1/α−1\\displaystyle=R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\bigl\(1\+\\lambda L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\bigr\)\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-1\}≲Rp⋅σξ,B2B⋅tr​\(Σ^1/α\)​Leff1/α−1​γ01/α\\displaystyle\\lesssim R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\\gamma\_\{0\}^\{1/\\alpha\}≤Rp⋅σξ,B2B⋅tr​\(Σ^1/α\)​Leff1/α−1​γ1/α\.\\displaystyle\\leq R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\\gamma^\{1/\\alpha\}\.\(25\)Here we used

λ​Leff​γ0=γ0γ≤1,\\lambda L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}=\\frac\{\\gamma\_\{0\}\}\{\\gamma\}\\leq 1,sinceλ=1/\(Leff​γ\)\\lambda=1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)andγ0≤γ\\gamma\_\{0\}\\leq\\gamma\.

By the elementary inequality

\(xi⊤​S⊤​\(θt−1−u∗\)\)2≤2​\(xi⊤​S⊤​θt−1\)2\+2​\(xi⊤​S⊤​u∗\)2,\\bigl\(x\_\{i\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\\leq 2\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\-1\}\)^\{2\}\+2\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\},and Lemma D\.3 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], we have

maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\)2≲maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ,\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}\)^\{2\}\\lesssim\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\},where

amax:=maxi∈\[N\],t∈\[L\]⁡\|yi−xi⊤​S⊤​θt\(−i\)\|,a\_\{\\max\}:=\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\bigl\|y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\\bigr\|,BΔ:=amax2⋅maxi∈\[N\]⁡‖S​xi‖22⋅Rp⋅\(Leff​γ\)2−sN2\.B\_\{\\Delta\}:=a\_\{\\max\}^\{2\}\\cdot\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\\cdot R\_\{p\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\.Therefore, defining

FB:=Rp\(maxi∈\[N\]\(xi⊤S⊤u∗\)2\+maxi∈\[N\]ε~i2\+maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ\),F\_\{B\}:=R\_\{p\}\\Bigl\(\\max\_\{i\\in\[N\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\+\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\}\\Bigr\),we obtain

σξ,B2≲FBRp\.\\sigma\_\{\\xi,B\}^\{2\}\\lesssim\\frac\{F\_\{B\}\}\{R\_\{p\}\}\.Substituting this bound into equation[25](https://arxiv.org/html/2605.24316#A9.E25)yields

𝔼batch​‖Σ1/2​ΔL‖22≲FB⋅tr​\(Σ^1/α\)B⋅γ1/α​Leff1/α−1\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim F\_\{B\}\\cdot\\frac\{\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\}\{B\}\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.Taking expectation overw∗w^\{\\ast\}and\(xi,yi\)i=1N\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\}gives the desired result\. ∎

### I\.2Fluctuation error under the source condition

###### Lemma I\.2\(Upper fluctuation bound under the source condition for multi\-pass batch SGD with replacement\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3)\. Letε∈\(0,1\)\\varepsilon\\in\(0,1\), and suppose in addition that

Leff≲N\(1−ε\)​a/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{\(1\-\\varepsilon\)a\}/\\gamma\.Then there exists an\(a,ε\)\(a,\\varepsilon\)\-dependent constantc\>0c\>0such that, whenever

γ≤clog⁡N,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least

1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,

𝔼​\[FlucBwr\]≲γ​log⁡NB​\[\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]\\lesssim\\frac\{\\gamma\\log N\}\{B\}\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.

###### Proof\.

We imitate the Gaussian concentration estimates and leave\-one\-out argument as in Appendix D\.3 ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], which implies that, for any fixeds∈\[0,1−1/a\)s\\in\[0,1\-1/a\), conditioned onSSandw∗w^\{\\ast\},

𝔼​\[FB∣S,w∗\]≲\(σ2\+‖w∗‖H2\)​log⁡N​\[1\+log2⁡N​\(Leff​γ\)2−sN2\]\.\\mathbb\{E\}\\bigl\[F\_\{B\}\\mid S,w^\{\\ast\}\\bigr\]\\lesssim\(\\sigma^\{2\}\+\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\)\\log N\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\.Here we reuse the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and apply its power\-law conclusion\. As in that proof, the stochastic approximation lemma is invoked with the effective covariance proxy

σξ,eff2:=σξ,B2B,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}:=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\},since𝔼batch​\[ξt\(B\)​ξt\(B\)⊤\]⪯σξ,eff2​Σ^\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\]\\preceq\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\widehat\{\\Sigma\}\. On the high\-probability event of Lemma[K\.10](https://arxiv.org/html/2605.24316#A11.Thmlemma10),Σν=Σ^\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\}satisfies

μj​\(Σ^\)≍j−afor​j≤min⁡\{M,N/c\},\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\asymp j^\{\-a\}\\qquad\\text\{for \}j\\leq\\min\\\{M,N/c\\\},which is precisely the spectral range required by the power\-law part of Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)\. Therefore, applying that power\-law bound withu=1u=1andu=0u=0, and usingλ=\(Leff​γ\)−1\\lambda=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1\}, gives

𝔼batch​‖Σ^1/2​ΔL‖22≲σξ,eff2​γ0​\(Leff​γ0\)1/a−1,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\},and

λ​𝔼batch​‖ΔL‖22≲σξ,eff2​γ0​\(Leff​γ0\)1/a−1\.\\lambda\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}\.Hence the same comparison as in Step 4 of the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)yields

𝔼batch​‖Σ1/2​ΔL‖22≲Rp​σξ,eff2​γ0​\(Leff​γ0\)1/a−1≤Rp​σξ,eff2​γ​\(Leff​γ\)1/a−1,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim R\_\{p\}\\,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}\\leq R\_\{p\}\\,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\},since

γ0​\(Leff​γ0\)1/a−1=Leff1/a−1​γ01/a≤Leff1/a−1​γ1/a=γ​\(Leff​γ\)1/a−1\.\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}=L\_\{\\mathrm\{eff\}\}^\{1/a\-1\}\\gamma\_\{0\}^\{1/a\}\\leq L\_\{\\mathrm\{eff\}\}^\{1/a\-1\}\\gamma^\{1/a\}=\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Finally, the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)givesσξ,B2≲FB/Rp\\sigma\_\{\\xi,B\}^\{2\}\\lesssim F\_\{B\}/R\_\{p\}, hence

σξ,eff2=σξ,B2B≲FBB​Rp\.\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\lesssim\\frac\{F\_\{B\}\}\{B\\,R\_\{p\}\}\.Therefore,

𝔼​\[FlucBwr∣S,w∗\]≲γ​\(Leff​γ\)1/a−1B​𝔼​\[FB∣S,w∗\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\\mid S,w^\{\\ast\}\]\\lesssim\\frac\{\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\}\{B\}\\mathbb\{E\}\\bigl\[F\_\{B\}\\mid S,w^\{\\ast\}\\bigr\]\.Substituting the bound onFBF\_\{B\}above yields

𝔼​\[FlucBwr∣S,w∗\]≲σ2\+‖w∗‖H2B⋅γ​log⁡N​\[1\+log2⁡N​\(Leff​γ\)2−sN2\]​\(Leff​γ\)1/a−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\\mid S,w^\{\\ast\}\]\\lesssim\\frac\{\\sigma^\{2\}\+\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\}\{B\}\\cdot\\gamma\\log N\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Taking expectation overw∗w^\{\\ast\}yields

𝔼​\[FlucBwr\]≲γ​log⁡NB​\[1\+log2⁡N​\(Leff​γ\)2−sN2\]​\(Leff​γ\)1/a−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]\\lesssim\\frac\{\\gamma\\log N\}\{B\}\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Now choose

s:=1−1a​\(1−ε/2\)\.s:=1\-\\frac\{1\}\{a\(1\-\\varepsilon/2\)\}\.Then

log2⁡N​\(Leff​γ\)1−sN≲log2⁡N⋅N\(1−ε\)​a​\(1−s\)−1=log2⁡N⋅N−ε2​\(1−ε/2\)≲1,\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1\-s\}\}\{N\}\\lesssim\\log^\{2\}N\\cdot N^\{\(1\-\\varepsilon\)a\(1\-s\)\-1\}=\\log^\{2\}N\\cdot N^\{\-\\frac\{\\varepsilon\}\{2\(1\-\\varepsilon/2\)\}\}\\lesssim 1,where we used the assumptionLeff​γ≲N\(1−ε\)​aL\_\{\\mathrm\{eff\}\}\\gamma\\lesssim N^\{\(1\-\\varepsilon\)a\}\. Therefore

log2⁡N​\(Leff​γ\)2−sN2​\(Leff​γ\)1/a−1=log2⁡N​\(Leff​γ\)1−sN⋅\(Leff​γ\)1/aN≲\(Leff​γ\)1/aN\.\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}=\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1\-s\}\}\{N\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\lesssim\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\.Combining this with the leading term yields

\[1\+log2⁡N​\(Leff​γ\)2−sN2\]​\(Leff​γ\)1/a−1≲\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN,\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\},which proves the claim\. ∎

### I\.3Lemmas to prove the upper bound

In this subsection, we first verify that the actual mini\-batch covariance process satisfies the moment assumptions needed later, and then state and prove a stochastic approximation lemma for mini\-batch covariance matrices\.

###### Lemma I\.3\(Verification of the moment assumptions forΣ^t\(B\)\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\)\.

Condition onSS,w∗w^\{\\ast\}, and the datasetDD\. Let

At:=Σ^t\(B\),Σν:=Σ^,CA:=maxi∈\[N\]⁡‖S​xi‖22\.A\_\{t\}:=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\},\\qquad\\Sigma\_\{\\nu\}:=\\widehat\{\\Sigma\},\\qquad C\_\{A\}:=\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\.Then\(At\)t∈\[L\]\(A\_\{t\}\)\_\{t\\in\[L\]\}are i\.i\.d\. random PSD matrices over the batch randomness, and

𝔼batch​\[At\]=Σν,𝔼batch​\[At2\]⪯CA​Σν,‖Σν‖2≤CA\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\},\\qquad\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\},\\qquad\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\}\.Moreover, if

γ0:=min⁡\{18​CA,γ\},\\gamma\_\{0\}:=\\min\\\!\\left\\\{\\frac\{1\}\{8C\_\{A\}\},\\,\\gamma\\right\\\},thenγ0​CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8, hence4​γ0​CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1\. In particular, these are exactly the mean and matrix\-moment bounds needed later to apply Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)withAt=Σ^t\(B\)A\_\{t\}=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}andΣν=Σ^\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\}\.

###### Proof\.

Write

zi:=S​xi,Zi:=zi​zi⊤,I∼unif​\(\[N\]\)\.z\_\{i\}:=Sx\_\{i\},\\qquad Z\_\{i\}:=z\_\{i\}z\_\{i\}^\{\\top\},\\qquad I\\sim\\mathrm\{unif\}\(\[N\]\)\.Then

At=1B​∑r=1BZit,r,A\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Z\_\{i\_\{t,r\}\},so the matricesAtA\_\{t\}are PSD and are i\.i\.d\. acrossttbecause the mini\-batches are sampled independently with replacement\. Moreover,

𝔼batch​\[At\]=1B​∑r=1B1N​∑i=1NZi=Σ^=Σν\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Z\_\{i\}=\\widehat\{\\Sigma\}=\\Sigma\_\{\\nu\}\.For the second moment, independence of the batch draws gives

𝔼batch​\[At2\]=1B​𝔼​\[ZI2\]\+B−1B​Σ^2\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]=\\frac\{1\}\{B\}\\,\\mathbb\{E\}\[Z\_\{I\}^\{2\}\]\+\\frac\{B\-1\}\{B\}\\,\\widehat\{\\Sigma\}^\{2\}\.Now

Zi2=‖zi‖22​Zi⪯CA​Zi,Z\_\{i\}^\{2\}=\\\|z\_\{i\}\\\|\_\{2\}^\{2\}Z\_\{i\}\\preceq C\_\{A\}Z\_\{i\},so

𝔼​\[ZI2\]⪯CA​𝔼​\[ZI\]=CA​Σ^\.\\mathbb\{E\}\[Z\_\{I\}^\{2\}\]\\preceq C\_\{A\}\\,\\mathbb\{E\}\[Z\_\{I\}\]=C\_\{A\}\\widehat\{\\Sigma\}\.Also, since eachZi⪯CA​IZ\_\{i\}\\preceq C\_\{A\}I, we have

Σ^=1N​∑i=1NZi⪯CA​I,hence‖Σ^‖2≤CA,\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Z\_\{i\}\\preceq C\_\{A\}I,\\qquad\\text\{hence\}\\qquad\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\leq C\_\{A\},and therefore

Σ^2⪯‖Σ^‖2​Σ^⪯CA​Σ^\.\\widehat\{\\Sigma\}^\{2\}\\preceq\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\widehat\{\\Sigma\}\\preceq C\_\{A\}\\widehat\{\\Sigma\}\.Substituting the last two displays into the expression for𝔼batch​\[At2\]\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]yields

𝔼batch​\[At2\]⪯1B​CA​Σ^\+B−1B​CA​Σ^=CA​Σ^=CA​Σν\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq\\frac\{1\}\{B\}C\_\{A\}\\widehat\{\\Sigma\}\+\\frac\{B\-1\}\{B\}C\_\{A\}\\widehat\{\\Sigma\}=C\_\{A\}\\widehat\{\\Sigma\}=C\_\{A\}\\Sigma\_\{\\nu\}\.The bound‖Σν‖2≤CA\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\}was proved above, and the definition ofγ0\\gamma\_\{0\}givesγ0​CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8immediately\. This completes the verification\. ∎

###### Lemma I\.4\(Stochastic approximation with random PSD updates\)\.

Consider the recursion

μt=\(I−γt​At\)​μt−1\+γt​ξt,μ0=0,t∈\[L\],\\mu\_\{t\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\mu\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\},\\qquad\\mu\_\{0\}=0,\\qquad t\\in\[L\],whereAt∈ℝM×MA\_\{t\}\\in\\mathbb\{R\}^\{M\\times M\}are i\.i\.d\. random PSD matrices,ξt∈ℝM\\xi\_\{t\}\\in\\mathbb\{R\}^\{M\}are random vectors, and each pair\(At,ξt\)\(A\_\{t\},\\xi\_\{t\}\)is independent ofσ\(\(As,ξs\):s<t\)\\sigma\(\(A\_\{s\},\\xi\_\{s\}\):s<t\)\. Assume:

1. 1\.𝔼​\[At\]=Σν\\mathbb\{E\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\}for some PSD matrixΣν\\Sigma\_\{\\nu\};
2. 2\.𝔼​\[ξt\]=0\\mathbb\{E\}\[\\xi\_\{t\}\]=0and𝔼​\[ξt​ξt⊤\]⪯σξ2​Σν\\mathbb\{E\}\[\\xi\_\{t\}\\xi\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\Sigma\_\{\\nu\};
3. 3\.𝔼​\[At2\]⪯CA​Σν\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\};
4. 4\.‖Σν‖2≤CA\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\};
5. 5\.γ0​CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8\.

Then for anyu∈\[0,1\]u\\in\[0,1\]and anyα\>1\\alpha\>1,

𝔼​‖Σνu/2​μL‖22≤cα​σξ2​γ0​tr​\(Σν1/α\)​\(Leff​γ0\)1/α−u,\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\leq c\_\{\\alpha\}\\,\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\},for some constantcα\>0c\_\{\\alpha\}\>0depending only onα\\alpha\. Moreover, ifμj​\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}forj≤min⁡\{M,N/c~\}j\\leq\\min\\\{M,N/\\widetilde\{c\}\\\}and some constantsa\>1a\>1andc~\>0\\widetilde\{c\}\>0, then

𝔼​‖Σνu/2​μL‖22≤ca​σξ2​γ0​\(Leff​γ0\)1/a−u,\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\leq c\_\{a\}\\,\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\},for some constantca\>0c\_\{a\}\>0depending only onaa\.

###### Proof\.

We define recursively

μt\(0\):=μt,ξt\(0\):=ξt,\\mu\_\{t\}^\{\(0\)\}:=\\mu\_\{t\},\\qquad\\xi\_\{t\}^\{\(0\)\}:=\\xi\_\{t\},and fork≥1k\\geq 1,

μt\(k\)=\(I−γt​Σν\)​μt−1\(k\)\+γt​ξt\(k\),μ0\(k\)=0,\\mu\_\{t\}^\{\(k\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(k\)\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\)\},\\qquad\\mu\_\{0\}^\{\(k\)\}=0,with

ξt\(k\):=\(Σν−At\)​μt−1\(k−1\)\.\\xi\_\{t\}^\{\(k\)\}:=\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\mu\_\{t\-1\}^\{\(k\-1\)\}\.Note that we can verify the decomposition

μt−∑i=0kμt\(i\)=\(I−γt​At\)​\(μt−1−∑i=0k−1μt−1\(i\)\)\+γt​ξt\(k\+1\)\.\\mu\_\{t\}\-\\sum\_\{i=0\}^\{k\}\\mu\_\{t\}^\{\(i\)\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\Bigl\(\\mu\_\{t\-1\}\-\\sum\_\{i=0\}^\{k\-1\}\\mu\_\{t\-1\}^\{\(i\)\}\\Bigr\)\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\+1\)\}\.
Below we quote Lemma D\.6 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], whose proof only uses the deterministic operatorΣν\\Sigma\_\{\\nu\}and does not rely on the rank\-one structure\.

###### Lemma I\.5\(Lemma D\.6 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]\)\.

Consider

μtr=\(I−γt​Σν\)​μt−1r\+γt​ξtr,μ0r=0,\\mu\_\{t\}^\{r\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{r\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{r\},\\qquad\\mu\_\{0\}^\{r\}=0,with𝔼​\[ξtr\]=0\\mathbb\{E\}\[\\xi\_\{t\}^\{r\}\]=0and

𝔼​\[ξtr​ξtr⊤\]⪯σξ,r2​Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{r\}\\xi\_\{t\}^\{r\\top\}\]\\preceq\\sigma\_\{\\xi,r\}^\{2\}\\Sigma\_\{\\nu\}\.Then for anyu∈\[0,1\]u\\in\[0,1\]andα\>1\\alpha\>1,

𝔼​‖Σνu/2​μLr‖22≲σξ,r2​γ0​tr​\(Σν1/α\)​\(Leff​γ0\)1/α−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{r\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,r\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\.Under power\-law eigenvaluesμj​\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}, this becomes

𝔼​‖Σνu/2​μLr‖22≲σξ,r2​γ0​\(Leff​γ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{r\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,r\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.

Then we bound the covariance forηt\\eta\_\{t\}andνt\\nu\_\{t\}as follows\.

###### Lemma I\.6\(Covariance bound for a semi\-stochastic linear recursion\)\.

Consider the recursion

νt=\(I−γt​Σν\)​νt−1\+γt​ηt,ν0=0,\\nu\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\nu\_\{t\-1\}\+\\gamma\_\{t\}\\eta\_\{t\},\\qquad\\nu\_\{0\}=0,whereΣν\\Sigma\_\{\\nu\}is PSD,𝔼​\[ηt\]=0\\mathbb\{E\}\[\\eta\_\{t\}\]=0, and

𝔼​\[ηt​ηt⊤\]⪯ση2​Σν\.\\mathbb\{E\}\[\\eta\_\{t\}\\eta\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\Sigma\_\{\\nu\}\.Assume moreover that all eigenvalues ofI−γt​ΣνI\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}lie in\[0,1\]\[0,1\]\. Then for everyt≥0t\\geq 0,

𝔼​\[νt​νt⊤\]⪯ση2​γ0​I\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}I\.

###### Proof\.

Unrolling the recursion gives

νt=∑i=1tγi​∏j=i\+1t\(I−γj​Σν\)​ηi\.\\nu\_\{t\}=\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\eta\_\{i\}\.Therefore,

𝔼​\[νt​νt⊤\]=∑i=1tγi2​∏j=i\+1t\(I−γj​Σν\)​𝔼​\[ηi​ηi⊤\]​∏j=i\+1t\(I−γj​Σν\)\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]=\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}^\{2\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\mathbb\{E\}\[\\eta\_\{i\}\\eta\_\{i\}^\{\\top\}\]\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.Using𝔼​\[ηi​ηi⊤\]⪯ση2​Σν\\mathbb\{E\}\[\\eta\_\{i\}\\eta\_\{i\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\Sigma\_\{\\nu\}andγi≤γ0\\gamma\_\{i\}\\leq\\gamma\_\{0\}, we obtain

𝔼​\[νt​νt⊤\]⪯ση2​γ0​∑i=1tγi​∏j=i\+1t\(I−γj​Σν\)​Σν​∏j=i\+1t\(I−γj​Σν\)\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.All factors are polynomials inΣν\\Sigma\_\{\\nu\}, so they commute\. Since every eigenvalue ofI−γj​ΣνI\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}lies in\[0,1\]\[0,1\], one has

∏j=i\+1t\(I−γj​Σν\)2⪯∏j=i\+1t\(I−γj​Σν\)\.\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)^\{2\}\\preceq\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.Hence

𝔼​\[νt​νt⊤\]⪯ση2​γ0​∑i=1tγi​∏j=i\+1t\(I−γj​Σν\)​Σν\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\.Finally, diagonalizingΣν\\Sigma\_\{\\nu\}reduces the last sum to the scalar identity

∑i=1tγi​∏j=i\+1t\(1−γj​λ\)​λ=1−∏j=1t\(1−γj​λ\)≤1,λ≥0,\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(1\-\\gamma\_\{j\}\\lambda\)\\lambda=1\-\\prod\_\{j=1\}^\{t\}\(1\-\\gamma\_\{j\}\\lambda\)\\leq 1,\\qquad\\lambda\\geq 0,which yields

∑i=1tγi​∏j=i\+1t\(I−γj​Σν\)​Σν⪯I\.\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\\preceq I\.Combining the last two displays proves the claim\. ∎

#### A

fter bounding the covariance, we now prove the covariance propagation bound\.

###### Lemma I\.7\(Covariance propagation bound\)\.

For allk≥0k\\geq 0,

𝔼​\[ξt\(k\)​ξt\(k\)⊤\]⪯σξ2​γ0k​\(4​CA\)k​Σν,\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\)\}\\xi\_\{t\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\}\(4C\_\{A\}\)^\{k\}\\Sigma\_\{\\nu\},and

𝔼​\[μt\(k\)​μt\(k\)⊤\]⪯σξ2​γ0k\+1​\(4​CA\)k​I\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\)\}\\mu\_\{t\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}I\.

###### Proof\.

We proceed by induction onkk\. Fork=0k=0, the first inequality is exactly the assumption

𝔼​\[ξt​ξt⊤\]⪯σξ2​Σν\.\\mathbb\{E\}\[\\xi\_\{t\}\\xi\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\Sigma\_\{\\nu\}\.For the second inequality whenk=0k=0, Lemma[I\.6](https://arxiv.org/html/2605.24316#A9.Thmlemma6)applied to

μt\(0\)=\(I−γt​Σν\)​μt−1\(0\)\+γt​ξt\\mu\_\{t\}^\{\(0\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(0\)\}\+\\gamma\_\{t\}\\xi\_\{t\}gives

𝔼​\[μt\(0\)​μt\(0\)⊤\]⪯σξ2​γ0​I\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(0\)\}\\mu\_\{t\}^\{\(0\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}I\.
Now assume the claim holds for somek≥0k\\geq 0\. Since

ξt\(k\+1\)=\(Σν−At\)​μt−1\(k\),\\xi\_\{t\}^\{\(k\+1\)\}=\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\mu\_\{t\-1\}^\{\(k\)\},we have

𝔼​\[ξt\(k\+1\)​ξt\(k\+1\)⊤\]=𝔼​\[\(Σν−At\)​𝔼​\[μt−1\(k\)​μt−1\(k\)⊤\]​\(Σν−At\)\]\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]=\\mathbb\{E\}\\Bigl\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\,\\mathbb\{E\}\[\\mu\_\{t\-1\}^\{\(k\)\}\\mu\_\{t\-1\}^\{\(k\)\\top\}\]\\,\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\Bigr\]\.Using the induction hypothesis,

𝔼​\[μt−1\(k\)​μt−1\(k\)⊤\]⪯σξ2​γ0k\+1​\(4​CA\)k​I,\\mathbb\{E\}\[\\mu\_\{t\-1\}^\{\(k\)\}\\mu\_\{t\-1\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}I,thus

𝔼​\[ξt\(k\+1\)​ξt\(k\+1\)⊤\]⪯σξ2​γ0k\+1​\(4​CA\)k​𝔼​\[\(Σν−At\)2\]\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}\\,\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]\.Using𝔼​\[At\]=Σν\\mathbb\{E\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\}, we obtain the exact identity

𝔼​\[\(Σν−At\)2\]=𝔼​\[At2\]−Σν2⪯𝔼​\[At2\]\.\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]=\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\-\\Sigma\_\{\\nu\}^\{2\}\\preceq\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\.By the assumption𝔼​\[At2\]⪯CA​Σν\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\}, it follows that

𝔼​\[\(Σν−At\)2\]⪯CA​Σν\.\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\}\.Hence

𝔼​\[ξt\(k\+1\)​ξt\(k\+1\)⊤\]⪯σξ2​γ0k\+1​\(4​CA\)k​CA​Σν⪯σξ2​γ0k\+1​\(4​CA\)k\+1​Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}C\_\{A\}\\Sigma\_\{\\nu\}\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\+1\}\\Sigma\_\{\\nu\}\.The bound on𝔼​\[μt\(k\+1\)​μt\(k\+1\)⊤\]\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\+1\)\}\\mu\_\{t\}^\{\(k\+1\)\\top\}\]then follows from Lemma[I\.6](https://arxiv.org/html/2605.24316#A9.Thmlemma6), applied to the recursion

μt\(k\+1\)=\(I−γt​Σν\)​μt−1\(k\+1\)\+γt​ξt\(k\+1\),\\mu\_\{t\}^\{\(k\+1\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(k\+1\)\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\+1\)\},with

𝔼​\[ξt\(k\+1\)​ξt\(k\+1\)⊤\]⪯σξ2​γ0k\+1​\(4​CA\)k\+1​Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\+1\}\\Sigma\_\{\\nu\}\.𝔼​\[μt\(k\+1\)​μt\(k\+1\)⊤\]⪯σξ2​γ0k\+2​\(4​CA\)k\+1​I\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\+1\)\}\\mu\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+2\}\(4C\_\{A\}\)^\{k\+1\}I\.This closes the induction\. ∎

Apart from the above lemmas to simplify our final upper bound, we also need the following companion estimate\.

###### Lemma I\.8\(Companion bound for thepp\-part recursion\)\.

Consider

μtp=\(I−γt​At\)​μt−1p\+γt​ξtp,μ0p=0,\\mu\_\{t\}^\{p\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\mu\_\{t\-1\}^\{p\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{p\},\\qquad\\mu\_\{0\}^\{p\}=0,with𝔼​\[ξtp\]=0\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\]=0and

𝔼​\[ξtp​ξtp⊤\]⪯σξ,p2​Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\]\\preceq\\sigma\_\{\\xi,p\}^\{2\}\\Sigma\_\{\\nu\}\.Then for anyu∈\[0,1\]u\\in\[0,1\],

𝔼​‖Σνu/2​μLp‖22≲σξ,p2​γ02​CAu​tr​\(Σν\)​Leff\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,p\}^\{2\}\\gamma\_\{0\}^\{2\}C\_\{A\}^\{u\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\.

###### Proof\.

We have

𝔼​‖Σνu/2​μLp‖22=∑t=1Lγt2​tr​\(𝔼​\[Σνu​∏i=t\+1L\(I−γi​Ai\)​ξtp​ξtp⊤​∏j=Lt\+1\(I−γj​Aj\)\]\)\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}^\{2\}\\,\\mathrm\{tr\}\\\!\\left\(\\mathbb\{E\}\\bigl\[\\Sigma\_\{\\nu\}^\{u\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}A\_\{i\}\)\\,\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\\,\\prod\_\{j=L\}^\{t\+1\}\(I\-\\gamma\_\{j\}A\_\{j\}\)\\bigr\]\\right\)\.Using𝔼​\[ξtp​ξtp⊤\]⪯σξ,p2​Σν\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\]\\preceq\\sigma\_\{\\xi,p\}^\{2\}\\Sigma\_\{\\nu\},∑tγt2≲γ02​Leff\\sum\_\{t\}\\gamma\_\{t\}^\{2\}\\lesssim\\gamma\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}, andΣν2⪯CA​Σν\\Sigma\_\{\\nu\}^\{2\}\\preceq C\_\{A\}\\Sigma\_\{\\nu\}, we have the following through elementary calculations:

𝔼​‖Σνu/2​μLp‖22≲σξ,p2​γ02​CAu​tr​\(Σν\)​Leff\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,p\}^\{2\}\\gamma\_\{0\}^\{2\}C\_\{A\}^\{u\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\.∎

Now we are able to combine Lemmas[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5),[I\.7](https://arxiv.org/html/2605.24316#A9.Thmlemma7), and[I\.8](https://arxiv.org/html/2605.24316#A9.Thmlemma8)\. By Minkowski’s inequality,

\(𝔼​‖Σνu/2​μL‖22\)1/2≤∑i=0k\(𝔼​‖Σνu/2​μL\(i\)‖22\)1/2\+\(𝔼​‖Σνu/2​\(μL−∑i=0kμL\(i\)\)‖22\)1/2\.\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\\leq\\sum\_\{i=0\}^\{k\}\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\+\\Bigl\(\\mathbb\{E\}\\Bigl\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\Bigl\(\\mu\_\{L\}\-\\sum\_\{i=0\}^\{k\}\\mu\_\{L\}^\{\(i\)\}\\Bigr\)\\Bigr\\\|\_\{2\}^\{2\}\\Bigr\)^\{1/2\}\.Applying Lemma[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5)toμL\(i\)\\mu\_\{L\}^\{\(i\)\}and Lemma[I\.8](https://arxiv.org/html/2605.24316#A9.Thmlemma8)to the remainder yields

\(𝔼​‖Σνu/2​μL‖22\)1/2≲∑i=0k\(σξ2​γ0i​\(4​CA\)i⋅γ0​tr​\(Σν1/α\)​\(Leff​γ0\)1/α−u\)1/2\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\\lesssim\\sum\_\{i=0\}^\{k\}\\Bigl\(\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{i\}\(4C\_\{A\}\)^\{i\}\\cdot\\gamma\_\{0\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\\Bigr\)^\{1/2\}\+\(σξ2​γ0k\+3​\(4​CA\)k\+1​CAu​tr​\(Σν\)​Leff\)1/2\.\\qquad\+\\Bigl\(\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+3\}\(4C\_\{A\}\)^\{k\+1\}\\,C\_\{A\}^\{u\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\\Bigr\)^\{1/2\}\.Since4​γ0​CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1, the geometric series converges\. Lettingk→∞k\\to\\inftygives

𝔼​‖Σνu/2​μL‖22≲σξ2​γ0​tr​\(Σν1/α\)​\(Leff​γ0\)1/α−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\.Now supposeμj​\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}forj≤min⁡\{M,N/c~\}j\\leq\\min\\\{M,N/\\widetilde\{c\}\\\}\. Then the power\-law part of Lemma[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5)yields, for everyi≥0i\\geq 0,

𝔼​‖Σνu/2​μL\(i\)‖22≲σξ2​γ0i\+1​\(4​CA\)i​\(Leff​γ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}^\{i\+1\}\(4C\_\{A\}\)^\{i\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.The remainder estimate is unchanged\. Therefore the same geometric\-series argument, together with4​γ0​CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1, gives

𝔼​‖Σνu/2​μL‖22≲σξ2​γ0​\(Leff​γ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.This is the claimed power\-law conclusion\. ∎

## Appendix JFluctuation Error under Multi\-pass Batch SGD without Replacement

This section focuses on the multi\-pass batch SGD procedure without replacement equation[4](https://arxiv.org/html/2605.24316#S2.E4)and uses the notation from Section[A\.5](https://arxiv.org/html/2605.24316#A1.SS5)\. Appendix[I](https://arxiv.org/html/2605.24316#A9)proves the corresponding fluctuation bounds for the multi\-pass batch SGD procedure with replacement equation[3](https://arxiv.org/html/2605.24316#S2.E3); here we record the without\-replacement\.

Define

FlucBwor:=𝔼batch​\[‖Σ1/2​\(uLwor−θL\)‖22\]=𝔼batch​\[‖Σ1/2​ΔLwor‖22\]\.\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}:=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wor\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}^\{\\mathrm\{wor\}\}\\\|\_\{2\}^\{2\}\\bigr\]\.By Section[A\.5](https://arxiv.org/html/2605.24316#A1.SS5), the fluctuation process satisfies

Δtwor=\(I−γt​Σ^It\(B\)\)​Δt−1wor\+γt​ξt,wor\(B\)\.\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\.
Define the finite\-population correction factor

ρN,B:=N−BB​\(N−1\)\.\\rho\_\{N,B\}:=\\frac\{N\-B\}\{B\(N\-1\)\}\.\(26\)We can observe that

ρN,1=1,ρN,N=0,\\rho\_\{N,1\}=1,\\qquad\\rho\_\{N,N\}=0,and more generally, and asymptotic level,

ρN,B≍1Bwhen​B≪N\.\\rho\_\{N,B\}\\asymp\\frac\{1\}\{B\}\\qquad\\text\{when \}B\\ll N\.
###### Lemma J\.1\(Finite\-population covariance identity\)\.

Letζ1,…,ζN∈ℝM\\zeta\_\{1\},\\dots,\\zeta\_\{N\}\\in\\mathbb\{R\}^\{M\}satisfy

1N​∑j=1Nζj=0\.\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}=0\.LetI⊂\[N\]I\\subset\[N\]be sampled uniformly without replacement with\|I\|=B\|I\|=B, and define

ζ¯I:=1B​∑j∈Iζj\.\\bar\{\\zeta\}\_\{I\}:=\\frac\{1\}\{B\}\\sum\_\{j\\in I\}\\zeta\_\{j\}\.Then

𝔼I​\[ζ¯I​ζ¯I⊤\]=ρN,B⋅1N​∑j=1Nζj​ζj⊤=ρN,B​𝔼i∼unif​\(\[N\]\)​\[ζi​ζi⊤\]\.\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\rho\_\{N,B\}\\cdot\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}\\zeta\_\{j\}^\{\\top\}=\\rho\_\{N,B\}\\,\\mathbb\{E\}\_\{i\\sim\\mathrm\{unif\}\(\[N\]\)\}\[\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\]\.

###### Proof\.

Expanding the covariance gives

𝔼I​\[ζ¯I​ζ¯I⊤\]=1B2​∑i=1Nℙ​\(i∈I\)​ζi​ζi⊤\+1B2​∑i≠jℙ​\(i,j∈I\)​ζi​ζj⊤\.\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\frac\{1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{N\}\\mathbb\{P\}\(i\\in I\)\\,\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\+\\frac\{1\}\{B^\{2\}\}\\sum\_\{i\\neq j\}\\mathbb\{P\}\(i,j\\in I\)\\,\\zeta\_\{i\}\\zeta\_\{j\}^\{\\top\}\.For uniform sampling without replacement,

ℙ​\(i∈I\)=BN,ℙ​\(i,j∈I\)=B​\(B−1\)N​\(N−1\)for​i≠j\.\\mathbb\{P\}\(i\\in I\)=\\frac\{B\}\{N\},\\qquad\\mathbb\{P\}\(i,j\\in I\)=\\frac\{B\(B\-1\)\}\{N\(N\-1\)\}\\quad\\text\{for \}i\\neq j\.Since∑j=1Nζj=0\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}=0,

∑i≠jζi​ζj⊤=\(∑i=1Nζi\)​\(∑j=1Nζj\)⊤−∑i=1Nζi​ζi⊤=−∑i=1Nζi​ζi⊤\.\\sum\_\{i\\neq j\}\\zeta\_\{i\}\\zeta\_\{j\}^\{\\top\}=\\left\(\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\right\)\\left\(\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}\\right\)^\{\\top\}\-\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}=\-\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\.Substituting these identities yields

𝔼I​\[ζ¯I​ζ¯I⊤\]=1B2​\(BN−B​\(B−1\)N​\(N−1\)\)​∑i=1Nζi​ζi⊤=N−BB​N​\(N−1\)​∑i=1Nζi​ζi⊤,\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\frac\{1\}\{B^\{2\}\}\\left\(\\frac\{B\}\{N\}\-\\frac\{B\(B\-1\)\}\{N\(N\-1\)\}\\right\)\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}=\\frac\{N\-B\}\{BN\(N\-1\)\}\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\},which is exactly the claimed formula becauseρN,B=\(N−B\)/\(B​\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)\. ∎

The next corollary shows that the upper\-bound conclusions of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)remain valid after replacing the factor1/B1/BbyρN,B\\rho\_\{N,B\}and the with\-replacement iterateutwru\_\{t\}^\{\\mathrm\{wr\}\}by the multi\-pass batch SGD without\-replacement iterateutworu\_\{t\}^\{\\mathrm\{wor\}\}\.

###### Corollary J\.1\(Upper fluctuation bounds for multi\-pass batch SGD without replacement\)\.

Assume the same fixed\-dataset mini\-batch setup as above, where at each iterationttthe batchIt⊂\[N\]I\_\{t\}\\subset\[N\]is sampled uniformly without replacement with\|It\|=B\|I\_\{t\}\|=B, independently across iterations\.

LetρN,B\\rho\_\{N,B\}be defined by equation[26](https://arxiv.org/html/2605.24316#A10.E26)\. Then the following hold\.

1. \(1\)Upper bound\.Under the assumptions of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1), one has 𝔼​\[FlucBwor\]≤c⋅𝔼​\[FB\]⋅ρN,B​tr​\(Σ^1/α\)⋅γ1/α​Leff1/α−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\]\\leq c\\cdot\\mathbb\{E\}\[F\_\{B\}\]\\cdot\\rho\_\{N,B\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.
2. \(2\)Source\-condition upper bound\.Under the assumptions of Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2), one has 𝔼​\[FlucBwor\]≲ρN,B​γ​log⁡N​\[\(Leff​γ\)1/a−1\+\(Leff​γ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\]\\lesssim\\rho\_\{N,B\}\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.

In particular, whenB=NB=N, one hasρN,N=0\\rho\_\{N,N\}=0, and therefore

FlucNwor=0,\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{N\}=0,which matches the identityutwor=θtu\_\{t\}^\{\\mathrm\{wor\}\}=\\theta\_\{t\}from Section[2](https://arxiv.org/html/2605.24316#S2)\.

###### Proof\.

We only need to identify the places in the proofs of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)where the factor1/B1/Benters, and replace it by the correct covariance factor for sampling without replacement\.

Fixtt, and condition on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\)\. Define

ζt​\(j\):=−\(S​xj​xj⊤​S⊤−Σ^\)​\(θt−1−u∗\)\+\(S​xj​ε~j−c^\),j∈\[N\]\.\\zeta\_\{t\}\(j\):=\-\\bigl\(Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(Sx\_\{j\}\\widetilde\{\\varepsilon\}\_\{j\}\-\\widehat\{c\}\\bigr\),\\qquad j\\in\[N\]\.Since

Σ^=1N​∑i=1NS​xi​xi⊤​S⊤,c^=1N​∑i=1NS​xi​ε~i,\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{c\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Sx\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\},we have

1N​∑j=1Nζt​\(j\)=0\.\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{t\}\(j\)=0\.Now letIt⊂\[N\]I\_\{t\}\\subset\[N\]be a uniformly random subset of sizeBB, sampled without replacement, and write

ξt,wor\(B\)=1B​∑j∈Itζt​\(j\)\.\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{j\\in I\_\{t\}\}\\zeta\_\{t\}\(j\)\.For the upper\-bound argument, we must also verify that the random transition matrix

At:=Σ^It\(B\)=1B​∑i∈Itzi​zi⊤,zi:=S​xi,CA:=maxi∈\[N\]⁡‖zi‖22\.A\_\{t\}:=\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}z\_\{i\}z\_\{i\}^\{\\top\},\\qquad z\_\{i\}:=Sx\_\{i\},\\qquad C\_\{A\}:=\\max\_\{i\\in\[N\]\}\\\|z\_\{i\}\\\|\_\{2\}^\{2\}\.satisfies the same moment assumptions used in Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)\. Since eachzi​zi⊤⪯CA​Iz\_\{i\}z\_\{i\}^\{\\top\}\\preceq C\_\{A\}I, we have pathwise

0⪯At⪯CA​I,henceAt2⪯CA​At\.0\\preceq A\_\{t\}\\preceq C\_\{A\}I,\\qquad\\text\{hence\}\\qquad A\_\{t\}^\{2\}\\preceq C\_\{A\}A\_\{t\}\.Taking expectation over the without\-replacement batch gives

𝔼batch​\[At\]=Σ^,𝔼batch​\[At2\]⪯CA​𝔼batch​\[At\]=CA​Σ^\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\widehat\{\\Sigma\},\\qquad\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=C\_\{A\}\\widehat\{\\Sigma\}\.AlsoΣ^=1N​∑i=1Nzi​zi⊤⪯CA​I\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq C\_\{A\}I, so‖Σ^‖2≤CA\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\leq C\_\{A\}\. Thus the same mean and matrix\-moment bounds as in Lemma[I\.3](https://arxiv.org/html/2605.24316#A9.Thmlemma3)remain valid for the without\-replacement transition matrices\. Sinceθt−1\\theta\_\{t\-1\}is deterministic once\(S,w∗,D\)\(S,w^\{\\ast\},D\)are fixed and the batches\(It\)t∈\[L\]\(I\_\{t\}\)\_\{t\\in\[L\]\}are sampled independently across iterations, the pair\(Σ^It\(B\),ξt,wor\(B\)\)\(\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\},\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\)is also independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}, so the adaptedness requirement in Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)is unchanged\. Conditioned on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\), the family\(ζt​\(j\)\)j=1N\(\\zeta\_\{t\}\(j\)\)\_\{j=1\}^\{N\}is deterministic and centered\. Applying Lemma[J\.1](https://arxiv.org/html/2605.24316#A10.Thmlemma1)therefore gives

𝔼​\[ξt,wor\(B\)​ξt,wor\(B\)⊤∣S,w∗,D,ℱt−1\]=ρN,B​𝔼i∼unif​\(\[N\]\)​\[ζt​\(i\)​ζt​\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\mathbb\{E\}\\\!\\left\[\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]=\\rho\_\{N,B\}\\,\\mathbb\{E\}\_\{i\\sim\\mathrm\{unif\}\(\[N\]\)\}\\\!\\left\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]\.\(27\)Therefore equation[24](https://arxiv.org/html/2605.24316#A9.E24)in the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)is replaced by

𝔼​\[ξt,wor\(B\)​ξt,wor\(B\)⊤∣S,w∗,D,ℱt−1\]⪯ρN,B​σξ,B2​Σ^\.\\mathbb\{E\}\\\!\\left\[\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]\\preceq\\rho\_\{N,B\}\\,\\sigma\_\{\\xi,B\}^\{2\}\\,\\widehat\{\\Sigma\}\.From this point onward, the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)is unchanged after replacingΔt\\Delta\_\{t\}byΔtwor\\Delta\_\{t\}^\{\\mathrm\{wor\}\},uLwru\_\{L\}^\{\\mathrm\{wr\}\}byuLworu\_\{L\}^\{\\mathrm\{wor\}\}, and every occurrence of1/B1/Bcoming from equation[24](https://arxiv.org/html/2605.24316#A9.E24)byρN,B\\rho\_\{N,B\}\. This yields part\(1\)\.

The proof of Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)then propagates the same covariance replacement, so every occurrence of1/B1/Bin the corresponding with\-replacement source\-condition upper bound is replaced byρN,B\\rho\_\{N,B\}\. This yields part\(2\)\.

The last statement follows because

ρN,N=N−NN​\(N−1\)=0\.\\rho\_\{N,N\}=\\frac\{N\-N\}\{N\(N\-1\)\}=0\.Thus, whenB=NB=N, the fluctuation bounds vanish, which is consistent withutwor=θtu\_\{t\}^\{\\mathrm\{wor\}\}=\\theta\_\{t\}for alltt\. ∎

## Appendix KCollected Auxiliary Lemmas

This section keeps only the auxiliary lemmas from Section E ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]and Sections A and G ofLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4)\]\. These are the spectral, moment, and concentration ingredients used repeatedly across the appendix proofs for equation[2](https://arxiv.org/html/2605.24316#S2.E2), equation[3](https://arxiv.org/html/2605.24316#S2.E3), and equation[4](https://arxiv.org/html/2605.24316#S2.E4)\.

We use the common notation from Section[A](https://arxiv.org/html/2605.24316#A1), especially Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. Let\(λi\)i≥1\(\\lambda\_\{i\}\)\_\{i\\geq 1\}denote the eigenvalues ofHHin non\-increasing order\. For integers0≤k∗≤k≤∞0\\leq k\_\{\\ast\}\\leq k\\leq\\infty, define

Σk∗:k:=Sk∗:k​Hk∗:k​Sk∗:k⊤,Σk:∞:=Sk:∞​Hk:∞​Sk:∞⊤\.\\Sigma\_\{k\_\{\\ast\}:k\}:=S\_\{k\_\{\\ast\}:k\}H\_\{k\_\{\\ast\}:k\}S\_\{k\_\{\\ast\}:k\}^\{\\top\},\\qquad\\Sigma\_\{k:\\infty\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.For any symmetric PSD matrixAA, we writeμ1​\(A\)≥μ2​\(A\)≥⋯\\mu\_\{1\}\(A\)\\geq\\mu\_\{2\}\(A\)\\geq\\cdotsfor its eigenvalues\.

### K\.1General concentration lemmas

###### Lemma K\.1\(Covariance replacement\)\.

Letλ\>0\\lambda\>0\. If

∑i=1Mμi​\(Σ\)μi​\(Σ\)\+λ≤N4,\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+\\lambda\}\\leq\\frac\{N\}\{4\},then with probability at least1−exp⁡\(−Ω​\(N\)\)1\-\\exp\(\-\\Omega\(N\)\)over the sample,

‖\(Σ^\+λ​I\)−1/2​\(Σ\+λ​I\)1/2‖2≤3\.\\bigl\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\(\\Sigma\+\\lambda I\)^\{1/2\}\\bigr\\\|\_\{2\}\\leq 3\.Moreover,

𝔼X​‖\(Σ^\+λ​I\)−1/2​\(Σ\+λ​I\)1/2‖24≤100\+exp⁡\(−c​N\)​‖Σ‖22λ2\\mathbb\{E\}\_\{X\}\\bigl\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\(\\Sigma\+\\lambda I\)^\{1/2\}\\bigr\\\|\_\{2\}^\{4\}\\leq 100\+\\exp\(\-cN\)\\frac\{\\\|\\Sigma\\\|\_\{2\}^\{2\}\}\{\\lambda^\{2\}\}for some absolute constantc\>0c\>0\.

###### Lemma K\.2\(Sketched fourth\-moment and residual covariance bounds\)\.

Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Condition on the sketch matrixSS, and let

z:=S​x,Σ:=S​H​S⊤,u∗:=Σ−1​S​H​w∗\.z:=Sx,\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.Then there exists an absolute constantα\>0\\alpha\>0such that the following hold\.

1. \(i\)For every PSD matrixA∈ℝM×MA\\in\\mathbb\{R\}^\{M\\times M\}, 𝔼​\[z​z⊤​A​z​z⊤\]⪯α​tr⁡\(Σ​A\)​Σ\.\\mathbb\{E\}\[zz^\{\\top\}Azz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\)\\Sigma\.
2. \(ii\)Writing σ~2​\(w∗\):=2​\(σ2\+α​‖w∗‖H2\),\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=2\\bigl\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\bigr\),one has 𝔼​\[\(y−z⊤​u∗\)2​z​z⊤\]⪯σ~2​\(w∗\)​Σ\.\\mathbb\{E\}\\\!\\left\[\(y\-z^\{\\top\}u^\{\\ast\}\)^\{2\}zz^\{\\top\}\\right\]\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\Sigma\.

Under Gaussian design, one may takeα=3\\alpha=3\.

###### Lemma K\.3\(Head–tail eigenvalue comparison\)\.

There exists an absolute constantc\>1c\>1such that for every0≤k≤M0\\leq k\\leq M, with probability at least

1−exp⁡\(−Ω​\(M\)\)−exp⁡\(−Ω​\(k\)\),1\-\\exp\(\-\\Omega\(M\)\)\-\\exp\(\-\\Omega\(k\)\),we have for everyj∈\[M\]j\\in\[M\],

\|μj​\(Σ\)−λj−1M​∑i\>kλi\|≤c​\(kM​λj\+λk\+1\+∑i\>kλi2M\)\.\\left\|\\mu\_\{j\}\(\\Sigma\)\-\\lambda\_\{j\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\right\|\\leq c\\left\(\\frac\{k\}\{M\}\\lambda\_\{j\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\\right\)\.In particular, ifk≤M/c2k\\leq M/c^\{2\}, then the termkM​λj\\tfrac\{k\}\{M\}\\lambda\_\{j\}can be absorbed into the left\-hand side\.

###### Lemma K\.4\(Tail concentration\)\.

For anyk≥0k\\geq 0, with probability at least1−δ1\-\\delta,

‖Σk:∞−1M​∑i\>kλi​IM‖2≲λk\+1​\(1\+log⁡\(1/δ\)M\)\+∑i\>kλi2M​\(1\+log⁡\(1/δ\)M\)\.\\left\\\|\\Sigma\_\{k:\\infty\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\,I\_\{M\}\\right\\\|\_\{2\}\\lesssim\\lambda\_\{k\+1\}\\left\(1\+\\frac\{\\log\(1/\\delta\)\}\{M\}\\right\)\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\\left\(1\+\\frac\{\\log\(1/\\delta\)\}\{M\}\\right\)\}\.In particular, with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),

‖Σk:∞−1M​∑i\>kλi​IM‖2≲λk\+1\+∑i\>kλi2M\.\\left\\\|\\Sigma\_\{k:\\infty\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\,I\_\{M\}\\right\\\|\_\{2\}\\lesssim\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.

###### Lemma K\.5\(Head concentration\)\.

For anyk≥1k\\geq 1, with probability at least1−δ1\-\\delta,

\|μj​\(Σ0:k\)−λj\|≲k\+log⁡\(1/δ\)M​λj,j≤k\.\\bigl\|\\mu\_\{j\}\(\\Sigma\_\{0:k\}\)\-\\lambda\_\{j\}\\bigr\|\\lesssim\\frac\{k\+\\log\(1/\\delta\)\}\{M\}\\,\\lambda\_\{j\},\\qquad j\\leq k\.In particular, with probability at least1−exp⁡\(−Ω​\(k\)\)1\-\\exp\(\-\\Omega\(k\)\),

\|μj​\(Σ0:k\)−λj\|≲kM​λj,j≤k\.\\bigl\|\\mu\_\{j\}\(\\Sigma\_\{0:k\}\)\-\\lambda\_\{j\}\\bigr\|\\lesssim\\frac\{k\}\{M\}\\lambda\_\{j\},\\qquad j\\leq k\.

###### Lemma K\.6\(Head–tail resolvent estimate\)\.

Fix an integerk≤M/3k\\leq M/3such thatrank⁡\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define

Ak:=Sk:∞​Hk:∞​Sk:∞⊤,Σ=S0:k​H0:k​S0:k⊤\+Ak\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\},\\qquad\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\.Then, with probability at least

1−exp⁡\(−Ω​\(M\)\),1\-\\exp\(\-\\Omega\(M\)\),one has

‖Σ−1​S0:k​H0:k‖2≲μM/2​\(Ak\)μM​\(Ak\)\.\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\.

### K\.2Power\-law auxiliary lemmas

###### Lemma K\.7\(Power\-law spectrum of the sketched covariance\)\.

Suppose the population spectrum obeysλj≍j−a\\lambda\_\{j\}\\asymp j^\{\-a\}for somea\>1a\>1\. Then, with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),

μj​\(Σ\)≍j−a,j∈\[M\]\.\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\},\\qquad j\\in\[M\]\.

###### Lemma K\.8\(Tail spectral ratio under power law\)\.

Supposeλj≍j−a\\lambda\_\{j\}\\asymp j^\{\-a\}witha\>1a\>1\. Then there exists anaa\-dependent constantc\>0c\>0such that for anyk≥0k\\geq 0,

μM/2​\(Σk:∞\)μM​\(Σk:∞\)≤c\\frac\{\\mu\_\{M/2\}\(\\Sigma\_\{k:\\infty\}\)\}\{\\mu\_\{M\}\(\\Sigma\_\{k:\\infty\}\)\}\\leq cwith probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\.

###### Lemma K\.9\(Source\-condition approximation bound\)\.

Suppose the source condition with exponentb\>1b\>1holds\. Then, with probability at least1−exp⁡\(−Ω​\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrix,

M1−b≲𝔼w∗​\[Approx\]≲M1−b\.M^\{1\-b\}\\lesssim\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim M^\{1\-b\}\.The hidden constants depend only on the source exponents\.

###### Lemma K\.10\(Empirical spectrum under power law\)\.

Assumeμj​\(Σ\)≍j−a\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\}forj∈\[M\]j\\in\[M\]\. Then there existaa\-dependent constantsc,c1,c2\>0c,c\_\{1\},c\_\{2\}\>0such that, conditioned onSS, with probability at least1−exp⁡\(−Ω​\(N\)\)1\-\\exp\(\-\\Omega\(N\)\)over the sample,

c1​j−a≤μj​\(Σ^\)≤c2​j−a,j≤min⁡\{M,N/c\},c\_\{1\}j^\{\-a\}\\leq\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\leq c\_\{2\}j^\{\-a\},\\qquad j\\leq\\min\\\{M,N/c\\\},and

μj​\(Σ^\)≲j−a,j≤min⁡\{M,N\}\.\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\lesssim j^\{\-a\},\\qquad j\\leq\\min\\\{M,N\\\}\.

## Appendix LExperimental setup

During the experiments, all simulations were run on CPU on a standard laptop; the full suite took approximately 2 hours and used less than 1 GB memory\.

All experiments are run in the synthetic diagonal\-coordinate sketched linear\-regression model from Sections[3](https://arxiv.org/html/2605.24316#S3)and[2](https://arxiv.org/html/2605.24316#S2)\. We fix an ambient dimensiondd, let

H=diag⁡\(λ1,…,λd\),λi=i−a,H=\\operatorname\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{d\}\),\\qquad\\lambda\_\{i\}=i^\{\-a\},draw a Gaussian sketchS∈ℝM×dS\\in\\mathbb\{R\}^\{M\\times d\}with i\.i\.d\.𝒩​\(0,1/M\)\\mathcal\{N\}\(0,1/M\)entries, and set

Σ=S​H​S⊤\.\\Sigma=SHS^\{\\top\}\.The source\-condition prior is chosen so that the coordinates ofw∗w^\{\\ast\}are independent Gaussian and satisfy

𝔼​\[λi​\(wi∗\)2\]≍i−b\.\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\.Conditioned on\(S,w∗\)\(S,w^\{\\ast\}\), the sketched featurez:=S​x∈ℝMz:=Sx\\in\\mathbb\{R\}^\{M\}and the clean signal⟨x,w∗⟩\\langle x,w^\{\\ast\}\\rangleare jointly Gaussian, so the implementation samples the pair\(z,y\)\(z,y\)directly from this induced law, with

y=⟨x,w∗⟩\+ε,ε∼𝒩​\(0,σ2\)\.y=\\langle x,w^\{\\ast\}\\rangle\+\\varepsilon,\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\.This lets us work entirely in the sketched coordinates used throughout the paper\.

Across repetitions, the draw of\(S,w∗\)\(S,w^\{\\ast\}\)is fixed and only the dataset and optimization randomness are refreshed\. Unless otherwise stated, the shared parameters are

a=2,b=1\.5,d=104,M=64,N=512,a=2,\\qquad b=1\.5,\\qquad d=10^\{4\},\\qquad M=64,\\qquad N=512,L=512,σ=1,γ=0\.05\.L=512,\\qquad\\sigma=1,\\qquad\\gamma=0\.05\.withR=100R=100independent repetitions\. We use the candidate power\-of\-two batch sizes

ℬ:=\{1,2,4,8,16,32,64,128,256,512\},\\mathcal\{B\}:=\\\{1,2,4,8,16,32,64,128,256,512\\\},and then specify the experiment\-dependent subsets below\. For multi\-pass experiments we compare both sampling rules from the paper: with\-replacement mini\-batching and without\-replacement mini\-batching, with finite\-population factor

ρN,B=N−BB​\(N−1\)\.\\rho\_\{N,B\}=\\frac\{N\-B\}\{B\(N\-1\)\}\.
All methods use the same Bartlett\-decay schedule implemented in the code: for a run ofLrunL\_\{\\mathrm\{run\}\}updates, the stepsize is held constant on blocks of length comparable toLrun,eff≍Lrun/log⁡LrunL\_\{\\mathrm\{run,eff\}\}\\asymp L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}and divided by two from one block to the next\. This is the simulation counterpart of the blockwise geometric schedule used in our theory\. ThusLrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD andLrun=LL\_\{\\mathrm\{run\}\}=Lfor multi\-pass batch SGD and for the full\-batch GD reference iterateθL\\theta\_\{L\}\.

The three experiments below are chosen to match the three batch\-dependent predictions in our theorems\. Experiment 1 isolates the one\-pass variance term, which is the only place whereBBenters the one\-pass theorem\. Experiment 2 isolates the fluctuation term in the multi\-pass theorem, which is the only part that differs between with\-replacement and without\-replacement sampling\. Experiment 3 then checks whether dividing by the predicted batch prefactor removes the leadingBB\-dependence\.

### L\.1Experiment 1: one\-pass variance sweep

For each batch sizeBB, we run one\-pass batch SGD with shuffled disjoint batches, so each sample is used exactly once and the number of updates is

T=NB,Teff=Tlog⁡T\.T=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}=\\frac\{T\}\{\\log T\}\.LetuT,ropu\_\{T,r\}^\{\\mathrm\{op\}\}be the one\-pass output on repetitionrr, and let

u¯Top:=1R​∑r=1RuT,rop\.\\bar\{u\}\_\{T\}^\{\\mathrm\{op\}\}:=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}u\_\{T,r\}^\{\\mathrm\{op\}\}\.We estimate the centered one\-pass variance by the Bessel\-corrected empirical average

Var^B:=RR−1⋅1R​∑r=1R‖Σ1/2​\(uT,rop−u¯Top\)‖22\.\\widehat\{\\mathrm\{Var\}\}\_\{B\}:=\\frac\{R\}\{R\-1\}\\cdot\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{T,r\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\\\|\_\{2\}^\{2\}\.This is the quantity plotted in Figure[1](https://arxiv.org/html/2605.24316#S4.F1)\(a\)\.

To compare with the theory, we also draw a rescaled upper\-bound reference of order

1B​Teff​∑j=1Mmin⁡\{1,Teff​γ​μj​\(Σ\)\},\\frac\{1\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\mu\_\{j\}\(\\Sigma\)\\\},using the eigenvalues of the fixed simulated sketched covarianceΣ\\Sigma\. Because our theorem gives only an upper bound onVarB\\mathrm\{Var\}\_\{B\}, this plot is meant to test the predictedBB\-dependence and effective\-dimension shape, not equality of leading constants\. Accordingly, the empirical curve is expected to remain below a suitably rescaled upper\-bound reference\.

### L\.2Experiment 2: multi\-pass fluctuation sweep

For each batch sizeBB, we run multi\-pass batch SGD forLLupdates under both sampling rules, producing iteratesuL,rwru\_\{L,r\}^\{\\mathrm\{wr\}\}anduL,rworu\_\{L,r\}^\{\\mathrm\{wor\}\}\. On the same dataset and with the same stepsize schedule, we also run the full\-batch GD reference iterateθL,r\\theta\_\{L,r\}\. We then estimate the fluctuation terms by

Fluc^Bρ:=1R​∑r=1R‖Σ1/2​\(uL,rρ−θL,r\)‖22,ρ∈\{wr,wor\}\.\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\rho\}:=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{L,r\}^\{\\rho\}\-\\theta\_\{L,r\}\)\\bigr\\\|\_\{2\}^\{2\},\\qquad\\rho\\in\\\{\\mathrm\{wr\},\\mathrm\{wor\}\\\}\.This directly matches the fluctuation quantity in Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. We plot these empirical means againstBBtogether with one\-parameter reference curves of the form

Cwr⋅1B,Cwor⋅ρN,B,C\_\{\\mathrm\{wr\}\}\\cdot\\frac\{1\}\{B\},\\qquad C\_\{\\mathrm\{wor\}\}\\cdot\\rho\_\{N,B\},whereCwrC\_\{\\mathrm\{wr\}\}andCworC\_\{\\mathrm\{wor\}\}are fitted from the average normalized fluctuation\. The purpose of this experiment is to test whether the sampling\-rule dependence is indeed captured by the batch prefactors1/B1/BandρN,B\\rho\_\{N,B\}\.

### L\.3Experiment 3: normalized fluctuation collapse

The third experiment uses the same multi\-pass runs as Experiment 2 but removes the predicted batch prefactors\. We plot

Fluc^Bwr1/BandFluc^BworρN,B\.\\frac\{\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\mathrm\{wr\}\}\}\{1/B\}\\qquad\\text\{and\}\\qquad\\frac\{\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\mathrm\{wor\}\}\}\{\\rho\_\{N,B\}\}\.If the theorem captures the leadingBB\-dependence correctly, these normalized quantities should be approximately constant acrossBB\. This is a stronger check than Experiment 2 alone: Experiment 2 tests the decay pattern on log–log axes, while Experiment 3 tests whether the remaining dependence after normalization is essentially flat\. For without\-replacement sampling, the pointB=NB=Nis omitted from the normalized plot becauseρN,N=0\\rho\_\{N,N\}=0\.

Throughout all three experiments, the error bars shown in the main\-text plots are one empirical standard deviation over theR=100R=100repetitions\.

## Appendix MLimitations and Broader Effects

#### Limitations\.

Our analysis is intentionally stylized\. The theory is proved for sketched linear regression under a Gaussian design, a well\-specified teacher–student model, power\-law covariance decay, and a source condition on the target parameter\. These assumptions let us separate approximation, bias, variance, and fluctuation cleanly, but they do not cover misspecification, heavy\-tailed or dependent data, non\-Gaussian sketching, or feature learning in nonlinear models\. The experiments are likewise synthetic and are designed to test the predicted scaling behavior rather than empirical competitiveness on real tasks\. Accordingly, the resulting batch\-size scaling laws should be interpreted as precise results for a controlled regime, not as universal prescriptions for all SGD training problems\.

#### Broader effects\.

One positive effect of this work is sharper guidance for how batch size changes optimization noise in large\-scale training\. By isolating when batching primarily alters stochastic terms and when without\-replacement sampling can further reduce fluctuation, the results can inform more compute\-efficient training strategies and improve theoretical intuition for algorithm design\. The main risk is overgeneralization: if stylized scaling laws derived under restrictive assumptions are transferred directly to complex real\-world systems, practitioners may choose training rules that are poorly calibrated for misspecified models, distribution shift, or fairness and safety constraints that are absent from our setup\. We therefore view these results as a theoretical foundation that should be paired with application\-specific validation before informing deployment decisions\.

Similar Articles

Scaling Laws, Carefully (25 minute read)

TLDR AI

A comprehensive overview of scaling laws in deep learning, tracing their theoretical roots and empirical findings, and explaining how loss decreases predictably with model size, data, and compute.

Prescriptive Scaling Laws for Data Constrained Training

Hugging Face Daily Papers

A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.