From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression
Summary
This paper derives batch scaling laws for sketched linear regression under power-law spectra, analyzing one-pass and multi-pass mini-batch SGD. It provides explicit risk decompositions showing how batch size affects bias, variance, and fluctuation terms, and establishes that without-replacement sampling yields lower noise than with-replacement.
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression
Source: [https://arxiv.org/html/2605.24316](https://arxiv.org/html/2605.24316)
Ziyan Chen The University of Sydney ziyan\.chen@sydney\.edu\.au&Dingxuan Zhou The University of Sydney dingxuan\.zhou@sydney\.edu\.au
###### Abstract
Scaling laws provide a compact description of how prediction error varies with compute, model size, and data, but existing theoretical results largely focus on single\-sample SGD or full data reuse and leave the role of mini\-batching unclear\. In this paper, we study batch scaling laws for sketched linear regression under a power\-law covariance spectrum and a source condition on the target parameter\. We analyze three optimization procedures: one\-pass batch SGD, multi\-pass batch SGD with replacement, and multi\-pass batch SGD without replacement\. We first derive an explicit risk decomposition showing that all three procedures share the same irreducible and approximation terms, while the stochastic contributions depend on the optimization protocol: one\-pass batch SGD splits into bias and variance, whereas the two multi\-pass procedures split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory\. Building on this decomposition, we prove source\-condition scaling laws for both one\-pass and multi\-pass mini\-batch methods\. For one\-pass batch SGD, mini\-batching preserves the approximation and optimization\-bias exponents, while the variance term scales asO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\); thus the usual covariance reduction by1/B1/Bholds at fixed update countTT, whereas in the one\-pass regimeT=N/BT=N/Bthis gain is partly offset by the shorter optimization horizon\. For multi\-pass batch SGD, the approximation term and the GD bias/variance contribution are identical for with\-replacement and without\-replacement sampling; the only difference is the fluctuation term, whose covariance prefactor is1/B1/Bwith replacement andρN,B=\(N−B\)/\(B\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)without replacement\. Consequently, without\-replacement sampling is less noisy forB\>1B\>1, and whenB=NB=Nthe fluctuation vanishes exactly, recovering deterministic gradient descent\. These results place batch size on the same theoretical footing as compute, data, and model dimension within the sketched linear\-regression framework\.
## 1Introduction
Scaling laws have become a standard language for describing progress in modern machine learning: across many domains, prediction error follows regular power laws in model size, data size, and compute\. This pattern has been documented from early large\-scale studies across translation, language, vision, and speech\(Hestnesset al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib6)\)to modern language\-model scaling analyses\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib3); Hoffmannet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib2)\)and multimodal autoregressive modeling\(Henighanet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib5)\)\. As a result, scaling laws are now used not only to summarize experiments, but also to guide forecasting, resource allocation, and training design\(Rosenfeldet al\.,[2020](https://arxiv.org/html/2605.24316#bib.bib21); Zhaiet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib22); Alabdulmohsinet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib23); Besirogluet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib18); Muennighoffet al\.,[2023](https://arxiv.org/html/2605.24316#bib.bib19); Paquetteet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib20)\)\.
Empirical scaling laws do not by themselves explain where the exponents come from, which parts of the risk they describe, or how algorithmic choices change them\. Rigorous results are therefore rarer\. In statistically transparent models, however, approximation, optimization, and sampling effects can be separated\. Following this program,Linet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\)proved source\-condition scaling laws for one\-pass SGD in sketched linear regression, andLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\)showed that multiple passes lead to a decomposition into approximation, GD bias, GD variance, and fluctuation, yielding sharper compute–risk tradeoffs\. These papers place empirical scaling laws on a rigorous footing and complement a broader theoretical literature on power\-law behavior based on manifold arguments, constructive learning curves, solvable and dynamical models, renormalized high\-dimensional asymptotics, quantization, and feature learning\(Sharma and Kaplan,[2020](https://arxiv.org/html/2605.24316#bib.bib24); Bahriet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib17); Hutter,[2021](https://arxiv.org/html/2605.24316#bib.bib25); Maloneyet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib26); Michaudet al\.,[2023](https://arxiv.org/html/2605.24316#bib.bib27); Bordelonet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib28),[2025](https://arxiv.org/html/2605.24316#bib.bib29); Atanasovet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib30); Dohmatobet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib31); Paquetteet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib20); Renet al\.,[2025](https://arxiv.org/html/2605.24316#bib.bib32)\)\.
Batch size is one of the most important large\-scale training knobs because it affects hardware utilization, wall\-clock efficiency, and gradient noise\. On the empirical side, large\-batch rules enabled ImageNet training with batches up to 8192\(Goyalet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib8)\), LARS pushed convolutional training to 8K and 32K\(Youet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib9)\), and later studies showed that the gains are highly workload\-dependent and eventually saturate\(Shallueet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib10); Golmantet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib11)\)\. A particularly influential synthesis is the gradient\-noise\-scale viewpoint ofMcCandlishet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib12)\), and batch\-dependent empirical scaling laws have recently been studied directly for language models\(Shuaiet al\.,[2024](https://arxiv.org/html/2605.24316#bib.bib16)\)\.
There is also a growing theoretical understanding of batch sizeBBthrough SGD noise\.Smith and Le \([2018](https://arxiv.org/html/2605.24316#bib.bib13)\)modeled SGD by an SDE with noise scale proportional toϵN/B\\epsilon N/B, suggesting that the optimal batch size should grow with both the learning rate and the dataset sizeNN\.Smithet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib14)\)studied the closely related strategy of increasing batch size instead of decaying the learning rate\. In least\-squares and related settings, the statistical effects of mini\-batching, multiple passes, tail averaging, and implicit regularization have also been analyzed extensively\(Lin and Rosasco,[2017](https://arxiv.org/html/2605.24316#bib.bib33); Jainet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib34); Mückeet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib35); Geet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib36); Zouet al\.,[2021](https://arxiv.org/html/2605.24316#bib.bib37); Wuet al\.,[2022](https://arxiv.org/html/2605.24316#bib.bib7); Pillaud\-Vivienet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib15)\)\. What is still missing in this sketched linear\-regression framework is a scaling\-law analysis that shows how batch size enters approximation, optimization bias, variance, and data reuse\.
This paper develops such a theory for sketched linear regression trained with mini\-batch methods\. We study three stochastic procedures: one\-pass batch SGD, multi\-pass batch SGD with replacement, and multi\-pass batch SGD without replacement\. While mini\-batching classically reduces the per\-update noise covariance by a factor1/B1/Bat a fixed number of updates, our contribution is to show how batch size propagates through the scaling\-law decompositions ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)\. In particular, the without\-replacement scheme introduces the finite\-population factorρN,B\\rho\_\{N,B\}and recovers GD exactly atB=NB=N\. Our main contributions are as follows\.
- •A one\-pass batch scaling law with horizon\-noise tradeoff\.Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1), together with the unified risk decomposition in Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1), shows that batching preserves the one\-pass approximation and bias exponents while the stochastic contribution obeys the variance bound in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Equivalently, the per\-update covariance gains the usual factor1/B1/Bat fixedTT, but in the actual one\-pass regimeT=N/BT=N/Bthis improvement is partly offset because larger batches shorten the optimization horizon\.
- •A multi\-pass fluctuation law\.Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)shows that mini\-batching does not change the deterministic approximation and GD bias–variance terms fromLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\); it changes only the fluctuation around the GD reference path\. Importantly, this result is not obtained by simply multiplying the 2025 fluctuation bound by1/B1/B: batch updates change the fluctuation recursion and the covariance calculation of the driving noise, so the derivation must be redone in the batch setting\. The resulting prefactor is1/B1/Bwith replacement andρN,B=\(N−B\)/\(B\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)without replacement, so without\-replacement sampling is less noisy and recovers deterministic GD atB=NB=N, complementing the broader picture that multiple passes are statistically useful on hard problems\(Pillaud\-Vivienet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib15)\)\.
#### Notation\.
For two positive\-valued functionsf\(x\)f\(x\)andg\(x\)g\(x\), we writef\(x\)≲g\(x\)f\(x\)\\lesssim g\(x\)\(equivalently,f\(x\)=O\(g\(x\)\)f\(x\)=O\(g\(x\)\)\) andf\(x\)≳g\(x\)f\(x\)\\gtrsim g\(x\)\(equivalently,f\(x\)=Ω\(g\(x\)\)f\(x\)=\\Omega\(g\(x\)\)\) if there exists an absolute constantc\>0c\>0such thatf\(x\)≤cg\(x\)f\(x\)\\leq cg\(x\)andf\(x\)≥cg\(x\)f\(x\)\\geq cg\(x\), respectively; we writef\(x\)≍g\(x\)f\(x\)\\asymp g\(x\)\(equivalently,f\(x\)=Θ\(g\(x\)\)f\(x\)=\\Theta\(g\(x\)\)\) when both bounds hold\. For vectorsuuandvvin a Hilbert space, we denote their inner product by⟨u,v⟩\\langle u,v\\rangleoru⊤vu^\{\\top\}v\. For matricesAAandBBof compatible dimensions, we define their inner product by⟨A,B⟩:=tr\(A⊤B\)\\langle A,B\\rangle:=\\operatorname\{tr\}\(A^\{\\top\}B\)\. We use∥⋅∥\\\|\\cdot\\\|to denote the operator norm for matrices and theℓ2\\ell\_\{2\}\-norm for vectors\. For a positive semidefinite \(PSD\) matrixAAand a compatible vectorvv, we write‖v‖A2:=v⊤Av\\\|v\\\|\_\{A\}^\{2\}:=v^\{\\top\}Av, and we writeA⪯BA\\preceq BwhenB−AB\-Ais PSD\. For a symmetric matrixAA,μj\(A\)\\mu\_\{j\}\(A\)denotes itsjj\-th eigenvalue andr\(A\)r\(A\)its rank\. Finally,log\(⋅\)\\log\(\\cdot\)denotes the base\-2 logarithm\.
## 2Preliminaries
We work in the same sketched linear\-regression framework asLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)\. In addition to the normal GD iterateθt\\theta\_\{t\}, we study a one\-pass batch SGD iterate, a multi\-pass batch SGD iterate with replacement, and a multi\-pass batch SGD iterate without replacement; each stochastic update averages a mini\-batch of sizeBB\.
In particular, whenB=1B=1, the one\-pass batch SGD setup reduces to the one\-pass SGD setting ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\)\. For the multi\-pass methods, our with\-replacement and without\-replacement rules coincide atB=1B=1, because the latter is without replacement only within each mini\-batch, not across an epoch\. Thus our without\-replacement procedure should not be confused with random reshuffling\.
#### Problem setup\.
Letℋ\\mathcal\{H\}be a finite\- or countably infinite\-dimensional Hilbert space\. For a parameterw∈ℋw\\in\\mathcal\{H\}, define the population risk
R\(w\):=𝔼\[\(⟨x,w⟩−y\)2\],\(x,y\)∼P,R\(w\):=\\mathbb\{E\}\\bigl\[\(\\langle x,w\\rangle\-y\)^\{2\}\\bigr\],\\qquad\(x,y\)\\sim P,wherePPis a Borel probability measure onℋ×ℝ\\mathcal\{H\}\\times\\mathbb\{R\}\. Define the population covariance and population risk minimizer as follows:
H:=𝔼\[xx⊤\],w∗∈argminw∈ℋR\(w\)H:=\\mathbb\{E\}\[xx^\{\\top\}\],\\quad w^\{\\ast\}\\in\\arg\\min\_\{w\\in\\mathcal\{H\}\}R\(w\)We observe only the sketched covariates\(Sx,y\)\(Sx,y\), whereS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is the sketching operator\. Foru∈ℝMu\\in\\mathbb\{R\}^\{M\}, define the sketched risk
RM\(u\):=R\(S⊤u\)=𝔼\[\(⟨Sx,u⟩−y\)2\]\.R\_\{M\}\(u\):=R\(S^\{\\top\}u\)=\\mathbb\{E\}\\bigl\[\(\\langle Sx,u\\rangle\-y\)^\{2\}\\bigr\]\.
GivenNNi\.i\.d\. samples drawn according toPP,
D=\{\(xi,yi\)\}i=1N,D=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\},with observationO=\{\(Sxi,yi\)\}i=1NO=\\\{\(Sx\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}\. Write
X:=\(x1,…,xN\)⊤,y:=\(y1,…,yN\)⊤,X:=\(x\_\{1\},\\dots,x\_\{N\}\)^\{\\top\},\\qquad y:=\(y\_\{1\},\\dots,y\_\{N\}\)^\{\\top\},and define the population and empirical quantities
Σ:=SHS⊤,Σ^:=1NSX⊤XS⊤,b^:=1NSX⊤y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{1\}\{N\}SX^\{\\top\}XS^\{\\top\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Conditioned onSS, the minimizer ofRMR\_\{M\}is, as inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\),
u∗=\(SHS⊤\)−1SHw∗=Σ−1SHw∗\.u^\{\\ast\}=\(SHS^\{\\top\}\)^\{\-1\}SHw^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\}\.
#### Optimization procedures\.
We compare four methods under this common setup\. For every optimization procedure run forLrunL\_\{\\mathrm\{run\}\}updates, we use the same blockwise geometric learning\-rate schedule: partition the updates into consecutive blocks indexed byℓ=0,1,2,…\\ell=0,1,2,\\ldots, each containingLrun,eff:=Lrun/logLrunL\_\{\\mathrm\{run,eff\}\}:=L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}consecutive updates up to endpoint rounding, and setγt=γ/2ℓ\\gamma\_\{t\}=\\gamma/2^\{\\ell\}for every updatettin blockℓ\\ell\. In particular,Lrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD, whereasLrun=LL\_\{\\mathrm\{run\}\}=Lfor normal GD and the two multi\-pass batch methods\.
#### 1\. Normal GD\.
Let\(γt\)\(\\gamma\_\{t\}\)be the prescribed stepsize schedule\. The normal GD iterate is
θt=θt−1−γtΣ^θt−1\+γtb^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.\(1\)
#### 2\. One\-pass Batch SGD\.
Assume for simplicity thatB∣NB\\mid N, and partition\[N\]\[N\]into disjoint batchesI1,…,IN/BI\_\{1\},\\dots,I\_\{N/B\}with\|It\|=B\|I\_\{t\}\|=B\. For each batch, define
Σ^It\(B\):=1B∑i∈ItSxixi⊤S⊤,b^It\(B\):=1B∑i∈ItSxiyi\.\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\}\.The one\-pass batch SGD iterate is
utop=ut−1op−γtΣ^It\(B\)ut−1op\+γtb^It\(B\),t=1,…,NB,u0op=0\.u\_\{t\}^\{\\mathrm\{op\}\}=u\_\{t\-1\}^\{\\mathrm\{op\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{op\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad t=1,\\dots,\\frac\{N\}\{B\},\\qquad u\_\{0\}^\{\\mathrm\{op\}\}=0\.\(2\)Thus each update usesBBsamples, and performs a total ofN/BN/Bupdates\.
#### 3\. Multi\-pass Batch SGD with Replacement\.
At each stept∈\[L\]t\\in\[L\], sample a mini\-batch with replacementit,1,…,it,B∼iidunif\(\[N\]\),i\_\{t,1\},\\dots,i\_\{t,B\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{unif\}\(\[N\]\),and define
Σ^t\(B\):=1B∑r=1BSxit,rxit,r⊤S⊤,b^t\(B\):=1B∑r=1BSxit,ryit,r\.\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}x\_\{i\_\{t,r\}\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}y\_\{i\_\{t,r\}\}\.The multi\-pass batch SGD iterate with replacement is
utwr=ut−1wr−γtΣ^t\(B\)ut−1wr\+γtb^t\(B\),t=1,…,L,u0wr=0\.u\_\{t\}^\{\\mathrm\{wr\}\}=u\_\{t\-1\}^\{\\mathrm\{wr\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad t=1,\\dots,L,\\qquad u\_\{0\}^\{\\mathrm\{wr\}\}=0\.\(3\)Here each step again usesBBsamples, but the algorithm now runs for a total ofLLupdates and may therefore reuse data across passes\.
#### 4\. Multi\-pass Batch SGD without Replacement\.
We consider mini\-batch updates on the fixed datasetDD, where at each stept∈\[L\]t\\in\[L\]we sample a subsetIt⊂\[N\]I\_\{t\}\\subset\[N\],\|It\|=B\|I\_\{t\}\|=Buniformly without replacement from\[N\]\[N\]\. Across different iterationstt, the batchesItI\_\{t\}are sampled independently, so data may be reused across iterations, but no sample is repeated within a single batch\.
For each sampled batchItI\_\{t\}, define
Σ^It\(B\):=1B∑i∈ItSxixi⊤S⊤,b^It\(B\):=1B∑i∈ItSxiyi\.\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\}\.Then the multi\-pass batch SGD iterate without replacement is
utwor=ut−1wor−γtΣ^It\(B\)ut−1wor\+γtb^It\(B\),t=1,…,L,u0wor=0\.u\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\-1\}^\{\\mathrm\{wor\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad t=1,\\dots,L,\\qquad u\_\{0\}^\{\\mathrm\{wor\}\}=0\.\(4\)
In summary, we keep the same problem setup and normal GD process as inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), but replace the single\-sample updates by mini\-batch updates of sizeBB\. When comparing the stochastic iterates with GD, we always use the same stepsize schedule and the same number of updates:N/BN/Bfor one\-pass batch SGD andLLfor multi\-pass batch SGD, with or without replacement\.
## 3Main Results
This section contains the main theoretical contribution of the paper: explicit batch\-size scaling laws for one\-pass and multi\-pass sketched SGD under the power\-law/source\-condition model\. The main conclusion is that mini\-batching does not change the functional form of the deterministic approximation and optimization\-bias terms \(in the one\-pass case, the bias is evaluated at the shorter horizonT=N/BT=N/B\)\. Instead, it enters the stochastic terms mainly through the one\-pass variance boundO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\)and through the multi\-pass fluctuation prefactorρN,B\\rho\_\{N,B\}\.
The assumptions below are the same stylized assumptions used inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), adapted here to the mini\-batch setting\. We first restate those conditions, then give a common risk decomposition, and finally derive the one\-pass and multi\-pass scaling laws\.
###### Assumption 1\(Data assumptions\)\.
Assume the following conditions on the data distributionPP\.
1. A\.Gaussian design\.The feature vector satisfiesx∼𝒩\(0,H\)\.x\\sim\\mathcal\{N\}\(0,H\)\.
2. B\.Well\-specified model\.The response satisfies 𝔼\[y∣x,w∗\]=⟨x,w∗⟩withσ2:=𝔼\[\(y−⟨x,w∗⟩\)2\]\.\\mathbb\{E\}\[y\\mid x,w^\{\\ast\}\]=\\langle x,w^\{\\ast\}\\rangle\\quad\\text\{with\}\\quad\\sigma^\{2\}:=\\mathbb\{E\}\\bigl\[\(y\-\\langle x,w^\{\\ast\}\\rangle\)^\{2\}\\bigr\]\.
3. C\.Power\-law spectrum\.Let\(λi\)i≥1\(\\lambda\_\{i\}\)\_\{i\\geq 1\}denote the eigenvalues ofHH\. Then for somea\>1a\>1 λi≍i−afor alli≥1\.\\lambda\_\{i\}\\asymp i^\{\-a\}\\qquad\\text\{for all \}i\\geq 1\.
4. D\.Source condition\.Let\(λi,vi\)i≥1\(\\lambda\_\{i\},v\_\{i\}\)\_\{i\\geq 1\}be the eigenvalue–eigenvector pairs ofHH\. Assumew∗w^\{\\ast\}follows a prior such that for someb\>1b\>1 𝔼\[⟨vi,w∗⟩⟨vj,w∗⟩\]=0fori≠j;𝔼\[λi⟨vi,w∗⟩2\]≍i−bfor alli≥1\\mathbb\{E\}\\bigl\[\\langle v\_\{i\},w^\{\\ast\}\\rangle\\langle v\_\{j\},w^\{\\ast\}\\rangle\\bigr\]=0\\quad\\text\{for \}i\\neq j;\\qquad\\mathbb\{E\}\\bigl\[\\lambda\_\{i\}\\langle v\_\{i\},w^\{\\ast\}\\rangle^\{2\}\\bigr\]\\asymp i^\{\-b\}\\quad\\text\{for all \}i\\geq 1
###### Assumption 2\(Source condition in diagonal coordinates\)\.
Assume without loss of generality thatHHis diagonal with non\-increasing diagonal entries
H=diag\(λ1,λ2,…\)\.H=\\operatorname\{diag\}\(\\lambda\_\{1\},\\lambda\_\{2\},\\dots\)\.Assume the true parameterw∗w^\{\\ast\}satisfies: for someb\>1b\>1,
𝔼\[wi∗wj∗\]=0for alli≠j;𝔼\[λi\(wi∗\)2\]≍i−bfor alli≥1\.\\mathbb\{E\}\[w\_\{i\}^\{\\ast\}w\_\{j\}^\{\\ast\}\]=0\\quad\\text\{for all \}i\\neq j;\\qquad\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\\quad\\text\{for all \}i\\geq 1\.
###### Assumption 3\(Stepsize conditions\)\.
Under the notation of the theorem and its proof, assume that with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS, the following hold:
1. A\.γ≤min\{clogN,ctr\(Σ\)\}\.\\gamma\\leq\\min\\\!\\left\\\{\\frac\{c\}\{\\log N\},\\frac\{c\}\{\\operatorname\{tr\}\(\\Sigma\)\}\\right\\\}\.
2. B\.tr\(Σ2\)≲1\.\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\lesssim 1\.
3. C\.∑i=1Mμi\(Σ\)μi\(Σ\)\+1/\(Leffγ\)≤N4\.\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)\}\\leq\\frac\{N\}\{4\}\.
4. D\.For allt≥1t\\geq 1,ℙ\(4maxi∈\[N\]‖Sxi‖22\>tγ\)≤N−ct\.\\mathbb\{P\}\\left\(4\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\>\\frac\{t\}\{\\gamma\}\\right\)\\leq N^\{\-ct\}\.
Throughout this section, we additionally assume that the sketch operatorS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is Gaussian, meaning that in the diagonal coordinates ofHH, its entries are i\.i\.d\.𝒩\(0,1/M\)\\mathcal\{N\}\(0,1/M\)\. For one\-pass and multi\-pass batch SGD, let
T:=NB≥2,Teff:=TlogT,Leff:=LlogL,ρN,B:=N−BB\(N−1\)\.T:=\\frac\{N\}\{B\}\\geq 2,\\quad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\quad L\_\{\\mathrm\{eff\}\}:=\\frac\{L\}\{\\log L\},\\quad\\rho\_\{N,B\}:=\\frac\{N\-B\}\{B\(N\-1\)\}\.
Unless a subscript indicates otherwise, the expectations in the theorem statements below are taken overw∗w^\{\\ast\}, the sample, and the mini\-batch randomness when applicable\.
To state the results compactly, define
ρ=\{1/B,for multi\-pass batch SGD with replacement,ρN,B,for multi\-pass batch SGD without replacement,\\rho=\\begin\{cases\}1/B,&\\text\{for multi\-pass batch SGD with replacement\},\\\\ \\rho\_\{N,B\},&\\text\{for multi\-pass batch SGD without replacement\},\\end\{cases\}and write correspondingly
\(uLρ,FlucBρ\)=\{\(uLwr,FlucBwr\),ρ=1/B,\(uLwor,FlucBwor\),ρ=ρN,B\.\(u\_\{L\}^\{\\rho\},\\mathrm\{Fluc\}^\{\\rho\}\_\{B\}\)=\\begin\{cases\}\(u\_\{L\}^\{\\mathrm\{wr\}\},\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\),&\\rho=1/B,\\\\ \(u\_\{L\}^\{\\mathrm\{wor\}\},\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\),&\\rho=\\rho\_\{N,B\}\.\\end\{cases\}
These assumptions and definitions have a simple interpretation\. The exponentsaaandbbquantify the statistical complexity of the problem through the spectral decay ofHHand the regularity ofw∗w^\{\\ast\}, while the sketch dimensionMMcontrols approximation\. The stepsize condition is the same kind of high\-probability regularity assumption used inLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\); it ensures that the concentration and effective\-time arguments can be applied uniformly in proof\.
The next proposition packages the three procedures into one structural statement\. It shows that all three risks share the same common baseline risk, while the one\-pass method further splits into bias plus variance and the two multi\-pass methods split into a common GD reference contribution plus a sampling\-rule\-dependent fluctuation term\.
###### Proposition 3\.1\(Risk decompositions for the three optimization procedures\)\.
Assume Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Letu¯T:=𝔼\[uTop\]\.\\bar\{u\}\_\{T\}:=\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\]\.Then the risks decompose as follows:
𝔼\[RM\(uTop\)\]=\\displaystyle\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=\{\}RM\(u∗\)⏟common baseline risk\+RM\(u¯T\)−RM\(u∗\)⏟one\-pass biasexcess risk\+𝔼\[RM\(uTop\)−RM\(u¯T\)\]⏟one\-pass varianceexcess risk\.\\displaystyle\\underbrace\{R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{common baseline risk\}\\end\{subarray\}\}\+\\underbrace\{R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{one\-pass bias\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{one\-pass variance\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\.𝔼\[RM\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=\{\}RM\(u∗\)⏟common baseline risk\+𝔼\[RM\(θL\)−RM\(u∗\)\]⏟common GD\-referenceexcess risk\+𝔼\[RM\(uLρ\)−RM\(θL\)\]⏟sampling\-rule\-dependentfluctuation excess risk\.\\displaystyle\\underbrace\{R\_\{M\}\(u^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{common baseline risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{common GD\-reference\}\\\\ \\text\{excess risk\}\\end\{subarray\}\}\+\\underbrace\{\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{sampling\-rule\-dependent\}\\\\ \\text\{fluctuation excess risk\}\\end\{subarray\}\}\.whereuLρu\_\{L\}^\{\\rho\}denotesuLwru\_\{L\}^\{\\mathrm\{wr\}\}whenρ=1/B\\rho=1/BanduLworu\_\{L\}^\{\\mathrm\{wor\}\}whenρ=ρN,B\\rho=\\rho\_\{N,B\}\. Note that
RM\(u∗\)=R\(w∗\)⏟irreducible risk\+\[RM\(u∗\)−R\(w∗\)\]⏟approximation risk\.R\_\{M\}\(u^\{\\ast\}\)=\\underbrace\{R\(w^\{\\ast\}\)\}\_\{\\begin\{subarray\}\{c\}\\text\{irreducible risk\}\\end\{subarray\}\}\+\\underbrace\{\\bigl\[R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)\\bigr\]\}\_\{\\begin\{subarray\}\{c\}\\text\{approximation risk\}\\end\{subarray\}\}\.
The proof is given at the end of Appendix[A](https://arxiv.org/html/2605.24316#A1)\. Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)is the organizing principle for the rest of the section: the next theorem specializes the one\-pass decomposition, and the theorem after that specializes the multi\-pass decompositions and makes explicit that with\-replacement and without\-replacement sampling differ only through the fluctuation prefactor\.
###### Theorem 3\.1\(Scaling law for one\-pass batch SGD under the source condition\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3), and[2](https://arxiv.org/html/2605.24316#Thmassumption2), and suppose
1<b<a\+1,σ2≍1,Teffγ≳1\.1<b<a\+1,\\qquad\\sigma^\{2\}\\asymp 1,\\qquad T\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1\.Assume moreover that the one\-pass analogue of Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)holds as follows: the effective\-horizon conditions are imposed withLeffL\_\{\\mathrm\{eff\}\}replaced byTeffT\_\{\\mathrm\{eff\}\}, while the maximum\-norm condition in that assumption remains over the originalNNsamples\. Then there exists anaa\-dependent constantc\>0c\>0such that, wheneverγ≤c/logT\\gamma\\leq c/\\log T, we have with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼\[RM\(uTop\)\]=σ2\+Θ\(M1−b\)\+Θ\(\(Teffγ\)\(1−b\)/a\)⏟Approx\+Bias\+O\(min\{M,\(Teffγ\)1/a\}BTeff\)⏟Var\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=\\sigma^\{2\}\+\\underbrace\{\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\}\_\{\\mathrm\{Approx\+Bias\}\}\+\\underbrace\{O\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\right\)\}\_\{\\mathrm\{Var\}\}\.Here the hidden constants depend only on\(a,b\)\(a,b\)\. In particular, when1<b≤a1<b\\leq a, the variance term is dominated by the sum of the approximation and bias terms, so the risk simplifies to
𝔼\[RM\(uTop\)\]=σ2\+Θ\(M1−b\)\+Θ\(\(Teffγ\)\(1−b\)/a\)\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=\\sigma^\{2\}\+\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\.
In Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1), the factor1/B1/Bshould be interpreted at the level of the per\-update noise covariance, or equivalently relative to a fixed number of updatesTT\. In the actual one\-pass regime, however,T=N/BT=N/B, so increasingBBsimultaneously lowers the one\-step noise and shortens the optimization horizon\. Accordingly, the full variance term isO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\), not a pure1/B1/Bimprovement as a function ofBBat fixed dataset sizeNN\.
See Appendix[B](https://arxiv.org/html/2605.24316#A2), and in particular Appendix[B\.1](https://arxiv.org/html/2605.24316#A2.SS1), for the proof of Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\.
We now turn to the multi\-pass setting\. The key question is how mini\-batching affects the decomposed risks, where the next theorem shows that only the fluctuation changes\.
###### Theorem 3\.2\(Scaling law for multi\-pass batch SGD with and without replacement under the source condition\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
1<b<a\+1,σ2≍1,Leffγ≳1,Leff≲Na/γ\.1<b<a\+1,\\qquad\\sigma^\{2\}\\asymp 1,\\qquad L\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1,\\qquad L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists an\(a,b\)\(a,b\)\-dependent constantc\>0c\>0such that, wheneverγ≤c/logN\\gamma\\leq c/\\log N, we have with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS, the following hold\.
1. \(1\)If, for some fixedε∈\(0,1\)\\varepsilon\\in\(0,1\),Leff≲N\(1−ε\)a/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{\(1\-\\varepsilon\)a\}/\\gammathen 𝔼\[RM\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\{\}σ2\+Θ\(M1−b\)⏟Approx\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)⏟GDBias\\displaystyle\\sigma^\{2\}\+\\underbrace\{\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\}\_\{\\mathrm\{Approx\}\}\+\\underbrace\{\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\}\_\{\\mathrm\{GD\\,Bias\}\}\+Θ\(min\{M,\(Leffγ\)1/a\}N\)⏟GDVar\+O\(ργlogN\[\(Leffγ\)1/a−1\+\(Leffγ\)1/aN\]\)⏟Fluc\.\\displaystyle\+\\underbrace\{\\Theta\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\\right\)\}\_\{\\mathrm\{GD\\,Var\}\}\+\\underbrace\{O\\\!\\left\(\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\\right\)\}\_\{\\mathrm\{Fluc\}\}\.
2. \(2\)In particular, whena≥ba\\geq b,Leff≲Na/b/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{a/b\}/\\gammaandγlogN≲1\\gamma\\log N\\lesssim 1, the GD variance and fluctuation terms are dominated by the sum of the approximation and GD bias terms, namely, 𝔼\[RM\(uLρ\)\]=σ2\+Θ\(M1−b\)\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\\sigma^\{2\}\+\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\.
3. \(3\)Whena<b<a\+1a<b<a\+1andLeff≲N/γL\_\{\\mathrm\{eff\}\}\\lesssim N/\\gamma, the approximation and GD bias terms combine as Θ\(M1−b\)\+Θ\(\(Leffγ\)\(1−b\)/a\)=Θ\(min\{M,\(Leffγ\)1/a\}1−b\),\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\bigl\(\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\bigr\)=\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\),and therefore 𝔼\[RM\(uLρ\)\]=\\displaystyle\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\\bigr\]=\{\}σ2\+Θ\(min\{M,\(Leffγ\)1/a\}1−b\)\+Θ\(min\{M,\(Leffγ\)1/a\}N\)\\displaystyle\\sigma^\{2\}\+\\Theta\\\!\\Bigl\(\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}^\{1\-b\}\\Bigr\)\+\\Theta\\\!\\left\(\\frac\{\\min\\\!\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\\right\)\+O\(ργlogN\[\(Leffγ\)1/a−1\+\(Leffγ\)1/aN\]\)\.\\displaystyle\+O\\\!\\left\(\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\\right\)\.
See Appendix[B](https://arxiv.org/html/2605.24316#A2), and in particular Appendix[B\.2](https://arxiv.org/html/2605.24316#A2.SS2), for the proof of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\.
#### Comparison with previous scaling laws\.
Taken together, Theorems[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)and[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)show precisely how the scaling laws ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\)deform under mini\-batching\. The one\-pass theorem is the mini\-batch analogue ofLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4)\): the approximation termΘ\(M1−b\)\\Theta\(M^\{1\-b\}\)and the one\-pass bias term keep the same exponents, while batching changes only the stochastic term\. More precisely, the one\-pass variance bound isO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\): the factor1/B1/Bis the fixed\-TTcovariance gain, whereas at fixed dataset sizeNNone must also account for the shorter horizonT=N/BT=N/B\. The multi\-pass theorem extendsLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\): the approximation term and the GD bias–variance contribution are unchanged, and the only new batch dependence is the fluctuation prefactorρ∈\{1/B,ρN,B\}\\rho\\in\\\{1/B,\\rho\_\{N,B\}\\\}\. In particular, settingB=1B=1recovers the corresponding one\-sample scaling laws; in this case the two multi\-pass sampling rules coincide\.
#### What batch size changes\.
The common message of the two theorems is that batch size acts as a noise\-control parameter rather than a deterministic regularizer\. In the one\-pass theorem, batching lowers the centered variance; in the multi\-pass theorem, it lowers only the fluctuation around the GD reference path\. This interpretation is consistent with the gradient\-noise\-scale viewpoint ofSmith and Le \([2018](https://arxiv.org/html/2605.24316#bib.bib13)\); McCandlishet al\.\([2018](https://arxiv.org/html/2605.24316#bib.bib12)\)and with empirical large\-batch studies showing that large batches can be effective when properly tuned but that their gains eventually saturate\(Goyalet al\.,[2017](https://arxiv.org/html/2605.24316#bib.bib8); Smithet al\.,[2018](https://arxiv.org/html/2605.24316#bib.bib14); Shallueet al\.,[2019](https://arxiv.org/html/2605.24316#bib.bib10)\)\. Our theorems make this principle explicit in the present linear\-regression setting: once the stochastic terms fall below the approximation and optimization terms, further increasingBBno longer changes the leading statistical scaling\. In the multi\-pass setting, without\-replacement sampling adds the finite\-population gainρN,B<1/B\\rho\_\{N,B\}<1/BwhenB\>1B\>1, so the benefit of large batches is strongest precisely whenBBis a non\-negligible fraction ofNN\.
#### Implications for choosing batch size\.
The theorems also suggest a simple batch\-size design rule\. In the one\-pass setting,BBhas two competing effects: it reduces the variance term but also shortens the optimization horizonT=N/BT=N/B, so overly large batches can help the noise term while worsening the bias term\. Thus one should increaseBBonly until the full variance boundO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\)is no longer comparable to the approximation\-plus\- bias contribution\. In the multi\-pass setting, by contrast, onceLLandγ\\gammaare fixed, increasingBBleaves the GD bias–variance terms unchanged and only decreases fluctuation\. This makes larger batches statistically attractive until the fluctuation term falls below the common GD reference contribution, with without\-replacement sampling being especially appealing in the large\-batch regime becauseρN,B\\rho\_\{N,B\}is strictly smaller than1/B1/BwhenB\>1B\>1and vanishes at full batch\.
#### Proof sketch\.
The proofs follow the same overall blueprint asLinet al\.\([2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\), together with a batch version of the covariance\-iterate arguments used byWuet al\.\([2022](https://arxiv.org/html/2605.24316#bib.bib7)\)\. For one\-pass batch SGD, Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)and the mean\-centered decompositionet=mt\+δte\_\{t\}=m\_\{t\}\+\\delta\_\{t\}give
𝔼\[RM\(uTop\)\]=RM\(u∗\)\+BiasB\+VarB,\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathrm\{Bias\}\_\{B\}\+\\mathrm\{Var\}\_\{B\},where the bias follows the deterministic recursionmt=\(I−γtΣ\)mt−1m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\}\. The novel ingredient in the one\-pass batch analysis is an exact split of the centered variance into a covariance\-fluctuation term and an additive\-noise term\. Specifically, we write
δt=qt\+vt,VarB=VarBcov\+VarBnoise,\\delta\_\{t\}=q\_\{t\}\+v\_\{t\},\\qquad\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\},whereqtq\_\{t\}is the centered covariance\-fluctuation process andvtv\_\{t\}is the additive\-noise process with
qt=\(I−γtZ¯t\)qt−1\+γt\(Σ−Z¯t\)mt−1,vt=\(I−γtZ¯t\)vt−1\+γtξ¯t,q\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\},\\qquad v\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\},withZ¯t\\bar\{Z\}\_\{t\}the current batch covariance\. We then boundVarBcov\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}andVarBnoise\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}separately: the latter is handled by a batch analogue of the noise\-recursion argument inWuet al\.\([2022](https://arxiv.org/html/2605.24316#bib.bib7)\), while the former captures the extra randomness created by replacingΣ\\Sigmawith a random batch covariance\. Together with the approximation and bias bounds, this yields the one\-pass scaling law\.
For multi\-pass methods, we keep the same GD reference path as inLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\)and modify only the fluctuation setup used to compare the stochastic batch iterate with normal GD\. Writing
Δtρ:=utρ−θt,\\Delta\_\{t\}^\{\\rho\}:=u\_\{t\}^\{\\rho\}\-\\theta\_\{t\},the perturbation follows the same general proof strategy as inLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\), except that the one\-sample random update at each step is replaced by a batch\-sampled update:
Δtwr=\(I−γtΣ^t\(B\)\)Δt−1wr\+γtξt\(B\),Δtwor=\(I−γtΣ^It\(B\)\)Δt−1wor\+γtξt,wor\(B\)\.\\Delta\_\{t\}^\{\\mathrm\{wr\}\}=\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\)\\Delta\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},\\qquad\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\.Thus we investigate the perturbation around normal GD using the same perturbative ideas asLinet al\.\([2025](https://arxiv.org/html/2605.24316#bib.bib1)\), but the batch setup changes the covariance calculation of the driving noise\. In the with\-replacement case, the batch noise is an average of single\-sample noises,
ξt\(B\)=1B∑r=1Bζt\(it,r\),\\xi\_\{t\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\zeta\_\{t\}\(i\_\{t,r\}\),which produces the factor1/B1/B; in the without\-replacement case, the same argument is combined with the finite\-population covariance identity, which replaces1/B1/BbyρN,B\\rho\_\{N,B\}\. The GD reference contributes the common deterministic terms, and substituting the appendix source\-condition bounds into Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)yields the two theorems\.
## 4Experiments
We evaluate the batch\-dependent predictions of our theory in a synthetic sketched linear\-regression model with diagonal\-coordinate\. We fix an ambient dimensiondd, draw a Gaussian sketchS∈ℝM×dS\\in\\mathbb\{R\}^\{M\\times d\}with i\.i\.d\.𝒩\(0,1/M\)\\mathcal\{N\}\(0,1/M\)entries, and generate data fromx∼𝒩\(0,diag\(λ1,…,λd\)\)x\\sim\\mathcal\{N\}\(0,\\operatorname\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{d\}\)\),λi=i−a\\lambda\_\{i\}=i^\{\-a\}, andy=⟨x,w∗⟩\+εy=\\langle x,w^\{\\ast\}\\rangle\+\\varepsilon, with source\-condition prior𝔼\[λi\(wi∗\)2\]≍i−b\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\. In the implementation, conditioned on\(S,w∗\)\(S,w^\{\\ast\}\), we sample the sketched pair\(Sx,y\)\(Sx,y\)directly from its induced joint Gaussian law\. Unless otherwise stated, we usea=2a=2,b=1\.5b=1\.5,d=104d=10^\{4\},M=64M=64,N=L=512N=L=512,σ=1\\sigma=1,γ=0\.05\\gamma=0\.05, and100100repetitions; full details are in Appendix[L](https://arxiv.org/html/2605.24316#A12)\.
Because our theorems show that the explicit mini\-batch covariance effect appears in the stochastic terms, we then conduct the experiments on the three claims that depend onBBmost sharply\. We fix\(S,w∗\)\(S,w^\{\\ast\}\)across repetitions to isolate the sampling and optimization randomness, thus the reported error bars quantify variability conditional on a representative sketched problem instance\.
#### Experiment 1: one\-pass variance sweep\.
In the one\-pass theorem, the explicit mini\-batch covariance reduction appears in the centered variance term\. BecauseT=N/BT=N/B, changingBBalso changes the effective horizonTeffγT\_\{\\mathrm\{eff\}\}\\gamma; panel \(a\) therefore compares the measured variance with the predicted upper\-bound\. Accordingly, the first experiment directly measures the centered one\-pass variance and compares it with the predicted1/\(BTeff\)1/\(BT\_\{\\mathrm\{eff\}\}\)\-type upper\-bound scaling\.
#### Experiment 2: multi\-pass fluctuation sweep\.
In the multi\-pass theorem, the deterministic GD contribution is common to with\-replacement and without\-replacement sampling, so the only sampling\-rule\-dependent term is the fluctuation\. The second experiment is therefore designed to isolate that term and test whether its batch dependence matches the predicted prefactors1/B1/BandρN,B\\rho\_\{N,B\}\.
#### Experiment 3: normalized fluctuation collapse\.
If the fluctuation scales as1/B1/BorρN,B\\rho\_\{N,B\}, then dividing by the corresponding batch prefactor should remove the leading factorBB\. The third experiment tests this collapse by plotting the normalized fluctuation curves across batch sizes; for without\-replacement sampling, the pointB=NB=Nis omitted becauseρN,N=0\\rho\_\{N,N\}=0\.

\(a\) One\-pass variance sweep

\(b\) Multi\-pass fluctuation sweep

\(c\) Normalized fluctuation collapse
Figure 1:Empirical validation of the batch\-dependent stochastic terms\. Panel \(a\) plots the empirical one\-pass centered variance againstBB, together with a rescaled reference curve of order∑j=1Mmin\{1,Teffγμj\(Σ\)\}/\(BTeff\)\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\mu\_\{j\}\(\\Sigma\)\\\}/\(B\\,T\_\{\\mathrm\{eff\}\}\); the reference curve is multiplied by a single constant chosen to match theB=1B=1point\. Panels \(b\) and \(c\) compare the multi\-pass fluctuation for with\-replacement and without\-replacement sampling against the predicted batch prefactors1/B1/BandρN,B=\(N−B\)/\(B\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)\. Error bars denote one standard deviation over repetitions\.The results align with our theoretical prediction to a large extent\. Panel \(a\) shows that the one\-pass centered variance decreases steadily with the batch size, in line with the predicted1/\(BTeff\)1/\(BT\_\{\\mathrm\{eff\}\}\)decay after considering the effective dimension factor\. The dashed reference is an upper\-bound curve rather than an exact asymptotic equality, so the relevant comparison is the shape and order of decay, not pointwise equality\. In particular, the empirical variance staying below the rescaled reference is exactly what one should expect from the theorem\. Panel \(b\) directly tests the multi\-pass fluctuation term\. The with\-replacement curve follows the predicted1/B1/Bdecay closely, while the without\-replacement curve decays faster in the large\-batch regime\. This is precisely the behavior predicted by Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2): the fluctuation prefactor is1/B1/Bfor with\-replacement sampling andρN,B<1/B\\rho\_\{N,B\}<1/BwhenB\>1B\>1for without\-replacement sampling\. The without\-replacement pointB=NB=Nis omitted from the plot becauseρN,N=0\\rho\_\{N,N\}=0\. Panel \(c\) removes these batch prefactors and plotsFlucBwr/\(1/B\)\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}/\(1/B\)andFlucBwor/ρN,B\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}/\\rho\_\{N,B\}\. As can be seen, the two normalized curves are substantially flatter than the unnormalized curves, indicating the prefactorBBhas been largely removed\. Overall, these three experiments together support the paper’s main message: batch size does not change the leading deterministic exponents, but it quantitatively controls the stochastic terms, and without\-replacement sampling has the smaller fluctuation scaleρN,B\\rho\_\{N,B\}\.
## 5Conclusion
We studied batch scaling laws for sketched linear regression trained by SGD under a power\-law covariance spectrum and a source condition\. Across one\-pass batch SGD and multi\-pass batch SGD with and without replacement, we derived a unified risk decomposition that separates approximation, bias, variance, and fluctuation, and used it to identify exactly how batch size enters the excess risk\. Our results show that batching preserves the leading approximation and optimization\-bias exponents while changing only the stochastic terms: in the one\-pass setting the variance term isO\(min\{M,\(Teffγ\)1/a\}/\(BTeff\)\)O\(\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}/\(BT\_\{\\mathrm\{eff\}\}\)\), so the familiar1/B1/Bgain applies at fixed update count but is partly offset at fixed dataset size by the shorter horizonT=N/BT=N/B; in the multi\-pass setting the only difference between with\-replacement and without\-replacement sampling is the fluctuation scale\. The without\-replacement method is strictly less noisy whenB\>1B\>1and recovers deterministic gradient descent whenB=NB=N\. Simulations support these theoretical predictions\.
Several directions remain open\. One option is to extend the analysis beyond sketched linear regression to richer nonlinear or feature\-learning models, while relaxing the current assumptions\. It would also be interesting to study broader optimization settings, such as adaptive step sizes, momentum, or more general data\-reuse schemes, and to develop joint scaling laws that optimize batch size together with model dimension, sample size, and compute in more realistic training regimes\.
## References
- Revisiting neural scaling laws in language and vision\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22300–22312\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- A\. Atanasov, J\. A\. Zavatone\-Veth, and C\. Pehlevan \(2024\)Scaling and renormalization in high\-dimensional regression\.arXiv preprint arXiv:2405\.00592\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2024\)Explaining neural scaling laws\.Proceedings of the National Academy of Sciences121\(27\),pp\. e2311878121\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- T\. Besiroglu, E\. Erdil, M\. Barnett, and J\. You \(2024\)Chinchilla scaling: a replication attempt\.arXiv preprint arXiv:2404\.10102\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- B\. Bordelon, A\. Atanasov, and C\. Pehlevan \(2024\)A dynamical model of neural scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 4345–4382\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- B\. Bordelon, A\. Atanasov, and C\. Pehlevan \(2025\)How feature learning can improve neural scaling laws\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- E\. Dohmatob, Y\. Feng, P\. Yang, F\. Charton, and J\. Kempe \(2024\)A tale of tails: model collapse as a change of scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 11165–11197\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- R\. Ge, S\. M\. Kakade, R\. Kidambi, and P\. Netrapalli \(2019\)The step decay schedule: a near optimal, geometrically decaying learning rate procedure for least squares\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- N\. Golmant, N\. Vemuri, Z\. Yao, V\. Feinberg, A\. Gholami, K\. Rothauge, M\. W\. Mahoney, and J\. Gonzalez \(2019\)On the computational inefficiency of large batch sizes for stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- P\. Goyal, P\. Dollár, R\. Girshick, P\. Noordhuis, L\. Wesolowski, A\. Kyrola, A\. Tulloch, Y\. Jia, and K\. He \(2017\)Accurate, large minibatch SGD: training ImageNet in 1 hour\.arXiv preprint arXiv:1706\.02677\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray, C\. Hallacy, B\. Mann, A\. Radford, A\. Ramesh, D\. M\. Ziegler, D\. Amodei, and S\. McCandlish \(2020\)Scaling laws for autoregressive generative modeling\.arXiv preprint arXiv:2010\.14701\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, Md\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\)Deep learning scaling is predictable, empirically\.arXiv preprint arXiv:1712\.00409\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, J\. W\. Rae, O\. Vinyals, and L\. Sifre \(2022\)Training compute\-optimal large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- M\. Hutter \(2021\)Learning curve theory\.arXiv preprint arXiv:2102\.04074\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- P\. Jain, S\. M\. Kakade, R\. Kidambi, P\. Netrapalli, and A\. Sidford \(2017\)Parallelizing stochastic gradient descent for least squares regression: mini\-batching, averaging, and model misspecification\.Journal of Machine Learning Research18\(223\),pp\. 1–42\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- J\. Lin and L\. Rosasco \(2017\)Optimal rates for multi\-pass stochastic gradient methods\.Journal of Machine Learning Research18\(97\),pp\. 1–47\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- L\. Lin, J\. Wu, and P\. L\. Bartlett \(2025\)Improved scaling laws in linear regression via data reuse\.InAdvances in Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[Appendix K](https://arxiv.org/html/2605.24316#A11.p1.1),[§D\.1](https://arxiv.org/html/2605.24316#A4.SS1.3.p3.5),[§I\.1](https://arxiv.org/html/2605.24316#A9.SS1.5.p5.4),[§I\.2](https://arxiv.org/html/2605.24316#A9.SS2.1.p1.3),[§I\.3](https://arxiv.org/html/2605.24316#A9.SS3.3.p2.1),[Lemma I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5),[2nd item](https://arxiv.org/html/2605.24316#S1.I1.i2.p1.4),[§1](https://arxiv.org/html/2605.24316#S1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p5.3),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px6.p3.3),[§2](https://arxiv.org/html/2605.24316#S2.p1.2),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.4),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.5),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p2.6),[§3](https://arxiv.org/html/2605.24316#S3.p2.1),[§3](https://arxiv.org/html/2605.24316#S3.p6.5)\.
- L\. Lin, J\. Wu, S\. M\. Kakade, P\. L\. Bartlett, and J\. D\. Lee \(2024\)Scaling laws in linear regression: compute, parameters, and data\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[Appendix K](https://arxiv.org/html/2605.24316#A11.p1.1),[§1](https://arxiv.org/html/2605.24316#S1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p5.3),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px1.p2.5),[§2](https://arxiv.org/html/2605.24316#S2.SS0.SSS0.Px6.p3.3),[§2](https://arxiv.org/html/2605.24316#S2.p1.2),[§2](https://arxiv.org/html/2605.24316#S2.p2.2),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.p2.1),[§3](https://arxiv.org/html/2605.24316#S3.p6.5)\.
- A\. Maloney, D\. A\. Roberts, and J\. Sully \(2022\)A solvable model of neural scaling laws\.arXiv preprint arXiv:2210\.16859\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- S\. McCandlish, J\. Kaplan, D\. Amodei, and O\. Team \(2018\)An empirical model of large\-batch training\.arXiv preprint arXiv:1812\.06162\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- E\. J\. Michaud, Z\. Liu, U\. Girit, and M\. Tegmark \(2023\)The quantization model of neural scaling\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- N\. Mücke, G\. Neu, and L\. Rosasco \(2019\)Beating SGD saturation with tail\-averaging and minibatching\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- N\. Muennighoff, A\. Rush, B\. Barak, T\. Le Scao, N\. Tazi, A\. Piktus, S\. Pyysalo, T\. Wolf, and C\. A\. Raffel \(2023\)Scaling data\-constrained language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- E\. Paquette, C\. Paquette, L\. Xiao, and J\. Pennington \(2024\)4\+3 phases of compute\-optimal neural scaling laws\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1),[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- L\. Pillaud\-Vivien, A\. Rudi, and F\. Bach \(2018\)Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[2nd item](https://arxiv.org/html/2605.24316#S1.I1.i2.p1.4),[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
- Y\. Ren, E\. Nichani, D\. Wu, and J\. D\. Lee \(2025\)Emergence and scaling laws in SGD learning of shallow neural networks\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- J\. S\. Rosenfeld, A\. Rosenfeld, Y\. Belinkov, and N\. Shavit \(2020\)A constructive prediction of the generalization error across scales\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- C\. J\. Shallue, J\. Lee, J\. Antognini, J\. Sohl\-Dickstein, R\. Frostig, and G\. E\. Dahl \(2019\)Measuring the effects of data parallelism on neural network training\.Journal of Machine Learning Research20\(112\),pp\. 1–49\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- U\. Sharma and J\. Kaplan \(2020\)A neural scaling law from the dimension of the data manifold\.arXiv preprint arXiv:2004\.10802\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p2.1)\.
- X\. Shuai, Y\. Wang, Y\. Wu, X\. Jiang, and X\. Ren \(2024\)Scaling law for language models training considering batch size\.arXiv preprint arXiv:2412\.01505\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- S\. L\. Smith, P\. Kindermans, C\. Ying, and Q\. V\. Le \(2018\)Don’t decay the learning rate, increase the batch size\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- S\. L\. Smith and Q\. V\. Le \(2018\)A Bayesian perspective on generalization and stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px2.p1.5)\.
- J\. Wu, D\. Zou, V\. Braverman, Q\. Gu, and S\. M\. Kakade \(2022\)The power and limitation of pretraining\-finetuning for linear regression under covariate shift\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Appendix A](https://arxiv.org/html/2605.24316#A1.p2.1),[§1](https://arxiv.org/html/2605.24316#S1.p4.3),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2605.24316#S3.SS0.SSS0.Px4.p1.8)\.
- Y\. You, I\. Gitman, and B\. Ginsburg \(2017\)Large batch training of convolutional networks\.arXiv preprint arXiv:1708\.03888\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p3.1)\.
- X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2022\)Scaling vision transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12104–12113\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p1.1)\.
- D\. Zou, Y\. Cao, D\. Zhou, and Q\. Gu \(2021\)The benefits of implicit regularization from stochastic gradient descent in least squares problems\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 29773–29785\.Cited by:[§1](https://arxiv.org/html/2605.24316#S1.p4.3)\.
From One\-Pass SGD to Data Reuse: Mini\-Batch Scaling Laws in Sketched Linear Regression Supplementary Material
Table of Contents
AAppendix Preliminaries[A](https://arxiv.org/html/2605.24316#A1)
A\.1Block notation\.[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)
A\.2Notations and setup formulas for normal GD\.[A\.2](https://arxiv.org/html/2605.24316#A1.SS2)
A\.3Notations and setup formulas for one\-pass batch SGD\.[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)
A\.4Notations and setup formulas for multi\-pass batch SGD with replacement\.[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)
A\.5Notations and setup formulas for multi\-pass batch SGD without replacement\.[A\.5](https://arxiv.org/html/2605.24316#A1.SS5)
BAssembling the proofs of the main scaling theorems[B](https://arxiv.org/html/2605.24316#A2)
B\.1Assembling the proof of the one\-pass theorem\.[B\.1](https://arxiv.org/html/2605.24316#A2.SS1)
B\.2Assembling the proof of the multi\-pass theorem\.[B\.2](https://arxiv.org/html/2605.24316#A2.SS2)
CApproximation Error[C](https://arxiv.org/html/2605.24316#A3)
C\.1Upper bound\.[C\.1](https://arxiv.org/html/2605.24316#A3.SS1)
C\.2Lower bound\.[C\.2](https://arxiv.org/html/2605.24316#A3.SS2)
C\.3Bounds under our assumptions\.[C\.3](https://arxiv.org/html/2605.24316#A3.SS3)
DBias Error under Normal GD[D](https://arxiv.org/html/2605.24316#A4)
D\.1Upper and lower bounds\.[D\.1](https://arxiv.org/html/2605.24316#A4.SS1)
D\.2Example under the source condition\.[D\.2](https://arxiv.org/html/2605.24316#A4.SS2)
EVariance Error under Normal GD[E](https://arxiv.org/html/2605.24316#A5)
E\.1Upper and lower bounds\.[E\.1](https://arxiv.org/html/2605.24316#A5.SS1)
E\.2Example under the source condition\.[E\.2](https://arxiv.org/html/2605.24316#A5.SS2)
FOne\-pass Batch SGD: Excess Error Decomposition[F](https://arxiv.org/html/2605.24316#A6)
GBias Error for One\-pass Batch SGD[G](https://arxiv.org/html/2605.24316#A7)
G\.1Upper and lower bounds for the bias term\.[G\.1](https://arxiv.org/html/2605.24316#A7.SS1)
G\.2Bounds under the source condition\.[G\.2](https://arxiv.org/html/2605.24316#A7.SS2)
HVariance Error for One\-pass Batch SGD[H](https://arxiv.org/html/2605.24316#A8)
H\.1Upper and lower bounds for the exact variance components\.[H\.1](https://arxiv.org/html/2605.24316#A8.SS1)
H\.2The additive\-noise component\.[H\.2](https://arxiv.org/html/2605.24316#A8.SS2)
H\.3Variance bound for the additive\-noise component\.[H\.3](https://arxiv.org/html/2605.24316#A8.SS3)
H\.4The covariance\-fluctuation component\.[H\.4](https://arxiv.org/html/2605.24316#A8.SS4)
H\.5Bounds under the source condition\.[H\.5](https://arxiv.org/html/2605.24316#A8.SS5)
IFluctuation Error under Multi\-pass Batch SGD with Replacement[I](https://arxiv.org/html/2605.24316#A9)
I\.1Upper bound result\.[I\.1](https://arxiv.org/html/2605.24316#A9.SS1)
I\.2Fluctuation error under the source condition\.[I\.2](https://arxiv.org/html/2605.24316#A9.SS2)
I\.3Lemmas to prove the upper bound\.[I\.3](https://arxiv.org/html/2605.24316#A9.SS3)
JFluctuation Error under Multi\-pass Batch SGD without Replacement[J](https://arxiv.org/html/2605.24316#A10)
KCollected Auxiliary Lemmas[K](https://arxiv.org/html/2605.24316#A11)
K\.1General concentration lemmas\.[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)
K\.2Power\-law auxiliary lemmas\.[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)
LExperimental setup[L](https://arxiv.org/html/2605.24316#A12)
MLimitations and Broader Effects[M](https://arxiv.org/html/2605.24316#A13)
## Appendix AAppendix Preliminaries
We collect the common notation used throughout the appendix\. Later appendix sections refer back to this section whenever possible, instead of restating the full stochastic\-update setup each time\.
The framework and many of the core ideas in Appendices[A](https://arxiv.org/html/2605.24316#A1)–[G](https://arxiv.org/html/2605.24316#A7)and[K](https://arxiv.org/html/2605.24316#A11)are borrowed fromLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]\. Appendices[H](https://arxiv.org/html/2605.24316#A8)–[J](https://arxiv.org/html/2605.24316#A10)contain the main technical contributions of this paper, although some of the proofs there are inspired byLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4),[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], Wuet al\.\[[2022](https://arxiv.org/html/2605.24316#bib.bib7)\]\.
Throughout the appendix, whenever a procedure is run forLrunL\_\{\\mathrm\{run\}\}updates, we use the same blockwise geometric learning\-rate scheduleγt=γ/2ℓ\\gamma\_\{t\}=\\gamma/2^\{\\ell\}on theℓ\\ell\-th block ofLrun,eff:=Lrun/logLrunL\_\{\\mathrm\{run,eff\}\}:=L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}consecutive updates \(up to the obvious endpoint rounding\); thusLrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD andLrun=LL\_\{\\mathrm\{run\}\}=Lfor normal GD and the two multi\-pass methods\.
### A\.1Block notation
For integers0≤k∗≤k0\\leq k\_\{\\ast\}\\leq k\(allowingk=∞k=\\infty\), define
Hk∗:k:=diag\(λk∗\+1,…,λk\),wk∗:k:=\(wk∗\+1,…,wk\)⊤\.H\_\{k\_\{\\ast\}:k\}:=\\operatorname\{diag\}\(\\lambda\_\{k\_\{\\ast\}\+1\},\\dots,\\lambda\_\{k\}\),\\qquad w\_\{k\_\{\\ast\}:k\}:=\(w\_\{k\_\{\\ast\}\+1\},\\dots,w\_\{k\}\)^\{\\top\}\.Similarly, letSk∗:kS\_\{k\_\{\\ast\}:k\}denote the submatrix ofSSconsisting of columnsk∗\+1,…,kk\_\{\\ast\}\+1,\\dots,k\.
### A\.2Notations and setup formulas for normal GD
This subsection records the notation for the normal GD procedure equation[1](https://arxiv.org/html/2605.24316#S2.E1)\. Define
Σ:=SHS⊤,Σ^:=SX⊤XS⊤N,u∗:=Σ−1SHw∗,b^:=1NSX⊤y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{SX^\{\\top\}XS^\{\\top\}\}\{N\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Let
ε~i:=yi−xi⊤S⊤u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1NSX⊤ε~\.\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\}\.The normal GD iterate satisfies
θt=θt−1−γtΣ^θt−1\+γtb^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.
### A\.3Notations and setup formulas for one\-pass batch SGD
This subsection records the notation for the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)\. Assume throughout thatB∣NB\\mid N, and define
T:=NB≥2,Teff:=TlogT,Σ:=SHS⊤,u∗:=Σ−1SHw∗\.T:=\\frac\{N\}\{B\}\\geq 2,\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.Partition\[N\]\[N\]into disjoint batchesI1,…,ITI\_\{1\},\\dots,I\_\{T\}with\|It\|=B\|I\_\{t\}\|=B, and write each block as
It=\{it,1,…,it,B\}\.I\_\{t\}=\\\{i\_\{t,1\},\\dots,i\_\{t,B\}\\\}\.Define
Σ^t\(B\):=1B∑i∈ItSxixi⊤S⊤,b^t\(B\):=1B∑i∈ItSxiyi,\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\},and
ξ^t\(B\):=1B∑i∈ItSxi\(yi−xi⊤S⊤u∗\)\.\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}\\bigl\(y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\\bigr\)\.The one\-pass batch SGD iterate\(utop\)\(u\_\{t\}^\{\\mathrm\{op\}\}\)satisfies
utop=ut−1op−γtΣ^t\(B\)ut−1op\+γtb^t\(B\),u0op=0\.u\_\{t\}^\{\\mathrm\{op\}\}=u\_\{t\-1\}^\{\\mathrm\{op\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{op\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{op\}\}=0\.With the centered erroret:=utop−u∗e\_\{t\}:=u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}, this becomes
et=\(I−γtΣ^t\(B\)\)et−1\+γtξ^t\(B\)\.e\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)e\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.For later use, we also write
zt,b:=Sxit,b,Z¯t:=1B∑b=1Bzt,bzt,b⊤,ξ¯t:=1B∑b=1Bzt,b\(yit,b−zt,b⊤u∗\),z\_\{t,b\}:=Sx\_\{i\_\{t,b\}\},\\qquad\\bar\{Z\}\_\{t\}:=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}z\_\{t,b\}^\{\\top\},\\qquad\\bar\{\\xi\}\_\{t\}:=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}\\bigl\(y\_\{i\_\{t,b\}\}\-z\_\{t,b\}^\{\\top\}u^\{\\ast\}\\bigr\),so that equivalently
et=\(I−γtZ¯t\)et−1\+γtξ¯t\.e\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)e\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Since the blocks are disjoint subsets of an i\.i\.d\. sample, the pairs\(Z¯t,ξ¯t\)\(\\bar\{Z\}\_\{t\},\\bar\{\\xi\}\_\{t\}\)are independent acrosstt\. The mean errormt:=𝔼\[et\]m\_\{t\}:=\\mathbb\{E\}\[e\_\{t\}\]therefore satisfies
mt=\(I−γtΣ\)mt−1,m0=−u∗\.m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\},\\qquad m\_\{0\}=\-u^\{\\ast\}\.Throughout the one\-pass appendix sections, whenever Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)is invoked,LeffL\_\{\\mathrm\{eff\}\}is replaced byTeffT\_\{\\mathrm\{eff\}\}\. The sample\-level maximum\-norm condition remains over the originalNNsamples, since the one\-pass procedure still uses allNNobservations, grouped intoT=N/BT=N/Bmini\-batches\.
### A\.4Notations and setup formulas for multi\-pass batch SGD with replacement
This subsection records the notation for the multi\-pass batch SGD procedure with replacement, namely equation[3](https://arxiv.org/html/2605.24316#S2.E3)\. Define
Σ:=SHS⊤,Σ^:=SX⊤XS⊤N,u∗:=Σ−1SHw∗,b^:=1NSX⊤y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{SX^\{\\top\}XS^\{\\top\}\}\{N\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Let
ε~i:=yi−xi⊤S⊤u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1NSX⊤ε~\.\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\}\.At each stept∈\[L\]t\\in\[L\], sample
it,1,…,it,B∼iidunif\(\[N\]\),i\_\{t,1\},\\dots,i\_\{t,B\}\\stackrel\{\{\\scriptstyle\\mathrm\{iid\}\}\}\{\{\\sim\}\}\\mathrm\{unif\}\(\[N\]\),and define
Σ^t\(B\):=1B∑r=1BSxit,rxit,r⊤S⊤,b^t\(B\):=1B∑r=1BSxit,ryit,r,\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}x\_\{i\_\{t,r\}\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}y\_\{i\_\{t,r\}\},c^t\(B\):=1B∑r=1BSxit,rε~it,r\.\\widehat\{c\}\_\{t\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Sx\_\{i\_\{t,r\}\}\\widetilde\{\\varepsilon\}\_\{i\_\{t,r\}\}\.The multi\-pass batch SGD iterate with replacement and the normal GD iterate are
utwr=ut−1wr−γtΣ^t\(B\)ut−1wr\+γtb^t\(B\),u0wr=0,u\_\{t\}^\{\\mathrm\{wr\}\}=u\_\{t\-1\}^\{\\mathrm\{wr\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wr\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{t\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{wr\}\}=0,and
θt=θt−1−γtΣ^θt−1\+γtb^,θ0=0\.\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0\.Define the fluctuation process
Δt:=utwr−θt,Δ0=0\.\\Delta\_\{t\}:=u\_\{t\}^\{\\mathrm\{wr\}\}\-\\theta\_\{t\},\\qquad\\Delta\_\{0\}=0\.Then
Δt=\(I−γtΣ^t\(B\)\)Δt−1\+γtξt\(B\),\\Delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},where
ξt\(B\):=−\(Σ^t\(B\)−Σ^\)\(θt−1−u∗\)\+\(c^t\(B\)−c^\)\.\\xi\_\{t\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{t\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.
### A\.5Notations and setup formulas for multi\-pass batch SGD without replacement
This subsection records the notation for the multi\-pass batch SGD procedure without replacement, namely equation[4](https://arxiv.org/html/2605.24316#S2.E4)\. Here we retain the common dataset\-level quantitiesΣ\\Sigma,Σ^\\widehat\{\\Sigma\},u∗u^\{\\ast\},b^\\widehat\{b\},ε~\\widetilde\{\\varepsilon\}, andc^\\widehat\{c\}from Section[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)\. At each stept∈\[L\]t\\in\[L\], we sample a subset
It⊂\[N\],\|It\|=B,I\_\{t\}\\subset\[N\],\\qquad\|I\_\{t\}\|=B,uniformly without replacement from\[N\]\[N\], independently across iterations\. Define
Σ^It\(B\):=1B∑i∈ItSxixi⊤S⊤,b^It\(B\):=1B∑i∈ItSxiyi,\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}y\_\{i\},and
c^It\(B\):=1B∑i∈ItSxiε~i\.\\widehat\{c\}\_\{I\_\{t\}\}^\{\(B\)\}:=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}Sx\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\}\.The multi\-pass batch SGD iterate without replacement is
utwor=ut−1wor−γtΣ^It\(B\)ut−1wor\+γtb^It\(B\),u0wor=0\.u\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\-1\}^\{\\mathrm\{wor\}\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}u\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\widehat\{b\}\_\{I\_\{t\}\}^\{\(B\)\},\\qquad u\_\{0\}^\{\\mathrm\{wor\}\}=0\.Define the multi\-pass fluctuation process
Δtwor:=utwor−θt,Δ0wor=0\.\\Delta\_\{t\}^\{\\mathrm\{wor\}\}:=u\_\{t\}^\{\\mathrm\{wor\}\}\-\\theta\_\{t\},\\qquad\\Delta\_\{0\}^\{\\mathrm\{wor\}\}=0\.Then
Δtwor=\(I−γtΣ^It\(B\)\)Δt−1wor\+γtξt,wor\(B\),\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\},where
ξt,wor\(B\):=−\(Σ^It\(B\)−Σ^\)\(θt−1−u∗\)\+\(c^It\(B\)−c^\)\.\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{I\_\{t\}\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.
### A\.6Proof of the main risk decomposition\.
We record here the proof of the structural decomposition used later in Section[3](https://arxiv.org/html/2605.24316#S3), since it only uses the basic identities from the present section\.
###### Proof of Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1)\.
Sinceu∗u^\{\\ast\}minimizes the sketched riskRMR\_\{M\}, for everyu∈ℝMu\\in\\mathbb\{R\}^\{M\}one has
RM\(u\)=RM\(u∗\)\+‖Σ1/2\(u−u∗\)‖22\.R\_\{M\}\(u\)=R\_\{M\}\(u^\{\\ast\}\)\+\\\|\\Sigma^\{1/2\}\(u\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\.
For one\-pass batch SGD, letu¯T:=𝔼\[uTop\]\\bar\{u\}\_\{T\}:=\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\]\. Applying the previous display withu=uTopu=u\_\{T\}^\{\\mathrm\{op\}\}and taking expectation gives
𝔼\[RM\(uTop\)\]=RM\(u∗\)\+𝔼\[‖Σ1/2\(uTop−u∗\)‖22\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.Now write
uTop−u∗=\(u¯T−u∗\)\+\(uTop−u¯T\)\.u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}=\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\+\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\.Since𝔼\[uTop−u¯T\]=0\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\]=0, the cross term vanishes, so
𝔼\[‖Σ1/2\(uTop−u∗\)‖22\]=‖Σ1/2\(u¯T−u∗\)‖22\+𝔼\[‖Σ1/2\(uTop−u¯T\)‖22\]\.\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\bigl\\\|\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\\bigr\\\|\_\{2\}^\{2\}\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.Sinceu∗u^\{\\ast\}minimizesRMR\_\{M\}, the first term is exactly
RM\(u¯T\)−RM\(u∗\)\.R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\.Moreover,
RM\(uTop\)−RM\(u¯T\)=2⟨Σ1/2\(u¯T−u∗\),Σ1/2\(uTop−u¯T\)⟩\+‖Σ1/2\(uTop−u¯T\)‖22,R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)=2\\bigl\\langle\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\),\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\bigr\\rangle\+\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\bigr\\\|\_\{2\}^\{2\},so taking expectation and using𝔼\[uTop−u¯T\]=0\\mathbb\{E\}\[u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\]=0gives
𝔼\[RM\(uTop\)−RM\(u¯T\)\]=𝔼\[‖Σ1/2\(uTop−u¯T\)‖22\]\.\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]=\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{T\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}\)\\\|\_\{2\}^\{2\}\\bigr\]\.This proves the one\-pass decomposition in excess\-risk form\.
For either multi\-pass sampling rule, the same identity gives
𝔼\[RM\(uLρ\)\]=RM\(u∗\)\+𝔼\[‖Σ1/2\(uLρ−u∗\)‖22\],\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\],whereuLρu\_\{L\}^\{\\rho\}denotes eitheruLwru\_\{L\}^\{\\mathrm\{wr\}\}oruLworu\_\{L\}^\{\\mathrm\{wor\}\}\. Writing
uLρ−u∗=\(θL−u∗\)\+\(uLρ−θL\)u\_\{L\}^\{\\rho\}\-u^\{\\ast\}=\(\\theta\_\{L\}\-u^\{\\ast\}\)\+\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)and expanding the squared norm produces the cross term
𝔼\[⟨Σ1/2\(θL−u∗\),Σ1/2\(uLρ−θL\)⟩\]\.\\mathbb\{E\}\\bigl\[\\langle\\Sigma^\{1/2\}\(\\theta\_\{L\}\-u^\{\\ast\}\),\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)\\rangle\\bigr\]\.This term vanishes because the fluctuation process has zero conditional mean: from the recursions definingΔt=utwr−θt\\Delta\_\{t\}=u\_\{t\}^\{\\mathrm\{wr\}\}\-\\theta\_\{t\}andΔtwor=utwor−θt\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=u\_\{t\}^\{\\mathrm\{wor\}\}\-\\theta\_\{t\}, the driving noise at each step is conditionally centered given\(S,D\)\(S,D\), so
𝔼\[uLρ−θL∣S,D\]=0\.\\mathbb\{E\}\[u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\\mid S,D\]=0\.Hence
𝔼\[‖Σ1/2\(uLρ−u∗\)‖22\]=𝔼\[‖Σ1/2\(θL−u∗\)‖22\]\+𝔼\[‖Σ1/2\(uLρ−θL\)‖22\],\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(\\theta\_\{L\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}\\bigr\]\+\\mathbb\{E\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\rho\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\],where the first term is
𝔼\[RM\(θL\)−RM\(u∗\)\]\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]becauseu∗u^\{\\ast\}minimizesRMR\_\{M\}, and the second term is
𝔼\[RM\(uLρ\)−RM\(θL\)\]\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]because the corresponding cross term vanishes by the same conditional\-centering argument\. This proves the multi\-pass decomposition in excess\-risk form\. ∎
## Appendix BAssembling the proofs of the main scaling theorems
This section gathers the appendix ingredients used in the proofs of Theorems[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)and[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. The detailed derivations are carried out in the subsequent appendix sections; here we simply record how those results fit together\. Throughout, we work on the intersection of the high\-probability events from the cited lemmas\. Since only finitely many such results are invoked, a union bound still yields probability
1−exp\(−Ω\(M\)\)\.1\-\\exp\(\-\\Omega\(M\)\)\.
### B\.1Assembling the proof of Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)
Start from Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1):
𝔼\[RM\(uTop\)\]=RM\(u∗\)\+\(RM\(u¯T\)−RM\(u∗\)\)\+𝔼\[RM\(uTop\)−RM\(u¯T\)\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\bigl\(R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\)\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(\\bar\{u\}\_\{T\}\)\\bigr\]\.The three terms are supplied by the appendix as follows\.
1. \(1\)Common baseline risk\.Appendix[C](https://arxiv.org/html/2605.24316#A3)defines Approx:=RM\(u∗\)−R\(w∗\)\.\\mathrm\{Approx\}:=R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)\.Under Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),R\(w∗\)=σ2R\(w^\{\\ast\}\)=\\sigma^\{2\}, and therefore RM\(u∗\)=σ2\+Approx\.R\_\{M\}\(u^\{\\ast\}\)=\\sigma^\{2\}\+\\mathrm\{Approx\}\.Lemma[C\.3](https://arxiv.org/html/2605.24316#A3.Thmlemma3)then gives 𝔼w∗\[Approx\]≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}\.
2. \(2\)One\-pass bias term\.In Appendix[F](https://arxiv.org/html/2605.24316#A6), the mean error satisfiesmT=u¯T−u∗m\_\{T\}=\\bar\{u\}\_\{T\}\-u^\{\\ast\}, so Definition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmdefinition1)identifies RM\(u¯T\)−RM\(u∗\)=‖Σ1/2\(u¯T−u∗\)‖22=BiasB\.R\_\{M\}\(\\bar\{u\}\_\{T\}\)\-R\_\{M\}\(u^\{\\ast\}\)=\\\|\\Sigma^\{1/2\}\(\\bar\{u\}\_\{T\}\-u^\{\\ast\}\)\\\|\_\{2\}^\{2\}=\\mathrm\{Bias\}\_\{B\}\.Lemma[G\.3](https://arxiv.org/html/2605.24316#A7.Thmlemma3)yields the general upper bound 𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.Moreover, when\(Teffγ\)1/a≤M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/c\_\{1\}, the same lemma gives the matching lower bound 𝔼w∗\[BiasB\]≳\(Teffγ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.
3. \(3\)One\-pass variance term\.Proposition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmproposition1)and Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2)identify the remaining term with the one\-pass variance quantityVarB\\mathrm\{Var\}\_\{B\}\. Lemma[H\.3](https://arxiv.org/html/2605.24316#A8.Thmlemma3)then yields 𝔼w∗\[VarB\]≲min\{M,\(Teffγ\)1/a\}BTeff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Here the factor1/B1/Bshould be read as the covariance reduction at a fixed number of updates\. Since one\-pass batch SGD runs forT=N/BT=N/Bupdates, the full variance dependence onBBat fixed dataset sizeNNis given by the displayed bound rather than by a standalone1/B1/Blaw\.
From items\(1\)and\(2\), the deterministic contribution obeys
𝔼w∗\[Approx\+BiasB\]=Θ\(M1−b\)\+Θ\(\(Teffγ\)\(1−b\)/a\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\+\\mathrm\{Bias\}\_\{B\}\]=\\Theta\\\!\\bigl\(M^\{1\-b\}\\bigr\)\+\\Theta\\\!\\Bigl\(\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\Bigr\)\.Indeed, when\(Teffγ\)1/a≤M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/c\_\{1\}, the lower bound onBiasB\\mathrm\{Bias\}\_\{B\}gives the second scale directly; when\(Teffγ\)1/a\>M/c1\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\>M/c\_\{1\}, one has\(Teffγ\)\(1−b\)/a≲M1−b\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\\lesssim M^\{1\-b\}, so the approximation lower bound already controls that term\. Substituting this deterministic bound and item\(3\)into the risk decomposition gives the first display in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. The simplified regime1<b≤a1<b\\leq afollows by comparing the variance order from Lemma[H\.3](https://arxiv.org/html/2605.24316#A8.Thmlemma3)with the deterministic orders above\.
### B\.2Assembling the proof of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)
Again start from Proposition[3\.1](https://arxiv.org/html/2605.24316#S3.Thmproposition1):
𝔼\[RM\(uLρ\)\]=RM\(u∗\)\+𝔼\[RM\(θL\)−RM\(u∗\)\]\+𝔼\[RM\(uLρ\)−RM\(θL\)\]\.\\mathbb\{E\}\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\]=R\_\{M\}\(u^\{\\ast\}\)\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(\\theta\_\{L\}\)\-R\_\{M\}\(u^\{\\ast\}\)\\bigr\]\+\\mathbb\{E\}\\bigl\[R\_\{M\}\(u\_\{L\}^\{\\rho\}\)\-R\_\{M\}\(\\theta\_\{L\}\)\\bigr\]\.The three contributions are assembled as follows\.
1. \(1\)Common baseline risk\.As in the one\-pass proof above, RM\(u∗\)=σ2\+Approx,𝔼w∗\[Approx\]≍M1−bR\_\{M\}\(u^\{\\ast\}\)=\\sigma^\{2\}\+\\mathrm\{Approx\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}by Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)and Lemma[C\.3](https://arxiv.org/html/2605.24316#A3.Thmlemma3)\.
2. \(2\)Common GD\-reference contribution\.Sections[D](https://arxiv.org/html/2605.24316#A4)and[E](https://arxiv.org/html/2605.24316#A5)control the deterministic and stochastic pieces of the normal\-GD reference iterateθL\\theta\_\{L\}\. Their source\-condition conclusions are Lemmas[D\.3](https://arxiv.org/html/2605.24316#A4.Thmlemma3)and[E\.2](https://arxiv.org/html/2605.24316#A5.Thmlemma2), namely 𝔼w∗\[BiasGD\(w∗\)\]≍min\{M,\(Leffγ\)1/a\}1−b,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\asymp\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\},VarGD≍min\{M,\(Leffγ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\asymp\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.These are exactly the orders labeledGDBias\\mathrm\{GD\\,Bias\}andGDVar\\mathrm\{GD\\,Var\}in Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\.
3. \(3\)Sampling\-rule\-dependent fluctuation term\.Whenρ=1/B\\rho=1/B, Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)gives the upper bound forFlucBwr\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\. Whenρ=ρN,B\\rho=\\rho\_\{N,B\}, Corollary[J\.1](https://arxiv.org/html/2605.24316#A10.Thmcorollary1)transfers the same conclusion toFlucBwor\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}after replacing1/B1/BbyρN,B\\rho\_\{N,B\}\. Hence, for either sampling rule, 𝔼\[FlucBρ\]≲ργlogN\[\(Leffγ\)1/a−1\+\(Leffγ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\rho\}\_\{B\}\]\\lesssim\\rho\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.
Combining the baseline term, the two GD\-reference orders, and the fluctuation estimate gives item\(1\)of Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. Finally, items\(2\)and\(3\)follow by direct comparison of the orders in item\(1\)\.
## Appendix CApproximation Error
This section studies the approximation term, which is common to the four optimization procedures equation[1](https://arxiv.org/html/2605.24316#S2.E1), equation[2](https://arxiv.org/html/2605.24316#S2.E2), equation[3](https://arxiv.org/html/2605.24316#S2.E3), and equation[4](https://arxiv.org/html/2605.24316#S2.E4)\.
Retaining the common setup from Section[2](https://arxiv.org/html/2605.24316#S2), define
Approx:=minu∈ℝMRM\(u\)−minw∈ℋR\(w\)=RM\(u∗\)−R\(w∗\)\\mathrm\{Approx\}:=\\min\_\{u\\in\\mathbb\{R\}^\{M\}\}R\_\{M\}\(u\)\-\\min\_\{w\\in\\mathcal\{H\}\}R\(w\)=R\_\{M\}\(u^\{\\ast\}\)\-R\(w^\{\\ast\}\)which further yields, by substituting the value ofu∗u^\{\*\}:
Approx=‖\(I−H1/2S⊤Σ−1SH1/2\)H1/2w∗‖22\.\\mathrm\{Approx\}=\\bigl\\\|\(I\-H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\)H^\{1/2\}w^\{\\ast\}\\bigr\\\|\_\{2\}^\{2\}\.\(5\)
###### Assumption 4\(Gaussian sketching\)\.
The sketching operatorS:ℋ→ℝMS:\\mathcal\{H\}\\to\\mathbb\{R\}^\{M\}is Gaussian, meaning that in the diagonal coordinates ofHH, its entries are i\.i\.d\. distributed as𝒩\(0,1/M\)\\mathcal\{N\}\(0,1/M\)\.
###### Assumption 5\(Source\-condition regime\)\.
In addition to Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2), we work in the regime
### C\.1Upper bound
###### Lemma C\.1\(Upper bound on the approximation error\)\.
Fix any integerk≥0k\\geq 0such thatr\(H\)≥k\+Mr\(H\)\\geq k\+M, and define
Ak:=Sk:∞Hk:∞Sk:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Then the approximation error satisfies the deterministic bound
Approx≲‖wk:∞∗‖Hk:∞2\+w0:k∗⊤\(H0:k−1\+S0:k⊤Ak−1S0:k\)−1w0:k∗\.\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+w\_\{0:k\}^\{\\ast\\top\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}w\_\{0:k\}^\{\\ast\}\.If in addition Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4)holds andk≤M/2k\\leq M/2, then with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
Approx≲‖wk:∞∗‖Hk:∞2\+βk‖w0:k∗‖22,\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\,\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\},where
βk:=∑i\>kλiM\+λk\+1\+∑i\>kλi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}\}\{M\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.
###### Proof\.
For the deterministic identities below, it is enough to work in an eigenbasis ofHH, so we writeHHin diagonal form and use the block notation from Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. When Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4)is invoked later for the high\-probability part, this is also the natural coordinate system by rotational invariance\. Set
𝒯:=H1/2S⊤Σ−1SH1/2−I\.\\mathcal\{T\}:=H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\-I\.Using the block decomposition induced by the split0:k0\\text\{:\}kandk:∞k\\text\{:\}\\infty, write
𝒯=\(UVV⊤W\),\\mathcal\{T\}=\\begin\{pmatrix\}U&V\\\\ V^\{\\top\}&W\\end\{pmatrix\},where
U:=H0:k1/2S0:k⊤Σ−1S0:kH0:k1/2−I,U:=H\_\{0:k\}^\{1/2\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}^\{1/2\}\-I,V:=H0:k1/2S0:k⊤Σ−1Sk:∞Hk:∞1/2,V:=H\_\{0:k\}^\{1/2\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\},W:=Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2−I\.W:=H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\-I\.Then equation[5](https://arxiv.org/html/2605.24316#A3.E5)gives
Approx=‖𝒯H1/2w∗‖22\.\\mathrm\{Approx\}=\\\|\\mathcal\{T\}H^\{1/2\}w^\{\\ast\}\\\|\_\{2\}^\{2\}\.By the inequality\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\},
Approx\\displaystyle\\mathrm\{Approx\}≤2w0:k∗⊤H0:k1/2\(U2\+VV⊤\)H0:k1/2w0:k∗\\displaystyle\\leq 2w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}^\{1/2\}\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}^\{1/2\}w\_\{0:k\}^\{\\ast\}\+2wk:∞∗⊤Hk:∞1/2\(W2\+V⊤V\)Hk:∞1/2wk:∞∗\.\\displaystyle\\qquad\+2w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}\.\(6\)
We first control the tail block\. Since
Σ=S0:kH0:kS0:k⊤\+Ak,\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\},a direct calculation shows that
W2\+V⊤V=−W\.W^\{2\}\+V^\{\\top\}V=\-W\.\(7\)Moreover,
0⪯Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2⪯I,0\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq I,so−I⪯W⪯0\-I\\preceq W\\preceq 0\. Combining this with equation[7](https://arxiv.org/html/2605.24316#A3.E7), we obtain
0⪯W2\+V⊤V=−W⪯I,0\\preceq W^\{2\}\+V^\{\\top\}V=\-W\\preceq I,and therefore
wk:∞∗⊤Hk:∞1/2\(W2\+V⊤V\)Hk:∞1/2wk:∞∗≤‖wk:∞∗‖Hk:∞2\.w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}\\leq\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.\(8\)
We next control the head block\. Applying the Woodbury identity to
Σ−1=\(S0:kH0:kS0:k⊤\+Ak\)−1,\\Sigma^\{\-1\}=\(S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\)^\{\-1\},we get
Σ−1=Ak−1−Ak−1S0:k\(H0:k−1\+S0:k⊤Ak−1S0:k\)−1S0:k⊤Ak−1\.\\Sigma^\{\-1\}=A\_\{k\}^\{\-1\}\-A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}\.Substituting this identity into the definitions ofUUandVV, we then have
U2\+VV⊤=H0:k−1/2\(H0:k−1\+S0:k⊤Ak−1S0:k\)−1H0:k−1/2\.U^\{2\}\+VV^\{\\top\}=H\_\{0:k\}^\{\-1/2\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}H\_\{0:k\}^\{\-1/2\}\.\(9\)Hence
w0:k∗⊤H0:k1/2\(U2\+VV⊤\)H0:k1/2w0:k∗=w0:k∗⊤\(H0:k−1\+S0:k⊤Ak−1S0:k\)−1w0:k∗\.w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}^\{1/2\}\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}^\{1/2\}w\_\{0:k\}^\{\\ast\}=w\_\{0:k\}^\{\\ast\\top\}\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}w\_\{0:k\}^\{\\ast\}\.\(10\)Putting equation[8](https://arxiv.org/html/2605.24316#A3.E8)and equation[10](https://arxiv.org/html/2605.24316#A3.E10)into equation[6](https://arxiv.org/html/2605.24316#A3.E6)proves the first claim\.
For the high\-probability bound, define
βk:=∑i\>kλiM\+λk\+1\+∑i\>kλi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}\}\{M\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.By Lemma[K\.4](https://arxiv.org/html/2605.24316#A11.Thmlemma4), with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
‖Ak‖2≲βk\.\\\|A\_\{k\}\\\|\_\{2\}\\lesssim\\beta\_\{k\}\.Equivalently,
Ak−1⪰cβk−1IA\_\{k\}^\{\-1\}\\succeq c\\,\\beta\_\{k\}^\{\-1\}Ifor some absolute constantc\>0c\>0\. Also, sincek≤M/2k\\leq M/2, a standard Gaussian covariance concentration bound gives
S0:k⊤S0:k⪰c0IkS\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c\_\{0\}I\_\{k\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\), for some absolute constantc0\>0c\_\{0\}\>0\. Therefore,
S0:k⊤Ak−1S0:k⪰cβk−1S0:k⊤S0:k⪰c′βk−1Ik\.S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\succeq c\\,\\beta\_\{k\}^\{\-1\}S\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c^\{\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\.Hence
\(H0:k−1\+S0:k⊤Ak−1S0:k\)−1⪯\(c′βk−1Ik\)−1≲βkIk\.\\bigl\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{0:k\}\\bigr\)^\{\-1\}\\preceq\\bigl\(c^\{\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\\bigr\)^\{\-1\}\\lesssim\\beta\_\{k\}I\_\{k\}\.Substituting this into the deterministic bound gives
Approx≲‖wk:∞∗‖Hk:∞2\+βk‖w0:k∗‖22,\\mathrm\{Approx\}\\lesssim\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\},which completes the proof\. ∎
### C\.2Lower bound
###### Lemma C\.2\(Lower bound on the approximation error under the source condition\)\.
Assume Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), and let
Hw:=𝔼\[w∗w∗⊤\]\.H\_\{w\}:=\\mathbb\{E\}\[w^\{\\ast\}w^\{\\ast\\top\}\]\.Then, conditioned on the sketch matrixSS,
𝔼w∗\[Approx\]≳∑i\>Mλiia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\.In particular, under Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2),
𝔼w∗\[Approx\]≳M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim M^\{1\-b\}\.
###### Proof\.
Reuse the block decomposition from the proof of Lemma[C\.1](https://arxiv.org/html/2605.24316#A3.Thmlemma1)\. Since Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)implies thatHHandHwH\_\{w\}are diagonal and that the coordinates ofw∗w^\{\\ast\}are uncorrelated, the cross terms vanish after taking expectation overw∗w^\{\\ast\}, and we obtain
𝔼w∗\[Approx\]\\displaystyle\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]=tr\(\(U2\+VV⊤\)H0:kHw,0:k\)\+tr\(\(W2\+V⊤V\)Hk:∞Hw,k:∞\)\\displaystyle=\\operatorname\{tr\}\\bigl\(\(U^\{2\}\+VV^\{\\top\}\)H\_\{0:k\}H\_\{w,0:k\}\\bigr\)\+\\operatorname\{tr\}\\bigl\(\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)≥tr\(\(W2\+V⊤V\)Hk:∞Hw,k:∞\)\\displaystyle\\geq\\operatorname\{tr\}\\bigl\(\(W^\{2\}\+V^\{\\top\}V\)H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)=−tr\(WHk:∞Hw,k:∞\),\\displaystyle=\-\\operatorname\{tr\}\\bigl\(WH\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\),\(11\)where the last identity uses equation[7](https://arxiv.org/html/2605.24316#A3.E7)\.
Define
Pk:=I−Hk:∞1/2Sk:∞⊤Ak−1Sk:∞Hk:∞1/2\.P\_\{k\}:=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.Since
Σ=S0:kH0:kS0:k⊤\+Ak⪰Ak,\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\\succeq A\_\{k\},we have
Σ−1⪯Ak−1\.\\Sigma^\{\-1\}\\preceq A\_\{k\}^\{\-1\}\.The matrix
Hk:∞1/2Sk:∞⊤Ak−1Sk:∞Hk:∞1/2H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}is an orthogonal projection onto the row space induced by the tail sketch, hencePkP\_\{k\}is also a projection matrix\. Moreover,
Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2⪯Hk:∞1/2Sk:∞⊤Ak−1Sk:∞Hk:∞1/2,H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}A\_\{k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\},so
−W=I−Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2⪰Pk\.\-W=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\succeq P\_\{k\}\.SinceHk:∞Hw,k:∞⪰0H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\succeq 0, equation[11](https://arxiv.org/html/2605.24316#A3.E11)implies
𝔼w∗\[Approx\]≥tr\(PkHk:∞Hw,k:∞\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\geq\\operatorname\{tr\}\\bigl\(P\_\{k\}H\_\{k:\\infty\}H\_\{w,k:\\infty\}\\bigr\)\.Finally, all eigenvalues ofPkP\_\{k\}are either0or11, with at mostMMzeros\. Applying Von Neumann’s trace inequality to the last display, we obtain
𝔼w∗\[Approx\]≥∑i\>k\+Mμi\(HHw\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\geq\\sum\_\{i\>k\+M\}\\mu\_\{i\}\(HH\_\{w\}\)\.Since Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)gives
μi\(HHw\)=λi𝔼\[\(wi∗\)2\]≍i−b=λiia−b,\\mu\_\{i\}\(HH\_\{w\}\)=\\lambda\_\{i\}\\,\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}=\\lambda\_\{i\}i^\{a\-b\},we conclude that
𝔼w∗\[Approx\]≳∑i\>k\+Mλiia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>k\+M\}\\lambda\_\{i\}i^\{a\-b\}\.Settingk=0k=0gives
𝔼w∗\[Approx\]≳∑i\>Mλiia−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\.Under the power\-law assumptionλi≍i−a\\lambda\_\{i\}\\asymp i^\{\-a\}, this becomes
𝔼w∗\[Approx\]≳∑i\>Mi−b≍M1−b,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}i^\{\-b\}\\asymp M^\{1\-b\},which completes the proof\. ∎
### C\.3Bounds under our assumptions
###### Lemma C\.3\(Approximation error under our assumptions\)\.
Assume Assumptions[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[5](https://arxiv.org/html/2605.24316#Thmassumption5)\. Then with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼w∗\[Approx\]≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\asymp M^\{1\-b\}\.
###### Proof\.
We first prove the upper bound\. Letk=⌊M/2⌋k=\\lfloor M/2\\rfloor\. By Lemma[C\.1](https://arxiv.org/html/2605.24316#A3.Thmlemma1),
𝔼w∗\[Approx\]≲𝔼w∗‖wk:∞∗‖Hk:∞2\+βk𝔼w∗‖w0:k∗‖22\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\+\\beta\_\{k\}\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Under Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[2](https://arxiv.org/html/2605.24316#Thmassumption2),
𝔼w∗‖wk:∞∗‖Hk:∞2=∑i\>kλi𝔼\[\(wi∗\)2\]≍∑i\>ki−b≍k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}=\\sum\_\{i\>k\}\\lambda\_\{i\}\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\>k\}i^\{\-b\}\\asymp k^\{1\-b\}\.Also, usingλi≍i−a\\lambda\_\{i\}\\asymp i^\{\-a\},
βk≲∑i\>ki−aM\+k−a\+∑i\>ki−2aM≲M−a\\beta\_\{k\}\\lesssim\\frac\{\\sum\_\{i\>k\}i^\{\-a\}\}\{M\}\+k^\{\-a\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}i^\{\-2a\}\}\{M\}\}\\lesssim M^\{\-a\}whenk≍Mk\\asymp M\. Moreover,
𝔼w∗‖w0:k∗‖22=∑i≤k𝔼\[\(wi∗\)2\]≍∑i≤kia−b≲ka−b\+1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}=\\sum\_\{i\\leq k\}\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\\leq k\}i^\{a\-b\}\\lesssim k^\{a\-b\+1\},where the last step uses Assumption[5](https://arxiv.org/html/2605.24316#Thmassumption5)\. Therefore,
βk𝔼w∗‖w0:k∗‖22≲M−aka−b\+1≍M1−b\.\\beta\_\{k\}\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\\lesssim M^\{\-a\}k^\{a\-b\+1\}\\asymp M^\{1\-b\}\.Since alsok1−b≍M1−bk^\{1\-b\}\\asymp M^\{1\-b\}, this proves
𝔼w∗\[Approx\]≲M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim M^\{1\-b\}\.
For the lower bound, Lemma[C\.2](https://arxiv.org/html/2605.24316#A3.Thmlemma2)gives
𝔼w∗\[Approx\]≳∑i\>Mλiia−b≍∑i\>Mi−b≍M1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\gtrsim\\sum\_\{i\>M\}\\lambda\_\{i\}i^\{a\-b\}\\asymp\\sum\_\{i\>M\}i^\{\-b\}\\asymp M^\{1\-b\}\.Combining the two bounds yields the claim\. ∎
## Appendix DBias Error under Normal GD
This section focuses on the normal GD procedure equation[1](https://arxiv.org/html/2605.24316#S2.E1)\. Similarly, we retain the notation of Section[2](https://arxiv.org/html/2605.24316#S2)\. In particular,
θt=θt−1−γtΣ^θt−1\+γtb^,θ0=0,\\theta\_\{t\}=\\theta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\\theta\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{b\},\\qquad\\theta\_\{0\}=0,with
Σ:=SHS⊤,Σ^:=1NSX⊤XS⊤,u∗:=Σ−1SHw∗,b^:=1NSX⊤y\.\\Sigma:=SHS^\{\\top\},\\qquad\\widehat\{\\Sigma\}:=\\frac\{1\}\{N\}SX^\{\\top\}XS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\},\\qquad\\widehat\{b\}:=\\frac\{1\}\{N\}SX^\{\\top\}y\.Define
ε~i:=yi−xi⊤S⊤u∗,ε~:=\(ε~1,…,ε~N\)⊤,c^:=1NSX⊤ε~,\\widetilde\{\\varepsilon\}\_\{i\}:=y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\},\\qquad\\widetilde\{\\varepsilon\}:=\(\\widetilde\{\\varepsilon\}\_\{1\},\\dots,\\widetilde\{\\varepsilon\}\_\{N\}\)^\{\\top\},\\qquad\\widehat\{c\}:=\\frac\{1\}\{N\}SX^\{\\top\}\\widetilde\{\\varepsilon\},and introduce the shorthand
CL:=∏t=1L\(I−γtΣ^\),V\(Σ^\):=1N∑t=1Lγt∏i=t\+1L\(I−γiΣ^\)\.C\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\),\\qquad V\(\\widehat\{\\Sigma\}\):=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}\\widehat\{\\Sigma\}\)\.Then
θL−u∗=−CLu∗\+V\(Σ^\)SX⊤ε~\.\\theta\_\{L\}\-u^\{\\ast\}=\-C\_\{L\}u^\{\\ast\}\+V\(\\widehat\{\\Sigma\}\)SX^\{\\top\}\\widetilde\{\\varepsilon\}\.Accordingly, the GD bias term is
BiasGD\(w∗\):=𝔼X\[‖Σ1/2CLu∗‖22\]\.\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\):=\\mathbb\{E\}\_\{X\}\\bigl\[\\\|\\Sigma^\{1/2\}C\_\{L\}u^\{\\ast\}\\\|\_\{2\}^\{2\}\\bigr\]\.
### D\.1Upper and lower bounds
###### Lemma D\.1\(Upper bound on the GD bias term\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Fix any integerk≤M/3k\\leq M/3such thatrank\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define
Ak:=Sk:∞Hk:∞Sk:∞⊤,k~:=⌈N/2⌉,Σk~:∞:=Sk~:∞Hk~:∞Sk~:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\},\\qquad\\widetilde\{k\}:=\\lceil N/2\\rceil,\\qquad\\Sigma\_\{\\widetilde\{k\}:\\infty\}:=S\_\{\\widetilde\{k\}:\\infty\}H\_\{\\widetilde\{k\}:\\infty\}S\_\{\\widetilde\{k\}:\\infty\}^\{\\top\}\.Then with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
BiasGD\(w∗\)≲‖w0:k∗‖22Leffγ\(μM/2\(Ak\)μM\(Ak\)\)2\+B¯‖wk:∞∗‖Hk:∞2,\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\\lesssim\\frac\{\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\+\\overline\{B\}\\,\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\},where
B¯:=1\+\(Leffγ\)2tr\(Σk~:∞\)2N2\+‖Σk~:∞‖22\+tr\(Σk~:∞2\)N\+tr\(Σk~:∞4\)N\.\\overline\{B\}:=1\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\}\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}\)^\{2\}\}\{N^\{2\}\}\+\\\|\\Sigma\_\{\\widetilde\{k\}:\\infty\}\\\|\_\{2\}^\{2\}\+\\frac\{\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{2\}\)\}\{N\}\+\\sqrt\{\\frac\{\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{4\}\)\}\{N\}\}\.
###### Proof\.
By rotational invariance of Gaussian sketching, we may work in the diagonal coordinates ofHHand use the block notation from Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. Define
ML:=CLΣCL\.M\_\{L\}:=C\_\{L\}\\Sigma C\_\{L\}\.Substituting
SH=\(S0:kH0:k,Sk:∞Hk:∞\)SH=\(S\_\{0:k\}H\_\{0:k\},\\,S\_\{k:\\infty\}H\_\{k:\\infty\}\)into the identity
BiasGD\(w∗\)=𝔼X\[w∗⊤HS⊤Σ−1MLΣ−1SHw∗\]\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)=\\mathbb\{E\}\_\{X\}\\bigl\[w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}SHw^\{\\ast\}\\bigr\]and splitting head and tail blocks gives
BiasGD\(w∗\)≤2T1\+2T2,\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\\leq 2T\_\{1\}\+2T\_\{2\},where
T1:=𝔼X\[w0:k∗⊤H0:kS0:k⊤Σ−1MLΣ−1S0:kH0:kw0:k∗\],T\_\{1\}:=\\mathbb\{E\}\_\{X\}\\bigl\[w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}w\_\{0:k\}^\{\\ast\}\\bigr\],and
T2:=𝔼X\[wk:∞∗⊤Hk:∞Sk:∞⊤Σ−1MLΣ−1Sk:∞Hk:∞wk:∞∗\]\.T\_\{2\}:=\\mathbb\{E\}\_\{X\}\\bigl\[w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}M\_\{L\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}\\bigr\]\.
For the head term,
T1≤𝔼X‖ML‖2⋅‖Σ−1S0:kH0:k‖22⋅‖w0:k∗‖22\.T\_\{1\}\\leq\\mathbb\{E\}\_\{X\}\\\|M\_\{L\}\\\|\_\{2\}\\cdot\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}^\{2\}\\cdot\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\.Applying Lemma[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)and spectral calculus, we have
𝔼X‖ML‖2≲1Leffγ,\\mathbb\{E\}\_\{X\}\\\|M\_\{L\}\\\|\_\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},while Lemma[K\.6](https://arxiv.org/html/2605.24316#A11.Thmlemma6)gives
‖Σ−1S0:kH0:k‖2≲μM/2\(Ak\)μM\(Ak\)\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence
T1≲‖w0:k∗‖22Leffγ\(μM/2\(Ak\)μM\(Ak\)\)2\.T\_\{1\}\\lesssim\\frac\{\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\.
For the tail term, define
ℬ:=𝔼X\[Σ−1/2MLΣ−1/2\]\.\\mathcal\{B\}:=\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}M\_\{L\}\\Sigma^\{\-1/2\}\\bigr\]\.Then
T2≤‖ℬ‖2⋅‖Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2‖2⋅‖wk:∞∗‖Hk:∞2\.T\_\{2\}\\leq\\\|\\mathcal\{B\}\\\|\_\{2\}\\cdot\\\|H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\\|\_\{2\}\\cdot\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.The middle operator norm is at most one, while the calculation as in Appendix B\.1 ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]gives
‖ℬ‖2≲B¯\.\\\|\\mathcal\{B\}\\\|\_\{2\}\\lesssim\\overline\{B\}\.Combining the estimates forT1T\_\{1\}andT2T\_\{2\}proves the claim\. ∎
###### Lemma D\.2\(Lower bound on the GD bias term\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3)\. Let
Hw:=𝔼w∗\[w∗w∗⊤\],Σw:=SHHwHS⊤\.H\_\{w\}:=\\mathbb\{E\}\_\{w^\{\\ast\}\}\[w^\{\\ast\}w^\{\\ast\\top\}\],\\qquad\\Sigma\_\{w\}:=SHH\_\{w\}HS^\{\\top\}\.Then with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼w∗\[BiasGD\(w∗\)\]≳∑i=2τ\+1Mμ3i\(Σw\)μi\(Σ\),\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i=2\\tau\+1\}^\{M\}\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\},where
τ:=𝔼X\[\#\{i∈\[M\]:μi\(Σ^\)Leffγ0\>1/4\}\]\.\\tau:=\\mathbb\{E\}\_\{X\}\\Bigl\[\\\#\\bigl\\\{i\\in\[M\]:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\\Bigr\]\.
###### Proof\.
SetCL:=∏t=1L\(I−γtΣ^\)C\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\. We have
𝔼w∗\[BiasGD\(w∗\)\]=tr\(𝔼X\[Σ−1/2CLΣCL⊤Σ−1/2\]⋅Σ−1/2ΣwΣ−1/2\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]=\\operatorname\{tr\}\\\!\\Bigl\(\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}C\_\{L\}\\Sigma C\_\{L\}^\{\\top\}\\Sigma^\{\-1/2\}\\bigr\]\\cdot\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\Bigr\)\.Moreover,
𝔼X\[Σ−1/2CLΣCL⊤Σ−1/2\]⪰Σ−1/2𝔼X\[CL\]Σ𝔼X\[CL\]⊤Σ−1/2,\\mathbb\{E\}\_\{X\}\\bigl\[\\Sigma^\{\-1/2\}C\_\{L\}\\Sigma C\_\{L\}^\{\\top\}\\Sigma^\{\-1/2\}\\bigr\]\\succeq\\Sigma^\{\-1/2\}\\,\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\\,\\Sigma\\,\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]^\{\\top\}\\Sigma^\{\-1/2\},since𝔼X\[\(CL−𝔼X\[CL\]\)Σ\(CL−𝔼X\[CL\]\)⊤\]⪰0\\mathbb\{E\}\_\{X\}\[\(C\_\{L\}\-\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\)\\Sigma\(C\_\{L\}\-\\mathbb\{E\}\_\{X\}\[C\_\{L\}\]\)^\{\\top\}\]\\succeq 0\. Applying this PSD lower bound and then spectral truncation/Von Neumann argument yields
𝔼w∗\[BiasGD\(w∗\)\]≳∑i=2τ\+1Mμi\(Σ−1/2ΣwΣ−1/2\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i=2\\tau\+1\}^\{M\}\\mu\_\{i\}\\bigl\(\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\bigr\)\.Lastly, the spectral comparison yields
μi\(Σ−1/2ΣwΣ−1/2\)≳μ3i\(Σw\)μi\(Σ\)\.\\mu\_\{i\}\\bigl\(\\Sigma^\{\-1/2\}\\Sigma\_\{w\}\\Sigma^\{\-1/2\}\\bigr\)\\gtrsim\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.Substituting this estimate into the previous display proves the lemma\. ∎
### D\.2Example under the source condition
###### Lemma D\.3\(Bias bounds under the source condition\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
a\>b−1,Leff≲Na/γ\.a\>b\-1,\\qquad L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists an\(a,b\)\(a,b\)\-dependent constantc\>0c\>0such that, whenever
γ≤clogN,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼w∗\[BiasGD\(w∗\)\]≍min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\asymp\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.
###### Proof\.
We first verify that the ingredients entering Lemmas[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)and[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)are all of the claimed order\.
By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7),
μi\(Σ\)≍i−afori∈\[M\]\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\in\[M\]with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Therefore,
tr\(Σ2\)≲1,∑i=1Mμi\(Σ\)μi\(Σ\)\+1/\(Leffγ\)≲\(Leffγ\)1/a\.\\operatorname\{tr\}\(\\Sigma^\{2\}\)\\lesssim 1,\\qquad\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.
Next, Lemma[K\.8](https://arxiv.org/html/2605.24316#A11.Thmlemma8)implies
μM/2\(Ak\)μM\(Ak\)≲1\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\lesssim 1for every admissiblekk, while the power\-law tail bounds give
tr\(Σk~:∞\)≲N1−a,‖Σk~:∞‖2≲N−a,tr\(Σk~:∞2\)≲N1−2a,tr\(Σk~:∞4\)≲N1−4a\.\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}\)\\lesssim N^\{1\-a\},\\qquad\\\|\\Sigma\_\{\\widetilde\{k\}:\\infty\}\\\|\_\{2\}\\lesssim N^\{\-a\},\\qquad\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{2\}\)\\lesssim N^\{1\-2a\},\\qquad\\operatorname\{tr\}\(\\Sigma\_\{\\widetilde\{k\}:\\infty\}^\{4\}\)\\lesssim N^\{1\-4a\}\.Hence the quantityB¯\\overline\{B\}in Lemma[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)satisfies
B¯≲1\\overline\{B\}\\lesssim 1wheneverLeff≲Na/γL\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.
For the upper bound, choose
k:=min\{M/3,\(Leffγ\)1/a\}\.k:=\\min\\\!\\bigl\\\{M/3,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}\.Using Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),
𝔼w∗‖w0:k∗‖22≍∑i=1kia−b,𝔼w∗‖wk:∞∗‖Hk:∞2≍∑i\>ki−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\\asymp\\sum\_\{i=1\}^\{k\}i^\{a\-b\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\\asymp\\sum\_\{i\>k\}i^\{\-b\}\.Plugging these relations into Lemma[D\.1](https://arxiv.org/html/2605.24316#A4.Thmlemma1)gives
𝔼w∗\[BiasGD\(w∗\)\]≲1Leffγ∑i=1kia−b\+∑i\>ki−b≲k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i=1\}^\{k\}i^\{a\-b\}\+\\sum\_\{i\>k\}i^\{\-b\}\\lesssim k^\{1\-b\}\.Sincek≍min\{M,\(Leffγ\)1/a\}k\\asymp\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}, this yields
𝔼w∗\[BiasGD\(w∗\)\]≲min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\lesssim\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.
For the lower bound, Lemma[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)and the estimate onτ\\tau\(mentioned in Lemma[D\.2](https://arxiv.org/html/2605.24316#A4.Thmlemma2)\) give
𝔼w∗\[BiasGD\(w∗\)\]≳∑i≳\(Leffγ\)1/aμ3i\(Σw\)μi\(Σ\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\\frac\{\\mu\_\{3i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), the operatorHHwHHH\_\{w\}Hhas eigenvalues of orderi−a−bi^\{\-a\-b\}, so Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)applied toHHwHHH\_\{w\}Hyields
μi\(Σw\)≍i−a−b\.\\mu\_\{i\}\(\\Sigma\_\{w\}\)\\asymp i^\{\-a\-b\}\.Combining this withμi\(Σ\)≍i−a\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}gives
𝔼w∗\[BiasGD\(w∗\)\]≳∑i≳\(Leffγ\)1/ai−b≳min\{M,\(Leffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{\\mathrm\{GD\}\}\(w^\{\\ast\}\)\]\\gtrsim\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-b\}\\gtrsim\\min\\\!\\bigl\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.This completes the proof\. ∎
## Appendix EVariance Error under Normal GD
Retain the notation of Section[D](https://arxiv.org/html/2605.24316#A4)\. In particular,
V\(Σ^\):=1N∑t=1Lγt∏i=t\+1L\(I−γiΣ^\),VL:=I−∏t=1L\(I−γtΣ^\)\.V\(\\widehat\{\\Sigma\}\):=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}\\widehat\{\\Sigma\}\),\\qquad V\_\{L\}:=I\-\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\.We define the normalized GD variance term by
VarGD:=𝔼X\[tr\(XS⊤V\(Σ^\)ΣV\(Σ^\)SX⊤\)\]\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}:=\\mathbb\{E\}\_\{X\}\\Bigl\[\\operatorname\{tr\}\\bigl\(XS^\{\\top\}V\(\\widehat\{\\Sigma\}\)\\Sigma V\(\\widehat\{\\Sigma\}\)SX^\{\\top\}\\bigr\)\\Bigr\]\.Then using
V\(Σ^\)=1N\(I−∏t=1L\(I−γtΣ^\)\)Σ^−1=1NVLΣ^−1,V\(\\widehat\{\\Sigma\}\)=\\frac\{1\}\{N\}\\Bigl\(I\-\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\)\\Bigr\)\\widehat\{\\Sigma\}^\{\-1\}=\\frac\{1\}\{N\}V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\},we may rewrite
VarGD=1N𝔼X\[tr\(ΣVLΣ^−1VL\)\]\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}=\\frac\{1\}\{N\}\\,\\mathbb\{E\}\_\{X\}\\Bigl\[\\operatorname\{tr\}\\bigl\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\\bigr\)\\Bigr\]\.
### E\.1Upper and lower bounds
###### Lemma E\.1\(Upper and lower bounds on the GD variance term\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then
VarGD≲DUN,VarGD≳DLN,\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{D\_\{U\}\}\{N\},\\qquad\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\gtrsim\\frac\{D\_\{L\}\}\{N\},where
DU:=𝔼X\[\#\{i∈\[M\]:μi\(Σ^\)Leffγ0\>1/4\}\+\(Leffγ0\)∑i:μi\(Σ^\)Leffγ0≤1/4μi\(Σ^\)\],D\_\{U\}:=\\mathbb\{E\}\_\{X\}\\Biggl\[\\\#\\bigl\\\{i\\in\[M\]:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\+\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\Biggr\],and
DL:=𝔼X\[\(Leffγ0\)2∑i:μi\(Σ^\)Leffγ0≤1/4μi\(Σ\)μi\(Σ^\)\+15∑i:μi\(Σ^\)Leffγ0\>1/4μi\(Σ\)μi\(Σ^\)\]\.D\_\{L\}:=\\mathbb\{E\}\_\{X\}\\Biggl\[\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\+\\frac\{1\}\{5\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\\Biggr\]\.
###### Proof\.
Using the identity
VarGD=1N𝔼X\[tr\(ΣVLΣ^−1VL\)\],\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}=\\frac\{1\}\{N\}\\,\\mathbb\{E\}\_\{X\}\\bigl\[\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\bigr\],and we fix anyλ\>0\\lambda\>0for later separation\. Then
tr\(ΣVLΣ^−1VL\)≤‖Σ1/2\(Σ^\+λI\)−1/2‖22⋅tr\(VL2\+λΣ^−1VL2\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\leq\\\|\\Sigma^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\\|\_\{2\}^\{2\}\\cdot\\operatorname\{tr\}\\bigl\(V\_\{L\}^\{2\}\+\\lambda\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}^\{2\}\\bigr\)\.Note that the stepsize assumptions imply
VL⪯I−\(I−2γ0Σ^\)Leff\.V\_\{L\}\\preceq I\-\(I\-2\\gamma\_\{0\}\\widehat\{\\Sigma\}\)^\{L\_\{\\mathrm\{eff\}\}\}\.Hence, using Bernoulli’s inequality exactly as in that proof,
tr\(VL2\+λΣ^−1VL2\)≲\#\{i:μi\(Σ^\)Leffγ0\>1/4\}\+\(Leffγ0\)∑i:μi\(Σ^\)Leffγ0≤1/4μi\(Σ^\)\\operatorname\{tr\}\\bigl\(V\_\{L\}^\{2\}\+\\lambda\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}^\{2\}\\bigr\)\\lesssim\\\#\\bigl\\\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\bigr\\\}\+\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)after choosingλ=\(Leffγ\)−1≤\(Leffγ0\)−1\\lambda=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1\}\\leq\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{\-1\}\. Applying Lemma[K\.1](https://arxiv.org/html/2605.24316#A11.Thmlemma1)then yields
𝔼X\[tr\(ΣVLΣ^−1VL\)\]≲DU,\\mathbb\{E\}\_\{X\}\\bigl\[\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\bigr\]\\lesssim D\_\{U\},which proves the upper bound\.
For the lower bound, Von Neumann’s trace inequality gives
tr\(ΣVLΣ^−1VL\)≥∑i=1Mμi\(Σ\)μi\(Σ^\)μ2\(M−i\)\+1\(VL2Σ^−2\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\geq\\sum\_\{i=1\}^\{M\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\,\\mu\_\{2\(M\-i\)\+1\}\\bigl\(V\_\{L\}^\{2\}\\widehat\{\\Sigma\}^\{\-2\}\\bigr\)\.Note that the scalar function
f\(x\):=\(1−\(1−γ0x\)Leff\)2x2f\(x\):=\\frac\{\(1\-\(1\-\\gamma\_\{0\}x\)^\{L\_\{\\mathrm\{eff\}\}\}\)^\{2\}\}\{x^\{2\}\}is decreasing on\[0,1/γ0\]\[0,1/\\gamma\_\{0\}\], and therefore
μ2\(M−i\)\+1\(VL2Σ^−2\)≳\{\(Leffγ0\)2,μi\(Σ^\)Leffγ0≤1/4,1/μi\(Σ^\)2,μi\(Σ^\)Leffγ0\>1/4\.\\mu\_\{2\(M\-i\)\+1\}\\bigl\(V\_\{L\}^\{2\}\\widehat\{\\Sigma\}^\{\-2\}\\bigr\)\\gtrsim\\begin\{cases\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\},&\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4,\\\\\[3\.0pt\] 1/\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)^\{2\},&\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\.\\end\{cases\}Substituting this bound into the previous display gives
tr\(ΣVLΣ^−1VL\)≳\(Leffγ0\)2∑i:μi\(Σ^\)Leffγ0≤1/4μi\(Σ\)μi\(Σ^\)\+15∑i:μi\(Σ^\)Leffγ0\>1/4μi\(Σ\)μi\(Σ^\)\.\\operatorname\{tr\}\(\\Sigma V\_\{L\}\\widehat\{\\Sigma\}^\{\-1\}V\_\{L\}\)\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{2\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\Sigma\)\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\+\\frac\{1\}\{5\}\\sum\_\{i:\\,\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\.Taking expectation overXXproves the lower bound\. ∎
### E\.2Example under the source condition
###### Lemma E\.2\(Variance bounds under the source condition\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Then there exists anaa\-dependent constantc\>0c\>0such that, whenever
γ≤clogN,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
VarGD≍min\{M,\(Leffγ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\asymp\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.
###### Proof\.
For the upper bound, Lemma[E\.1](https://arxiv.org/html/2605.24316#A5.Thmlemma1)gives
VarGD≲DUN\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{D\_\{U\}\}\{N\}\.By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7),μi\(Σ\)≍i−a\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}, and the same truncation argument used in the proof of Lemma[D\.3](https://arxiv.org/html/2605.24316#A4.Thmlemma3)implies
𝔼X\[\#\{i:μi\(Σ^\)Leffγ0\>1/4\}\]≲\(Leffγ\)1/a\.\\mathbb\{E\}\_\{X\}\\Bigl\[\\\#\\\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\>1/4\\\}\\Bigr\]\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Moreover,
\(Leffγ0\)∑i:μi\(Σ^\)Leffγ0≤1/4μi\(Σ^\)≲\(Leffγ\)∑i≳\(Leffγ\)1/ai−a≲\(Leffγ\)1/a\.\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)\\sum\_\{i:\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\leq 1/4\}\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-a\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Since alsoDU≤MD\_\{U\}\\leq M, we obtain
VarGD≲min\{M,\(Leffγ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\lesssim\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.
For the lower bound, first assume
\(Leffγ\)1/a≤M/c\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq M/cfor a sufficiently large constantc\>0c\>0\. By Lemmas[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)and[K\.10](https://arxiv.org/html/2605.24316#A11.Thmlemma10), with high probability we have
μi\(Σ\)≍i−a,μi\(Σ^\)≍i−afori≤c−1min\{M,N\}\.\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\},\\qquad\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\leq c^\{\-1\}\\min\\\{M,N\\\}\.Therefore the first term inDLD\_\{L\}satisfies
DL≳\(Leffγ\)2∑i≳\(Leffγ\)1/ai−2a≳\(Leffγ\)1/a\.D\_\{L\}\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\}\\sum\_\{i\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}i^\{\-2a\}\\gtrsim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.
If instead
\(Leffγ\)1/a≥M/c,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\geq M/c,then the second term inDLD\_\{L\}and the same empirical\-spectrum estimate imply
DL≳∑i≤M/cμi\(Σ\)μi\(Σ^\)≳M\.D\_\{L\}\\gtrsim\\sum\_\{i\\leq M/c\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\widehat\{\\Sigma\}\)\}\\gtrsim M\.Combining the two regimes with Lemma[E\.1](https://arxiv.org/html/2605.24316#A5.Thmlemma1)yields
VarGD≳min\{M,\(Leffγ\)1/a\}N\.\\mathrm\{Var\}\_\{\\mathrm\{GD\}\}\\gtrsim\\frac\{\\min\\\{M,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{N\}\.This proves the lemma\. ∎
## Appendix FOne\-pass Batch SGD: Excess Error Decomposition
This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular, the centered erroret:=utop−u∗e\_\{t\}:=u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}satisfies
et=\(I−γtΣ^t\(B\)\)et−1\+γtξ^t\(B\)\.e\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)e\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.\(12\)
### F\.1Excess error decomposition
Sinceu∗u^\{\\ast\}is the minimizer of the sketched population risk, the excess error is
ℰex\(B\):=RM\(uTop\)−RM\(u∗\)=‖uTop−u∗‖Σ2=‖eT‖Σ2\.\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}:=R\_\{M\}\(u\_\{T\}^\{\\mathrm\{op\}\}\)\-R\_\{M\}\(u^\{\\ast\}\)=\\\|u\_\{T\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}=\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Taking expectation, we will decompose
𝔼ℰex\(B\)=𝔼‖eT‖Σ2\\mathbb\{E\}\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}=\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}into deterministic and stochastic contributions\.
We use throughout the original mean\-centered decomposition
eT=mT\+\(eT−mT\),mT:=𝔼\[eT\]\.e\_\{T\}=m\_\{T\}\+\(e\_\{T\}\-m\_\{T\}\),\\qquad m\_\{T\}:=\\mathbb\{E\}\[e\_\{T\}\]\.
###### Definition F\.1\(Bias and centered variance\)\.
Define the mean iterate error and centered fluctuation by
mt:=𝔼\[et\],δt:=et−mt\.m\_\{t\}:=\\mathbb\{E\}\[e\_\{t\}\],\\qquad\\delta\_\{t\}:=e\_\{t\}\-m\_\{t\}\.The one\-pass bias and variance terms are
BiasB:=‖mT‖Σ2,VarB:=𝔼‖δT‖Σ2\.\\mathrm\{Bias\}\_\{B\}:=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}:=\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Since
𝔼\[Σ^t\(B\)\]=Σ,𝔼\[ξ^t\(B\)\]=0,\\mathbb\{E\}\[\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\]=\\Sigma,\\qquad\\mathbb\{E\}\[\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\]=0,the mean recursion satisfies
mt=\(I−γtΣ\)mt−1,m0=−u∗,m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\},\\qquad m\_\{0\}=\-u^\{\\ast\},and hence
mT=−∏t=1T\(I−γtΣ\)u∗\.m\_\{T\}=\-\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)u^\{\\ast\}\.Therefore
BiasB=∥∏t=1T\(I−γtΣ\)u∗∥Σ2\.\\boxed\{\\mathrm\{Bias\}\_\{B\}=\\left\\\|\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)u^\{\\ast\}\\right\\\|\_\{\\Sigma\}^\{2\}\.\}
###### Proposition F\.1\(Exact mean\-centered decomposition of the excess error\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Under the one\-pass batch SGD recursion equation[12](https://arxiv.org/html/2605.24316#A6.E12),
𝔼ℰex\(B\)=𝔼‖eT‖Σ2=BiasB\+VarB\.\\mathbb\{E\}\\mathcal\{E\}\_\{\\mathrm\{ex\}\}^\{\(B\)\}=\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathrm\{Bias\}\_\{B\}\+\\mathrm\{Var\}\_\{B\}\.
###### Proof\.
SinceeT=mT\+δTe\_\{T\}=m\_\{T\}\+\\delta\_\{T\}, we have
‖eT‖Σ2=‖mT‖Σ2\+2⟨mT,δT⟩Σ\+‖δT‖Σ2\.\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+2\\langle m\_\{T\},\\delta\_\{T\}\\rangle\_\{\\Sigma\}\+\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.Taking expectation and using𝔼\[δT\]=0\\mathbb\{E\}\[\\delta\_\{T\}\]=0gives
𝔼‖eT‖Σ2=‖mT‖Σ2\+𝔼‖δT‖Σ2,\\mathbb\{E\}\\\|e\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\},which is exactly the claimed identity\. ∎
### F\.2Fluctuation recursion and an exact split of the variance term
Subtracting the mean recursion from equation[12](https://arxiv.org/html/2605.24316#A6.E12)gives the exact centered fluctuation recursion
δt=\(I−γtΣ^t\(B\)\)δt−1\+γt\(Σ−Σ^t\(B\)\)mt−1\+γtξ^t\(B\)\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\delta\_\{t\-1\}\+\\gamma\_\{t\}\\bigl\(\\Sigma\-\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)m\_\{t\-1\}\+\\gamma\_\{t\}\\widehat\{\\xi\}\_\{t\}^\{\(B\)\}\.\(13\)Equivalently, in the shorthand notation of Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3),
δt=\(I−γtZ¯t\)δt−1\+γt\(Σ−Z¯t\)mt−1\+γtξ¯t\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\\bigr\)\\delta\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.\(14\)
For the variance analysis it is convenient to splitδt\\delta\_\{t\}into the part caused by batch\-covariance fluctuations and the part caused by the additive label noise\.
###### Definition F\.2\(Centered covariance and noise components\)\.
Define two auxiliary processes\(qt\)\(q\_\{t\}\)and\(vt\)\(v\_\{t\}\)by
q0=0,v0=0,q\_\{0\}=0,\\qquad v\_\{0\}=0,and, fort≥1t\\geq 1,
qt=\(I−γtZ¯t\)qt−1\+γt\(Σ−Z¯t\)mt−1,q\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\-1\}\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\},vt=\(I−γtZ¯t\)vt−1\+γtξ¯t\.v\_\{t\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Define the corresponding quadratic terms by
VarBcov:=𝔼‖qT‖Σ2,VarBnoise:=𝔼‖vT‖Σ2\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}:=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}:=\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.
###### Proposition F\.2\(Exact split of the centered variance\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Then, for everytt,
δt=qt\+vt,𝔼\[qt\]=0\.\\delta\_\{t\}=q\_\{t\}\+v\_\{t\},\\qquad\\mathbb\{E\}\[q\_\{t\}\]=0\.Moreover, if
𝒢t:=σ\(zs,b:1≤s≤t,1≤b≤B\),\\mathcal\{G\}\_\{t\}:=\\sigma\\bigl\(z\_\{s,b\}:1\\leq s\\leq t,\\ 1\\leq b\\leq B\\bigr\),then
𝔼\[vt∣𝒢t,w∗\]=0,\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0,and therefore
VarB=VarBcov\+VarBnoise\.\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.
###### Proof\.
The identityδt=qt\+vt\\delta\_\{t\}=q\_\{t\}\+v\_\{t\}follows by induction from equation[14](https://arxiv.org/html/2605.24316#A6.E14)\. Indeed, it is true at timet=0t=0, and ifδt−1=qt−1\+vt−1\\delta\_\{t\-1\}=q\_\{t\-1\}\+v\_\{t\-1\}, then
δt=\(I−γtZ¯t\)\(qt−1\+vt−1\)\+γt\(Σ−Z¯t\)mt−1\+γtξ¯t=qt\+vt\.\\delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\\bigr\)\(q\_\{t\-1\}\+v\_\{t\-1\}\)\+\\gamma\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}=q\_\{t\}\+v\_\{t\}\.
Next,𝔼\[q0\]=0\\mathbb\{E\}\[q\_\{0\}\]=0\. If𝔼\[qt−1\]=0\\mathbb\{E\}\[q\_\{t\-1\}\]=0, then using thatqt−1q\_\{t\-1\}depends only on batches1,…,t−11,\\dots,t\-1whileZ¯t\\bar\{Z\}\_\{t\}comes from the disjointtt\-th batch, we may use independence ofqt−1q\_\{t\-1\}andZ¯t\\bar\{Z\}\_\{t\}, together with𝔼\[Z¯t\]=Σ\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}\]=\\Sigmaand the fact thatmt−1m\_\{t\-1\}is deterministic, to obtain
𝔼\[qt\]=\(I−γtΣ\)𝔼\[qt−1\]\+γt𝔼\[\(Σ−Z¯t\)mt−1\]=0\.\\mathbb\{E\}\[q\_\{t\}\]=\(I\-\\gamma\_\{t\}\\Sigma\)\\mathbb\{E\}\[q\_\{t\-1\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\-1\}\]=0\.Thus𝔼\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt\.
To prove the conditional mean\-zero property forvtv\_\{t\}, fix one sample in the current batch and writez=Sxz=Sx\. Conditioning on the sketch matrixSS, the pair\(x,z\)\(x,z\)is jointly Gaussian with
𝔼\[x∣z\]=HS⊤\(SHS⊤\)−1z=HS⊤Σ−1z\.\\mathbb\{E\}\[x\\mid z\]=HS^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}z=HS^\{\\top\}\\Sigma^\{\-1\}z\.Sinceu∗=Σ−1SHw∗u^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\}, the Gaussian conditional\-regression identity yields
𝔼\[⟨x,w∗⟩∣z,w∗\]=⟨𝔼\[x∣z\],w∗⟩=⟨HS⊤Σ−1z,w∗⟩=z⊤u∗\.\\mathbb\{E\}\[\\langle x,w^\{\\ast\}\\rangle\\mid z,w^\{\\ast\}\]=\\langle\\mathbb\{E\}\[x\\mid z\],w^\{\\ast\}\\rangle=\\langle HS^\{\\top\}\\Sigma^\{\-1\}z,w^\{\\ast\}\\rangle=z^\{\\top\}u^\{\\ast\}\.On the other hand, Assumption[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)implies
𝔼\[y−⟨x,w∗⟩∣x,w∗\]=0\.\\mathbb\{E\}\[y\-\\langle x,w^\{\\ast\}\\rangle\\mid x,w^\{\\ast\}\]=0\.Therefore, by the tower property,
𝔼\[y−z⊤u∗∣z,w∗\]=𝔼\[y−⟨x,w∗⟩∣z,w∗\]\+𝔼\[⟨x,w∗⟩−z⊤u∗∣z,w∗\]=0\.\\mathbb\{E\}\[y\-z^\{\\top\}u^\{\\ast\}\\mid z,w^\{\\ast\}\]=\\mathbb\{E\}\[y\-\\langle x,w^\{\\ast\}\\rangle\\mid z,w^\{\\ast\}\]\+\\mathbb\{E\}\[\\langle x,w^\{\\ast\}\\rangle\-z^\{\\top\}u^\{\\ast\}\\mid z,w^\{\\ast\}\]=0\.Multiplying byzz, which is measurable with respect toσ\(z,w∗\)\\sigma\(z,w^\{\\ast\}\), gives
𝔼\[z\(y−z⊤u∗\)∣z,w∗\]=0\.\\mathbb\{E\}\\bigl\[z\(y\-z^\{\\top\}u^\{\\ast\}\)\\mid z,w^\{\\ast\}\\bigr\]=0\.Applying this to each summand inξ¯t=1B∑b=1Bzt,b\(yit,b−zt,b⊤u∗\)\\bar\{\\xi\}\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{b=1\}^\{B\}z\_\{t,b\}\(y\_\{i\_\{t,b\}\}\-z\_\{t,b\}^\{\\top\}u^\{\\ast\}\), and conditioning on𝒢t\\mathcal\{G\}\_\{t\}, which fixes the current sketched covariates\(zt,b\)b=1B\(z\_\{t,b\}\)\_\{b=1\}^\{B\}, we obtain
𝔼\[ξ¯t∣𝒢t,w∗\]=0\.\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0\.Becausevt−1v\_\{t\-1\}depends only on the firstt−1t\-1batches, it is independent of the current block\(zt,b\)b=1B\(z\_\{t,b\}\)\_\{b=1\}^\{B\}\. Hence
𝔼\[vt∣𝒢t,w∗\]=\(I−γtZ¯t\)𝔼\[vt−1∣𝒢t,w∗\]\+γt𝔼\[ξ¯t∣𝒢t,w∗\]\.\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]\.Since conditioning on the additional current block does not change𝔼\[vt−1∣𝒢t,w∗\]\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\], the induction hypothesis gives𝔼\[vt−1∣𝒢t,w∗\]=𝔼\[vt−1∣𝒢t−1,w∗\]=0\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=\\mathbb\{E\}\[v\_\{t\-1\}\\mid\\mathcal\{G\}\_\{t\-1\},w^\{\\ast\}\]=0\. Starting fromv0=0v\_\{0\}=0, we therefore obtain by induction that
𝔼\[vt∣𝒢t,w∗\]=0for allt\.\\mathbb\{E\}\[v\_\{t\}\\mid\\mathcal\{G\}\_\{t\},w^\{\\ast\}\]=0\\qquad\\text\{for all \}t\.
Finally,qTq\_\{T\}is𝒢T\\mathcal\{G\}\_\{T\}\-measurable, so
𝔼⟨qT,vT⟩Σ=𝔼\[𝔼\[⟨qT,vT⟩Σ∣𝒢T,w∗\]\]=0\.\\mathbb\{E\}\\langle q\_\{T\},v\_\{T\}\\rangle\_\{\\Sigma\}=\\mathbb\{E\}\\Bigl\[\\mathbb\{E\}\\bigl\[\\langle q\_\{T\},v\_\{T\}\\rangle\_\{\\Sigma\}\\mid\\mathcal\{G\}\_\{T\},w^\{\\ast\}\\bigr\]\\Bigr\]=0\.UsingδT=qT\+vT\\delta\_\{T\}=q\_\{T\}\+v\_\{T\}, we conclude that
𝔼‖δT‖Σ2=𝔼‖qT‖Σ2\+𝔼‖vT‖Σ2=VarBcov\+VarBnoise,\\mathbb\{E\}\\\|\\delta\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\+\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\},as claimed\. ∎
## Appendix GBias Error for One\-pass Batch SGD
This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular,
T:=NB,Teff:=TlogT,Σ:=SHS⊤,u∗:=Σ−1SHw∗\.T:=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.By Definition[F\.1](https://arxiv.org/html/2605.24316#A6.Thmdefinition1), this section studies the exact one\-pass bias term
BiasB=‖mT‖Σ2\.\\mathrm\{Bias\}\_\{B\}=\\\|m\_\{T\}\\\|\_\{\\Sigma\}^\{2\}\.
Define
CT:=∏t=1T\(I−γtΣ\)\.C\_\{T\}:=\\prod\_\{t=1\}^\{T\}\(I\-\\gamma\_\{t\}\\Sigma\)\.Since the mean recursion satisfiesmt=\(I−γtΣ\)mt−1m\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\-1\}andm0=−u∗m\_\{0\}=\-u^\{\\ast\}, we have
mT=−CTu∗,BiasB=‖CTu∗‖Σ2\.m\_\{T\}=\-C\_\{T\}u^\{\\ast\},\\qquad\\mathrm\{Bias\}\_\{B\}=\\\|C\_\{T\}u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\.
### G\.1Upper and lower bounds for the bias term
###### Lemma G\.1\(Upper bound on the one\-pass batch bias\)\.
Assume Assumptions[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Fix any integerk≤M/3k\\leq M/3such thatrank\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define
Ak:=Sk:∞Hk:∞Sk:∞⊤\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Then, with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
BiasB≲‖w0:k∗‖22Teffγ\(μM/2\(Ak\)μM\(Ak\)\)2\+‖wk:∞∗‖Hk:∞2\.\\mathrm\{Bias\}\_\{B\}\\lesssim\\frac\{\\\|w^\{\\ast\}\_\{0:k\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\+\\\|w^\{\\ast\}\_\{k:\\infty\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\.In particular,
BiasB≲‖u∗‖22Teffγ\.\\mathrm\{Bias\}\_\{B\}\\lesssim\\frac\{\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.
###### Proof\.
By Assumption[4](https://arxiv.org/html/2605.24316#Thmassumption4), we may work in the diagonal coordinates ofHH, as encoded in Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2)\. Define
MT:=CTΣCT\.M\_\{T\}:=C\_\{T\}\\Sigma C\_\{T\}\.By the effective\-time comparison, we have
MT⪯\(I−γΣ\)TeffΣ\(I−γΣ\)Teff=:M\.M\_\{T\}\\preceq\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\Sigma\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}=:M\.Therefore,
BiasB\\displaystyle\\mathrm\{Bias\}\_\{B\}=⟨MT,u∗u∗⊤⟩\\displaystyle=\\langle M\_\{T\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle≤⟨M,u∗u∗⊤⟩\\displaystyle\\leq\\langle M,u^\{\\ast\}u^\{\\ast\\top\}\\rangle=w∗⊤HS⊤Σ−1MΣ−1SHw∗\.\\displaystyle=w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}SHw^\{\\ast\}\.\(15\)
Now decompose
SH=\(S0:kH0:k,Sk:∞Hk:∞\)\.SH=\(S\_\{0:k\}H\_\{0:k\},\\,S\_\{k:\\infty\}H\_\{k:\\infty\}\)\.Then using the idea of spectral truncation, we have
where
T1:=w0:k∗⊤H0:kS0:k⊤Σ−1MΣ−1S0:kH0:kw0:k∗,T\_\{1\}:=w\_\{0:k\}^\{\\ast\\top\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}w\_\{0:k\}^\{\\ast\},and
T2:=wk:∞∗⊤Hk:∞Sk:∞⊤Σ−1MΣ−1Sk:∞Hk:∞wk:∞∗\.T\_\{2\}:=w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}M\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}\.
For the head term, we have
T1≤‖M‖2⋅‖Σ−1S0:kH0:k‖22⋅‖w0:k∗‖22\.T\_\{1\}\\leq\\\|M\\\|\_\{2\}\\cdot\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}^\{2\}\\cdot\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\.By spectral calculus,
‖M‖2=maxx∈\[0,‖Σ‖2\]x\(1−γx\)2Teff≲1Teffγ\.\\\|M\\\|\_\{2\}=\\max\_\{x\\in\[0,\\\|\\Sigma\\\|\_\{2\}\]\}x\(1\-\\gamma x\)^\{2T\_\{\\mathrm\{eff\}\}\}\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.Moreover, Lemma[K\.6](https://arxiv.org/html/2605.24316#A11.Thmlemma6)yields
‖Σ−1S0:kH0:k‖2≲μM/2\(Ak\)μM\(Ak\)\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence
T1≲‖w0:k∗‖22Teffγ\(μM/2\(Ak\)μM\(Ak\)\)2\.T\_\{1\}\\lesssim\\frac\{\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\left\(\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\right\)^\{2\}\.
For the tail term, since0⪯I−γΣ⪯I0\\preceq I\-\\gamma\\Sigma\\preceq I, we also have
0⪯M=\(I−γΣ\)TeffΣ\(I−γΣ\)Teff⪯Σ\.0\\preceq M=\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\Sigma\(I\-\\gamma\\Sigma\)^\{T\_\{\\mathrm\{eff\}\}\}\\preceq\\Sigma\.Therefore,
T2\\displaystyle T\_\{2\}≤wk:∞∗⊤Hk:∞Sk:∞⊤Σ−1ΣΣ−1Sk:∞Hk:∞wk:∞∗\\displaystyle\\leq w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}\\Sigma\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}w\_\{k:\\infty\}^\{\\ast\}=wk:∞∗⊤Hk:∞1/2\(Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2\)Hk:∞1/2wk:∞∗\\displaystyle=w\_\{k:\\infty\}^\{\\ast\\top\}H\_\{k:\\infty\}^\{1/2\}\\Bigl\(H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\Bigr\)H\_\{k:\\infty\}^\{1/2\}w\_\{k:\\infty\}^\{\\ast\}≤‖wk:∞∗‖Hk:∞2,\\displaystyle\\leq\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\},where the last step uses
0⪯Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2⪯I\.0\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq I\.Combining the bounds onT1T\_\{1\}andT2T\_\{2\}proves the first claim\.
For the simpler bound, we may ignore the head–tail split and use
BiasB≤‖M‖2‖u∗‖22≲‖u∗‖22Teffγ\.\\mathrm\{Bias\}\_\{B\}\\leq\\\|M\\\|\_\{2\}\\,\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\\lesssim\\frac\{\\\|u^\{\\ast\}\\\|\_\{2\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\.∎
###### Lemma G\.2\(Lower bound on the one\-pass batch bias\)\.
Let
Hw:=𝔼w∗\[w∗w∗⊤\],Σw:=SHHwHS⊤\.H\_\{w\}:=\\mathbb\{E\}\_\{w^\{\\ast\}\}\[w^\{\\ast\}w^\{\\ast\\top\}\],\\qquad\\Sigma\_\{w\}:=SHH\_\{w\}HS^\{\\top\}\.Then, conditioned on the sketch matrixSS,
𝔼w∗\[BiasB\]≳∑i:μi\(Σ\)<1/\(γTeff\)μi\(Σw\)μi\(Σ\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i:\\,\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\)\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\.
###### Proof\.
Let us define
MT′:=Σ\(I−2γΣ\)2Teff\.M\_\{T\}^\{\\prime\}:=\\Sigma\(I\-2\\gamma\\Sigma\)^\{2T\_\{\\mathrm\{eff\}\}\}\.Note that one can verify, blockwise geometric schedule implies
∏t=1T\(1−γtx\)2≥\(1−2γx\)2Tefffor everyx∈\[0,‖Σ‖2\]\.\\prod\_\{t=1\}^\{T\}\(1\-\\gamma\_\{t\}x\)^\{2\}\\geq\(1\-2\\gamma x\)^\{2T\_\{\\mathrm\{eff\}\}\}\\qquad\\text\{for every \}x\\in\[0,\\\|\\Sigma\\\|\_\{2\}\]\.By functional calculus, this gives
MT:=CTΣCT⪰MT′\.M\_\{T\}:=C\_\{T\}\\Sigma C\_\{T\}\\succeq M\_\{T\}^\{\\prime\}\.Therefore,
BiasB=⟨MT,u∗u∗⊤⟩≥⟨MT′,u∗u∗⊤⟩\.\\mathrm\{Bias\}\_\{B\}=\\langle M\_\{T\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle\\geq\\langle M\_\{T\}^\{\\prime\},u^\{\\ast\}u^\{\\ast\\top\}\\rangle\.Since
𝔼w∗\[u∗u∗⊤\]=Σ−1ΣwΣ−1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[u^\{\\ast\}u^\{\\ast\\top\}\]=\\Sigma^\{\-1\}\\Sigma\_\{w\}\\Sigma^\{\-1\},we have
𝔼w∗\[BiasB\]\\displaystyle\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]≥tr\(MT′Σ−1ΣwΣ−1\)\\displaystyle\\geq\\operatorname\{tr\}\\\!\\bigl\(M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\\Sigma\_\{w\}\\Sigma^\{\-1\}\\bigr\)=tr\(Σ−1MT′Σ−1Σw\)\.\\displaystyle=\\operatorname\{tr\}\\\!\\bigl\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\\Sigma\_\{w\}\\bigr\)\.\(16\)Applying von Neumann’s trace inequality to equation[16](https://arxiv.org/html/2605.24316#A7.E16)gives
𝔼w∗\[BiasB\]≳∑i=1MμM−i\+1\(Σ−1MT′Σ−1\)μi\(Σw\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i=1\}^\{M\}\\mu\_\{M\-i\+1\}\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\)\\,\\mu\_\{i\}\(\\Sigma\_\{w\}\)\.SinceMT′M\_\{T\}^\{\\prime\}is positive definite, the identityμM−i\+1\(A\)=μi\(A−1\)−1\\mu\_\{M\-i\+1\}\(A\)=\\mu\_\{i\}\(A^\{\-1\}\)^\{\-1\}yields
μM−i\+1\(Σ−1MT′Σ−1\)=1μi\(Σ2MT′−1\)=1μi\(Σ\(I−2γΣ\)−2Teff\)\.\\mu\_\{M\-i\+1\}\(\\Sigma^\{\-1\}M\_\{T\}^\{\\prime\}\\Sigma^\{\-1\}\)=\\frac\{1\}\{\\mu\_\{i\}\(\\Sigma^\{2\}M\_\{T\}^\{\\prime\-1\}\)\}=\\frac\{1\}\{\\mu\_\{i\}\\\!\\bigl\(\\Sigma\(I\-2\\gamma\\Sigma\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\bigr\)\}\.Therefore,
𝔼w∗\[BiasB\]≳∑i=1Mμi\(Σw\)μi\(Σ\(I−2γΣ\)−2Teff\)\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\\\!\\bigl\(\\Sigma\(I\-2\\gamma\\Sigma\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\bigr\)\}\.Ifμi\(Σ\)<1/\(γTeff\)\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\), then
\(1−2γμi\(Σ\)\)−2Teff≲1,\(1\-2\\gamma\\mu\_\{i\}\(\\Sigma\)\)^\{\-2T\_\{\\mathrm\{eff\}\}\}\\lesssim 1,so the corresponding denominator is comparable toμi\(Σ\)\\mu\_\{i\}\(\\Sigma\)\. Restricting the sum to this index set gives
𝔼w∗\[BiasB\]≳∑i:μi\(Σ\)<1/\(γTeff\)μi\(Σw\)μi\(Σ\),\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i:\\,\\mu\_\{i\}\(\\Sigma\)<1/\(\\gamma T\_\{\\mathrm\{eff\}\}\)\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\},which is exactly the claimed bound\. ∎
### G\.2Bounds under the source condition
###### Lemma G\.3\(Bounds on the one\-pass batch bias under the source condition\)\.
Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Suppose moreover that
Then there exist\(a,b\)\(a,b\)\-dependent constantsc0,c1\>0c\_\{0\},c\_\{1\}\>0such that, whenever
γ≤c0logT,\\gamma\\leq\\frac\{c\_\{0\}\}\{\\log T\},we have with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.Moreover, if in addition
\(Teffγ\)1/a≤Mc1,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\leq\\frac\{M\}\{c\_\{1\}\},then, on the same event,
𝔼w∗\[BiasB\]≳\(Teffγ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.
###### Proof\.
In this proof, we verify the upper and lower bounds separately\.
For the upper bound, choose
k:=min\{M/3,\(Teffγ\)1/a\}\.k:=\\min\\\!\\bigl\\\{M/3,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}\.By Lemma[K\.8](https://arxiv.org/html/2605.24316#A11.Thmlemma8),
μM/2\(Ak\)μM\(Ak\)≲1\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\\lesssim 1with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),
𝔼w∗‖w0:k∗‖22≍∑i=1kia−b,𝔼w∗‖wk:∞∗‖Hk:∞2≍∑i\>ki−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{0:k\}^\{\\ast\}\\\|\_\{2\}^\{2\}\\asymp\\sum\_\{i=1\}^\{k\}i^\{a\-b\},\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\\\|w\_\{k:\\infty\}^\{\\ast\}\\\|\_\{H\_\{k:\\infty\}\}^\{2\}\\asymp\\sum\_\{i\>k\}i^\{\-b\}\.Plugging these relations into Lemma[G\.1](https://arxiv.org/html/2605.24316#A7.Thmlemma1)gives
𝔼w∗\[BiasB\]≲1Teffγ∑i=1kia−b\+∑i\>ki−b≲k1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i=1\}^\{k\}i^\{a\-b\}\+\\sum\_\{i\>k\}i^\{\-b\}\\lesssim k^\{1\-b\}\.Sincek≍min\{M,\(Teffγ\)1/a\}k\\asymp\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}, this yields
𝔼w∗\[BiasB\]≲min\{M,\(Teffγ\)1/a\}1−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\lesssim\\min\\\!\\bigl\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\bigr\\\}^\{1\-b\}\.
For the lower bound, set
θ:=\(Teffγ\)1/a\.\\theta:=\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\.Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)gives
μi\(Σ\)≍i−afori∈\[M\]\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-a\}\\qquad\\text\{for \}i\\in\[M\]with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Therefore, the condition
μi\(Σ\)<1γTeff\\mu\_\{i\}\(\\Sigma\)<\\frac\{1\}\{\\gamma T\_\{\\mathrm\{eff\}\}\}forces
Moreover, under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2), the operatorHHwHHH\_\{w\}Hhas eigenvalues of orderi−a−bi^\{\-a\-b\}, so Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7)applied toHHwHHH\_\{w\}Hyields
μi\(Σw\)≍i−a−b\.\\mu\_\{i\}\(\\Sigma\_\{w\}\)\\asymp i^\{\-a\-b\}\.Combining these estimates with Lemma[G\.2](https://arxiv.org/html/2605.24316#A7.Thmlemma2), we obtain
𝔼w∗\[BiasB\]≳∑i≳θμi\(Σw\)μi\(Σ\)≍∑i≳θi−b\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\sum\_\{i\\gtrsim\\theta\}\\frac\{\\mu\_\{i\}\(\\Sigma\_\{w\}\)\}\{\\mu\_\{i\}\(\\Sigma\)\}\\asymp\\sum\_\{i\\gtrsim\\theta\}i^\{\-b\}\.Ifθ≤M/c1\\theta\\leq M/c\_\{1\}for a sufficiently large\(a,b\)\(a,b\)\-dependent constantc1c\_\{1\}, then the last sum contains the range\[Cθ,2Cθ\]\[C\\theta,2C\\theta\]for some constantC\>0C\>0, and hence
𝔼w∗\[BiasB\]≳θ1−b≳\(Teffγ\)\(1−b\)/a\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Bias\}\_\{B\}\]\\gtrsim\\theta^\{1\-b\}\\gtrsim\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{\(1\-b\)/a\}\.Combining the upper and lower bounds proves the claim\. ∎
## Appendix HVariance Error for One\-pass Batch SGD
This section focuses on the one\-pass batch SGD procedure equation[2](https://arxiv.org/html/2605.24316#S2.E2)and uses the notation from Section[A\.3](https://arxiv.org/html/2605.24316#A1.SS3)\. In particular,
T:=NB,Teff:=TlogT,utop−u∗=\(I−γtZ¯t\)\(ut−1op−u∗\)\+γtξ¯t\.T:=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\},\\qquad u\_\{t\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\(u\_\{t\-1\}^\{\\mathrm\{op\}\}\-u^\{\\ast\}\)\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Throughout this section, Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3)is understood withNNreplaced byTTandLeffL\_\{\\mathrm\{eff\}\}replaced byTeffT\_\{\\mathrm\{eff\}\}\(except the max norm is over the originalNNsamples\)\.
For the covariance\-iterate arguments below, it is convenient to reindex the one\-pass updates byt=0,…,T−1t=0,\\dots,T\-1: whenever this convention is used, we set
\(γt,Z¯t,ξ¯t\):=\(γt\+1,Z¯t\+1,ξ¯t\+1\)\.\(\\gamma\_\{t\},\\bar\{Z\}\_\{t\},\\bar\{\\xi\}\_\{t\}\):=\(\\gamma\_\{t\+1\},\\bar\{Z\}\_\{t\+1\},\\bar\{\\xi\}\_\{t\+1\}\)\.In particular,qTq\_\{T\}andvTv\_\{T\}still denote the states after allTTone\-pass updates\.
We also define
σ~2\(w∗\):=2\(σ2\+α‖w∗‖H2\),σ¯2\(w∗\):=σ~2\(w∗\)\+α‖w∗‖H2=2σ2\+3α‖w∗‖H2,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=2\\bigl\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\bigr\),\\qquad\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}=2\\sigma^\{2\}\+3\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\},whereα\\alphais the constant from Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2)\. Under Gaussian design, that lemma permits the choiceα=3\\alpha=3\.
### H\.1Upper and lower bounds for the exact variance components
By Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2), the exact centered variance splits as
VarB=VarBcov\+VarBnoise\.\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.We first record abstract upper and lower bounds for the two exact variance subcomponents\. Both are controlled by the same spectral kernel, which we define first and bound only in the last subsection\.
###### Definition H\.1\(Common raw kernel quantity\)\.
Let
T~Σ\(γ\)∘A:=ΣA\+AΣ−γΣAΣ\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma\\Sigma A\\Sigmafor any symmetric matrixAA\. Define
𝒦B:=1B\(1−γRB2\)⟨Σ,∑t=0T−1γt2∏i=t\+1T−1\(I−γiΣ\)2Σ⟩,\\mathcal\{K\}\_\{B\}:=\\frac\{1\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle,\(17\)where
RB2:=\(1\+α−1B\)tr\(Σ\)\.R\_\{B\}^\{2\}:=\\left\(1\+\\frac\{\\alpha\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\.\(18\)
#### Stability of the batch fourth\-moment factor\.
Under Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3), the absolute constantc\>0c\>0in the stepsize condition may be chosen sufficiently small so that the denominator in the above definition is bounded away from zero\. Indeed, sinceB≥1B\\geq 1,
RB2=\(1\+α−1B\)tr\(Σ\)≤αtr\(Σ\)\.R\_\{B\}^\{2\}=\\left\(1\+\\frac\{\\alpha\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\\leq\\alpha\\,\\operatorname\{tr\}\(\\Sigma\)\.Therefore Assumption 3A gives
γRB2≤αγtr\(Σ\)≤αc\.\\gamma R\_\{B\}^\{2\}\\leq\\alpha\\gamma\\operatorname\{tr\}\(\\Sigma\)\\leq\\alpha c\.Choosingc≤1/\(2α\)c\\leq 1/\(2\\alpha\), we obtain
γRB2≤12,1−γRB2≥12\.\\gamma R\_\{B\}^\{2\}\\leq\\frac\{1\}\{2\},\\qquad 1\-\\gamma R\_\{B\}^\{2\}\\geq\\frac\{1\}\{2\}\.Throughout this section, we work under this choice of the absolute constant\. Consequently, all factors of the form\(1−γRB2\)−1\(1\-\\gamma R\_\{B\}^\{2\}\)^\{\-1\}are absorbed into universal constants\.
###### Proposition H\.1\(Upper and lower bounds for the exact variance components\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Suppose additionally thatγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then the two exact centered\-variance subcomponents satisfy
0≤VarBnoise≤σ~2\(w∗\)𝒦B,0\\leq\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\\leq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\},0≤VarBcov≤α‖u∗‖Σ2𝒦B≤α‖w∗‖H2𝒦B\.0\\leq\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\\leq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\\leq\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\.Consequently,
0≤VarB≤σ¯2\(w∗\)𝒦B\.0\\leq\\mathrm\{Var\}\_\{B\}\\leq\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\}\.The same bounds hold after taking expectation overw∗w^\{\\ast\}\.
###### Proof\.
The lower bounds are immediate from the definitions
VarBnoise=𝔼‖vT‖Σ2,VarBcov=𝔼‖qT‖Σ2,\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\mathbb\{E\}\\\|v\_\{T\}\\\|\_\{\\Sigma\}^\{2\},\\qquad\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\mathbb\{E\}\\\|q\_\{T\}\\\|\_\{\\Sigma\}^\{2\},and from the exact identityVarB=VarBcov\+VarBnoise\\mathrm\{Var\}\_\{B\}=\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\+\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\.
For the upper bound on the additive\-noise component, Theorem[H\.1](https://arxiv.org/html/2605.24316#A8.Thmmytheorem1)gives
VarBnoise=⟨Σ,CT\(B\)⟩≤σ~2\(w∗\)𝒦B\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\langle\\Sigma,C\_\{T\}^\{\(B\)\}\\rangle\\leq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\mathcal\{K\}\_\{B\}\.Similarly, Theorem[H\.2](https://arxiv.org/html/2605.24316#A8.Thmmytheorem2)yields
VarBcov=⟨Σ,QT\(B\)⟩≤α‖u∗‖Σ2𝒦B\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\langle\\Sigma,Q\_\{T\}^\{\(B\)\}\\rangle\\leq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\,\\mathcal\{K\}\_\{B\}\.To compare‖u∗‖Σ2\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}with‖w∗‖H2\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}, write
P:=H1/2S⊤Σ−1SH1/2\.P:=H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\}\.SinceΣ=SHS⊤\\Sigma=SHS^\{\\top\}, we haveP2=PP^\{2\}=P, soPPis an orthogonal projector and therefore0⪯P⪯I0\\preceq P\\preceq I\. Usingu∗=Σ−1SHw∗u^\{\\ast\}=\\Sigma^\{\-1\}SHw^\{\\ast\},
‖u∗‖Σ2=w∗⊤HS⊤Σ−1SHw∗=w∗⊤H1/2PH1/2w∗≤‖w∗‖H2\.\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}=w^\{\\ast\\top\}HS^\{\\top\}\\Sigma^\{\-1\}SHw^\{\\ast\}=w^\{\\ast\\top\}H^\{1/2\}PH^\{1/2\}w^\{\\ast\}\\leq\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\.Combining the last three displays proves the upper bounds\. ∎
### H\.2The additive\-noise component
We first treat the additive\-noise component\(vt\)\(v\_\{t\}\)from Definition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmdefinition2)\.
Define
v0:=0,vt\+1:=\(I−γtZ¯t\)vt\+γtξ¯t\.v\_\{0\}:=0,\\qquad v\_\{t\+1\}:=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Let
Ct\(B\):=𝔼\[vtvt⊤\]C\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[v\_\{t\}v\_\{t\}^\{\\top\}\]be the covariance iterate of the additive\-noise component\. Also define the batch fourth\-moment operator
MΣ\(B\)\(A\):=𝔼\[Z¯tAZ¯t\],M\_\{\\Sigma\}^\{\(B\)\}\(A\):=\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}A\\bar\{Z\}\_\{t\}\],and the batch covariance operator
TΣ\(B\)\(γ\)∘A:=ΣA\+AΣ−γMΣ\(B\)\(A\)\.T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma M\_\{\\Sigma\}^\{\(B\)\}\(A\)\.
###### Lemma H\.1\(Covariance recursion for the additive\-noise component\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Let
Σξ\(B\):=𝔼\[ξ¯tξ¯t⊤\]\.\\Sigma\_\{\\xi\}^\{\(B\)\}:=\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\bar\{\\xi\}\_\{t\}^\{\\top\}\]\.Then
Σξ\(B\)⪯σ~2\(w∗\)BΣ\.\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\,\\Sigma\.Moreover, the covariance iterate satisfies the exact recursion
Ct\+1\(B\)=\(I−γtTΣ\(B\)\(γt\)\)∘Ct\(B\)\+γt2Σξ\(B\)\.C\_\{t\+1\}^\{\(B\)\}=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\Sigma\_\{\\xi\}^\{\(B\)\}\.\(19\)Finally, ifγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, then for alltt,
Ct\(B\)⪯γσ~2\(w∗\)B\(1−γRB2\)I\.C\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\gamma\\,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\,I\.
###### Proof\.
We first prove the bound onΣξ\(B\)\\Sigma\_\{\\xi\}^\{\(B\)\}\. Since theBBsamples in a batch are independent and centered,
Σξ\(B\)=𝔼\[ξ¯tξ¯t⊤\]=1B𝔼\[z\(y−z⊤u∗\)2z⊤\]\.\\Sigma\_\{\\xi\}^\{\(B\)\}=\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\bar\{\\xi\}\_\{t\}^\{\\top\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\\\!\\left\[z\\bigl\(y\-z^\{\\top\}u^\{\\ast\}\\bigr\)^\{2\}z^\{\\top\}\\right\]\.Nowz=Sxz=Sx, and part \(ii\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2)yields
𝔼\[\(y−⟨u∗,Sx⟩\)2\(Sx\)\(Sx\)⊤\]⪯σ~2\(w∗\)Σ,\\mathbb\{E\}\\\!\\left\[\(y\-\\langle u^\{\\ast\},Sx\\rangle\)^\{2\}\(Sx\)\(Sx\)^\{\\top\}\\right\]\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\Sigma,whereσ~2\(w∗\)=2\(σ2\+α‖w∗‖H2\)\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)=2\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\)\. Hence
Σξ\(B\)⪯σ~2\(w∗\)BΣ\.\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\Sigma\.
We next derive the recursion\. By definition,
vt\+1=\(I−γtZ¯t\)vt\+γtξ¯t\.v\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}\+\\gamma\_\{t\}\\bar\{\\xi\}\_\{t\}\.Taking the second moment, the mixed term vanishes after conditioning on the current sketched covariates𝒢tcur:=σ\(zt,1,…,zt,B\)\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\}:=\\sigma\(z\_\{t,1\},\\dots,z\_\{t,B\}\)and onw∗w^\{\\ast\}: the vectorvtv\_\{t\}depends only on earlier batches and is therefore independent of𝒢tcur\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\}, while Proposition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmproposition2)gives𝔼\[ξ¯t∣𝒢tcur,w∗\]=0\\mathbb\{E\}\[\\bar\{\\xi\}\_\{t\}\\mid\\mathcal\{G\}\_\{t\}^\{\\mathrm\{cur\}\},w^\{\\ast\}\]=0\. Hence both cross terms are zero, and
Ct\+1\(B\)=𝔼\[\(I−γtZ¯t\)vtvt⊤\(I−γtZ¯t\)\]\+γt2Σξ\(B\)\.C\_\{t\+1\}^\{\(B\)\}=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}v\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]\+\\gamma\_\{t\}^\{2\}\\Sigma\_\{\\xi\}^\{\(B\)\}\.Sincevtv\_\{t\}is independent of the current batch, we have
𝔼\[\(I−γtZ¯t\)vtvt⊤\(I−γtZ¯t\)\]=\(I−γtTΣ\(B\)\(γt\)\)∘Ct\(B\),\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)v\_\{t\}v\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\},which proves equation[19](https://arxiv.org/html/2605.24316#A8.E19)\.
Finally, we prove the crude bound by induction\. Set
κB:=γσ~2\(w∗\)B\(1−γRB2\)\.\\kappa\_\{B\}:=\\frac\{\\gamma\\,\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\.We showCt\(B\)⪯κBIC\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{B\}Ifor alltt\. This is true fort=0t=0sinceC0\(B\)=0C\_\{0\}^\{\(B\)\}=0\. Assume it holds at timett\. Then by monotonicity ofMΣ\(B\)M\_\{\\Sigma\}^\{\(B\)\},
MΣ\(B\)\(Ct\(B\)\)⪯κBMΣ\(B\)\(I\)\.M\_\{\\Sigma\}^\{\(B\)\}\(C\_\{t\}^\{\(B\)\}\)\\preceq\\kappa\_\{B\}\\,M\_\{\\Sigma\}^\{\(B\)\}\(I\)\.Now we are up to
MΣ\(B\)\(I\)=𝔼\[Z¯t2\]=1B𝔼\[\(zz⊤\)2\]\+B−1BΣ2\.M\_\{\\Sigma\}^\{\(B\)\}\(I\)=\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}^\{2\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\[\(zz^\{\\top\}\)^\{2\}\]\+\\frac\{B\-1\}\{B\}\\Sigma^\{2\}\.By part \(i\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2), for every PSD matrixAA,
𝔼\[zz⊤Azz⊤\]⪯αtr\(ΣA\)Σ\.\\mathbb\{E\}\[zz^\{\\top\}Azz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\)\\Sigma\.Applying this withA=IA=I, and usingΣ2⪯‖Σ‖2Σ⪯tr\(Σ\)Σ\\Sigma^\{2\}\\preceq\\\|\\Sigma\\\|\_\{2\}\\Sigma\\preceq\\operatorname\{tr\}\(\\Sigma\)\\Sigma, we obtain
MΣ\(B\)\(I\)⪯\(αB\+B−1B\)tr\(Σ\)Σ=RB2Σ\.M\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq\\left\(\\frac\{\\alpha\}\{B\}\+\\frac\{B\-1\}\{B\}\\right\)\\operatorname\{tr\}\(\\Sigma\)\\Sigma=R\_\{B\}^\{2\}\\Sigma\.Therefore
Ct\+1\(B\)⪯κB\(I−2γtΣ\+γt2RB2Σ\)\+γt2σ~2\(w∗\)BΣ\.C\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{B\}\(I\-2\\gamma\_\{t\}\\Sigma\+\\gamma\_\{t\}^\{2\}R\_\{B\}^\{2\}\\Sigma\)\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\Sigma\.Sinceγt≤γ\\gamma\_\{t\}\\leq\\gamma, we get
Ct\+1\(B\)⪯κBI−γt2σ~2\(w∗\)B\(2−γRB21−γRB2−1\)Σ⪯κBI\.C\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{B\}I\-\\frac\{\\gamma\_\{t\}^\{2\}\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\}\\left\(\\frac\{2\-\\gamma R\_\{B\}^\{2\}\}\{1\-\\gamma R\_\{B\}^\{2\}\}\-1\\right\)\\Sigma\\preceq\\kappa\_\{B\}I\.This closes the induction\. ∎
### H\.3Variance bound for the additive\-noise component
We now prove a variance bound for the additive\-noise component using the following theorem\.
###### Theorem H\.1\(Variance bound for the additive\-noise component\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2), and supposeγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then
VarBnoise=⟨Σ,CT\(B\)⟩≤σ~2\(w∗\)B\(1−γRB2\)⟨Σ,∑t=0T−1γt2∏i=t\+1T−1\(I−γiΣ\)2Σ⟩\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}=\\langle\\Sigma,C\_\{T\}^\{\(B\)\}\\rangle\\leq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle\.
###### Proof\.
Starting from the exact recursion equation[19](https://arxiv.org/html/2605.24316#A8.E19), Lemma[H\.1](https://arxiv.org/html/2605.24316#A8.Thmlemma1)gives the crude boundCt\(B\)⪯κBIC\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{B\}I, whereκB=γσ~2\(w∗\)/\(B\(1−γRB2\)\)\\kappa\_\{B\}=\\gamma\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)/\(B\(1\-\\gamma R\_\{B\}^\{2\}\)\)\. Combining this with equation[19](https://arxiv.org/html/2605.24316#A8.E19), the estimateMΣ\(B\)\(I\)⪯RB2ΣM\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq R\_\{B\}^\{2\}\\Sigma, and the boundΣξ\(B\)⪯σ~2\(w∗\)Σ/B\\Sigma\_\{\\xi\}^\{\(B\)\}\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\Sigma/B, we obtain
Ct\+1\(B\)⪯Ct\(B\)−γtΣCt\(B\)−γtCt\(B\)Σ\+γt2σ~2\(w∗\)B\(1−γRB2\)Σ\.C\_\{t\+1\}^\{\(B\)\}\\preceq C\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}\\Sigma C\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}C\_\{t\}^\{\(B\)\}\\Sigma\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.Sinceγt2ΣCt\(B\)Σ⪰0\\gamma\_\{t\}^\{2\}\\Sigma C\_\{t\}^\{\(B\)\}\\Sigma\\succeq 0, the last display is in turn bounded by
Ct\+1\(B\)⪯\(I−γtT~Σ\(γt\)\)∘Ct\(B\)\+γt2σ~2\(w∗\)B\(1−γRB2\)Σ,C\_\{t\+1\}^\{\(B\)\}\\preceq\\bigl\(I\-\\gamma\_\{t\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{t\}\)\\bigr\)\\circ C\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma,where
T~Σ\(γ\)∘A:=ΣA\+AΣ−γΣAΣ\.\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\\circ A:=\\Sigma A\+A\\Sigma\-\\gamma\\Sigma A\\Sigma\.Unrolling this recursion fromt=0t=0toT−1T\-1, and usingC0\(B\)=0C\_\{0\}^\{\(B\)\}=0, we get
CT\(B\)⪯σ~2\(w∗\)B\(1−γRB2\)∑t=0T−1γt2∏i=t\+1T−1\(I−γiT~Σ\(γi\)\)∘Σ\.C\_\{T\}^\{\(B\)\}\\preceq\\frac\{\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{i\}\)\)\\circ\\Sigma\.Now, sinceΣ\\Sigmacommutes with every polynomial inΣ\\Sigma, one checks directly that
\(I−γT~Σ\(γ\)\)∘A=\(I−γΣ\)A\(I−γΣ\)\(I\-\\gamma\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\)\)\\circ A=\(I\-\\gamma\\Sigma\)A\(I\-\\gamma\\Sigma\)wheneverAAcommutes withΣ\\Sigma\. Since the recursion starts fromA=ΣA=\\Sigma, every term in the expansion is a polynomial inΣ\\Sigma, and hence commutes withΣ\\Sigma\. Therefore,
∏i=t\+1T−1\(I−γiT~Σ\(γi\)\)∘Σ=∏i=t\+1T−1\(I−γiΣ\)2Σ\.\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{i\}\)\)\\circ\\Sigma=\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\.Taking the inner product withΣ\\Sigmaproves the claim\. ∎
### H\.4The covariance\-fluctuation component
We now treat the centered covariance\-fluctuation process\(qt\)\(q\_\{t\}\)from Definition[F\.2](https://arxiv.org/html/2605.24316#A6.Thmdefinition2)\. Also, we reindex the recursion from equation[14](https://arxiv.org/html/2605.24316#A6.E14)as
q0:=0,qt\+1:=\(I−γtZ¯t\)qt\+γtζt,ζt:=\(Σ−Z¯t\)mt,q\_\{0\}:=0,\\qquad q\_\{t\+1\}:=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\+\\gamma\_\{t\}\\zeta\_\{t\},\\qquad\\zeta\_\{t\}:=\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\},fort=0,…,T−1t=0,\\dots,T\-1\. Let
Qt\(B\):=𝔼\[qtqt⊤\],Λt\(B\):=𝔼\[ζtζt⊤\]\.Q\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[q\_\{t\}q\_\{t\}^\{\\top\}\],\\qquad\\Lambda\_\{t\}^\{\(B\)\}:=\\mathbb\{E\}\[\\zeta\_\{t\}\\zeta\_\{t\}^\{\\top\}\]\.
###### Lemma H\.2\(Covariance iterate for the centered covariance\-fluctuation component\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Then𝔼\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt, and
Λt\(B\)⪯α‖mt‖Σ2BΣ⪯α‖u∗‖Σ2BΣ\.\\Lambda\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\alpha\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\,\\Sigma\\preceq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\,\\Sigma\.Moreover, the covariance iterate satisfies the exact recursion
Qt\+1\(B\)=\(I−γtTΣ\(B\)\(γt\)\)∘Qt\(B\)\+γt2Λt\(B\)\.Q\_\{t\+1\}^\{\(B\)\}=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\Lambda\_\{t\}^\{\(B\)\}\.\(20\)Finally, ifγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, then for alltt,
Qt\(B\)⪯γα‖u∗‖Σ2B\(1−γRB2\)I\.Q\_\{t\}^\{\(B\)\}\\preceq\\frac\{\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\,I\.
###### Proof\.
We first prove that𝔼\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0for everytt\. This is true at timet=0t=0\. If𝔼\[qt\]=0\\mathbb\{E\}\[q\_\{t\}\]=0, thenqtq\_\{t\}depends only on the firstttbatches and is independent of the current batch definingZ¯t\\bar\{Z\}\_\{t\}\. Using also𝔼\[Z¯t\]=Σ\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}\]=\\Sigmaand the fact thatmtm\_\{t\}is deterministic,
𝔼\[qt\+1\]=\(I−γtΣ\)𝔼\[qt\]\+γt𝔼\[\(Σ−Z¯t\)mt\]=0\.\\mathbb\{E\}\[q\_\{t\+1\}\]=\(I\-\\gamma\_\{t\}\\Sigma\)\\mathbb\{E\}\[q\_\{t\}\]\+\\gamma\_\{t\}\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)m\_\{t\}\]=0\.
Next, letAt:=mtmt⊤A\_\{t\}:=m\_\{t\}m\_\{t\}^\{\\top\}\. Since theBBsamples in a batch are*iid*,
𝔼\[Z¯tAtZ¯t\]=1B𝔼\[zz⊤Atzz⊤\]\+B−1BΣAtΣ\.\\mathbb\{E\}\[\\bar\{Z\}\_\{t\}A\_\{t\}\\bar\{Z\}\_\{t\}\]=\\frac\{1\}\{B\}\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\+\\frac\{B\-1\}\{B\}\\Sigma A\_\{t\}\\Sigma\.Therefore,
Λt\(B\)=𝔼\[\(Σ−Z¯t\)At\(Σ−Z¯t\)\]=1B\(𝔼\[zz⊤Atzz⊤\]−ΣAtΣ\)⪯1B𝔼\[zz⊤Atzz⊤\]\.\\Lambda\_\{t\}^\{\(B\)\}=\\mathbb\{E\}\[\(\\Sigma\-\\bar\{Z\}\_\{t\}\)A\_\{t\}\(\\Sigma\-\\bar\{Z\}\_\{t\}\)\]=\\frac\{1\}\{B\}\\Bigl\(\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\-\\Sigma A\_\{t\}\\Sigma\\Bigr\)\\preceq\\frac\{1\}\{B\}\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\.By part \(i\) of Lemma[K\.2](https://arxiv.org/html/2605.24316#A11.Thmlemma2),
𝔼\[zz⊤Atzz⊤\]⪯αtr\(ΣAt\)Σ=α‖mt‖Σ2Σ,\\mathbb\{E\}\[zz^\{\\top\}A\_\{t\}zz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\_\{t\}\)\\Sigma=\\alpha\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\\Sigma,which proves the first bound onΛt\(B\)\\Lambda\_\{t\}^\{\(B\)\}\. Since under the zero\-based convention fixed at the start of the section,mt\+1=\(I−γtΣ\)mtm\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\Sigma\)m\_\{t\}, and all eigenvalues ofI−γtΣI\-\\gamma\_\{t\}\\Sigmalie in\[0,1\]\[0,1\], the sequence‖mt‖Σ2\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}is nonincreasing, so we result in‖mt‖Σ2≤‖m0‖Σ2=‖u∗‖Σ2\\\|m\_\{t\}\\\|\_\{\\Sigma\}^\{2\}\\leq\\\|m\_\{0\}\\\|\_\{\\Sigma\}^\{2\}=\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\.
We now derive the recursion\. Expanding the second moment ofqt\+1=\(I−γtZ¯t\)qt\+γtζtq\_\{t\+1\}=\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\+\\gamma\_\{t\}\\zeta\_\{t\}gives
Qt\+1\(B\)=𝔼\[\(I−γtZ¯t\)qtqt⊤\(I−γtZ¯t\)\]\+γt2Λt\(B\)\+Γt\+Γt⊤,Q\_\{t\+1\}^\{\(B\)\}=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}q\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]\+\\gamma\_\{t\}^\{2\}\\Lambda\_\{t\}^\{\(B\)\}\+\\Gamma\_\{t\}\+\\Gamma\_\{t\}^\{\\top\},where
Γt:=𝔼\[\(I−γtZ¯t\)qtζt⊤\]\.\\Gamma\_\{t\}:=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}\\zeta\_\{t\}^\{\\top\}\]\.Becauseqtq\_\{t\}is independent of the current batch andmtm\_\{t\}is deterministic, if we define the linear operator
ℒt\(A\):=𝔼\[\(I−γtZ¯t\)A\(Σ−Z¯t\)\],\\mathcal\{L\}\_\{t\}\(A\):=\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)A\(\\Sigma\-\\bar\{Z\}\_\{t\}\)\],then
Γt=ℒt\(𝔼\[qt\]mt⊤\)=0\.\\Gamma\_\{t\}=\\mathcal\{L\}\_\{t\}\\bigl\(\\mathbb\{E\}\[q\_\{t\}\]m\_\{t\}^\{\\top\}\\bigr\)=0\.Thus the cross terms vanish\. Sinceqtq\_\{t\}is independent of the current batch,
𝔼\[\(I−γtZ¯t\)qtqt⊤\(I−γtZ¯t\)\]=\(I−γtTΣ\(B\)\(γt\)\)∘Qt\(B\),\\mathbb\{E\}\[\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)q\_\{t\}q\_\{t\}^\{\\top\}\(I\-\\gamma\_\{t\}\\bar\{Z\}\_\{t\}\)\]=\\bigl\(I\-\\gamma\_\{t\}T\_\{\\Sigma\}^\{\(B\)\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\},which proves equation[20](https://arxiv.org/html/2605.24316#A8.E20)\.
Finally, set
κq:=γα‖u∗‖Σ2B\(1−γRB2\)\.\\kappa\_\{q\}:=\\frac\{\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\.We prove by induction thatQt\(B\)⪯κqIQ\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{q\}I\. This is true att=0t=0\. Assuming it holds at timett, monotonicity ofMΣ\(B\)M\_\{\\Sigma\}^\{\(B\)\}and the bound onMΣ\(B\)\(I\)M\_\{\\Sigma\}^\{\(B\)\}\(I\)from Lemma[H\.1](https://arxiv.org/html/2605.24316#A8.Thmlemma1)give
MΣ\(B\)\(Qt\(B\)\)⪯κqRB2Σ\.M\_\{\\Sigma\}^\{\(B\)\}\(Q\_\{t\}^\{\(B\)\}\)\\preceq\\kappa\_\{q\}R\_\{B\}^\{2\}\\Sigma\.Using equation[20](https://arxiv.org/html/2605.24316#A8.E20)and the bound onΛt\(B\)\\Lambda\_\{t\}^\{\(B\)\}, we get
Qt\+1\(B\)⪯κq\(I−2γtΣ\+γt2RB2Σ\)\+γt2α‖u∗‖Σ2BΣ⪯κqI\.Q\_\{t\+1\}^\{\(B\)\}\\preceq\\kappa\_\{q\}\(I\-2\\gamma\_\{t\}\\Sigma\+\\gamma\_\{t\}^\{2\}R\_\{B\}^\{2\}\\Sigma\)\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\}\\Sigma\\preceq\\kappa\_\{q\}I\.This closes the induction\. ∎
###### Theorem H\.2\(Bound on the centered covariance\-fluctuation component\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2), and supposeγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2\. Then
VarBcov=⟨Σ,QT\(B\)⟩≤α‖u∗‖Σ2B\(1−γRB2\)⟨Σ,∑t=0T−1γt2∏i=t\+1T−1\(I−γiΣ\)2Σ⟩\.\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}=\\langle\\Sigma,Q\_\{T\}^\{\(B\)\}\\rangle\\leq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\left\\langle\\Sigma,\\;\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\\right\\rangle\.
###### Proof\.
Starting from the exact recursion equation[20](https://arxiv.org/html/2605.24316#A8.E20), Lemma[H\.2](https://arxiv.org/html/2605.24316#A8.Thmlemma2)gives the crude boundQt\(B\)⪯κqIQ\_\{t\}^\{\(B\)\}\\preceq\\kappa\_\{q\}I, whereκq=γα‖u∗‖Σ2/\(B\(1−γRB2\)\)\\kappa\_\{q\}=\\gamma\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}/\(B\(1\-\\gamma R\_\{B\}^\{2\}\)\)\. Combining this with equation[20](https://arxiv.org/html/2605.24316#A8.E20), the estimateMΣ\(B\)\(I\)⪯RB2ΣM\_\{\\Sigma\}^\{\(B\)\}\(I\)\\preceq R\_\{B\}^\{2\}\\Sigma, and the boundΛt\(B\)⪯α‖u∗‖Σ2Σ/B\\Lambda\_\{t\}^\{\(B\)\}\\preceq\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\\Sigma/B, we get
Qt\+1\(B\)⪯Qt\(B\)−γtΣQt\(B\)−γtQt\(B\)Σ\+γt2α‖u∗‖Σ2B\(1−γRB2\)Σ\.Q\_\{t\+1\}^\{\(B\)\}\\preceq Q\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}\\Sigma Q\_\{t\}^\{\(B\)\}\-\\gamma\_\{t\}Q\_\{t\}^\{\(B\)\}\\Sigma\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.Sinceγt2ΣQt\(B\)Σ⪰0\\gamma\_\{t\}^\{2\}\\Sigma Q\_\{t\}^\{\(B\)\}\\Sigma\\succeq 0, this is bounded by
Qt\+1\(B\)⪯\(I−γtT~Σ\(γt\)\)∘Qt\(B\)\+γt2α‖u∗‖Σ2B\(1−γRB2\)Σ\.Q\_\{t\+1\}^\{\(B\)\}\\preceq\\bigl\(I\-\\gamma\_\{t\}\\widetilde\{T\}\_\{\\Sigma\}\(\\gamma\_\{t\}\)\\bigr\)\\circ Q\_\{t\}^\{\(B\)\}\+\\gamma\_\{t\}^\{2\}\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\Sigma\.The same unrolling and commutation argument as in the proof of Theorem[H\.1](https://arxiv.org/html/2605.24316#A8.Thmmytheorem1)therefore yields
QT\(B\)⪯α‖u∗‖Σ2B\(1−γRB2\)∑t=0T−1γt2∏i=t\+1T−1\(I−γiΣ\)2Σ\.Q\_\{T\}^\{\(B\)\}\\preceq\\frac\{\\alpha\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(I\-\\gamma\_\{i\}\\Sigma\)^\{2\}\\Sigma\.Taking the inner product withΣ\\Sigmaproves the claim\. ∎
### H\.5Bounds under the source condition
We now specialize the abstract upper bounds from Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1)under the power\-law spectrum and source\-condition assumptions\. The only remaining input is the bound on𝒦B\\mathcal\{K\}\_\{B\}, which is proved in the next subsection\.
###### Lemma H\.3\(Source\-condition bounds for the exact one\-pass variance terms\)\.
Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), with the convention in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\. Suppose moreover that
Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrixSS,
𝔼w∗\[VarBnoise\]≲min\{M,\(Teffγ\)1/a\}BTeff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},𝔼w∗\[VarBcov\]≲min\{M,\(Teffγ\)1/a\}BTeff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},and consequently
𝔼w∗\[VarB\]≲min\{M,\(Teffγ\)1/a\}BTeff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.
###### Proof\.
By Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1),
𝔼w∗\[VarBnoise\]≤𝔼w∗\[σ~2\(w∗\)\]𝒦B,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\leq\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\,\\mathcal\{K\}\_\{B\},𝔼w∗\[VarBcov\]≤α𝔼w∗\[‖u∗‖Σ2\]𝒦B≤α𝔼w∗\[‖w∗‖H2\]𝒦B\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\leq\\alpha\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|u^\{\\ast\}\\\|\_\{\\Sigma\}^\{2\}\]\\,\\mathcal\{K\}\_\{B\}\\leq\\alpha\\,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\]\\,\\mathcal\{K\}\_\{B\}\.Under Assumption[2](https://arxiv.org/html/2605.24316#Thmassumption2),
𝔼w∗\[‖w∗‖H2\]=∑i≥1λi𝔼\[\(wi∗\)2\]≍∑i≥1i−b≲1,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\]=\\sum\_\{i\\geq 1\}\\lambda\_\{i\}\\,\\mathbb\{E\}\[\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp\\sum\_\{i\\geq 1\}i^\{\-b\}\\lesssim 1,sinceb\>1b\>1\. Therefore
𝔼w∗\[σ~2\(w∗\)\]≲1,𝔼w∗\[σ¯2\(w∗\)\]≲1\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\lesssim 1,\\qquad\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\]\\lesssim 1\.Applying Lemma[H\.5](https://arxiv.org/html/2605.24316#A8.Thmlemma5), proved in the next subsection, yields
𝔼w∗\[VarBnoise\]≲min\{M,\(Teffγ\)1/a\}BTeff,\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{noise\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\},and
𝔼w∗\[VarBcov\]≲min\{M,\(Teffγ\)1/a\}BTeff\.\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Var\}\_\{B\}^\{\\mathrm\{cov\}\}\]\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Summing the two bounds gives the final claim\. ∎
### H\.6Bounding the common kernel quantity
We now prove the bounds on𝒦B\\mathcal\{K\}\_\{B\}used above\. We first convert this kernel quantity into the effective\-dimension form and then simplify it under the power\-law spectrum assumption\.
###### Lemma H\.4\(Effective\-dimension reduction for𝒦B\\mathcal\{K\}\_\{B\}\)\.
Assume the blockwise geometric learning\-rate schedule
γt=γ2ℓfort∈Iℓ,\\gamma\_\{t\}=\\frac\{\\gamma\}\{2^\{\\ell\}\}\\qquad\\text\{for \}t\\in I\_\{\\ell\},where the blocksIℓI\_\{\\ell\}form a partition of\{0,…,T−1\}\\\{0,\\dots,T\-1\\\}into consecutive intervals of length comparable to
Teff:=TlogTT\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\}up to endpoint rounding\. Let
Teff:=TlogT\.T\_\{\\mathrm\{eff\}\}:=\\frac\{T\}\{\\log T\}\.Then there exists a universal constantc\>0c\>0such that
𝒦B≤cBTeff∑j=1Mmin\{1,Teffγμj\(Σ\)\}\.\\mathcal\{K\}\_\{B\}\\leq\\frac\{c\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\.Equivalently, if we write
s:=Teffγ,s:=T\_\{\\mathrm\{eff\}\}\\gamma,then
𝒦B≤cBTeff\(\#\{μj\(Σ\)≥1/s\}\+s∑μj\(Σ\)<1/sμj\(Σ\)\)\.\\mathcal\{K\}\_\{B\}\\leq\\frac\{c\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\left\(\\\#\\\{\\mu\_\{j\}\(\\Sigma\)\\geq 1/s\\\}\+s\\sum\_\{\\mu\_\{j\}\(\\Sigma\)<1/s\}\\mu\_\{j\}\(\\Sigma\)\\right\)\.This second display is simply an exact rewriting of the effective\-dimension bound above, and it is the form used in the power\-law estimate below\.
###### Proof\.
DiagonalizeΣ\\Sigmaas
Σ=Udiag\(μ1\(Σ\),…,μM\(Σ\)\)U⊤\.\\Sigma=U\\operatorname\{diag\}\(\\mu\_\{1\}\(\\Sigma\),\\dots,\\mu\_\{M\}\(\\Sigma\)\)U^\{\\top\}\.Then Definition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmdefinition1)gives
𝒦B=1B\(1−γRB2\)∑j=1Mμj\(Σ\)2∑t=0T−1γt2∏i=t\+1T−1\(1−γiμj\(Σ\)\)2\.\\mathcal\{K\}\_\{B\}=\\frac\{1\}\{B\(1\-\\gamma R\_\{B\}^\{2\}\)\}\\sum\_\{j=1\}^\{M\}\\mu\_\{j\}\(\\Sigma\)^\{2\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\\bigl\(1\-\\gamma\_\{i\}\\mu\_\{j\}\(\\Sigma\)\\bigr\)^\{2\}\.For each scalarλ≥0\\lambda\\geq 0, define
ΦT\(λ\):=λ2∑t=0T−1γt2∏i=t\+1T−1\(1−γiλ\)2\.\\Phi\_\{T\}\(\\lambda\):=\\lambda^\{2\}\\sum\_\{t=0\}^\{T\-1\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(1\-\\gamma\_\{i\}\\lambda\)^\{2\}\.We claim that
ΦT\(λ\)≲1Teffmin\{1,Teffγλ\}\.\\Phi\_\{T\}\(\\lambda\)\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\lambda\\\}\.\(21\)To prove this, partition\{0,…,T−1\}\\\{0,\\dots,T\-1\\\}into the geometric blocks\(Iℓ\)ℓ≥0\(I\_\{\\ell\}\)\_\{\\ell\\geq 0\}, and write
ηℓ:=γ2ℓ,aℓ:=Teffηℓλ,a0=Teffγλ\.\\eta\_\{\\ell\}:=\\frac\{\\gamma\}\{2^\{\\ell\}\},\\qquad a\_\{\\ell\}:=T\_\{\\mathrm\{eff\}\}\\eta\_\{\\ell\}\\lambda,\\qquad a\_\{0\}=T\_\{\\mathrm\{eff\}\}\\gamma\\lambda\.Sinceγtr\(Σ\)≲1\\gamma\\operatorname\{tr\}\(\\Sigma\)\\lesssim 1by Assumption[3](https://arxiv.org/html/2605.24316#Thmassumption3), we have0≤γtλ≤1/20\\leq\\gamma\_\{t\}\\lambda\\leq 1/2after taking the absolute constant in that assumption small enough\. Hence all factors1−γtλ1\-\\gamma\_\{t\}\\lambdalie in\[0,1\]\[0,1\], and therefore for each blockIℓI\_\{\\ell\},
ΦT,ℓ\(λ\)\\displaystyle\\Phi\_\{T,\\ell\}\(\\lambda\):=λ2∑t∈Iℓγt2∏i=t\+1T−1\(1−γiλ\)2\\displaystyle=\\lambda^\{2\}\\sum\_\{t\\in I\_\{\\ell\}\}\\gamma\_\{t\}^\{2\}\\prod\_\{i=t\+1\}^\{T\-1\}\(1\-\\gamma\_\{i\}\\lambda\)^\{2\}≤λ2\|Iℓ\|ηℓ2∏q\>ℓ\(1−ηqλ\)2\|Iq\|\.\\displaystyle\\leq\\lambda^\{2\}\|I\_\{\\ell\}\|\\eta\_\{\\ell\}^\{2\}\\prod\_\{q\>\\ell\}\(1\-\\eta\_\{q\}\\lambda\)^\{2\|I\_\{q\}\|\}\.Using\|Iℓ\|≍Teff\|I\_\{\\ell\}\|\\asymp T\_\{\\mathrm\{eff\}\}, the inequality1−x≤e−x1\-x\\leq e^\{\-x\}forx∈\[0,1\]x\\in\[0,1\], and the geometric identity∑q\>ℓηq≍ηℓ\\sum\_\{q\>\\ell\}\\eta\_\{q\}\\asymp\\eta\_\{\\ell\}, we obtain
ΦT,ℓ\(λ\)≲Teffλ2ηℓ2exp\(−cTeffλ∑q\>ℓηq\)≲aℓ2Teffe−caℓ\\Phi\_\{T,\\ell\}\(\\lambda\)\\lesssim T\_\{\\mathrm\{eff\}\}\\lambda^\{2\}\\eta\_\{\\ell\}^\{2\}\\exp\\\!\\Bigl\(\-cT\_\{\\mathrm\{eff\}\}\\lambda\\sum\_\{q\>\\ell\}\\eta\_\{q\}\\Bigr\)\\lesssim\\frac\{a\_\{\\ell\}^\{2\}\}\{T\_\{\\mathrm\{eff\}\}\}e^\{\-ca\_\{\\ell\}\}for a universal constantc\>0c\>0\. Summing overℓ\\ellgives
ΦT\(λ\)≲1Teff∑ℓ≥0aℓ2e−caℓ,aℓ=a02ℓ\.\\Phi\_\{T\}\(\\lambda\)\\lesssim\\frac\{1\}\{T\_\{\\mathrm\{eff\}\}\}\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\},\\qquad a\_\{\\ell\}=\\frac\{a\_\{0\}\}\{2^\{\\ell\}\}\.Ifa0≤1a\_\{0\}\\leq 1, then
∑ℓ≥0aℓ2e−caℓ≤∑ℓ≥0aℓ2≲a02≤a0\.\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\leq\\sum\_\{\\ell\\geq 0\}a\_\{\\ell\}^\{2\}\\lesssim a\_\{0\}^\{2\}\\leq a\_\{0\}\.Ifa0≥1a\_\{0\}\\geq 1, letℓ⋆:=⌊log2a0⌋\\ell\_\{\\star\}:=\\lfloor\\log\_\{2\}a\_\{0\}\\rfloor\. Then forℓ\>ℓ⋆\\ell\>\\ell\_\{\\star\}, we haveaℓ<1a\_\{\\ell\}<1, so
∑ℓ\>ℓ⋆aℓ2e−caℓ≤∑ℓ\>ℓ⋆aℓ2≲1\.\\sum\_\{\\ell\>\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\leq\\sum\_\{\\ell\>\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}\\lesssim 1\.Forℓ≤ℓ⋆\\ell\\leq\\ell\_\{\\star\}, we haveaℓ≥1a\_\{\\ell\}\\geq 1, and the dyadic pointsaℓ=a0/2ℓa\_\{\\ell\}=a\_\{0\}/2^\{\\ell\}decrease geometrically, whilex↦x2e−cxx\\mapsto x^\{2\}e^\{\-cx\}decays exponentially for largexx\. Hence
∑ℓ≤ℓ⋆aℓ2e−caℓ≲1\.\\sum\_\{\\ell\\leq\\ell\_\{\\star\}\}a\_\{\\ell\}^\{2\}e^\{\-ca\_\{\\ell\}\}\\lesssim 1\.Combining the last three displays proves equation[21](https://arxiv.org/html/2605.24316#A8.E21)\. Substituting this bound into the diagonal expansion above and using that1−γRB21\-\\gamma R\_\{B\}^\{2\}is bounded below by a universal constant under the hypothesisγRB2<1/2\\gamma R\_\{B\}^\{2\}<1/2, the first claim is verified\.
The second display is just the identity
∑j=1Mmin\{1,sμj\}=\#\{μj≥1/s\}\+s∑μj<1/sμj,s:=Teffγ\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,s\\mu\_\{j\}\\\}=\\\#\\\{\\mu\_\{j\}\\geq 1/s\\\}\+s\\sum\_\{\\mu\_\{j\}<1/s\}\\mu\_\{j\},\\qquad s:=T\_\{\\mathrm\{eff\}\}\\gamma\.No further reduction is needed here; Lemma[H\.5](https://arxiv.org/html/2605.24316#A8.Thmlemma5)applies this equivalent form directly under the power\-law assumption\. ∎
###### Lemma H\.5\(Power\-law bound for𝒦B\\mathcal\{K\}\_\{B\}\)\.
Assume Assumptions[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3)and[4](https://arxiv.org/html/2605.24316#Thmassumption4)\. Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrixSS, assumeTeffγ≳1T\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1\(in Theorem[3\.1](https://arxiv.org/html/2605.24316#S3.Thmmytheorem1)\),
𝒦B≲min\{M,\(Teffγ\)1/a\}BTeff\.\\mathcal\{K\}\_\{B\}\\lesssim\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.Consequently,
VarB≲σ¯2\(w∗\)min\{M,\(Teffγ\)1/a\}BTeff\.\\mathrm\{Var\}\_\{B\}\\lesssim\\overline\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\frac\{\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\.
###### Proof\.
By Lemma[H\.4](https://arxiv.org/html/2605.24316#A8.Thmlemma4), it suffices to bound
∑j=1Mmin\{1,Teffγμj\(Σ\)\}\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\.By Lemma[K\.7](https://arxiv.org/html/2605.24316#A11.Thmlemma7), with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
μj\(Σ\)≍j−a,j∈\[M\]\.\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\},\\qquad j\\in\[M\]\.Let
k⋆:=min\{M,⌊\(Teffγ\)1/a⌋\}\.k\_\{\\star\}:=\\min\\\{M,\\lfloor\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\rfloor\\\}\.Then forj≤k⋆j\\leq k\_\{\\star\},
Teffγμj\(Σ\)≳1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\gtrsim 1,while forj\>k⋆j\>k\_\{\\star\},
Teffγμj\(Σ\)≲Teffγj−a\.T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\lesssim T\_\{\\mathrm\{eff\}\}\\gamma\\,j^\{\-a\}\.Hence
∑j=1Mmin\{1,Teffγμj\(Σ\)\}≲k⋆\+Teffγ∑j\>k⋆j−a\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\\lesssim k\_\{\\star\}\+T\_\{\\mathrm\{eff\}\}\\gamma\\sum\_\{j\>k\_\{\\star\}\}j^\{\-a\}\.Sincea\>1a\>1,
∑j\>k⋆j−a≲k⋆1−a,\\sum\_\{j\>k\_\{\\star\}\}j^\{\-a\}\\lesssim k\_\{\\star\}^\{\\,1\-a\},and therefore
∑j=1Mmin\{1,Teffγμj\(Σ\)\}≲k⋆\+Teffγk⋆1−a≲k⋆=min\{M,\(Teffγ\)1/a\}\.\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\,\\mu\_\{j\}\(\\Sigma\)\\\}\\lesssim k\_\{\\star\}\+T\_\{\\mathrm\{eff\}\}\\gamma\\,k\_\{\\star\}^\{\\,1\-a\}\\lesssim k\_\{\\star\}=\\min\\\{M,\(T\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\\\}\.Substituting into Lemma[H\.4](https://arxiv.org/html/2605.24316#A8.Thmlemma4)proves the first claim\.
The second claim follows from Proposition[H\.1](https://arxiv.org/html/2605.24316#A8.Thmproposition1)\. ∎
## Appendix IFluctuation Error under Multi\-pass Batch SGD with Replacement
This section focuses on the multi\-pass batch SGD procedure with replacement equation[3](https://arxiv.org/html/2605.24316#S2.E3)and uses the notation from Section[A\.4](https://arxiv.org/html/2605.24316#A1.SS4)\. For the proof sketch, we first rewrite the fluctuation recursion, then control the covariance of the stochastic noise term, and finally apply a stochastic approximation lemma together with the leave\-one\-out control of GD outputs\. Note that different from one\-sample update, in our case, we have a random positive semidefinite mini\-batch covariance matrix\.
Define the fluctuation term, conditional on the sketched dataset, by
FlucBwr:=𝔼batch\[‖Σ1/2\(uLwr−θL\)‖22\]=𝔼batch\[‖Σ1/2ΔL‖22\]\.\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}:=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wr\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\]\.
For convenience, we restate the fluctuation process and its noise term here:
Δt=\(I−γtΣ^t\(B\)\)Δt−1\+γtξt\(B\),\\Delta\_\{t\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(B\)\},\(22\)where
ξt\(B\):=−\(Σ^t\(B\)−Σ^\)\(θt−1−u∗\)\+\(c^t\(B\)−c^\)\.\\xi\_\{t\}^\{\(B\)\}:=\-\\bigl\(\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(\\widehat\{c\}\_\{t\}^\{\(B\)\}\-\\widehat\{c\}\\bigr\)\.\(23\)
### I\.1Upper bound result
###### Lemma I\.1\(Upper bound on the fluctuation error for multi\-pass batch SGD with replacement\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3), and suppose
Leff≲Na/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{a\}/\\gamma\.Under the notation above, fix anys∈\[0,1\]s\\in\[0,1\]andα\>1\\alpha\>1\. Letθt\(−i\)\\theta\_\{t\}^\{\(\-i\)\}denote the leave\-one\-out GD iterate, namely,
θt\(−i\)=\(I−γtΣ^\(−i\)\)θt−1\(−i\)\+γt\(SX⊤y\)\(−i\),withθ0\(−i\)=0,\\theta\_\{t\}^\{\(\-i\)\}=\\left\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}^\{\(\-i\)\}\\right\)\\theta\_\{t\-1\}^\{\(\-i\)\}\+\\gamma\_\{t\}\\left\(SX^\{\\top\}y\\right\)^\{\(\-i\)\},\\qquad\\text\{with \}\\theta\_\{0\}^\{\(\-i\)\}=0,whereΣ^\(−i\):=∑j≠iSxjxj⊤S⊤/N\\widehat\{\\Sigma\}^\{\(\-i\)\}:=\\sum\_\{j\\neq i\}Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}/Nand\(SX⊤y\)\(−i\):=∑j≠iSxjyj/N\(SX^\{\\top\}y\)^\{\(\-i\)\}:=\\sum\_\{j\\neq i\}Sx\_\{j\}y\_\{j\}/N\. Define
λ:=1Leffγ,Rp:=‖\(Σ\+λI\)1/2\(Σ^\+λI\)−1/2‖22,\\lambda:=\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},\\qquad R\_\{p\}:=\\bigl\\\|\(\\Sigma\+\\lambda I\)^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\bigr\\\|\_\{2\}^\{2\},amax:=maxi∈\[N\],t∈\[L\]\|yi−xi⊤S⊤θt\(−i\)\|,a\_\{\\max\}:=\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\bigl\|y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\\bigr\|,BΔ:=amax2⋅maxi∈\[N\]‖Sxi‖22⋅Rp⋅\(Leffγ\)2−sN2,B\_\{\\Delta\}:=a\_\{\\max\}^\{2\}\\cdot\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\\cdot R\_\{p\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\},and
FB:=Rp\(maxi∈\[N\]\(xi⊤S⊤u∗\)2\+maxi∈\[N\]ε~i2\+maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ\)\.F\_\{B\}:=R\_\{p\}\\Bigl\(\\max\_\{i\\in\[N\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\+\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\}\\Bigr\)\.Then there exists a constantc\>0c\>0, depending only on\(s,α\)\(s,\\alpha\), such that with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼\[FlucBwr\]=𝔼w∗,\(xi,yi\)i=1N,batch\[‖Σ1/2\(uLwr−θL\)‖22\]≤c⋅𝔼\[FB\]⋅tr\(Σ^1/α\)B⋅γ1/αLeff1/α−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]=\\mathbb\{E\}\_\{w^\{\\ast\},\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\},\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wr\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]\\leq c\\cdot\\mathbb\{E\}\[F\_\{B\}\]\\cdot\\frac\{\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\}\{B\}\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.
###### Proof\.
Conditioned onSS,w∗w^\{\\ast\}, and the datasetD=\(xi,yi\)i=1ND=\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\}, the mini\-batch average is unbiased:
𝔼\[Σ^t\(B\)∣S,w∗,D,ℱt−1\]=Σ^,𝔼\[c^t\(B\)∣S,w∗,D,ℱt−1\]=c^\.\\mathbb\{E\}\[\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=\\widehat\{\\Sigma\},\\qquad\\mathbb\{E\}\[\\widehat\{c\}\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=\\widehat\{c\}\.Hence
𝔼\[ξt\(B\)∣S,w∗,D,ℱt−1\]=0\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(B\)\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\]=0\.
We then rewrite the batch noise as an average of single\-sample noises\. Define, forj∈\[N\]j\\in\[N\],
ζt\(j\):=−\(Sxjxj⊤S⊤−Σ^\)\(θt−1−u∗\)\+\(Sxjε~j−c^\)\.\\zeta\_\{t\}\(j\):=\-\\bigl\(Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(Sx\_\{j\}\\widetilde\{\\varepsilon\}\_\{j\}\-\\widehat\{c\}\\bigr\)\.Then
ξt\(B\)=1B∑r=1Bζt\(it,r\)\.\\xi\_\{t\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\zeta\_\{t\}\(i\_\{t,r\}\)\.Conditioned on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\), the vectorsζt\(it,1\),…,ζt\(it,B\)\\zeta\_\{t\}\(i\_\{t,1\}\),\\dots,\\zeta\_\{t\}\(i\_\{t,B\}\)are i\.i\.d\. and mean zero, hence
𝔼\[ξt\(B\)ξt\(B\)⊤∣S,w∗,D,ℱt−1\]=1B𝔼\[ζt\(i\)ζt\(i\)⊤∣S,w∗,D,ℱt−1\],\\mathbb\{E\}\\bigl\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]=\\frac\{1\}\{B\}\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\],wherei∼unif\(\[N\]\)i\\sim\\mathrm\{unif\}\(\[N\]\)\.
Write
dt:=θt−1−u∗,zi:=Sxi,d\_\{t\}:=\\theta\_\{t\-1\}\-u^\{\\ast\},\\qquad z\_\{i\}:=Sx\_\{i\},and decompose
ζt\(i\)=ζt,1\(i\)\+ζt,2\(i\),\\zeta\_\{t\}\(i\)=\\zeta\_\{t,1\}\(i\)\+\\zeta\_\{t,2\}\(i\),where
ζt,1\(i\):=−\(zizi⊤−Σ^\)dt,ζt,2\(i\):=ziε~i−c^\.\\zeta\_\{t,1\}\(i\):=\-\\bigl\(z\_\{i\}z\_\{i\}^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)d\_\{t\},\\qquad\\zeta\_\{t,2\}\(i\):=z\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\}\-\\widehat\{c\}\.Since\(a\+b\)\(a\+b\)⊤⪯2aa⊤\+2bb⊤\(a\+b\)\(a\+b\)^\{\\top\}\\preceq 2aa^\{\\top\}\+2bb^\{\\top\}for all vectorsa,ba,b,
𝔼\[ζt\(i\)ζt\(i\)⊤∣S,w∗,D,ℱt−1\]\\displaystyle\\mathbb\{E\}\\bigl\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]⪯2𝔼\[ζt,1\(i\)ζt,1\(i\)⊤∣S,w∗,D,ℱt−1\]\\displaystyle\\preceq 2\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\+2𝔼\[ζt,2\(i\)ζt,2\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\displaystyle\\qquad\+2\\,\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\.We bound the two terms separately\. Since
ζt,1\(i\)=at\(i\)−𝔼\[at\(i\)∣S,w∗,D,ℱt−1\],at\(i\):=−zizi⊤dt,\\zeta\_\{t,1\}\(i\)=a\_\{t\}\(i\)\-\\mathbb\{E\}\[a\_\{t\}\(i\)\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\],\\qquad a\_\{t\}\(i\):=\-z\_\{i\}z\_\{i\}^\{\\top\}d\_\{t\},its covariance is dominated by its second moment:
𝔼\[ζt,1\(i\)ζt,1\(i\)⊤∣S,w∗,D,ℱt−1\]⪯𝔼\[at\(i\)at\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\mathbb\{E\}\\bigl\[a\_\{t\}\(i\)a\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\.Now
at\(i\)at\(i\)⊤=\(zi⊤dt\)2zizi⊤⪯maxj∈\[N\]\(zj⊤dt\)2zizi⊤,a\_\{t\}\(i\)a\_\{t\}\(i\)^\{\\top\}=\\bigl\(z\_\{i\}^\{\\top\}d\_\{t\}\\bigr\)^\{2\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq\\max\_\{j\\in\[N\]\}\\bigl\(z\_\{j\}^\{\\top\}d\_\{t\}\\bigr\)^\{2\}z\_\{i\}z\_\{i\}^\{\\top\},so averaging overi∼unif\(\[N\]\)i\\sim\\mathrm\{unif\}\(\[N\]\)gives
𝔼\[ζt,1\(i\)ζt,1\(i\)⊤∣S,w∗,D,ℱt−1\]⪯maxj∈\[N\]\(xj⊤S⊤\(θt−1−u∗\)\)2Σ^\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,1\}\(i\)\\zeta\_\{t,1\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\max\_\{j\\in\[N\]\}\\bigl\(x\_\{j\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\\,\\widehat\{\\Sigma\}\.Similarly,
ζt,2\(i\)=b\(i\)−𝔼\[b\(i\)∣S,w∗,D\],b\(i\):=ziε~i,\\zeta\_\{t,2\}\(i\)=b\(i\)\-\\mathbb\{E\}\[b\(i\)\\mid S,w^\{\\ast\},D\],\\qquad b\(i\):=z\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\},and therefore
𝔼\[ζt,2\(i\)ζt,2\(i\)⊤∣S,w∗,D,ℱt−1\]⪯𝔼\[b\(i\)b\(i\)⊤∣S,w∗,D\]\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\mathbb\{E\}\\bigl\[b\(i\)b\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D\\bigr\]\.Since
b\(i\)b\(i\)⊤=ε~i2zizi⊤⪯maxj∈\[N\]ε~j2zizi⊤,b\(i\)b\(i\)^\{\\top\}=\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq\\max\_\{j\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{j\}^\{2\}z\_\{i\}z\_\{i\}^\{\\top\},we obtain
𝔼\[ζt,2\(i\)ζt,2\(i\)⊤∣S,w∗,D,ℱt−1\]⪯maxj∈\[N\]ε~j2Σ^\.\\mathbb\{E\}\\bigl\[\\zeta\_\{t,2\}\(i\)\\zeta\_\{t,2\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\max\_\{j\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{j\}^\{2\}\\,\\widehat\{\\Sigma\}\.Combining the decomposition above with the last four displays yields
𝔼\[ξt\(B\)ξt\(B\)⊤∣S,w∗,D,ℱt−1\]⪯σξ,B2BΣ^,\\mathbb\{E\}\\bigl\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\bigr\]\\preceq\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\,\\widehat\{\\Sigma\},\(24\)where
σξ,B2:=2maxi∈\[N\],t∈\[L\]\[\(xi⊤S⊤\(θt−1−u∗\)\)2\+ε~i2\]\.\\sigma\_\{\\xi,B\}^\{2\}:=2\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\left\[\\bigl\(x\_\{i\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\+\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\\right\]\.
Let
λ:=1Leffγ,Rp:=‖\(Σ\+λI\)1/2\(Σ^\+λI\)−1/2‖22\.\\lambda:=\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},\\qquad R\_\{p\}:=\\bigl\\\|\(\\Sigma\+\\lambda I\)^\{1/2\}\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\\bigr\\\|\_\{2\}^\{2\}\.Condition onSS,w∗w^\{\\ast\}, andDD\. By Lemma[I\.3](https://arxiv.org/html/2605.24316#A9.Thmlemma3), assumptions\(1\),\(3\),\(4\), and\(5\)of Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)hold for
At=Σ^t\(B\),Σν=Σ^,CA=maxi∈\[N\]‖Sxi‖22\.A\_\{t\}=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\},\\qquad\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\},\\qquad C\_\{A\}=\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\.Moreover, sinceθt−1\\theta\_\{t\-1\}is deterministic once\(S,w∗,D\)\(S,w^\{\\ast\},D\)are fixed and each mini\-batch is sampled independently across iterations, the pair\(At,ξt\(B\)\)\(A\_\{t\},\\xi\_\{t\}^\{\(B\)\}\)is independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}\. Step 1 gives𝔼batch\[ξt\(B\)\]=0\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[\\xi\_\{t\}^\{\(B\)\}\]=0, and taking expectation over the batch randomness in equation[24](https://arxiv.org/html/2605.24316#A9.E24)gives the covariance part of assumption\(2\)with
σξ2=σξ,B2B\.\\sigma\_\{\\xi\}^\{2\}=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\.Therefore, using
‖\(Σ^\+λI\)1/2ΔL‖22=‖Σ^1/2ΔL‖22\+λ‖ΔL‖22,\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}=\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\},and applying Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)withu=1u=1andu=0u=0, we obtain
𝔼batch‖Σ1/2ΔL‖22\\displaystyle\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}≤Rp𝔼batch‖\(Σ^\+λI\)1/2ΔL‖22\\displaystyle\\leq R\_\{p\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}=Rp𝔼batch‖Σ^1/2ΔL‖22\+Rpλ𝔼batch‖ΔL‖22\\displaystyle=R\_\{p\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\+R\_\{p\}\\,\\lambda\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\}≲Rp⋅σξ,B2B⋅γ0tr\(Σ^1/α\)\(Leffγ0\)1/α−1\\displaystyle\\lesssim R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-1\}\+Rp⋅λ⋅σξ,B2B⋅γ0tr\(Σ^1/α\)\(Leffγ0\)1/α\\displaystyle\\qquad\+R\_\{p\}\\cdot\\lambda\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\}=Rp⋅σξ,B2B⋅\(1\+λLeffγ0\)γ0tr\(Σ^1/α\)\(Leffγ0\)1/α−1\\displaystyle=R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\bigl\(1\+\\lambda L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\\bigr\)\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-1\}≲Rp⋅σξ,B2B⋅tr\(Σ^1/α\)Leff1/α−1γ01/α\\displaystyle\\lesssim R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\\gamma\_\{0\}^\{1/\\alpha\}≤Rp⋅σξ,B2B⋅tr\(Σ^1/α\)Leff1/α−1γ1/α\.\\displaystyle\\leq R\_\{p\}\\cdot\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\cdot\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\,L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\\gamma^\{1/\\alpha\}\.\(25\)Here we used
λLeffγ0=γ0γ≤1,\\lambda L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}=\\frac\{\\gamma\_\{0\}\}\{\\gamma\}\\leq 1,sinceλ=1/\(Leffγ\)\\lambda=1/\(L\_\{\\mathrm\{eff\}\}\\gamma\)andγ0≤γ\\gamma\_\{0\}\\leq\\gamma\.
By the elementary inequality
\(xi⊤S⊤\(θt−1−u∗\)\)2≤2\(xi⊤S⊤θt−1\)2\+2\(xi⊤S⊤u∗\)2,\\bigl\(x\_\{i\}^\{\\top\}S^\{\\top\}\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\\bigr\)^\{2\}\\leq 2\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\-1\}\)^\{2\}\+2\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\},and Lemma D\.3 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], we have
maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\)2≲maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ,\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}\)^\{2\}\\lesssim\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\},where
amax:=maxi∈\[N\],t∈\[L\]\|yi−xi⊤S⊤θt\(−i\)\|,a\_\{\\max\}:=\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\\bigl\|y\_\{i\}\-x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\\bigr\|,BΔ:=amax2⋅maxi∈\[N\]‖Sxi‖22⋅Rp⋅\(Leffγ\)2−sN2\.B\_\{\\Delta\}:=a\_\{\\max\}^\{2\}\\cdot\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\\cdot R\_\{p\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\.Therefore, defining
FB:=Rp\(maxi∈\[N\]\(xi⊤S⊤u∗\)2\+maxi∈\[N\]ε~i2\+maxi∈\[N\],t∈\[L\]\(xi⊤S⊤θt\(−i\)\)2\+maxi∈\[N\]∥xi⊤S⊤∥Σ−s2⋅BΔ\),F\_\{B\}:=R\_\{p\}\\Bigl\(\\max\_\{i\\in\[N\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}u^\{\\ast\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\widetilde\{\\varepsilon\}\_\{i\}^\{2\}\+\\max\_\{i\\in\[N\],\\,t\\in\[L\]\}\(x\_\{i\}^\{\\top\}S^\{\\top\}\\theta\_\{t\}^\{\(\-i\)\}\)^\{2\}\+\\max\_\{i\\in\[N\]\}\\\|x\_\{i\}^\{\\top\}S^\{\\top\}\\\|\_\{\\Sigma^\{\-s\}\}^\{2\}\\cdot B\_\{\\Delta\}\\Bigr\),we obtain
σξ,B2≲FBRp\.\\sigma\_\{\\xi,B\}^\{2\}\\lesssim\\frac\{F\_\{B\}\}\{R\_\{p\}\}\.Substituting this bound into equation[25](https://arxiv.org/html/2605.24316#A9.E25)yields
𝔼batch‖Σ1/2ΔL‖22≲FB⋅tr\(Σ^1/α\)B⋅γ1/αLeff1/α−1\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim F\_\{B\}\\cdot\\frac\{\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\}\{B\}\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.Taking expectation overw∗w^\{\\ast\}and\(xi,yi\)i=1N\(x\_\{i\},y\_\{i\}\)\_\{i=1\}^\{N\}gives the desired result\. ∎
### I\.2Fluctuation error under the source condition
###### Lemma I\.2\(Upper fluctuation bound under the source condition for multi\-pass batch SGD with replacement\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1),[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2),[1\.C](https://arxiv.org/html/2605.24316#S3.I1.i3),[2](https://arxiv.org/html/2605.24316#Thmassumption2),[4](https://arxiv.org/html/2605.24316#Thmassumption4), and[3](https://arxiv.org/html/2605.24316#Thmassumption3)\. Letε∈\(0,1\)\\varepsilon\\in\(0,1\), and suppose in addition that
Leff≲N\(1−ε\)a/γ\.L\_\{\\mathrm\{eff\}\}\\lesssim N^\{\(1\-\\varepsilon\)a\}/\\gamma\.Then there exists an\(a,ε\)\(a,\\varepsilon\)\-dependent constantc\>0c\>0such that, whenever
γ≤clogN,\\gamma\\leq\\frac\{c\}\{\\log N\},we have with probability at least
1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the randomness ofSS,
𝔼\[FlucBwr\]≲γlogNB\[\(Leffγ\)1/a−1\+\(Leffγ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]\\lesssim\\frac\{\\gamma\\log N\}\{B\}\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.
###### Proof\.
We imitate the Gaussian concentration estimates and leave\-one\-out argument as in Appendix D\.3 ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], which implies that, for any fixeds∈\[0,1−1/a\)s\\in\[0,1\-1/a\), conditioned onSSandw∗w^\{\\ast\},
𝔼\[FB∣S,w∗\]≲\(σ2\+‖w∗‖H2\)logN\[1\+log2N\(Leffγ\)2−sN2\]\.\\mathbb\{E\}\\bigl\[F\_\{B\}\\mid S,w^\{\\ast\}\\bigr\]\\lesssim\(\\sigma^\{2\}\+\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\)\\log N\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\.Here we reuse the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and apply its power\-law conclusion\. As in that proof, the stochastic approximation lemma is invoked with the effective covariance proxy
σξ,eff2:=σξ,B2B,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}:=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\},since𝔼batch\[ξt\(B\)ξt\(B\)⊤\]⪯σξ,eff2Σ^\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[\\xi\_\{t\}^\{\(B\)\}\\xi\_\{t\}^\{\(B\)\\top\}\]\\preceq\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\widehat\{\\Sigma\}\. On the high\-probability event of Lemma[K\.10](https://arxiv.org/html/2605.24316#A11.Thmlemma10),Σν=Σ^\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\}satisfies
μj\(Σ^\)≍j−aforj≤min\{M,N/c\},\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\asymp j^\{\-a\}\\qquad\\text\{for \}j\\leq\\min\\\{M,N/c\\\},which is precisely the spectral range required by the power\-law part of Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)\. Therefore, applying that power\-law bound withu=1u=1andu=0u=0, and usingλ=\(Leffγ\)−1\\lambda=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1\}, gives
𝔼batch‖Σ^1/2ΔL‖22≲σξ,eff2γ0\(Leffγ0\)1/a−1,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\widehat\{\\Sigma\}^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\},and
λ𝔼batch‖ΔL‖22≲σξ,eff2γ0\(Leffγ0\)1/a−1\.\\lambda\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}\.Hence the same comparison as in Step 4 of the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)yields
𝔼batch‖Σ1/2ΔL‖22≲Rpσξ,eff2γ0\(Leffγ0\)1/a−1≤Rpσξ,eff2γ\(Leffγ\)1/a−1,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim R\_\{p\}\\,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}\\leq R\_\{p\}\\,\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\},since
γ0\(Leffγ0\)1/a−1=Leff1/a−1γ01/a≤Leff1/a−1γ1/a=γ\(Leffγ\)1/a−1\.\\gamma\_\{0\}\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-1\}=L\_\{\\mathrm\{eff\}\}^\{1/a\-1\}\\gamma\_\{0\}^\{1/a\}\\leq L\_\{\\mathrm\{eff\}\}^\{1/a\-1\}\\gamma^\{1/a\}=\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Finally, the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)givesσξ,B2≲FB/Rp\\sigma\_\{\\xi,B\}^\{2\}\\lesssim F\_\{B\}/R\_\{p\}, hence
σξ,eff2=σξ,B2B≲FBBRp\.\\sigma\_\{\\xi,\\mathrm\{eff\}\}^\{2\}=\\frac\{\\sigma\_\{\\xi,B\}^\{2\}\}\{B\}\\lesssim\\frac\{F\_\{B\}\}\{B\\,R\_\{p\}\}\.Therefore,
𝔼\[FlucBwr∣S,w∗\]≲γ\(Leffγ\)1/a−1B𝔼\[FB∣S,w∗\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\\mid S,w^\{\\ast\}\]\\lesssim\\frac\{\\gamma\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\}\{B\}\\mathbb\{E\}\\bigl\[F\_\{B\}\\mid S,w^\{\\ast\}\\bigr\]\.Substituting the bound onFBF\_\{B\}above yields
𝔼\[FlucBwr∣S,w∗\]≲σ2\+‖w∗‖H2B⋅γlogN\[1\+log2N\(Leffγ\)2−sN2\]\(Leffγ\)1/a−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\\mid S,w^\{\\ast\}\]\\lesssim\\frac\{\\sigma^\{2\}\+\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\}\{B\}\\cdot\\gamma\\log N\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Taking expectation overw∗w^\{\\ast\}yields
𝔼\[FlucBwr\]≲γlogNB\[1\+log2N\(Leffγ\)2−sN2\]\(Leffγ\)1/a−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wr\}\}\_\{B\}\]\\lesssim\\frac\{\\gamma\\log N\}\{B\}\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\.Now choose
s:=1−1a\(1−ε/2\)\.s:=1\-\\frac\{1\}\{a\(1\-\\varepsilon/2\)\}\.Then
log2N\(Leffγ\)1−sN≲log2N⋅N\(1−ε\)a\(1−s\)−1=log2N⋅N−ε2\(1−ε/2\)≲1,\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1\-s\}\}\{N\}\\lesssim\\log^\{2\}N\\cdot N^\{\(1\-\\varepsilon\)a\(1\-s\)\-1\}=\\log^\{2\}N\\cdot N^\{\-\\frac\{\\varepsilon\}\{2\(1\-\\varepsilon/2\)\}\}\\lesssim 1,where we used the assumptionLeffγ≲N\(1−ε\)aL\_\{\\mathrm\{eff\}\}\\gamma\\lesssim N^\{\(1\-\\varepsilon\)a\}\. Therefore
log2N\(Leffγ\)2−sN2\(Leffγ\)1/a−1=log2N\(Leffγ\)1−sN⋅\(Leffγ\)1/aN≲\(Leffγ\)1/aN\.\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}=\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1\-s\}\}\{N\}\\cdot\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\lesssim\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\.Combining this with the leading term yields
\[1\+log2N\(Leffγ\)2−sN2\]\(Leffγ\)1/a−1≲\(Leffγ\)1/a−1\+\(Leffγ\)1/aN,\\left\[1\+\\frac\{\\log^\{2\}N\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{2\-s\}\}\{N^\{2\}\}\\right\]\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\},which proves the claim\. ∎
### I\.3Lemmas to prove the upper bound
In this subsection, we first verify that the actual mini\-batch covariance process satisfies the moment assumptions needed later, and then state and prove a stochastic approximation lemma for mini\-batch covariance matrices\.
###### Lemma I\.3\(Verification of the moment assumptions forΣ^t\(B\)\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}\)\.
Condition onSS,w∗w^\{\\ast\}, and the datasetDD\. Let
At:=Σ^t\(B\),Σν:=Σ^,CA:=maxi∈\[N\]‖Sxi‖22\.A\_\{t\}:=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\},\\qquad\\Sigma\_\{\\nu\}:=\\widehat\{\\Sigma\},\\qquad C\_\{A\}:=\\max\_\{i\\in\[N\]\}\\\|Sx\_\{i\}\\\|\_\{2\}^\{2\}\.Then\(At\)t∈\[L\]\(A\_\{t\}\)\_\{t\\in\[L\]\}are i\.i\.d\. random PSD matrices over the batch randomness, and
𝔼batch\[At\]=Σν,𝔼batch\[At2\]⪯CAΣν,‖Σν‖2≤CA\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\},\\qquad\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\},\\qquad\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\}\.Moreover, if
γ0:=min\{18CA,γ\},\\gamma\_\{0\}:=\\min\\\!\\left\\\{\\frac\{1\}\{8C\_\{A\}\},\\,\\gamma\\right\\\},thenγ0CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8, hence4γ0CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1\. In particular, these are exactly the mean and matrix\-moment bounds needed later to apply Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)withAt=Σ^t\(B\)A\_\{t\}=\\widehat\{\\Sigma\}\_\{t\}^\{\(B\)\}andΣν=Σ^\\Sigma\_\{\\nu\}=\\widehat\{\\Sigma\}\.
###### Proof\.
Write
zi:=Sxi,Zi:=zizi⊤,I∼unif\(\[N\]\)\.z\_\{i\}:=Sx\_\{i\},\\qquad Z\_\{i\}:=z\_\{i\}z\_\{i\}^\{\\top\},\\qquad I\\sim\\mathrm\{unif\}\(\[N\]\)\.Then
At=1B∑r=1BZit,r,A\_\{t\}=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}Z\_\{i\_\{t,r\}\},so the matricesAtA\_\{t\}are PSD and are i\.i\.d\. acrossttbecause the mini\-batches are sampled independently with replacement\. Moreover,
𝔼batch\[At\]=1B∑r=1B1N∑i=1NZi=Σ^=Σν\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\frac\{1\}\{B\}\\sum\_\{r=1\}^\{B\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Z\_\{i\}=\\widehat\{\\Sigma\}=\\Sigma\_\{\\nu\}\.For the second moment, independence of the batch draws gives
𝔼batch\[At2\]=1B𝔼\[ZI2\]\+B−1BΣ^2\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]=\\frac\{1\}\{B\}\\,\\mathbb\{E\}\[Z\_\{I\}^\{2\}\]\+\\frac\{B\-1\}\{B\}\\,\\widehat\{\\Sigma\}^\{2\}\.Now
Zi2=‖zi‖22Zi⪯CAZi,Z\_\{i\}^\{2\}=\\\|z\_\{i\}\\\|\_\{2\}^\{2\}Z\_\{i\}\\preceq C\_\{A\}Z\_\{i\},so
𝔼\[ZI2\]⪯CA𝔼\[ZI\]=CAΣ^\.\\mathbb\{E\}\[Z\_\{I\}^\{2\}\]\\preceq C\_\{A\}\\,\\mathbb\{E\}\[Z\_\{I\}\]=C\_\{A\}\\widehat\{\\Sigma\}\.Also, since eachZi⪯CAIZ\_\{i\}\\preceq C\_\{A\}I, we have
Σ^=1N∑i=1NZi⪯CAI,hence‖Σ^‖2≤CA,\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Z\_\{i\}\\preceq C\_\{A\}I,\\qquad\\text\{hence\}\\qquad\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\leq C\_\{A\},and therefore
Σ^2⪯‖Σ^‖2Σ^⪯CAΣ^\.\\widehat\{\\Sigma\}^\{2\}\\preceq\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\widehat\{\\Sigma\}\\preceq C\_\{A\}\\widehat\{\\Sigma\}\.Substituting the last two displays into the expression for𝔼batch\[At2\]\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]yields
𝔼batch\[At2\]⪯1BCAΣ^\+B−1BCAΣ^=CAΣ^=CAΣν\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq\\frac\{1\}\{B\}C\_\{A\}\\widehat\{\\Sigma\}\+\\frac\{B\-1\}\{B\}C\_\{A\}\\widehat\{\\Sigma\}=C\_\{A\}\\widehat\{\\Sigma\}=C\_\{A\}\\Sigma\_\{\\nu\}\.The bound‖Σν‖2≤CA\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\}was proved above, and the definition ofγ0\\gamma\_\{0\}givesγ0CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8immediately\. This completes the verification\. ∎
###### Lemma I\.4\(Stochastic approximation with random PSD updates\)\.
Consider the recursion
μt=\(I−γtAt\)μt−1\+γtξt,μ0=0,t∈\[L\],\\mu\_\{t\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\mu\_\{t\-1\}\+\\gamma\_\{t\}\\xi\_\{t\},\\qquad\\mu\_\{0\}=0,\\qquad t\\in\[L\],whereAt∈ℝM×MA\_\{t\}\\in\\mathbb\{R\}^\{M\\times M\}are i\.i\.d\. random PSD matrices,ξt∈ℝM\\xi\_\{t\}\\in\\mathbb\{R\}^\{M\}are random vectors, and each pair\(At,ξt\)\(A\_\{t\},\\xi\_\{t\}\)is independent ofσ\(\(As,ξs\):s<t\)\\sigma\(\(A\_\{s\},\\xi\_\{s\}\):s<t\)\. Assume:
1. 1\.𝔼\[At\]=Σν\\mathbb\{E\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\}for some PSD matrixΣν\\Sigma\_\{\\nu\};
2. 2\.𝔼\[ξt\]=0\\mathbb\{E\}\[\\xi\_\{t\}\]=0and𝔼\[ξtξt⊤\]⪯σξ2Σν\\mathbb\{E\}\[\\xi\_\{t\}\\xi\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\Sigma\_\{\\nu\};
3. 3\.𝔼\[At2\]⪯CAΣν\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\};
4. 4\.‖Σν‖2≤CA\\\|\\Sigma\_\{\\nu\}\\\|\_\{2\}\\leq C\_\{A\};
5. 5\.γ0CA≤1/8\\gamma\_\{0\}C\_\{A\}\\leq 1/8\.
Then for anyu∈\[0,1\]u\\in\[0,1\]and anyα\>1\\alpha\>1,
𝔼‖Σνu/2μL‖22≤cασξ2γ0tr\(Σν1/α\)\(Leffγ0\)1/α−u,\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\leq c\_\{\\alpha\}\\,\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\},for some constantcα\>0c\_\{\\alpha\}\>0depending only onα\\alpha\. Moreover, ifμj\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}forj≤min\{M,N/c~\}j\\leq\\min\\\{M,N/\\widetilde\{c\}\\\}and some constantsa\>1a\>1andc~\>0\\widetilde\{c\}\>0, then
𝔼‖Σνu/2μL‖22≤caσξ2γ0\(Leffγ0\)1/a−u,\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\leq c\_\{a\}\\,\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\},for some constantca\>0c\_\{a\}\>0depending only onaa\.
###### Proof\.
We define recursively
μt\(0\):=μt,ξt\(0\):=ξt,\\mu\_\{t\}^\{\(0\)\}:=\\mu\_\{t\},\\qquad\\xi\_\{t\}^\{\(0\)\}:=\\xi\_\{t\},and fork≥1k\\geq 1,
μt\(k\)=\(I−γtΣν\)μt−1\(k\)\+γtξt\(k\),μ0\(k\)=0,\\mu\_\{t\}^\{\(k\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(k\)\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\)\},\\qquad\\mu\_\{0\}^\{\(k\)\}=0,with
ξt\(k\):=\(Σν−At\)μt−1\(k−1\)\.\\xi\_\{t\}^\{\(k\)\}:=\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\mu\_\{t\-1\}^\{\(k\-1\)\}\.Note that we can verify the decomposition
μt−∑i=0kμt\(i\)=\(I−γtAt\)\(μt−1−∑i=0k−1μt−1\(i\)\)\+γtξt\(k\+1\)\.\\mu\_\{t\}\-\\sum\_\{i=0\}^\{k\}\\mu\_\{t\}^\{\(i\)\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\Bigl\(\\mu\_\{t\-1\}\-\\sum\_\{i=0\}^\{k\-1\}\\mu\_\{t\-1\}^\{\(i\)\}\\Bigr\)\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\+1\)\}\.
Below we quote Lemma D\.6 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\], whose proof only uses the deterministic operatorΣν\\Sigma\_\{\\nu\}and does not rely on the rank\-one structure\.
###### Lemma I\.5\(Lemma D\.6 inLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]\)\.
Consider
μtr=\(I−γtΣν\)μt−1r\+γtξtr,μ0r=0,\\mu\_\{t\}^\{r\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{r\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{r\},\\qquad\\mu\_\{0\}^\{r\}=0,with𝔼\[ξtr\]=0\\mathbb\{E\}\[\\xi\_\{t\}^\{r\}\]=0and
𝔼\[ξtrξtr⊤\]⪯σξ,r2Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{r\}\\xi\_\{t\}^\{r\\top\}\]\\preceq\\sigma\_\{\\xi,r\}^\{2\}\\Sigma\_\{\\nu\}\.Then for anyu∈\[0,1\]u\\in\[0,1\]andα\>1\\alpha\>1,
𝔼‖Σνu/2μLr‖22≲σξ,r2γ0tr\(Σν1/α\)\(Leffγ0\)1/α−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{r\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,r\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\.Under power\-law eigenvaluesμj\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}, this becomes
𝔼‖Σνu/2μLr‖22≲σξ,r2γ0\(Leffγ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{r\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,r\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.
Then we bound the covariance forηt\\eta\_\{t\}andνt\\nu\_\{t\}as follows\.
###### Lemma I\.6\(Covariance bound for a semi\-stochastic linear recursion\)\.
Consider the recursion
νt=\(I−γtΣν\)νt−1\+γtηt,ν0=0,\\nu\_\{t\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\nu\_\{t\-1\}\+\\gamma\_\{t\}\\eta\_\{t\},\\qquad\\nu\_\{0\}=0,whereΣν\\Sigma\_\{\\nu\}is PSD,𝔼\[ηt\]=0\\mathbb\{E\}\[\\eta\_\{t\}\]=0, and
𝔼\[ηtηt⊤\]⪯ση2Σν\.\\mathbb\{E\}\[\\eta\_\{t\}\\eta\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\Sigma\_\{\\nu\}\.Assume moreover that all eigenvalues ofI−γtΣνI\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}lie in\[0,1\]\[0,1\]\. Then for everyt≥0t\\geq 0,
𝔼\[νtνt⊤\]⪯ση2γ0I\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}I\.
###### Proof\.
Unrolling the recursion gives
νt=∑i=1tγi∏j=i\+1t\(I−γjΣν\)ηi\.\\nu\_\{t\}=\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\eta\_\{i\}\.Therefore,
𝔼\[νtνt⊤\]=∑i=1tγi2∏j=i\+1t\(I−γjΣν\)𝔼\[ηiηi⊤\]∏j=i\+1t\(I−γjΣν\)\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]=\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}^\{2\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\mathbb\{E\}\[\\eta\_\{i\}\\eta\_\{i\}^\{\\top\}\]\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.Using𝔼\[ηiηi⊤\]⪯ση2Σν\\mathbb\{E\}\[\\eta\_\{i\}\\eta\_\{i\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\Sigma\_\{\\nu\}andγi≤γ0\\gamma\_\{i\}\\leq\\gamma\_\{0\}, we obtain
𝔼\[νtνt⊤\]⪯ση2γ0∑i=1tγi∏j=i\+1t\(I−γjΣν\)Σν∏j=i\+1t\(I−γjΣν\)\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.All factors are polynomials inΣν\\Sigma\_\{\\nu\}, so they commute\. Since every eigenvalue ofI−γjΣνI\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}lies in\[0,1\]\[0,1\], one has
∏j=i\+1t\(I−γjΣν\)2⪯∏j=i\+1t\(I−γjΣν\)\.\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)^\{2\}\\preceq\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\.Hence
𝔼\[νtνt⊤\]⪯ση2γ0∑i=1tγi∏j=i\+1t\(I−γjΣν\)Σν\.\\mathbb\{E\}\[\\nu\_\{t\}\\nu\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\eta\}^\{2\}\\gamma\_\{0\}\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\.Finally, diagonalizingΣν\\Sigma\_\{\\nu\}reduces the last sum to the scalar identity
∑i=1tγi∏j=i\+1t\(1−γjλ\)λ=1−∏j=1t\(1−γjλ\)≤1,λ≥0,\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(1\-\\gamma\_\{j\}\\lambda\)\\lambda=1\-\\prod\_\{j=1\}^\{t\}\(1\-\\gamma\_\{j\}\\lambda\)\\leq 1,\\qquad\\lambda\\geq 0,which yields
∑i=1tγi∏j=i\+1t\(I−γjΣν\)Σν⪯I\.\\sum\_\{i=1\}^\{t\}\\gamma\_\{i\}\\prod\_\{j=i\+1\}^\{t\}\(I\-\\gamma\_\{j\}\\Sigma\_\{\\nu\}\)\\Sigma\_\{\\nu\}\\preceq I\.Combining the last two displays proves the claim\. ∎
#### A
fter bounding the covariance, we now prove the covariance propagation bound\.
###### Lemma I\.7\(Covariance propagation bound\)\.
For allk≥0k\\geq 0,
𝔼\[ξt\(k\)ξt\(k\)⊤\]⪯σξ2γ0k\(4CA\)kΣν,\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\)\}\\xi\_\{t\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\}\(4C\_\{A\}\)^\{k\}\\Sigma\_\{\\nu\},and
𝔼\[μt\(k\)μt\(k\)⊤\]⪯σξ2γ0k\+1\(4CA\)kI\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\)\}\\mu\_\{t\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}I\.
###### Proof\.
We proceed by induction onkk\. Fork=0k=0, the first inequality is exactly the assumption
𝔼\[ξtξt⊤\]⪯σξ2Σν\.\\mathbb\{E\}\[\\xi\_\{t\}\\xi\_\{t\}^\{\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\Sigma\_\{\\nu\}\.For the second inequality whenk=0k=0, Lemma[I\.6](https://arxiv.org/html/2605.24316#A9.Thmlemma6)applied to
μt\(0\)=\(I−γtΣν\)μt−1\(0\)\+γtξt\\mu\_\{t\}^\{\(0\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(0\)\}\+\\gamma\_\{t\}\\xi\_\{t\}gives
𝔼\[μt\(0\)μt\(0\)⊤\]⪯σξ2γ0I\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(0\)\}\\mu\_\{t\}^\{\(0\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}I\.
Now assume the claim holds for somek≥0k\\geq 0\. Since
ξt\(k\+1\)=\(Σν−At\)μt−1\(k\),\\xi\_\{t\}^\{\(k\+1\)\}=\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\mu\_\{t\-1\}^\{\(k\)\},we have
𝔼\[ξt\(k\+1\)ξt\(k\+1\)⊤\]=𝔼\[\(Σν−At\)𝔼\[μt−1\(k\)μt−1\(k\)⊤\]\(Σν−At\)\]\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]=\\mathbb\{E\}\\Bigl\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\,\\mathbb\{E\}\[\\mu\_\{t\-1\}^\{\(k\)\}\\mu\_\{t\-1\}^\{\(k\)\\top\}\]\\,\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)\\Bigr\]\.Using the induction hypothesis,
𝔼\[μt−1\(k\)μt−1\(k\)⊤\]⪯σξ2γ0k\+1\(4CA\)kI,\\mathbb\{E\}\[\\mu\_\{t\-1\}^\{\(k\)\}\\mu\_\{t\-1\}^\{\(k\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}I,thus
𝔼\[ξt\(k\+1\)ξt\(k\+1\)⊤\]⪯σξ2γ0k\+1\(4CA\)k𝔼\[\(Σν−At\)2\]\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}\\,\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]\.Using𝔼\[At\]=Σν\\mathbb\{E\}\[A\_\{t\}\]=\\Sigma\_\{\\nu\}, we obtain the exact identity
𝔼\[\(Σν−At\)2\]=𝔼\[At2\]−Σν2⪯𝔼\[At2\]\.\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]=\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\-\\Sigma\_\{\\nu\}^\{2\}\\preceq\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\.By the assumption𝔼\[At2\]⪯CAΣν\\mathbb\{E\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\}, it follows that
𝔼\[\(Σν−At\)2\]⪯CAΣν\.\\mathbb\{E\}\[\(\\Sigma\_\{\\nu\}\-A\_\{t\}\)^\{2\}\]\\preceq C\_\{A\}\\Sigma\_\{\\nu\}\.Hence
𝔼\[ξt\(k\+1\)ξt\(k\+1\)⊤\]⪯σξ2γ0k\+1\(4CA\)kCAΣν⪯σξ2γ0k\+1\(4CA\)k\+1Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\}C\_\{A\}\\Sigma\_\{\\nu\}\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\+1\}\\Sigma\_\{\\nu\}\.The bound on𝔼\[μt\(k\+1\)μt\(k\+1\)⊤\]\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\+1\)\}\\mu\_\{t\}^\{\(k\+1\)\\top\}\]then follows from Lemma[I\.6](https://arxiv.org/html/2605.24316#A9.Thmlemma6), applied to the recursion
μt\(k\+1\)=\(I−γtΣν\)μt−1\(k\+1\)\+γtξt\(k\+1\),\\mu\_\{t\}^\{\(k\+1\)\}=\(I\-\\gamma\_\{t\}\\Sigma\_\{\\nu\}\)\\mu\_\{t\-1\}^\{\(k\+1\)\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{\(k\+1\)\},with
𝔼\[ξt\(k\+1\)ξt\(k\+1\)⊤\]⪯σξ2γ0k\+1\(4CA\)k\+1Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{\(k\+1\)\}\\xi\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+1\}\(4C\_\{A\}\)^\{k\+1\}\\Sigma\_\{\\nu\}\.𝔼\[μt\(k\+1\)μt\(k\+1\)⊤\]⪯σξ2γ0k\+2\(4CA\)k\+1I\.\\mathbb\{E\}\[\\mu\_\{t\}^\{\(k\+1\)\}\\mu\_\{t\}^\{\(k\+1\)\\top\}\]\\preceq\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+2\}\(4C\_\{A\}\)^\{k\+1\}I\.This closes the induction\. ∎
Apart from the above lemmas to simplify our final upper bound, we also need the following companion estimate\.
###### Lemma I\.8\(Companion bound for thepp\-part recursion\)\.
Consider
μtp=\(I−γtAt\)μt−1p\+γtξtp,μ0p=0,\\mu\_\{t\}^\{p\}=\(I\-\\gamma\_\{t\}A\_\{t\}\)\\mu\_\{t\-1\}^\{p\}\+\\gamma\_\{t\}\\xi\_\{t\}^\{p\},\\qquad\\mu\_\{0\}^\{p\}=0,with𝔼\[ξtp\]=0\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\]=0and
𝔼\[ξtpξtp⊤\]⪯σξ,p2Σν\.\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\]\\preceq\\sigma\_\{\\xi,p\}^\{2\}\\Sigma\_\{\\nu\}\.Then for anyu∈\[0,1\]u\\in\[0,1\],
𝔼‖Σνu/2μLp‖22≲σξ,p2γ02CAutr\(Σν\)Leff\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,p\}^\{2\}\\gamma\_\{0\}^\{2\}C\_\{A\}^\{u\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\.
###### Proof\.
We have
𝔼‖Σνu/2μLp‖22=∑t=1Lγt2tr\(𝔼\[Σνu∏i=t\+1L\(I−γiAi\)ξtpξtp⊤∏j=Lt\+1\(I−γjAj\)\]\)\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}^\{2\}\\,\\mathrm\{tr\}\\\!\\left\(\\mathbb\{E\}\\bigl\[\\Sigma\_\{\\nu\}^\{u\}\\prod\_\{i=t\+1\}^\{L\}\(I\-\\gamma\_\{i\}A\_\{i\}\)\\,\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\\,\\prod\_\{j=L\}^\{t\+1\}\(I\-\\gamma\_\{j\}A\_\{j\}\)\\bigr\]\\right\)\.Using𝔼\[ξtpξtp⊤\]⪯σξ,p2Σν\\mathbb\{E\}\[\\xi\_\{t\}^\{p\}\\xi\_\{t\}^\{p\\top\}\]\\preceq\\sigma\_\{\\xi,p\}^\{2\}\\Sigma\_\{\\nu\},∑tγt2≲γ02Leff\\sum\_\{t\}\\gamma\_\{t\}^\{2\}\\lesssim\\gamma\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}, andΣν2⪯CAΣν\\Sigma\_\{\\nu\}^\{2\}\\preceq C\_\{A\}\\Sigma\_\{\\nu\}, we have the following through elementary calculations:
𝔼‖Σνu/2μLp‖22≲σξ,p2γ02CAutr\(Σν\)Leff\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{p\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi,p\}^\{2\}\\gamma\_\{0\}^\{2\}C\_\{A\}^\{u\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\.∎
Now we are able to combine Lemmas[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5),[I\.7](https://arxiv.org/html/2605.24316#A9.Thmlemma7), and[I\.8](https://arxiv.org/html/2605.24316#A9.Thmlemma8)\. By Minkowski’s inequality,
\(𝔼‖Σνu/2μL‖22\)1/2≤∑i=0k\(𝔼‖Σνu/2μL\(i\)‖22\)1/2\+\(𝔼‖Σνu/2\(μL−∑i=0kμL\(i\)\)‖22\)1/2\.\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\\leq\\sum\_\{i=0\}^\{k\}\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\+\\Bigl\(\\mathbb\{E\}\\Bigl\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\Bigl\(\\mu\_\{L\}\-\\sum\_\{i=0\}^\{k\}\\mu\_\{L\}^\{\(i\)\}\\Bigr\)\\Bigr\\\|\_\{2\}^\{2\}\\Bigr\)^\{1/2\}\.Applying Lemma[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5)toμL\(i\)\\mu\_\{L\}^\{\(i\)\}and Lemma[I\.8](https://arxiv.org/html/2605.24316#A9.Thmlemma8)to the remainder yields
\(𝔼‖Σνu/2μL‖22\)1/2≲∑i=0k\(σξ2γ0i\(4CA\)i⋅γ0tr\(Σν1/α\)\(Leffγ0\)1/α−u\)1/2\\bigl\(\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\bigr\)^\{1/2\}\\lesssim\\sum\_\{i=0\}^\{k\}\\Bigl\(\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{i\}\(4C\_\{A\}\)^\{i\}\\cdot\\gamma\_\{0\}\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\\Bigr\)^\{1/2\}\+\(σξ2γ0k\+3\(4CA\)k\+1CAutr\(Σν\)Leff\)1/2\.\\qquad\+\\Bigl\(\\sigma\_\{\\xi\}^\{2\}\\gamma\_\{0\}^\{k\+3\}\(4C\_\{A\}\)^\{k\+1\}\\,C\_\{A\}^\{u\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}\)L\_\{\\mathrm\{eff\}\}\\Bigr\)^\{1/2\}\.Since4γ0CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1, the geometric series converges\. Lettingk→∞k\\to\\inftygives
𝔼‖Σνu/2μL‖22≲σξ2γ0tr\(Σν1/α\)\(Leffγ0\)1/α−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\\mathrm\{tr\}\(\\Sigma\_\{\\nu\}^\{1/\\alpha\}\)\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/\\alpha\-u\}\.Now supposeμj\(Σν\)≍j−a\\mu\_\{j\}\(\\Sigma\_\{\\nu\}\)\\asymp j^\{\-a\}forj≤min\{M,N/c~\}j\\leq\\min\\\{M,N/\\widetilde\{c\}\\\}\. Then the power\-law part of Lemma[I\.5](https://arxiv.org/html/2605.24316#A9.Thmlemma5)yields, for everyi≥0i\\geq 0,
𝔼‖Σνu/2μL\(i\)‖22≲σξ2γ0i\+1\(4CA\)i\(Leffγ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}^\{i\+1\}\(4C\_\{A\}\)^\{i\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.The remainder estimate is unchanged\. Therefore the same geometric\-series argument, together with4γ0CA≤1/2<14\\gamma\_\{0\}C\_\{A\}\\leq 1/2<1, gives
𝔼‖Σνu/2μL‖22≲σξ2γ0\(Leffγ0\)1/a−u\.\\mathbb\{E\}\\\|\\Sigma\_\{\\nu\}^\{u/2\}\\mu\_\{L\}\\\|\_\{2\}^\{2\}\\lesssim\\sigma\_\{\\xi\}^\{2\}\\,\\gamma\_\{0\}\\,\(L\_\{\\mathrm\{eff\}\}\\gamma\_\{0\}\)^\{1/a\-u\}\.This is the claimed power\-law conclusion\. ∎
## Appendix JFluctuation Error under Multi\-pass Batch SGD without Replacement
This section focuses on the multi\-pass batch SGD procedure without replacement equation[4](https://arxiv.org/html/2605.24316#S2.E4)and uses the notation from Section[A\.5](https://arxiv.org/html/2605.24316#A1.SS5)\. Appendix[I](https://arxiv.org/html/2605.24316#A9)proves the corresponding fluctuation bounds for the multi\-pass batch SGD procedure with replacement equation[3](https://arxiv.org/html/2605.24316#S2.E3); here we record the without\-replacement\.
Define
FlucBwor:=𝔼batch\[‖Σ1/2\(uLwor−θL\)‖22\]=𝔼batch\[‖Σ1/2ΔLwor‖22\]\.\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}:=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\(u\_\{L\}^\{\\mathrm\{wor\}\}\-\\theta\_\{L\}\)\\\|\_\{2\}^\{2\}\\bigr\]=\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\\bigl\[\\\|\\Sigma^\{1/2\}\\Delta\_\{L\}^\{\\mathrm\{wor\}\}\\\|\_\{2\}^\{2\}\\bigr\]\.By Section[A\.5](https://arxiv.org/html/2605.24316#A1.SS5), the fluctuation process satisfies
Δtwor=\(I−γtΣ^It\(B\)\)Δt−1wor\+γtξt,wor\(B\)\.\\Delta\_\{t\}^\{\\mathrm\{wor\}\}=\\bigl\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}\\bigr\)\\Delta\_\{t\-1\}^\{\\mathrm\{wor\}\}\+\\gamma\_\{t\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\.
Define the finite\-population correction factor
ρN,B:=N−BB\(N−1\)\.\\rho\_\{N,B\}:=\\frac\{N\-B\}\{B\(N\-1\)\}\.\(26\)We can observe that
ρN,1=1,ρN,N=0,\\rho\_\{N,1\}=1,\\qquad\\rho\_\{N,N\}=0,and more generally, and asymptotic level,
ρN,B≍1BwhenB≪N\.\\rho\_\{N,B\}\\asymp\\frac\{1\}\{B\}\\qquad\\text\{when \}B\\ll N\.
###### Lemma J\.1\(Finite\-population covariance identity\)\.
Letζ1,…,ζN∈ℝM\\zeta\_\{1\},\\dots,\\zeta\_\{N\}\\in\\mathbb\{R\}^\{M\}satisfy
1N∑j=1Nζj=0\.\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}=0\.LetI⊂\[N\]I\\subset\[N\]be sampled uniformly without replacement with\|I\|=B\|I\|=B, and define
ζ¯I:=1B∑j∈Iζj\.\\bar\{\\zeta\}\_\{I\}:=\\frac\{1\}\{B\}\\sum\_\{j\\in I\}\\zeta\_\{j\}\.Then
𝔼I\[ζ¯Iζ¯I⊤\]=ρN,B⋅1N∑j=1Nζjζj⊤=ρN,B𝔼i∼unif\(\[N\]\)\[ζiζi⊤\]\.\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\rho\_\{N,B\}\\cdot\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}\\zeta\_\{j\}^\{\\top\}=\\rho\_\{N,B\}\\,\\mathbb\{E\}\_\{i\\sim\\mathrm\{unif\}\(\[N\]\)\}\[\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\]\.
###### Proof\.
Expanding the covariance gives
𝔼I\[ζ¯Iζ¯I⊤\]=1B2∑i=1Nℙ\(i∈I\)ζiζi⊤\+1B2∑i≠jℙ\(i,j∈I\)ζiζj⊤\.\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\frac\{1\}\{B^\{2\}\}\\sum\_\{i=1\}^\{N\}\\mathbb\{P\}\(i\\in I\)\\,\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\+\\frac\{1\}\{B^\{2\}\}\\sum\_\{i\\neq j\}\\mathbb\{P\}\(i,j\\in I\)\\,\\zeta\_\{i\}\\zeta\_\{j\}^\{\\top\}\.For uniform sampling without replacement,
ℙ\(i∈I\)=BN,ℙ\(i,j∈I\)=B\(B−1\)N\(N−1\)fori≠j\.\\mathbb\{P\}\(i\\in I\)=\\frac\{B\}\{N\},\\qquad\\mathbb\{P\}\(i,j\\in I\)=\\frac\{B\(B\-1\)\}\{N\(N\-1\)\}\\quad\\text\{for \}i\\neq j\.Since∑j=1Nζj=0\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}=0,
∑i≠jζiζj⊤=\(∑i=1Nζi\)\(∑j=1Nζj\)⊤−∑i=1Nζiζi⊤=−∑i=1Nζiζi⊤\.\\sum\_\{i\\neq j\}\\zeta\_\{i\}\\zeta\_\{j\}^\{\\top\}=\\left\(\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\right\)\\left\(\\sum\_\{j=1\}^\{N\}\\zeta\_\{j\}\\right\)^\{\\top\}\-\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}=\-\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}\.Substituting these identities yields
𝔼I\[ζ¯Iζ¯I⊤\]=1B2\(BN−B\(B−1\)N\(N−1\)\)∑i=1Nζiζi⊤=N−BBN\(N−1\)∑i=1Nζiζi⊤,\\mathbb\{E\}\_\{I\}\[\\bar\{\\zeta\}\_\{I\}\\bar\{\\zeta\}\_\{I\}^\{\\top\}\]=\\frac\{1\}\{B^\{2\}\}\\left\(\\frac\{B\}\{N\}\-\\frac\{B\(B\-1\)\}\{N\(N\-1\)\}\\right\)\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\}=\\frac\{N\-B\}\{BN\(N\-1\)\}\\sum\_\{i=1\}^\{N\}\\zeta\_\{i\}\\zeta\_\{i\}^\{\\top\},which is exactly the claimed formula becauseρN,B=\(N−B\)/\(B\(N−1\)\)\\rho\_\{N,B\}=\(N\-B\)/\(B\(N\-1\)\)\. ∎
The next corollary shows that the upper\-bound conclusions of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)remain valid after replacing the factor1/B1/BbyρN,B\\rho\_\{N,B\}and the with\-replacement iterateutwru\_\{t\}^\{\\mathrm\{wr\}\}by the multi\-pass batch SGD without\-replacement iterateutworu\_\{t\}^\{\\mathrm\{wor\}\}\.
###### Corollary J\.1\(Upper fluctuation bounds for multi\-pass batch SGD without replacement\)\.
Assume the same fixed\-dataset mini\-batch setup as above, where at each iterationttthe batchIt⊂\[N\]I\_\{t\}\\subset\[N\]is sampled uniformly without replacement with\|It\|=B\|I\_\{t\}\|=B, independently across iterations\.
LetρN,B\\rho\_\{N,B\}be defined by equation[26](https://arxiv.org/html/2605.24316#A10.E26)\. Then the following hold\.
1. \(1\)Upper bound\.Under the assumptions of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1), one has 𝔼\[FlucBwor\]≤c⋅𝔼\[FB\]⋅ρN,Btr\(Σ^1/α\)⋅γ1/αLeff1/α−1\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\]\\leq c\\cdot\\mathbb\{E\}\[F\_\{B\}\]\\cdot\\rho\_\{N,B\}\\,\\mathrm\{tr\}\(\\widehat\{\\Sigma\}^\{1/\\alpha\}\)\\cdot\\gamma^\{1/\\alpha\}L\_\{\\mathrm\{eff\}\}^\{1/\\alpha\-1\}\.
2. \(2\)Source\-condition upper bound\.Under the assumptions of Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2), one has 𝔼\[FlucBwor\]≲ρN,BγlogN\[\(Leffγ\)1/a−1\+\(Leffγ\)1/aN\]\.\\mathbb\{E\}\[\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{B\}\]\\lesssim\\rho\_\{N,B\}\\,\\gamma\\log N\\left\[\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\-1\}\+\\frac\{\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{1/a\}\}\{N\}\\right\]\.
In particular, whenB=NB=N, one hasρN,N=0\\rho\_\{N,N\}=0, and therefore
FlucNwor=0,\\mathrm\{Fluc\}^\{\\mathrm\{wor\}\}\_\{N\}=0,which matches the identityutwor=θtu\_\{t\}^\{\\mathrm\{wor\}\}=\\theta\_\{t\}from Section[2](https://arxiv.org/html/2605.24316#S2)\.
###### Proof\.
We only need to identify the places in the proofs of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)and Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)where the factor1/B1/Benters, and replace it by the correct covariance factor for sampling without replacement\.
Fixtt, and condition on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\)\. Define
ζt\(j\):=−\(Sxjxj⊤S⊤−Σ^\)\(θt−1−u∗\)\+\(Sxjε~j−c^\),j∈\[N\]\.\\zeta\_\{t\}\(j\):=\-\\bigl\(Sx\_\{j\}x\_\{j\}^\{\\top\}S^\{\\top\}\-\\widehat\{\\Sigma\}\\bigr\)\(\\theta\_\{t\-1\}\-u^\{\\ast\}\)\+\\bigl\(Sx\_\{j\}\\widetilde\{\\varepsilon\}\_\{j\}\-\\widehat\{c\}\\bigr\),\\qquad j\\in\[N\]\.Since
Σ^=1N∑i=1NSxixi⊤S⊤,c^=1N∑i=1NSxiε~i,\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Sx\_\{i\}x\_\{i\}^\{\\top\}S^\{\\top\},\\qquad\\widehat\{c\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}Sx\_\{i\}\\widetilde\{\\varepsilon\}\_\{i\},we have
1N∑j=1Nζt\(j\)=0\.\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\zeta\_\{t\}\(j\)=0\.Now letIt⊂\[N\]I\_\{t\}\\subset\[N\]be a uniformly random subset of sizeBB, sampled without replacement, and write
ξt,wor\(B\)=1B∑j∈Itζt\(j\)\.\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{j\\in I\_\{t\}\}\\zeta\_\{t\}\(j\)\.For the upper\-bound argument, we must also verify that the random transition matrix
At:=Σ^It\(B\)=1B∑i∈Itzizi⊤,zi:=Sxi,CA:=maxi∈\[N\]‖zi‖22\.A\_\{t\}:=\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\}=\\frac\{1\}\{B\}\\sum\_\{i\\in I\_\{t\}\}z\_\{i\}z\_\{i\}^\{\\top\},\\qquad z\_\{i\}:=Sx\_\{i\},\\qquad C\_\{A\}:=\\max\_\{i\\in\[N\]\}\\\|z\_\{i\}\\\|\_\{2\}^\{2\}\.satisfies the same moment assumptions used in Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)\. Since eachzizi⊤⪯CAIz\_\{i\}z\_\{i\}^\{\\top\}\\preceq C\_\{A\}I, we have pathwise
0⪯At⪯CAI,henceAt2⪯CAAt\.0\\preceq A\_\{t\}\\preceq C\_\{A\}I,\\qquad\\text\{hence\}\\qquad A\_\{t\}^\{2\}\\preceq C\_\{A\}A\_\{t\}\.Taking expectation over the without\-replacement batch gives
𝔼batch\[At\]=Σ^,𝔼batch\[At2\]⪯CA𝔼batch\[At\]=CAΣ^\.\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=\\widehat\{\\Sigma\},\\qquad\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}^\{2\}\]\\preceq C\_\{A\}\\,\\mathbb\{E\}\_\{\\mathrm\{batch\}\}\[A\_\{t\}\]=C\_\{A\}\\widehat\{\\Sigma\}\.AlsoΣ^=1N∑i=1Nzizi⊤⪯CAI\\widehat\{\\Sigma\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}z\_\{i\}z\_\{i\}^\{\\top\}\\preceq C\_\{A\}I, so‖Σ^‖2≤CA\\\|\\widehat\{\\Sigma\}\\\|\_\{2\}\\leq C\_\{A\}\. Thus the same mean and matrix\-moment bounds as in Lemma[I\.3](https://arxiv.org/html/2605.24316#A9.Thmlemma3)remain valid for the without\-replacement transition matrices\. Sinceθt−1\\theta\_\{t\-1\}is deterministic once\(S,w∗,D\)\(S,w^\{\\ast\},D\)are fixed and the batches\(It\)t∈\[L\]\(I\_\{t\}\)\_\{t\\in\[L\]\}are sampled independently across iterations, the pair\(Σ^It\(B\),ξt,wor\(B\)\)\(\\widehat\{\\Sigma\}\_\{I\_\{t\}\}^\{\(B\)\},\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\)is also independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}, so the adaptedness requirement in Lemma[I\.4](https://arxiv.org/html/2605.24316#A9.Thmlemma4)is unchanged\. Conditioned on\(S,w∗,D,ℱt−1\)\(S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\), the family\(ζt\(j\)\)j=1N\(\\zeta\_\{t\}\(j\)\)\_\{j=1\}^\{N\}is deterministic and centered\. Applying Lemma[J\.1](https://arxiv.org/html/2605.24316#A10.Thmlemma1)therefore gives
𝔼\[ξt,wor\(B\)ξt,wor\(B\)⊤∣S,w∗,D,ℱt−1\]=ρN,B𝔼i∼unif\(\[N\]\)\[ζt\(i\)ζt\(i\)⊤∣S,w∗,D,ℱt−1\]\.\\mathbb\{E\}\\\!\\left\[\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]=\\rho\_\{N,B\}\\,\\mathbb\{E\}\_\{i\\sim\\mathrm\{unif\}\(\[N\]\)\}\\\!\\left\[\\zeta\_\{t\}\(i\)\\zeta\_\{t\}\(i\)^\{\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]\.\(27\)Therefore equation[24](https://arxiv.org/html/2605.24316#A9.E24)in the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)is replaced by
𝔼\[ξt,wor\(B\)ξt,wor\(B\)⊤∣S,w∗,D,ℱt−1\]⪯ρN,Bσξ,B2Σ^\.\\mathbb\{E\}\\\!\\left\[\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\}\\xi\_\{t,\\mathrm\{wor\}\}^\{\(B\)\\top\}\\mid S,w^\{\\ast\},D,\\mathcal\{F\}\_\{t\-1\}\\right\]\\preceq\\rho\_\{N,B\}\\,\\sigma\_\{\\xi,B\}^\{2\}\\,\\widehat\{\\Sigma\}\.From this point onward, the proof of Lemma[I\.1](https://arxiv.org/html/2605.24316#A9.Thmlemma1)is unchanged after replacingΔt\\Delta\_\{t\}byΔtwor\\Delta\_\{t\}^\{\\mathrm\{wor\}\},uLwru\_\{L\}^\{\\mathrm\{wr\}\}byuLworu\_\{L\}^\{\\mathrm\{wor\}\}, and every occurrence of1/B1/Bcoming from equation[24](https://arxiv.org/html/2605.24316#A9.E24)byρN,B\\rho\_\{N,B\}\. This yields part\(1\)\.
The proof of Lemma[I\.2](https://arxiv.org/html/2605.24316#A9.Thmlemma2)then propagates the same covariance replacement, so every occurrence of1/B1/Bin the corresponding with\-replacement source\-condition upper bound is replaced byρN,B\\rho\_\{N,B\}\. This yields part\(2\)\.
The last statement follows because
ρN,N=N−NN\(N−1\)=0\.\\rho\_\{N,N\}=\\frac\{N\-N\}\{N\(N\-1\)\}=0\.Thus, whenB=NB=N, the fluctuation bounds vanish, which is consistent withutwor=θtu\_\{t\}^\{\\mathrm\{wor\}\}=\\theta\_\{t\}for alltt\. ∎
## Appendix KCollected Auxiliary Lemmas
This section keeps only the auxiliary lemmas from Section E ofLinet al\.\[[2025](https://arxiv.org/html/2605.24316#bib.bib1)\]and Sections A and G ofLinet al\.\[[2024](https://arxiv.org/html/2605.24316#bib.bib4)\]\. These are the spectral, moment, and concentration ingredients used repeatedly across the appendix proofs for equation[2](https://arxiv.org/html/2605.24316#S2.E2), equation[3](https://arxiv.org/html/2605.24316#S2.E3), and equation[4](https://arxiv.org/html/2605.24316#S2.E4)\.
We use the common notation from Section[A](https://arxiv.org/html/2605.24316#A1), especially Section[A\.1](https://arxiv.org/html/2605.24316#A1.SS1)\. Let\(λi\)i≥1\(\\lambda\_\{i\}\)\_\{i\\geq 1\}denote the eigenvalues ofHHin non\-increasing order\. For integers0≤k∗≤k≤∞0\\leq k\_\{\\ast\}\\leq k\\leq\\infty, define
Σk∗:k:=Sk∗:kHk∗:kSk∗:k⊤,Σk:∞:=Sk:∞Hk:∞Sk:∞⊤\.\\Sigma\_\{k\_\{\\ast\}:k\}:=S\_\{k\_\{\\ast\}:k\}H\_\{k\_\{\\ast\}:k\}S\_\{k\_\{\\ast\}:k\}^\{\\top\},\\qquad\\Sigma\_\{k:\\infty\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.For any symmetric PSD matrixAA, we writeμ1\(A\)≥μ2\(A\)≥⋯\\mu\_\{1\}\(A\)\\geq\\mu\_\{2\}\(A\)\\geq\\cdotsfor its eigenvalues\.
### K\.1General concentration lemmas
###### Lemma K\.1\(Covariance replacement\)\.
Letλ\>0\\lambda\>0\. If
∑i=1Mμi\(Σ\)μi\(Σ\)\+λ≤N4,\\sum\_\{i=1\}^\{M\}\\frac\{\\mu\_\{i\}\(\\Sigma\)\}\{\\mu\_\{i\}\(\\Sigma\)\+\\lambda\}\\leq\\frac\{N\}\{4\},then with probability at least1−exp\(−Ω\(N\)\)1\-\\exp\(\-\\Omega\(N\)\)over the sample,
‖\(Σ^\+λI\)−1/2\(Σ\+λI\)1/2‖2≤3\.\\bigl\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\(\\Sigma\+\\lambda I\)^\{1/2\}\\bigr\\\|\_\{2\}\\leq 3\.Moreover,
𝔼X‖\(Σ^\+λI\)−1/2\(Σ\+λI\)1/2‖24≤100\+exp\(−cN\)‖Σ‖22λ2\\mathbb\{E\}\_\{X\}\\bigl\\\|\(\\widehat\{\\Sigma\}\+\\lambda I\)^\{\-1/2\}\(\\Sigma\+\\lambda I\)^\{1/2\}\\bigr\\\|\_\{2\}^\{4\}\\leq 100\+\\exp\(\-cN\)\\frac\{\\\|\\Sigma\\\|\_\{2\}^\{2\}\}\{\\lambda^\{2\}\}for some absolute constantc\>0c\>0\.
###### Lemma K\.2\(Sketched fourth\-moment and residual covariance bounds\)\.
Assume Assumptions[1\.A](https://arxiv.org/html/2605.24316#S3.I1.i1)and[1\.B](https://arxiv.org/html/2605.24316#S3.I1.i2)\. Condition on the sketch matrixSS, and let
z:=Sx,Σ:=SHS⊤,u∗:=Σ−1SHw∗\.z:=Sx,\\qquad\\Sigma:=SHS^\{\\top\},\\qquad u^\{\\ast\}:=\\Sigma^\{\-1\}SHw^\{\\ast\}\.Then there exists an absolute constantα\>0\\alpha\>0such that the following hold\.
1. \(i\)For every PSD matrixA∈ℝM×MA\\in\\mathbb\{R\}^\{M\\times M\}, 𝔼\[zz⊤Azz⊤\]⪯αtr\(ΣA\)Σ\.\\mathbb\{E\}\[zz^\{\\top\}Azz^\{\\top\}\]\\preceq\\alpha\\,\\operatorname\{tr\}\(\\Sigma A\)\\Sigma\.
2. \(ii\)Writing σ~2\(w∗\):=2\(σ2\+α‖w∗‖H2\),\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\):=2\\bigl\(\\sigma^\{2\}\+\\alpha\\\|w^\{\\ast\}\\\|\_\{H\}^\{2\}\\bigr\),one has 𝔼\[\(y−z⊤u∗\)2zz⊤\]⪯σ~2\(w∗\)Σ\.\\mathbb\{E\}\\\!\\left\[\(y\-z^\{\\top\}u^\{\\ast\}\)^\{2\}zz^\{\\top\}\\right\]\\preceq\\widetilde\{\\sigma\}^\{2\}\(w^\{\\ast\}\)\\,\\Sigma\.
Under Gaussian design, one may takeα=3\\alpha=3\.
###### Lemma K\.3\(Head–tail eigenvalue comparison\)\.
There exists an absolute constantc\>1c\>1such that for every0≤k≤M0\\leq k\\leq M, with probability at least
1−exp\(−Ω\(M\)\)−exp\(−Ω\(k\)\),1\-\\exp\(\-\\Omega\(M\)\)\-\\exp\(\-\\Omega\(k\)\),we have for everyj∈\[M\]j\\in\[M\],
\|μj\(Σ\)−λj−1M∑i\>kλi\|≤c\(kMλj\+λk\+1\+∑i\>kλi2M\)\.\\left\|\\mu\_\{j\}\(\\Sigma\)\-\\lambda\_\{j\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\right\|\\leq c\\left\(\\frac\{k\}\{M\}\\lambda\_\{j\}\+\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\\right\)\.In particular, ifk≤M/c2k\\leq M/c^\{2\}, then the termkMλj\\tfrac\{k\}\{M\}\\lambda\_\{j\}can be absorbed into the left\-hand side\.
###### Lemma K\.4\(Tail concentration\)\.
For anyk≥0k\\geq 0, with probability at least1−δ1\-\\delta,
‖Σk:∞−1M∑i\>kλiIM‖2≲λk\+1\(1\+log\(1/δ\)M\)\+∑i\>kλi2M\(1\+log\(1/δ\)M\)\.\\left\\\|\\Sigma\_\{k:\\infty\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\,I\_\{M\}\\right\\\|\_\{2\}\\lesssim\\lambda\_\{k\+1\}\\left\(1\+\\frac\{\\log\(1/\\delta\)\}\{M\}\\right\)\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\\left\(1\+\\frac\{\\log\(1/\\delta\)\}\{M\}\\right\)\}\.In particular, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
‖Σk:∞−1M∑i\>kλiIM‖2≲λk\+1\+∑i\>kλi2M\.\\left\\\|\\Sigma\_\{k:\\infty\}\-\\frac\{1\}\{M\}\\sum\_\{i\>k\}\\lambda\_\{i\}\\,I\_\{M\}\\right\\\|\_\{2\}\\lesssim\\lambda\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}\\lambda\_\{i\}^\{2\}\}\{M\}\}\.
###### Lemma K\.5\(Head concentration\)\.
For anyk≥1k\\geq 1, with probability at least1−δ1\-\\delta,
\|μj\(Σ0:k\)−λj\|≲k\+log\(1/δ\)Mλj,j≤k\.\\bigl\|\\mu\_\{j\}\(\\Sigma\_\{0:k\}\)\-\\lambda\_\{j\}\\bigr\|\\lesssim\\frac\{k\+\\log\(1/\\delta\)\}\{M\}\\,\\lambda\_\{j\},\\qquad j\\leq k\.In particular, with probability at least1−exp\(−Ω\(k\)\)1\-\\exp\(\-\\Omega\(k\)\),
\|μj\(Σ0:k\)−λj\|≲kMλj,j≤k\.\\bigl\|\\mu\_\{j\}\(\\Sigma\_\{0:k\}\)\-\\lambda\_\{j\}\\bigr\|\\lesssim\\frac\{k\}\{M\}\\lambda\_\{j\},\\qquad j\\leq k\.
###### Lemma K\.6\(Head–tail resolvent estimate\)\.
Fix an integerk≤M/3k\\leq M/3such thatrank\(H\)≥k\+M\\operatorname\{rank\}\(H\)\\geq k\+M, and define
Ak:=Sk:∞Hk:∞Sk:∞⊤,Σ=S0:kH0:kS0:k⊤\+Ak\.A\_\{k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\},\\qquad\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+A\_\{k\}\.Then, with probability at least
1−exp\(−Ω\(M\)\),1\-\\exp\(\-\\Omega\(M\)\),one has
‖Σ−1S0:kH0:k‖2≲μM/2\(Ak\)μM\(Ak\)\.\\\|\\Sigma^\{\-1\}S\_\{0:k\}H\_\{0:k\}\\\|\_\{2\}\\lesssim\\frac\{\\mu\_\{M/2\}\(A\_\{k\}\)\}\{\\mu\_\{M\}\(A\_\{k\}\)\}\.
### K\.2Power\-law auxiliary lemmas
###### Lemma K\.7\(Power\-law spectrum of the sketched covariance\)\.
Suppose the population spectrum obeysλj≍j−a\\lambda\_\{j\}\\asymp j^\{\-a\}for somea\>1a\>1\. Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
μj\(Σ\)≍j−a,j∈\[M\]\.\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\},\\qquad j\\in\[M\]\.
###### Lemma K\.8\(Tail spectral ratio under power law\)\.
Supposeλj≍j−a\\lambda\_\{j\}\\asymp j^\{\-a\}witha\>1a\>1\. Then there exists anaa\-dependent constantc\>0c\>0such that for anyk≥0k\\geq 0,
μM/2\(Σk:∞\)μM\(Σk:∞\)≤c\\frac\{\\mu\_\{M/2\}\(\\Sigma\_\{k:\\infty\}\)\}\{\\mu\_\{M\}\(\\Sigma\_\{k:\\infty\}\)\}\\leq cwith probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\.
###### Lemma K\.9\(Source\-condition approximation bound\)\.
Suppose the source condition with exponentb\>1b\>1holds\. Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the sketch matrix,
M1−b≲𝔼w∗\[Approx\]≲M1−b\.M^\{1\-b\}\\lesssim\\mathbb\{E\}\_\{w^\{\\ast\}\}\[\\mathrm\{Approx\}\]\\lesssim M^\{1\-b\}\.The hidden constants depend only on the source exponents\.
###### Lemma K\.10\(Empirical spectrum under power law\)\.
Assumeμj\(Σ\)≍j−a\\mu\_\{j\}\(\\Sigma\)\\asymp j^\{\-a\}forj∈\[M\]j\\in\[M\]\. Then there existaa\-dependent constantsc,c1,c2\>0c,c\_\{1\},c\_\{2\}\>0such that, conditioned onSS, with probability at least1−exp\(−Ω\(N\)\)1\-\\exp\(\-\\Omega\(N\)\)over the sample,
c1j−a≤μj\(Σ^\)≤c2j−a,j≤min\{M,N/c\},c\_\{1\}j^\{\-a\}\\leq\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\leq c\_\{2\}j^\{\-a\},\\qquad j\\leq\\min\\\{M,N/c\\\},and
μj\(Σ^\)≲j−a,j≤min\{M,N\}\.\\mu\_\{j\}\(\\widehat\{\\Sigma\}\)\\lesssim j^\{\-a\},\\qquad j\\leq\\min\\\{M,N\\\}\.
## Appendix LExperimental setup
During the experiments, all simulations were run on CPU on a standard laptop; the full suite took approximately 2 hours and used less than 1 GB memory\.
All experiments are run in the synthetic diagonal\-coordinate sketched linear\-regression model from Sections[3](https://arxiv.org/html/2605.24316#S3)and[2](https://arxiv.org/html/2605.24316#S2)\. We fix an ambient dimensiondd, let
H=diag\(λ1,…,λd\),λi=i−a,H=\\operatorname\{diag\}\(\\lambda\_\{1\},\\dots,\\lambda\_\{d\}\),\\qquad\\lambda\_\{i\}=i^\{\-a\},draw a Gaussian sketchS∈ℝM×dS\\in\\mathbb\{R\}^\{M\\times d\}with i\.i\.d\.𝒩\(0,1/M\)\\mathcal\{N\}\(0,1/M\)entries, and set
Σ=SHS⊤\.\\Sigma=SHS^\{\\top\}\.The source\-condition prior is chosen so that the coordinates ofw∗w^\{\\ast\}are independent Gaussian and satisfy
𝔼\[λi\(wi∗\)2\]≍i−b\.\\mathbb\{E\}\[\\lambda\_\{i\}\(w\_\{i\}^\{\\ast\}\)^\{2\}\]\\asymp i^\{\-b\}\.Conditioned on\(S,w∗\)\(S,w^\{\\ast\}\), the sketched featurez:=Sx∈ℝMz:=Sx\\in\\mathbb\{R\}^\{M\}and the clean signal⟨x,w∗⟩\\langle x,w^\{\\ast\}\\rangleare jointly Gaussian, so the implementation samples the pair\(z,y\)\(z,y\)directly from this induced law, with
y=⟨x,w∗⟩\+ε,ε∼𝒩\(0,σ2\)\.y=\\langle x,w^\{\\ast\}\\rangle\+\\varepsilon,\\qquad\\varepsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\.This lets us work entirely in the sketched coordinates used throughout the paper\.
Across repetitions, the draw of\(S,w∗\)\(S,w^\{\\ast\}\)is fixed and only the dataset and optimization randomness are refreshed\. Unless otherwise stated, the shared parameters are
a=2,b=1\.5,d=104,M=64,N=512,a=2,\\qquad b=1\.5,\\qquad d=10^\{4\},\\qquad M=64,\\qquad N=512,L=512,σ=1,γ=0\.05\.L=512,\\qquad\\sigma=1,\\qquad\\gamma=0\.05\.withR=100R=100independent repetitions\. We use the candidate power\-of\-two batch sizes
ℬ:=\{1,2,4,8,16,32,64,128,256,512\},\\mathcal\{B\}:=\\\{1,2,4,8,16,32,64,128,256,512\\\},and then specify the experiment\-dependent subsets below\. For multi\-pass experiments we compare both sampling rules from the paper: with\-replacement mini\-batching and without\-replacement mini\-batching, with finite\-population factor
ρN,B=N−BB\(N−1\)\.\\rho\_\{N,B\}=\\frac\{N\-B\}\{B\(N\-1\)\}\.
All methods use the same Bartlett\-decay schedule implemented in the code: for a run ofLrunL\_\{\\mathrm\{run\}\}updates, the stepsize is held constant on blocks of length comparable toLrun,eff≍Lrun/logLrunL\_\{\\mathrm\{run,eff\}\}\\asymp L\_\{\\mathrm\{run\}\}/\\log L\_\{\\mathrm\{run\}\}and divided by two from one block to the next\. This is the simulation counterpart of the blockwise geometric schedule used in our theory\. ThusLrun=T=N/BL\_\{\\mathrm\{run\}\}=T=N/Bfor one\-pass batch SGD andLrun=LL\_\{\\mathrm\{run\}\}=Lfor multi\-pass batch SGD and for the full\-batch GD reference iterateθL\\theta\_\{L\}\.
The three experiments below are chosen to match the three batch\-dependent predictions in our theorems\. Experiment 1 isolates the one\-pass variance term, which is the only place whereBBenters the one\-pass theorem\. Experiment 2 isolates the fluctuation term in the multi\-pass theorem, which is the only part that differs between with\-replacement and without\-replacement sampling\. Experiment 3 then checks whether dividing by the predicted batch prefactor removes the leadingBB\-dependence\.
### L\.1Experiment 1: one\-pass variance sweep
For each batch sizeBB, we run one\-pass batch SGD with shuffled disjoint batches, so each sample is used exactly once and the number of updates is
T=NB,Teff=TlogT\.T=\\frac\{N\}\{B\},\\qquad T\_\{\\mathrm\{eff\}\}=\\frac\{T\}\{\\log T\}\.LetuT,ropu\_\{T,r\}^\{\\mathrm\{op\}\}be the one\-pass output on repetitionrr, and let
u¯Top:=1R∑r=1RuT,rop\.\\bar\{u\}\_\{T\}^\{\\mathrm\{op\}\}:=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}u\_\{T,r\}^\{\\mathrm\{op\}\}\.We estimate the centered one\-pass variance by the Bessel\-corrected empirical average
Var^B:=RR−1⋅1R∑r=1R‖Σ1/2\(uT,rop−u¯Top\)‖22\.\\widehat\{\\mathrm\{Var\}\}\_\{B\}:=\\frac\{R\}\{R\-1\}\\cdot\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{T,r\}^\{\\mathrm\{op\}\}\-\\bar\{u\}\_\{T\}^\{\\mathrm\{op\}\}\)\\bigr\\\|\_\{2\}^\{2\}\.This is the quantity plotted in Figure[1](https://arxiv.org/html/2605.24316#S4.F1)\(a\)\.
To compare with the theory, we also draw a rescaled upper\-bound reference of order
1BTeff∑j=1Mmin\{1,Teffγμj\(Σ\)\},\\frac\{1\}\{B\\,T\_\{\\mathrm\{eff\}\}\}\\sum\_\{j=1\}^\{M\}\\min\\\{1,T\_\{\\mathrm\{eff\}\}\\gamma\\mu\_\{j\}\(\\Sigma\)\\\},using the eigenvalues of the fixed simulated sketched covarianceΣ\\Sigma\. Because our theorem gives only an upper bound onVarB\\mathrm\{Var\}\_\{B\}, this plot is meant to test the predictedBB\-dependence and effective\-dimension shape, not equality of leading constants\. Accordingly, the empirical curve is expected to remain below a suitably rescaled upper\-bound reference\.
### L\.2Experiment 2: multi\-pass fluctuation sweep
For each batch sizeBB, we run multi\-pass batch SGD forLLupdates under both sampling rules, producing iteratesuL,rwru\_\{L,r\}^\{\\mathrm\{wr\}\}anduL,rworu\_\{L,r\}^\{\\mathrm\{wor\}\}\. On the same dataset and with the same stepsize schedule, we also run the full\-batch GD reference iterateθL,r\\theta\_\{L,r\}\. We then estimate the fluctuation terms by
Fluc^Bρ:=1R∑r=1R‖Σ1/2\(uL,rρ−θL,r\)‖22,ρ∈\{wr,wor\}\.\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\rho\}:=\\frac\{1\}\{R\}\\sum\_\{r=1\}^\{R\}\\bigl\\\|\\Sigma^\{1/2\}\(u\_\{L,r\}^\{\\rho\}\-\\theta\_\{L,r\}\)\\bigr\\\|\_\{2\}^\{2\},\\qquad\\rho\\in\\\{\\mathrm\{wr\},\\mathrm\{wor\}\\\}\.This directly matches the fluctuation quantity in Theorem[3\.2](https://arxiv.org/html/2605.24316#S3.Thmmytheorem2)\. We plot these empirical means againstBBtogether with one\-parameter reference curves of the form
Cwr⋅1B,Cwor⋅ρN,B,C\_\{\\mathrm\{wr\}\}\\cdot\\frac\{1\}\{B\},\\qquad C\_\{\\mathrm\{wor\}\}\\cdot\\rho\_\{N,B\},whereCwrC\_\{\\mathrm\{wr\}\}andCworC\_\{\\mathrm\{wor\}\}are fitted from the average normalized fluctuation\. The purpose of this experiment is to test whether the sampling\-rule dependence is indeed captured by the batch prefactors1/B1/BandρN,B\\rho\_\{N,B\}\.
### L\.3Experiment 3: normalized fluctuation collapse
The third experiment uses the same multi\-pass runs as Experiment 2 but removes the predicted batch prefactors\. We plot
Fluc^Bwr1/BandFluc^BworρN,B\.\\frac\{\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\mathrm\{wr\}\}\}\{1/B\}\\qquad\\text\{and\}\\qquad\\frac\{\\widehat\{\\mathrm\{Fluc\}\}\_\{B\}^\{\\mathrm\{wor\}\}\}\{\\rho\_\{N,B\}\}\.If the theorem captures the leadingBB\-dependence correctly, these normalized quantities should be approximately constant acrossBB\. This is a stronger check than Experiment 2 alone: Experiment 2 tests the decay pattern on log–log axes, while Experiment 3 tests whether the remaining dependence after normalization is essentially flat\. For without\-replacement sampling, the pointB=NB=Nis omitted from the normalized plot becauseρN,N=0\\rho\_\{N,N\}=0\.
Throughout all three experiments, the error bars shown in the main\-text plots are one empirical standard deviation over theR=100R=100repetitions\.
## Appendix MLimitations and Broader Effects
#### Limitations\.
Our analysis is intentionally stylized\. The theory is proved for sketched linear regression under a Gaussian design, a well\-specified teacher–student model, power\-law covariance decay, and a source condition on the target parameter\. These assumptions let us separate approximation, bias, variance, and fluctuation cleanly, but they do not cover misspecification, heavy\-tailed or dependent data, non\-Gaussian sketching, or feature learning in nonlinear models\. The experiments are likewise synthetic and are designed to test the predicted scaling behavior rather than empirical competitiveness on real tasks\. Accordingly, the resulting batch\-size scaling laws should be interpreted as precise results for a controlled regime, not as universal prescriptions for all SGD training problems\.
#### Broader effects\.
One positive effect of this work is sharper guidance for how batch size changes optimization noise in large\-scale training\. By isolating when batching primarily alters stochastic terms and when without\-replacement sampling can further reduce fluctuation, the results can inform more compute\-efficient training strategies and improve theoretical intuition for algorithm design\. The main risk is overgeneralization: if stylized scaling laws derived under restrictive assumptions are transferred directly to complex real\-world systems, practitioners may choose training rules that are poorly calibrated for misspecified models, distribution shift, or fairness and safety constraints that are absent from our setup\. We therefore view these results as a theoretical foundation that should be paired with application\-specific validation before informing deployment decisions\.Similar Articles
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
This paper derives a scaling law for sketched linear contrastive learning under a Gaussian latent-variable model, analyzing how risk decomposes into approximation, optimization, and statistical terms, and provides theoretical guidance for balancing model size, data, and compute in contrastive learning.
Scaling Laws, Carefully (25 minute read)
A comprehensive overview of scaling laws in deep learning, tracing their theoretical roots and empirical findings, and explaining how loss decreases predictably with model size, data, and compute.
Prescriptive Scaling Laws for Data Constrained Training
A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.
High-Probability PL-SGD with Markovian Noise: Optimal Mixing and Tail Dependence
This paper provides optimal high-probability bounds for stochastic gradient descent under Markovian noise for PL-smooth objectives, closing gaps between expectation and high-probability guarantees and extending to heavy-tailed settings with matching lower bounds.
@lilianweng: A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason a…
Lilian Weng's blog post provides a comprehensive overview of scaling laws in deep learning, covering their derivation, compute-optimal allocation, and the debate between Kaplan et al. and Chinchilla.