Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
Summary
This paper derives a scaling law for sketched linear contrastive learning under a Gaussian latent-variable model, analyzing how risk decomposes into approximation, optimization, and statistical terms, and provides theoretical guidance for balancing model size, data, and compute in contrastive learning.
View Cached Full Text
Cached at: 06/26/26, 05:21 AM
# Approximation, Optimization, and Statistical Scaling
Source: [https://arxiv.org/html/2606.26617](https://arxiv.org/html/2606.26617)
## Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling
###### Abstract
Scaling laws describe how learning performance varies with model size, data size, and compute\. While recent theoretical work has established scaling laws for sketched linear regression, much less is understood for contrastive representation learning\. In this paper, we study a sketched linear model for contrastive learning under a paired Gaussian latent\-variable setup\. The learner observes only sketched views of two correlated variables and trains a bilinear contrastive score by full\-batch empirical gradient descent\. We analyze a Gaussian\-negative quadratic contrastive surrogate under aligned power\-law spectra and a contrastive source condition, where we derive a risk decomposition into irreducible risk, approximation error, GD bias, GD variance, and a cross term\. The cross term is controlled by the bias and variance and therefore does not affect the upper\-bound scaling\. Our main theorem gives an explicit scaling law with respect to sketch dimensionMM, sample sizeNN, and effective optimization horizonLeffγL\_\{\\mathrm\{eff\}\}\\gamma\. Compared with standard linear\-regression scaling laws, the contrastive setting must learn interactions between two views, and this changes how optimization and finite\-sample noise scale with model size, data, and training time\. This provides a first theoretical step toward understanding scaling behavior in contrastive learning and gives guidance for balancing model size, data, and optimization compute\.
## Introduction
Scaling laws provide a compact way to describe how prediction error changes with model size, data size, and compute\. A representative form is the neural language\-model scaling law ofKaplanet al\.\([2020](https://arxiv.org/html/2606.26617#bib.bib2)\), where the loss is modeled as a sum of power laws in the number of non\-embedding parametersNNand the dataset sizeDD:
L\(N,D\)=L∞\+\(NcN\)αN\+\(DcD\)αD\.L\(N,D\)=L\_\{\\infty\}\+\\left\(\\frac\{N\_\{c\}\}\{N\}\\right\)^\{\\alpha\_\{N\}\}\+\\left\(\\frac\{D\_\{c\}\}\{D\}\\right\)^\{\\alpha\_\{D\}\}\.Related empirical laws have been observed across language, vision, translation, speech, and multimodal modeling\(Hestnesset al\.[2017](https://arxiv.org/html/2606.26617#bib.bib1); Henighanet al\.[2020](https://arxiv.org/html/2606.26617#bib.bib3); Hoffmannet al\.[2022](https://arxiv.org/html/2606.26617#bib.bib4); Zhaiet al\.[2022a](https://arxiv.org/html/2606.26617#bib.bib5); Muennighoffet al\.[2023](https://arxiv.org/html/2606.26617#bib.bib6)\)\. These empirical laws are useful because they allow one to forecast the benefit of additional compute, data, or parameters before performing expensive training runs\. At the same time, empirical power laws do not by themselves explain where the exponents come from, which part of the risk they describe, or how algorithmic choices affect them\. This has motivated a growing theoretical literature studying scaling laws in simplified but statistically transparent models\(Hutter[2021](https://arxiv.org/html/2606.26617#bib.bib7); Sharma and Kaplan[2020](https://arxiv.org/html/2606.26617#bib.bib8); Maloneyet al\.[2022](https://arxiv.org/html/2606.26617#bib.bib9); Bahriet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib10); Bordelonet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib11); Atanasovet al\.[2026](https://arxiv.org/html/2606.26617#bib.bib12); Paquetteet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib13); Dohmatobet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib14)\)\. In particular, recent work on sketched linear regression derives provable scaling laws under power\-law covariance spectra and source conditions, connecting approximation, optimization, data, and compute\(Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15),[2025](https://arxiv.org/html/2606.26617#bib.bib16); Chen and Zhou[2026](https://arxiv.org/html/2606.26617#bib.bib17)\)\. Inspired by this sketched linear framework, we ask whether an analogous scaling\-law theory can be developed for contrastive learning\.
Contrastive learning is a central paradigm in modern representation learning\. Its basic principle is to learn representations in which related pairs are close, while unrelated pairs are separated\. This idea appears in early metric\-learning objectives\(Hadsellet al\.[2006](https://arxiv.org/html/2606.26617#bib.bib23)\), in mutual\-information and noise\-contrastive formulations such as CPC and InfoNCE\(Oordet al\.[2018](https://arxiv.org/html/2606.26617#bib.bib24)\), and in self\-supervised visual representation learning methods such as SimCLR, MoCo, and supervised contrastive learning\(Chenet al\.[2020](https://arxiv.org/html/2606.26617#bib.bib25); Heet al\.[2020](https://arxiv.org/html/2606.26617#bib.bib26); Khoslaet al\.[2020](https://arxiv.org/html/2606.26617#bib.bib27)\)\. The same principle also underlies large\-scale language–image pretraining: CLIP learns aligned visual and textual representations by contrasting matched image–text pairs against mismatched pairs\(Radfordet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib28)\), and related systems such as ALIGN and LiT demonstrate that contrastive pretraining can produce strong zero\-shot transfer, retrieval, and robust visual classification performance at scale\(Jiaet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib29); Zhaiet al\.[2022b](https://arxiv.org/html/2606.26617#bib.bib30)\)\. The broad goal of contrastive learning is therefore not merely to fit a supervised label, but to recover a representation geometry that supports downstream classification, retrieval, transfer, and multimodal alignment\.
Despite this empirical success, the theoretical understanding of contrastive learning remains incomplete\. Existing theoretical works explain several important aspects of contrastive learning, including why contrastive objectives can recover latent factors, how augmentations and positive\-pair structure affect learned representations, and how contrastive learning differs from generative or reconstruction\-based unsupervised learning\(Aroraet al\.[2019](https://arxiv.org/html/2606.26617#bib.bib31); Wang and Isola[2020](https://arxiv.org/html/2606.26617#bib.bib32); Toshet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib33); Zimmermannet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib34); HaoChenet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib35); Jiet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib36)\)\. Recent work also studies statistical consistency and generalization of contrastive representation learning through function\-approximation and finite\-sample analyses\(Liet al\.[2026](https://arxiv.org/html/2606.26617#bib.bib37)\)\. However, these works do not directly explain how contrastive learning risk should scale with model size, sample size, and optimization time under a power\-law spectral model\. In parallel, empirical studies of contrastive language–image pretraining have found that CLIP\-like models exhibit predictable scaling behavior\. For example,Chertiet al\.\([2023](https://arxiv.org/html/2606.26617#bib.bib38)\)report power\-law scaling for OpenCLIP models trained on public image–text data across zero\-shot classification, retrieval, linear probing, and fine\-tuning, whileLiet al\.\([2023](https://arxiv.org/html/2606.26617#bib.bib39)\)identify an inverse scaling law for CLIP training based on reducing image/text token length as the encoders grow\. More recent work uses scaling\-law fits to compare open vision–language pretraining procedures and datasets\(Nezhurinaet al\.[2025](https://arxiv.org/html/2606.26617#bib.bib40)\)\. These empirical findings suggest that contrastive learning has stable scaling structure, but they do not provide a decomposition into approximation, optimization, and sampling effects\.
This paper develops a theoretical scaling\-law model for contrastive learning in a sketched linear setting\. We study a Gaussian paired\-view model and a bilinear contrastive score trained on sketched inputs\. Instead of analyzing the full nonlinear InfoNCE loss, we consider a Gaussian\-negative quadratic contrastive surrogate, which keeps the central contrastive structure: the positive cross\-covariance term corresponds to alignment of matched pairs, while the quadratic marginal covariance term corresponds to the Gaussian\-negative\. The resulting problem is analytically tractable, but exhibits a new feature: ordinary linear regression learns how individual input directions contribute to a response; contrastive learning must instead learn relations between two views\. In our bilinear model, these relations are represented by interactions between pairs of spectral directions, and this changes how optimization and finite\-sample noise scale\.
Our result is qualitatively consistent with empirical contrastive scaling laws in the sense that the risk decomposes into stable power\-law terms controlled by model size, data size, and compute\. However, our theorem should not be read as a direct quantitative prediction for deep CLIP training\. We analyze a controlled Gaussian, sketched, bilinear surrogate trained by empirical gradient descent\. The purpose is to isolate mechanisms that can produce scaling behavior in contrastive learning\. The main message is that approximation, optimization, and sampling error still separate, but the optimization and variance terms are shaped by view–view interactions rather than by single input directions alone\.
Our main contributions are as follows\.
- •A sketched linear contrastive learning setup\.We propose a sketched Gaussian quadratic surrogate for contrastive learning\. The model is idealized to permit exact risk decompositions, while retaining the bilinear alignment structure of contrastive representation learning\.
- •Scaling laws for empirical GD\.We analyze empirical gradient descent and decompose the expected sketched risk into irreducible risk, approximation error, GD bias and GD variance, along with a cross term which can be absorbed into the sum of bias and variance\. Under aligned power\-law assumptions, we obtain explicit scaling laws with respect to sketch dimension, sample size, and the number of GD steps through the effective horizonLeffγL\_\{\\mathrm\{eff\}\}\\gamma\. A key new quantity is the product effective dimension, which counts active pairs of spectral directions rather than active single directions in linear regression\.
- •Guidance for compute\-resource allocation\.Our final scaling law separates the effects of model size, data size, and optimization steps, which gives a transparent rule for balancing them and compute in the stylized contrastive setting, which guides the allocation of computes\.
To the best of our knowledge, this is the first provable scaling\-law analysis for a contrastive learning objective in a sketched bilinear model\. A central novelty is the product effective dimension, which has no analogue in ordinary linear regression and reflects the fact that contrastive learning activates pairs of spectral directions across two views\. The analysis is stylized, but it provides a theoretical starting point for understanding why contrastive objectives can exhibit scaling behavior and how their scaling differs from ordinary linear prediction\.
#### Notation\.
For two positive quantitiesffandgg, we writef≲gf\\lesssim g\(equivalently,f=𝒪\(g\)f=\\mathcal\{O\}\(g\)\) andf≳gf\\gtrsim g\(equivalently,f=Ω\(g\)f=\\Omega\(g\)\) if the inequality holds up to an absolute constant\. We writef≍gf\\asymp gorf=Θ\(g\)f=\\Theta\(g\)when both bounds hold\. For matricesAAandBBof compatible dimensions,⟨A,B⟩:=tr\(A⊤B\)\\langle A,B\\rangle:=\\operatorname\{tr\}\(A^\{\\top\}B\)denotes the Frobenius inner product\. We use∥⋅∥\\\|\\cdot\\\|for the operator norm of a matrix and the Euclidean norm of a vector, and∥⋅∥F\\\|\\cdot\\\|\_\{F\}for the Frobenius norm\. For positive semidefinite matricesAAandBB,A⪯BA\\preceq Bmeans thatB−AB\-Ais positive semidefinite\. IfΣ⪰0\\Sigma\\succeq 0, we write‖u‖Σ2:=u⊤Σu\\\|u\\\|\_\{\\Sigma\}^\{2\}:=u^\{\\top\}\\Sigma uand‖A‖Σ,Σ2:=tr\(A⊤ΣAΣ\)\\\|A\\\|\_\{\\Sigma,\\Sigma\}^\{2\}:=\\operatorname\{tr\}\(A^\{\\top\}\\Sigma A\\Sigma\)\. Finally,μi\(Σ\)\\mu\_\{i\}\(\\Sigma\)denotes theii\-th largest eigenvalue of a symmetric matrixΣ\\Sigma, and𝔼𝒟\\mathbb\{E\}\_\{\\mathcal\{D\}\}denotes expectation over the training sample, conditional on the sketch when the sketch is fixed\.
## Preliminary
#### Contrastive\-learning background\.
Contrastive learning starts from paired viewsai,bia\_\{i\},b\_\{i\}, whereaia\_\{i\}is a positive view matched withbib\_\{i\}andbjb\_\{j\}forj≠ij\\neq iserves as a negative view\. A common formulation trains encodersfθf\_\{\\theta\}andgθg\_\{\\theta\}through a similarity score such as
sθ\(u,v\):=⟨fθ\(u\),gθ\(v\)⟩τ\.s\_\{\\theta\}\(u,v\):=\\frac\{\\langle f\_\{\\theta\}\(u\),g\_\{\\theta\}\(v\)\\rangle\}\{\\tau\}\.The goal is to learn representations that assign high scores to matched pairs and low scores to mismatched pairs, thereby recovering a geometry useful for downstream predictions\(Oordet al\.[2018](https://arxiv.org/html/2606.26617#bib.bib24); Radfordet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib28)\)\. Our sketched bilinear model can be viewed as replacing nonlinear encodersfθf\_\{\\theta\},gθg\_\{\\theta\}by fixed linear sketches and learning only the bilinear interaction matrix\.
### Problem Setup
#### Latent\-variable contrastive setup\.
LetD∈ℕ∪\{∞\}D\\in\\mathbb\{N\}\\cup\\\{\\infty\\\}be the ambient dimension and letM≤DM\\leq Dbe the sketch / model dimension\. We consider the paired Gaussian model
z∼𝒩\(0,Λz\),ϵx,ϵy∼i\.i\.d\.𝒩\(0,Λϵ\),z\\sim\\mathcal\{N\}\(0,\\Lambda\_\{z\}\),\\qquad\\epsilon\_\{x\},\\epsilon\_\{y\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\mathcal\{N\}\(0,\\Lambda\_\{\\epsilon\}\),wherez,ϵx,ϵyz,\\epsilon\_\{x\},\\epsilon\_\{y\}are mutually independent, and
x=z\+ϵx,y=z\+ϵy\.x=z\+\\epsilon\_\{x\},\\qquad y=z\+\\epsilon\_\{y\}\.Define the marginal covariance and the cross\-covariance by
H\\displaystyle H:=𝔼\[xx⊤\]=𝔼\[yy⊤\]=Λz\+Λϵ,\\displaystyle:=\\mathbb\{E\}\[xx^\{\\top\}\]=\\mathbb\{E\}\[yy^\{\\top\}\]=\\Lambda\_\{z\}\+\\Lambda\_\{\\epsilon\},C\\displaystyle C:=𝔼\[xy⊤\]=Λz\.\\displaystyle:=\\mathbb\{E\}\[xy^\{\\top\}\]=\\Lambda\_\{z\}\.
LetS∈ℝM×DS\\in\\mathbb\{R\}^\{M\\times D\}be a Gaussian sketching matrix with entries sampled from Gaussian distributions and prefixed at the very beginning, satisfying
Sij∼i\.i\.d\.𝒩\(0,1/M\)\.S\_\{ij\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\mathcal\{N\}\(0,1/M\)\.For ground truth input pair\(x,y\)\(x,y\), the learner observes only the sketched pair
x~=Sx,y~=Sy\.\\widetilde\{x\}=Sx,\\qquad\\widetilde\{y\}=Sy\.
#### Bilinear contrastive score\.
For an unsketched setup, we learn a bilinear score matrixWWbetween paired observations,
sW\(x,y\):=x⊤Wy,W∈ℝD×D\.s\_\{W\}\(x,y\):=x^\{\\top\}Wy,\\qquad W\\in\\mathbb\{R\}^\{D\\times D\}\.Letp\+p\_\{\+\}be the joint law of a positive pair\(x,y\)\(x,y\), and letpxp\_\{x\}andpyp\_\{y\}be its marginals\. A large\-negative\-sample population contrastive objective can be written in the score\-based form\(Wang and Isola[2020](https://arxiv.org/html/2606.26617#bib.bib32)\)
ℒpop\(s;τ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{pop\}\}\(s;\\tau\):=𝔼\(x,y\)∼p\+\[ℓs\(x,y\)\],\\displaystyle:=\{\}\\mathbb\{E\}\_\{\(x,y\)\\sim p\_\{\+\}\}\\left\[\\ell\_\{s\}\(x,y\)\\right\],ℓs\(x,y\)\\displaystyle\\ell\_\{s\}\(x,y\):=log𝔼y′∼pyexp\(s\(x,y′\)−s\(x,y\)τ\),\\displaystyle:=\{\}\\log\\mathbb\{E\}\_\{y^\{\\prime\}\\sim p\_\{y\}\}\\exp\\left\(\\frac\{s\(x,y^\{\\prime\}\)\-s\(x,y\)\}\{\\tau\}\\right\),=\\displaystyle=\{\}−1τ𝔼\(x,y\)∼p\+s\(x,y\)\\displaystyle\-\\frac\{1\}\{\\tau\}\\mathbb\{E\}\_\{\(x,y\)\\sim p\_\{\+\}\}s\(x,y\)\+𝔼x∼px\[log𝔼y′∼pyexp\(s\(x,y′\)τ\)\]\.\\displaystyle\+\\mathbb\{E\}\_\{x\\sim p\_\{x\}\}\\left\[\\log\\mathbb\{E\}\_\{y^\{\\prime\}\\sim p\_\{y\}\}\\exp\\left\(\\frac\{s\(x,y^\{\\prime\}\)\}\{\\tau\}\\right\)\\right\]\.
For our paired Gaussian model, we use a quadratic Gaussian\-negative surrogate derived from the above formula,
R\(W\):=−⟨W,C⟩\+12tr\(W⊤HWH\),W∈ℝD×D\.R\(W\):=\-\\langle W,C\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(W^\{\\top\}HWH\),\\qquad W\\in\\mathbb\{R\}^\{D\\times D\}\.Here the linear cross\-covariance term−⟨W,C⟩\-\\langle W,C\\ranglerepresents positive\-pair alignment, while the quadratic marginal termtr\(W⊤HWH\)\\operatorname\{tr\}\(W^\{\\top\}HWH\)represents Gaussian\-negative regularization obtained by averaging over independent negatives\. Its full population minimizer is
W⋆=H−1CH−1\.W^\{\\star\}=H^\{\-1\}CH^\{\-1\}\.
For the sketched modelW=S⊤ASW=S^\{\\top\}AS, the learnable score matrix isA∈ℝM×MA\\in\\mathbb\{R\}^\{M\\times M\}\. Here we define the sketched marginal covariance and sketched cross covariance as
Σ:=SHS⊤,CM:=SCS⊤\.\\Sigma:=SHS^\{\\top\},\\qquad C\_\{M\}:=SCS^\{\\top\}\.Then the risk of the sketched\-space matrix parameter is
RM\(A\):=R\(S⊤AS\)=−⟨A,CM⟩\+12tr\(A⊤ΣAΣ\)\.R\_\{M\}\(A\):=R\(S^\{\\top\}AS\)=\-\\langle A,C\_\{M\}\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(A^\{\\top\}\\Sigma A\\Sigma\)\.For this surrogate, the Hessian \(data covariance operator\) is simply the curvature operator that determines how fast GD learns each direction and how strongly empirical noise is amplified; here it acts on a matrix perturbation as
ℋ\(B\)=ΣBΣ\.\\mathscr\{H\}\(B\)=\\Sigma B\\Sigma\.IfΣ\\Sigmahas eigenvaluesμi\\mu\_\{i\}, then an interaction between theii\-th andjj\-th spectral directions has curvatureμiμj\\mu\_\{i\}\\mu\_\{j\}\. Similar as population minimizer, that of the sketched population is
A⋆=Σ−1CMΣ−1=\(SHS⊤\)−1SCS⊤\(SHS⊤\)−1\.A^\{\\star\}=\\Sigma^\{\-1\}C\_\{M\}\\Sigma^\{\-1\}=\(SHS^\{\\top\}\)^\{\-1\}SCS^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}\.
#### Normal Gradient Descent\.
For any matrixBB, define the population contrastive norm
‖B‖Σ,Σ2:=tr\(B⊤ΣBΣ\)\.\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}:=\\operatorname\{tr\}\(B^\{\\top\}\\Sigma B\\Sigma\)\.The excess risk is defined as
RM\(A\)−RM\(A⋆\)=12‖A−A⋆‖Σ,Σ2\.R\_\{M\}\(A\)\-R\_\{M\}\(A^\{\\star\}\)=\\frac\{1\}\{2\}\\\|A\-A^\{\\star\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
GivenNNpaired samples\(x~i,y~i\)i=1N\(\\widetilde\{x\}\_\{i\},\\widetilde\{y\}\_\{i\}\)\_\{i=1\}^\{N\}, define their empirical sketched marginal covariance as
Σ^x:=1N∑i=1Nx~ix~i⊤,Σ^y:=1N∑i=1Ny~iy~i⊤,\\widehat\{\\Sigma\}\_\{x\}:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\widetilde\{x\}\_\{i\}\\widetilde\{x\}\_\{i\}^\{\\top\},\\qquad\\widehat\{\\Sigma\}\_\{y\}:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\widetilde\{y\}\_\{i\}\\widetilde\{y\}\_\{i\}^\{\\top\},and empirical sketched cross covariance as
C^:=1N∑i=1Nx~iy~i⊤\.\\widehat\{C\}:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\widetilde\{x\}\_\{i\}\\widetilde\{y\}\_\{i\}^\{\\top\}\.We consider full\-batch empirical GD on empirical loss
R^M\(A\)=−⟨A,C^⟩\+12tr\(A⊤Σ^xAΣ^y\)\.\\widehat\{R\}\_\{M\}\(A\)=\-\\langle A,\\widehat\{C\}\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\\left\(A^\{\\top\}\\widehat\{\\Sigma\}\_\{x\}A\\widehat\{\\Sigma\}\_\{y\}\\right\)\.Consequently, the GD recursion is defined as
At=At−1−γt\(Σ^xAt−1Σ^y−C^\),t=1,…,L,A\_\{t\}=A\_\{t\-1\}\-\\gamma\_\{t\}\\left\(\\widehat\{\\Sigma\}\_\{x\}A\_\{t\-1\}\\widehat\{\\Sigma\}\_\{y\}\-\\widehat\{C\}\\right\),\\qquad t=1,\\ldots,L,with initializationA0=0A\_\{0\}=0,γt\\gamma\_\{t\}is the learning rate at stepttandLLis the number of full\-batch GD steps\. Here we use a geometrically decaying stepsize schedule
γt=γ02ℓ,ℓ=⌊tLeff⌋,t=1,…,L,\\gamma\_\{t\}=\\frac\{\\gamma\_\{0\}\}\{2^\{\\ell\}\},\\qquad\\ell=\\left\\lfloor\\frac\{t\}\{L\_\{\\mathrm\{eff\}\}\}\\right\\rfloor,\\qquad t=1,\\ldots,L,where we define the effective steps
Leff:=⌊LlogL⌋\.L\_\{\\mathrm\{eff\}\}:=\\left\\lfloor\\frac\{L\}\{\\log L\}\\right\\rfloor\.
### Assumptions
###### Assumption 1\(Aligned power\-law spectra\)\.
The covariance operatorsΛz\\Lambda\_\{z\}andΛϵ\\Lambda\_\{\\epsilon\}are diagonal with eigenvalues\{λz,i\}i=1D\\\{\\lambda\_\{z,i\}\\\}\_\{i=1\}^\{D\}and\{λϵ,i\}i=1D\\\{\\lambda\_\{\\epsilon,i\}\\\}\_\{i=1\}^\{D\}respectively\. Also, there exist constantsa,b\>1a,b\>1witha\>b\+12a\>b\+\\frac\{1\}\{2\}, and constants0≤c0≤c10\\leq c\_\{0\}\\leq c\_\{1\}such that
c0i−a≤λz,i≤c1i−a,c0i−b≤λϵ,i≤c1i−b,i≥1\.c\_\{0\}i^\{\-a\}\\leq\\lambda\_\{z,i\}\\leq c\_\{1\}i^\{\-a\},\\qquad c\_\{0\}i^\{\-b\}\\leq\\lambda\_\{\\epsilon,i\}\\leq c\_\{1\}i^\{\-b\},\\qquad i\\geq 1\.
This assumption fixes a common spectral coordinate system\. The diagonal assumption is mainly a structural normalization: because a Gaussian sketch is rotationally invariant, rotating the ambient basis does not change the distribution of the sketched problem\(Sarlós[2006](https://arxiv.org/html/2606.26617#bib.bib18); Halkoet al\.[2011](https://arxiv.org/html/2606.26617#bib.bib19); Woodruff[2014](https://arxiv.org/html/2606.26617#bib.bib20)\)\. The power\-law decay specifies an effective low\-dimensional structure, with most signal and noise energy concentrated in leading spectral directions; analogous polynomial spectral\-decay conditions are standard in kernel learning and modern linear\-regression scaling\-law analyses\(Caponnetto and De Vito[2007](https://arxiv.org/html/2606.26617#bib.bib21); Bahriet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib10); Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15),[2025](https://arxiv.org/html/2606.26617#bib.bib16)\)\.
###### Assumption 2\(GD stepsize schedule and effective horizon\)\.
Denote the maximum norm of the training data as
Rx:=max1≤i≤N‖x~i‖22,Ry:=max1≤i≤N‖y~i‖22\.R\_\{x\}:=\\max\_\{1\\leq i\\leq N\}\\\|\\widetilde\{x\}\_\{i\}\\\|\_\{2\}^\{2\},\\qquad R\_\{y\}:=\\max\_\{1\\leq i\\leq N\}\\\|\\widetilde\{y\}\_\{i\}\\\|\_\{2\}^\{2\}\.We take the clipped initial stepsize
γ0:=min\{γ,14RxRy\},\\gamma\_\{0\}:=\\min\\left\\\{\\gamma,\\,\\frac\{1\}\{4R\_\{x\}R\_\{y\}\}\\right\\\},whereγ\>0\\gamma\>0is a deterministic tuning parameter satisfyingγ≤cγ\\gamma\\leq c\_\{\\gamma\}for a sufficiently small constantcγ\>0c\_\{\\gamma\}\>0\. We also assume the effective horizon is non\-degenerate,
Leffγ≳1\.L\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1\.
This assumption ensures that the full\-batch GD dynamics are stable while still running for a meaningful amount of optimization time\. The clipping by\(RxRy\)−1\(R\_\{x\}R\_\{y\}\)^\{\-1\}is a data\-dependent safeguard against steps that are too large for the empirical quadratic loss, whereas the deterministic upper bound onγ\\gammakeeps the population\-level contraction arguments valid\. The conditionLeffγ≳1L\_\{\\mathrm\{eff\}\}\\gamma\\gtrsim 1simply rules out the degenerate regime in which the effective optimization horizon vanishes; in that case, starting fromA0=0A\_\{0\}=0, the algorithm would have essentially no opportunity to learn the sketched population minimizer\.
###### Assumption 3\(Contrastive source condition\)\.
Let the spectral decomposition of the sketched covarianceΣ\\Sigmabe
Σ=∑i=1Mμivivi⊤,μ1≥μ2≥⋯≥μM\>0\.\\Sigma=\\sum\_\{i=1\}^\{M\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad\\mu\_\{1\}\\geq\\mu\_\{2\}\\geq\\cdots\\geq\\mu\_\{M\}\>0\.We assume there exist coefficients\(κi\)i=1M\(\\kappa\_\{i\}\)\_\{i=1\}^\{M\}such that
CM=∑i=1Mκiμivivi⊤,C\_\{M\}=\\sum\_\{i=1\}^\{M\}\\kappa\_\{i\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},which means sketched cross\-covariance is diagonal in the eigenbasis ofΣ\\Sigma\.Moreover, when matching a lower bound, we further assume
κi≍i−\(a−b\),\\kappa\_\{i\}\\asymp i^\{\-\(a\-b\)\},along with the tail coefficients are bounded awayfrom one: for some fixedi0≥1i\_\{0\}\\geq 1andcκ∈\(0,1\)c\_\{\\kappa\}\\in\(0,1\),κi≤1−cκ\\kappa\_\{i\}\\leq 1\-c\_\{\\kappa\}for alli≥i0i\\geq i\_\{0\}, which ensures the non\-degeneracy of lower bounds\.
This source condition says that the contrastive targetCMC\_\{M\}is aligned with the principal spectral directions of the sketched covarianceΣ\\Sigma\. The coefficientsκi\\kappa\_\{i\}measure how much target signal remains in each eigendirection, and the decay condition encodes a regular target that becomes weaker in increasingly low\-variance directions\. Such source or target\-covariance alignment assumptions are mathematically natural in inverse problems and have been used explicitly in recent linear\-regression scaling\-law analyses to connect the spectrum of the data covariance with the regression target\(Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15),[2025](https://arxiv.org/html/2606.26617#bib.bib16)\)\. Importantly, this source condition is imposed on the eigenbasis of the sketched covarianceΣ\\Sigma; it cannot be induced from Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1)and Gaussian sketching alone, which primarily control spectral decay and sketch\-induced eigenvalue behavior\.
## Main Results
We now state the main scaling law for empirical gradient descent in the sketched contrastive learning problem\. The result separates the population risk into irreducible risk, approximation error, GD bias, GD variance, and a cross term, imitating the error decomposition idea ofLinet al\.\([2024](https://arxiv.org/html/2606.26617#bib.bib15),[2025](https://arxiv.org/html/2606.26617#bib.bib16)\)\. Let the empirical cross covariance residual beE^:=C^−Σ^xA⋆Σ^y\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}\. During derivation, the GD bias and variance filters used below are the linear maps
ℬ^L\(B\):=\[∏t=1L\(I−γtΣ^x\(⋅\)Σ^y\)\]\(B\),\\displaystyle\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\):=\\left\[\\prod\_\{t=1\}^\{L\}\\left\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{x\}\(\\cdot\)\\widehat\{\\Sigma\}\_\{y\}\\right\)\\right\]\(B\),𝒱^L\(E\):=∑t=1Lγt\[∏s=t\+1L\(I−γsΣ^x\(⋅\)Σ^y\)\]\(E\),\\displaystyle\\widehat\{\\mathscr\{V\}\}\_\{L\}\(E\):=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\left\[\\prod\_\{s=t\+1\}^\{L\}\\left\(I\-\\gamma\_\{s\}\\widehat\{\\Sigma\}\_\{x\}\(\\cdot\)\\widehat\{\\Sigma\}\_\{y\}\\right\)\\right\]\(E\),where an empty product is the identity map\. Below we give the risk decomposition for empirical GD\.
###### Proposition 1\(Risk decomposition for empirical GD\)\.
LetALA\_\{L\}be theLL\-step empirical GD iterate with error𝔼𝒟\\mathbb\{E\}\_\{\\mathcal\{D\}\}\. Then it has decomposition
𝔼𝒟\[RM\(AL\)\]=Rirr\+Approx\+BiasL\+VarL\+CrossL,\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[R\_\{M\}\(A\_\{L\}\)\\right\]=\{\}R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\+\\operatorname\{Cross\}\_\{L\},where
Rirr:=R\(W⋆\),\\displaystyle R\_\{\\mathrm\{irr\}\}:=R\(W^\{\\star\}\),Approx:=RM\(A⋆\)−R\(W⋆\),\\displaystyle\\operatorname\{Approx\}:=R\_\{M\}\(A^\{\\star\}\)\-R\(W^\{\\star\}\),BiasL:=12𝔼𝒟\[∥ℬ^L\(A⋆\)∥Σ,Σ2\],\\displaystyle\\operatorname\{Bias\}\_\{L\}:=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\],VarL:=12𝔼𝒟\[‖𝒱^L\(E^\)‖Σ,Σ2\],\\displaystyle\\operatorname\{Var\}\_\{L\}:=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\],CrossL:=−𝔼𝒟\[⟨ℬ^L\(A⋆\),𝒱^L\(E^\)⟩Σ,Σ\]\.\\displaystyle\\operatorname\{Cross\}\_\{L\}:=\-\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\langle\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\),\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\rangle\_\{\\Sigma,\\Sigma\}\\right\]\.
Note that for the above proposition,
\|CrossL\|≤2BiasLVarL≤BiasL\+VarL\.\|\\operatorname\{Cross\}\_\{L\}\|\\leq 2\\sqrt\{\\operatorname\{Bias\}\_\{L\}\\operatorname\{Var\}\_\{L\}\}\\leq\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\.Consequently, for an upper bound analysis,
𝔼𝒟\[RM\(AL\)\]≲Rirr\+Approx\+BiasL\+VarL\.\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[R\_\{M\}\(A\_\{L\}\)\\right\]\\lesssim R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\.
We next combine the approximation, GD\-bias, and GD\-variance estimates\. Write the effective horizon and effective decay power as
R:=Leffγ,δ:=a−b\.R:=L\_\{\\mathrm\{eff\}\}\\gamma,\\qquad\\delta:=a\-b\.We propose the following theorem of contrastive scaling law\.
###### Theorem 1\(Scaling law for empirical GD in sketched contrastive learning\)\.
Suppose Assumptions[1](https://arxiv.org/html/2606.26617#Thmassumption1),[2](https://arxiv.org/html/2606.26617#Thmassumption2), and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold, with12<δ<b\+12\\frac\{1\}\{2\}<\\delta<b\+\\frac\{1\}\{2\}andδ≠\(b\+1\)/2\\delta\\neq\(b\+1\)/2\. Assume the effective horizon satisfies
1≲R≲min\{NM,M2b\}\.1\\lesssim R\\lesssim\\min\\left\\\{\\frac\{N\}\{M\},\\,M^\{2b\}\\right\\\}\.Then with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
𝔼𝒟\[RM\(AL\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[R\_\{M\}\(A\_\{L\}\)\]=Rirr\+𝒜M\+ℬR\+𝒱N,M,R\+𝒞N,M,R,\\displaystyle=R\_\{\\mathrm\{irr\}\}\+\\mathcal\{A\}\_\{M\}\+\\mathcal\{B\}\_\{R\}\+\\mathcal\{V\}\_\{N,M,R\}\+\\mathcal\{C\}\_\{N,M,R\},𝒜M\\displaystyle\\mathcal\{A\}\_\{M\}:=Θ\(M−min\{2δ−1,b\}\),\\displaystyle:=\{\}\\Theta\\left\(M^\{\-\\min\\\{2\\delta\-1,b\\\}\}\\right\),ℬR\\displaystyle\\mathcal\{B\}\_\{R\}:=Θ\(R1−2δ2b\),\\displaystyle:=\{\}\\Theta\\left\(R^\{\\frac\{1\-2\\delta\}\{2b\}\}\\right\),𝒱N,M,R\\displaystyle\\mathcal\{V\}\_\{N,M,R\}:=Θ\(D×\(R,M\)N\),\\displaystyle:=\{\}\\Theta\\left\(\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\\right\),𝒞N,M,R\\displaystyle\\mathcal\{C\}\_\{N,M,R\}:=𝒪\(R1−2δ4b𝒱N,M,R1/2\)\.\\displaystyle:=\{\}\\mathcal\{O\}\\left\(R^\{\\frac\{1\-2\\delta\}\{4b\}\}\\mathcal\{V\}\_\{N,M,R\}^\{1/2\}\\right\)\.Here𝒜M\\mathcal\{A\}\_\{M\},ℬR\\mathcal\{B\}\_\{R\},𝒱N,M,R\\mathcal\{V\}\_\{N,M,R\}, and𝒞N,M,R\\mathcal\{C\}\_\{N,M,R\}correspond to approximation, GD bias, GD variance, and the cross term, respectively, with
D×\(R,M\)=\{R1/blog\(eR\),1≤R1/b≤M,R1/b\(1\+logM2R1/b\),M<R1/b<M2,M2,R1/b≥M2\.D\_\{\\times\}\(R,M\)=\\begin\{cases\}R^\{1/b\}\\log\(eR\),&1\\leq R^\{1/b\}\\leq M,\\\\ R^\{1/b\}\\left\(1\+\\log\\dfrac\{M^\{2\}\}\{R^\{1/b\}\}\\right\),&M<R^\{1/b\}<M^\{2\},\\\\ M^\{2\},&R^\{1/b\}\\geq M^\{2\}\.\\end\{cases\}The hidden constants inΘ\(⋅\)\\Theta\(\\cdot\)and𝒪\(⋅\)\\mathcal\{O\}\(\\cdot\)depend only on problem exponents and constants in assumptions\.
#### Interpretation of the terms\.
The irreducible riskRirrR\_\{\\mathrm\{irr\}\}is the best value achievable by the full, unsketched quadratic contrastive surrogate\. It is independent of the sketch size, sample size, and optimization time, and therefore plays the role of a problem\-dependent floor\. The approximation term is the price of restricting the full bilinear scoreWWto the sketched classW=S⊤ASW=S^\{\\top\}AS\. It decreases with the sketch dimensionMM, because a larger sketch captures more spectral mass of the population contrastive target, but it is not improved by taking more samples or by running GD longer\.
The GD\-bias term is an optimization error\. Starting fromA0=0A\_\{0\}=0, GD first learns the diagonal source directions with large curvatureμi2\\mu\_\{i\}^\{2\}\. After effective optimization horizonR=LeffγR=L\_\{\\mathrm\{eff\}\}\\gamma, directions satisfyingRμi2≳1R\\mu\_\{i\}^\{2\}\\gtrsim 1are substantially learned, whereas directions below this threshold remain biased\. In the unsaturated regime, this gives the rateR\(1−2δ\)/\(2b\)R^\{\(1\-2\\delta\)/\(2b\)\}under the source condition\. In contrast, the GD\-variance term is statistical: it measures how empirical covariance and cross\-covariance fluctuations are amplified\. Because the curvature operator acts on interactions between two spectral directions, its curvature values are products\{μiμj\}i,j\\\{\\mu\_\{i\}\\mu\_\{j\}\\\}\_\{i,j\}\. The corresponding active dimension is
dprod\(R,M\):=∑i,j≤Mmin\{1,\(Rμiμj\)2\}\.d\_\{\\mathrm\{prod\}\}\(R,M\):=\\sum\_\{i,j\\leq M\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.We call this count the product effective dimension\. Under the power\-law spectral scaling, it has the piecewise scaleD×\(R,M\)D\_\{\\times\}\(R,M\): it is orderR1/blog\(eR\)R^\{1/b\}\\log\(eR\)in the unsaturated product regime, orderR1/b\(1\+log\(M2/R1/b\)\)R^\{1/b\}\(1\+\\log\(M^\{2\}/R^\{1/b\}\)\)in the intermediate regime, and orderM2M^\{2\}in the fully saturated regime\. Thus increasing optimization time reduces bias but also activates more noisy interactions between the two views\. Finally, the cross term is not a separate scaling bottleneck: by Cauchy–Schwarz it is controlled by the geometric mean of bias and variance, and hence can be absorbed into them for upper\-bound purposes\.
The horizon conditions in Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)have two roles:R≲M2bR\\lesssim M^\{2b\}keeps the GD\-bias term in an unsaturated, not fully learned regime, whileR≲N/MR\\lesssim N/Mensures the empirical marginal covariances can be replaced by the population covariance \(*i\.e\.*, covariance event\) with high probability\. Previous sketched linear\-regression scaling\-law papers propose covariance replacement lemmas for this purpose\(Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15),[2025](https://arxiv.org/html/2606.26617#bib.bib16)\), but those tools are not strong enough for our contrastive setup, where the bilinear Hessian involves products of empirical covariances and requires uniform control over view–view interaction directions\.
#### Proof sketch\.
For the approximation term, the upper bound first projects onto the leading population spectral coordinates and then controls the leakage created by the Gaussian sketch outside this truncated subspace\. The matching lower bound shows that two losses cannot be avoided: the spectral tail beyond the sketch dimension and the head leakage caused by random projection of the leading coordinates\.
For the GD\-bias term, the source condition reduces the relevant population dynamics to diagonal modes\. The GD filter contracts diagonal coordinates withμi2≳R−1\\mu\_\{i\}^\{2\}\\gtrsim R^\{\-1\}after effective horizonRR\. The lower bound keeps the residual mass in the complementary low\-curvature diagonal modes, whereRμi2≲1R\\mu\_\{i\}^\{2\}\\lesssim 1, showing that these directions cannot be learned faster by the same GD schedule\.
For the GD\-variance term, the upper bound combines empirical covariance concentration with the GD variance filter\. This reduces the problem to counting which view–view interaction directions are activated by the optimization horizon\. The lower bound isolates a large set of active coordinates and shows that their empirical fluctuations survive the filter, yielding the same variance scaling\.
## Experiments




Figure 1:Synthetic verification of the sketched contrastive scaling law\. Top\-left: approximation error versus sketch dimensionMM\. Top\-right: GD bias versus the number of GD iterationsLL, compared with the bias reference\. Bottom\-left: GD empirical variance versusLLcompared with variance reference, with vertical markers indicating the two variance regimes\. Bottom\-right: empirical excess risk along GD compared with the empirical bias\-plus\-variance\.We run synthetic experiments under the paired Gaussian latent\-variable model of Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)\. The two views share a latent component, have independent Gaussian noises with diagonal power\-law spectra, and are observed through a fixed Gaussian sketch; the sketched bilinear model is then trained by full\-batch GD\. For the optimization panels, we enforce the source alignment and use controlled empirical marginal covariances satisfying the covariance event, so the experiments test the predicted GD bias/variance filters rather than the sharpness of the covariance\-concentration condition itself\.
Figure[1](https://arxiv.org/html/2606.26617#Sx4.F1)summarizes the results\. The approximation experiment varies the sketch dimensionMMand compares the empirical approximation error with the predicted power\-law rate\. The GD\-bias experiment fixesMMand varies the number of GD steps over a long horizon, comparing the measured GD bias with the rate predicted by Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)\. The GD\-variance experiment fixesMMandNN, variesLL, and estimates the variance component across independently sampled datasets\. The variance reference is computed from the finite sketched spectrum and the actual accumulated stepsize, so the panel directly probes the predicted transition from the unsaturated product regime, through the intermediate regime, to the finite\-dimensional saturated regime\. Finally, the excess\-risk experiment variesLLand compares the empirical excess risk with the measured bias\-plus\-variance curve\.
Overall, the synthetic results support the qualitative predictions of the theory\. The approximation curve closely follows the predicted sketch\-dimension exponent: the fitted slope is about−1\.23\-1\.23, compared with the predicted slope−1\.10\-1\.10\. The GD\-bias curve aligns with the expectation of the theorem whenLLis small, since this part of the run lies inside the bias\-unsaturated regime; on the small\-LLportion, the fitted slope againstΓL\\Gamma\_\{L\}is about−0\.87\-0\.87, close to the theorem\-guided slope−0\.91\-0\.91\. For largerLL, the run may leave the regime covered by the theorem, and the empirical curve correspondingly deviates from the theorem\-guided reference\. The GD\-variance curve exhibits the expected three\-stage increase with the number of iterations: additional direction\-pairs become active, the curve bends through the intermediate regime, and it eventually saturates at the finite sketched dimension\. Its transition points and saturation scale are consistent with the finite\-spectral calculation detailed in the supplementary materials; log–log fits give early/intermediate/late slopes approximately0\.95/0\.36/0\.010\.95/0\.36/0\.01, close to the finite\-filter reference slopes0\.85/0\.33/0\.010\.85/0\.33/0\.01\. The excess\-risk curve is almost indistinguishable from the bias\-plus\-variance curve, with maximum relative discrepancy below0\.4%0\.4\\%, supporting the risk decomposition in Proposition[1](https://arxiv.org/html/2606.26617#Thmproposition1)and the treatment of the cross term as a lower\-order contribution\. n
## Discussion
### Compute Budget Allocation
#### Compute proxy\.
Scaling\-law studies often express training compute as the product of model size, number of processed examples or tokens, and optimization steps, up to hardware\- and architecture\-dependent constants\(Kaplanet al\.[2020](https://arxiv.org/html/2606.26617#bib.bib2); Hoffmannet al\.[2022](https://arxiv.org/html/2606.26617#bib.bib4)\)\. In our sketched bilinear model, we se the leading\-order compute proxy
𝒞≍LNM2\.\\mathcal\{C\}\\asymp LNM^\{2\}\.
#### Compute allocation\.
The effective optimization horizon isR=LeffγR=L\_\{\\mathrm\{eff\}\}\\gamma, and we use the heuristic relationR≈Lγ/logLR\\approx L\\gamma/\\log L\. Thus, whenγ\\gammais kept near a constant stable value,LLandRRare interchangeable up to logarithmic factors\. The question is how to allocate a fixed compute budget𝒞\\mathcal\{C\}among optimization timeLL, sketch/model sizeMM, and sample sizeNN\.
###### Proposition 2\(Compute allocation under the covariance constraint\)\.
Ignore logarithmic factors and set fixed stableγ≍1\\gamma\\asymp 1\. Letq:=2δ−1q:=2\\delta\-1,0<q<2b0<q<2b, and impose the effective\-horizon constraint from Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1),
1≲R≲min\{NM,M2b\}\.1\\lesssim R\\lesssim\\min\\left\\\{\\frac\{N\}\{M\},\\,M^\{2b\}\\right\\\}\.Under the compute proxy𝒞≍LNM2\\mathcal\{C\}\\asymp LNM^\{2\}, the leading balanced allocations are as follows, up to logarithmic factors\.
1. 1\.Product\-unsaturated, forced\.IfM≍R1/bM\\asymp R^\{1/b\}, then R≍𝒞b2b\+3,M≍𝒞12b\+3,N≍𝒞b\+12b\+3\.R\\asymp\\mathcal\{C\}^\{\\frac\{b\}\{2b\+3\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{1\}\{2b\+3\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{b\+1\}\{2b\+3\}\}\.
2. 2\.Rough source\.If0<q<b0<q<b, then R≍𝒞2b4b\+3,M≍𝒞14b\+3,N≍𝒞2b\+14b\+3\.R\\asymp\\mathcal\{C\}^\{\\frac\{2b\}\{4b\+3\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{1\}\{4b\+3\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{2b\+1\}\{4b\+3\}\}\.
3. 3\.Smooth source\.Ifb<q<2bb<q<2b, then R≍𝒞2b24b2\+3q,M≍𝒞q4b2\+3q,N≍𝒞2b2\+q4b2\+3q\.R\\asymp\\mathcal\{C\}^\{\\frac\{2b^\{2\}\}\{4b^\{2\}\+3q\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{q\}\{4b^\{2\}\+3q\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{2b^\{2\}\+q\}\{4b^\{2\}\+3q\}\}\.
The boundary caseq=bq=bgives the same exponents from the rough and smooth formulas up to the logarithmic factors\.
#### Choosing time steps and learning rate\.
LetR⋆R\_\{\\star\}denote the target horizon prescribed by Proposition[2](https://arxiv.org/html/2606.26617#Thmproposition2)\. The learning\-rate choice is the largest stable clipped step,
γ⋆=min\{cγ,14RxRy\},\\gamma\_\{\\star\}=\\min\\left\\\{c\_\{\\gamma\},\\frac\{1\}\{4R\_\{x\}R\_\{y\}\}\\right\\\},up to absolute constant factors\. We then chooseLeff=⌈R⋆γ⋆⌉\.L\_\{\\mathrm\{eff\}\}=\\left\\lceil\\frac\{R\_\{\\star\}\}\{\\gamma\_\{\\star\}\}\\right\\rceil\.The number of GD stepsLLis then the smallest integer satisfying⌊L/logL⌋≥Leff\\lfloor L/\\log L\\rfloor\\geq L\_\{\\mathrm\{eff\}\}\. Thus, whenγ⋆≍1\\gamma\_\{\\star\}\\asymp 1, the compute\-allocation exponents can be read as explicit prescriptions for the number of GD steps: realize the targetR⋆R\_\{\\star\}in Proposition[2](https://arxiv.org/html/2606.26617#Thmproposition2)by takingL≍R⋆logR⋆L\\asymp R\_\{\\star\}\\log R\_\{\\star\}, withMMandNNchosen according to the corresponding row above\.
#### Allocation intuition\.
The effective\-horizon constraint changes the compute\-allocation rule: increasing the optimization horizon requires a proportional increase in sample size\. Thus the balanced allocations keepNNnear the lower boundaryN≍RMN\\asymp RM: this spends just enough data to stabilize the alignment between empirical marginal covariance and population covariance, while leaving the remaining compute to balance approximation against optimization bias\.
### Prior Work and Our Limitations
Our stylized setup is motivated by three nearby lines of work\. CLIP learns image–text alignment by training modality\-specific encoders on paired image–caption data with a contrastive objective\(Radfordet al\.[2021](https://arxiv.org/html/2606.26617#bib.bib28)\)\. InfoNCE formalizes contrastive learning by comparing positive pairs against negative samples\(Oordet al\.[2018](https://arxiv.org/html/2606.26617#bib.bib24)\)\. Recent data\-reuse scaling laws study how optimization time and repeated use of finite data affect sketched linear regression\(Linet al\.[2025](https://arxiv.org/html/2606.26617#bib.bib16)\)\. Since we aims a tractable model for contrastive scaling, we acknowledge the following three limitations in the problem setup with respect to these three works, which makes the scaling law explicit but restricts the scope of the model:
- •Bilinear matched\-view representation\.We use a bilinear score, and assume the two viewsxxandyyhave matched dimensions and are sketched by the same linear map\. This abstracts away the nonlinear, modality\-specific image and text encoders used in CLIP, where the raw inputs live in different spaces before being mapped to a common embedding space\.
- •No explicit sampled negatives\.Our empirical training data consists of paired positive samples, and the negative side idea of is represented through a quadratic Gaussian\-negative surrogate rather than explicit negative pairs\. This is less general than InfoNCE training, where positives are explicitly contrasted against negative samples\.
- •Simplified optimization and covariance control\.We analyze Gaussian data with full\-batch GD and under a covariance\-concentration regime, where empirical marginal covariances are controlled by their population counterparts\. This assumption is stronger than what is typically available in practical contrastive training and omits features such as mini\-batch stochasticity\.
Despite these simplifications, the result is still informative\. From the scaling\-law perspective, it extends linear\-regression theory to a contrastive objective whose effective degrees of freedom come from products of view\-specific spectral directions\. From the contrastive\-learning perspective, it complements existing approximation and generalization analyses by adding an explicit optimization scheme and showing how approximation, optimization bias, and finite\-sample variance interact during training\.
## Conclusion
We studied scaling laws for sketched contrastive representation learning under a paired Gaussian latent\-variable model\. For a quadratic Gaussian\-negative contrastive surrogate trained by empirical GD, we decomposed the expected sketched risk into irreducible risk, approximation error, GD bias, GD variance, and a cross term that can be absorbed for upper bounds\. Under aligned power\-law spectra and a contrastive source condition, our analysis gives explicit scaling laws in terms of sketch dimension, sample size, and optimization horizon\. The main new feature compared with sketched linear regression is that the model must learn interactions between two views, which changes how optimization and finite\-sample noise scale\. Several directions remain open\. It would be important to sharpen the covariance\-event analysis beyond the present Gaussian setting, to extend the model to heterogeneous views wherexxandyymay have different dimensions but share the same latent variablezz, and to analyze multi\-pass or mini\-batch SGD, where fluctuations around the GD reference path become more delicate\.
## References
- S\. Arora, H\. Khandeparkar, M\. Khodak, O\. Plevrakis, and N\. Saunshi \(2019\)A theoretical analysis of contrastive unsupervised representation learning\.InProceedings of the 36th International Conference on Machine Learning,pp\. 5628–5637\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- A\. Atanasov, J\. A\. Zavatone\-Veth, and C\. Pehlevan \(2026\)Scaling and renormalization in high\-dimensional regression\.Journal of Statistical Mechanics: Theory and Experiment2026\(4\),pp\. 043404\.External Links:[Document](https://dx.doi.org/10.1088/1742-5468/ae4bba)Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2024\)Explaining neural scaling laws\.Proceedings of the National Academy of Sciences121\(27\),pp\. e2311878121\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3),[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1)\.
- B\. Bordelon, A\. Atanasov, and C\. Pehlevan \(2024\)A dynamical model of neural scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- A\. Caponnetto and E\. De Vito \(2007\)Optimal rates for the regularized least\-squares algorithm\.Foundations of Computational Mathematics7\(3\),pp\. 331–368\.Cited by:[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1)\.
- T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton \(2020\)A simple framework for contrastive learning of visual representations\.InProceedings of the 37th International Conference on Machine Learning,pp\. 1597–1607\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- Z\. Chen and D\. Zhou \(2026\)From one\-pass sgd to data reuse: mini\-batch scaling laws in sketched linear regression\.arXiv preprint arXiv:2605\.24316\.External Links:[Link](https://arxiv.org/abs/2605.24316)Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- M\. Cherti, R\. Beaumont, R\. Wightman, M\. Wortsman, G\. Ilharco, C\. Gordon, C\. Schuhmann, L\. Schmidt, and J\. Jitsev \(2023\)Reproducible scaling laws for contrastive language\-image learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2818–2829\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- E\. Dohmatob, Y\. Feng, P\. Yang, F\. Charton, and J\. Kempe \(2024\)A tale of tails: model collapse as a change of scaling laws\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 11165–11197\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- R\. Hadsell, S\. Chopra, and Y\. LeCun \(2006\)Dimensionality reduction by learning an invariant mapping\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 1735–1742\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- N\. Halko, P\. Martinsson, and J\. A\. Tropp \(2011\)Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions\.SIAM Review53\(2\),pp\. 217–288\.Cited by:[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1)\.
- J\. Z\. HaoChen, C\. Wei, A\. Gaidon, and T\. Ma \(2021\)Provable guarantees for self\-supervised deep learning with spectral contrastive loss\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 5000–5011\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- K\. He, H\. Fan, Y\. Wu, S\. Xie, and R\. Girshick \(2020\)Momentum contrast for unsupervised visual representation learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9729–9738\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray, C\. Hallacy, B\. Mann, A\. Radford, A\. Ramesh, N\. Ryder, D\. M\. Ziegler, J\. Schulman, D\. Amodei, and S\. McCandlish \(2020\)Scaling laws for autoregressive generative modeling\.arXiv preprint arXiv:2010\.14701\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, Md\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\)Deep learning scaling is predictable, empirically\.arXiv preprint arXiv:1712\.00409\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3),[Compute proxy\.](https://arxiv.org/html/2606.26617#Sx5.SSx1.SSS0.Px1.p1.1)\.
- M\. Hutter \(2021\)Learning curve theory\.arXiv preprint arXiv:2102\.04074\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- W\. Ji, Z\. Deng, R\. Nakada, J\. Zou, and L\. Zhang \(2021\)The power of contrast for feature learning: a theoretical analysis\.arXiv preprint arXiv:2110\.02473\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- C\. Jia, Y\. Yang, Y\. Xia, Y\. Chen, Z\. Parekh, H\. Pham, Q\. V\. Le, Y\. Sung, Z\. Li, and T\. Duerig \(2021\)Scaling up visual and vision\-language representation learning with noisy text supervision\.InProceedings of the 38th International Conference on Machine Learning,pp\. 4904–4916\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.2),[Compute proxy\.](https://arxiv.org/html/2606.26617#Sx5.SSx1.SSS0.Px1.p1.1)\.
- P\. Khosla, P\. Teterwak, C\. Wang, A\. Sarna, Y\. Tian, P\. Isola, A\. Maschinot, C\. Liu, and D\. Krishnan \(2020\)Supervised contrastive learning\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 18661–18673\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- X\. Li, Z\. Wang, and C\. Xie \(2023\)An inverse scaling law for CLIP training\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- Y\. Li, X\. Wei, T\. Yang, and Y\. Ying \(2026\)Statistical consistency and generalization of contrastive representation learning\.arXiv preprint arXiv:2605\.02116\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- L\. Lin, J\. Wu, and P\. L\. Bartlett \(2025\)Improved scaling laws in linear regression via data reuse\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3),[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1),[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p3.4),[Interpretation of the terms\.](https://arxiv.org/html/2606.26617#Sx3.SSx2.SSS0.Px1.p3.2),[Main Results](https://arxiv.org/html/2606.26617#Sx3.p1.1),[Prior Work and Our Limitations](https://arxiv.org/html/2606.26617#Sx5.SSx2.p1.1),[Lemma 11](https://arxiv.org/html/2606.26617#Thmlemma11.p1.5.1)\.
- L\. Lin, J\. Wu, S\. M\. Kakade, P\. L\. Bartlett, and J\. D\. Lee \(2024\)Scaling laws in linear regression: compute, parameters, and data\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 60556–60606\.External Links:[Document](https://dx.doi.org/10.52202/079017-1937)Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3),[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1),[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p3.4),[Interpretation of the terms\.](https://arxiv.org/html/2606.26617#Sx3.SSx2.SSS0.Px1.p3.2),[Main Results](https://arxiv.org/html/2606.26617#Sx3.p1.1),[Lemma 7](https://arxiv.org/html/2606.26617#Thmlemma7),[Lemma 9](https://arxiv.org/html/2606.26617#Thmlemma9.p1.7.1)\.
- A\. Maloney, D\. A\. Roberts, and J\. Sully \(2022\)A solvable model of neural scaling laws\.arXiv preprint arXiv:2210\.16859\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- N\. Muennighoff, A\. M\. Rush, B\. Barak, T\. Le Scao, A\. Piktus, N\. Tazi, S\. Pyysalo, T\. Wolf, and C\. Raffel \(2023\)Scaling data\-constrained language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- M\. Nezhurina, T\. Porian, G\. Puccetti, T\. Kerssies, R\. Beaumont, M\. Cherti, and J\. Jitsev \(2025\)Scaling laws for robust comparison of open foundation language\-vision models and datasets\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- A\. v\. d\. Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1),[Contrastive\-learning background\.](https://arxiv.org/html/2606.26617#Sx2.SS0.SSS0.Px1.p1.9),[Prior Work and Our Limitations](https://arxiv.org/html/2606.26617#Sx5.SSx2.p1.1)\.
- E\. Paquette, C\. Paquette, L\. Xiao, and J\. Pennington \(2024\)4\+3 phases of compute\-optimal neural scaling laws\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning,pp\. 8748–8763\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1),[Contrastive\-learning background\.](https://arxiv.org/html/2606.26617#Sx2.SS0.SSS0.Px1.p1.9),[Prior Work and Our Limitations](https://arxiv.org/html/2606.26617#Sx5.SSx2.p1.1)\.
- T\. Sarlós \(2006\)Improved approximation algorithms for large matrices via random projections\.InProceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science,pp\. 143–152\.Cited by:[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1)\.
- U\. Sharma and J\. Kaplan \(2020\)A neural scaling law from the dimension of the data manifold\.arXiv preprint arXiv:2004\.10802\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- C\. Tosh, A\. Krishnamurthy, and D\. Hsu \(2021\)Contrastive learning, multi\-view redundancy, and linear models\.InProceedings of the 32nd International Conference on Algorithmic Learning Theory,pp\. 1179–1206\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
- T\. Wang and P\. Isola \(2020\)Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.InProceedings of the 37th International Conference on Machine Learning,pp\. 9929–9939\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1),[Bilinear contrastive score\.](https://arxiv.org/html/2606.26617#Sx2.SSx1.SSS0.Px2.p1.5)\.
- D\. P\. Woodruff \(2014\)Sketching as a tool for numerical linear algebra\.Foundations and Trends in Theoretical Computer Science10\(1–2\),pp\. 1–157\.Cited by:[Assumptions](https://arxiv.org/html/2606.26617#Sx2.SSx2.p1.1)\.
- X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2022a\)Scaling vision transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12104–12113\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p1.3)\.
- X\. Zhai, X\. Wang, B\. Mustafa, A\. Steiner, D\. Keysers, A\. Kolesnikov, and L\. Beyer \(2022b\)LiT: zero\-shot transfer with locked\-image text tuning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 18123–18133\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p2.1)\.
- R\. S\. Zimmermann, Y\. Sharma, S\. Schneider, M\. Bethge, and W\. Brendel \(2021\)Contrastive learning inverts the data generating process\.InProceedings of the 38th International Conference on Machine Learning,pp\. 12979–12990\.Cited by:[Introduction](https://arxiv.org/html/2606.26617#Sx1.p3.1)\.
Supplementary Material
## Catalogue of the Supplementary Material
The supplementary material is organized as follows\. The first part collects notation and proves the main decomposition and scaling theorem from the main paper\. The next parts prove the approximation, GD\-bias, and GD\-variance bounds separately\. The final parts collect auxiliary probabilistic and spectral lemmas and provide additional experimental details\.
## Appendix APreliminary
### Appendix notation and minimizer identities
We collect here the notation used throughout the appendix\. The full covariance and cross\-covariance are
H:=𝔼\[xx⊤\]=𝔼\[yy⊤\],C:=𝔼\[xy⊤\]\.H:=\\mathbb\{E\}\[xx^\{\\top\}\]=\\mathbb\{E\}\[yy^\{\\top\}\],\\qquad C:=\\mathbb\{E\}\[xy^\{\\top\}\]\.When these operators are diagonal, we write
H=diag\(h1,h2,…\),C=diag\(τ1,τ2,…\),\\displaystyle H=\\operatorname\{diag\}\(h\_\{1\},h\_\{2\},\\ldots\),\\qquad C=\\operatorname\{diag\}\(\\tau\_\{1\},\\tau\_\{2\},\\ldots\),hi=λz,i\+λϵ,i,τi=λz,i\.\\displaystyle h\_\{i\}=\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\},\\quad\\tau\_\{i\}=\\lambda\_\{z,i\}\.The full quadratic contrastive risk is
R\(W\):=−⟨W,C⟩\+12tr\(W⊤HWH\),R\(W\):=\-\\langle W,C\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(W^\{\\top\}HWH\),with the minimizer
W⋆:=H−1CH−1\.W^\{\\star\}:=H^\{\-1\}CH^\{\-1\}\.For the fixed sketch matrixSS, write
x~=Sx,y~=Sy,Σ:=SHS⊤,\\displaystyle\\widetilde\{x\}=Sx,\\qquad\\widetilde\{y\}=Sy,\\qquad\\Sigma:=SHS^\{\\top\},CM:=SCS⊤=𝔼\[x~y~⊤\]\.\\displaystyle C\_\{M\}:=SCS^\{\\top\}=\\mathbb\{E\}\[\\widetilde\{x\}\\widetilde\{y\}^\{\\top\}\]\.The sketched risk and its minimizer are
RM\(A\):=−⟨A,CM⟩\+12tr\(A⊤ΣAΣ\),R\_\{M\}\(A\):=\-\\langle A,C\_\{M\}\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(A^\{\\top\}\\Sigma A\\Sigma\),with the minimizer
A⋆:=Σ−1CMΣ−1\.A^\{\\star\}:=\\Sigma^\{\-1\}C\_\{M\}\\Sigma^\{\-1\}\.The minimizer identities are justified below\. For matrices of compatible dimensions
⟨B1,B2⟩Σ,Σ:=tr\(B1⊤ΣB2Σ\),‖B‖Σ,Σ2:=⟨B,B⟩Σ,Σ\.\\langle B\_\{1\},B\_\{2\}\\rangle\_\{\\Sigma,\\Sigma\}:=\\operatorname\{tr\}\(B\_\{1\}^\{\\top\}\\Sigma B\_\{2\}\\Sigma\),\\quad\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}:=\\langle B,B\\rangle\_\{\\Sigma,\\Sigma\}\.For empirical GD, let empirical marginal sketched covariance
Σ^x:=1N∑n=1Nx~nx~n⊤,Σ^y:=1N∑n=1Ny~ny~n⊤,\\widehat\{\\Sigma\}\_\{x\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\widetilde\{x\}\_\{n\}\\widetilde\{x\}\_\{n\}^\{\\top\},\\qquad\\widehat\{\\Sigma\}\_\{y\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\widetilde\{y\}\_\{n\}\\widetilde\{y\}\_\{n\}^\{\\top\},and empirical cross sketched covariance
C^:=1N∑n=1Nx~ny~n⊤\.\\widehat\{C\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\widetilde\{x\}\_\{n\}\\widetilde\{y\}\_\{n\}^\{\\top\}\.We use the empirical Hessian, empirical residual, and GD filters
ℋ^\(B\):=Σ^xBΣ^y,E^:=C^−Σ^xA⋆Σ^y,\\widehat\{\\mathscr\{H\}\}\(B\):=\\widehat\{\\Sigma\}\_\{x\}B\\widehat\{\\Sigma\}\_\{y\},\\qquad\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\},ℬ^r:s:=∏t=rs\(I−γtℋ^\),ℬ^L\+1:L:=I,\\widehat\{\\mathscr\{B\}\}\_\{r:s\}:=\\prod\_\{t=r\}^\{s\}\(I\-\\gamma\_\{t\}\\widehat\{\\mathscr\{H\}\}\),\\qquad\\widehat\{\\mathscr\{B\}\}\_\{L\+1:L\}:=I,and
ℬ^L:=ℬ^1:L,𝒱^L:=∑t=1Lγtℬ^t\+1:L\.\\widehat\{\\mathscr\{B\}\}\_\{L\}:=\\widehat\{\\mathscr\{B\}\}\_\{1:L\},\\qquad\\widehat\{\\mathscr\{V\}\}\_\{L\}:=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\widehat\{\\mathscr\{B\}\}\_\{t\+1:L\}\.We also use the optimization shorthands
R:=Leffγ,ρ:=R−1/2\.R:=L\_\{\\mathrm\{eff\}\}\\gamma,\\qquad\\rho:=R^\{\-1/2\}\.For a sufficiently small absolute constantcrel\>0c\_\{\\mathrm\{rel\}\}\>0, define the covariance event
ℰcov\(R\):=\{max♯∈\{x,y\}‖Σ−1/2Σ^♯Σ−1/2−I‖≤crelR−1/2\}\.\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\):=\\left\\\{\\max\_\{\\sharp\\in\\\{x,y\\\}\}\\left\\\|\\Sigma^\{\-1/2\}\\widehat\{\\Sigma\}\_\{\\sharp\}\\Sigma^\{\-1/2\}\-I\\right\\\|\\leq c\_\{\\mathrm\{rel\}\}R^\{\-1/2\}\\right\\\}\.Under the aligned power\-law assumptions, writeδ:=a−b\\delta:=a\-b\. By Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), on the high\-probability sketch event, for the spectral decompositionΣ=∑i=1Mμivivi⊤\\Sigma=\\sum\_\{i=1\}^\{M\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},
μi≍i−b,CM=∑i=1Mκiμivivi⊤,κi≲i−δ\.\\mu\_\{i\}\\asymp i^\{\-b\},\\qquad C\_\{M\}=\\sum\_\{i=1\}^\{M\}\\kappa\_\{i\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad\\kappa\_\{i\}\\lesssim i^\{\-\\delta\}\.For approximation arguments we also use the whitened signal and sketch projection
K:=H−1/2CH−1/2,P:=H1/2S⊤\(SHS⊤\)−1SH1/2\.K:=H^\{\-1/2\}CH^\{\-1/2\},\\qquad P:=H^\{1/2\}S^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}SH^\{1/2\}\.
We now justify the minimizer identities used above\. For the population risk,
R\(W\)\\displaystyle R\(W\)=−⟨W,C⟩\+12tr\(W⊤HWH\)\.\\displaystyle=\-\\langle W,C\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(W^\{\\top\}HWH\)\.For any perturbationΔ∈ℝD×D\\Delta\\in\\mathbb\{R\}^\{D\\times D\}, the first variation is
dR\(W\)\[Δ\]\\displaystyle\\mathrm\{d\}R\(W\)\[\\Delta\]=−⟨Δ,C⟩\+12tr\(Δ⊤HWH\)\+12tr\(W⊤HΔH\)\\displaystyle=\-\\langle\\Delta,C\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(\\Delta^\{\\top\}HWH\)\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(W^\{\\top\}H\\Delta H\)=−⟨Δ,C⟩\+12⟨Δ,HWH⟩\+12⟨Δ,HWH⟩\\displaystyle=\-\\langle\\Delta,C\\rangle\+\\frac\{1\}\{2\}\\langle\\Delta,HWH\\rangle\+\\frac\{1\}\{2\}\\langle\\Delta,HWH\\rangle=⟨Δ,HWH−C⟩\.\\displaystyle=\\langle\\Delta,HWH\-C\\rangle\.Hence
∇R\(W\)=HWH−C\.\\nabla R\(W\)=HWH\-C\.The population minimizer therefore satisfies the normal equation
AssumingHHis invertible, this gives
W⋆=H−1CH−1\.W^\{\\star\}=H^\{\-1\}CH^\{\-1\}\.Moreover, since the full\-population Hessian operator
ℋfull\(W\)=HWH\\mathscr\{H\}\_\{\\mathrm\{full\}\}\(W\)=HWHis positive semidefinite with respect to the Frobenius inner product, this stationary point is the global minimizer\.
The proof works similarly to sketched minimizer\. From its definition,
RM\(A\)\\displaystyle R\_\{M\}\(A\)=−⟨A,CM⟩\+12tr\(A⊤ΣAΣ\)\.\\displaystyle=\-\\langle A,C\_\{M\}\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(A^\{\\top\}\\Sigma A\\Sigma\)\.For any perturbationΔ∈ℝM×M\\Delta\\in\\mathbb\{R\}^\{M\\times M\}, we have
dRM\(A\)\[Δ\]\\displaystyle\\mathrm\{d\}R\_\{M\}\(A\)\[\\Delta\]=−⟨Δ,CM⟩\+12tr\(Δ⊤ΣAΣ\)\+12tr\(A⊤ΣΔΣ\)\\displaystyle=\-\\langle\\Delta,C\_\{M\}\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(\\Delta^\{\\top\}\\Sigma A\\Sigma\)\+\\frac\{1\}\{2\}\\operatorname\{tr\}\(A^\{\\top\}\\Sigma\\Delta\\Sigma\)=−⟨Δ,CM⟩\+12⟨Δ,ΣAΣ⟩\+12⟨Δ,ΣAΣ⟩\\displaystyle=\-\\langle\\Delta,C\_\{M\}\\rangle\+\\frac\{1\}\{2\}\\langle\\Delta,\\Sigma A\\Sigma\\rangle\+\\frac\{1\}\{2\}\\langle\\Delta,\\Sigma A\\Sigma\\rangle=⟨Δ,ΣAΣ−CM⟩\.\\displaystyle=\\langle\\Delta,\\Sigma A\\Sigma\-C\_\{M\}\\rangle\.Therefore,
∇RM\(A\)=ΣAΣ−CM\.\\nabla R\_\{M\}\(A\)=\\Sigma A\\Sigma\-C\_\{M\}\.The sketched population minimizer satisfies
ΣA⋆Σ=CM\.\\Sigma A^\{\\star\}\\Sigma=C\_\{M\}\.AssumingΣ\\Sigmais invertible, we obtain
A⋆=Σ−1CMΣ−1\.A^\{\\star\}=\\Sigma^\{\-1\}C\_\{M\}\\Sigma^\{\-1\}\.Since
Σ=SHS⊤,CM=SCS⊤,\\Sigma=SHS^\{\\top\},\\qquad C\_\{M\}=SCS^\{\\top\},this can be written as
A⋆=\(SHS⊤\)−1SCS⊤\(SHS⊤\)−1\.A^\{\\star\}=\(SHS^\{\\top\}\)^\{\-1\}SCS^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}\.Again, the sketched Hessian operator
ℋ\(A\)=ΣAΣ\\mathscr\{H\}\(A\)=\\Sigma A\\Sigmais positive semidefinite with respect to the Frobenius inner product, so this stationary point is the global minimizer\.
### Irreducible risk
We first identify the risk level achieved by the full population minimizer\. Since
W⋆=H−1CH−1,W^\{\\star\}=H^\{\-1\}CH^\{\-1\},the population normal equation gives
Therefore,
R\(W⋆\)\\displaystyle R\(W^\{\\star\}\)=−⟨W⋆,C⟩\+12tr\(\(W⋆\)⊤HW⋆H\)\\displaystyle=\-\\langle W^\{\\star\},C\\rangle\+\\frac\{1\}\{2\}\\operatorname\{tr\}\\bigl\(\(W^\{\\star\}\)^\{\\top\}HW^\{\\star\}H\\bigr\)=−⟨W⋆,C⟩\+12⟨W⋆,HW⋆H⟩\\displaystyle=\-\\langle W^\{\\star\},C\\rangle\+\\frac\{1\}\{2\}\\langle W^\{\\star\},HW^\{\\star\}H\\rangle=−⟨W⋆,C⟩\+12⟨W⋆,C⟩\\displaystyle=\-\\langle W^\{\\star\},C\\rangle\+\\frac\{1\}\{2\}\\langle W^\{\\star\},C\\rangle=−12⟨W⋆,C⟩\.\\displaystyle=\-\\frac\{1\}\{2\}\\langle W^\{\\star\},C\\rangle\.SubstitutingW⋆=H−1CH−1W^\{\\star\}=H^\{\-1\}CH^\{\-1\}, we obtain
R\(W⋆\)\\displaystyle R\(W^\{\\star\}\)=−12tr\[\(H−1CH−1\)⊤C\]\\displaystyle=\-\\frac\{1\}\{2\}\\operatorname\{tr\}\\left\[\(H^\{\-1\}CH^\{\-1\}\)^\{\\top\}C\\right\]=−12tr\(H−1C⊤H−1C\)\\displaystyle=\-\\frac\{1\}\{2\}\\operatorname\{tr\}\\left\(H^\{\-1\}C^\{\\top\}H^\{\-1\}C\\right\)=−12‖H−1/2CH−1/2‖F2\.\\displaystyle=\-\\frac\{1\}\{2\}\\left\\\|H^\{\-1/2\}CH^\{\-1/2\}\\right\\\|\_\{F\}^\{2\}\.In general, we call this quantity the irreducible risk and denote
Rirr:=R\(W⋆\)=−12‖H−1/2CH−1/2‖F2\.R\_\{\\mathrm\{irr\}\}:=R\(W^\{\\star\}\)=\-\\frac\{1\}\{2\}\\left\\\|H^\{\-1/2\}CH^\{\-1/2\}\\right\\\|\_\{F\}^\{2\}\.Note that all additional error terms, including approximation error from sketching and excess error from finite\-data optimization, are measured relative to this irreducible baseline\.
Additionally, we particularly bound the irreducible error as below\. By Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1), the full marginal covariance and cross\-covariance are simultaneously diagonal:
H=Λz\+Λϵ,C=Λz\.H=\\Lambda\_\{z\}\+\\Lambda\_\{\\epsilon\},\\qquad C=\\Lambda\_\{z\}\.Writing
Λz=diag\(λz,1,λz,2,…\),Λϵ=diag\(λϵ,1,λϵ,2,…\),\\Lambda\_\{z\}=\\operatorname\{diag\}\(\\lambda\_\{z,1\},\\lambda\_\{z,2\},\\ldots\),\\qquad\\Lambda\_\{\\epsilon\}=\\operatorname\{diag\}\(\\lambda\_\{\\epsilon,1\},\\lambda\_\{\\epsilon,2\},\\ldots\),we have
H=diag\(h1,h2,…\),hi:=λz,i\+λϵ,i\.H=\\operatorname\{diag\}\(h\_\{1\},h\_\{2\},\\ldots\),\\qquad h\_\{i\}:=\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\.Therefore,
H−1/2CH−1/2=diag\(λz,1λz,1\+λϵ,1,λz,2λz,2\+λϵ,2,…\)\.H^\{\-1/2\}CH^\{\-1/2\}=\\operatorname\{diag\}\\left\(\\frac\{\\lambda\_\{z,1\}\}\{\\lambda\_\{z,1\}\+\\lambda\_\{\\epsilon,1\}\},\\frac\{\\lambda\_\{z,2\}\}\{\\lambda\_\{z,2\}\+\\lambda\_\{\\epsilon,2\}\},\\ldots\\right\)\.Hence the irreducible risk simplifies to
R\(W⋆\)\\displaystyle R\(W^\{\\star\}\)=−12‖H−1/2CH−1/2‖F2\\displaystyle=\-\\frac\{1\}\{2\}\\left\\\|H^\{\-1/2\}CH^\{\-1/2\}\\right\\\|\_\{F\}^\{2\}=−12∑i≥1\(λz,iλz,i\+λϵ,i\)2\.\\displaystyle=\-\\frac\{1\}\{2\}\\sum\_\{i\\geq 1\}\\left\(\\frac\{\\lambda\_\{z,i\}\}\{\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\}\\right\)^\{2\}\.Equivalently, defining the population signal\-to\-marginal ratio
κi:=λz,iλz,i\+λϵ,i,\\kappa\_\{i\}:=\\frac\{\\lambda\_\{z,i\}\}\{\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\},we obtain
R\(W⋆\)=−12∑i≥1κi2\.R\(W^\{\\star\}\)=\-\\frac\{1\}\{2\}\\sum\_\{i\\geq 1\}\\kappa\_\{i\}^\{2\}\.Under Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1), if
λz,i≍i−a,λϵ,i≍i−b,a\>b,\\lambda\_\{z,i\}\\asymp i^\{\-a\},\\qquad\\lambda\_\{\\epsilon,i\}\\asymp i^\{\-b\},\\qquad a\>b,then
κi=λz,iλz,i\+λϵ,i≍i−\(a−b\)\.\\kappa\_\{i\}=\\frac\{\\lambda\_\{z,i\}\}\{\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\}\\asymp i^\{\-\(a\-b\)\}\.Consequently,
R\(W⋆\)=−12∑i≥1κi2≍−∑i≥1i−2\(a−b\)\.R\(W^\{\\star\}\)=\-\\frac\{1\}\{2\}\\sum\_\{i\\geq 1\}\\kappa\_\{i\}^\{2\}\\asymp\-\\sum\_\{i\\geq 1\}i^\{\-2\(a\-b\)\}\.In particular, the irreducible risk is finite whenever
2\(a−b\)\>1,equivalentlya−b\>12\.2\(a\-b\)\>1,\\qquad\\text\{equivalently\}\\qquad a\-b\>\\frac\{1\}\{2\}\.
### Proof of Proposition[1](https://arxiv.org/html/2606.26617#Thmproposition1)
We first split the risk into its full\-population baseline, its sketched approximation gap, and its sketched excess risk:
RM\(AL\)\\displaystyle R\_\{M\}\(A\_\{L\}\)=R\(W⋆\)\+\[RM\(A⋆\)−R\(W⋆\)\]\\displaystyle=R\(W^\{\\star\}\)\+\\left\[R\_\{M\}\(A^\{\\star\}\)\-R\(W^\{\\star\}\)\\right\]\+\[RM\(AL\)−RM\(A⋆\)\]\\displaystyle\+\\left\[R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)\\right\]=Rirr\+Approx\+\[RM\(AL\)−RM\(A⋆\)\]\.\\displaystyle=R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+\\left\[R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)\\right\]\.SinceA⋆A^\{\\star\}minimizes the sketched population risk, we have
RM\(AL\)−RM\(A⋆\)=12‖AL−A⋆‖Σ,Σ2\.R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)=\\frac\{1\}\{2\}\\\|A\_\{L\}\-A^\{\\star\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.By Lemma[1](https://arxiv.org/html/2606.26617#Thmlemma1),
RM\(AL\)−RM\(A⋆\)=BiasLsamp\+VarLsamp\+CrossLsamp,R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)=\\operatorname\{Bias\}^\{\\mathrm\{samp\}\}\_\{L\}\+\\operatorname\{Var\}^\{\\mathrm\{samp\}\}\_\{L\}\+\\operatorname\{Cross\}^\{\\mathrm\{samp\}\}\_\{L\},where
BiasLsamp:=12‖ℬ^L\(A⋆\)‖Σ,Σ2,\\operatorname\{Bias\}^\{\\mathrm\{samp\}\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\},VarLsamp:=12‖𝒱^L\(E^\)‖Σ,Σ2,\\operatorname\{Var\}^\{\\mathrm\{samp\}\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\},and
CrossLsamp:=−⟨ℬ^L\(A⋆\),𝒱^L\(E^\)⟩Σ,Σ\.\\operatorname\{Cross\}^\{\\mathrm\{samp\}\}\_\{L\}:=\-\\left\\langle\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\),\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\rangle\_\{\\Sigma,\\Sigma\}\.Taking expectation over the empirical sample gives
𝔼𝒟\[RM\(AL\)\]=Rirr\+Approx\+BiasL\+VarL\+CrossL,\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[R\_\{M\}\(A\_\{L\}\)\]=R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\+\\operatorname\{Cross\}\_\{L\},with the definitions in Proposition[1](https://arxiv.org/html/2606.26617#Thmproposition1)\.
It remains to bound the cross term\. By Cauchy–Schwarz,
\|CrossL\|≤𝔼𝒟\[‖ℬ^L\(A⋆\)‖Σ,Σ‖𝒱^L\(E^\)‖Σ,Σ\]\\displaystyle\|\\operatorname\{Cross\}\_\{L\}\|\\leq\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}\\right\]≤\(𝔼𝒟\[‖ℬ^L\(A⋆\)‖Σ,Σ2\]\)1/2\(𝔼𝒟\[‖𝒱^L\(E^\)‖Σ,Σ2\]\)1/2\\displaystyle\\leq\\left\(\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\]\\right\)^\{1/2\}\\left\(\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\]\\right\)^\{1/2\}=2BiasLVarL\.\\displaystyle=2\\sqrt\{\\operatorname\{Bias\}\_\{L\}\\operatorname\{Var\}\_\{L\}\}\.Using2ab≤a\+b2\\sqrt\{ab\}\\leq a\+b, we obtain
\|CrossL\|≤BiasL\+VarL\.\|\\operatorname\{Cross\}\_\{L\}\|\\leq\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\.Therefore,
𝔼𝒟\[RM\(AL\)\]≤Rirr\+Approx\+2BiasL\+2VarL,\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[R\_\{M\}\(A\_\{L\}\)\]\\leq R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+2\\operatorname\{Bias\}\_\{L\}\+2\\operatorname\{Var\}\_\{L\},which proves the proposition\.
### Proof of Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)
By Proposition[1](https://arxiv.org/html/2606.26617#Thmproposition1),
𝔼𝒟\[RM\(AL\)\]=Rirr\+Approx\+BiasL\+VarL\+CrossL\.\\mathbb\{E\}\_\{\\mathcal\{D\}\}\[R\_\{M\}\(A\_\{L\}\)\]=R\_\{\\mathrm\{irr\}\}\+\\operatorname\{Approx\}\+\\operatorname\{Bias\}\_\{L\}\+\\operatorname\{Var\}\_\{L\}\+\\operatorname\{Cross\}\_\{L\}\.We substitute the specific estimates for the four non\-baseline terms\. Since1≲R≲N/M1\\lesssim R\\lesssim N/M, Lemma[6](https://arxiv.org/html/2606.26617#Thmlemma6)implies that, conditional on the fixed sketch,ℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)holds with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\), and also impliesN≳MN\\gtrsim M, as required by the variance upper bound\. The conditionR≲M2bR\\lesssim M^\{2b\}is equivalent to the bias\-unsaturated regimeR1/\(2b\)≲MR^\{1/\(2b\)\}\\lesssim M\. We work on the intersection of this covariance event, the high\-probability sketch event in the approximation bound, and the sketched\-spectrum event of Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11)\.
First, since12<δ<b\+12\\frac\{1\}\{2\}<\\delta<b\+\\frac\{1\}\{2\}, Theorem[2](https://arxiv.org/html/2606.26617#Thmmytheorem2)gives
Approx=Θ\(M−min\{2δ−1,b\}\)\.\\operatorname\{Approx\}=\\Theta\\left\(M^\{\-\\min\\\{2\\delta\-1,b\\\}\}\\right\)\.Second, under the bias\-unsaturated conditionR1/\(2b\)≲MR^\{1/\(2b\)\}\\lesssim M, Theorem[3](https://arxiv.org/html/2606.26617#Thmmytheorem3)gives
BiasL=Θ\(R1−2δ2b\),R=Leffγ\.\\operatorname\{Bias\}\_\{L\}=\\Theta\\left\(R^\{\\frac\{1\-2\\delta\}\{2b\}\}\\right\),\\qquad R=L\_\{\\mathrm\{eff\}\}\\gamma\.Define
D×\(R,M\):=\{R1/blog\(eR\),1≤R1/b≤M,R1/b\(1\+logM2R1/b\),M<R1/b<M2,M2,R1/b≥M2\.D\_\{\\times\}\(R,M\):=\\begin\{cases\}R^\{1/b\}\\log\(eR\),&1\\leq R^\{1/b\}\\leq M,\\\\ R^\{1/b\}\\left\(1\+\\log\\dfrac\{M^\{2\}\}\{R^\{1/b\}\}\\right\),&M<R^\{1/b\}<M^\{2\},\\\\ M^\{2\},&R^\{1/b\}\\geq M^\{2\}\.\\end\{cases\}Third, since the theorem assumesR≳1R\\gtrsim 1, Theorem[4](https://arxiv.org/html/2606.26617#Thmmytheorem4)gives, under the tail nondegeneracy condition for the variance lower bound,
VarL=Θ\(D×\(R,M\)N\)\.\\operatorname\{Var\}\_\{L\}=\\Theta\\left\(\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\\right\)\.Finally, by Cauchy–Schwarz and the preceding bias and variance estimates,
CrossL\\displaystyle\\operatorname\{Cross\}\_\{L\}=𝒪\(BiasLVarL\)\\displaystyle=\\mathcal\{O\}\\left\(\\sqrt\{\\operatorname\{Bias\}\_\{L\}\\operatorname\{Var\}\_\{L\}\}\\right\)=𝒪\(R1−2δ2b⋅D×\(R,M\)N\)\.\\displaystyle=\\mathcal\{O\}\\left\(\\sqrt\{R^\{\\frac\{1\-2\\delta\}\{2b\}\}\\cdot\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\}\\right\)\.Combining these displays proves
𝔼𝒟\[RM\(AL\)\]\\displaystyle\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\left\[R\_\{M\}\(A\_\{L\}\)\\right\]=Rirr\+Θ\(M−min\{2δ−1,b\}\)\+Θ\(R1−2δ2b\)\\displaystyle=R\_\{\\mathrm\{irr\}\}\+\\Theta\\left\(M^\{\-\\min\\\{2\\delta\-1,b\\\}\}\\right\)\+\\Theta\\left\(R^\{\\frac\{1\-2\\delta\}\{2b\}\}\\right\)\+Θ\(D×\(R,M\)N\)\\displaystyle\\quad\+\\Theta\\left\(\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\\right\)\+𝒪\(R1−2δ2b⋅D×\(R,M\)N\),\\displaystyle\\quad\+\\mathcal\{O\}\\left\(\\sqrt\{R^\{\\frac\{1\-2\\delta\}\{2b\}\}\\cdot\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\}\\right\),which is the claimed scaling law\.
### Proof of Proposition[2](https://arxiv.org/html/2606.26617#Thmproposition2)
This result is a compute\-allocation heuristic rather than a minimax optimality theorem\. Throughout the proof,≈\\approx,≍\\asymp, and the displayed power laws suppress logarithmic factors fromLeff=⌊L/logL⌋L\_\{\\mathrm\{eff\}\}=\\lfloor L/\\log L\\rfloor, from the boundary logarithms inD×\(R,M\)D\_\{\\times\}\(R,M\), and from implementation\-dependent constants in the compute proxy\. Sinceγ≍1\\gamma\\asymp 1, we identify
𝒞≍LNM2≈RNM2\.\\mathcal\{C\}\\asymp LNM^\{2\}\\approx RNM^\{2\}\.Let
q:=2δ−1,α:=min\{q,b\}\.q:=2\\delta\-1,\\qquad\\alpha:=\\min\\\{q,b\\\}\.Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)gives the leading excess\-risk proxy
ℰ\(M,N,R\)≍M−α\+R−q/\(2b\)\+D×\(R,M\)N\.\\mathcal\{E\}\(M,N,R\)\\asymp M^\{\-\\alpha\}\+R^\{\-q/\(2b\)\}\+\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.The new covariance\-concentration condition in Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1)requires
R≲NM,or equivalentlyN≳RM\.R\\lesssim\\frac\{N\}\{M\},\\qquad\\text\{or equivalently\}\\qquad N\\gtrsim RM\.An efficient allocation should not takeNNmuch larger than this lower bound unless the variance term is still leading: largerNNconsumes compute but only reduces the variance\. In the allocations below, taking
already makes the variance term lower order than the balanced approximation and bias terms\. Therefore the compute proxy reduces to
𝒞≈R\(RM\)M2=R2M3\.\\mathcal\{C\}\\approx R\(RM\)M^\{2\}=R^\{2\}M^\{3\}\.The remaining horizon constraint is the bias\-unsaturation condition
R≲M2b,equivalentlyM≳R1/\(2b\)\.R\\lesssim M^\{2b\},\\qquad\\text\{equivalently\}\\qquad M\\gtrsim R^\{1/\(2b\)\}\.
We first record the product\-unsaturated boundary allocation\. If one insists on the product\-unsaturated condition
R1/b≲M,R^\{1/b\}\\lesssim M,the smallest admissible model size is
Together withN≍RMN\\asymp RM, this gives
N≍R1\+1/b,𝒞≈R2R3/b=R\(2b\+3\)/b\.N\\asymp R^\{1\+1/b\},\\qquad\\mathcal\{C\}\\approx R^\{2\}R^\{3/b\}=R^\{\(2b\+3\)/b\}\.Hence
R≍𝒞b2b\+3,M≍𝒞12b\+3,N≍𝒞b\+12b\+3\.R\\asymp\\mathcal\{C\}^\{\\frac\{b\}\{2b\+3\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{1\}\{2b\+3\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{b\+1\}\{2b\+3\}\}\.On this boundary,D×\(R,M\)≍R1/bD\_\{\\times\}\(R,M\)\\asymp R^\{1/b\}, so
D×\(R,M\)N≍R−1,\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\\asymp R^\{\-1\},which is lower order than the bias termR−q/\(2b\)R^\{\-q/\(2b\)\}for0<q<2b0<q<2b\.
We next consider the intermediate product regime
R1/\(2b\)≲M≲R1/b,R^\{1/\(2b\)\}\\lesssim M\\lesssim R^\{1/b\},whereD×\(R,M\)≍R1/bD\_\{\\times\}\(R,M\)\\asymp R^\{1/b\}up to logarithmic factors\. In this regime, withN≍RMN\\asymp RM, the variance term is
D×\(R,M\)N≍R1/bRM=R1/b−1M−1\.\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\\asymp\\frac\{R^\{1/b\}\}\{RM\}=R^\{1/b\-1\}M^\{\-1\}\.
In the rough\-source range0<q<b0<q<b, we haveα=q\\alpha=q\. Balancing approximation and bias gives
M−q≍R−q/\(2b\),M≍R1/\(2b\)\.M^\{\-q\}\\asymp R^\{\-q/\(2b\)\},\\qquad M\\asymp R^\{1/\(2b\)\}\.This is the lower boundary of the admissible window\. Hence
N\\displaystyle N≍RM≍R1\+1/\(2b\),\\displaystyle\\asymp RM\\asymp R^\{1\+1/\(2b\)\},𝒞\\displaystyle\\mathcal\{C\}≈R2R3/\(2b\)=R\(4b\+3\)/\(2b\)\.\\displaystyle\\approx R^\{2\}R^\{3/\(2b\)\}=R^\{\(4b\+3\)/\(2b\)\}\.Solving forR,M,NR,M,Ngives
R≍𝒞2b4b\+3,M≍𝒞14b\+3,N≍𝒞2b\+14b\+3\.R\\asymp\\mathcal\{C\}^\{\\frac\{2b\}\{4b\+3\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{1\}\{4b\+3\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{2b\+1\}\{4b\+3\}\}\.The variance term is
R1/b−1M−1≍R−1\+1/\(2b\),R^\{1/b\-1\}M^\{\-1\}\\asymp R^\{\-1\+1/\(2b\)\},which is lower order thanR−q/\(2b\)R^\{\-q/\(2b\)\}becauseq<bq<bandb\>1b\>1\.
In the smooth\-source rangeb<q<2bb<q<2b, we haveα=b\\alpha=b\. Balancing approximation and bias gives
M−b≍R−q/\(2b\),M≍Rq/\(2b2\)\.M^\{\-b\}\\asymp R^\{\-q/\(2b\)\},\\qquad M\\asymp R^\{q/\(2b^\{2\}\)\}\.Sinceb<q<2bb<q<2b, this choice lies strictly betweenR1/\(2b\)R^\{1/\(2b\)\}andR1/bR^\{1/b\}, so it is inside the intermediate product window\. Then
N\\displaystyle N≍RM≍R1\+q/\(2b2\),\\displaystyle\\asymp RM\\asymp R^\{1\+q/\(2b^\{2\}\)\},𝒞\\displaystyle\\mathcal\{C\}≈R2R3q/\(2b2\)=R\(4b2\+3q\)/\(2b2\)\.\\displaystyle\\approx R^\{2\}R^\{3q/\(2b^\{2\}\)\}=R^\{\(4b^\{2\}\+3q\)/\(2b^\{2\}\)\}\.Consequently,
R≍𝒞2b24b2\+3q,M≍𝒞q4b2\+3q,N≍𝒞2b2\+q4b2\+3q\.R\\asymp\\mathcal\{C\}^\{\\frac\{2b^\{2\}\}\{4b^\{2\}\+3q\}\},\\qquad M\\asymp\\mathcal\{C\}^\{\\frac\{q\}\{4b^\{2\}\+3q\}\},\\qquad N\\asymp\\mathcal\{C\}^\{\\frac\{2b^\{2\}\+q\}\{4b^\{2\}\+3q\}\}\.The variance term is
R1/b−1M−1≍R−1\+1/b−q/\(2b2\)\.R^\{1/b\-1\}M^\{\-1\}\\asymp R^\{\-1\+1/b\-q/\(2b^\{2\}\)\}\.This is lower order than the balanced bias termR−q/\(2b\)R^\{\-q/\(2b\)\}, because
1−1b\+q2b2−q2b=\(b−1\)\(2b−q\)2b2\>0\.1\-\\frac\{1\}\{b\}\+\\frac\{q\}\{2b^\{2\}\}\-\\frac\{q\}\{2b\}=\\frac\{\(b\-1\)\(2b\-q\)\}\{2b^\{2\}\}\>0\.Thus the variance is non\-leading in both source regimes once the covariance constraint is saturated\. Finally,R=LeffγR=L\_\{\\mathrm\{eff\}\}\\gammaandLeff=⌊L/logL⌋L\_\{\\mathrm\{eff\}\}=\\lfloor L/\\log L\\rfloor, so suppressing floors and logarithmic factors givesL/logL≈R/γL/\\log L\\approx R/\\gamma\. This completes the claimed allocation rules\.
## Appendix BApproximation Error
### Approximation error definition
The approximation error is the gap between the best sketched model and the best full model:
Approx:=minA∈ℝM×MR\(S⊤AS\)−minW∈ℝD×DR\(W\)\.\\operatorname\{Approx\}:=\\min\_\{A\\in\\mathbb\{R\}^\{M\\times M\}\}R\(S^\{\\top\}AS\)\-\\min\_\{W\\in\\mathbb\{R\}^\{D\\times D\}\}R\(W\)\.Define the whitened signal and sketch projection by
K\\displaystyle K:=H−1/2CH−1/2,\\displaystyle:=H^\{\-1/2\}CH^\{\-1/2\},P\\displaystyle P:=H1/2S⊤\(SHS⊤\)−1SH1/2\.\\displaystyle:=H^\{1/2\}S^\{\\top\}\(SHS^\{\\top\}\)^\{\-1\}SH^\{1/2\}\.ThenPPis an orthogonal projection and the approximation error admits the exact representation
Approx=12‖K−PKP‖F2\.\\operatorname\{Approx\}=\\frac\{1\}\{2\}\\left\\\|K\-PKP\\right\\\|\_\{F\}^\{2\}\.
### A general upper bound for the approximation error
###### Proposition 3\(General upper bound for approximation error\)\.
Fix an integerk≤M/2k\\leq M/2, and assume thatΣ\>k\\Sigma\_\{\>k\}is invertible, where
Σ\>k:=Sk:∞Hk:∞Sk:∞⊤\.\\Sigma\_\{\>k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Define
βk:=∑i\>khiM\+hk\+1\+∑i\>khi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}h\_\{i\}\}\{M\}\+h\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}h\_\{i\}^\{2\}\}\{M\}\}\.Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the Gaussian sketchSS,
Approx≲∑i\>kκi2\+βk∑i≤kκi2hi\.\\operatorname\{Approx\}\\lesssim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\+\\beta\_\{k\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\.
###### Proof\.
LetQ:=I−PQ:=I\-P\. SincePPis an orthogonal projection,QQis also an orthogonal projection:
P2=P,Q2=Q,PQ=QP=0\.P^\{2\}=P,\\qquad Q^\{2\}=Q,\\qquad PQ=QP=0\.UsingI=P\+QI=P\+Q, we expand
K=\(P\+Q\)K\(P\+Q\)=PKP\+PKQ\+QKP\+QKQ\.K=\(P\+Q\)K\(P\+Q\)=PKP\+PKQ\+QKP\+QKQ\.Therefore,
K−PKP=PKQ\+QKP\+QKQ\.K\-PKP=PKQ\+QKP\+QKQ\.The three terms on the right\-hand side are mutually orthogonal in Frobenius inner product\. For example,
⟨QKQ,QKP⟩\\displaystyle\\langle QKQ,QKP\\rangle=tr\(\(QKQ\)⊤QKP\)\\displaystyle=\\operatorname\{tr\}\\left\(\(QKQ\)^\{\\top\}QKP\\right\)=tr\(QKQKP\)\\displaystyle=\\operatorname\{tr\}\(QKQKP\)=tr\(PQKQK\)=0,\\displaystyle=\\operatorname\{tr\}\(PQKQK\)=0,
where the last equality usesPQ=0PQ=0\. The other cross terms vanish in the same way\. SinceKKis symmetric, we also have
\|PKQ\|F=\|QKP\|F\.\|PKQ\|\_\{F\}=\|QKP\|\_\{F\}\.Thus
\|K−PKP\|F2=\|QKQ\|F2\+2\|QKP\|F2\.\|K\-PKP\|\_\{F\}^\{2\}=\|QKQ\|\_\{F\}^\{2\}\+2\|QKP\|\_\{F\}^\{2\}\.On the other hand,
QK=QK\(P\+Q\)=QKP\+QKQ,QK=QK\(P\+Q\)=QKP\+QKQ,and the two terms are Frobenius\-orthogonal\. Hence
\|QK\|F2=\|QKP\|F2\+\|QKQ\|F2\.\|QK\|\_\{F\}^\{2\}=\|QKP\|\_\{F\}^\{2\}\+\|QKQ\|\_\{F\}^\{2\}\.Combining the previous two displays gives
Approx=12\|K−PKP\|F2≤\|QK\|F2\.\\operatorname\{Approx\}=\\frac\{1\}\{2\}\|K\-PKP\|\_\{F\}^\{2\}\\leq\|QK\|\_\{F\}^\{2\}\.Now
\|QK\|F2=tr\(K⊤Q⊤QK\)=tr\(KQK\)=tr\(QK2\),\|QK\|\_\{F\}^\{2\}=\\operatorname\{tr\}\(K^\{\\top\}Q^\{\\top\}QK\)=\\operatorname\{tr\}\(KQK\)=\\operatorname\{tr\}\(QK^\{2\}\),where we usedK=K⊤K=K^\{\\top\},Q=Q⊤Q=Q^\{\\top\}, andQ2=QQ^\{2\}=Q\. Therefore,
Approx≤⟨Q,K2⟩\.\\operatorname\{Approx\}\\leq\\langle Q,K^\{2\}\\rangle\.
Next defineΔP:=P−I\\Delta\_\{P\}:=P\-I\. BecauseQ=I−PQ=I\-P, we haveΔP=−Q\\Delta\_\{P\}=\-Q, and hence
Therefore,
Approx≤⟨ΔP2,K2⟩\.\\operatorname\{Approx\}\\leq\\langle\\Delta\_\{P\}^\{2\},K^\{2\}\\rangle\.
We now split the coordinates into0:k0:kandk:∞k:\\infty\. Write
ΔP=\(𝖴𝖵𝖵⊤𝖶\)\.\\Delta\_\{P\}=\\begin\{pmatrix\}\\mathsf\{U\}&\\mathsf\{V\}\\\\ \\mathsf\{V\}^\{\\top\}&\\mathsf\{W\}\\end\{pmatrix\}\.Then
ΔP2=\(𝖴2\+𝖵𝖵⊤𝖴𝖵\+𝖵𝖶𝖵⊤𝖴\+𝖶𝖵⊤𝖶2\+𝖵⊤𝖵\)\.\\Delta\_\{P\}^\{2\}=\\begin\{pmatrix\}\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\}&\\mathsf\{U\}\\mathsf\{V\}\+\\mathsf\{V\}\\mathsf\{W\}\\\\ \\mathsf\{V\}^\{\\top\}\\mathsf\{U\}\+\\mathsf\{W\}\\mathsf\{V\}^\{\\top\}&\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}\\end\{pmatrix\}\.SinceK2K^\{2\}is diagonal, it is block diagonal under the same split:
K2=\(K0:k200Kk:∞2\)\.K^\{2\}=\\begin\{pmatrix\}K\_\{0:k\}^\{2\}&0\\\\ 0&K\_\{k:\\infty\}^\{2\}\\end\{pmatrix\}\.Therefore the off\-diagonal blocks do not contribute to the inner product, and
⟨ΔP2,K2⟩=⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩\+⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩\.\\langle\\Delta\_\{P\}^\{2\},K^\{2\}\\rangle=\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle\+\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle\.
We first bound the tail term\. By Lemma[9](https://arxiv.org/html/2606.26617#Thmlemma9),
0⪯𝖶2\+𝖵⊤𝖵⪯I\.0\\preceq\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}\\preceq I\.SinceKk:∞2⪰0K\_\{k:\\infty\}^\{2\}\\succeq 0, this implies
⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩≤tr\(Kk:∞2\)=∑i\>kκi2\.\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle\\leq\\operatorname\{tr\}\(K\_\{k:\\infty\}^\{2\}\)=\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.
We next bound the head term\. Again by Lemma[9](https://arxiv.org/html/2606.26617#Thmlemma9),
𝖴2\+𝖵𝖵⊤=H0:k−1/2RkheadH0:k−1/2,\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\}=H\_\{0:k\}^\{\-1/2\}R\_\{k\}^\{\\mathrm\{head\}\}H\_\{0:k\}^\{\-1/2\},where
Rkhead:=\(H0:k−1\+S0:k⊤Σ\>k−1S0:k\)−1\.R\_\{k\}^\{\\mathrm\{head\}\}:=\\left\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\right\)^\{\-1\}\.Therefore,
⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩=⟨Rkhead,H0:k−1/2K0:k2H0:k−1/2⟩\.\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle=\\left\\langle R\_\{k\}^\{\\mathrm\{head\}\},H\_\{0:k\}^\{\-1/2\}K\_\{0:k\}^\{2\}H\_\{0:k\}^\{\-1/2\}\\right\\rangle\.Since \(H\) and \(K\) are diagonal,
H0:k−1/2K0:k2H0:k−1/2=diag\(κi2hi\)i≤k\.H\_\{0:k\}^\{\-1/2\}K\_\{0:k\}^\{2\}H\_\{0:k\}^\{\-1/2\}=\\operatorname\{diag\}\\left\(\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\\right\)\_\{i\\leq k\}\.
It remains to controlRkheadR\_\{k\}^\{\\mathrm\{head\}\}\. By Lemma[7](https://arxiv.org/html/2606.26617#Thmlemma7), with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
‖Σ\>k‖≲βk\.\\\|\\Sigma\_\{\>k\}\\\|\\lesssim\\beta\_\{k\}\.Thus
Σ\>k−1⪰cβk−1IM\.\\Sigma\_\{\>k\}^\{\-1\}\\succeq c\\beta\_\{k\}^\{\-1\}I\_\{M\}\.By Lemma[8](https://arxiv.org/html/2606.26617#Thmlemma8), sincek≤M/2k\\leq M/2,
S0:k⊤S0:k⪰c′IkS\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c^\{\\prime\}I\_\{k\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence
S0:k⊤Σ\>k−1S0:k⪰cβk−1S0:k⊤S0:k⪰c′′βk−1Ik\.S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\succeq c\\beta\_\{k\}^\{\-1\}S\_\{0:k\}^\{\\top\}S\_\{0:k\}\\succeq c^\{\\prime\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\.SinceH0:k−1⪰0H\_\{0:k\}^\{\-1\}\\succeq 0, we obtain
H0:k−1\+S0:k⊤Σ\>k−1S0:k⪰c′′βk−1Ik\.H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\succeq c^\{\\prime\\prime\}\\beta\_\{k\}^\{\-1\}I\_\{k\}\.Taking inverses reverses the Loewner order, so
Rkhead=\(H0:k−1\+S0:k⊤Σ\>k−1S0:k\)−1⪯CβkIk\.R\_\{k\}^\{\\mathrm\{head\}\}=\\left\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\right\)^\{\-1\}\\preceq C\\beta\_\{k\}I\_\{k\}\.Therefore,
⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩≤Cβk∑i≤kκi2hi\.\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle\\leq C\\beta\_\{k\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\.
Putting the head and tail estimates together gives
Approx≲∑i\>kκi2\+βk∑i≤kκi2hi\.\\operatorname\{Approx\}\\lesssim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\+\\beta\_\{k\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\.This completes the proof\. ∎
### A general lower bound for the approximation error
###### Proposition 4\(General lower bound for approximation error\)\.
Fix an integerk≤c0Mk\\leq c\_\{0\}Mfor a sufficiently small absolute constantc0\>0c\_\{0\}\>0, and assume that the tail rank is at least2M2M, so thathk\+2Mh\_\{k\+2M\}is well\-defined\. Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the Gaussian sketchSS,
Approx≳hk\+2M∑i≤kκi2hi\+∑j\>Mμj\(Kk:∞2\)\.\\operatorname\{Approx\}\\gtrsim h\_\{k\+2M\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\+\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)\.In particular, ifκi2\\kappa\_\{i\}^\{2\},i≥1\{i\\geq 1\}is non\-increasing, then
∑j\>Mμj\(Kk:∞2\)=∑i\>k\+Mκi2,\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)=\\sum\_\{i\>k\+M\}\\kappa\_\{i\}^\{2\},and hence
Approx≳hk\+2M∑i≤kκi2hi\+∑i\>k\+Mκi2\.\\operatorname\{Approx\}\\gtrsim h\_\{k\+2M\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\+\\sum\_\{i\>k\+M\}\\kappa\_\{i\}^\{2\}\.
###### Proof\.
LetQ:=I−PQ:=I\-P\. As in the proof of Proposition[3](https://arxiv.org/html/2606.26617#Thmproposition3), sincePPis an orthogonal projection,QQis also an orthogonal projection and
K−PKP=PKQ\+QKP\+QKQ\.K\-PKP=PKQ\+QKP\+QKQ\.The three terms on the right\-hand side are mutually orthogonal in Frobenius inner product, and sinceK=K⊤K=K^\{\\top\},
\|PKQ\|F=\|QKP\|F\.\|PKQ\|\_\{F\}=\|QKP\|\_\{F\}\.Therefore,
Approx=12\|K−PKP\|F2=12\|QKQ\|F2\+\|QKP\|F2\.\\ \\operatorname\{Approx\}=\\frac\{1\}\{2\}\|K\-PKP\|\_\{F\}^\{2\}=\\frac\{1\}\{2\}\|QKQ\|\_\{F\}^\{2\}\+\|QKP\|\_\{F\}^\{2\}\.Meanwhile,
QK=QK\(P\+Q\)=QKP\+QKQ,QK=QK\(P\+Q\)=QKP\+QKQ,and the two terms are orthogonal\. Hence
\|QK\|F2=\|QKP\|F2\+\|QKQ\|F2\.\|QK\|\_\{F\}^\{2\}=\|QKP\|\_\{F\}^\{2\}\+\|QKQ\|\_\{F\}^\{2\}\.Comparing the previous two displays, we obtain
Approx=12\|QK\|F2\+12\|QKP\|F2≥12\|QK\|F2\.\\operatorname\{Approx\}=\\frac\{1\}\{2\}\|QK\|\_\{F\}^\{2\}\+\\frac\{1\}\{2\}\|QKP\|\_\{F\}^\{2\}\\geq\\frac\{1\}\{2\}\|QK\|\_\{F\}^\{2\}\.SinceQ=Q⊤=Q2Q=Q^\{\\top\}=Q^\{2\}, we have
\|QK\|F2=tr\(KQK\)=tr\(QK2\)=⟨Q,K2⟩\.\|QK\|\_\{F\}^\{2\}=\\operatorname\{tr\}\(KQK\)=\\operatorname\{tr\}\(QK^\{2\}\)=\\langle Q,K^\{2\}\\rangle\.Thus
Approx≥12⟨Q,K2⟩\.\\operatorname\{Approx\}\\geq\\frac\{1\}\{2\}\\langle Q,K^\{2\}\\rangle\.
Now defineΔP:=P−I\\Delta\_\{P\}:=P\-I\. SinceQ=I−PQ=I\-P, we haveΔP=−Q\\Delta\_\{P\}=\-Q, and therefore
Splitting coordinates into0:k0:kandk:∞k:\\infty, write
ΔP=\(𝖴𝖵𝖵⊤𝖶\)\.\\Delta\_\{P\}=\\begin\{pmatrix\}\\mathsf\{U\}&\\mathsf\{V\}\\\\ \\mathsf\{V\}^\{\\top\}&\\mathsf\{W\}\\end\{pmatrix\}\.Then
ΔP2=\(𝖴2\+𝖵𝖵⊤𝖴𝖵\+𝖵𝖶𝖵⊤𝖴\+𝖶𝖵⊤𝖶2\+𝖵⊤𝖵\)\.\\Delta\_\{P\}^\{2\}=\\begin\{pmatrix\}\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\}&\\mathsf\{U\}\\mathsf\{V\}\+\\mathsf\{V\}\\mathsf\{W\}\\\\ \\mathsf\{V\}^\{\\top\}\\mathsf\{U\}\+\\mathsf\{W\}\\mathsf\{V\}^\{\\top\}&\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}\\end\{pmatrix\}\.BecauseK2K^\{2\}is block diagonal under the same split, the off\-diagonal blocks vanish in the inner product, so
⟨Q,K2⟩=⟨ΔP2,K2⟩\\langle Q,K^\{2\}\\rangle=\\langle\\Delta\_\{P\}^\{2\},K^\{2\}\\rangleequals
⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩\+⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩\.\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle\+\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle\.Therefore,
Approx≥12⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩\+12⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩\.\\operatorname\{Approx\}\\geq\\frac\{1\}\{2\}\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle\+\\frac\{1\}\{2\}\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle\.
We first lower bound the tail term\. By Lemma[9](https://arxiv.org/html/2606.26617#Thmlemma9),
𝖶2\+𝖵⊤𝖵=−𝖶\.\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}=\-\\mathsf\{W\}\.Hence
⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩=⟨−𝖶,Kk:∞2⟩\.\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle=\\langle\-\\mathsf\{W\},K\_\{k:\\infty\}^\{2\}\\rangle\.By definition,
−𝖶=I−Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2\.\-\\mathsf\{W\}=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.Recall
Σ=S0:kH0:kS0:k⊤\+Σ\>k,Σ\>k:=Sk:∞Hk:∞Sk:∞⊤\.\\Sigma=S\_\{0:k\}H\_\{0:k\}S\_\{0:k\}^\{\\top\}\+\\Sigma\_\{\>k\},\\qquad\\Sigma\_\{\>k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Since
Σ⪰Σ\>k,\\Sigma\\succeq\\Sigma\_\{\>k\},the order\-reversing property of matrix inversion gives
Σ−1⪯Σ\>k−1\.\\Sigma^\{\-1\}\\preceq\\Sigma\_\{\>k\}^\{\-1\}\.Therefore,
Hk:∞1/2Sk:∞⊤Σ−1Sk:∞Hk:∞1/2⪯Hk:∞1/2Sk:∞⊤Σ\>k−1Sk:∞Hk:∞1/2\.H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\\preceq H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.Subtracting both sides, we obtain
−𝖶⪰Πk,\-\\mathsf\{W\}\\succeq\\Pi\_\{k\},where
Πk:=I−Hk:∞1/2Sk:∞⊤Σ\>k−1Sk:∞Hk:∞1/2\.\\Pi\_\{k\}:=I\-H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.Let
Rktail:=Sk:∞Hk:∞1/2\.R\_\{k\}^\{\\mathrm\{tail\}\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}\.ThenΣ\>k=Rktail\(Rktail\)⊤\\Sigma\_\{\>k\}=R\_\{k\}^\{\\mathrm\{tail\}\}\(R\_\{k\}^\{\\mathrm\{tail\}\}\)^\{\\top\}, and
Hk:∞1/2Sk:∞⊤Σ\>k−1Sk:∞Hk:∞1/2=\(Rktail\)⊤\(Rktail\(Rktail\)⊤\)−1Rktail\.H\_\{k:\\infty\}^\{1/2\}S\_\{k:\\infty\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{k:\\infty\}H\_\{k:\\infty\}^\{1/2\}=\(R\_\{k\}^\{\\mathrm\{tail\}\}\)^\{\\top\}\\left\(R\_\{k\}^\{\\mathrm\{tail\}\}\(R\_\{k\}^\{\\mathrm\{tail\}\}\)^\{\\top\}\\right\)^\{\-1\}R\_\{k\}^\{\\mathrm\{tail\}\}\.This is the orthogonal projection onto the row space ofRktailR\_\{k\}^\{\\mathrm\{tail\}\}\. SinceRktailR\_\{k\}^\{\\mathrm\{tail\}\}has rankMMalmost surely under Gaussian sketching,Πk\\Pi\_\{k\}is an orthogonal projection such thatI−ΠkI\-\\Pi\_\{k\}has rankMM\. Hence, by Lemma[10](https://arxiv.org/html/2606.26617#Thmlemma10),
⟨Πk,Kk:∞2⟩≥∑j\>Mμj\(Kk:∞2\)\.\\langle\\Pi\_\{k\},K\_\{k:\\infty\}^\{2\}\\rangle\\geq\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)\.Using−𝖶⪰Πk\-\\mathsf\{W\}\\succeq\\Pi\_\{k\}andKk:∞2⪰0K\_\{k:\\infty\}^\{2\}\\succeq 0, we get
⟨𝖶2\+𝖵⊤𝖵,Kk:∞2⟩=⟨−𝖶,Kk:∞2⟩≥∑j\>Mμj\(Kk:∞2\)\.\\langle\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\},K\_\{k:\\infty\}^\{2\}\\rangle=\\langle\-\\mathsf\{W\},K\_\{k:\\infty\}^\{2\}\\rangle\\geq\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)\.
We next lower bound the head term\. By Lemma[9](https://arxiv.org/html/2606.26617#Thmlemma9),
𝖴2\+𝖵𝖵⊤=H0:k−1/2RkheadH0:k−1/2,\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\}=H\_\{0:k\}^\{\-1/2\}R\_\{k\}^\{\\mathrm\{head\}\}H\_\{0:k\}^\{\-1/2\},where
Rkhead:=\(H0:k−1\+S0:k⊤Σ\>k−1S0:k\)−1\.R\_\{k\}^\{\\mathrm\{head\}\}:=\\left\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\right\)^\{\-1\}\.Therefore,
⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩=⟨Rkhead,H0:k−1/2K0:k2H0:k−1/2⟩\.\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle=\\left\\langle R\_\{k\}^\{\\mathrm\{head\}\},H\_\{0:k\}^\{\-1/2\}K\_\{0:k\}^\{2\}H\_\{0:k\}^\{\-1/2\}\\right\\rangle\.SinceHHandKKare diagonal,
H0:k−1/2K0:k2H0:k−1/2=diag\(κi2hi\)i≤k\.H\_\{0:k\}^\{\-1/2\}K\_\{0:k\}^\{2\}H\_\{0:k\}^\{\-1/2\}=\\operatorname\{diag\}\\left\(\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\\right\)\_\{i\\leq k\}\.
It remains to lower boundRkheadR\_\{k\}^\{\\mathrm\{head\}\}\. By Lemma[7](https://arxiv.org/html/2606.26617#Thmlemma7), with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
μmin\(Σ\>k\)≳hk\+2M\.\\mu\_\{\\min\}\(\\Sigma\_\{\>k\}\)\\gtrsim h\_\{k\+2M\}\.Thus
Σ\>k−1⪯Chk\+2M−1IM\.\\Sigma\_\{\>k\}^\{\-1\}\\preceq Ch\_\{k\+2M\}^\{\-1\}I\_\{M\}\.By Lemma[8](https://arxiv.org/html/2606.26617#Thmlemma8), sincek≤c0Mk\\leq c\_\{0\}M,
S0:k⊤S0:k⪯C′IkS\_\{0:k\}^\{\\top\}S\_\{0:k\}\\preceq C^\{\\prime\}I\_\{k\}with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)\. Hence
S0:k⊤Σ\>k−1S0:k⪯Chk\+2M−1S0:k⊤S0:k⪯C′′hk\+2M−1Ik\.S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\preceq Ch\_\{k\+2M\}^\{\-1\}S\_\{0:k\}^\{\\top\}S\_\{0:k\}\\preceq C^\{\\prime\\prime\}h\_\{k\+2M\}^\{\-1\}I\_\{k\}\.Also, becausehih\_\{i\}is non\-increasing andi≤ki\\leq k, we have
hi≥hk≥hk\+2M\.h\_\{i\}\\geq h\_\{k\}\\geq h\_\{k\+2M\}\.Therefore,
H0:k−1⪯hk\+2M−1Ik\.H\_\{0:k\}^\{\-1\}\\preceq h\_\{k\+2M\}^\{\-1\}I\_\{k\}\.Combining the previous two bounds gives
H0:k−1\+S0:k⊤Σ\>k−1S0:k⪯C′′′hk\+2M−1Ik\.H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\preceq C^\{\\prime\\prime\\prime\}h\_\{k\+2M\}^\{\-1\}I\_\{k\}\.Taking inverses reverses the Loewner order, so
Rkhead⪰chk\+2MIk\.R\_\{k\}^\{\\mathrm\{head\}\}\\succeq ch\_\{k\+2M\}I\_\{k\}\.Consequently,
⟨𝖴2\+𝖵𝖵⊤,K0:k2⟩≥chk\+2M∑i≤kκi2hi\.\\langle\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\},K\_\{0:k\}^\{2\}\\rangle\\geq ch\_\{k\+2M\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\.
Combining the head and tail lower bounds, and absorbing the factor1/21/2into the implicit constant, yields
Approx≳hk\+2M∑i≤kκi2hi\+∑j\>Mμj\(Kk:∞2\)\.\\operatorname\{Approx\}\\gtrsim h\_\{k\+2M\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\+\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)\.Ifκi2\\kappa\_\{i\}^\{2\},i≥1\{i\\geq 1\}is non\-increasing, then the eigenvalues ofKk:∞2K\_\{k:\\infty\}^\{2\}are
κk\+12,κk\+22,…,\\kappa\_\{k\+1\}^\{2\},\\kappa\_\{k\+2\}^\{2\},\\ldots,and therefore
∑j\>Mμj\(Kk:∞2\)=∑i\>k\+Mκi2\.\\sum\_\{j\>M\}\\mu\_\{j\}\(K\_\{k:\\infty\}^\{2\}\)=\\sum\_\{i\>k\+M\}\\kappa\_\{i\}^\{2\}\.This proves the claimed lower bound\. ∎
###### Theorem 2\(Approximation scaling under latent\-noise power laws\)\.
Suppose Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1)holds\. Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the Gaussian sketchSS,
Approx≍M−b∑i≤Mi3b−2a\+∑i\>Mi−2\(a−b\)\.\\operatorname\{Approx\}\\asymp M^\{\-b\}\\sum\_\{i\\leq M\}i^\{3b\-2a\}\+\\sum\_\{i\>M\}i^\{\-2\(a\-b\)\}\.Consequently, ifa\>b\+1/2a\>b\+1/2, then
Approx≍\{M1−2\(a−b\),b\+12<a<3b\+12,M−blogM,a=3b\+12,M−b,a\>3b\+12\.\\operatorname\{Approx\}\\asymp\\begin\{cases\}M^\{1\-2\(a\-b\)\},&b\+\\frac\{1\}\{2\}<a<\\frac\{3b\+1\}\{2\},\\\\\[4\.0pt\] M^\{\-b\}\\log M,&a=\\frac\{3b\+1\}\{2\},\\\\\[4\.0pt\] M^\{\-b\},&a\>\\frac\{3b\+1\}\{2\}\.\\end\{cases\}Equivalently, away from the logarithmic boundarya=\(3b\+1\)/2a=\(3b\+1\)/2,
Approx≍M−min\{2\(a−b\)−1,b\}\.\\operatorname\{Approx\}\\asymp M^\{\-\\min\\\{\\,2\(a\-b\)\-1,\\ b\\,\\\}\}\.
###### Proof\.
We start from the general upper and lower bounds proved in Propositions[3](https://arxiv.org/html/2606.26617#Thmproposition3)and[4](https://arxiv.org/html/2606.26617#Thmproposition4)\. For anyk≤cMk\\leq cM, these bounds imply that, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
Approx≲∑i\>kκi2\+βk∑i≤kκi2hi,\\operatorname\{Approx\}\\lesssim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\+\\beta\_\{k\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\},and
Approx≳hk\+2M∑i≤kκi2hi\+∑i\>k\+Mκi2,\\operatorname\{Approx\}\\gtrsim h\_\{k\+2M\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{h\_\{i\}\}\+\\sum\_\{i\>k\+M\}\\kappa\_\{i\}^\{2\},where the marginal spectrumhih\_\{i\}and whitened signal coefficientsκi\\kappa\_\{i\}are defined by
hi\\displaystyle h\_\{i\}:=λz,i\+λϵ,i,\\displaystyle:=\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\},κi\\displaystyle\\kappa\_\{i\}:=λz,iλz,i\+λϵ,i\.\\displaystyle:=\\frac\{\\lambda\_\{z,i\}\}\{\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\}\.The coefficientβk\\beta\_\{k\}is
βk:=∑i\>khiM\+hk\+1\+∑i\>khi2M\.\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}h\_\{i\}\}\{M\}\+h\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}h\_\{i\}^\{2\}\}\{M\}\}\.
We now substitute the power laws\. Sincea\>b\+1/2a\>b\+1/2, in particulara\>ba\>b\. Therefore the noise covariance has the slower spectral decay, and
hi=λz,i\+λϵ,i≍i−b\.h\_\{i\}=\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\\asymp i^\{\-b\}\.Moreover,
κi=λz,iλz,i\+λϵ,i≍i−ai−b=i−\(a−b\)\.\\kappa\_\{i\}=\\frac\{\\lambda\_\{z,i\}\}\{\\lambda\_\{z,i\}\+\\lambda\_\{\\epsilon,i\}\}\\asymp\\frac\{i^\{\-a\}\}\{i^\{\-b\}\}=i^\{\-\(a\-b\)\}\.Hence
κi2≍i−2\(a−b\)\.\\kappa\_\{i\}^\{2\}\\asymp i^\{\-2\(a\-b\)\}\.
Next we estimateβk\\beta\_\{k\}\. Sincehi≍i−bh\_\{i\}\\asymp i^\{\-b\}andb\>1b\>1,
∑i\>khi≍∑i\>ki−b≍k1−b\.\\sum\_\{i\>k\}h\_\{i\}\\asymp\\sum\_\{i\>k\}i^\{\-b\}\\asymp k^\{1\-b\}\.Therefore
∑i\>khiM≍k1−bM\.\\frac\{\\sum\_\{i\>k\}h\_\{i\}\}\{M\}\\asymp\\frac\{k^\{1\-b\}\}\{M\}\.Also,
hk\+1≍k−b\.h\_\{k\+1\}\\asymp k^\{\-b\}\.Finally, since2b\>12b\>1,
∑i\>khi2≍∑i\>ki−2b≍k1−2b,\\sum\_\{i\>k\}h\_\{i\}^\{2\}\\asymp\\sum\_\{i\>k\}i^\{\-2b\}\\asymp k^\{1\-2b\},and hence
∑i\>khi2M≍k1−2bM\.\\sqrt\{\\frac\{\\sum\_\{i\>k\}h\_\{i\}^\{2\}\}\{M\}\}\\asymp\\sqrt\{\\frac\{k^\{1\-2b\}\}\{M\}\}\.We now choosek=⌊cM⌋k=\\lfloor cM\\rfloorfor a sufficiently small absolute constantc\>0c\>0\. Under this choice,
k1−bM≍M−b,k−b≍M−b,\\frac\{k^\{1\-b\}\}\{M\}\\asymp M^\{\-b\},\\qquad k^\{\-b\}\\asymp M^\{\-b\},and
k1−2bM≍M−b\.\\sqrt\{\\frac\{k^\{1\-2b\}\}\{M\}\}\\asymp M^\{\-b\}\.Thus
βk≍M−b\.\\beta\_\{k\}\\asymp M^\{\-b\}\.Similarly,
hk\+2M≍M−b\.h\_\{k\+2M\}\\asymp M^\{\-b\}\.
Substituting these estimates into the general upper bound gives
Approx≲∑i\>Mi−2\(a−b\)\+M−b∑i≤Mi−2\(a−b\)i−b\.\\operatorname\{Approx\}\\lesssim\\sum\_\{i\>M\}i^\{\-2\(a\-b\)\}\+M^\{\-b\}\\sum\_\{i\\leq M\}\\frac\{i^\{\-2\(a\-b\)\}\}\{i^\{\-b\}\}\.Since
i−2\(a−b\)i−b=ib−2\(a−b\)=i3b−2a,\\frac\{i^\{\-2\(a\-b\)\}\}\{i^\{\-b\}\}=i^\{b\-2\(a\-b\)\}=i^\{3b\-2a\},we obtain
Approx≲∑i\>Mi−2\(a−b\)\+M−b∑i≤Mi3b−2a\.\\operatorname\{Approx\}\\lesssim\\sum\_\{i\>M\}i^\{\-2\(a\-b\)\}\+M^\{\-b\}\\sum\_\{i\\leq M\}i^\{3b\-2a\}\.
The lower bound gives the same order\. Indeed, usinghk\+2M≍M−bh\_\{k\+2M\}\\asymp M^\{\-b\},κi2≍i−2\(a−b\)\\kappa\_\{i\}^\{2\}\\asymp i^\{\-2\(a\-b\)\}, andk≍Mk\\asymp M,
Approx≳M−b∑i≤Mi3b−2a\+∑i\>Mi−2\(a−b\)\.\\operatorname\{Approx\}\\gtrsim M^\{\-b\}\\sum\_\{i\\leq M\}i^\{3b\-2a\}\+\\sum\_\{i\>M\}i^\{\-2\(a\-b\)\}\.Therefore,
Approx≍M−b∑i≤Mi3b−2a\+∑i\>Mi−2\(a−b\)\.\\operatorname\{Approx\}\\asymp M^\{\-b\}\\sum\_\{i\\leq M\}i^\{3b\-2a\}\+\\sum\_\{i\>M\}i^\{\-2\(a\-b\)\}\.
It remains to simplify the sums\. By Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1),δ=a−b\>1/2\\delta=a\-b\>1/2\. The tail sum is
∑i\>Mi−2δ≍M1−2δ\.\\sum\_\{i\>M\}i^\{\-2\\delta\}\\asymp M^\{1\-2\\delta\}\.The head\-leakage sum is
M−b∑i≤Mib−2δ\.M^\{\-b\}\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\.There are three cases\.
Ifb−2δ\>−1b\-2\\delta\>\-1, equivalently
δ<b\+12,\\delta<\\frac\{b\+1\}\{2\},then
∑i≤Mib−2δ≍Mb−2δ\+1,\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp M^\{b\-2\\delta\+1\},and hence
M−b∑i≤Mib−2δ≍M1−2δ\.M^\{\-b\}\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp M^\{1\-2\\delta\}\.
Ifb−2δ=−1b\-2\\delta=\-1, equivalently
δ=b\+12,\\delta=\\frac\{b\+1\}\{2\},then
∑i≤Mib−2δ≍logM,\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp\\log M,and hence
M−b∑i≤Mib−2δ≍M−blogM\.M^\{\-b\}\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp M^\{\-b\}\\log M\.In this boundary case,
M1−2δ=M−b,M^\{1\-2\\delta\}=M^\{\-b\},so the logarithmic head term dominates\.
Ifb−2δ<−1b\-2\\delta<\-1, equivalently
δ\>b\+12,\\delta\>\\frac\{b\+1\}\{2\},then
∑i≤Mib−2δ≍1,\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp 1,and hence
M−b∑i≤Mib−2δ≍M−b\.M^\{\-b\}\\sum\_\{i\\leq M\}i^\{b\-2\\delta\}\\asymp M^\{\-b\}\.In this regime,
M1−2δ≲M−b,M^\{1\-2\\delta\}\\lesssim M^\{\-b\},so the head\-leakage term dominates\.
Returning toδ=a−b\\delta=a\-b, we obtain
Approx≍\{M1−2\(a−b\),b\+12<a<3b\+12,M−blogM,a=3b\+12,M−b,a\>3b\+12\.\\operatorname\{Approx\}\\asymp\\begin\{cases\}M^\{1\-2\(a\-b\)\},&b\+\\frac\{1\}\{2\}<a<\\frac\{3b\+1\}\{2\},\\\\\[4\.0pt\] M^\{\-b\}\\log M,&a=\\frac\{3b\+1\}\{2\},\\\\\[4\.0pt\] M^\{\-b\},&a\>\\frac\{3b\+1\}\{2\}\.\\end\{cases\}This completes the proof\. ∎
## Appendix CExcess risk of empirical gradient descent
We use the sketched riskRMR\_\{M\}, its minimizerA⋆A^\{\\star\}, the empirical residualE^:=C^−Σ^xA⋆Σ^y\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}, and the empirical GD filtersℬ^L\\widehat\{\\mathscr\{B\}\}\_\{L\}and𝒱^L\\widehat\{\\mathscr\{V\}\}\_\{L\}\. The relevant quantities for this section are the following bias, variance, and cross terms\.
###### Lemma 1\(Bias–variance–cross decomposition for empirical GD\)\.
LetALA\_\{L\}be the output of empirical GD afterLLsteps\. Then
RM\(AL\)−RM\(A⋆\)=BiasLsamp\+VarLsamp\+CrossLsamp,R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)=\\operatorname\{Bias\}^\{\\mathrm\{samp\}\}\_\{L\}\+\\operatorname\{Var\}^\{\\mathrm\{samp\}\}\_\{L\}\+\\operatorname\{Cross\}^\{\\mathrm\{samp\}\}\_\{L\},where
BiasLsamp:=12‖ℬ^L\(A⋆\)‖Σ,Σ2,\\operatorname\{Bias\}^\{\\mathrm\{samp\}\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\},VarLsamp:=12‖𝒱^L\(E^\)‖Σ,Σ2,\\operatorname\{Var\}^\{\\mathrm\{samp\}\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\},and
CrossLsamp:=−⟨ℬ^L\(A⋆\),𝒱^L\(E^\)⟩Σ,Σ\.\\operatorname\{Cross\}^\{\\mathrm\{samp\}\}\_\{L\}:=\-\\left\\langle\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\),\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\rangle\_\{\\Sigma,\\Sigma\}\.
###### Proof\.
Let
Δt:=At−A⋆\.\\Delta\_\{t\}:=A\_\{t\}\-A^\{\\star\}\.Using the GD recursion and substitutingAt−1=Δt−1\+A⋆A\_\{t\-1\}=\\Delta\_\{t\-1\}\+A^\{\\star\}, we obtain
Δt=Δt−1−γtΣ^xΔt−1Σ^y\+γt\(C^−Σ^xA⋆Σ^y\)\.\\Delta\_\{t\}=\\Delta\_\{t\-1\}\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{x\}\\Delta\_\{t\-1\}\\widehat\{\\Sigma\}\_\{y\}\+\\gamma\_\{t\}\\left\(\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}\\right\)\.By the definition ofℋ^\\widehat\{\\mathscr\{H\}\}andE^\\widehat\{E\}, this becomes
Δt=\(I−γtℋ^\)\(Δt−1\)\+γtE^\.\\Delta\_\{t\}=\(I\-\\gamma\_\{t\}\\widehat\{\\mathscr\{H\}\}\)\(\\Delta\_\{t\-1\}\)\+\\gamma\_\{t\}\\widehat\{E\}\.Iterating the recursion fromt=1t=1toLL, and using
Δ0=A0−A⋆=−A⋆,\\Delta\_\{0\}=A\_\{0\}\-A^\{\\star\}=\-A^\{\\star\},gives
ΔL=−ℬ^L\(A⋆\)\+𝒱^L\(E^\)\.\\Delta\_\{L\}=\-\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\+\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\.Therefore,
RM\(AL\)−RM\(A⋆\)=12‖ΔL‖Σ,Σ2\.R\_\{M\}\(A\_\{L\}\)\-R\_\{M\}\(A^\{\\star\}\)=\\frac\{1\}\{2\}\\\|\\Delta\_\{L\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.Substituting the previous display and expanding the square gives
12‖−ℬ^L\(A⋆\)\+𝒱^L\(E^\)‖Σ,Σ2\\frac\{1\}\{2\}\\left\\\|\-\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\+\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}=12‖ℬ^L\(A⋆\)‖Σ,Σ2\+12‖𝒱^L\(E^\)‖Σ,Σ2−⟨ℬ^L\(A⋆\),𝒱^L\(E^\)⟩Σ,Σ\.=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\+\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\-\\left\\langle\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\),\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\rangle\_\{\\Sigma,\\Sigma\}\.This proves the desired decomposition\. ∎
## Appendix DBias of Normal GD
### A general upper bound for the GD bias
The GD\-bias term is
BiasL:=12‖ℬ^L\(A⋆\)‖Σ,Σ2\.\\operatorname\{Bias\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
Fork∈\{1,…,M\}k\\in\\\{1,\\ldots,M\\\}, define the population spectral projector
Pk:=∑i≤kvivi⊤\.P\_\{k\}:=\\sum\_\{i\\leq k\}v\_\{i\}v\_\{i\}^\{\\top\}\.
###### Lemma 2\(General GD\-bias upper bound\)\.
Suppose Assumptions[2](https://arxiv.org/html/2606.26617#Thmassumption2)and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold, and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Then, for every1≤k≤M1\\leq k\\leq M,
BiasL≲1Leffγ∑i≤kκi2μi2\+∑i\>kκi2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{\\mu\_\{i\}^\{2\}\}\+\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.Before substituting the source condition,
BiasL≲1Leffγ‖PkA⋆Pk‖F2\+‖A⋆−PkA⋆Pk‖Σ,Σ2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{F\}^\{2\}\+\\\|A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
###### Proof\.
Define the population head and tail parts ofA⋆A^\{\\star\}by
Ah⋆:=PkA⋆Pk,At⋆:=A⋆−PkA⋆Pk\.A^\{\\star\}\_\{\\mathrm\{h\}\}:=P\_\{k\}A^\{\\star\}P\_\{k\},\\qquad A^\{\\star\}\_\{\\mathrm\{t\}\}:=A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}\.By linearity ofℬ^L\\widehat\{\\mathscr\{B\}\}\_\{L\},
ℬ^L\(A⋆\)=ℬ^L\(Ah⋆\)\+ℬ^L\(At⋆\)\.\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)=\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\+\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\.Using‖B1\+B2‖Σ,Σ2≤2‖B1‖Σ,Σ2\+2‖B2‖Σ,Σ2\\\|B\_\{1\}\+B\_\{2\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\leq 2\\\|B\_\{1\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\+2\\\|B\_\{2\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}, we get
BiasL≲‖ℬ^L\(Ah⋆\)‖Σ,Σ2\+‖ℬ^L\(At⋆\)‖Σ,Σ2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\+\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
We first bound the head term\. By Lemma[12](https://arxiv.org/html/2606.26617#Thmlemma12),
‖ℬ^L\(Ah⋆\)‖Σ,Σ2≲‖ℬ^L\(Ah⋆\)‖Σ^x,Σ^y2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\lesssim\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\.By Lemma[14](https://arxiv.org/html/2606.26617#Thmlemma14),
‖ℬ^L\(Ah⋆\)‖Σ^x,Σ^y2≲1Leffγ‖Ah⋆‖F2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|A^\{\\star\}\_\{\\mathrm\{h\}\}\\\|\_\{F\}^\{2\}\.Therefore,
‖ℬ^L\(Ah⋆\)‖Σ,Σ2≲1Leffγ‖PkA⋆Pk‖F2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{F\}^\{2\}\.
We next bound the tail term\. Again by Lemma[12](https://arxiv.org/html/2606.26617#Thmlemma12),
‖ℬ^L\(At⋆\)‖Σ,Σ2≲‖ℬ^L\(At⋆\)‖Σ^x,Σ^y2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\lesssim\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\.By the contraction part of Lemma[14](https://arxiv.org/html/2606.26617#Thmlemma14),
‖ℬ^L\(At⋆\)‖Σ^x,Σ^y2≤‖At⋆‖Σ^x,Σ^y2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\leq\\\|A^\{\\star\}\_\{\\mathrm\{t\}\}\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\.Applying Lemma[12](https://arxiv.org/html/2606.26617#Thmlemma12)in the other direction gives
‖At⋆‖Σ^x,Σ^y2≲‖At⋆‖Σ,Σ2\.\\\|A^\{\\star\}\_\{\\mathrm\{t\}\}\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\lesssim\\\|A^\{\\star\}\_\{\\mathrm\{t\}\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.Thus
‖ℬ^L\(At⋆\)‖Σ,Σ2≲‖A⋆−PkA⋆Pk‖Σ,Σ2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\lesssim\\\|A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
Combining the head and tail bounds yields
BiasL≲1Leffγ‖PkA⋆Pk‖F2\+‖A⋆−PkA⋆Pk‖Σ,Σ2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{F\}^\{2\}\+\\\|A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
It remains to use Assumption[3](https://arxiv.org/html/2606.26617#Thmassumption3)\. Under this source condition,
CM=∑i=1Mκiμivivi⊤\.C\_\{M\}=\\sum\_\{i=1\}^\{M\}\\kappa\_\{i\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\}\.Since
A⋆=Σ−1CMΣ−1,A^\{\\star\}=\\Sigma^\{\-1\}C\_\{M\}\\Sigma^\{\-1\},we have
A⋆=∑i=1Mκiμivivi⊤\.A^\{\\star\}=\\sum\_\{i=1\}^\{M\}\\frac\{\\kappa\_\{i\}\}\{\\mu\_\{i\}\}v\_\{i\}v\_\{i\}^\{\\top\}\.Therefore,
PkA⋆Pk=∑i≤kκiμivivi⊤,P\_\{k\}A^\{\\star\}P\_\{k\}=\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}\}\{\\mu\_\{i\}\}v\_\{i\}v\_\{i\}^\{\\top\},and hence
‖PkA⋆Pk‖F2=∑i≤kκi2μi2\.\\\|P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{F\}^\{2\}=\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{\\mu\_\{i\}^\{2\}\}\.Similarly,
A⋆−PkA⋆Pk=∑i\>kκiμivivi⊤\.A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}=\\sum\_\{i\>k\}\\frac\{\\kappa\_\{i\}\}\{\\mu\_\{i\}\}v\_\{i\}v\_\{i\}^\{\\top\}\.Thus
‖A⋆−PkA⋆Pk‖Σ,Σ2=∑i\>kμi2\(κiμi\)2=∑i\>kκi2\.\\\|A^\{\\star\}\-P\_\{k\}A^\{\\star\}P\_\{k\}\\\|\_\{\\Sigma,\\Sigma\}^\{2\}=\\sum\_\{i\>k\}\\mu\_\{i\}^\{2\}\\left\(\\frac\{\\kappa\_\{i\}\}\{\\mu\_\{i\}\}\\right\)^\{2\}=\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.Substituting these two identities into the general head–tail bound gives
BiasL≲1Leffγ∑i≤kκi2μi2\+∑i\>kκi2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{\\mu\_\{i\}^\{2\}\}\+\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.This completes the proof\. ∎
### A general lower bound for the GD bias
The GD\-bias term is
BiasL:=12‖ℬ^L\(A⋆\)‖Σ,Σ2\.\\operatorname\{Bias\}\_\{L\}:=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.Fork∈\{1,…,M\}k\\in\\\{1,\\ldots,M\\\}, define
Pk:=∑i≤kvivi⊤,Qk:=I−Pk\.P\_\{k\}:=\\sum\_\{i\\leq k\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad Q\_\{k\}:=I\-P\_\{k\}\.We also define the diagonal tail subspace
𝒯k:=span\{vivi⊤:i\>k\}\.\\mathcal\{T\}\_\{k\}:=\\operatorname\{span\}\\\{v\_\{i\}v\_\{i\}^\{\\top\}:i\>k\\\}\.LetΠ𝒯k\\Pi\_\{\\mathcal\{T\}\_\{k\}\}denote the orthogonal projection onto𝒯k\\mathcal\{T\}\_\{k\}under theΣ,Σ\\Sigma,\\Sigma\-inner product\.
###### Lemma 3\(General GD\-bias lower bound\)\.
Suppose Assumptions[2](https://arxiv.org/html/2606.26617#Thmassumption2)and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold, and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Let
ρ:=\(Leffγ\)−1/2\.\\rho:=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1/2\}\.Choosekksuch that
μk\+1≤c0ρ\\mu\_\{k\+1\}\\leq c\_\{0\}\\rhofor a sufficiently small absolute constantc0\>0c\_\{0\}\>0\. Define
Kh:=∑i≤kκivivi⊤,Kt:=∑i\>kκivivi⊤\.K\_\{\\mathrm\{h\}\}:=\\sum\_\{i\\leq k\}\\kappa\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad K\_\{\\mathrm\{t\}\}:=\\sum\_\{i\>k\}\\kappa\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\}\.Assume that
crel‖Kt‖F\+crel2ρ2‖Kh‖F≤η‖Kt‖Fc\_\{\\mathrm\{rel\}\}\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\+c\_\{\\mathrm\{rel\}\}^\{2\}\\rho^\{2\}\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\\leq\\eta\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}for a sufficiently small absolute constantη\>0\\eta\>0\. Then
BiasL≳∑i\>kκi2\.\\operatorname\{Bias\}\_\{L\}\\gtrsim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.
###### Proof\.
We work in the whitened coordinates
ZB:=Σ1/2BΣ1/2\.Z\_\{B\}:=\\Sigma^\{1/2\}B\\Sigma^\{1/2\}\.In these coordinates,
‖B‖Σ,Σ=‖ZB‖F\.\\\|B\\\|\_\{\\Sigma,\\Sigma\}=\\\|Z\_\{B\}\\\|\_\{F\}\.By Assumption[3](https://arxiv.org/html/2606.26617#Thmassumption3),
CM=∑i=1Mκiμivivi⊤,C\_\{M\}=\\sum\_\{i=1\}^\{M\}\\kappa\_\{i\}\\mu\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},and therefore
A⋆=Σ−1CMΣ−1=∑i=1Mκiμivivi⊤\.A^\{\\star\}=\\Sigma^\{\-1\}C\_\{M\}\\Sigma^\{\-1\}=\\sum\_\{i=1\}^\{M\}\\frac\{\\kappa\_\{i\}\}\{\\mu\_\{i\}\}v\_\{i\}v\_\{i\}^\{\\top\}\.Thus
K⋆:=Σ1/2A⋆Σ1/2=∑i=1Mκivivi⊤=Kh\+Kt,K\_\{\\star\}:=\\Sigma^\{1/2\}A^\{\\star\}\\Sigma^\{1/2\}=\\sum\_\{i=1\}^\{M\}\\kappa\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\}=K\_\{\\mathrm\{h\}\}\+K\_\{\\mathrm\{t\}\},where
Kh:=∑i≤kκivivi⊤,Kt:=∑i\>kκivivi⊤\.K\_\{\\mathrm\{h\}\}:=\\sum\_\{i\\leq k\}\\kappa\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad K\_\{\\mathrm\{t\}\}:=\\sum\_\{i\>k\}\\kappa\_\{i\}v\_\{i\}v\_\{i\}^\{\\top\}\.
We next rewrite the empirical and population Hessian operators in the whitened coordinates\. Define
ℋ\(B\):=ΣBΣ,ℬL:=∏t=1L\(I−γtℋ\)\.\\mathscr\{H\}\(B\):=\\Sigma B\\Sigma,\\qquad\\mathscr\{B\}\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\mathscr\{H\}\)\.In these whitened coordinates, the population Hessian becomes
ℋ\(Z\):=ΣZΣ\.\\mathscr\{H\}\(Z\):=\\Sigma Z\\Sigma\.On the eventℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\), we can write
Σ^x=Σ1/2\(I\+Ex\)Σ1/2,Σ^y=Σ1/2\(I\+Ey\)Σ1/2,\\widehat\{\\Sigma\}\_\{x\}=\\Sigma^\{1/2\}\(I\+E\_\{x\}\)\\Sigma^\{1/2\},\\qquad\\widehat\{\\Sigma\}\_\{y\}=\\Sigma^\{1/2\}\(I\+E\_\{y\}\)\\Sigma^\{1/2\},where
‖Ex‖∨‖Ey‖≤crelρ\.\\\|E\_\{x\}\\\|\\vee\\\|E\_\{y\}\\\|\\leq c\_\{\\mathrm\{rel\}\}\\rho\.Hence the empirical Hessian becomes
ℋ^w\(Z\)=Σ\(I\+Ex\)Z\(I\+Ey\)Σ\.\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\(Z\)=\\Sigma\(I\+E\_\{x\}\)Z\(I\+E\_\{y\}\)\\Sigma\.Equivalently,
ℋ^w\(Z\)−ℋ\(Z\)=ΣExZΣ\+ΣZEyΣ\+ΣExZEyΣ\.\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\(Z\)\-\\mathscr\{H\}\(Z\)=\\Sigma E\_\{x\}Z\\Sigma\+\\Sigma ZE\_\{y\}\\Sigma\+\\Sigma E\_\{x\}ZE\_\{y\}\\Sigma\.Let
ℬ^Lw:=∏t=1L\(I−γtℋ^w\)\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\)be the empirical filter in these whitened coordinates\. Then for any matrixBB, withZB=Σ1/2BΣ1/2Z\_\{B\}=\\Sigma^\{1/2\}B\\Sigma^\{1/2\},
Σ1/2\(Σ^xBΣ^y\)Σ1/2\\displaystyle\\Sigma^\{1/2\}\\left\(\\widehat\{\\Sigma\}\_\{x\}B\\widehat\{\\Sigma\}\_\{y\}\\right\)\\Sigma^\{1/2\}=Σ\(I\+Ex\)ZB\(I\+Ey\)Σ\\displaystyle=\{\}\\Sigma\(I\+E\_\{x\}\)Z\_\{B\}\(I\+E\_\{y\}\)\\Sigma=ℋ^w\(ZB\)\.\\displaystyle=\{\}\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\(Z\_\{B\}\)\.Therefore each one\-step empirical GD map is conjugated byB↦Σ1/2BΣ1/2B\\mapsto\\Sigma^\{1/2\}B\\Sigma^\{1/2\}:
Σ1/2\(I−γtΣ^x\(⋅\)Σ^y\)\(B\)Σ1/2=\(I−γtℋ^w\)\(ZB\)\.\\Sigma^\{1/2\}\\left\(I\-\\gamma\_\{t\}\\widehat\{\\Sigma\}\_\{x\}\(\\cdot\)\\widehat\{\\Sigma\}\_\{y\}\\right\)\(B\)\\Sigma^\{1/2\}=\\left\(I\-\\gamma\_\{t\}\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\\right\)\(Z\_\{B\}\)\.Iterating this identity overt=1,…,Lt=1,\\ldots,Lgives
Σ1/2ℬ^L\(A⋆\)Σ1/2=ℬ^Lw\(K⋆\)\.\\Sigma^\{1/2\}\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\Sigma^\{1/2\}=\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\.
We first lower\-bound the population evolution of the tail source\. SinceKtK\_\{\\mathrm\{t\}\}is diagonal in the eigenbasis ofΣ\\Sigma, the population operator acts coordinatewise:
ℋ\(vivi⊤\)=μi2vivi⊤\.\\mathscr\{H\}\(v\_\{i\}v\_\{i\}^\{\\top\}\)=\\mu\_\{i\}^\{2\}v\_\{i\}v\_\{i\}^\{\\top\}\.Therefore,
ℬL\(Kt\)=∑i\>kκiψL\(μi2\)vivi⊤,\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)=\\sum\_\{i\>k\}\\kappa\_\{i\}\\psi\_\{L\}\(\\mu\_\{i\}^\{2\}\)v\_\{i\}v\_\{i\}^\{\\top\},where
ψL\(s\):=∏t=1L\(1−γts\)\.\\psi\_\{L\}\(s\):=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\.Sinceμk\+1≤c0ρ\\mu\_\{k\+1\}\\leq c\_\{0\}\\rho, for everyi\>ki\>k,
μi2≤c02ρ2=c02Leffγ\.\\mu\_\{i\}^\{2\}\\leq c\_\{0\}^\{2\}\\rho^\{2\}=\\frac\{c\_\{0\}^\{2\}\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\.Takingc0\>0c\_\{0\}\>0sufficiently small and using Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2), we have
ψL\(μi2\)2≥c\\psi\_\{L\}\(\\mu\_\{i\}^\{2\}\)^\{2\}\\geq cfor an absolute constantc\>0c\>0\. Hence
‖ℬL\(Kt\)‖F2=∑i\>kκi2ψL\(μi2\)2≳∑i\>kκi2\.\\\|\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\\|\_\{F\}^\{2\}=\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\\psi\_\{L\}\(\\mu\_\{i\}^\{2\}\)^\{2\}\\gtrsim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.
We now compare the empirical evolution with the population evolution after projecting onto the diagonal tail subspace\. Projection is useful because it removes the possible cancellation between the filtered head and filtered tail parts\. Indeed,
‖ℬ^Lw\(K⋆\)‖F≥‖Π𝒯kℬ^Lw\(K⋆\)‖F\.\\left\\\|\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\\geq\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\.By linearity,
Π𝒯kℬ^Lw\(K⋆\)=Π𝒯kℬ^Lw\(Kt\)\+Π𝒯kℬ^Lw\(Kh\)\.\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)=\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\+\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{h\}\}\)\.
We first control the tail perturbation\. For anyZ∈𝒯kZ\\in\\mathcal\{T\}\_\{k\}, one has
Therefore,
‖ZΣ‖F≤μk\+1‖Z‖F≤c0ρ‖Z‖F,\\\|Z\\Sigma\\\|\_\{F\}\\leq\\mu\_\{k\+1\}\\\|Z\\\|\_\{F\}\\leq c\_\{0\}\\rho\\\|Z\\\|\_\{F\},and similarly
‖ΣZ‖F≤c0ρ‖Z‖F\.\\\|\\Sigma Z\\\|\_\{F\}\\leq c\_\{0\}\\rho\\\|Z\\\|\_\{F\}\.Using
‖Ex‖∨‖Ey‖≤crelρ,\\\|E\_\{x\}\\\|\\vee\\\|E\_\{y\}\\\|\\leq c\_\{\\mathrm\{rel\}\}\\rho,we obtain
‖\(ℋ^w−ℋ\)\(Z\)‖F≲crelρ2‖Z‖F\.\\\|\(\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\-\\mathscr\{H\}\)\(Z\)\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}\\rho^\{2\}\\\|Z\\\|\_\{F\}\.By the telescoping identity for products,
ℬ^Lw\(Kt\)−ℬL\(Kt\)=−∑t=1Lγtℬ^t\+1:Lw\(ℋ^w−ℋ\)ℬ1:t−1\(Kt\),\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\-\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)=\-\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{t\+1:L\}\(\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\-\\mathscr\{H\}\)\\mathscr\{B\}\_\{1:t\-1\}\(K\_\{\\mathrm\{t\}\}\),where empty products are interpreted as the identity\. The population iteratesℬ1:t−1\(Kt\)\\mathscr\{B\}\_\{1:t\-1\}\(K\_\{\\mathrm\{t\}\}\)remain in𝒯k\\mathcal\{T\}\_\{k\}, and the population filter is a contraction in Frobenius norm\. Moreover, by the eventℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)and the stepsize condition, the empirical filters are uniformly bounded in theΣ,Σ\\Sigma,\\Sigma\-norm, equivalently in the whitened Frobenius norm\. Hence
‖ℬ^Lw\(Kt\)−ℬL\(Kt\)‖F≲crelρ2\(∑t=1Lγt\)‖Kt‖F\.\\left\\\|\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\-\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}\\rho^\{2\}\\left\(\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\right\)\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\.Since the geometrically decaying schedule satisfies
∑t=1Lγt≲Leffγ=ρ−2,\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\lesssim L\_\{\\mathrm\{eff\}\}\\gamma=\\rho^\{\-2\},we get
‖ℬ^Lw\(Kt\)−ℬL\(Kt\)‖F≲crel‖Kt‖F\.\\left\\\|\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\-\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\.Consequently,
‖Π𝒯kℬ^Lw\(Kt\)−ℬL\(Kt\)‖F≲crel‖Kt‖F\.\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\-\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\.
It remains to control the leakage from the population head into the diagonal tail subspace\. Since the population operatorℋ\\mathscr\{H\}preserves diagonal coordinates, the population filter never mapsKhK\_\{\\mathrm\{h\}\}into𝒯k\\mathcal\{T\}\_\{k\}\. Hence all diagonal\-tail leakage fromKhK\_\{\\mathrm\{h\}\}is caused by the perturbationℋ^w−ℋ\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\-\\mathscr\{H\}\.
The key observation is that first\-order perturbations cannot map a head diagonal component directly into a tail diagonal component\. Indeed, forKh=PkKhPkK\_\{\\mathrm\{h\}\}=P\_\{k\}K\_\{\\mathrm\{h\}\}P\_\{k\},
Π𝒯k\(ΣExKhΣ\)=0,Π𝒯k\(ΣKhEyΣ\)=0\.\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\(\\Sigma E\_\{x\}K\_\{\\mathrm\{h\}\}\\Sigma\)=0,\\qquad\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\(\\Sigma K\_\{\\mathrm\{h\}\}E\_\{y\}\\Sigma\)=0\.To land in the diagonal tail subspace, both the left and right indices must move from the head to the tail\. This requires either the second\-order termΣExKhEyΣ\\Sigma E\_\{x\}K\_\{\\mathrm\{h\}\}E\_\{y\}\\Sigma, or two first\-order perturbations at different times\. In either case, the leakage carries two perturbation factors\. Since each perturbation factor is bounded bycrelρc\_\{\\mathrm\{rel\}\}\\rho, and the two outside tail covariance factors contribute at mostμk\+12≲ρ2\\mu\_\{k\+1\}^\{2\}\\lesssim\\rho^\{2\}, the accumulated diagonal\-tail leakage over the whole effective horizon is bounded by
‖Π𝒯kℬ^Lw\(Kh\)‖F≲crel2ρ2‖Kh‖F\.\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}^\{2\}\\rho^\{2\}\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\.More explicitly, this follows by applying the above telescoping expansion once for the second\-order term and twice for the pair of first\-order terms, using
∑t=1Lγt≲ρ−2,\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\lesssim\\rho^\{\-2\},and the fact that one left\-index leakage and one right\-index leakage are both necessary before a head diagonal coordinate can contribute to𝒯k\\mathcal\{T\}\_\{k\}\.
Combining the estimates, we obtain by the reverse triangle inequality
‖Π𝒯kℬ^Lw\(K⋆\)‖F≥‖ℬL\(Kt\)‖F\\displaystyle\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\\geq\\\|\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\\|\_\{F\}−‖Π𝒯kℬ^Lw\(Kt\)−ℬL\(Kt\)‖F\\displaystyle\-\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\-\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\right\\\|\_\{F\}−‖Π𝒯kℬ^Lw\(Kh\)‖F\\displaystyle\-\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\mathrm\{h\}\}\)\\right\\\|\_\{F\}≥‖ℬL\(Kt\)‖F−Ccrel‖Kt‖F−Ccrel2ρ2‖Kh‖F\.\\displaystyle\\geq\\\|\\mathscr\{B\}\_\{L\}\(K\_\{\\mathrm\{t\}\}\)\\\|\_\{F\}\-Cc\_\{\\mathrm\{rel\}\}\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\-Cc\_\{\\mathrm\{rel\}\}^\{2\}\\rho^\{2\}\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\.By the assumed smallness condition
crel‖Kt‖F\+crel2ρ2‖Kh‖F≤η‖Kt‖F,c\_\{\\mathrm\{rel\}\}\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\+c\_\{\\mathrm\{rel\}\}^\{2\}\\rho^\{2\}\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\\leq\\eta\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\},and by takingη\>0\\eta\>0sufficiently small, we get
‖Π𝒯kℬ^Lw\(K⋆\)‖F≳‖Kt‖F\.\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\\gtrsim\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\.Since projection cannot increase the norm,
‖ℬ^Lw\(K⋆\)‖F≥‖Π𝒯kℬ^Lw\(K⋆\)‖F\.\\left\\\|\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\\geq\\left\\\|\\Pi\_\{\\mathcal\{T\}\_\{k\}\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}\.Returning to the originalAA\-coordinates,
‖ℬ^L\(A⋆\)‖Σ,Σ2=‖ℬ^Lw\(K⋆\)‖F2≳‖Kt‖F2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}=\\left\\\|\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{L\}\(K\_\{\\star\}\)\\right\\\|\_\{F\}^\{2\}\\gtrsim\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}^\{2\}\.Finally,
‖Kt‖F2=∑i\>kκi2\.\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}^\{2\}=\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.Therefore,
BiasL=12‖ℬ^L\(A⋆\)‖Σ,Σ2≳∑i\>kκi2\.\\operatorname\{Bias\}\_\{L\}=\\frac\{1\}\{2\}\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(A^\{\\star\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\gtrsim\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.This completes the proof\. ∎
### Specific GD\-bias scaling
We now combine the GD\-bias upper and lower bounds under the power\-law notation
R:=Leffγ,δ:=a−b\.R:=L\_\{\\mathrm\{eff\}\}\\gamma,\\qquad\\delta:=a\-b\.By Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), on the high\-probability sketch event, the sketched covariance eigenvalues satisfyμi≍i−b\\mu\_\{i\}\\asymp i^\{\-b\}\. Define the upper\-bound optimization cutoff
k\+:=⌊min\{M,R1/\(2b\)\}⌋\.k\_\{\+\}:=\\left\\lfloor\\min\\left\\\{M,\\,R^\{1/\(2b\)\}\\right\\\}\\right\\rfloor\.
###### Theorem 3\(Specific GD\-bias bounds\)\.
Suppose Assumptions[1](https://arxiv.org/html/2606.26617#Thmassumption1),[2](https://arxiv.org/html/2606.26617#Thmassumption2), and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold\. Work on the high\-probability sketch event of Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Then
BiasL≲1R∑i≤k\+i2b−2δ\+∑i\>k\+i−2δ\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{R\}\\sum\_\{i\\leq k\_\{\+\}\}i^\{2b\-2\\delta\}\+\\sum\_\{i\>k\_\{\+\}\}i^\{\-2\\delta\}\.For the lower bound, letk−<Mk\_\{\-\}<Mbe a cutoff satisfying
μk−\+1≤c0R−1/2\\mu\_\{k\_\{\-\}\+1\}\\leq c\_\{0\}R^\{\-1/2\}and the smallness condition in Lemma[3](https://arxiv.org/html/2606.26617#Thmlemma3)\. Then
BiasL≳∑i\>k−i−2δ\.\\operatorname\{Bias\}\_\{L\}\\gtrsim\\sum\_\{i\>k\_\{\-\}\}i^\{\-2\\delta\}\.Consequently, if
12<δ<b\+12,R1/\(2b\)≲M,\\frac\{1\}\{2\}<\\delta<b\+\\frac\{1\}\{2\},\\qquad R^\{1/\(2b\)\}\\lesssim M,then
BiasL≍R1−2δ2b=\(Leffγ\)1−2\(a−b\)2b\.\\boxed\{\\operatorname\{Bias\}\_\{L\}\\asymp R^\{\\frac\{1\-2\\delta\}\{2b\}\}=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\\frac\{1\-2\(a\-b\)\}\{2b\}\}\.\}In the same unsaturated regime, at the boundaryδ=b\+12\\delta=b\+\\frac\{1\}\{2\}, the upper bound gives
BiasL≲R−1logR,\\operatorname\{Bias\}\_\{L\}\\lesssim R^\{\-1\}\\log R,while the lower bound gives
BiasL≳R−1\.\\operatorname\{Bias\}\_\{L\}\\gtrsim R^\{\-1\}\.Ifδ\>b\+12\\delta\>b\+\\frac\{1\}\{2\}, then
BiasL≲R−1,\\operatorname\{Bias\}\_\{L\}\\lesssim R^\{\-1\},and any admissible lower cutoffk−≍R1/\(2b\)k\_\{\-\}\\asymp R^\{1/\(2b\)\}gives
BiasL≳R1−2δ2b\.\\operatorname\{Bias\}\_\{L\}\\gtrsim R^\{\\frac\{1\-2\\delta\}\{2b\}\}\.
###### Proof\.
The general GD\-bias upper bound gives, for everyk≤Mk\\leq M,
BiasL≲1R∑i≤kκi2μi2\+∑i\>kκi2\.\\operatorname\{Bias\}\_\{L\}\\lesssim\\frac\{1\}\{R\}\\sum\_\{i\\leq k\}\\frac\{\\kappa\_\{i\}^\{2\}\}\{\\mu\_\{i\}^\{2\}\}\+\\sum\_\{i\>k\}\\kappa\_\{i\}^\{2\}\.Using Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11)and the lower\-bound source condition,
μi≍i−b,κi≍i−δ,\\mu\_\{i\}\\asymp i^\{\-b\},\\qquad\\kappa\_\{i\}\\asymp i^\{\-\\delta\},we obtain
κi2μi2≍i2b−2δ\.\\frac\{\\kappa\_\{i\}^\{2\}\}\{\\mu\_\{i\}^\{2\}\}\\asymp i^\{2b\-2\\delta\}\.Choosingk=k\+k=k\_\{\+\}gives the displayed upper bound\.
For the conditional lower bound, Lemma[3](https://arxiv.org/html/2606.26617#Thmlemma3)gives
BiasL≳∑i\>k−κi2≍∑i\>k−i−2δ\.\\operatorname\{Bias\}\_\{L\}\\gtrsim\\sum\_\{i\>k\_\{\-\}\}\\kappa\_\{i\}^\{2\}\\asymp\\sum\_\{i\>k\_\{\-\}\}i^\{\-2\\delta\}\.
It remains to evaluate the sums in the unsaturated regime\. AssumeR1/\(2b\)≲MR^\{1/\(2b\)\}\\lesssim M\. Thenk\+≍R1/\(2b\)k\_\{\+\}\\asymp R^\{1/\(2b\)\}, and there is a cutoffk−≍R1/\(2b\)k\_\{\-\}\\asymp R^\{1/\(2b\)\}withk−<Mk\_\{\-\}<Mandμk−\+1≤c0R−1/2\\mu\_\{k\_\{\-\}\+1\}\\leq c\_\{0\}R^\{\-1/2\}\. For1/2<δ≤b\+1/21/2<\\delta\\leq b\+1/2, the smallness condition in Lemma[3](https://arxiv.org/html/2606.26617#Thmlemma3)holds for this choice:‖Kh‖F≲1\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\\lesssim 1,‖Kt‖F≍k−1/2−δ\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\\asymp k\_\{\-\}^\{1/2\-\\delta\}, and hence
R−1‖Kh‖F‖Kt‖F≲R−1\+δ−1/22b=o\(1\)\.R^\{\-1\}\\frac\{\\\|K\_\{\\mathrm\{h\}\}\\\|\_\{F\}\}\{\\\|K\_\{\\mathrm\{t\}\}\\\|\_\{F\}\}\\lesssim R^\{\-1\+\\frac\{\\delta\-1/2\}\{2b\}\}=o\(1\)\.Takingcrelc\_\{\\mathrm\{rel\}\}sufficiently small verifies the remaining part of the smallness condition\.
Sinceδ\>1/2\\delta\>1/2,
∑i\>k−i−2δ≍k−1−2δ≍R1−2δ2b\.\\sum\_\{i\>k\_\{\-\}\}i^\{\-2\\delta\}\\asymp k\_\{\-\}^\{1\-2\\delta\}\\asymp R^\{\\frac\{1\-2\\delta\}\{2b\}\}\.For the head sum,
1R∑i≤k\+i2b−2δ\\frac\{1\}\{R\}\\sum\_\{i\\leq k\_\{\+\}\}i^\{2b\-2\\delta\}has three regimes\. If2b−2δ\>−12b\-2\\delta\>\-1, equivalently
δ<b\+12,\\delta<b\+\\frac\{1\}\{2\},then
∑i≤k\+i2b−2δ≍k\+2b−2δ\+1,\\sum\_\{i\\leq k\_\{\+\}\}i^\{2b\-2\\delta\}\\asymp k\_\{\+\}^\{2b\-2\\delta\+1\},and therefore
1R∑i≤k\+i2b−2δ≍R1−2δ2b\.\\frac\{1\}\{R\}\\sum\_\{i\\leq k\_\{\+\}\}i^\{2b\-2\\delta\}\\asymp R^\{\\frac\{1\-2\\delta\}\{2b\}\}\.Thus the upper and lower bounds match\.
Ifδ=b\+12\\delta=b\+\\frac\{1\}\{2\}, then
∑i≤k\+i−1≍logk\+,\\sum\_\{i\\leq k\_\{\+\}\}i^\{\-1\}\\asymp\\log k\_\{\+\},so the upper bound is
BiasL≲R−1logR,\\operatorname\{Bias\}\_\{L\}\\lesssim R^\{\-1\}\\log R,whereas the tail lower bound gives
BiasL≳R−1\.\\operatorname\{Bias\}\_\{L\}\\gtrsim R^\{\-1\}\.
Ifδ\>b\+12\\delta\>b\+\\frac\{1\}\{2\}, then
∑i≤k\+i2b−2δ≍1,\\sum\_\{i\\leq k\_\{\+\}\}i^\{2b\-2\\delta\}\\asymp 1,and hence
BiasL≲R−1\.\\operatorname\{Bias\}\_\{L\}\\lesssim R^\{\-1\}\.Whenever the lower\-bound admissibility condition holds withk−≍R1/\(2b\)k\_\{\-\}\\asymp R^\{1/\(2b\)\}, the lower bound is
BiasL≳R1−2δ2b\.\\operatorname\{Bias\}\_\{L\}\\gtrsim R^\{\\frac\{1\-2\\delta\}\{2b\}\}\.This completes the proof\. ∎
## Appendix EVariance of Normal GD
### A general upper bound for the GD variance
Using the empirical residualE^:=C^−Σ^xA⋆Σ^y\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}and the empirical GD variance filter𝒱^L\\widehat\{\\mathscr\{V\}\}\_\{L\}, the GD\-variance term is
VarL:=12𝔼\[‖𝒱^L\(E^\)‖Σ,Σ2\]\.\\operatorname\{Var\}\_\{L\}:=\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\]\.
###### Lemma 4\(General GD\-variance upper bound\)\.
Suppose Assumptions[2](https://arxiv.org/html/2606.26617#Thmassumption2)and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold, and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Suppose alsoN≳MN\\gtrsim M\. Then
VarL≲1N∑i,j=1Mmin\{1,\(Rμiμj\)2\}\.\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{1\}\{N\}\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.
###### Proof\.
Letℒ\(B\):=Σ1/2BΣ1/2\\mathcal\{L\}\(B\):=\\Sigma^\{1/2\}B\\Sigma^\{1/2\}and define the whitened empirical variance filter
𝒱^Lw:=ℒ𝒱^Lℒ−1\.\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}:=\\mathcal\{L\}\\widehat\{\\mathscr\{V\}\}\_\{L\}\\mathcal\{L\}^\{\-1\}\.Let
Z^:=ℒ\(E^\)=Σ1/2E^Σ1/2\.\\widehat\{Z\}:=\\mathcal\{L\}\(\\widehat\{E\}\)=\\Sigma^\{1/2\}\\widehat\{E\}\\Sigma^\{1/2\}\.Since
‖B‖Σ,Σ=‖Σ1/2BΣ1/2‖F,\\\|B\\\|\_\{\\Sigma,\\Sigma\}=\\\|\\Sigma^\{1/2\}B\\Sigma^\{1/2\}\\\|\_\{F\},we can write, in whitened coordinates,
‖𝒱^L\(E^\)‖Σ,Σ2=‖𝒱^Lw\(Z^\)‖F2\.\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}=\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\.By Lemma[20](https://arxiv.org/html/2606.26617#Thmlemma20), the empirical and population variance filters are comparable in whitened coordinates\. Hence
VarL≲𝔼\[‖𝒱L\(Z^\)‖F2\],\\operatorname\{Var\}\_\{L\}\\lesssim\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\],where
𝒱L:=∑t=1Lγt∏s=t\+1L\(I−γsℋ\),ℋ\(Z\):=ΣZΣ\.\\mathscr\{V\}\_\{L\}:=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{s=t\+1\}^\{L\}\(I\-\\gamma\_\{s\}\\mathscr\{H\}\),\\qquad\\mathscr\{H\}\(Z\):=\\Sigma Z\\Sigma\.The population filter is diagonal in the basis\{vivj⊤\}i,j=1M\\\{v\_\{i\}v\_\{j\}^\{\\top\}\\\}\_\{i,j=1\}^\{M\}\. Writing
gL\(s\):=∑t=1Lγt∏r=t\+1L\(1−γrs\),g\_\{L\}\(s\):=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\),we have
𝒱L\(vivj⊤\)=gL\(μiμj\)vivj⊤\.\\mathscr\{V\}\_\{L\}\(v\_\{i\}v\_\{j\}^\{\\top\}\)=g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)v\_\{i\}v\_\{j\}^\{\\top\}\.Therefore,
𝔼\[‖𝒱L\(Z^\)‖F2\]=∑i,j=1MgL\(μiμj\)2𝔼\[⟨Z^,vivj⊤⟩F2\]\.\\displaystyle\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]=\{\}\\sum\_\{i,j=1\}^\{M\}g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\mathbb\{E\}\\left\[\\left\\langle\\widehat\{Z\},v\_\{i\}v\_\{j\}^\{\\top\}\\right\\rangle\_\{F\}^\{2\}\\right\]\.By Lemma[17](https://arxiv.org/html/2606.26617#Thmlemma17), applied to the deterministic matrixW=vivj⊤W=v\_\{i\}v\_\{j\}^\{\\top\},
𝔼\[⟨Z^,vivj⊤⟩F2\]≲1N‖Σvivj⊤Σ‖F2=μi2μj2N\.\\mathbb\{E\}\\left\[\\left\\langle\\widehat\{Z\},v\_\{i\}v\_\{j\}^\{\\top\}\\right\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\left\\\|\\Sigma v\_\{i\}v\_\{j\}^\{\\top\}\\Sigma\\right\\\|\_\{F\}^\{2\}=\\frac\{\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\}\{N\}\.Combining this with Lemma[15](https://arxiv.org/html/2606.26617#Thmlemma15)gives
VarL≲1N∑i,j=1Mmin\{1,\(Rμiμj\)2\}\.\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{1\}\{N\}\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.This proves the claim\. ∎
### A general lower bound for the GD variance
The GD\-variance term is
VarL:=12𝔼\[‖𝒱^L\(E^\)‖Σ,Σ2\]\.\\operatorname\{Var\}\_\{L\}:=\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\right\]\.
###### Lemma 5\(General GD\-variance lower bound\)\.
Suppose Assumptions[2](https://arxiv.org/html/2606.26617#Thmassumption2)and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold, and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Suppose that there exists an integeri0≥1i\_\{0\}\\geq 1and a constantcκ∈\(0,1\)c\_\{\\kappa\}\\in\(0,1\)such that
κi≤1−cκ,i≥i0\.\\kappa\_\{i\}\\leq 1\-c\_\{\\kappa\},\\qquad i\\geq i\_\{0\}\.Then
VarL≳1N∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}\.\\operatorname\{Var\}\_\{L\}\\gtrsim\\frac\{1\}\{N\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.
###### Proof\.
We work in whitened coordinates\. For any matrixBB, define
ZB:=Σ1/2BΣ1/2\.Z\_\{B\}:=\\Sigma^\{1/2\}B\\Sigma^\{1/2\}\.Then
‖B‖Σ,Σ=‖ZB‖F\.\\\|B\\\|\_\{\\Sigma,\\Sigma\}=\\\|Z\_\{B\}\\\|\_\{F\}\.Let
Z^:=Σ1/2E^Σ1/2\.\\widehat\{Z\}:=\\Sigma^\{1/2\}\\widehat\{E\}\\Sigma^\{1/2\}\.Let𝒱^Lw\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}denote the empirical variance filter in these whitened coordinates\. Then
‖𝒱^L\(E^\)‖Σ,Σ2=‖𝒱^Lw\(Z^\)‖F2\.\\left\\\|\\widehat\{\\mathscr\{V\}\}\_\{L\}\(\\widehat\{E\}\)\\right\\\|\_\{\\Sigma,\\Sigma\}^\{2\}=\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\.Therefore,
VarL=12𝔼\[‖𝒱^Lw\(Z^\)‖F2\]\.\\operatorname\{Var\}\_\{L\}=\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]\.
By Lemma[20](https://arxiv.org/html/2606.26617#Thmlemma20),
𝔼\[‖𝒱^Lw\(Z^\)‖F2\]≳𝔼\[‖𝒱L\(Z^\)‖F2\],\\mathbb\{E\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]\\gtrsim\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\],where
𝒱L:=∑t=1Lγt∏s=t\+1L\(I−γsℋ\),ℋ\(Z\):=ΣZΣ\.\\mathscr\{V\}\_\{L\}:=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{s=t\+1\}^\{L\}\(I\-\\gamma\_\{s\}\\mathscr\{H\}\),\\qquad\\mathscr\{H\}\(Z\):=\\Sigma Z\\Sigma\.
Sinceℋ\\mathscr\{H\}is diagonal in the basis
\{vivj⊤:1≤i,j≤M\},\\\{v\_\{i\}v\_\{j\}^\{\\top\}:1\\leq i,j\\leq M\\\},with eigenvalue
sij:=μiμjs\_\{ij\}:=\\mu\_\{i\}\\mu\_\{j\}on directionvivj⊤v\_\{i\}v\_\{j\}^\{\\top\}, the population variance filter satisfies
𝒱L\(vivj⊤\)=gL\(sij\)vivj⊤,\\mathscr\{V\}\_\{L\}\(v\_\{i\}v\_\{j\}^\{\\top\}\)=g\_\{L\}\(s\_\{ij\}\)v\_\{i\}v\_\{j\}^\{\\top\},where
gL\(s\):=∑t=1Lγt∏r=t\+1L\(1−γrs\)\.g\_\{L\}\(s\):=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)\.Therefore,
𝔼\[‖𝒱L\(Z^\)‖F2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]=∑i,j=1MgL\(μiμj\)2𝔼\[Z^ij2\]\\displaystyle=\\sum\_\{i,j=1\}^\{M\}g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]≥∑i,j≤Mi,j≥i0i≠jgL\(μiμj\)2𝔼\[Z^ij2\]\.\\displaystyle\\geq\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]\.By Lemma[19](https://arxiv.org/html/2606.26617#Thmlemma19), fori,j≥i0i,j\\geq i\_\{0\}andi≠ji\\neq j,
𝔼\[Z^ij2\]≳μi2μj2N\.\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]\\gtrsim\\frac\{\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\}\{N\}\.Hence
𝔼\[‖𝒱L\(Z^\)‖F2\]\\displaystyle\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]≳1N∑i,j≤Mi,j≥i0i≠jgL\(μiμj\)2μi2μj2\.\\displaystyle\\gtrsim\\frac\{1\}\{N\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\.By Lemma[18](https://arxiv.org/html/2606.26617#Thmlemma18),
gL\(μiμj\)2μi2μj2≳min\{1,\(Rμiμj\)2\}\.g\_\{L\}\(\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\\gtrsim\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.Therefore,
𝔼\[‖𝒱L\(Z^\)‖F2\]≳1N∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}\.\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]\\gtrsim\\frac\{1\}\{N\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.Combining this with the empirical\-to\-population filter comparison gives
VarL≳1N∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}\.\\operatorname\{Var\}\_\{L\}\\gtrsim\\frac\{1\}\{N\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.This proves the claim\. ∎
### Specific GD\-variance scaling
We now specialize the general GD\-variance bounds using the power\-law notationR:=LeffγR:=L\_\{\\mathrm\{eff\}\}\\gammaandT:=R1/bT:=R^\{1/b\}\. By Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), on the high\-probability sketch event,μi≍i−b\\mu\_\{i\}\\asymp i^\{\-b\}\. Define the product effective dimension
dprod\(R,M\):=∑i,j=1Mmin\{1,\(Rμiμj\)2\}\.d\_\{\\mathrm\{prod\}\}\(R,M\):=\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.We also use the piecewise product scale
D×\(R,M\):=\{Tlog\(eT\),1≤T≤M,T\(1\+logM2T\),M<T<M2,M2,T≥M2\.D\_\{\\times\}\(R,M\):=\\begin\{cases\}T\\log\(eT\),&1\\leq T\\leq M,\\\\ T\\left\(1\+\\log\\dfrac\{M^\{2\}\}\{T\}\\right\),&M<T<M^\{2\},\\\\ M^\{2\},&T\\geq M^\{2\}\.\\end\{cases\}By Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11),
Rμiμj≍R\(ij\)−b\.R\\mu\_\{i\}\\mu\_\{j\}\\asymp R\(ij\)^\{\-b\}\.Thus
dprod\(R,M\)≍∑i,j=1Mmin\{1,R2\(ij\)−2b\}\.d\_\{\\mathrm\{prod\}\}\(R,M\)\\asymp\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,R^\{2\}\(ij\)^\{\-2b\}\\\}\.
###### Theorem 4\(Specific GD\-variance bounds\)\.
Suppose Assumptions[1](https://arxiv.org/html/2606.26617#Thmassumption1),[2](https://arxiv.org/html/2606.26617#Thmassumption2), and[3](https://arxiv.org/html/2606.26617#Thmassumption3)hold\. Work on the high\-probability sketch event of Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), and suppose thatℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Suppose also thatN≳MN\\gtrsim M\. Then
VarL≲1Ndprod\(R,M\)\.\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{1\}\{N\}d\_\{\\mathrm\{prod\}\}\(R,M\)\.Moreover,
dprod\(R,M\)≲D×\(R,M\),d\_\{\\mathrm\{prod\}\}\(R,M\)\\lesssim D\_\{\\times\}\(R,M\),and therefore
VarL≲D×\(R,M\)N\.\\boxed\{\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.\}
Furthermore, assume the tail coefficients are bounded away from one: there exist fixed constantsi0≥1i\_\{0\}\\geq 1andcκ∈\(0,1\)c\_\{\\kappa\}\\in\(0,1\)such that
κi≤1−cκ,i≥i0\.\\kappa\_\{i\}\\leq 1\-c\_\{\\kappa\},\\qquad i\\geq i\_\{0\}\.IfT≳1T\\gtrsim 1, then
VarL≳D×\(R,M\)N\.\\operatorname\{Var\}\_\{L\}\\gtrsim\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.Consequently,
VarL≍D×\(R,M\)N\.\\boxed\{\\operatorname\{Var\}\_\{L\}\\asymp\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.\}
###### Proof\.
The general GD\-variance upper bound in Lemma[4](https://arxiv.org/html/2606.26617#Thmlemma4)gives
VarL≲1N∑i,j=1Mmin\{1,\(Rμiμj\)2\}=1Ndprod\(R,M\)\.\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{1\}\{N\}\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}=\\frac\{1\}\{N\}d\_\{\\mathrm\{prod\}\}\(R,M\)\.It remains to estimatedprod\(R,M\)d\_\{\\mathrm\{prod\}\}\(R,M\)\.
Using Lemma[11](https://arxiv.org/html/2606.26617#Thmlemma11), we have
dprod\(R,M\)≲∑i,j=1Mmin\{1,R2\(ij\)−2b\}\.d\_\{\\mathrm\{prod\}\}\(R,M\)\\lesssim\\sum\_\{i,j=1\}^\{M\}\\min\\\{1,R^\{2\}\(ij\)^\{\-2b\}\\\}\.SinceT=R1/bT=R^\{1/b\},
R2\(ij\)−2b=\(Tij\)2b\.R^\{2\}\(ij\)^\{\-2b\}=\\left\(\\frac\{T\}\{ij\}\\right\)^\{2b\}\.We split the sum into the regionij≤Tij\\leq Tand the regionij\>Tij\>T:
dprod\(R,M\)≲∑i,j≤Mij≤T1⏟S1\+∑i,j≤Mij\>T\(Tij\)2b⏟S2\.d\_\{\\mathrm\{prod\}\}\(R,M\)\\lesssim\\underbrace\{\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ ij\\leq T\\end\{subarray\}\}1\}\_\{S\_\{1\}\}\+\\underbrace\{\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ ij\>T\\end\{subarray\}\}\\left\(\\frac\{T\}\{ij\}\\right\)^\{2b\}\}\_\{S\_\{2\}\}\.
We first boundS1S\_\{1\}\. IfT≤MT\\leq M, then
S1≤∑i≤TTi≲Tlog\(eT\)\.S\_\{1\}\\leq\\sum\_\{i\\leq T\}\\frac\{T\}\{i\}\\lesssim T\\log\(eT\)\.IfM<T<M2M<T<M^\{2\}, then
S1\\displaystyle S\_\{1\}≤∑i≤T/MM\+∑T/M<i≤MTi\\displaystyle\\leq\\sum\_\{i\\leq T/M\}M\+\\sum\_\{T/M<i\\leq M\}\\frac\{T\}\{i\}≲T\+TlogM2T\.\\displaystyle\\lesssim T\+T\\log\\frac\{M^\{2\}\}\{T\}\.IfT≥M2T\\geq M^\{2\}, thenS1≤M2S\_\{1\}\\leq M^\{2\}\. HenceS1≲D×\(R,M\)S\_\{1\}\\lesssim D\_\{\\times\}\(R,M\)\.
We next boundS2S\_\{2\}\. Sinceb\>1/2b\>1/2, we have2b\>12b\>1\. IfT≤MT\\leq M, then
S2\\displaystyle S\_\{2\}≲T2b∑i≤Ti−2b\(Ti\)1−2b\+T2b∑i\>Ti−2b\\displaystyle\\lesssim T^\{2b\}\\sum\_\{i\\leq T\}i^\{\-2b\}\\left\(\\frac\{T\}\{i\}\\right\)^\{1\-2b\}\+T^\{2b\}\\sum\_\{i\>T\}i^\{\-2b\}≲T∑i≤T1i\+T≲Tlog\(eT\)\.\\displaystyle\\lesssim T\\sum\_\{i\\leq T\}\\frac\{1\}\{i\}\+T\\lesssim T\\log\(eT\)\.IfM<T<M2M<T<M^\{2\}, the tail region is nonempty only fori\>T/Mi\>T/M, and therefore
S2\\displaystyle S\_\{2\}≲T2b∑T/M<i≤Mi−2b\(Ti\)1−2b\\displaystyle\\lesssim T^\{2b\}\\sum\_\{T/M<i\\leq M\}i^\{\-2b\}\\left\(\\frac\{T\}\{i\}\\right\)^\{1\-2b\}=T∑T/M<i≤M1i≲TlogM2T\.\\displaystyle=T\\sum\_\{T/M<i\\leq M\}\\frac\{1\}\{i\}\\lesssim T\\log\\frac\{M^\{2\}\}\{T\}\.IfT≥M2T\\geq M^\{2\}, thenS2=0S\_\{2\}=0\. Combining the estimates forS1S\_\{1\}andS2S\_\{2\}, we obtain
dprod\(R,M\)≲D×\(R,M\)\.d\_\{\\mathrm\{prod\}\}\(R,M\)\\lesssim D\_\{\\times\}\(R,M\)\.Thus
VarL≲D×\(R,M\)N\.\\operatorname\{Var\}\_\{L\}\\lesssim\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.
We now prove the lower bound\. By Lemma[5](https://arxiv.org/html/2606.26617#Thmlemma5),
VarL≳1N∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}\.\\operatorname\{Var\}\_\{L\}\\gtrsim\\frac\{1\}\{N\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\.By Lemma[21](https://arxiv.org/html/2606.26617#Thmlemma21), ifT≳1T\\gtrsim 1andMMis sufficiently large relative toi0i\_\{0\}, then
∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}≳D×\(R,M\)\.\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\\gtrsim D\_\{\\times\}\(R,M\)\.Therefore,
VarL≳D×\(R,M\)N\.\\operatorname\{Var\}\_\{L\}\\gtrsim\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.Combining the upper and lower bounds proves
VarL≍D×\(R,M\)N\.\\operatorname\{Var\}\_\{L\}\\asymp\\frac\{D\_\{\\times\}\(R,M\)\}\{N\}\.∎
## Appendix FAuxiliary Lemmas
###### Lemma 6\(High\-probability covariance event\)\.
Fix the sketch matrixSSand assumeΣ=SHS⊤\\Sigma=SHS^\{\\top\}is positive definite\. Let
R:=Leffγ\.R:=L\_\{\\mathrm\{eff\}\}\\gamma\.There exist absolute constantsC,c\>0C,c\>0such that, for everyt≥0t\\geq 0, conditional onSS,
max♯∈\{x,y\}‖Σ−1/2Σ^♯Σ−1/2−I‖≤C\(M\+tN\+M\+tN\)\\max\_\{\\sharp\\in\\\{x,y\\\}\}\\left\\\|\\Sigma^\{\-1/2\}\\widehat\{\\Sigma\}\_\{\\sharp\}\\Sigma^\{\-1/2\}\-I\\right\\\|\\leq C\\left\(\\sqrt\{\\frac\{M\+t\}\{N\}\}\+\\frac\{M\+t\}\{N\}\\right\)with probability at least1−2exp\(−ct\)1\-2\\exp\(\-ct\)\. Consequently, if
N≥CR\(M\+t\),N\\geq CR\(M\+t\),thenℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)holds with probability at least1−2exp\(−ct\)1\-2\\exp\(\-ct\)after increasingCCby a constant depending only oncrelc\_\{\\mathrm\{rel\}\}\. In particular, takingt≍Mt\\asymp M,ℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)holds with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)whenever
N≳RM,equivalentlyR≲N/M\.N\\gtrsim RM,\\qquad\\text\{equivalently\}\\qquad R\\lesssim N/M\.
###### Proof\.
Conditional onSS, the whitened variables
ξ:=Σ−1/2x~,η:=Σ−1/2y~\\xi:=\\Sigma^\{\-1/2\}\\widetilde\{x\},\\qquad\\eta:=\\Sigma^\{\-1/2\}\\widetilde\{y\}have standard Gaussian marginal distributions inℝM\\mathbb\{R\}^\{M\}\. Hence
Σ−1/2Σ^xΣ−1/2=1N∑n=1Nξnξn⊤\\displaystyle\\Sigma^\{\-1/2\}\\widehat\{\\Sigma\}\_\{x\}\\Sigma^\{\-1/2\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\xi\_\{n\}\\xi\_\{n\}^\{\\top\}Σ−1/2Σ^yΣ−1/2=1N∑n=1Nηnηn⊤\.\\displaystyle\\Sigma^\{\-1/2\}\\widehat\{\\Sigma\}\_\{y\}\\Sigma^\{\-1/2\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\eta\_\{n\}\\eta\_\{n\}^\{\\top\}\.The standard Gaussian sample\-covariance concentration bound gives, for each♯∈\{x,y\}\\sharp\\in\\\{x,y\\\},
‖Σ−1/2Σ^♯Σ−1/2−I‖≤C\(M\+tN\+M\+tN\)\\left\\\|\\Sigma^\{\-1/2\}\\widehat\{\\Sigma\}\_\{\\sharp\}\\Sigma^\{\-1/2\}\-I\\right\\\|\\leq C\\left\(\\sqrt\{\\frac\{M\+t\}\{N\}\}\+\\frac\{M\+t\}\{N\}\\right\)with probability at least1−exp\(−ct\)1\-\\exp\(\-ct\)\. A union bound over the two marginals proves the first display\. IfN≥CR\(M\+t\)N\\geq CR\(M\+t\), then the first term is at most a constant multiple ofR−1/2R^\{\-1/2\}, while the second is no larger than a constant multiple ofR−1R^\{\-1\}, hence also ofR−1/2R^\{\-1/2\}in the non\-degenerate regimeR≳1R\\gtrsim 1\. Choosing the constant in the sample\-size condition sufficiently large relative tocrelc\_\{\\mathrm\{rel\}\}yieldsℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)\. ∎
###### Lemma 7\(Tail covariance concentration\(Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15), Lemma G\.2\)\)\.
LetH=diag\(h1,h2,…\)H=\\operatorname\{diag\}\(h\_\{1\},h\_\{2\},\\ldots\)be a diagonal positive semidefinite operator with non\-increasing diagonal entries\. LetS∈ℝM×DS\\in\\mathbb\{R\}^\{M\\times D\}have entriesSij∼i\.i\.d\.𝒩\(0,1/M\)S\_\{ij\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\mathcal\{N\}\(0,1/M\)\. For any integerk≥0k\\geq 0, define
Σ\>k:=Sk:∞Hk:∞Sk:∞⊤\.\\Sigma\_\{\>k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}\.Then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
‖Σ\>k−∑i\>khiMIM‖≲hk\+1\+∑i\>khi2M\.\\left\\\|\\Sigma\_\{\>k\}\-\\frac\{\\sum\_\{i\>k\}h\_\{i\}\}\{M\}I\_\{M\}\\right\\\|\\lesssim h\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}h\_\{i\}^\{2\}\}\{M\}\}\.Consequently,
‖Σ\>k‖≲βk,βk:=∑i\>khiM\+hk\+1\+∑i\>khi2M\.\\\|\\Sigma\_\{\>k\}\\\|\\lesssim\\beta\_\{k\},\\qquad\\beta\_\{k\}:=\\frac\{\\sum\_\{i\>k\}h\_\{i\}\}\{M\}\+h\_\{k\+1\}\+\\sqrt\{\\frac\{\\sum\_\{i\>k\}h\_\{i\}^\{2\}\}\{M\}\}\.Moreover, if the tail rank is at least2M2M, then, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
μmin\(Σ\>k\)≳hk\+2M\.\\mu\_\{\\min\}\(\\Sigma\_\{\>k\}\)\\gtrsim h\_\{k\+2M\}\.
###### Lemma 8\(Head Gaussian covariance concentration\)\.
LetS0:k∈ℝM×kS\_\{0:k\}\\in\\mathbb\{R\}^\{M\\times k\}be the firstkkcolumns of a Gaussian sketching matrix with entriesSij∼i\.i\.d\.𝒩\(0,1/M\)S\_\{ij\}\\overset\{\\mathrm\{i\.i\.d\.\}\}\{\\sim\}\\mathcal\{N\}\(0,1/M\)\. There exists an absolute constantc0∈\(0,1\)c\_\{0\}\\in\(0,1\)such that, wheneverk≤c0Mk\\leq c\_\{0\}M, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\),
cIk⪯S0:k⊤S0:k⪯CIk,cI\_\{k\}\\preceq S\_\{0:k\}^\{\\top\}S\_\{0:k\}\\preceq CI\_\{k\},wherec,C\>0c,C\>0are absolute constants\.
###### Lemma 9\(Head–tail block identities\)\.
Let
Σ:=SHS⊤,P:=H1/2S⊤Σ−1SH1/2,ΔP:=P−I\.\\Sigma:=SHS^\{\\top\},\\qquad P:=H^\{1/2\}S^\{\\top\}\\Sigma^\{\-1\}SH^\{1/2\},\\qquad\\Delta\_\{P\}:=P\-I\.Fix an integerkksuch that
Σ\>k:=Sk:∞Hk:∞Sk:∞⊤\\Sigma\_\{\>k\}:=S\_\{k:\\infty\}H\_\{k:\\infty\}S\_\{k:\\infty\}^\{\\top\}is invertible\. Split the coordinates into0:k0:kandk:∞k:\\infty, and write
ΔP=\(𝖴𝖵𝖵⊤𝖶\)\.\\Delta\_\{P\}=\\begin\{pmatrix\}\\mathsf\{U\}&\\mathsf\{V\}\\\\ \\mathsf\{V\}^\{\\top\}&\\mathsf\{W\}\\end\{pmatrix\}\.Then
𝖶2\+𝖵⊤𝖵=−𝖶,0⪯𝖶2\+𝖵⊤𝖵⪯I\.\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}=\-\\mathsf\{W\},\\qquad 0\\preceq\\mathsf\{W\}^\{2\}\+\\mathsf\{V\}^\{\\top\}\\mathsf\{V\}\\preceq I\.Moreover,
𝖴2\+𝖵𝖵⊤=H0:k−1/2\(H0:k−1\+S0:k⊤Σ\>k−1S0:k\)−1H0:k−1/2\.\\mathsf\{U\}^\{2\}\+\\mathsf\{V\}\\mathsf\{V\}^\{\\top\}=H\_\{0:k\}^\{\-1/2\}\\left\(H\_\{0:k\}^\{\-1\}\+S\_\{0:k\}^\{\\top\}\\Sigma\_\{\>k\}^\{\-1\}S\_\{0:k\}\\right\)^\{\-1\}H\_\{0:k\}^\{\-1/2\}\.These identities are the same head–tail algebra used in the approximation analysis of sketched linear regression\(Linet al\.[2024](https://arxiv.org/html/2606.26617#bib.bib15), Lemma C\.1\)\.
###### Lemma 10\(Projection trace lower bound\)\.
LetB⪰0B\\succeq 0be a compact positive semidefinite operator with eigenvaluesμ1\(B\)≥μ2\(B\)≥⋯≥0\\mu\_\{1\}\(B\)\\geq\\mu\_\{2\}\(B\)\\geq\\cdots\\geq 0\. LetΠ\\Pibe an orthogonal projection such thatI−ΠI\-\\Pihas rank at mostmm\. Then
⟨Π,B⟩≥∑j\>mμj\(B\)\.\\langle\\Pi,B\\rangle\\geq\\sum\_\{j\>m\}\\mu\_\{j\}\(B\)\.Equivalently, a rank\-mmprojection can remove at most the largestmmeigenvalue directions ofBB\.
###### Lemma 11\(Sketched marginal spectrum under power laws\)\.
Suppose Assumption[1](https://arxiv.org/html/2606.26617#Thmassumption1)holds\. Recall
H=Λz\+Λϵ,Σ=SHS⊤\.H=\\Lambda\_\{z\}\+\\Lambda\_\{\\epsilon\},\\qquad\\Sigma=SHS^\{\\top\}\.Then
hi:=μi\(H\)≍i−b\.h\_\{i\}:=\\mu\_\{i\}\(H\)\\asymp i^\{\-b\}\.Moreover, with probability at least1−exp\(−Ω\(M\)\)1\-\\exp\(\-\\Omega\(M\)\)over the Gaussian sketchSS,
μi\(Σ\)≍i−b,i=1,…,M\.\\mu\_\{i\}\(\\Sigma\)\\asymp i^\{\-b\},\\qquad i=1,\\ldots,M\.The sketched\-spectrum statement is Lemma E\.5 ofLinet al\.\([2025](https://arxiv.org/html/2606.26617#bib.bib16)\), stated here in our notation\.
###### Lemma 12\(Two\-sided norm equivalence on the covariance event\)\.
Supposeℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)occurs\. Let
ρ:=R−1/2,cL:=1−crelρ,cU:=1\+crelρ\.\\rho:=R^\{\-1/2\},\\qquad c\_\{L\}:=1\-c\_\{\\mathrm\{rel\}\}\\rho,\\qquad c\_\{U\}:=1\+c\_\{\\mathrm\{rel\}\}\\rho\.Then, for every matrixBB,
cL2‖B‖Σ,Σ2≤‖B‖Σ^x,Σ^y2≤cU2‖B‖Σ,Σ2\.c\_\{L\}^\{2\}\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\\leq\\\|B\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\leq c\_\{U\}^\{2\}\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.
###### Proof\.
Byℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\),
cLΣ⪯Σ^x⪯cUΣ,cLΣ⪯Σ^y⪯cUΣ\.c\_\{L\}\\Sigma\\preceq\\widehat\{\\Sigma\}\_\{x\}\\preceq c\_\{U\}\\Sigma,\\qquad c\_\{L\}\\Sigma\\preceq\\widehat\{\\Sigma\}\_\{y\}\\preceq c\_\{U\}\\Sigma\.For PSD matricesA1⪯A2A\_\{1\}\\preceq A\_\{2\}andC1⪯C2C\_\{1\}\\preceq C\_\{2\}, we have
tr\(B⊤A1BC1\)≤tr\(B⊤A2BC2\)\.\\operatorname\{tr\}\(B^\{\\top\}A\_\{1\}BC\_\{1\}\)\\leq\\operatorname\{tr\}\(B^\{\\top\}A\_\{2\}BC\_\{2\}\)\.Indeed,
tr\(B⊤A2BC2\)−tr\(B⊤A1BC1\)\\operatorname\{tr\}\(B^\{\\top\}A\_\{2\}BC\_\{2\}\)\-\\operatorname\{tr\}\(B^\{\\top\}A\_\{1\}BC\_\{1\}\)equals
tr\(B⊤\(A2−A1\)BC2\)\+tr\(B⊤A1B\(C2−C1\)\),\\operatorname\{tr\}\(B^\{\\top\}\(A\_\{2\}\-A\_\{1\}\)BC\_\{2\}\)\+\\operatorname\{tr\}\(B^\{\\top\}A\_\{1\}B\(C\_\{2\}\-C\_\{1\}\)\),and both terms are nonnegative\. Applying this monotonicity twice gives
‖B‖Σ^x,Σ^y2≤cU2‖B‖Σ,Σ2\\\|B\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\leq c\_\{U\}^\{2\}\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}and
‖B‖Σ^x,Σ^y2≥cL2‖B‖Σ,Σ2\.\\\|B\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\geq c\_\{L\}^\{2\}\\\|B\\\|\_\{\\Sigma,\\Sigma\}^\{2\}\.This proves the claim\. ∎
###### Lemma 13\(Scalar GD filter\)\.
Suppose the stepsize schedule in Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2)holds and the stepsizes satisfy
0≤γts≤1,t=1,…,L\.0\\leq\\gamma\_\{t\}s\\leq 1,\\qquad t=1,\\ldots,L\.Let
ψL\(s\):=∏t=1L\(1−γts\)\.\\psi\_\{L\}\(s\):=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\.Then
0≤ψL\(s\)2≤1,0\\leq\\psi\_\{L\}\(s\)^\{2\}\\leq 1,and
sψL\(s\)2≲1Leffγ\.s\\psi\_\{L\}\(s\)^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\.
###### Proof\.
The first claim follows from0≤1−γts≤10\\leq 1\-\\gamma\_\{t\}s\\leq 1\. For the second claim, the geometrically decaying schedule keeps a constant\-order fraction of the firstLeffL\_\{\\mathrm\{eff\}\}steps at scale comparable toγ\\gamma\. Hence
ψL\(s\)2≤\(1−cγs\)cLeff≤exp\(−cLeffγs\)\\psi\_\{L\}\(s\)^\{2\}\\leq\(1\-c\\gamma s\)^\{cL\_\{\\mathrm\{eff\}\}\}\\leq\\exp\(\-cL\_\{\\mathrm\{eff\}\}\\gamma s\)for an absolute constantc\>0c\>0\. Therefore,
sψL\(s\)2≤sexp\(−cLeffγs\)≤CLeffγ,s\\psi\_\{L\}\(s\)^\{2\}\\leq s\\exp\(\-cL\_\{\\mathrm\{eff\}\}\\gamma s\)\\leq\\frac\{C\}\{L\_\{\\mathrm\{eff\}\}\\gamma\},becausesups≥0se−as=1/\(ae\)\\sup\_\{s\\geq 0\}se^\{\-as\}=1/\(ae\)\. This proves the claim\. ∎
###### Lemma 14\(Empirical product\-norm filter and contraction\)\.
Let
ℬ^L:=∏t=1L\(I−γtℋ^\),ℋ^\(B\):=Σ^xBΣ^y\.\\widehat\{\\mathscr\{B\}\}\_\{L\}:=\\prod\_\{t=1\}^\{L\}\(I\-\\gamma\_\{t\}\\widehat\{\\mathscr\{H\}\}\),\\qquad\\widehat\{\\mathscr\{H\}\}\(B\):=\\widehat\{\\Sigma\}\_\{x\}B\\widehat\{\\Sigma\}\_\{y\}\.Suppose Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2)holds\. Then, for every matrixBB,
‖ℬ^L\(B\)‖Σ^x,Σ^y2≲1Leffγ‖B‖F2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|B\\\|\_\{F\}^\{2\}\.Moreover,
‖ℬ^L\(B\)‖Σ^x,Σ^y2≤‖B‖Σ^x,Σ^y2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\leq\\\|B\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\.
###### Proof\.
Diagonalize the empirical marginal covariances:
Σ^x=Uxdiag\(αi\)Ux⊤,Σ^y=Uydiag\(βj\)Uy⊤\.\\widehat\{\\Sigma\}\_\{x\}=U\_\{x\}\\operatorname\{diag\}\(\\alpha\_\{i\}\)U\_\{x\}^\{\\top\},\\qquad\\widehat\{\\Sigma\}\_\{y\}=U\_\{y\}\\operatorname\{diag\}\(\\beta\_\{j\}\)U\_\{y\}^\{\\top\}\.For any matrixBB, write
B~:=Ux⊤BUy\.\\widetilde\{B\}:=U\_\{x\}^\{\\top\}BU\_\{y\}\.In this basis, the empirical Hessian operator acts coordinatewise:
ℋ^:B~ij↦αiβjB~ij\.\\widehat\{\\mathscr\{H\}\}:\\widetilde\{B\}\_\{ij\}\\mapsto\\alpha\_\{i\}\\beta\_\{j\}\\widetilde\{B\}\_\{ij\}\.Therefore,
ℬ^L:B~ij↦ψL\(αiβj\)B~ij,\\widehat\{\\mathscr\{B\}\}\_\{L\}:\\widetilde\{B\}\_\{ij\}\\mapsto\\psi\_\{L\}\(\\alpha\_\{i\}\\beta\_\{j\}\)\\widetilde\{B\}\_\{ij\},where
ψL\(s\):=∏t=1L\(1−γts\)\.\\psi\_\{L\}\(s\):=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\.Hence
‖ℬ^L\(B\)‖Σ^x,Σ^y2=∑i,jαiβjψL\(αiβj\)2B~ij2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}=\\sum\_\{i,j\}\\alpha\_\{i\}\\beta\_\{j\}\\psi\_\{L\}\(\\alpha\_\{i\}\\beta\_\{j\}\)^\{2\}\\widetilde\{B\}\_\{ij\}^\{2\}\.By Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2),
αi≤‖Σ^x‖≤Rx,βj≤‖Σ^y‖≤Ry,\\alpha\_\{i\}\\leq\\\|\\widehat\{\\Sigma\}\_\{x\}\\\|\\leq R\_\{x\},\\qquad\\beta\_\{j\}\\leq\\\|\\widehat\{\\Sigma\}\_\{y\}\\\|\\leq R\_\{y\},andγt≤γ0≤\(4RxRy\)−1\\gamma\_\{t\}\\leq\\gamma\_\{0\}\\leq\(4R\_\{x\}R\_\{y\}\)^\{\-1\}\. Hence0≤γtαiβj≤10\\leq\\gamma\_\{t\}\\alpha\_\{i\}\\beta\_\{j\}\\leq 1, so Lemma[13](https://arxiv.org/html/2606.26617#Thmlemma13)gives
αiβjψL\(αiβj\)2≲1Leffγ\.\\alpha\_\{i\}\\beta\_\{j\}\\psi\_\{L\}\(\\alpha\_\{i\}\\beta\_\{j\}\)^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\.Therefore,
‖ℬ^L\(B\)‖Σ^x,Σ^y2≲1Leffγ∑i,jB~ij2=1Leffγ‖B‖F2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\lesssim\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\sum\_\{i,j\}\\widetilde\{B\}\_\{ij\}^\{2\}=\\frac\{1\}\{L\_\{\\mathrm\{eff\}\}\\gamma\}\\\|B\\\|\_\{F\}^\{2\}\.The contraction bound follows from
0≤ψL\(αiβj\)2≤1,0\\leq\\psi\_\{L\}\(\\alpha\_\{i\}\\beta\_\{j\}\)^\{2\}\\leq 1,again by Lemma[13](https://arxiv.org/html/2606.26617#Thmlemma13)\. Thus
‖ℬ^L\(B\)‖Σ^x,Σ^y2≤∑i,jαiβjB~ij2=‖B‖Σ^x,Σ^y2\.\\left\\\|\\widehat\{\\mathscr\{B\}\}\_\{L\}\(B\)\\right\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\\leq\\sum\_\{i,j\}\\alpha\_\{i\}\\beta\_\{j\}\\widetilde\{B\}\_\{ij\}^\{2\}=\\\|B\\\|\_\{\\widehat\{\\Sigma\}\_\{x\},\\widehat\{\\Sigma\}\_\{y\}\}^\{2\}\.∎
###### Lemma 15\(Scalar variance filter\)\.
Let
ψL\(s\):=∏t=1L\(1−γts\),gL\(s\):=∑t=1Lγt∏r=t\+1L\(1−γrs\)\.\\psi\_\{L\}\(s\):=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\),\\qquad g\_\{L\}\(s\):=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)\.Assume the stepsize schedule in Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2)and suppose0≤γts≤10\\leq\\gamma\_\{t\}s\\leq 1for allt=1,…,Lt=1,\\ldots,L\. Then
0≤sgL\(s\)≤1,0\\leq sg\_\{L\}\(s\)\\leq 1,and
s2gL\(s\)2≲min\{1,\(Leffγs\)2\}\.s^\{2\}g\_\{L\}\(s\)^\{2\}\\lesssim\\min\\\{1,\(L\_\{\\mathrm\{eff\}\}\\gamma s\)^\{2\}\\\}\.
###### Proof\.
First observe that
1−ψL\(s\)=1−∏t=1L\(1−γts\)\.1\-\\psi\_\{L\}\(s\)=1\-\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\.Expanding this telescopically gives
1−ψL\(s\)=∑t=1L\[∏r=t\+1L\(1−γrs\)\]\[1−\(1−γts\)\]\.1\-\\psi\_\{L\}\(s\)=\\sum\_\{t=1\}^\{L\}\\left\[\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)\\right\]\\left\[1\-\(1\-\\gamma\_\{t\}s\)\\right\]\.Since1−\(1−γts\)=γts1\-\(1\-\\gamma\_\{t\}s\)=\\gamma\_\{t\}s, we get
1−ψL\(s\)=s∑t=1Lγt∏r=t\+1L\(1−γrs\)=sgL\(s\)\.1\-\\psi\_\{L\}\(s\)=s\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)=sg\_\{L\}\(s\)\.Because0≤γts≤10\\leq\\gamma\_\{t\}s\\leq 1, every factor1−γts1\-\\gamma\_\{t\}slies in\[0,1\]\[0,1\], and hence
0≤ψL\(s\)≤1\.0\\leq\\psi\_\{L\}\(s\)\\leq 1\.Therefore,
0≤sgL\(s\)=1−ψL\(s\)≤1\.0\\leq sg\_\{L\}\(s\)=1\-\\psi\_\{L\}\(s\)\\leq 1\.This proves
s2gL\(s\)2≤1\.s^\{2\}g\_\{L\}\(s\)^\{2\}\\leq 1\.
It remains to prove the small\-ssbound\. Since
1−∏t=1L\(1−γts\)≤∑t=1Lγts,1\-\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\\leq\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}s,we have
sgL\(s\)≤s∑t=1Lγt\.sg\_\{L\}\(s\)\\leq s\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\.Under the geometrically decaying schedule,
∑t=1Lγt≲Leffγ\.\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\lesssim L\_\{\\mathrm\{eff\}\}\\gamma\.Therefore,
sgL\(s\)≲Leffγs\.sg\_\{L\}\(s\)\\lesssim L\_\{\\mathrm\{eff\}\}\\gamma s\.Squaring gives
s2gL\(s\)2≲\(Leffγs\)2\.s^\{2\}g\_\{L\}\(s\)^\{2\}\\lesssim\(L\_\{\\mathrm\{eff\}\}\\gamma s\)^\{2\}\.Combining this withs2gL\(s\)2≤1s^\{2\}g\_\{L\}\(s\)^\{2\}\\leq 1, we obtain
s2gL\(s\)2≲min\{1,\(Leffγs\)2\}\.s^\{2\}g\_\{L\}\(s\)^\{2\}\\lesssim\\min\\\{1,\(L\_\{\\mathrm\{eff\}\}\\gamma s\)^\{2\}\\\}\.∎
###### Lemma 16\(Gaussian covariance\-fluctuation moment bound\)\.
Let\(ξ,η\)∈ℝM×ℝM\(\\xi,\\eta\)\\in\\mathbb\{R\}^\{M\}\\times\\mathbb\{R\}^\{M\}be jointly Gaussian with
𝔼\[ξξ⊤\]=I,𝔼\[ηη⊤\]=I,𝔼\[ξη⊤\]=K0,\\mathbb\{E\}\[\\xi\\xi^\{\\top\}\]=I,\\qquad\\mathbb\{E\}\[\\eta\\eta^\{\\top\}\]=I,\\qquad\\mathbb\{E\}\[\\xi\\eta^\{\\top\}\]=K\_\{0\},where‖K0‖≤1\\\|K\_\{0\}\\\|\\leq 1\. Then, for every deterministic matrixW∈ℝM×MW\\in\\mathbb\{R\}^\{M\\times M\},
𝔼\[⟨ξη⊤−K0,W⟩F2\]≲‖W‖F2,\\mathbb\{E\}\\left\[\\left\\langle\\xi\\eta^\{\\top\}\-K\_\{0\},W\\right\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\\|W\\\|\_\{F\}^\{2\},𝔼\[⟨ξξ⊤−I,W⟩F2\]≲‖W‖F2,\\mathbb\{E\}\\left\[\\left\\langle\\xi\\xi^\{\\top\}\-I,W\\right\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\\|W\\\|\_\{F\}^\{2\},and
𝔼\[⟨ηη⊤−I,W⟩F2\]≲‖W‖F2\.\\mathbb\{E\}\\left\[\\left\\langle\\eta\\eta^\{\\top\}\-I,W\\right\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\\|W\\\|\_\{F\}^\{2\}\.Consequently, if\(ξn,ηn\)n=1N\(\\xi\_\{n\},\\eta\_\{n\}\)\_\{n=1\}^\{N\}are i\.i\.d\. copies of\(ξ,η\)\(\\xi,\\eta\), and
Gxy:=1N∑n=1N\(ξnηn⊤−K0\),G\_\{xy\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\eta\_\{n\}^\{\\top\}\-K\_\{0\}\),Gx:=1N∑n=1N\(ξnξn⊤−I\),Gy:=1N∑n=1N\(ηnηn⊤−I\),G\_\{x\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\xi\_\{n\}^\{\\top\}\-I\),\\qquad G\_\{y\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\eta\_\{n\}\\eta\_\{n\}^\{\\top\}\-I\),then
𝔼\[⟨Gxy,W⟩F2\]≲1N‖W‖F2,\\mathbb\{E\}\\left\[\\langle G\_\{xy\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|W\\\|\_\{F\}^\{2\},𝔼\[⟨Gx,W⟩F2\]≲1N‖W‖F2,𝔼\[⟨Gy,W⟩F2\]≲1N‖W‖F2\.\\mathbb\{E\}\\left\[\\langle G\_\{x\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|W\\\|\_\{F\}^\{2\},\\qquad\\mathbb\{E\}\\left\[\\langle G\_\{y\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|W\\\|\_\{F\}^\{2\}\.
###### Proof\.
We prove the first bound; the other two are the standard special cases withη=ξ\\eta=\\xiandK0=IK\_\{0\}=I\.
Observe that
⟨ξη⊤−K0,W⟩F=ξ⊤Wη−tr\(K0⊤W\)\.\\left\\langle\\xi\\eta^\{\\top\}\-K\_\{0\},W\\right\\rangle\_\{F\}=\\xi^\{\\top\}W\\eta\-\\operatorname\{tr\}\(K\_\{0\}^\{\\top\}W\)\.Therefore,
𝔼\[⟨ξη⊤−K0,W⟩F2\]=Var\(ξ⊤Wη\)\.\\mathbb\{E\}\\left\[\\left\\langle\\xi\\eta^\{\\top\}\-K\_\{0\},W\\right\\rangle\_\{F\}^\{2\}\\right\]=\\operatorname\{Var\}\(\\xi^\{\\top\}W\\eta\)\.Since\(ξ,η\)\(\\xi,\\eta\)is jointly Gaussian, Isserlis’ formula gives
𝔼\[\(ξ⊤Wη\)2\]=∑i,j,k,ℓWijWkℓ𝔼\[ξiηjξkηℓ\]\.\\mathbb\{E\}\[\(\\xi^\{\\top\}W\\eta\)^\{2\}\]=\\sum\_\{i,j,k,\\ell\}W\_\{ij\}W\_\{k\\ell\}\\mathbb\{E\}\[\\xi\_\{i\}\\eta\_\{j\}\\xi\_\{k\}\\eta\_\{\\ell\}\]\.For jointly Gaussian variables,
𝔼\[ξiηjξkηℓ\]=𝔼\[ξiηj\]𝔼\[ξkηℓ\]\+𝔼\[ξiξk\]𝔼\[ηjηℓ\]\+𝔼\[ξiηℓ\]𝔼\[ηjξk\]\.\\mathbb\{E\}\[\\xi\_\{i\}\\eta\_\{j\}\\xi\_\{k\}\\eta\_\{\\ell\}\]=\\mathbb\{E\}\[\\xi\_\{i\}\\eta\_\{j\}\]\\mathbb\{E\}\[\\xi\_\{k\}\\eta\_\{\\ell\}\]\+\\mathbb\{E\}\[\\xi\_\{i\}\\xi\_\{k\}\]\\mathbb\{E\}\[\\eta\_\{j\}\\eta\_\{\\ell\}\]\+\\mathbb\{E\}\[\\xi\_\{i\}\\eta\_\{\\ell\}\]\\mathbb\{E\}\[\\eta\_\{j\}\\xi\_\{k\}\]\.Using
𝔼\[ξiηj\]=\(K0\)ij,𝔼\[ξiξk\]=δik,𝔼\[ηjηℓ\]=δjℓ,\\mathbb\{E\}\[\\xi\_\{i\}\\eta\_\{j\}\]=\(K\_\{0\}\)\_\{ij\},\\qquad\\mathbb\{E\}\[\\xi\_\{i\}\\xi\_\{k\}\]=\\delta\_\{ik\},\\qquad\\mathbb\{E\}\[\\eta\_\{j\}\\eta\_\{\\ell\}\]=\\delta\_\{j\\ell\},we get
𝔼\[\(ξ⊤Wη\)2\]=tr\(K0⊤W\)2\+∥W∥F2\+tr\(W⊤K0W⊤K0\)\.\\mathbb\{E\}\[\(\\xi^\{\\top\}W\\eta\)^\{2\}\]=\\operatorname\{tr\}\(K\_\{0\}^\{\\top\}W\)^\{2\}\+\\\|W\\\|\_\{F\}^\{2\}\+\\operatorname\{tr\}\(W^\{\\top\}K\_\{0\}W^\{\\top\}K\_\{0\}\)\.Thus
Var\(ξ⊤Wη\)=‖W‖F2\+tr\(W⊤K0W⊤K0\)\.\\operatorname\{Var\}\(\\xi^\{\\top\}W\\eta\)=\\\|W\\\|\_\{F\}^\{2\}\+\\operatorname\{tr\}\(W^\{\\top\}K\_\{0\}W^\{\\top\}K\_\{0\}\)\.Since‖K0‖≤1\\\|K\_\{0\}\\\|\\leq 1,
\|tr\(W⊤K0W⊤K0\)\|≤‖W⊤K0‖F2≤‖W‖F2\.\\left\|\\operatorname\{tr\}\(W^\{\\top\}K\_\{0\}W^\{\\top\}K\_\{0\}\)\\right\|\\leq\\\|W^\{\\top\}K\_\{0\}\\\|\_\{F\}^\{2\}\\leq\\\|W\\\|\_\{F\}^\{2\}\.Hence
𝔼\[⟨ξη⊤−K0,W⟩F2\]≲‖W‖F2\.\\mathbb\{E\}\\left\[\\left\\langle\\xi\\eta^\{\\top\}\-K\_\{0\},W\\right\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\\|W\\\|\_\{F\}^\{2\}\.
For the empirical average, independence gives
𝔼\[⟨Gxy,W⟩F2\]=1N2∑n=1N𝔼\[⟨ξnηn⊤−K0,W⟩F2\],\\mathbb\{E\}\\left\[\\langle G\_\{xy\},W\\rangle\_\{F\}^\{2\}\\right\]=\\frac\{1\}\{N^\{2\}\}\\sum\_\{n=1\}^\{N\}\\mathbb\{E\}\\left\[\\left\\langle\\xi\_\{n\}\\eta\_\{n\}^\{\\top\}\-K\_\{0\},W\\right\\rangle\_\{F\}^\{2\}\\right\],because the cross terms vanish by centering\. Therefore,
𝔼\[⟨Gxy,W⟩F2\]≲1N‖W‖F2\.\\mathbb\{E\}\\left\[\\langle G\_\{xy\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|W\\\|\_\{F\}^\{2\}\.The same argument proves the bounds forGxG\_\{x\}andGyG\_\{y\}\. ∎
###### Lemma 17\(Contrastive empirical\-noise covariance upper bound\)\.
Let
E^:=C^−Σ^xA⋆Σ^y,\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\},and define its whitened version
Z^:=Σ1/2E^Σ1/2\.\\widehat\{Z\}:=\\Sigma^\{1/2\}\\widehat\{E\}\\Sigma^\{1/2\}\.Assume the contrastive source condition, so that
K⋆:=Σ1/2A⋆Σ1/2K\_\{\\star\}:=\\Sigma^\{1/2\}A^\{\\star\}\\Sigma^\{1/2\}satisfies
0⪯K⋆⪯I\.0\\preceq K\_\{\\star\}\\preceq I\.Assume also thatN≳MN\\gtrsim M\. Then, for every deterministic matrixW∈ℝM×MW\\in\\mathbb\{R\}^\{M\\times M\},
𝔼\[⟨Z^,W⟩F2\]≲1N‖ΣWΣ‖F2\.\\mathbb\{E\}\\left\[\\langle\\widehat\{Z\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|\\Sigma W\\Sigma\\\|\_\{F\}^\{2\}\.
###### Proof\.
Define the whitened paired variables
ξ:=Σ−1/2x~,η:=Σ−1/2y~\.\\xi:=\\Sigma^\{\-1/2\}\\widetilde\{x\},\\qquad\\eta:=\\Sigma^\{\-1/2\}\\widetilde\{y\}\.Then
𝔼\[ξξ⊤\]=I,𝔼\[ηη⊤\]=I,𝔼\[ξη⊤\]=K⋆\.\\mathbb\{E\}\[\\xi\\xi^\{\\top\}\]=I,\\qquad\\mathbb\{E\}\[\\eta\\eta^\{\\top\}\]=I,\\qquad\\mathbb\{E\}\[\\xi\\eta^\{\\top\}\]=K\_\{\\star\}\.By the contrastive source condition,
0⪯K⋆⪯I\.0\\preceq K\_\{\\star\}\\preceq I\.For the empirical quantities, define
Gxy:=1N∑n=1N\(ξnηn⊤−K⋆\),G\_\{xy\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\eta\_\{n\}^\{\\top\}\-K\_\{\\star\}\),Gx:=1N∑n=1N\(ξnξn⊤−I\),Gy:=1N∑n=1N\(ηnηn⊤−I\)\.G\_\{x\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\xi\_\{n\}^\{\\top\}\-I\),\\qquad G\_\{y\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\eta\_\{n\}\\eta\_\{n\}^\{\\top\}\-I\)\.Then
Σ^x=Σ1/2\(I\+Gx\)Σ1/2,Σ^y=Σ1/2\(I\+Gy\)Σ1/2,\\widehat\{\\Sigma\}\_\{x\}=\\Sigma^\{1/2\}\(I\+G\_\{x\}\)\\Sigma^\{1/2\},\\qquad\\widehat\{\\Sigma\}\_\{y\}=\\Sigma^\{1/2\}\(I\+G\_\{y\}\)\\Sigma^\{1/2\},and
C^=Σ1/2\(K⋆\+Gxy\)Σ1/2\.\\widehat\{C\}=\\Sigma^\{1/2\}\(K\_\{\\star\}\+G\_\{xy\}\)\\Sigma^\{1/2\}\.Since
A⋆=Σ−1/2K⋆Σ−1/2,A^\{\\star\}=\\Sigma^\{\-1/2\}K\_\{\\star\}\\Sigma^\{\-1/2\},we obtain
Z^\\displaystyle\\widehat\{Z\}=Σ1/2\(C^−Σ^xA⋆Σ^y\)Σ1/2\\displaystyle=\\Sigma^\{1/2\}\\left\(\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}\\right\)\\Sigma^\{1/2\}=Σ\[K⋆\+Gxy−\(I\+Gx\)K⋆\(I\+Gy\)\]Σ\\displaystyle=\\Sigma\\left\[K\_\{\\star\}\+G\_\{xy\}\-\(I\+G\_\{x\}\)K\_\{\\star\}\(I\+G\_\{y\}\)\\right\]\\Sigma=Σ\[Gxy−GxK⋆−K⋆Gy−GxK⋆Gy\]Σ\.\\displaystyle=\\Sigma\\left\[G\_\{xy\}\-G\_\{x\}K\_\{\\star\}\-K\_\{\\star\}G\_\{y\}\-G\_\{x\}K\_\{\\star\}G\_\{y\}\\right\]\\Sigma\.Let
MW:=ΣWΣ\.M\_\{W\}:=\\Sigma W\\Sigma\.Then
⟨Z^,W⟩F=⟨Gxy−GxK⋆−K⋆Gy−GxK⋆Gy,MW⟩F\.\\langle\\widehat\{Z\},W\\rangle\_\{F\}=\\left\\langle G\_\{xy\}\-G\_\{x\}K\_\{\\star\}\-K\_\{\\star\}G\_\{y\}\-G\_\{x\}K\_\{\\star\}G\_\{y\},M\_\{W\}\\right\\rangle\_\{F\}\.Using\(a\+b\+c\+d\)2≤4\(a2\+b2\+c2\+d2\)\(a\+b\+c\+d\)^\{2\}\\leq 4\(a^\{2\}\+b^\{2\}\+c^\{2\}\+d^\{2\}\), it suffices to bound the four terms separately\.
First, by Lemma[16](https://arxiv.org/html/2606.26617#Thmlemma16),
𝔼\[⟨Gxy,MW⟩F2\]≲1N‖MW‖F2\.\\mathbb\{E\}\[\\langle G\_\{xy\},M\_\{W\}\\rangle\_\{F\}^\{2\}\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\.
Second,
⟨GxK⋆,MW⟩F=⟨Gx,MWK⋆⊤⟩F\.\\langle G\_\{x\}K\_\{\\star\},M\_\{W\}\\rangle\_\{F\}=\\langle G\_\{x\},M\_\{W\}K\_\{\\star\}^\{\\top\}\\rangle\_\{F\}\.Since‖K⋆‖≤1\\\|K\_\{\\star\}\\\|\\leq 1,
‖MWK⋆⊤‖F≤‖MW‖F\.\\\|M\_\{W\}K\_\{\\star\}^\{\\top\}\\\|\_\{F\}\\leq\\\|M\_\{W\}\\\|\_\{F\}\.Thus, again by Lemma[16](https://arxiv.org/html/2606.26617#Thmlemma16),
𝔼\[⟨GxK⋆,MW⟩F2\]≲1N‖MW‖F2\.\\mathbb\{E\}\[\\langle G\_\{x\}K\_\{\\star\},M\_\{W\}\\rangle\_\{F\}^\{2\}\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\.Similarly,
⟨K⋆Gy,MW⟩F=⟨Gy,K⋆⊤MW⟩F,\\langle K\_\{\\star\}G\_\{y\},M\_\{W\}\\rangle\_\{F\}=\\langle G\_\{y\},K\_\{\\star\}^\{\\top\}M\_\{W\}\\rangle\_\{F\},and
‖K⋆⊤MW‖F≤‖MW‖F\.\\\|K\_\{\\star\}^\{\\top\}M\_\{W\}\\\|\_\{F\}\\leq\\\|M\_\{W\}\\\|\_\{F\}\.Therefore,
𝔼\[⟨K⋆Gy,MW⟩F2\]≲1N‖MW‖F2\.\\mathbb\{E\}\[\\langle K\_\{\\star\}G\_\{y\},M\_\{W\}\\rangle\_\{F\}^\{2\}\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\.
It remains to control the quadratic empirical\-covariance term\. Conditional onGyG\_\{y\}, the matrixMWGy⊤K⋆⊤M\_\{W\}G\_\{y\}^\{\\top\}K\_\{\\star\}^\{\\top\}is fixed with respect to the samples enteringGxG\_\{x\}up to the correlation between the two views\. A standard Gaussian decoupling argument for jointly Gaussian quadratic forms gives
𝔼\[⟨GxK⋆Gy,MW⟩F2\|Gy\]≲1N∥MWGy⊤K⋆⊤∥F2\.\\mathbb\{E\}\\left\[\\langle G\_\{x\}K\_\{\\star\}G\_\{y\},M\_\{W\}\\rangle\_\{F\}^\{2\}\\,\\middle\|\\,G\_\{y\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}G\_\{y\}^\{\\top\}K\_\{\\star\}^\{\\top\}\\\|\_\{F\}^\{2\}\.Using‖K⋆‖≤1\\\|K\_\{\\star\}\\\|\\leq 1,
‖MWGy⊤K⋆⊤‖F2≤‖MW‖F2‖Gy‖2\.\\\|M\_\{W\}G\_\{y\}^\{\\top\}K\_\{\\star\}^\{\\top\}\\\|\_\{F\}^\{2\}\\leq\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\\\|G\_\{y\}\\\|^\{2\}\.Taking expectation and using the standard Gaussian sample\-covariance moment bound
𝔼‖Gy‖2≲MN\+1\\mathbb\{E\}\\\|G\_\{y\}\\\|^\{2\}\\lesssim\\frac\{M\}\{N\}\+1together withN≳MN\\gtrsim M, we obtain
𝔼\[⟨GxK⋆Gy,MW⟩F2\]≲1N‖MW‖F2\.\\mathbb\{E\}\\left\[\\langle G\_\{x\}K\_\{\\star\}G\_\{y\},M\_\{W\}\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\.Combining the four estimates yields
𝔼\[⟨Z^,W⟩F2\]≲1N‖MW‖F2\.\\mathbb\{E\}\\left\[\\langle\\widehat\{Z\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|M\_\{W\}\\\|\_\{F\}^\{2\}\.SinceMW=ΣWΣM\_\{W\}=\\Sigma W\\Sigma, this is exactly
𝔼\[⟨Z^,W⟩F2\]≲1N‖ΣWΣ‖F2\.\\mathbb\{E\}\\left\[\\langle\\widehat\{Z\},W\\rangle\_\{F\}^\{2\}\\right\]\\lesssim\\frac\{1\}\{N\}\\\|\\Sigma W\\Sigma\\\|\_\{F\}^\{2\}\.∎
###### Lemma 18\(Scalar variance filter lower bound\)\.
Let
ψL\(s\):=∏t=1L\(1−γts\),gL\(s\):=∑t=1Lγt∏r=t\+1L\(1−γrs\)\.\\psi\_\{L\}\(s\):=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\),\\qquad g\_\{L\}\(s\):=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)\.Assume the stepsize schedule in Assumption[2](https://arxiv.org/html/2606.26617#Thmassumption2)\. Suppose0≤γts≤10\\leq\\gamma\_\{t\}s\\leq 1for allt=1,…,Lt=1,\\ldots,L\. Then
sgL\(s\)=1−ψL\(s\)\.sg\_\{L\}\(s\)=1\-\\psi\_\{L\}\(s\)\.Moreover, there exist absolute constantsc,C\>0c,C\>0such that
s2gL\(s\)2≥cmin\{1,\(Leffγs\)2\},s^\{2\}g\_\{L\}\(s\)^\{2\}\\geq c\\min\\\{1,\(L\_\{\\mathrm\{eff\}\}\\gamma s\)^\{2\}\\\},whenevers≤C/γs\\leq C/\\gamma\.
###### Proof\.
First, by telescoping,
1−ψL\(s\)\\displaystyle 1\-\\psi\_\{L\}\(s\)=1−∏t=1L\(1−γts\)\\displaystyle=1\-\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)=∑t=1L\[∏r=t\+1L\(1−γrs\)\]\[1−\(1−γts\)\]\\displaystyle=\\sum\_\{t=1\}^\{L\}\\left\[\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)\\right\]\\left\[1\-\(1\-\\gamma\_\{t\}s\)\\right\]=s∑t=1Lγt∏r=t\+1L\(1−γrs\)=sgL\(s\)\.\\displaystyle=s\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\(1\-\\gamma\_\{r\}s\)=sg\_\{L\}\(s\)\.Thus it remains to lower\-bound1−ψL\(s\)1\-\\psi\_\{L\}\(s\)\.
By the definition of the geometrically decaying schedule, at leastLeffL\_\{\\mathrm\{eff\}\}steps have stepsize comparable toγ\\gamma\. Therefore, there exists an absolute constantc1\>0c\_\{1\}\>0such that
∑t=1Lγt≥c1Leffγ\.\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\geq c\_\{1\}L\_\{\\mathrm\{eff\}\}\\gamma\.Since0≤γts≤10\\leq\\gamma\_\{t\}s\\leq 1, we use
1−u≤e−u\(u≥0\)1\-u\\leq e^\{\-u\}\\qquad\(u\\geq 0\)to obtain
ψL\(s\)=∏t=1L\(1−γts\)≤exp\(−s∑t=1Lγt\)≤exp\(−c1Leffγs\)\.\\psi\_\{L\}\(s\)=\\prod\_\{t=1\}^\{L\}\(1\-\\gamma\_\{t\}s\)\\leq\\exp\\left\(\-s\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\right\)\\leq\\exp\(\-c\_\{1\}L\_\{\\mathrm\{eff\}\}\\gamma s\)\.Hence
1−ψL\(s\)≥1−exp\(−c1Leffγs\)\.1\-\\psi\_\{L\}\(s\)\\geq 1\-\\exp\(\-c\_\{1\}L\_\{\\mathrm\{eff\}\}\\gamma s\)\.For allu≥0u\\geq 0,
1−e−u≥c2min\{1,u\}1\-e^\{\-u\}\\geq c\_\{2\}\\min\\\{1,u\\\}for an absolute constantc2\>0c\_\{2\}\>0\. Takingu=c1Leffγsu=c\_\{1\}L\_\{\\mathrm\{eff\}\}\\gamma s, we get
1−ψL\(s\)≥c3min\{1,Leffγs\}\.1\-\\psi\_\{L\}\(s\)\\geq c\_\{3\}\\min\\\{1,L\_\{\\mathrm\{eff\}\}\\gamma s\\\}\.SincesgL\(s\)=1−ψL\(s\)sg\_\{L\}\(s\)=1\-\\psi\_\{L\}\(s\), we conclude
s2gL\(s\)2=\(1−ψL\(s\)\)2≥cmin\{1,\(Leffγs\)2\}\.s^\{2\}g\_\{L\}\(s\)^\{2\}=\(1\-\\psi\_\{L\}\(s\)\)^\{2\}\\geq c\\min\\\{1,\(L\_\{\\mathrm\{eff\}\}\\gamma s\)^\{2\}\\\}\.∎
###### Lemma 19\(Coordinatewise contrastive noise non\-degeneracy\)\.
Let
ξ:=Σ−1/2x~,η:=Σ−1/2y~\.\\xi:=\\Sigma^\{\-1/2\}\\widetilde\{x\},\\qquad\\eta:=\\Sigma^\{\-1/2\}\\widetilde\{y\}\.Assume
𝔼\[ξξ⊤\]=I,𝔼\[ηη⊤\]=I,\\displaystyle\\mathbb\{E\}\[\\xi\\xi^\{\\top\}\]=I,\\qquad\\mathbb\{E\}\[\\eta\\eta^\{\\top\}\]=I,𝔼\[ξη⊤\]=K⋆=diag\(κ1,…,κM\)\.\\displaystyle\\mathbb\{E\}\[\\xi\\eta^\{\\top\}\]=K\_\{\\star\}=\\operatorname\{diag\}\(\\kappa\_\{1\},\\ldots,\\kappa\_\{M\}\)\.Fixcκ∈\(0,1\)c\_\{\\kappa\}\\in\(0,1\), and define the non\-degenerate index set
ℐκ:=\{\(i,j\):i≠j,κi≤1−cκ,κj≤1−cκ\}\.\\mathcal\{I\}\_\{\\kappa\}:=\\\{\(i,j\):i\\neq j,\\ \\kappa\_\{i\}\\leq 1\-c\_\{\\kappa\},\\ \\kappa\_\{j\}\\leq 1\-c\_\{\\kappa\}\\\}\.Let
E^:=C^−Σ^xA⋆Σ^y,Z^:=Σ1/2E^Σ1/2\.\\widehat\{E\}:=\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\},\\qquad\\widehat\{Z\}:=\\Sigma^\{1/2\}\\widehat\{E\}\\Sigma^\{1/2\}\.Then, for every\(i,j\)∈ℐκ\(i,j\)\\in\\mathcal\{I\}\_\{\\kappa\}, ifNNis sufficiently large,
𝔼\[Z^ij2\]≳μi2μj2N\.\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]\\gtrsim\\frac\{\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\}\{N\}\.The hidden constant may depend oncκc\_\{\\kappa\}, but not oni,j,N,Mi,j,N,M\.
###### Proof\.
Define the whitened empirical fluctuations
Gxy:=1N∑n=1N\(ξnηn⊤−K⋆\),G\_\{xy\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\eta\_\{n\}^\{\\top\}\-K\_\{\\star\}\),Gx:=1N∑n=1N\(ξnξn⊤−I\),Gy:=1N∑n=1N\(ηnηn⊤−I\)\.G\_\{x\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\xi\_\{n\}\\xi\_\{n\}^\{\\top\}\-I\),\\qquad G\_\{y\}:=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\(\\eta\_\{n\}\\eta\_\{n\}^\{\\top\}\-I\)\.Since
A⋆=Σ−1/2K⋆Σ−1/2,A^\{\\star\}=\\Sigma^\{\-1/2\}K\_\{\\star\}\\Sigma^\{\-1/2\},we have
Z^\\displaystyle\\widehat\{Z\}=Σ1/2\(C^−Σ^xA⋆Σ^y\)Σ1/2\\displaystyle=\\Sigma^\{1/2\}\\left\(\\widehat\{C\}\-\\widehat\{\\Sigma\}\_\{x\}A^\{\\star\}\\widehat\{\\Sigma\}\_\{y\}\\right\)\\Sigma^\{1/2\}=Σ\[K⋆\+Gxy−\(I\+Gx\)K⋆\(I\+Gy\)\]Σ\\displaystyle=\\Sigma\\left\[K\_\{\\star\}\+G\_\{xy\}\-\(I\+G\_\{x\}\)K\_\{\\star\}\(I\+G\_\{y\}\)\\right\]\\Sigma=Σ\[Gxy−GxK⋆−K⋆Gy−GxK⋆Gy\]Σ\.\\displaystyle=\\Sigma\\left\[G\_\{xy\}\-G\_\{x\}K\_\{\\star\}\-K\_\{\\star\}G\_\{y\}\-G\_\{x\}K\_\{\\star\}G\_\{y\}\\right\]\\Sigma\.Fori≠ji\\neq j, define the first\-order coordinate fluctuation
Fij:=\(Gxy\)ij−κj\(Gx\)ij−κi\(Gy\)ij\.F\_\{ij\}:=\(G\_\{xy\}\)\_\{ij\}\-\\kappa\_\{j\}\(G\_\{x\}\)\_\{ij\}\-\\kappa\_\{i\}\(G\_\{y\}\)\_\{ij\}\.Then
Z^ij=μiμj\[Fij−\(GxK⋆Gy\)ij\]\.\\widehat\{Z\}\_\{ij\}=\\mu\_\{i\}\\mu\_\{j\}\\left\[F\_\{ij\}\-\(G\_\{x\}K\_\{\\star\}G\_\{y\}\)\_\{ij\}\\right\]\.By the inequality
\(a−b\)2≥12a2−b2,\(a\-b\)^\{2\}\\geq\\frac\{1\}\{2\}a^\{2\}\-b^\{2\},we get
𝔼\[Z^ij2\]≥μi2μj2\[12𝔼\[Fij2\]−𝔼\[\(GxK⋆Gy\)ij2\]\]\.\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]\\geq\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\\left\[\\frac\{1\}\{2\}\\mathbb\{E\}\[F\_\{ij\}^\{2\}\]\-\\mathbb\{E\}\[\(G\_\{x\}K\_\{\\star\}G\_\{y\}\)\_\{ij\}^\{2\}\]\\right\]\.
We first lower\-bound𝔼\[Fij2\]\\mathbb\{E\}\[F\_\{ij\}^\{2\}\]\. For a single sample define
fij:=ξiηj−κjξiξj−κiηiηj\.f\_\{ij\}:=\\xi\_\{i\}\\eta\_\{j\}\-\\kappa\_\{j\}\\xi\_\{i\}\\xi\_\{j\}\-\\kappa\_\{i\}\\eta\_\{i\}\\eta\_\{j\}\.Then
Fij=1N∑n=1Nfij\(n\)\.F\_\{ij\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}f\_\{ij\}^\{\(n\)\}\.Sincefijf\_\{ij\}is centered and the samples are independent,
𝔼\[Fij2\]=1N𝔼\[fij2\]\.\\mathbb\{E\}\[F\_\{ij\}^\{2\}\]=\\frac\{1\}\{N\}\\mathbb\{E\}\[f\_\{ij\}^\{2\}\]\.
We now compute a lower bound on𝔼\[fij2\]\\mathbb\{E\}\[f\_\{ij\}^\{2\}\]\. BecauseK⋆K\_\{\\star\}is diagonal, the pairs\(ξi,ηi\)\(\\xi\_\{i\},\\eta\_\{i\}\)and\(ξj,ηj\)\(\\xi\_\{j\},\\eta\_\{j\}\)are independent fori≠ji\\neq j\. We may write
ηi=κiξi\+1−κi2ζi,ηj=κjξj\+1−κj2ζj,\\eta\_\{i\}=\\kappa\_\{i\}\\xi\_\{i\}\+\\sqrt\{1\-\\kappa\_\{i\}^\{2\}\}\\,\\zeta\_\{i\},\\qquad\\eta\_\{j\}=\\kappa\_\{j\}\\xi\_\{j\}\+\\sqrt\{1\-\\kappa\_\{j\}^\{2\}\}\\,\\zeta\_\{j\},whereξi,ξj,ζi,ζj\\xi\_\{i\},\\xi\_\{j\},\\zeta\_\{i\},\\zeta\_\{j\}are independent standard Gaussian random variables\. Substituting this representation intofijf\_\{ij\}, we obtain a second\-order Gaussian polynomial\. Since the monomials
ξiξj,ξiζj,ζiξj,ζiζj\\xi\_\{i\}\\xi\_\{j\},\\quad\\xi\_\{i\}\\zeta\_\{j\},\\quad\\zeta\_\{i\}\\xi\_\{j\},\\quad\\zeta\_\{i\}\\zeta\_\{j\}are orthogonal inL2L^\{2\}, the variance is the sum of the squared coefficients\. In particular, when
κi≤1−cκ,κj≤1−cκ,\\kappa\_\{i\}\\leq 1\-c\_\{\\kappa\},\\qquad\\kappa\_\{j\}\\leq 1\-c\_\{\\kappa\},at least one coefficient has magnitude bounded below by a positive constant depending only oncκc\_\{\\kappa\}\. Therefore
𝔼\[fij2\]≥c\(cκ\)\>0\.\\mathbb\{E\}\[f\_\{ij\}^\{2\}\]\\geq c\(c\_\{\\kappa\}\)\>0\.Hence
𝔼\[Fij2\]≥c\(cκ\)N\.\\mathbb\{E\}\[F\_\{ij\}^\{2\}\]\\geq\\frac\{c\(c\_\{\\kappa\}\)\}\{N\}\.
It remains to show that the quadratic term is lower order\. Since‖K⋆‖≤1\\\|K\_\{\\star\}\\\|\\leq 1, standard Gaussian sample\-covariance moment bounds give
𝔼\[\(GxK⋆Gy\)ij2\]≲1N2\.\\mathbb\{E\}\[\(G\_\{x\}K\_\{\\star\}G\_\{y\}\)\_\{ij\}^\{2\}\]\\lesssim\\frac\{1\}\{N^\{2\}\}\.Consequently, forNNsufficiently large,
12𝔼\[Fij2\]−𝔼\[\(GxK⋆Gy\)ij2\]≳1N\.\\frac\{1\}\{2\}\\mathbb\{E\}\[F\_\{ij\}^\{2\}\]\-\\mathbb\{E\}\[\(G\_\{x\}K\_\{\\star\}G\_\{y\}\)\_\{ij\}^\{2\}\]\\gtrsim\\frac\{1\}\{N\}\.Therefore,
𝔼\[Z^ij2\]≳μi2μj2N\.\\mathbb\{E\}\[\\widehat\{Z\}\_\{ij\}^\{2\}\]\\gtrsim\\frac\{\\mu\_\{i\}^\{2\}\\mu\_\{j\}^\{2\}\}\{N\}\.∎
###### Lemma 20\(Empirical\-to\-population variance\-filter comparison\)\.
Let
ρ:=\(Leffγ\)−1/2,R:=Leffγ\.\\rho:=\(L\_\{\\mathrm\{eff\}\}\\gamma\)^\{\-1/2\},\\qquad R:=L\_\{\\mathrm\{eff\}\}\\gamma\.Assumeℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)\. Let
𝒱L:=∑t=1Lγt∏s=t\+1L\(I−γsℋ\),ℋ\(Z\):=ΣZΣ,\\mathscr\{V\}\_\{L\}:=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{s=t\+1\}^\{L\}\(I\-\\gamma\_\{s\}\\mathscr\{H\}\),\\qquad\\mathscr\{H\}\(Z\):=\\Sigma Z\\Sigma,be the population variance filter in whitened coordinates, and let𝒱^Lw\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}be the corresponding empirical variance filter\. Ifcrel\>0c\_\{\\mathrm\{rel\}\}\>0in the definition ofℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\)is sufficiently small, then for every matrixZZ,
‖𝒱^Lw\(Z\)‖F2≍‖𝒱L\(Z\)‖F2\.\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(Z\)\\right\\\|\_\{F\}^\{2\}\\asymp\\left\\\|\\mathscr\{V\}\_\{L\}\(Z\)\\right\\\|\_\{F\}^\{2\}\.In particular,
𝔼\[‖𝒱^Lw\(Z^\)‖F2\]≍𝔼\[‖𝒱L\(Z^\)‖F2\]\.\\mathbb\{E\}\\left\[\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]\\asymp\\mathbb\{E\}\\left\[\\left\\\|\\mathscr\{V\}\_\{L\}\(\\widehat\{Z\}\)\\right\\\|\_\{F\}^\{2\}\\right\]\.
###### Proof\.
Write
ℋ^w=ℋ\+Δ,\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}=\\mathscr\{H\}\+\\Delta,where
ℋ^w\(Z\)=Σ\(I\+Ex\)Z\(I\+Ey\)Σ,‖Ex‖∨‖Ey‖≤crelρ\.\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\(Z\)=\\Sigma\(I\+E\_\{x\}\)Z\(I\+E\_\{y\}\)\\Sigma,\\qquad\\\|E\_\{x\}\\\|\\vee\\\|E\_\{y\}\\\|\\leq c\_\{\\mathrm\{rel\}\}\\rho\.Thus
Δ\(Z\)=ΣExZΣ\+ΣZEyΣ\+ΣExZEyΣ\.\\Delta\(Z\)=\\Sigma E\_\{x\}Z\\Sigma\+\\Sigma ZE\_\{y\}\\Sigma\+\\Sigma E\_\{x\}ZE\_\{y\}\\Sigma\.Using‖Σ‖≲1\\\|\\Sigma\\\|\\lesssim 1and the relative concentration assumption,
‖Δ\(Z\)‖F≲crelρ\(‖ΣZΣ‖F\+ρ‖Z‖F\)\.\\\|\\Delta\(Z\)\\\|\_\{F\}\\lesssim c\_\{\\mathrm\{rel\}\}\\rho\\left\(\\\|\\Sigma Z\\Sigma\\\|\_\{F\}\+\\rho\\\|Z\\\|\_\{F\}\\right\)\.The second term is harmless on the effective spectral range becauseρ2=1/R\\rho^\{2\}=1/Ris the learning threshold\.
By the resolvent/telescoping identity for variance filters,
𝒱^Lw−𝒱L=−∑t=1Lγt∑r=t\+1Lℬ^r\+1:LwΔℬt\+1:r−1,\\displaystyle\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\-\\mathscr\{V\}\_\{L\}=\-\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\sum\_\{r=t\+1\}^\{L\}\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{r\+1:L\}\\Delta\\mathscr\{B\}\_\{t\+1:r\-1\},where
ℬa:b:=∏s=ab\(I−γsℋ\),ℬ^a:bw:=∏s=ab\(I−γsℋ^w\),\\mathscr\{B\}\_\{a:b\}:=\\prod\_\{s=a\}^\{b\}\(I\-\\gamma\_\{s\}\\mathscr\{H\}\),\\qquad\\widehat\{\\mathscr\{B\}\}^\{\\mathrm\{w\}\}\_\{a:b\}:=\\prod\_\{s=a\}^\{b\}\(I\-\\gamma\_\{s\}\\widehat\{\\mathscr\{H\}\}\_\{\\mathrm\{w\}\}\),and empty products are interpreted as the identity\. The population and empirical filters are contractions under the stable stepsize condition\. Moreover, the double sum is controlled by the effective horizonR=LeffγR=L\_\{\\mathrm\{eff\}\}\\gamma, while every occurrence ofΔ\\Deltacontributes a factorcrelρc\_\{\\mathrm\{rel\}\}\\rhotogether with one curvature factor\. Combining the scalar filter bounds yields
‖\(𝒱^Lw−𝒱L\)\(Z\)‖F≤Ccrel‖𝒱L\(Z\)‖F\\left\\\|\(\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\-\\mathscr\{V\}\_\{L\}\)\(Z\)\\right\\\|\_\{F\}\\leq Cc\_\{\\mathrm\{rel\}\}\\left\\\|\\mathscr\{V\}\_\{L\}\(Z\)\\right\\\|\_\{F\}for every matrixZZ, whereC\>0C\>0is an absolute constant\. Therefore,
𝒱^Lw\(Z\)=𝒱L\(Z\)\+\(𝒱^Lw−𝒱L\)\(Z\),\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(Z\)=\\mathscr\{V\}\_\{L\}\(Z\)\+\(\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\-\\mathscr\{V\}\_\{L\}\)\(Z\),the triangle inequality and reverse triangle inequality give
‖𝒱^Lw\(Z\)‖F≤\(1\+Ccrel\)‖𝒱L\(Z\)‖F\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(Z\)\\right\\\|\_\{F\}\\leq\(1\+Cc\_\{\\mathrm\{rel\}\}\)\\left\\\|\\mathscr\{V\}\_\{L\}\(Z\)\\right\\\|\_\{F\}and
‖𝒱^Lw\(Z\)‖F≥\(1−Ccrel\)‖𝒱L\(Z\)‖F\.\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(Z\)\\right\\\|\_\{F\}\\geq\(1\-Cc\_\{\\mathrm\{rel\}\}\)\\left\\\|\\mathscr\{V\}\_\{L\}\(Z\)\\right\\\|\_\{F\}\.Takingcrel\>0c\_\{\\mathrm\{rel\}\}\>0sufficiently small gives
‖𝒱^Lw\(Z\)‖F2≍‖𝒱L\(Z\)‖F2\.\\left\\\|\\widehat\{\\mathscr\{V\}\}^\{\\mathrm\{w\}\}\_\{L\}\(Z\)\\right\\\|\_\{F\}^\{2\}\\asymp\\left\\\|\\mathscr\{V\}\_\{L\}\(Z\)\\right\\\|\_\{F\}^\{2\}\.Applying this toZ=Z^Z=\\widehat\{Z\}and taking expectation proves the expectation comparison\. ∎
###### Lemma 21\(Piecewise product effective\-dimension lower bound\)\.
Assume
μi≍i−b,i=1,…,M,\\mu\_\{i\}\\asymp i^\{\-b\},\\qquad i=1,\\ldots,M,for someb\>1/2b\>1/2\. Let
R:=Leffγ,T:=R1/b,R:=L\_\{\\mathrm\{eff\}\}\\gamma,\\qquad T:=R^\{1/b\},and define
D×\(R,M\):=\{Tlog\(eT\),1≤T≤M,T\(1\+logM2T\),M<T<M2,M2,T≥M2\.D\_\{\\times\}\(R,M\):=\\begin\{cases\}T\\log\(eT\),&1\\leq T\\leq M,\\\\ T\\left\(1\+\\log\\dfrac\{M^\{2\}\}\{T\}\\right\),&M<T<M^\{2\},\\\\ M^\{2\},&T\\geq M^\{2\}\.\\end\{cases\}Fix a constanti0≥1i\_\{0\}\\geq 1\. IfT≳1T\\gtrsim 1andMMis sufficiently large relative toi0i\_\{0\}, then
∑i,j≤Mi,j≥i0i≠jmin\{1,\(Rμiμj\)2\}≳D×\(R,M\)\.\\sum\_\{\\begin\{subarray\}\{c\}i,j\\leq M\\\\ i,j\\geq i\_\{0\}\\\\ i\\neq j\\end\{subarray\}\}\\min\\\{1,\(R\\mu\_\{i\}\\mu\_\{j\}\)^\{2\}\\\}\\gtrsim D\_\{\\times\}\(R,M\)\.
###### Proof\.
Sinceμi≍i−b\\mu\_\{i\}\\asymp i^\{\-b\}, there is a sufficiently small constantc1\>0c\_\{1\}\>0such that
ij≤c1T⟹Rμiμj≳1\.ij\\leq c\_\{1\}T\\quad\\Longrightarrow\\quad R\\mu\_\{i\}\\mu\_\{j\}\\gtrsim 1\.Therefore each admissible pair in the setij≤c1Tij\\leq c\_\{1\}Tcontributes a constant to the sum\. It remains to count such pairs inside theM×MM\\times Mbox, with the fixed lower cutoffi,j≥i0i,j\\geq i\_\{0\}and the diagonal exclusion changing only absolute constants\.
IfT=O\(1\)T=O\(1\), thenD×\(R,M\)=O\(1\)D\_\{\\times\}\(R,M\)=O\(1\)\. Sincei0i\_\{0\}is fixed andMMis sufficiently large, the single off\-diagonal pair\(i0,i0\+1\)\(i\_\{0\},i\_\{0\}\+1\)gives a constant contribution to the sum\. Thus the claim holds in this bounded horizon case\.
If1≪T≤M1\\ll T\\leq M, then fori0≤i≤c2Ti\_\{0\}\\leq i\\leq c\_\{2\}T, the number of admissiblejj’s withij≤c1Tij\\leq c\_\{1\}Tis at least a constant multiple ofT/iT/i\. Thus
\#\{admissible active pairs\}≳TlogT≍Tlog\(eT\)\.\\\#\\\{\\text\{admissible active pairs\}\\\}\\gtrsim T\\log T\\asymp T\\log\(eT\)\.
IfM<T<M2M<T<M^\{2\}, split the count into two parts\. When the rangei0≤i≤c2T/Mi\_\{0\}\\leq i\\leq c\_\{2\}T/Mis nonempty, a constant fraction of allMMvalues ofjjare admissible for theseii’s, giving a contribution≳T\\gtrsim T\. If this range is empty, thenT/M=O\(1\)T/M=O\(1\), and the logarithmic contribution below already dominates the missingTTterm\. Second, forc3T/M<i≤c4Mc\_\{3\}T/M<i\\leq c\_\{4\}M, the number of admissiblejj’s is again at least a constant multiple ofT/iT/i, and hence
∑c3T/M<i≤c4MTi≳TlogM2T\.\\sum\_\{c\_\{3\}T/M<i\\leq c\_\{4\}M\}\\frac\{T\}\{i\}\\gtrsim T\\log\\frac\{M^\{2\}\}\{T\}\.Together these two parts give
\#\{admissible active pairs\}≳T\(1\+logM2T\)\.\\\#\\\{\\text\{admissible active pairs\}\\\}\\gtrsim T\\left\(1\+\\log\\frac\{M^\{2\}\}\{T\}\\right\)\.
IfT≥M2T\\geq M^\{2\}, choose a sufficiently small constantc5\>0c\_\{5\}\>0\. Then for alli,j≤c5Mi,j\\leq c\_\{5\}M, one hasij≤c1Tij\\leq c\_\{1\}T\. Hence the admissible\-pair count is
Combining the three regimes proves the claim\. ∎
## Appendix GAdditional Experimental Details
All synthetic experiments were run locally on a CPU\-only machine\. The numerical implementation uses standard dense linear algebra and plotting routines; no GPU acceleration was used\. Random Gaussian sketches were generated with entries independently sampled asSij∼𝒩\(0,1/M\)S\_\{ij\}\\sim\\mathcal\{N\}\(0,1/M\)\. For the optimization panels, we use ambient dimensionD=4096D=4096, sketch dimensionM=32M=32, sample sizeN=4096N=4096, power\-law exponentsa=2\.6a=2\.6andb=1\.1b=1\.1, and240240independent repetitions\. The target coefficients are aligned with the covariance eigenbasis so that the source condition in Assumption[3](https://arxiv.org/html/2606.26617#Thmassumption3)is enforced in the finite construction\. The empirical horizon used in the plots is the accumulated stepsize
ΓL:=∑t=1Lγt,\\Gamma\_\{L\}:=\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\},which plays the same scaling role asRRin the theorem, up to the logarithmic factors suppressed by the GD schedule\.
For the optimization panels, the theorem is applied on the covariance eventℰcov\(R\)\\mathcal\{E\}\_\{\\mathrm\{cov\}\}\(R\), which is ensured theoretically by the sample conditionN≳RMN\\gtrsim RM\. In the numerical implementation we use controlled marginal covariances rather than testing this concentration condition directly: wherever the GD update or finite\-filter reference usesΣ^x\\widehat\{\\Sigma\}\_\{x\}andΣ^y\\widehat\{\\Sigma\}\_\{y\}, we replace the raw empirical marginal covariances by the corresponding population sketched covariance matrices with small bounded relative perturbations\. Equivalently, the perturbed matrices have the formΣ1/2\(I\+Δ\)Σ1/2\\Sigma^\{1/2\}\(I\+\\Delta\)\\Sigma^\{1/2\}with‖Δ‖op\\\|\\Delta\\\|\_\{\\mathrm\{op\}\}bounded by the prescribed perturbation radius\. This keeps the optimization panels focused on the predicted GD bias/variance filters, while the sampled cross\-covariance fluctuations still generate the GD\-variance component measured across repetitions\.
1. 1\.Approximation experiment\.The first experiment tests the approximation term by varying the sketch dimensionMM\. We use ambient dimensionD=4096D=4096, power\-law exponentsa=2\.6a=2\.6andb=1\.1b=1\.1, and average over independently generated Gaussian sketches for each value ofMM\. The tested sketch dimensions areM∈\{128,192,256,384,512\}M\\in\\\{128,192,256,384,512\\\}\. The empirical approximation error is computed from population sketched moments, without sampling finite datasets\. The fitted empirical slope is approximately−1\.23\-1\.23, while the predicted slope is−1\.10\-1\.10\. This shows that the observed approximation scaling is close to the theoretical sketch\-dimension rate\.
2. 2\.GD\-bias experiment\.The second experiment tests the GD\-bias term along the optimization horizon\. We fixM=32M=32,N=4096N=4096, andD=4096D=4096, enforce the source\-aligned target, and run empirical GD with the same decaying stepsize schedule used in the theory\. The number of GD iterations ranges from1616to9830498304\. The plotted reference is theΓL\\Gamma\_\{L\}\-rate from Theorem[1](https://arxiv.org/html/2606.26617#Thmmytheorem1), rescaled by a single constant\. For smallLL, the run remains in the theorem’s bias\-unsaturated regime, and the measured bias follows the reference curve closely\. A log–log fit over this small\-LLportion gives an empirical slope of approximately−0\.87\-0\.87as a function ofΓL\\Gamma\_\{L\}, compared with the theorem\-guided slope−0\.91\-0\.91\. For largerLL, the run may leave the theorem’s bias\-unsaturated regime, so the empirical curve is expected to separate from the displayed reference\.
3. 3\.GD\-variance experiment\.The third experiment tests the GD\-variance term along the same optimization horizon\. We fixM=32M=32,N=4096N=4096, andD=4096D=4096, varyLL, independently generate empirical datasets from the sketched Gaussian law, and measure the variance component across repetitions\. The expected reference curve in this panel is computed from the finite sketched spectrum using the exact GD variance filter dGD\(L\)\\displaystyle d\_\{\\mathrm\{GD\}\}\(L\):=∑i,j≤M\(μ^iν^j\)2gij\(L\)2,\\displaystyle=\{\}\\sum\_\{i,j\\leq M\}\\left\(\\widehat\{\\mu\}\_\{i\}\\widehat\{\\nu\}\_\{j\}\\right\)^\{2\}g\_\{ij\}\(L\)^\{2\},gij\(L\)\\displaystyle g\_\{ij\}\(L\):=∑t=1Lγt∏r=t\+1L\(1−γrμ^iν^j/τ\),\\displaystyle=\{\}\\sum\_\{t=1\}^\{L\}\\gamma\_\{t\}\\prod\_\{r=t\+1\}^\{L\}\\left\(1\-\\gamma\_\{r\}\\widehat\{\\mu\}\_\{i\}\\widehat\{\\nu\}\_\{j\}/\\tau\\right\),and the plotted variance reference is proportional todGD\(L\)/Nd\_\{\\mathrm\{GD\}\}\(L\)/N\. Hereμ^i\\widehat\{\\mu\}\_\{i\}andν^j\\widehat\{\\nu\}\_\{j\}are the finite sketched marginal eigenvalues used in the empirical construction\. The two vertical markers indicate the unsaturated\-to\-intermediate and intermediate\-to\-saturated transitions, approximatelyL=1\.27×103L=1\.27\\times 10^\{3\}andL=9\.14×104L=9\.14\\times 10^\{4\}, respectively\. The empirical curve exhibits the expected three\-stage increase: an initial growth phase, a bent intermediate regime, and saturation near the finite sketched dimension\. Fitting the empirical curve on the early, intermediate, and late parts gives slopes approximately0\.950\.95,0\.360\.36, and0\.010\.01, respectively; the corresponding slopes of the finite\-filter reference are approximately0\.850\.85,0\.330\.33, and0\.010\.01\.
4. 4\.Excess\-risk decomposition experiment\.The fourth experiment tests the full excess\-risk decomposition using the same valuesM=32M=32,N=4096N=4096,D=4096D=4096, and the same optimization checkpoints\. At each checkpoint, we compute the empirical excess risk, the empirical GD\-bias term, the empirical GD\-variance term, and the cross term\. The empirical excess risk is nearly identical to the bias\-plus\-variance curve across all checkpoints\. The cross term is negative and very small: its largest relative magnitude is below0\.4%0\.4\\%of the bias\-plus\-variance reference\. This supports the treatment of the cross term as negligible for the upper\-bound scaling law\.
In the approximation experiment, the reported means decrease from approximately1\.72×10−21\.72\\times 10^\{\-2\}atM=128M=128to3\.08×10−33\.08\\times 10^\{\-3\}atM=512M=512\. In the GD\-bias experiment, the bias decreases from approximately2\.80×10−22\.80\\times 10^\{\-2\}atL=16L=16to1\.64×10−91\.64\\times 10^\{\-9\}atL=98304L=98304\. In the GD\-variance experiment, the variance increases from approximately4\.78×10−44\.78\\times 10^\{\-4\}atL=16L=16to1\.20×10−11\.20\\times 10^\{\-1\}atL=98304L=98304, with saturation visible near the end of the horizon\. In the excess\-risk decomposition experiment, the empirical excess risk stays close to the bias\-plus\-variance reference; for example, atL=98304L=98304, both are approximately1\.20×10−11\.20\\times 10^\{\-1\}\. These numerical results are consistent with the four plotted curves in Figure[1](https://arxiv.org/html/2606.26617#Sx4.F1)\.
## Appendix HNotation Map
This section records the notation used in the main text and proofs\. The goal is to keep a one\-to\-one association between each recurring symbol and its formula\. Symbols used only as dummy variables inside a single displayed sum or inner product are not listed\.
### Population and sketched objects
### Risks, minimizers, and GD quantities
### Scaling terms and product dimensions
### Proof coordinates, projections, and filters
### Noise variables used in auxiliary lemmasSimilar Articles
From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression
This paper derives batch scaling laws for sketched linear regression under power-law spectra, analyzing one-pass and multi-pass mini-batch SGD. It provides explicit risk decompositions showing how batch size affects bias, variance, and fluctuation terms, and establishes that without-replacement sampling yields lower noise than with-replacement.
Scaling Laws, Carefully (25 minute read)
A comprehensive overview of scaling laws in deep learning, tracing their theoretical roots and empirical findings, and explaining how loss decreases predictably with model size, data, and compute.
Prescriptive Scaling Laws for Data Constrained Training
A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.
@lilianweng: A super long overdue (3+ years?) post on scaling laws. Compute is expensive. Scaling laws are a way to help us reason a…
Lilian Weng's blog post provides a comprehensive overview of scaling laws in deep learning, covering their derivation, compute-optimal allocation, and the debate between Kaplan et al. and Chinchilla.
The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning
This paper develops a measure-theoretic framework analyzing when contrastive learning recovers meaningful latent geometry, introducing a 'diversity condition' on positive-pair sampling and a support-corrected InfoNCE variant, with experiments validating that sampling diversity and architectural inductive bias interact critically in contrastive representation learning.