Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

arXiv cs.LG 06/11/26, 04:00 AM Papers
Summary
This paper introduces Bernstein–Schur kernels, a class of nonstationary kernels between shift-invariant and dot-product templates, and provides a random feature construction by sketching the finite modulation and randomizing the completely monotone radial factor. The method yields unbiased estimators with operator-norm bounds controlled by intrinsic dimensions, and experiments validate the approach on a biased kernel example.
arXiv:2606.11255v1 Announce Type: new Abstract: Bernstein--Schur kernels are products of a finite-feature kernel (one with an explicit finite-dimensional feature map) and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that \emph{randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter's one-dimensional Bernstein--Widder scale and then applying Gaussian random Fourier features (whose frequency is still $d$-dimensional). The feature dimension is then $Dm$, set by the sketch size $m$ and the radial-draw count $D$, free of the $O(d^2)$ size of the exact modulation feature. Keeping the modulation \emph{exact} is the analyzable limit ($m\to\infty$): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix-Bernstein operator-norm bound (with a matching high-probability tail) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crude $N\max_{ij}$ entrywise route, and a deterministic relative-spectral kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator inherits the same intrinsic-dimension operator-norm guarantee plus a single additive sketch term, tunable by $m$ independently of $D$. The motivating instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, $b\ge0$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance-optimal at a fixed radial-feature budget.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:45 PM
# Bernstein–Schur Kernels: Random Features by Sketched Modulation and Radial Randomization
Source: [https://arxiv.org/html/2606.11255](https://arxiv.org/html/2606.11255)
\\bbl@luahyphenate\\directlua

if Babel\.locale\_mapped == nil then Babel\.locale\_mapped = true Babel\.linebreaking\.add\_before\(Babel\.locale\_map, 1\) Babel\.loc\_to\_scr = Babel\.chr\_to\_loc = Babel\.chr\_to\_loc or end Babel\.locale\_props\[1\]\.letters = false\\directluaif Babel\.script\_blocks\[’Latn’\] then Babel\.loc\_to\_scr\[1\] = Babel\.script\_blocks\[’Latn’\] Babel\.locale\_props\[1\]\.lg = 89 end\\directluaif Babel\.script\_blocks\[’Latn’\] then Babel\.loc\_to\_scr\[1\] = Babel\.script\_blocks\[’Latn’\] end\\bbl@luahyphenate\\directluaif Babel\.locale\_mapped == nil then Babel\.locale\_mapped = true Babel\.linebreaking\.add\_before\(Babel\.locale\_map, 1\) Babel\.loc\_to\_scr = Babel\.chr\_to\_loc = Babel\.chr\_to\_loc or end Babel\.locale\_props\[2\]\.letters = false\\directluaif Babel\.script\_blocks\[”\] then Babel\.loc\_to\_scr\[2\] = Babel\.script\_blocks\[”\] Babel\.locale\_props\[2\]\.lg = 89 end\\directluaif Babel\.script\_blocks\[”\] then Babel\.loc\_to\_scr\[2\] = Babel\.script\_blocks\[”\] end \[tifinagh\]rm\[Path=fonts/, Extension=\.ttf, UprightFont=\*\]ebrima\\bbl@patterns@luaenglish\\bbl@patterns@luatifinagh

###### Abstract

*Bernstein–Schur kernels*are products of a finite\-feature kernel \(one with an explicit finite\-dimensional feature map\) and a completely monotone shift\-invariant kernel: nonstationary kernels that fall between the shift\-invariant and dot\-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly\. We give one random\-feature construction for the whole class that*randomizes both factors*: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter’s one\-dimensional Bernstein–Widder scale and then applying Gaussian random Fourier features \(whose frequency is stilldd\-dimensional\)\. The feature dimension is thenDmDm, set by the sketch sizemmand the radial\-draw countDD, free of theO\(d2\)O\(d^\{2\}\)size of the exact modulation feature\. Keeping the modulation*exact*is the analyzable limit \(m→∞m\\to\\infty\): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix\-Bernstein operator\-norm bound \(with a matching high\-probability tail\) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crudeNmaxijN\\max\_\{ij\}entrywise route, and a deterministic relative\-spectral kernel\-ridge stability result\. By conditioning on the sketch, the doubly\-randomized estimator inherits the*same*intrinsic\-dimension operator\-norm guarantee plus a single additive sketch term, tunable bymmindependently ofDD\. The motivating instance is the biased\\tifinaghfontⵟ\-kernelk\\tifinaghfontⵟ,b\(w,x\)=\(w⊤x\+b\)2/\(‖w−x‖2\+ε\)k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(w,x\)=\(w^\{\\top\}x\+b\)^\{2\}/\(\\\|w\-x\\\|^\{2\}\+\\varepsilon\),b≥0b\\geq 0, whose family span contains the inverse\-multiquadric kernel by finite differences inbb; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance\-optimal at a fixed radial\-feature budget\. A modulation–radial error decomposition shows the product stays useful when the modulation is sketched: radial noise is modulated by the alignment factor while modulation error is localized by proximity\. Experiments validate the construction off\-sphere, where the kernel is genuinely non\-dot\-product and both uniform and k\-means Nyström degrade withddat matched landmark count \(while at matched*representation*cost an adaptive Nyström can be more accurate, at the price of being data\-dependent and non\-streaming\) and show, in a controlled test, that the\\tifinaghfontⵟ\-kernel is preferred on the coupled alignment–proximity target and not on single\-factor controls\.

## 1Introduction

Kernel methods \(support vector machines\(Cortes & Vapnik,[1995](https://arxiv.org/html/2606.11255#bib.bib8)\), kernel ridge regression, Gaussian processes\(Rasmussen & Williams,[2006](https://arxiv.org/html/2606.11255#bib.bib22)\), and the attention layers of modern transformers, which compute a kernel smoother over tokens\(Tsai et al\.,[2019](https://arxiv.org/html/2606.11255#bib.bib28)\)\) rest on theN×NN\\times NGram matrixKij=k\(xi,xj\)K\_\{ij\}=k\(x\_\{i\},x\_\{j\}\), whoseO\(N2\)O\(N^\{2\}\)storage andO\(N3\)O\(N^\{3\}\)factorization become infeasible onceNNexceeds10510^\{5\}\. The kernel may be fixed or learned implicitly\(Li et al\.,[2019](https://arxiv.org/html/2606.11255#bib.bib13)\), but in every case scaling these methods means never formingKKexplicitly, and the dominant tool is the random feature map: a low\-dimensionalzzwith𝔼\[z\(x\)⊤z\(w\)\]=k\(x,w\)\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=k\(x,w\), so thatK≈ZZ⊤K\\approx ZZ^\{\\top\}is handled inO\(NM\)O\(NM\)time and memory\. For attention in particular, this is the random\-feature linear\-attention route\(Katharopoulos et al\.,[2020](https://arxiv.org/html/2606.11255#bib.bib12); Choromanski et al\.,[2021](https://arxiv.org/html/2606.11255#bib.bib7)\)\.

Which kernels admit such a map is dictated by their structure\.*Random Fourier features*\(RFF\)\(Rahimi & Recht,[2007](https://arxiv.org/html/2606.11255#bib.bib23)\)cover shift\-invariant kernelsk\(x,w\)=ϕ\(x−w\)k\(x,w\)=\\phi\(x\-w\)through Bochner’s theorem, sampling the nonnegative spectral measure whose Fourier transform isϕ\\phi; quasi\-Monte Carlo\(Avron et al\.,[2016](https://arxiv.org/html/2606.11255#bib.bib3)\)and orthogonal frequencies\(Yu et al\.,[2016](https://arxiv.org/html/2606.11255#bib.bib34)\)reduce their variance, and the construction underpins large\-scale kernel ridge regression\(Avron et al\.,[2017](https://arxiv.org/html/2606.11255#bib.bib4)\)\.*Polynomial sketches*and dot\-product random features \(TensorSketch, Fastfood, and the dot\-product feature maps ofPham & Pagh,[2013](https://arxiv.org/html/2606.11255#bib.bib19); Le et al\.,[2013](https://arxiv.org/html/2606.11255#bib.bib17); Kar & Karnick,[2012](https://arxiv.org/html/2606.11255#bib.bib14)\) do the same for dot\-product kernelsk\(x,w\)=κ\(x⊤w\)k\(x,w\)=\\kappa\(x^\{\\top\}w\)\.*Nyström methods*\(Williams & Seeger,[2000](https://arxiv.org/html/2606.11255#bib.bib32); Musco & Musco,[2017](https://arxiv.org/html/2606.11255#bib.bib18)\)take a different route, building a low\-rank approximation fromm≪Nm\\ll Nlandmark columns at the mercy of the kernel’s spectral decay\. Each family is bound to a structure: RFF to shift\-invariance, sketches to dot\-product form, Nyström to fast eigenvalue decay\. A unifying thread is that any kernel written as a*mixture*can be linearized by sampling the mixture, which extends RFF beyond Gaussian spectra to broad stationary families\(Wilson & Adams,[2013](https://arxiv.org/html/2606.11255#bib.bib33)\); our radial factor is handled in exactly this way\. What is specific here is that the mixture is applied to only the radial part of a kernel that is*not*stationary overall, and is composed with an exact finite feature for the polynomial part, so the scheme sits at the intersection of mixture\-based RFF and exact feature maps rather than inside either\.

The kernel we wish to scale fits none of these molds\. The biased\\tifinaghfontⵟ\-kernel

k\\tifinaghfontⵟ,b\(w,x\)=\(w⊤x\+b\)2‖w−x‖2\+ε,b≥0,ε\>0,k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(w,x\)=\\frac\{\(w^\{\\top\}x\+b\)^\{2\}\}\{\\\|w\-x\\\|^\{2\}\+\\varepsilon\},\\qquad b\\geq 0,\\ \\varepsilon\>0,\(1\)couples alignment and proximity: a squared inner product in the numerator, an inverse squared\-distance in the denominator\. The finite\-difference identity in Proposition[A\.9](https://arxiv.org/html/2606.11255#A1.Thmtheorem9)shows that the family spanspan⁡\{k\\tifinaghfontⵟ,b:b≥0\}\\operatorname\{span\}\\\{k\_\{\\text\{\\tifinaghfont ⵟ\},b\}:b\\geq 0\\\}contains the inverse\-multiquadric \(IMQ\) kernel\. We do not rely on any single\-kernel universality statement in this paper; fixed\-b\>0b\>0universality is discussed separately byBouhsine \([2026](https://arxiv.org/html/2606.11255#bib.bib6)\)\. In the construction and experiments we treat a fixedb\>0b\>0as a practical inductive\-bias parameter, and the unbiasedk\\tifinaghfontⵟk\_\{\\text\{\\tifinaghfont ⵟ\}\}\(b=0b=0\), whose RKHS functions all vanish at the origin, is the special case\. Butk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is*not*shift\-invariant and*not*a dot\-product kernel: the radial denominator‖x−w‖2\\\|x\-w\\\|^\{2\}alone is shift\-invariant, yet multiplying it by the alignment numerator\(x⊤w\+b\)2\(x^\{\\top\}w\+b\)^\{2\}destroys both forms at once, so neither Bochner sampling nor polynomial sketching applies to the full kernel, and exact methods revert to theO\(N2\)O\(N^\{2\}\)Gram matrix\. \(On the unit sphere‖x−w‖2=2−2x⊤w\\\|x\-w\\\|^\{2\}=2\-2x^\{\\top\}w, sok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}coincides with a rational dot\-product kernel and a dimension\-efficient dot\-product route becomes available; deriving that route is left to future work, and we target the generalℝd\\mathbb\{R\}^\{d\}construction here, where no such reduction exists\.\) The key empirical test is therefore off\-sphere \(Figure[1](https://arxiv.org/html/2606.11255#S1.F1)\): on a bounded ball with varying norms \(Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)\) the kernel is genuinely non\-dot\-product, and RAY \(Random Approximation of the\\tifinaghfontⵟ\-kernel\) retains itsO\(1/D\)O\(1/\\sqrt\{D\}\)behavior while landmark methods degrade withdd\.

The obstacle dissolves oncek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is read as a Schur product \(Eq\.[2](https://arxiv.org/html/2606.11255#S2.E2)\) of two individually tractable pieces: the biased degree\-2 polynomial kernel\(w⊤x\+b\)2\(w^\{\\top\}x\+b\)^\{2\}, which has an exact finite feature map, and the IMQ\-type radial kernel\(‖w−x‖2\+ε\)−1\(\\\|w\-x\\\|^\{2\}\+\\varepsilon\)^\{\-1\}, which, being completely monotone in‖x−w‖2\\\|x\-w\\\|^\{2\}, is a Bernstein–Widder mixture of Gaussians over a single scale parameter \(the classical Schoenberg radial kernels\(Schoenberg,[1938](https://arxiv.org/html/2606.11255#bib.bib25)\), for which broad scale\-mixture samplers exist beyond this IMQ case\(Langrené et al\.,[2024](https://arxiv.org/html/2606.11255#bib.bib16)\)\)\. We turn this into a feature scheme that randomizes*both*factors: sample the radial factor’s one\-dimensional scale, apply random Fourier features to the resulting Gaussian, and sketch the polynomial factor\. Keeping the polynomial exact is the analyzable limit \(m→∞m\\to\\infty\), where the variance and concentration are sharpest; but that feature carries theO\(d2\)O\(d^\{2\}\)polynomial size, whereas sketching makes the dimensionDmDm, free ofd2d^\{2\}, at the cost of one controllable sketch term \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\. Beyond the radial random Fourier features, the only added randomness is the one\-dimensional Bernstein scale, so for a fixed dataset the radial draw count carries no explicitdd\-dependence \(as for standard RFF\), the genuine contrast with uniform Nyström, which we confirm degrades withdd\.

The point is not that products of kernels can be tensored, a standard closure property, but that factoring this particular nonstationary, non\-dot\-product kernel turns each piece into a textbook object \(an exact finite polynomial feature and a Gaussian mixture\), so a kernel fitting neither random\-feature template is linearized by combining the standard tools for each\. This places RAY among random\-feature methods that reach beyond Bochner sampling: for dot\-product\(Kar & Karnick,[2012](https://arxiv.org/html/2606.11255#bib.bib14); Pham & Pagh,[2013](https://arxiv.org/html/2606.11255#bib.bib19)\), compositional\(Daniely et al\.,[2017](https://arxiv.org/html/2606.11255#bib.bib9)\), and non\-stationary spectral or harmonizable\(Remes et al\.,[2017](https://arxiv.org/html/2606.11255#bib.bib21); Ton et al\.,[2018](https://arxiv.org/html/2606.11255#bib.bib27); Shen et al\.,[2019](https://arxiv.org/html/2606.11255#bib.bib26)\)kernels, from which it differs by factoring the kernel into a finite modulation and a completely monotone radial factor and randomizing each by its native tool, a polynomial sketch and a Bernstein–Widder radial sampler \(Section[5](https://arxiv.org/html/2606.11255#S5)\)\.

The contribution is the analysis of the resulting estimator and the class it defines\. For exact modulation we prove unbiasedness, an exact variance for the flat estimator, an expected matrix\-Bernstein operator\-norm bound \(with a matching tail\) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension, and relative\-spectral kernel\-ridge stability; conditioning on the sketch carries the operator\-norm guarantee to the deployed doubly\-randomized estimator\. Unbiasedness, the variance, and the uniform entrywise bound lift verbatim to the*Bernstein–Schur*class \(a finite\-feature kernel times a completely monotone radial kernel, of whichk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is the flagship\) by a single theorem \(Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)\); the matrix\-Bernstein and kernel\-ridge statements we prove fork\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}, with the same proof pattern for any bounded finite modulation\. The construction is self\-contained: positive\-definiteness from the Schur factorization \(Eq\.[2](https://arxiv.org/html/2606.11255#S2.E2)\), the IMQ span from the finite\-difference identity \(Proposition[A\.9](https://arxiv.org/html/2606.11255#A1.Thmtheorem9)\), and the radial mixture from Bernstein–Widder \(Section[2\.2](https://arxiv.org/html/2606.11255#S2.SS2)\); we rely onBouhsine \([2026](https://arxiv.org/html/2606.11255#bib.bib6)\)only for the historical infinite\-feature interpretation and fixed\-bbuniversality not needed here\.

#### Contributions\.

1. \(i\)A random\-feature construction for a kernel class, not one kernel\.We identify*Bernstein–Schur kernels*, a finite\-feature kernel times a completely monotone shift\-invariant kernel, and give one estimator for the whole class: keep the finite feature exact, sample the radial Bernstein–Widder scale, and apply random Fourier features \(Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5), unbiased with variance and uniform bounds\)\. The biased\\tifinaghfontⵟ\-kernel is the flagship instance \(Section[2](https://arxiv.org/html/2606.11255#S2), Proposition[2\.1](https://arxiv.org/html/2606.11255#S2.Thmtheorem1)shows it is genuinely neither stationary nor dot\-product\); the deployed estimator randomizes both factors, with exact modulation as the analyzable limit \(Section[2\.5](https://arxiv.org/html/2606.11255#S2.SS5)\)\.
2. \(ii\)Sharp variance and optimal allocation\(Section[3](https://arxiv.org/html/2606.11255#S3)\): the*exact*variance of the recommended flat estimator \(Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)\), a two\-level identity proving one frequency per scale is variance\-optimal within the hierarchical estimator at a fixed radial\-feature budget \(Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)\), and a normalized variant whose variance is bounded free of the data radius and bias \(Proposition[A\.7](https://arxiv.org/html/2606.11255#A1.Thmtheorem7)\)\.
3. \(iii\)Gram concentration and deterministic KRR perturbation stability: a uniform Gram bound \(Theorem[A\.3](https://arxiv.org/html/2606.11255#A1.Thmtheorem3)\) and an expected matrix\-Bernstein operator\-norm bound \(Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)\) with a matching high\-probability tail \(Corollary[3\.3](https://arxiv.org/html/2606.11255#S3.Thmtheorem3)\), controlled by the top eigenvalues of the kernel and modulation Gram matrices and an intrinsic dimension rather than the crudeNmaxijN\\max\_\{ij\}entrywise route \(while still allowing worst\-caseNN\-scaling through those spectra\), turned into a deterministic relative\-spectral kernel\-ridge stability result \(Theorem[3\.6](https://arxiv.org/html/2606.11255#S3.Thmtheorem6)\): a perturbation bound on the ridge coefficients, not a high\-probability excess\-risk guarantee, which we leave open\.
4. \(iv\)RAY, the doubly\-randomized estimator, and the signal\-modulation decomposition\.The exact polynomial feature costsO\(d2\)O\(d^\{2\}\)per draw, so RAY sketches it: the feature dimension becomesDmDm, free ofd2d^\{2\}\. Conditioning on the sketch, this doubly\-randomized estimator carries the*same*intrinsic\-dimension operator\-norm guarantee as the exact\-modulation limit plus a single additive sketch term, tunable bymmindependently ofDD\(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\); exact modulation is them→∞m\\to\\inftylimit\. A modulation–radial error decomposition \(Proposition[3\.7](https://arxiv.org/html/2606.11255#S3.Thmtheorem7)\) explains why the product still helps: radial approximation noise is modulated by the true finite feature, modulation error is localized by radial proximity, and only an interaction term mixes them, so the alignment×\\timesproximity product suppresses similarity false positives even when the modulation is sketched \(Section[4\.5](https://arxiv.org/html/2606.11255#S4.SS5)\)\.
5. \(v\)Empirical validation\(Section[4](https://arxiv.org/html/2606.11255#S4)\): theO\(1/D\)O\(1/\\sqrt\{D\}\)rate, the exact\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}bias law, flat\-sampling optimality, an off\-sphere bounded\-ball test confirming the general\-ℝd\\mathbb\{R\}^\{d\}niche where no direct dot\-product reduction is available and Nyström degrades withdd\(Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)\), a cost\-matched comparison \(Section[4\.3](https://arxiv.org/html/2606.11255#S4.SS3)\), the fidelity of RAY to the exact\\tifinaghfontⵟ\-kernel at low draw counts with an IMQ\-RFF ablation isolating the exact numerator \(Section[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px5)\), and a controlled coupled\-target preference test where the\\tifinaghfontⵟ\-kernel is the best bias on the coupled alignment–proximity target and not on single\-factor controls \(Section[4\.4](https://arxiv.org/html/2606.11255#S4.SS4)\)\.

Section[2](https://arxiv.org/html/2606.11255#S2)constructs the estimator step by step; Section[3](https://arxiv.org/html/2606.11255#S3)establishes its guarantees and variance reduction; Section[4](https://arxiv.org/html/2606.11255#S4)reports the experiments; Sections[5](https://arxiv.org/html/2606.11255#S5)–[6](https://arxiv.org/html/2606.11255#S6)discuss and conclude\. Proofs and further validations are in the appendix\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x1.png)Figure 1:The key regime: an*off\-sphere*bounded ball \(varying norms\), wherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is genuinely non\-dot\-product and no direct dot\-product reduction is available\.\(a\)RAY’s relative Frobenius Gram error follows theO\(1/D\)O\(1/\\sqrt\{D\}\)Monte\-Carlo rate at every dimension\.\(b\)AtD=1000D=1000RAY stays bounded asddgrows, while uniform and k\-means Nyström \(fixedm=100m=100landmarks\) degrade \(matched in radial/landmark count; the cost\-matched comparison is Section[4\.3](https://arxiv.org/html/2606.11255#S4.SS3)\)\.

## 2Constructing the Random\\tifinaghfontⵟ\-Feature

We seek a feature mapz:ℝd→ℝMbz:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{M\_\{b\}\}with𝔼\[z\(x\)⊤z\(w\)\]=k\\tifinaghfontⵟ,b\(x,w\)\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\), so that the Gram matrix ofX=\{x1,…,xN\}X=\\\{x\_\{1\},\\dots,x\_\{N\}\\\}is approximated byZZ⊤ZZ^\{\\top\}inO\(NMb\)O\(NM\_\{b\}\)rather thanO\(N2\)O\(N^\{2\}\)\. We buildzzin five steps: \(i\) factork\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}into a polynomial and a radial kernel \(Section[2\.1](https://arxiv.org/html/2606.11255#S2.SS1)\); \(ii\) write the radial kernel as a nonnegative mixture of Gaussians over a one\-dimensional scale \(Section[2\.2](https://arxiv.org/html/2606.11255#S2.SS2)\); \(iii\) discretize the mixture by sampling that scale \(Section[2\.3](https://arxiv.org/html/2606.11255#S2.SS3)\); \(iv\) approximate the Gaussian factor by random Fourier features and tensor with the modulation feature \(Section[2\.4](https://arxiv.org/html/2606.11255#S2.SS4)\); and \(v\) sketch the modulation, so that*both*factors are randomized and the feature dimension is free of theO\(d2\)O\(d^\{2\}\)polynomial size \(Section[2\.5](https://arxiv.org/html/2606.11255#S2.SS5)\)\. Steps \(i\)–\(iv\) with an exact modulation give the analyzable limit; step \(v\) is the deployed estimator\.

### 2\.1Step 1: Schur factorization

The kernel splits as a Schur product of a polynomial and a radial kernel,

k\\tifinaghfontⵟ,b\(w,x\)=\(w⊤x\+b\)2⏟pb\(w,x\)⋅1‖w−x‖2\+ε⏟hε\(w,x\),k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(w,x\)=\\underbrace\{\(w^\{\\top\}x\+b\)^\{2\}\}\_\{p\_\{b\}\(w,x\)\}\\cdot\\underbrace\{\\frac\{1\}\{\\\|w\-x\\\|^\{2\}\+\\varepsilon\}\}\_\{h\_\{\\varepsilon\}\(w,x\)\},\(2\)both factors positive definite \(pbp\_\{b\}from its explicit feature mappb\(x\)⊤pb\(w\)p\_\{b\}\(x\)^\{\\top\}p\_\{b\}\(w\)\(Proposition[2\.2](https://arxiv.org/html/2606.11255#S2.Thmtheorem2)\), andhεh\_\{\\varepsilon\}as a nonnegative Bernstein–Widder mixture of Gaussian kernels \(Section[2\.2](https://arxiv.org/html/2606.11255#S2.SS2)\)\), sok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is PSD by the Schur product theorem\(Schur,[1911](https://arxiv.org/html/2606.11255#bib.bib24)\)\. Being continuous and positive definite,k\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is a Mercer kernel in the usual integral\-operator sense on every compactX⊂ℝdX\\subset\\mathbb\{R\}^\{d\}, the setting throughout \(Section[3](https://arxiv.org/html/2606.11255#S3)fixes‖x‖≤R\\\|x\\\|\\leq R\); we make no claim on noncompact domains\. \(This is self\-contained;Bouhsine \([2026](https://arxiv.org/html/2606.11255#bib.bib6)\)is cited only for historical context and fixed\-bbuniversality results not used here\.\) The polynomial factor is the tractable one: it has an exact, finite feature map\. That the full kernel nonetheless sits outside both random\-feature templates is not a matter of degree but a structural fact\.

The radial factor is itself recovered from the family by an elementary*forward*second difference inbb:\(2h2\)−1\[k\\tifinaghfontⵟ,b\+2h−2k\\tifinaghfontⵟ,b\+h\+k\\tifinaghfontⵟ,b\]=hε\(2h^\{2\}\)^\{\-1\}\\bigl\[k\_\{\\text\{\\tifinaghfont ⵟ\},b\+2h\}\-2k\_\{\\text\{\\tifinaghfont ⵟ\},b\+h\}\+k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\\bigr\]=h\_\{\\varepsilon\}exactly, for anyb≥0b\\geq 0,h\>0h\>0, with all three biases inside the domainb≥0b\\geq 0\(Proposition[A\.9](https://arxiv.org/html/2606.11255#A1.Thmtheorem9), Appendix[A](https://arxiv.org/html/2606.11255#A1)\)\. This is the only “spans IMQ” statement we use, and it keeps the claim self\-contained\.

###### Proposition 2\.1\(Off\-sphere the kernel is neither stationary nor dot\-product\)\.

Forb≥0b\\geq 0,ε\>0\\varepsilon\>0, the kernelk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is not shift\-invariant on any domain containing two points of different norm, and not a dot\-product kernel on any domain containing two pairs of equal inner products=x⊤w≠−bs=x^\{\\top\}w\\neq\-bbut different distance \(ats=−bs=\-bthe numerator vanishes for both pairs, so this value is excluded\)\. \(On the unit sphere, where distance and inner product are in bijection, it does reduce to a dot\-product kernel\.\)

The witnesses are immediate:k\\tifinaghfontⵟ,b\(x,x\)=\(‖x‖2\+b\)2/εk\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,x\)=\(\\\|x\\\|^\{2\}\+b\)^\{2\}/\\varepsilonvaries with‖x‖\\\|x\\\|, ruling out shift\-invariance; andx=w=e1x\{=\}w\{=\}e\_\{1\}versusx′=2e1,w′=e1/2x^\{\\prime\}\{=\}\\sqrt\{2\}\\,e\_\{1\},\\,w^\{\\prime\}\{=\}e\_\{1\}/\\sqrt\{2\}havex⊤w=x′⁣⊤w′=1x^\{\\top\}w=x^\{\\prime\\top\}w^\{\\prime\}=1but‖x−w‖2=0≠12=‖x′−w′‖2\\\|x\-w\\\|^\{2\}=0\\neq\\tfrac\{1\}\{2\}=\\\|x^\{\\prime\}\-w^\{\\prime\}\\\|^\{2\}, sok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is not a function ofx⊤wx^\{\\top\}walone\. This is why neither Bochner sampling nor dot\-product sketching applies to the full kernel, and why the off\-sphere experiment \(Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)\) is the key test\.

###### Proposition 2\.2\(Biased polynomial feature\)\.

Forb≥0b\\geq 0define

pb\(x\)=\(vec⁡\(x⊗x\),2bx,b\)⊤∈ℝd2\+d\+1\.p\_\{b\}\(x\)=\\bigl\(\\operatorname\{vec\}\(x\\otimes x\),\\;\\sqrt\{2b\}\\,x,\\;b\\bigr\)^\{\\top\}\\in\\mathbb\{R\}^\{d^\{2\}\+d\+1\}\.\(3\)Thenpb\(x\)⊤pb\(w\)=\(x⊤w\)2\+2b\(x⊤w\)\+b2=\(x⊤w\+b\)2p\_\{b\}\(x\)^\{\\top\}p\_\{b\}\(w\)=\(x^\{\\top\}w\)^\{2\}\+2b\(x^\{\\top\}w\)\+b^\{2\}=\(x^\{\\top\}w\+b\)^\{2\}, and‖pb\(x\)‖2=\(‖x‖2\+b\)2\\\|p\_\{b\}\(x\)\\\|^\{2\}=\(\\\|x\\\|^\{2\}\+b\)^\{2\}\.

###### Proof\.

pb\(x\)⊤pb\(w\)=vec\(x⊗x\)⊤vec\(w⊗w\)\+2bx⊤w\+b⋅b=\(x⊤w\)2\+2b\(x⊤w\)\+b2p\_\{b\}\(x\)^\{\\top\}p\_\{b\}\(w\)=\\operatorname\{vec\}\(x\\otimes x\)^\{\\top\}\\operatorname\{vec\}\(w\\otimes w\)\+2b\\,x^\{\\top\}w\+b\\cdot b=\(x^\{\\top\}w\)^\{2\}\+2b\(x^\{\\top\}w\)\+b^\{2\}, where the constant coordinatebbcontributesb⋅b=b2b\\cdot b=b^\{2\}\. Settingw=xw=xgives‖pb\(x\)‖2=‖x‖4\+2b‖x‖2\+b2=\(‖x‖2\+b\)2\\\|p\_\{b\}\(x\)\\\|^\{2\}=\\\|x\\\|^\{4\}\+2b\\\|x\\\|^\{2\}\+b^\{2\}=\(\\\|x\\\|^\{2\}\+b\)^\{2\}\. ∎

This finite feature is the object the deployed estimator will*sketch*\. We develop the construction in two passes\. First, holding the modulation exact, the estimator’s randomness is entirely radial, so the unbiasedness and variance identities are transparent \(Section[3](https://arxiv.org/html/2606.11255#S3)\); exact modulation is them→∞m\\to\\inftyanalyzable limit\. The deployed estimator then sketches this feature \(Step 5, Section[2\.5](https://arxiv.org/html/2606.11255#S2.SS5)\), making the feature dimension free of theO\(d2\)O\(d^\{2\}\)polynomial size at the price of a single, controllable modulation\-sketch term on top of the radial one \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\. Richer modulations \(attention\-style features or learned anchors\) slot into the same two\-error decomposition\. The feature is real exactly on the kernel’s domainb≥0b\\geq 0\(the entry2b\\sqrt\{2b\}requires it\), consistent with the definition \([1](https://arxiv.org/html/2606.11255#S1.E1)\)\. By symmetry the dimension reduces fromd2d^\{2\}tod\(d\+1\)/2d\(d\+1\)/2using the upper triangle ofx⊗xx\\otimes xwith2\\sqrt\{2\}scaling on off\-diagonals; for largedd, TensorSketch\(Pham & Pagh,[2013](https://arxiv.org/html/2606.11255#bib.bib19)\)reduces it further to a controllableDpolyD\_\{\\mathrm\{poly\}\}\(Appendix[D](https://arxiv.org/html/2606.11255#A4)\)\. Settingb=0b=0recovers the unbiased featurep0\(x\)=vec⁡\(x⊗x\)p\_\{0\}\(x\)=\\operatorname\{vec\}\(x\\otimes x\)\. From here on, unless stated otherwise,pbp\_\{b\}denotes the symmetrized degree\-2 feature of dimensiondb=d\(d\+1\)/2\+d\+1d\_\{b\}=d\(d\+1\)/2\+d\+1, exactly equivalent in inner product to the redundantd2\+d\+1d^\{2\}\+d\+1tensor feature of \([3](https://arxiv.org/html/2606.11255#S2.E3)\)\.

### 2\.2Step 2: Bernstein–Widder representation of the radial factor

The radial factorhεh\_\{\\varepsilon\}is completely monotone in‖x−w‖2\\\|x\-w\\\|^\{2\}, so by the Bernstein–Widder theorem\(Widder,[1941](https://arxiv.org/html/2606.11255#bib.bib31)\)it is a Laplace mixture of Gaussian kernels,

hε\(x,w\)=1‖x−w‖2\+ε=∫0∞e−t\(‖x−w‖2\+ε\)𝑑t=∫0∞e−tεgt\(x,w\)𝑑t,h\_\{\\varepsilon\}\(x,w\)=\\frac\{1\}\{\\\|x\-w\\\|^\{2\}\+\\varepsilon\}=\\int\_\{0\}^\{\\infty\}e^\{\-t\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\)\}\\,dt=\\int\_\{0\}^\{\\infty\}e^\{\-t\\varepsilon\}\\,g\_\{t\}\(x,w\)\\,dt,\(4\)wheregt\(x,w\)=e−t‖x−w‖2g\_\{t\}\(x,w\)=e^\{\-t\\\|x\-w\\\|^\{2\}\}is the Gaussian kernel at scalett\. Multiplying by the polynomial factor and tensoring its feature gives the exact feature map ofk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(the infinite\-feature interpretation, derived here and discussed historically byBouhsine,[2026](https://arxiv.org/html/2606.11255#bib.bib6)\): withΦb\(x\)\(t\)=e−tε/2φt\(x\)⊗pb\(x\)\\Phi\_\{b\}\(x\)\(t\)=e^\{\-t\\varepsilon/2\}\\varphi\_\{t\}\(x\)\\otimes p\_\{b\}\(x\)\(φt\\varphi\_\{t\}the canonical Gaussian feature map\),

⟨Φb\(x\),Φb\(w\)⟩=∫0∞e−tεgt\(x,w\)\(x⊤w\+b\)2𝑑t=k\\tifinaghfontⵟ,b\(x,w\)\.\\langle\\Phi\_\{b\}\(x\),\\Phi\_\{b\}\(w\)\\rangle=\\int\_\{0\}^\{\\infty\}e^\{\-t\\varepsilon\}\\,g\_\{t\}\(x,w\)\\,\(x^\{\\top\}w\+b\)^\{2\}\\,dt=k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)\.\(5\)The mixing measureεe−εtdt\\varepsilon e^\{\-\\varepsilon t\}\\,dtis, up to normalization, the density ofT∼Exp⁡\(ε\)T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\. Since𝔼T∼Exp⁡\(ε\)\[e−T‖x−w‖2\]=ε/\(‖x−w‖2\+ε\)\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\[e^\{\-T\\\|x\-w\\\|^\{2\}\}\]=\\varepsilon/\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\), the kernel is an expectation over that single scale,

k\\tifinaghfontⵟ,b\(x,w\)=1ε𝔼T∼Exp⁡\(ε\)\[gT\(x,w\)\(x⊤w\+b\)2\],k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)=\\frac\{1\}\{\\varepsilon\}\\,\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\\\!\\bigl\[g\_\{T\}\(x,w\)\\,\(x^\{\\top\}w\+b\)^\{2\}\\bigr\],\(6\)and it is this expectation, rather than the integral \([5](https://arxiv.org/html/2606.11255#S2.E5)\), that the next step discretizes\.

### 2\.3Step 3: Sampling the radial scale

Equation \([6](https://arxiv.org/html/2606.11255#S2.E6)\) replaces the integral overttby an expectation over a*one\-dimensional*random scale\. Drawingt1,…,tD∼iidExp⁡\(ε\)t\_\{1\},\\dots,t\_\{D\}\\overset\{\\mathrm\{iid\}\}\{\\sim\}\\operatorname\{Exp\}\(\\varepsilon\)and averaging gives the Monte\-Carlo estimate

k\\tifinaghfontⵟ,b\(x,w\)≈\(x⊤w\+b\)2ε⋅1D∑j=1Dgtj\(x,w\)\.k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)\\approx\\frac\{\(x^\{\\top\}w\+b\)^\{2\}\}\{\\varepsilon\}\\cdot\\frac\{1\}\{D\}\\sum\_\{j=1\}^\{D\}g\_\{t\_\{j\}\}\(x,w\)\.\(7\)At this point only the radial scale is sampled and the modulation is still exact, the analyzable limit, whose single \(radial\) error source makes the unbiasedness and variance identities of Section[3](https://arxiv.org/html/2606.11255#S3)clean\. Step 5 then sketches the modulation, randomizing*both*factors and removing theO\(d2\)O\(d^\{2\}\)floor at the cost of one additional, controllable error source \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\. This shapes the scheme’s dimension behavior: the outer Bernstein scaletjt\_\{j\}is one\-dimensional, and while the inner Gaussian frequencyω∈ℝd\\omega\\in\\mathbb\{R\}^\{d\}of the radial RFF is stilldd\-dimensional, the fixed\-dataset Hoeffding/union\-bound count for the radial estimate carries no explicitdd\-dependence \(Corollary[A\.4](https://arxiv.org/html/2606.11255#A1.Thmtheorem4)\), as it does for standard RFF; the genuinedd\-contrast is in applicability and in the polynomial feature dimension, discussed in Section[3](https://arxiv.org/html/2606.11255#S3)\. It remains to makegtjg\_\{t\_\{j\}\}computable by an inner product\.

### 2\.4Step 4: Random Fourier features and the estimator

Each Gaussiangt\(x,w\)=e−t‖x−w‖2g\_\{t\}\(x,w\)=e^\{\-t\\\|x\-w\\\|^\{2\}\}is shift\-invariant, so Bochner’s theorem applies to it \(even though it does not apply tok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}\): drawingω∼𝒩\(0,2tId\)\\omega\\sim\\mathcal\{N\}\(0,2tI\_\{d\}\)andβ∼Unif\(\[0,2π\]\)\\beta\\sim\\mathrm\{Unif\}\(\[0,2\\pi\]\),𝔼ω,β\[2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)\]=gt\(x,w\)\\mathbb\{E\}\_\{\\omega,\\beta\}\[2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)\]=g\_\{t\}\(x,w\)\(Rahimi & Recht,[2007](https://arxiv.org/html/2606.11255#bib.bib23)\)\. TensoringD′D^\{\\prime\}such features with the exact polynomial feature yields the estimator; we writedb:=d\(d\+1\)/2\+d\+1d\_\{b\}:=d\(d\{\+\}1\)/2\+d\+1for the symmetric polynomial feature \(the default; the redundantd2d^\{2\}tensor is reduced losslessly\) andMb=DD′dbM\_\{b\}=D\\,D^\{\\prime\}\\,d\_\{b\}\.

###### Definition 2\.3\(Exact\-modulation\\tifinaghfontⵟ\-feature\)\.

GivenD,D′∈ℕD,D^\{\\prime\}\\in\\mathbb\{N\},ε\>0\\varepsilon\>0,b≥0b\\geq 0, the*exact\-modulation*\\tifinaghfontⵟ\-feature mapz:ℝd→ℝMbz:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{M\_\{b\}\}\(the analyzable limit, which the deployed RAY map of Step 5 randomizes further\) is:

1. 1\.Drawt1,…,tD∼iidExp⁡\(ε\)t\_\{1\},\\dots,t\_\{D\}\\overset\{\\mathrm\{iid\}\}\{\\sim\}\\operatorname\{Exp\}\(\\varepsilon\)\.
2. 2\.For eachjj, drawωj,1,…,ωj,D′∼iid𝒩\(0,2tjId\)\\omega\_\{j,1\},\\dots,\\omega\_\{j,D^\{\\prime\}\}\\overset\{\\mathrm\{iid\}\}\{\\sim\}\\mathcal\{N\}\(0,2t\_\{j\}I\_\{d\}\)andβj,ℓ∼iidUnif\(\[0,2π\]\)\\beta\_\{j,\\ell\}\\overset\{\\mathrm\{iid\}\}\{\\sim\}\\mathrm\{Unif\}\(\[0,2\\pi\]\)\(the factor22is the standard RFF convention so that𝔼ω,β\[2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)\]=e−tj‖x−w‖2\\mathbb\{E\}\_\{\\omega,\\beta\}\[2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)\]=e^\{\-t\_\{j\}\\\|x\-w\\\|^\{2\}\}\)\.
3. 3\.Gaussian RFF at scaletjt\_\{j\}:ψtj\(x\)=2/D′\(cos⁡\(ωj,ℓ⊤x\+βj,ℓ\)\)ℓ=1D′∈ℝD′\\psi\_\{t\_\{j\}\}\(x\)=\\sqrt\{2/D^\{\\prime\}\}\\,\(\\cos\(\\omega\_\{j,\\ell\}^\{\\top\}x\+\\beta\_\{j,\\ell\}\)\)\_\{\\ell=1\}^\{D^\{\\prime\}\}\\in\\mathbb\{R\}^\{D^\{\\prime\}\}\.
4. 4\.Exact biased polynomial feature:p\(x\)=ε−1/2pb\(x\)∈ℝdbp\(x\)=\\varepsilon^\{\-1/2\}p\_\{b\}\(x\)\\in\\mathbb\{R\}^\{d\_\{b\}\}\.
5. 5\.Block:zj\(x\)=D−1/2ψtj\(x\)⊗p\(x\)z\_\{j\}\(x\)=D^\{\-1/2\}\\,\\psi\_\{t\_\{j\}\}\(x\)\\otimes p\(x\)\. Concatenate:z\(x\)=\(z1\(x\)⊤,…,zD\(x\)⊤\)⊤z\(x\)=\(z\_\{1\}\(x\)^\{\\top\},\\dots,z\_\{D\}\(x\)^\{\\top\}\)^\{\\top\}\.

The two levelsDDandD′D^\{\\prime\}are less fundamental than they look, and naming what the inner draw actually samples makes this clear\. The hierarchical drawt∼Exp⁡\(ε\)t\\sim\\operatorname\{Exp\}\(\\varepsilon\),ω∼𝒩\(0,2tId\)\\omega\\sim\\mathcal\{N\}\(0,2tI\_\{d\}\)producesω\\omegawith marginal density∫0∞𝒩\(ω;0,2tId\)εe−εt𝑑t\\int\_\{0\}^\{\\infty\}\\mathcal\{N\}\(\\omega;0,2tI\_\{d\}\)\\,\\varepsilon e^\{\-\\varepsilon t\}\\,dt, which by the same Gaussian scale mixture \([4](https://arxiv.org/html/2606.11255#S2.E4)\) is the normalized Bochner spectral distribution of the rescaled radial factorεhε\\varepsilon h\_\{\\varepsilon\}; the missing total mass1/ε1/\\varepsilonis carried by the feature scaling\. Step 4 is therefore standard random Fourier features forhεh\_\{\\varepsilon\}tensored with the exact polynomial feature\. The IMQ spectral density can be written using modified Bessel functions; the radial\-scale mixture is simply a tractable sampler for that distribution, not a device that competes with it\. Seen this way the inner level is redundant: at equal costDD′DD^\{\\prime\}theD′D^\{\\prime\}frequencies of a block share one scaletjt\_\{j\}and are correlated, whereasD′D^\{\\prime\}independent spectral draws are not, so variance is minimized atD′=1D^\{\\prime\}=1except in the degenerate zero\-outer\-variance case \(Appendix[D\.2](https://arxiv.org/html/2606.11255#A4.SS2)confirms this directly\)\. We keepD′D^\{\\prime\}general for the analysis but recommendD′=1D^\{\\prime\}=1, drawingDDindependent\(tj,ωj\)\(t\_\{j\},\\omega\_\{j\}\)pairs, which collapses the estimator to

z\(x\)=D−1/2\(2cos⁡\(ωj⊤x\+βj\)p\(x\)\)j=1D,ωj∼𝒩\(0,2tjId\),tj∼Exp⁡\(ε\)\.z\(x\)=D^\{\-1/2\}\\Bigl\(\\sqrt\{2\}\\,\\cos\(\\omega\_\{j\}^\{\\top\}x\+\\beta\_\{j\}\)\\,p\(x\)\\Bigr\)\_\{j=1\}^\{D\},\\qquad\\omega\_\{j\}\\sim\\mathcal\{N\}\(0,2t\_\{j\}I\_\{d\}\),\\ t\_\{j\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)\.\(8\)The contribution of this section is thus the Schur factorization that isolates an exactly\-representable, sketchable polynomial modulation \(Section[2\.1](https://arxiv.org/html/2606.11255#S2.SS1)\) paired with this sampler for the IMQ factor, not an improvement over IMQ random features; the dimension comparison this invites is taken up in Section[3\.2](https://arxiv.org/html/2606.11255#S3.SS2)\.

###### Theorem 2\.4\(Unbiasedness\)\.

𝔼\[z\(x\)⊤z\(w\)\]=k\\tifinaghfontⵟ,b\(x,w\)\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)\.

###### Proof\.

z\(x\)⊤z\(w\)=1D∑j=1D\(ψtj\(x\)⊤ψtj\(w\)\)\(p\(x\)⊤p\(w\)\)z\(x\)^\{\\top\}z\(w\)=\\frac\{1\}\{D\}\\sum\_\{j=1\}^\{D\}\(\\psi\_\{t\_\{j\}\}\(x\)^\{\\top\}\\psi\_\{t\_\{j\}\}\(w\)\)\(p\(x\)^\{\\top\}p\(w\)\)\. The polynomial inner product is exact:p\(x\)⊤p\(w\)=\(x⊤w\+b\)2/εp\(x\)^\{\\top\}p\(w\)=\(x^\{\\top\}w\+b\)^\{2\}/\\varepsilon\. By the standard RFF identity\(Rahimi & Recht,[2007](https://arxiv.org/html/2606.11255#bib.bib23)\),𝔼ω\[ψtj\(x\)⊤ψtj\(w\)∣tj\]=e−tj‖x−w‖2\\mathbb\{E\}\_\{\\omega\}\[\\psi\_\{t\_\{j\}\}\(x\)^\{\\top\}\\psi\_\{t\_\{j\}\}\(w\)\\mid t\_\{j\}\]=e^\{\-t\_\{j\}\\\|x\-w\\\|^\{2\}\}\. Taking expectation overtj∼Exp⁡\(ε\)t\_\{j\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)and using \([6](https://arxiv.org/html/2606.11255#S2.E6)\),

𝔼\[z\(x\)⊤z\(w\)\]=𝔼T∼Exp⁡\(ε\)\[e−T‖x−w‖2\]⋅\(x⊤w\+b\)2ε=ε‖x−w‖2\+ε⋅\(x⊤w\+b\)2ε=k\\tifinaghfontⵟ,b\(x,w\)\.∎\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\\\!\\bigl\[e^\{\-T\\\|x\-w\\\|^\{2\}\}\\bigr\]\\cdot\\frac\{\(x^\{\\top\}w\+b\)^\{2\}\}\{\\varepsilon\}=\\frac\{\\varepsilon\}\{\\\|x\-w\\\|^\{2\}\+\\varepsilon\}\\cdot\\frac\{\(x^\{\\top\}w\+b\)^\{2\}\}\{\\varepsilon\}=k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)\.\\qed

The construction used nothing specific tok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}beyond its structure as a finite\-feature kernel times a completely monotone shift\-invariant kernel\. The estimator, and its guarantees, hold for the whole class\.

###### Theorem 2\.5\(Bernstein–Schur random features\)\.

Letk\(x,w\)=p\(x,w\)f\(‖x−w‖2\)k\(x,w\)=p\(x,w\)\\,f\(\\\|x\-w\\\|^\{2\}\)withp\(x,w\)=u\(x\)⊤u\(w\)p\(x,w\)=u\(x\)^\{\\top\}u\(w\)for a finite featureu:ℝd→ℝdpu:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\_\{p\}\}, andffcompletely monotone with Bernstein mixturef\(r\)=∫0∞e−tr𝑑ν\(t\)f\(r\)=\\int\_\{0\}^\{\\infty\}e^\{\-tr\}\\,d\\nu\(t\)of*finite*massmf:=ν\(ℝ≥0\)=f\(0\)<∞m\_\{f\}:=\\nu\(\\mathbb\{R\}\_\{\\geq 0\}\)=f\(0\)<\\infty\(distinct from the sketch dimensionmmof Section[2\.5](https://arxiv.org/html/2606.11255#S2.SS5)\)\. DrawTj∼ν/mfT\_\{j\}\\sim\\nu/m\_\{f\},ωj∣Tj∼𝒩\(0,2TjId\)\\omega\_\{j\}\\mid T\_\{j\}\\sim\\mathcal\{N\}\(0,2T\_\{j\}I\_\{d\}\),βj∼Unif\[0,2π\]\\beta\_\{j\}\\sim\\mathrm\{Unif\}\[0,2\\pi\], and setz\(x\)=\(mf/D2cos⁡\(ωj⊤x\+βj\)u\(x\)\)j=1Dz\(x\)=\\bigl\(\\sqrt\{m\_\{f\}/D\}\\,\\sqrt\{2\}\\cos\(\\omega\_\{j\}^\{\\top\}x\+\\beta\_\{j\}\)\\,u\(x\)\\bigr\)\_\{j=1\}^\{D\}\. Then𝔼\[z\(x\)⊤z\(w\)\]=k\(x,w\)\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=k\(x,w\), and if‖u\(x\)‖≤B\\\|u\(x\)\\\|\\leq BonXX, thenVar⁡\[z\(x\)⊤z\(w\)\]≤3mf2B4/\(2D\)\\operatorname\{Var\}\[z\(x\)^\{\\top\}z\(w\)\]\\leq 3m\_\{f\}^\{2\}B^\{4\}/\(2D\)and, with probability≥1−δ\\geq 1\-\\delta,supi,j\|z\(xi\)⊤z\(xj\)−k\(xi,xj\)\|≤mfB28log⁡\(2N2/δ\)/D\\sup\_\{i,j\}\|z\(x\_\{i\}\)^\{\\top\}z\(x\_\{j\}\)\-k\(x\_\{i\},x\_\{j\}\)\|\\leq m\_\{f\}B^\{2\}\\sqrt\{8\\log\(2N^\{2\}/\\delta\)/D\}\.

We call these*Bernstein–Schur kernels*\. The biased\\tifinaghfontⵟ\-kernel is the caseu=pbu=p\_\{b\},f\(r\)=\(r\+ε\)−1f\(r\)=\(r\+\\varepsilon\)^\{\-1\}withdν\(t\)=e−εtdtd\\nu\(t\)=e^\{\-\\varepsilon t\}\\,dt, so the mass ismf=ν\(ℝ≥0\)=1/εm\_\{f\}=\\nu\(\\mathbb\{R\}\_\{\\geq 0\}\)=1/\\varepsilonand the lawν/mf\\nu/m\_\{f\}isExp⁡\(ε\)\\operatorname\{Exp\}\(\\varepsilon\); withB=maxx⁡‖pb\(x\)‖=R2\+bB=\\max\_\{x\}\\\|p\_\{b\}\(x\)\\\|=R^\{2\}\+bthe variance scalemf2B4=\(R2\+b\)4/ε2m\_\{f\}^\{2\}B^\{4\}=\(R^\{2\}\+b\)^\{4\}/\\varepsilon^\{2\}recovers the bounds of Section[3](https://arxiv.org/html/2606.11255#S3)\. Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)establishes unbiasedness, the variance, and the uniform entrywise bound for the whole class; the matrix\-Bernstein \(Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)\) and kernel\-ridge \(Theorem[3\.6](https://arxiv.org/html/2606.11255#S3.Thmtheorem6)\) statements we prove fork\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}, and the same proof pattern applies to any member after replacingPPbymfm\_\{f\}times the bounded modulation Gram, though we do not write the general statement out\. The class is an elementary closure construction, but it is useful because it packages a family of kernels that are often neither stationary nor pure dot\-product, and it permits unified unbiased features, exact variance formulas, and matrix\-level approximation guarantees\. Table[1](https://arxiv.org/html/2606.11255#S2.T1)lists nontrivial members; onlyuuandν\\nuchange between them\.

Table 1:Bernstein–Schur kernelsk=p⋅fk=p\\cdot f\. Each is linearized by the same estimator: keep the finite feature of the modulationppexact, draw the radial scale from the \(normalized\) Bernstein measure offf, and apply random Fourier features to the Gaussian\. The “mixing” column gives the normalized lawν/mf\\nu/m\_\{f\}\(the IMQ hasdν=e−εtdtd\\nu=e^\{\-\\varepsilon t\}dt, massmf=1/εm\_\{f\}=1/\\varepsilon; the generalized IMQ\(r\+ε\)−α\(r\+\\varepsilon\)^\{\-\\alpha\}hasdν∝tα−1e−εtdtd\\nu\\propto t^\{\\alpha\-1\}e^\{\-\\varepsilon t\}dt, massmf=ε−αm\_\{f\}=\\varepsilon^\{\-\\alpha\}\)\. HereΓ\(α,ε\)\\Gamma\(\\alpha,\\varepsilon\)denotes shapeα\\alphaand rateε\\varepsilon\. Only the two columns right offfchange between instances\.These other members are linearized by the same estimator in practice, not only in principle: on a genuinely non\-\\tifinaghfontⵟinstance \(a degree\-3 modulation\(x⊤w\+b\)3\(x^\{\\top\}w\+b\)^\{3\}times a generalized\-IMQ radial\(‖x−w‖2\+ε\)−2\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\)^\{\-2\}withΓ\(2,ε\)\\Gamma\(2,\\varepsilon\)mixing\), the same scheme \(keepuuexact, drawTj∼Γ\(2,ε\)T\_\{j\}\\sim\\Gamma\(2,\\varepsilon\), apply RFF\) is unbiased and converges at the Monte\-Carlo rate, exactly as for the\\tifinaghfontⵟ\-kernel \(Appendix[D](https://arxiv.org/html/2606.11255#A4)\)\. The class\-level theorem is thus more than a formal generalization: swappinguuandν\\nuin one estimator linearizes a different nonstationary kernel\.

### 2\.5Step 5: Sketching the modulation, the deployed estimator

Definition[2\.3](https://arxiv.org/html/2606.11255#S2.Thmtheorem3)keeps the modulation featurepbp\_\{b\}exact, so its dimensiondb=O\(d2\)d\_\{b\}=O\(d^\{2\}\)fixes the feature sizeDD′dbD\\,D^\{\\prime\}\\,d\_\{b\}\. The deployed estimator removes this floor by randomizing the modulation as well\. LetTSm\\mathrm\{TS\}\_\{m\}be a degree\-2 TensorSketch of the augmented input\(x,b\)\(x,\\sqrt\{b\}\), so𝔼\[TSm\(x\)⊤TSm\(w\)\]=\(x⊤w\+b\)2\\mathbb\{E\}\[\\mathrm\{TS\}\_\{m\}\(x\)^\{\\top\}\\mathrm\{TS\}\_\{m\}\(w\)\]=\(x^\{\\top\}w\+b\)^\{2\}over the sketch randomness\(Pham & Pagh,[2013](https://arxiv.org/html/2606.11255#bib.bib19); Avron et al\.,[2014](https://arxiv.org/html/2606.11255#bib.bib2)\), and setp^m\(x\)=ε−1/2TSm\(x\)∈ℝm\\widehat\{p\}\_\{m\}\(x\)=\\varepsilon^\{\-1/2\}\\mathrm\{TS\}\_\{m\}\(x\)\\in\\mathbb\{R\}^\{m\}\. The deployed*RAY*map \(randomizing both factors\) replaces step 4 of Definition[2\.3](https://arxiv.org/html/2606.11255#S2.Thmtheorem3)byp^m\\widehat\{p\}\_\{m\}\(using the flatD′=1D^\{\\prime\}=1form\):

z^\(x\)=D−1/2\(2cos⁡\(ωj⊤x\+βj\)p^m\(x\)\)j=1D∈ℝDm,ωj∼𝒩\(0,2tjId\),tj∼Exp⁡\(ε\)\.\\widehat\{z\}\(x\)=D^\{\-1/2\}\\bigl\(\\sqrt\{2\}\\,\\cos\(\\omega\_\{j\}^\{\\top\}x\+\\beta\_\{j\}\)\\,\\widehat\{p\}\_\{m\}\(x\)\\bigr\)\_\{j=1\}^\{D\}\\in\\mathbb\{R\}^\{Dm\},\\qquad\\omega\_\{j\}\\sim\\mathcal\{N\}\(0,2t\_\{j\}I\_\{d\}\),\\ t\_\{j\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)\.\(9\)Both factors are now random \(the radial scale by the Bernstein mixture, the modulation by the sketch\) and the feature dimensionDmDmis free ofd2d^\{2\}\. The experiments \(Section[4\.2](https://arxiv.org/html/2606.11255#S4.SS2)\) and Proposition[A\.8](https://arxiv.org/html/2606.11255#A1.Thmtheorem8)instead use a lower\-variance*quadratic\-only*variant that sketches only the degree\-2 term\(x⊤w\)2\(x^\{\\top\}w\)^\{2\}and keeps the linear and constant terms2bx⊤w\+b22b\\,x^\{\\top\}w\+b^\{2\}exact, of dimensionD\(m\+d\+1\)D\(m\{\+\}d\{\+\}1\): still free of theO\(d2\)O\(d^\{2\}\)floor, larger only by an additiveO\(d\)O\(d\)per draw\. Both forms are covered by Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\(the analysis touches the modulation only throughP^m\\widehat\{P\}\_\{m\}\)\. The exact\-modulation estimator \([8](https://arxiv.org/html/2606.11255#S2.E8)\) \(Definition[2\.3](https://arxiv.org/html/2606.11255#S2.Thmtheorem3)\) is the limitm→∞m\\to\\infty\(p^m→pb\\widehat\{p\}\_\{m\}\\to p\_\{b\}\), where the modulation randomness vanishes; that limit is the object the variance and concentration identities of Section[3](https://arxiv.org/html/2606.11255#S3)are stated for, and Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)carries the operator\-norm guarantee to finitemmat the cost of one additional, controllable sketch term\. The modulation randomizer is modular: any unbiased or low\-rank PSD modulation feature plugs in \(TensorSketch, random\-Maclaurin products, or anchor features\), the choice being a geometry\-dependent design decision rather than part of the analysis \(Section[5](https://arxiv.org/html/2606.11255#S5)\)\. We state the guarantees for exact modulation and then transfer them, so the next two sections read as: clean identities at the limit, then the doubly\-randomized bound\.

## 3Approximation Guarantees

### 3\.1Variance and uniform error

Throughout,X⊂ℝdX\\subset\\mathbb\{R\}^\{d\}is compact with‖x‖≤R\\\|x\\\|\\leq Rfor allx∈Xx\\in X\. Proofs are in Appendix[A](https://arxiv.org/html/2606.11255#A1)\.

The variance of the recommended flat estimator \(D′=1D^\{\\prime\}=1\) is known in closed form\. A coarser envelope valid for allD,D′D,D^\{\\prime\}\(VD≤\(R2\+b\)4Dε2\(1\+32D′\)V\_\{D\}\\leq\\frac\{\(R^\{2\}\+b\)^\{4\}\}\{D\\varepsilon^\{2\}\}\(1\+\\frac\{3\}\{2D^\{\\prime\}\}\), the3/\(2D′\)3/\(2D^\{\\prime\}\)term being the inner random\-Fourier\-feature variance\) and the two\-level identity \([14](https://arxiv.org/html/2606.11255#A1.E14)\) whose budget argument singles outD′=1D^\{\\prime\}=1as optimal are deferred to Appendix[A](https://arxiv.org/html/2606.11255#A1)\(Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1), Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)\); the experiments confirmD′=1D^\{\\prime\}=1directly \(Appendix[D\.2](https://arxiv.org/html/2606.11255#A4.SS2)\)\.

###### Theorem 3\.1\(Exact variance of the flat estimator\)\.

Writer=‖x−w‖2r=\\\|x\-w\\\|^\{2\}anda=\(x⊤w\+b\)2a=\(x^\{\\top\}w\+b\)^\{2\}\. The flat estimator,*one*inner frequency per scale,D′=1D^\{\\prime\}=1\(not theD′→∞D^\{\\prime\}\\to\\inftyinner\-averaged estimator\),k^D=aε1D∑j=1D2cos⁡\(ωj⊤x\+βj\)cos⁡\(ωj⊤w\+βj\)\\widehat\{k\}\_\{D\}=\\tfrac\{a\}\{\\varepsilon\}\\tfrac\{1\}\{D\}\\sum\_\{j=1\}^\{D\}2\\cos\(\\omega\_\{j\}^\{\\top\}x\+\\beta\_\{j\}\)\\cos\(\\omega\_\{j\}^\{\\top\}w\+\\beta\_\{j\}\), withTj∼Exp⁡\(ε\)T\_\{j\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)andωj∣Tj∼𝒩\(0,2TjId\)\\omega\_\{j\}\\mid T\_\{j\}\\sim\\mathcal\{N\}\(0,2T\_\{j\}I\_\{d\}\), has

Var⁡\[k^D\(x,w\)\]=a2Dε2\[1\+12εε\+4r−\(εε\+r\)2\]\.\\operatorname\{Var\}\[\\widehat\{k\}\_\{D\}\(x,w\)\]=\\frac\{a^\{2\}\}\{D\\varepsilon^\{2\}\}\\left\[\\,1\+\\frac\{1\}\{2\}\\frac\{\\varepsilon\}\{\\varepsilon\+4r\}\-\\Bigl\(\\frac\{\\varepsilon\}\{\\varepsilon\+r\}\\Bigr\)^\{2\}\\,\\right\]\.Atr=0r=0the bracket is12\\tfrac\{1\}\{2\}\(the one\-frequency variance does not vanish\); theD′→∞D^\{\\prime\}\\to\\inftyinner\-averaged estimator instead has outer\-only variancea2Dε2\[εε\+2r−\(εε\+r\)2\]\\tfrac\{a^\{2\}\}\{D\\varepsilon^\{2\}\}\[\\tfrac\{\\varepsilon\}\{\\varepsilon\+2r\}\-\(\\tfrac\{\\varepsilon\}\{\\varepsilon\+r\}\)^\{2\}\], theVoutV\_\{\\mathrm\{out\}\}term of Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)\.

The identity locates the variance precisely: it is dominated by the fourth power of the bias\-shifted alignmenta2=\(x⊤w\+b\)4a^\{2\}=\(x^\{\\top\}w\+b\)^\{4\}and scales asε−2\\varepsilon^\{\-2\}, while the radial bracket equals12\\tfrac\{1\}\{2\}atr=0r=0, tends to11asr→∞r\\to\\infty, and is uniformly bounded by32\\tfrac\{3\}\{2\}\. So the alignment numerator, not the distance, sets the scale; this recovers theO\(\(R2\+b\)4/\(Dε2\)\)O\(\(R^\{2\}\+b\)^\{4\}/\(D\\varepsilon^\{2\}\)\)order of Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1)with exact constants\.

A Hoeffding bound per entry plus a union bound gives a uniform entrywise Gram errorsupij\|z\(xi\)⊤z\(xj\)−k\\tifinaghfontⵟ,b\|≤\(R2\+b\)2ε−18log⁡\(2N2/δ\)/D\\sup\_\{ij\}\|z\(x\_\{i\}\)^\{\\top\}z\(x\_\{j\}\)\-k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\|\\leq\(R^\{2\}\+b\)^\{2\}\\varepsilon^\{\-1\}\\sqrt\{8\\log\(2N^\{2\}/\\delta\)/D\}with probability1−δ1\-\\delta\(Theorem[A\.3](https://arxiv.org/html/2606.11255#A1.Thmtheorem3)\), hence a sample complexityD=O\(\(R2\+b\)4ε−2τ−2log⁡\(N/η\)\)D=O\(\(R^\{2\}\+b\)^\{4\}\\varepsilon^\{\-2\}\\tau^\{\-2\}\\log\(N/\\eta\)\)free of explicitdd\(Corollary[A\.4](https://arxiv.org/html/2606.11255#A1.Thmtheorem4)\); both are in Appendix[A](https://arxiv.org/html/2606.11255#A1)\. Converting this entrywise bound to operator norm through‖A‖op≤Nmaxij⁡\|Aij\|\\\|A\\\|\_\{\\mathrm\{op\}\}\\leq N\\max\_\{ij\}\|A\_\{ij\}\|costs a factorNN\(Corollary[A\.5](https://arxiv.org/html/2606.11255#A1.Thmtheorem5)\)\. That factor is wasteful: it ignores that the per\-draw error matrices are structured\. Each radial draw contributesK\(j\)=\(ΨjΨj⊤\)∘PK^\{\(j\)\}=\(\\Psi\_\{j\}\\Psi\_\{j\}^\{\\top\}\)\\circ PwithP=\[\(xi⊤xj\+b\)2/ε\]P=\[\(x\_\{i\}^\{\\top\}x\_\{j\}\+b\)^\{2\}/\\varepsilon\], a Schur product of two positive semidefinite matrices and hence itself PSD; matrix Bernstein exploits this\.

###### Theorem 3\.2\(Expected matrix\-Bernstein operator\-norm bound\)\.

LetK=\[k\\tifinaghfontⵟ,b\(xi,xj\)\]K=\[k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x\_\{i\},x\_\{j\}\)\]andP=\[\(xi⊤xj\+b\)2/ε\]P=\[\(x\_\{i\}^\{\\top\}x\_\{j\}\+b\)^\{2\}/\\varepsilon\]be the kernel and polynomial Gram matrices \(both PSD\), and letKDK\_\{D\}be the exact\-modulation estimate fromDDi\.i\.d\. radial draws\. WithV=∑j𝔼\[\(D−1\(K\(j\)−K\)\)2\]V=\\sum\_\{j\}\\mathbb\{E\}\[\(D^\{\-1\}\(K^\{\(j\)\}\-K\)\)^\{2\}\]the matrix variance anddint=tr⁡\(V\)/‖V‖op≤Nd\_\{\\mathrm\{int\}\}=\\operatorname\{tr\}\(V\)/\\\|V\\\|\_\{\\mathrm\{op\}\}\\leq Nits intrinsic dimension \(ifV=0V=0the estimate is exact and the bound holds trivially with the conventiondint=1d\_\{\\mathrm\{int\}\}=1\),

𝔼‖KD−K‖op≤2‖P‖op‖K‖oplog⁡\(2dint\)D\+‖P‖oplog⁡\(2dint\)D\.\\mathbb\{E\}\\,\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}\\;\\leq\\;2\\sqrt\{\\frac\{\\\|P\\\|\_\{\\mathrm\{op\}\}\\,\\\|K\\\|\_\{\\mathrm\{op\}\}\\,\\log\(2d\_\{\\mathrm\{int\}\}\)\}\{D\}\}\\;\+\\;\\frac\{\\\|P\\\|\_\{\\mathrm\{op\}\}\\,\\log\(2d\_\{\\mathrm\{int\}\}\)\}\{D\}\.

The leading term scales with the top eigenvalues‖P‖op,‖K‖op\\\|P\\\|\_\{\\mathrm\{op\}\},\\\|K\\\|\_\{\\mathrm\{op\}\}and the effective rankdintd\_\{\\mathrm\{int\}\}, all≪N\\ll Nfor spectrally concentrated data; since‖P‖op≤tr⁡\(P\)≤N\(R2\+b\)2/ε\\\|P\\\|\_\{\\mathrm\{op\}\}\\leq\\operatorname\{tr\}\(P\)\\leq N\(R^\{2\}\+b\)^\{2\}/\\varepsilonanddint≤Nd\_\{\\mathrm\{int\}\}\\leq N, the bound never exceeds the order of Corollary[A\.5](https://arxiv.org/html/2606.11255#A1.Thmtheorem5)and is data\-adaptively tighter\. The theorem is stated for exact modulation; the same proof pattern extends to a general Bernstein–Schur kernel after replacingPPby the bounded modulation Gram multiplied by the finite massmfm\_\{f\}\. The key step is thatK\(j\)⪰0K^\{\(j\)\}\\succeq 0, so\(K\(j\)\)2⪯‖K\(j\)‖opK\(j\)⪯2‖P‖opK\(j\)\(K^\{\(j\)\}\)^\{2\}\\preceq\\\|K^\{\(j\)\}\\\|\_\{\\mathrm\{op\}\}K^\{\(j\)\}\\preceq 2\\\|P\\\|\_\{\\mathrm\{op\}\}K^\{\(j\)\}\(the Schur\-multiplier bound, Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6), gives‖K\(j\)‖op≤2‖P‖op\\\|K^\{\(j\)\}\\\|\_\{\\mathrm\{op\}\}\\leq 2\\\|P\\\|\_\{\\mathrm\{op\}\}\)\. Hence‖V‖op≤2‖P‖op‖K‖op/D\\\|V\\\|\_\{\\mathrm\{op\}\}\\leq 2\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\\\|\_\{\\mathrm\{op\}\}/D, and the intrinsic matrix\-Bernstein inequality\(Tropp,[2012](https://arxiv.org/html/2606.11255#bib.bib29);[2015](https://arxiv.org/html/2606.11255#bib.bib30)\)yields the bound\. The full proof is in Appendix[A](https://arxiv.org/html/2606.11255#A1); an empirical check across datasets of varying spectral spread confirms the data\-adaptive constant \(Section[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px1)\)\.

###### Corollary 3\.3\(High\-probability operator\-norm bound\)\.

With the summandsYj=D−1\(K\(j\)−K\)Y\_\{j\}=D^\{\-1\}\(K^\{\(j\)\}\-K\), the almost\-sure bound‖Yj‖op≤L:=3‖P‖op/D\\\|Y\_\{j\}\\\|\_\{\\mathrm\{op\}\}\\leq L:=3\\\|P\\\|\_\{\\mathrm\{op\}\}/D, and the matrix variance‖∑j𝔼\[Yj2\]‖op≤v:=2‖P‖op‖K‖op/D\\bigl\\\|\\sum\_\{j\}\\mathbb\{E\}\[Y\_\{j\}^\{2\}\]\\bigr\\\|\_\{\\mathrm\{op\}\}\\leq v:=2\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\\\|\_\{\\mathrm\{op\}\}/D, the tail form of the intrinsic matrix\-Bernstein inequality\(Tropp,[2015](https://arxiv.org/html/2606.11255#bib.bib30), Thm\. 7\.3\.1\)gives, with probability at least1−δ1\-\\delta,

‖KD−K‖op≤2‖P‖op‖K‖oplog⁡\(4dint/δ\)D\+2‖P‖oplog⁡\(4dint/δ\)D\.\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}\\leq 2\\sqrt\{\\frac\{\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\\\|\_\{\\mathrm\{op\}\}\\log\(4d\_\{\\mathrm\{int\}\}/\\delta\)\}\{D\}\}\+\\frac\{2\\\|P\\\|\_\{\\mathrm\{op\}\}\\log\(4d\_\{\\mathrm\{int\}\}/\\delta\)\}\{D\}\.\(10\)

The proof \(Appendix[A](https://arxiv.org/html/2606.11255#A1)\) establishesLLandvvfromK\(j\)⪰0K^\{\(j\)\}\\succeq 0and the Schur\-multiplier bound \(Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6)\), then inverts the intrinsic Bernstein tail\.

The same conditioning argument carries this bound to the*doubly*\-randomized estimator of Step 5, where the modulation is also sketched, the map one actually deploys\.

###### Theorem 3\.4\(Operator\-norm error of RAY\)\.

LetK^D,m=P^m∘R^D\\widehat\{K\}\_\{D,m\}=\\widehat\{P\}\_\{m\}\\circ\\widehat\{R\}\_\{D\}be the doubly\-randomized estimator \([9](https://arxiv.org/html/2606.11255#S2.E9)\): a single degree\-2 TensorSketch of dimensionmm\(drawn once, shared across draws\) gives the modulation GramP^m=\[TSm\(xi\)⊤TSm\(xj\)\]⪰0\\widehat\{P\}\_\{m\}=\[\\mathrm\{TS\}\_\{m\}\(x\_\{i\}\)^\{\\top\}\\mathrm\{TS\}\_\{m\}\(x\_\{j\}\)\]\\succeq 0with𝔼P^m=P=\[\(xi⊤xj\+b\)2/ε\]\\mathbb\{E\}\\,\\widehat\{P\}\_\{m\}=P=\[\(x\_\{i\}^\{\\top\}x\_\{j\}\+b\)^\{2\}/\\varepsilon\], andDDradial draws give the radial GramR^D\\widehat\{R\}\_\{D\}with𝔼R^D=R\\mathbb\{E\}\\,\\widehat\{R\}\_\{D\}=R, the unit\-diagonal GramRij=ε/\(ε\+‖xi−xj‖2\)R\_\{ij\}=\\varepsilon/\(\\varepsilon\+\\\|x\_\{i\}\-x\_\{j\}\\\|^\{2\}\)\(soK=R∘PK=R\\circ P\), independent of the sketch\. Suppose the sketch satisfies the spectral event‖P^m−P‖op≤η‖P‖op\\\|\\widehat\{P\}\_\{m\}\-P\\\|\_\{\\mathrm\{op\}\}\\leq\\eta\\\|P\\\|\_\{\\mathrm\{op\}\}\(Remark[3\.5](https://arxiv.org/html/2606.11255#S3.Thmtheorem5)discusses when the degree\-2 TensorSketch achieves it\)\. Then on that event, with probability at least1−δ1\-\\deltaover the radial draws,

‖K^D,m−K‖op≤2\(1\+η\)‖P‖op‖KS‖oplog⁡\(4dint,S/δ\)D\+2\(1\+η\)‖P‖oplog⁡\(4dint,S/δ\)D⏟radial,O\(D−1/2\)\+η‖P‖op⏟sketch,\\\|\\widehat\{K\}\_\{D,m\}\-K\\\|\_\{\\mathrm\{op\}\}\\leq\\underbrace\{2\\sqrt\{\\frac\{\(1\+\\eta\)\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\_\{S\}\\\|\_\{\\mathrm\{op\}\}\\log\(4d\_\{\\mathrm\{int\},S\}/\\delta\)\}\{D\}\}\+\\frac\{2\(1\+\\eta\)\\\|P\\\|\_\{\\mathrm\{op\}\}\\log\(4d\_\{\\mathrm\{int\},S\}/\\delta\)\}\{D\}\}\_\{\\text\{radial\},\\ O\(D^\{\-1/2\}\)\}\+\\underbrace\{\\eta\\,\\\|P\\\|\_\{\\mathrm\{op\}\}\}\_\{\\text\{sketch\}\},\(11\)whereKS=P^m∘RK\_\{S\}=\\widehat\{P\}\_\{m\}\\circ R,‖KS‖op≤\(1\+η\)‖P‖op\\\|K\_\{S\}\\\|\_\{\\mathrm\{op\}\}\\leq\(1\+\\eta\)\\\|P\\\|\_\{\\mathrm\{op\}\}, anddint,S=tr⁡\(VS\)/‖VS‖op≤Nd\_\{\\mathrm\{int\},S\}=\\operatorname\{tr\}\(V\_\{S\}\)/\\\|V\_\{S\}\\\|\_\{\\mathrm\{op\}\}\\leq Nis the intrinsic dimension of the*sketch\-conditioned*matrix varianceVSV\_\{S\}\(the variance of Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)for the pair\(P^m,KS\)\(\\widehat\{P\}\_\{m\},K\_\{S\}\)rather than\(P,K\)\(P,K\), reducing todintd\_\{\\mathrm\{int\}\}asm→∞m\\to\\infty\)\. Atm→∞m\\to\\infty\(η→0\\eta\\to 0\) the sketch term vanishes and \([11](https://arxiv.org/html/2606.11255#S3.E11)\) is exactly Corollary[3\.3](https://arxiv.org/html/2606.11255#S3.Thmtheorem3): randomizing the modulation costs the single additive termη‖P‖op\\eta\\\|P\\\|\_\{\\mathrm\{op\}\}, set bymmindependently ofDD\.

*Conditioning is the whole argument\.*Given the sketch,P^m\\widehat\{P\}\_\{m\}is a fixed PSD modulation Gram andK^D,m=P^m∘R^D\\widehat\{K\}\_\{D,m\}=\\widehat\{P\}\_\{m\}\\circ\\widehat\{R\}\_\{D\}is an*exact\-modulation*estimator ofKS=P^m∘RK\_\{S\}=\\widehat\{P\}\_\{m\}\\circ R; Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)and Corollary[3\.3](https://arxiv.org/html/2606.11255#S3.Thmtheorem3)apply verbatim withP↦P^mP\\mapsto\\widehat\{P\}\_\{m\}\(so‖P^m‖op≤\(1\+η\)‖P‖op\\\|\\widehat\{P\}\_\{m\}\\\|\_\{\\mathrm\{op\}\}\\leq\(1\+\\eta\)\\\|P\\\|\_\{\\mathrm\{op\}\}and‖KS‖op≤‖P^m‖op\\\|K\_\{S\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\\widehat\{P\}\_\{m\}\\\|\_\{\\mathrm\{op\}\}by Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6)withRRunit\-diagonal\), giving the radial term\. The remaining bias isKS−K=\(P^m−P\)∘R=R∘EPK\_\{S\}\-K=\(\\widehat\{P\}\_\{m\}\-P\)\\circ R=R\\circ E\_\{P\}withEP:=P^m−PE\_\{P\}:=\\widehat\{P\}\_\{m\}\-Psymmetric but*not*PSD; here the symmetric Schur\-multiplier bound Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6)\(b\) applies \(withA=R⪰0A=R\\succeq 0,maxi⁡Rii=1\\max\_\{i\}R\_\{ii\}=1\), giving‖R∘EP‖op≤‖EP‖op≤η‖P‖op\\\|R\\circ E\_\{P\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|E\_\{P\}\\\|\_\{\\mathrm\{op\}\}\\leq\\eta\\\|P\\\|\_\{\\mathrm\{op\}\}\. The triangle inequality combines the two; the split is validated in Section[4\.2](https://arxiv.org/html/2606.11255#S4.SS2)\. The conservative entrywise Frobenius decomposition of Proposition[A\.8](https://arxiv.org/html/2606.11255#A1.Thmtheorem8)needs no subspace\-embedding hypothesis and remains available as a fallback \(Appendix[A](https://arxiv.org/html/2606.11255#A1)\)\.

This propagates to downstream kernel ridge regression: the approximation perturbs the learned predictor only through the operator\-norm Gram error\.

###### Theorem 3\.6\(KRR stability under relative spectral approximation\)\.

For a targety∈ℝNy\\in\\mathbb\{R\}^\{N\}and ridgeλ\>0\\lambda\>0, letα^=\(K\+λI\)−1y\\hat\{\\alpha\}=\(K\+\\lambda I\)^\{\-1\}y,α~=\(KD\+λI\)−1y\\tilde\{\\alpha\}=\(K\_\{D\}\+\\lambda I\)^\{\-1\}y,A=K\+λIA=K\+\\lambda I, andE=KD−KE=K\_\{D\}\-K\. If the whitened error obeys‖A−1/2EA−1/2‖op≤ρ<1\\\|A^\{\-1/2\}EA^\{\-1/2\}\\\|\_\{\\mathrm\{op\}\}\\leq\\rho<1, then

\(1−ρ\)A⪯KD\+λI⪯\(1\+ρ\)A,‖α~−α^‖A≤ρ1−ρ‖α^‖A,\(1\-\\rho\)A\\preceq K\_\{D\}\+\\lambda I\\preceq\(1\+\\rho\)A,\\qquad\\\|\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\\\|\_\{A\}\\leq\\frac\{\\rho\}\{1\-\\rho\}\\,\\\|\\hat\{\\alpha\}\\\|\_\{A\},where‖v‖A2=v⊤Av\\\|v\\\|\_\{A\}^\{2\}=v^\{\\top\}Av\.

This is the spectral\-approximation form standard in random\-feature KRR analysis\(Avron et al\.,[2017](https://arxiv.org/html/2606.11255#bib.bib4)\): control of the*relative*errorρ\\rho, not the raw‖KD−K‖op\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}, governs the predictor\. The theorem is deterministic\. A high\-probability conditionρ=O\(D−1/2\)\\rho=O\(D^\{\-1/2\}\)would follow by applying matrix Bernstein \(Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)\) to the whitened summandsA−1/2\(K\(j\)−K\)A−1/2A^\{\-1/2\}\(K^\{\(j\)\}\-K\)A^\{\-1/2\}; their scale is set by the effective dimensiondeff\(λ\)=tr⁡\(K\(K\+λI\)−1\)d\_\{\\mathrm\{eff\}\}\(\\lambda\)=\\operatorname\{tr\}\(K\(K\+\\lambda I\)^\{\-1\}\)\. We outline this route and leave the high\-probability bound and full excess\-risk analysis to future work\. A cruder coefficient bound‖α^−α~‖2≤λ−2‖KD−K‖op‖y‖2\\\|\\hat\{\\alpha\}\-\\tilde\{\\alpha\}\\\|\_\{2\}\\leq\\lambda^\{\-2\}\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}\\\|y\\\|\_\{2\}also holds, sinceKD=ZZ⊤⪰0K\_\{D\}=ZZ^\{\\top\}\\succeq 0makesKD\+λIK\_\{D\}\+\\lambda Iinvertible with inverse norm≤1/λ\\leq 1/\\lambda\. It matches the1/D1/\\sqrt\{D\}convergence observed downstream \(Section[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px5)\)\.

Read acrossbb, these bounds expose what the bias costs\. Settingb=0b=0recovers the unbiased bounds withR4R^\{4\}andR8R^\{8\}; forb\>0b\>0they inflate by powers of\(1\+b/R2\)\(1\+b/R^\{2\}\), the bias enlarging the effective data radius fromR2R^\{2\}toR2\+bR^\{2\}\+b\. This is the price side of the trade whose benefit, the finite\-difference IMQ reach and the practical bias shift, motivatedk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}overk\\tifinaghfontⵟk\_\{\\text\{\\tifinaghfont ⵟ\}\}in the first place \(Section[1](https://arxiv.org/html/2606.11255#S1)\); the\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}variance law is not merely an artifact of the proof but is borne out empirically in Section[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px2)\.

The\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}blow\-up is removable by normalizing the modulation feature toqb\(x\)=pb\(x\)/\(‖x‖2\+b\)q\_\{b\}\(x\)=p\_\{b\}\(x\)/\(\\\|x\\\|^\{2\}\+b\)\(unit norm\): the resulting*normalized estimator*has variance bounded by3/\(2Dε2\)3/\(2D\\varepsilon^\{2\}\), free ofRRandbb\(Proposition[A\.7](https://arxiv.org/html/2606.11255#A1.Thmtheorem7), Appendix[A](https://arxiv.org/html/2606.11255#A1)\)\. It is not an approximation tok\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}but an unbiased estimator of a*different*, cosine\-rescaled kernel, so the gain is stability, not a free fix\.

Finally, the Gram error has a structure that explains why the product is worth keeping even when the modulation is itself compressed\. WritingK=P∘HK=P\\circ HwithPij=pb\(xi,xj\)P\_\{ij\}=p\_\{b\}\(x\_\{i\},x\_\{j\}\),Hij=hε\(xi,xj\)H\_\{ij\}=h\_\{\\varepsilon\}\(x\_\{i\},x\_\{j\}\):

###### Proposition 3\.7\(Modulation–radial error decomposition\)\.

WithP^m\\widehat\{P\}\_\{m\}any modulation approximation andH^D\\widehat\{H\}\_\{D\}the radial estimate,K^D,m=P^m∘H^D\\widehat\{K\}\_\{D,m\}=\\widehat\{P\}\_\{m\}\\circ\\widehat\{H\}\_\{D\}satisfies

K^D,m−K=P∘\(H^D−H\)⏟radial error, modulated by finite feature\+\(P^m−P\)∘H⏟modulation error, localized by proximity\+\(P^m−P\)∘\(H^D−H\)⏟interaction\.\\widehat\{K\}\_\{D,m\}\-K=\\underbrace\{P\\circ\(\\widehat\{H\}\_\{D\}\-H\)\}\_\{\\text\{radial error, modulated by finite feature\}\}\+\\underbrace\{\(\\widehat\{P\}\_\{m\}\-P\)\\circ H\}\_\{\\text\{modulation error, localized by proximity\}\}\+\\underbrace\{\(\\widehat\{P\}\_\{m\}\-P\)\\circ\(\\widehat\{H\}\_\{D\}\-H\)\}\_\{\\text\{interaction\}\}\.For exact modulation \(P^m=P\\widehat\{P\}\_\{m\}=P\) this collapses toK^D−K=P∘\(H^D−H\)\\widehat\{K\}\_\{D\}\-K=P\\circ\(\\widehat\{H\}\_\{D\}\-H\), so entrywise\|K^ij−Kij\|=pb\(xi,xj\)\|H^ij−Hij\|\|\\widehat\{K\}\_\{ij\}\-K\_\{ij\}\|=p\_\{b\}\(x\_\{i\},x\_\{j\}\)\\,\|\\widehat\{H\}\_\{ij\}\-H\_\{ij\}\|\.

\(Proof: expand\(P\+EP\)∘\(H\+EH\)\(P\+E\_\{P\}\)\\circ\(H\+E\_\{H\}\)and subtractKK\.\) The radial Monte\-Carlo noise is Schur\-scaled by the modulation; under bounded norms, and literally for the normalized variant whereGij=\(xi⊤xj\+b\)2/\(\(‖xi‖2\+b\)\(‖xj‖2\+b\)\)∈\[0,1\]G\_\{ij\}=\(x\_\{i\}^\{\\top\}x\_\{j\}\+b\)^\{2\}/\(\(\\\|x\_\{i\}\\\|^\{2\}\+b\)\(\\\|x\_\{j\}\\\|^\{2\}\+b\)\)\\in\[0,1\], this acts as an alignment gate, sincepbp\_\{b\}otherwise also carries norm and bias scale\. Section[4\.5](https://arxiv.org/html/2606.11255#S4.SS5)shows this modulation suppresses false positives even when the polynomial factor is sketched\.

### 3\.2Where the dimension enters, and what to compare against

Table[2](https://arxiv.org/html/2606.11255#S3.T2)states the sample complexity for aτ\\tau\-approximation of theNN\-point Gram matrix\. We are careful about the setting, because it is easy to overclaim\. At this*dataset level*\(a union bound over the≤N2\\leq N^\{2\}pairs, via Hoeffding\), the RAY radial count is free ofdd\(Corollary[A\.4](https://arxiv.org/html/2606.11255#A1.Thmtheorem4)\)\. But so is standard RFF: a union bound over a finite point set removes the explicitddfor the Gaussian and IMQ kernels as well\. Dimension\-freeness of the sample count is therefore a property of the dataset\-level analysis, not a special feature of our construction\.

Table 2:Sample complexity for a dataset\-levelτ\\tau\-approximation of theNN\-point Gram matrix \(Hoeffding \+ union bound\)\. All counts are free of the input dimensionddat this level; the constant is the squared per\-draw range\. Standard RFF uses an absolutely bounded cosine feature, so its count is free ofRRand the bandwidthσ\\sigma; the IMQ–Laplace estimator carries the radial massmf=1/εm\_\{f\}=1/\\varepsilon, givingε−2\\varepsilon^\{\-2\}\. RAY’s count is exactly that radialε−2\\varepsilon^\{\-2\}times the*unbounded*polynomial modulation range\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}, the honest price of the alignment numerator, removed only by normalizing the modulation \(Proposition[A\.7](https://arxiv.org/html/2606.11255#A1.Thmtheorem7)\)\. The methods differ*qualitatively*in applicability, since RFF needs shift\-invariance and sketching a dot\-product form, andk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}has neither\.RRis the data radius\.Table 3:Approximation taxonomy\. Random\-feature methods draw feature parameters from kernel\-defined distributions before seeing the dataset and are unbiased for arbitrary pairs\. Nyström methods are data\-dependent deployment baselines: they choose columns, centers, or leverage\-sampled landmarks from the observed training set and approximate the empirical Gram matrix or downstream predictor\.What RAY genuinely buys is therefore not a smallerdd\-dependence than RFF, but*applicability*: it brings a kernel that is neither shift\-invariant nor a dot\-product, on which RFF and polynomial sketching cannot be run, into the same dataset\-level,dd\-free regime that RFF enjoys for the kernels it does cover, at the cost of a larger constant:\(R2\+b\)4/ε2\(R^\{2\}\+b\)^\{4\}/\\varepsilon^\{2\}in place of the absolute constant of a bounded\-cosine RFF, the unbounded polynomial modulation being the one factor RFF does not carry\. The explicitddfamiliar from RFF reappears only in the*domain\-level*\(uniform\-over\-a\-dd\-ball\) analysis through the covering number; there RAY’s inner Gaussian features incur it as well, so RAY is notdd\-free in that stronger sense either\. Nyström occupies a different category \(Table[3](https://arxiv.org/html/2606.11255#S3.T3)\): it is an adaptive landmark approximation to the observed sample, not a data\-independent random\-feature map\. Its error can degrade with dimension through the spectral\-decay rateO\(m−2s/d\)O\(m^\{\-2s/d\}\), which is the contrast our matched\-landmark experiments target \(Section[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px4), Table[11](https://arxiv.org/html/2606.11255#A4.T11)\), but its adaptivity can also make it a stronger deployment baseline at fixed representation size \(Section[4\.3](https://arxiv.org/html/2606.11255#S4.SS3)\)\.

### 3\.3Variance reduction

Three optional refinements of the sampling lower the variance further; all are evaluated in Appendix[D](https://arxiv.org/html/2606.11255#A4)\.

#### Quasi\-Monte Carlo overtt\.

Replacetj∼Exp⁡\(ε\)t\_\{j\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)by quasi\-random samples through the inverse CDFtj=−ε−1log⁡ujt\_\{j\}=\-\\varepsilon^\{\-1\}\\log u\_\{j\},uju\_\{j\}from a Sobol or Halton sequence\. The radial integrandt↦e−t\(ε\+‖x−w‖2\)t\\mapsto e^\{\-t\(\\varepsilon\+\\\|x\-w\\\|^\{2\}\)\}isC∞C^\{\\infty\}and of bounded variation, so by Koksma–Hlawka\(Kuo & Nuyens,[2016](https://arxiv.org/html/2606.11255#bib.bib15)\)the*tt\-integral’s*QMC error isO\(\(log⁡D\)s/D\)O\(\(\\log D\)^\{s\}/D\)\. We emphasize, however, that this is the rate of the outer integral only: the estimator also carries the inner Gaussian RFF Monte\-Carlo noise, which is not quasi\-randomized, and empirically \(Appendix[D\.1](https://arxiv.org/html/2606.11255#A4.SS1)\) it is this inner noise that sets the overall convergence, so QMC delivers a constant\-factor variance reduction rather than a faster rate\. Deterministic Sobol/Halton points also trade the exact i\.i\.d\. unbiasedness of Theorem[2\.4](https://arxiv.org/html/2606.11255#S2.Thmtheorem4)for a bounded deterministic integration error;*randomized*QMC \(a scrambled or shifted sequence\) restores unbiasedness in expectation over the randomization while keeping the improved error, and is the variant we recommend when unbiasedness must hold exactly\.

#### Orthogonal features for the inner approximation\.

For each scaletjt\_\{j\}, drawing the inner frequencies as scaled rows of a Haar\-random orthogonal matrix\(Yu et al\.,[2016](https://arxiv.org/html/2606.11255#bib.bib34)\)reduces the inner\-loop variance without affecting unbiasedness\. The combined QMC\-outer / orthogonal\-inner estimator achieves the best of both\.

#### Importance sampling overtt\.

Samplingt∼Exp⁡\(ε\+η\)t\\sim\\operatorname\{Exp\}\(\\varepsilon\+\\eta\)with weightwt=εε\+ηeηtw\_\{t\}=\\frac\{\\varepsilon\}\{\\varepsilon\+\\eta\}e^\{\\eta t\}keeps the estimator unbiased and reduces variance for nearby pairs \(‖x−w‖2≤η\\\|x\-w\\\|^\{2\}\\leq\\eta\) by up to\(ε/\(ε\+η\)\)2\(\\varepsilon/\(\\varepsilon\+\\eta\)\)^\{2\},*providedη\\etais not too large*: the importance estimate of the radial factor at squared distancerrhas a finite second moment only forη<ε\+2r\\eta<\\varepsilon\+2r\(Appendix[D](https://arxiv.org/html/2606.11255#A4)\), so a single proposal safe for every pair, includingr=0r=0, requiresη<ε\\eta<\\varepsilon\. The proposal is data\-independent and so practical within that range, if suboptimal for the full pairwise\-distance spread\.

## 4Experiments

We lead with the construction’s reason for being and then exercise its design and use\. The off\-sphere bounded ball is RAY’s representational niche, wherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is genuinely nonstationary and no dot\-product reduction exists \(Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1), Figure[1](https://arxiv.org/html/2606.11255#S1.F1)\); we then study TensorSketch compression of the polynomial factor at matched feature dimension \(Section[4\.2](https://arxiv.org/html/2606.11255#S4.SS2)\) and the honest fair\-cost comparison \(Section[4\.3](https://arxiv.org/html/2606.11255#S4.SS3)\)\. We then turn to*when the coupling matters*\(when the\\tifinaghfontⵟ\-kernel is the right kernel \(Section[4\.4](https://arxiv.org/html/2606.11255#S4.SS4)\) and why, through the alignment\-modulation decomposition \(Section[4\.5](https://arxiv.org/html/2606.11255#S4.SS5)\)\) and close with two systems results in RAY’s data\-independent, streaming niche: a linear\-time\\tifinaghfontⵟ\-attention primitive \(Section[4\.6](https://arxiv.org/html/2606.11255#S4.SS6)\) and large\-scale streaming primal training on HIGGS \(Section[4\.7](https://arxiv.org/html/2606.11255#S4.SS7)\), at a scale where the Gram is impossible\. Confirmatory studies \(kernel\-value variance reduction \(Appendix[D\.1](https://arxiv.org/html/2606.11255#A4.SS1)\), the outer/inner allocation that singles out flat sampling \(Appendix[D\.2](https://arxiv.org/html/2606.11255#A4.SS2)\), and theO\(N2\)O\(N^\{2\}\)scalability wall \(Appendix[D\.3](https://arxiv.org/html/2606.11255#A4.SS3)\)\) are deferred to Appendix[D](https://arxiv.org/html/2606.11255#A4), together with the sphere\-normalized Gram\-rate and dimension checks\. All experiments are seed\-deterministic and reproducible from the archived scripts\. Sphere\-normalized studies are deferred as approximation\-quality checks: on the spherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is a dot\-product kernel with natural zonal baselines \(Random Gegenbauer features\(Han et al\.,[2022](https://arxiv.org/html/2606.11255#bib.bib10)\)\), so the representational claims are off\-sphere\.

### 4\.1Off\-sphere validation on a bounded ball

The key test is off\-sphere \(Figure[1](https://arxiv.org/html/2606.11255#S1.F1)\): a bounded ball wherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is not a function ofx⊤wx^\{\\top\}walone, so no dot\-product reduction exists\. We sampleN=1000N=1000points with directions uniform on𝕊d−1\\mathbb\{S\}^\{d\-1\}and radii uniform on\[0\.25,1\]\[0\.25,1\]\(a dimension\-independent spread of norms; uniform\-in\-ball sampling would collapse toward the sphere asddgrows\)\. Approximating the exact Gram with RAY \(flatD′=1D^\{\\prime\}=1,b=1b=1,ε\\varepsilonthe median squared distance,55seeds\) reproduces theO\(1/D\)O\(1/\\sqrt\{D\}\)rate, while uniform\- and k\-means\-landmark Nyström degrade withdd\(Table[4](https://arxiv.org/html/2606.11255#S4.T4)\): k\-means helps by a constant but not the trend, and RAY atD=1000D=1000is below both byd=8d=8\. This matches radial draws against landmarks, not feature dimension \(RAY’s isDdbD\\,d\_\{b\}\); the cost\-matched comparison is Section[4\.3](https://arxiv.org/html/2606.11255#S4.SS3)\. Bounded norm, not sphere normalization, is what the estimator requires\.

Table 4:Relative Frobenius Gram error on an*off\-sphere*bounded ball \(‖x‖∈\[0\.25,1\]\\\|x\\\|\\in\[0\.25,1\], varying norms;N=1000N=1000,b=1b=1, mean over55seeds\)\. RAY follows theO\(1/D\)O\(1/\\sqrt\{D\}\)rate at every dimension; both uniform and k\-means Nyström \(m=100m=100\) worsen asddgrows, k\-means by a smaller constant\. Standard deviation over the55seeds is≤0\.015\\leq 0\.015for the RAY columns and≤0\.003\\leq 0\.003for the Nyström columns, so thed≥8d\\geq 8ordering \(RAY below both Nyström variants\) is outside seed noise\.The adaptive*ridge\-leverage\-score*Nyström\(Musco & Musco,[2017](https://arxiv.org/html/2606.11255#bib.bib18)\), the strongest standard baseline, does not change the picture: it tracks uniform Nyström \(0\.0720\.072/0\.0940\.094/0\.1100\.110atd=8d=8/1616/3232\), k\-means stays the best landmark variant \(0\.0510\.051/0\.0790\.079/0\.0970\.097\), and all three worsen withddwhile RAY atD=1000D=1000remains below them byd=16d=16\. The behavior persists into genuinely high dimension: on the same off\-sphere ball atd∈\{128,256,512\}d\\in\\\{128,256,512\\\}, RAY follows theO\(1/D\)O\(1/\\sqrt\{D\}\)rate \(fitted slopes−0\.41\-0\.41,−0\.51\-0\.51,−0\.52\-0\.52\) with relative error0\.050\.05–0\.060\.06atD=1000D=1000, unchanged in order across a4×4\\timesrange ofdd\.

The fidelity carries to real data kept off\-sphere\. Standardizing and scaling each dataset by its maximum row norm \(so‖x‖≤1\\\|x\\\|\\leq 1with*varying*norms, not on the sphere\), RAY atD=128D=128tracks the exact\\tifinaghfontⵟ\-kernel’s downstream accuracy ondigits\(0\.9830\.983vs\. exact0\.9860\.986\) and on a largercovtypesubsample \(N=3000N\{=\}3000,d=54d\{=\}54:0\.6950\.695vs\. exact0\.6820\.682\), in both cases matching or beating Gaussian RFF and k\-means Nyström at the same budget\. So the sphere\-normalized ridge\-regression results \(Appendix[D](https://arxiv.org/html/2606.11255#A4)\) are not an artifact of normalization: RAY reproduces the exact kernel off\-sphere as well\.

### 4\.2Compressing the polynomial factor with TensorSketch

RAY sketches the degree\-2 modulation to escape theO\(d2\)O\(d^\{2\}\)feature floor \(Section[2\.5](https://arxiv.org/html/2606.11255#S2.SS5)\); here we validate the error decomposition and exercise the accuracy–cost trade\. The variance splits into a radial Monte\-Carlo term \(∼D−1/2\\sim D^\{\-1/2\}\), a polynomial\-sketch term \(∼m−1/2\\sim m^\{\-1/2\}\), and their product \(Proposition[A\.8](https://arxiv.org/html/2606.11255#A1.Thmtheorem8)\)\. For a fixed pair \(40004000repetitions\) the split is clean: varyingDDatm=256m=256drives only the radial term \(0\.09→8\.5×10−40\.09\\to 8\.5\\times 10^\{\-4\}\), varyingmmatD=1000D=1000only the sketch term \(4\.2×10−3→1\.4×10−44\.2\\times 10^\{\-3\}\\to 1\.4\\times 10^\{\-4\}\), and the empirical variance matches the three\-term formula throughout \(ratio0\.940\.94–1\.061\.06\)\. The operator\-norm error of Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)shows the same structure \(Figure[2](https://arxiv.org/html/2606.11255#S4.F2)\): atm=128m\{=\}128the radial term falls asD−1/2D^\{\-1/2\}\(40\.5→3\.640\.5\\to 3\.6overD=10→1000D\{=\}10\\to 1000\) while the sketch term is aDD\-independent floor that decays withmm\(19\.5→7\.419\.5\\to 7\.4asm=64→256m\{=\}64\\to 256\), vanishing asm→∞m\\to\\infty\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x2.png)Figure 2:Operator\-norm error of the deployed \(doubly\-randomized\) RAY estimator, validating Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\(off\-sphere,d=16d=16,N=300N=300,‖P‖op=186\\\|P\\\|\_\{\\mathrm\{op\}\}=186\)\.\(a\)At fixed sketch sizem=128m\{=\}128, the radial term falls asO\(1/D\)O\(1/\\sqrt\{D\}\)while the sketch termη‖P‖op\\eta\\\|P\\\|\_\{\\mathrm\{op\}\}is aDD\-independent floor; the total decays to that floor, and them→∞m\\to\\infty\(exact\-modulation\) curve is the zero\-floor limit\.\(b\)The sketch term‖EP∘R‖op\\\|E\_\{P\}\\circ R\\\|\_\{\\mathrm\{op\}\}and the relative sketch errorη=‖EP‖op/‖P‖op\\eta=\\\|E\_\{P\}\\\|\_\{\\mathrm\{op\}\}/\\\|P\\\|\_\{\\mathrm\{op\}\}both shrink withmm, so increasingmmrecovers exact\-modulation RAY \(Corollary[3\.3](https://arxiv.org/html/2606.11255#S3.Thmtheorem3)\)\. The single additive sketch term is the entire price of randomizing the modulation\.At finitemmthe compression recovers most of the efficiency the exact feature loses to itsO\(d2\)O\(d^\{2\}\)size: on sphere\-normalized digits \(d=64d=64\) atM=dbM=d\_\{b\}, RAY \(m=128m=128\) reaches0\.9770\.977where exact modulation, starved to a single radial draw, scores0\.9280\.928, near the optimal rank\-MMceiling0\.9860\.986\(sanity check, Table[21](https://arxiv.org/html/2606.11255#A4.T21), Appendix[D](https://arxiv.org/html/2606.11255#A4)\)\. In low dimension there is nothing to compress \(california,db=45d\_\{b\}=45\): the sketch only adds noise and one takesm→∞m\\to\\infty\. RAY is the default scalable map; exact modulation is its low\-ddendpoint, not a separate method\.

At a fixed feature dimensionM=D\(m\+d\+1\)M=D\(m\{\+\}d\{\+\}1\)the sketch sizemmtrades against the radial\-draw countDD, and the optimum is interior \(Table[5](https://arxiv.org/html/2606.11255#S4.T5)\)\. The best allocation for Gram fidelity is an intermediate sketch size; downstream KRR error instead favors smallermm\(more radial draws\), so the operating point depends on whether Gram fidelity or prediction is the goal\. This operationalizes the radial\-vs\-sketch split of Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\.

Table 5:Deployed\-RAY allocation at fixed feature dimensionM=8192M=8192on an off\-sphere ball \(relative Frobenius Gram error, mean over33seeds\): at fixedM=D\(m\+d\+1\)M=D\(m\{\+\}d\{\+\}1\)the sketch sizemmtrades against the radial\-draw countDD\. The error is minimized at an intermediatemm\(smallmmleaves a large sketch term, largemmstarves the radial draws\), and the optimalmmgrows slowly withdd\.Sketching also removes theO\(d2\)O\(d^\{2\}\)representation floor outright \(Table[6](https://arxiv.org/html/2606.11255#S4.T6)\)\. Atd=1024d\{=\}1024the exact\-modulation feature needs33\.733\.7GB per theN=1000N\{=\}1000representation and cannot be built, while the sketched feature is7474MB and builds in2828ms, a direct demonstration that sketching removes theO\(d2\)O\(d^\{2\}\)floor \(Limitation \(v\)\)\.

Table 6:Feature\-construction cost vs\. dimensionddat fixed radial drawsD=8D\{=\}8, sketchm=128m\{=\}128,N=1000N\{=\}1000\(off\-sphere ball\)\. Exact modulation’s per\-point feature dimensionDdbD\\,d\_\{b\}grows asO\(d2\)O\(d^\{2\}\)and becomes impossible to build; the sketched featureD\(m\+d\+1\)D\(m\{\+\}d\{\+\}1\)stays linear indd\. “–” marks a feature too large to materialize \(\>2\>2M coordinates/point\)\.
### 4\.3Fair\-cost comparison

Matched draws understate RAY’s cost; matched representation is the honest axis\. The Nyström methods here are strong*data\-dependent*deployment baselines \(Table[3](https://arxiv.org/html/2606.11255#S3.T3)\), not data\-independent feature maps\. On the off\-sphere bounded ball \(Table[7](https://arxiv.org/html/2606.11255#S4.T7);d=64d=64,‖x‖∈\[0\.3,1\.5\]\\\|x\\\|\\in\[0\.3,1\.5\], coupled target,M=db=2145M=d\_\{b\}=2145\), RAY \(0\.4730\.473RMSE\) far outperforms the budget\-starved exact\-modulation limit \(1\.1141\.114, one radial draw\): compression, not exactness, is the right setting at this budget\. The adaptive k\-means and ridge\-leverage Nyström of the*exact*\\tifinaghfontⵟ\-kernel are most accurate \(0\.0990\.099each; matching, so neither is a weak comparator\), and Gaussian RFF is a strong different\-kernel reference\. This makes theO\(d2\)O\(d^\{2\}\)floor \(Limitation \(v\)\) concrete: atd=64d=64realizing the kernel as features needs more draws, so a moderate\-NN, fixed\-budget deployment may prefer adaptive landmarks unless streaming, unbiasedness, or pre\-data features are required\. The ordering is the same on sphere\-normalizeddigits\(Table[22](https://arxiv.org/html/2606.11255#A4.T22), Appendix[D](https://arxiv.org/html/2606.11255#A4)\)\.

Table 7:Off\-sphere fair\-cost: KRR test RMSE \(↓\\downarrow\) on a bounded ball \(d=64d=64,‖x‖∈\[0\.3,1\.5\]\\\|x\\\|\\in\[0\.3,1\.5\], coupled target, mean±\\pmstd over33seeds\) at matched representation dimension\. RAY beats the budget\-starved exact\-modulation limit; k\-means and ridge\-leverage Nyström of the exact\\tifinaghfontⵟ\-kernel are most accurate; Gaussian RFF approximates a different kernel\. The sphere\-normalized version is Table[22](https://arxiv.org/html/2606.11255#A4.T22)\(Appendix[D](https://arxiv.org/html/2606.11255#A4)\)\.
### 4\.4Coupled\-target preference

The comparisons so far measure approximation quality, not whether the\\tifinaghfontⵟ\-kernel is the right kernel\. We test that inductive bias directly: the kernel couples alignment and proximity, so it should be preferred exactly when the target benefits from both, and not otherwise\. On off\-sphere data \(‖x‖∈\[0\.3,1\.5\]\\\|x\\\|\\in\[0\.3,1\.5\],d=16d=16\) we build three regression targets from small atom sums and compare kernel ridge regression with the Gaussian, IMQ, degree\-2 polynomial, and\\tifinaghfontⵟkernels plus RAY \(Table[8](https://arxiv.org/html/2606.11255#S4.T8)\)\. The two single\-factor controls are kernel\-natural \(*proximity*y=∑kak/\(‖x−vk‖2\+ε0\)y=\\sum\_\{k\}a\_\{k\}/\(\\\|x\-v\_\{k\}\\\|^\{2\}\+\\varepsilon\_\{0\}\)and*alignment*y=∑kak\(uk⊤x\)2y=\\sum\_\{k\}a\_\{k\}\(u\_\{k\}^\{\\top\}x\)^\{2\}\), so they should, and do, favor the distance kernels and the polynomial kernel respectively\. The*coupled*targety=∑kaktanh⁡\(2uk⊤x\)e−‖x−vk‖y=\\sum\_\{k\}a\_\{k\}\\tanh\(2u\_\{k\}^\{\\top\}x\)\\,e^\{\-\\\|x\-v\_\{k\}\\\|\}is deliberately*not*of the\\tifinaghfontⵟ\-kernel’s form: it multiplies atanh\\tanhalignment by a Laplace proximity, matching no candidate kernel, so a\\tifinaghfontⵟwin there cannot be an artifact of\\tifinaghfontⵟ\-generated data\. The\\tifinaghfontⵟ\-kernel wins the coupled target and only there\. The evidence is therefore a controlled compatibility result for the alignment×\\timesproximity coupling, not a general necessity theorem\. RAY inherits the preference once its approximation has converged: atD=4000D=4000it reaches0\.0930\.093on the coupled target, below both distance kernels, approaching the exact\\tifinaghfontⵟ\-kernel’s0\.0870\.087; at smallerDDthe residual approximation error keeps it above them, consistent with theO\(1/D\)O\(1/\\sqrt\{D\}\)convergence measured elsewhere\.

Table 8:Kernel ridge regression test RMSE \(↓\\downarrow\) on off\-sphere targets \(d=16d=16, varying norms, mean over55seeds, RAY atD=4000D=4000, best per row in bold\)\. The coupled target \(tanh alignment×\\timesLaplace proximity\) matches no candidate kernel; the\\tifinaghfontⵟ\-kernel is best only there, while distance kernels win the proximity control and the polynomial wins the alignment control\. RAY tracks the exact\\tifinaghfontⵟ\-kernel, beating both distance kernels on the coupled target\.The preference survives fair per\-kernel tuning \(Table[9](https://arxiv.org/html/2606.11255#S4.T9)\): with every kernel’sb,ε,λb,\\varepsilon,\\lambdagrid\-searched on a held\-out split, the exact\\tifinaghfontⵟ\-kernel is best on the coupled target and statistically tied ondigits, so the coupled\-target preference is not an artifact of fixed hyperparameters\.

Table 9:Validation\-tuned downstream comparison \(every kernel’sb,ε,λb,\\varepsilon,\\lambdagrid\-searched on a held\-out split; off\-sphere; mean±\\pmstd over33seeds\)\. The\\tifinaghfontⵟ\-kernel’s coupled\-target advantage survives fair tuning: it wins the synthetic coupled \(alignment×\\timesproximity\) target and ties ondigitswhere one factor suffices\. RAY is the deployed sketched approximation at a fixed budget\.This advantage belongs to the kernel, and the matched\-cost comparison tells us where realizing it is paid for\. Resolving the coupling through random features costsO\(d2\)O\(d^\{2\}\)per draw \(the polynomial floor of Section[3\.2](https://arxiv.org/html/2606.11255#S3.SS2)\), so at a small fixed feature dimension only a few draws are affordable: the kernel is under\-resolved, and a well\-resolved RFF for a simpler kernel can score better on the very target the coupling was meant to win\. AtM=4096M\{=\}4096this is exactly what happens \(Table[10](https://arxiv.org/html/2606.11255#S4.T10)\): sketched RAY reaches0\.1030\.103on the coupled target against0\.0380\.038and0\.0390\.039for resolved Gaussian and IMQ RFF, while the exact\\tifinaghfontⵟ\-kernel, paying full price, keeps the lead at0\.0250\.025, and each single\-factor control goes to the matched RFF built for it\. The point is not that the coupling stops helping but that a tiny fixed budget is the wrong place to ask for it, because there a cheap simpler kernel out\-resolves the right one\. RAY is built for the opposite regime, the one the whole construction targets: whereNNis large enough that the exact Gram cannot be formed and the features must be data\-independent and streaming \(Section[4\.7](https://arxiv.org/html/2606.11255#S4.SS7)\)\. There the alternative to RAY is not a cheaper RFF on a small target but a method that cannot run at all, and the coupling RAY carries through the sketch is precisely what makes it worth forming, the gap to the exact kernel closing at theO\(1/D\)O\(1/\\sqrt\{D\}\)rate as the draw budget grows\.

Table 10:Matched\-dimension random\-feature comparison \(RMSE↓\\downarrow, off\-sphered=32d\{=\}32,M=4096M\{=\}4096, mean of33seeds\)\. At this small fixed budget theO\(d2\)O\(d^\{2\}\)polynomial floor leaves the coupling under\-resolved, so a cheaper RFF for a simpler kernel scores better while the exact\\tifinaghfontⵟ\-kernel keeps the lead; RAY’s regime is large\-NNstreaming, not a fixed\-budget bake\-off \(see text\)\.
### 4\.5The alignment numerator as signal modulation

The coupled\-target preference is explained by the modulation–radial decomposition \(Proposition[3\.7](https://arxiv.org/html/2606.11255#S3.Thmtheorem7)\): radial Monte\-Carlo noise enters each Gram entry scaled by the alignment modulationpbp\_\{b\}, so the product sharpens*between*\-pair discrimination while leaving any single pair’s relative error unchanged\. What matters here is that this gating survives approximation\. On off\-sphere data \(d=16d=16\) we score three pair types \(*true*\(close and aligned\),*radial distractors*\(close but weakly aligned\), and*alignment distractors*\(aligned but far\)\) by the AUC separating true from each \(400400pairs/type\)\. A radial\-only kernel \(IMQ\) false\-positives on radial distractors \(AUC0\.220\.22\); an alignment\-only kernel \(degree\-2 polynomial\) false\-positives on alignment distractors \(AUC0\.000\.00\); only the\\tifinaghfontⵟproduct suppresses both \(AUC1\.001\.00/1\.001\.00\), the exact\-modulation estimator inherits this almost exactly, and the deployed RAY \(sketched modulation\) partially survives on the alignment distractors \(AUC0\.730\.73\)\. So the coupling that the preference test \(Section[4\.4](https://arxiv.org/html/2606.11255#S4.SS4)\) rewards is exactly the one the Schur modulation preserves under approximation\.

### 4\.6Linear\-time streaming attention

Attention is a kernel smoother over tokens,attn\(qi\)=∑jk\(qi,kj\)vj/∑jk\(qi,kj\)\\mathrm\{attn\}\(q\_\{i\}\)=\\sum\_\{j\}k\(q\_\{i\},k\_\{j\}\)v\_\{j\}/\\sum\_\{j\}k\(q\_\{i\},k\_\{j\}\), atO\(N2\)O\(N^\{2\}\)cost\. An explicit feature map collapses it toO\(NM\)O\(NM\): withϕ\\phisuch thatϕ\(q\)⊤ϕ\(k\)≈k\(q,k\)\\phi\(q\)^\{\\top\}\\phi\(k\)\\approx k\(q,k\), the smoother factorizes asϕ\(qi\)⊤\(∑jϕ\(kj\)vj⊤\)/ϕ\(qi\)⊤∑jϕ\(kj\)\\phi\(q\_\{i\}\)^\{\\top\}\\bigl\(\\sum\_\{j\}\\phi\(k\_\{j\}\)v\_\{j\}^\{\\top\}\\bigr\)/\\phi\(q\_\{i\}\)^\{\\top\}\\sum\_\{j\}\\phi\(k\_\{j\}\)\. RAY supplies such a map fork\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}, and it is the kind only random features provide: one map for every token \(data\-independent\) and causal, through a recurrent stateSt=St−1\+ϕ\(kt\)vt⊤S\_\{t\}=S\_\{t\-1\}\+\\phi\(k\_\{t\}\)v\_\{t\}^\{\\top\}\. A landmark method cannot serve here: its landmarks are chosen from the whole sequence, so it is neither per\-token\-fixed nor causal\.

We measure fidelity and scaling, not language quality: random\-feature attention is a coarse approximation, and the\\tifinaghfontⵟ\-attention architecture is developed separately\. The linear\-attention output matches exact\\tifinaghfontⵟ\-attention with a median per\-token error that falls monotonically with the feature dimensionMM\(Figure[3](https://arxiv.org/html/2606.11255#S4.F3)a,d=32d=32:0\.450\.45atM=1552M\{=\}1552down to0\.120\.12atM=12416M\{=\}12416\), and the induced attention\-weight matrix tracks it \(0\.57→0\.120\.57\\to 0\.12\)\. The approximation does not degrade with context length: at fixedM=3104M\{=\}3104the per\-token error stays in0\.260\.26–0\.370\.37acrossNNfrom256256to81928192, with no upward drift\. Its one limitation is sharpness \(the radial RFF is the hard factor\), so the error scales with how peaked the attention is: from0\.110\.11on diffuse attention \(ε=4×\\varepsilon=4\\timesthe median squared distance\) to1\.021\.02when peaked \(ε=0\.25×\\varepsilon=0\.25\\times\), Figure[3](https://arxiv.org/html/2606.11255#S4.F3)b\. The payoff is theO\(N2\)O\(N^\{2\}\)wall \(Figure[3](https://arxiv.org/html/2606.11255#S4.F3)c\): the exactN×NN\\times Nattention matrix reaches8\.68\.6GB atN=32768N\{=\}32768and137137GB atN=131072N\{=\}131072\(no longer storable\) while RAY stays linear \(1\.61\.6GB,1\.51\.5s atN=131072N\{=\}131072\), and a single0\.20\.2MB recurrent stateSt=St−1\+ϕ\(kt\)vt⊤S\_\{t\}=S\_\{t\-1\}\+\\phi\(k\_\{t\}\)v\_\{t\}^\{\\top\}, constant inNN, decodes causally\. That recurrence is*exact*, reproducing masked\\tifinaghfontⵟ\-attention up to the same feature\-approximation error \(0\.240\.24\)\. This is RAY’s niche made concrete: a faithful, linear\-time, streaming kernel attention that the exact kernel cannot scale to and a landmark method cannot stream\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x3.png)Figure 3:RAY as a linear\-time, streaming\\tifinaghfontⵟ\-attention primitive \(random queries/keys/values,d=32d=32\)\.\(a\)The linear\-attention output and the induced attention\-weight matrix both match exact\\tifinaghfontⵟ\-attention with a median error that falls with the feature dimensionMM; one fixed map is applied to every token\.\(b\)The one limitation: error scales with attention sharpness: diffuse attention \(large radial scaleε\\varepsilon\) is easy, peaked attention is the hard regime for the radial RFF\.\(c\)The exactN×NN\\times Nattention matrix becomes impossible to store \(137137GB atN=131072N\{=\}131072,×\\boldsymbol\{\\times\}\), while RAY’sO\(NM\)O\(NM\)features stay linear inNNand a singleO\(Mdv\)O\(Md\_\{v\}\)recurrent state \(constant inNN\) decodes causally; that recurrenceSt=St−1\+ϕ\(kt\)vt⊤S\_\{t\}=S\_\{t\-1\}\+\\phi\(k\_\{t\}\)v\_\{t\}^\{\\top\}is exact, which a data\-dependent landmark scheme cannot offer\.
### 4\.7Large\-scale primal training on HIGGS

RAY is a primal feature map, never forming a Gram\. We close by exercising that property where it is forced: HIGGS\(Baldi et al\.,[2014](https://arxiv.org/html/2606.11255#bib.bib5)\),11,000,00011\{,\}000\{,\}000examples withd=28d=28physics features, a scale at which theN×NN\\times NGram \(∼1014\\sim 10^\{14\}entries\) cannot be stored\. Compressed Bernstein–Schur features train as a memory\-flat streaming primal \(peak8\.58\.5GB, no Gram\) at this scale, with RAY dominating exact modulation at matched budget; Gaussian RFF leads on AUC, as expected on a smooth detector task with no alignment×\\timesproximity coupling to exploit\. The point is streaming feasibility where the Gram is impossible, not peak AUC; the full sweep, protocol, and table are in Appendix[D](https://arxiv.org/html/2606.11255#A4)\(Table[18](https://arxiv.org/html/2606.11255#A4.T18)\)\.

## 5Discussion

The biased\\tifinaghfontⵟ\-kernel fits neither random\-feature template \(it is neither shift\-invariant nor a dot\-product kernel \(Proposition[2\.1](https://arxiv.org/html/2606.11255#S2.Thmtheorem1)\)\) yet RAY linearizes it with operator\-norm guarantees, and for the entire Bernstein–Schur*class*rather than a single kernel \(Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)\)\. The Schur factorization is what makes this work: it turns an intractable product into two textbook objects, an exact finite modulation and one completely monotone radial factor, each randomized by its native tool\. The matrix\-Bernstein analysis \(Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)\) then controls the error through the top eigenvalues and an intrinsic dimension, not the crudeNmaxijN\\max\_\{ij\}route, and the deployed doubly\-randomized estimator inherits that guarantee up to a single tunable sketch term \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\. The tensor\-product closure these pieces rely on is classical\(Aronszajn,[1950](https://arxiv.org/html/2606.11255#bib.bib1)\); the contribution is the factorization that brings it to bear on a kernel outside both templates, and the class\-level concentration analysis of the result\.

The modulation feature is interchangeable: Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)sees it only as PSD and anη\\eta\-spectral approximation ofPP, so any unbiased or low\-rank such feature \(TensorSketch\(Avron et al\.,[2014](https://arxiv.org/html/2606.11255#bib.bib2)\), random\-Maclaurin products\(Kar & Karnick,[2012](https://arxiv.org/html/2606.11255#bib.bib14)\), or anchor squares\) plugs into the same bound\. On the sphere, where the modulation is a dot\-product kernel, anchor squares are exact with few anchors, the route of the spherical\\tifinaghfontⵟ\-attention ofLuna et al\. \([2026](https://arxiv.org/html/2606.11255#bib.bib11)\); off the sphere they are biased \(𝔼a\(a⊤x\)2\(a⊤w\)2=‖x‖2‖w‖2\+2\(x⊤w\)2≠\(x⊤w\+b\)2\\mathbb\{E\}\_\{a\}\(a^\{\\top\}x\)^\{2\}\(a^\{\\top\}w\)^\{2\}=\\\|x\\\|^\{2\}\\\|w\\\|^\{2\}\+2\(x^\{\\top\}w\)^\{2\}\\neq\(x^\{\\top\}w\+b\)^\{2\}\), so we default to an unbiased sketch\. The choice is geometry, not analysis\.

Computationally, via the Kronecker structure

z\(x\)⊤z\(w\)=\(x⊤w\+b\)2εD∑j=1Dψtj\(x\)⊤ψtj\(w\),z\(x\)^\{\\top\}z\(w\)=\\frac\{\(x^\{\\top\}w\+b\)^\{2\}\}\{\\varepsilon D\}\\sum\_\{j=1\}^\{D\}\\psi\_\{t\_\{j\}\}\(x\)^\{\\top\}\\psi\_\{t\_\{j\}\}\(w\),\(12\)a kernel evaluation costsO\(D\(D′\+d\)\)O\(D\(D^\{\\prime\}\+d\)\)without forming theMbM\_\{b\}\-dimensional feature\. In primal form the estimator never materializes theN×NN\\times NGram, payingO\(NDdb\)O\(N\\,D\\,d\_\{b\}\)feature memory; this beats theO\(N2\)O\(N^\{2\}\)Gram wheneverDdb≪ND\\,d\_\{b\}\\ll N\. The polynomial dimensiondb=O\(d2\)d\_\{b\}=O\(d^\{2\}\)is the only term scaling poorly indd\. The deployed estimator sketches it tom≪d2m\\ll d^\{2\}\(Section[4\.2](https://arxiv.org/html/2606.11255#S4.SS2)\), making the feature dimensionDmDmat the price of one tunable sketch term \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\.

Several caveats bound the method\. The variance constant\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}exceeds theR4R^\{4\}of Gaussian RFF \(the price of the polynomial factor and the bias\) and the realized Gram error grows weakly withddthrough that constant rather than the rate \(Appendix[D](https://arxiv.org/html/2606.11255#A4.SS0.SSS0.Px1)\)\. The estimator needs bounded\-norm inputs, since the unbounded numerator destabilizes the kernel on raw data, though the off\-sphere ball \(Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)\) shows exact spherical support is not required; the data\-independent importance proposal is likewise suboptimal when distances spread widely\. TheO\(d2\)O\(d^\{2\}\)floor of the exact\-modulation feature is removed by sketching \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\), worthwhile except at very lowddor very high target accuracy, where one takesm→∞m\\to\\infty\. And at moderateNNand fixed representation size, adaptive Nyström can be more accurate, fitting landmarks to the observed sample; RAY does not aim to replace landmark methods, but to provide data\-independent, unbiased, streaming features\.

Several questions remain open\. Can the variance bound be tightened on low\-dimensional manifolds, where the data occupies a thin shell? Is there a kernel\-quadrature analogue with a faster\-than\-Monte\-Carlo rate? Canε\\varepsilonandbbbe learned end\-to\-end while preserving unbiasedness? And can a feature\-covariance \(Rudi–Rosasco\-style\) analysis give a sharp excess\-risk bound for RAY\-KRR, beyond the loose operator\-norm route of Corollary[A\.5](https://arxiv.org/html/2606.11255#A1.Thmtheorem5)?

## 6Conclusion

We gave a random\-feature scheme for*Bernstein–Schur kernels*\(finite\-feature modulations of completely monotone radial kernels\) and its flagship, the biased\\tifinaghfontⵟ\-kernelk\\tifinaghfontⵟ,b\(w,x\)=\(w⊤x\+b\)2/\(‖w−x‖2\+ε\)k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(w,x\)=\(w^\{\\top\}x\+b\)^\{2\}/\(\\\|w\-x\\\|^\{2\}\+\\varepsilon\), a kernel that is neither shift\-invariant nor a dot\-product kernel and so fits neither standard random\-feature template\. The scheme reads the kernel as a Schur product and randomizes*both*factors: it samples the radial Bernstein–Widder scalet∼Exp⁡\(ε\)t\\sim\\operatorname\{Exp\}\(\\varepsilon\)and applies random Fourier features to the resulting Gaussian, and it sketches the modulation, so the deployed feature has dimensionDmDm, free of theO\(d2\)O\(d^\{2\}\)size of the exact polynomial feature\. Keeping the modulation exact is the analyzable limit \(m→∞m\\to\\infty\): there the estimator is unbiased withO\(\(R2\+b\)4/\(Dε2\)\)O\(\(R^\{2\}\+b\)^\{4\}/\(D\\varepsilon^\{2\}\)\)variance and a radial sample count free of explicit dimension for a fixed dataset, and a matrix\-Bernstein operator\-norm bound holds; conditioning on the sketch carries that guarantee to the doubly\-randomized estimator up to a single tunable sketch term \(Theorem[3\.4](https://arxiv.org/html/2606.11255#S3.Thmtheorem4)\)\. Experiments confirm theO\(1/D\)O\(1/\\sqrt\{D\}\)rate, the\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}bias law, and the matched\-landmark off\-sphere behavior, wherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is genuinely non\-dot\-product and no dot\-product reduction is available\. Cost\-matched, adaptive data\-dependent Nyström can be more accurate at moderateNN, while RAY provides data\-independent, unbiased, streaming primal features for a nonstationary product kernel\. The same doubly\-randomized construction applies unchanged to the whole Bernstein–Schur class; the modulation randomizer \(sketch, random\-Maclaurin, or anchor features\) is a geometry\-dependent choice, with anchor features the natural primitive on the sphere and an unbiased sketch off it\.

#### Reproducibility\.

The code for every experiment, with the paper’s figures and a project page, is available online\.111[https://www\.tahabouhsine\.com/ray](https://www.tahabouhsine.com/ray)The per\-table datasets, splits, seed counts, and swept grids are collected in Appendix[E](https://arxiv.org/html/2606.11255#A5), and each result table states its own seed count and error bars in its caption\.

## References

- Aronszajn \(1950\)Aronszajn, N\. \(1950\)\.Theory of reproducing kernels\.Transactions of the American Mathematical Society, 68\(3\):337–404\.
- Avron et al\. \(2014\)Avron, H\., Nguyen, H\., and Woodruff, D\. \(2014\)\.Subspace embeddings for the polynomial kernel\.InNeurIPS\.
- Avron et al\. \(2016\)Avron, H\., Kapralov, M\., Musco, C\., Musco, C\., Velingker, A\., and Zandieh, A\. \(2016\)\.Quasi\-Monte Carlo feature maps for shift\-invariant kernels\.Journal of Machine Learning Research, 17\(120\):1–38\.
- Avron et al\. \(2017\)Avron, H\., Kapralov, M\., Musco, C\., Musco, C\., Velingker, A\., and Zandieh, A\. \(2017\)\.Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees\.InICML\.
- Baldi et al\. \(2014\)Baldi, P\., Sadowski, P\., and Whiteson, D\. \(2014\)\.Searching for exotic particles in high\-energy physics with deep learning\.Nature Communications, 5:4308\.
- Bouhsine \(2026\)Bouhsine, T\. \(2026\)\.Action at a Distance: A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance\.arXiv preprint\.
- Choromanski et al\. \(2021\)Choromanski, K\., Likhosherstov, V\., Dohan, D\., Song, X\., Gane, A\., Sarlós, T\., Hawkins, P\., Davis, J\., Mohiuddin, A\., Kaiser, L\., Belanger, D\., Colwell, L\., and Weller, A\. \(2021\)\.Rethinking attention with Performers\.InICLR\.
- Cortes & Vapnik \(1995\)Cortes, C\. and Vapnik, V\. \(1995\)\.Support\-vector networks\.Machine Learning, 20\(3\):273–297\.
- Daniely et al\. \(2017\)Daniely, A\., Frostig, R\., Gupta, V\., and Singer, Y\. \(2017\)\.Random features for compositional kernels\.arXiv:1703\.07872\.
- Han et al\. \(2022\)Han, I\., Zandieh, A\., and Avron, H\. \(2022\)\.Random Gegenbauer features for scalable kernel methods\.InICML, Proceedings of Machine Learning Research, volume 162, pages 8330–8358\.
- Luna et al\. \(2026\)Luna, J\. M\., Bouhsine, T\., and Choromanski, K\. \(2026\)\.SLAY: Geometry\-aware spherical linearized attention with Yat\-kernel\.arXiv:2602\.04915\.
- Katharopoulos et al\. \(2020\)Katharopoulos, A\., Vyas, A\., Pappas, N\., and Fleuret, F\. \(2020\)\.Transformers are RNNs: Fast autoregressive transformers with linear attention\.InICML\.
- Li et al\. \(2019\)Li, C\.\-L\., Chang, W\.\-C\., Mroueh, Y\., Yang, Y\., and Póczos, B\. \(2019\)\.Implicit kernel learning\.InAISTATS\.
- Kar & Karnick \(2012\)Kar, P\. and Karnick, H\. \(2012\)\.Random feature maps for dot product kernels\.InAISTATS\.
- Kuo & Nuyens \(2016\)Kuo, F\. Y\. and Nuyens, D\. \(2016\)\.Application of quasi\-Monte Carlo methods to elliptic PDEs with random diffusion coefficients\.Acta Numerica, 25:225–356\.
- Langrené et al\. \(2024\)Langrené, N\., Warin, X\., and Gruet, P\. \(2024\)\.A spectral mixture representation of isotropic kernels with application to random Fourier features\.arXiv:2411\.02770\.
- Le et al\. \(2013\)Le, Q\., Sarlós, T\., and Smola, A\. \(2013\)\.Fastfood: approximating kernel expansions in loglinear time\.InICML\.
- Musco & Musco \(2017\)Musco, C\. and Musco, C\. \(2017\)\.Recursive sampling for the Nyström method\.InNeurIPS\.
- Pham & Pagh \(2013\)Pham, N\. and Pagh, R\. \(2013\)\.Fast and scalable polynomial kernels via explicit feature maps\.InKDD\.
- Radford et al\. \(2021\)Radford, A\., Kim, J\. W\., Hallacy, C\., Ramesh, A\., Goh, G\., Agarwal, S\., Sastry, G\., Askell, A\., Mishkin, P\., Clark, J\., Krueger, G\., and Sutskever, I\. \(2021\)\.Learning transferable visual models from natural language supervision\.InICML\.
- Remes et al\. \(2017\)Remes, S\., Heinonen, M\., and Kaski, S\. \(2017\)\.Non\-stationary spectral kernels\.InNeurIPS\.
- Rasmussen & Williams \(2006\)Rasmussen, C\. E\. and Williams, C\. K\. I\. \(2006\)\.Gaussian Processes for Machine Learning\.MIT Press\.
- Rahimi & Recht \(2007\)Rahimi, A\. and Recht, B\. \(2007\)\.Random features for large\-scale kernel machines\.InNeurIPS, pages 1177–1184\.
- Schur \(1911\)Schur, J\. \(1911\)\.Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen\.Journal für die reine und angewandte Mathematik, 140:1–28\.
- Schoenberg \(1938\)Schoenberg, I\. J\. \(1938\)\.Metric spaces and completely monotone functions\.Annals of Mathematics, 39\(4\):811–841\.
- Shen et al\. \(2019\)Shen, Z\., Heinonen, M\., and Kaski, S\. \(2019\)\.Harmonizable mixture kernels with variational Fourier features\.InAISTATS, Proceedings of Machine Learning Research, volume 89, pages 3273–3282\.
- Ton et al\. \(2018\)Ton, J\.\-F\., Flaxman, S\., Sejdinovic, D\., and Bhatt, S\. \(2018\)\.Spatial mapping with Gaussian processes and nonstationary Fourier features\.Spatial Statistics, 28:59–78\.
- Tsai et al\. \(2019\)Tsai, Y\.\-H\. H\., Bai, S\., Yamada, M\., Morency, L\.\-P\., and Salakhutdinov, R\. \(2019\)\.Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel\.InEMNLP\.
- Tropp \(2012\)Tropp, J\. A\. \(2012\)\.User\-friendly tail bounds for sums of random matrices\.Foundations of Computational Mathematics, 12\(4\):389–434\.
- Tropp \(2015\)Tropp, J\. A\. \(2015\)\.An introduction to matrix concentration inequalities\.Foundations and Trends in Machine Learning, 8\(1–2\):1–230\.
- Widder \(1941\)Widder, D\. V\. \(1941\)\.The Laplace Transform\.Princeton University Press\.
- Williams & Seeger \(2000\)Williams, C\. K\. I\. and Seeger, M\. \(2000\)\.Using the Nyström method to speed up kernel machines\.InNeurIPS, pages 682–688\.
- Wilson & Adams \(2013\)Wilson, A\. G\. and Adams, R\. P\. \(2013\)\.Gaussian process kernels for pattern discovery and extrapolation\.InICML\.
- Yu et al\. \(2016\)Yu, F\. X\., Suresh, A\. T\., Choromanski, K\. M\., Holtmann\-Rice, D\. N\., and Kumar, S\. \(2016\)\.Orthogonal random features\.InNeurIPS\.

## Appendix AProofs of the Approximation Guarantees

Both results rest on a single per\-scale termY=\(ψt\(x\)⊤ψt\(w\)\)\(p\(x\)⊤p\(w\)\)Y=\(\\psi\_\{t\}\(x\)^\{\\top\}\\psi\_\{t\}\(w\)\)\(p\(x\)^\{\\top\}p\(w\)\), whose polynomial factor is exact,p\(x\)⊤p\(w\)=\(x⊤w\+b\)2/εp\(x\)^\{\\top\}p\(w\)=\(x^\{\\top\}w\+b\)^\{2\}/\\varepsilon, and whose Gaussian factor is an unbiasedD′D^\{\\prime\}\-sample estimate ofgt\(x,w\)g\_\{t\}\(x,w\)\. Two facts about it are used repeatedly: the a\.s\. bound\|ψt\(x\)⊤ψt\(w\)\|≤‖ψt\(x\)‖‖ψt\(w\)‖≤2\|\\psi\_\{t\}\(x\)^\{\\top\}\\psi\_\{t\}\(w\)\|\\leq\\\|\\psi\_\{t\}\(x\)\\\|\\\|\\psi\_\{t\}\(w\)\\\|\\leq 2\(each‖ψt‖2=2D′∑ℓcos2≤2\\\|\\psi\_\{t\}\\\|^\{2\}=\\tfrac\{2\}\{D^\{\\prime\}\}\\sum\_\{\\ell\}\\cos^\{2\}\\leq 2\), so\|Y\|≤2\(R2\+b\)2/ε\|Y\|\\leq 2\(R^\{2\}\+b\)^\{2\}/\\varepsilon; and the second moment𝔼\[\(ψt\(x\)⊤ψt\(w\)\)2\]≤𝔼T\[e−2T‖x−w‖2\]\+O\(1/D′\)\\mathbb\{E\}\[\(\\psi\_\{t\}\(x\)^\{\\top\}\\psi\_\{t\}\(w\)\)^\{2\}\]\\leq\\mathbb\{E\}\_\{T\}\[e^\{\-2T\\\|x\-w\\\|^\{2\}\}\]\+O\(1/D^\{\\prime\}\)\. Corollary[A\.5](https://arxiv.org/html/2606.11255#A1.Thmtheorem5)needs no separate proof: it is‖A‖op≤‖A‖F≤Nmaxij⁡\|Aij\|\\\|A\\\|\_\{\\mathrm\{op\}\}\\leq\\\|A\\\|\_\{F\}\\leq N\\max\_\{ij\}\|A\_\{ij\}\|applied to Theorem[A\.3](https://arxiv.org/html/2606.11255#A1.Thmtheorem3)\.

###### Theorem A\.1\(Variance envelope\)\.

LetVD=Var⁡\[z\(x\)⊤z\(w\)\]V\_\{D\}=\\operatorname\{Var\}\[z\(x\)^\{\\top\}z\(w\)\]\. Accounting for both the inner \(D′D^\{\\prime\}\) and outer \(DD\) sampling,

VD≤\(‖x‖2\+b\)2\(‖w‖2\+b\)2Dε2\(ε2‖x−w‖2\+ε\+32D′\)≤\(R2\+b\)4Dε2\(1\+32D′\)\.V\_\{D\}\\leq\\frac\{\(\\\|x\\\|^\{2\}\+b\)^\{2\}\(\\\|w\\\|^\{2\}\+b\)^\{2\}\}\{D\\,\\varepsilon^\{2\}\}\\\!\\left\(\\frac\{\\varepsilon\}\{2\\\|x\-w\\\|^\{2\}\+\\varepsilon\}\+\\frac\{3\}\{2D^\{\\prime\}\}\\right\)\\leq\\frac\{\(R^\{2\}\+b\)^\{4\}\}\{D\\,\\varepsilon^\{2\}\}\\\!\\left\(1\+\\frac\{3\}\{2D^\{\\prime\}\}\\right\)\.\(13\)The3/\(2D′\)3/\(2D^\{\\prime\}\)term is the inner random\-Fourier\-feature variance; it does not vanish at the recommendedD′=1D^\{\\prime\}=1\(where the constant is5/25/2\), only asD′→∞D^\{\\prime\}\\to\\infty\.

###### Proof of Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1)\.

Writez\(x\)⊤z\(w\)=1D∑jYjz\(x\)^\{\\top\}z\(w\)=\\frac\{1\}\{D\}\\sum\_\{j\}Y\_\{j\}; theYjY\_\{j\}are i\.i\.d\., soVD=1DVar⁡\[Y1\]≤1D𝔼\[Y12\]=1D\(p\(x\)⊤p\(w\)\)2𝔼\[\(ψT\(x\)⊤ψT\(w\)\)2\]V\_\{D\}=\\frac\{1\}\{D\}\\operatorname\{Var\}\[Y\_\{1\}\]\\leq\\frac\{1\}\{D\}\\mathbb\{E\}\[Y\_\{1\}^\{2\}\]=\\frac\{1\}\{D\}\(p\(x\)^\{\\top\}p\(w\)\)^\{2\}\\,\\mathbb\{E\}\[\(\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\)^\{2\}\], the polynomial factor being deterministic\. It obeys\(p\(x\)⊤p\(w\)\)2=\(x⊤w\+b\)4/ε2≤\(‖x‖2\+b\)2\(‖w‖2\+b\)2/ε2\(p\(x\)^\{\\top\}p\(w\)\)^\{2\}=\(x^\{\\top\}w\+b\)^\{4\}/\\varepsilon^\{2\}\\leq\(\\\|x\\\|^\{2\}\+b\)^\{2\}\(\\\|w\\\|^\{2\}\+b\)^\{2\}/\\varepsilon^\{2\}by Cauchy–Schwarz andb≥0b\\geq 0\. For the Gaussian factor, the law of total variance overω∣T\\omega\\mid TthenTTgives𝔼\[\(ψT\(x\)⊤ψT\(w\)\)2\]=𝔼T\[gT\(x,w\)2\]\+𝔼T\[vT\]/D′\\mathbb\{E\}\[\(\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\)^\{2\}\]=\\mathbb\{E\}\_\{T\}\[g\_\{T\}\(x,w\)^\{2\}\]\+\\mathbb\{E\}\_\{T\}\[v\_\{T\}\]/D^\{\\prime\}, wherevt=Varω⁡\[2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)\]v\_\{t\}=\\operatorname\{Var\}\_\{\\omega\}\[2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)\]\. The outer term is𝔼T∼Exp⁡\(ε\)\[e−2T‖x−w‖2\]=ε/\(2‖x−w‖2\+ε\)\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\[e^\{\-2T\\\|x\-w\\\|^\{2\}\}\]=\\varepsilon/\(2\\\|x\-w\\\|^\{2\}\+\\varepsilon\)\. For the inner term,𝔼ω,β\[\(2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)\)2\]=1\+12e−4t‖x−w‖2≤32\\mathbb\{E\}\_\{\\omega,\\beta\}\[\(2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)\)^\{2\}\]=1\+\\tfrac\{1\}\{2\}e^\{\-4t\\\|x\-w\\\|^\{2\}\}\\leq\\tfrac\{3\}\{2\}, sovt≤32v\_\{t\}\\leq\\tfrac\{3\}\{2\}and𝔼T\[vT\]≤32\\mathbb\{E\}\_\{T\}\[v\_\{T\}\]\\leq\\tfrac\{3\}\{2\}\. Substituting and dividing byDDgives \([13](https://arxiv.org/html/2606.11255#A1.E13)\); the second form usesε/\(2‖x−w‖2\+ε\)≤1\\varepsilon/\(2\\\|x\-w\\\|^\{2\}\+\\varepsilon\)\\leq 1\. \(The1/ε21/\\varepsilon^\{2\}comes from the polynomial normalizationp=ε−1/2pbp=\\varepsilon^\{\-1/2\}p\_\{b\}\.\) ∎

###### Proposition A\.2\(ExplicitD,D′D,D^\{\\prime\}variance and optimal allocation\)\.

Letvt\(x,w\)=Varω⁡\[2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)\]v\_\{t\}\(x,w\)=\\operatorname\{Var\}\_\{\\omega\}\[\\,2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)\\,\]be the variance of one trigonometric feature pair at scalett, and setVout=VarT∼Exp⁡\(ε\)⁡\[gT\(x,w\)\]V\_\{\\mathrm\{out\}\}=\\operatorname\{Var\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\[g\_\{T\}\(x,w\)\],Vin=𝔼T∼Exp⁡\(ε\)\[vT\(x,w\)\]V\_\{\\mathrm\{in\}\}=\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\[v\_\{T\}\(x,w\)\]\. Then

Var⁡\[z\(x\)⊤z\(w\)\]=\(x⊤w\+b\)4ε2\(VoutD\+VinDD′\)\.\\operatorname\{Var\}\[z\(x\)^\{\\top\}z\(w\)\]=\\frac\{\(x^\{\\top\}w\+b\)^\{4\}\}\{\\varepsilon^\{2\}\}\\Bigl\(\\frac\{V\_\{\\mathrm\{out\}\}\}\{D\}\+\\frac\{V\_\{\\mathrm\{in\}\}\}\{D\\,D^\{\\prime\}\}\\Bigr\)\.\(14\)At a fixed feature budgetB=DD′B=DD^\{\\prime\}the right side is minimized byD′=1D^\{\\prime\}=1\(henceD=BD=B\) wheneverVout\>0V\_\{\\mathrm\{out\}\}\>0: the inner termVin/BV\_\{\\mathrm\{in\}\}/Bis fixed by the budget, while the outer termVout/DV\_\{\\mathrm\{out\}\}/Ddecreases as1/D1/D\. IfVout=0V\_\{\\mathrm\{out\}\}=0\(for example atx=wx=w\), every allocation with the sameDD′DD^\{\\prime\}ties\.

###### Proof of Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)\.

The polynomial factor is deterministic, soVar⁡\[Y1\]=\(p\(x\)⊤p\(w\)\)2Var⁡\[ψT\(x\)⊤ψT\(w\)\]\\operatorname\{Var\}\[Y\_\{1\}\]=\(p\(x\)^\{\\top\}p\(w\)\)^\{2\}\\operatorname\{Var\}\[\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\]with\(p\(x\)⊤p\(w\)\)2=\(x⊤w\+b\)4/ε2\(p\(x\)^\{\\top\}p\(w\)\)^\{2\}=\(x^\{\\top\}w\+b\)^\{4\}/\\varepsilon^\{2\}\. By the law of total variance, conditioning on the scaleTTand then on theD′D^\{\\prime\}inner frequencies,

Var⁡\[ψT\(x\)⊤ψT\(w\)\]=𝔼T\[Varω⁡\(ψT\(x\)⊤ψT\(w\)∣T\)\]\+VarT⁡\(𝔼ω\[ψT\(x\)⊤ψT\(w\)∣T\]\)\.\\operatorname\{Var\}\[\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\]=\\mathbb\{E\}\_\{T\}\\bigl\[\\operatorname\{Var\}\_\{\\omega\}\(\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\\mid T\)\\bigr\]\+\\operatorname\{Var\}\_\{T\}\\bigl\(\\mathbb\{E\}\_\{\\omega\}\[\\psi\_\{T\}\(x\)^\{\\top\}\\psi\_\{T\}\(w\)\\mid T\]\\bigr\)\.The conditional mean isgT\(x,w\)g\_\{T\}\(x,w\), giving the outer termVarT⁡\(gT\)=Vout\\operatorname\{Var\}\_\{T\}\(g\_\{T\}\)=V\_\{\\mathrm\{out\}\}\. The conditional object is an average ofD′D^\{\\prime\}i\.i\.d\. feature pairs, so its variance isvT/D′v\_\{T\}/D^\{\\prime\}, giving the inner term𝔼T\[vT\]/D′=Vin/D′\\mathbb\{E\}\_\{T\}\[v\_\{T\}\]/D^\{\\prime\}=V\_\{\\mathrm\{in\}\}/D^\{\\prime\}\. Dividing byDD\(theDDblocks are i\.i\.d\.\) yields \([14](https://arxiv.org/html/2606.11255#A1.E14)\)\. WithDD′=BDD^\{\\prime\}=Bfixed,Vin/\(DD′\)=Vin/BV\_\{\\mathrm\{in\}\}/\(DD^\{\\prime\}\)=V\_\{\\mathrm\{in\}\}/Bis constant andVout/DV\_\{\\mathrm\{out\}\}/Dis decreasing inDD, soD′=1D^\{\\prime\}=1is optimal\. ∎

###### Theorem A\.3\(Uniform Gram approximation\)\.

ForX=\{x1,…,xN\}X=\\\{x\_\{1\},\\dots,x\_\{N\}\\\}with‖xi‖≤R\\\|x\_\{i\}\\\|\\leq R, the exact\-modulation estimator withDDouter samples satisfies, with probability at least1−δ1\-\\delta,

supi,j∈\[N\]\|z\(xi\)⊤z\(xj\)−k\\tifinaghfontⵟ,b\(xi,xj\)\|≤\(R2\+b\)2ε8log⁡\(2N2/δ\)D\.\\sup\_\{i,j\\in\[N\]\}\\bigl\|z\(x\_\{i\}\)^\{\\top\}z\(x\_\{j\}\)\-k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x\_\{i\},x\_\{j\}\)\\bigr\|\\leq\\frac\{\(R^\{2\}\+b\)^\{2\}\}\{\\varepsilon\}\\sqrt\{\\frac\{8\\log\(2N^\{2\}/\\delta\)\}\{D\}\}\.\(15\)The constant uses the a\.s\. bound\|ψt\(x\)⊤ψt\(w\)\|≤2\|\\psi\_\{t\}\(x\)^\{\\top\}\\psi\_\{t\}\(w\)\|\\leq 2, not the expectation\-level value11\.

###### Corollary A\.4\(Sample complexity\)\.

D=O\(\(R2\+b\)4ε−2τ−2log⁡\(N/η\)\)D=O\\\!\\bigl\(\(R^\{2\}\+b\)^\{4\}\\varepsilon^\{\-2\}\\tau^\{\-2\}\\log\(N/\\eta\)\\bigr\)outer samples suffice for entrywise error at mostτ\\tauwith probability1−η1\-\\eta, independent of dimensiondd\.

###### Corollary A\.5\(Operator\-norm control\)\.

Since‖A‖op≤‖A‖F≤Nmaxij⁡\|Aij\|\\\|A\\\|\_\{\\mathrm\{op\}\}\\leq\\\|A\\\|\_\{F\}\\leq N\\max\_\{ij\}\|A\_\{ij\}\|, Theorem[A\.3](https://arxiv.org/html/2606.11255#A1.Thmtheorem3)gives‖KD−K‖op≤N\(R2\+b\)2ε−18log⁡\(2N2/δ\)/D\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}\\leq N\(R^\{2\}\+b\)^\{2\}\\varepsilon^\{\-1\}\\sqrt\{8\\log\(2N^\{2\}/\\delta\)/D\}with probability1−δ1\-\\delta\. The factorNNis the price of this elementary route; Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)removes it via matrix concentration, replacingNmaxij\|⋅\|N\\max\_\{ij\}\|\\cdot\|with the top eigenvalues of the kernel and modulation Gram matrices and an intrinsic dimension \(which can still scale withNNin the worst case\)\.

###### Proof of Theorem[A\.3](https://arxiv.org/html/2606.11255#A1.Thmtheorem3)\.

Fix\(xi,xj\)\(x\_\{i\},x\_\{j\}\); the estimator averagesDDi\.i\.d\. termsYk∈\[−c,c\]Y\_\{k\}\\in\[\-c,c\]withc=2\(R2\+b\)2/εc=2\(R^\{2\}\+b\)^\{2\}/\\varepsilonby the a\.s\. bound above\. Hoeffding’s inequality givesℙ\(\|z\(xi\)⊤z\(xj\)−k\\tifinaghfontⵟ,b\|\>s\)≤2exp⁡\(−Ds2/\(2c2\)\)=2exp⁡\(−Ds2ε2/\(8\(R2\+b\)4\)\)\\mathbb\{P\}\(\|z\(x\_\{i\}\)^\{\\top\}z\(x\_\{j\}\)\-k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\|\>s\)\\leq 2\\exp\(\-Ds^\{2\}/\(2c^\{2\}\)\)=2\\exp\\\!\\bigl\(\-Ds^\{2\}\\varepsilon^\{2\}/\(8\(R^\{2\}\+b\)^\{4\}\)\\bigr\)\. A union bound over the≤N2\\leq N^\{2\}pairs and solving2N2exp⁡\(⋅\)≤δ2N^\{2\}\\exp\(\\cdot\)\\leq\\deltaforssyields \([15](https://arxiv.org/html/2606.11255#A1.E15)\)\. ∎

###### Lemma A\.6\(Schur multiplier bound\)\.

LetA⪰0A\\succeq 0withmaxi⁡Aii≤a\\max\_\{i\}A\_\{ii\}\\leq a\.\(a\)IfP⪰0P\\succeq 0, then0⪯A∘P⪯a‖P‖opI0\\preceq A\\circ P\\preceq a\\\|P\\\|\_\{\\mathrm\{op\}\}I\.\(b\)IfE=E⊤E=E^\{\\top\}is symmetric \(not necessarily PSD\), then‖A∘E‖op≤a‖E‖op\\\|A\\circ E\\\|\_\{\\mathrm\{op\}\}\\leq a\\\|E\\\|\_\{\\mathrm\{op\}\}\.

###### Proof\.

WriteA=UU⊤=∑rurur⊤A=UU^\{\\top\}=\\sum\_\{r\}u\_\{r\}u\_\{r\}^\{\\top\}withuru\_\{r\}the columns ofUU; thenA∘E=∑r\(urur⊤\)∘E=∑rDrEDrA\\circ E=\\sum\_\{r\}\(u\_\{r\}u\_\{r\}^\{\\top\}\)\\circ E=\\sum\_\{r\}D\_\{r\}ED\_\{r\}withDr=diag⁡\(ur\)D\_\{r\}=\\operatorname\{diag\}\(u\_\{r\}\), and the diagonal matrix∑rDr2\\sum\_\{r\}D\_\{r\}^\{2\}has entries∑r\(ur\)i2=\(UU⊤\)ii=Aii≤a\\sum\_\{r\}\(u\_\{r\}\)\_\{i\}^\{2\}=\(UU^\{\\top\}\)\_\{ii\}=A\_\{ii\}\\leq a, so∑rDr2⪯aI\\sum\_\{r\}D\_\{r\}^\{2\}\\preceq aI\. For any unit vectorxxand symmetricEE,

\|x⊤\(A∘E\)x\|=\|∑r\(Drx\)⊤E\(Drx\)\|≤‖E‖op∑r‖Drx‖2=‖E‖opx⊤\(∑rDr2\)x≤a‖E‖op,\\bigl\|x^\{\\top\}\(A\\circ E\)x\\bigr\|=\\Bigl\|\\sum\_\{r\}\(D\_\{r\}x\)^\{\\top\}E\(D\_\{r\}x\)\\Bigr\|\\leq\\\|E\\\|\_\{\\mathrm\{op\}\}\\sum\_\{r\}\\\|D\_\{r\}x\\\|^\{2\}=\\\|E\\\|\_\{\\mathrm\{op\}\}\\,x^\{\\top\}\\Bigl\(\\textstyle\\sum\_\{r\}D\_\{r\}^\{2\}\\Bigr\)x\\leq a\\\|E\\\|\_\{\\mathrm\{op\}\},which is \(b\)\. TakingE=P⪰0E=P\\succeq 0removes the absolute value \(each summand is nonnegative\), giving0≤x⊤\(A∘P\)x≤a‖P‖op0\\leq x^\{\\top\}\(A\\circ P\)x\\leq a\\\|P\\\|\_\{\\mathrm\{op\}\}andA∘P⪰0A\\circ P\\succeq 0by the Schur product theorem, which is \(a\)\. ∎

###### Proof of Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)\.

WriteKD−K=∑j=1DYjK\_\{D\}\-K=\\sum\_\{j=1\}^\{D\}Y\_\{j\}withYj=D−1\(K\(j\)−K\)Y\_\{j\}=D^\{\-1\}\(K^\{\(j\)\}\-K\)andK\(j\)=\(ΨjΨj⊤\)∘PK^\{\(j\)\}=\(\\Psi\_\{j\}\\Psi\_\{j\}^\{\\top\}\)\\circ Pthe Gram of thejj\-th radial draw; theYjY\_\{j\}are i\.i\.d\. symmetric with𝔼\[Yj\]=0\\mathbb\{E\}\[Y\_\{j\}\]=0\(Theorem[2\.4](https://arxiv.org/html/2606.11255#S2.Thmtheorem4)\)\. Two structural facts:K\(j\)⪰0K^\{\(j\)\}\\succeq 0, being a Schur product of the PSD matricesΨjΨj⊤\\Psi\_\{j\}\\Psi\_\{j\}^\{\\top\}andPP; and‖K\(j\)‖op≤2‖P‖op\\\|K^\{\(j\)\}\\\|\_\{\\mathrm\{op\}\}\\leq 2\\\|P\\\|\_\{\\mathrm\{op\}\}by Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6)with\(ΨjΨj⊤\)ii=‖ψtj\(xi\)‖2≤2\(\\Psi\_\{j\}\\Psi\_\{j\}^\{\\top\}\)\_\{ii\}=\\\|\\psi\_\{t\_\{j\}\}\(x\_\{i\}\)\\\|^\{2\}\\leq 2\. Likewise‖K‖op=‖R∘P‖op≤‖P‖op\\\|K\\\|\_\{\\mathrm\{op\}\}=\\\|R\\circ P\\\|\_\{\\mathrm\{op\}\}\\leq\\\|P\\\|\_\{\\mathrm\{op\}\}by Lemma[A\.6](https://arxiv.org/html/2606.11255#A1.Thmtheorem6), withRRthe radial Gram \(PSD, unit diagonal\)\.

*A\.s\. bound\.*∥Yj∥op≤D−1\(∥K\(j\)∥op\+∥K∥op\)≤3∥P∥op/D=:L\\\|Y\_\{j\}\\\|\_\{\\mathrm\{op\}\}\\leq D^\{\-1\}\(\\\|K^\{\(j\)\}\\\|\_\{\\mathrm\{op\}\}\+\\\|K\\\|\_\{\\mathrm\{op\}\}\)\\leq 3\\\|P\\\|\_\{\\mathrm\{op\}\}/D=:L\.

*Matrix variance\.*For PSDMM,M2⪯‖M‖opMM^\{2\}\\preceq\\\|M\\\|\_\{\\mathrm\{op\}\}M; with‖K\(j\)‖op≤2‖P‖op\\\|K^\{\(j\)\}\\\|\_\{\\mathrm\{op\}\}\\leq 2\\\|P\\\|\_\{\\mathrm\{op\}\}this gives\(K\(j\)\)2⪯2‖P‖opK\(j\)\(K^\{\(j\)\}\)^\{2\}\\preceq 2\\\|P\\\|\_\{\\mathrm\{op\}\}K^\{\(j\)\}, so𝔼\[\(K\(j\)\)2\]⪯2‖P‖opK\\mathbb\{E\}\[\(K^\{\(j\)\}\)^\{2\}\]\\preceq 2\\\|P\\\|\_\{\\mathrm\{op\}\}Kand𝔼\[\(K\(j\)−K\)2\]=𝔼\[\(K\(j\)\)2\]−K2⪯2‖P‖opK\\mathbb\{E\}\[\(K^\{\(j\)\}\-K\)^\{2\}\]=\\mathbb\{E\}\[\(K^\{\(j\)\}\)^\{2\}\]\-K^\{2\}\\preceq 2\\\|P\\\|\_\{\\mathrm\{op\}\}K\. Hence

V=∑j=1D𝔼\[Yj2\]=1D𝔼\[\(K\(1\)−K\)2\]⪯2‖P‖opDK,∥V∥op≤2‖P‖op‖K‖opD=:v\.V=\\sum\_\{j=1\}^\{D\}\\mathbb\{E\}\[Y\_\{j\}^\{2\}\]=\\tfrac\{1\}\{D\}\\,\\mathbb\{E\}\[\(K^\{\(1\)\}\-K\)^\{2\}\]\\preceq\\tfrac\{2\\\|P\\\|\_\{\\mathrm\{op\}\}\}\{D\}\\,K,\\qquad\\\|V\\\|\_\{\\mathrm\{op\}\}\\leq\\tfrac\{2\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\\\|\_\{\\mathrm\{op\}\}\}\{D\}=:v\.
*Conclusion \(expectation\)\.*The intrinsic\-dimension matrix Bernstein inequality\(Tropp,[2015](https://arxiv.org/html/2606.11255#bib.bib30)\)gives𝔼‖∑jYj‖op≤2vlog⁡\(2dint\)\+13Llog⁡\(2dint\)\\mathbb\{E\}\\\|\\sum\_\{j\}Y\_\{j\}\\\|\_\{\\mathrm\{op\}\}\\leq\\sqrt\{2v\\log\(2d\_\{\\mathrm\{int\}\}\)\}\+\\tfrac\{1\}\{3\}L\\log\(2d\_\{\\mathrm\{int\}\}\)withdint=tr⁡\(V\)/‖V‖opd\_\{\\mathrm\{int\}\}=\\operatorname\{tr\}\(V\)/\\\|V\\\|\_\{\\mathrm\{op\}\}\. Substitutingv=2‖P‖op‖K‖op/Dv=2\\\|P\\\|\_\{\\mathrm\{op\}\}\\\|K\\\|\_\{\\mathrm\{op\}\}/DandL=3‖P‖op/DL=3\\\|P\\\|\_\{\\mathrm\{op\}\}/Dyields the stated bound;dint≤rank⁡\(V\)≤Nd\_\{\\mathrm\{int\}\}\\leq\\operatorname\{rank\}\(V\)\\leq Nand‖P‖op≤tr⁡\(P\)≤N\(R2\+b\)2/ε\\\|P\\\|\_\{\\mathrm\{op\}\}\\leq\\operatorname\{tr\}\(P\)\\leq N\(R^\{2\}\+b\)^\{2\}/\\varepsilongive the worst\-case comparison to Corollary[A\.5](https://arxiv.org/html/2606.11255#A1.Thmtheorem5)\.

*Conclusion \(tail\)\.*With the*same*vvandLL, the tail form of the intrinsic matrix Bernstein inequality\(Tropp,[2015](https://arxiv.org/html/2606.11255#bib.bib30), Thm\. 7\.3\.1\)statesℙ\{‖∑jYj‖op≥s\}≤4dintexp⁡\(−s2/2v\+Ls/3\)\\mathbb\{P\}\\\{\\\|\\sum\_\{j\}Y\_\{j\}\\\|\_\{\\mathrm\{op\}\}\\geq s\\\}\\leq 4d\_\{\\mathrm\{int\}\}\\exp\\\!\\bigl\(\-\\tfrac\{s^\{2\}/2\}\{v\+Ls/3\}\\bigr\)fors≥v\+L/3s\\geq\\sqrt\{v\}\+L/3\. Setting the right side toδ\\deltaand writingℓ=log⁡\(4dint/δ\)\\ell=\\log\(4d\_\{\\mathrm\{int\}\}/\\delta\), the quadratics2−23Lℓs−2vℓ=0s^\{2\}\-\\tfrac\{2\}\{3\}L\\ell\\,s\-2v\\ell=0has positive roots=13Lℓ\+\(13Lℓ\)2\+2vℓ≤2vℓ\+23Lℓs=\\tfrac\{1\}\{3\}L\\ell\+\\sqrt\{\(\\tfrac\{1\}\{3\}L\\ell\)^\{2\}\+2v\\ell\}\\leq\\sqrt\{2v\\ell\}\+\\tfrac\{2\}\{3\}L\\ell\. Hence, with probability at least1−δ1\-\\delta,‖∑jYj‖op≤2vlog⁡\(4dint/δ\)\+23Llog⁡\(4dint/δ\)\\\|\\sum\_\{j\}Y\_\{j\}\\\|\_\{\\mathrm\{op\}\}\\leq\\sqrt\{2v\\log\(4d\_\{\\mathrm\{int\}\}/\\delta\)\}\+\\tfrac\{2\}\{3\}L\\log\(4d\_\{\\mathrm\{int\}\}/\\delta\), which is \([10](https://arxiv.org/html/2606.11255#S3.E10)\) after substitutingvvandLL\(the linear coefficient is23L=2‖P‖op/D\\tfrac\{2\}\{3\}L=2\\\|P\\\|\_\{\\mathrm\{op\}\}/D\)\. ∎

###### Proof of Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)\.

WriteY=2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)=cos⁡\(ω⊤\(x−w\)\)\+cos⁡\(ω⊤\(x\+w\)\+2β\)Y=2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\)=\\cos\(\\omega^\{\\top\}\(x\-w\)\)\+\\cos\(\\omega^\{\\top\}\(x\+w\)\+2\\beta\)\. The second term has zero mean overβ∼Unif\[0,2π\]\\beta\\sim\\mathrm\{Unif\}\[0,2\\pi\], so withω∣T∼𝒩\(0,2TId\)\\omega\\mid T\\sim\\mathcal\{N\}\(0,2TI\_\{d\}\)and the Gaussian characteristic function𝔼ω\[cos⁡\(ω⊤u\)\]=e−T‖u‖2\\mathbb\{E\}\_\{\\omega\}\[\\cos\(\\omega^\{\\top\}u\)\]=e^\{\-T\\\|u\\\|^\{2\}\},𝔼\[Y∣T\]=e−Tr\\mathbb\{E\}\[Y\\mid T\]=e^\{\-Tr\}and𝔼\[Y\]=𝔼T∼Exp⁡\(ε\)\[e−Tr\]=ε/\(ε\+r\)\\mathbb\{E\}\[Y\]=\\mathbb\{E\}\_\{T\\sim\\operatorname\{Exp\}\(\\varepsilon\)\}\[e^\{\-Tr\}\]=\\varepsilon/\(\\varepsilon\+r\)\. For the second moment,𝔼β\[Y2∣ω\]=cos2⁡\(ω⊤\(x−w\)\)\+12\\mathbb\{E\}\_\{\\beta\}\[Y^\{2\}\\mid\\omega\]=\\cos^\{2\}\(\\omega^\{\\top\}\(x\-w\)\)\+\\tfrac\{1\}\{2\}, and𝔼ω\[cos2⁡\(ω⊤\(x−w\)\)∣T\]=12\(1\+e−4Tr\)\\mathbb\{E\}\_\{\\omega\}\[\\cos^\{2\}\(\\omega^\{\\top\}\(x\-w\)\)\\mid T\]=\\tfrac\{1\}\{2\}\(1\+e^\{\-4Tr\}\)\(usingcos2⁡θ=12\(1\+cos⁡2θ\)\\cos^\{2\}\\theta=\\tfrac\{1\}\{2\}\(1\+\\cos 2\\theta\)and𝔼\[cos⁡\(2ω⊤\(x−w\)\)∣T\]=e−4Tr\\mathbb\{E\}\[\\cos\(2\\omega^\{\\top\}\(x\-w\)\)\\mid T\]=e^\{\-4Tr\}\)\. Hence𝔼\[Y2∣T\]=1\+12e−4Tr\\mathbb\{E\}\[Y^\{2\}\\mid T\]=1\+\\tfrac\{1\}\{2\}e^\{\-4Tr\}and𝔼\[Y2\]=1\+12ε/\(ε\+4r\)\\mathbb\{E\}\[Y^\{2\}\]=1\+\\tfrac\{1\}\{2\}\\varepsilon/\(\\varepsilon\+4r\)\. ThusVar⁡\(Y\)=1\+12ε/\(ε\+4r\)−\(ε/\(ε\+r\)\)2\\operatorname\{Var\}\(Y\)=1\+\\tfrac\{1\}\{2\}\\varepsilon/\(\\varepsilon\+4r\)\-\(\\varepsilon/\(\\varepsilon\+r\)\)^\{2\}, andk^D\\widehat\{k\}\_\{D\}averagesDDi\.i\.d\. copies scaled bya/εa/\\varepsilon, givingVar⁡\[k^D\]=\(a/ε\)2Var⁡\(Y\)/D\\operatorname\{Var\}\[\\widehat\{k\}\_\{D\}\]=\(a/\\varepsilon\)^\{2\}\\operatorname\{Var\}\(Y\)/D\. ∎

###### Proof of Theorem[3\.6](https://arxiv.org/html/2606.11255#S3.Thmtheorem6)\.

LetB=A−1/2EA−1/2B=A^\{\-1/2\}EA^\{\-1/2\}, so‖B‖op≤ρ\\\|B\\\|\_\{\\mathrm\{op\}\}\\leq\\rhogives−ρI⪯B⪯ρI\-\\rho I\\preceq B\\preceq\\rho I, and multiplying byA1/2A^\{1/2\}on both sides gives−ρA⪯E⪯ρA\-\\rho A\\preceq E\\preceq\\rho A; sinceKD\+λI=A\+EK\_\{D\}\+\\lambda I=A\+Ethis is the spectral sandwich\. For the coefficients,α~−α^=\(A\+E\)−1y−A−1y=−\(A\+E\)−1Eα^\\tilde\{\\alpha\}\-\\hat\{\\alpha\}=\(A\+E\)^\{\-1\}y\-A^\{\-1\}y=\-\(A\+E\)^\{\-1\}E\\hat\{\\alpha\}, soA1/2\(α~−α^\)=−\(I\+B\)−1BA1/2α^A^\{1/2\}\(\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\)=\-\(I\+B\)^\{\-1\}B\\,A^\{1/2\}\\hat\{\\alpha\}\. As‖B‖op≤ρ<1\\\|B\\\|\_\{\\mathrm\{op\}\}\\leq\\rho<1,‖\(I\+B\)−1‖op≤\(1−ρ\)−1\\\|\(I\+B\)^\{\-1\}\\\|\_\{\\mathrm\{op\}\}\\leq\(1\-\\rho\)^\{\-1\}, hence‖α~−α^‖A=‖A1/2\(α~−α^\)‖2≤ρ1−ρ‖A1/2α^‖2=ρ1−ρ‖α^‖A\\\|\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\\\|\_\{A\}=\\\|A^\{1/2\}\(\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\)\\\|\_\{2\}\\leq\\frac\{\\rho\}\{1\-\\rho\}\\\|A^\{1/2\}\\hat\{\\alpha\}\\\|\_\{2\}=\\frac\{\\rho\}\{1\-\\rho\}\\\|\\hat\{\\alpha\}\\\|\_\{A\}\. ∎

###### Proposition A\.7\(Normalized estimator\)\.

Letqb\(x\)=pb\(x\)/\(‖x‖2\+b\)q\_\{b\}\(x\)=p\_\{b\}\(x\)/\(\\\|x\\\|^\{2\}\+b\), so‖qb\(x\)‖=1\\\|q\_\{b\}\(x\)\\\|=1\(forb\>0b\>0this is defined on all ofℝd\\mathbb\{R\}^\{d\}; forb=0b=0we restrict to domains excluding the origin\)\. Thenqb\(x\)⊤qb\(w\)=\(x⊤w\+b\)2/\(\(‖x‖2\+b\)\(‖w‖2\+b\)\)q\_\{b\}\(x\)^\{\\top\}q\_\{b\}\(w\)=\(x^\{\\top\}w\+b\)^\{2\}/\\bigl\(\(\\\|x\\\|^\{2\}\+b\)\(\\\|w\\\|^\{2\}\+b\)\\bigr\), and the normalized kernelk¯\\tifinaghfontⵟ,b\(x,w\)=qb\(x\)⊤qb\(w\)\(‖x−w‖2\+ε\)−1\\bar\{k\}\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)=q\_\{b\}\(x\)^\{\\top\}q\_\{b\}\(w\)\\,\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\)^\{\-1\}is positive definite\. Its flat estimator \(replacepbp\_\{b\}byqbq\_\{b\}\) is unbiased with, by Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)and‖qb‖=1\\\|q\_\{b\}\\\|=1,Var⁡\[k¯^D\(x,w\)\]≤1Dε2\(1\+12εε\+4‖x−w‖2\)≤32Dε2\\operatorname\{Var\}\[\\widehat\{\\bar\{k\}\}\_\{D\}\(x,w\)\]\\leq\\frac\{1\}\{D\\varepsilon^\{2\}\}\(1\+\\tfrac\{1\}\{2\}\\tfrac\{\\varepsilon\}\{\\varepsilon\+4\\\|x\-w\\\|^\{2\}\}\)\\leq\\frac\{3\}\{2D\\varepsilon^\{2\}\}, a bound free ofRRandbb\.

###### Proof of Proposition[A\.7](https://arxiv.org/html/2606.11255#A1.Thmtheorem7)\.

‖pb\(x\)‖2=\(‖x‖2\+b\)2\\\|p\_\{b\}\(x\)\\\|^\{2\}=\(\\\|x\\\|^\{2\}\+b\)^\{2\}\(Proposition[2\.2](https://arxiv.org/html/2606.11255#S2.Thmtheorem2)\), soqb=pb/\(‖x‖2\+b\)q\_\{b\}=p\_\{b\}/\(\\\|x\\\|^\{2\}\+b\)has unit norm andqb\(x\)⊤qb\(w\)=pb\(x\)⊤pb\(w\)/\(\(‖x‖2\+b\)\(‖w‖2\+b\)\)=\(x⊤w\+b\)2/\(\(‖x‖2\+b\)\(‖w‖2\+b\)\)q\_\{b\}\(x\)^\{\\top\}q\_\{b\}\(w\)=p\_\{b\}\(x\)^\{\\top\}p\_\{b\}\(w\)/\(\(\\\|x\\\|^\{2\}\+b\)\(\\\|w\\\|^\{2\}\+b\)\)=\(x^\{\\top\}w\+b\)^\{2\}/\(\(\\\|x\\\|^\{2\}\+b\)\(\\\|w\\\|^\{2\}\+b\)\)\. The normalized kernel is a Schur product of this PSD kernel andhεh\_\{\\varepsilon\}, hence PSD; the variance bound is Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)witha=qb\(x\)⊤qb\(w\)≤1a=q\_\{b\}\(x\)^\{\\top\}q\_\{b\}\(w\)\\leq 1\. ∎

###### Proposition A\.8\(RAY error decomposition\)\.

Sketch only the quadratic term:p^m\(x,w\)=TS2\(x\)⊤TS2\(w\)\+2bx⊤w\+b2\\widehat\{p\}\_\{m\}\(x,w\)=\\mathrm\{TS\}\_\{2\}\(x\)^\{\\top\}\\mathrm\{TS\}\_\{2\}\(w\)\+2b\\,x^\{\\top\}w\+b^\{2\}with𝔼\[TS2\(x\)⊤TS2\(w\)\]=\(x⊤w\)2\\mathbb\{E\}\[\\mathrm\{TS\}\_\{2\}\(x\)^\{\\top\}\\mathrm\{TS\}\_\{2\}\(w\)\]=\(x^\{\\top\}w\)^\{2\}, so𝔼\[p^m\]=p=\(x⊤w\+b\)2\\mathbb\{E\}\[\\widehat\{p\}\_\{m\}\]=p=\(x^\{\\top\}w\+b\)^\{2\}\(the linear and constant terms are kept exact and add no sketch variance\)\. Letp^m\\widehat\{p\}\_\{m\}be independent of the unbiased radial estimateh^D\\widehat\{h\}\_\{D\}ofh\(x,w\)=\(‖x−w‖2\+ε\)−1h\(x,w\)=\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\)^\{\-1\}\. Thenk^D,m=p^mh^D\\widehat\{k\}\_\{D,m\}=\\widehat\{p\}\_\{m\}\\widehat\{h\}\_\{D\}is unbiased fork\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}and

Var⁡\[k^D,m\]=p2Var⁡\[h^D\]⏟radial Monte Carlo\+h2Var⁡\[p^m\]⏟polynomial sketch\+Var⁡\[p^m\]Var⁡\[h^D\]⏟interaction\.\\operatorname\{Var\}\[\\widehat\{k\}\_\{D,m\}\]=\\underbrace\{p^\{2\}\\operatorname\{Var\}\[\\widehat\{h\}\_\{D\}\]\}\_\{\\text\{radial Monte Carlo\}\}\+\\underbrace\{h^\{2\}\\operatorname\{Var\}\[\\widehat\{p\}\_\{m\}\]\}\_\{\\text\{polynomial sketch\}\}\+\\underbrace\{\\operatorname\{Var\}\[\\widehat\{p\}\_\{m\}\]\\,\\operatorname\{Var\}\[\\widehat\{h\}\_\{D\}\]\}\_\{\\text\{interaction\}\}\.The degree\-2 TensorSketch boundVar⁡\[p^m\]≤C‖x‖4‖w‖4/m\\operatorname\{Var\}\[\\widehat\{p\}\_\{m\}\]\\leq C\\,\\\|x\\\|^\{4\}\\\|w\\\|^\{4\}/m\(the bias terms being exact\) makes the sketch terms vanish asm→∞m\\to\\infty, recovering the exact\-modulation estimator\.

###### Proof of Proposition[A\.8](https://arxiv.org/html/2606.11255#A1.Thmtheorem8)\.

By independence𝔼\[p^mh^D\]=ph=k\\mathbb\{E\}\[\\widehat\{p\}\_\{m\}\\widehat\{h\}\_\{D\}\]=ph=kand𝔼\[p^m2h^D2\]=𝔼\[p^m2\]𝔼\[h^D2\]=\(p2\+Var⁡p^m\)\(h2\+Var⁡h^D\)\\mathbb\{E\}\[\\widehat\{p\}\_\{m\}^\{2\}\\widehat\{h\}\_\{D\}^\{2\}\]=\\mathbb\{E\}\[\\widehat\{p\}\_\{m\}^\{2\}\]\\mathbb\{E\}\[\\widehat\{h\}\_\{D\}^\{2\}\]=\(p^\{2\}\+\\operatorname\{Var\}\\widehat\{p\}\_\{m\}\)\(h^\{2\}\+\\operatorname\{Var\}\\widehat\{h\}\_\{D\}\); subtractingp2h2p^\{2\}h^\{2\}gives the three\-term identity\. ∎

###### Proposition A\.9\(The IMQ factor is a finite difference inbb\)\.

For everyx,wx,w, everyb≥0b\\geq 0, and every steph\>0h\>0,

k\\tifinaghfontⵟ,b\+2h\(w,x\)−2k\\tifinaghfontⵟ,b\+h\(w,x\)\+k\\tifinaghfontⵟ,b\(w,x\)2h2=1‖w−x‖2\+ε=hε\(w,x\),\\frac\{k\_\{\\text\{\\tifinaghfont ⵟ\},b\+2h\}\(w,x\)\-2\\,k\_\{\\text\{\\tifinaghfont ⵟ\},b\+h\}\(w,x\)\+k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(w,x\)\}\{2h^\{2\}\}=\\frac\{1\}\{\\\|w\-x\\\|^\{2\}\+\\varepsilon\}=h\_\{\\varepsilon\}\(w,x\),\(16\)exactly, with no limit required and all three biasesb,b\+h,b\+2hb,\\,b\+h,\\,b\+2hinside the kernel’s domainb≥0b\\geq 0\.

###### Proof\.

The numerator\(w⊤x\+b\)2\(w^\{\\top\}x\+b\)^\{2\}is a quadratic polynomial inbb, so any second difference equals the constant2h22h^\{2\}; the forward difference keeps all biases inb≥0b\\geq 0\(the centered form would requireb≥hb\\geq h\)\. Dividing by thebb\-independent denominator‖w−x‖2\+ε\\\|w\-x\\\|^\{2\}\+\\varepsilonyieldshεh\_\{\\varepsilon\}\. ∎

## Appendix BKronecker Implementation

For kernel evaluation only the inner product is needed, and it factors as \([12](https://arxiv.org/html/2606.11255#S5.E12)\), computable inO\(D\(D′\+d\)\)O\(D\(D^\{\\prime\}\+d\)\)per pair\. For the full Gram matrix, Algorithm[1](https://arxiv.org/html/2606.11255#alg1)assembles it via Hadamard products\.

Algorithm 1RAY Gram\-matrix approximation0:Data

X=\{x1,…,xN\}X=\\\{x\_\{1\},\\dots,x\_\{N\}\\\}, parameters

D,D′,ε,bD,D^\{\\prime\},\\varepsilon,b
1:Draw

t1,…,tD∼Exp⁡\(ε\)t\_\{1\},\\dots,t\_\{D\}\\sim\\operatorname\{Exp\}\(\\varepsilon\)
2:Compute polynomial Gram

Pij=\(xi⊤xj\+b\)2/εP\_\{ij\}=\(x\_\{i\}^\{\\top\}x\_\{j\}\+b\)^\{2\}/\\varepsilonO\(N2d\)O\(N^\{2\}d\)

3:

Kapprox←𝟎N×NK\_\{\\mathrm\{approx\}\}\\leftarrow\\mathbf\{0\}\_\{N\\times N\}
4:for

j=1,…,Dj=1,\\dots,Ddo

5:Draw

ωj,ℓ∼𝒩\(0,2tjId\)\\omega\_\{j,\\ell\}\\sim\\mathcal\{N\}\(0,2t\_\{j\}I\_\{d\}\),

βj,ℓ∼Unif\(\[0,2π\]\)\\beta\_\{j,\\ell\}\\sim\\mathrm\{Unif\}\(\[0,2\\pi\]\)
6:Build

Ψj∈ℝN×D′\\Psi\_\{j\}\\in\\mathbb\{R\}^\{N\\times D^\{\\prime\}\},

Ψj\[i,ℓ\]=2/D′cos⁡\(ωj,ℓ⊤xi\+βj,ℓ\)\\Psi\_\{j\}\[i,\\ell\]=\\sqrt\{2/D^\{\\prime\}\}\\cos\(\\omega\_\{j,\\ell\}^\{\\top\}x\_\{i\}\+\\beta\_\{j,\\ell\}\)
7:

Kapprox\+=\(ΨjΨj⊤\)∘P/DK\_\{\\mathrm\{approx\}\}\\mathrel\{\+\}=\(\\Psi\_\{j\}\\Psi\_\{j\}^\{\\top\}\)\\circ P/DO\(N2D′\+ND′d\)O\(N^\{2\}D^\{\\prime\}\+ND^\{\\prime\}d\)

8:endfor

9:return

KapproxK\_\{\\mathrm\{approx\}\}

## Appendix CSpectral Interpretation: Bernstein–Schur Kernels

The scheme drawsttfrom the Bernstein–Widder mixing measure of the completely monotone radial factorhεh\_\{\\varepsilon\}\. This is not a Bochner spectral measure \(k\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is not shift\-invariant\) but plays the analogous role, and the construction depends onhεh\_\{\\varepsilon\}only through this measure\. Here we prove the general Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)\.

###### Proof of Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)\.

For one draw setY=2cos⁡\(ω⊤x\+β\)cos⁡\(ω⊤w\+β\)Y=2\\cos\(\\omega^\{\\top\}x\+\\beta\)\\cos\(\\omega^\{\\top\}w\+\\beta\); conditioned onTTthe standard RFF identity gives𝔼\[Y∣T\]=e−T‖x−w‖2\\mathbb\{E\}\[Y\\mid T\]=e^\{\-T\\\|x\-w\\\|^\{2\}\}, so𝔼\[mY\]=m𝔼T∼ν/m\[e−T‖x−w‖2\]=∫0∞e−t‖x−w‖2𝑑ν\(t\)=f\(‖x−w‖2\)\\mathbb\{E\}\[m\\,Y\]=m\\,\\mathbb\{E\}\_\{T\\sim\\nu/m\}\[e^\{\-T\\\|x\-w\\\|^\{2\}\}\]=\\int\_\{0\}^\{\\infty\}e^\{\-t\\\|x\-w\\\|^\{2\}\}d\\nu\(t\)=f\(\\\|x\-w\\\|^\{2\}\)\. Withu\(x\)⊤u\(w\)=p\(x,w\)u\(x\)^\{\\top\}u\(w\)=p\(x,w\),𝔼\[z\(x\)⊤z\(w\)\]=p\(x,w\)f\(‖x−w‖2\)=k\(x,w\)\\mathbb\{E\}\[z\(x\)^\{\\top\}z\(w\)\]=p\(x,w\)f\(\\\|x\-w\\\|^\{2\}\)=k\(x,w\)\. For the variance,\|mYu\(x\)⊤u\(w\)\|≤2mB2\|m\\,Y\\,u\(x\)^\{\\top\}u\(w\)\|\\leq 2mB^\{2\}and𝔼\[Y2∣T\]≤32\\mathbb\{E\}\[Y^\{2\}\\mid T\]\\leq\\tfrac\{3\}\{2\}\(as in the proof of Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)\), soVar⁡\[mYu\(x\)⊤u\(w\)\]≤m2B4𝔼\[Y2\]≤32m2B4\\operatorname\{Var\}\[m\\,Y\\,u\(x\)^\{\\top\}u\(w\)\]\\leq m^\{2\}B^\{4\}\\,\\mathbb\{E\}\[Y^\{2\}\]\\leq\\tfrac\{3\}\{2\}m^\{2\}B^\{4\}; averagingDDi\.i\.d\. copies divides byDD\. For the uniform bound, each summand lies in an interval of length4mB24mB^\{2\}, so Hoeffding givesℙ\(\|z\(xi\)⊤z\(xj\)−k\(xi,xj\)\|≥s\)≤2exp⁡\(−Ds2/\(8m2B4\)\)\\mathbb\{P\}\(\|z\(x\_\{i\}\)^\{\\top\}z\(x\_\{j\}\)\-k\(x\_\{i\},x\_\{j\}\)\|\\geq s\)\\leq 2\\exp\(\-Ds^\{2\}/\(8m^\{2\}B^\{4\}\)\), and a union bound over≤N2\\leq N^\{2\}pairs yields the claim\. ∎

The biased\\tifinaghfontⵟ\-kernel is the caseu=pbu=p\_\{b\}\(dp=d\(d\+1\)/2\+d\+1d\_\{p\}=d\(d\{\+\}1\)/2\+d\+1\),f\(r\)=\(r\+ε\)−1f\(r\)=\(r\+\\varepsilon\)^\{\-1\}withdν\(t\)=e−εtdtd\\nu\(t\)=e^\{\-\\varepsilon t\}\\,dt, somf=1/εm\_\{f\}=1/\\varepsilonandν/mf=Exp⁡\(ε\)\\nu/m\_\{f\}=\\operatorname\{Exp\}\(\\varepsilon\); nontrivial members are tabulated in the main text \(Table[1](https://arxiv.org/html/2606.11255#S2.T1)\)\.

## Appendix DFurther Validations

#### Gram\-matrix approximation error\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x4.png)Figure 4:Sphere\-normalized sanity check \(here the kernel coincides with a dot\-product kernel, so this isolates the dimension behavior and is not a representation claim\)\. RAY approximates the biased Gram at the Monte\-Carlo rate with a radial sample count that grows little with dimension \(flatD′=1D^\{\\prime\}=1\)\.\(a\)Relative Frobenius error vs\.DD\(N=1000N=1000,b=1b=1,ε=1\\varepsilon=1\); all dimensions track theO\(1/D\)O\(1/\\sqrt\{D\}\)guide within a factor of∼3\\sim 3\.\(b\)Radial samples for≤10%\\leq 10\\%error vs\. dimension: RAY’sD⋆D^\{\\star\}stays bounded and plateaus byd=20d=20, while the Nyström landmark countm⋆m^\{\\star\}exceeds the tested range ford≥20d\\geq 20\.Random\\tifinaghfontⵟ\-features converge at the Monte\-Carlo rate, and their accuracy grows only mildly with dimension\. Approximating the exact biased Gram matrix ofN=1000N=1000sphere\-normalized points \(normalization fixesR=1R=1, isolating the dimension behavior from theR2R^\{2\}growth of raw data;b=1b=1,ε=1\\varepsilon=1, mean over55seeds, the recommended flatD′=1D^\{\\prime\}=1soDDis the number of cosine draws\), the relative Frobenius error falls asO\(1/D\)O\(1/\\sqrt\{D\}\)in the number of radial samplesDD\(from0\.330\.33atD=10D=10to0\.0240\.024atD=1000D=1000atd=2d=2\(Figure[4](https://arxiv.org/html/2606.11255#A4.F4)a\)\) and atD=1000D=1000stays within\[0\.024,0\.073\]\[0\.024,0\.073\]acrossd∈\{2,…,20\}d\\in\\\{2,\\dots,20\\\}, a factor of∼3\\sim 3\. Uniform\-landmark Nyström at a fixed budget behaves oppositely \(Table[11](https://arxiv.org/html/2606.11255#A4.T11); we add stronger k\-means and adaptive ridge\-leverage\-score Nyström\(Musco & Musco,[2017](https://arxiv.org/html/2606.11255#bib.bib18)\)baselines in Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)\): atd=2d=2the error is below the table’s displayed precision, likely reflecting unusually rapid spectral decay on this sampled problem rather than literal exact recovery, but byd=20d=20it degrades to0\.390\.39\(m=50m\{=\}50\) and0\.290\.29\(m=100m\{=\}100\)\. At matched draws versus landmarks, RAY is therefore preferable in moderate\-to\-high dimension, Nyström only in low\. The operator\-norm error controlled by Theorem[3\.2](https://arxiv.org/html/2606.11255#S3.Thmtheorem2)tracks the Frobenius error: the sameO\(1/D\)O\(1/\\sqrt\{D\}\)rate, and smaller in relative terms \(0\.43→0\.0590\.43\\to 0\.059vs\.0\.59→0\.0710\.59\\to 0\.071asDDruns10→100010\\to 1000atd=10d=10\)\. The theorem’s data\-adaptive constant is borne out directly: across bounded\-ball datasets of increasing spectral spread \(intrinsic dimensiondintd\_\{\\mathrm\{int\}\}from1\.41\.4to3\.53\.5,N=400N=400\) the matrix\-Bernstein expression stays near5555–6060while the crudeNmaxijN\\max\_\{ij\}route grows from6666to147147, the gap widening with the spectrum, and both upper\-bound the measured‖KD−K‖op\\\|K\_\{D\}\-K\\\|\_\{\\mathrm\{op\}\}\(2020–2424\)\.

Table 11:Relative Frobenius error‖Kapprox−K‖F/‖K‖F\\\|K\_\{\\mathrm\{approx\}\}\-K\\\|\_\{F\}/\\\|K\\\|\_\{F\}of the biased Gram matrix \(b=1b=1,ε=1\\varepsilon=1, unit sphere, mean over55seeds\)\. RAY usesDDradial samples at the recommended flatD′=1D^\{\\prime\}=1, soDDis the number of cosine draws \(total feature dimensionDdbD\\,d\_\{b\}\); Nyström usesmmlandmarks\.
#### Bias scaling\.

The bias enters the variance exactly as the fourth power Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1)predicts \(Figure[5](https://arxiv.org/html/2606.11255#A4.F5)\)\. For two unit\-vector pairs inℝ2\\mathbb\{R\}^\{2\}\(aligned \(x⊤w=1x^\{\\top\}w=1\) andx⊤w=0\.5x^\{\\top\}w=0\.5\) we sweepb∈\{0,…,5\}b\\in\\\{0,\\dots,5\\\}atD=200D=200,ε=1\\varepsilon=1and fit the log\-log slope of the empirical variance \(20002000repetitions\) againstx⊤w\+bx^\{\\top\}w\+b\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x5.png)Figure 5:Estimator variance vs\. the bias\-shifted alignmentx⊤w\+bx^\{\\top\}w\+b\(log\-log,20002000repetitions\)\. Both pairs follow a fourth power \(fitted slopes4\.014\.01and3\.993\.99against the slope\-44guide\)\. For the aligned pair the variance equals the\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}envelope of Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1)\(the ratioVar/\(R2\+b\)4\\operatorname\{Var\}/\(R^\{2\}\+b\)^\{4\}is constant at≈5×10−5\\approx 5\\times 10^\{\-5\}, so the bound is tight\); thex⊤w=0\.5x^\{\\top\}w=0\.5pair lies below it, the gap being the Cauchy–Schwarz step in the proof\.The fitted exponents are4\.014\.01and3\.993\.99\(and2\.012\.01,1\.991\.99for the standard deviation\), matching the predicted44and22\. The aligned pair sits on the\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}envelope, whereVar/\(R2\+b\)4\\operatorname\{Var\}/\(R^\{2\}\+b\)^\{4\}is constant and the bound is tight; thex⊤w=0\.5x^\{\\top\}w=0\.5pair lies strictly below it, the gap being the Cauchy–Schwarz stepx⊤w≤‖x‖‖w‖x^\{\\top\}w\\leq\\\|x\\\|\\\|w\\\|in the proof\.

Beyond the envelope, the*exact*variance of Theorem[3\.1](https://arxiv.org/html/2606.11255#S3.Thmtheorem1)is confirmed pointwise: acrossb∈\{0,1,2\}b\\in\\\{0,1,2\\\}and1818off\-sphere pairs spanningr=‖x−w‖2∈\[0\.4,2\.2\]r=\\\|x\-w\\\|^\{2\}\\in\[0\.4,2\.2\], the ratio of empirical \(over2×1052\\times 10^\{5\}draws\) to predicted variance is1\.001±0\.0021\.001\\pm 0\.002, and the estimator is unbiased to within0\.5%0\.5\\%\.

#### The normalized estimator removes the bias/radius variance blow\-up\.

The normalized variant \(Proposition[A\.7](https://arxiv.org/html/2606.11255#A1.Thmtheorem7)\) tradesk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}for a bounded\-variance rescaling\. We confirm the exact relationVar⁡\(exact modulation\)/Var⁡\(normalized\)=\(\(‖x‖2\+b\)\(‖w‖2\+b\)\)2\\operatorname\{Var\}\(\\text\{exact modulation\}\)/\\operatorname\{Var\}\(\\text\{normalized\}\)=\(\(\\\|x\\\|^\{2\}\+b\)\(\\\|w\\\|^\{2\}\+b\)\)^\{2\}by sweepingbbatR=1R=1and the data radiusRRatb=1b=1\(2×1052\\times 10^\{5\}draws per point, Table[12](https://arxiv.org/html/2606.11255#A4.T12)\)\. The measured ratio matches\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}to three figures, the exact\-modulation variance grows from0\.020\.02to449449across the sweep, and the normalized variant stays bounded in\[0\.045,0\.53\]≤32\[0\.045,0\.53\]\\leq\\tfrac\{3\}\{2\}, as the proposition predicts\. The normalized estimator is therefore the stable choice for large\-radius or large\-bias data, at the cost of estimating the rescaled kernel\.

Table 12:Flat\-estimator variance of exact vs\. normalized modulation \(ε=1\\varepsilon=1,2×1052\\times 10^\{5\}draws\)\. The ratio equals\(R2\+b\)4\(R^\{2\}\+b\)^\{4\}exactly; exact modulation blows up while the normalized estimator stays bounded \(≤32\\leq\\tfrac\{3\}\{2\}\)\.
#### Fixed\-dataset radial sample complexity\.

The radial sample count needed for a target accuracy grows little with dimension \(and then plateaus\) whereas Nyström’s landmark count does not stay bounded at all\. With the recommended flatD′=1D^\{\\prime\}=1\(soDDis the number of cosine draws\), for eachd∈\{2,…,100\}d\\in\\\{2,\\dots,100\\\}\(N=500N=500on the sphere,b=ε=1b=\\varepsilon=1,33seeds\) we findD⋆D^\{\\star\}, the smallestDDreaching relative Frobenius error≤0\.10\\leq 0\.10, and the analogous Nyström countm⋆m^\{\\star\}\(Table[13](https://arxiv.org/html/2606.11255#A4.T13), Figure[4](https://arxiv.org/html/2606.11255#A4.F4)b\)\.

Table 13:Radial samples needed for relative Frobenius error≤0\.10\\leq 0\.10vs\. dimension \(b=1b=1,ε=1\\varepsilon=1, unit sphere, flatD′=1D^\{\\prime\}=1\)\. RAY’s radial countD⋆D^\{\\star\}stays bounded and plateaus; Nyström’s landmark countm⋆m^\{\\star\}exceeds the tested range \(\>350\>350\) ford≥20d\\geq 20\. “\>350\>350” means350350landmarks did not reach the target\.RAY’sD⋆D^\{\\star\}rises from200200to750750and then plateaus \(no further growth fromd=20d=20to100100\), while Nyström reaches the target only ford≤10d\\leq 10: fromd=20d=20on,350350landmarks leave the error above0\.100\.10\(atd=100d=100,0\.200\.20\)\. The contrast is between a radial count that saturates inddand Nyström’s curse of dimensionalityO\(m−2s/d\)O\(m^\{\-2s/d\}\), the empirical content of Table[2](https://arxiv.org/html/2606.11255#S3.T2); the total RAY feature dimension isD⋆dbD^\{\\star\}d\_\{b\}, which does grow withddthrough the polynomial factor\.

#### Downstream kernel ridge regression\.

The exact polynomial factor becomes a large downstream advantage at a low number of random draws\. On two real datasets \(digits\(classification,d=64d=64\) andcalifornia\(regression,d=8d=8\), standardized andℓ2\\ell\_\{2\}\-normalized to the sphere, withε\\varepsilonthe median squared distance,b=1b=1, and ridgeλ=10−2\\lambda=10^\{\-2\}, all fixed rather than tuned on validation\) we fit kernel ridge regression in Gram form \(to isolate kernel\-approximation quality; Appendix[D\.3](https://arxiv.org/html/2606.11255#A4.SS3)evaluates primal\-feature scalability separately\) and compare, at a*matched number of random draws*DD, RAY against IMQ random features \(the radial\-only estimator of Section[2\.4](https://arxiv.org/html/2606.11255#S2.SS4)\), Gaussian RFF, and Nyström on the exact\\tifinaghfontⵟ\-kernel, with the three exact kernels as references\. MatchingDDmatches the Gram\-assembly compute up to the one\-timeO\(N2d\)O\(N^\{2\}d\)polynomial Gram, a small overhead next to the gap below\. Figure[6](https://arxiv.org/html/2606.11255#A4.F6)reports the mean over33splits with±1\\pm 1standard\-deviation bands\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x6.png)Figure 6:Downstream KRR test metric vs\. the number of random drawsDDon sphere\-normalized real data \(mean over33splits,±1\\pm 1s\.d\. bands\); the dashed line is the exact\\tifinaghfontⵟ\-kernel\.\(a\)digits: RAY\-\\tifinaghfontⵟsits at the exact\-kernel accuracy already atD=8D=8, while Gaussian RFF, IMQ\-RFF, and Nyström climb slowly and need∼512\\sim\\\!512features to catch up\.\(b\)california: RAY tracks the exact kernel from the smallest budgets\. RAY keeps the alignment numerator exact and spends itsDDdraws only on the radial factor; the others must sample the whole kernel\. As*exact*kernels the\\tifinaghfontⵟ\-kernel ties Gaussian on digits \(0\.9860\.986\) and trails it on california \(0\.5580\.558vs\.0\.5310\.531RMSE, IMQ0\.5350\.535\)\.With onlyD=32D=32radial samples RAY already reaches0\.9800\.980accuracy on digits and comes within0\.010\.01RMSE of the exact\\tifinaghfontⵟ\-kernel on california, while Gaussian RFF, IMQ\-RFF, and Nyström at the same budget trail by a wide margin \(0\.810\.81–0\.890\.89on digits\): they spend their draws approximating the whole kernel, whereas RAY keeps the alignment numerator exact and spends itsDDdraws only on the radial factor\. The IMQ\-RFF ablation isolates this: removing the numerator and adding it back lifts digits accuracy from0\.810\.81to0\.980\.98atD=32D=32, so the exact numerator is the source of the low\-budget efficiency\. As an*exact*kernel the\\tifinaghfontⵟ\-kernel is competitive but not uniformly best \(it ties Gaussian on digits and trails it by0\.030\.03RMSE on california\); the gain here is RAY’s draw efficiency\. Two caveats define the scope: the comparison is matched in the number of*draws*, not in explicit feature dimension, which for RAY carries the polynomial factordb=O\(d2\)d\_\{b\}=O\(d^\{2\}\)per draw, so the draw\-efficiency advantage does not transfer to a dimension\-matched setting in highdd, where the on\-sphere dot\-product route \(a direction we leave to future work\) is preferable; and the\\tifinaghfontⵟ\-kernel needs inputs of bounded norm: on raw standardized features its unbounded numerator destabilizes both the exact kernel and RAY \(exact\-\\tifinaghfontⵟRMSE0\.840\.84, RAY diverging\), so we keep‖x‖\\\|x\\\|bounded \(here by sphere normalization; Section[4\.1](https://arxiv.org/html/2606.11255#S4.SS1)validates the off\-sphere bounded\-ball regime\)\.

The relative\-spectral KRR theorem \(Theorem[3\.6](https://arxiv.org/html/2606.11255#S3.Thmtheorem6)\) holds as stated\. On a digits subset \(N=600N=600,λ=0\.1\\lambda=0\.1\) the whitened errorρ=‖A−1/2\(KD−K\)A−1/2‖op\\rho=\\\|A^\{\-1/2\}\(K\_\{D\}\-K\)A^\{\-1/2\}\\\|\_\{\\mathrm\{op\}\}falls withDD\(log\-log slope−0\.83\-0\.83\), and onceρ<1\\rho<1\(atD=1024D=1024,ρ=0\.69\\rho=0\.69\) the predicted bound‖α~−α^‖A≤ρ1−ρ‖α^‖A\\\|\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\\\|\_\{A\}\\leq\\frac\{\\rho\}\{1\-\\rho\}\\\|\\hat\{\\alpha\}\\\|\_\{A\}holds with a wide margin \(0\.18≤2\.180\.18\\leq 2\.18\); at smallerDDthe relative error decreases monotonically \(1\.59→0\.181\.59\\to 0\.18overD:16→1024D\{:\}16\\to 1024\) as the deterministic guarantee predicts once it becomes active\.

#### Real\-data check: competitive across regimes\.

The coupled\-target preference study \(Section[4\.4](https://arxiv.org/html/2606.11255#S4.SS4)\) is synthetic by construction\. The same reading \(a single\\tifinaghfontⵟ\-kernel stays competitive whichever geometry carries the labels, rather than failing on one\) can be tested on real data\. We embed CIFAR\-10 and CIFAR\-100 with a frozen CLIP image encoder\(Radford et al\.,[2021](https://arxiv.org/html/2606.11255#bib.bib20)\)and run exact kernel ridge regression with the Gaussian, IMQ, degree\-2 polynomial, and\\tifinaghfontⵟkernels under two preprocessings of the embeddings: centered, where class identity is angular and alignment dominates, and bounded\-ball, where it is radial and proximity dominates \(Table[14](https://arxiv.org/html/2606.11255#A4.T14)\)\. Across five seeds the four kernels agree to within about one standard deviation: on CIFAR\-10 they are statistically tied at≈0\.944\\approx 0\.944, and on CIFAR\-100 the polynomial \(alignment\) kernel leads by roughly one standard deviation with the\\tifinaghfontⵟ\-kernel close behind\. The reading is modest but consistent with the coupling picture: the\\tifinaghfontⵟ\-kernel is competitive in both regimes, never far from the best, because it carries whichever factor is informative, whereas a single\-factor kernel would be exposed on the regime it is blind to; on these embeddings, though, one factor already suffices, so no kernel separates decisively\. And this is a statement about the*kernel*: atd=512d=512the random\-feature\\tifinaghfontⵟneeds more draws than a single\-factor RFF to realize it, consistent with theO\(d2\)O\(d^\{2\}\)representation floor \(Limitation \(v\)\)\.

Table 14:Exact\-kernel KRR test accuracy on frozen CLIP embeddings of CIFAR\-10/100 \(1010k train /22k test, fixedλ\\lambda\), under centered \(alignment\-dominated\) and bounded\-ball \(proximity\-dominated\) preprocessing\. Means over55random train/test subsamples; per\-cell standard deviation≤0\.006\\leq 0\.006\. On CIFAR\-10 all four kernels agree within one standard deviation; on CIFAR\-100 the polynomial leads by about one standard deviation\. Best mean per row in bold \(CIFAR\-10 alignment is a four\-way tie\)\.
#### Importance sampling \(Section[3\.3](https://arxiv.org/html/2606.11255#S3.SS3)\)\.

The proposalt∼Exp⁡\(ε\+η\)t\\sim\\operatorname\{Exp\}\(\\varepsilon\+\\eta\)is unbiased but has a finite\-variance window\.

###### Proposition D\.1\(Exponential proposal for the radial factor\)\.

WithT∼Exp⁡\(ε\+η\)T\\sim\\operatorname\{Exp\}\(\\varepsilon\+\\eta\)\(η≥0\\eta\\geq 0\) andh^η\(r\)=1ε\+ηe−\(r−η\)T\\widehat\{h\}\_\{\\eta\}\(r\)=\\frac\{1\}\{\\varepsilon\+\\eta\}e^\{\-\(r\-\\eta\)T\}, we have𝔼\[h^η\(r\)\]=\(r\+ε\)−1\\mathbb\{E\}\[\\widehat\{h\}\_\{\\eta\}\(r\)\]=\(r\+\\varepsilon\)^\{\-1\}, and𝔼\[h^η\(r\)2\]=\(\(ε\+η\)\(ε\+2r−η\)\)−1\\mathbb\{E\}\[\\widehat\{h\}\_\{\\eta\}\(r\)^\{2\}\]=\\bigl\(\(\\varepsilon\+\\eta\)\(\\varepsilon\+2r\-\\eta\)\\bigr\)^\{\-1\}is finite iffη<ε\+2r\\eta<\\varepsilon\+2r\. A proposal safe for all pairs, includingr=0r=0, requiresη<ε\\eta<\\varepsilon\.

###### Proof\.

𝔼\[h^η\(r\)\]=∫0∞1ε\+ηe−\(r−η\)t\(ε\+η\)e−\(ε\+η\)t𝑑t=∫0∞e−\(r\+ε\)t𝑑t=\(r\+ε\)−1\\mathbb\{E\}\[\\widehat\{h\}\_\{\\eta\}\(r\)\]=\\int\_\{0\}^\{\\infty\}\\frac\{1\}\{\\varepsilon\+\\eta\}e^\{\-\(r\-\\eta\)t\}\(\\varepsilon\+\\eta\)e^\{\-\(\\varepsilon\+\\eta\)t\}dt=\\int\_\{0\}^\{\\infty\}e^\{\-\(r\+\\varepsilon\)t\}dt=\(r\+\\varepsilon\)^\{\-1\}\. Similarly𝔼\[h^η\(r\)2\]=1ε\+η∫0∞e−\(ε\+2r−η\)t𝑑t\\mathbb\{E\}\[\\widehat\{h\}\_\{\\eta\}\(r\)^\{2\}\]=\\frac\{1\}\{\\varepsilon\+\\eta\}\\int\_\{0\}^\{\\infty\}e^\{\-\(\\varepsilon\+2r\-\\eta\)t\}dt, finite iffε\+2r−η\>0\\varepsilon\+2r\-\\eta\>0, equal to\(\(ε\+η\)\(ε\+2r−η\)\)−1\(\(\\varepsilon\+\\eta\)\(\\varepsilon\+2r\-\\eta\)\)^\{\-1\}\. ∎

Empirically, on pairs at several squared distances \(x⊤w=0\.5x^\{\\top\}w=0\.5,b=1b=1,ε=1\\varepsilon=1,η<ε\\eta<\\varepsilon,D=200D=200,30003000repetitions\) the estimator stays unbiased \(bias≤10−3\\leq 10^\{\-3\}\) and the reduction is real but tuning\-dependent: up to a0\.42×0\.42\\timesvariance ratio at‖x−w‖2=1\\\|x\-w\\\|^\{2\}=1\(η=0\.5\\eta=0\.5\) and0\.55×0\.55\\timesat‖x−w‖2=0\.25\\\|x\-w\\\|^\{2\}=0\.25\(η=0\.1\\eta=0\.1\), but*no*improvement for the nearest pair‖x−w‖2=0\.05\\\|x\-w\\\|^\{2\}=0\.05\(whereη=0\\eta=0is best\)\. Since the optimalη\\etais data\-dependent, a single fixed proposal is a compromise\.

#### Orthogonal features ind\>1d\>1\(Section[3\.3](https://arxiv.org/html/2606.11255#S3.SS3)\)\.

At a fixed scalettwe estimate the Gaussian factorgt\(x,w\)g\_\{t\}\(x,w\)withD′=dD^\{\\prime\}=dinner frequencies, drawn either i\.i\.d\.𝒩\(0,2tId\)\\mathcal\{N\}\(0,2tI\_\{d\}\)or as a scaled Haar\-orthogonal block\(Yu et al\.,[2016](https://arxiv.org/html/2606.11255#bib.bib34)\)that preserves the marginal\. Both are unbiased \(empirical bias≤6×10−3\\leq 6\\times 10^\{\-3\}\), and the orthogonal block lowers the inner variance by a ratio of0\.720\.72atd=5d=5and0\.760\.76atd=20d=20\(40004000repetitions,x⊤w=0\.3x^\{\\top\}w=0\.3\)\. Unlike thed=1d=1case of Appendix[D\.1](https://arxiv.org/html/2606.11255#A4.SS1), where there is no direction to decorrelate, orthogonality helps onced\>1d\>1\.

#### Polynomial sketching \(Section[5](https://arxiv.org/html/2606.11255#S5)\)\.

The TensorSketch that RAY uses \(Proposition[A\.8](https://arxiv.org/html/2606.11255#A1.Thmtheorem8)\) sketches only the quadratic term\(x⊤w\)2\(x^\{\\top\}w\)^\{2\}and keeps2bx⊤w\+b22b\\,x^\{\\top\}w\+b^\{2\}exact\. Here we probe the cruder variant that sketches the whole augmented feature\(x,b\)\(x,\\sqrt\{b\}\), replacing the exactdbd\_\{b\}\-dimensional polynomial by aDpolyD\_\{\\mathrm\{poly\}\}\-dimensional sketch \(this also sketches the bias and linear terms, so we do not use it in the main experiments\)\. Its error decays slowly, as∼Dpoly−1/2\\sim D\_\{\\mathrm\{poly\}\}^\{\-1/2\}: atd=10d=10\(db=121d\_\{b\}=121\) a sketch ofDpoly=128D\_\{\\mathrm\{poly\}\}=128gives full\-Gram error0\.1750\.175against the exact feature’s0\.0460\.046; atd=100d=100\(db=10,201d\_\{b\}=10\{,\}201\) a20×20\\timessmaller sketch \(Dpoly=512D\_\{\\mathrm\{poly\}\}=512\) gives0\.1720\.172against0\.0600\.060\. Sketching is thus a controllable memory–accuracy trade, useful only whend2d^\{2\}is prohibitive and some accuracy can be sacrificed; at moderateddthe exact polynomial feature is both cheaper and more accurate\. Thed2d^\{2\}cost is better reduced losslessly by the symmetricd\(d\+1\)/2d\(d\{\+\}1\)/2reduction\.

#### A non\-\\tifinaghfontⵟBernstein–Schur instance \(Theorem[2\.5](https://arxiv.org/html/2606.11255#S2.Thmtheorem5)\)\.

To check that the construction is genuinely about the class and not the\\tifinaghfontⵟ\-kernel alone, we run it unchanged on a different member:k\(x,w\)=\(x⊤w\+b\)3\(‖x−w‖2\+ε\)−αk\(x,w\)=\(x^\{\\top\}w\+b\)^\{3\}\(\\\|x\-w\\\|^\{2\}\+\\varepsilon\)^\{\-\\alpha\}, a degree\-3 polynomial modulation times the generalized IMQ \(α=2\\alpha=2\), whose Bernstein measure isΓ\(α,ε\)\\Gamma\(\\alpha,\\varepsilon\)with shapeα\\alphaand rateε\\varepsilon\. Keeping the cubic modulation exact and samplingT∼Γ\(α,ε\)T\\sim\\Gamma\(\\alpha,\\varepsilon\),ω∼𝒩\(0,2TId\)\\omega\\sim\\mathcal\{N\}\(0,2TI\_\{d\}\), on off\-sphere data \(‖x‖∈\[0\.3,1\.2\]\\\|x\\\|\\in\[0\.3,1\.2\],d=8d=8,N=400N=400\) the estimator is unbiased \(mean of200200estimates atD=50D=50is within0\.0230\.023relative Frobenius error of the exact Gram\) and converges at the Monte\-Carlo rate \(relative error0\.71→0\.0780\.71\\to 0\.078asDDruns10→100010\\to 1000, fitted log\-log slope−0\.48\-0\.48\)\. Only the modulation feature and the mixing law changed from the\\tifinaghfontⵟ\-kernel; the recipe did not\.

#### Sensitivity tobbandε\\varepsilon\.

We sweep the biasb∈\{0,0\.1,1,10\}b\\in\\\{0,0\.1,1,10\\\}and the radial scaleε∈\{0\.25,1,4\}×\\varepsilon\\in\\\{0\.25,1,4\\\}\\times\(median squared distance\) and report the relative Frobenius Gram error of the flat estimator \(D=200D=200,N=300N=300,d=16d=16,55seeds\) for exact, normalized, and sketched modulation \(the last is RAY\) \(Table[15](https://arxiv.org/html/2606.11255#A4.T15)\)\. The exact and normalized estimators stay in a tight0\.060\.06–0\.250\.25band across two orders of magnitude inbband a16×16\\timesrange inε\\varepsilon: the method is insensitive to either, andε=\\varepsilon=median squared distance is a sound default\. RAY adds the expectedO\(m−1/2\)O\(m^\{\-1/2\}\)sketch error \(m=128m=128\), largest at smallbband smallε\\varepsilonwhere the quadratic term dominates the modulation\.

Table 15:Relative Frobenius Gram error vs\. the biasbband radial scaleε\\varepsilon\(as a multiple of the median squared distance\), flat estimator atD=200D=200\(N=300N=300,d=16d=16, mean over55seeds; per\-cell standard deviation≤0\.023\\leq 0\.023for the exact and normalized estimators,≤0\.046\\leq 0\.046for RAY \(sketch\)\)\. The exact and normalized estimators are stable across the grid; RAY adds anO\(m−1/2\)O\(m^\{\-1/2\}\)sketch error, largest where the quadratic modulation dominates \(smallbb, smallε\\varepsilon\)\.
### D\.1Kernel\-value variance and variance reduction

The estimator’s variance falls asO\(1/D\)O\(1/D\), and quasi\-Monte Carlo over the scale lowers it by a constant factor\. We estimatek\\tifinaghfontⵟ,b\(x,w\)k\_\{\\text\{\\tifinaghfont ⵟ\},b\}\(x,w\)for a fixedd=1d=1pair over10001000repetitions \(x=0\.5x=0\.5,w=1w=1,b=ε=1b=\\varepsilon=1,D′=50D^\{\\prime\}=50; the scalar exact polynomial factor isolates the radial Gaussian approximation\), comparing the basic, QMC, orthogonal, and combined estimators in Table[16](https://arxiv.org/html/2606.11255#A4.T16)\.

Table 16:Empirical variance of the biased\-kernel estimator \(×10−3\\times 10^\{\-3\},b=1b=1,ε=1\\varepsilon=1,10001000repetitions\)\.Basic RAY halves its variance with each doubling ofDD\(a4\.5×4\.5\\timesdrop fromD=10D\{=\}10to5050\), and QMC sampling of the radial scale cuts it a further2\.42\.4–3\.2×3\.2\\timesat everyDD\. Orthogonalizing the inner frequencies does nothing here \(in one dimension there is no direction to decorrelate\) but reduces the inner variance by2424–28%28\\%onced\>1d\>1\(Appendix[D](https://arxiv.org/html/2606.11255#A4)\) and composes with QMC\. The Theorem[A\.1](https://arxiv.org/html/2606.11255#A1.Thmtheorem1)bound holds but is loose, its worst\-case Cauchy–Schwarz making the empirical constants far smaller \(atD=100D\{=\}100, bound0\.160\.16vs\. variance0\.0020\.002\)\.

One caveat on QMC: it buys a constant, not a faster rate\. Fitting log\-log RMSE againstD∈\[10,2000\]D\\in\[10,2000\]gives a Monte\-Carlo slope of−0\.502\-0\.502and a QMC slope of−0\.514\-0\.514, the sameO\(1/D\)O\(1/\\sqrt\{D\}\)rate, with QMC a constant∼1\.7×\\sim\\\!1\.7\\timeslower\. The inner Gaussian RFF noise, which is not quasi\-randomized, sets the convergence, so quasi\-randomizing only the one\-dimensional scale cannot change it; theO\(\(log⁡D\)s/D\)O\(\(\\log D\)^\{s\}/D\)rate of Section[3\.3](https://arxiv.org/html/2606.11255#S3.SS3)is the outer integral’s alone\.

### D\.2Outer/inner allocation: flat sampling is optimal

Flat sampling \(one frequency per scale\) is the best use of a fixed budget, confirming Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)\. The estimator splits its randomness betweenDDouter scales andD′D^\{\\prime\}inner frequencies at identical costO\(N2DD′\)O\(N^\{2\}DD^\{\\prime\}\); holding the budgetB=DD′=1000B=DD^\{\\prime\}=1000fixed and varying the split \(d=10d=10,N=400N=400, sphere,ε=b=1\\varepsilon=b=1,44seeds\) gives Table[17](https://arxiv.org/html/2606.11255#A4.T17)\.

Table 17:Relative Frobenius error at a fixed budgetB=DD′=1000B=DD^\{\\prime\}=1000as the inner countD′D^\{\\prime\}varies \(soD=1000/D′D=1000/D^\{\\prime\}\)\. SmallerD′D^\{\\prime\}is better;D′=1D^\{\\prime\}=1\(flat sampling\) wins\.Error rises monotonically withD′D^\{\\prime\}:D′=1D^\{\\prime\}=1is optimal here,3×3\\timesbetter thanD′=50D^\{\\prime\}=50\. This is Proposition[A\.2](https://arxiv.org/html/2606.11255#A1.Thmtheorem2)in numbers: withDD′DD^\{\\prime\}fixed the inner termVin/\(DD′\)V\_\{\\mathrm\{in\}\}/\(DD^\{\\prime\}\)is constant and the outer termVout/DV\_\{\\mathrm\{out\}\}/Dfalls as1/D1/DwhenVout\>0V\_\{\\mathrm\{out\}\}\>0, so the largestDDwins \(ifVout=0V\_\{\\mathrm\{out\}\}=0, all allocations tie\)\. The two\-level scheme is thus an artifact away from that degenerate case: one should drawDDindependent\(tj,ωj\)\(t\_\{j\},\\omega\_\{j\}\)pairs, the flat estimator \([8](https://arxiv.org/html/2606.11255#S2.E8)\) that Section[2\.4](https://arxiv.org/html/2606.11255#S2.SS4)identifies with IMQ random features\. \(Accordingly, the main Gram, dimension, and ridge\-regression experiments all use the recommended flatD′=1D^\{\\prime\}=1; only the variance\-reduction probe of Appendix[D\.1](https://arxiv.org/html/2606.11255#A4.SS1), which isolates the radial Gaussian factor, fixesD′=50D^\{\\prime\}=50\.\)

### D\.3Scalability: clearing theO\(N2\)O\(N^\{2\}\)wall

Random\\tifinaghfontⵟ\-features clear theO\(N2\)O\(N^\{2\}\)bottleneck the construction was built to clear\. We time fitting ridge regression on random targets \(cost, not accuracy\) three ways \(exact kernel ridge \(formKK, solve theN×NN\\times Nsystem\); RAY primal ridge \(featuresZ∈ℝN×MZ\\in\\mathbb\{R\}^\{N\\times M\}withM=Ddb′M=D\\,d\_\{b\}^\{\\prime\}, the symmetric polynomialdb′=d\(d\+1\)/2\+d\+1d\_\{b\}^\{\\prime\}=d\(d\{\+\}1\)/2\+d\+1and flatD′=1D^\{\\prime\}=1, solve theM×MM\\times Msystem\); and Nyström \(mmlandmarks\)\) reporting wall\-clock and the dominant array’s memory \(d=8d=8,D=m=64D=m=64,M=2880M=2880; Figure[7](https://arxiv.org/html/2606.11255#A4.F7)\)\.

![Refer to caption](https://arxiv.org/html/2606.11255v1/x7.png)Figure 7:Cost of fitting ridge regression vs\.NN\(d=8d=8,D=m=64D=m=64, log\-log\)\.\(a\)Wall\-clock: exact ridge steepens at the predicted∼N2\\sim N^\{2\}rate \(fitted exponent2\.12\.1\) and is run only while feasible; RAY and Nyström grow gently\.\(b\)Representation memory: the exactN×NN\\times NGram reaches3333GB byN=64,000N=64\{,\}000\(above the dashed cap, where it no longer fits\), while RAY \(NMNM\) and Nyström \(NmNm\) stay linear inNN\.Exact ridge scales as the predictedO\(N2\)O\(N^\{2\}\)\(fitted time exponent2\.12\.1\) and its Gram matrix reaches3333GB atN=64,000N=64\{,\}000, where it no longer fits in memory; RAY and Nyström, whose representations are linear inNN\(NMNMandNmNm\), run there in1\.51\.5and0\.020\.02seconds\. The honest caveat is the*constant*, not the scaling: RAY’s representationM=Ddb′M=D\\,d\_\{b\}^\{\\prime\}carries the degree\-2 polynomial dimension, so it is heavier than Nyström’smm\(28802880vs6464here\); the symmetric reduction \(used\) and sketching shrinkdb′d\_\{b\}^\{\\prime\}, but a downstream method built on RAY should keepDDand the polynomial dimension modest\.

#### Large\-scale primal training on HIGGS \(full sweep\)\.

The streaming HIGGS demonstration of Section[4\.7](https://arxiv.org/html/2606.11255#S4.SS7)compares, at matched representation dimensionMM, Gaussian RFF \(D=MD\{=\}M\), exact modulation \(D=⌊M/db⌋D\{=\}\\lfloor M/d\_\{b\}\\rfloor\), and RAY \(m=128m\{=\}128,D=⌊M/\(m\+d\+1\)⌋D\{=\}\\lfloor M/\(m\{\+\}d\{\+\}1\)\\rfloor\); the off\-sphere preprocessing standardizes then rescales by the99\.999\.9th norm percentile \(varying norms in the bounded ball\),ε\\varepsilonis the median squared distance,b=1b=1, and the polynomial floor isdb=435d\_\{b\}=435\. Two facts come through \(Table[18](https://arxiv.org/html/2606.11255#A4.T18)\)\. First, the estimator trains at1111M scale as an ordinary streaming primal model: peak memory is flat at8\.58\.5GB, identical to plain random features, because the Gram is never formed\. Second, at matched representation budget RAY dominates exact modulation: the exact polynomial starves the radial factor toD=1D\{=\}1atM=512M\{=\}512, while compression buys33–52×52\\timesmore draws and a consistent AUC gain\. Gaussian RFF leads on AUC, which is expected, HIGGS is a smooth detector\-feature classification with no strong alignment×\\timesproximity coupling for the\\tifinaghfontⵟ\-modulation to exploit\. The claim is not “RAY wins HIGGS” but that compressed Bernstein–Schur features are a practical, memory\-flat, million\-scale estimator, and that compression, not exactness, is the right setting oncedbd\_\{b\}is large\.

Table 18:Streaming primal training on HIGGS \(Ntrain=10\.5N\_\{\\text\{train\}\}=10\.5M,Ntest=500N\_\{\\text\{test\}\}=500k,d=28d=28,db=435d\_\{b\}=435\) at matched representation dimensionMM\. Test AUC \(↑\\uparrow\),DD= radial draws afforded at that budget; best feature method perMMin bold\. Peak memory is flat at8\.58\.5GB across all methods \(no Gram, no full feature matrix\); the entire sweep runs in∼5\\sim 5minutes on one M5 Pro GPU\. Linear baseline AUC0\.6820\.682\. Single run \(one seed\): this is a scalability demonstration, not a statistical AUC comparison\.*Protocol\.*Streaming Adam \(learning rate0\.020\.02, batch size81928192, two passes over the10\.510\.5M training rows\), single precision \(fp32\), one fixed seed; features are rebuilt per mini\-batch on the GPU and discarded\. Reported wall\-clock includes per\-batch feature construction but excludes the one\-time gzip load of the data; the8\.68\.6GB peak \(measured bygetrusage\) is the resident process maximum, the same for all methods since none forms a Gram\. The whole\{512,…,8192\}×\{\\\{512,\\dots,8192\\\}\\times\\\{linear, Gaussian, exact modulation, RAY\}\\\}sweep takes∼260\\sim\\\!260s on one Apple M5 Pro\.

#### Relative\-spectral KRR stability for the sketched estimator\.

The relative\-spectral guarantee \(Theorem[3\.6](https://arxiv.org/html/2606.11255#S3.Thmtheorem6)\) also governs the deployed sketched RAY \(Table[19](https://arxiv.org/html/2606.11255#A4.T19)\)\. The sketch term keepsρ\\rhoabove the exact\-modulation value, so more regularization or budget is needed for the bound to activate, but once it does the coefficient error obeys it\.

Table 19:Relative\-spectral KRR stability for the deployed \(sketched\) RAY \(digits subset,N=600N\{=\}600\)\.ρ=‖A−1/2\(KD,m−K\)A−1/2‖op\\rho=\\\|A^\{\-1/2\}\(K\_\{D,m\}\-K\)A^\{\-1/2\}\\\|\_\{\\mathrm\{op\}\}withA=K\+λIA=K\+\\lambda I; the deterministic bound‖α~−α^‖A≤ρ1−ρ‖α^‖A\\\|\\tilde\{\\alpha\}\-\\hat\{\\alpha\}\\\|\_\{A\}\\leq\\frac\{\\rho\}\{1\-\\rho\}\\\|\\hat\{\\alpha\}\\\|\_\{A\}activates onceρ<1\\rho<1\.ρ\\rhofalls with the radial drawsDD, the sketch sizemm, and the ridgeλ\\lambda; the measured coefficient error stays within the bound throughout\.
#### Bounded\-input preprocessing\.

The unbounded numerator ofk\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}makes a norm bound mandatory \(Limitation \(iii\), Table[20](https://arxiv.org/html/2606.11255#A4.T20)\)\. Raw standardization \(max norm9\.39\.3\) inflates RMSE to0\.370\.37\(exact\) /0\.960\.96\(RAY\); any bounded scheme recovers≤0\.05\\leq 0\.05, confirming Limitation \(iii\)\.

Table 20:Bounded\-input preprocessing \(off\-sphere coupled target,d=16d\{=\}16, KRR test RMSE, mean over33seeds\)\. The unbounded numerator destabilizes both the exact kernel and RAY on raw data; any norm bound restores accuracy; clipping and max\-norm scaling are best\.
#### Sphere\-normalized matched\-cost checks\.

On the unit spherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}coincides with a dot\-product kernel, where direct zonal/dot\-product routes are natural baselines; the sphere\-normalized matched\-dimension and fair\-cost results below are therefore sanity checks, with the off\-sphere versions \(Tables[5](https://arxiv.org/html/2606.11255#S4.T5),[7](https://arxiv.org/html/2606.11255#S4.T7)\) the primary evidence\. The ordering is the same: at matched dimension RAY’s TensorSketch recovers most of the efficiency the exact feature loses to itsO\(d2\)O\(d^\{2\}\)size \(Table[21](https://arxiv.org/html/2606.11255#A4.T21)\), and at matched representation the adaptive Nyström variants are most accurate while RAY is the data\-independent streaming option \(Table[22](https://arxiv.org/html/2606.11255#A4.T22)\)\.

Table 21:Matched\-dimension KRR on sphere\-normalizeddigits\(d=64d=64,b=1b=1, accuracy, mean±\\pmstd over33splits\)\. At a fixed explicit feature dimensionMM, exact modulation affords onlyM/dbM/d\_\{b\}radial draws; RAY \(m=128m=128\) compresses the polynomial factor and affords many more, approaching the optimal rank\-MMceiling\.Table 22:Fair\-cost comparison on sphere\-normalizeddigits\(d=64d=64,b=1b=1, accuracy mean±\\pmstd over33splits\) at matched representation dimension\. Memory isN×dimN\\times\\text\{dim\}floats; build is feature/landmark construction wall\-clock\.

## Appendix EExperimental Details

This appendix collects the protocol shared across experiments and the per\-table configuration, so that every number in the paper can be reproduced\. The accompanying code, figures, and a project page are available online\.222[https://www\.tahabouhsine\.com/ray](https://www.tahabouhsine.com/ray)

#### Shared protocol\.

Unless a table states otherwise, the radial scaleε\\varepsilonis set to the median squared pairwise distance of the inputs \(computed on a subsample\), the bias isb=1b=1, and the deployed RAY map is the lower\-variance quadratic\-only TensorSketch \(sketch sizemm,DDradial draws, flatD′=1D^\{\\prime\}=1\): the degree\-2 term is sketched to dimensionmmand the linear/constant terms are kept exact, so the feature dimension isD\(m\+d\+1\)D\(m\{\+\}d\{\+\}1\)\. Kernel ridge regression is solved exactly \(in the primal \(features\) for RAY and RFF, and in the dual \(N×NN\\times N\) for the exact\\tifinaghfontⵟ\-kernel and Nyström\) with the ridgeλ\\lambdalisted per table; “exact modulation” is Definition[2\.3](https://arxiv.org/html/2606.11255#S2.Thmtheorem3)\(them→∞m\\to\\inftylimit\)\. Off\-sphere data is a bounded ball with norms drawn over a stated interval \(e\.g\.‖x‖∈\[0\.3,1\.5\]\\\|x\\\|\\in\[0\.3,1\.5\]\), the regime wherek\\tifinaghfontⵟ,bk\_\{\\text\{\\tifinaghfont ⵟ\},b\}is genuinely non\-dot\-product; sphere\-normalized runs \(Appendix tables\) fixR=1R=1and are sanity checks only\. Reported error bars are mean±\\pmstd over the seed count in each caption\.

#### Hardware and precision\.

All Gram\-error, KRR, operator\-norm, and feature\-build timing experiments run on CPU in double precision \(fp64\) and use no GPU\. The sole GPU experiment is the million\-scale HIGGS streaming run \(Table[18](https://arxiv.org/html/2606.11255#A4.T18)\), which runs in single precision \(fp32\) on one Apple M5 Pro GPU; its full protocol \(streaming Adam, batch81928192, two passes over the training rows, peak resident memory\) is stated inline with the table\. Timing tables therefore compare like for like within a row; absolute wall\-clock is hardware\-specific and reported only to show scaling, not peak throughput\.

Table 23:Per\-table configuration\. “Data” gives the dataset and input dimensiondd; “train/test” the KRR split sizes \(“Gram” or “op\-norm” marks experiments that approximate a matrix rather than fit a predictor, with no split; “build” marks feature\-construction timing only\)\. Grids list the swept hyperparameters;ε\\varepsilonis the median squared distance throughout\. Each table’s own caption restates its seed count and error bars\.
Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Similar Articles

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning

Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization

Perturbative methods for non-parametric instrumental variable

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

Feature Lottery? A Bifurcation Theory of Concept Emergence

Submit Feedback

Similar Articles

Hierarchical RBF-KAN and RBF-SKAN Architectures for Multidimensional Function Approximation and Random Field Learning
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
Perturbative methods for non-parametric instrumental variable
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Feature Lottery? A Bifurcation Theory of Concept Emergence