Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces
Summary
Proposes GraphDR-LinUCB, a method for contextual bandits with graph-structured arms that projects features onto the graph's low-frequency spectral subspace. Achieves the first regret bound for spectral-projection-based contextual bandits and demonstrates 15x regret reduction on real datasets over full-dimensional LinUCB.
View Cached Full Text
Cached at: 06/29/26, 05:26 AM
# Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces
Source: [https://arxiv.org/html/2606.27917](https://arxiv.org/html/2606.27917)
###### Abstract
Contextual bandits with graph\-structured arms arise in recommendation, citation retrieval, and social advertising, where arms connected on a graph tend to share reward signal\. Standard dimensionality reduction ignores this structure, inflating exploration cost by a factor ofd/kd/k\. We proposeGraphDR\-LinUCB, which projects arm features onto the graph’s low\-frequency spectral subspace and runs linear UCB in the resultingkk\-dimensional space\. We prove the firstO~\(kT\)\\widetilde\{O\}\(k\\sqrt\{T\}\)regret bound for spectral\-projection\-based contextual bandits, reducing dimension dependence fromddtokk; a perturbation argument extends this to noisy graphs, with an explicit penalty for reward\-smoothness mismatch and graph\-estimation error\. Our central theoretical finding is that the high\-frequency reward component need not incur a worst\-case linear\-in\-TTpenalty: its actual cost depends on its realized impact along the played path, not on its total energy\. A simple spectral comparison between subspaces \(Γk\\Gamma\_\{k\}\) predicts which reducer wins on a given dataset, correctly calling five of six real\-dataset outcomes without any fitted threshold\. Across a synthetic benchmark and six real datasets \(MovieLens, Amazon, LastFM, ogbn\-arxiv, MIND\),GraphDR\-LinUCBreduces cumulative regret by15×15\\timesover full\-dimensional LinUCB and outperforms competing graph\-aware methods on five of six; the single failure is precisely where the graph’s spectral subspace is misaligned with the reward\.
## Introduction
Contextual bandits with graph\-structured arms appear broadly in deployed systems\[[25](https://arxiv.org/html/2606.27917#bib.bib37)\]; users connected by social links, products organized by co\-purchase or genre, and documents linked by citation\. In each setting, reward functions are typically smooth over the graph \(nearby nodes share expected reward\), and graph spectral methods are designed precisely to capture this structure\. The natural question is whether projecting arm features onto a low\-frequency Laplacian eigenspace, reducing the exploration dimension fromddtok≪dk\\ll d, can deliver a genuine regret improvement over operating in the full feature space\.
The answer is not obvious\. Standard misspecified\-linear\-bandit theory\[[23](https://arxiv.org/html/2606.27917#bib.bib36),[12](https://arxiv.org/html/2606.27917#bib.bib33)\]charges a worst\-case bias proportional toνkT\\nu\_\{k\}Tfor any deviation of the reward from the projected subspace, which can quickly dominate theT\\sqrt\{T\}exploration gain\. Prior graph\-bandit work sidesteps this by either propagating information along edges\[[13](https://arxiv.org/html/2606.27917#bib.bib34),[5](https://arxiv.org/html/2606.27917#bib.bib29)\]or soft\-regularizing with the graph Laplacian\[[38](https://arxiv.org/html/2606.27917#bib.bib40)\]; neither performs a hard projection that actually reduces the exploration dimension\. Spectral bandits\[[34](https://arxiv.org/html/2606.27917#bib.bib39)\]operate in a graph\-frequency basis but do not analyze noisy eigenspaces or quantify the cost of imperfect graph smoothness\. The theoretical viability of hard graph\-spectral projection for bandits, and its practical conditions, remain uncharacterized\. Moreover, a hard projection is useful only when the reward is smooth on the graph \(a property of the data, not the algorithm\), making a structural falsification test essential: projecting onto a node\-permuted eigenspace leaves the dimension unchanged but destroys graph\-reward alignment, so any genuine gain must collapse\.
To tackle these challenges, we studyGraphDR\-LinUCB\(graph dimensionality reduction\): project each arm’s feature vector onto the bottom\-kkLaplacian eigenspace and run LinUCB\[[1](https://arxiv.org/html/2606.27917#bib.bib26),[3](https://arxiv.org/html/2606.27917#bib.bib28)\]in the resultingkk\-dimensional space, with graph\-shuffle controls on every experiment as a structural falsification test\. We prove exact\-subspace and Davis–Kahan robust regret bounds, develop a structure\-specific residual analysis that replaces the worst\-case misspecification penalty with realized graph quantities, and introduce a spectral selection rule for choosing between graph\-DR and competing reducers without fitting any threshold\.
#### Contributions\.
- •Finite\-sample and robust bounds\.We prove thatGraphDR\-LinUCBachievesO~\(kT\)\\widetilde\{O\}\(k\\sqrt\{T\}\)regret under exact graph smoothness, reducing dimension dependence fromddtokk\. A Davis–Kahan argument extends this to approximate smoothness and noisy graph observation, making the regret cost of eigenspace estimation error explicit through the spectral gapΔk\\Delta\_\{k\}and smoothness tailζk\\zeta\_\{k\}\.
- •Structure\-specific residual theorem\.We show that the high\-frequency residual need not be paid for at the worst\-case linear\-in\-TTrate\. The cost is governed by two realized graph quantities, residual leverage𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}and candidate\-set residual width𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}, which are sublinear when the residual is graph\-benign\. An implementable pilot\-based variant makes this bound computable from data\.
- •Spectral selection rule\.We define the subspace\-capture marginΓk=‖Uk⊤θ‖22−‖P⊤θ‖22\\Gamma\_\{k\}=\\left\\lVert U\_\{k\}^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}\-\\left\\lVert P^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}, a zero\-parameter spectral statistic that predicts which reducer better aligns with the reward without requiring any fitted threshold\. We evaluate it alongside the theory\-derived diagnostic𝒢k\\mathcal\{G\}\_\{k\}and the scalar tailζk\\zeta\_\{k\}\.
- •Empirical evaluation\.We validate the approach on stochastic block model and random\-geometric\-graph synthetics with graph\-shuffle controls on every configuration, and evaluate on six real graph\-bandit datasets spanning recommendation, news, and citation \(MovieLens\-100k/1M, Amazon, LastFM, MIND\-small, ogbn\-arxiv\)\.
## Related Work
#### Linear contextual bandits\.
The LinUCB framework\[[7](https://arxiv.org/html/2606.27917#bib.bib9),[3](https://arxiv.org/html/2606.27917#bib.bib28),[1](https://arxiv.org/html/2606.27917#bib.bib26)\]achievesO~\(dT\)\\widetilde\{O\}\(d\\sqrt\{T\}\)regret fordd\-dimensional linear rewards through elliptical confidence sets and self\-normalized martingale concentration\[[24](https://arxiv.org/html/2606.27917#bib.bib35)\], with a matchingΩ\(dT\)\\Omega\(d\\sqrt\{T\}\)lower bound established by Dani et al\.\[[9](https://arxiv.org/html/2606.27917#bib.bib10)\]\. Beyond UCB, Thompson sampling\[[6](https://arxiv.org/html/2606.27917#bib.bib30),[32](https://arxiv.org/html/2606.27917#bib.bib16)\]provides a Bayesian alternative exploration strategy\.
#### Misspecified linear bandits\.
When the reward is only approximately linear in the chosen features, regret incurs an additional bias term\. Worst\-case analyses\[[23](https://arxiv.org/html/2606.27917#bib.bib36),[12](https://arxiv.org/html/2606.27917#bib.bib33)\]pay an envelope that is linear in the horizon and proportional to the misspecification level, a cost that quickly dominates theT\\sqrt\{T\}exploration term even under mild misspecification\. Sharp horizon\-dependent analyses of KL\-regularized contextual bandits\[[40](https://arxiv.org/html/2606.27917#bib.bib7)\]sharpen these worst\-case envelopes under regularization, though the graph\-structural form of the residual remains unexamined\.
#### Graph bandits\.
Online clustering of bandits\[[13](https://arxiv.org/html/2606.27917#bib.bib34)\]propagates reward information across a user graph by sharing estimates between neighboring arms\. Laplacian\-regularized LinUCB\[[38](https://arxiv.org/html/2606.27917#bib.bib40)\]adds a graph\-smoothness penalty to the ridge objective, biasing the estimator toward functions that vary slowly over the graph\. The graph\-feedback bandit setting of Mannor and Shamir\[[26](https://arxiv.org/html/2606.27917#bib.bib18)\]observes neighboring arm rewards as side information; recent work extends this to similar\-arm graphs\[[30](https://arxiv.org/html/2606.27917#bib.bib2),[17](https://arxiv.org/html/2606.27917#bib.bib3)\]and interference structures\[[18](https://arxiv.org/html/2606.27917#bib.bib1)\]\. Wang et al\.\[[36](https://arxiv.org/html/2606.27917#bib.bib23)\]combine low\-rank structure and graph Laplacian regularization for matrix contextual bandits via soft penalization; the present paper instead uses hard spectral projection and derives an explicit spectral\-gap regret bound\.
#### Spectral bandits\.
Spectral bandits\[[34](https://arxiv.org/html/2606.27917#bib.bib39)\]design algorithms for payoffs that are smooth functions on a graph, with regret governed by a spectral effective dimension that reflects the concentration of the payoff in low\-frequency graph coordinates\. The algorithm operates in the full graph\-frequency basis with a per\-eigenvalue smoothness weight rather than a hard truncation at a fixed dimensionkk\.
#### Graph signal processing and spectral embeddings\.
Laplacian eigenvectors as low\-frequency graph coordinates originate in graph signal processing\[[33](https://arxiv.org/html/2606.27917#bib.bib38)\]and underpin Laplacian Eigenmaps\[[4](https://arxiv.org/html/2606.27917#bib.bib12)\]and spectral clustering\[[27](https://arxiv.org/html/2606.27917#bib.bib11),[35](https://arxiv.org/html/2606.27917#bib.bib13)\]\. Our contribution is not a new embedding method but the first bandit regret analysis tying hard spectral projection to exploration cost\.
#### Model selection and adaptivekk\.
Corralling methods\[[2](https://arxiv.org/html/2606.27917#bib.bib27)\]achieve regret competitive with the best algorithm in a pool of base learners, and model selection for contextual bandits more broadly has been studied via smoothed meta\-algorithms\[[28](https://arxiv.org/html/2606.27917#bib.bib19)\], providing a plausible path to adaptive\-kkGraphDR; a tight graph\-specific theorem is left to future work\.
#### Dimension reduction in bandits\.
Random projection\[[39](https://arxiv.org/html/2606.27917#bib.bib8)\]reduces the feature dimension before running a linear bandit, with regret depending on the projected dimension\. Bilinear bandits\[[20](https://arxiv.org/html/2606.27917#bib.bib17)\]exploit low\-rank reward structure via a two\-stage subspace\-exploration approach achievingO~\(\(d1\+d2\)3/2rT\)\\widetilde\{O\}\(\(d\_\{1\}\{\+\}d\_\{2\}\)^\{3/2\}\\sqrt\{rT\}\)regret, and tight two\-to\-infinity subspace recovery methods\[[19](https://arxiv.org/html/2606.27917#bib.bib22)\]sharpen these bounds for matrix bandits\. Recent work handles non\-stationary subspaces: Khosravi and Huo\[[21](https://arxiv.org/html/2606.27917#bib.bib24)\]proveO~\(rT\)\\widetilde\{O\}\(r\\sqrt\{T\}\)dynamic regret for piecewise\-stationary low\-rank linear bandits, and Dai et al\.\[[8](https://arxiv.org/html/2606.27917#bib.bib25)\]attain rank\-dependent regret for adversarial contextual bandits with low\-rank experts in a model\-routing setting\. Beyond subspace methods, matrix sketching for high\-dimensional linear bandits\[[37](https://arxiv.org/html/2606.27917#bib.bib4)\]shows adaptive sketch sizing is needed to avoid linear regret under heavy spectral tails\. In multi\-agent and distributed settings, the spectral gap of the communication graph directly controls the regret of cooperative bandit algorithms\[[29](https://arxiv.org/html/2606.27917#bib.bib5),[31](https://arxiv.org/html/2606.27917#bib.bib6)\]\.
None of the above uses a graph\-structural subspace as the feature projector or ties bandit regret to the reward\-graph spectral gap\. The present paper provides the first such analysis, characterizing when projecting arm features onto the Laplacian eigenspace reduces exploration cost and when it does not\.
## Setup and Algorithm
#### Graph and Laplacian\.
LetGGbe an undirected graph with symmetric normalized LaplacianL=I−D−1/2AD−1/2L=I\-D^\{\-1/2\}AD^\{\-1/2\}, whose eigenvalues satisfy0=λ1≤λ2≤⋯≤λn≤20=\\lambda\_\{1\}\\leq\\lambda\_\{2\}\\leq\\cdots\\leq\\lambda\_\{n\}\\leq 2\. We assume nonnegative edge weights and no isolated nodes, adding self\-loops where needed so thatDii\>0D\_\{ii\}\>0\. We further assume a positive separating eigengapΔk=λk\+1−λk\>0\\Delta\_\{k\}=\\lambda\_\{k\+1\}\-\\lambda\_\{k\}\>0, so that the bottom\-kkeigenspace is well defined\. LetEk∈ℝd×kE\_\{k\}\\in\\mathbb\{R\}^\{d\\times k\}be an orthonormal basis for the feature\-space low\-frequency subspace, with projectorΠk=EkEk⊤\\Pi\_\{k\}=E\_\{k\}E\_\{k\}^\{\\top\}\. In the node\-arm, one\-hot\-feature case,d=nd=nandEk=UkE\_\{k\}=U\_\{k\}, whereUkU\_\{k\}collects thekklowest\-frequency Laplacian eigenvectors\. Throughout the theory,UkU\_\{k\}*includes*the constant eigenvectoru1u\_\{1\}\(λ1=0\\lambda\_\{1\}=0on connected graphs\); experiments shift tou2,…,uk\+1u\_\{2\},\\dots,u\_\{k\+1\}and report that convention explicitly\. The analysis is for the node\-arm settingd=nd=n,xt,a=eax\_\{t,a\}=e\_\{a\},Ek=UkE\_\{k\}=U\_\{k\}\.
#### Bandit interaction\.
At each roundtt, the learner observes a finite arm set𝒜t\\mathcal\{A\}\_\{t\}, where arma∈𝒜ta\\in\\mathcal\{A\}\_\{t\}has featurext,a∈ℝdx\_\{t,a\}\\in\\mathbb\{R\}^\{d\}\. After choosingata\_\{t\}, it receives rewardyt=xt,at⊤θ⋆\+ηty\_\{t\}=x\_\{t,a\_\{t\}\}^\{\\top\}\\theta^\{\\star\}\+\\eta\_\{t\}, whereθ⋆∈ℝd\\theta^\{\\star\}\\in\\mathbb\{R\}^\{d\}is unknown andηt\\eta\_\{t\}is conditionally sub\-Gaussian noise\. In practice, the clean Laplacian may be unavailable\. We letL^\\widehat\{L\}denote an observed or estimated Laplacian andE^k∈ℝd×k\\widehat\{E\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times k\}an orthonormal basis for its bottom\-kkeigenspace, and setΠ^k=E^kE^k⊤\\widehat\{\\Pi\}\_\{k\}=\\widehat\{E\}\_\{k\}\\widehat\{E\}\_\{k\}^\{\\top\}andP^=E^k⊤\\widehat\{P\}=\\widehat\{E\}\_\{k\}^\{\\top\}\. The projected feature iszt,a=E^k⊤xt,a∈ℝkz\_\{t,a\}=\\widehat\{E\}\_\{k\}^\{\\top\}x\_\{t,a\}\\in\\mathbb\{R\}^\{k\}\.
Algorithm 1GraphDR\-LinUCBInput: Observed LaplacianL^\\widehat\{L\}; arm\-feature oraclext,ax\_\{t,a\}; dimensionkk; regularizationλ\>0\\lambda\>0; confidence sequence\(βt\)t≥1\(\\beta\_\{t\}\)\_\{t\\geq 1\} Output: Pulled armsa1,…,aTa\_\{1\},\\dots,a\_\{T\}
1:Compute bottom\-
kkorthonormal eigenbasis
E^k∈ℝd×k\\widehat\{E\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times k\}of
L^\\widehat\{L\}\.
2:Initialize
V1←λIkV\_\{1\}\\leftarrow\\lambda I\_\{k\},
b1←𝟎∈ℝkb\_\{1\}\\leftarrow\\mathbf\{0\}\\in\\mathbb\{R\}^\{k\}\.
3:for
t=1,2,…,Tt=1,2,\\dots,Tdo
4:Compute
α^t←Vt−1bt\\widehat\{\\alpha\}\_\{t\}\\leftarrow V\_\{t\}^\{\-1\}b\_\{t\}\.
5:Observe candidate set
𝒜t\\mathcal\{A\}\_\{t\}; project each arm:
zt,a←E^k⊤xt,az\_\{t,a\}\\leftarrow\\widehat\{E\}\_\{k\}^\{\\top\}x\_\{t,a\}for all
a∈𝒜ta\\in\\mathcal\{A\}\_\{t\}\.
6:Pull
at←argmaxa∈𝒜t\[zt,a⊤α^t\+βt‖zt,a‖Vt−1\]a\_\{t\}\\leftarrow\\arg\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\\!\\left\[z\_\{t,a\}^\{\\top\}\\widehat\{\\alpha\}\_\{t\}\+\\beta\_\{t\}\\left\\lVert z\_\{t,a\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\right\]\.
7:Observe reward
yty\_\{t\}\.
8:
Vt\+1←Vt\+zt,atzt,at⊤V\_\{t\+1\}\\leftarrow V\_\{t\}\+z\_\{t,a\_\{t\}\}z\_\{t,a\_\{t\}\}^\{\\top\};
bt\+1←bt\+zt,atytb\_\{t\+1\}\\leftarrow b\_\{t\}\+z\_\{t,a\_\{t\}\}y\_\{t\}\.
9:endfor
#### Algorithm\.
GraphDR\-LinUCB\(Algorithm[1](https://arxiv.org/html/2606.27917#alg1); pipeline in Figure[1](https://arxiv.org/html/2606.27917#Sx3.F1)\) runs ordinary linear UCB in thekk\-dimensional projected space; the projectionP^\\widehat\{P\}is the only thing that distinguishes it from standard LinUCB\.
GraphGGLaplacianLLbasisE^k\\widehat\{E\}\_\{k\}projectz=E^k⊤xz=\\widehat\{E\}\_\{k\}^\{\\top\}xLinUCBinℝk\\mathbb\{R\}^\{k\}pullata\_\{t\}residualr⟂,k=\(I−Π^k\)θ⋆r\_\{\\perp,k\}=\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}controls𝖲𝖱𝖫T,𝖢𝖱𝖶T\\mathsf\{SRL\}\_\{T\},\\ \\mathsf\{CRW\}\_\{T\}Figure 1:GraphDR\-LinUCB\. The graph determines a bottom\-kklow\-frequency Laplacian eigenbasisE^k\\widehat\{E\}\_\{k\}; arm features are projected toℝk\\mathbb\{R\}^\{k\}and LinUCB runs there\. Only the high\-frequency residualr⟂,kr\_\{\\perp,k\}left outside the subspace can hurt the learner; its effect is governed by the realized graph quantities𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}and𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}\. The shuffle control permutes node labels before computingE^k\\widehat\{E\}\_\{k\}, preserving the projected dimension but destroying graph alignment and eliminating the regret advantage\.
## Theory
We present three theorems in order of increasing sharpness\. Table[1](https://arxiv.org/html/2606.27917#Sx4.T1)summarizes all three regimes; proofs are in the appendix section\.
Table 1:Summary of the four theoretical regimes\. The structure\-specific bound \(Thm\.[5](https://arxiv.org/html/2606.27917#Thmtheorem5)\) tightens the generic worst\-case linear bias to realized graph quantities𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}and𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}\(Corollary[9](https://arxiv.org/html/2606.27917#Thmtheorem9)\); the estimated\-residual variant \(Thm\.[7](https://arxiv.org/html/2606.27917#Thmtheorem7)\) makes it implementable\.### Exact known\-subspace bound
When the reward is perfectly smooth on the graph, projection onto the Laplacian eigenspace is lossless and the problem reduces exactly to akk\-dimensional linear bandit\.
###### Assumption 1\(Exact graph smoothness\)\.
There existsα⋆∈ℝk\\alpha^\{\\star\}\\in\\mathbb\{R\}^\{k\}such thatθ⋆=Ekα⋆\\theta^\{\\star\}=E\_\{k\}\\alpha^\{\\star\}\. Equivalently,θ⋆∈range\(Πk\)\\theta^\{\\star\}\\in\\operatorname\{range\}\(\\Pi\_\{k\}\)\.
Under Assumption[1](https://arxiv.org/html/2606.27917#Thmassumption1),zt,a=Ek⊤xt,az\_\{t,a\}=E\_\{k\}^\{\\top\}x\_\{t,a\}satisfies
xt,a⊤θ⋆=xt,a⊤Ekα⋆=\(Ek⊤xt,a\)⊤α⋆=zt,a⊤α⋆,x\_\{t,a\}^\{\\top\}\\theta^\{\\star\}=x\_\{t,a\}^\{\\top\}E\_\{k\}\\alpha^\{\\star\}=\(E\_\{k\}^\{\\top\}x\_\{t,a\}\)^\{\\top\}\\alpha^\{\\star\}=z\_\{t,a\}^\{\\top\}\\alpha^\{\\star\},so the projected problem is an exactkk\-dimensional linear bandit and the standard LinUCB analysis applies verbatim inℝk\\mathbb\{R\}^\{k\}\.
###### Assumption 2\(Bandit regularity\)\.
For allttanda∈𝒜ta\\in\\mathcal\{A\}\_\{t\},‖zt,a‖2≤Lx\\left\\lVert z\_\{t,a\}\\right\\rVert\_\{2\}\\leq L\_\{x\}\. The projected parameter satisfies‖α⋆‖2≤S\\left\\lVert\\alpha^\{\\star\}\\right\\rVert\_\{2\}\\leq S\. The noise sequence is conditionallyRR\-sub\-Gaussian with respect to the filtrationℱt\\mathcal\{F\}\_\{t\}generated by the history before observingyty\_\{t\}:𝔼\[exp\(sηt\)∣ℱt\]≤exp\(s2R2/2\)\\mathbb\{E\}\[\\exp\(s\\eta\_\{t\}\)\\mid\\mathcal\{F\}\_\{t\}\]\\leq\\exp\(s^\{2\}R^\{2\}/2\)for everys∈ℝs\\in\\mathbb\{R\}\. The arm sets and features may be chosen adaptively from the past but are fixed beforeηt\\eta\_\{t\}is realized\.
Define pseudo\-regret
Reg\(T\)=∑t=1T\(maxa∈𝒜txt,a⊤θ⋆−xt,at⊤θ⋆\)\\mathrm\{Reg\}\(T\)=\\sum\_\{t=1\}^\{T\}\\big\(\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}x\_\{t,a\}^\{\\top\}\\theta^\{\\star\}\-x\_\{t,a\_\{t\}\}^\{\\top\}\\theta^\{\\star\}\\big\)\.
###### Theorem 1\(Reduced\-dimension regret under exact smoothness\)\.
Under Assumptions[1](https://arxiv.org/html/2606.27917#Thmassumption1)and[2](https://arxiv.org/html/2606.27917#Thmassumption2), suppose the algorithm uses the true projectionEk⊤E\_\{k\}^\{\\top\}, regularizationλ≥Lx2\\lambda\\geq L\_\{x\}^\{2\}, and confidence radius
βt=Rklog\(1\+\(t−1\)Lx2λk\)\+2log1δ\+λS\.\\beta\_\{t\}=R\\sqrt\{k\\log\\\!\\left\(1\+\\tfrac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\+2\\log\\tfrac\{1\}\{\\delta\}\}\+\\sqrt\{\\lambda\}S\.Then, with probability at least1−δ1\-\\delta, simultaneously for allT≥1T\\geq 1,
Reg\(T\)≤2βT\+12Tklog\(1\+TLx2λk\)\.\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}\\sqrt\{2Tk\\log\\\!\\left\(1\+\\tfrac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\}\.Consequently, for fixedR,Lx,S,λ,δR,L\_\{x\},S,\\lambda,\\delta,Reg\(T\)=O~\(kT\)\\mathrm\{Reg\}\(T\)=\\widetilde\{O\}\(k\\sqrt\{T\}\)\.
###### Corollary 2\(Comparison with full\-dimensional LinUCB\)\.
Full\-dimensional LinUCB inℝd\\mathbb\{R\}^\{d\}has rateO~\(dT\)\\widetilde\{O\}\(d\\sqrt\{T\}\)\. Under exact graph smoothness andk≪dk\\ll d,GraphDR\-LinUCBreduces the dimension dependence fromddtokk\.
### Robust gap\-dependent bound
In practice, two idealizations of Theorem[1](https://arxiv.org/html/2606.27917#Thmtheorem1)fail simultaneously: the reward may carry energy outside the firstkkgraph frequencies \(ζk\>0\\zeta\_\{k\}\>0\), and the eigenspace is estimated from a noisy Laplacian \(εL\>0\\varepsilon\_\{L\}\>0\)\. The following handles both\. The key tool is a projector\-perturbation bound from Davis and Kahan\.
###### Lemma 3\(Davis–Kahan projector perturbation\[[10](https://arxiv.org/html/2606.27917#bib.bib31)\]\)\.
AssumeLLandL^\\widehat\{L\}are symmetric and‖L^−L‖2≤εL≤Δk/4\\left\\lVert\\widehat\{L\}\-L\\right\\rVert\_\{2\}\\leq\\varepsilon\_\{L\}\\leq\\Delta\_\{k\}/4\. Then‖Π^k−Πk‖2≤4εL/Δk\\left\\lVert\\widehat\{\\Pi\}\_\{k\}\-\\Pi\_\{k\}\\right\\rVert\_\{2\}\\leq 4\\varepsilon\_\{L\}/\\Delta\_\{k\}, and consequently‖\(I−Π^k\)Πk‖2≤4εL/Δk\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\Pi\_\{k\}\\right\\rVert\_\{2\}\\leq 4\\varepsilon\_\{L\}/\\Delta\_\{k\}\.
###### Assumption 3\(Approximate smoothness and noisy graph\)\.
Letrk=\(I−Πk\)θ⋆r\_\{k\}=\(I\-\\Pi\_\{k\}\)\\theta^\{\\star\}\. Assume‖Πkθ⋆‖2≤Sk\\left\\lVert\\Pi\_\{k\}\\theta^\{\\star\}\\right\\rVert\_\{2\}\\leq S\_\{k\}and‖rk‖2≤ζk\\left\\lVert r\_\{k\}\\right\\rVert\_\{2\}\\leq\\zeta\_\{k\}\. The ambient features satisfy‖xt,a‖2≤Lx\\left\\lVert x\_\{t,a\}\\right\\rVert\_\{2\}\\leq L\_\{x\}for allt,at,a\. The noise is conditionallyRR\-sub\-Gaussian as in Assumption[2](https://arxiv.org/html/2606.27917#Thmassumption2), and the arm sets are chosen before the current noise is realized\.
###### Theorem 4\(Generic robust regret bound\)\.
Assume Lemma[3](https://arxiv.org/html/2606.27917#Thmtheorem3)and Assumption[3](https://arxiv.org/html/2606.27917#Thmassumption3)\. RunGraphDR\-LinUCBwithP^=E^k⊤\\widehat\{P\}=\\widehat\{E\}\_\{k\}^\{\\top\}andλ≥Lx2\\lambda\\geq L\_\{x\}^\{2\}, with confidence radiusβt=βt0\+νk2\(t−1\)klog\(1\+\(t−1\)Lx2/\(λk\)\)\\beta\_\{t\}=\\beta\_\{t\}^\{0\}\+\\nu\_\{k\}\\sqrt\{2\(t\-1\)k\\log\(1\+\(t\-1\)L\_\{x\}^\{2\}/\(\\lambda k\)\)\}\(equivalentlyβt0\+νkAT\\beta\_\{t\}^\{0\}\+\\nu\_\{k\}A\_\{T\}for known horizonTT\), which requiresνk\\nu\_\{k\}known or upper\-bounded\. Define
γk\\displaystyle\\gamma\_\{k\}=4εLΔk,\\displaystyle=\\frac\{4\\varepsilon\_\{L\}\}\{\\Delta\_\{k\}\},νk\\displaystyle\\nu\_\{k\}=Lx\(ζk\+Skγk\)=Lx\(ζk\+4SkεLΔk\),\\displaystyle=L\_\{x\}\(\\zeta\_\{k\}\+S\_\{k\}\\gamma\_\{k\}\)=L\_\{x\}\\\!\\left\(\\zeta\_\{k\}\+\\tfrac\{4S\_\{k\}\\varepsilon\_\{L\}\}\{\\Delta\_\{k\}\}\\right\),S¯k\\displaystyle\\overline\{S\}\_\{k\}=Sk\+ζk\.\\displaystyle=S\_\{k\}\+\\zeta\_\{k\}\.βt0=Rklog\(1\+\(t−1\)Lx2λk\)\+2log1δ\+λS¯k,\\displaystyle\\beta\_\{t\}^\{0\}=R\\sqrt\{k\\log\\\!\\left\(1\+\\tfrac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\+2\\log\\tfrac\{1\}\{\\delta\}\}\+\\sqrt\{\\lambda\}\\,\\overline\{S\}\_\{k\},AT=2Tklog\(1\+TLx2λk\)\.\\displaystyle A\_\{T\}=\\sqrt\{2Tk\\log\\\!\\left\(1\+\\tfrac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\}\.
Then, with probability at least1−δ1\-\\delta, simultaneously for allT≥1T\\geq 1,
Reg\(T\)≤2βT\+10AT\+2νkAT2\+2νkT=2βT\+10AT\+4νkTklog\(1\+TLx2λk\)\+2νkT\.\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+2\\nu\_\{k\}A\_\{T\}^\{2\}\+2\\nu\_\{k\}T\\\\ =2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+4\\nu\_\{k\}Tk\\log\\\!\\left\(1\+\\tfrac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\+2\\nu\_\{k\}T\.Suppressing logarithmic factors,Reg\(T\)=O~\(kT\+Lx\(ζk\+SkεL/Δk\)kT\)\\mathrm\{Reg\}\(T\)=\\widetilde\{O\}\\\!\\big\(k\\sqrt\{T\}\+L\_\{x\}\(\\zeta\_\{k\}\+S\_\{k\}\\varepsilon\_\{L\}/\\Delta\_\{k\}\)kT\\big\)\. Ifζk=0\\zeta\_\{k\}=0andεL=0\\varepsilon\_\{L\}=0, thenνk=0\\nu\_\{k\}=0and the bound reduces toO~\(kT\)\\widetilde\{O\}\(k\\sqrt\{T\}\)\. Ifνk\\nu\_\{k\}is unknown it must be replaced by an upper bound or selected by a doubling or grid wrapper; otherwise the result is an oracle bound\.
TheεL/Δk\\varepsilon\_\{L\}/\\Delta\_\{k\}dependence is the key takeaway: a large spectral gap protects against graph noise, while a small gap amplifies it\.
### Main result: structure\-specific residual regret
The generic bound of Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4)uses only the scalar envelope\|ξt,a\|≤νk\|\\xi\_\{t,a\}\|\\leq\\nu\_\{k\}, treating the high\-frequency residual\(I−Π^k\)θ⋆\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}as an arbitrary adversary\. In graph problems this residual is not arbitrary: it is a high\-frequency graph signal, and it harms the learner only through the arms that actually appear and the low\-frequency coordinates that are actually played\. The following theorem retains this structure instead of collapsing it intoνk\\nu\_\{k\}\.
Letr⟂,k=\(I−Π^k\)θ⋆r\_\{\\perp,k\}=\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}andξt\(a\)=xt,a⊤r⟂,k\\xi\_\{t\}\(a\)=x\_\{t,a\}^\{\\top\}r\_\{\\perp,k\}\.
###### Definition 1\(Residual leverage and candidate residual width\)\.
For a realized trajectory, define the*residual leverage*
𝖲𝖱𝖫T\(r⟂,k\)=max1≤t≤T\+1‖∑s=1t−1zs,asξs\(as\)‖Vt−1,\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)=\\max\_\{1\\leq t\\leq T\+1\}\\left\\lVert\\textstyle\\sum\_\{s=1\}^\{t\-1\}z\_\{s,a\_\{s\}\}\\,\\xi\_\{s\}\(a\_\{s\}\)\\right\\rVert\_\{V\_\{t\}^\{\-1\}\},and the*cumulative candidate residual width*
𝖢𝖱𝖶T\(r⟂,k\)=∑t=1T\(maxa∈𝒜tξt\(a\)−mina∈𝒜tξt\(a\)\)\.\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)=\\sum\_\{t=1\}^\{T\}\\Big\(\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\xi\_\{t\}\(a\)\-\\min\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\xi\_\{t\}\(a\)\\Big\)\.
𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}is the self\-normalized bias that the high\-frequency residual injects into the regression updates;𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}is the amount by which the residual can change the ordering of arms within each candidate set\. In the node\-arm case \(xt,a=eax\_\{t,a\}=e\_\{a\},r⟂,k=U\>kc\>kr\_\{\\perp,k\}=U\_\{\>k\}c\_\{\>k\}\), both are determined by where the high\-frequency graph signal lives and which arms the bandit actually compares: a strictly finer characterization than the scalar tailζk\\zeta\_\{k\}\.
###### Theorem 5\(A posteriori oracle residual bound\)\.
Assume the noise and bounded projected\-feature conditions in Assumption[2](https://arxiv.org/html/2606.27917#Thmassumption2)\. Fix a horizonTTand supposeGraphDR\-LinUCBis run withE^k\\widehat\{E\}\_\{k\},λ≥Lx2\\lambda\\geq L\_\{x\}^\{2\}, and confidence radiusβt=βt0\+BT\\beta\_\{t\}=\\beta\_\{t\}^\{0\}\+B\_\{T\}withBT≥𝖲𝖱𝖫T\(r⟂,k\)B\_\{T\}\\geq\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\), where
βt0=Rklog\(1\+\(t−1\)Lx2λk\)\+2log1δ\+λ‖E^k⊤θ⋆‖2,\\displaystyle\\beta\_\{t\}^\{0\}=R\\sqrt\{k\\log\\\!\\left\(1\+\\tfrac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\+2\\log\\tfrac\{1\}\{\\delta\}\}\+\\sqrt\{\\lambda\}\\,\\left\\lVert\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}\\right\\rVert\_\{2\},AT=2Tklog\(1\+TLx2λk\)\.\\displaystyle A\_\{T\}=\\sqrt\{2Tk\\log\\\!\\left\(1\+\\tfrac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\}\.Then, with probability at least1−δ1\-\\delta,
Reg\(T\)≤2βT\+10AT\+2BTAT\+𝖢𝖱𝖶T\(r⟂,k\)\.\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+2B\_\{T\}A\_\{T\}\+\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)\.In particular, the oracle choiceBT=𝖲𝖱𝖫T\(r⟂,k\)B\_\{T\}=\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)yieldsReg\(T\)≤2βT\+10AT\+2𝖲𝖱𝖫T\(r⟂,k\)AT\+𝖢𝖱𝖶T\(r⟂,k\)\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+2\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)A\_\{T\}\+\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)\.
This is an a posteriori oracle inequality:BTB\_\{T\}depends on the unknownθ⋆\\theta^\{\\star\}and the realized trajectory\. Proposition[6](https://arxiv.org/html/2606.27917#Thmtheorem6)below characterizes exactly what is lost when the confidence radius is not inflated, explaining why an estimate of𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}is needed for an implementable algorithm\.
###### Proposition 6\(What happens without residual\-radius inflation\)\.
If the same algorithm is run with the smaller radiusβt0\\beta\_\{t\}^\{0\}only, the same proof gives the valid bound
Reg\(T\)≤2βT\+10AT\+𝖲𝖱𝖫T\(r⟂,k\)AT\+T𝖲𝖱𝖫T\(r⟂,k\)\+𝖢𝖱𝖶T\(r⟂,k\)\.\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)A\_\{T\}\\\\ \+T\\,\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)\+\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)\.Thus the clean2𝖲𝖱𝖫TAT2\\mathsf\{SRL\}\_\{T\}A\_\{T\}oracle term requires either a known upper bound on𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}or an estimated\-residual confidence inflation\. Without inflation, the comparator arm is not the played arm, and its uncertainty cannot be collapsed by the elliptical\-potential lemma\.
The oracle bound becomes implementable by replacing the unknown residual with a pilot estimate\.
###### Theorem 7\(Estimated\-residual horizon oracle bound\)\.
Letr^\\widehat\{r\}be a pilot estimate ofr⟂,kr\_\{\\perp,k\}obtained from data independent of the bandit noise used in the regret run, with‖r^−r⟂,k‖2≤er\\left\\lVert\\widehat\{r\}\-r\_\{\\perp,k\}\\right\\rVert\_\{2\}\\leq e\_\{r\}and‖xt,a‖2≤Lx\\left\\lVert x\_\{t,a\}\\right\\rVert\_\{2\}\\leq L\_\{x\}\. Defineξ^t\(a\)=xt,a⊤r^\\widehat\{\\xi\}\_\{t\}\(a\)=x\_\{t,a\}^\{\\top\}\\widehat\{r\}and a computable leverage boundB^T≥max1≤t≤T\+1‖∑s<tzs,asξ^s\(as\)‖Vt−1\\widehat\{B\}\_\{T\}\\geq\\max\_\{1\\leq t\\leq T\+1\}\\left\\lVert\\sum\_\{s<t\}z\_\{s,a\_\{s\}\}\\widehat\{\\xi\}\_\{s\}\(a\_\{s\}\)\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\. RunGraphDR\-LinUCBwith radiusβt=βt0\+B^T\+LxerAT\\beta\_\{t\}=\\beta\_\{t\}^\{0\}\+\\widehat\{B\}\_\{T\}\+L\_\{x\}e\_\{r\}A\_\{T\}\. Then, on the same high\-probability event as Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5),
Reg\(T\)≤2βT\+10AT\+2\(B^T\+LxerAT\)AT\+𝖢𝖱𝖶^T\(r^\)\+2LxerT,\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+2\(\\widehat\{B\}\_\{T\}\+L\_\{x\}e\_\{r\}A\_\{T\}\)A\_\{T\}\\\\ \+\\widehat\{\\mathsf\{CRW\}\}\_\{T\}\(\\widehat\{r\}\)\+2L\_\{x\}e\_\{r\}T,where𝖢𝖱𝖶^T\(r^\)=∑t=1T\(maxaxt,a⊤r^−minaxt,a⊤r^\)\\widehat\{\\mathsf\{CRW\}\}\_\{T\}\(\\widehat\{r\}\)=\\sum\_\{t=1\}^\{T\}\(\\max\_\{a\}x\_\{t,a\}^\{\\top\}\\widehat\{r\}\-\\min\_\{a\}x\_\{t,a\}^\{\\top\}\\widehat\{r\}\)\. To be genuinely online, use the predictable radiiβt=βt0\+B^t\+LxerAt\\beta\_\{t\}=\\beta\_\{t\}^\{0\}\+\\widehat\{B\}\_\{t\}\+L\_\{x\}e\_\{r\}A\_\{t\}withB^t=maxu≤t‖∑s<uzsξ^s\(as\)‖Vu−1\\widehat\{B\}\_\{t\}=\\max\_\{u\\leq t\}\\left\\lVert\\sum\_\{s<u\}z\_\{s\}\\widehat\{\\xi\}\_\{s\}\(a\_\{s\}\)\\right\\rVert\_\{V\_\{u\}^\{\-1\}\}; the stated horizon\-TTform is the special case att=Tt=T\.
Two corollaries bound the range of behaviour\. First, the generic theorem is only a loose special case of Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)\.
###### Corollary 8\(Generic theorem as a loose special case\)\.
If\|ξt\(a\)\|≤νk\|\\xi\_\{t\}\(a\)\|\\leq\\nu\_\{k\}for allt,at,a, then𝖲𝖱𝖫T\(r⟂,k\)≤νkAT\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)\\leq\\nu\_\{k\}A\_\{T\}and𝖢𝖱𝖶T\(r⟂,k\)≤2νkT\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)\\leq 2\\nu\_\{k\}T\. Substituting these into the oracle version of Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)withBT=νkATB\_\{T\}=\\nu\_\{k\}A\_\{T\}recovers Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4), up to the harmless replacement of‖E^k⊤θ⋆‖2\\left\\lVert\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}\\right\\rVert\_\{2\}byS¯k\\overline\{S\}\_\{k\}\.
###### Corollary 9\(Benign residuals yield sublinear extra bias\)\.
On any problem sequence with𝖲𝖱𝖫T\(r⟂,k\)=O\(klogT\)\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)=O\(\\sqrt\{k\\log T\}\)and𝖢𝖱𝖶T\(r⟂,k\)=O\(T\)\\mathsf\{CRW\}\_\{T\}\(r\_\{\\perp,k\}\)=O\(\\sqrt\{T\}\), the oracle\-inflated algorithm of Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)givesReg\(T\)=O~\(kT\)\+O~\(kT\)\+O\(T\)=O~\(kT\)\\mathrm\{Reg\}\(T\)=\\widetilde\{O\}\(k\\sqrt\{T\}\)\+\\widetilde\{O\}\(k\\sqrt\{T\}\)\+O\(\\sqrt\{T\}\)=\\widetilde\{O\}\(k\\sqrt\{T\}\), since𝖲𝖱𝖫TAT=O~\(kT\)\\mathsf\{SRL\}\_\{T\}A\_\{T\}=\\widetilde\{O\}\(k\\sqrt\{T\}\), so the high\-frequency residual contributes only sublinear regret\. These conditions hold, for example, when the residual is nearly common\-mode inside most candidate sets and its product with the played low\-frequency features exhibits self\-normalized cancellation rather than worst\-case alignment\.
## Synthetic Experiments
We validate the dimension\-reduction mechanism on controlled synthetic graphs where the reward subspace is known by construction\. Arms are graph nodes with one\-hot features \(d=nd=n\), andθ⋆=Ukα⋆/‖Ukα⋆‖\\theta^\{\\star\}=U\_\{k\}\\alpha^\{\\star\}/\\left\\lVert U\_\{k\}\\alpha^\{\\star\}\\right\\rVertuses the true bottom\-kknontrivial Laplacian eigenvectors, so the reward is graph\-smooth by design\. All methods compared are: Random; LinUCB\-full \(d=nd=n\); principal component analysis \(PCA\)\+LinUCB; Johnson\-Lindenstrauss \(JL\)\+LinUCB;GraphDR\+LinUCB; andGraphDR\-shuffled, the graph\-shuffle control that projects with the bottom\-kknontrivial eigenvectors of a node\-permuted copy of the same graph\. The shuffle control has the same projected dimension as GraphDR but a graph\-misaligned eigenspace; any genuine gain must collapse when it is applied, and in every experiment below, it does\. RegretR\(T\)R\(T\)is the mean±\\pmstandard error over88seeds; LinUCB uses Sherman\-Morrison updates\.
#### Main comparison\.
On a stochastic block model \(SBM\) withn=d=200n=d=200,C=k⋆=5C=k^\{\\star\}=5communities, andT=20,000T=20\{,\}000,GraphDR\-LinUCBreduces regret by more than15×15\\timesrelative to full LinUCB and69×69\\timesrelative to PCA \(Figure[2](https://arxiv.org/html/2606.27917#Sx5.F2)\):
Figure 2:Main synthetic comparison with the graph\-shuffle control\.GraphDR\+LinUCB dominates; the shuffled eigenspace collapses toward the random\-policy level\.The shuffle control collapses toward the random\-policy level \(957\.8957\.8\), showing the gain is structural rather than merely dimensional\. PCA performs poorly because one\-hot features are isotropic and carry no variance signal\.
#### Dimension scaling\.
Fixingk⋆=5k^\{\\star\}=5and scalingd=nd=nfrom6060to800800, GraphDR regret stays nearly flat \(from22\.222\.2to40\.640\.6, a1\.8×1\.8\\timesgrowth\) while full\-dimensional LinUCB grows steeply \(from183\.0183\.0to852\.1852\.1, a4\.7×4\.7\\timesgrowth\), matching the predictedkk\-versus\-ddexploration cost of Corollary[2](https://arxiv.org/html/2606.27917#Thmtheorem2)\(Figure[3](https://arxiv.org/html/2606.27917#Sx5.F3)\)\.
Figure 3:Dimension scaling at fixedk⋆=5k^\{\\star\}=5\. GraphDR\-LinUCB remains nearly flat asd=nd=ngrows; full LinUCB grows with ambient dimension\.
#### Second graph family\.
On a random geometric graph \(RGG,n=200n=200, radius0\.160\.16\) the same pattern holds: GraphDR attainsR\(T\)=18\.1R\(T\)=18\.1against417\.8417\.8for full LinUCB and over20002000for PCA, JL, and Random, with the shuffle control collapsing to2544\.62544\.6\(Figure[4](https://arxiv.org/html/2606.27917#Sx5.F4)\)\.
Figure 4:Second graph family: random geometric graph\. GraphDR dominates; the graph\-shuffle control collapses\.
#### kk\-sweep\.
Sweeping the projected dimension at fixedk⋆=5k^\{\\star\}=5on the SBM confirms that regret is minimized exactly atk=5k=5and rises on both sides: too few dimensions omit signal; too many reintroduce exploration cost\. The regret atk=1,2,3,𝟓,8,12,20k=1,2,3,\\mathbf\{5\},8,12,20is419\.9,384\.2,98\.6,28\.7,39\.5,55\.4,69\.5419\.9,384\.2,98\.6,\\mathbf\{28\.7\},39\.5,55\.4,69\.5\(Figure[5](https://arxiv.org/html/2606.27917#Sx5.F5)\)\.
Figure 5:Regret versus projected dimensionkk\. The minimum is at the true spectral dimensionk⋆=5k^\{\\star\}=5\.
#### Noise stress test and robust theorem validation\.
WhenGGis observed under edge\-flip noise,kkmust be selected from a noisy spectrum\. The bootstrap and naive eigengap estimators return identicalk^\\widehat\{k\}at every noise level; under heavier noisek^\\widehat\{k\}degrades \(4→2→14\\\!\\to\\\!2\\\!\\to\\\!1\) with measurable regret cost\. A complementary stress test directly probes the misspecification envelopeνk\\nu\_\{k\}by measuringζk\\zeta\_\{k\}andεL\\varepsilon\_\{L\}in each run: regret rises monotonically with both terms \(pooled Spearman0\.680\.68across1616runs\)\.
Table 2:Cumulative regretR\(T=20,000\)R\(T=20\{,\}000\)across six real graph\-bandit datasets \(top\-1500 nodes each,55seeds; lowest inbold\)\.GraphDR\-LinUCBbeats full LinUCB on five of six datasets, failing only on ogbn\-arxiv where the citation\-graph eigenspace is misaligned with the subject\-label reward\. The graph\-shuffle control is always near\-random\. GraphDR wins on4/64/6against PCA; the scalar tailζk\\zeta\_\{k\}partially but not fully predicts the pattern\.Table 3:Cumulative regretR\(T\)R\(T\)against graph\-aware bandits \(T=20000T\{=\}20000,55seeds synthetic; lowest per row inbold\)\.GraphDR\-LinUCBbeats both SpectralUCB and Laplacian\-regularized LinUCB on the synthetic benchmark and on five of six real datasets\. The single exception is ogbn\-arxiv \(ζk=0\.96\\zeta\_\{k\}\{=\}0\.96\): even soft graph\-aware methods beat hard bottom\-kkprojection when the eigenspace is misaligned with the reward\. The advantage of hard spectral truncation over soft eigenvalue\-penalized smoothing is that truncation entirely removes the exploration cost of high\-frequency directions rather than merely down\-weighting it\.
## Real\-Data Benchmarks
We evaluate on six real graph\-bandit datasets: four recommendation graphs \(MovieLens\-100k/1M\[[25](https://arxiv.org/html/2606.27917#bib.bib37)\], Amazon Digital Music, LastFM/HetRec\) and two non\-recommendation graphs \(MIND\-small news\[[11](https://arxiv.org/html/2606.27917#bib.bib32)\], ogbn\-arxiv citation\[[16](https://arxiv.org/html/2606.27917#bib.bib20)\]\)\. Items, artists, papers, or news articles are arms; features are node indicators; the graph is built from co\-rating, co\-listening, citation, or category and co\-click relations; and rewards are calibrated from real ratings, plays, labels, or click\-through statistics\. All runs report cumulative regret atT=20,000T=20\{,\}000over55seeds\. Full dataset construction, per\-dataset MovieLens results, and the measured spectral quantities \(k^\\widehat\{k\}, eigengap, andζk\\zeta\_\{k\}\) are in the appendix\.
#### Cross\-dataset summary\.
Table[2](https://arxiv.org/html/2606.27917#Sx5.T2)showsGraphDR\-LinUCBbeats full LinUCB on five of six datasets; the exception is ogbn\-arxiv, where the citation\-graph eigenspace is misaligned with the subject\-label reward \(ζk=0\.96\\zeta\_\{k\}=0\.96\)\. The GraphDR\-versus\-PCA comparison is more nuanced: GraphDR leads on four datasets \(MovieLens\-100k, Amazon, LastFM, MIND\), while PCA wins on MovieLens\-1M and ogbn\-arxiv\. At scale and when the reward is only weakly smooth, learned content variance can carry signal that PCA exploits better than the graph\. The scalar reward tailζk\\zeta\_\{k\}partially but not fully predicts this: ogbn\-arxiv \(0\.960\.96\) is GraphDR’s clearest failure, yet MIND has the largest tail \(0\.990\.99\) and GraphDR wins decisively\. The structure\-specific theorem explains the discrepancy; what matters is not total high\-frequency energy but whether the residual creates large candidate\-set width or residual leverage along the played path\.
#### Comparison against graph\-aware baselines\.
The decisive test for a graph\-DR contribution is whether the algorithm beats bandits that already exploit the graph\. Table[3](https://arxiv.org/html/2606.27917#Sx5.T3)compares against SpectralUCB\[[34](https://arxiv.org/html/2606.27917#bib.bib39)\]: LinUCB in the full Laplacian eigenbasis with a per\-eigenvalue smoothness penalty and no hard truncation, and Laplacian\-regularized LinUCB\[[38](https://arxiv.org/html/2606.27917#bib.bib40)\]: full\-ddLinUCB with graph\-smoothness regularization replacing ridge\.
GraphDR\-LinUCB’s advantage holds against graph\-aware competitors on every graph\-smooth dataset; on the misaligned ogbn\-arxiv \(ζk=0\.96\\zeta\_\{k\}=0\.96\), Laplacian\-regularized LinUCB is the safer choice\.
To isolate the projection map from the feature source, a matched experiment replaces content\-PCA with graph\-only PCA \(top\-kksingular vectors ofAA, the same graph\-only input GraphDR receives\): both graph\-spectral maps beat graph\-blind baselines on most datasets, and GraphDR and graph\-PCA split33–33given identical input, confirming the advantage is structural\.
#### Diagnostic\.
A theory\-motivated diagnostic𝒢k\\mathcal\{G\}\_\{k\}based on the residual quantities of Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)was evaluated but failed to outperformζk\\zeta\_\{k\}on the signed gap; we replace it with the computableΓk\\Gamma\_\{k\}below\.
### A working selection rule: the subspace\-capture margin
The structure\-specific theorem says a reducer hurts the learner only through the reward energy it fails to keep\. The GraphDR\-vs\-PCA contest is therefore most directly predicted not by the absolute graph tailζk\\zeta\_\{k\}, but by a*comparison*of how much reward energy each reducer’s subspace captures\. LetUkU\_\{k\}be GraphDR’s bottom\-kknontrivial Laplacian eigenbasis andPPthe competing reducer’skk\-dimensional basis \(content\-PCA here\)\. Define thesubspace\-capture margin
Γk\\displaystyle\\Gamma\_\{k\}\\;=‖Uk⊤θ‖22−‖P⊤θ‖22\\displaystyle=\\;\\left\\lVert U\_\{k\}^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}\\;\-\\;\\left\\lVert P^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}=‖Uk⊤θ‖22⏟graph low\-freq energy kept−‖P⊤θ‖22⏟competitor energy kept,\\displaystyle=\\;\\underbrace\{\\left\\lVert U\_\{k\}^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}\}\_\{\\text\{graph low\-freq energy kept\}\}\\;\-\\;\\underbrace\{\\left\\lVert P^\{\\top\}\\theta\\right\\rVert\_\{2\}^\{2\}\}\_\{\\text\{competitor energy kept\}\},the \(squared, unit\-norm\-reward\) reward energy GraphDR’s subspace captures minus that the competitor captures\.Γk\>0\\Gamma\_\{k\}\>0predicts GraphDR;Γk<0\\Gamma\_\{k\}<0predicts the competitor\. Unlikeζk\\zeta\_\{k\}, which only sees GraphDR’s own tail,Γk\\Gamma\_\{k\}is a head\-to\-head alignment statistic and is computable from the same pilot reward estimateθ^\\widehat\{\\theta\}the diagnostic𝒢k\\mathcal\{G\}\_\{k\}already requires\.
Validated on all six real datasets, the fitted\-threshold rule is leave\-one\-out correct on6/66/6; the zero\-parameter sign ruleΓk\>0\\Gamma\_\{k\}\>0is correct on5/65/6, with the lone miss at the near\-tieΓk=−0\.008\\Gamma\_\{k\}=\-0\.008\(MIND\-small\)\.
Against the dead diagnostic𝒢k\\mathcal\{G\}\_\{k\}\(\|ρs\|≈0\.11\|\\rho\_\{s\}\|\\approx 0\.11,2/62/6leave\-one\-out\) and the scalar tailζk\\zeta\_\{k\}\(2/62/6\),Γk\\Gamma\_\{k\}is the clear headline rule; the eigengapΔk\\Delta\_\{k\}correlates at0\.830\.83but reaches only4/64/6leave\-one\-out and lacks mechanistic justification for the head\-to\-head comparison\.
## Conclusion
We presentedGraphDR\-LinUCB, a reduction from graph\-smooth contextual bandits tokk\-dimensional linear bandits via Laplacian eigenspace projection\. The central theoretical insight is that the high\-frequency reward component need not cost a linear\-in\-TTpenalty: its actual cost is governed by two realized graph quantities, residual leverage and candidate\-set residual width, rather than by the tail energyζk\\zeta\_\{k\}alone\. This structure\-specific view replaces worst\-case misspecification theory with a finer, graph\-aware accounting, and the subspace\-capture marginΓk\\Gamma\_\{k\}translates this accounting into a practical, threshold\-free selection rule\. Graph\-shuffle controls on every configuration confirm the gain is structural, and synthetic and real\-data experiments validate the approach across recommendation, news, and citation graphs\.
## References
- \[1\]Y\. Abbasi\-Yadkori, D\. Pál, and C\. Szepesvári\(2011\)Improved algorithms for linear stochastic bandits\.InAdvances in Neural Information Processing Systems,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p3.2),[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[2\]A\. Agarwal, H\. Luo, B\. Neyshabur, and R\. E\. Schapire\(2017\)Corralling a band of bandit algorithms\.InConference on Learning Theory,Cited by:[Appendix K](https://arxiv.org/html/2606.27917#A11.p3.6),[Model selection and adaptivekk\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px6.p1.1)\.
- \[3\]P\. Auer\(2002\)Using confidence bounds for exploitation\-exploration trade\-offs\.Journal of Machine Learning Research3,pp\. 397–422\.Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p3.2),[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[4\]M\. Belkin and P\. Niyogi\(2003\)Laplacian eigenmaps for dimensionality reduction and data representation\.Neural Computation15\(6\),pp\. 1373–1396\.External Links:[Document](https://dx.doi.org/10.1162/089976603321780317)Cited by:[Graph signal processing and spectral embeddings\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px5.p1.1)\.
- \[5\]N\. Cesa\-Bianchi, C\. Gentile, and G\. Zappella\(2013\)A gang of bandits\.InAdvances in Neural Information Processing Systems,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2)\.
- \[6\]O\. Chapelle and L\. Li\(2011\)An empirical evaluation of Thompson sampling\.InAdvances in Neural Information Processing Systems,Cited by:[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[7\]W\. Chu, L\. Li, L\. Reyzin, and R\. E\. Schapire\(2011\)Contextual bandits with linear payoff functions\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[8\]Y\. Dai, N\. Golrezaei, and P\. Jaillet\(2026\)Policy regret for embedding model routing: contextual bandits with low\-rank experts\.External Links:2606\.14929Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[9\]V\. Dani, T\. P\. Hayes, and S\. M\. Kakade\(2008\)Stochastic linear optimization under bandit feedback\.InConference on Learning Theory,Cited by:[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[10\]C\. Davis and W\. M\. Kahan\(1970\)The rotation of eigenvectors by a perturbation\. III\.SIAM Journal on Numerical Analysis7\(1\),pp\. 1–46\.Cited by:[Lemma 3](https://arxiv.org/html/2606.27917#Thmtheorem3)\.
- \[11\]M\. Dudík, J\. Langford, and L\. Li\(2011\)Doubly robust policy evaluation and learning\.InInternational Conference on Machine Learning,Cited by:[Real\-Data Benchmarks](https://arxiv.org/html/2606.27917#Sx6.p1.4)\.
- \[12\]D\. J\. Foster, C\. Gentile, M\. Mohri, and J\. Zimmert\(2020\)Adapting to misspecification in contextual bandits\.InAdvances in Neural Information Processing Systems,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2),[Misspecified linear bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px2.p1.1)\.
- \[13\]C\. Gentile, S\. Li, and G\. Zappella\(2014\)Online clustering of bandits\.InInternational Conference on Machine Learning,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2),[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[14\]B\. Ghojogh, A\. Ghodsi, F\. Karray, and M\. Crowley\(2021\)Laplacian\-based dimensionality reduction including spectral clustering, Laplacian eigenmap, locality preserving projection, graph embedding, and diffusion map: tutorial and survey\.External Links:2106\.02154Cited by:[Appendix K](https://arxiv.org/html/2606.27917#A11.p3.6)\.
- \[15\]W\. L\. Hamilton, R\. Ying, and J\. Leskovec\(2017\)Inductive representation learning on large graphs\.External Links:1706\.02216Cited by:[Appendix K](https://arxiv.org/html/2606.27917#A11.p3.6)\.
- \[16\]W\. Hu, M\. Fey, M\. Zitnik, Y\. Dong, H\. Ren, B\. Liu, M\. Catasta, and J\. Leskovec\(2020\)Open graph benchmark: datasets for machine learning on graphs\.External Links:2005\.00687Cited by:[Real\-Data Benchmarks](https://arxiv.org/html/2606.27917#Sx6.p1.4)\.
- \[17\]R\. Huang and Z\. Huang\(2025\)Nearly tight bounds for cross\-learning contextual bandits with graphical feedback\.arXiv preprint arXiv:2502\.04678\.Cited by:[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[18\]F\. Jamshidi, M\. Shahverdikondori, and N\. Kiyavash\(2025\)Graph\-dependent regret bounds in multi\-armed bandits with interference\.External Links:2503\.07555Cited by:[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[19\]Y\. Jedra, W\. Réveillard, S\. Stojanovic, and A\. Proutiere\(2024\)Low\-rank bandits via tight two\-to\-infinity singular subspace recovery\.External Links:2402\.15739Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[20\]K\. Jun, R\. Willett, S\. Wright, and R\. Nowak\(2019\)Bilinear bandits with low\-rank structure\.External Links:1901\.02470Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[21\]H\. Khosravi and X\. Huo\(2026\)Catching a moving subspace: low\-rank bandits beyond stationarity\.External Links:2605\.20269Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[22\]T\. N\. Kipf and M\. Welling\(2017\)Semi\-supervised classification with graph convolutional networks\.External Links:1609\.02907Cited by:[Appendix K](https://arxiv.org/html/2606.27917#A11.p3.6)\.
- \[23\]T\. Lattimore, C\. Szepesvári, and G\. Weisz\(2019\)Learning with good feature representations in bandits and in RL with a generative model\.InInternational Conference on Machine Learning,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2),[Misspecified linear bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px2.p1.1)\.
- \[24\]T\. Lattimore and C\. Szepesvári\(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[25\]L\. Li, W\. Chu, J\. Langford, and R\. E\. Schapire\(2010\)A contextual\-bandit approach to personalized news article recommendation\.InInternational Conference on World Wide Web,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p1.2),[Real\-Data Benchmarks](https://arxiv.org/html/2606.27917#Sx6.p1.4)\.
- \[26\]S\. Mannor and O\. Shamir\(2011\)From bandits to experts: on the value of side\-observations\.External Links:1106\.2436Cited by:[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[27\]A\. Y\. Ng, M\. I\. Jordan, and Y\. Weiss\(2001\)On spectral clustering: analysis and an algorithm\.InAdvances in Neural Information Processing Systems,Cited by:[Graph signal processing and spectral embeddings\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px5.p1.1)\.
- \[28\]A\. Pacchiano, M\. Phan, Y\. Abbasi\-Yadkori, A\. Rao, J\. Zimmert, T\. Lattimore, and C\. Szepesvari\(2020\)Model selection in contextual stochastic bandit problems\.External Links:2003\.01704Cited by:[Model selection and adaptivekk\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px6.p1.1)\.
- \[29\]P\. Paschalidis, R\. Zhang, and N\. Li\(2024\)Cooperative multi\-agent graph bandits: UCB algorithm and regret analysis\.InAmerican Control Conference \(ACC\),External Links:[Document](https://dx.doi.org/10.23919/ACC60939.2024.10644845)Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[30\]H\. Qi, F\. Guo, L\. Zhu, Q\. Zhang, and X\. Li\(2025\)Graph feedback bandits on similar arms: with and without graph structures\.arXiv preprint arXiv:2501\.14314\.Cited by:[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[31\]H\. Qiu, M\. Zhang, and N\. Cesa\-Bianchi\(2026\)Near\-optimal regret for distributed adversarial bandits: a black\-box approach\.arXiv preprint arXiv:2602\.06404\.Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[32\]D\. Russo, B\. Van Roy, A\. Kazerouni, I\. Osband, and Z\. Wen\(2017\)A tutorial on thompson sampling\.External Links:1707\.02038Cited by:[Linear contextual bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px1.p1.3)\.
- \[33\]D\. I\. Shuman, S\. K\. Narang, P\. Frossard, A\. Ortega, and P\. Vandergheynst\(2013\)The emerging field of signal processing on graphs\.IEEE Signal Processing Magazine30\(3\),pp\. 83–98\.Cited by:[Graph signal processing and spectral embeddings\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px5.p1.1)\.
- \[34\]M\. Valko, R\. Munos, B\. Kveton, and T\. Kocák\(2014\)Spectral bandits for smooth graph functions\.InInternational Conference on Machine Learning,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2),[Spectral bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px4.p1.1),[Comparison against graph\-aware baselines\.](https://arxiv.org/html/2606.27917#Sx6.SSx3.SSS0.Px2.p1.1)\.
- \[35\]U\. von Luxburg\(2007\)A tutorial on spectral clustering\.External Links:0711\.0189Cited by:[Graph signal processing and spectral embeddings\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px5.p1.1)\.
- \[36\]Y\. Wang, J\. Li, Y\. Kang, S\. Gao, and Z\. Xiao\(2025\)Generalized low\-rank matrix contextual bandits with graph information\.External Links:2507\.17528Cited by:[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1)\.
- \[37\]D\. Wen, H\. Yin, X\. Zhang, P\. Zhao, L\. Zhang, and Z\. Wei\(2024\)Revisiting matrix sketching in linear bandits: achieving sublinear regret via dyadic block sketching\.arXiv preprint arXiv:2410\.10258\.Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[38\]K\. Yang, L\. Toni, and X\. Dong\(2020\)Laplacian\-regularized graph bandits: algorithms and theoretical analysis\.InInternational Conference on Artificial Intelligence and Statistics,Cited by:[Introduction](https://arxiv.org/html/2606.27917#Sx1.p2.2),[Graph bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px3.p1.1),[Comparison against graph\-aware baselines\.](https://arxiv.org/html/2606.27917#Sx6.SSx3.SSS0.Px2.p1.1)\.
- \[39\]X\. Yu\(2019\)Contextual bandits with random projection\.External Links:1903\.08600Cited by:[Dimension reduction in bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px7.p1.2)\.
- \[40\]H\. Zhao, C\. Ye, Q\. Gu, and T\. Zhang\(2024\)Sharp analysis for KL\-regularized contextual bandits and RLHF\.External Links:2411\.04625Cited by:[Misspecified linear bandits\.](https://arxiv.org/html/2606.27917#Sx2.SS0.SSS0.Px2.p1.1)\.
## Appendix AProofs
This appendix contains the proofs of all results stated in the Theory section, in the order they appear\. Each proof begins with a brief sketch of the key step to orient before the formal argument\.
###### Proof of Theorem[1](https://arxiv.org/html/2606.27917#Thmtheorem1)\.
*Sketch\.*The projectionEk⊤E\_\{k\}^\{\\top\}makes the problem exactlykk\-dimensional; the standard self\-normalized concentration and elliptical potential arguments then apply directly inℝk\\mathbb\{R\}^\{k\}\.
Letzt=zt,atz\_\{t\}=z\_\{t,a\_\{t\}\}\. Under Assumption[1](https://arxiv.org/html/2606.27917#Thmassumption1),yt=zt⊤α⋆\+ηty\_\{t\}=z\_\{t\}^\{\\top\}\\alpha^\{\\star\}\+\\eta\_\{t\}\. The ridge estimate satisfies
α^t−α⋆=Vt−1∑s=1t−1zsηs−λVt−1α⋆\.\\widehat\{\\alpha\}\_\{t\}\-\\alpha^\{\\star\}=V\_\{t\}^\{\-1\}\\sum\_\{s=1\}^\{t\-1\}z\_\{s\}\\eta\_\{s\}\-\\lambda V\_\{t\}^\{\-1\}\\alpha^\{\\star\}\.Taking theVtV\_\{t\}\-norm and usingVt⪰λIkV\_\{t\}\\succeq\\lambda I\_\{k\},
‖α^t−α⋆‖Vt≤‖∑s=1t−1zsηs‖Vt−1\+λS\.\\left\\lVert\\widehat\{\\alpha\}\_\{t\}\-\\alpha^\{\\star\}\\right\\rVert\_\{V\_\{t\}\}\\leq\\left\\lVert\\textstyle\\sum\_\{s=1\}^\{t\-1\}z\_\{s\}\\eta\_\{s\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\+\\sqrt\{\\lambda\}S\.The self\-normalized martingale inequality gives, with probability at least1−δ1\-\\delta, uniformly overtt,
‖∑s=1t−1zsηs‖Vt−1≤R2log\(det\(Vt\)1/2det\(λIk\)1/2δ\)\.\\left\\lVert\\textstyle\\sum\_\{s=1\}^\{t\-1\}z\_\{s\}\\eta\_\{s\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq R\\sqrt\{2\\log\\\!\\left\(\\tfrac\{\\det\(V\_\{t\}\)^\{1/2\}\}\{\\det\(\\lambda I\_\{k\}\)^\{1/2\}\\delta\}\\right\)\}\.Since‖zs‖2≤Lx\\left\\lVert z\_\{s\}\\right\\rVert\_\{2\}\\leq L\_\{x\},logdet\(Vt\)det\(λIk\)≤klog\(1\+\(t−1\)Lx2λk\)\\log\\frac\{\\det\(V\_\{t\}\)\}\{\\det\(\\lambda I\_\{k\}\)\}\\leq k\\log\(1\+\\frac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\), so‖α^t−α⋆‖Vt≤βt\\left\\lVert\\widehat\{\\alpha\}\_\{t\}\-\\alpha^\{\\star\}\\right\\rVert\_\{V\_\{t\}\}\\leq\\beta\_\{t\}for alltton this event\. For any arm, Cauchy–Schwarz gives\|zt,a⊤\(α^t−α⋆\)\|≤βt‖zt,a‖Vt−1\|z\_\{t,a\}^\{\\top\}\(\\widehat\{\\alpha\}\_\{t\}\-\\alpha^\{\\star\}\)\|\\leq\\beta\_\{t\}\\left\\lVert z\_\{t,a\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}, so the UCB is optimistic\. Lettingat⋆a\_\{t\}^\{\\star\}be optimal, the standard optimism argument yieldsxt,at⋆⊤θ⋆−xt,at⊤θ⋆≤2βt‖zt‖Vt−1x\_\{t,a\_\{t\}^\{\\star\}\}^\{\\top\}\\theta^\{\\star\}\-x\_\{t,a\_\{t\}\}^\{\\top\}\\theta^\{\\star\}\\leq 2\\beta\_\{t\}\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\. Summing and using monotonicity ofβt\\beta\_\{t\},Reg\(T\)≤2βT\+1∑t=1T‖zt‖Vt−1\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}\\sum\_\{t=1\}^\{T\}\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\. Becauseλ≥Lx2\\lambda\\geq L\_\{x\}^\{2\},‖zt‖Vt−12≤1\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}^\{2\}\\leq 1, and the elliptical potential lemma gives∑t=1T‖zt‖Vt−1≤2Tklog\(1\+TLx2λk\)\\sum\_\{t=1\}^\{T\}\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq\\sqrt\{2Tk\\log\(1\+\\frac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\)\}\. Substitution proves the theorem\. ∎
###### Proof of Lemma[3](https://arxiv.org/html/2606.27917#Thmtheorem3)\.
*Sketch\.*Weyl’s inequality controls the eigenvalue shift; Davis–Kahan then bounds the projector rotation by the shift divided by the gap\.
By Weyl’s inequality, each eigenvalue ofL^\\widehat\{L\}differs from the corresponding eigenvalue ofLLby at mostεL\\varepsilon\_\{L\}, so the separation between the perturbed bottom\-kkcluster and the rest is at leastΔk−2εL≥Δk/2\\Delta\_\{k\}\-2\\varepsilon\_\{L\}\\geq\\Delta\_\{k\}/2\. The Davis–KahansinΘ\\sin\\Thetatheorem for symmetric matrices gives‖Π^k−Πk‖2≤2‖L^−L‖2Δk−2εL≤4εL/Δk\\left\\lVert\\widehat\{\\Pi\}\_\{k\}\-\\Pi\_\{k\}\\right\\rVert\_\{2\}\\leq\\frac\{2\\left\\lVert\\widehat\{L\}\-L\\right\\rVert\_\{2\}\}\{\\Delta\_\{k\}\-2\\varepsilon\_\{L\}\}\\leq 4\\varepsilon\_\{L\}/\\Delta\_\{k\}\. The second inequality holds because\(I−Π^k\)Πk\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\Pi\_\{k\}is a cross\-projector bounded in operator norm by the distance between projectors\. ∎
###### Proof of Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4)\.
*Sketch\.*Decomposeθ⋆\\theta^\{\\star\}into its in\-subspace part and high\-frequency residual; bound the residual’s effect on the regression updates byνk\\nu\_\{k\}pointwise; then apply the standard optimism argument with an inflated confidence radius\.
Decomposeθ⋆=Π^kθ⋆\+\(I−Π^k\)θ⋆\\theta^\{\\star\}=\\widehat\{\\Pi\}\_\{k\}\\theta^\{\\star\}\+\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}and setα^⋆=E^k⊤θ⋆\\widehat\{\\alpha\}^\{\\star\}=\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}, soΠ^kθ⋆=E^kα^⋆\\widehat\{\\Pi\}\_\{k\}\\theta^\{\\star\}=\\widehat\{E\}\_\{k\}\\widehat\{\\alpha\}^\{\\star\}and‖α^⋆‖2≤‖θ⋆‖2≤Sk\+ζk=S¯k\\left\\lVert\\widehat\{\\alpha\}^\{\\star\}\\right\\rVert\_\{2\}\\leq\\left\\lVert\\theta^\{\\star\}\\right\\rVert\_\{2\}\\leq S\_\{k\}\+\\zeta\_\{k\}=\\overline\{S\}\_\{k\}\. Moreover,
‖\(I−Π^k\)θ⋆‖2\\displaystyle\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}\\right\\rVert\_\{2\}=‖\(I−Π^k\)\(Πkθ⋆\+rk\)‖2\\displaystyle=\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)\(\\Pi\_\{k\}\\theta^\{\\star\}\+r\_\{k\}\)\\right\\rVert\_\{2\}≤‖\(I−Π^k\)Πk‖2‖Πkθ⋆‖2\+‖\(I−Π^k\)rk‖2\\displaystyle\\leq\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\Pi\_\{k\}\\right\\rVert\_\{2\}\\left\\lVert\\Pi\_\{k\}\\theta^\{\\star\}\\right\\rVert\_\{2\}\+\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)r\_\{k\}\\right\\rVert\_\{2\}≤γkSk\+ζk\.\\displaystyle\\leq\\gamma\_\{k\}S\_\{k\}\+\\zeta\_\{k\}\.For any arm defineξt,a=xt,a⊤\(I−Π^k\)θ⋆\\xi\_\{t,a\}=x\_\{t,a\}^\{\\top\}\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}; then\|ξt,a\|≤Lx‖\(I−Π^k\)θ⋆‖2≤Lx\(ζk\+Skγk\)=νk\|\\xi\_\{t,a\}\|\\leq L\_\{x\}\\left\\lVert\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}\\right\\rVert\_\{2\}\\leq L\_\{x\}\(\\zeta\_\{k\}\+S\_\{k\}\\gamma\_\{k\}\)=\\nu\_\{k\}, andxt,a⊤θ⋆=zt,a⊤α^⋆\+ξt,ax\_\{t,a\}^\{\\top\}\\theta^\{\\star\}=z\_\{t,a\}^\{\\top\}\\widehat\{\\alpha\}^\{\\star\}\+\\xi\_\{t,a\}with\|ξt,a\|≤νk\|\\xi\_\{t,a\}\|\\leq\\nu\_\{k\}\. For the selected arm writezt=zt,atz\_\{t\}=z\_\{t,a\_\{t\}\},ξt=ξt,at\\xi\_\{t\}=\\xi\_\{t,a\_\{t\}\}, soyt=zt⊤α^⋆\+ηt\+ξty\_\{t\}=z\_\{t\}^\{\\top\}\\widehat\{\\alpha\}^\{\\star\}\+\\eta\_\{t\}\+\\xi\_\{t\}and
α^t−α^⋆=Vt−1∑s<tzsηs\+Vt−1∑s<tzsξs−λVt−1α^⋆\.\\widehat\{\\alpha\}\_\{t\}\-\\widehat\{\\alpha\}^\{\\star\}=V\_\{t\}^\{\-1\}\\sum\_\{s<t\}z\_\{s\}\\eta\_\{s\}\+V\_\{t\}^\{\-1\}\\sum\_\{s<t\}z\_\{s\}\\xi\_\{s\}\-\\lambda V\_\{t\}^\{\-1\}\\widehat\{\\alpha\}^\{\\star\}\.On the self\-normalized event,‖∑s<tzsηs‖Vt−1≤Rklog\(1\+\(t−1\)Lx2λk\)\+2log1δ\\left\\lVert\\sum\_\{s<t\}z\_\{s\}\\eta\_\{s\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq R\\sqrt\{k\\log\(1\+\\frac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\)\+2\\log\\frac\{1\}\{\\delta\}\}andλ‖α^⋆‖Vt−1≤λS¯k\\lambda\\left\\lVert\\widehat\{\\alpha\}^\{\\star\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq\\sqrt\{\\lambda\}\\,\\overline\{S\}\_\{k\}\. Since\|ξs\|≤νk\|\\xi\_\{s\}\|\\leq\\nu\_\{k\}andVt⪰VsV\_\{t\}\\succeq V\_\{s\},
‖∑s<tzsξs‖Vt−1\\displaystyle\\left\\lVert\\textstyle\\sum\_\{s<t\}z\_\{s\}\\xi\_\{s\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}≤νk∑s<t‖zs‖Vs−1\\displaystyle\\leq\\nu\_\{k\}\\sum\_\{s<t\}\\left\\lVert z\_\{s\}\\right\\rVert\_\{V\_\{s\}^\{\-1\}\}≤νk2\(t−1\)klog\(1\+\(t−1\)Lx2λk\)\.\\displaystyle\\leq\\nu\_\{k\}\\sqrt\{2\(t\-1\)k\\log\\\!\\left\(1\+\\tfrac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\\right\)\}\.Hence‖α^t−α^⋆‖Vt≤βt0\+νk2\(t−1\)klog\(1\+\(t−1\)Lx2λk\)\\left\\lVert\\widehat\{\\alpha\}\_\{t\}\-\\widehat\{\\alpha\}^\{\\star\}\\right\\rVert\_\{V\_\{t\}\}\\leq\\beta\_\{t\}^\{0\}\+\\nu\_\{k\}\\sqrt\{2\(t\-1\)k\\log\(1\+\\frac\{\(t\-1\)L\_\{x\}^\{2\}\}\{\\lambda k\}\)\}\. Withμt\(a\)=xt,a⊤θ⋆\\mu\_\{t\}\(a\)=x\_\{t,a\}^\{\\top\}\\theta^\{\\star\},qt\(a\)=zt,a⊤α^⋆q\_\{t\}\(a\)=z\_\{t,a\}^\{\\top\}\\widehat\{\\alpha\}^\{\\star\}, we have\|μt−qt\|≤νk\|\\mu\_\{t\}\-q\_\{t\}\|\\leq\\nu\_\{k\}, and forat⋆∈argmaxμta\_\{t\}^\{\\star\}\\in\\arg\\max\\mu\_\{t\},μt\(at⋆\)−μt\(at\)≤qt\(at⋆\)−qt\(at\)\+2νk\\mu\_\{t\}\(a\_\{t\}^\{\\star\}\)\-\\mu\_\{t\}\(a\_\{t\}\)\\leq q\_\{t\}\(a\_\{t\}^\{\\star\}\)\-q\_\{t\}\(a\_\{t\}\)\+2\\nu\_\{k\}\. The optimism argument applied toqtq\_\{t\}with the inflated radius and summation giveReg\(T\)≤2βT\+10AT\+2νkAT2\+2νkT\\mathrm\{Reg\}\(T\)\\leq 2\\beta\_\{T\+1\}^\{0\}A\_\{T\}\+2\\nu\_\{k\}A\_\{T\}^\{2\}\+2\\nu\_\{k\}T\. ∎
#### Why the generic bound is too pessimistic\.
Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4)uses only the uniform envelope\|ξt,a\|≤νk\|\\xi\_\{t,a\}\|\\leq\\nu\_\{k\}, treating the residual as an adversary that can corrupt every regression update and change the best arm every round\. In graph problemsξt,a\\xi\_\{t,a\}is not arbitrary: it is the value of the high\-frequency graph signal\(I−Π^k\)θ⋆\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\theta^\{\\star\}on the arms that actually appear\. The next theorem keeps this structure instead of collapsing it into the scalarνk\\nu\_\{k\}\.
###### Proof of Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)\.
*Sketch\.*Decompose the instantaneous regret into a low\-frequency part \(controlled by the UCB optimism in the projected space\) and a high\-frequency part \(controlled by𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}summed over rounds\); the regression bias from the residual is absorbed into𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}\.
Writeμt\(a\)=xt,a⊤θ⋆\\mu\_\{t\}\(a\)=x\_\{t,a\}^\{\\top\}\\theta^\{\\star\},qt\(a\)=zt,a⊤E^k⊤θ⋆q\_\{t\}\(a\)=z\_\{t,a\}^\{\\top\}\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}, soμt\(a\)=qt\(a\)\+ξt\(a\)\\mu\_\{t\}\(a\)=q\_\{t\}\(a\)\+\\xi\_\{t\}\(a\)andyt=qt\(at\)\+ηt\+ξt\(at\)y\_\{t\}=q\_\{t\}\(a\_\{t\}\)\+\\eta\_\{t\}\+\\xi\_\{t\}\(a\_\{t\}\)\. The ridge error decomposes as
α^t−E^k⊤θ⋆=Vt−1∑s<tzsηs\+Vt−1∑s<tzsξs\(as\)−λVt−1E^k⊤θ⋆\.\\widehat\{\\alpha\}\_\{t\}\-\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}=V\_\{t\}^\{\-1\}\\sum\_\{s<t\}z\_\{s\}\\eta\_\{s\}\+V\_\{t\}^\{\-1\}\\sum\_\{s<t\}z\_\{s\}\\xi\_\{s\}\(a\_\{s\}\)\-\\lambda V\_\{t\}^\{\-1\}\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}\.On the self\-normalized event, for alltt,‖α^t−E^k⊤θ⋆‖Vt≤βt0\+𝖲𝖱𝖫T\(r⟂,k\)≤βt0\+BT\\left\\lVert\\widehat\{\\alpha\}\_\{t\}\-\\widehat\{E\}\_\{k\}^\{\\top\}\\theta^\{\\star\}\\right\\rVert\_\{V\_\{t\}\}\\leq\\beta\_\{t\}^\{0\}\+\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)\\leq\\beta\_\{t\}^\{0\}\+B\_\{T\}\. Because the algorithm uses radiusβt0\+BT\\beta\_\{t\}^\{0\}\+B\_\{T\}, its UCB is optimistic for the projected meansqtq\_\{t\}\. Lettingatq∈argmaxaqt\(a\)a\_\{t\}^\{q\}\\in\\arg\\max\_\{a\}q\_\{t\}\(a\), the LinUCB argument givesqt\(atq\)−qt\(at\)≤2\(βt0\+BT\)‖zt‖Vt−1q\_\{t\}\(a\_\{t\}^\{q\}\)\-q\_\{t\}\(a\_\{t\}\)\\leq 2\(\\beta\_\{t\}^\{0\}\+B\_\{T\}\)\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\. Sinceat⋆a\_\{t\}^\{\\star\}maximizesμt\\mu\_\{t\}rather thanqtq\_\{t\},
μt\(at⋆\)−μt\(at\)\\displaystyle\\mu\_\{t\}\(a\_\{t\}^\{\\star\}\)\-\\mu\_\{t\}\(a\_\{t\}\)=qt\(at⋆\)−qt\(at\)\+ξt\(at⋆\)−ξt\(at\)\\displaystyle=q\_\{t\}\(a\_\{t\}^\{\\star\}\)\-q\_\{t\}\(a\_\{t\}\)\+\\xi\_\{t\}\(a\_\{t\}^\{\\star\}\)\-\\xi\_\{t\}\(a\_\{t\}\)≤qt\(atq\)−qt\(at\)\+\(maxaξt\(a\)−minaξt\(a\)\)\.\\displaystyle\\leq q\_\{t\}\(a\_\{t\}^\{q\}\)\-q\_\{t\}\(a\_\{t\}\)\+\\Big\(\\max\_\{a\}\\xi\_\{t\}\(a\)\-\\min\_\{a\}\\xi\_\{t\}\(a\)\\Big\)\.Summing and using the elliptical potential bound proves the theorem\. ∎
###### Proof of Proposition[6](https://arxiv.org/html/2606.27917#Thmtheorem6)\.
The confidence event still gives\|qt\(a\)−zt,a⊤α^t\|≤\(βt0\+𝖲𝖱𝖫T\)‖zt,a‖Vt−1\|q\_\{t\}\(a\)\-z\_\{t,a\}^\{\\top\}\\widehat\{\\alpha\}\_\{t\}\|\\leq\(\\beta\_\{t\}^\{0\}\+\\mathsf\{SRL\}\_\{T\}\)\\left\\lVert z\_\{t,a\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}, but the algorithm only maximizes the UCB with radiusβt0\\beta\_\{t\}^\{0\}, sozt,atq⊤α^t\+βt0‖zt,atq‖Vt−1≤zt,at⊤α^t\+βt0‖zt,at‖Vt−1z\_\{t,a\_\{t\}^\{q\}\}^\{\\top\}\\widehat\{\\alpha\}\_\{t\}\+\\beta\_\{t\}^\{0\}\\left\\lVert z\_\{t,a\_\{t\}^\{q\}\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq z\_\{t,a\_\{t\}\}^\{\\top\}\\widehat\{\\alpha\}\_\{t\}\+\\beta\_\{t\}^\{0\}\\left\\lVert z\_\{t,a\_\{t\}\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\. Combining,
qt\(atq\)−qt\(at\)≤𝖲𝖱𝖫T‖zt,atq‖Vt−1\+\(2βt0\+𝖲𝖱𝖫T\)‖zt‖Vt−1\.q\_\{t\}\(a\_\{t\}^\{q\}\)\-q\_\{t\}\(a\_\{t\}\)\\leq\\mathsf\{SRL\}\_\{T\}\\left\\lVert z\_\{t,a\_\{t\}^\{q\}\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\+\(2\\beta\_\{t\}^\{0\}\+\\mathsf\{SRL\}\_\{T\}\)\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\.The played\-arm terms sum to at most\(2βT\+10\+𝖲𝖱𝖫T\)AT\(2\\beta\_\{T\+1\}^\{0\}\+\\mathsf\{SRL\}\_\{T\}\)A\_\{T\}, while‖zt,atq‖Vt−1≤1\\left\\lVert z\_\{t,a\_\{t\}^\{q\}\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq 1underλ≥Lx2\\lambda\\geq L\_\{x\}^\{2\}, giving the extraT𝖲𝖱𝖫TT\\,\\mathsf\{SRL\}\_\{T\}term\. Adding𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}proves the claim\. ∎
###### Proof of Theorem[7](https://arxiv.org/html/2606.27917#Thmtheorem7)\.
The residual\-leverage estimation error obeys‖∑s<tzs\(ξs\(as\)−ξ^s\(as\)\)‖Vt−1≤Lxer∑s<t‖zs‖Vs−1≤LxerAT\\left\\lVert\\sum\_\{s<t\}z\_\{s\}\(\\xi\_\{s\}\(a\_\{s\}\)\-\\widehat\{\\xi\}\_\{s\}\(a\_\{s\}\)\)\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq L\_\{x\}e\_\{r\}\\sum\_\{s<t\}\\left\\lVert z\_\{s\}\\right\\rVert\_\{V\_\{s\}^\{\-1\}\}\\leq L\_\{x\}e\_\{r\}A\_\{T\}, since\|xs,as⊤\(r⟂,k−r^\)\|≤Lxer\|x\_\{s,a\_\{s\}\}^\{\\top\}\(r\_\{\\perp,k\}\-\\widehat\{r\}\)\|\\leq L\_\{x\}e\_\{r\}\. HenceB^T\+LxerAT\\widehat\{B\}\_\{T\}\+L\_\{x\}e\_\{r\}A\_\{T\}upper\-bounds𝖲𝖱𝖫T\(r⟂,k\)\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\)\. The true candidate residual width differs from the estimated one by at most2LxerT2L\_\{x\}e\_\{r\}T\. Applying Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)withBT=B^T\+LxerATB\_\{T\}=\\widehat\{B\}\_\{T\}\+L\_\{x\}e\_\{r\}A\_\{T\}proves the result\. ∎
#### The implementable advantage is contingent on pilot quality\.
Theorem[7](https://arxiv.org/html/2606.27917#Thmtheorem7)improves on Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4)*only when the pilot estimate is good\.*The terms2LxerAT2=O~\(erkT\)2L\_\{x\}e\_\{r\}A\_\{T\}^\{2\}=\\widetilde\{O\}\(e\_\{r\}kT\)and2LxerT2L\_\{x\}e\_\{r\}Tare linear inTT, so a poor pilot \(er=Θ\(1\)e\_\{r\}=\\Theta\(1\)\) recovers the generic linear bias\. The estimated\-residual bound is most useful when a cheap exploration prefix or logged data yieldser=o\(1\)e\_\{r\}=o\(1\)*and*the realizedB^T,𝖢𝖱𝖶^T\\widehat\{B\}\_\{T\},\\widehat\{\\mathsf\{CRW\}\}\_\{T\}are small, precisely the benign\-residual regime the diagnostic in Appendix[D](https://arxiv.org/html/2606.27917#A4)is designed to detect\.
###### Proof of Corollary[8](https://arxiv.org/html/2606.27917#Thmtheorem8)\.
‖∑s<tzsξs\(as\)‖Vt−1≤νk∑s<t‖zs‖Vs−1≤νkAT\\left\\lVert\\sum\_\{s<t\}z\_\{s\}\\xi\_\{s\}\(a\_\{s\}\)\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq\\nu\_\{k\}\\sum\_\{s<t\}\\left\\lVert z\_\{s\}\\right\\rVert\_\{V\_\{s\}^\{\-1\}\}\\leq\\nu\_\{k\}A\_\{T\}, and every candidate\-set residual range is at most2νk2\\nu\_\{k\}\. ∎
#### What is graph\-specific about the theorem\.
In the one\-hot node\-arm casext,a=eax\_\{t,a\}=e\_\{a\}andr⟂,k=U\>kc\>kr\_\{\\perp,k\}=U\_\{\>k\}c\_\{\>k\}is the high\-frequency graph signal, soξt\(a\)=ea⊤U\>kc\>k\\xi\_\{t\}\(a\)=e\_\{a\}^\{\\top\}U\_\{\>k\}c\_\{\>k\}\. Thus𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}is the cumulative range of the high\-frequency graph signal over the candidate sets, while𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}is the self\-normalized cross\-correlation between low\-frequency coordinatesUk\(a\)U\_\{k\}\(a\)and high\-frequency residual values\(U\>kc\>k\)a\(U\_\{\>k\}c\_\{\>k\}\)\_\{a\}along the played path\. This is sharper thanζk\\zeta\_\{k\}because it knows*where*the residual lives on the graph and*which*arms the bandit compares\.
## Appendix BProjection Identity
Ifθ⋆=Ekα⋆\\theta^\{\\star\}=E\_\{k\}\\alpha^\{\\star\}andEk⊤Ek=IkE\_\{k\}^\{\\top\}E\_\{k\}=I\_\{k\}, then‖θ⋆‖22=\(α⋆\)⊤Ek⊤Ekα⋆=‖α⋆‖22\\left\\lVert\\theta^\{\\star\}\\right\\rVert\_\{2\}^\{2\}=\(\\alpha^\{\\star\}\)^\{\\top\}E\_\{k\}^\{\\top\}E\_\{k\}\\alpha^\{\\star\}=\\left\\lVert\\alpha^\{\\star\}\\right\\rVert\_\{2\}^\{2\}, and for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\},x⊤θ⋆=\(Ek⊤x\)⊤α⋆x^\{\\top\}\\theta^\{\\star\}=\(E\_\{k\}^\{\\top\}x\)^\{\\top\}\\alpha^\{\\star\}\. Thus projection toEk⊤xE\_\{k\}^\{\\top\}xis signal\-lossless under exact graph smoothness\.
## Appendix CElliptical Potential Bound
LetV1=λIkV\_\{1\}=\\lambda I\_\{k\},Vt\+1=Vt\+ztzt⊤V\_\{t\+1\}=V\_\{t\}\+z\_\{t\}z\_\{t\}^\{\\top\},λ≥Lx2\\lambda\\geq L\_\{x\}^\{2\},‖zt‖2≤Lx\\left\\lVert z\_\{t\}\\right\\rVert\_\{2\}\\leq L\_\{x\}\. Then‖zt‖Vt−12≤1\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}^\{2\}\\leq 1, and by the matrix determinant lemmadet\(Vt\+1\)=det\(Vt\)\(1\+‖zt‖Vt−12\)\\det\(V\_\{t\+1\}\)=\\det\(V\_\{t\}\)\(1\+\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}^\{2\}\)\. Sinceu≤2log\(1\+u\)u\\leq 2\\log\(1\+u\)foru∈\[0,1\]u\\in\[0,1\],∑t=1T‖zt‖Vt−12≤2logdet\(VT\+1\)det\(V1\)\\sum\_\{t=1\}^\{T\}\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}^\{2\}\\leq 2\\log\\frac\{\\det\(V\_\{T\+1\}\)\}\{\\det\(V\_\{1\}\)\}, and the trace–determinant inequality giveslogdet\(VT\+1\)det\(V1\)≤klog\(1\+TLx2λk\)\\log\\frac\{\\det\(V\_\{T\+1\}\)\}\{\\det\(V\_\{1\}\)\}\\leq k\\log\(1\+\\frac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\)\. Hence∑t=1T‖zt‖Vt−1≤2Tklog\(1\+TLx2λk\)\\sum\_\{t=1\}^\{T\}\\left\\lVert z\_\{t\}\\right\\rVert\_\{V\_\{t\}^\{\-1\}\}\\leq\\sqrt\{2Tk\\log\(1\+\\frac\{TL\_\{x\}^\{2\}\}\{\\lambda k\}\)\}\.
## Appendix DA Predictive Diagnostic and Its Held\-Out Evaluation
#### Diagnostic construction\.
The real\-data picture in the Real\-Data Benchmarks section exposes a gap that the scalar tailζk\\zeta\_\{k\}cannot close:ζk\\zeta\_\{k\}is large on both MIND \(0\.990\.99\) and ogbn\-arxiv \(0\.960\.96\), yet GraphDR wins decisively on MIND and loses on ogbn\-arxiv\. The structure\-specific theorem says the deciding factor is not total high\-frequency energy but whether the residual is harmless in the candidate distribution\. This motivates a computable alignment diagnostic\.
Given a pilot reward estimateθ^\\widehat\{\\theta\}from logged data, validation data, or a short exploration prefix, definer^⟂,k=\(I−Π^k\)θ^\\widehat\{r\}\_\{\\perp,k\}=\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\widehat\{\\theta\}\. For a candidate\-set sample𝒟=\{𝒜t\}t=1M\\mathcal\{D\}=\\\{\\mathcal\{A\}\_\{t\}\\\}\_\{t=1\}^\{M\}we measure the empirical analogues of the two theorem quantities: the candidate residual width
𝖢𝖱𝖶^k\(𝒟\)=1M∑t=1M\(maxa∈𝒜txt,a⊤r^⟂,k−mina∈𝒜txt,a⊤r^⟂,k\),\\widehat\{\\mathsf\{CRW\}\}\_\{k\}\(\\mathcal\{D\}\)=\\frac\{1\}\{M\}\\sum\_\{t=1\}^\{M\}\\Big\(\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}x\_\{t,a\}^\{\\top\}\\widehat\{r\}\_\{\\perp,k\}\-\\min\_\{a\\in\\mathcal\{A\}\_\{t\}\}x\_\{t,a\}^\{\\top\}\\widehat\{r\}\_\{\\perp,k\}\\Big\),and an approximate residual leverage𝖲𝖱𝖫^k\(𝒟\)=max1≤t≤M\+1‖∑s<tz^s,asxs,as⊤r^⟂,k‖V^t−1\\widehat\{\\mathsf\{SRL\}\}\_\{k\}\(\\mathcal\{D\}\)=\\max\_\{1\\leq t\\leq M\+1\}\\left\\lVert\\sum\_\{s<t\}\\widehat\{z\}\_\{s,a\_\{s\}\}\\,x\_\{s,a\_\{s\}\}^\{\\top\}\\widehat\{r\}\_\{\\perp,k\}\\right\\rVert\_\{\\widehat\{V\}\_\{t\}^\{\-1\}\}computed along a short greedy or random prefix, combined into
𝒢k\(𝒟\)=𝖢𝖱𝖶^k\(𝒟\)\+22klog\(1\+MLx2/\(λk\)\)M𝖲𝖱𝖫^k\(𝒟\)\.\\mathcal\{G\}\_\{k\}\(\\mathcal\{D\}\)=\\widehat\{\\mathsf\{CRW\}\}\_\{k\}\(\\mathcal\{D\}\)\+2\\sqrt\{\\frac\{2k\\log\(1\+ML\_\{x\}^\{2\}/\(\\lambda k\)\)\}\{M\}\}\\;\\widehat\{\\mathsf\{SRL\}\}\_\{k\}\(\\mathcal\{D\}\)\.Small𝒢k\\mathcal\{G\}\_\{k\}is*hypothesized*to indicate that GraphDR should be competitive; large𝒢k\\mathcal\{G\}\_\{k\}suggests PCA, full LinUCB, or a Laplacian\-regularized alternative may be safer\.
#### Held\-out evaluation protocol\.
A binary outcome over six datasets cannot support fitting and validating a threshold rule, so we expand each base dataset into150150problem instances by subsampling node\-induced subgraphs at several sizes, varyingk∈\{2,5,10,20,40\}k\\in\\\{2,5,10,20,40\\\}, varying candidate\-set construction \(uniform, popularity\-biased, neighbourhood\-restricted\), and edge\-perturbing the graph at several noise levels\. Each instance produces one tuple\(ζk,𝖲𝖱𝖫T,𝖢𝖱𝖶T,𝒢k,RTGraphDR,RTPCA\)\(\\zeta\_\{k\},\\mathsf\{SRL\}\_\{T\},\\mathsf\{CRW\}\_\{T\},\\mathcal\{G\}\_\{k\},R\_\{T\}^\{\\mathrm\{GraphDR\}\},R\_\{T\}^\{\\mathrm\{PCA\}\}\)\. The primary test rank\-correlates each predictor with the signed gapG−P:=RTGraphDR−RTPCAG\\\!\-\\\!P:=R\_\{T\}^\{\\mathrm\{GraphDR\}\}\-R\_\{T\}^\{\\mathrm\{PCA\}\}\(negative means GraphDR wins\); the decision\-rule test fits a thresholdY^\(τ\)=𝟏\{𝒢k≤τ\}\\widehat\{Y\}\(\\tau\)=\\mathbf\{1\}\\\{\\mathcal\{G\}\_\{k\}\\leq\\tau\\\}under leave\-one\-base\-dataset\-out\.
#### Results\.
The outcome is, on its declared primary test,*negative*, and we report it straight\. Measured over the150150instances, the absolute Spearman correlation with the signed gap is0\.2520\.252forζk\\zeta\_\{k\},0\.070\.07for𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\},0\.030\.03for𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}, and0\.110\.11for𝒢k\\mathcal\{G\}\_\{k\}\(the theorem\-consistent normalization raises𝒢k\\mathcal\{G\}\_\{k\}’s correlation from0\.0360\.036to0\.1060\.106, still belowζk\\zeta\_\{k\}\)\. No structure\-specific predictor beatsζk\\zeta\_\{k\}on rank correlation, so the declared positive criterion is not met\. The leave\-one\-base\-dataset\-out decision\-rule test is more favourable:𝒢k\\mathcal\{G\}\_\{k\}reaches held\-out accuracy0\.620\.62versus0\.470\.47forζk\\zeta\_\{k\}\(and𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}alone0\.630\.63\), so the realized residual quantities do help a held\-out classifier\. On the MIND\-versus\-ogbn separation that motivated the diagnostic,𝒢k\\mathcal\{G\}\_\{k\}does not separate the two cases \(MIND0\.1920\.192versus ogbn0\.1370\.137, the wrong direction\), so the scalar collapse fails the specific case it was designed for\.
We therefore present𝒢k\\mathcal\{G\}\_\{k\}as a theory\-motivated diagnostic with honest empirical characterization, not a validated model\-selection rule\. The structure\-specific theorem stands on its own; the realized residual quantities carry some held\-out classification signal; but a unit\-weighted scalar collapse does not beatζk\\zeta\_\{k\}, and a deployable rule would require a learned\(𝖲𝖱𝖫T,𝖢𝖱𝖶T\)\(\\mathsf\{SRL\}\_\{T\},\\mathsf\{CRW\}\_\{T\}\)weighting and more base graphs\.
## Appendix EEstimating the Residual Diagnostic
The diagnostic𝒢k\\mathcal\{G\}\_\{k\}from the predictive diagnostic appendix is computed as follows\.
1. 1\.BuildL^\\widehat\{L\}from the training graph and computeE^k\\widehat\{E\}\_\{k\}\.
2. 2\.Fit a pilot reward modelθ^\\widehat\{\\theta\}from a disjoint exploration prefix, logged data, or cross\-fitting \(a node reward estimate in the one\-hot case; a ridge or doubly robust model with general features\)\.
3. 3\.Computer^⟂,k=\(I−Π^k\)θ^\\widehat\{r\}\_\{\\perp,k\}=\(I\-\\widehat\{\\Pi\}\_\{k\}\)\\widehat\{\\theta\}\.
4. 4\.Estimate𝖢𝖱𝖶^k\\widehat\{\\mathsf\{CRW\}\}\_\{k\}on held\-out candidate sets by averaging the range ofxt,a⊤r^⟂,kx\_\{t,a\}^\{\\top\}\\widehat\{r\}\_\{\\perp,k\}over candidates\.
5. 5\.Estimate𝖲𝖱𝖫^k\\widehat\{\\mathsf\{SRL\}\}\_\{k\}on a short simulated or logged prefix\.
6. 6\.Compute the subspace\-capture marginΓk=‖Uk⊤θ^‖22−‖P⊤θ^‖22\\Gamma\_\{k\}=\\left\\lVert U\_\{k\}^\{\\top\}\\widehat\{\\theta\}\\right\\rVert\_\{2\}^\{2\}\-\\left\\lVert P^\{\\top\}\\widehat\{\\theta\}\\right\\rVert\_\{2\}^\{2\}against the competing reducer’s basisPP\. Select GraphDR whenΓk\>0\\Gamma\_\{k\}\>0; otherwise prefer PCA, full LinUCB, or a Laplacian\-regularized baseline\.
## Appendix FFull Diagnostic\-Validation Protocol
The protocol behind the predictive diagnostic evaluation is specified here in full so that the negative primary result cannot be read as post hoc\.
The instance generation expands each of the five base datasets used in the diagnostic study \(MovieLens\-100k, LastFM, Amazon, ogbn\-arxiv, MIND\-small; MovieLens\-1M is omitted from the subgraph\-resampled study\) by subsampling node\-induced subgraphs at sizes\{400,800,1500\}\\\{400,800,1500\\\}, varyingk∈\{2,5,10,20,40\}k\\in\\\{2,5,10,20,40\\\}, varying candidate\-set construction across uniform, popularity\-biased, and neighbourhood\-restricted sampling, and edge\-perturbing the graph at noise levels\{0,0\.05\}\\\{0,0\.05\\\}\. This yields150150instances, each producing one tuple\(ζk,𝖲𝖱𝖫T,𝖢𝖱𝖶T,𝒢k,RTGraphDR,RTPCA\)\(\\zeta\_\{k\},\\mathsf\{SRL\}\_\{T\},\\mathsf\{CRW\}\_\{T\},\\mathcal\{G\}\_\{k\},R\_\{T\}^\{\\mathrm\{GraphDR\}\},R\_\{T\}^\{\\mathrm\{PCA\}\}\)atT=8000T\{=\}8000over33seeds\.
The continuous target is the signed gapG−P:=RTGraphDR−RTPCAG\\\!\-\\\!P:=R\_\{T\}^\{\\mathrm\{GraphDR\}\}\-R\_\{T\}^\{\\mathrm\{PCA\}\}\(negative means GraphDR wins\), and the binary target isY=𝟏\{RTGraphDR<RTPCA\}Y=\\mathbf\{1\}\\\{R\_\{T\}^\{\\mathrm\{GraphDR\}\}<R\_\{T\}^\{\\mathrm\{PCA\}\}\\\}\. We report each of𝖲𝖱𝖫T\\mathsf\{SRL\}\_\{T\}and𝖢𝖱𝖶T\\mathsf\{CRW\}\_\{T\}separately, together with a two\-dimensional logistic predictor on\(𝖲𝖱𝖫T,𝖢𝖱𝖶T\)\(\\mathsf\{SRL\}\_\{T\},\\mathsf\{CRW\}\_\{T\}\), before collapsing to the scalar𝒢k\\mathcal\{G\}\_\{k\}\. The primary test rank\-correlates \(Spearman, Kendall\) each predictor withG−PG\\\!\-\\\!Pacross instances\. For the decision\-rule claim we fit a thresholdY^\(τ\)=𝟏\{𝒢k≤τ\}\\widehat\{Y\}\(\\tau\)=\\mathbf\{1\}\\\{\\mathcal\{G\}\_\{k\}\\leq\\tau\\\}on training instances and evaluate held\-out using leave\-one\-base\-dataset\-out, comparing against the analogous threshold rule onζk\\zeta\_\{k\}by AUC and held\-out accuracy\. Because𝒢k\\mathcal\{G\}\_\{k\}depends on a pilot estimater^\\widehat\{r\}, we also report𝒢k\\mathcal\{G\}\_\{k\}and its predictive power across pilot budgets\. The measured correlations and held\-out accuracies are reported in the predictive diagnostic appendix\.
## Appendix GReal\-Data Construction and Per\-Dataset Results
The six datasets comprise four recommendation graphs \(MovieLens\-100k/1M, Amazon Digital Music, LastFM/HetRec\) and two non\-recommendation graphs \(MIND\-small news, ogbn\-arxiv citation\)\. Table[4](https://arxiv.org/html/2606.27917#A7.T4)records how each graph and bandit protocol is built\.
Table 4:Real\-data benchmarks: four recommendation graphs and two non\-recommendation graphs\. All datasets use node\-indicator arm features \(xi=eix\_\{i\}=e\_\{i\},d=nd=n\) consistent with the graph\-causal synthetic model\.Table 5:kk\-selection and measured spectral quantities\. Eigengaps are uniformly small and reward tails large across all datasets; theζk\\zeta\_\{k\}ordering does not match GraphDR’s win/loss ordering, reinforcing that scalar tail energy alone underdetermines the outcome\.The two MovieLens datasets bracket the GraphDR\-versus\-PCA boundary\. On MovieLens\-100k \(n=1682n=1682items, an item–itemkkNN co\-rating graph with13,68613\{,\}686edges, real centered unit\-normalized mean\-rating reward\), GraphDR attains the lowest regret, though by a small margin over PCA and far below the synthetic regime, consistent with the measured tailζk=0\.67\\zeta\_\{k\}=0\.67\(Table[6](https://arxiv.org/html/2606.27917#A7.T6)\)\. The PCA baseline here uses learned item embeddings whereas GraphDR uses node indicators, so the comparison tests whether graph structure beats learned feature variance, not whether Laplacian projection beats PCA on the same input; the matched comparison is in the matched\-comparison appendix\. On MovieLens\-1M \(top\-1500 most\-rated items,ζk=0\.68\\zeta\_\{k\}=0\.68\) GraphDR beats full LinUCB by2\.2×2\.2\\timesand the shuffle control collapses to near\-random, but PCA wins: at scale learned item embeddings carry strong variance signal that content\-PCA exploits better than the graph when the reward is only weakly smooth \(Table[7](https://arxiv.org/html/2606.27917#A7.T7)\)\.
Table 6:MovieLens\-100k cumulative regret\. GraphDR attains the lowest regret; the margin over PCA is small and far below the synthetic regime, consistent withζk=0\.67\\zeta\_\{k\}=0\.67\.Table 7:MovieLens\-1M \(top\-1500 most\-rated items\),ζk=0\.68\\zeta\_\{k\}=0\.68\. GraphDR beats full LinUCB \(2\.2×2\.2\\times\) and the shuffle control collapses to near\-random, but PCA wins\.
## Appendix HMatched Comparison: Graph\-Only PCA on the Same Input
The PCA baseline in the main body consumes learned item embeddings\. To isolate the*projection map*from the*feature source*, we add graph\-only PCA: PCA applied to the graph adjacency itself \(the top\-kkright singular vectors ofAA\), which consumes exactly the same graph\-only information GraphDR receives, with no ratings or content\. The only remaining difference from GraphDR is the projection criterion \(top variance directions ofAAversus bottom\-kklow\-frequency Laplacian eigenvectors\)\.
Table 8:Corrected matched\-input comparison \(T=20000T\{=\}20000,55seeds, paired arm/noise streams, largest\-connected\-component graphs; lowest inbold\)\.The confound is resolved:graph\-only projection \(GraphDR and/or graph\-PCA\) beats content\-PCA and full LinUCB except where content variance is genuinely strong \(ml\-1m\), so GraphDR’s advantage is structural, not a feature\-source artifact\.Reported straight:given the*same*graph\-only input, GraphDR \(normalized low\-frequency Laplacian\) and graph\-PCA \(raw\-adjacency top variance\) split33–33: GraphDR wins on ml\-100k, Amazon, and LastFM; graph\-PCA wins on ml\-1m, ogbn\-arxiv, and MIND\. The graph\-PCA advantage is sharpest on ogbn\-arxiv \(197\.5197\.5vs\.667\.6667\.6\), the citation graph whose subject\-label reward we already document as low\-frequency\-misaligned \(ζk=0\.96\\zeta\_\{k\}\{=\}0\.96\): there the degree\-normalized Laplacian discards degree\-correlated structure that raw\-adjacency PCA retains\. The honest takeaway is therefore “use graph structure”: both graph\-spectral maps dominate the graph\-blind baselines, rather than a claim that the normalized low\-frequency subspace is the uniquely best graph\-spectral projection; selecting between the two graph\-spectral maps per dataset is future work\.
## Appendix ISubspace\-Capture Margin: Full Validation
Table 9:The subspace\-capture marginΓk\\Gamma\_\{k\}predicts the GraphDR\-vs\-content\-PCA outcome whereζk\\zeta\_\{k\}cannot\.Rows sorted byΓk\\Gamma\_\{k\}; final column is the realized outcome \(signed gapRGraphDR−RPCAR\_\{\\mathrm\{GraphDR\}\}\-R\_\{\\mathrm\{PCA\}\}in parentheses, negative==GraphDR wins\)\. The fitted\-threshold rule is leave\-one\-base\-dataset\-out correct on6/66/6; the zero\-parameter sign ruleΓk\>0\\Gamma\_\{k\}\>0is correct on5/65/6\(lone miss: MIND at the near\-tieΓk=−0\.008\\Gamma\_\{k\}=\-0\.008, where*both*subspaces capture almost no reward energy yet GraphDR wins because content\-PCA is itself far from the reward\)\. Crucially,Γk\\Gamma\_\{k\}orders ogbn\-arxiv \(−0\.240\-0\.240, GraphDR’s documented failure\) at the bottom whileζk\\zeta\_\{k\}cannot: MIND has the*largest*tail \(0\.990\.99\) yet GraphDR wins, exactly the inversion that defeated the scalar tail and the unit\-weighted𝒢k\\mathcal\{G\}\_\{k\}\.#### Comparison to alternative selectors\.
Γk\\Gamma\_\{k\}’s Spearman rank\-correlation with the signed gap is0\.490\.49and its leave\-one\-out accuracy is6/66/6\(fitted\) /5/65/6\(zero\-parameter\), versus the original𝒢k\\mathcal\{G\}\_\{k\}’s\|ρs\|≈0\.11\|\\rho\_\{s\}\|\\approx 0\.11andζk\\zeta\_\{k\}’s2/62/6\. The eigengapΔk\\Delta\_\{k\}alone rank\-correlates at0\.830\.83but reaches only4/64/6leave\-one\-out and has no mechanistic justification for the head\-to\-head comparison against content\-PCA; we reportΓk\\Gamma\_\{k\}as the headline rule andΔk\\Delta\_\{k\}as a correlated secondary signal\. Withn=6n=6base datasets we cannot certify a threshold law; the open task is to confirm this regularity on more base graphs and against the matched graph\-PCA competitor\.
## Appendix JAdditional Synthetic Results
This appendix collects the full synthetic results summarized in the Synthetic Experiments section\.
#### Dimension scaling\.
Fixingk⋆=5k^\{\\star\}=5and scalingd=nd=nfrom6060to800800:
GraphDR regret stays nearly flat asddgrows while full\-dimensional LinUCB grows substantially, matching the predictedkk\-versus\-ddexploration cost \(Figure[3](https://arxiv.org/html/2606.27917#Sx5.F3)\)\.
#### Second graph family\.
On a random geometric graph \(n=200n=200, radius0\.160\.16\): GraphDRR\(T\)=18\.1R\(T\)=18\.1versus LinUCB\-full417\.8417\.8, PCA2045\.52045\.5, JL2364\.12364\.1, Random2613\.02613\.0; the shuffle control collapses to2544\.62544\.6, near\-random \(Figure[4](https://arxiv.org/html/2606.27917#Sx5.F4)\)\.
#### Regret versus projected dimensionkk\.
Projecting onto the true bottom\-kknontrivial eigenvectors in the SBM withk⋆=5k^\{\\star\}=5, regret atk=1,2,3,𝟓,8,12,20k=1,2,3,\\mathbf\{5\},8,12,20is419\.9,384\.2,98\.6,28\.7,39\.5,55\.4,69\.5419\.9,384\.2,98\.6,\\mathbf\{28\.7\},39\.5,55\.4,69\.5\. Regret is minimized atk=k⋆=5k=k^\{\\star\}=5and rises on both sides \(Figure[5](https://arxiv.org/html/2606.27917#Sx5.F5)\)\.
#### Noisy graph stress test\.
When the graph is observed under edge\-flip noise, the bootstrap estimator and naive eigengap estimator return identicalk^\\widehat\{k\}at each noise level; under heavier noisek^\\widehat\{k\}degrades \(4→2→14\\\!\\to\\\!2\\\!\\to\\\!1\) with measurable regret cost, motivating Theorem[4](https://arxiv.org/html/2606.27917#Thmtheorem4)\.
#### Validating the robust theorem\.
We measureζk\\zeta\_\{k\}andεL\\varepsilon\_\{L\}in each run \(n=200n=200SBM,k⋆=5k^\{\\star\}=5,T=20000T=20000,55seeds\)\. Injecting reward energy outside the bottom\-kksubspace raisesζk\\zeta\_\{k\}from0to0\.710\.71and regret rises monotonically; holding the reward exactly smooth and perturbing edges raisesεL\\varepsilon\_\{L\}from0to0\.480\.48and regret again rises monotonically \(Table[10](https://arxiv.org/html/2606.27917#A10.T10)\)\. Pooled Spearman\(R,ρk\)=0\.68\(R,\\rho\_\{k\}\)=0\.68, monotone in binned medians\.
Table 10:Stress test consistent with the generic robust theorem: regret rises monotonically with each measured misspecification term\. Pooled Spearman\(R,ρk\)=0\.68\(R,\\rho\_\{k\}\)=0\.68\.
## Appendix KLimitations and Additional Evaluation
Several boundaries of the results deserve explicit statement\.
The structure\-specific bound \(Theorem[5](https://arxiv.org/html/2606.27917#Thmtheorem5)\) uses𝖲𝖱𝖫T\(r⟂,k\)\\mathsf\{SRL\}\_\{T\}\(r\_\{\\perp,k\}\), which depends on the unknownθ⋆\\theta^\{\\star\}; we state it as an oracle inequality and make it implementable only through the estimated\-residual variant of Theorem[7](https://arxiv.org/html/2606.27917#Thmtheorem7), whose advantage is contingent on a good pilot\. If the residual has large candidate\-set width every round, the residual term remains linear inTT; this cost is unavoidable without benign\-residual structure\.
Our real\-data protocol calibrates rewards from ratings and clicks rather than from a logged online experiment with known propensities\. With logged propensities or an exploration log under a known logging policypp, one could report, for a target policyπ\\pi,
V^IPS\(π\)=1N∑i=1Nπ\(ai∣xi\)pi\(ai∣xi\)ri,\\displaystyle\\widehat\{V\}\_\{\\mathrm\{IPS\}\}\(\\pi\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{\\pi\(a\_\{i\}\\mid x\_\{i\}\)\}\{p\_\{i\}\(a\_\{i\}\\mid x\_\{i\}\)\}r\_\{i\},V^SNIPS\(π\)=∑iπ\(ai∣xi\)pi\(ai∣xi\)ri∑iπ\(ai∣xi\)pi\(ai∣xi\)\.\\displaystyle\\widehat\{V\}\_\{\\mathrm\{SNIPS\}\}\(\\pi\)=\\frac\{\\sum\_\{i\}\\frac\{\\pi\(a\_\{i\}\\mid x\_\{i\}\)\}\{p\_\{i\}\(a\_\{i\}\\mid x\_\{i\}\)\}r\_\{i\}\}\{\\sum\_\{i\}\\frac\{\\pi\(a\_\{i\}\\mid x\_\{i\}\)\}\{p\_\{i\}\(a\_\{i\}\\mid x\_\{i\}\)\}\}\.V^DR\(π\)=1N∑i=1N\[∑aπ\(a∣xi\)μ^\(xi,a\)\+π\(ai∣xi\)pi\(ai∣xi\)\(ri−μ^\(xi,ai\)\)\]\.\\widehat\{V\}\_\{\\mathrm\{DR\}\}\(\\pi\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\!\\Biggl\[\\sum\_\{a\}\\pi\(a\\mid x\_\{i\}\)\\widehat\{\\mu\}\(x\_\{i\},a\)\\\\ \+\\frac\{\\pi\(a\_\{i\}\\mid x\_\{i\}\)\}\{p\_\{i\}\(a\_\{i\}\\mid x\_\{i\}\)\}\\big\(r\_\{i\}\-\\widehat\{\\mu\}\(x\_\{i\},a\_\{i\}\)\\big\)\\Biggr\]\.A fully logged evaluation and an adaptive\-kkstudy \(testing whether a corralling wrapper\[[2](https://arxiv.org/html/2606.27917#bib.bib27)\]matches oracle\-kkregret\) are natural extensions left to future work\. Replacing the node\-indicator arm features with learned graph embeddings\[[22](https://arxiv.org/html/2606.27917#bib.bib14),[15](https://arxiv.org/html/2606.27917#bib.bib15)\]and studying how embedding quality affects the spectral gap is a further direction, as is extending the Laplacian\-based DR survey of\[[14](https://arxiv.org/html/2606.27917#bib.bib21)\]to the bandit feedback setting\. We also prove only upper bounds; matching lower bounds for theεL/Δk\\varepsilon\_\{L\}/\\Delta\_\{k\}dependence and for the residual leverage and candidate\-width terms remain open\.Similar Articles
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.
Contextual Slate GLM Bandits with Limited Adaptivity
Proposes algorithms for contextual slate bandits with generalized linear rewards under limited adaptivity, achieving regret bounds independent of the non-linearity parameter. The batched and rarely-switching algorithms are computationally efficient and empirically outperform baselines, including in a language model example selection task.
A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization
This paper presents a comparative study of Bayesian Contextual Bandits, XGBoost, and Linear Regression for real-time sorter diversion optimization in e-commerce warehouses, showing BCB achieves 2.03% reward uplift with superior online learning and inference latency.
Learning in Markovian bandits with non-observable states and constrained decision epochs
This paper studies regret minimization in Markovian bandits with non-observable states and constrained decision epochs, introducing a generalization called self-degrading Markovian bandits. The authors propose the UCB-NOM algorithm that achieves nearly logarithmic regret and provide bounds that do not depend on the number of states.
Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
This paper formalizes embedding model routing as an adversarial contextual linear bandit with low-rank experts, proposing the Hypentropy Policy Gradient (HPG) algorithm that achieves O~(s√(MT)) policy regret, avoiding the curse of dimensionality.