Randomized Exploration for Linear Bandits via Absolute Perturbations
Summary
This paper proposes Absolute Thompson Sampling (ATS), a modification of Thompson Sampling that ensures optimism in expectation by using absolute exploration noise, enabling a simpler UCB-style regret analysis while maintaining computational efficiency. It achieves regret matching existing TS bounds, and introduces an ensemble variant that converges to UCB behavior.
View Cached Full Text
Cached at: 06/30/26, 05:28 AM
# Randomized Exploration for Linear Bandits via Absolute Perturbations
Source: [https://arxiv.org/html/2606.28616](https://arxiv.org/html/2606.28616)
###### Abstract
In stochastic linear bandits, the canonical Upper Confidence Bound \(UCB\) algorithm admits a simple frequentist regret analysis but can be computationally demanding, while Thompson Sampling \(TS\) is computationally attractive yet typically harder to analyze due to its non\-optimistic nature\. We propose Absolute Thompson Sampling \(ATS\), a simple modification of TS that ensures optimism in expectation by replacing the signed exploration noise with its absolute value\. This preserves the computational efficiency of TS while avoiding the technically involved anti\-concentration arguments common in TS analyses, enabling a simple UCB\-style regret analysis\. We show that ATS achievesO~\(d3/2K\)\\widetilde\{O\}\(d^\{3/2\}\\sqrt\{K\}\)regret, matching existing bounds for TS in linear bandits\. We further introduce Ensemble Absolute Thompson Sampling \(EATS\), which takes the maximum over multiple absolute perturbations with normalization by the ensemble size\. As the ensemble size grows, EATS converges to the UCB objective, recovering UCB behavior in the limit\. Experiments show that moderate ensemble sizes already yield strong performance\. Our results point to a bridge between randomized exploration and deterministic optimism both in theory and practice\.
## 1Introduction
We study stochastic linear bandit problems, where a learner sequentially selects arms from a given arm set and receives noisy linear rewards\. Efficient exploration is the central challenge for achieving low regret\. Two of the most widely used approaches are optimism\-based algorithms, such as Upper Confidence Bound \(UCB\), and randomized algorithms, such as Thompson Sampling \(TS\)\.
The UCB algorithm selects arms according to the*Optimism in the Face of Uncertainty*\(OFU\) principle, maximizing an upper confidence bound on the estimated reward\. This principle leads to a simple and elegant regret analysis based on confidence sets and the elliptical potential lemma\(abbasi2011improved\)\. However, UCB can be computationally expensive, as it requires maximizing an elliptical bonus term over the arm set\. In contrast, TS selects arms by sampling a parameter from a posterior distribution and acting greedily with respect to the sampled parameter\. TS is computationally attractive since each round reduces to solving a linear maximization problem over the arm set\. However, its frequentist regret analysis is more involved than that of UCB\. Because the sampled parameter may underestimate the true reward, typical analyses require that TS produces an optimistic parameter with some probability and control the regret incurred during non\-optimistic rounds\(agrawal2013thompson;abeille2017linear\)\.
Thus, UCB and TS have complementary computational and analytical properties\. We bridge this gap by introducing a simple modification of TS that mimics the behavior of UCB\. In[Section˜4](https://arxiv.org/html/2606.28616#S4), we introduceAbsolute Thompson Sampling \(ATS\), which replaces the signed exploration noise in TS with its absolute value\. This modification ensures*optimism in expectation*: the exploration bonus used in ATS matches the UCB bonus in expectation\. As a result, the regret analysis becomes conceptually similar to that of UCB and avoids explicitly handling non\-optimistic rounds\. We show that ATS achievesO~\(d3/2K\)\\widetilde\{O\}\(d^\{3/2\}\\sqrt\{K\}\)regret with high probability, matching the existing TS regret bound for linear bandits\(abeille2017linear\)\. Moreover, ATS retains the computational efficiency of TS, requiring only one additional linear maximization per round\.
While ATS achieves the same regret bound as TS, it incurs an additionald\\sqrt\{d\}factor compared to UCB\(abbasi2011improved\)\. To address this gap, we propose an extension of ATS that more closely resembles UCB\. Motivated by extreme\-value properties of Gaussian random variables \([Proposition˜5\.1](https://arxiv.org/html/2606.28616#S5.Thmtheorem1)\), we introduceEnsemble Absolute Thompson Sampling \(EATS\), which replaces the single perturbation in ATS with the maximum over multiple perturbations\. We show that, as the ensemble size grows, the EATS objective converges to the UCB objective, recovering UCB behavior in the limit\. Empirically, moderate ensemble sizes already yield strong performance, suggesting a practical interpolation between randomized exploration and deterministic optimism \([Section˜6](https://arxiv.org/html/2606.28616#S6)\)\. Establishing theoretical guarantees for EATS with a finite ensemble size remains an interesting open problem \([Remark˜5\.3](https://arxiv.org/html/2606.28616#S5.Thmtheorem3)\)\.
Together, our results highlight a new perspective on randomized exploration: by carefully shaping the exploration term, one can design algorithms that retain the computational efficiency of TS while exhibiting analysis and behavior closely aligned with UCB\. Our work opens new directions for designing simple and efficient randomized algorithms for bandit problems\.
## 2Related Work
For linear bandits,agrawal2013thompsonandabeille2017linearprovided the fundamental frequentist regret analyses of TS, establishingO~\(d3/2K\)\\widetilde\{O\}\(d^\{3/2\}\\sqrt\{K\}\)bounds\. A common strategy in these works is to show that TS samples an optimistic parameter with constant probabilitypp, and then bound the regret by scaling the optimistic regret by1/p1/p\(janz2023ensemble\)\. Similar proof techniques were later extended to multinomial logistic bandits\(oh2019thompson\), generalized linear bandits\(kveton2020randomized\), and linear MDPs\(zanette2020frequentist\)\.abeille2025anddeveloped a tighter analysis that avoids relying on constant\-probability optimism, achievingO~\(dK\)\\widetilde\{O\}\(d\\sqrt\{K\}\)regret, matching the lower bound for linear bandits\(dani2008stochastic;rusmevichientong2010linearly\)\. However, their result requires additional structural assumptions on the action set such as theL2L\_\{2\}unit ball\.
While these works establish sublinear regret guarantees, existing linear TS analyses are technically involved, as they must carefully control the regret incurred during non\-optimistic rounds\. In contrast, by ensuring*optimism in expectation*, our approach avoids the need to separately analyze non\-optimistic rounds, leading to a proof that is conceptually as simple as that of UCB\(abbasi2011improved\)\. See[Section˜4](https://arxiv.org/html/2606.28616#S4)for more details of our approach\.
Closer to our work,zhang2022feelandhuix2023tightmodify TS to enforce stronger optimism\. However, their approaches rely on sampling from skewed posterior distributions via Markov Chain Monte Carlo, which increases computational complexity and diminishes the practical appeal of TS\. In contrast,vaswani2019oldmodify UCB by introducing random perturbations\. While they achieveO~\(dK\)\\widetilde\{O\}\(d\\sqrt\{K\}\)regret, their method requires maximizing an elliptical bonus term over the arm set, which can be computationally expensive in practice\. Our method, by contrast, introduces only a minimal modification to TS and can be implemented using the same linear maximization oracle as standard TS\.
## 3Preliminary
Given two vectorsx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}and a positive definite matrixA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}, we denote⟨x,y⟩=x⊤y\\langle x,y\\rangle=x^\{\\top\}yand∥x∥A=x⊤Ax\\lVert x\\rVert\_\{A\}=\\sqrt\{x^\{\\top\}Ax\}\. We denote by𝕊d−1\\mathbb\{S\}^\{d\-1\}the unit sphere inℝd\\mathbb\{R\}^\{d\}\. All\-zero and all\-one vectors are denoted by𝟎\\mathbf\{0\}and𝟏\\mathbf\{1\}, respectively\. For random variablesXnX\_\{n\}andXX, we writeXn→ℙXX\_\{n\}\\xrightarrow\{\\mathbb\{P\}\}Xto denote convergence in probability, i\.e\., for anyε\>0\\varepsilon\>0,limn→∞ℙ\(\|Xn−X\|\>ε\)=0\\lim\_\{n\\to\\infty\}\\mathbb\{P\}\(\\lvert X\_\{n\}\-X\\rvert\>\\varepsilon\)=0\.
### 3\.1Stochastic Linear Bandits
Let𝒜⊂ℝd\\mathcal\{A\}\\subset\\mathbb\{R\}^\{d\}be the compact set of arms\. Without loss of generality, we assume∥ϕ∥2≤1\\lVert\\phi\\rVert\_\{2\}\\leq 1for anyϕ∈𝒜\\phi\\in\\mathcal\{A\}\. At each roundkk, the agent pulls an armϕk∈𝒜\\phi\_\{k\}\\in\\mathcal\{A\}and observes the rewardrk=\(θ⋆\)⊤ϕk\+εkr\_\{k\}=\\left\(\\theta^\{\\star\}\\right\)^\{\\top\}\\phi\_\{k\}\+\\varepsilon\_\{k\}\. Here,θ⋆∈ℝd\\theta^\{\\star\}\\in\\mathbb\{R\}^\{d\}is unknown to the agent and satisfies∥θ⋆∥2≤B\\lVert\\theta^\{\\star\}\\rVert\_\{2\}\\leq B, andεk\\varepsilon\_\{k\}is zero\-meanRR\-sub\-Gaussian noise\. Definer\(ϕ\)≔\(θ⋆\)⊤ϕr\(\\phi\)\\coloneqq\(\\theta^\{\\star\}\)^\{\\top\}\\phi\. The agent’s goal is to achieve sublinear regret:
Reg\(K\):=∑k=1Kr⋆−r\(ϕk\)=o\(K\)\\displaystyle\{\\textstyle\\operatorname\{Reg\}\(K\)\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\sum^\{K\}\_\{k=1\}r^\{\\star\}\-r\(\\phi\_\{k\}\)=o\\left\(K\\right\)\}\(1\)wherer⋆=maxϕ∈𝒜r\(ϕ\)r^\{\\star\}=\\max\_\{\\phi\\in\\mathcal\{A\}\}r\(\\phi\)\. We denoteϕ⋆∈argmaxϕ∈𝒜r\(ϕ\)\\phi^\{\\star\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}r\(\\phi\)as the optimal arm\.
We denoteΛk\\Lambda\_\{k\}as the regularized Gram matrix andθ^k\\hat\{\\theta\}\_\{k\}as the ridge estimator ofθ⋆\\theta^\{\\star\}at roundkk:
Λk=λI\+∑i=1k−1ϕiϕi⊤,θ^k=Λk−1∑i=1k−1ϕiri,\\displaystyle\\Lambda\_\{k\}=\\lambda I\+\\sum\_\{i=1\}^\{k\-1\}\\phi\_\{i\}\\phi\_\{i\}^\{\\top\},\\quad\\hat\{\\theta\}\_\{k\}=\\Lambda\_\{k\}^\{\-1\}\\sum\_\{i=1\}^\{k\-1\}\\phi\_\{i\}r\_\{i\},\(2\)whereλ\>0\\lambda\>0is a regularization parameter\.
### 3\.2Basic Algorithms
The two popular algorithms are Upper Confidence Bound \(UCB\) and Thompson Sampling \(TS\)\.
#### UCB algorithm\.
UCB constructs a confidence set forθ⋆\\theta^\{\\star\}and selects the arm with the largest upper confidence bound\. The following lemma gives a confidence bound for the estimation errorθ^k−θ⋆\\hat\{\\theta\}\_\{k\}\-\\theta^\{\\star\}\.
###### Lemma 3\.1\(Confidence bound;abbasi2011improved\)\.
Letℰ\\mathscr\{E\}be the event such that for allkk,
∥θ^k−θ⋆∥Λk≤λB\+Rdln\(1\+k/λδ\)=:βk\.\\displaystyle\\left\\lVert\\hat\{\\theta\}\_\{k\}\-\\theta^\{\\star\}\\right\\rVert\_\{\\Lambda\_\{k\}\}\\leq\\sqrt\{\\lambda\}B\+R\\sqrt\{d\\ln\\left\(\\frac\{1\+k/\\lambda\}\{\\delta\}\\right\)\}=\\nolinebreak\\mkern\-1\.2mu\\vcentcolon\\beta\_\{k\}\\;\.whereδ\>0\\delta\>0\. Then, for anyδ\>0\\delta\>0,ℙ\(ℰ\)≥1−δ\\mathbb\{P\}\(\\mathscr\{E\}\)\\geq 1\-\\delta, and consequently, for anyϕ∈𝒜\\phi\\in\\mathcal\{A\}andkk,
\|ϕ⊤θ⋆−ϕ⊤θ^k\|≤βk‖ϕ‖Λk−1\.\\displaystyle\\lvert\\phi^\{\\top\}\\theta^\{\\star\}\-\\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\\rvert\\leq\\beta\_\{k\}\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda\_\{k\}^\{\-1\}\}\\;\.\(3\)
Using[Eq\.˜3](https://arxiv.org/html/2606.28616#S3.E3), UCB selects the arm with the largest upper confidence bound:
\(UCB\)ϕkUCB∈argmaxϕ∈𝒜ϕ⊤θ^k\+βk∥ϕ∥Λk−1\.\\displaystyle\\text\{\(UCB\)\}\\qquad\\phi\_\{k\}^\{\\mathrm\{UCB\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\beta\_\{k\}\\lVert\\phi\\rVert\_\{\\Lambda\_\{k\}^\{\-1\}\}\.\(4\)This rule embodies the*Optimism in the Face of Uncertainty*\(OFU\) principle: it favors arms with high estimated rewards and large uncertainty, thereby balancing exploitation and exploration\. UCB has a regret of𝒪\(dKln\(Kδ−1\)\)\\mathcal\{O\}\(d\\sqrt\{K\}\\ln\(K\\delta^\{\-1\}\)\)with probability at least1−δ1\-\\delta\(abbasi2011improved\)\.
#### Thompson sampling\.
TS maintains a posterior distribution over the unknown parameterθ⋆\\theta^\{\\star\}and acts greedily with respect to a sampled parameter\. When Gaussian priors and likelihoods are used for estimatingθ⋆\\theta^\{\\star\}, the posterior distribution is also Gaussian, and TS selects the arm by:
\(TS\)ϕkTS∈argmaxϕ∈𝒜f\(ϕ,ξk\)≔ϕ⊤θ^k\+βkϕ⊤Λk−1/2ξkwhereξk∼𝒩\(𝟎,I\)\.\\displaystyle\\text\{\(TS\)\}\\quad\\phi\_\{k\}^\{\\mathrm\{TS\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ f\(\\phi,\\xi\_\{k\}\)\\coloneqq\\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\beta\_\{k\}\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\quad\\text\{where\}\\quad\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\.\(5\)
TS is often preferred in practice because it is computationally simpler and easier to implement\. For a fixed Gaussian sample,[Eq\.˜5](https://arxiv.org/html/2606.28616#S3.E5)reduces to a linear maximization over𝒜\\mathcal\{A\}, whereas UCB in[Eq\.˜4](https://arxiv.org/html/2606.28616#S3.E4)maximizes a linear term plus an ellipsoidal bonus\.
However, regret analysis for TS is more challenging than for UCB\. Unlike UCB, TS may select non\-optimistic arms, which requires a more delicate analysis of non\-optimistic rounds\. A typical proof strategy first shows that TS samples an optimistic parameter with constant probabilityp\>0p\>0via anti\-concentration arguments \(e\.g\., Lemma 2 inagrawal2013thompson\), and then scales the optimistic regret by1/p1/p\(e\.g\.,*master theorem*injanz2023ensemble\)\. This yields a regret bound of𝒪\(p−1d3Kln\(Kδ−1\)\)\\mathcal\{O\}\(p^\{\-1\}\\sqrt\{d^\{3\}K\}\\ln\(K\\delta^\{\-1\}\)\)\(agrawal2013thompson;abeille2017linear\)\.
## 4Absolute Thompson Sampling
As explained in the previous section, UCB admits a simple regret analysis but can be computationally intensive, whereas TS is computationally efficient but requires a more involved analysis\. To bridge this gap, we design a randomized algorithm that retains the computational efficiency of TS while allowing a regret analysis closer to that of UCB\.
Our key idea is to make a minimal modification to TS so that it behaves more like UCB\. Specifically, we use the following equality: For any symmetric positive definite matrixΛ\\Lambdaand vectorϕ∈ℝd\\phi\\in\\mathbb\{R\}^\{d\},
𝔼ξ∼𝒩\(𝟎,I\)\[\|ϕ⊤Λ−1/2ξ\|\]=2π‖ϕ‖Λ−1\.\\displaystyle\\mathbb\{E\}\_\{\\xi\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\}\\left\[\\left\\lvert\\phi^\{\\top\}\\Lambda^\{\-1/2\}\\xi\\right\\rvert\\right\]=\\sqrt\{\\frac\{2\}\{\\pi\}\}\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\;\.\(6\)Thus, for a fixedϕ\\phi, the absolute inner\-product term in TS, namely\|ϕ⊤Λk−1/2ξk\|\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\rvert, behaves like the UCB bonus term‖ϕ‖Λk−1\\\|\\phi\\\|\_\{\\Lambda\_\{k\}^\{\-1\}\}in expectation\.
Motivated by this observation, we replace the signed inner product in[Eq\.˜5](https://arxiv.org/html/2606.28616#S3.E5)with its absolute value\.
\(ATS\)ϕkATS∈argmaxϕ∈𝒜ϕ⊤θ^k\+π2βk\|ϕ⊤Λk−1/2ξk\|whereξk∼𝒩\(𝟎,I\)\.\\displaystyle\\text\{\(ATS\)\}\\quad\\phi\_\{k\}^\{\\mathrm\{ATS\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\sqrt\{\\frac\{\\pi\}\{2\}\}\\beta\_\{k\}\\left\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\quad\\text\{where\}\\quad\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\.\(7\)We call this algorithm absolute Thompson sampling \(ATS\)\.
### 4\.1Properties of ATS
Letℱk≔σ\(ξ1,ε1,…,ξk,εk\)\\mathcal\{F\}\_\{k\}\\coloneqq\\sigma\(\\xi\_\{1\},\\varepsilon\_\{1\},\\ldots,\\xi\_\{k\},\\varepsilon\_\{k\}\)be the filtration such thatεk\\varepsilon\_\{k\}andξk\\xi\_\{k\}areℱk\\mathcal\{F\}\_\{k\}\-measurable\. By slightly abusing notation, we use the shorthand𝔼ξk\[⋅\]:=𝔼\[⋅\|ℱk−1\]\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\cdot\\right\]\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\mathbb\{E\}\\left\[\\cdot\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\.
for*k=1,…,Kk=1,\\dots,K*do
Update Gram matrix
Λk=λI\+∑i=1k−1ϕiϕi⊤\\Lambda\_\{k\}=\\lambda I\+\\sum\_\{i=1\}^\{k\-1\}\\phi\_\{i\}\\phi\_\{i\}^\{\\top\};
Sample
ξk∼𝒩\(𝟎,I\)\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\);
Compute
ϕk∈argmaxϕ∈𝒜ϕ⊤θ^k\+π2βk\|ϕ⊤Λk−1/2ξk\|\\phi\_\{k\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\sqrt\{\\frac\{\\pi\}\{2\}\}\\beta\_\{k\}\\left\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\. See[Remark˜4\.3](https://arxiv.org/html/2606.28616#S4.Thmtheorem3);
Pull
ϕk\\phi\_\{k\}and observe
rkr\_\{k\};
end for
Algorithm 1Absolute Thompson Sampling \(ATS\)
### 4\.2Regret Analysis
The minimal algorithmic modification by absolute value significantly simplifies the regret analysis of TS\. This section proves that[algorithm˜1](https://arxiv.org/html/2606.28616#algorithm1)achieves sublinear regret with high probability\.
###### Theorem 4\.4\.
For anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, for anyK≥1K\\geq 1,[algorithm˜1](https://arxiv.org/html/2606.28616#algorithm1)achieves regret ofReg\(K\)=𝒪\(d3Kln\(Kδ−1\)\)\\operatorname\{Reg\}\(K\)=\\mathcal\{O\}\\left\(\\sqrt\{d^\{3\}K\}\\ln\(K\\delta^\{\-1\}\)\\right\)\.
###### Proof\.
The proof mirrors the regret analysis of UCB\(abbasi2011improved\): Decompose the regret with optimism \(Step 1\) and then bound the remaining terms via the elliptical potential lemma \(Step 3\)\. Since the optimism holds in expectation, we add one step to remove the expectations over the samplesξk\\xi\_\{k\}before applying the elliptical potential lemma \(Step 2\)\. Notably, unlike the standard analysis of TS \(e\.g\.,agrawal2013thompson;abeille2017linear\),we do not need to establish a constant probability of optimism or separately handle non\-optimistic rounds, which significantly simplifies the proof\.
#### Step 1\. Regret decomposition with optimism\.
Suppose that the eventℰ\\mathscr\{E\}in[Lemma˜3\.1](https://arxiv.org/html/2606.28616#S3.Thmtheorem1)holds\. By adding and subtracting the expected objective of ATS𝔼ξk\[ϕk⊤θ^k\+π2βk\|ϕk⊤Λk−1/2ξk\|\]\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\sqrt\{\\frac\{\\pi\}\{2\}\}\\beta\_\{k\}\\left\\lvert\\phi\_\{k\}^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\right\], we decompose the regret by:
Reg\(K\)=\\displaystyle\\operatorname\{Reg\}\(K\)=∑k=1K\(ϕ⋆\)⊤θ⋆−𝔼ξk\[ϕk⊤θ^k\+π2βk\|ϕk⊤Λk−1/2ξk\|\]⏟≤0by[Lemma˜4\.2](https://arxiv.org/html/2606.28616#S4.Thmtheorem2)\\displaystyle\\sum\_\{k=1\}^\{K\}\\underbrace\{\(\\phi^\{\\star\}\)^\{\\top\}\\theta^\{\\star\}\-\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\sqrt\{\\frac\{\\pi\}\{2\}\}\\beta\_\{k\}\\left\\lvert\\phi\_\{k\}^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\right\]\}\_\{\\leq 0\\text\{ by \\lx@cref\{creftype~refnum\}\{lemma:optimism\-ATS\}\}\}\(9\)\+∑k=1K\(𝔼ξk\[ϕk⊤θ^k\]−ϕk⊤θ⋆\)⏟=:1\+π2∑k=1Kβk𝔼ξk\[\|ϕk⊤Λk−1/2ξk\|\]⏟≕2\.\\displaystyle\+\\underbrace\{\\sum\_\{k=1\}^\{K\}\\left\(\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\\right\]\-\\phi\_\{k\}^\{\\top\}\\theta^\{\\star\}\\right\)\}\_\{=\\nolinebreak\\mkern\-1\.2mu\\vcentcolon\\hbox to7\.83pt\{\\vbox to7\.83pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-3\.91264pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{3\.71265pt\}\{2\.05046pt\}\{2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\{3\.71265pt\}\\pgfsys@curveto\{\-2\.05046pt\}\{3\.71265pt\}\{\-3\.71265pt\}\{2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{\-3\.71265pt\}\{\-2\.05046pt\}\{\-2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\{\-3\.71265pt\}\\pgfsys@curveto\{2\.05046pt\}\{\-3\.71265pt\}\{3\.71265pt\}\{\-2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-1\.99306pt\}\{\-2\.25555pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{1\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\+\\underbrace\{\\sqrt\{\\frac\{\\pi\}\{2\}\}\\sum\_\{k=1\}^\{K\}\\beta\_\{k\}\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\left\\lvert\\phi\_\{k\}^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\right\]\}\_\{\\eqqcolon\\hbox to7\.83pt\{\\vbox to7\.83pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-3\.91264pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{3\.71265pt\}\{2\.05046pt\}\{2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\{3\.71265pt\}\\pgfsys@curveto\{\-2\.05046pt\}\{3\.71265pt\}\{\-3\.71265pt\}\{2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{\-3\.71265pt\}\{\-2\.05046pt\}\{\-2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\{\-3\.71265pt\}\\pgfsys@curveto\{2\.05046pt\}\{\-3\.71265pt\}\{3\.71265pt\}\{\-2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-1\.99306pt\}\{\-2\.25555pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{2\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\\;\.
Therefore, it suffices to bound1and2\. The term2is bounded by
2≤\(a\)π2βK∑k=1K𝔼ξk\[∥ξk∥22\]𝔼ξk\[∥ϕk∥Λk−12\]\\displaystyle\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mua\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}\\sqrt\{\\frac\{\\pi\}\{2\}\}\\beta\_\{K\}\\sum\_\{k=1\}^\{K\}\\sqrt\{\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\lVert\\xi\_\{k\}\\rVert\_\{2\}^\{2\}\\right\]\}\\sqrt\{\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\right\]\}\(10\)=\(b\)πd2βK∑k=1K𝔼ξk\[∥ϕk∥Λk−12\]≤\(c\)πd2βKK∑k=1K𝔼ξk\[∥ϕk∥Λk−12\],\\displaystyle\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mub\\mkern\-1\.5mu\)\}\}\{\{=\}\}\\sqrt\{\\frac\{\\pi d\}\{2\}\}\\beta\_\{K\}\\sum\_\{k=1\}^\{K\}\\sqrt\{\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\right\]\}\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5muc\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}\\sqrt\{\\frac\{\\pi d\}\{2\}\}\\beta\_\{K\}\\sqrt\{K\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\_\{\\xi\_\{k\}\}\\left\[\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\right\]\}\\;,where \(a\) uses the Cauchy–Schwarz inequality:𝔼\[X⊤Y\]≤𝔼\[∥X∥2∥Y∥2\]≤𝔼‖X‖22𝔼‖Y‖22\\mathbb\{E\}\[X^\{\\top\}Y\]\\leq\\mathbb\{E\}\[\\lVert X\\rVert\_\{2\}\\lVert Y\\rVert\_\{2\}\]\\leq\\sqrt\{\\mathbb\{E\}\\\|X\\\|\_\{2\}^\{2\}\}\\sqrt\{\\mathbb\{E\}\\\|Y\\\|\_\{2\}^\{2\}\}, \(b\) uses𝔼ξ\[∥ξ∥22\]=d\\mathbb\{E\}\_\{\\xi\}\[\\lVert\\xi\\rVert\_\{2\}^\{2\}\]=dforξ∼𝒩\(𝟎,I\)\\xi\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\), and \(c\) uses Cauchy–Schwarz again\.
By inserting the bound of2into[Eq\.˜9](https://arxiv.org/html/2606.28616#S4.E9),
Reg\(K\)≤∑k=1K𝔼\[ϕk⊤θ^k\|ℱk−1\]−ϕk⊤θ⋆⏟=1\+πd2βKK∑k=1K𝔼\[∥ϕk∥Λk−12\|ℱk−1\]⏟=:3\.\\displaystyle\\operatorname\{Reg\}\(K\)\\leq\\underbrace\{\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\-\\phi\_\{k\}^\{\\top\}\\theta^\{\\star\}\}\_\{=\\hbox to7\.83pt\{\\vbox to7\.83pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-3\.91264pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{3\.71265pt\}\{2\.05046pt\}\{2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\{3\.71265pt\}\\pgfsys@curveto\{\-2\.05046pt\}\{3\.71265pt\}\{\-3\.71265pt\}\{2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{\-3\.71265pt\}\{\-2\.05046pt\}\{\-2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\{\-3\.71265pt\}\\pgfsys@curveto\{2\.05046pt\}\{\-3\.71265pt\}\{3\.71265pt\}\{\-2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-1\.99306pt\}\{\-2\.25555pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{1\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\+\\underbrace\{\\sqrt\{\\frac\{\\pi d\}\{2\}\}\\beta\_\{K\}\\sqrt\{K\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\}\}\_\{=\\nolinebreak\\mkern\-1\.2mu\\vcentcolon\\hbox to7\.83pt\{\\vbox to7\.83pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-3\.91264pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{3\.71265pt\}\{2\.05046pt\}\{2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\{3\.71265pt\}\\pgfsys@curveto\{\-2\.05046pt\}\{3\.71265pt\}\{\-3\.71265pt\}\{2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\\pgfsys@curveto\{\-3\.71265pt\}\{\-2\.05046pt\}\{\-2\.05046pt\}\{\-3\.71265pt\}\{0\.0pt\}\{\-3\.71265pt\}\\pgfsys@curveto\{2\.05046pt\}\{\-3\.71265pt\}\{3\.71265pt\}\{\-2\.05046pt\}\{3\.71265pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-1\.99306pt\}\{\-2\.25555pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{3\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\\;\.\(11\)
#### Step 2\. Removing expectations\.
The confidence bound[Eq\.˜3](https://arxiv.org/html/2606.28616#S3.E3)suggests that3is on the order of∑k=1K∥ϕk∥Λk−1\\sum^\{K\}\_\{k=1\}\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}\. This motivates us to bound1and3using the elliptical potential lemma:
###### Lemma 4\.5\(Elliptical potential;abbasi2011improved\)\.
For allK≥1K\\geq 1andλ≥1\\lambda\\geq 1,
∑k=1K‖ϕk‖Λk−12≤2dln\(1\+Kλd\)⟹∑k=1K‖ϕk‖Λk−1≤2dKln\(1\+Kλd\)\.\\sum\_\{k=1\}^\{K\}\\left\\\|\\phi\_\{k\}\\right\\\|\_\{\\Lambda\_\{k\}^\{\-1\}\}^\{2\}\\leq 2d\\ln\\left\(1\+\\frac\{K\}\{\\lambda d\}\\right\)\\implies\\sum\_\{k=1\}^\{K\}\\left\\\|\\phi\_\{k\}\\right\\\|\_\{\\Lambda\_\{k\}^\{\-1\}\}\\leq\\sqrt\{2dK\\ln\\left\(1\+\\frac\{K\}\{\\lambda d\}\\right\)\}\\;\.
However, to use[Lemma˜4\.5](https://arxiv.org/html/2606.28616#S4.Thmtheorem5), we need to remove the expectations in1and3over the samplesξk\\xi\_\{k\}\. We use the following lemma to convert expectations to concrete realizations\.
###### Lemma 4\.6\.
Letℰ1\\mathscr\{E\}\_\{1\}andℰ2\\mathscr\{E\}\_\{2\}be the events such that, for anyK≥1K\\geq 1,
ℰ1:\\displaystyle\\mathscr\{E\}\_\{1\}:\\qquad∑k=1K𝔼\[ϕk⊤θ^k\|ℱk−1\]≤∑k=1Kϕk⊤θ^k\+2\(B\+βKλ\)Klnπ2K26δ,\\displaystyle\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\\leq\\sum\_\{k=1\}^\{K\}\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\+2\\left\(B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\right\)\\sqrt\{K\\ln\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\\;,\(12\)ℰ2:\\displaystyle\\mathscr\{E\}\_\{2\}:\\qquad∑k=1K𝔼\[∥ϕk∥Λk−12\|ℱk−1\]≤2∑k=1K∥ϕk∥Λk−12\+4λln2Kδ,\\displaystyle\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\\leq 2\\sum\_\{k=1\}^\{K\}\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\+\\frac\{4\}\{\\lambda\}\\ln\\frac\{2K\}\{\\delta\}\\;,\(13\)whereδ\>0\\delta\>0\. Then, for anyδ\>0\\delta\>0,ℙ\(ℰ∩ℰ1c\)≤δ\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\mathscr\{E\}\_\{1\}^\{c\}\\right\)\\leq\\deltaandℙ\(ℰ2c\)≤δ\\mathbb\{P\}\\left\(\\mathscr\{E\}\_\{2\}^\{c\}\\right\)\\leq\\delta\.
###### Proof\.
For the first claim, under the good eventℰ\\mathscr\{E\}, we have
\|ϕk⊤θ^k\|≤∥θ^k∥2≤∥θ⋆∥2\+∥θ^k−θ⋆∥2≤∥θ⋆∥2\+1λ∥θ^k−θ⋆∥Λk≤B\+βKλ\.\\left\\lvert\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\\right\\rvert\\leq\\lVert\\hat\{\\theta\}\_\{k\}\\rVert\_\{2\}\\leq\\lVert\\theta^\{\\star\}\\rVert\_\{2\}\+\\lVert\\hat\{\\theta\}\_\{k\}\-\\theta^\{\\star\}\\rVert\_\{2\}\\leq\\lVert\\theta^\{\\star\}\\rVert\_\{2\}\+\\frac\{1\}\{\\sqrt\{\\lambda\}\}\\lVert\\hat\{\\theta\}\_\{k\}\-\\theta^\{\\star\}\\rVert\_\{\\Lambda\_\{k\}\}\\leq B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\;\.Then, the first claim is followed by applying a version of Azuma–Hoeffding \([Lemma˜A\.2](https://arxiv.org/html/2606.28616#A1.Thmtheorem2)\) to the martingale difference sequence𝔼\[ϕk⊤θ^k\|ℱk−1\]−ϕk⊤θ^k\\mathbb\{E\}\\left\[\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\\right\]\-\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\. The second claim uses Lemma D\.4 inrosenberg2020near\(see[Lemma˜A\.3](https://arxiv.org/html/2606.28616#A1.Thmtheorem3)\) with the fact that∥ϕk∥Λk−12≤1λ\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\\leq\\frac\{1\}\{\\lambda\}\. ∎
#### Step 3\. Applying elliptical potential lemma\.
Now suppose that the eventℰ∩ℰ1∩ℰ2\\mathscr\{E\}\\cap\\mathscr\{E\}\_\{1\}\\cap\\mathscr\{E\}\_\{2\}holds\. This happens with probability at least1−3δ1\-3\\delta\.111This is becauseℙ\(ℰ∩ℰ1∩ℰ2\)=ℙ\(ℰ∩ℰ1\)−ℙ\(ℰ∩ℰ1∩ℰ2c\)≥1−2δ−ℙ\(ℰ2c\)≥1−3δ\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\mathscr\{E\}\_\{1\}\\cap\\mathscr\{E\}\_\{2\}\\right\)=\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\mathscr\{E\}\_\{1\}\\right\)\-\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\mathscr\{E\}\_\{1\}\\cap\\mathscr\{E\}\_\{2\}^\{c\}\\right\)\\geq 1\-2\\delta\-\\mathbb\{P\}\\left\(\\mathscr\{E\}\_\{2\}^\{c\}\\right\)\\geq 1\-3\\delta\.Then the first term1in[Eq\.˜11](https://arxiv.org/html/2606.28616#S4.E11)is bounded by
1≤\(a\)∑k=1Kϕk⊤θ^k−ϕk⊤θ⋆\+2\(B\+βKλ\)Klnπ2K26δ≤\(b\)βK∑k=1K‖ϕk‖Λk−1\+2\(B\+βKλ\)Klnπ2K26δ,\\displaystyle\\hbox to9\.93pt\{\\vbox to9\.93pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-4\.9644pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{4\.7644pt\}\{0\.0pt\}\\pgfsys@curveto\{4\.7644pt\}\{2\.63133pt\}\{2\.63133pt\}\{4\.7644pt\}\{0\.0pt\}\{4\.7644pt\}\\pgfsys@curveto\{\-2\.63133pt\}\{4\.7644pt\}\{\-4\.7644pt\}\{2\.63133pt\}\{\-4\.7644pt\}\{0\.0pt\}\\pgfsys@curveto\{\-4\.7644pt\}\{\-2\.63133pt\}\{\-2\.63133pt\}\{\-4\.7644pt\}\{0\.0pt\}\{\-4\.7644pt\}\\pgfsys@curveto\{2\.63133pt\}\{\-4\.7644pt\}\{4\.7644pt\}\{\-2\.63133pt\}\{4\.7644pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-2\.5pt\}\{\-3\.22221pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{1\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mua\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}\\sum\_\{k=1\}^\{K\}\\phi\_\{k\}^\{\\top\}\\hat\{\\theta\}\_\{k\}\-\\phi\_\{k\}^\{\\top\}\\theta^\{\\star\}\+2\\left\(B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\right\)\\sqrt\{K\\ln\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mub\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}\\beta\_\{K\}\\sum\_\{k=1\}^\{K\}\\left\\lVert\\phi\_\{k\}\\right\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}\+2\\left\(B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\right\)\\sqrt\{K\\ln\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\\;,where \(a\) usesℰ1\\mathscr\{E\}\_\{1\}and \(b\) uses[Eq\.˜3](https://arxiv.org/html/2606.28616#S3.E3)underℰ\\mathscr\{E\}\. Usingℰ2\\mathscr\{E\}\_\{2\}, we have
3≤\(a\)βKπdK22∑k=1K∥ϕk∥Λk−12\+4λln2Kδ\.\\displaystyle\\hbox to9\.93pt\{\\vbox to9\.93pt\{\\pgfpicture\\makeatletter\\hbox\{\\enskip\\lower\-4\.9644pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\{\{\}\}\}\{\{\}\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\}\{\{\}\\pgfsys@moveto\{4\.7644pt\}\{0\.0pt\}\\pgfsys@curveto\{4\.7644pt\}\{2\.63133pt\}\{2\.63133pt\}\{4\.7644pt\}\{0\.0pt\}\{4\.7644pt\}\\pgfsys@curveto\{\-2\.63133pt\}\{4\.7644pt\}\{\-4\.7644pt\}\{2\.63133pt\}\{\-4\.7644pt\}\{0\.0pt\}\\pgfsys@curveto\{\-4\.7644pt\}\{\-2\.63133pt\}\{\-2\.63133pt\}\{\-4\.7644pt\}\{0\.0pt\}\{\-4\.7644pt\}\\pgfsys@curveto\{2\.63133pt\}\{\-4\.7644pt\}\{4\.7644pt\}\{\-2\.63133pt\}\{4\.7644pt\}\{0\.0pt\}\\pgfsys@closepath\\pgfsys@moveto\{0\.0pt\}\{0\.0pt\}\\pgfsys@stroke\\pgfsys@invoke\{ \} \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-2\.5pt\}\{\-3\.22221pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{3\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\{\{\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mua\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}\\beta\_\{K\}\\sqrt\{\\frac\{\\pi dK\}\{2\}\}\\sqrt\{2\\sum\_\{k=1\}^\{K\}\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\+\\frac\{4\}\{\\lambda\}\\ln\\frac\{2K\}\{\\delta\}\}\\;\.
Finally, by inserting the bounds for1and3into[Eq\.˜11](https://arxiv.org/html/2606.28616#S4.E11),
Reg\(K\)≤\\displaystyle\\operatorname\{Reg\}\(K\)\\leqβK∑k=1K‖ϕk‖Λk−1\+2\(B\+βKλ\)Klnπ2K26δ\+βKπdK22∑k=1K∥ϕk∥Λk−12\+4λln2Kδ\\displaystyle\\beta\_\{K\}\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\sum\_\{k=1\}^\{K\}\\left\\lVert\\phi\_\{k\}\\right\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}\}\+2\\left\(B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\right\)\\sqrt\{K\\ln\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\+\\beta\_\{K\}\\sqrt\{\\frac\{\\pi dK\}\{2\}\}\\sqrt\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}2\\sum\_\{k=1\}^\{K\}\\lVert\\phi\_\{k\}\\rVert\_\{\\Lambda^\{\-1\}\_\{k\}\}^\{2\}\}\+\\frac\{4\}\{\\lambda\}\\ln\\frac\{2K\}\{\\delta\}\}≤\(a\)\\displaystyle\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mua\\mkern\-1\.5mu\)\}\}\{\{\\leq\}\}βK2dKln\(1\+Kλd\)\+2\(B\+βKλ\)Klnπ2K26δ\+βKπdK24dln\(1\+Kλd\)\+4λln2Kδ\\displaystyle\\beta\_\{K\}\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\sqrt\{2dK\\ln\\left\(1\+\\frac\{K\}\{\\lambda d\}\\right\)\}\}\+2\\left\(B\+\\frac\{\\beta\_\{K\}\}\{\\sqrt\{\\lambda\}\}\\right\)\\sqrt\{K\\ln\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\+\\beta\_\{K\}\\sqrt\{\\frac\{\\pi dK\}\{2\}\}\\sqrt\{\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}4d\\ln\\left\(1\+\\frac\{K\}\{\\lambda d\}\\right\)\}\+\\frac\{4\}\{\\lambda\}\\ln\\frac\{2K\}\{\\delta\}\}=\(b\)\\displaystyle\\stackrel\{\{\\scriptstyle\\scriptscriptstyle\(\\mkern\-1\.5mub\\mkern\-1\.5mu\)\}\}\{\{=\}\}𝒪\(d3Kln\(Kδ−1\)\),\\displaystyle\\mathcal\{O\}\\left\(\\sqrt\{d^\{3\}K\}\\ln\(K\\delta^\{\-1\}\)\\right\)\\;,where \(a\) applies[Lemma˜4\.5](https://arxiv.org/html/2606.28616#S4.Thmtheorem5)to the red terms and \(b\) usesβK=𝒪\(dlnKδ−1\)\\beta\_\{K\}=\\mathcal\{O\}\(\\sqrt\{d\\ln K\\delta^\{\-1\}\}\)\. ∎
## 5Ensemble Absolute Thompson Sampling
Although ATS is computationally efficient, its regret bound incurs an additionald\\sqrt\{d\}factor compared to UCB\. This looseness stems from the fact that the key identity \([6](https://arxiv.org/html/2606.28616#S4.E6)\) holds only for a fixedϕ\\phiindependent of the sampleξ\\xi, whereas in ATS the selected actionϕkATS\\phi\_\{k\}^\{\\mathrm\{ATS\}\}depends on the sampleξk\\xi\_\{k\}\. Consequently, the ATS objective can be overly optimistic in expectation relative to UCB, as suggested by[Lemma˜4\.2](https://arxiv.org/html/2606.28616#S4.Thmtheorem2)\.
In this section, we explore a simple extension of ATS that can mitigate thisd\\sqrt\{d\}gap by more closely approximating the UCB objective\.Our analysis provides only asymptotic justification, and a non\-asymptotic regret guarantee remains open\.We discuss the details in[Remark˜5\.3](https://arxiv.org/html/2606.28616#S5.Thmtheorem3)\.
The key idea is to leverage extreme\-value properties of Gaussian random variables\.
###### Proposition 5\.1\.
Letξ1,…,ξN∼𝒩\(𝟎,I\)\\xi\_\{1\},\\dots,\\xi\_\{N\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)be i\.i\.d\. Gaussian vectors\. For fixed symmetric positive definite matrixΛ∈ℝd×d\\Lambda\\in\\mathbb\{R\}^\{d\\times d\}and vectorϕ∈ℝd\\phi\\in\\mathbb\{R\}^\{d\}, we have
maxi∈\[N\]12lnN\|ϕ⊤Λ−1/2ξi\|→ℙ‖ϕ‖Λ−1\.\\displaystyle\\max\_\{i\\in\[N\]\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\left\\lvert\\phi^\{\\top\}\\Lambda^\{\-1/2\}\\xi\_\{i\}\\right\\rvert\\xrightarrow\{\\mathbb\{P\}\}\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\;\.\(14\)
The proof is provided in[Section˜B\.1](https://arxiv.org/html/2606.28616#A2.SS1)\. Thus, for largeNNand fixedϕ\\phi, the left hand side of[Eq\.˜14](https://arxiv.org/html/2606.28616#S5.E14)provides an estimator of the UCB bonus‖ϕ‖Λ−1\\\|\\phi\\\|\_\{\\Lambda^\{\-1\}\}\. This observation motivates the following algorithm which we call ensemble absolute Thompson sampling \(EATS\):
\(EATS\)ϕkEATS∈argmaxϕ∈𝒜maxi≤Nϕ⊤θ^k\+βk2lnN\|ϕ⊤Λk−1/2ξki\|whereξki∼𝒩\(0,I\)\.\\displaystyle\\text\{\(EATS\)\}\\quad\\phi\_\{k\}^\{\\mathrm\{EATS\}\}\\in\\arg\\max\_\{\\phi\\in\\mathcal\{A\}\}\\max\_\{i\\leq N\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\frac\{\\beta\_\{k\}\}\{\\sqrt\{2\\ln N\}\}\\lvert\{\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}^\{i\}\}\\rvert\\quad\\text\{where\}\\quad\\xi^\{i\}\_\{k\}\\sim\\mathcal\{N\}\(0,I\)\.\(15\)
### 5\.1Properties of EATS
for*k=1,…,Kk=1,\\dots,K*do
Update Gram matrix
Λk=λI\+∑i=1k−1ϕiϕi⊤\\Lambda\_\{k\}=\\lambda I\+\\sum\_\{i=1\}^\{k\-1\}\\phi\_\{i\}\\phi\_\{i\}^\{\\top\};
Sample
ξk1,…,ξkN∼𝒩\(𝟎,I\)\\xi\_\{k\}^\{1\},\\dots,\\xi\_\{k\}^\{N\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\);
Compute
ϕk∈argmaxϕ∈𝒜maxi∈\[N\]ϕ⊤θ^k\+βk2lnN\|ϕ⊤Λk−1/2ξki\|\\phi\_\{k\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\max\_\{i\\in\[N\]\}\\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\frac\{\\beta\_\{k\}\}\{\\sqrt\{2\\ln N\}\}\\left\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}^\{i\}\\right\\rvert\. See[Remark˜5\.2](https://arxiv.org/html/2606.28616#S5.Thmtheorem2);
Pull
ϕk\\phi\_\{k\}and observe
rkr\_\{k\};
end for
Algorithm 2Ensemble Absolute Thompson Sampling \(EATS\)
## 6Experiments
This section presents the numerical results for the proposed algorithms\. We first compare the empirical performance of ATS and TS, with the goal of showing that ATS achieves performance comparable to that of TS\. We then study the empirical behavior of EATS to illustrate the potential of a new algorithmic design that may approach the statistical performance of UCB while substantially reducing computational complexity, provided that the ensemble size is chosen reasonably\.
### 6\.1Comparison ofATS\\mathrm\{ATS\}vsTS\\mathrm\{TS\}
We conduct experiments in three environments: the unit ball, the hypercube, and the two\-lane shortest\-path problem\. Unless otherwise specified, all reported performance metrics are averaged over 100 independent random seeds\.
Across these three environments, in addition to UCB \([Eq\.˜4](https://arxiv.org/html/2606.28616#S3.E4)\), TS \([Eq\.˜5](https://arxiv.org/html/2606.28616#S3.E5)\), and ATS \([Eq\.˜7](https://arxiv.org/html/2606.28616#S4.E7)\), we also compare the following variants:
\(TS\-No\-Inflation\)ϕkTS\-No\-Inflation∈argmaxϕ∈𝒜ϕ⊤θ^k\+ϕ⊤Λk−1/2ξkwhereξk∼𝒩\(𝟎,I\)\.\\displaystyle\\text\{\(TS\-No\-Inflation\)\}\\quad\\phi\_\{k\}^\{\\text\{TS\-No\-Inflation\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\quad\\text\{where\}\\quad\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\\;\.\(ATS\-No\-π/2\)ϕkATS\-No\-π/2∈argmaxϕ∈𝒜ϕ⊤θ^k\+βk\|ϕ⊤Λk−1/2ξk\|whereξk∼𝒩\(𝟎,I\)\.\\displaystyle\\text\{\(ATS\-No\-$\\sqrt\{\\pi/2\}$\)\}\\quad\\phi\_\{k\}^\{\\text\{ATS\-No\-$\\sqrt\{\\pi/2\}$\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\beta\_\{k\}\\left\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\quad\\text\{where\}\\quad\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\\;\.\(ATS\-No\-Inflation\)ϕkATS\-No\-Inflation∈argmaxϕ∈𝒜ϕ⊤θ^k\+\|ϕ⊤Λk−1/2ξk\|whereξk∼𝒩\(𝟎,I\)\.\\displaystyle\\text\{\(ATS\-No\-Inflation\)\}\\quad\\phi\_\{k\}^\{\\text\{ATS\-No\-Inflation\}\}\\in\\operatorname\*\{arg\\,max\}\_\{\\phi\\in\\mathcal\{A\}\}\\ \\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\left\\lvert\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}\\right\\rvert\\quad\\text\{where\}\\quad\\xi\_\{k\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\\;\.In addition, because UCB can be computationally expensive, we use Rarely\-Switching\-UCB \(RS\-UCB\)\(abbasi2011improved\)to reduce the computational cost\.
#### Unit Ball
The unit ball environment is a natural benchmark in this setting, as information\-theoretic lower bounds for linear bandits have been constructed with the arm set taken to be the unit ball\(lattimore2020bandit\)\. In addition, recent work has shown that linear Thompson Sampling does not require forced exploration when the action set is smooth and convex\(abeille2025and\)\.
From[figure˜1](https://arxiv.org/html/2606.28616#S6.F1), the left panel shows that ATS performs worse than TS by only a constant factor, while exhibiting the same dependence on the dimensiondd\. After removing theπ/2\\sqrt\{\\pi/2\}factor, as in ATS\-No\-π/2\\sqrt\{\\pi/2\}, the resulting performance is comparable to that of TS, and is even slightly better in our experiments\.
In the right panel, we further evaluate the no\-inflation variants of both TS and ATS, following the algorithmic design inabeille2025and, where TS is shown to achieve a𝒪~\(dT\)\\tilde\{\\mathcal\{O\}\}\(d\\sqrt\{T\}\)regret guarantee on smooth and convex action sets such as the unit ball\. Our experiments support this theoretical prediction and further suggest that ATS may also be able to match the performance of TS in the same setting considered byabeille2025and\.


Figure 1:Sweep results for the linear bandit problem with unit\-sphere arm set,‖θ⋆‖=1\\\|\\theta^\{\\star\}\\\|=1, and Gaussian noiseηt∼𝒩\(0,0\.1\)\\eta\_\{t\}\\sim\\mathcal\{N\}\(0,0\.1\)\. The parameterθ⋆\\theta^\{\\star\}is sampled uniformly from the unit sphere\. Results are averaged over 100 independent random seeds\. Both panels present log–log plots of the final regret of different algorithms, where the feature dimension ranges over15,16,…,24\{15,16,\\dotsc,24\}and the horizon is fixed atT=250000T=250000:\(Left\)TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\};\(Right\)RS\-UCB, TS\-No\-Inflation, and ATS\-No\-Inflation\.
#### Hypercube
The hypercube environment is defined as follows\. For arm dimensiond∈ℤ\+d\\in\\mathbb\{Z\}^\{\+\}, the action set is𝒜=\{±1\}d/d\\mathcal\{A\}=\\\{\\pm 1\\\}^\{d\}/\\sqrt\{d\}, so that every armϕ∈𝒜\\phi\\in\\mathcal\{A\}satisfies‖ϕ‖2≤1\\\|\\phi\\\|\_\{2\}\\leq 1\. The unknown parameterθ∗\\theta\_\{\*\}is sampled uniformly from the unitℓ2\\ell\_\{2\}\-sphere\. We consider this environment for two reasons\. First, information\-theoretic lower bounds for linear bandits are known to arise from hypercube\-type constructions\(lattimore2020bandit\)\. Second, the action set has cardinality2d2^\{d\}, making exact UCB optimization computationally expensive in this setting\.
The hypercube is also a challenging experimental regime because the number of actions grows exponentially withdd, and the asymptotic behavior appears only slowly\. From[figure˜2](https://arxiv.org/html/2606.28616#S6.F2), we observe that ATS performs worse than TS and ATS\-No\-π/2\\sqrt\{\\pi/2\}by an approximately constant factor\. As in the unit ball setting, removing theπ/2\\sqrt\{\\pi/2\}factor improves the empirical performance of ATS, making it slightly better than TS in our experiments\. Moreover, the right panel suggests that these results should still be interpreted as transient behavior, since none of the observed regrets appears to have reached its asymptotic scaling by the end of the horizon\.


Figure 2:Sweep results for the linear bandit problem with the arm set is the hypercube,‖θ⋆‖=1\\\|\\theta^\{\\star\}\\\|=1, and noiseηt∼𝒩\(0,0\.1\)\\eta\_\{t\}\\sim\\mathcal\{N\}\(0,0\.1\)\. The parameterθ⋆\\theta^\{\\star\}is drawn uniformly at random from the unit sphere\. Results are averaged over 100 independent random seeds\. Both panels show plots of the final regret for different algorithms, with feature dimension varying over10,11,…,19\{10,11,\\dotsc,19\}and horizon fixed atT=500000T=500000:\(Left\)Log–log plot of final regret of TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\};\(Right\)TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\}\.There are at least two possible explanations for why the observeddd\-dependence of TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\}does not fully match the theoretical prediction\. The first is that the dimensions considered may still be too small\. In[figure˜2](https://arxiv.org/html/2606.28616#S6.F2), ford∈\{20,…,30\}d\\in\\\{20,\\dotsc,30\\\}andT=200,000T=200\{,\}000, all three algorithms still appear to be in the transient regime\. Pushing to substantially largerddis computationally infeasible in our setting, since evend=30d=30already corresponds to roughly2302^\{30\}arms\. The second possibility is that the time horizon is still not large enough\. Indeed, the right panel of[figure˜2](https://arxiv.org/html/2606.28616#S6.F2)indicates that all three methods remain in the transient regime even at the largest horizon we consider\. However, horizons such asT=500,000T=500\{,\}000are already substantial relative to the dimensions studied and are also near the limit of what we can simulate under our computational budget\.


Figure 3:Numerical results for the linear bandit problem with the arm set is the hypercube,‖θ⋆‖=1\\\|\\theta^\{\\star\}\\\|=1, and noiseηt∼𝒩\(0,0\.1\)\\eta\_\{t\}\\sim\\mathcal\{N\}\(0,0\.1\)\. The parameterθ⋆\\theta^\{\\star\}is drawn uniformly at random from the unit sphere\. Results are averaged over 100 independent random seeds\. Both panels show plots of the final regret for different algorithms, with feature dimension varying over20,21,…,29\{20,21,\\dotsc,29\}and horizon fixed atT=200000T=200000:\(Left\)Log–log plot of final regret of TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\};\(Right\)TS, ATS, and ATS\-No\-π/2\\sqrt\{\\pi/2\}\.Nevertheless, the current experiments consistently support the conclusion that ATS is worse than TS and ATS\-No\-π/2\\sqrt\{\\pi/2\}by an approximately constant factor, that removing theπ/2\\sqrt\{\\pi/2\}term improves ATS and can even make it slightly outperform TS empirically, and that these methods appear to exhibit broadly similar dependence ondd\. It is therefore reasonable to conjecture that this qualitative picture persists in larger\-ddand larger\-TTregimes\.
#### Two\-Lane Shortest Path
In this section, we study a shortest\-path bandit instance on a two\-lane layered graph\. The learner’s goal is to identify a minimum\-cost path from a source nodessto a sink nodett\. The graph hasm∈ℤ\+m\\in\\mathbb\{Z\}^\{\+\}stages, each containing two nodes representing the upper and lower lanes\. From each node at stageii, the learner may either stay in the same lane or switch to the other lane at stagei\+1i\+1\. Hence, each inter\-stage transition admits four possible edges:\{upper, lower\}→\{upper, lower\}\\\{\\text\{upper, lower\}\\\}\\to\\\{\\text\{upper, lower\}\\\}\.
We assign cost11to every switching transition, cost11to upper→\\toupper, and cost0to lower→\\tolower\. Consequently, the optimal path is obtained by staying in the lower lane at every stage, yielding total cost0\. The resulting graph contains2m2^\{m\}distinctss\-ttpaths andd=2\+4\(m−1\)\+2=4md=2\+4\(m\-1\)\+2=4medges\. See[figure˜9](https://arxiv.org/html/2606.28616#A3.F9)as an example\.
For any pathxx, letϕ\(x\)∈\{0,1\}d\\phi\(x\)\\in\\\{0,1\\\}^\{d\}denote its edge\-incidence feature vector\. The observed cost is given byC~\(x\)=ϕ\(x\)⊤θ∗\+ε\\tilde\{C\}\(x\)=\\phi\(x\)^\{\\top\}\\theta\_\{\*\}\+\\varepsilon, whereε∼𝒩\(0,R2\)\\varepsilon\\sim\\mathcal\{N\}\(0,R^\{2\}\)\. The edge\-parameter vectorθ∗\\theta\_\{\*\}assigns value0to lower→\\tolower edges and value11to all other edges, that is,\(θ∗\)i=1−𝕀\(ei=lower→lower\)\(\\theta\_\{\*\}\)\_\{i\}=1\-\\mathbb\{I\}\\\!\\left\(e\_\{i\}=\\mathrm\{lower\}\\to\\mathrm\{lower\}\\right\)\. From[figure˜4](https://arxiv.org/html/2606.28616#S6.F4), we see that ATS is worse than TS and ATS\-No\-π/2\\sqrt\{\\pi/2\}only by a constant factor, while all three appear to share the same scaling with the dimensiondd\. Consistent with the unit ball experiments, removing theπ/2\\sqrt\{\\pi/2\}factor yields a constant\-factor improvement in empirical performance, and ATS\-No\-π/2\\sqrt\{\\pi/2\}becomes comparable to TS\.
Moreover, TS\-No\-Inflation and ATS\-No\-Inflation continue to outperform the other algorithms by a significant margin\. However, their empirical dependence onddappears to be close tod1\.5d^\{1\.5\}, in contrast to the behavior observed in the unit ball setting\.


Figure 4:Sweep results for the two\-lane shortest\-path problem with Gaussian noiseεt∼𝒩\(0,0\.2\)\\varepsilon\_\{t\}\\sim\\mathcal\{N\}\(0,0\.2\), averaged over 50 independent random seeds\. Both panels report the final regret of TS, TS\-No\-Inflation, ATS, ATS\-No\-π/2\\sqrt\{\\pi/2\}, and ATS\-No\-Inflation as the feature dimension ranges over\{15,16,…,25\}\\\{15,16,\\dotsc,25\\\}, with horizon fixed atT=250000T=250000:\(Left\)final regret versusdd;\(Right\)log\-log plot of final regret versusdd\.
### 6\.2Empirical evidences for EATS
In this section, we present empirical evidence that EATS has the potential to satisfy aO~\(dK\)\\tilde\{O\}\(d\\sqrt\{K\}\)regret guarantee whenN=poly\(d\)N=\\mathrm\{poly\}\(d\)\. To this end, we study the dependence of cumulative regret on the dimensionddthrough a dimension sweep\.In all EATS experiments reported in this section, we chooseN=dN=d\. The settings of the environments in this section are also the same as the ones in[Section˜6\.1](https://arxiv.org/html/2606.28616#S6.SS1)\.
#### Unit Ball
From[figure˜5](https://arxiv.org/html/2606.28616#S6.F5), we see that the regret of EATS exhibits an approximately linear dependence ondd, while remaining worse than UCB by a substantial constant factor\. At the same time, its empirical performance is comparable to that of TS\-No\-Inflation, for whichabeille2025andestablished a regret bound with linear dependence ondd\.
Figure 5:Sweep results for the linear bandit problem with unit\-sphere arm set\. Same setting as[figure˜1](https://arxiv.org/html/2606.28616#S6.F1)\. Results are averaged over 100 independent random seeds\. Log–log plot of regret vs dimension for EATS, RS\-UCB, and TS\-No\-Inflation\.
#### Hypercube
From the left panel of[figure˜6](https://arxiv.org/html/2606.28616#S6.F6), we observe that the regret of EATS scales linearly withdd\. The right panel further suggests that EATS has already reached its asymptotic regime byd=30d=30andT=200,000T=200\{,\}000, which in turn indicates that the experiments ford∈\{20,…,30\}d\\in\\\{20,\\dotsc,30\\\}are likely also operating in the asymptotic regime\. In contrast,[figure˜3](https://arxiv.org/html/2606.28616#S6.F3)shows that TS and ATS still remain in their transient regimes over the same range, which explains why their observed dependence onddis smaller\.
Overall, the fact that EATS consistently displays linear scaling across both environments, including one in which the asymptotic regime appears to be visible, suggests that aO~\(dK\)\\tilde\{O\}\(d\\sqrt\{K\}\)\-type guarantee for EATS is plausible\.


Figure 6:Sweep results for the linear bandit problem with the arm set is the hypercube\. Same setting as[figure˜2](https://arxiv.org/html/2606.28616#S6.F2)\.\(Left\)Log–log plot of final regret of EATS, TS, ATS, and TS\-No\-Inflation;\(Right\)The behavior of EATS atd=30d=30andT=100,000T=100,000\.
## 7Conclusion
We introduced Absolute Thompson Sampling \(ATS\), which replaces the signed exploration noise in TS with its absolute value\. This ensures optimism in expectation, enabling a UCB\-style regret analysis without the technically involved anti\-concentration arguments typically required for TS\. ATS achievesO~\(d3/2K\)\\widetilde\{O\}\(d^\{3/2\}\\sqrt\{K\}\)regret while retaining the computational efficiency of randomized exploration\.
We also proposed Ensemble Absolute Thompson Sampling \(EATS\), which approximates the UCB objective by taking the maximum over multiple absolute perturbations with normalization by the ensemble size\. As the ensemble size grows, EATS recovers UCB behavior in the limit\. Experiments suggest that moderate ensemble sizes can yield strong performance\. Establishing non\-asymptotic guarantees for EATS with finite ensemble sizes remains an interesting direction for future work\.
Overall, our results suggest a new perspective on randomized exploration: by carefully shaping the exploration term, one can design algorithms that retain the computational efficiency of TS while exhibiting analysis and behavior closely aligned with UCB\.
## References
Supplementary Materials
*The following content was not necessarily subject to peer review\.*
## Appendix AUseful Lemmas
###### Lemma A\.1\(Azuma\-Hoeffding\)\.
Let\(Xk,ℱk\)k=1K\(X\_\{k\},\\mathcal\{F\}\_\{k\}\)\_\{k=1\}^\{K\}be a martingale difference sequence, i\.e\.,XkX\_\{k\}isℱk\\mathcal\{F\}\_\{k\}\-measurable and𝔼\[Xk\|ℱk−1\]=0\\mathbb\{E\}\[X\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\]=0\. AssumeXk∈\[lk,uk\]X\_\{k\}\\in\[l\_\{k\},u\_\{k\}\]almost surely\. Then, for anyδ∈\(0,1\)\\delta\\in\(0,1\),
ℙ\(∑k=1KXk≥∑k=1K\(uk−lk\)22log1δ\)≤δ\.\\displaystyle\\mathbb\{P\}\\left\(\\sum\_\{k=1\}^\{K\}X\_\{k\}\\geq\\sqrt\{\\sum\_\{k=1\}^\{K\}\\frac\{\(u\_\{k\}\-l\_\{k\}\)^\{2\}\}\{2\}\\log\\frac\{1\}\{\\delta\}\}\\right\)\\leq\\delta\\;\.
###### Lemma A\.2\(Azuma\-Hoeffding with good events\)\.
Let\(Yk\)k=1∞\(Y\_\{k\}\)\_\{k=1\}^\{\\infty\}beℱk\\mathcal\{F\}\_\{k\}\-measurable and integrable random variables\. Letℰk\\mathscr\{E\}\_\{k\}beℱk−1\\mathcal\{F\}\_\{k\-1\}\-measurable events and letℰ\\mathscr\{E\}be an event such thatℰ⊆⋂k=1∞ℰk\\mathscr\{E\}\\subseteq\\bigcap^\{\\infty\}\_\{k=1\}\\mathscr\{E\}\_\{k\}\. Assume that onℰk\\mathscr\{E\}\_\{k\}, we haveYk∈\[lk,uk\]Y\_\{k\}\\in\[l\_\{k\},u\_\{k\}\]almost surely\. Then, for anyδ∈\(0,1\)\\delta\\in\(0,1\),
ℙ\(ℰ∩\{∃K≥1:∑k=1K\(Yk−𝔼\[Yk\|ℱk−1\]\)≥∑k=1K\(uk−lk\)22logπ2K26δ\}\)≤δ\.\\displaystyle\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\left\\\{\\exists K\\geq 1:\\sum\_\{k=1\}^\{K\}\\left\(Y\_\{k\}\-\\mathbb\{E\}\[Y\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\]\\right\)\\geq\\sqrt\{\\sum\_\{k=1\}^\{K\}\\frac\{\(u\_\{k\}\-l\_\{k\}\)^\{2\}\}\{2\}\\log\\frac\{\\pi^\{2\}K^\{2\}\}\{6\\delta\}\}\\right\\\}\\right\)\\leq\\delta\\;\.
###### Proof\.
DefineY~k≔𝟙\[ℰk\]Yk\\tilde\{Y\}\_\{k\}\\coloneqq\\mathbbm\{1\}\[\\mathscr\{E\}\_\{k\}\]Y\_\{k\}which returnsYkY\_\{k\}ifℰk\\mathscr\{E\}\_\{k\}holds and0otherwise\. Sinceℰk∈ℱk−1\\mathscr\{E\}\_\{k\}\\in\\mathcal\{F\}\_\{k\-1\}, the processX~k≔Y~k−𝔼\[Y~k∣ℱk−1\]\\tilde\{X\}\_\{k\}\\coloneqq\\tilde\{Y\}\_\{k\}\-\\mathbb\{E\}\[\\tilde\{Y\}\_\{k\}\\mid\\mathcal\{F\}\_\{k\-1\}\]isℱk\\mathcal\{F\}\_\{k\}\-measurable and a martingale difference sequence\. Moreover,X~k\\tilde\{X\}\_\{k\}lies in an interval of length at mostuk−lku\_\{k\}\-l\_\{k\}almost surely\. DefineδK≔6δ/π2K2\\delta\_\{K\}\\coloneqq 6\\delta/\\pi^\{2\}K^\{2\}\. By[Lemma˜A\.1](https://arxiv.org/html/2606.28616#A1.Thmtheorem1), and taking a union bound overK≥1K\\geq 1,
ℙ\(∃K≥1:∑k=1KX~k≥∑k=1K\(uk−lk\)22log1δK\)≤∑K=1∞δK=δ\.\\mathbb\{P\}\\left\(\\exists K\\geq 1:\\sum\_\{k=1\}^\{K\}\\tilde\{X\}\_\{k\}\\geq\\sqrt\{\\sum\_\{k=1\}^\{K\}\\frac\{\(u\_\{k\}\-l\_\{k\}\)^\{2\}\}\{2\}\\log\\frac\{1\}\{\\delta\_\{K\}\}\}\\right\)\\leq\\sum\_\{K=1\}^\{\\infty\}\\delta\_\{K\}=\\delta\\;\.On the eventℰ\\mathscr\{E\}, we haveY~k=Yk\\tilde\{Y\}\_\{k\}=Y\_\{k\}for allkk\. Therefore,
ℙ\(ℰ∩\{∃K≥1:∑k=1K\(Yk−𝔼\[Yk\|ℱk−1\]\)≥∑k=1K\(uk−lk\)22log1δK\}\)≤δ\.\\mathbb\{P\}\\left\(\\mathscr\{E\}\\cap\\left\\\{\\exists K\\geq 1:\\sum\_\{k=1\}^\{K\}\\left\(Y\_\{k\}\-\\mathbb\{E\}\[Y\_\{k\}\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\middle\|\\allowbreak\\mathchoice\{\\\>\}\{\\\>\}\{\\,\}\{\\,\}\\mathopen\{\}\\mathcal\{F\}\_\{k\-1\}\]\\right\)\\geq\\sqrt\{\\sum\_\{k=1\}^\{K\}\\frac\{\(u\_\{k\}\-l\_\{k\}\)^\{2\}\}\{2\}\\log\\frac\{1\}\{\\delta\_\{K\}\}\}\\right\\\}\\right\)\\leq\\delta\\;\.∎
###### Lemma A\.3\(Lemma D\.4inrosenberg2020near\)\.
Let\(Xk\)k=1∞\\left\(X\_\{k\}\\right\)\_\{k=1\}^\{\\infty\}be random variables with expectation adapted to the filtration\(ℱk\)k=0∞\\left\(\\mathcal\{F\}\_\{k\}\\right\)\_\{k=0\}^\{\\infty\}\. Suppose that0≤Xk≤C0\\leq X\_\{k\}\\leq Calmost surely\. Then, with probability at least1−δ1\-\\delta, the following holds for allK≥1K\\geq 1:
∑k=1K𝔼\[Xk∣ℱk−1\]≤4Cln2Kδ\+2∑k=1KXk\.\\sum\_\{k=1\}^\{K\}\\mathbb\{E\}\\left\[X\_\{k\}\\mid\\mathcal\{F\}\_\{k\-1\}\\right\]\\leq 4C\\ln\\frac\{2K\}\{\\delta\}\+2\\sum\_\{k=1\}^\{K\}X\_\{k\}\\;\.
## Appendix BProofs for[Section˜5](https://arxiv.org/html/2606.28616#S5)
### B\.1Proof of[Proposition˜5\.1](https://arxiv.org/html/2606.28616#S5.Thmtheorem1)
###### Lemma B\.1\(Restatement of[Proposition˜5\.1](https://arxiv.org/html/2606.28616#S5.Thmtheorem1)\)\.
Letξ1,…,ξN∼𝒩\(𝟎,I\)\\xi\_\{1\},\\dots,\\xi\_\{N\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)be i\.i\.d\. Gaussian vectors\. For fixed symmetric positive definite matrixΛ∈ℝd×d\\Lambda\\in\\mathbb\{R\}^\{d\\times d\}and vectorϕ∈ℝd\\phi\\in\\mathbb\{R\}^\{d\}, we have
maxi∈\[N\]12lnN\|ϕ⊤Λ−1/2ξi\|→ℙ‖ϕ‖Λ−1\.\\displaystyle\\max\_\{i\\in\[N\]\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\left\\lvert\\phi^\{\\top\}\\Lambda^\{\-1/2\}\\xi\_\{i\}\\right\\rvert\\xrightarrow\{\\mathbb\{P\}\}\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\;\.
###### Proof\.
LetZ1,…,ZN∼𝒩\(0,1\)Z\_\{1\},\\dots,Z\_\{N\}\\sim\\mathcal\{N\}\(0,1\)be i\.i\.d\. random variables\. Then,
maxi∈\[N\]12lnN\|ϕ⊤Λ−1/2ξi\|=‖ϕ‖Λ−112lnNmaxi∈\[N\]\|Zi\|\.\\max\_\{i\\in\[N\]\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\left\\lvert\\phi^\{\\top\}\\Lambda^\{\-1/2\}\\xi\_\{i\}\\right\\rvert=\\\|\\phi\\\|\_\{\\Lambda^\{\-1\}\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\max\_\{i\\in\[N\]\}\\lvert Z\_\{i\}\\rvert\\;\.Thus, it suffices to show that12lnNmaxi∈\[N\]\|Zi\|→ℙ1\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\max\_\{i\\in\[N\]\}\\lvert Z\_\{i\}\\rvert\\xrightarrow\{\\mathbb\{P\}\}1\.
We use the shorthanduN:=2lnNu\_\{N\}\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\sqrt\{2\\ln N\}\. We prove the claim by showing that, for anyε\>0\\varepsilon\>0,
ℙ\(\|maxi∈\[N\]\|Zi\|uN−1\|\>ε\)≤ℙ\(maxi∈\[N\]\|Zi\|\>\(1\+ε\)uN\)\+ℙ\(maxi∈\[N\]\|Zi\|≤\(1−ε\)uN\)⟶N→∞0\.\\mathbb\{P\}\\left\(\\left\|\\frac\{\\max\_\{i\\in\[N\]\}\\left\|Z\_\{i\}\\right\|\}\{u\_\{N\}\}\-1\\right\|\>\\varepsilon\\right\)\\leq\\mathbb\{P\}\\left\(\\max\_\{i\\in\[N\]\}\\left\|Z\_\{i\}\\right\|\>\(1\+\\varepsilon\)u\_\{N\}\\right\)\+\\mathbb\{P\}\\left\(\\max\_\{i\\in\[N\]\}\\left\|Z\_\{i\}\\right\|\\leq\(1\-\\varepsilon\)u\_\{N\}\\right\)\\underset\{N\\rightarrow\\infty\}\{\\longrightarrow\}0\.\(17\)We show that both the upper tail and the lower tail vanish asN→∞N\\to\\infty\.
#### Upper tail bound\.
Fixε\>0\\varepsilon\>0\. LetZ∼𝒩\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\)\. Usingℙ\(\|Z\|\>t\)≤2exp\(−t2/2\)\\mathbb\{P\}\(\|Z\|\>t\)\\leq 2\\exp\(\-t^\{2\}/2\)fort≥0t\\geq 0,
ℙ\(maxi∈\[N\]\|Zi\|\>\(1\+ε\)uN\)\\displaystyle\\mathbb\{P\}\\left\(\\max\_\{i\\in\[N\]\}\\left\|Z\_\{i\}\\right\|\>\(1\+\\varepsilon\)u\_\{N\}\\right\)≤Nℙ\(\|Z\|\>\(1\+ε\)uN\)≤2Nexp\(−\(1\+ε\)2uN22\)\\displaystyle\\leq N\\mathbb\{P\}\\left\(\\left\|Z\\right\|\>\(1\+\\varepsilon\)u\_\{N\}\\right\)\\leq 2N\\exp\\left\(\-\\frac\{\(1\+\\varepsilon\)^\{2\}u\_\{N\}^\{2\}\}\{2\}\\right\)=2Nexp\(−\(1\+ε\)2lnN\)=2N1−\(1\+ε\)2=2N−\(2ε\+ε2\)⟶N→∞0\.\\displaystyle=2N\\exp\\left\(\-\(1\+\\varepsilon\)^\{2\}\\ln N\\right\)=2N^\{1\-\(1\+\\varepsilon\)^\{2\}\}=2N^\{\-\\left\(2\\varepsilon\+\\varepsilon^\{2\}\\right\)\}\\underset\{N\\rightarrow\\infty\}\{\\longrightarrow\}0\.
#### Lower tail bound\.
For the lower tail, we have
ℙ\(maxi≤N\|Zi\|≤t\)=\(1−ℙ\(\|Z\|\>t\)\)N≤exp\(−Nℙ\(\|Z\|\>t\)\)\.\\mathbb\{P\}\\left\(\\max\_\{i\\leq N\}\\left\|Z\_\{i\}\\right\|\\leq t\\right\)=\(1\-\\mathbb\{P\}\(\|Z\|\>t\)\)^\{N\}\\leq\\exp\(\-N\\mathbb\{P\}\(\|Z\|\>t\)\)\\;\.\(18\)We use the following Gaussian tail lower bound:
ℙ\(\|Z\|\>t\)≥22πt1\+t2e−t2/2\\mathbb\{P\}\(\|Z\|\>t\)\\geq\\frac\{2\}\{\\sqrt\{2\\pi\}\}\\frac\{t\}\{1\+t^\{2\}\}e^\{\-t^\{2\}/2\}Taket=\(1−ε\)uNt=\(1\-\\varepsilon\)u\_\{N\}\. WhenN→∞N\\to\\infty, we havet→∞t\\rightarrow\\inftyandt1\+t2≍1t\\frac\{t\}\{1\+t^\{2\}\}\\asymp\\frac\{1\}\{t\}\. So for largeNN,
ℙ\(\|Z\|\>t\)≥c1uNexp\(−\(1−ε\)2uN22\)=c1uNN−\(1−ε\)2\.\\mathbb\{P\}\(\|Z\|\>t\)\\geq c\\frac\{1\}\{u\_\{N\}\}\\exp\\left\(\-\\frac\{\(1\-\\varepsilon\)^\{2\}u\_\{N\}^\{2\}\}\{2\}\\right\)=c\\frac\{1\}\{u\_\{N\}\}N^\{\-\(1\-\\varepsilon\)^\{2\}\}\\;\.Therefore,
Nℙ\(\|Z\|\>t\)≥c1uNN1−\(1−ε\)2=c12lnNN2ε−ε2⟶N→∞∞,N\\mathbb\{P\}\(\|Z\|\>t\)\\geq c\\frac\{1\}\{u\_\{N\}\}N^\{1\-\(1\-\\varepsilon\)^\{2\}\}=c\\frac\{1\}\{\\sqrt\{2\\ln N\}\}N^\{2\\varepsilon\-\\varepsilon^\{2\}\}\\underset\{N\\rightarrow\\infty\}\{\\longrightarrow\}\\infty,since2ε−ε2\>02\\varepsilon\-\\varepsilon^\{2\}\>0\. By[Eq\.˜18](https://arxiv.org/html/2606.28616#A2.E18), we have
ℙ\(maxi≤N\|Zi\|≤\(1−ε\)uN\)⟶N→∞0\.\\mathbb\{P\}\\left\(\\max\_\{i\\leq N\}\\left\|Z\_\{i\}\\right\|\\leq\(1\-\\varepsilon\)u\_\{N\}\\right\)\\underset\{N\\rightarrow\\infty\}\{\\longrightarrow\}0\\;\.By substituting the upper and lower tail bounds into[Eq\.˜17](https://arxiv.org/html/2606.28616#A2.E17), we obtain the desired result\. ∎
### B\.2Proof of[Proposition˜5\.4](https://arxiv.org/html/2606.28616#S5.Thmtheorem4)
###### Proposition B\.2\(Restatement of[Proposition˜5\.4](https://arxiv.org/html/2606.28616#S5.Thmtheorem4)\)\.
For anykk, whenN→∞N\\to\\infty, it holds that
maxϕ∈𝒜maxi∈\[N\]ϕ⊤θ^k\+βk2lnN\|ϕ⊤Λk−1/2ξki\|→ℙmaxϕ∈𝒜ϕ⊤θ^k\+βk‖ϕ‖Λk−1\.\\displaystyle\\max\_\{\\phi\\in\\mathcal\{A\}\}\\max\_\{i\\in\[N\]\}\\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\frac\{\\beta\_\{k\}\}\{\\sqrt\{2\\ln N\}\}\\lvert\{\\phi^\{\\top\}\\Lambda\_\{k\}^\{\-1/2\}\\xi\_\{k\}^\{i\}\}\\rvert\\xrightarrow\{\\mathbb\{P\}\}\\max\_\{\\phi\\in\\mathcal\{A\}\}\\phi^\{\\top\}\\hat\{\\theta\}\_\{k\}\+\\beta\_\{k\}\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda\_\{k\}^\{\-1\}\}\\;\.\(19\)
###### Proof\.
We condition onℱk−1\\mathcal\{F\}\_\{k\-1\}\. Without loss of generality, we prove the following claim: For fixed symmetric positive definite matrixΛ∈ℝd×d\\Lambda\\in\\mathbb\{R\}^\{d\\times d\}, vectorθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}andβ\>0\\beta\>0,
maxϕ∈𝒜ϕ⊤θ\+βmaxi∈\[N\]12lnN\|ϕ⊤Λ−1/2ξi\|⏟=:MN\(ϕ\)→ℙmaxϕ∈𝒜ϕ⊤θ\+β‖ϕ‖Λ−1\.\\displaystyle\\max\_\{\\phi\\in\\mathcal\{A\}\}\\phi^\{\\top\}\\theta\+\\underbrace\{\\beta\\max\_\{i\\in\[N\]\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\left\\lvert\\phi^\{\\top\}\\Lambda^\{\-1/2\}\\xi^\{i\}\\right\\rvert\}\_\{=\\nolinebreak\\mkern\-1\.2mu\\vcentcolon M\_\{N\}\(\\phi\)\}\\xrightarrow\{\\mathbb\{P\}\}\\max\_\{\\phi\\in\\mathcal\{A\}\}\\phi^\{\\top\}\\theta\+\\beta\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\;\.\(20\)
Since𝒜\\mathcal\{A\}is compact, pick a finiteε\\varepsilon\-net𝒩ε\\mathcal\{N\}\_\{\\varepsilon\}of𝒜\\mathcal\{A\}such that, for everyϕ∈𝒜\\phi\\in\\mathcal\{A\}, there isψ∈𝒜ε\\psi\\in\\mathcal\{A\}\_\{\\varepsilon\}with‖ϕ−ψ‖2≤ε\\left\\lVert\\phi\-\\psi\\right\\rVert\_\{2\}\\leq\\varepsilon\. From[Proposition˜5\.1](https://arxiv.org/html/2606.28616#S5.Thmtheorem1)and union bound overψ∈𝒜ε\\psi\\in\\mathcal\{A\}\_\{\\varepsilon\}, we have
maxψ∈𝒜ε\|MN\(ψ\)−β‖ψ‖Λ−1\|→ℙ0\.\\displaystyle\\max\_\{\\psi\\in\\mathcal\{A\}\_\{\\varepsilon\}\}\\left\\lvert M\_\{N\}\(\\psi\)\-\\beta\\left\\lVert\\psi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\right\\rvert\\xrightarrow\{\\mathbb\{P\}\}0\\;\.\(21\)
We have the following Lipschitz continuity ofMNM\_\{N\}: for anyϕ1,ϕ2∈𝒜\\phi\_\{1\},\\phi\_\{2\}\\in\\mathcal\{A\},
\|MN\(ϕ1\)−MN\(ϕ2\)\|≤β2lnNmaxi∈\[N\]\|\(ϕ1−ϕ2\)⊤Λ−1/2ξi\|≤‖ϕ1−ϕ2‖2LN,\\displaystyle\\left\\lvert M\_\{N\}\(\\phi\_\{1\}\)\-M\_\{N\}\(\\phi\_\{2\}\)\\right\\rvert\\leq\\frac\{\\beta\}\{\\sqrt\{2\\ln N\}\}\\max\_\{i\\in\[N\]\}\\left\\lvert\(\\phi\_\{1\}\-\\phi\_\{2\}\)^\{\\top\}\\Lambda^\{\-1/2\}\\xi^\{i\}\\right\\rvert\\leq\\left\\lVert\\phi\_\{1\}\-\\phi\_\{2\}\\right\\rVert\_\{2\}L\_\{N\}\\;,where we defined
LN:=β2lnNmaxi∈\[N\]∥Λ−1/2ξi∥2≤β∥Λ−1/2∥op12lnNmaxi∈\[N\]∥ξi∥2→ℙβ∥Λ−1/2∥op\.L\_\{N\}\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\frac\{\\beta\}\{\\sqrt\{2\\ln N\}\}\\max\_\{i\\in\[N\]\}\\left\\lVert\\Lambda^\{\-1/2\}\\xi^\{i\}\\right\\rVert\_\{2\}\\leq\\beta\\left\\lVert\\Lambda^\{\-1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}\\frac\{1\}\{\\sqrt\{2\\ln N\}\}\\max\_\{i\\in\[N\]\}\\left\\lVert\\xi^\{i\}\\right\\rVert\_\{2\}\\xrightarrow\{\\mathbb\{P\}\}\\beta\\left\\lVert\\Lambda^\{\-1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}\\;\.Here,∥⋅∥op\\\|\\cdot\\\|\_\{\\mathrm\{op\}\}denotes the operator norm and the convergence in the last step follows from[Proposition˜5\.1](https://arxiv.org/html/2606.28616#S5.Thmtheorem1)\. Additionally,ϕ↦‖ϕ‖Λ−1\\phi\\mapsto\\\|\\phi\\\|\_\{\\Lambda^\{\-1\}\}is‖Λ−1/2‖op\\\|\\Lambda^\{\-1/2\}\\\|\_\{\\mathrm\{op\}\}\-Lipschitz continuous\. Therefore, by combining the above Lipschitz continuity results with[Eq\.˜21](https://arxiv.org/html/2606.28616#A2.E21), we have
maxϕ∈𝒜\|MN\(ϕ\)−β‖ϕ‖Λ−1\|\\displaystyle\\max\_\{\\phi\\in\\mathcal\{A\}\}\\left\\lvert M\_\{N\}\(\\phi\)\-\\beta\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\right\\rvert≤maxψ∈𝒜ε\|MN\(ψ\)−β‖ψ∥Λ−1\|\+O\(ε\)→ℙO\(ε\)\.\\displaystyle\\leq\\max\_\{\\psi\\in\\mathcal\{A\}\_\{\\varepsilon\}\}\\left\\lvert M\_\{N\}\(\\psi\)\-\\beta\\left\\lVert\\psi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\right\\rvert\+O\(\\varepsilon\)\\xrightarrow\{\\mathbb\{P\}\}O\(\\varepsilon\)\\;\.
Finally, defineFN\(ϕ\):=ϕ⊤θ\+MN\(ϕ\)F\_\{N\}\(\\phi\)\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\phi^\{\\top\}\\theta\+M\_\{N\}\(\\phi\)andF\(ϕ\):=ϕ⊤θ\+β∥ϕ∥Λ−1F\(\\phi\)\\vcentcolon\\nolinebreak\\mkern\-1\.2mu=\\phi^\{\\top\}\\theta\+\\beta\\left\\lVert\\phi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\. We have
\|maxϕ∈𝒜FN\(ϕ\)−maxϕ∈𝒜F\(ϕ\)\|≤maxϕ∈𝒜\|FN\(ϕ\)−F\(ϕ\)\|≤maxψ∈𝒜ε\|MN\(ψ\)−β‖ψ∥Λ−1\|\+O\(ε\)→ℙO\(ε\)\.\\displaystyle\\left\\lvert\\max\_\{\\phi\\in\\mathcal\{A\}\}F\_\{N\}\(\\phi\)\-\\max\_\{\\phi\\in\\mathcal\{A\}\}F\(\\phi\)\\right\\rvert\\leq\\max\_\{\\phi\\in\\mathcal\{A\}\}\\left\\lvert F\_\{N\}\(\\phi\)\-F\(\\phi\)\\right\\rvert\\leq\\max\_\{\\psi\\in\\mathcal\{A\}\_\{\\varepsilon\}\}\\left\\lvert M\_\{N\}\(\\psi\)\-\\beta\\left\\lVert\\psi\\right\\rVert\_\{\\Lambda^\{\-1\}\}\\right\\rvert\+O\(\\varepsilon\)\\xrightarrow\{\\mathbb\{P\}\}O\(\\varepsilon\)\\;\.By lettingε→0\\varepsilon\\to 0, we obtain the desired result\. ∎
## Appendix CAdditional Experimental results


Figure 7:Numerical results for linear bandits on the unit ball with feature dimensiond=10d=10, horizonT=200000T=200000,‖θ⋆‖=1\\\|\\theta^\{\\star\}\\\|=1, and noiseηt∼𝒩\(0,0\.1\)\\eta\_\{t\}\\sim\\mathcal\{N\}\(0,0\.1\)\. The parameterθ⋆\\theta^\{\\star\}is sampled uniformly from the unit sphere, and regret is averaged over100100independent seeds\.\(Left\)Regret curves of UCB, TS, TS with no inflation, ATS, ATS with noπ/2\\sqrt\{\\pi/2\}factor and ATS and the right column shows log\-log plots of cumulative regret\. In the right column, the black dotted line isRref\(t\)=d1\.5tR\_\{\\mathrm\{ref\}\}\(t\)=d^\{1\.5\}\\sqrt\{t\}for reference\.

Figure 8:Numerical results for linear bandits on the hypercube with feature dimensiond=10d=10, horizonT=200000T=200000,‖θ⋆‖=1\\\|\\theta^\{\\star\}\\\|=1, and noiseηt∼𝒩\(0,0\.1\)\\eta\_\{t\}\\sim\\mathcal\{N\}\(0,0\.1\)\. The parameterθ⋆\\theta^\{\\star\}is sampled uniformly from the unit sphere, and regret is averaged over100100independent seeds\.\(Left\)Regret curves of UCB, TS, TS\-no\-inflation, ATS, ATS\-no\-inflation, ATS\-no\-π/2\\sqrt\{\\pi/2\}\.\(Right\)Log\-log plots of the regret curves\. The black dotted line isRref\(t\)=d1\.5tR\_\{\\mathrm\{ref\}\}\(t\)=d^\{1\.5\}\\sqrt\{t\}for reference\.Figure 9:Example of the two\-lane shortest path problem withm=4m=4Similar Articles
Distributed Online Bandit Submodular Maximization with Bounded Sampling Violations
This paper presents a unified algorithmic framework for distributed online submodular maximization under partition matroid constraints, achieving sublinear (1-1/e)-regret guarantees for both full-information and bandit feedback. It also introduces a bounded stochastic pipage rounding scheme to ensure cumulative sampling violations remain sublinear.
Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback
This paper introduces a hybrid Track-and-Stop algorithm for best arm identification in generalized linear bandits that unifies absolute and relative feedback. The authors propose a likelihood-ratio-based confidence sequence to adaptively allocate queries, demonstrating improved sample efficiency over baseline methods.
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.
Learning in Markovian bandits with non-observable states and constrained decision epochs
This paper studies regret minimization in Markovian bandits with non-observable states and constrained decision epochs, introducing a generalization called self-degrading Markovian bandits. The authors propose the UCB-NOM algorithm that achieves nearly logarithmic regret and provide bounds that do not depend on the number of states.
Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.